Message ID | 20230216051750.3125598-27-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Per-VMA locks | expand |
On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > When vma->anon_vma is not set, page fault handler will set it by either > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > a compatible adjacent VMA and that requires not only the faulting VMA > to be stable but also the tree structure and other VMAs inside that tree. > Therefore locking just the faulting VMA is not enough for this search. > Fall back to taking mmap_lock when vma->anon_vma is not set. This > situation happens only on the first page fault and should not affect > overall performance. I think I asked this before, but don't remember getting an aswer. Why do we defer setting anon_vma to the first fault? Why don't we set it up at mmap time?
On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > When vma->anon_vma is not set, page fault handler will set it by either > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > a compatible adjacent VMA and that requires not only the faulting VMA > > to be stable but also the tree structure and other VMAs inside that tree. > > Therefore locking just the faulting VMA is not enough for this search. > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > situation happens only on the first page fault and should not affect > > overall performance. > > I think I asked this before, but don't remember getting an aswer. > Why do we defer setting anon_vma to the first fault? Why don't we > set it up at mmap time? Yeah, I remember that conversation Matthew and I could not find the definitive answer at the time. I'll look into that again or maybe someone can answer it here. In the end rather than changing that logic I decided to skip vma->anon_vma==NULL cases because I measured them being less than 0.01% of all page faults, so ROI from changing that would be quite low. But I agree that the logic is weird and maybe we can improve that. I will have to review that again when I'm working on eliminating all these special cases we skip, like swap/userfaults/etc.
On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > When vma->anon_vma is not set, page fault handler will set it by either > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > to be stable but also the tree structure and other VMAs inside that tree. > > > Therefore locking just the faulting VMA is not enough for this search. > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > situation happens only on the first page fault and should not affect > > > overall performance. > > > > I think I asked this before, but don't remember getting an aswer. > > Why do we defer setting anon_vma to the first fault? Why don't we > > set it up at mmap time? > > Yeah, I remember that conversation Matthew and I could not find the > definitive answer at the time. I'll look into that again or maybe > someone can answer it here. After looking into it again I'm still under the impression that vma->anon_vma is populated lazily (during the first page fault rather than at mmap time) to avoid doing extra work for areas which are never faulted. Though I might be missing some important detail here. > > In the end rather than changing that logic I decided to skip > vma->anon_vma==NULL cases because I measured them being less than > 0.01% of all page faults, so ROI from changing that would be quite > low. But I agree that the logic is weird and maybe we can improve > that. I will have to review that again when I'm working on eliminating > all these special cases we skip, like swap/userfaults/etc.
On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > situation happens only on the first page fault and should not affect > > > > overall performance. > > > > > > I think I asked this before, but don't remember getting an aswer. > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > set it up at mmap time? > > > > Yeah, I remember that conversation Matthew and I could not find the > > definitive answer at the time. I'll look into that again or maybe > > someone can answer it here. > > After looking into it again I'm still under the impression that > vma->anon_vma is populated lazily (during the first page fault rather > than at mmap time) to avoid doing extra work for areas which are never > faulted. Though I might be missing some important detail here. I think this is because the kernel cannot merge VMAs that have different anon_vmas? Enabling lazy population of anon_vma could potentially increase the chances of merging VMAs. > > In the end rather than changing that logic I decided to skip > > vma->anon_vma==NULL cases because I measured them being less than > > 0.01% of all page faults, so ROI from changing that would be quite > > low. But I agree that the logic is weird and maybe we can improve > > that. I will have to review that again when I'm working on eliminating > > all these special cases we skip, like swap/userfaults/etc.
On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > situation happens only on the first page fault and should not affect > > > > overall performance. > > > > > > I think I asked this before, but don't remember getting an aswer. > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > set it up at mmap time? > > > > Yeah, I remember that conversation Matthew and I could not find the > > definitive answer at the time. I'll look into that again or maybe > > someone can answer it here. > > After looking into it again I'm still under the impression that > vma->anon_vma is populated lazily (during the first page fault rather > than at mmap time) to avoid doing extra work for areas which are never > faulted. Though I might be missing some important detail here. How often does userspace call mmap() and then _never_ fault on it? I appreciate that userspace might mmap() gigabytes of address space and then only end up using a small amount of it, so populating it lazily makes sense. But creating a region and never faulting on it? The only use-case I can think of is loading shared libraries: openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 (...) mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000 mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000 mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 but that's a file-backed VMA, not an anon VMA.
On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > situation happens only on the first page fault and should not affect > > > > > overall performance. > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > set it up at mmap time? > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > definitive answer at the time. I'll look into that again or maybe > > > someone can answer it here. > > > > After looking into it again I'm still under the impression that > > vma->anon_vma is populated lazily (during the first page fault rather > > than at mmap time) to avoid doing extra work for areas which are never > > faulted. Though I might be missing some important detail here. > > How often does userspace call mmap() and then _never_ fault on it? > I appreciate that userspace might mmap() gigabytes of address space and > then only end up using a small amount of it, so populating it lazily > makes sense. But creating a region and never faulting on it? The only > use-case I can think of is loading shared libraries: > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > (...) > mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000 > mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 > mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000 > mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 > mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 > > but that's a file-backed VMA, not an anon VMA. Might the case of dup_mmap() while forking be the reason why a VMA in the child process might be never used while parent uses it (or visa versa)? Again, I'm not sure this is the reason but I can find no other good explanation.
On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > situation happens only on the first page fault and should not affect > > > > > overall performance. > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > set it up at mmap time? > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > definitive answer at the time. I'll look into that again or maybe > > > someone can answer it here. > > > > After looking into it again I'm still under the impression that > > vma->anon_vma is populated lazily (during the first page fault rather > > than at mmap time) to avoid doing extra work for areas which are never > > faulted. Though I might be missing some important detail here. > > I think this is because the kernel cannot merge VMAs that have > different anon_vmas? > > Enabling lazy population of anon_vma could potentially increase the > chances of merging VMAs. Hmm. Do you have a clear explanation why merging chances increase this way? A couple of possibilities I can think of would be: 1. If after mmap'ing a VMA and before faulting the first page into it we often change something that affects anon_vma_compatible() decision, like vm_policy; 2. When mmap'ing VMAs we do not map them consecutively but the final arrangement is actually contiguous. Don't think either of those cases would be very representative of a usual case but maybe I'm wrong or there is another reason? > > > > In the end rather than changing that logic I decided to skip > > > vma->anon_vma==NULL cases because I measured them being less than > > > 0.01% of all page faults, so ROI from changing that would be quite > > > low. But I agree that the logic is weird and maybe we can improve > > > that. I will have to review that again when I'm working on eliminating > > > all these special cases we skip, like swap/userfaults/etc. > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
On Fri, Feb 17, 2023 at 08:13:01AM -0800, Suren Baghdasaryan wrote: > On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hyeyoo@gmail.com> wrote: > > > > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > > situation happens only on the first page fault and should not affect > > > > > > overall performance. > > > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > > set it up at mmap time? > > > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > > definitive answer at the time. I'll look into that again or maybe > > > > someone can answer it here. > > > > > > After looking into it again I'm still under the impression that > > > vma->anon_vma is populated lazily (during the first page fault rather > > > than at mmap time) to avoid doing extra work for areas which are never > > > faulted. Though I might be missing some important detail here. > > > > I think this is because the kernel cannot merge VMAs that have > > different anon_vmas? > > > > Enabling lazy population of anon_vma could potentially increase the > > chances of merging VMAs. > > Hmm. Do you have a clear explanation why merging chances increase this > way? A couple of possibilities I can think of would be: > 1. If after mmap'ing a VMA and before faulting the first page into it > we often change something that affects anon_vma_compatible() decision, > like vm_policy; > 2. When mmap'ing VMAs we do not map them consecutively but the final > arrangement is actually contiguous. > > Don't think either of those cases would be very representative of a > usual case but maybe I'm wrong or there is another reason? Ok. I agree it does not represent common cases. Hmm then I wonder how it went from the initial approach of "allocate anon_vma objects only via fork()" [1] to "populate anon_vma at page faults". [2] [3] Maybe Hugh, Andrea or Andrew have opinions? [1] anon_vma RFC2, lore.kernel.org https://lore.kernel.org/lkml/20040311065254.GT30940@dualathlon.random [2] The status of object-based reverse mapping, LWN.net https://lwn.net/Articles/85908 [3] rmap 39 add anon_vma rmap https://gitlab.com/hyeyoo/linux-historical/-/commit/8aa3448cabdfca146aa3fd36e852d0209fb2276a > > > > > > > In the end rather than changing that logic I decided to skip > > > > vma->anon_vma==NULL cases because I measured them being less than > > > > 0.01% of all page faults, so ROI from changing that would be quite > > > > low. But I agree that the logic is weird and maybe we can improve > > > > that. I will have to review that again when I'm working on eliminating > > > > all these special cases we skip, like swap/userfaults/etc. > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. > >
On Fri, Feb 17, 2023 at 08:10:35AM -0800, Suren Baghdasaryan wrote: > On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > > When vma->anon_vma is not set, page fault handler will set it by either > > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > > to be stable but also the tree structure and other VMAs inside that tree. > > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > > situation happens only on the first page fault and should not affect > > > > > > overall performance. > > > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > > set it up at mmap time? > > > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > > definitive answer at the time. I'll look into that again or maybe > > > > someone can answer it here. > > > > > > After looking into it again I'm still under the impression that > > > vma->anon_vma is populated lazily (during the first page fault rather > > > than at mmap time) to avoid doing extra work for areas which are never > > > faulted. Though I might be missing some important detail here. > > > > How often does userspace call mmap() and then _never_ fault on it? > > I appreciate that userspace might mmap() gigabytes of address space and > > then only end up using a small amount of it, so populating it lazily > > makes sense. But creating a region and never faulting on it? The only > > use-case I can think of is loading shared libraries: > > > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > > (...) > > mmap(NULL, 1970000, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ce612e000 > > mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 > > mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17b000) = 0x7f0ce62a9000 > > mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 > > mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 > > > > but that's a file-backed VMA, not an anon VMA. > > Might the case of dup_mmap() while forking be the reason why a VMA in > the child process might be never used while parent uses it (or visa > versa)? Again, I'm not sure this is the reason but I can find no other > good explanation. I found an explanation! Well, a partial one. If we MAP_PRIVATE a file mapping (like, er those ones up there) and only take read faults on it, we can postpone allocation of the anon_vma indefinitely. But once we take a write fault in that VMA, we need to allocate an anon_vma for it so that we can track the anonymous pages that have been allocated to satisfy the copy-on-write (see do_cow_fault()). However, I think in that caase, we could probably skip the find_mergeable_anon_vma() step. We don't today; we check whether a->vm_file == b->vm_file in anon_vma_compatible, but I wonder if that triggers often.
diff --git a/mm/memory.c b/mm/memory.c index 5e1c124552a1..13369ff15ec1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5242,6 +5242,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (!vma_is_anonymous(vma)) goto inval; + /* find_mergeable_anon_vma uses adjacent vmas which are not locked */ + if (!vma->anon_vma) + goto inval; + if (!vma_start_read(vma)) goto inval;
When vma->anon_vma is not set, page fault handler will set it by either reusing anon_vma of an adjacent VMA if VMAs are compatible or by allocating a new one. find_mergeable_anon_vma() walks VMA tree to find a compatible adjacent VMA and that requires not only the faulting VMA to be stable but also the tree structure and other VMAs inside that tree. Therefore locking just the faulting VMA is not enough for this search. Fall back to taking mmap_lock when vma->anon_vma is not set. This situation happens only on the first page fault and should not affect overall performance. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- mm/memory.c | 4 ++++ 1 file changed, 4 insertions(+)