Message ID | 20230414180043.1839745-1-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] mm: handle swap page faults if the faulting page can be locked | expand |
On Fri, Apr 14, 2023 at 11:00 AM Suren Baghdasaryan <surenb@google.com> wrote: > > When page fault is handled under VMA lock protection, all swap page > faults are retried with mmap_lock because folio_lock_or_retry > implementation has to drop and reacquire mmap_lock if folio could > not be immediately locked. > Instead of retrying all swapped page faults, retry only when folio > locking fails. I just realized that the title of the patch is misleading. It's about handling page fault under VMA lock. A better title would be something like: "mm: handle swap page faults under vma lock if page is uncontended" > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > Patch applies cleanly over linux-next and mm-unstable > > mm/filemap.c | 6 ++++++ > mm/memory.c | 5 ----- > 2 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 6f3a7e53fccf..67b937b0f436 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) > * mmap_lock has been released (mmap_read_unlock(), unless flags had both > * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in > * which case mmap_lock is still held. > + * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed > + * with VMA lock only, the VMA lock is still held. > * > * If neither ALLOW_RETRY nor KILLABLE are set, will always return true > * with the folio locked and the mmap_lock unperturbed. > @@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > unsigned int flags) > { > + /* Can't do this if not holding mmap_lock */ > + if (flags & FAULT_FLAG_VMA_LOCK) > + return false; > + > if (fault_flag_allow_retry_first(flags)) { > /* > * CAUTION! In this case, mmap_lock is not released > diff --git a/mm/memory.c b/mm/memory.c > index d88f370eacd1..3301a8d01820 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3715,11 +3715,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!pte_unmap_same(vmf)) > goto out; > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > - ret = VM_FAULT_RETRY; > - goto out; > - } > - > entry = pte_to_swp_entry(vmf->orig_pte); > if (unlikely(non_swap_entry(entry))) { > if (is_migration_entry(entry)) { > -- > 2.40.0.634.g4ca3ef3211-goog >
On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: > When page fault is handled under VMA lock protection, all swap page > faults are retried with mmap_lock because folio_lock_or_retry > implementation has to drop and reacquire mmap_lock if folio could > not be immediately locked. > Instead of retrying all swapped page faults, retry only when folio > locking fails. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Let's just review what can now be handled under the VMA lock instead of the mmap_lock, in case somebody knows better than me that it's not safe. - We can call migration_entry_wait(). This will wait for PG_locked to become clear (in migration_entry_wait_on_locked()). As previously discussed offline, I think this is safe to do while holding the VMA locked. - We can call remove_device_exclusive_entry(). That calls folio_lock_or_retry(), which will fail if it can't get the VMA lock. - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar with Nouveau and amdkfd could comment on how safe this is? - I believe we can't call handle_pte_marker() because we exclude UFFD VMAs earlier. - We can call swap_readpage() if we allocate a new folio. I haven't traced through all this code to tell if it's OK. So ... I believe this is all OK, but we're definitely now willing to wait for I/O from the swap device while holding the VMA lock when we weren't before. And maybe we should make a bigger deal of it in the changelog. And maybe we shouldn't just be failing the folio_lock_or_retry(), maybe we should be waiting for the folio lock with the VMA locked.
On Fri, Apr 14, 2023 at 11:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: > > When page fault is handled under VMA lock protection, all swap page > > faults are retried with mmap_lock because folio_lock_or_retry > > implementation has to drop and reacquire mmap_lock if folio could > > not be immediately locked. > > Instead of retrying all swapped page faults, retry only when folio > > locking fails. > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Thank you for the reviews! > > Let's just review what can now be handled under the VMA lock instead of > the mmap_lock, in case somebody knows better than me that it's not safe. > > - We can call migration_entry_wait(). This will wait for PG_locked to > become clear (in migration_entry_wait_on_locked()). As previously > discussed offline, I think this is safe to do while holding the VMA > locked. > - We can call remove_device_exclusive_entry(). That calls > folio_lock_or_retry(), which will fail if it can't get the VMA lock. > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar > with Nouveau and amdkfd could comment on how safe this is? > - I believe we can't call handle_pte_marker() because we exclude UFFD > VMAs earlier. > - We can call swap_readpage() if we allocate a new folio. I haven't > traced through all this code to tell if it's OK. > > So ... I believe this is all OK, but we're definitely now willing to > wait for I/O from the swap device while holding the VMA lock when we > weren't before. And maybe we should make a bigger deal of it in the > changelog. > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > maybe we should be waiting for the folio lock with the VMA locked. Wouldn't that cause holding the VMA lock for the duration of swap I/O (something you said we want to avoid in the previous paragraph) and effectively undo d065bd810b6d ("mm: retry page fault when blocking on disk transfer") for VMA locks? >
On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote: > > - We can call migration_entry_wait(). This will wait for PG_locked to > > become clear (in migration_entry_wait_on_locked()). As previously > > discussed offline, I think this is safe to do while holding the VMA > > locked. Just to be clear, this particular use of PG_locked is not during I/O, it's during page migration. This is a few orders of magnitude different. > > - We can call swap_readpage() if we allocate a new folio. I haven't > > traced through all this code to tell if it's OK. ... whereas this will wait for I/O. If we decide that's not OK, we'll need to test for FAULT_FLAG_VMA_LOCK and bail out of this path. > > So ... I believe this is all OK, but we're definitely now willing to > > wait for I/O from the swap device while holding the VMA lock when we > > weren't before. And maybe we should make a bigger deal of it in the > > changelog. > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > maybe we should be waiting for the folio lock with the VMA locked. > > Wouldn't that cause holding the VMA lock for the duration of swap I/O > (something you said we want to avoid in the previous paragraph) and > effectively undo d065bd810b6d ("mm: retry page fault when blocking on > disk transfer") for VMA locks? I'm not certain we want to avoid holding the VMA lock for the duration of an I/O. Here's how I understand the rationale for avoiding holding the mmap_lock while we perform I/O (before the existence of the VMA lock): - If everybody is doing page faults, there is no specific problem; we all hold the lock for read and multiple page faults can be handled in parallel. - As soon as one thread attempts to manipulate the tree (eg calls mmap()), all new readers must wait (as the rwsem is fair), and the writer must wait for all existing readers to finish. That's potentially milliseconds for an I/O during which time all page faults stop. Now we have the per-VMA lock, faults which can be handled without taking the mmap_lock can still be satisfied, as long as that VMA is not being modified. It is rare for a real application to take a page fault on a VMA which is being modified. So modifications to the tree will generally not take VMA locks on VMAs which are currently handling faults, and new faults will generally not find a VMA which is write-locked. When we find a locked folio (presumably for I/O, although folios are locked for other reasons), if we fall back to taking the mmap_lock for read, we increase contention on the mmap_lock and make the page fault wait on any mmap() operation. If we simply sleep waiting for the I/O, we make any mmap() operation _which touches this VMA_ wait for the I/O to complete. But I think that's OK, because new page faults can continue to be serviced ... as long as they don't need to take the mmap_lock. So ... I think what we _really_ want here is ... +++ b/mm/filemap.c @@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags) { - if (fault_flag_allow_retry_first(flags)) { + if (!(flags & FAULT_FLAG_VMA_LOCK) && + fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released * even though return 0. @@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, ret = __folio_lock_killable(folio); if (ret) { - mmap_read_unlock(mm); + if (!(flags & FAULT_FLAG_VMA_LOCK)) + mmap_read_unlock(mm); return false; } } else {
On Fri, Apr 14, 2023 at 1:32 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote: > > > - We can call migration_entry_wait(). This will wait for PG_locked to > > > become clear (in migration_entry_wait_on_locked()). As previously > > > discussed offline, I think this is safe to do while holding the VMA > > > locked. > > Just to be clear, this particular use of PG_locked is not during I/O, > it's during page migration. This is a few orders of magnitude > different. > > > > - We can call swap_readpage() if we allocate a new folio. I haven't > > > traced through all this code to tell if it's OK. > > ... whereas this will wait for I/O. If we decide that's not OK, we'll > need to test for FAULT_FLAG_VMA_LOCK and bail out of this path. > > > > So ... I believe this is all OK, but we're definitely now willing to > > > wait for I/O from the swap device while holding the VMA lock when we > > > weren't before. And maybe we should make a bigger deal of it in the > > > changelog. > > > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > > maybe we should be waiting for the folio lock with the VMA locked. > > > > Wouldn't that cause holding the VMA lock for the duration of swap I/O > > (something you said we want to avoid in the previous paragraph) and > > effectively undo d065bd810b6d ("mm: retry page fault when blocking on > > disk transfer") for VMA locks? > > I'm not certain we want to avoid holding the VMA lock for the duration > of an I/O. Here's how I understand the rationale for avoiding holding > the mmap_lock while we perform I/O (before the existence of the VMA lock): > > - If everybody is doing page faults, there is no specific problem; > we all hold the lock for read and multiple page faults can be handled > in parallel. > - As soon as one thread attempts to manipulate the tree (eg calls > mmap()), all new readers must wait (as the rwsem is fair), and the > writer must wait for all existing readers to finish. That's > potentially milliseconds for an I/O during which time all page faults > stop. > > Now we have the per-VMA lock, faults which can be handled without taking > the mmap_lock can still be satisfied, as long as that VMA is not being > modified. It is rare for a real application to take a page fault on a > VMA which is being modified. > > So modifications to the tree will generally not take VMA locks on VMAs > which are currently handling faults, and new faults will generally not > find a VMA which is write-locked. > > When we find a locked folio (presumably for I/O, although folios are > locked for other reasons), if we fall back to taking the mmap_lock > for read, we increase contention on the mmap_lock and make the page > fault wait on any mmap() operation. Do you mean we increase mmap_lock contention by holding the mmap_lock between the start of pagefault retry and until we drop it in __folio_lock_or_retry? > If we simply sleep waiting for the > I/O, we make any mmap() operation _which touches this VMA_ wait for > the I/O to complete. But I think that's OK, because new page faults > can continue to be serviced ... as long as they don't need to take > the mmap_lock. Ok, so we will potentially block VMA writers for the duration of the I/O... Stupid question: why was this a bigger problem for mmap_lock? Potentially our address space can consist of only one anon VMA, so locking that VMA vs mmap_lock should be the same from swap pagefault POV. Maybe mmap_lock is taken for write in some other important cases when VMA lock is not needed? > > So ... I think what we _really_ want here is ... > > +++ b/mm/filemap.c > @@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > unsigned int flags) > { > - if (fault_flag_allow_retry_first(flags)) { > + if (!(flags & FAULT_FLAG_VMA_LOCK) && > + fault_flag_allow_retry_first(flags)) { > /* > * CAUTION! In this case, mmap_lock is not released > * even though return 0. > @@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > > ret = __folio_lock_killable(folio); > if (ret) { > - mmap_read_unlock(mm); > + if (!(flags & FAULT_FLAG_VMA_LOCK)) > + mmap_read_unlock(mm); > return false; > } > } else { >
On 14 Apr 2023 14:51:59 -0700 Suren Baghdasaryan <surenb@google.com> > On Fri, Apr 14, 2023 at 1:32=E2=80=AFPM Matthew Wilcox <willy@infradead.org= > > > > If we simply sleep waiting for the > > I/O, we make any mmap() operation _which touches this VMA_ wait for > > the I/O to complete. But I think that's OK, because new page faults > > can continue to be serviced ... as long as they don't need to take > > the mmap_lock. > > Ok, so we will potentially block VMA writers for the duration of the I/O... > Stupid question: why was this a bigger problem for mmap_lock? > Potentially our address space can consist of only one anon VMA, so > locking that VMA vs mmap_lock should be the same from swap pagefault > POV. Maybe mmap_lock is taken for write in some other important cases > when VMA lock is not needed? This question adds a churn to my mind then after searching the git more than 30 minutes commit 89b15332af7c ("mm: drop mmap_sem before calling balance_dirty_pages() in write fault") rises. And John can tell you more about it.
On Fri, Apr 14, 2023 at 02:51:59PM -0700, Suren Baghdasaryan wrote: > On Fri, Apr 14, 2023 at 1:32 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote: > > > > - We can call migration_entry_wait(). This will wait for PG_locked to > > > > become clear (in migration_entry_wait_on_locked()). As previously > > > > discussed offline, I think this is safe to do while holding the VMA > > > > locked. > > > > Just to be clear, this particular use of PG_locked is not during I/O, > > it's during page migration. This is a few orders of magnitude > > different. > > > > > > - We can call swap_readpage() if we allocate a new folio. I haven't > > > > traced through all this code to tell if it's OK. > > > > ... whereas this will wait for I/O. If we decide that's not OK, we'll > > need to test for FAULT_FLAG_VMA_LOCK and bail out of this path. > > > > > > So ... I believe this is all OK, but we're definitely now willing to > > > > wait for I/O from the swap device while holding the VMA lock when we > > > > weren't before. And maybe we should make a bigger deal of it in the > > > > changelog. > > > > > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > > > maybe we should be waiting for the folio lock with the VMA locked. > > > > > > Wouldn't that cause holding the VMA lock for the duration of swap I/O > > > (something you said we want to avoid in the previous paragraph) and > > > effectively undo d065bd810b6d ("mm: retry page fault when blocking on > > > disk transfer") for VMA locks? > > > > I'm not certain we want to avoid holding the VMA lock for the duration > > of an I/O. Here's how I understand the rationale for avoiding holding > > the mmap_lock while we perform I/O (before the existence of the VMA lock): > > > > - If everybody is doing page faults, there is no specific problem; > > we all hold the lock for read and multiple page faults can be handled > > in parallel. > > - As soon as one thread attempts to manipulate the tree (eg calls > > mmap()), all new readers must wait (as the rwsem is fair), and the > > writer must wait for all existing readers to finish. That's > > potentially milliseconds for an I/O during which time all page faults > > stop. > > > > Now we have the per-VMA lock, faults which can be handled without taking > > the mmap_lock can still be satisfied, as long as that VMA is not being > > modified. It is rare for a real application to take a page fault on a > > VMA which is being modified. > > > > So modifications to the tree will generally not take VMA locks on VMAs > > which are currently handling faults, and new faults will generally not > > find a VMA which is write-locked. > > > > When we find a locked folio (presumably for I/O, although folios are > > locked for other reasons), if we fall back to taking the mmap_lock > > for read, we increase contention on the mmap_lock and make the page > > fault wait on any mmap() operation. > > Do you mean we increase mmap_lock contention by holding the mmap_lock > between the start of pagefault retry and until we drop it in > __folio_lock_or_retry? Yes, and if there is a writer (to any VMA, not just to the VMA we're working on), this page fault has to wait for that writer. Rather than just waiting for this I/O to complete. > > If we simply sleep waiting for the > > I/O, we make any mmap() operation _which touches this VMA_ wait for > > the I/O to complete. But I think that's OK, because new page faults > > can continue to be serviced ... as long as they don't need to take > > the mmap_lock. > > Ok, so we will potentially block VMA writers for the duration of the I/O... > Stupid question: why was this a bigger problem for mmap_lock? > Potentially our address space can consist of only one anon VMA, so > locking that VMA vs mmap_lock should be the same from swap pagefault > POV. Maybe mmap_lock is taken for write in some other important cases > when VMA lock is not needed? I really doubt any process has only a single VMA. Usually, there's at least stack, program text, program data, program rodata, heap, vdso, libc text, libc data, libc rodata, etc. $ cat /proc/self/maps |wc -l 23 That's 5 for the program, 5 for ld.so, 5 for libc, heap, stack, vvar, vdso, and a few I can't identify. Also, if our program only has one anon VMA, what is it doing that it needs to modify the tree? Growing it, perhaps? Almost anything else would cause it to have more than one VMA ;-) But let's consider the case where we have a 1TB VMA which is servicing the majority of faults and we try to mprotect one page in it. The mprotect() thread will take the mmap_lock for write, then block on the VMA lock until all the other threads taking page faults have finished (ie waiting for all the swapin to happen). While it waits, new faults on that VMA will also block (on the mmap_lock as the read_trylock on the VMA lock will fail). So yes, this case is just as bad as the original problem. We have two alternatives here. We can do what we do today -- fall back to taking the mmap_lock for read before dropping it while we wait for the folio lock. Or we can do something like this ... +++ b/mm/filemap.c @@ -1698,7 +1698,10 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, if (flags & FAULT_FLAG_RETRY_NOWAIT) return false; - mmap_read_unlock(mm); + if (flags & FAULT_FLAG_VMA_LOCK) + vma_end_read(vma); + else + mmap_read_unlock(mm); if (flags & FAULT_FLAG_KILLABLE) folio_wait_locked_killable(folio); else ... and then figure out a return value that lets the caller of handle_mm_fault() know not to call vma_end_read(). Or we can decide that it's OK to reintroduce this worst-case scenario, because we've now reduced the likelihood of it happening so far.
Matthew Wilcox <willy@infradead.org> writes: > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: >> When page fault is handled under VMA lock protection, all swap page >> faults are retried with mmap_lock because folio_lock_or_retry >> implementation has to drop and reacquire mmap_lock if folio could >> not be immediately locked. >> Instead of retrying all swapped page faults, retry only when folio >> locking fails. > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > Let's just review what can now be handled under the VMA lock instead of > the mmap_lock, in case somebody knows better than me that it's not safe. > > - We can call migration_entry_wait(). This will wait for PG_locked to > become clear (in migration_entry_wait_on_locked()). As previously > discussed offline, I think this is safe to do while holding the VMA > locked. Do we even need to be holding the VMA locked while in migration_entry_wait()? My understanding is we're just waiting for PG_locked to be cleared so we can return with a reasonable chance the migration entry is gone. If for example it has been unmapped or protections downgraded we will simply refault. > - We can call remove_device_exclusive_entry(). That calls > folio_lock_or_retry(), which will fail if it can't get the VMA lock. Looks ok to me. > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar > with Nouveau and amdkfd could comment on how safe this is? Currently this won't work because drives assume mmap_lock is held during pgmap->ops->migrate_to_ram(). Primarily this is because migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and that asserts mmap_lock is taken in walk_page_range() and also migrate_vma_insert_page(). So I don't think we can call that case without mmap_lock. At a glance it seems it should be relatively easy to move to using lock_vma_under_rcu(). Drivers will need updating as well though because migrate_vma_setup() is called outside of fault handling paths so drivers will currently take mmap_lock rather than vma lock when looking up the vma. See for example nouveau_svmm_bind(). > - I believe we can't call handle_pte_marker() because we exclude UFFD > VMAs earlier. > - We can call swap_readpage() if we allocate a new folio. I haven't > traced through all this code to tell if it's OK. > > So ... I believe this is all OK, but we're definitely now willing to > wait for I/O from the swap device while holding the VMA lock when we > weren't before. And maybe we should make a bigger deal of it in the > changelog. > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > maybe we should be waiting for the folio lock with the VMA locked.
On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@nvidia.com> wrote: > > > Matthew Wilcox <willy@infradead.org> writes: > > > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: > >> When page fault is handled under VMA lock protection, all swap page > >> faults are retried with mmap_lock because folio_lock_or_retry > >> implementation has to drop and reacquire mmap_lock if folio could > >> not be immediately locked. > >> Instead of retrying all swapped page faults, retry only when folio > >> locking fails. > > > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > > Let's just review what can now be handled under the VMA lock instead of > > the mmap_lock, in case somebody knows better than me that it's not safe. > > > > - We can call migration_entry_wait(). This will wait for PG_locked to > > become clear (in migration_entry_wait_on_locked()). As previously > > discussed offline, I think this is safe to do while holding the VMA > > locked. > > Do we even need to be holding the VMA locked while in > migration_entry_wait()? My understanding is we're just waiting for > PG_locked to be cleared so we can return with a reasonable chance the > migration entry is gone. If for example it has been unmapped or > protections downgraded we will simply refault. If we drop VMA lock before migration_entry_wait() then we would need to lock_vma_under_rcu again after the wait. In which case it might be simpler to retry the fault with some special return code to indicate that VMA lock is not held anymore and we want to retry without taking mmap_lock. I think it's similar to the last options Matthew suggested earlier. In which case we can reuse the same retry mechanism for both cases, here and in __folio_lock_or_retry. > > > - We can call remove_device_exclusive_entry(). That calls > > folio_lock_or_retry(), which will fail if it can't get the VMA lock. > > Looks ok to me. > > > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar > > with Nouveau and amdkfd could comment on how safe this is? > > Currently this won't work because drives assume mmap_lock is held during > pgmap->ops->migrate_to_ram(). Primarily this is because > migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and > that asserts mmap_lock is taken in walk_page_range() and also > migrate_vma_insert_page(). > > So I don't think we can call that case without mmap_lock. > > At a glance it seems it should be relatively easy to move to using > lock_vma_under_rcu(). Drivers will need updating as well though because > migrate_vma_setup() is called outside of fault handling paths so drivers > will currently take mmap_lock rather than vma lock when looking up the > vma. See for example nouveau_svmm_bind(). Thanks for the pointers, Alistair! It does look like we need to be more careful with the migrate_to_ram() path. For now I can fallback to retrying with mmap_lock for this case, like with do with all cases today. Afterwards this path can be made ready for working under VMA lock and we can remove that retry. Does that sound good? > > > - I believe we can't call handle_pte_marker() because we exclude UFFD > > VMAs earlier. > > - We can call swap_readpage() if we allocate a new folio. I haven't > > traced through all this code to tell if it's OK. > > > > So ... I believe this is all OK, but we're definitely now willing to > > wait for I/O from the swap device while holding the VMA lock when we > > weren't before. And maybe we should make a bigger deal of it in the > > changelog. > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > maybe we should be waiting for the folio lock with the VMA locked. >
Suren Baghdasaryan <surenb@google.com> writes: > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@nvidia.com> wrote: >> >> >> Matthew Wilcox <willy@infradead.org> writes: >> >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: >> >> When page fault is handled under VMA lock protection, all swap page >> >> faults are retried with mmap_lock because folio_lock_or_retry >> >> implementation has to drop and reacquire mmap_lock if folio could >> >> not be immediately locked. >> >> Instead of retrying all swapped page faults, retry only when folio >> >> locking fails. >> > >> > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> >> > >> > Let's just review what can now be handled under the VMA lock instead of >> > the mmap_lock, in case somebody knows better than me that it's not safe. >> > >> > - We can call migration_entry_wait(). This will wait for PG_locked to >> > become clear (in migration_entry_wait_on_locked()). As previously >> > discussed offline, I think this is safe to do while holding the VMA >> > locked. >> >> Do we even need to be holding the VMA locked while in >> migration_entry_wait()? My understanding is we're just waiting for >> PG_locked to be cleared so we can return with a reasonable chance the >> migration entry is gone. If for example it has been unmapped or >> protections downgraded we will simply refault. > > If we drop VMA lock before migration_entry_wait() then we would need > to lock_vma_under_rcu again after the wait. In which case it might be > simpler to retry the fault with some special return code to indicate > that VMA lock is not held anymore and we want to retry without taking > mmap_lock. I think it's similar to the last options Matthew suggested > earlier. In which case we can reuse the same retry mechanism for both > cases, here and in __folio_lock_or_retry. Good point. Agree there is no reason to re-take the VMA lock after the wait, although in this case we shouldn't need to retry the fault (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way out to userspace. >> >> > - We can call remove_device_exclusive_entry(). That calls >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock. >> >> Looks ok to me. >> >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar >> > with Nouveau and amdkfd could comment on how safe this is? >> >> Currently this won't work because drives assume mmap_lock is held during >> pgmap->ops->migrate_to_ram(). Primarily this is because >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and >> that asserts mmap_lock is taken in walk_page_range() and also >> migrate_vma_insert_page(). >> >> So I don't think we can call that case without mmap_lock. >> >> At a glance it seems it should be relatively easy to move to using >> lock_vma_under_rcu(). Drivers will need updating as well though because >> migrate_vma_setup() is called outside of fault handling paths so drivers >> will currently take mmap_lock rather than vma lock when looking up the >> vma. See for example nouveau_svmm_bind(). > > Thanks for the pointers, Alistair! It does look like we need to be > more careful with the migrate_to_ram() path. For now I can fallback to > retrying with mmap_lock for this case, like with do with all cases > today. Afterwards this path can be made ready for working under VMA > lock and we can remove that retry. Does that sound good? Sounds good to me. Fixing that shouldn't be too difficult but will need changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy to look at doing that if/when this change makes it in. Thanks. >> >> > - I believe we can't call handle_pte_marker() because we exclude UFFD >> > VMAs earlier. >> > - We can call swap_readpage() if we allocate a new folio. I haven't >> > traced through all this code to tell if it's OK. >> > >> > So ... I believe this is all OK, but we're definitely now willing to >> > wait for I/O from the swap device while holding the VMA lock when we >> > weren't before. And maybe we should make a bigger deal of it in the >> > changelog. >> > >> > And maybe we shouldn't just be failing the folio_lock_or_retry(), >> > maybe we should be waiting for the folio lock with the VMA locked. >>
On Mon, Apr 17, 2023 at 4:33 PM Alistair Popple <apopple@nvidia.com> wrote: > > > Suren Baghdasaryan <surenb@google.com> writes: > > > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@nvidia.com> wrote: > >> > >> > >> Matthew Wilcox <willy@infradead.org> writes: > >> > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: > >> >> When page fault is handled under VMA lock protection, all swap page > >> >> faults are retried with mmap_lock because folio_lock_or_retry > >> >> implementation has to drop and reacquire mmap_lock if folio could > >> >> not be immediately locked. > >> >> Instead of retrying all swapped page faults, retry only when folio > >> >> locking fails. > >> > > >> > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > >> > > >> > Let's just review what can now be handled under the VMA lock instead of > >> > the mmap_lock, in case somebody knows better than me that it's not safe. > >> > > >> > - We can call migration_entry_wait(). This will wait for PG_locked to > >> > become clear (in migration_entry_wait_on_locked()). As previously > >> > discussed offline, I think this is safe to do while holding the VMA > >> > locked. > >> > >> Do we even need to be holding the VMA locked while in > >> migration_entry_wait()? My understanding is we're just waiting for > >> PG_locked to be cleared so we can return with a reasonable chance the > >> migration entry is gone. If for example it has been unmapped or > >> protections downgraded we will simply refault. > > > > If we drop VMA lock before migration_entry_wait() then we would need > > to lock_vma_under_rcu again after the wait. In which case it might be > > simpler to retry the fault with some special return code to indicate > > that VMA lock is not held anymore and we want to retry without taking > > mmap_lock. I think it's similar to the last options Matthew suggested > > earlier. In which case we can reuse the same retry mechanism for both > > cases, here and in __folio_lock_or_retry. > > Good point. Agree there is no reason to re-take the VMA lock after the > wait, although in this case we shouldn't need to retry the fault > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way > out to userspace. Actually, __collapse_huge_page_swapin() which calls do_swap_page() can use VMA reference again inside its loop unless we return VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped the VMA lock and stability of the VMA is not guaranteed at that point. So, we do need to return VM_FAULT_RETRY maybe with another bit indicating that retry does not need to fallback to mmap_lock. Smth like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK". > > >> > >> > - We can call remove_device_exclusive_entry(). That calls > >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock. > >> > >> Looks ok to me. > >> > >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar > >> > with Nouveau and amdkfd could comment on how safe this is? > >> > >> Currently this won't work because drives assume mmap_lock is held during > >> pgmap->ops->migrate_to_ram(). Primarily this is because > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and > >> that asserts mmap_lock is taken in walk_page_range() and also > >> migrate_vma_insert_page(). > >> > >> So I don't think we can call that case without mmap_lock. > >> > >> At a glance it seems it should be relatively easy to move to using > >> lock_vma_under_rcu(). Drivers will need updating as well though because > >> migrate_vma_setup() is called outside of fault handling paths so drivers > >> will currently take mmap_lock rather than vma lock when looking up the > >> vma. See for example nouveau_svmm_bind(). > > > > Thanks for the pointers, Alistair! It does look like we need to be > > more careful with the migrate_to_ram() path. For now I can fallback to > > retrying with mmap_lock for this case, like with do with all cases > > today. Afterwards this path can be made ready for working under VMA > > lock and we can remove that retry. Does that sound good? > > Sounds good to me. Fixing that shouldn't be too difficult but will need > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy > to look at doing that if/when this change makes it in. Thanks. > > >> > >> > - I believe we can't call handle_pte_marker() because we exclude UFFD > >> > VMAs earlier. > >> > - We can call swap_readpage() if we allocate a new folio. I haven't > >> > traced through all this code to tell if it's OK. > >> > > >> > So ... I believe this is all OK, but we're definitely now willing to > >> > wait for I/O from the swap device while holding the VMA lock when we > >> > weren't before. And maybe we should make a bigger deal of it in the > >> > changelog. > >> > > >> > And maybe we shouldn't just be failing the folio_lock_or_retry(), > >> > maybe we should be waiting for the folio lock with the VMA locked. > >> >
On Mon, Apr 17, 2023 at 4:50 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Apr 17, 2023 at 4:33 PM Alistair Popple <apopple@nvidia.com> wrote: > > > > > > Suren Baghdasaryan <surenb@google.com> writes: > > > > > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@nvidia.com> wrote: > > >> > > >> > > >> Matthew Wilcox <willy@infradead.org> writes: > > >> > > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: > > >> >> When page fault is handled under VMA lock protection, all swap page > > >> >> faults are retried with mmap_lock because folio_lock_or_retry > > >> >> implementation has to drop and reacquire mmap_lock if folio could > > >> >> not be immediately locked. > > >> >> Instead of retrying all swapped page faults, retry only when folio > > >> >> locking fails. > > >> > > > >> > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > >> > > > >> > Let's just review what can now be handled under the VMA lock instead of > > >> > the mmap_lock, in case somebody knows better than me that it's not safe. > > >> > > > >> > - We can call migration_entry_wait(). This will wait for PG_locked to > > >> > become clear (in migration_entry_wait_on_locked()). As previously > > >> > discussed offline, I think this is safe to do while holding the VMA > > >> > locked. > > >> > > >> Do we even need to be holding the VMA locked while in > > >> migration_entry_wait()? My understanding is we're just waiting for > > >> PG_locked to be cleared so we can return with a reasonable chance the > > >> migration entry is gone. If for example it has been unmapped or > > >> protections downgraded we will simply refault. > > > > > > If we drop VMA lock before migration_entry_wait() then we would need > > > to lock_vma_under_rcu again after the wait. In which case it might be > > > simpler to retry the fault with some special return code to indicate > > > that VMA lock is not held anymore and we want to retry without taking > > > mmap_lock. I think it's similar to the last options Matthew suggested > > > earlier. In which case we can reuse the same retry mechanism for both > > > cases, here and in __folio_lock_or_retry. > > > > Good point. Agree there is no reason to re-take the VMA lock after the > > wait, although in this case we shouldn't need to retry the fault > > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way > > out to userspace. > > Actually, __collapse_huge_page_swapin() which calls do_swap_page() can > use VMA reference again inside its loop unless we return > VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped > the VMA lock and stability of the VMA is not guaranteed at that point. > So, we do need to return VM_FAULT_RETRY maybe with another bit > indicating that retry does not need to fallback to mmap_lock. Smth > like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK". False alarm. __collapse_huge_page_swapin is always called under mmap_lock protection. I'll go over the code once more to make sure nothing else would use VMA after we drop the VMA lock in page fault path. > > > > > >> > > >> > - We can call remove_device_exclusive_entry(). That calls > > >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock. > > >> > > >> Looks ok to me. > > >> > > >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar > > >> > with Nouveau and amdkfd could comment on how safe this is? > > >> > > >> Currently this won't work because drives assume mmap_lock is held during > > >> pgmap->ops->migrate_to_ram(). Primarily this is because > > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and > > >> that asserts mmap_lock is taken in walk_page_range() and also > > >> migrate_vma_insert_page(). > > >> > > >> So I don't think we can call that case without mmap_lock. > > >> > > >> At a glance it seems it should be relatively easy to move to using > > >> lock_vma_under_rcu(). Drivers will need updating as well though because > > >> migrate_vma_setup() is called outside of fault handling paths so drivers > > >> will currently take mmap_lock rather than vma lock when looking up the > > >> vma. See for example nouveau_svmm_bind(). > > > > > > Thanks for the pointers, Alistair! It does look like we need to be > > > more careful with the migrate_to_ram() path. For now I can fallback to > > > retrying with mmap_lock for this case, like with do with all cases > > > today. Afterwards this path can be made ready for working under VMA > > > lock and we can remove that retry. Does that sound good? > > > > Sounds good to me. Fixing that shouldn't be too difficult but will need > > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy > > to look at doing that if/when this change makes it in. Thanks. > > > > >> > > >> > - I believe we can't call handle_pte_marker() because we exclude UFFD > > >> > VMAs earlier. > > >> > - We can call swap_readpage() if we allocate a new folio. I haven't > > >> > traced through all this code to tell if it's OK. > > >> > > > >> > So ... I believe this is all OK, but we're definitely now willing to > > >> > wait for I/O from the swap device while holding the VMA lock when we > > >> > weren't before. And maybe we should make a bigger deal of it in the > > >> > changelog. > > >> > > > >> > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > >> > maybe we should be waiting for the folio lock with the VMA locked. > > >> > >
On Mon, Apr 17, 2023 at 6:07 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Apr 17, 2023 at 4:50 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Mon, Apr 17, 2023 at 4:33 PM Alistair Popple <apopple@nvidia.com> wrote: > > > > > > > > > Suren Baghdasaryan <surenb@google.com> writes: > > > > > > > On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@nvidia.com> wrote: > > > >> > > > >> > > > >> Matthew Wilcox <willy@infradead.org> writes: > > > >> > > > >> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote: > > > >> >> When page fault is handled under VMA lock protection, all swap page > > > >> >> faults are retried with mmap_lock because folio_lock_or_retry > > > >> >> implementation has to drop and reacquire mmap_lock if folio could > > > >> >> not be immediately locked. > > > >> >> Instead of retrying all swapped page faults, retry only when folio > > > >> >> locking fails. > > > >> > > > > >> > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> > > > >> > > > > >> > Let's just review what can now be handled under the VMA lock instead of > > > >> > the mmap_lock, in case somebody knows better than me that it's not safe. > > > >> > > > > >> > - We can call migration_entry_wait(). This will wait for PG_locked to > > > >> > become clear (in migration_entry_wait_on_locked()). As previously > > > >> > discussed offline, I think this is safe to do while holding the VMA > > > >> > locked. > > > >> > > > >> Do we even need to be holding the VMA locked while in > > > >> migration_entry_wait()? My understanding is we're just waiting for > > > >> PG_locked to be cleared so we can return with a reasonable chance the > > > >> migration entry is gone. If for example it has been unmapped or > > > >> protections downgraded we will simply refault. > > > > > > > > If we drop VMA lock before migration_entry_wait() then we would need > > > > to lock_vma_under_rcu again after the wait. In which case it might be > > > > simpler to retry the fault with some special return code to indicate > > > > that VMA lock is not held anymore and we want to retry without taking > > > > mmap_lock. I think it's similar to the last options Matthew suggested > > > > earlier. In which case we can reuse the same retry mechanism for both > > > > cases, here and in __folio_lock_or_retry. > > > > > > Good point. Agree there is no reason to re-take the VMA lock after the > > > wait, although in this case we shouldn't need to retry the fault > > > (ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way > > > out to userspace. > > > > Actually, __collapse_huge_page_swapin() which calls do_swap_page() can > > use VMA reference again inside its loop unless we return > > VM_FAULT_RETRY or VM_FAULT_ERROR. That is not safe since we dropped > > the VMA lock and stability of the VMA is not guaranteed at that point. > > So, we do need to return VM_FAULT_RETRY maybe with another bit > > indicating that retry does not need to fallback to mmap_lock. Smth > > like "return VM_FAULT_RETRY | VM_FAULT_USE_VMA_LOCK". > > False alarm. __collapse_huge_page_swapin is always called under > mmap_lock protection. I'll go over the code once more to make sure > nothing else would use VMA after we drop the VMA lock in page fault > path. I posted a new series at https://lore.kernel.org/all/20230501175025.36233-1-surenb@google.com/ It implements suggestions discussed in this thread. Feedback is appreciated! Thanks! > > > > > > > > > > >> > > > >> > - We can call remove_device_exclusive_entry(). That calls > > > >> > folio_lock_or_retry(), which will fail if it can't get the VMA lock. > > > >> > > > >> Looks ok to me. > > > >> > > > >> > - We can call pgmap->ops->migrate_to_ram(). Perhaps somebody familiar > > > >> > with Nouveau and amdkfd could comment on how safe this is? > > > >> > > > >> Currently this won't work because drives assume mmap_lock is held during > > > >> pgmap->ops->migrate_to_ram(). Primarily this is because > > > >> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and > > > >> that asserts mmap_lock is taken in walk_page_range() and also > > > >> migrate_vma_insert_page(). > > > >> > > > >> So I don't think we can call that case without mmap_lock. > > > >> > > > >> At a glance it seems it should be relatively easy to move to using > > > >> lock_vma_under_rcu(). Drivers will need updating as well though because > > > >> migrate_vma_setup() is called outside of fault handling paths so drivers > > > >> will currently take mmap_lock rather than vma lock when looking up the > > > >> vma. See for example nouveau_svmm_bind(). > > > > > > > > Thanks for the pointers, Alistair! It does look like we need to be > > > > more careful with the migrate_to_ram() path. For now I can fallback to > > > > retrying with mmap_lock for this case, like with do with all cases > > > > today. Afterwards this path can be made ready for working under VMA > > > > lock and we can remove that retry. Does that sound good? > > > > > > Sounds good to me. Fixing that shouldn't be too difficult but will need > > > changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy > > > to look at doing that if/when this change makes it in. Thanks. > > > > > > >> > > > >> > - I believe we can't call handle_pte_marker() because we exclude UFFD > > > >> > VMAs earlier. > > > >> > - We can call swap_readpage() if we allocate a new folio. I haven't > > > >> > traced through all this code to tell if it's OK. > > > >> > > > > >> > So ... I believe this is all OK, but we're definitely now willing to > > > >> > wait for I/O from the swap device while holding the VMA lock when we > > > >> > weren't before. And maybe we should make a bigger deal of it in the > > > >> > changelog. > > > >> > > > > >> > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > > >> > maybe we should be waiting for the folio lock with the VMA locked. > > > >> > > >
diff --git a/mm/filemap.c b/mm/filemap.c index 6f3a7e53fccf..67b937b0f436 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) * mmap_lock has been released (mmap_read_unlock(), unless flags had both * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in * which case mmap_lock is still held. + * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed + * with VMA lock only, the VMA lock is still held. * * If neither ALLOW_RETRY nor KILLABLE are set, will always return true * with the folio locked and the mmap_lock unperturbed. @@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags) { + /* Can't do this if not holding mmap_lock */ + if (flags & FAULT_FLAG_VMA_LOCK) + return false; + if (fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released diff --git a/mm/memory.c b/mm/memory.c index d88f370eacd1..3301a8d01820 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3715,11 +3715,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!pte_unmap_same(vmf)) goto out; - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { - ret = VM_FAULT_RETRY; - goto out; - } - entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) {
When page fault is handled under VMA lock protection, all swap page faults are retried with mmap_lock because folio_lock_or_retry implementation has to drop and reacquire mmap_lock if folio could not be immediately locked. Instead of retrying all swapped page faults, retry only when folio locking fails. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- Patch applies cleanly over linux-next and mm-unstable mm/filemap.c | 6 ++++++ mm/memory.c | 5 ----- 2 files changed, 6 insertions(+), 5 deletions(-)