Message ID | 20230501175025.36233-1-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/3] mm: handle swap page faults under VMA lock if page is uncontended | expand |
On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > +++ b/mm/memory.c > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > if (!pte_unmap_same(vmf)) > goto out; > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > - ret = VM_FAULT_RETRY; > - goto out; > - } > - > entry = pte_to_swp_entry(vmf->orig_pte); > if (unlikely(non_swap_entry(entry))) { > if (is_migration_entry(entry)) { You're missing the necessary fallback in the (!folio) case. swap_readpage() is synchronous and will sleep.
On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > +++ b/mm/memory.c > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > if (!pte_unmap_same(vmf)) > > > goto out; > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > - ret = VM_FAULT_RETRY; > > > - goto out; > > > - } > > > - > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > if (unlikely(non_swap_entry(entry))) { > > > if (is_migration_entry(entry)) { > > > > You're missing the necessary fallback in the (!folio) case. > > swap_readpage() is synchronous and will sleep. > > True, but is it unsafe to do that under VMA lock and has to be done > under mmap_lock? ... you were the one arguing that we didn't want to wait for I/O with the VMA lock held?
On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > +++ b/mm/memory.c > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > if (!pte_unmap_same(vmf)) > > > > goto out; > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > - ret = VM_FAULT_RETRY; > > > > - goto out; > > > > - } > > > > - > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > if (unlikely(non_swap_entry(entry))) { > > > > if (is_migration_entry(entry)) { > > > > > > You're missing the necessary fallback in the (!folio) case. > > > swap_readpage() is synchronous and will sleep. > > > > True, but is it unsafe to do that under VMA lock and has to be done > > under mmap_lock? > > ... you were the one arguing that we didn't want to wait for I/O with > the VMA lock held? Well, that discussion was about waiting in folio_lock_or_retry() with the lock being held. I argued against it because currently we drop mmap_lock lock before waiting, so if we don't drop VMA lock we would be changing the current behavior which might introduce new regressions. In the case of swap_readpage and swapin_readahead we already wait with mmap_lock held, so waiting with VMA lock held does not introduce new problems (unless there is a need to hold mmap_lock). That said, you are absolutely correct that this situation can be improved by dropping the lock in these cases too. I just didn't want to attack everything at once. I believe after we agree on the approach implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com for dropping the VMA lock before waiting, these cases can be added easier. Does that make sense?
On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > +++ b/mm/memory.c > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > if (!pte_unmap_same(vmf)) > > > > > goto out; > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > - ret = VM_FAULT_RETRY; > > > > > - goto out; > > > > > - } > > > > > - > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > if (is_migration_entry(entry)) { > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > swap_readpage() is synchronous and will sleep. > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > under mmap_lock? > > > > ... you were the one arguing that we didn't want to wait for I/O with > > the VMA lock held? > > Well, that discussion was about waiting in folio_lock_or_retry() with > the lock being held. I argued against it because currently we drop > mmap_lock lock before waiting, so if we don't drop VMA lock we would > be changing the current behavior which might introduce new > regressions. In the case of swap_readpage and swapin_readahead we > already wait with mmap_lock held, so waiting with VMA lock held does > not introduce new problems (unless there is a need to hold mmap_lock). > > That said, you are absolutely correct that this situation can be > improved by dropping the lock in these cases too. I just didn't want > to attack everything at once. I believe after we agree on the approach > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > for dropping the VMA lock before waiting, these cases can be added > easier. Does that make sense? OK, I looked at this path some more, and I think we're fine. This patch is only called for SWP_SYNCHRONOUS_IO which is only set for QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms (both btt and pmem). So the answer is that we don't sleep in this path, and there's no need to drop the lock.
On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > +++ b/mm/memory.c > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > if (!pte_unmap_same(vmf)) > > > > > > goto out; > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > - ret = VM_FAULT_RETRY; > > > > > > - goto out; > > > > > > - } > > > > > > - > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > under mmap_lock? > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > the VMA lock held? > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > the lock being held. I argued against it because currently we drop > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > be changing the current behavior which might introduce new > > regressions. In the case of swap_readpage and swapin_readahead we > > already wait with mmap_lock held, so waiting with VMA lock held does > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > That said, you are absolutely correct that this situation can be > > improved by dropping the lock in these cases too. I just didn't want > > to attack everything at once. I believe after we agree on the approach > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > for dropping the VMA lock before waiting, these cases can be added > > easier. Does that make sense? > > OK, I looked at this path some more, and I think we're fine. This > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > (both btt and pmem). So the answer is that we don't sleep in this > path, and there's no need to drop the lock. Yes but swapin_readahead does sleep, so I'll have to handle that case too after this.
On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > +++ b/mm/memory.c > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > goto out; > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > - goto out; > > > > > > > - } > > > > > > > - > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > under mmap_lock? > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > the VMA lock held? > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > the lock being held. I argued against it because currently we drop > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > be changing the current behavior which might introduce new > > > regressions. In the case of swap_readpage and swapin_readahead we > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > That said, you are absolutely correct that this situation can be > > > improved by dropping the lock in these cases too. I just didn't want > > > to attack everything at once. I believe after we agree on the approach > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > for dropping the VMA lock before waiting, these cases can be added > > > easier. Does that make sense? > > > > OK, I looked at this path some more, and I think we're fine. This > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > (both btt and pmem). So the answer is that we don't sleep in this > > path, and there's no need to drop the lock. > > Yes but swapin_readahead does sleep, so I'll have to handle that case > too after this. Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere in swapin_readahead()? It all looks like async I/O to me.
On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > +++ b/mm/memory.c > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > goto out; > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > - goto out; > > > > > > > > - } > > > > > > > > - > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > under mmap_lock? > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > the VMA lock held? > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > the lock being held. I argued against it because currently we drop > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > be changing the current behavior which might introduce new > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > That said, you are absolutely correct that this situation can be > > > > improved by dropping the lock in these cases too. I just didn't want > > > > to attack everything at once. I believe after we agree on the approach > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > for dropping the VMA lock before waiting, these cases can be added > > > > easier. Does that make sense? > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > (both btt and pmem). So the answer is that we don't sleep in this > > > path, and there's no need to drop the lock. > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > too after this. > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > in swapin_readahead()? It all looks like async I/O to me. Hmm. I thought that we have synchronous I/O in the following paths: swapin_readahead()->swap_cluster_readahead()->swap_readpage() swapin_readahead()->swap_vma_readahead()->swap_readpage() but just noticed that in both cases swap_readpage() is called with the synchronous parameter being false. So you are probably right here... Does that mean swapin_readahead() might return a page which does not have its content swapped-in yet?
On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote: > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > - goto out; > > > > > > > > > - } > > > > > > > > > - > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > under mmap_lock? > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > the VMA lock held? > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > the lock being held. I argued against it because currently we drop > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > be changing the current behavior which might introduce new > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > to attack everything at once. I believe after we agree on the approach > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > easier. Does that make sense? > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > path, and there's no need to drop the lock. > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > too after this. > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > in swapin_readahead()? It all looks like async I/O to me. > > Hmm. I thought that we have synchronous I/O in the following paths: > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > swapin_readahead()->swap_vma_readahead()->swap_readpage() > but just noticed that in both cases swap_readpage() is called with the > synchronous parameter being false. So you are probably right here... > Does that mean swapin_readahead() might return a page which does not > have its content swapped-in yet? That's my understanding. In that case it's !uptodate and still locked. The folio_lock_or_retry() will wait for the read to complete unless we've told it we'd rather retry.
On Tue, May 2, 2023 at 4:40 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, May 02, 2023 at 04:04:59PM -0700, Suren Baghdasaryan wrote: > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > > - goto out; > > > > > > > > > > - } > > > > > > > > > > - > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > > under mmap_lock? > > > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > > the VMA lock held? > > > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > > the lock being held. I argued against it because currently we drop > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > > be changing the current behavior which might introduce new > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > > to attack everything at once. I believe after we agree on the approach > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > > easier. Does that make sense? > > > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > > path, and there's no need to drop the lock. > > > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > > too after this. > > > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > > in swapin_readahead()? It all looks like async I/O to me. > > > > Hmm. I thought that we have synchronous I/O in the following paths: > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > > but just noticed that in both cases swap_readpage() is called with the > > synchronous parameter being false. So you are probably right here... > > Does that mean swapin_readahead() might return a page which does not > > have its content swapped-in yet? > > That's my understanding. In that case it's !uptodate and still locked. > The folio_lock_or_retry() will wait for the read to complete unless > we've told it we'd rather retry. Ok, and we already drop the locks in folio_lock_or_retry() when needed. Sounds like we cover this case with this patchset?
On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > - goto out; > > > > > > > > > - } > > > > > > > > > - > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > under mmap_lock? > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > the VMA lock held? > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > the lock being held. I argued against it because currently we drop > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > be changing the current behavior which might introduce new > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > to attack everything at once. I believe after we agree on the approach > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > easier. Does that make sense? > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > path, and there's no need to drop the lock. > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > too after this. > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > in swapin_readahead()? It all looks like async I/O to me. > > Hmm. I thought that we have synchronous I/O in the following paths: > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > swapin_readahead()->swap_vma_readahead()->swap_readpage() > but just noticed that in both cases swap_readpage() is called with the > synchronous parameter being false. So you are probably right here... In both swap_cluster_readahead() and swap_vma_readahead() it looks like if the readahead window is 1 (aka we are not reading ahead), then we jump to directly calling read_swap_cache_async() passing do_poll = true, which means we may end up calling swap_readpage() passing synchronous = true. I am not familiar with readahead heuristics, so I am not sure how common this is, but it's something to think about. > Does that mean swapin_readahead() might return a page which does not > have its content swapped-in yet? >
On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > > - goto out; > > > > > > > > > > - } > > > > > > > > > > - > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > > under mmap_lock? > > > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > > the VMA lock held? > > > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > > the lock being held. I argued against it because currently we drop > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > > be changing the current behavior which might introduce new > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > > to attack everything at once. I believe after we agree on the approach > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > > easier. Does that make sense? > > > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > > path, and there's no need to drop the lock. > > > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > > too after this. > > > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > > in swapin_readahead()? It all looks like async I/O to me. > > > > Hmm. I thought that we have synchronous I/O in the following paths: > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > > but just noticed that in both cases swap_readpage() is called with the > > synchronous parameter being false. So you are probably right here... > > In both swap_cluster_readahead() and swap_vma_readahead() it looks > like if the readahead window is 1 (aka we are not reading ahead), then > we jump to directly calling read_swap_cache_async() passing do_poll = > true, which means we may end up calling swap_readpage() passing > synchronous = true. > > I am not familiar with readahead heuristics, so I am not sure how > common this is, but it's something to think about. Uh, you are correct. If this branch is common, we could use the same "drop the lock and retry" pattern inside read_swap_cache_async(). That would be quite easy to implement. Thanks for checking on it! > > > Does that mean swapin_readahead() might return a page which does not > > have its content swapped-in yet? > >
On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > > > > > > > > > > > +++ b/mm/memory.c > > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > > > > > > > > > > if (!pte_unmap_same(vmf)) > > > > > > > > > > > goto out; > > > > > > > > > > > > > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > > > > > > > > > > > - ret = VM_FAULT_RETRY; > > > > > > > > > > > - goto out; > > > > > > > > > > > - } > > > > > > > > > > > - > > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > > > > > > > > > > > if (is_migration_entry(entry)) { > > > > > > > > > > > > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > > > > > > > > > > swap_readpage() is synchronous and will sleep. > > > > > > > > > > > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > > > > > > > > > under mmap_lock? > > > > > > > > > > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > > > > > > > > the VMA lock held? > > > > > > > > > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > > > > > > > the lock being held. I argued against it because currently we drop > > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > > > > > > > be changing the current behavior which might introduce new > > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > > > > > > > > > > > > > > That said, you are absolutely correct that this situation can be > > > > > > > improved by dropping the lock in these cases too. I just didn't want > > > > > > > to attack everything at once. I believe after we agree on the approach > > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > > > > > > > for dropping the VMA lock before waiting, these cases can be added > > > > > > > easier. Does that make sense? > > > > > > > > > > > > OK, I looked at this path some more, and I think we're fine. This > > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > > > > > > (both btt and pmem). So the answer is that we don't sleep in this > > > > > > path, and there's no need to drop the lock. > > > > > > > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > > > > > too after this. > > > > > > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > > > > in swapin_readahead()? It all looks like async I/O to me. > > > > > > Hmm. I thought that we have synchronous I/O in the following paths: > > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > > > but just noticed that in both cases swap_readpage() is called with the > > > synchronous parameter being false. So you are probably right here... > > > > In both swap_cluster_readahead() and swap_vma_readahead() it looks > > like if the readahead window is 1 (aka we are not reading ahead), then > > we jump to directly calling read_swap_cache_async() passing do_poll = > > true, which means we may end up calling swap_readpage() passing > > synchronous = true. > > > > I am not familiar with readahead heuristics, so I am not sure how > > common this is, but it's something to think about. > > Uh, you are correct. If this branch is common, we could use the same > "drop the lock and retry" pattern inside read_swap_cache_async(). That > would be quite easy to implement. > Thanks for checking on it! I am honestly not sure how common this is. +Ying who might have a better idea. > > > > > > > Does that mean swapin_readahead() might return a page which does not > > > have its content swapped-in yet? > > >
Yosry Ahmed <yosryahmed@google.com> writes: > On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote: >> >> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> > >> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: >> > > >> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: >> > > > >> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: >> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: >> > > > > > >> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: >> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: >> > > > > > > > >> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: >> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: >> > > > > > > > > > >> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: >> > > > > > > > > > > +++ b/mm/memory.c >> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) >> > > > > > > > > > > if (!pte_unmap_same(vmf)) >> > > > > > > > > > > goto out; >> > > > > > > > > > > >> > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { >> > > > > > > > > > > - ret = VM_FAULT_RETRY; >> > > > > > > > > > > - goto out; >> > > > > > > > > > > - } >> > > > > > > > > > > - >> > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); >> > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { >> > > > > > > > > > > if (is_migration_entry(entry)) { >> > > > > > > > > > >> > > > > > > > > > You're missing the necessary fallback in the (!folio) case. >> > > > > > > > > > swap_readpage() is synchronous and will sleep. >> > > > > > > > > >> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done >> > > > > > > > > under mmap_lock? >> > > > > > > > >> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with >> > > > > > > > the VMA lock held? >> > > > > > > >> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with >> > > > > > > the lock being held. I argued against it because currently we drop >> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would >> > > > > > > be changing the current behavior which might introduce new >> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we >> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does >> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). >> > > > > > > >> > > > > > > That said, you are absolutely correct that this situation can be >> > > > > > > improved by dropping the lock in these cases too. I just didn't want >> > > > > > > to attack everything at once. I believe after we agree on the approach >> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com >> > > > > > > for dropping the VMA lock before waiting, these cases can be added >> > > > > > > easier. Does that make sense? >> > > > > > >> > > > > > OK, I looked at this path some more, and I think we're fine. This >> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for >> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms >> > > > > > (both btt and pmem). So the answer is that we don't sleep in this >> > > > > > path, and there's no need to drop the lock. >> > > > > >> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case >> > > > > too after this. >> > > > >> > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere >> > > > in swapin_readahead()? It all looks like async I/O to me. >> > > >> > > Hmm. I thought that we have synchronous I/O in the following paths: >> > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() >> > > swapin_readahead()->swap_vma_readahead()->swap_readpage() >> > > but just noticed that in both cases swap_readpage() is called with the >> > > synchronous parameter being false. So you are probably right here... >> > >> > In both swap_cluster_readahead() and swap_vma_readahead() it looks >> > like if the readahead window is 1 (aka we are not reading ahead), then >> > we jump to directly calling read_swap_cache_async() passing do_poll = >> > true, which means we may end up calling swap_readpage() passing >> > synchronous = true. >> > >> > I am not familiar with readahead heuristics, so I am not sure how >> > common this is, but it's something to think about. >> >> Uh, you are correct. If this branch is common, we could use the same >> "drop the lock and retry" pattern inside read_swap_cache_async(). That >> would be quite easy to implement. >> Thanks for checking on it! > > > I am honestly not sure how common this is. > > +Ying who might have a better idea. Checked the code and related history. It seems that we can just pass "synchronous = false" to swap_readpage() in read_swap_cache_async(). "synchronous = true" was introduced in commit 23955622ff8d ("swap: add block io poll in swapin path") to reduce swap read latency for block devices that can be polled. But in commit 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio"), the polling is deleted. So, we don't need to pass "synchronous = true" to swap_readpage() during swapin_readahead(), because we will wait the IO to complete in folio_lock_or_retry(). Best Regards, Huang, Ying >> >> >> > >> > > Does that mean swapin_readahead() might return a page which does not >> > > have its content swapped-in yet? >> > >
On Thu, May 4, 2023 at 10:03 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Wed, May 3, 2023 at 12:57 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> On Wed, May 3, 2023 at 1:34 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > > >> > On Tue, May 2, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > > > >> > > On Tue, May 2, 2023 at 3:31 PM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > >> > > > On Tue, May 02, 2023 at 09:36:03AM -0700, Suren Baghdasaryan wrote: > >> > > > > On Tue, May 2, 2023 at 8:03 AM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > > > >> > > > > > On Mon, May 01, 2023 at 10:04:56PM -0700, Suren Baghdasaryan wrote: > >> > > > > > > On Mon, May 1, 2023 at 8:22 PM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > > > > > >> > > > > > > > On Mon, May 01, 2023 at 07:30:13PM -0700, Suren Baghdasaryan wrote: > >> > > > > > > > > On Mon, May 1, 2023 at 7:02 PM Matthew Wilcox <willy@infradead.org> wrote: > >> > > > > > > > > > > >> > > > > > > > > > On Mon, May 01, 2023 at 10:50:23AM -0700, Suren Baghdasaryan wrote: > >> > > > > > > > > > > +++ b/mm/memory.c > >> > > > > > > > > > > @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > >> > > > > > > > > > > if (!pte_unmap_same(vmf)) > >> > > > > > > > > > > goto out; > >> > > > > > > > > > > > >> > > > > > > > > > > - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { > >> > > > > > > > > > > - ret = VM_FAULT_RETRY; > >> > > > > > > > > > > - goto out; > >> > > > > > > > > > > - } > >> > > > > > > > > > > - > >> > > > > > > > > > > entry = pte_to_swp_entry(vmf->orig_pte); > >> > > > > > > > > > > if (unlikely(non_swap_entry(entry))) { > >> > > > > > > > > > > if (is_migration_entry(entry)) { > >> > > > > > > > > > > >> > > > > > > > > > You're missing the necessary fallback in the (!folio) case. > >> > > > > > > > > > swap_readpage() is synchronous and will sleep. > >> > > > > > > > > > >> > > > > > > > > True, but is it unsafe to do that under VMA lock and has to be done > >> > > > > > > > > under mmap_lock? > >> > > > > > > > > >> > > > > > > > ... you were the one arguing that we didn't want to wait for I/O with > >> > > > > > > > the VMA lock held? > >> > > > > > > > >> > > > > > > Well, that discussion was about waiting in folio_lock_or_retry() with > >> > > > > > > the lock being held. I argued against it because currently we drop > >> > > > > > > mmap_lock lock before waiting, so if we don't drop VMA lock we would > >> > > > > > > be changing the current behavior which might introduce new > >> > > > > > > regressions. In the case of swap_readpage and swapin_readahead we > >> > > > > > > already wait with mmap_lock held, so waiting with VMA lock held does > >> > > > > > > not introduce new problems (unless there is a need to hold mmap_lock). > >> > > > > > > > >> > > > > > > That said, you are absolutely correct that this situation can be > >> > > > > > > improved by dropping the lock in these cases too. I just didn't want > >> > > > > > > to attack everything at once. I believe after we agree on the approach > >> > > > > > > implemented in https://lore.kernel.org/all/20230501175025.36233-3-surenb@google.com > >> > > > > > > for dropping the VMA lock before waiting, these cases can be added > >> > > > > > > easier. Does that make sense? > >> > > > > > > >> > > > > > OK, I looked at this path some more, and I think we're fine. This > >> > > > > > patch is only called for SWP_SYNCHRONOUS_IO which is only set for > >> > > > > > QUEUE_FLAG_SYNCHRONOUS devices, which are brd, zram and nvdimms > >> > > > > > (both btt and pmem). So the answer is that we don't sleep in this > >> > > > > > path, and there's no need to drop the lock. > >> > > > > > >> > > > > Yes but swapin_readahead does sleep, so I'll have to handle that case > >> > > > > too after this. > >> > > > > >> > > > Sleeping is OK, we do that in pXd_alloc()! Do we block on I/O anywhere > >> > > > in swapin_readahead()? It all looks like async I/O to me. > >> > > > >> > > Hmm. I thought that we have synchronous I/O in the following paths: > >> > > swapin_readahead()->swap_cluster_readahead()->swap_readpage() > >> > > swapin_readahead()->swap_vma_readahead()->swap_readpage() > >> > > but just noticed that in both cases swap_readpage() is called with the > >> > > synchronous parameter being false. So you are probably right here... > >> > > >> > In both swap_cluster_readahead() and swap_vma_readahead() it looks > >> > like if the readahead window is 1 (aka we are not reading ahead), then > >> > we jump to directly calling read_swap_cache_async() passing do_poll = > >> > true, which means we may end up calling swap_readpage() passing > >> > synchronous = true. > >> > > >> > I am not familiar with readahead heuristics, so I am not sure how > >> > common this is, but it's something to think about. > >> > >> Uh, you are correct. If this branch is common, we could use the same > >> "drop the lock and retry" pattern inside read_swap_cache_async(). That > >> would be quite easy to implement. > >> Thanks for checking on it! > > > > > > I am honestly not sure how common this is. > > > > +Ying who might have a better idea. > > Checked the code and related history. It seems that we can just pass > "synchronous = false" to swap_readpage() in read_swap_cache_async(). > "synchronous = true" was introduced in commit 23955622ff8d ("swap: add > block io poll in swapin path") to reduce swap read latency for block > devices that can be polled. But in commit 9650b453a3d4 ("block: ignore > RWF_HIPRI hint for sync dio"), the polling is deleted. So, we don't > need to pass "synchronous = true" to swap_readpage() during > swapin_readahead(), because we will wait the IO to complete in > folio_lock_or_retry(). Thanks for investigating, Ying! It sounds like we can make some simplifications here. I'll double-check and if I don't find anything else, will change to "synchronous = false" in the next version of the patchset. > > Best Regards, > Huang, Ying > > >> > >> > >> > > >> > > Does that mean swapin_readahead() might return a page which does not > >> > > have its content swapped-in yet? > >> > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
diff --git a/mm/filemap.c b/mm/filemap.c index a34abfe8c654..84f39114d4de 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1706,6 +1706,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) * mmap_lock has been released (mmap_read_unlock(), unless flags had both * FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_RETRY_NOWAIT set, in * which case mmap_lock is still held. + * If flags had FAULT_FLAG_VMA_LOCK set, meaning the operation is performed + * with VMA lock only, the VMA lock is still held. * * If neither ALLOW_RETRY nor KILLABLE are set, will always return true * with the folio locked and the mmap_lock unperturbed. @@ -1713,6 +1715,10 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags) { + /* Can't do this if not holding mmap_lock */ + if (flags & FAULT_FLAG_VMA_LOCK) + return false; + if (fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released diff --git a/mm/memory.c b/mm/memory.c index f69fbc251198..41f45819a923 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3711,11 +3711,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!pte_unmap_same(vmf)) goto out; - if (vmf->flags & FAULT_FLAG_VMA_LOCK) { - ret = VM_FAULT_RETRY; - goto out; - } - entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { @@ -3725,6 +3720,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->page = pfn_swap_entry_to_page(entry); ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { + /* + * migrate_to_ram is not yet ready to operate + * under VMA lock. + */ + ret |= VM_FAULT_RETRY; + goto out; + } + vmf->page = pfn_swap_entry_to_page(entry); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
When page fault is handled under VMA lock protection, all swap page faults are retried with mmap_lock because folio_lock_or_retry implementation has to drop and reacquire mmap_lock if folio could not be immediately locked. Instead of retrying all swapped page faults, retry only when folio locking fails. Drivers implementing ops->migrate_to_ram might still rely on mmap_lock, therefore fall back to mmap_lock in this case. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- mm/filemap.c | 6 ++++++ mm/memory.c | 14 +++++++++----- 2 files changed, 15 insertions(+), 5 deletions(-)