[v2,00/35] Speculative page faults

Message ID	20220128131006.67712-1-michel@lespinasse.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Michel Lespinasse <michel@lespinasse.org> To: Linux-MM <linux-mm@kvack.org>, linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org> Cc: kernel-team@fb.com, Laurent Dufour <ldufour@linux.ibm.com>, Jerome Glisse <jglisse@google.com>, Peter Zijlstra <peterz@infradead.org>, Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>, Davidlohr Bueso <dave@stgolabs.net>, Matthew Wilcox <willy@infradead.org>, Liam Howlett <liam.howlett@oracle.com>, Rik van Riel <riel@surriel.com>, Paul McKenney <paulmck@kernel.org>, Song Liu <songliubraving@fb.com>, Suren Baghdasaryan <surenb@google.com>, Minchan Kim <minchan@google.com>, Joel Fernandes <joelaf@google.com>, David Rientjes <rientjes@google.com>, Axel Rasmussen <axelrasmussen@google.com>, Andy Lutomirski <luto@kernel.org>, Michel Lespinasse <michel@lespinasse.org> Subject: [PATCH v2 00/35] Speculative page faults Date: Fri, 28 Jan 2022 05:09:31 -0800 Message-Id: <20220128131006.67712-1-michel@lespinasse.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Speculative page faults \| expand [v2,00/35] Speculative page faults [v2,01/35] mm: export dump_mm [v2,02/35] mmap locking API: mmap_lock_is_contended returns a bool [v2,03/35] mmap locking API: name the return values [v2,04/35] do_anonymous_page: use update_mmu_tlb() [v2,05/35] do_anonymous_page: reduce code duplication [v2,06/35] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT [v2,07/35] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT [v2,08/35] mm: add FAULT_FLAG_SPECULATIVE flag [v2,09/35] mm: add do_handle_mm_fault() [v2,10/35] mm: add per-mm mmap sequence counter for speculative page fault handling. [v2,11/35] mm: rcu safe vma freeing [v2,12/35] mm: separate mmap locked assertion from find_vma [v2,13/35] x86/mm: attempt speculative mm faults first [v2,14/35] mm: add speculative_page_walk_begin() and speculative_page_walk_end() [v2,15/35] mm: refactor __handle_mm_fault() / handle_pte_fault() [v2,16/35] mm: implement speculative handling in __handle_mm_fault(). [v2,17/35] mm: add pte_map_lock() and pte_spinlock() [v2,18/35] mm: implement speculative handling in do_anonymous_page() [v2,19/35] mm: enable speculative fault handling through do_anonymous_page() [v2,20/35] mm: implement speculative handling in do_numa_page() [v2,21/35] mm: enable speculative fault handling in do_numa_page() [v2,22/35] percpu-rwsem: enable percpu_sem destruction in atomic context [v2,23/35] mm: add mmu_notifier_lock [v2,24/35] mm: write lock mmu_notifier_lock when registering mmu notifiers [v2,25/35] mm: add mmu_notifier_trylock() and mmu_notifier_unlock() [v2,26/35] mm: implement speculative handling in wp_page_copy() [v2,27/35] mm: implement and enable speculative fault handling in handle_pte_fault() [v2,28/35] mm: disable speculative faults for single threaded user space [v2,29/35] mm: disable rcu safe vma freeing for single threaded user space [v2,30/35] mm: create new include/linux/vm_event.h header file [v2,31/35] mm: anon spf statistics [v2,32/35] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT [v2,33/35] arm64/mm: attempt speculative mm faults first [v2,34/35] powerpc/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT [v2,35/35] powerpc/mm: attempt speculative mm faults first

Michel Lespinasse Jan. 28, 2022, 1:09 p.m. UTC

This patchset is my take on speculative page faults (spf).
It builds on ideas that have been previously proposed by Laurent Dufour,
Peter Zijlstra and others before. While Laurent's previous proposal
was rejected around the time of LSF/MM 2019, I am hoping we can revisit
this now based on what I think is a simpler and more bisectable approach,
much improved scaling numbers in the anonymous vma case, and the Android
use case that has since emerged. I will expand on these points towards
the end of this message.

The patch series applies on top of linux v5.17-rc1;
a git tree is also available:
git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon

I would like these patches to be considered for inclusion into v5.18.
Several android vendors are using Laurent Dufour's previous SPF work into
their kernel tree in order to improve application startup performance,
want to converge to an upstream accepted solution, and have reported good
numbers with previous versions of this patchset. Also, there is a broader
interest into reducing mmap lock dependencies in critical MM paths,
and I think this patchset would be a good first step in that direction.


This patchset follows the same overall structure as the v1 proposal,
with the following differences:
- Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
- The mmu notifier lock is new; this fixes a race in v1 patchset
  between speculative COW faults and registering new MMU notifiers.
- Speculative handling of swap-cache pages has been removed.
- Commit 30 is new; this fixes build issues that showed in some configs.


In principle it would also be possible to extend this work for handling
file mapped vmas; I have pending work on such patches too but they are
not mature enough to be submitted for inclusion at this point.


Patchset summary:

Classical page fault processing takes the mmap read lock in order to
prevent races with mmap writers. In contrast, speculative fault
processing does not take the mmap read lock, and instead verifies,
when the results of the page fault are about to get committed and
become visible to other threads, that no mmap writers have been
running concurrently with the page fault. If the check fails,
speculative updates do not get committed and the fault is retried
in the usual, non-speculative way (with the mmap read lock held).

The concurrency check is implemented using a per-mm mmap sequence count.
The counter is incremented at the beginning and end of each mmap write
operation. If the counter is initially observed to have an even value,
and has the same value later on, the observer can deduce that no mmap
writers have been running concurrently with it between those two times.
This is similar to a seqlock, except that readers never spin on the
counter value (they would instead revert to taking the mmap read lock),
and writers are allowed to sleep. One benefit of this approach is that
it requires no writer side changes, just some hooks in the mmap write
lock APIs that writers already use.

The first step of a speculative page fault is to look up the vma and
read its contents (currently by making a copy of the vma, though in
principle it would be sufficient to only read the vma attributes that
are used in page faults). The mmap sequence count is used to verify
that there were no mmap writers concurrent to the lookup and copy steps.
Note that walking rbtrees while there may potentially be concurrent
writers is not an entirely new idea in linux, as latched rbtrees
are already doing this. This is safe as long as the lookup is
followed by a sequence check to verify that concurrency did not
actually occur (and abort the speculative fault if it did).

The next step is to walk down the existing page table tree to find the
current pte entry. This is done with interrupts disabled to avoid
races with munmap(). Again, not an entirely new idea, as this repeats
a pattern already present in fast GUP. Similar precautions are also
taken when taking the page table lock.

Breaking COW on an existing mapping may require firing MMU notifiers.
Some care is required to avoid racing with registering new notifiers.
This patchset adds a new per-cpu rwsem to handle this situation.


Commits 1 to 5 are preparatory cleanups.

Commits 6 and 7 introduce CONFIG_SPECULATIVE_PAGE_FAULT and lets us
enable it on x86 so we can test the new code as it gets introduced.

Commits 8 and 9 extend handle_mm_fault() so it can be used for
speculative faults; initially these always abort with VM_FAULT_RETRY.

Commits 10 to 27 progressively implement the speculative handling of
page faults. Importantly, they are structured to be bisectable:
the new code gets enabled every few commits.
- Commit 10 adds the mmap sequence count that will be used for detecting
  when writers have been running concurrently with an spf attempt
  (in which case the attempt will be aborted);
- Commit 11 adds RCU safe vma freeing;
- Commit 12 adds a version of find_vma that doesn't check for mmap locking;
- Commit 13 does a lockless VMA lookup and starts the spf handling attempt;
- Commit 14 introduces an API for preventing page table reclamation
  (using RCU or disabling interrupts depending on build config options);
- (Commit 15 is a small refactor preparing for the next commit);
- Commit 16 walks down the existing page tables, carefully avoiding
  races with potential writers (munmap in particular)
- Commit 17 introduces pte_map_lock() and pte_spinlock(), which attempt
  to (optionally map and) lock an existing page table when it's time to
  commit page fault results to it.
- Commits 18 to 21 implement SPF for the simplest cases
  (do_anonymous_page and do_numa_page). This mostly comes down to
  using the pte_map_lock() and pte_spinlock() APIs where needed,
  and making sure to abort speculation in unsupported cases
  (mostly anon_vma allocation and userfaultfd).
- Commits 22 to 25 add a new mmu_notifier_lock
- Commits 26 and 27 implement some additional SPF cases, using the new
  mmu_notifier_lock for the COW cases.

Commits 28 and 29 disable speculative handling for single threaded
userspace. This is for (minor) performance tuning and is pushed
towards the end of the series to make it easier to exercise the spf
paths as they are introduced.

Commits 30 and 31 add some extra statistics.

Commits 32 to 35 add spf support on the arm64 and powerpc architectures.


Michel Lespinasse (34):
  mm: export dump_mm
  mmap locking API: mmap_lock_is_contended returns a bool
  mmap locking API: name the return values
  do_anonymous_page: use update_mmu_tlb()
  do_anonymous_page: reduce code duplication
  mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
  x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  mm: add FAULT_FLAG_SPECULATIVE flag
  mm: add do_handle_mm_fault()
  mm: add per-mm mmap sequence counter for speculative page fault handling.
  mm: rcu safe vma freeing
  mm: separate mmap locked assertion from find_vma
  x86/mm: attempt speculative mm faults first
  mm: add speculative_page_walk_begin() and speculative_page_walk_end()
  mm: refactor __handle_mm_fault() / handle_pte_fault()
  mm: implement speculative handling in __handle_mm_fault().
  mm: add pte_map_lock() and pte_spinlock()
  mm: implement speculative handling in do_anonymous_page()
  mm: enable speculative fault handling through do_anonymous_page()
  mm: implement speculative handling in do_numa_page()
  mm: enable speculative fault handling in do_numa_page()
  mm: add mmu_notifier_lock
  mm: write lock mmu_notifier_lock when registering mmu notifiers
  mm: add mmu_notifier_trylock() and mmu_notifier_unlock()
  mm: implement speculative handling in wp_page_copy()
  mm: implement and enable speculative fault handling in handle_pte_fault()
  mm: disable speculative faults for single threaded user space
  mm: disable rcu safe vma freeing for single threaded user space
  mm: create new include/linux/vm_event.h header file
  mm: anon spf statistics
  arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  arm64/mm: attempt speculative mm faults first
  powerpc/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  powerpc/mm: attempt speculative mm faults first

Suren Baghdasaryan (1):
  percpu-rwsem: enable percpu_sem destruction in atomic context

 arch/arm64/Kconfig                    |   1 +
 arch/arm64/mm/fault.c                 |  62 ++++
 arch/powerpc/Kconfig                  |   1 +
 arch/powerpc/mm/fault.c               |  64 ++++
 arch/x86/Kconfig                      |   1 +
 arch/x86/mm/fault.c                   |  63 ++++
 drivers/gpu/drm/i915/i915_gpu_error.c |   4 +-
 include/linux/mm.h                    |  68 +++-
 include/linux/mm_types.h              |  33 +-
 include/linux/mmap_lock.h             | 109 ++++--
 include/linux/mmu_notifier.h          |  52 ++-
 include/linux/percpu-rwsem.h          |  13 +-
 include/linux/vm_event.h              | 111 ++++++
 include/linux/vm_event_item.h         |  25 ++
 include/linux/vmstat.h                |  95 +-----
 kernel/fork.c                         |  18 +-
 kernel/locking/percpu-rwsem.c         |  32 ++
 mm/Kconfig                            |  22 ++
 mm/Kconfig.debug                      |   7 +
 mm/debug.c                            |   1 +
 mm/memory.c                           | 474 +++++++++++++++++++-------
 mm/mmap.c                             |  13 +-
 mm/vmstat.c                           |  25 ++
 23 files changed, 1040 insertions(+), 254 deletions(-)
 create mode 100644 include/linux/vm_event.h

David Hildenbrand Jan. 31, 2022, 9:56 a.m. UTC | #1

On 28.01.22 14:09, Michel Lespinasse wrote:

Hi Michel,

> This patchset is my take on speculative page faults (spf).
> It builds on ideas that have been previously proposed by Laurent Dufour,
> Peter Zijlstra and others before. While Laurent's previous proposal
> was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> this now based on what I think is a simpler and more bisectable approach,
> much improved scaling numbers in the anonymous vma case, and the Android
> use case that has since emerged. I will expand on these points towards
> the end of this message.
> 
> The patch series applies on top of linux v5.17-rc1;
> a git tree is also available:
> git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> 
> I would like these patches to be considered for inclusion into v5.18.

Just a general note: we certainly need (much more) review. And I think
we'll have to make a decision if the maintenance effort +  complexity
will be worth the benefit.

> Several android vendors are using Laurent Dufour's previous SPF work into
> their kernel tree in order to improve application startup performance,
> want to converge to an upstream accepted solution, and have reported good
> numbers with previous versions of this patchset. Also, there is a broader
> interest into reducing mmap lock dependencies in critical MM paths,
> and I think this patchset would be a good first step in that direction.
> 
> 
> This patchset follows the same overall structure as the v1 proposal,
> with the following differences:
> - Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
> - The mmu notifier lock is new; this fixes a race in v1 patchset
>   between speculative COW faults and registering new MMU notifiers.
> - Speculative handling of swap-cache pages has been removed.
> - Commit 30 is new; this fixes build issues that showed in some configs.
> 
> 
> In principle it would also be possible to extend this work for handling
> file mapped vmas; I have pending work on such patches too but they are
> not mature enough to be submitted for inclusion at this point.
> 

I'd have expected a performance evaluation at this point, to highlight
the possible benefit and eventually also downsides, if any.

> 
> Patchset summary:
> 
> Classical page fault processing takes the mmap read lock in order to
> prevent races with mmap writers. In contrast, speculative fault
> processing does not take the mmap read lock, and instead verifies,
> when the results of the page fault are about to get committed and
> become visible to other threads, that no mmap writers have been
> running concurrently with the page fault. If the check fails,
> speculative updates do not get committed and the fault is retried
> in the usual, non-speculative way (with the mmap read lock held).
> 
> The concurrency check is implemented using a per-mm mmap sequence count.
> The counter is incremented at the beginning and end of each mmap write
> operation. If the counter is initially observed to have an even value,
> and has the same value later on, the observer can deduce that no mmap
> writers have been running concurrently with it between those two times.
> This is similar to a seqlock, except that readers never spin on the
> counter value (they would instead revert to taking the mmap read lock),
> and writers are allowed to sleep. One benefit of this approach is that
> it requires no writer side changes, just some hooks in the mmap write
> lock APIs that writers already use.
> 
> The first step of a speculative page fault is to look up the vma and
> read its contents (currently by making a copy of the vma, though in
> principle it would be sufficient to only read the vma attributes that
> are used in page faults). The mmap sequence count is used to verify
> that there were no mmap writers concurrent to the lookup and copy steps.
> Note that walking rbtrees while there may potentially be concurrent
> writers is not an entirely new idea in linux, as latched rbtrees
> are already doing this. This is safe as long as the lookup is
> followed by a sequence check to verify that concurrency did not
> actually occur (and abort the speculative fault if it did).
> 
> The next step is to walk down the existing page table tree to find the
> current pte entry. This is done with interrupts disabled to avoid
> races with munmap(). Again, not an entirely new idea, as this repeats
> a pattern already present in fast GUP. Similar precautions are also
> taken when taking the page table lock.
> 
> Breaking COW on an existing mapping may require firing MMU notifiers.
> Some care is required to avoid racing with registering new notifiers.
> This patchset adds a new per-cpu rwsem to handle this situation.

I have to admit that this sounds complicated and possibly dangerous to me.


Here is one of my concerns, I hope you can clarify:

GUP-fast only ever walks page tables and doesn't actually modify any
page table state, including, not taking page table locks which might not
reside in the memmap directly but in auxiliary data. It works because we
only ever drop the last reference to a page table (to free it) after we
synchronized against GUP-fast either via an IPI or synchronize_rcu(), as
GUP=fast disables interrupts.


I'd assume that taking page table locks on page tables that might no
longer be spanned by a VMA because of concurrent page table
deconstruction  is dangerous:


On munmap(), we do the VMA update under mmap_lock in write mode, to the
remove the page tables under mmap_lock in read mode.

Let's take a look at free_pte_range() on x86:

free_pte_range()
-> pte_free_tlb()
 -> tlb_flush_pmd_range()
  -> __tlb_adjust_range()
   /* Doesn't actually flush but only updates the tlb range */
 -> __pte_free_tlb()
  -> ___pte_free_tlb()
   -> pgtable_pte_page_dtor()
    -> ptlock_free()
    /* page table lock was freed */
   -> paravirt_tlb_remove_table()
    -> tlb_remove_page()
     -> tlb_remove_page_size()
      -> __tlb_remove_page_size()
       /* Page added to TLB batch flushing+freeing */

The later tlb_flush_mmu() via tlb_flush_mmu_free()->tlb_table_flush()
will the free the page tables, after synchronizing against GUP-fast. But
at that point we already deconstructed the page tables.

So just reading your summary here, what prevents in your approach taking
a page table lock with racing against page table lock freeing? I cannot
see how a seqcount would help.


IIUC, with what you propose we cannot easily have auxiliary data for a
page table, at least not via current pgtable_pte_page_dtor(), including
page locks, which is a drawback (and currently eventually a BUG in your
code?) at least for me. But I only read the cover letter, so I might be
missing something important :)

Suren Baghdasaryan Jan. 31, 2022, 5 p.m. UTC | #2

On Mon, Jan 31, 2022 at 1:56 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 28.01.22 14:09, Michel Lespinasse wrote:
>
> Hi Michel,
>
> > This patchset is my take on speculative page faults (spf).
> > It builds on ideas that have been previously proposed by Laurent Dufour,
> > Peter Zijlstra and others before. While Laurent's previous proposal
> > was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> > this now based on what I think is a simpler and more bisectable approach,
> > much improved scaling numbers in the anonymous vma case, and the Android
> > use case that has since emerged. I will expand on these points towards
> > the end of this message.
> >
> > The patch series applies on top of linux v5.17-rc1;
> > a git tree is also available:
> > git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> >
> > I would like these patches to be considered for inclusion into v5.18.
>
> Just a general note: we certainly need (much more) review. And I think
> we'll have to make a decision if the maintenance effort +  complexity
> will be worth the benefit.
>
> > Several android vendors are using Laurent Dufour's previous SPF work into
> > their kernel tree in order to improve application startup performance,
> > want to converge to an upstream accepted solution, and have reported good
> > numbers with previous versions of this patchset. Also, there is a broader
> > interest into reducing mmap lock dependencies in critical MM paths,
> > and I think this patchset would be a good first step in that direction.
> >
> >
> > This patchset follows the same overall structure as the v1 proposal,
> > with the following differences:
> > - Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
> > - The mmu notifier lock is new; this fixes a race in v1 patchset
> >   between speculative COW faults and registering new MMU notifiers.
> > - Speculative handling of swap-cache pages has been removed.
> > - Commit 30 is new; this fixes build issues that showed in some configs.
> >
> >
> > In principle it would also be possible to extend this work for handling
> > file mapped vmas; I have pending work on such patches too but they are
> > not mature enough to be submitted for inclusion at this point.
> >
>
> I'd have expected a performance evaluation at this point, to highlight
> the possible benefit and eventually also downsides, if any.

Hi David,
In Android we and several Android vendors reported application start
time improvements (a critical metric in Android world) on the previous
SPF posting.
My test results were included in the cover letter:
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#m23c5cb33b1a04979c792db6ddd7e3245e5f86bcb
Android vendors reported their results on the same thread:
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#m8eb304b67c9a33388e2fe4448a04a74879120b34
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#maaa58f7072732e5a2a77fe9f65dd3e444c2aed04
And Axel ran pft (pagefault test) benchmarks on server class machines
with results reported here:
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#mc3965e87a702c67909a078a67f8f7964d707b2e0
The Android performance team had recently reported a case when a
low-end device was having visible performance issues and after
applying SPF the device became usable. I'm CC'ing Tim Murray from that
team to provide more information if possible.
As a side-note, an older version of SPF has been used for several
years on Android and many vendors specifically requested us to include
it in our kernels. It is currently maintained in Android Common Kernel
as an out-of-tree patchset and getting it upstream would be huge for
us in terms of getting more testing in a wider ecosystem and
maintenance efforts.
Thanks,
Suren.




>
> >
> > Patchset summary:
> >
> > Classical page fault processing takes the mmap read lock in order to
> > prevent races with mmap writers. In contrast, speculative fault
> > processing does not take the mmap read lock, and instead verifies,
> > when the results of the page fault are about to get committed and
> > become visible to other threads, that no mmap writers have been
> > running concurrently with the page fault. If the check fails,
> > speculative updates do not get committed and the fault is retried
> > in the usual, non-speculative way (with the mmap read lock held).
> >
> > The concurrency check is implemented using a per-mm mmap sequence count.
> > The counter is incremented at the beginning and end of each mmap write
> > operation. If the counter is initially observed to have an even value,
> > and has the same value later on, the observer can deduce that no mmap
> > writers have been running concurrently with it between those two times.
> > This is similar to a seqlock, except that readers never spin on the
> > counter value (they would instead revert to taking the mmap read lock),
> > and writers are allowed to sleep. One benefit of this approach is that
> > it requires no writer side changes, just some hooks in the mmap write
> > lock APIs that writers already use.
> >
> > The first step of a speculative page fault is to look up the vma and
> > read its contents (currently by making a copy of the vma, though in
> > principle it would be sufficient to only read the vma attributes that
> > are used in page faults). The mmap sequence count is used to verify
> > that there were no mmap writers concurrent to the lookup and copy steps.
> > Note that walking rbtrees while there may potentially be concurrent
> > writers is not an entirely new idea in linux, as latched rbtrees
> > are already doing this. This is safe as long as the lookup is
> > followed by a sequence check to verify that concurrency did not
> > actually occur (and abort the speculative fault if it did).
> >
> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap(). Again, not an entirely new idea, as this repeats
> > a pattern already present in fast GUP. Similar precautions are also
> > taken when taking the page table lock.
> >
> > Breaking COW on an existing mapping may require firing MMU notifiers.
> > Some care is required to avoid racing with registering new notifiers.
> > This patchset adds a new per-cpu rwsem to handle this situation.
>
> I have to admit that this sounds complicated and possibly dangerous to me.
>
>
> Here is one of my concerns, I hope you can clarify:
>
> GUP-fast only ever walks page tables and doesn't actually modify any
> page table state, including, not taking page table locks which might not
> reside in the memmap directly but in auxiliary data. It works because we
> only ever drop the last reference to a page table (to free it) after we
> synchronized against GUP-fast either via an IPI or synchronize_rcu(), as
> GUP=fast disables interrupts.
>
>
> I'd assume that taking page table locks on page tables that might no
> longer be spanned by a VMA because of concurrent page table
> deconstruction  is dangerous:
>
>
> On munmap(), we do the VMA update under mmap_lock in write mode, to the
> remove the page tables under mmap_lock in read mode.
>
> Let's take a look at free_pte_range() on x86:
>
> free_pte_range()
> -> pte_free_tlb()
>  -> tlb_flush_pmd_range()
>   -> __tlb_adjust_range()
>    /* Doesn't actually flush but only updates the tlb range */
>  -> __pte_free_tlb()
>   -> ___pte_free_tlb()
>    -> pgtable_pte_page_dtor()
>     -> ptlock_free()
>     /* page table lock was freed */
>    -> paravirt_tlb_remove_table()
>     -> tlb_remove_page()
>      -> tlb_remove_page_size()
>       -> __tlb_remove_page_size()
>        /* Page added to TLB batch flushing+freeing */
>
> The later tlb_flush_mmu() via tlb_flush_mmu_free()->tlb_table_flush()
> will the free the page tables, after synchronizing against GUP-fast. But
> at that point we already deconstructed the page tables.
>
> So just reading your summary here, what prevents in your approach taking
> a page table lock with racing against page table lock freeing? I cannot
> see how a seqcount would help.
>
>
> IIUC, with what you propose we cannot easily have auxiliary data for a
> page table, at least not via current pgtable_pte_page_dtor(), including
> page locks, which is a drawback (and currently eventually a BUG in your
> code?) at least for me. But I only read the cover letter, so I might be
> missing something important :)
>
> --
> Thanks,
>
> David / dhildenb
>

Andrew Morton Feb. 1, 2022, 1:14 a.m. UTC | #3

On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@lespinasse.org> wrote:

> Patchset summary:
> 
> Classical page fault processing takes the mmap read lock in order to
> prevent races with mmap writers. In contrast, speculative fault
> processing does not take the mmap read lock, and instead verifies,
> when the results of the page fault are about to get committed and
> become visible to other threads, that no mmap writers have been
> running concurrently with the page fault. If the check fails,
> speculative updates do not get committed and the fault is retried
> in the usual, non-speculative way (with the mmap read lock held).
> 
> The concurrency check is implemented using a per-mm mmap sequence count.
> The counter is incremented at the beginning and end of each mmap write
> operation. If the counter is initially observed to have an even value,
> and has the same value later on, the observer can deduce that no mmap
> writers have been running concurrently with it between those two times.
> This is similar to a seqlock, except that readers never spin on the
> counter value (they would instead revert to taking the mmap read lock),
> and writers are allowed to sleep. One benefit of this approach is that
> it requires no writer side changes, just some hooks in the mmap write
> lock APIs that writers already use.
> 
> The first step of a speculative page fault is to look up the vma and
> read its contents (currently by making a copy of the vma, though in
> principle it would be sufficient to only read the vma attributes that
> are used in page faults). The mmap sequence count is used to verify
> that there were no mmap writers concurrent to the lookup and copy steps.
> Note that walking rbtrees while there may potentially be concurrent
> writers is not an entirely new idea in linux, as latched rbtrees
> are already doing this. This is safe as long as the lookup is
> followed by a sequence check to verify that concurrency did not
> actually occur (and abort the speculative fault if it did).

I'm surprised that descending the rbtree locklessly doesn't flat-out
oops the kernel.  How are we assured that every pointer which is
encountered actually points at the right thing?  Against things
which tear that tree down?

> The next step is to walk down the existing page table tree to find the
> current pte entry. This is done with interrupts disabled to avoid
> races with munmap().

Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
point of view?

> Again, not an entirely new idea, as this repeats
> a pattern already present in fast GUP. Similar precautions are also
> taken when taking the page table lock.
> 
> Breaking COW on an existing mapping may require firing MMU notifiers.
> Some care is required to avoid racing with registering new notifiers.
> This patchset adds a new per-cpu rwsem to handle this situation.

Matthew Wilcox Feb. 1, 2022, 2:20 a.m. UTC | #4

On Mon, Jan 31, 2022 at 05:14:34PM -0800, Andrew Morton wrote:
> On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@lespinasse.org> wrote:
> > The first step of a speculative page fault is to look up the vma and
> > read its contents (currently by making a copy of the vma, though in
> > principle it would be sufficient to only read the vma attributes that
> > are used in page faults). The mmap sequence count is used to verify
> > that there were no mmap writers concurrent to the lookup and copy steps.
> > Note that walking rbtrees while there may potentially be concurrent
> > writers is not an entirely new idea in linux, as latched rbtrees
> > are already doing this. This is safe as long as the lookup is
> > followed by a sequence check to verify that concurrency did not
> > actually occur (and abort the speculative fault if it did).
> 
> I'm surprised that descending the rbtree locklessly doesn't flat-out
> oops the kernel.  How are we assured that every pointer which is
> encountered actually points at the right thing?  Against things
> which tear that tree down?

It doesn't necessarily point at the _right_ thing.  You may get
entirely the wrong node in the tree if you race with a modification,
but, as Michel says, you check the seqcount before you even look at
the VMA (and if the seqcount indicates a modification, you throw away
the result and fall back to the locked version).  The rbtree always
points to other rbtree nodes, so you aren't going to walk into some
completely wrong data structure.

> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap().
> 
> Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
> point of view?

I am not a fan of this approach.  For other reasons, I think we want to
switch to RCU-freed page tables, and then we can walk the page tables
with the RCU lock held.  Some architectures already RCU-free the page
tables, so I think it's just a matter of converting the rest.

Sebastian Andrzej Siewior Feb. 1, 2022, 5:17 p.m. UTC | #5

On 2022-01-31 17:14:34 [-0800], Andrew Morton wrote:
> On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@lespinasse.org> wrote:
> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap().
> 
> Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
> point of view?

I applied the series on top of RT and gave it shot. Nothing out of the
ordinary happened so that is good.

From browsing through the code:
- speculative_page_walk_begin() seems to disable interrupts.
  There is a spin_trylock() invocation in that area. That is okay since
  it is never invoked from in_IRQ(). But there should not be any regular
  spin_lock() in such a section.

- We do have a seqcount API. So instead of mmap_seq_read_start() one
  could use raw_read_seqcount(). The lockdep bits would also check if
  the associated lock (in this case mmap_lock) is held in the write
  path.

- The read side (mmap_seq_read_start()) does not attempt to stabilize
  the counter (waiting for even) which is good. Otherwise special care
  would be needed ;)

Sebastian

Michel Lespinasse Feb. 7, 2022, 5:39 p.m. UTC | #6

On Tue, Feb 01, 2022 at 02:20:39AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 31, 2022 at 05:14:34PM -0800, Andrew Morton wrote:
> > On Fri, 28 Jan 2022 05:09:31 -0800 Michel Lespinasse <michel@lespinasse.org> wrote:
> > > The next step is to walk down the existing page table tree to find the
> > > current pte entry. This is done with interrupts disabled to avoid
> > > races with munmap().
> > 
> > Sebastian, could you please comment on this from the CONFIG_PREEMPT_RT
> > point of view?
> 
> I am not a fan of this approach.  For other reasons, I think we want to
> switch to RCU-freed page tables, and then we can walk the page tables
> with the RCU lock held.  Some architectures already RCU-free the page
> tables, so I think it's just a matter of converting the rest.

Note - I have no problem with switching to RCU-freed page tables
everywhere when and if we end up needing to. I just don't see that
this need comes from the SPF patchset, so I don't think this should
be introduced as an artificial dependency.

--
Michel "walken" Lespinasse

Mel Gorman Feb. 23, 2022, 4:11 p.m. UTC | #7

On Fri, Jan 28, 2022 at 05:09:31AM -0800, Michel Lespinasse wrote:
> This patchset is my take on speculative page faults (spf).
> It builds on ideas that have been previously proposed by Laurent Dufour,
> Peter Zijlstra and others before. While Laurent's previous proposal
> was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> this now based on what I think is a simpler and more bisectable approach,
> much improved scaling numbers in the anonymous vma case, and the Android
> use case that has since emerged. I will expand on these points towards
> the end of this message.
> 
> The patch series applies on top of linux v5.17-rc1;
> a git tree is also available:
> git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> 
> I would like these patches to be considered for inclusion into v5.18.
> Several android vendors are using Laurent Dufour's previous SPF work into
> their kernel tree in order to improve application startup performance,
> want to converge to an upstream accepted solution, and have reported good
> numbers with previous versions of this patchset. Also, there is a broader
> interest into reducing mmap lock dependencies in critical MM paths,
> and I think this patchset would be a good first step in that direction.
> 

I think there is serious lack of performance data here. The only
performance point offered is the Android Application Startup case.
Unfortunately, that benefit may be specific to the Zygote process that
preloads classes that may be required and listens for new applications to
start. I suspect the benefit wouldn't apply to most Linux distributions
and even JVM-based workloads are not primarily constrained by the startup
cost. Improving application start up costs is not great justification
for this level of code complexity even though I recognise why it is a
key performance indicator for Android given that startup times affect
the user experience.

Laurent's original work was partially motivated by the performance of
a proprietary application. While I cannot replicate a full production
workload as that can only be done by the company, I could do a basic
evaluation commonly conducted on standalone systems. It was extremely
fault intensive with SPF success rates greater than 96% but almost no
change in actual performance. It's perfectly possible that the application
has changed since SPF was first proposed. The developers did spend a fair
amount of effort at making the application NUMA-aware and reusing memory
more aggressively to avoid faults. It's still very fault intensive but
does not appear to suffer due to parallel memory operations guessing from
the data.

On my own tests, the only preliminary test that was a clear winner
was will-it-scale using threads for the page-fault workloads and
page-fault-test for threads. To be far, the increases there are dramatic
with a high success rate of speculative faults.

pft timings
                                 5.17.0-rc3             5.17.0-rc3
                                    vanilla        mm-spfault-v2r1
Amean     elapsed-1        32.66 (   0.00%)       32.77 *  -0.36%*
Amean     elapsed-4         9.17 (   0.00%)        8.89 *   3.07%*
Amean     elapsed-7         5.53 (   0.00%)        5.26 *   4.95%*
Amean     elapsed-12        4.13 (   0.00%)        3.50 *  15.16%*
Amean     elapsed-21        3.93 (   0.00%)        2.79 *  29.03%*
Amean     elapsed-30        4.02 (   0.00%)        2.94 *  26.79%*
Amean     elapsed-48        4.37 (   0.00%)        2.83 *  35.24%*
Amean     elapsed-79        4.13 (   0.00%)        2.17 *  47.36%*
Amean     elapsed-80        4.12 (   0.00%)        2.13 *  48.22%*

Ops SPFault Attempt                        0.00  4734439786.00
Ops SPFault Abort                          0.00     9360014.00
Ops SPFault Success                        0.00          99.80

This is the ideal case for SPF but not very realistic. Interestingly,
ebizzy barely benefitted even though it's threaded because it's not
guaranteed to be address space modification intensive.

Hackbench took a performance hit between 0-5% depending on the exact
configuration and machine used. It is threaded and had high SPF abort rates
(up to 50%). It's not a great example but it shows at least one example
where SPF hurts more than it help and there may be other applications
that are harmed by having to retry faults.

The scope of SPF is narrow relative to the much older discussion of
breaking up mmap_sem. The only time SPF benefits is when faults are racing
against parallel memory address updates holding mmap_sem for write.
That requires a threaded application that is both intense in terms of
address space updates and fault intensive. That is much narrower than
threaded applications that are address space update intensive (e.g.
using mprotect to avoid accidentally leaking data, mapping data files
for IO etc). Have we examples of realistic applications that meet all the
criteria of "threaded", "address-space intensive" and "fault intensive"
that are common enough to justify the complexity?

Admittedly, I initially just threw this series at a collection of
workloads that simply stress the allocator because it stresses faults as
a side-effect but most of them did not match the criteria for "threaded
application that is both address space update intensive and fault
intensive". I'm struggling to think of good examples although redis
is a possibility. HPC workloads like NPB parallelised with OpenMP is a
possibility but I looked at some old results and while it does trap faults,
the vast majority are related to NUMA balancing.  The other ones I normally
consider for scaling purposes are process orientated and not threads.

On the patches themselves, I'm not sure the optimisation for ignoring SPF
is guaranteed to work as mm_users could be temporarily elevated although
probably not enough to matter. I also think patch 5 stands on its own and
could be sent separately. For the others, I didn't read them in sufficient
depth but noted that the level of similar logic between speculative
and non-speculative paths could be a maintenance headache to keep the
speculative and !speculative rules in sync. I didn't see obvious problems
as such but I still think the complexity is high for a corner case.

Suren Baghdasaryan March 8, 2022, 5:37 a.m. UTC | #8

On Wed, Feb 23, 2022 at 8:11 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Jan 28, 2022 at 05:09:31AM -0800, Michel Lespinasse wrote:
> > This patchset is my take on speculative page faults (spf).
> > It builds on ideas that have been previously proposed by Laurent Dufour,
> > Peter Zijlstra and others before. While Laurent's previous proposal
> > was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> > this now based on what I think is a simpler and more bisectable approach,
> > much improved scaling numbers in the anonymous vma case, and the Android
> > use case that has since emerged. I will expand on these points towards
> > the end of this message.
> >
> > The patch series applies on top of linux v5.17-rc1;
> > a git tree is also available:
> > git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> >
> > I would like these patches to be considered for inclusion into v5.18.
> > Several android vendors are using Laurent Dufour's previous SPF work into
> > their kernel tree in order to improve application startup performance,
> > want to converge to an upstream accepted solution, and have reported good
> > numbers with previous versions of this patchset. Also, there is a broader
> > interest into reducing mmap lock dependencies in critical MM paths,
> > and I think this patchset would be a good first step in that direction.
> >
>
> I think there is serious lack of performance data here. The only
> performance point offered is the Android Application Startup case.
> Unfortunately, that benefit may be specific to the Zygote process that
> preloads classes that may be required and listens for new applications to
> start. I suspect the benefit wouldn't apply to most Linux distributions
> and even JVM-based workloads are not primarily constrained by the startup
> cost. Improving application start up costs is not great justification
> for this level of code complexity even though I recognise why it is a
> key performance indicator for Android given that startup times affect
> the user experience.
>
> Laurent's original work was partially motivated by the performance of
> a proprietary application. While I cannot replicate a full production
> workload as that can only be done by the company, I could do a basic
> evaluation commonly conducted on standalone systems. It was extremely
> fault intensive with SPF success rates greater than 96% but almost no
> change in actual performance. It's perfectly possible that the application
> has changed since SPF was first proposed. The developers did spend a fair
> amount of effort at making the application NUMA-aware and reusing memory
> more aggressively to avoid faults. It's still very fault intensive but
> does not appear to suffer due to parallel memory operations guessing from
> the data.
>
> On my own tests, the only preliminary test that was a clear winner
> was will-it-scale using threads for the page-fault workloads and
> page-fault-test for threads. To be far, the increases there are dramatic
> with a high success rate of speculative faults.
>
> pft timings
>                                  5.17.0-rc3             5.17.0-rc3
>                                     vanilla        mm-spfault-v2r1
> Amean     elapsed-1        32.66 (   0.00%)       32.77 *  -0.36%*
> Amean     elapsed-4         9.17 (   0.00%)        8.89 *   3.07%*
> Amean     elapsed-7         5.53 (   0.00%)        5.26 *   4.95%*
> Amean     elapsed-12        4.13 (   0.00%)        3.50 *  15.16%*
> Amean     elapsed-21        3.93 (   0.00%)        2.79 *  29.03%*
> Amean     elapsed-30        4.02 (   0.00%)        2.94 *  26.79%*
> Amean     elapsed-48        4.37 (   0.00%)        2.83 *  35.24%*
> Amean     elapsed-79        4.13 (   0.00%)        2.17 *  47.36%*
> Amean     elapsed-80        4.12 (   0.00%)        2.13 *  48.22%*
>
> Ops SPFault Attempt                        0.00  4734439786.00
> Ops SPFault Abort                          0.00     9360014.00
> Ops SPFault Success                        0.00          99.80
>
> This is the ideal case for SPF but not very realistic. Interestingly,
> ebizzy barely benefitted even though it's threaded because it's not
> guaranteed to be address space modification intensive.
>
> Hackbench took a performance hit between 0-5% depending on the exact
> configuration and machine used. It is threaded and had high SPF abort rates
> (up to 50%). It's not a great example but it shows at least one example
> where SPF hurts more than it help and there may be other applications
> that are harmed by having to retry faults.
>
> The scope of SPF is narrow relative to the much older discussion of
> breaking up mmap_sem. The only time SPF benefits is when faults are racing
> against parallel memory address updates holding mmap_sem for write.
> That requires a threaded application that is both intense in terms of
> address space updates and fault intensive. That is much narrower than
> threaded applications that are address space update intensive (e.g.
> using mprotect to avoid accidentally leaking data, mapping data files
> for IO etc). Have we examples of realistic applications that meet all the
> criteria of "threaded", "address-space intensive" and "fault intensive"
> that are common enough to justify the complexity?
>
> Admittedly, I initially just threw this series at a collection of
> workloads that simply stress the allocator because it stresses faults as
> a side-effect but most of them did not match the criteria for "threaded
> application that is both address space update intensive and fault
> intensive". I'm struggling to think of good examples although redis
> is a possibility. HPC workloads like NPB parallelised with OpenMP is a
> possibility but I looked at some old results and while it does trap faults,
> the vast majority are related to NUMA balancing.  The other ones I normally
> consider for scaling purposes are process orientated and not threads.
>
> On the patches themselves, I'm not sure the optimisation for ignoring SPF
> is guaranteed to work as mm_users could be temporarily elevated although
> probably not enough to matter. I also think patch 5 stands on its own and
> could be sent separately. For the others, I didn't read them in sufficient
> depth but noted that the level of similar logic between speculative
> and non-speculative paths could be a maintenance headache to keep the
> speculative and !speculative rules in sync. I didn't see obvious problems
> as such but I still think the complexity is high for a corner case.

Hi Mel,
Thank you for taking your time to analyze SPF effects on different
workloads. Your feedback drove me to look into the reasons Android
benefits from this patchset. What we know is that apps which benefit
the most are the ones with high number of threads (~100) and when I
strace'd one of these apps I can see that each thread mmaps several
areas upon startup (Stack and Thread-local storage (TLS), thread
signal stack, indirect ref table).
So, I created a simple test that spawns a given number of threads,
each thread mmapping and faulting-in a given number of vmas with a
given number of pages in each one. Each thread records the time it
takes to mmap the vmas and fault-in the pages and the test reports the
total and the average times measured. You can find my test program
here: https://github.com/surenbaghdasaryan/spf_test/blob/main/spf_test.c

I ran a number of tests on my Pixel 6 and SPF shows quite positive
results even with a small number of vmas and pages. Couple examples:

100 threads, 2 vmas, 10 pages (cmdline: spf_test 100 2 10)
Baseline avg time: 1,889,398.01ns
SPF avg time: 327,299.36ns
Improvement: 83%

100 threads, 10 vmas, 2 pages (cmdline: spf_test 100 10 2)
Baseline avg time: 1,234,861.48ns
SPF avg time: 800,392.82ns
Improvement: 35%

100 threads, 10 vmas, 10 pages (cmdline: spf_test 100 10 10)
Baseline avg time: 12,199,939.04ns
SPF avg time: 3,223,206.41ns
Improvement: 74%

100 threads, 30 vmas, 30 pages (cmdline: spf_test 100 30 30)
Baseline avg time: 255,827,268.16ns
SPF avg time: 41,538,348.47ns
Improvement: 84%

To minimize the noise, the test setup was to run with the same
parameters for several hundred times and take the average between
runs.
I think this test represents an example of what you were describing as
a "threaded application that is both address space update intensive
and fault intensive" because mmaps modify the address space with
page-faults happening in parallel. We can call it an artificial
workload but it does not strike me as something very unusual. I can
imagine other systems apart from Android which could spawn multiple
threads with each thread mapping some memory area to work with and
using that area immediately.
Thanks,
Suren.


>
> --
> Mel Gorman
> SUSE Labs

[v2,00/35] Speculative page faults

Message

Comments