[RFC,RESEND,00/28] per-VMA locks proposal

Message ID	20220901173516.702122-1-surenb@google.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 1 Sep 2022 10:34:48 -0700 Mime-Version: 1.0 Message-ID: <20220901173516.702122-1-surenb@google.com> Subject: [RFC PATCH RESEND 00/28] per-VMA locks proposal From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: michel@lespinasse.org, jglisse@google.com, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, mgorman@suse.de, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, peterz@infradead.org, ldufour@linux.ibm.com, laurent.dufour@fr.ibm.com, paulmck@kernel.org, luto@kernel.org, songliubraving@fb.com, peterx@redhat.com, david@redhat.com, dhowells@redhat.com, hughd@google.com, bigeasy@linutronix.de, kent.overstreet@linux.dev, rientjes@google.com, axelrasmussen@google.com, joelaf@google.com, minchan@google.com, surenb@google.com, kernel-team@android.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linuxppc-dev@lists.ozlabs.org, x86@kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	per-VMA locks proposal \| expand [RFC,RESEND,00/28] per-VMA locks proposal [RFC,RESEND,01/28] mm: introduce CONFIG_PER_VMA_LOCK [RFC,RESEND,02/28] mm: rcu safe VMA freeing [RFC,RESEND,03/28] mm: introduce __find_vma to be used without mmap_lock protection [RFC,RESEND,04/28] mm: move mmap_lock assert function definitions [RFC,RESEND,05/28] mm: add per-VMA lock and helper functions to control it [RFC,RESEND,06/28] mm: mark VMA as locked whenever vma->vm_flags are modified [RFC,RESEND,07/28] kernel/fork: mark VMAs as locked before copying pages during fork [RFC,RESEND,08/28] mm/khugepaged: mark VMA as locked while collapsing a hugepage [RFC,RESEND,09/28] mm/mempolicy: mark VMA as locked when changing protection policy [RFC,RESEND,10/28] mm/mmap: mark VMAs as locked in vma_adjust [RFC,RESEND,11/28] mm/mmap: mark VMAs as locked before merging or splitting them [RFC,RESEND,12/28] mm/mremap: mark VMA as locked while remapping it to a new address range [RFC,RESEND,13/28] mm: conditionally mark VMA as locked in free_pgtables and unmap_page_range [RFC,RESEND,14/28] mm: mark VMAs as locked before isolating them [RFC,RESEND,15/28] mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped area [RFC,RESEND,16/28] kernel/fork: assert no VMA readers during its destruction [RFC,RESEND,17/28] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration [RFC,RESEND,18/28] mm: add FAULT_FLAG_VMA_LOCK flag [RFC,RESEND,19/28] mm: disallow do_swap_page to handle page faults under VMA lock [RFC,RESEND,20/28] mm: introduce per-VMA lock statistics [RFC,RESEND,21/28] mm: introduce find_and_lock_anon_vma to be used from arch-specific code [RFC,RESEND,22/28] x86/mm: try VMA lock-based page fault handling first [RFC,RESEND,23/28] x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK [RFC,RESEND,24/28] arm64/mm: try VMA lock-based page fault handling first [RFC,RESEND,25/28] arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK [RFC,RESEND,26/28] powerc/mm: try VMA lock-based page fault handling first [RFC,RESEND,27/28] powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK [RFC,RESEND,28/28] kernel/fork: throttle call_rcu() calls in vm_area_free

Suren Baghdasaryan Sept. 1, 2022, 5:34 p.m. UTC

Resending to fix the issue with the In-Reply-To tag in the original
submission at [4].

This is a proof of concept for per-vma locks idea that was discussed
during SPF [1] discussion at LSF/MM this year [2], which concluded with
suggestion that “a reader/writer semaphore could be put into the VMA
itself; that would have the effect of using the VMA as a sort of range
lock. There would still be contention at the VMA level, but it would be an
improvement.” This patchset implements this suggested approach.

When handling page faults we lookup the VMA that contains the faulting
page under RCU protection and try to acquire its lock. If that fails we
fall back to using mmap_lock, similar to how SPF handled this situation.

One notable way the implementation deviates from the proposal is the way
VMAs are marked as locked. Because during some of mm updates multiple
VMAs need to be locked until the end of the update (e.g. vma_merge,
split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
and other complications would make the code more complex. Therefore we
provide a way to "mark" VMAs as locked and then unmark all locked VMAs
all at once. This is done using two sequence numbers - one in the
vm_area_struct and one in the mm_struct. VMA is considered locked when
these sequence numbers are equal. To mark a VMA as locked we set the
sequence number in vm_area_struct to be equal to the sequence number
in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
This allows for an efficient way to track locked VMAs and to drop the
locks on all VMAs at the end of the update.

The patchset implements per-VMA locking only for anonymous pages which
are not in swap. If the initial proposal is considered acceptable, then
support for swapped and file-backed page faults will be added.

Performance benchmarks show similar although slightly smaller benefits as
with SPF patchset (~75% of SPF benefits). Still, with lower complexity
this approach might be more desirable.

The patchset applies cleanly over 6.0-rc3
The tree for testing is posted at [3]

[1] https://lore.kernel.org/all/20220128131006.67712-1-michel@lespinasse.org/
[2] https://lwn.net/Articles/893906/
[3] https://github.com/surenbaghdasaryan/linux/tree/per_vma_lock_rfc
[4] https://lore.kernel.org/all/20220829212531.3184856-1-surenb@google.com/

Laurent Dufour (2):
  powerc/mm: try VMA lock-based page fault handling first
  powerpc/mm: define ARCH_SUPPORTS_PER_VMA_LOCK

Michel Lespinasse (1):
  mm: rcu safe VMA freeing

Suren Baghdasaryan (25):
  mm: introduce CONFIG_PER_VMA_LOCK
  mm: introduce __find_vma to be used without mmap_lock protection
  mm: move mmap_lock assert function definitions
  mm: add per-VMA lock and helper functions to control it
  mm: mark VMA as locked whenever vma->vm_flags are modified
  kernel/fork: mark VMAs as locked before copying pages during fork
  mm/khugepaged: mark VMA as locked while collapsing a hugepage
  mm/mempolicy: mark VMA as locked when changing protection policy
  mm/mmap: mark VMAs as locked in vma_adjust
  mm/mmap: mark VMAs as locked before merging or splitting them
  mm/mremap: mark VMA as locked while remapping it to a new address
    range
  mm: conditionally mark VMA as locked in free_pgtables and
    unmap_page_range
  mm: mark VMAs as locked before isolating them
  mm/mmap: mark adjacent VMAs as locked if they can grow into unmapped
    area
  kernel/fork: assert no VMA readers during its destruction
  mm/mmap: prevent pagefault handler from racing with mmu_notifier
    registration
  mm: add FAULT_FLAG_VMA_LOCK flag
  mm: disallow do_swap_page to handle page faults under VMA lock
  mm: introduce per-VMA lock statistics
  mm: introduce find_and_lock_anon_vma to be used from arch-specific
    code
  x86/mm: try VMA lock-based page fault handling first
  x86/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  arm64/mm: try VMA lock-based page fault handling first
  arm64/mm: define ARCH_SUPPORTS_PER_VMA_LOCK
  kernel/fork: throttle call_rcu() calls in vm_area_free

 arch/arm64/Kconfig                     |   1 +
 arch/arm64/mm/fault.c                  |  36 +++++++++
 arch/powerpc/mm/fault.c                |  41 ++++++++++
 arch/powerpc/platforms/powernv/Kconfig |   1 +
 arch/powerpc/platforms/pseries/Kconfig |   1 +
 arch/x86/Kconfig                       |   1 +
 arch/x86/mm/fault.c                    |  36 +++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c  |   4 +-
 fs/proc/task_mmu.c                     |   1 +
 fs/userfaultfd.c                       |   6 ++
 include/linux/mm.h                     | 104 ++++++++++++++++++++++++-
 include/linux/mm_types.h               |  33 ++++++--
 include/linux/mmap_lock.h              |  37 ++++++---
 include/linux/vm_event_item.h          |   6 ++
 include/linux/vmstat.h                 |   6 ++
 kernel/fork.c                          |  75 +++++++++++++++++-
 mm/Kconfig                             |  13 ++++
 mm/Kconfig.debug                       |   8 ++
 mm/init-mm.c                           |   6 ++
 mm/internal.h                          |   4 +-
 mm/khugepaged.c                        |   1 +
 mm/madvise.c                           |   1 +
 mm/memory.c                            |  82 ++++++++++++++++---
 mm/mempolicy.c                         |   6 +-
 mm/mlock.c                             |   2 +
 mm/mmap.c                              |  60 ++++++++++----
 mm/mprotect.c                          |   1 +
 mm/mremap.c                            |   1 +
 mm/nommu.c                             |   2 +
 mm/oom_kill.c                          |   3 +-
 mm/vmstat.c                            |   6 ++
 31 files changed, 531 insertions(+), 54 deletions(-)

Kent Overstreet Sept. 1, 2022, 8:58 p.m. UTC | #1

On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> Resending to fix the issue with the In-Reply-To tag in the original
> submission at [4].
> 
> This is a proof of concept for per-vma locks idea that was discussed
> during SPF [1] discussion at LSF/MM this year [2], which concluded with
> suggestion that “a reader/writer semaphore could be put into the VMA
> itself; that would have the effect of using the VMA as a sort of range
> lock. There would still be contention at the VMA level, but it would be an
> improvement.” This patchset implements this suggested approach.
> 
> When handling page faults we lookup the VMA that contains the faulting
> page under RCU protection and try to acquire its lock. If that fails we
> fall back to using mmap_lock, similar to how SPF handled this situation.
> 
> One notable way the implementation deviates from the proposal is the way
> VMAs are marked as locked. Because during some of mm updates multiple
> VMAs need to be locked until the end of the update (e.g. vma_merge,
> split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> and other complications would make the code more complex. Therefore we
> provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> all at once. This is done using two sequence numbers - one in the
> vm_area_struct and one in the mm_struct. VMA is considered locked when
> these sequence numbers are equal. To mark a VMA as locked we set the
> sequence number in vm_area_struct to be equal to the sequence number
> in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> This allows for an efficient way to track locked VMAs and to drop the
> locks on all VMAs at the end of the update.

I like it - the sequence numbers are a stroke of genuius. For what it's doing
the patchset seems almost small.

Two complaints so far:
 - I don't like the vma_mark_locked() name. To me it says that the caller
   already took or is taking the lock and this function is just marking that
   we're holding the lock, but it's really taking a different type of lock. But
   this function can block, it really is taking a lock, so it should say that.
   
   This is AFAIK a new concept, not sure I'm going to have anything good either,
   but perhaps vma_lock_multiple()?

 - I don't like the #ifdef and the separate fallback path in the fault handlers.

   Can we make find_and_lock_anon_vma() do the right thing, and not fail unless
   e.g. there isn't a vma at that address? Just have it wait for vm_lock_seq to
   change and then retry if needed.

Suren Baghdasaryan Sept. 1, 2022, 11:26 p.m. UTC | #2

On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> > Resending to fix the issue with the In-Reply-To tag in the original
> > submission at [4].
> >
> > This is a proof of concept for per-vma locks idea that was discussed
> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> > suggestion that “a reader/writer semaphore could be put into the VMA
> > itself; that would have the effect of using the VMA as a sort of range
> > lock. There would still be contention at the VMA level, but it would be an
> > improvement.” This patchset implements this suggested approach.
> >
> > When handling page faults we lookup the VMA that contains the faulting
> > page under RCU protection and try to acquire its lock. If that fails we
> > fall back to using mmap_lock, similar to how SPF handled this situation.
> >
> > One notable way the implementation deviates from the proposal is the way
> > VMAs are marked as locked. Because during some of mm updates multiple
> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> > and other complications would make the code more complex. Therefore we
> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> > all at once. This is done using two sequence numbers - one in the
> > vm_area_struct and one in the mm_struct. VMA is considered locked when
> > these sequence numbers are equal. To mark a VMA as locked we set the
> > sequence number in vm_area_struct to be equal to the sequence number
> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> > This allows for an efficient way to track locked VMAs and to drop the
> > locks on all VMAs at the end of the update.
>
> I like it - the sequence numbers are a stroke of genuius. For what it's doing
> the patchset seems almost small.

Thanks for reviewing it!

>
> Two complaints so far:
>  - I don't like the vma_mark_locked() name. To me it says that the caller
>    already took or is taking the lock and this function is just marking that
>    we're holding the lock, but it's really taking a different type of lock. But
>    this function can block, it really is taking a lock, so it should say that.
>
>    This is AFAIK a new concept, not sure I'm going to have anything good either,
>    but perhaps vma_lock_multiple()?

I'm open to name suggestions but vma_lock_multiple() is a bit
confusing to me. Will wait for more suggestions.

>
>  - I don't like the #ifdef and the separate fallback path in the fault handlers.
>
>    Can we make find_and_lock_anon_vma() do the right thing, and not fail unless
>    e.g. there isn't a vma at that address? Just have it wait for vm_lock_seq to
>    change and then retry if needed.

I think it can be done but would come with additional complexity. I
was really trying to keep things as simple as possible after SPF got
shot down on the grounds of complexity. I hope to start simple and
improve only when necessary.

Peter Zijlstra Sept. 2, 2022, 7:42 a.m. UTC | #3

On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> This is a proof of concept for per-vma locks idea that was discussed
> during SPF [1] discussion at LSF/MM this year [2], which concluded with
> suggestion that “a reader/writer semaphore could be put into the VMA
> itself; that would have the effect of using the VMA as a sort of range
> lock. There would still be contention at the VMA level, but it would be an
> improvement.” This patchset implements this suggested approach.

The whole reason I started the SPF thing waay back when was because one
of the primary reporters at the time had very large VMAs and a per-vma
lock wouldn't actually help anything at all.

IIRC it was either scientific code initializing a huge matrix or a
database with a giant table; I'm sure the archives have better memory
than me.

Suren Baghdasaryan Sept. 2, 2022, 2:45 p.m. UTC | #4

On Fri, Sep 2, 2022 at 12:43 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> > This is a proof of concept for per-vma locks idea that was discussed
> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> > suggestion that “a reader/writer semaphore could be put into the VMA
> > itself; that would have the effect of using the VMA as a sort of range
> > lock. There would still be contention at the VMA level, but it would be an
> > improvement.” This patchset implements this suggested approach.
>
> The whole reason I started the SPF thing waay back when was because one
> of the primary reporters at the time had very large VMAs and a per-vma
> lock wouldn't actually help anything at all.
>
> IIRC it was either scientific code initializing a huge matrix or a
> database with a giant table; I'm sure the archives have better memory
> than me.

Regardless of the initial intent, SPF happens to be very useful for
cases when we have multiple threads establishing some mappings
concurrently with page faults (see details at [1]). Android vendors
independently from each other were backporting your and Laurent's
patchset for years. I found internal reports of similar mmap_lock
contention issues in Google Fibers [2] and I suspect there are more
places this happens if people looked closer.

[1] https://lore.kernel.org/all/CAJuCfpE10y78SNPQ+LRY5EonDFhOG=1XjZ9FUUDiyhfhjZ54NA@mail.gmail.com/
[2] https://www.phoronix.com/scan.php?page=news_item&px=Google-Fibers-Toward-Open

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Michal Hocko Sept. 5, 2022, 12:32 p.m. UTC | #5

Unless I am missing something, this is not based on the Maple tree
rewrite, right? Does the change in the data structure makes any
difference to the approach? I remember discussions at LSFMM where it has
been pointed out that some issues with the vma tree are considerably
simpler to handle with the maple tree.

On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
[...]
> One notable way the implementation deviates from the proposal is the way
> VMAs are marked as locked. Because during some of mm updates multiple
> VMAs need to be locked until the end of the update (e.g. vma_merge,
> split_vma, etc).

I think it would be really helpful to spell out those issues in a greater
detail. Not everybody is aware of those vma related subtleties.

Thanks for working on this Suren!

Suren Baghdasaryan Sept. 5, 2022, 6:32 p.m. UTC | #6

On Mon, Sep 5, 2022 at 5:32 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> Unless I am missing something, this is not based on the Maple tree
> rewrite, right? Does the change in the data structure makes any
> difference to the approach? I remember discussions at LSFMM where it has
> been pointed out that some issues with the vma tree are considerably
> simpler to handle with the maple tree.

Correct, this does not use the Maple tree yet but once Maple tree
transition happens and it supports RCU-safe lookups, my code in
find_vma_under_rcu() becomes really simple.

>
> On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
> [...]
> > One notable way the implementation deviates from the proposal is the way
> > VMAs are marked as locked. Because during some of mm updates multiple
> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > split_vma, etc).
>
> I think it would be really helpful to spell out those issues in a greater
> detail. Not everybody is aware of those vma related subtleties.

Ack. I'll expand the description of the cases when multiple VMAs need
to be locked in the same update. The main difficulties are:
1. Multiple VMAs might need to be locked within one
mmap_write_lock/mmap_write_unlock session (will call it an update
transaction).
2. Figuring out when it's safe to unlock a previously locked VMA is
tricky because that might be happening in different functions and at
different call levels.

So, instead of the usual lock/unlock pattern, the proposed solution
marks a VMA as locked and provides an efficient way to:
1. Identify locked VMAs.
2. Unlock all locked VMAs in bulk.

We also postpone unlocking the locked VMAs until the end of the update
transaction, when we do mmap_write_unlock. Potentially this keeps a
VMA locked for longer than is absolutely necessary but it results in a
big reduction of code complexity.

>
> Thanks for working on this Suren!

Thanks for reviewing!
Suren.

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Kent Overstreet Sept. 5, 2022, 8:35 p.m. UTC | #7

On Mon, Sep 05, 2022 at 11:32:48AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 5:32 AM 'Michal Hocko' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > Unless I am missing something, this is not based on the Maple tree
> > rewrite, right? Does the change in the data structure makes any
> > difference to the approach? I remember discussions at LSFMM where it has
> > been pointed out that some issues with the vma tree are considerably
> > simpler to handle with the maple tree.
> 
> Correct, this does not use the Maple tree yet but once Maple tree
> transition happens and it supports RCU-safe lookups, my code in
> find_vma_under_rcu() becomes really simple.
> 
> >
> > On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
> > [...]
> > > One notable way the implementation deviates from the proposal is the way
> > > VMAs are marked as locked. Because during some of mm updates multiple
> > > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > > split_vma, etc).
> >
> > I think it would be really helpful to spell out those issues in a greater
> > detail. Not everybody is aware of those vma related subtleties.
> 
> Ack. I'll expand the description of the cases when multiple VMAs need
> to be locked in the same update. The main difficulties are:
> 1. Multiple VMAs might need to be locked within one
> mmap_write_lock/mmap_write_unlock session (will call it an update
> transaction).
> 2. Figuring out when it's safe to unlock a previously locked VMA is
> tricky because that might be happening in different functions and at
> different call levels.
> 
> So, instead of the usual lock/unlock pattern, the proposed solution
> marks a VMA as locked and provides an efficient way to:
> 1. Identify locked VMAs.
> 2. Unlock all locked VMAs in bulk.
> 
> We also postpone unlocking the locked VMAs until the end of the update
> transaction, when we do mmap_write_unlock. Potentially this keeps a
> VMA locked for longer than is absolutely necessary but it results in a
> big reduction of code complexity.

Correct me if I'm wrong, but it looks like any time multiple VMAs need to be
locked we need mmap_lock anyways, which is what makes your approach so sweet.

If however we ever want to lock multiple VMAs without taking mmap_lock, then
deadlock avoidance algorithms aren't that bad - there's the ww_mutex approach,
which is simple and works well when there isn't much expected contention (the
advantage of the ww_mutex approach is that it doesn't have to track all held
locks). I've also written full cycle detection; that approcah gets you fewer
restarts, at the cost of needing a list of all currently held locks.

Suren Baghdasaryan Sept. 6, 2022, 3:46 p.m. UTC | #8

On Mon, Sep 5, 2022 at 1:35 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Mon, Sep 05, 2022 at 11:32:48AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Sep 5, 2022 at 5:32 AM 'Michal Hocko' via kernel-team
> > <kernel-team@android.com> wrote:
> > >
> > > Unless I am missing something, this is not based on the Maple tree
> > > rewrite, right? Does the change in the data structure makes any
> > > difference to the approach? I remember discussions at LSFMM where it has
> > > been pointed out that some issues with the vma tree are considerably
> > > simpler to handle with the maple tree.
> >
> > Correct, this does not use the Maple tree yet but once Maple tree
> > transition happens and it supports RCU-safe lookups, my code in
> > find_vma_under_rcu() becomes really simple.
> >
> > >
> > > On Thu 01-09-22 10:34:48, Suren Baghdasaryan wrote:
> > > [...]
> > > > One notable way the implementation deviates from the proposal is the way
> > > > VMAs are marked as locked. Because during some of mm updates multiple
> > > > VMAs need to be locked until the end of the update (e.g. vma_merge,
> > > > split_vma, etc).
> > >
> > > I think it would be really helpful to spell out those issues in a greater
> > > detail. Not everybody is aware of those vma related subtleties.
> >
> > Ack. I'll expand the description of the cases when multiple VMAs need
> > to be locked in the same update. The main difficulties are:
> > 1. Multiple VMAs might need to be locked within one
> > mmap_write_lock/mmap_write_unlock session (will call it an update
> > transaction).
> > 2. Figuring out when it's safe to unlock a previously locked VMA is
> > tricky because that might be happening in different functions and at
> > different call levels.
> >
> > So, instead of the usual lock/unlock pattern, the proposed solution
> > marks a VMA as locked and provides an efficient way to:
> > 1. Identify locked VMAs.
> > 2. Unlock all locked VMAs in bulk.
> >
> > We also postpone unlocking the locked VMAs until the end of the update
> > transaction, when we do mmap_write_unlock. Potentially this keeps a
> > VMA locked for longer than is absolutely necessary but it results in a
> > big reduction of code complexity.
>
> Correct me if I'm wrong, but it looks like any time multiple VMAs need to be
> locked we need mmap_lock anyways, which is what makes your approach so sweet.

That is correct. Anytime we need to take VMA's write lock we have to
be holding the write side of the mmap_lock as well. That's what allows
me to skip locking in cases like checking if the VMA is already
locked.

>
> If however we ever want to lock multiple VMAs without taking mmap_lock, then
> deadlock avoidance algorithms aren't that bad - there's the ww_mutex approach,
> which is simple and works well when there isn't much expected contention (the
> advantage of the ww_mutex approach is that it doesn't have to track all held
> locks). I've also written full cycle detection; that approcah gets you fewer
> restarts, at the cost of needing a list of all currently held locks.

Thanks for the tip! I'll take a closer look at ww_mutex.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

Vlastimil Babka Sept. 11, 2022, 9:35 a.m. UTC | #9

On 9/2/22 01:26, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
>>
>> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
>> > Resending to fix the issue with the In-Reply-To tag in the original
>> > submission at [4].
>> >
>> > This is a proof of concept for per-vma locks idea that was discussed
>> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
>> > suggestion that “a reader/writer semaphore could be put into the VMA
>> > itself; that would have the effect of using the VMA as a sort of range
>> > lock. There would still be contention at the VMA level, but it would be an
>> > improvement.” This patchset implements this suggested approach.
>> >
>> > When handling page faults we lookup the VMA that contains the faulting
>> > page under RCU protection and try to acquire its lock. If that fails we
>> > fall back to using mmap_lock, similar to how SPF handled this situation.
>> >
>> > One notable way the implementation deviates from the proposal is the way
>> > VMAs are marked as locked. Because during some of mm updates multiple
>> > VMAs need to be locked until the end of the update (e.g. vma_merge,
>> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
>> > and other complications would make the code more complex. Therefore we
>> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
>> > all at once. This is done using two sequence numbers - one in the
>> > vm_area_struct and one in the mm_struct. VMA is considered locked when
>> > these sequence numbers are equal. To mark a VMA as locked we set the
>> > sequence number in vm_area_struct to be equal to the sequence number
>> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
>> > This allows for an efficient way to track locked VMAs and to drop the
>> > locks on all VMAs at the end of the update.
>>
>> I like it - the sequence numbers are a stroke of genuius. For what it's doing
>> the patchset seems almost small.
> 
> Thanks for reviewing it!
> 
>>
>> Two complaints so far:
>>  - I don't like the vma_mark_locked() name. To me it says that the caller
>>    already took or is taking the lock and this function is just marking that
>>    we're holding the lock, but it's really taking a different type of lock. But
>>    this function can block, it really is taking a lock, so it should say that.
>>
>>    This is AFAIK a new concept, not sure I'm going to have anything good either,
>>    but perhaps vma_lock_multiple()?
> 
> I'm open to name suggestions but vma_lock_multiple() is a bit
> confusing to me. Will wait for more suggestions.

Well, it does act like a vma_write_lock(), no? So why not that name. The
checking function for it is even called vma_assert_write_locked().

We just don't provide a single vma_write_unlock(), but a
vma_mark_unlocked_all(), that could be instead named e.g.
vma_write_unlock_all().
But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?

Suren Baghdasaryan Sept. 28, 2022, 2:28 a.m. UTC | #10

On Sun, Sep 11, 2022 at 2:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/2/22 01:26, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 1:58 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> >>
> >> On Thu, Sep 01, 2022 at 10:34:48AM -0700, Suren Baghdasaryan wrote:
> >> > Resending to fix the issue with the In-Reply-To tag in the original
> >> > submission at [4].
> >> >
> >> > This is a proof of concept for per-vma locks idea that was discussed
> >> > during SPF [1] discussion at LSF/MM this year [2], which concluded with
> >> > suggestion that “a reader/writer semaphore could be put into the VMA
> >> > itself; that would have the effect of using the VMA as a sort of range
> >> > lock. There would still be contention at the VMA level, but it would be an
> >> > improvement.” This patchset implements this suggested approach.
> >> >
> >> > When handling page faults we lookup the VMA that contains the faulting
> >> > page under RCU protection and try to acquire its lock. If that fails we
> >> > fall back to using mmap_lock, similar to how SPF handled this situation.
> >> >
> >> > One notable way the implementation deviates from the proposal is the way
> >> > VMAs are marked as locked. Because during some of mm updates multiple
> >> > VMAs need to be locked until the end of the update (e.g. vma_merge,
> >> > split_vma, etc). Tracking all the locked VMAs, avoiding recursive locks
> >> > and other complications would make the code more complex. Therefore we
> >> > provide a way to "mark" VMAs as locked and then unmark all locked VMAs
> >> > all at once. This is done using two sequence numbers - one in the
> >> > vm_area_struct and one in the mm_struct. VMA is considered locked when
> >> > these sequence numbers are equal. To mark a VMA as locked we set the
> >> > sequence number in vm_area_struct to be equal to the sequence number
> >> > in mm_struct. To unlock all VMAs we increment mm_struct's seq number.
> >> > This allows for an efficient way to track locked VMAs and to drop the
> >> > locks on all VMAs at the end of the update.
> >>
> >> I like it - the sequence numbers are a stroke of genuius. For what it's doing
> >> the patchset seems almost small.
> >
> > Thanks for reviewing it!
> >
> >>
> >> Two complaints so far:
> >>  - I don't like the vma_mark_locked() name. To me it says that the caller
> >>    already took or is taking the lock and this function is just marking that
> >>    we're holding the lock, but it's really taking a different type of lock. But
> >>    this function can block, it really is taking a lock, so it should say that.
> >>
> >>    This is AFAIK a new concept, not sure I'm going to have anything good either,
> >>    but perhaps vma_lock_multiple()?
> >
> > I'm open to name suggestions but vma_lock_multiple() is a bit
> > confusing to me. Will wait for more suggestions.
>
> Well, it does act like a vma_write_lock(), no? So why not that name. The
> checking function for it is even called vma_assert_write_locked().
>
> We just don't provide a single vma_write_unlock(), but a
> vma_mark_unlocked_all(), that could be instead named e.g.
> vma_write_unlock_all().
> But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?

Thank you for your suggestions, Vlastimil! vma_write_lock() sounds
good to me. For vma_mark_unlocked_all() replacement, I would prefer
vma_write_unlock_all() which keeps the vma_write_XXX naming pattern to
indicate that these are operating on the same locks. If the fact that
it accepts mm_struct as a parameter is an issue then maybe
vma_write_unlock_mm() ?

>
>

Vlastimil Babka Sept. 29, 2022, 11:18 a.m. UTC | #11

On 9/28/22 04:28, Suren Baghdasaryan wrote:
> On Sun, Sep 11, 2022 at 2:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 9/2/22 01:26, Suren Baghdasaryan wrote:
>> >
>> >>
>> >> Two complaints so far:
>> >>  - I don't like the vma_mark_locked() name. To me it says that the caller
>> >>    already took or is taking the lock and this function is just marking that
>> >>    we're holding the lock, but it's really taking a different type of lock. But
>> >>    this function can block, it really is taking a lock, so it should say that.
>> >>
>> >>    This is AFAIK a new concept, not sure I'm going to have anything good either,
>> >>    but perhaps vma_lock_multiple()?
>> >
>> > I'm open to name suggestions but vma_lock_multiple() is a bit
>> > confusing to me. Will wait for more suggestions.
>>
>> Well, it does act like a vma_write_lock(), no? So why not that name. The
>> checking function for it is even called vma_assert_write_locked().
>>
>> We just don't provide a single vma_write_unlock(), but a
>> vma_mark_unlocked_all(), that could be instead named e.g.
>> vma_write_unlock_all().
>> But it's called on a mm, so maybe e.g. mm_vma_write_unlock_all()?
> 
> Thank you for your suggestions, Vlastimil! vma_write_lock() sounds
> good to me. For vma_mark_unlocked_all() replacement, I would prefer
> vma_write_unlock_all() which keeps the vma_write_XXX naming pattern to

OK.

> indicate that these are operating on the same locks. If the fact that
> it accepts mm_struct as a parameter is an issue then maybe
> vma_write_unlock_mm() ?

Sounds good!

>>
>>

[RFC,RESEND,00/28] per-VMA locks proposal

Message

Comments