mbox series

[v4,0/5] move per-vma lock into vm_area_struct

Message ID 20241120000826.335387-1-surenb@google.com (mailing list archive)
Headers show
Series move per-vma lock into vm_area_struct | expand

Message

Suren Baghdasaryan Nov. 20, 2024, 12:08 a.m. UTC
Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
Splitting single logical structure into multiple ones leads to more
complicated management, extra pointer dereferences and overall less
maintainable code. When that split-away part is a lock, it complicates
things even further. With no performance benefits, there are no reasons
for this split. Merging the vm_lock back into vm_area_struct also allows
vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset.
This patchset:
1. moves vm_lock back into vm_area_struct, aligning it at the cacheline
boundary and changing the cache to be cacheline-aligned to minimize
cacheline sharing;
2. changes vm_area_struct initialization to mark new vma as detached until
it is inserted into vma tree;
3. changes vm_area_struct cache to SLAB_TYPESAFE_BY_RCU to allow for their
reuse and to minimize call_rcu() calls.
Pagefault microbenchmarks do not show noticeable performance change.

Changes since v3 [4]
- Added SOBs, per Lorenzo Stoakes and Davidlohr Bueso;
- Replaced vma write-locking in vma_mark_attached() with memory barriers to
order accesses to vma->detached vs vm_mm/vm_start/vm_end

Patch applies over mm-unstable

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/
[4] https://lore.kernel.org/all/20241117080931.600731-1-surenb@google.com/

Suren Baghdasaryan (5):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: mark vma as detached until it's added into vma tree
  mm: make vma cache SLAB_TYPESAFE_BY_RCU
  docs/mm: document latest changes to vm_lock

 Documentation/mm/process_addrs.rst |  10 ++-
 include/linux/mm.h                 | 129 ++++++++++++++++++++++++-----
 include/linux/mm_types.h           |  19 ++---
 kernel/fork.c                      |  88 +++++++-------------
 mm/memory.c                        |  17 ++--
 mm/userfaultfd.c                   |  22 ++---
 mm/vma.c                           |   8 +-
 mm/vma.h                           |   2 +
 tools/testing/vma/vma_internal.h   |  55 +++++-------
 9 files changed, 198 insertions(+), 152 deletions(-)


base-commit: 5a7056135bb69da2ce0a42eb8c07968c1331777b

Comments

Shakeel Butt Nov. 20, 2024, 10:10 p.m. UTC | #1
On Tue, Nov 19, 2024 at 04:08:21PM -0800, Suren Baghdasaryan wrote:
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].

If 'struct vm_area_struct' is prone to performance issues due to
cacheline misalignments then we should do something about the
__randomize_layout tag for it. I imagine we can identify the fields
which might be performance critical to be on same cacheline or different
cacheline due to false sharing then we can divide the fields into
different cacheline groups and fields can be __randomize_layout within
the group. WDYT?
Suren Baghdasaryan Nov. 20, 2024, 11:52 p.m. UTC | #2
On Wed, Nov 20, 2024 at 2:10 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Nov 19, 2024 at 04:08:21PM -0800, Suren Baghdasaryan wrote:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
>
> If 'struct vm_area_struct' is prone to performance issues due to
> cacheline misalignments then we should do something about the
> __randomize_layout tag for it. I imagine we can identify the fields
> which might be performance critical to be on same cacheline or different
> cacheline due to false sharing then we can divide the fields into
> different cacheline groups and fields can be __randomize_layout within
> the group. WDYT?

I think that's a good idea since shuffling these fields around does
affect performance. I can look into it once these changes get
finalized and the layout gets more stable.

>
Matthew Wilcox Nov. 21, 2024, 2 a.m. UTC | #3
On Wed, Nov 20, 2024 at 02:10:44PM -0800, Shakeel Butt wrote:
> If 'struct vm_area_struct' is prone to performance issues due to
> cacheline misalignments then we should do something about the
> __randomize_layout tag for it. I imagine we can identify the fields
> which might be performance critical to be on same cacheline or different
> cacheline due to false sharing then we can divide the fields into
> different cacheline groups and fields can be __randomize_layout within
> the group. WDYT?

Pretty sure the people who think security is more important than
performance are the only ones who randomize structs.