mbox series

[RFC,v2,0/6] hugetlb: Change huge pmd sharing synchronization again

Message ID 20220420223753.386645-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hugetlb: Change huge pmd sharing synchronization again | expand

Message

Mike Kravetz April 20, 2022, 10:37 p.m. UTC
I am sending this as a v2 RFC for the following reasons:
- The original RFC was incomplete and had a few issues.
- The only comments from the original RFC suggested eliminating huge pmd
  sharing to eliminate the associated complexity.  I do not believe this
  is possible as user space code will notice it's absence.  In any case,
  if we want to remove i_mmap_rwsem from the fault path to address fault
  scalability, we will need to address fault/truncate races.  Patches 3
  and 4 of this series do that.

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

To illustrate the regression, I created a simple program that does the
following in an infinite loop:
- mmap a 4GB hugetlb file (size insures pmd sharing)
- fault in all pages
- unmap the hugetlb file

The hugetlb fault code was then instrumented to collect number of times
the mutex was locked and wait time.  Samples are from 10 second
intervals on a 4 CPU VM with 8GB memory.  Eight instances of the
map/fault/unmap program are running.

next-20220420
-------------
[  690.117843] Wait_debug: faults sec  3506
[  690.118788]             num faults  35062
[  690.119825]             num locks   35062
[  690.120956]             intvl wait time 54688 msecs
[  690.122330]             max_wait_time   24000 usecs


next-20220420 + this series
---------------------------
[  484.965960] Wait_debug: faults sec  1419429
[  484.967294]             num faults  14194293
[  484.968656]             num locks   5611
[  484.969893]             intvl wait time 21087 msecs
[  484.971388]             max_wait_time   34000 usecs

As can be seen, fault time suffers when there are other operations
taking i_mmap_rwsem in write mode such as unmap.

This series proposes reverting c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  This moves acquisition of i_mmap_rwsem in the
fault path back to huge_pmd_share where it is only taken when necessary.
After, reverting these patches we still need to handle:
- fault and truncate races
  Catch and properly backout faults beyond i_size
  Backing out reservations is much easier after 846be08578ed to expand
  restore_reserve_on_error functionality.
- unshare and fault/lookup races
  Since the pointer returned from huge_pte_offset or huge_pte_alloc may
  become invalid until we lock the page table, we must revalidate after
  taking the lock.  Code paths must backout and possibly retry if
  page table pointer changes.

Patch 5 in this series makes basic changes to page table locking for
hugetlb mappings.  Currently code uses (split) pmd locks for hugetlb
mappings if page size is PMD_SIZE.  A pointer to the pmd is required
to find the page struct containing the lock.  However, with pmd sharing
the pmd pointer is not stable until we hold the pmd lock.  To solve
this chicken/egg problem, we use the page_table_lock in mm_struct if
the pmd pointer is associated with a mapping where pmd sharing is
possible.  A study of the performance implications of this change still
needs to be performed.

Please help with comments or suggestions.  I would like to come up with
something that is performant AND safe.

Mike Kravetz (6):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlbfs: Do not use pmd locks if hugetlb sharing possible
  hugetlb: Check for pmd unshare and fault/lookup races

 arch/powerpc/mm/pgtable.c |   2 +-
 fs/hugetlbfs/inode.c      | 136 +++++++++++--------
 include/linux/hugetlb.h   |  30 ++---
 mm/damon/vaddr.c          |   4 +-
 mm/hmm.c                  |   2 +-
 mm/hugetlb.c              | 273 ++++++++++++++++++++++----------------
 mm/mempolicy.c            |   2 +-
 mm/migrate.c              |   2 +-
 mm/page_vma_mapped.c      |   2 +-
 mm/rmap.c                 |   8 +-
 mm/userfaultfd.c          |  11 +-
 11 files changed, 261 insertions(+), 211 deletions(-)

Comments

Mike Kravetz April 22, 2022, 4:38 p.m. UTC | #1
On 4/20/22 15:37, Mike Kravetz wrote:
> 
> Patch 5 in this series makes basic changes to page table locking for
> hugetlb mappings.  Currently code uses (split) pmd locks for hugetlb
> mappings if page size is PMD_SIZE.  A pointer to the pmd is required
> to find the page struct containing the lock.  However, with pmd sharing
> the pmd pointer is not stable until we hold the pmd lock.  To solve
> this chicken/egg problem, we use the page_table_lock in mm_struct if
> the pmd pointer is associated with a mapping where pmd sharing is
> possible.  A study of the performance implications of this change still
> needs to be performed.

Sorry, this approach is totally wrong!!!

If sharing pmds, we MUST use the pmd specific lock as that is common
between processes.  If we use the process specific lock in mm_struct
we are not synchronizing pmd updates between processes.

I am going to rethink the idea of a vma (process) specific synchronization
mechanism for pmd sharing.  I abandoned this early because of some lock
ordering issues, but think there may already exist code to handle situations
where we run into trouble with lock order.