mbox series

[RFC,v4,0/8] hugetlb: Change huge pmd sharing synchronization again

Message ID 20220706202347.95150-1-mike.kravetz@oracle.com (mailing list archive)
Headers show
Series hugetlb: Change huge pmd sharing synchronization again | expand

Message

Mike Kravetz July 6, 2022, 8:23 p.m. UTC
I am sending this as a RFC once more in the hope of generating comments
and discussion.  While the changes are intrusive, they do show a significant
performance benefit in my simulated workload.  Code is based on next-20220706.

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

next-20220629 (unmodified)
--------------------------
samples		min	max	avg
48		492006	523305	502552

next-20220629 (revert i_mmap_sema locking)
------------------------------------------
samples		min	max	avg
48		497107	522704	507234

next-20220629 (vma sema locking, this series)
---------------------------------------------
samples		min	max	avg
48		494480	532461	505579

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  Changing the locking scheme results in a significant
performance benefit.

next-20220629 (unmodified)
==========================
fork program (forks per second)
-------------------------------
instances	min		max		avg
1		3546.404513	3546.404513	3546.4
24		577.559827	583.790154	581.428

faulting program (faults per second)
------------------------------------
instances	min		max		avg
1		491608.876328	491608.876328	491609
24		74535.644215	84787.975212	75485.1

combine forking and faulting
----------------------------
instances	min		max		avg
24		445.067848	451.480526	448.606		(forks)
24		2119.829757	2215.549965	2144.06		(faults)

next-20220629 (revert i_mmap_sema locking)
==========================================
fork program (forks per second)
-------------------------------
instances	min		max		avg
1		3494.161969	3494.161969	3494.16
24		736.786700	744.573732	739.105

faulting program (faults per second)
------------------------------------
instances	min		max		avg
1		495692.786855	495692.786855	495693
24		94290.781748	96456.033087	95231.3

combine forking and faulting
----------------------------
instances	min		max		avg
24		74.797786	77.671360	76.0802		(forks)
24		82633.819315	83689.377936	83110.8		(faults)

next-20220629 (vma sema locking, this series)
=============================================
fork program (forks per second)
-------------------------------
instances	min		max		avg
1		3467.288792	3467.28879	3467.29
24		660.194795	669.520649	663.257

faulting program (faults per second)
------------------------------------
instances	min		max		avg
1		477994.366435	477994.366435	477994
24		89544.336665	91933.002686	90512.1

combine forking and faulting
----------------------------
instances	min		max		avg
24		149.437608	152.865862	151.34		(forks)
24		65078.863949	66276.888544	65596.6		(faults)

Just looking at the combined numbers:
	i_mmap_rwsem		No locking		vma locking
        ------------		----------		-----------
forks	448			76			151
faults	2144			83110			65596

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade79 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   <------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
Patches 3 and 4 of this series address those races.  This requires
using the hugetlb fault mutexes for more coordination between the fault
code and file page removal.

Patches 5 - 7 add infrastructure for a new vma based rw semaphore that
will be used for pmd sharing synchronization.  The idea is that this
semaphore will be held in read mode for the duration of fault processing,
and held in write mode for unmap operations which may call huge_pmd_unshare.
Acquiring i_mmap_rwsem is also still required to synchronize huge pmd
sharing.  However it is only required in the fault path when setting up
sharing, and will be acquired in huge_pmd_share().

Patch 8 makes use of this new vma lock.  Unfortunately, the fault code
and truncate/hole punch code would naturally take locks in the opposite
order which could lead to deadlock.  Since the performance of page faults
is more important, the truncation/hole punch code is modified to back
out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/

Mike Kravetz (8):
  hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
    synchronization
  hugetlbfs: move routine remove_huge_page to hugetlb.c
  hugetlbfs: catch and handle truncate racing with page faults
  hugetlb: rename vma_shareable() and refactor code
  hugetlb: add vma based lock for pmd sharing synchronization
  hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
  hugetlb: use new vma_lock for pmd sharing synchronization

 fs/hugetlbfs/inode.c    | 283 +++++++++++++++++++++++++----------
 include/linux/hugetlb.h |  39 ++++-
 kernel/fork.c           |   6 +-
 mm/hugetlb.c            | 321 ++++++++++++++++++++++++++++++----------
 mm/memory.c             |   2 +
 mm/rmap.c               | 113 ++++++++------
 mm/userfaultfd.c        |  14 +-
 7 files changed, 561 insertions(+), 217 deletions(-)

Comments

Ray Fucillo July 20, 2022, 2:16 p.m. UTC | #1
> On Jul 6, 2022, at 4:23 PM, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
> I am sending this as a RFC once more in the hope of generating comments
> and discussion. While the changes are intrusive, they do show a significant
> performance benefit in my simulated workload. Code is based on next-20220706.
> 
> hugetlb fault scalability regressions have recently been reported [1].
> This is not the first such report, as regressions were also noted when
> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization") was added [2] in v5.7. At that time, a proposal to
> address the regression was suggested [3] but went nowhere.
> 
> The regression and benefit of this patch series is not evident when
> using the vm_scalability benchmark reported in [2] on a recent kernel.
> Results from running,

Mike, thank you for all the work and careful consideration here!  We did find that this patch set addresses the scalability regression that was at the root of critical issues for customers that upgraded to newer Linux distributions (those with commit c0d0381ade79)

Ray