mbox series

[v12,00/31] Speculative page faults

Message ID 20190416134522.17540-1-ldufour@linux.ibm.com (mailing list archive)
Headers show
Series Speculative page faults | expand

Message

Laurent Dufour April 16, 2019, 1:44 p.m. UTC
This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type of page fault is named speculative
page fault. If the speculative page fault fails because a concurrency has
been detected or because underlying PMD or PTE tables are not yet
allocating, it is failing its processing and a regular page fault is then
tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by protecting the MM RB
tree with RCU and by using a reference counter on each VMA. When fetching a
VMA under the RCU protection, the VMA's reference counter is incremented to
ensure that the VMA will not freed in our back during the SPF
processing. Once that processing is done the VMA's reference counter is
decremented. To ensure that a VMA is still present when walking the RB tree
locklessly, the VMA's reference counter is incremented when that VMA is
linked in the RB tree. When the VMA is unlinked from the RB tree, its
reference counter will be decremented at the end of the RCU grace period,
ensuring it will be available during this time. This means that the VMA
freeing could be delayed and could delay the file closing for file
mapping. Since the SPF handler is not able to manage file mapping, file is
closed synchronously and not during the RCU cleaning. This is safe since
the page fault handler is aborting if a file pointer is associated to the
VMA.

Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
benchmark [2].

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA has been found, the speculative page fault handler would check
for the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus, the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

The locking of the PTE is done with interrupts disabled, this allows
checking for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have caught the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing operation will have to wait on the PTE lock to move
forward. This allows the SPF handler to map the PTE safely. If the PMD
value is different from the one recorded at the beginning of the SPF
operation, the classic page fault handler will be called to handle the
operation while holding the mmap_sem. As the PTE lock is done with the
interrupts disabled, the lock is done using spin_trylock() to avoid dead
lock when handling a page fault while a TLB invalidate is requested by
another CPU holding the PTE.

In pseudo code, this could be seen as:
    speculative_page_fault()
    {
	    vma = find_vma_rcu()
	    check vma sequence count
	    check vma's support
	    disable interrupt
		  check pgd,p4d,...,pte
		  save pmd and pte in vmf
		  save vma sequence counter in vmf
	    enable interrupt
	    check vma sequence count
	    handle_pte_fault(vma)
		    ..
		    page = alloc_page()
		    pte_map_lock()
			    disable interrupt
				    abort if sequence counter has changed
				    abort if pmd or pte has changed
				    pte map and lock
			    enable interrupt
		    if abort
		       free page
		       abort
		    ...
	    put_vma(vma)
    }
    
    arch_fault_handler()
    {
	    if (speculative_page_fault(&vma))
	       goto done
    again:
	    lock(mmap_sem)
	    vma = find_vma();
	    handle_pte_fault(vma);
	    if retry
	       unlock(mmap_sem)
	       goto again;
    done:
	    handle fault error
    }

Support for THP is not done because when checking for the PMD, we can be
confused by an in progress collapsing operation done by khugepaged. The
issue is that pmd_none() could be true either if the PMD is not already
populated or if the underlying PTE are in the way to be collapsed. So we
cannot safely allocate a PMD if pmd_none() is true.

This series add a new software performance event named 'speculative-faults'
or 'spf'. It counts the number of successful page fault event handled
speculatively. When recording 'faults,spf' events, the faults one is
counting the total number of page fault events while 'spf' is only counting
the part of the faults processed speculatively.

There are some trace events introduced by this series. They allow
identifying why the page faults were not processed speculatively. This
doesn't take in account the faults generated by a monothreaded process
which directly processed while holding the mmap_sem. This trace events are
grouped in a system named 'pagefault', they are:

 - pagefault:spf_vma_changed : if the VMA has been changed in our back
 - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
 - pagefault:spf_vma_notsup : the VMA's type is not supported
 - pagefault:spf_vma_access : the VMA's access right are not respected
 - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
 back.

To record all the related events, the easier is to run perf with the
following arguments :
$ perf stat -e 'faults,spf,pagefault:*' <command>

There is also a dedicated vmstat counter showing the number of successful
page fault handled speculatively. I can be seen this way:
$ grep speculative_pgfault /proc/vmstat

It is possible to deactivate the speculative page fault handler by echoing
0 in /proc/sys/vm/speculative_page_fault.

This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
functional on x86, PowerPC. I cross built it on arm64 but I was not able to
test it.

This series is also available on github [4].

---------------------
Real Workload results

Test using a "popular in memory multithreaded database product" on 128cores
SMT8 Power system are in progress and I will come back with performance
mesurement as soon as possible. With the previous series we seen up to 30%
improvements in the number of transaction processed per second, and we hope
this will be the case with this series too.

------------------
Benchmarks results

Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
SPF is BASE + this series

Kernbench:
----------
Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
kernel (kernel is build 5 times):

Average	Half load -j 24
		 Run	(std deviation)
		 BASE			SPF
Elapsed	Time	 56.52   (1.39185)      56.256  (1.15106)       0.47% 
User	Time	 980.018 (2.94734)      984.958 (1.98518)       -0.50%
System	Time	 130.744 (1.19148)      133.616 (0.873573)      -2.20%
Percent	CPU	 1965.6  (49.682)       1988.4  (40.035)        -1.16%
Context	Switches 29926.6 (272.789)      30472.4 (109.569)       -1.82%
Sleeps		 124793  (415.87)       125003  (591.008)       -0.17%
						
Average	Optimal	load -j	48
		 Run	(std deviation)
		 BASE			SPF
Elapsed	Time	 46.354  (0.917949)     45.968 (1.42786)        0.83% 
User	Time	 1193.42 (224.96)       1196.78 (223.28)        -0.28%
System	Time	 143.306 (13.2726)      146.177 (13.2659)       -2.00%
Percent	CPU	 2668.6  (743.157)      2699.9 (753.767)        -1.17%
Context	Switches 62268.3 (34097.1)      62721.7 (33999.1)       -0.73%
Sleeps		 132556  (8222.99)      132607 (8077.6)         -0.04%

During a run on the SPF, perf events were captured:
 Performance counter stats for '../kernbench -M':
       525,873,132      faults
               242      spf
                 0      pagefault:spf_vma_changed
                 0      pagefault:spf_vma_noanon
               441      pagefault:spf_vma_notsup
                 0      pagefault:spf_vma_access
                 0      pagefault:spf_pmd_changed

Very few speculative page faults were recorded as most of the processes
involved are monothreaded (sounds that on this architecture some threads
were created during the kernel build processing).

Here are the kerbench results on a 1024 CPUs Power8 VM:

5.1.0-rc4-mm1+				5.1.0-rc4-mm1-spf-rcu+
Average Half load -j 512 Run (std deviation):
Elapsed Time 	 52.52   (0.906697)	52.778  (0.510069)	-0.49%
User Time 	 3855.43 (76.378)	3890.44 (73.0466)	-0.91%
System Time 	 1977.24 (182.316)	1974.56 (166.097)	0.14% 
Percent CPU 	 11111.6 (540.461)	11115.2 (458.907)	-0.03%
Context Switches 83245.6 (3061.44)	83651.8 (1202.31)	-0.49%
Sleeps 		 613459  (23091.8)	628378  (27485.2) 	-2.43%

Average Optimal load -j 1024 Run (std deviation):
Elapsed Time 	 52.964  (0.572346)	53.132 (0.825694)	-0.32%
User Time 	 4058.22 (222.034)	4070.2 (201.646) 	-0.30%
System Time 	 2672.81 (759.207)	2712.13 (797.292)	-1.47%
Percent CPU 	 12756.7 (1786.35)	12806.5 (1858.89)	-0.39% 
Context Switches 88818.5 (6772)		87890.6 (5567.72)	1.04% 
Sleeps 		 618658  (20842.2)	636297 (25044) 		-2.85%

During a run on the SPF, perf events were captured:
 Performance counter stats for '../kernbench -M':
       149 375 832      faults
                 1      spf
                 0      pagefault:spf_vma_changed
                 0      pagefault:spf_vma_noanon
               561      pagefault:spf_vma_notsup
                 0      pagefault:spf_vma_access
                 0      pagefault:spf_pmd_changed

Most of the processes involved are monothreaded so SPF is not activated but
there is no impact on the performance.

Ebizzy:
-------
The test is counting the number of records per second it can manage, the
higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
consistent result I repeated the test 100 times and measure the average
result. The number is the record processes per second, the higher is the best.

  		BASE		SPF		delta	
24 CPUs x86	5492.69		9383.07		70.83%
1024 CPUS P8 VM 8476.74		17144.38	102%

Here are the performance counter read during a run on a 48 CPUs x86 node:
 Performance counter stats for './ebizzy -mTt 48':
        11,846,569      faults
        10,886,706      spf
           957,702      pagefault:spf_vma_changed
                 0      pagefault:spf_vma_noanon
               815      pagefault:spf_vma_notsup
                 0      pagefault:spf_vma_access
                 0      pagefault:spf_pmd_changed

And the ones captured during a run on a 1024 CPUs Power VM:
 Performance counter stats for './ebizzy -mTt 1024':
         1 359 789      faults
         1 284 910      spf
            72 085      pagefault:spf_vma_changed
                 0      pagefault:spf_vma_noanon
             2 669      pagefault:spf_vma_notsup
                 0      pagefault:spf_vma_access
                 0      pagefault:spf_pmd_changed
		 
In ebizzy's case most of the page fault were handled in a speculative way,
leading the ebizzy performance boost.

------------------
Changes since v11 [3]
- Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
 - Abort speculative page fault when doing swap readhead because VMA's
   boundaries are not protected at this time. Doing this the first swap in
   is doing a readhead, the next fault should be handled in a speculative
   way as the page is present in the swap read page.
 - Handle a race between copy_pte_range() and the wp_page_copy called by
   the speculative page fault handler.
 - Ported to Kernel v5.0
 - Moved VM_FAULT_PTNOTSAME define in mm_types.h
 - Use RCU to protect the MM RB tree instead of a rwlock.
 - Add a toggle interface: /proc/sys/vm/speculative_page_fault

[1] https://lore.kernel.org/linux-mm/20141020215633.717315139@infradead.org/
[2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
[3] https://lore.kernel.org/linux-mm/1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com/
[4] https://github.com/ldu4/linux/tree/spf-v12

Laurent Dufour (25):
  mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
  x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
  mm: make pte_unmap_same compatible with SPF
  mm: introduce INIT_VMA()
  mm: protect VMA modifications using VMA sequence count
  mm: protect mremap() against SPF hanlder
  mm: protect SPF handler against anon_vma changes
  mm: cache some VMA fields in the vm_fault structure
  mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
  mm: introduce __lru_cache_add_active_or_unevictable
  mm: introduce __vm_normal_page()
  mm: introduce __page_add_new_anon_rmap()
  mm: protect against PTE changes done by dup_mmap()
  mm: protect the RB tree with a sequence lock
  mm: introduce vma reference counter
  mm: Introduce find_vma_rcu()
  mm: don't do swap readahead during speculative page fault
  mm: adding speculative page fault failure trace events
  perf: add a speculative page fault sw event
  perf tools: add support for the SPF perf event
  mm: add speculative page fault vmstats
  powerpc/mm: add speculative page fault
  mm: Add a speculative page fault switch in sysctl

Mahendran Ganesh (2):
  arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
  arm64/mm: add speculative page fault

Peter Zijlstra (4):
  mm: prepare for FAULT_FLAG_SPECULATIVE
  mm: VMA sequence count
  mm: provide speculative fault infrastructure
  x86/mm: add speculative pagefault handling

 arch/arm64/Kconfig                    |   1 +
 arch/arm64/mm/fault.c                 |  12 +
 arch/powerpc/Kconfig                  |   1 +
 arch/powerpc/mm/fault.c               |  16 +
 arch/x86/Kconfig                      |   1 +
 arch/x86/mm/fault.c                   |  14 +
 fs/exec.c                             |   1 +
 fs/proc/task_mmu.c                    |   5 +-
 fs/userfaultfd.c                      |  17 +-
 include/linux/hugetlb_inline.h        |   2 +-
 include/linux/migrate.h               |   4 +-
 include/linux/mm.h                    | 138 +++++-
 include/linux/mm_types.h              |  16 +-
 include/linux/pagemap.h               |   4 +-
 include/linux/rmap.h                  |  12 +-
 include/linux/swap.h                  |  10 +-
 include/linux/vm_event_item.h         |   3 +
 include/trace/events/pagefault.h      |  80 ++++
 include/uapi/linux/perf_event.h       |   1 +
 kernel/fork.c                         |  35 +-
 kernel/sysctl.c                       |   9 +
 mm/Kconfig                            |  22 +
 mm/huge_memory.c                      |   6 +-
 mm/hugetlb.c                          |   2 +
 mm/init-mm.c                          |   3 +
 mm/internal.h                         |  45 ++
 mm/khugepaged.c                       |   5 +
 mm/madvise.c                          |   6 +-
 mm/memory.c                           | 631 ++++++++++++++++++++++----
 mm/mempolicy.c                        |  51 ++-
 mm/migrate.c                          |   6 +-
 mm/mlock.c                            |  13 +-
 mm/mmap.c                             | 249 ++++++++--
 mm/mprotect.c                         |   4 +-
 mm/mremap.c                           |  13 +
 mm/nommu.c                            |   1 +
 mm/rmap.c                             |   5 +-
 mm/swap.c                             |   6 +-
 mm/swap_state.c                       |  10 +-
 mm/vmstat.c                           |   5 +-
 tools/include/uapi/linux/perf_event.h |   1 +
 tools/perf/util/evsel.c               |   1 +
 tools/perf/util/parse-events.c        |   4 +
 tools/perf/util/parse-events.l        |   1 +
 tools/perf/util/python.c              |   1 +
 45 files changed, 1277 insertions(+), 196 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

Comments

Michel Lespinasse April 22, 2019, 9:29 p.m. UTC | #1
Hi Laurent,

Thanks a lot for copying me on this patchset. It took me a few days to
go through it - I had not been following the previous iterations of
this series so I had to catch up. I will be sending comments for
individual commits, but before tat I would like to discuss the series
as a whole.

I think these changes are a big step in the right direction. My main
reservation about them is that they are additive - adding some complexity
for speculative page faults - and I wonder if it'd be possible, over the
long term, to replace the existing complexity we have in mmap_sem retry
mechanisms instead of adding to it. This is not something that should
block your progress, but I think it would be good, as we introduce spf,
to evaluate whether we could eventually get all the way to removing the
mmap_sem retry mechanism, or if we will actually have to keep both.


The proposed spf mechanism only handles anon vmas. Is there a
fundamental reason why it couldn't handle mapped files too ?
My understanding is that the mechanism of verifying the vma after
taking back the ptl at the end of the fault would work there too ?
The file has to stay referenced during the fault, but holding the vma's
refcount could be made to cover that ? the vm_file refcount would have
to be released in __free_vma() instead of remove_vma; I'm not quite sure
if that has more implications than I realize ?

The proposed spf mechanism only works at the pte level after the page
tables have already been created. The non-spf page fault path takes the
mm->page_table_lock to protect against concurrent page table allocation
by multiple page faults; I think unmapping/freeing page tables could
be done under mm->page_table_lock too so that spf could implement
allocating new page tables by verifying the vma after taking the
mm->page_table_lock ?

The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
I am not sure what is the issue there - is this due to the vma->vm_start
and vma->vm_pgoff reads in *__vm_normal_page() ?


My last potential concern is about performance. The numbers you have
look great, but I worry about potential regressions in PF performance
for threaded processes that don't currently encounter contention
(i.e. there may be just one thread actually doing all the work while
the others are blocked). I think one good proxy for measuring that
would be to measure a single threaded workload - kernbench would be
fine - without the special-case optimization in patch 22 where
handle_speculative_fault() immediately aborts in the single-threaded case.

Reviewed-by: Michel Lespinasse <walken@google.com>
This is for the series as a whole; I expect to do another review pass on
individual commits in the series when we have agreement on the toplevel
stuff (I noticed a few things like out-of-date commit messages but that's
really minor stuff).


I want to add a note about mmap_sem. In the past there has been
discussions about replacing it with an interval lock, but these never
went anywhere because, mostly, of the fact that such mechanisms were
too expensive to use in the page fault path. I think adding the spf
mechanism would invite us to revisit this issue - interval locks may
be a great way to avoid blocking between unrelated mmap_sem writers
(for example, do not delay stack creation for new threads while a
large mmap or munmap may be going on), and probably also to handle
mmap_sem readers that can't easily use the spf mechanism (for example,
gup callers which make use of the returned vmas). But again that is a
separate topic to explore which doesn't have to get resolved before
spf goes in.
Peter Zijlstra April 23, 2019, 9:38 a.m. UTC | #2
On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
> The proposed spf mechanism only handles anon vmas. Is there a
> fundamental reason why it couldn't handle mapped files too ?
> My understanding is that the mechanism of verifying the vma after
> taking back the ptl at the end of the fault would work there too ?
> The file has to stay referenced during the fault, but holding the vma's
> refcount could be made to cover that ? the vm_file refcount would have
> to be released in __free_vma() instead of remove_vma; I'm not quite sure
> if that has more implications than I realize ?

IIRC (and I really don't remember all that much) the trickiest bit was
vs unmount. Since files can stay open past the 'expected' duration,
umount could be delayed.

But yes, I think I had a version that did all that just 'fine'. Like
mentioned, I didn't keep the refcount because it sucked just as hard as
the mmap_sem contention, but the SRCU callback did the fput() just fine
(esp. now that we have delayed_fput).
Michal Hocko April 23, 2019, 10:47 a.m. UTC | #3
On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
[...]
> I want to add a note about mmap_sem. In the past there has been
> discussions about replacing it with an interval lock, but these never
> went anywhere because, mostly, of the fact that such mechanisms were
> too expensive to use in the page fault path. I think adding the spf
> mechanism would invite us to revisit this issue - interval locks may
> be a great way to avoid blocking between unrelated mmap_sem writers
> (for example, do not delay stack creation for new threads while a
> large mmap or munmap may be going on), and probably also to handle
> mmap_sem readers that can't easily use the spf mechanism (for example,
> gup callers which make use of the returned vmas). But again that is a
> separate topic to explore which doesn't have to get resolved before
> spf goes in.

Well, I believe we should _really_ re-evaluate the range locking sooner
rather than later. Why? Because it looks like the most straightforward
approach to the mmap_sem contention for most usecases I have heard of
(mostly a mm{unm}ap, mremap standing in the way of page faults).
On a plus side it also makes us think about the current mmap (ab)users
which should lead to an overall code improvements and maintainability.

SPF sounds like a good idea but it is a really big and intrusive surgery
to the #PF path. And more importantly without any real world usecase
numbers which would justify this. That being said I am not opposed to
this change I just think it is a large hammer while we haven't seen
attempts to tackle problems in a simpler way.
Anshuman Khandual April 23, 2019, 11:35 a.m. UTC | #4
On 04/16/2019 07:14 PM, Laurent Dufour wrote:
> In pseudo code, this could be seen as:
>     speculative_page_fault()
>     {
> 	    vma = find_vma_rcu()
> 	    check vma sequence count
> 	    check vma's support
> 	    disable interrupt
> 		  check pgd,p4d,...,pte
> 		  save pmd and pte in vmf
> 		  save vma sequence counter in vmf
> 	    enable interrupt
> 	    check vma sequence count
> 	    handle_pte_fault(vma)
> 		    ..
> 		    page = alloc_page()
> 		    pte_map_lock()
> 			    disable interrupt
> 				    abort if sequence counter has changed
> 				    abort if pmd or pte has changed
> 				    pte map and lock
> 			    enable interrupt
> 		    if abort
> 		       free page
> 		       abort

Would not it be better if the 'page' allocated here can be passed on to handle_pte_fault()
below so that in the fallback path it does not have to enter the buddy again ? Of course
it will require changes to handle_pte_fault() to accommodate a pre-allocated non-NULL
struct page to operate on or free back into the buddy if fallback path fails for some
other reason. This will probably make SPF path less overhead for cases where it has to
fallback on handle_pte_fault() after pte_map_lock() in speculative_page_fault().

> 		    ...
> 	    put_vma(vma)
>     }
>     
>     arch_fault_handler()
>     {
> 	    if (speculative_page_fault(&vma))
> 	       goto done
>     again:
> 	    lock(mmap_sem)
> 	    vma = find_vma();
> 	    handle_pte_fault(vma);
> 	    if retry
> 	       unlock(mmap_sem)
> 	       goto again;
>     done:
> 	    handle fault error
>     }

- Anshuman
Matthew Wilcox (Oracle) April 23, 2019, 12:41 p.m. UTC | #5
On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
> [...]
> > I want to add a note about mmap_sem. In the past there has been
> > discussions about replacing it with an interval lock, but these never
> > went anywhere because, mostly, of the fact that such mechanisms were
> > too expensive to use in the page fault path. I think adding the spf
> > mechanism would invite us to revisit this issue - interval locks may
> > be a great way to avoid blocking between unrelated mmap_sem writers
> > (for example, do not delay stack creation for new threads while a
> > large mmap or munmap may be going on), and probably also to handle
> > mmap_sem readers that can't easily use the spf mechanism (for example,
> > gup callers which make use of the returned vmas). But again that is a
> > separate topic to explore which doesn't have to get resolved before
> > spf goes in.
> 
> Well, I believe we should _really_ re-evaluate the range locking sooner
> rather than later. Why? Because it looks like the most straightforward
> approach to the mmap_sem contention for most usecases I have heard of
> (mostly a mm{unm}ap, mremap standing in the way of page faults).
> On a plus side it also makes us think about the current mmap (ab)users
> which should lead to an overall code improvements and maintainability.

Dave Chinner recently did evaluate the range lock for solving a problem
in XFS and didn't like what he saw:

https://lore.kernel.org/linux-fsdevel/20190418031013.GX29573@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c

I think scaling the lock needs to be tied to the actual data structure
and not have a second tree on-the-side to fake-scale the locking.  Anyway,
we're going to have a session on this at LSFMM, right?

> SPF sounds like a good idea but it is a really big and intrusive surgery
> to the #PF path. And more importantly without any real world usecase
> numbers which would justify this. That being said I am not opposed to
> this change I just think it is a large hammer while we haven't seen
> attempts to tackle problems in a simpler way.

I don't think the "no real world usecase numbers" is fair.  Laurent quoted:

> Ebizzy:
> -------
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> consistent result I repeated the test 100 times and measure the average
> result. The number is the record processes per second, the higher is the best.
> 
>   		BASE		SPF		delta	
> 24 CPUs x86	5492.69		9383.07		70.83%
> 1024 CPUS P8 VM 8476.74		17144.38	102%

and cited 30% improvement for you-know-what product from an earlier
version of the patch.
Peter Zijlstra April 23, 2019, 12:48 p.m. UTC | #6
On Tue, Apr 23, 2019 at 05:41:48AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> > Well, I believe we should _really_ re-evaluate the range locking sooner
> > rather than later. Why? Because it looks like the most straightforward
> > approach to the mmap_sem contention for most usecases I have heard of
> > (mostly a mm{unm}ap, mremap standing in the way of page faults).
> > On a plus side it also makes us think about the current mmap (ab)users
> > which should lead to an overall code improvements and maintainability.
> 
> Dave Chinner recently did evaluate the range lock for solving a problem
> in XFS and didn't like what he saw:
> 
> https://lore.kernel.org/linux-fsdevel/20190418031013.GX29573@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c
> 
> I think scaling the lock needs to be tied to the actual data structure
> and not have a second tree on-the-side to fake-scale the locking.

Right, which is how I ended up using the split PT locks. They already
provide fine(r) grained locking.
Michal Hocko April 23, 2019, 1:42 p.m. UTC | #7
On Tue 23-04-19 05:41:48, Matthew Wilcox wrote:
> On Tue, Apr 23, 2019 at 12:47:07PM +0200, Michal Hocko wrote:
> > On Mon 22-04-19 14:29:16, Michel Lespinasse wrote:
> > [...]
> > > I want to add a note about mmap_sem. In the past there has been
> > > discussions about replacing it with an interval lock, but these never
> > > went anywhere because, mostly, of the fact that such mechanisms were
> > > too expensive to use in the page fault path. I think adding the spf
> > > mechanism would invite us to revisit this issue - interval locks may
> > > be a great way to avoid blocking between unrelated mmap_sem writers
> > > (for example, do not delay stack creation for new threads while a
> > > large mmap or munmap may be going on), and probably also to handle
> > > mmap_sem readers that can't easily use the spf mechanism (for example,
> > > gup callers which make use of the returned vmas). But again that is a
> > > separate topic to explore which doesn't have to get resolved before
> > > spf goes in.
> > 
> > Well, I believe we should _really_ re-evaluate the range locking sooner
> > rather than later. Why? Because it looks like the most straightforward
> > approach to the mmap_sem contention for most usecases I have heard of
> > (mostly a mm{unm}ap, mremap standing in the way of page faults).
> > On a plus side it also makes us think about the current mmap (ab)users
> > which should lead to an overall code improvements and maintainability.
> 
> Dave Chinner recently did evaluate the range lock for solving a problem
> in XFS and didn't like what he saw:
> 
> https://lore.kernel.org/linux-fsdevel/20190418031013.GX29573@dread.disaster.area/T/#md981b32c12a2557a2dd0f79ad41d6c8df1f6f27c

Thank you, will have a look.

> I think scaling the lock needs to be tied to the actual data structure
> and not have a second tree on-the-side to fake-scale the locking.  Anyway,
> we're going to have a session on this at LSFMM, right?

I thought we had something for the mmap_sem scaling but I do not see
this in a list of proposed topics. But we can certainly add it there.

> > SPF sounds like a good idea but it is a really big and intrusive surgery
> > to the #PF path. And more importantly without any real world usecase
> > numbers which would justify this. That being said I am not opposed to
> > this change I just think it is a large hammer while we haven't seen
> > attempts to tackle problems in a simpler way.
> 
> I don't think the "no real world usecase numbers" is fair.  Laurent quoted:
> 
> > Ebizzy:
> > -------
> > The test is counting the number of records per second it can manage, the
> > higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> > consistent result I repeated the test 100 times and measure the average
> > result. The number is the record processes per second, the higher is the best.
> > 
> >   		BASE		SPF		delta	
> > 24 CPUs x86	5492.69		9383.07		70.83%
> > 1024 CPUS P8 VM 8476.74		17144.38	102%
> 
> and cited 30% improvement for you-know-what product from an earlier
> version of the patch.

Well, we are talking about
45 files changed, 1277 insertions(+), 196 deletions(-)

which is a _major_ surgery in my book. Having a real life workloads numbers
is nothing unfair to ask for IMHO.

And let me remind you that I am not really opposing SPF in general. I
would just like to see a simpler approach before we go such a large
change. If the range locking is not really a scalable approach then all
right but from why I've see it should help a lot of most bottle-necks I
have seen.
Laurent Dufour April 24, 2019, 7:33 a.m. UTC | #8
Le 23/04/2019 à 11:38, Peter Zijlstra a écrit :
> On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
>> The proposed spf mechanism only handles anon vmas. Is there a
>> fundamental reason why it couldn't handle mapped files too ?
>> My understanding is that the mechanism of verifying the vma after
>> taking back the ptl at the end of the fault would work there too ?
>> The file has to stay referenced during the fault, but holding the vma's
>> refcount could be made to cover that ? the vm_file refcount would have
>> to be released in __free_vma() instead of remove_vma; I'm not quite sure
>> if that has more implications than I realize ?
> 
> IIRC (and I really don't remember all that much) the trickiest bit was
> vs unmount. Since files can stay open past the 'expected' duration,
> umount could be delayed.
> 
> But yes, I think I had a version that did all that just 'fine'. Like
> mentioned, I didn't keep the refcount because it sucked just as hard as
> the mmap_sem contention, but the SRCU callback did the fput() just fine
> (esp. now that we have delayed_fput).

I had to use a refcount for the VMA because I'm using RCU in place of 
SRCU and only protecting the RB tree using RCU.

Regarding the file pointer, I decided to release it synchronously to 
avoid the latency of RCU during the file closing. As you mentioned this 
could delayed the umount but not only, as Linus Torvald demonstrated by 
the past [1]. Anyway, since the file support is not yet here there is no 
need for that currently.

Regarding the file mapping support, the concern is to ensure that 
vm_ops->fault() will not try to release the mmap_sem. This is true for 
most of the file system operation using the generic one, but there is 
currently no clever way to identify that except by checking the 
vm_ops->fault pointer. Adding a flag to the vm_operations_struct 
structure is another option.

that's doable as far as the underlying fault() function is not dealing 
with the mmap_sem, and I made a try by the past but was thinking that 
first the anonymous case should be accepted before moving forward this way.

[1] 
https://lore.kernel.org/linux-mm/alpine.LFD.2.00.1001041904250.3630@localhost.localdomain/
Laurent Dufour April 24, 2019, 6:01 p.m. UTC | #9
Le 22/04/2019 à 23:29, Michel Lespinasse a écrit :
> Hi Laurent,
> 
> Thanks a lot for copying me on this patchset. It took me a few days to
> go through it - I had not been following the previous iterations of
> this series so I had to catch up. I will be sending comments for
> individual commits, but before tat I would like to discuss the series
> as a whole.

Hi Michel,

Thanks for reviewing this series.

> I think these changes are a big step in the right direction. My main
> reservation about them is that they are additive - adding some complexity
> for speculative page faults - and I wonder if it'd be possible, over the
> long term, to replace the existing complexity we have in mmap_sem retry
> mechanisms instead of adding to it. This is not something that should
> block your progress, but I think it would be good, as we introduce spf,
> to evaluate whether we could eventually get all the way to removing the
> mmap_sem retry mechanism, or if we will actually have to keep both.

Until we get rid of the mmap_sem which seems to be a very long story, I 
can't see how we could get rid of the retry mechanism.

> The proposed spf mechanism only handles anon vmas. Is there a
> fundamental reason why it couldn't handle mapped files too ?
> My understanding is that the mechanism of verifying the vma after
> taking back the ptl at the end of the fault would work there too ?
> The file has to stay referenced during the fault, but holding the vma's
> refcount could be made to cover that ? the vm_file refcount would have
> to be released in __free_vma() instead of remove_vma; I'm not quite sure
> if that has more implications than I realize ?

The only concern is the flow of operation  done in the vm_ops->fault() 
processing. Most of the file system relie on the generic filemap_fault() 
which should be safe to use. But we need a clever way to identify fault 
processing which are compatible with the SPF handler. This could be done 
using a tag/flag in the vm_ops structure or in the vma's flags.

This would be the next step.


> The proposed spf mechanism only works at the pte level after the page
> tables have already been created. The non-spf page fault path takes the
> mm->page_table_lock to protect against concurrent page table allocation
> by multiple page faults; I think unmapping/freeing page tables could
> be done under mm->page_table_lock too so that spf could implement
> allocating new page tables by verifying the vma after taking the
> mm->page_table_lock ?

I've to admit that I didn't dig further here.
Do you have a patch? ;)

> 
> The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
> I am not sure what is the issue there - is this due to the vma->vm_start
> and vma->vm_pgoff reads in *__vm_normal_page() ?

Yes that's the reason, no way to guarantee the value of these fields in 
the SPF path.

> 
> My last potential concern is about performance. The numbers you have
> look great, but I worry about potential regressions in PF performance
> for threaded processes that don't currently encounter contention
> (i.e. there may be just one thread actually doing all the work while
> the others are blocked). I think one good proxy for measuring that
> would be to measure a single threaded workload - kernbench would be
> fine - without the special-case optimization in patch 22 where
> handle_speculative_fault() immediately aborts in the single-threaded case.

I'll have to give it a try.

> Reviewed-by: Michel Lespinasse <walken@google.com>
> This is for the series as a whole; I expect to do another review pass on
> individual commits in the series when we have agreement on the toplevel
> stuff (I noticed a few things like out-of-date commit messages but that's
> really minor stuff).

Thanks a lot for reviewing this long series.

> 
> I want to add a note about mmap_sem. In the past there has been
> discussions about replacing it with an interval lock, but these never
> went anywhere because, mostly, of the fact that such mechanisms were
> too expensive to use in the page fault path. I think adding the spf
> mechanism would invite us to revisit this issue - interval locks may
> be a great way to avoid blocking between unrelated mmap_sem writers
> (for example, do not delay stack creation for new threads while a
> large mmap or munmap may be going on), and probably also to handle
> mmap_sem readers that can't easily use the spf mechanism (for example,
> gup callers which make use of the returned vmas). But again that is a
> separate topic to explore which doesn't have to get resolved before
> spf goes in.
>
Michel Lespinasse April 27, 2019, 1:53 a.m. UTC | #10
On Wed, Apr 24, 2019 at 09:33:44AM +0200, Laurent Dufour wrote:
> Le 23/04/2019 à 11:38, Peter Zijlstra a écrit :
> > On Mon, Apr 22, 2019 at 02:29:16PM -0700, Michel Lespinasse wrote:
> > > The proposed spf mechanism only handles anon vmas. Is there a
> > > fundamental reason why it couldn't handle mapped files too ?
> > > My understanding is that the mechanism of verifying the vma after
> > > taking back the ptl at the end of the fault would work there too ?
> > > The file has to stay referenced during the fault, but holding the vma's
> > > refcount could be made to cover that ? the vm_file refcount would have
> > > to be released in __free_vma() instead of remove_vma; I'm not quite sure
> > > if that has more implications than I realize ?
> > 
> > IIRC (and I really don't remember all that much) the trickiest bit was
> > vs unmount. Since files can stay open past the 'expected' duration,
> > umount could be delayed.
> > 
> > But yes, I think I had a version that did all that just 'fine'. Like
> > mentioned, I didn't keep the refcount because it sucked just as hard as
> > the mmap_sem contention, but the SRCU callback did the fput() just fine
> > (esp. now that we have delayed_fput).
> 
> I had to use a refcount for the VMA because I'm using RCU in place of SRCU
> and only protecting the RB tree using RCU.
> 
> Regarding the file pointer, I decided to release it synchronously to avoid
> the latency of RCU during the file closing. As you mentioned this could
> delayed the umount but not only, as Linus Torvald demonstrated by the past
> [1]. Anyway, since the file support is not yet here there is no need for
> that currently.
>
> [1] https://lore.kernel.org/linux-mm/alpine.LFD.2.00.1001041904250.3630@localhost.localdomain/

Just to make sure I understand this correctly. If a program tries to
munmap a region while page faults are occuring (which means that the
program has a race condition in the first place), before spf the
mmap_sem would delay the munmap until the page fault completes. With
spf the munmap will happen immediately, while the vm_ops->fault()
is running, with spf holding a ref to the file. vm_ops->fault is
expected to execute a read from the file to the page cache, and the
page cache page will never be mapped into the process because after
taking the ptl, spf will notice the vma changed.  So, the side effects
that may be observed after munmap completes would be:

- side effects from reading a file into the page cache - I'm not sure
  what they are, the main one I can think of is that userspace may observe
  the file's atime changing ?

- side effects from holding a reference to the file - which userspace
  may observe by trying to unmount().

Is that the extent of the side effects, or are there more that I have
not thought of ?

> Regarding the file mapping support, the concern is to ensure that
> vm_ops->fault() will not try to release the mmap_sem. This is true for most
> of the file system operation using the generic one, but there is currently
> no clever way to identify that except by checking the vm_ops->fault pointer.
> Adding a flag to the vm_operations_struct structure is another option.
> 
> that's doable as far as the underlying fault() function is not dealing with
> the mmap_sem, and I made a try by the past but was thinking that first the
> anonymous case should be accepted before moving forward this way.

Yes, that makes sense. Updating all of the fault handlers would be a
lot of work - but there doesn't seem to be anything fundamental that
wouldn't work there (except for the side effects of reordering spf
against munmap, as discussed above, which doesn't look easy to fully hide.).
Michel Lespinasse April 27, 2019, 6 a.m. UTC | #11
On Wed, Apr 24, 2019 at 08:01:20PM +0200, Laurent Dufour wrote:
> Le 22/04/2019 à 23:29, Michel Lespinasse a écrit :
> > Hi Laurent,
> > 
> > Thanks a lot for copying me on this patchset. It took me a few days to
> > go through it - I had not been following the previous iterations of
> > this series so I had to catch up. I will be sending comments for
> > individual commits, but before tat I would like to discuss the series
> > as a whole.
> 
> Hi Michel,
> 
> Thanks for reviewing this series.
> 
> > I think these changes are a big step in the right direction. My main
> > reservation about them is that they are additive - adding some complexity
> > for speculative page faults - and I wonder if it'd be possible, over the
> > long term, to replace the existing complexity we have in mmap_sem retry
> > mechanisms instead of adding to it. This is not something that should
> > block your progress, but I think it would be good, as we introduce spf,
> > to evaluate whether we could eventually get all the way to removing the
> > mmap_sem retry mechanism, or if we will actually have to keep both.
> 
> Until we get rid of the mmap_sem which seems to be a very long story, I
> can't see how we could get rid of the retry mechanism.

Short answer: I'd like spf to be extended to handle file vmas,
populating page tables, and the vm_normal_page thing, so that we
wouldn't have to fall back to the path that grabs (and possibly
has to drop) the read side mmap_sem.

Even doing the above, there are still cases spf can't solve - for
example, gup, or the occasional spf abort, or even the case of a large
mmap/munmap delaying a smaller one. I think replacing mmap_sem with a
reader/writer interval lock would be a very nice generic solution to
this problem, allowing false conflicts to proceed in parallel, while
synchronizing true conflicts which is exactly what we want. But I
don't think such a lock can be implemented efficiently enough to be
put on the page fault fast-path, so I think spf could be the solution
there - it would allow us to skip taking that interval lock on most
page faults. The other places where we use mmap_sem are not as critical
for performance (they normally operate on a larger region at a time)
so I think we could afford the interval lock in those places.
Haiyan Song June 6, 2019, 6:51 a.m. UTC | #12
Hi Laurent,

Regression test for v12 patch serials have been run on Intel 2s skylake platform,
some regressions were found by LKP-tools (linux kernel performance). Only tested the
cases that have been run and found regressions on v11 patch serials.

Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
Kernel commit:
  base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51)
  head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)

Benchmark: will-it-scale
Download link: https://github.com/antonblanchard/will-it-scale/tree/master
Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
THP: enable / disable
nr_task: 100%

The following is benchmark results, tested 4 times for every case.

a). Enable THP
                                            base  %stddev   change    head   %stddev
will-it-scale.page_fault3.per_thread_ops    63216  ±3%      -16.9%    52537   ±4%
will-it-scale.page_fault2.per_thread_ops    36862           -9.8%     33256

b). Disable THP
                                            base  %stddev   change    head   %stddev
will-it-scale.page_fault3.per_thread_ops    65111           -18.6%    53023  ±2%
will-it-scale.page_fault2.per_thread_ops    38164           -12.0%    33565

Best regards,
Haiyan Song

On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:
> This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
> page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type of page fault is named speculative
> page fault. If the speculative page fault fails because a concurrency has
> been detected or because underlying PMD or PTE tables are not yet
> allocating, it is failing its processing and a regular page fault is then
> tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, this is done by protecting the MM RB
> tree with RCU and by using a reference counter on each VMA. When fetching a
> VMA under the RCU protection, the VMA's reference counter is incremented to
> ensure that the VMA will not freed in our back during the SPF
> processing. Once that processing is done the VMA's reference counter is
> decremented. To ensure that a VMA is still present when walking the RB tree
> locklessly, the VMA's reference counter is incremented when that VMA is
> linked in the RB tree. When the VMA is unlinked from the RB tree, its
> reference counter will be decremented at the end of the RCU grace period,
> ensuring it will be available during this time. This means that the VMA
> freeing could be delayed and could delay the file closing for file
> mapping. Since the SPF handler is not able to manage file mapping, file is
> closed synchronously and not during the RCU cleaning. This is safe since
> the page fault handler is aborting if a file pointer is associated to the
> VMA.
> 
> Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
> benchmark [2].
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA has been found, the speculative page fault handler would check
> for the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus, the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
> 
> The locking of the PTE is done with interrupts disabled, this allows
> checking for the PMD to ensure that there is not an ongoing collapsing
> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
> valid at the time the PTE is locked, we have the guarantee that the
> collapsing operation will have to wait on the PTE lock to move
> forward. This allows the SPF handler to map the PTE safely. If the PMD
> value is different from the one recorded at the beginning of the SPF
> operation, the classic page fault handler will be called to handle the
> operation while holding the mmap_sem. As the PTE lock is done with the
> interrupts disabled, the lock is done using spin_trylock() to avoid dead
> lock when handling a page fault while a TLB invalidate is requested by
> another CPU holding the PTE.
> 
> In pseudo code, this could be seen as:
>     speculative_page_fault()
>     {
> 	    vma = find_vma_rcu()
> 	    check vma sequence count
> 	    check vma's support
> 	    disable interrupt
> 		  check pgd,p4d,...,pte
> 		  save pmd and pte in vmf
> 		  save vma sequence counter in vmf
> 	    enable interrupt
> 	    check vma sequence count
> 	    handle_pte_fault(vma)
> 		    ..
> 		    page = alloc_page()
> 		    pte_map_lock()
> 			    disable interrupt
> 				    abort if sequence counter has changed
> 				    abort if pmd or pte has changed
> 				    pte map and lock
> 			    enable interrupt
> 		    if abort
> 		       free page
> 		       abort
> 		    ...
> 	    put_vma(vma)
>     }
>     
>     arch_fault_handler()
>     {
> 	    if (speculative_page_fault(&vma))
> 	       goto done
>     again:
> 	    lock(mmap_sem)
> 	    vma = find_vma();
> 	    handle_pte_fault(vma);
> 	    if retry
> 	       unlock(mmap_sem)
> 	       goto again;
>     done:
> 	    handle fault error
>     }
> 
> Support for THP is not done because when checking for the PMD, we can be
> confused by an in progress collapsing operation done by khugepaged. The
> issue is that pmd_none() could be true either if the PMD is not already
> populated or if the underlying PTE are in the way to be collapsed. So we
> cannot safely allocate a PMD if pmd_none() is true.
> 
> This series add a new software performance event named 'speculative-faults'
> or 'spf'. It counts the number of successful page fault event handled
> speculatively. When recording 'faults,spf' events, the faults one is
> counting the total number of page fault events while 'spf' is only counting
> the part of the faults processed speculatively.
> 
> There are some trace events introduced by this series. They allow
> identifying why the page faults were not processed speculatively. This
> doesn't take in account the faults generated by a monothreaded process
> which directly processed while holding the mmap_sem. This trace events are
> grouped in a system named 'pagefault', they are:
> 
>  - pagefault:spf_vma_changed : if the VMA has been changed in our back
>  - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>  - pagefault:spf_vma_notsup : the VMA's type is not supported
>  - pagefault:spf_vma_access : the VMA's access right are not respected
>  - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>  back.
> 
> To record all the related events, the easier is to run perf with the
> following arguments :
> $ perf stat -e 'faults,spf,pagefault:*' <command>
> 
> There is also a dedicated vmstat counter showing the number of successful
> page fault handled speculatively. I can be seen this way:
> $ grep speculative_pgfault /proc/vmstat
> 
> It is possible to deactivate the speculative page fault handler by echoing
> 0 in /proc/sys/vm/speculative_page_fault.
> 
> This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
> functional on x86, PowerPC. I cross built it on arm64 but I was not able to
> test it.
> 
> This series is also available on github [4].
> 
> ---------------------
> Real Workload results
> 
> Test using a "popular in memory multithreaded database product" on 128cores
> SMT8 Power system are in progress and I will come back with performance
> mesurement as soon as possible. With the previous series we seen up to 30%
> improvements in the number of transaction processed per second, and we hope
> this will be the case with this series too.
> 
> ------------------
> Benchmarks results
> 
> Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
> SPF is BASE + this series
> 
> Kernbench:
> ----------
> Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
> kernel (kernel is build 5 times):
> 
> Average	Half load -j 24
> 		 Run	(std deviation)
> 		 BASE			SPF
> Elapsed	Time	 56.52   (1.39185)      56.256  (1.15106)       0.47% 
> User	Time	 980.018 (2.94734)      984.958 (1.98518)       -0.50%
> System	Time	 130.744 (1.19148)      133.616 (0.873573)      -2.20%
> Percent	CPU	 1965.6  (49.682)       1988.4  (40.035)        -1.16%
> Context	Switches 29926.6 (272.789)      30472.4 (109.569)       -1.82%
> Sleeps		 124793  (415.87)       125003  (591.008)       -0.17%
> 						
> Average	Optimal	load -j	48
> 		 Run	(std deviation)
> 		 BASE			SPF
> Elapsed	Time	 46.354  (0.917949)     45.968 (1.42786)        0.83% 
> User	Time	 1193.42 (224.96)       1196.78 (223.28)        -0.28%
> System	Time	 143.306 (13.2726)      146.177 (13.2659)       -2.00%
> Percent	CPU	 2668.6  (743.157)      2699.9 (753.767)        -1.17%
> Context	Switches 62268.3 (34097.1)      62721.7 (33999.1)       -0.73%
> Sleeps		 132556  (8222.99)      132607 (8077.6)         -0.04%
> 
> During a run on the SPF, perf events were captured:
>  Performance counter stats for '../kernbench -M':
>        525,873,132      faults
>                242      spf
>                  0      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>                441      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 
> Very few speculative page faults were recorded as most of the processes
> involved are monothreaded (sounds that on this architecture some threads
> were created during the kernel build processing).
> 
> Here are the kerbench results on a 1024 CPUs Power8 VM:
> 
> 5.1.0-rc4-mm1+				5.1.0-rc4-mm1-spf-rcu+
> Average Half load -j 512 Run (std deviation):
> Elapsed Time 	 52.52   (0.906697)	52.778  (0.510069)	-0.49%
> User Time 	 3855.43 (76.378)	3890.44 (73.0466)	-0.91%
> System Time 	 1977.24 (182.316)	1974.56 (166.097)	0.14% 
> Percent CPU 	 11111.6 (540.461)	11115.2 (458.907)	-0.03%
> Context Switches 83245.6 (3061.44)	83651.8 (1202.31)	-0.49%
> Sleeps 		 613459  (23091.8)	628378  (27485.2) 	-2.43%
> 
> Average Optimal load -j 1024 Run (std deviation):
> Elapsed Time 	 52.964  (0.572346)	53.132 (0.825694)	-0.32%
> User Time 	 4058.22 (222.034)	4070.2 (201.646) 	-0.30%
> System Time 	 2672.81 (759.207)	2712.13 (797.292)	-1.47%
> Percent CPU 	 12756.7 (1786.35)	12806.5 (1858.89)	-0.39% 
> Context Switches 88818.5 (6772)		87890.6 (5567.72)	1.04% 
> Sleeps 		 618658  (20842.2)	636297 (25044) 		-2.85%
> 
> During a run on the SPF, perf events were captured:
>  Performance counter stats for '../kernbench -M':
>        149 375 832      faults
>                  1      spf
>                  0      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>                561      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 
> Most of the processes involved are monothreaded so SPF is not activated but
> there is no impact on the performance.
> 
> Ebizzy:
> -------
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> consistent result I repeated the test 100 times and measure the average
> result. The number is the record processes per second, the higher is the best.
> 
>   		BASE		SPF		delta	
> 24 CPUs x86	5492.69		9383.07		70.83%
> 1024 CPUS P8 VM 8476.74		17144.38	102%
> 
> Here are the performance counter read during a run on a 48 CPUs x86 node:
>  Performance counter stats for './ebizzy -mTt 48':
>         11,846,569      faults
>         10,886,706      spf
>            957,702      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>                815      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 
> And the ones captured during a run on a 1024 CPUs Power VM:
>  Performance counter stats for './ebizzy -mTt 1024':
>          1 359 789      faults
>          1 284 910      spf
>             72 085      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>              2 669      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 		 
> In ebizzy's case most of the page fault were handled in a speculative way,
> leading the ebizzy performance boost.
> 
> ------------------
> Changes since v11 [3]
> - Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
>  - Abort speculative page fault when doing swap readhead because VMA's
>    boundaries are not protected at this time. Doing this the first swap in
>    is doing a readhead, the next fault should be handled in a speculative
>    way as the page is present in the swap read page.
>  - Handle a race between copy_pte_range() and the wp_page_copy called by
>    the speculative page fault handler.
>  - Ported to Kernel v5.0
>  - Moved VM_FAULT_PTNOTSAME define in mm_types.h
>  - Use RCU to protect the MM RB tree instead of a rwlock.
>  - Add a toggle interface: /proc/sys/vm/speculative_page_fault
> 
> [1] https://lore.kernel.org/linux-mm/20141020215633.717315139@infradead.org/
> [2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
> [3] https://lore.kernel.org/linux-mm/1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com/
> [4] https://github.com/ldu4/linux/tree/spf-v12
> 
> Laurent Dufour (25):
>   mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>   x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>   mm: make pte_unmap_same compatible with SPF
>   mm: introduce INIT_VMA()
>   mm: protect VMA modifications using VMA sequence count
>   mm: protect mremap() against SPF hanlder
>   mm: protect SPF handler against anon_vma changes
>   mm: cache some VMA fields in the vm_fault structure
>   mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>   mm: introduce __lru_cache_add_active_or_unevictable
>   mm: introduce __vm_normal_page()
>   mm: introduce __page_add_new_anon_rmap()
>   mm: protect against PTE changes done by dup_mmap()
>   mm: protect the RB tree with a sequence lock
>   mm: introduce vma reference counter
>   mm: Introduce find_vma_rcu()
>   mm: don't do swap readahead during speculative page fault
>   mm: adding speculative page fault failure trace events
>   perf: add a speculative page fault sw event
>   perf tools: add support for the SPF perf event
>   mm: add speculative page fault vmstats
>   powerpc/mm: add speculative page fault
>   mm: Add a speculative page fault switch in sysctl
> 
> Mahendran Ganesh (2):
>   arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   arm64/mm: add speculative page fault
> 
> Peter Zijlstra (4):
>   mm: prepare for FAULT_FLAG_SPECULATIVE
>   mm: VMA sequence count
>   mm: provide speculative fault infrastructure
>   x86/mm: add speculative pagefault handling
> 
>  arch/arm64/Kconfig                    |   1 +
>  arch/arm64/mm/fault.c                 |  12 +
>  arch/powerpc/Kconfig                  |   1 +
>  arch/powerpc/mm/fault.c               |  16 +
>  arch/x86/Kconfig                      |   1 +
>  arch/x86/mm/fault.c                   |  14 +
>  fs/exec.c                             |   1 +
>  fs/proc/task_mmu.c                    |   5 +-
>  fs/userfaultfd.c                      |  17 +-
>  include/linux/hugetlb_inline.h        |   2 +-
>  include/linux/migrate.h               |   4 +-
>  include/linux/mm.h                    | 138 +++++-
>  include/linux/mm_types.h              |  16 +-
>  include/linux/pagemap.h               |   4 +-
>  include/linux/rmap.h                  |  12 +-
>  include/linux/swap.h                  |  10 +-
>  include/linux/vm_event_item.h         |   3 +
>  include/trace/events/pagefault.h      |  80 ++++
>  include/uapi/linux/perf_event.h       |   1 +
>  kernel/fork.c                         |  35 +-
>  kernel/sysctl.c                       |   9 +
>  mm/Kconfig                            |  22 +
>  mm/huge_memory.c                      |   6 +-
>  mm/hugetlb.c                          |   2 +
>  mm/init-mm.c                          |   3 +
>  mm/internal.h                         |  45 ++
>  mm/khugepaged.c                       |   5 +
>  mm/madvise.c                          |   6 +-
>  mm/memory.c                           | 631 ++++++++++++++++++++++----
>  mm/mempolicy.c                        |  51 ++-
>  mm/migrate.c                          |   6 +-
>  mm/mlock.c                            |  13 +-
>  mm/mmap.c                             | 249 ++++++++--
>  mm/mprotect.c                         |   4 +-
>  mm/mremap.c                           |  13 +
>  mm/nommu.c                            |   1 +
>  mm/rmap.c                             |   5 +-
>  mm/swap.c                             |   6 +-
>  mm/swap_state.c                       |  10 +-
>  mm/vmstat.c                           |   5 +-
>  tools/include/uapi/linux/perf_event.h |   1 +
>  tools/perf/util/evsel.c               |   1 +
>  tools/perf/util/parse-events.c        |   4 +
>  tools/perf/util/parse-events.l        |   1 +
>  tools/perf/util/python.c              |   1 +
>  45 files changed, 1277 insertions(+), 196 deletions(-)
>  create mode 100644 include/trace/events/pagefault.h
> 
> -- 
> 2.21.0
>
Laurent Dufour June 14, 2019, 8:37 a.m. UTC | #13
Le 06/06/2019 à 08:51, Haiyan Song a écrit :
> Hi Laurent,
> 
> Regression test for v12 patch serials have been run on Intel 2s skylake platform,
> some regressions were found by LKP-tools (linux kernel performance). Only tested the
> cases that have been run and found regressions on v11 patch serials.
> 
> Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
> Kernel commit:
>    base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51)
>    head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)
> 
> Benchmark: will-it-scale
> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
> Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
> test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
> THP: enable / disable
> nr_task: 100%
> 
> The following is benchmark results, tested 4 times for every case.
> 
> a). Enable THP
>                                              base  %stddev   change    head   %stddev
> will-it-scale.page_fault3.per_thread_ops    63216  ±3%      -16.9%    52537   ±4%
> will-it-scale.page_fault2.per_thread_ops    36862           -9.8%     33256
> 
> b). Disable THP
>                                              base  %stddev   change    head   %stddev
> will-it-scale.page_fault3.per_thread_ops    65111           -18.6%    53023  ±2%
> will-it-scale.page_fault2.per_thread_ops    38164           -12.0%    33565

Hi Haiyan,

Thanks for running this tests on your systems.

I did the same tests on my systems (x86 and PowerPc) and I didn't get the same numbers.
My x86 system has lower CPUs but larger memory amount but I don't think this impacts
a lot since my numbers are far from yours.

x86_64 48CPUs 755G
     		5.1.0-rc4-mm1		5.1.0-rc4-mm1-spf
page_fault2_threads			SPF OFF			SPF ON
THP always 	2200902.3 [5%]		2152618.8 -2% [4%]	2136316   -3% [7%]
THP never	2185616.5 [6%]		2099274.2 -4% [3%]	2123275.1 -3% [7%]

     		5.1.0-rc4-mm1		5.1.0-rc4-mm1-spf
page_fault3_threads			SPF OFF			SPF ON
THP always	2700078.7 [5%]		2789437.1 +3% [4%]	2944806.8 +12% [3%]
THP never	2625756.7 [4%]		2944806.8 +12% [8%]	2876525.5 +10% [4%]

PowerPC P8 80CPUs 31G
     		5.1.0-rc4-mm1		5.1.0-rc4-mm1-spf
page_fault2_threads			SPF OFF			SPF ON
THP always	171732	 [0%]		170762.8 -1% [0%]	170450.9 -1% [0%]
THP never	171808.4 [0%]		170600.3 -1% [0%]	170231.6 -1% [0%]

     		5.1.0-rc4-mm1		5.1.0-rc4-mm1-spf
page_fault3_threads			SPF OFF			SPF ON
THP always	2499.6 [13%]		2624.5 +5% [11%]		2734.5 +9% [3%]
THP never	2732.5 [2%]		2791.1 +2% [1%]		2695   -3% [4%]

Numbers in bracket are the standard deviation percent.

I run each test 10 times and then compute the average and deviation.

Please find attached the script I run to get these numbers.
This would be nice if you could give it a try on your victim node and share the result.

Thanks,
Laurent.

> Best regards,
> Haiyan Song
> 
> On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:
>> This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
>> page fault without holding the mm semaphore [1].
>>
>> The idea is to try to handle user space page faults without holding the
>> mmap_sem. This should allow better concurrency for massively threaded
>> process since the page fault handler will not wait for other threads memory
>> layout change to be done, assuming that this change is done in another part
>> of the process's memory space. This type of page fault is named speculative
>> page fault. If the speculative page fault fails because a concurrency has
>> been detected or because underlying PMD or PTE tables are not yet
>> allocating, it is failing its processing and a regular page fault is then
>> tried.
>>
>> The speculative page fault (SPF) has to look for the VMA matching the fault
>> address without holding the mmap_sem, this is done by protecting the MM RB
>> tree with RCU and by using a reference counter on each VMA. When fetching a
>> VMA under the RCU protection, the VMA's reference counter is incremented to
>> ensure that the VMA will not freed in our back during the SPF
>> processing. Once that processing is done the VMA's reference counter is
>> decremented. To ensure that a VMA is still present when walking the RB tree
>> locklessly, the VMA's reference counter is incremented when that VMA is
>> linked in the RB tree. When the VMA is unlinked from the RB tree, its
>> reference counter will be decremented at the end of the RCU grace period,
>> ensuring it will be available during this time. This means that the VMA
>> freeing could be delayed and could delay the file closing for file
>> mapping. Since the SPF handler is not able to manage file mapping, file is
>> closed synchronously and not during the RCU cleaning. This is safe since
>> the page fault handler is aborting if a file pointer is associated to the
>> VMA.
>>
>> Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
>> benchmark [2].
>>
>> The VMA's attributes checked during the speculative page fault processing
>> have to be protected against parallel changes. This is done by using a per
>> VMA sequence lock. This sequence lock allows the speculative page fault
>> handler to fast check for parallel changes in progress and to abort the
>> speculative page fault in that case.
>>
>> Once the VMA has been found, the speculative page fault handler would check
>> for the VMA's attributes to verify that the page fault has to be handled
>> correctly or not. Thus, the VMA is protected through a sequence lock which
>> allows fast detection of concurrent VMA changes. If such a change is
>> detected, the speculative page fault is aborted and a *classic* page fault
>> is tried.  VMA sequence lockings are added when VMA attributes which are
>> checked during the page fault are modified.
>>
>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>> so once the page table is locked, the VMA is valid, so any other changes
>> leading to touching this PTE will need to lock the page table, so no
>> parallel change is possible at this time.
>>
>> The locking of the PTE is done with interrupts disabled, this allows
>> checking for the PMD to ensure that there is not an ongoing collapsing
>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>> valid at the time the PTE is locked, we have the guarantee that the
>> collapsing operation will have to wait on the PTE lock to move
>> forward. This allows the SPF handler to map the PTE safely. If the PMD
>> value is different from the one recorded at the beginning of the SPF
>> operation, the classic page fault handler will be called to handle the
>> operation while holding the mmap_sem. As the PTE lock is done with the
>> interrupts disabled, the lock is done using spin_trylock() to avoid dead
>> lock when handling a page fault while a TLB invalidate is requested by
>> another CPU holding the PTE.
>>
>> In pseudo code, this could be seen as:
>>      speculative_page_fault()
>>      {
>> 	    vma = find_vma_rcu()
>> 	    check vma sequence count
>> 	    check vma's support
>> 	    disable interrupt
>> 		  check pgd,p4d,...,pte
>> 		  save pmd and pte in vmf
>> 		  save vma sequence counter in vmf
>> 	    enable interrupt
>> 	    check vma sequence count
>> 	    handle_pte_fault(vma)
>> 		    ..
>> 		    page = alloc_page()
>> 		    pte_map_lock()
>> 			    disable interrupt
>> 				    abort if sequence counter has changed
>> 				    abort if pmd or pte has changed
>> 				    pte map and lock
>> 			    enable interrupt
>> 		    if abort
>> 		       free page
>> 		       abort
>> 		    ...
>> 	    put_vma(vma)
>>      }
>>      
>>      arch_fault_handler()
>>      {
>> 	    if (speculative_page_fault(&vma))
>> 	       goto done
>>      again:
>> 	    lock(mmap_sem)
>> 	    vma = find_vma();
>> 	    handle_pte_fault(vma);
>> 	    if retry
>> 	       unlock(mmap_sem)
>> 	       goto again;
>>      done:
>> 	    handle fault error
>>      }
>>
>> Support for THP is not done because when checking for the PMD, we can be
>> confused by an in progress collapsing operation done by khugepaged. The
>> issue is that pmd_none() could be true either if the PMD is not already
>> populated or if the underlying PTE are in the way to be collapsed. So we
>> cannot safely allocate a PMD if pmd_none() is true.
>>
>> This series add a new software performance event named 'speculative-faults'
>> or 'spf'. It counts the number of successful page fault event handled
>> speculatively. When recording 'faults,spf' events, the faults one is
>> counting the total number of page fault events while 'spf' is only counting
>> the part of the faults processed speculatively.
>>
>> There are some trace events introduced by this series. They allow
>> identifying why the page faults were not processed speculatively. This
>> doesn't take in account the faults generated by a monothreaded process
>> which directly processed while holding the mmap_sem. This trace events are
>> grouped in a system named 'pagefault', they are:
>>
>>   - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>   - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>   - pagefault:spf_vma_notsup : the VMA's type is not supported
>>   - pagefault:spf_vma_access : the VMA's access right are not respected
>>   - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>   back.
>>
>> To record all the related events, the easier is to run perf with the
>> following arguments :
>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>
>> There is also a dedicated vmstat counter showing the number of successful
>> page fault handled speculatively. I can be seen this way:
>> $ grep speculative_pgfault /proc/vmstat
>>
>> It is possible to deactivate the speculative page fault handler by echoing
>> 0 in /proc/sys/vm/speculative_page_fault.
>>
>> This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
>> functional on x86, PowerPC. I cross built it on arm64 but I was not able to
>> test it.
>>
>> This series is also available on github [4].
>>
>> ---------------------
>> Real Workload results
>>
>> Test using a "popular in memory multithreaded database product" on 128cores
>> SMT8 Power system are in progress and I will come back with performance
>> mesurement as soon as possible. With the previous series we seen up to 30%
>> improvements in the number of transaction processed per second, and we hope
>> this will be the case with this series too.
>>
>> ------------------
>> Benchmarks results
>>
>> Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
>> SPF is BASE + this series
>>
>> Kernbench:
>> ----------
>> Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
>> kernel (kernel is build 5 times):
>>
>> Average	Half load -j 24
>> 		 Run	(std deviation)
>> 		 BASE			SPF
>> Elapsed	Time	 56.52   (1.39185)      56.256  (1.15106)       0.47%
>> User	Time	 980.018 (2.94734)      984.958 (1.98518)       -0.50%
>> System	Time	 130.744 (1.19148)      133.616 (0.873573)      -2.20%
>> Percent	CPU	 1965.6  (49.682)       1988.4  (40.035)        -1.16%
>> Context	Switches 29926.6 (272.789)      30472.4 (109.569)       -1.82%
>> Sleeps		 124793  (415.87)       125003  (591.008)       -0.17%
>> 						
>> Average	Optimal	load -j	48
>> 		 Run	(std deviation)
>> 		 BASE			SPF
>> Elapsed	Time	 46.354  (0.917949)     45.968 (1.42786)        0.83%
>> User	Time	 1193.42 (224.96)       1196.78 (223.28)        -0.28%
>> System	Time	 143.306 (13.2726)      146.177 (13.2659)       -2.00%
>> Percent	CPU	 2668.6  (743.157)      2699.9 (753.767)        -1.17%
>> Context	Switches 62268.3 (34097.1)      62721.7 (33999.1)       -0.73%
>> Sleeps		 132556  (8222.99)      132607 (8077.6)         -0.04%
>>
>> During a run on the SPF, perf events were captured:
>>   Performance counter stats for '../kernbench -M':
>>         525,873,132      faults
>>                 242      spf
>>                   0      pagefault:spf_vma_changed
>>                   0      pagefault:spf_vma_noanon
>>                 441      pagefault:spf_vma_notsup
>>                   0      pagefault:spf_vma_access
>>                   0      pagefault:spf_pmd_changed
>>
>> Very few speculative page faults were recorded as most of the processes
>> involved are monothreaded (sounds that on this architecture some threads
>> were created during the kernel build processing).
>>
>> Here are the kerbench results on a 1024 CPUs Power8 VM:
>>
>> 5.1.0-rc4-mm1+				5.1.0-rc4-mm1-spf-rcu+
>> Average Half load -j 512 Run (std deviation):
>> Elapsed Time 	 52.52   (0.906697)	52.778  (0.510069)	-0.49%
>> User Time 	 3855.43 (76.378)	3890.44 (73.0466)	-0.91%
>> System Time 	 1977.24 (182.316)	1974.56 (166.097)	0.14%
>> Percent CPU 	 11111.6 (540.461)	11115.2 (458.907)	-0.03%
>> Context Switches 83245.6 (3061.44)	83651.8 (1202.31)	-0.49%
>> Sleeps 		 613459  (23091.8)	628378  (27485.2) 	-2.43%
>>
>> Average Optimal load -j 1024 Run (std deviation):
>> Elapsed Time 	 52.964  (0.572346)	53.132 (0.825694)	-0.32%
>> User Time 	 4058.22 (222.034)	4070.2 (201.646) 	-0.30%
>> System Time 	 2672.81 (759.207)	2712.13 (797.292)	-1.47%
>> Percent CPU 	 12756.7 (1786.35)	12806.5 (1858.89)	-0.39%
>> Context Switches 88818.5 (6772)		87890.6 (5567.72)	1.04%
>> Sleeps 		 618658  (20842.2)	636297 (25044) 		-2.85%
>>
>> During a run on the SPF, perf events were captured:
>>   Performance counter stats for '../kernbench -M':
>>         149 375 832      faults
>>                   1      spf
>>                   0      pagefault:spf_vma_changed
>>                   0      pagefault:spf_vma_noanon
>>                 561      pagefault:spf_vma_notsup
>>                   0      pagefault:spf_vma_access
>>                   0      pagefault:spf_pmd_changed
>>
>> Most of the processes involved are monothreaded so SPF is not activated but
>> there is no impact on the performance.
>>
>> Ebizzy:
>> -------
>> The test is counting the number of records per second it can manage, the
>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>> consistent result I repeated the test 100 times and measure the average
>> result. The number is the record processes per second, the higher is the best.
>>
>>    		BASE		SPF		delta	
>> 24 CPUs x86	5492.69		9383.07		70.83%
>> 1024 CPUS P8 VM 8476.74		17144.38	102%
>>
>> Here are the performance counter read during a run on a 48 CPUs x86 node:
>>   Performance counter stats for './ebizzy -mTt 48':
>>          11,846,569      faults
>>          10,886,706      spf
>>             957,702      pagefault:spf_vma_changed
>>                   0      pagefault:spf_vma_noanon
>>                 815      pagefault:spf_vma_notsup
>>                   0      pagefault:spf_vma_access
>>                   0      pagefault:spf_pmd_changed
>>
>> And the ones captured during a run on a 1024 CPUs Power VM:
>>   Performance counter stats for './ebizzy -mTt 1024':
>>           1 359 789      faults
>>           1 284 910      spf
>>              72 085      pagefault:spf_vma_changed
>>                   0      pagefault:spf_vma_noanon
>>               2 669      pagefault:spf_vma_notsup
>>                   0      pagefault:spf_vma_access
>>                   0      pagefault:spf_pmd_changed
>> 		
>> In ebizzy's case most of the page fault were handled in a speculative way,
>> leading the ebizzy performance boost.
>>
>> ------------------
>> Changes since v11 [3]
>> - Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
>>   - Abort speculative page fault when doing swap readhead because VMA's
>>     boundaries are not protected at this time. Doing this the first swap in
>>     is doing a readhead, the next fault should be handled in a speculative
>>     way as the page is present in the swap read page.
>>   - Handle a race between copy_pte_range() and the wp_page_copy called by
>>     the speculative page fault handler.
>>   - Ported to Kernel v5.0
>>   - Moved VM_FAULT_PTNOTSAME define in mm_types.h
>>   - Use RCU to protect the MM RB tree instead of a rwlock.
>>   - Add a toggle interface: /proc/sys/vm/speculative_page_fault
>>
>> [1] https://lore.kernel.org/linux-mm/20141020215633.717315139@infradead.org/
>> [2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
>> [3] https://lore.kernel.org/linux-mm/1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com/
>> [4] https://github.com/ldu4/linux/tree/spf-v12
>>
>> Laurent Dufour (25):
>>    mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>    x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>    powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>    mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>    mm: make pte_unmap_same compatible with SPF
>>    mm: introduce INIT_VMA()
>>    mm: protect VMA modifications using VMA sequence count
>>    mm: protect mremap() against SPF hanlder
>>    mm: protect SPF handler against anon_vma changes
>>    mm: cache some VMA fields in the vm_fault structure
>>    mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>    mm: introduce __lru_cache_add_active_or_unevictable
>>    mm: introduce __vm_normal_page()
>>    mm: introduce __page_add_new_anon_rmap()
>>    mm: protect against PTE changes done by dup_mmap()
>>    mm: protect the RB tree with a sequence lock
>>    mm: introduce vma reference counter
>>    mm: Introduce find_vma_rcu()
>>    mm: don't do swap readahead during speculative page fault
>>    mm: adding speculative page fault failure trace events
>>    perf: add a speculative page fault sw event
>>    perf tools: add support for the SPF perf event
>>    mm: add speculative page fault vmstats
>>    powerpc/mm: add speculative page fault
>>    mm: Add a speculative page fault switch in sysctl
>>
>> Mahendran Ganesh (2):
>>    arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>    arm64/mm: add speculative page fault
>>
>> Peter Zijlstra (4):
>>    mm: prepare for FAULT_FLAG_SPECULATIVE
>>    mm: VMA sequence count
>>    mm: provide speculative fault infrastructure
>>    x86/mm: add speculative pagefault handling
>>
>>   arch/arm64/Kconfig                    |   1 +
>>   arch/arm64/mm/fault.c                 |  12 +
>>   arch/powerpc/Kconfig                  |   1 +
>>   arch/powerpc/mm/fault.c               |  16 +
>>   arch/x86/Kconfig                      |   1 +
>>   arch/x86/mm/fault.c                   |  14 +
>>   fs/exec.c                             |   1 +
>>   fs/proc/task_mmu.c                    |   5 +-
>>   fs/userfaultfd.c                      |  17 +-
>>   include/linux/hugetlb_inline.h        |   2 +-
>>   include/linux/migrate.h               |   4 +-
>>   include/linux/mm.h                    | 138 +++++-
>>   include/linux/mm_types.h              |  16 +-
>>   include/linux/pagemap.h               |   4 +-
>>   include/linux/rmap.h                  |  12 +-
>>   include/linux/swap.h                  |  10 +-
>>   include/linux/vm_event_item.h         |   3 +
>>   include/trace/events/pagefault.h      |  80 ++++
>>   include/uapi/linux/perf_event.h       |   1 +
>>   kernel/fork.c                         |  35 +-
>>   kernel/sysctl.c                       |   9 +
>>   mm/Kconfig                            |  22 +
>>   mm/huge_memory.c                      |   6 +-
>>   mm/hugetlb.c                          |   2 +
>>   mm/init-mm.c                          |   3 +
>>   mm/internal.h                         |  45 ++
>>   mm/khugepaged.c                       |   5 +
>>   mm/madvise.c                          |   6 +-
>>   mm/memory.c                           | 631 ++++++++++++++++++++++----
>>   mm/mempolicy.c                        |  51 ++-
>>   mm/migrate.c                          |   6 +-
>>   mm/mlock.c                            |  13 +-
>>   mm/mmap.c                             | 249 ++++++++--
>>   mm/mprotect.c                         |   4 +-
>>   mm/mremap.c                           |  13 +
>>   mm/nommu.c                            |   1 +
>>   mm/rmap.c                             |   5 +-
>>   mm/swap.c                             |   6 +-
>>   mm/swap_state.c                       |  10 +-
>>   mm/vmstat.c                           |   5 +-
>>   tools/include/uapi/linux/perf_event.h |   1 +
>>   tools/perf/util/evsel.c               |   1 +
>>   tools/perf/util/parse-events.c        |   4 +
>>   tools/perf/util/parse-events.l        |   1 +
>>   tools/perf/util/python.c              |   1 +
>>   45 files changed, 1277 insertions(+), 196 deletions(-)
>>   create mode 100644 include/trace/events/pagefault.h
>>
>> -- 
>> 2.21.0
>>
Laurent Dufour June 14, 2019, 8:44 a.m. UTC | #14
Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> Please find attached the script I run to get these numbers.
> This would be nice if you could give it a try on your victim node and share the result.

Sounds that the Intel mail fitering system doesn't like the attached shell script.
Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44

Thanks,
Laurent.
Haiyan Song June 20, 2019, 8:19 a.m. UTC | #15
Hi Laurent,

I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
serials.

Here attached the output results of this script.

The following comparison result is statistics from the script outputs.

a). Enable THP
                                            SPF_0          change       SPF_1
will-it-scale.page_fault2.per_thread_ops    2664190.8      -11.7%       2353637.6      
will-it-scale.page_fault3.per_thread_ops    4480027.2      -14.7%       3819331.9     


b). Disable THP
                                            SPF_0           change      SPF_1
will-it-scale.page_fault2.per_thread_ops    2653260.7       -10%        2385165.8
will-it-scale.page_fault3.per_thread_ops    4436330.1       -12.4%      3886734.2 


Thanks,
Haiyan Song


On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> > Please find attached the script I run to get these numbers.
> > This would be nice if you could give it a try on your victim node and share the result.
> 
> Sounds that the Intel mail fitering system doesn't like the attached shell script.
> Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
> 
> Thanks,
> Laurent.
>
#### THP always
#### SPF 0
average:2628818
average:2732209
average:2728392
average:2550695
average:2689873
average:2691963
average:2627612
average:2558295
average:2707877
average:2726174
#### SPF 1
average:2426260
average:2145674
average:2117769
average:2292502
average:2350403
average:2483327
average:2467324
average:2335393
average:2437859
average:2479865
#### THP never
#### SPF 0
average:2712575
average:2711447
average:2672362
average:2701981
average:2668073
average:2579296
average:2662048
average:2637422
average:2579143
average:2608260
#### SPF 1
average:2348782
average:2203349
average:2312960
average:2402995
average:2318914
average:2543129
average:2390337
average:2490178
average:2416798
average:2424216
#### THP always
#### SPF 0
average:4370143
average:4245754
average:4678884
average:4665759
average:4665809
average:4639132
average:4210755
average:4330552
average:4290469
average:4703015
#### SPF 1
average:3810608
average:3918890
average:3758003
average:3965024
average:3578151
average:3822748
average:3687293
average:3998701
average:3915771
average:3738130
#### THP never
#### SPF 0
average:4505598
average:4672023
average:4701787
average:4355885
average:4338397
average:4446350
average:4360811
average:4653767
average:4016352
average:4312331
#### SPF 1
average:3685383
average:4029413
average:4051615
average:3747588
average:4058557
average:4042340
average:3971295
average:3752943
average:3750626
average:3777582
Chinwen Chang July 6, 2020, 9:25 a.m. UTC | #16
On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
> Hi Laurent,
> 
> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
> serials.
> 
> Here attached the output results of this script.
> 
> The following comparison result is statistics from the script outputs.
> 
> a). Enable THP
>                                             SPF_0          change       SPF_1
> will-it-scale.page_fault2.per_thread_ops    2664190.8      -11.7%       2353637.6      
> will-it-scale.page_fault3.per_thread_ops    4480027.2      -14.7%       3819331.9     
> 
> 
> b). Disable THP
>                                             SPF_0           change      SPF_1
> will-it-scale.page_fault2.per_thread_ops    2653260.7       -10%        2385165.8
> will-it-scale.page_fault3.per_thread_ops    4436330.1       -12.4%      3886734.2 
> 
> 
> Thanks,
> Haiyan Song
> 
> 
> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> > Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> > > Please find attached the script I run to get these numbers.
> > > This would be nice if you could give it a try on your victim node and share the result.
> > 
> > Sounds that the Intel mail fitering system doesn't like the attached shell script.
> > Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
> > 
> > Thanks,
> > Laurent.
> > 

Hi Laurent,

We merged SPF v11 and some patches from v12 into our platforms. After
several experiments, we observed SPF has obvious improvements on the
launch time of applications, especially for those high-TLP ones,

# launch time of applications(s):

package           version      w/ SPF      w/o SPF      improve(%)
------------------------------------------------------------------                          
Baidu maps        10.13.3      0.887       0.98         9.49
Taobao            8.4.0.35     1.227       1.293        5.10
Meituan           9.12.401     1.107       1.543        28.26
WeChat            7.0.3        2.353       2.68         12.20
Honor of Kings    1.43.1.6     6.63        6.713        1.24


By the way, we have verified our platforms with those patches and
achieved the goal of mass production.

Thanks.
Chinwen Chang
Laurent Dufour July 6, 2020, 12:27 p.m. UTC | #17
Le 06/07/2020 à 11:25, Chinwen Chang a écrit :
> On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
>> Hi Laurent,
>>
>> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
>> serials.
>>
>> Here attached the output results of this script.
>>
>> The following comparison result is statistics from the script outputs.
>>
>> a). Enable THP
>>                                              SPF_0          change       SPF_1
>> will-it-scale.page_fault2.per_thread_ops    2664190.8      -11.7%       2353637.6
>> will-it-scale.page_fault3.per_thread_ops    4480027.2      -14.7%       3819331.9
>>
>>
>> b). Disable THP
>>                                              SPF_0           change      SPF_1
>> will-it-scale.page_fault2.per_thread_ops    2653260.7       -10%        2385165.8
>> will-it-scale.page_fault3.per_thread_ops    4436330.1       -12.4%      3886734.2
>>
>>
>> Thanks,
>> Haiyan Song
>>
>>
>> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
>>> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
>>>> Please find attached the script I run to get these numbers.
>>>> This would be nice if you could give it a try on your victim node and share the result.
>>>
>>> Sounds that the Intel mail fitering system doesn't like the attached shell script.
>>> Please find it there: https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44
>>>
>>> Thanks,
>>> Laurent.
>>>
> 
> Hi Laurent,
> 
> We merged SPF v11 and some patches from v12 into our platforms. After
> several experiments, we observed SPF has obvious improvements on the
> launch time of applications, especially for those high-TLP ones,
> 
> # launch time of applications(s):
> 
> package           version      w/ SPF      w/o SPF      improve(%)
> ------------------------------------------------------------------
> Baidu maps        10.13.3      0.887       0.98         9.49
> Taobao            8.4.0.35     1.227       1.293        5.10
> Meituan           9.12.401     1.107       1.543        28.26
> WeChat            7.0.3        2.353       2.68         12.20
> Honor of Kings    1.43.1.6     6.63        6.713        1.24

That's great news, thanks for reporting this!

> 
> By the way, we have verified our platforms with those patches and
> achieved the goal of mass production.

Another good news!
For my information, what is your targeted hardware?

Cheers,
Laurent.
Chinwen Chang July 7, 2020, 5:31 a.m. UTC | #18
On Mon, 2020-07-06 at 14:27 +0200, Laurent Dufour wrote:
> Le 06/07/2020 à 11:25, Chinwen Chang a écrit :
> > On Thu, 2019-06-20 at 16:19 +0800, Haiyan Song wrote:
> >> Hi Laurent,
> >>
> >> I downloaded your script and run it on Intel 2s skylake platform with spf-v12 patch
> >> serials.
> >>
> >> Here attached the output results of this script.
> >>
> >> The following comparison result is statistics from the script outputs.
> >>
> >> a). Enable THP
> >>                                              SPF_0          change       SPF_1
> >> will-it-scale.page_fault2.per_thread_ops    2664190.8      -11.7%       2353637.6
> >> will-it-scale.page_fault3.per_thread_ops    4480027.2      -14.7%       3819331.9
> >>
> >>
> >> b). Disable THP
> >>                                              SPF_0           change      SPF_1
> >> will-it-scale.page_fault2.per_thread_ops    2653260.7       -10%        2385165.8
> >> will-it-scale.page_fault3.per_thread_ops    4436330.1       -12.4%      3886734.2
> >>
> >>
> >> Thanks,
> >> Haiyan Song
> >>
> >>
> >> On Fri, Jun 14, 2019 at 10:44:47AM +0200, Laurent Dufour wrote:
> >>> Le 14/06/2019 à 10:37, Laurent Dufour a écrit :
> >>>> Please find attached the script I run to get these numbers.
> >>>> This would be nice if you could give it a try on your victim node and share the result.
> >>>
> >>> Sounds that the Intel mail fitering system doesn't like the attached shell script.
> >>> Please find it there: https://urldefense.com/v3/__https://gist.github.com/ldu4/a5cc1a93f293108ea387d43d5d5e7f44__;!!CTRNKA9wMg0ARbw!0lux2FMCbIFxFEl824CdSuSQqT0IVWsvyUqfDVJNEVb9gTWyRltm7cpPZg70N_XhXmMZ$ 
> >>>
> >>> Thanks,
> >>> Laurent.
> >>>
> > 
> > Hi Laurent,
> > 
> > We merged SPF v11 and some patches from v12 into our platforms. After
> > several experiments, we observed SPF has obvious improvements on the
> > launch time of applications, especially for those high-TLP ones,
> > 
> > # launch time of applications(s):
> > 
> > package           version      w/ SPF      w/o SPF      improve(%)
> > ------------------------------------------------------------------
> > Baidu maps        10.13.3      0.887       0.98         9.49
> > Taobao            8.4.0.35     1.227       1.293        5.10
> > Meituan           9.12.401     1.107       1.543        28.26
> > WeChat            7.0.3        2.353       2.68         12.20
> > Honor of Kings    1.43.1.6     6.63        6.713        1.24
> 
> That's great news, thanks for reporting this!
> 
> > 
> > By the way, we have verified our platforms with those patches and
> > achieved the goal of mass production.
> 
> Another good news!
> For my information, what is your targeted hardware?
> 
> Cheers,
> Laurent.

Hi Laurent,

Our targeted hardware belongs to ARM64 multi-core series.

Thanks.
Chinwen
>
Joel Fernandes Dec. 14, 2020, 2:03 a.m. UTC | #19
On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
[..]
> > > Hi Laurent,
> > > 
> > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > several experiments, we observed SPF has obvious improvements on the
> > > launch time of applications, especially for those high-TLP ones,
> > > 
> > > # launch time of applications(s):
> > > 
> > > package           version      w/ SPF      w/o SPF      improve(%)
> > > ------------------------------------------------------------------
> > > Baidu maps        10.13.3      0.887       0.98         9.49
> > > Taobao            8.4.0.35     1.227       1.293        5.10
> > > Meituan           9.12.401     1.107       1.543        28.26
> > > WeChat            7.0.3        2.353       2.68         12.20
> > > Honor of Kings    1.43.1.6     6.63        6.713        1.24
> > 
> > That's great news, thanks for reporting this!
> > 
> > > 
> > > By the way, we have verified our platforms with those patches and
> > > achieved the goal of mass production.
> > 
> > Another good news!
> > For my information, what is your targeted hardware?
> > 
> > Cheers,
> > Laurent.
> 
> Hi Laurent,
> 
> Our targeted hardware belongs to ARM64 multi-core series.

Hello!

I was trying to develop an intuition about why does SPF give improvement for
you on small CPU systems. This is just a high-level theory but:

1. Assume the improvement is because of elimination of "blocking" on
mmap_sem.
Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
places, thus causing blocking on mmap_sem in other paths? If so, is it
feasible to convert such usages to acquiring them in read-mode?

2. Assume the improvement is because of lesser read-side contention on
mmap_sem.
On small CPU systems, I would not expect reducing cache-line bouncing to give
such a dramatic improvement in performance as you are seeing.

Thanks for any insight on this!

- Joel
Laurent Dufour Dec. 14, 2020, 9:36 a.m. UTC | #20
Le 14/12/2020 à 03:03, Joel Fernandes a écrit :
> On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
> [..]
>>>> Hi Laurent,
>>>>
>>>> We merged SPF v11 and some patches from v12 into our platforms. After
>>>> several experiments, we observed SPF has obvious improvements on the
>>>> launch time of applications, especially for those high-TLP ones,
>>>>
>>>> # launch time of applications(s):
>>>>
>>>> package           version      w/ SPF      w/o SPF      improve(%)
>>>> ------------------------------------------------------------------
>>>> Baidu maps        10.13.3      0.887       0.98         9.49
>>>> Taobao            8.4.0.35     1.227       1.293        5.10
>>>> Meituan           9.12.401     1.107       1.543        28.26
>>>> WeChat            7.0.3        2.353       2.68         12.20
>>>> Honor of Kings    1.43.1.6     6.63        6.713        1.24
>>>
>>> That's great news, thanks for reporting this!
>>>
>>>>
>>>> By the way, we have verified our platforms with those patches and
>>>> achieved the goal of mass production.
>>>
>>> Another good news!
>>> For my information, what is your targeted hardware?
>>>
>>> Cheers,
>>> Laurent.
>>
>> Hi Laurent,
>>
>> Our targeted hardware belongs to ARM64 multi-core series.
> 
> Hello!
> 
> I was trying to develop an intuition about why does SPF give improvement for
> you on small CPU systems. This is just a high-level theory but:
> 
> 1. Assume the improvement is because of elimination of "blocking" on
> mmap_sem.
> Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
> places, thus causing blocking on mmap_sem in other paths? If so, is it
> feasible to convert such usages to acquiring them in read-mode?

That's correct, and the goal of this series is to try not holding the mmap_sem 
in read mode during page fault processing.

Converting mmap_sem holder from write to read mode is not so easy and that work 
as already been done in some places. If you think there are areas where this 
could be done, you're welcome to send patches fixing that.

> 2. Assume the improvement is because of lesser read-side contention on
> mmap_sem.
> On small CPU systems, I would not expect reducing cache-line bouncing to give
> such a dramatic improvement in performance as you are seeing.

I don't think cache line bouncing reduction is the main sourcec of performance 
improvement, I would rather think this is the lower part here.
I guess this is mainly because during loading time a lot of page fault is 
occuring and thus SPF is reducing the contention on the mmap_sem.

> Thanks for any insight on this!
> 
> - Joel
>
Joel Fernandes Dec. 14, 2020, 6:10 p.m. UTC | #21
On Mon, Dec 14, 2020 at 10:36:29AM +0100, Laurent Dufour wrote:
> Le 14/12/2020 à 03:03, Joel Fernandes a écrit :
> > On Tue, Jul 07, 2020 at 01:31:37PM +0800, Chinwen Chang wrote:
> > [..]
> > > > > Hi Laurent,
> > > > > 
> > > > > We merged SPF v11 and some patches from v12 into our platforms. After
> > > > > several experiments, we observed SPF has obvious improvements on the
> > > > > launch time of applications, especially for those high-TLP ones,
> > > > > 
> > > > > # launch time of applications(s):
> > > > > 
> > > > > package           version      w/ SPF      w/o SPF      improve(%)
> > > > > ------------------------------------------------------------------
> > > > > Baidu maps        10.13.3      0.887       0.98         9.49
> > > > > Taobao            8.4.0.35     1.227       1.293        5.10
> > > > > Meituan           9.12.401     1.107       1.543        28.26
> > > > > WeChat            7.0.3        2.353       2.68         12.20
> > > > > Honor of Kings    1.43.1.6     6.63        6.713        1.24
> > > > 
> > > > That's great news, thanks for reporting this!
> > > > 
> > > > > 
> > > > > By the way, we have verified our platforms with those patches and
> > > > > achieved the goal of mass production.
> > > > 
> > > > Another good news!
> > > > For my information, what is your targeted hardware?
> > > > 
> > > > Cheers,
> > > > Laurent.
> > > 
> > > Hi Laurent,
> > > 
> > > Our targeted hardware belongs to ARM64 multi-core series.
> > 
> > Hello!
> > 
> > I was trying to develop an intuition about why does SPF give improvement for
> > you on small CPU systems. This is just a high-level theory but:
> > 
> > 1. Assume the improvement is because of elimination of "blocking" on
> > mmap_sem.
> > Could it be that the mmap_sem is acquired in write-mode unnecessarily in some
> > places, thus causing blocking on mmap_sem in other paths? If so, is it
> > feasible to convert such usages to acquiring them in read-mode?
> 
> That's correct, and the goal of this series is to try not holding the
> mmap_sem in read mode during page fault processing.
> 
> Converting mmap_sem holder from write to read mode is not so easy and that
> work as already been done in some places. If you think there are areas where
> this could be done, you're welcome to send patches fixing that.
> 
> > 2. Assume the improvement is because of lesser read-side contention on
> > mmap_sem.
> > On small CPU systems, I would not expect reducing cache-line bouncing to give
> > such a dramatic improvement in performance as you are seeing.
> 
> I don't think cache line bouncing reduction is the main sourcec of
> performance improvement, I would rather think this is the lower part here.
> I guess this is mainly because during loading time a lot of page fault is
> occuring and thus SPF is reducing the contention on the mmap_sem.

Thanks for the reply. I think I also wrongly assumed that acquiring mmap
rwsem in write mode in a syscall makes SPF moot. Peter explained to me on IRC
that tere's still perf improvement in write mode if an unrelated VMA is
modified while another VMA is faulting.  CMIIW - not an mm expert by any
stretch.

Thanks!

 - Joel