mbox series

[RFC,V2,0/1] sched/numa: Fix disjoint set vma scan regression

Message ID cover.1684228065.git.raghavendra.kt@amd.com (mailing list archive)
Headers show
Series sched/numa: Fix disjoint set vma scan regression | expand

Message

Raghavendra K T May 16, 2023, 9:19 a.m. UTC
With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.

While this had improved significant system time overhead, there were corner
cases, which genuinely need some relaxation for e.g., concern raised by
PeterZ where unfairness amongst the thread belonging to disjoint set of vmas,
that can potentially amplify the side effects of vma regions belonging to some
of the tasks being left unscanned.

[1] had handled that issue by allowing first two scans at mm level
(mm->numa_scan_seq) unconditionally. But that was not enough.

One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
allocation happen by main thread and it is divided into memory chunks of 24MB
to be continuously bzeroed.

(this is run default by LKP tests while numa01 is run default in mmtests which
operate on full 3GB region by each thread)

While RFC [2] tried to address this issue, the logic had more heuristics.
After posting [2], [3] also confirmed same regression.

The current patch addresses the same issue in a more accurate way as
follows:

(1) Any disjoint vma which is not associated with a task, that tries to
scan is now allowed to induce prot_none faults. Total number of such
unconditional scans allowed per vma is derived based on the exact vma size
as follows:

total scans allowed = 1/2 * vma_size / scan_size.

(2) Total scans already done is maintained using a per vma scan counter.

(3) For a very long running task, this scan counter is reset after 16
times whole mm of task scanning took place (using mm->numa_scan_seq).

With above patch, numa01_THREAD_ALLOC regression reported is resolved,
but please note that with [1] there was a drastic decrease in system time
for mmtest numa01, this patch adds back some of the system time.

Summary: numa scan enhancement patch [1] togethor with the current patchset
improves overall system time by filtering unnecessary numa scan
while still retaining necessary scanning in some corner cases which
involves disjoint set vmas.

(Mel, PeterZ this patch looks more precise handling of the issue) 

Your comments/Ideas are welcome.

Changes since V1:
1) Rewrite entire logic based on actual vma size than heuristics
2) Added Reported-by kernel test robot and internal LKP test
3) Rebased to 6.4.-rc1 (ba0ad6ed89)

Result:
SUT: Milan w/ 2 numa nodes 256 cpus

Run of numa01_THREAD__ALLOC 6.4.0-rc1 (that has w/ numascan enhancement)
		base-numascan		base			base+fix
real		1m3.025s		1m24.163s		1m3.551s
user		213m44.232s		251m3.638s		219m55.662s
sys		6m26.598s		0m13.056s		2m35.767s
		
numa_hit 		5478165		4395752		4907431
numa_local	        5478103		4395366		4907044
numa_other	             62		    386		    387
numa_pte_updates	1989274		  11606		1265014
numa_hint_faults	1756059		    515		1135804
numa_hint_faults_local	 971500		    486		 558076
numa_pages_migrated	 784211		     29		 577728

Below is the mmtest autonuma performance
autonuma
===========
base: 6.4.0-rc1+
					base w/o numascan      	base(=w/ numascan)    base + fix

 
Amean     syst-NUMA01                  247.46 (   0.00%)       18.52 *  92.51%*      148.18 *  40.12%*
Amean     syst-NUMA01_THREADLOCAL        0.23 (   0.00%)        0.21 *   5.06%*        0.22 *   1.90%*
Amean     syst-NUMA02                    0.70 (   0.00%)        0.70 *   1.02%*        0.73 *  -3.46%*
Amean     syst-NUMA02_SMT                0.59 (   0.00%)        0.59 *   0.00%*        0.58 *   2.42%*
Amean     elsp-NUMA01                  309.54 (   0.00%)      284.57 *   8.07%*      306.84 *   0.87%*
Amean     elsp-NUMA01_THREADLOCAL        1.02 (   0.00%)        1.02 *   0.42%*        1.04 *  -1.53%*
Amean     elsp-NUMA02                    3.22 (   0.00%)        3.55 * -10.21%*        3.32 *  -3.15%*
Amean     elsp-NUMA02_SMT                3.71 (   0.00%)        3.86 *  -4.08%*        3.74 *  -0.69%*

Duration User      383183.43   294971.18   357446.52
Duration System      1743.53      140.85     1048.57
Duration Elapsed     2232.09     2062.33     2214.44

Ops NUMA alloc hit                  57057379.00    43378289.00    51885613.00
Ops NUMA alloc local                57055256.00    43377265.00    51884407.00
Ops NUMA base-page range updates   137882746.00       25895.00    83600214.00
Ops NUMA PTE updates               137882746.00       25895.00    83600214.00
Ops NUMA hint faults               139609832.00       22651.00    84634363.00
Ops NUMA hint local faults %       113091055.00       18200.00    65809169.00
Ops NUMA hint local percent               81.01          80.35          77.76
Ops NUMA pages migrated             13415929.00        1798.00     9638327.00
Ops AutoNUMA cost                     699269.24         113.47      423940.14

links:
[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
[2] https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
[3] https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/

Note: I have updated patch-1 with appropriate log required for the commit, so some of
above result/info is duplicated.

Raghavendra K T (1):
  sched/numa: Fix disjoint set vma scan regression

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 41 ++++++++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)