mbox series

[RFC,V3,0/1] sched/numa: Fix disjoint set vma scan regression

Message ID cover.1685506205.git.raghavendra.kt@amd.com (mailing list archive)
Headers show
Series sched/numa: Fix disjoint set vma scan regression | expand

Message

Raghavendra K T May 31, 2023, 4:25 a.m. UTC
With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.

While this had improved significant system time overhead, there were corner
cases, which genuinely need some relaxation for e.g., concern raised by
PeterZ where unfairness amongst the thread belonging to disjoint set of vmas,
that can potentially amplify the side effects, where vma regions belonging
to some of the tasks being left unscanned.

[1] had handled that issue by allowing first two scans at mm level
(mm->numa_scan_seq) unconditionally. But that was not enough.

One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
allocation happen by main thread and it is divided into memory chunks of 24MB
to be continuously bzeroed (for 128 threads on my machine).

This was found in internal LKP run and also reported by [4].

While RFC V1 [2] tried to address this issue, the logic had more heuristics.
RFC V2 [3] was rewritten based on vma_size.

Current implementation drops some of additional logic for long running task
and relooked some of the usage of READ_ONCE/WRITE_ONCE().

The current patch addresses the same issue in a more accurate way as
follows:

(1) Any disjoint vma which is not associated with a task, that tries to
scan is now allowed to induce prot_none faults. Total number of such
unconditional scans allowed per vma is derived based on the exact vma size
as follows:

total scans allowed = 1/2 * vma_size / scan_size.

(2) Total scans already done is maintained using a per vma scan counter.

With above patch, numa01_THREAD_ALLOC regression reported is resolved,
but please note that with [1] there was a drastic decrease in system time
for mmtest numa01, this patch adds back some of the system time.

Summary: numa scan enhancement patch [1] togethor with the current patchset
improves overall system time by filtering unnecessary numa scan
while still retaining necessary scanning in some corner cases which
involves disjoint set vmas.

Your comments/Ideas are welcome.

Changes since:
RFC V2:
1) Drop reset of scan counter that tried to take care of long running workloads
2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata)
3) Base is 6.4.0-rc2

RFC V1:
1) Rewrite entire logic based on actual vma size than heuristics
2) Added Reported-by kernel test robot and internal LKP test
3) Rebased to 6.4.-rc1 (ba0ad6ed89)

Result:
SUT: Milan w/ 2 numa nodes 256 cpus

Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan enhancement)
                	base-numascan	base		base+fix
real    		1m1.507s	1m23.259s	1m2.632s
user    		213m51.336s	251m46.363s	220m35.528s
sys     		3m3.397s	0m12.492s	2m41.393s

numa_hit 		5615517		4560123		4963875
numa_local 		5615505		4560024		4963700
numa_other 		12		99		175
numa_pte_updates 	1822797		493		1559111
numa_hint_faults 	1307113		523		1469031
numa_hint_faults_local 	612617		488		884829
numa_pages_migrated 	694370		35		584202

We can see regression in base real time recovered, but with some additional
system time overhead.

Below is the mmtest autonuma performance

autonumabench
===========
(base 6.4.0-rc2 that has numascan enhancement)
					base-numascan		base			base+fix
Amean     syst-NUMA01                  300.46 (   0.00%)       23.97 *  92.02%*       67.18 *  77.64%*
Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.22 *  -9.15%*        0.22 *  -9.15%*
Amean     syst-NUMA02                    0.70 (   0.00%)        0.71 *  -0.61%*        0.70 *   0.41%*
Amean     syst-NUMA02_SMT                0.58 (   0.00%)        0.62 *  -5.38%*        0.61 *  -3.67%*
Amean     elsp-NUMA01                  320.92 (   0.00%)      276.13 *  13.96%*      324.11 *  -0.99%*
Amean     elsp-NUMA01_THREADLOCAL        1.02 (   0.00%)        1.03 *  -1.83%*        1.03 *  -1.83%*
Amean     elsp-NUMA02                    3.16 (   0.00%)        3.93 * -24.20%*        3.14 *   0.81%*
Amean     elsp-NUMA02_SMT                3.82 (   0.00%)        3.87 *  -1.27%*        3.44 *   9.90%*

Duration User      403532.43   279173.53   359098.23
Duration System      2114.31      179.20      481.54
Duration Elapsed     2312.20     2004.48     2335.84

Ops NUMA alloc hit                  55795455.00    45452739.00    45500387.00
Ops NUMA alloc local                55794177.00    45435858.00    45500070.00
Ops NUMA base-page range updates   147858285.00       18601.00    42043107.00
Ops NUMA PTE updates               147858285.00       18601.00    42043107.00
Ops NUMA hint faults               150531983.00       18254.00    42450080.00
Ops NUMA hint local faults %       125691825.00       11964.00    32993313.00
Ops NUMA hint local percent               83.50          65.54          77.72
Ops NUMA pages migrated             13535786.00        2207.00     4654628.00
Ops AutoNUMA cost                     753952.10          91.44      212633.14

Please note there is a system time overhead added for numa01 but we still have very
good improvement w.r.t base without numascan. 

[1] Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
[2] Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
[3] Link: https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/
[4] Link: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/

Raghavendra K T (1):
  sched/numa: Fix disjoint set vma scan regression

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
 2 files changed, 25 insertions(+), 7 deletions(-)

Comments

Swapnil Sapkal June 7, 2023, 11:40 a.m. UTC | #1
Hello Raghavendra,

On 5/31/2023 9:55 AM, Raghavendra K T wrote:
> With the numa scan enhancements [1], only the threads which had previously
> accessed vma are allowed to scan.
> 
> While this had improved significant system time overhead, there were corner
> cases, which genuinely need some relaxation for e.g., concern raised by
> PeterZ where unfairness amongst the thread belonging to disjoint set of vmas,
> that can potentially amplify the side effects, where vma regions belonging
> to some of the tasks being left unscanned.
> 
> [1] had handled that issue by allowing first two scans at mm level
> (mm->numa_scan_seq) unconditionally. But that was not enough.
> 
> One of the test that exercise similar side effect is numa01_THREAD_ALLOC where
> allocation happen by main thread and it is divided into memory chunks of 24MB
> to be continuously bzeroed (for 128 threads on my machine).
> 
> This was found in internal LKP run and also reported by [4].
> 
> While RFC V1 [2] tried to address this issue, the logic had more heuristics.
> RFC V2 [3] was rewritten based on vma_size.
> 
> Current implementation drops some of additional logic for long running task
> and relooked some of the usage of READ_ONCE/WRITE_ONCE().
> 
> The current patch addresses the same issue in a more accurate way as
> follows:
> 
> (1) Any disjoint vma which is not associated with a task, that tries to
> scan is now allowed to induce prot_none faults. Total number of such
> unconditional scans allowed per vma is derived based on the exact vma size
> as follows:
> 
> total scans allowed = 1/2 * vma_size / scan_size.
> 
> (2) Total scans already done is maintained using a per vma scan counter.
> 
> With above patch, numa01_THREAD_ALLOC regression reported is resolved,
> but please note that with [1] there was a drastic decrease in system time
> for mmtest numa01, this patch adds back some of the system time.
> 
> Summary: numa scan enhancement patch [1] togethor with the current patchset
> improves overall system time by filtering unnecessary numa scan
> while still retaining necessary scanning in some corner cases which
> involves disjoint set vmas.
> 
> Your comments/Ideas are welcome.
> 
> Changes since:
> RFC V2:
> 1) Drop reset of scan counter that tried to take care of long running workloads
> 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata)
> 3) Base is 6.4.0-rc2
> 
> RFC V1:
> 1) Rewrite entire logic based on actual vma size than heuristics
> 2) Added Reported-by kernel test robot and internal LKP test
> 3) Rebased to 6.4.-rc1 (ba0ad6ed89)
> 
> Result:
> SUT: Milan w/ 2 numa nodes 256 cpus
> 
> Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan enhancement)
>                  	base-numascan	base		base+fix
> real    		1m1.507s	1m23.259s	1m2.632s
> user    		213m51.336s	251m46.363s	220m35.528s
> sys     		3m3.397s	0m12.492s	2m41.393s
> 
> numa_hit 		5615517		4560123		4963875
> numa_local 		5615505		4560024		4963700
> numa_other 		12		99		175
> numa_pte_updates 	1822797		493		1559111
> numa_hint_faults 	1307113		523		1469031
> numa_hint_faults_local 	612617		488		884829
> numa_pages_migrated 	694370		35		584202
> 
> We can see regression in base real time recovered, but with some additional
> system time overhead.
> 
> Below is the mmtest autonuma performance
> 
> autonumabench
> ===========
> (base 6.4.0-rc2 that has numascan enhancement)
> 					base-numascan		base			base+fix
> Amean     syst-NUMA01                  300.46 (   0.00%)       23.97 *  92.02%*       67.18 *  77.64%*
> Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.22 *  -9.15%*        0.22 *  -9.15%*
> Amean     syst-NUMA02                    0.70 (   0.00%)        0.71 *  -0.61%*        0.70 *   0.41%*
> Amean     syst-NUMA02_SMT                0.58 (   0.00%)        0.62 *  -5.38%*        0.61 *  -3.67%*
> Amean     elsp-NUMA01                  320.92 (   0.00%)      276.13 *  13.96%*      324.11 *  -0.99%*
> Amean     elsp-NUMA01_THREADLOCAL        1.02 (   0.00%)        1.03 *  -1.83%*        1.03 *  -1.83%*
> Amean     elsp-NUMA02                    3.16 (   0.00%)        3.93 * -24.20%*        3.14 *   0.81%*
> Amean     elsp-NUMA02_SMT                3.82 (   0.00%)        3.87 *  -1.27%*        3.44 *   9.90%*
> 
> Duration User      403532.43   279173.53   359098.23
> Duration System      2114.31      179.20      481.54
> Duration Elapsed     2312.20     2004.48     2335.84
> 
> Ops NUMA alloc hit                  55795455.00    45452739.00    45500387.00
> Ops NUMA alloc local                55794177.00    45435858.00    45500070.00
> Ops NUMA base-page range updates   147858285.00       18601.00    42043107.00
> Ops NUMA PTE updates               147858285.00       18601.00    42043107.00
> Ops NUMA hint faults               150531983.00       18254.00    42450080.00
> Ops NUMA hint local faults %       125691825.00       11964.00    32993313.00
> Ops NUMA hint local percent               83.50          65.54          77.72
> Ops NUMA pages migrated             13535786.00        2207.00     4654628.00
> Ops AutoNUMA cost                     753952.10          91.44      212633.14
> 
> Please note there is a system time overhead added for numa01 but we still have very
> good improvement w.r.t base without numascan.
> 

I tested the patch with lkp autonuma benchmark on a dual socket 4th 
Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are 
the results:

commit:
   6.4.0-rc2
   6.4.0-rc2+patch

       6.4.0-rc2            6.4.0-rc2+patch
---------------- ---------------------------
          %stddev     %change         %stddev
              \          |                \
     501.84           -12.5%     439.14       numa01.seconds
     228.66            -1.8%     224.44       numa01_THREAD_ALLOC.seconds
       0.51           +21.6%       0.62       numa02.seconds
     107.17            +0.0%     107.17       numa02_SMT.seconds
       2936            -9.1%       2669       elapsed_time
     794910            +3.7%     824178       system_time
     474520           -17.5%     391331       user_time

Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>

> [1] Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> [2] Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> [3] Link: https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/
> [4] Link: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
> 
> Raghavendra K T (1):
>    sched/numa: Fix disjoint set vma scan regression
> 
>   include/linux/mm_types.h |  1 +
>   kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
>   2 files changed, 25 insertions(+), 7 deletions(-)
> 
--
Thanks and regards,
Swapnil
Raghavendra K T June 8, 2023, 4:01 a.m. UTC | #2
On 6/7/2023 5:10 PM, Sapkal Swapnil wrote:
> Hello Raghavendra,
> 
> On 5/31/2023 9:55 AM, Raghavendra K T wrote:
>> With the numa scan enhancements [1], only the threads which had 
>> previously
>> accessed vma are allowed to scan.
>>
>> While this had improved significant system time overhead, there were 
>> corner
>> cases, which genuinely need some relaxation for e.g., concern raised by
>> PeterZ where unfairness amongst the thread belonging to disjoint set 
>> of vmas,
>> that can potentially amplify the side effects, where vma regions 
>> belonging
>> to some of the tasks being left unscanned.
>>
>> [1] had handled that issue by allowing first two scans at mm level
>> (mm->numa_scan_seq) unconditionally. But that was not enough.
>>
>> One of the test that exercise similar side effect is 
>> numa01_THREAD_ALLOC where
>> allocation happen by main thread and it is divided into memory chunks 
>> of 24MB
>> to be continuously bzeroed (for 128 threads on my machine).
>>
>> This was found in internal LKP run and also reported by [4].
>>
>> While RFC V1 [2] tried to address this issue, the logic had more 
>> heuristics.
>> RFC V2 [3] was rewritten based on vma_size.
>>
>> Current implementation drops some of additional logic for long running 
>> task
>> and relooked some of the usage of READ_ONCE/WRITE_ONCE().
>>
>> The current patch addresses the same issue in a more accurate way as
>> follows:
>>
>> (1) Any disjoint vma which is not associated with a task, that tries to
>> scan is now allowed to induce prot_none faults. Total number of such
>> unconditional scans allowed per vma is derived based on the exact vma 
>> size
>> as follows:
>>
>> total scans allowed = 1/2 * vma_size / scan_size.
>>
>> (2) Total scans already done is maintained using a per vma scan counter.
>>
>> With above patch, numa01_THREAD_ALLOC regression reported is resolved,
>> but please note that with [1] there was a drastic decrease in system time
>> for mmtest numa01, this patch adds back some of the system time.
>>
>> Summary: numa scan enhancement patch [1] togethor with the current 
>> patchset
>> improves overall system time by filtering unnecessary numa scan
>> while still retaining necessary scanning in some corner cases which
>> involves disjoint set vmas.
>>
>> Your comments/Ideas are welcome.
>>
>> Changes since:
>> RFC V2:
>> 1) Drop reset of scan counter that tried to take care of long running 
>> workloads
>> 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata)
>> 3) Base is 6.4.0-rc2
>>
>> RFC V1:
>> 1) Rewrite entire logic based on actual vma size than heuristics
>> 2) Added Reported-by kernel test robot and internal LKP test
>> 3) Rebased to 6.4.-rc1 (ba0ad6ed89)
>>
>> Result:
>> SUT: Milan w/ 2 numa nodes 256 cpus
>>
>> Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan 
>> enhancement)
>>                      base-numascan    base        base+fix
>> real            1m1.507s    1m23.259s    1m2.632s
>> user            213m51.336s    251m46.363s    220m35.528s
>> sys             3m3.397s    0m12.492s    2m41.393s
>>
>> numa_hit         5615517        4560123        4963875
>> numa_local         5615505        4560024        4963700
>> numa_other         12        99        175
>> numa_pte_updates     1822797        493        1559111
>> numa_hint_faults     1307113        523        1469031
>> numa_hint_faults_local     612617        488        884829
>> numa_pages_migrated     694370        35        584202
>>
>> We can see regression in base real time recovered, but with some 
>> additional
>> system time overhead.
>>
>> Below is the mmtest autonuma performance
>>
>> autonumabench
>> ===========
>> (base 6.4.0-rc2 that has numascan enhancement)
>>                     base-numascan        base            base+fix
>> Amean     syst-NUMA01                  300.46 (   0.00%)       23.97 
>> *  92.02%*       67.18 *  77.64%*
>> Amean     syst-NUMA01_THREADLOCAL        0.20 (   0.00%)        0.22 
>> *  -9.15%*        0.22 *  -9.15%*
>> Amean     syst-NUMA02                    0.70 (   0.00%)        0.71 
>> *  -0.61%*        0.70 *   0.41%*
>> Amean     syst-NUMA02_SMT                0.58 (   0.00%)        0.62 
>> *  -5.38%*        0.61 *  -3.67%*
>> Amean     elsp-NUMA01                  320.92 (   0.00%)      276.13 
>> *  13.96%*      324.11 *  -0.99%*
>> Amean     elsp-NUMA01_THREADLOCAL        1.02 (   0.00%)        1.03 
>> *  -1.83%*        1.03 *  -1.83%*
>> Amean     elsp-NUMA02                    3.16 (   0.00%)        3.93 * 
>> -24.20%*        3.14 *   0.81%*
>> Amean     elsp-NUMA02_SMT                3.82 (   0.00%)        3.87 
>> *  -1.27%*        3.44 *   9.90%*
>>
>> Duration User      403532.43   279173.53   359098.23
>> Duration System      2114.31      179.20      481.54
>> Duration Elapsed     2312.20     2004.48     2335.84
>>
>> Ops NUMA alloc hit                  55795455.00    45452739.00    
>> 45500387.00
>> Ops NUMA alloc local                55794177.00    45435858.00    
>> 45500070.00
>> Ops NUMA base-page range updates   147858285.00       18601.00    
>> 42043107.00
>> Ops NUMA PTE updates               147858285.00       18601.00    
>> 42043107.00
>> Ops NUMA hint faults               150531983.00       18254.00    
>> 42450080.00
>> Ops NUMA hint local faults %       125691825.00       11964.00    
>> 32993313.00
>> Ops NUMA hint local percent               83.50          
>> 65.54          77.72
>> Ops NUMA pages migrated             13535786.00        2207.00     
>> 4654628.00
>> Ops AutoNUMA cost                     753952.10          91.44      
>> 212633.14
>>
>> Please note there is a system time overhead added for numa01 but we 
>> still have very
>> good improvement w.r.t base without numascan.
>>
> 
> I tested the patch with lkp autonuma benchmark on a dual socket 4th 
> Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are 
> the results:
> 
> commit:
>    6.4.0-rc2
>    6.4.0-rc2+patch
> 
>        6.4.0-rc2            6.4.0-rc2+patch
> ---------------- ---------------------------
>           %stddev     %change         %stddev
>               \          |                \
>      501.84           -12.5%     439.14       numa01.seconds
>      228.66            -1.8%     224.44       numa01_THREAD_ALLOC.seconds
>        0.51           +21.6%       0.62       numa02.seconds
>      107.17            +0.0%     107.17       numa02_SMT.seconds
>        2936            -9.1%       2669       elapsed_time
>      794910            +3.7%     824178       system_time
>      474520           -17.5%     391331       user_time
> 
> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
> 
>> [1] Link: 
>> https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t 
>>
>> [2] Link: 
>> https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
>> [3] Link: 
>> https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/ 
>>
>> [4] Link: 
>> https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/ 
>>
>>
>> Raghavendra K T (1):
>>    sched/numa: Fix disjoint set vma scan regression
>>
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
>>   2 files changed, 25 insertions(+), 7 deletions(-)
>>
> -- 
> Thanks and regards,
> Swapnil

Thank you Swapnil.
It reminds again that LKP's numa01 = numa01_THREAD_ALLOC which has
regained numbers.

I will also wait if kernel-test-robot also sees issue fixed. and also

if Mel/Peter have any objections/comment on the direction.

Regards