mbox series

[RFC,V1,0/6] sched/numa: Enhance disjoint VMA scanning

Message ID cover.1693287931.git.raghavendra.kt@amd.com (mailing list archive)
Headers show
Series sched/numa: Enhance disjoint VMA scanning | expand

Message

Raghavendra K T Aug. 29, 2023, 6:06 a.m. UTC
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
VMA scanning is allowed if:
1) The task had accessed the VMA. 
 Rationale: Reduce overhead for the tasks that had not
touched VMA. Also filter out unnecessary scanning.

2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
 Rationale: Understanding initial characteristics of VMAs and also
 prevent VMA scanning unfairness.

While that works for most of the times to reduce scanning overhead,
 there are some corner cases associated with it.

This was found in an internal LKP run and also reported by [2]. There was
an attempt to fix.

Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/

This is a fully different series after Mel's feedback to address the issue
 and also a continuation of enhancing VMA scanning for NUMA balancing.

Problem statement (Disjoint VMA set):
======================================
Let's look at some of the corner cases with a below example of tasks and their
access pattern.

Consider N tasks (threads) of a process.
Set1 tasks accessing vma_x (group of VMAs)
Set2 tasks accessing vma_y (group of VMAs)

             Set1                      Set2
        -------------------         --------------------
        | task_1..task_n/2 |       | task_n/2+1..task_n |
        -------------------         --------------------	
                 |                             |
                 V                             V
        -------------------         --------------------
        |     vma_x       |         |     vma_y         |
        -------------------         --------------------	

Corner cases:
(a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
suppose Set1 tasks gets more opportunity to scan (May be because of the
activity pattern of tasks or other reasons in current design) in the above
example, then vma_x gets scanned more number of times than vma_y.

some experiment is also done here which illustrates this unfairness:
Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/

(b) Sizes of vmas can differ.
Suppose size of vma_y is far greater than the size of vma_x, then a bigger
portion of vma_y can potentially be left unscanned since scanning is bounded
by scan_size of 256MB (default) for each iteration.

(c) Highly active threads trap a few VMAs frequently, and some of the VMAs not 
accessed for long time can potentially get starved of scanning indefinitely
(Mel). There is a possibility of lack of enough hints/details about VMAs if it
is needed later for migration.

(d) Allocation of memory in some specific manner (Mel).
One example could be, Suppose a main thread allocates memory and it is not
active. When other threads tries to act upon it, they may not have much
hints about it, if the corresponding VMA was not scanned.

(e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
will never get scanned. (Observed rarely but very much possible depending on
workload behaviour).

Above this, a combination of some of the above (e.g., (a) and (b)) can
potentially amplifyi/worsen the side effect.

This patchset, tries to address the above issues by enhancing unconditional
VMA scanning logic.

High level ideas:
=================
Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it
and when it hits zero do force scan (Mel).
vma_scan_select value is again repopulated when it hits zero.

This is how VMA scanning phases looks like after implementation:

|<---p1--->|<-----p2----->|<-----p2----->|...

Algorithm:
p1: New VMA, initial phase do not scan till scan_delay.

p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero.

Reinitialize vma_scan_select and repeat p2.

pros/cons:
+  : Ratelimiting is inbuilt to the approach
+  : vma_size is taken into account for scanning
+/-: Scanning continues forever
-  : Changes in vma size is taken care after force scan. i.e.,
   vma_scan_select is repopulated only after vma_scan_select hits zero.

Idea-1 can potentially cover all the issues mentioned above.

Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata).
If number of tasks accessing vma is >= 1, unconditionally allow scanning.

Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks
accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.

Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.

Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
reaches a value proportional (or equal) to vma_size/scan_size.
This a complimentary to Idea-1.

this is how VMA scanning phases looks like after implementation:

|<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
                                                        RESET                                               RESET
Algorithm:
p1: New VMA, initial phase do not scan till scan_delay.

p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
 f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.

p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
 f(vma_size)/scan_size in a rate limited manner. This is an optional phase.

p4: Allow scanning iff task has accessed VMA.

Reset after p4 (optional).

Repeat p2, p3 p4

Motivation: Allow agressive scanning in the beginning followed by a rate
limited scanning. And then completely disallow scanning to avoid unnecessary
scanning. Reset time could be a function of scan_delay and chosen long enough
to aid long running task to forget history and start afresh.

+  : Ratelimiting need to be taken care separately if needed.
+/-: Scanning continues only if RESET of vma_scan_seq is implemented.
+  : changes in vma size is taken care in every scan.

 Current patch series implements Ideas 1, 2, 3 + extension of access PID history
idea from PeterZ.

Results:
======
Base: 6.5.0-rc6+ (4853c74bd7ab)
SUT: Milan w/ 2 numa nodes 256 cpus

mmtest		numa01_THREAD_ALLOC manual run:

		base		patched
real		1m22.758s	1m9.200s
user		249m49.540s	229m30.039s
sys		0m25.040s	3m10.451s
	
numa_pte_updates 	6985	1573363
numa_hint_faults 	2705	1022623
numa_hint_faults_local 	2279	389633
numa_pages_migrated 	426	632990

kernbench
			base			patched
Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*

Duration User       65973.18    65038.00
Duration System     30538.92    32478.59
Duration Elapsed      529.52      533.09

Ops NUMA PTE updates                  976844.00      962680.00
Ops NUMA hint faults                  226763.00      245620.00
Ops NUMA pages migrated               220146.00      207025.00
Ops AutoNUMA cost                       1144.84        1238.77

Improvements in other benchmarks I have tested.
Time based:
Hashjoin	4.21%
Btree	 	2.04%
XSbench		0.36%

Throughput based:
Graph500 	-3.62%
Nas.bt		3.69%
Nas.ft		21.91%

Note: VMA scanning improvements [1] has refined scanning so much that
system overhead we re-introduce with additional scan look glaringly
high. But If we consider the difference between before [1] and current
series, overall scanning overhead is considerably reduced.

1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/

Note: Patch description is again repeated in some patches to avoid any
need to copy from cover letter again.

Peter Zijlstra (1):
  sched/numa: Increase tasks' access history

Raghavendra K T (5):
  sched/numa: Move up the access pid reset logic
  sched/numa: Add disjoint vma unconditional scan logic
  sched/numa: Remove unconditional scan logic using mm numa_scan_seq
  sched/numa: Allow recently accessed VMAs to be scanned
  sched/numa: Allow scanning of shared VMAs

 include/linux/mm.h       |  12 +++--
 include/linux/mm_types.h |   5 +-
 kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
 3 files changed, 102 insertions(+), 24 deletions(-)

Comments

Swapnil Sapkal Sept. 13, 2023, 5:28 a.m. UTC | #1
Hello Raghu,

On 8/29/2023 11:36 AM, Raghavendra K T wrote:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
> VMA scanning is allowed if:
> 1) The task had accessed the VMA.
>   Rationale: Reduce overhead for the tasks that had not
> touched VMA. Also filter out unnecessary scanning.
> 
> 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
>   Rationale: Understanding initial characteristics of VMAs and also
>   prevent VMA scanning unfairness.
> 
> While that works for most of the times to reduce scanning overhead,
>   there are some corner cases associated with it.
> 
> This was found in an internal LKP run and also reported by [2]. There was
> an attempt to fix.
> 
> Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/
> 
> This is a fully different series after Mel's feedback to address the issue
>   and also a continuation of enhancing VMA scanning for NUMA balancing.
> 
> Problem statement (Disjoint VMA set):
> ======================================
> Let's look at some of the corner cases with a below example of tasks and their
> access pattern.
> 
> Consider N tasks (threads) of a process.
> Set1 tasks accessing vma_x (group of VMAs)
> Set2 tasks accessing vma_y (group of VMAs)
> 
>               Set1                      Set2
>          -------------------         --------------------
>          | task_1..task_n/2 |       | task_n/2+1..task_n |
>          -------------------         --------------------	
>                   |                             |
>                   V                             V
>          -------------------         --------------------
>          |     vma_x       |         |     vma_y         |
>          -------------------         --------------------	
> 
> Corner cases:
> (a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
> suppose Set1 tasks gets more opportunity to scan (May be because of the
> activity pattern of tasks or other reasons in current design) in the above
> example, then vma_x gets scanned more number of times than vma_y.
> 
> some experiment is also done here which illustrates this unfairness:
> Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/
> 
> (b) Sizes of vmas can differ.
> Suppose size of vma_y is far greater than the size of vma_x, then a bigger
> portion of vma_y can potentially be left unscanned since scanning is bounded
> by scan_size of 256MB (default) for each iteration.
> 
> (c) Highly active threads trap a few VMAs frequently, and some of the VMAs not
> accessed for long time can potentially get starved of scanning indefinitely
> (Mel). There is a possibility of lack of enough hints/details about VMAs if it
> is needed later for migration.
> 
> (d) Allocation of memory in some specific manner (Mel).
> One example could be, Suppose a main thread allocates memory and it is not
> active. When other threads tries to act upon it, they may not have much
> hints about it, if the corresponding VMA was not scanned.
> 
> (e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
> will never get scanned. (Observed rarely but very much possible depending on
> workload behaviour).
> 
> Above this, a combination of some of the above (e.g., (a) and (b)) can
> potentially amplifyi/worsen the side effect.
> 
> This patchset, tries to address the above issues by enhancing unconditional
> VMA scanning logic.
> 
> High level ideas:
> =================
> Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it
> and when it hits zero do force scan (Mel).
> vma_scan_select value is again repopulated when it hits zero.
> 
> This is how VMA scanning phases looks like after implementation:
> 
> |<---p1--->|<-----p2----->|<-----p2----->|...
> 
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero.
> 
> Reinitialize vma_scan_select and repeat p2.
> 
> pros/cons:
> +  : Ratelimiting is inbuilt to the approach
> +  : vma_size is taken into account for scanning
> +/-: Scanning continues forever
> -  : Changes in vma size is taken care after force scan. i.e.,
>     vma_scan_select is repopulated only after vma_scan_select hits zero.
> 
> Idea-1 can potentially cover all the issues mentioned above.
> 
> Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata).
> If number of tasks accessing vma is >= 1, unconditionally allow scanning.
> 
> Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks
> accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.
> 
> Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.
> 
> Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
> reaches a value proportional (or equal) to vma_size/scan_size.
> This a complimentary to Idea-1.
> 
> this is how VMA scanning phases looks like after implementation:
> 
> |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
>                                                          RESET                                               RESET
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.
> 
> p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size in a rate limited manner. This is an optional phase.
> 
> p4: Allow scanning iff task has accessed VMA.
> 
> Reset after p4 (optional).
> 
> Repeat p2, p3 p4
> 
> Motivation: Allow agressive scanning in the beginning followed by a rate
> limited scanning. And then completely disallow scanning to avoid unnecessary
> scanning. Reset time could be a function of scan_delay and chosen long enough
> to aid long running task to forget history and start afresh.
> 
> +  : Ratelimiting need to be taken care separately if needed.
> +/-: Scanning continues only if RESET of vma_scan_seq is implemented.
> +  : changes in vma size is taken care in every scan.
> 
>   Current patch series implements Ideas 1, 2, 3 + extension of access PID history
> idea from PeterZ.
> 
> Results:
> ======
> Base: 6.5.0-rc6+ (4853c74bd7ab)
> SUT: Milan w/ 2 numa nodes 256 cpus
> 
> mmtest		numa01_THREAD_ALLOC manual run:
> 
> 		base		patched
> real		1m22.758s	1m9.200s
> user		249m49.540s	229m30.039s
> sys		0m25.040s	3m10.451s
> 	
> numa_pte_updates 	6985	1573363
> numa_hint_faults 	2705	1022623
> numa_hint_faults_local 	2279	389633
> numa_pages_migrated 	426	632990
> 
> kernbench
> 			base			patched
> Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
> Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
> Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*
> 
> Duration User       65973.18    65038.00
> Duration System     30538.92    32478.59
> Duration Elapsed      529.52      533.09
> 
> Ops NUMA PTE updates                  976844.00      962680.00
> Ops NUMA hint faults                  226763.00      245620.00
> Ops NUMA pages migrated               220146.00      207025.00
> Ops AutoNUMA cost                       1144.84        1238.77
> 
> Improvements in other benchmarks I have tested.
> Time based:
> Hashjoin	4.21%
> Btree	 	2.04%
> XSbench		0.36%
> 
> Throughput based:
> Graph500 	-3.62%
> Nas.bt		3.69%
> Nas.ft		21.91%
> 
> Note: VMA scanning improvements [1] has refined scanning so much that
> system overhead we re-introduce with additional scan look glaringly
> high. But If we consider the difference between before [1] and current
> series, overall scanning overhead is considerably reduced.
> 
> 1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> 2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> 
> Note: Patch description is again repeated in some patches to avoid any
> need to copy from cover letter again.
> 
> Peter Zijlstra (1):
>    sched/numa: Increase tasks' access history
> 
> Raghavendra K T (5):
>    sched/numa: Move up the access pid reset logic
>    sched/numa: Add disjoint vma unconditional scan logic
>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>    sched/numa: Allow recently accessed VMAs to be scanned
>    sched/numa: Allow scanning of shared VMAs
> 
>   include/linux/mm.h       |  12 +++--
>   include/linux/mm_types.h |   5 +-
>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>   3 files changed, 102 insertions(+), 24 deletions(-)
> 

I have tested this series on 4th generation EPYC processor. I am seeing improvement in autonuma-benchmark with the series.

o System Details

- 4th Generation EPYC System
- 2 x 128C/256T
- NPS1 mode

o Kernels

base:   4853c74bd7ab Merge tag 'parisc-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux


==================================================================
Test          : autonuma-benchmark
Units         : Time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
commit:
   base (4853c74bd7ab)
   base + this_series

   base (4853c74bd7ab)         base + this_series
---------------- ---------------------------
          %stddev     %change         %stddev
              \          |                \
     522.58           -11.2%     464.23        autonuma-benchmark.numa01.seconds
     273.93            -1.2%     270.75        autonuma-benchmark.numa01_THREAD_ALLOC.seconds
       0.60            +0.0%       0.60        autonuma-benchmark.numa02.seconds
     102.68            +3.7%     106.50        autonuma-benchmark.numa02_SMT.seconds

Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>

--
Thanks and Regards,
Swapnil
Raghavendra K T Sept. 13, 2023, 6:24 a.m. UTC | #2
On 9/13/2023 10:58 AM, Swapnil Sapkal wrote:
> Hello Raghu,
> 
> On 8/29/2023 11:36 AM, Raghavendra K T wrote:
>> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
>> VMA scanning is allowed if:
>> 1) The task had accessed the VMA.
>>   Rationale: Reduce overhead for the tasks that had not
>> touched VMA. Also filter out unnecessary scanning.
>>
>> 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
>>   Rationale: Understanding initial characteristics of VMAs and also
>>   prevent VMA scanning unfairness.
>>
>> While that works for most of the times to reduce scanning overhead,
>>   there are some corner cases associated with it.
>>
>> This was found in an internal LKP run and also reported by [2]. There was
>> an attempt to fix.
>>
>> Link: 
>> https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/ 
>>
>>
>> This is a fully different series after Mel's feedback to address the 
>> issue
>>   and also a continuation of enhancing VMA scanning for NUMA balancing.
>>
>> Problem statement (Disjoint VMA set):
>> ======================================
>> Let's look at some of the corner cases with a below example of tasks 
>> and their
>> access pattern.
>>
>> Consider N tasks (threads) of a process.
>> Set1 tasks accessing vma_x (group of VMAs)
>> Set2 tasks accessing vma_y (group of VMAs)
>>
>>               Set1                      Set2
>>          -------------------         --------------------
>>          | task_1..task_n/2 |       | task_n/2+1..task_n |
>>          -------------------         --------------------
>>                   |                             |
>>                   V                             V
>>          -------------------         --------------------
>>          |     vma_x       |         |     vma_y         |
>>          -------------------         --------------------
>>
>> Corner cases:
>> (a) Out of N tasks, not all of them gets fair opportunity to scan. 
>> (PeterZ).
>> suppose Set1 tasks gets more opportunity to scan (May be because of the
>> activity pattern of tasks or other reasons in current design) in the 
>> above
>> example, then vma_x gets scanned more number of times than vma_y.
>>
>> some experiment is also done here which illustrates this unfairness:
>> Link: 
>> https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/ 
>>
>>
>> (b) Sizes of vmas can differ.
>> Suppose size of vma_y is far greater than the size of vma_x, then a 
>> bigger
>> portion of vma_y can potentially be left unscanned since scanning is 
>> bounded
>> by scan_size of 256MB (default) for each iteration.
>>
>> (c) Highly active threads trap a few VMAs frequently, and some of the 
>> VMAs not
>> accessed for long time can potentially get starved of scanning 
>> indefinitely
>> (Mel). There is a possibility of lack of enough hints/details about 
>> VMAs if it
>> is needed later for migration.
>>
>> (d) Allocation of memory in some specific manner (Mel).
>> One example could be, Suppose a main thread allocates memory and it is 
>> not
>> active. When other threads tries to act upon it, they may not have much
>> hints about it, if the corresponding VMA was not scanned.
>>
>> (e) VMAs that are created after two full scans of mm 
>> (mm->numa_scan_seq > 2)
>> will never get scanned. (Observed rarely but very much possible 
>> depending on
>> workload behaviour).
>>
>> Above this, a combination of some of the above (e.g., (a) and (b)) can
>> potentially amplifyi/worsen the side effect.
>>
>> This patchset, tries to address the above issues by enhancing 
>> unconditional
>> VMA scanning logic.
>>
>> High level ideas:
>> =================
>> Idea-1) Depending on vma_size, populate a per vma_scan_select value, 
>> decrement it
>> and when it hits zero do force scan (Mel).
>> vma_scan_select value is again repopulated when it hits zero.
>>
>> This is how VMA scanning phases looks like after implementation:
>>
>> |<---p1--->|<-----p2----->|<-----p2----->|...
>>
>> Algorithm:
>> p1: New VMA, initial phase do not scan till scan_delay.
>>
>> p2: Allow scanning if the task has accessed VMA or vma_scan_select hit 
>> zero.
>>
>> Reinitialize vma_scan_select and repeat p2.
>>
>> pros/cons:
>> +  : Ratelimiting is inbuilt to the approach
>> +  : vma_size is taken into account for scanning
>> +/-: Scanning continues forever
>> -  : Changes in vma size is taken care after force scan. i.e.,
>>     vma_scan_select is repopulated only after vma_scan_select hits zero.
>>
>> Idea-1 can potentially cover all the issues mentioned above.
>>
>> Idea-2) Take bitmask_weight of latest access_pids value (suggested by 
>> Bharata).
>> If number of tasks accessing vma is >= 1, unconditionally allow scanning.
>>
>> Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number 
>> of tasks
>> accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.
>>
>> Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.
>>
>> Idea-4) Have a per vma_scan_seq. allow the unconditional scan till 
>> vma_scan_seq
>> reaches a value proportional (or equal) to vma_size/scan_size.
>> This a complimentary to Idea-1.
>>
>> this is how VMA scanning phases looks like after implementation:
>>
>> |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> 
>> ...||
>>                                                          
>> RESET                                               RESET
>> Algorithm:
>> p1: New VMA, initial phase do not scan till scan_delay.
>>
>> p2: Allow scanning if task has accessed VMA or vma_scan_seq has 
>> reached till
>>   f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.
>>
>> p3: Allow scanning if task has accessed VMA or vma_scan_seq has 
>> reached till
>>   f(vma_size)/scan_size in a rate limited manner. This is an optional 
>> phase.
>>
>> p4: Allow scanning iff task has accessed VMA.
>>
>> Reset after p4 (optional).
>>
>> Repeat p2, p3 p4
>>
>> Motivation: Allow agressive scanning in the beginning followed by a rate
>> limited scanning. And then completely disallow scanning to avoid 
>> unnecessary
>> scanning. Reset time could be a function of scan_delay and chosen long 
>> enough
>> to aid long running task to forget history and start afresh.
>>
>> +  : Ratelimiting need to be taken care separately if needed.
>> +/-: Scanning continues only if RESET of vma_scan_seq is implemented.
>> +  : changes in vma size is taken care in every scan.
>>
>>   Current patch series implements Ideas 1, 2, 3 + extension of access 
>> PID history
>> idea from PeterZ.
>>
>> Results:
>> ======
>> Base: 6.5.0-rc6+ (4853c74bd7ab)
>> SUT: Milan w/ 2 numa nodes 256 cpus
>>
>> mmtest        numa01_THREAD_ALLOC manual run:
>>
>>         base        patched
>> real        1m22.758s    1m9.200s
>> user        249m49.540s    229m30.039s
>> sys        0m25.040s    3m10.451s
>>
>> numa_pte_updates     6985    1573363
>> numa_hint_faults     2705    1022623
>> numa_hint_faults_local     2279    389633
>> numa_pages_migrated     426    632990
>>
>> kernbench
>>             base            patched
>> Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
>> Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
>> Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*
>>
>> Duration User       65973.18    65038.00
>> Duration System     30538.92    32478.59
>> Duration Elapsed      529.52      533.09
>>
>> Ops NUMA PTE updates                  976844.00      962680.00
>> Ops NUMA hint faults                  226763.00      245620.00
>> Ops NUMA pages migrated               220146.00      207025.00
>> Ops AutoNUMA cost                       1144.84        1238.77
>>
>> Improvements in other benchmarks I have tested.
>> Time based:
>> Hashjoin    4.21%
>> Btree         2.04%
>> XSbench        0.36%
>>
>> Throughput based:
>> Graph500     -3.62%
>> Nas.bt        3.69%
>> Nas.ft        21.91%
>>
>> Note: VMA scanning improvements [1] has refined scanning so much that
>> system overhead we re-introduce with additional scan look glaringly
>> high. But If we consider the difference between before [1] and current
>> series, overall scanning overhead is considerably reduced.
>>
>> 1. Link: 
>> https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t 
>>
>> 2. Link: 
>> https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
>>
>> Note: Patch description is again repeated in some patches to avoid any
>> need to copy from cover letter again.
>>
>> Peter Zijlstra (1):
>>    sched/numa: Increase tasks' access history
>>
>> Raghavendra K T (5):
>>    sched/numa: Move up the access pid reset logic
>>    sched/numa: Add disjoint vma unconditional scan logic
>>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>>    sched/numa: Allow recently accessed VMAs to be scanned
>>    sched/numa: Allow scanning of shared VMAs
>>
>>   include/linux/mm.h       |  12 +++--
>>   include/linux/mm_types.h |   5 +-
>>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>>   3 files changed, 102 insertions(+), 24 deletions(-)
>>
> 
> I have tested this series on 4th generation EPYC processor. I am seeing 
> improvement in autonuma-benchmark with the series.
> 
> o System Details
> 
> - 4th Generation EPYC System
> - 2 x 128C/256T
> - NPS1 mode
> 
> o Kernels
> 
> base:   4853c74bd7ab Merge tag 'parisc-for-6.5-rc7' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> 
> 
> ==================================================================
> Test          : autonuma-benchmark
> Units         : Time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> commit:
>    base (4853c74bd7ab)
>    base + this_series
> 
>    base (4853c74bd7ab)         base + this_series
> ---------------- ---------------------------
>           %stddev     %change         %stddev
>               \          |                \
>      522.58           -11.2%     464.23        
> autonuma-benchmark.numa01.seconds
>      273.93            -1.2%     270.75        
> autonuma-benchmark.numa01_THREAD_ALLOC.seconds
>        0.60            +0.0%       0.60        
> autonuma-benchmark.numa02.seconds
>      102.68            +3.7%     106.50        
> autonuma-benchmark.numa02_SMT.seconds
> 
> Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
> 


Thank you Swapnil.

Regards
- Raghu
Raghavendra K T Sept. 19, 2023, 6:30 a.m. UTC | #3
On 8/29/2023 11:36 AM, Raghavendra K T wrote:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
> VMA scanning is allowed if:
> 1) The task had accessed the VMA.
>   Rationale: Reduce overhead for the tasks that had not
> touched VMA. Also filter out unnecessary scanning.
> 
> 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
>   Rationale: Understanding initial characteristics of VMAs and also
>   prevent VMA scanning unfairness.
> 
> While that works for most of the times to reduce scanning overhead,
>   there are some corner cases associated with it.
> 
> This was found in an internal LKP run and also reported by [2]. There was
> an attempt to fix.
> 
> Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/
> 
> This is a fully different series after Mel's feedback to address the issue
>   and also a continuation of enhancing VMA scanning for NUMA balancing.
> 
> Problem statement (Disjoint VMA set):
> ======================================
> Let's look at some of the corner cases with a below example of tasks and their
> access pattern.
> 
> Consider N tasks (threads) of a process.
> Set1 tasks accessing vma_x (group of VMAs)
> Set2 tasks accessing vma_y (group of VMAs)
> 
>               Set1                      Set2
>          -------------------         --------------------
>          | task_1..task_n/2 |       | task_n/2+1..task_n |
>          -------------------         --------------------	
>                   |                             |
>                   V                             V
>          -------------------         --------------------
>          |     vma_x       |         |     vma_y         |
>          -------------------         --------------------	
> 
> Corner cases:
> (a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
> suppose Set1 tasks gets more opportunity to scan (May be because of the
> activity pattern of tasks or other reasons in current design) in the above
> example, then vma_x gets scanned more number of times than vma_y.
> 
> some experiment is also done here which illustrates this unfairness:
> Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/
> 
> (b) Sizes of vmas can differ.
> Suppose size of vma_y is far greater than the size of vma_x, then a bigger
> portion of vma_y can potentially be left unscanned since scanning is bounded
> by scan_size of 256MB (default) for each iteration.
> 
> (c) Highly active threads trap a few VMAs frequently, and some of the VMAs not
> accessed for long time can potentially get starved of scanning indefinitely
> (Mel). There is a possibility of lack of enough hints/details about VMAs if it
> is needed later for migration.
> 
> (d) Allocation of memory in some specific manner (Mel).
> One example could be, Suppose a main thread allocates memory and it is not
> active. When other threads tries to act upon it, they may not have much
> hints about it, if the corresponding VMA was not scanned.
> 
> (e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
> will never get scanned. (Observed rarely but very much possible depending on
> workload behaviour).
> 
> Above this, a combination of some of the above (e.g., (a) and (b)) can
> potentially amplifyi/worsen the side effect.
> 
> This patchset, tries to address the above issues by enhancing unconditional
> VMA scanning logic.
> 
> High level ideas:
> =================
> Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it
> and when it hits zero do force scan (Mel).
> vma_scan_select value is again repopulated when it hits zero.
> 
> This is how VMA scanning phases looks like after implementation:
> 
> |<---p1--->|<-----p2----->|<-----p2----->|...
> 
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero.
> 
> Reinitialize vma_scan_select and repeat p2.
> 
> pros/cons:
> +  : Ratelimiting is inbuilt to the approach
> +  : vma_size is taken into account for scanning
> +/-: Scanning continues forever
> -  : Changes in vma size is taken care after force scan. i.e.,
>     vma_scan_select is repopulated only after vma_scan_select hits zero.
> 
> Idea-1 can potentially cover all the issues mentioned above.
> 
> Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata).
> If number of tasks accessing vma is >= 1, unconditionally allow scanning.
> 
> Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks
> accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.
> 
> Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.
> 
> Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
> reaches a value proportional (or equal) to vma_size/scan_size.
> This a complimentary to Idea-1.
> 
> this is how VMA scanning phases looks like after implementation:
> 
> |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
>                                                          RESET                                               RESET
> Algorithm:
> p1: New VMA, initial phase do not scan till scan_delay.
> 
> p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.
> 
> p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
>   f(vma_size)/scan_size in a rate limited manner. This is an optional phase.
> 
> p4: Allow scanning iff task has accessed VMA.
> 
> Reset after p4 (optional).
> 
> Repeat p2, p3 p4
> 
> Motivation: Allow agressive scanning in the beginning followed by a rate
> limited scanning. And then completely disallow scanning to avoid unnecessary
> scanning. Reset time could be a function of scan_delay and chosen long enough
> to aid long running task to forget history and start afresh.
> 
> +  : Ratelimiting need to be taken care separately if needed.
> +/-: Scanning continues only if RESET of vma_scan_seq is implemented.
> +  : changes in vma size is taken care in every scan.
> 
>   Current patch series implements Ideas 1, 2, 3 + extension of access PID history
> idea from PeterZ.
> 
> Results:
> ======
> Base: 6.5.0-rc6+ (4853c74bd7ab)
> SUT: Milan w/ 2 numa nodes 256 cpus
> 
> mmtest		numa01_THREAD_ALLOC manual run:
> 
> 		base		patched
> real		1m22.758s	1m9.200s
> user		249m49.540s	229m30.039s
> sys		0m25.040s	3m10.451s
> 	
> numa_pte_updates 	6985	1573363
> numa_hint_faults 	2705	1022623
> numa_hint_faults_local 	2279	389633
> numa_pages_migrated 	426	632990
> 
> kernbench
> 			base			patched
> Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
> Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
> Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*
> 
> Duration User       65973.18    65038.00
> Duration System     30538.92    32478.59
> Duration Elapsed      529.52      533.09
> 
> Ops NUMA PTE updates                  976844.00      962680.00
> Ops NUMA hint faults                  226763.00      245620.00
> Ops NUMA pages migrated               220146.00      207025.00
> Ops AutoNUMA cost                       1144.84        1238.77
> 
> Improvements in other benchmarks I have tested.
> Time based:
> Hashjoin	4.21%
> Btree	 	2.04%
> XSbench		0.36%
> 
> Throughput based:
> Graph500 	-3.62%
> Nas.bt		3.69%
> Nas.ft		21.91%
> 
> Note: VMA scanning improvements [1] has refined scanning so much that
> system overhead we re-introduce with additional scan look glaringly
> high. But If we consider the difference between before [1] and current
> series, overall scanning overhead is considerably reduced.
> 
> 1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> 2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> 
> Note: Patch description is again repeated in some patches to avoid any
> need to copy from cover letter again.
> 
> Peter Zijlstra (1):
>    sched/numa: Increase tasks' access history
> 
> Raghavendra K T (5):
>    sched/numa: Move up the access pid reset logic
>    sched/numa: Add disjoint vma unconditional scan logic
>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>    sched/numa: Allow recently accessed VMAs to be scanned
>    sched/numa: Allow scanning of shared VMAs
> 
>   include/linux/mm.h       |  12 +++--
>   include/linux/mm_types.h |   5 +-
>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>   3 files changed, 102 insertions(+), 24 deletions(-)
> 

Hello Andrew,

I am Resending patch rebasing to mm-unstable, adding results from Oliver
and Swapnil.

(so that it is ready to merge once we get go ahead/ no objection from
Mel, Peter ... Okay to work any further suggestions if any).

Hope that is Okay.

Thanks and Regards
- Raghu
Ingo Molnar Sept. 19, 2023, 7:15 a.m. UTC | #4
* Raghavendra K T <raghavendra.kt@amd.com> wrote:

> On 8/29/2023 11:36 AM, Raghavendra K T wrote:
> > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic") [1]
> > VMA scanning is allowed if:
> > 1) The task had accessed the VMA.
> >   Rationale: Reduce overhead for the tasks that had not
> > touched VMA. Also filter out unnecessary scanning.
> > 
> > 2) Early phase of the VMA scan where mm->numa_scan_seq is less than 2.
> >   Rationale: Understanding initial characteristics of VMAs and also
> >   prevent VMA scanning unfairness.
> > 
> > While that works for most of the times to reduce scanning overhead,
> >   there are some corner cases associated with it.
> > 
> > This was found in an internal LKP run and also reported by [2]. There was
> > an attempt to fix.
> > 
> > Link: https://lore.kernel.org/linux-mm/cover.1685506205.git.raghavendra.kt@amd.com/T/
> > 
> > This is a fully different series after Mel's feedback to address the issue
> >   and also a continuation of enhancing VMA scanning for NUMA balancing.
> > 
> > Problem statement (Disjoint VMA set):
> > ======================================
> > Let's look at some of the corner cases with a below example of tasks and their
> > access pattern.
> > 
> > Consider N tasks (threads) of a process.
> > Set1 tasks accessing vma_x (group of VMAs)
> > Set2 tasks accessing vma_y (group of VMAs)
> > 
> >               Set1                      Set2
> >          -------------------         --------------------
> >          | task_1..task_n/2 |       | task_n/2+1..task_n |
> >          -------------------         --------------------	
> >                   |                             |
> >                   V                             V
> >          -------------------         --------------------
> >          |     vma_x       |         |     vma_y         |
> >          -------------------         --------------------	
> > 
> > Corner cases:
> > (a) Out of N tasks, not all of them gets fair opportunity to scan. (PeterZ).
> > suppose Set1 tasks gets more opportunity to scan (May be because of the
> > activity pattern of tasks or other reasons in current design) in the above
> > example, then vma_x gets scanned more number of times than vma_y.
> > 
> > some experiment is also done here which illustrates this unfairness:
> > Link: https://lore.kernel.org/lkml/c730dee0-a711-8a8e-3eb1-1bfdd21e6add@amd.com/
> > 
> > (b) Sizes of vmas can differ.
> > Suppose size of vma_y is far greater than the size of vma_x, then a bigger
> > portion of vma_y can potentially be left unscanned since scanning is bounded
> > by scan_size of 256MB (default) for each iteration.
> > 
> > (c) Highly active threads trap a few VMAs frequently, and some of the VMAs not
> > accessed for long time can potentially get starved of scanning indefinitely
> > (Mel). There is a possibility of lack of enough hints/details about VMAs if it
> > is needed later for migration.
> > 
> > (d) Allocation of memory in some specific manner (Mel).
> > One example could be, Suppose a main thread allocates memory and it is not
> > active. When other threads tries to act upon it, they may not have much
> > hints about it, if the corresponding VMA was not scanned.
> > 
> > (e) VMAs that are created after two full scans of mm (mm->numa_scan_seq > 2)
> > will never get scanned. (Observed rarely but very much possible depending on
> > workload behaviour).
> > 
> > Above this, a combination of some of the above (e.g., (a) and (b)) can
> > potentially amplifyi/worsen the side effect.
> > 
> > This patchset, tries to address the above issues by enhancing unconditional
> > VMA scanning logic.
> > 
> > High level ideas:
> > =================
> > Idea-1) Depending on vma_size, populate a per vma_scan_select value, decrement it
> > and when it hits zero do force scan (Mel).
> > vma_scan_select value is again repopulated when it hits zero.
> > 
> > This is how VMA scanning phases looks like after implementation:
> > 
> > |<---p1--->|<-----p2----->|<-----p2----->|...
> > 
> > Algorithm:
> > p1: New VMA, initial phase do not scan till scan_delay.
> > 
> > p2: Allow scanning if the task has accessed VMA or vma_scan_select hit zero.
> > 
> > Reinitialize vma_scan_select and repeat p2.
> > 
> > pros/cons:
> > +  : Ratelimiting is inbuilt to the approach
> > +  : vma_size is taken into account for scanning
> > +/-: Scanning continues forever
> > -  : Changes in vma size is taken care after force scan. i.e.,
> >     vma_scan_select is repopulated only after vma_scan_select hits zero.
> > 
> > Idea-1 can potentially cover all the issues mentioned above.
> > 
> > Idea-2) Take bitmask_weight of latest access_pids value (suggested by Bharata).
> > If number of tasks accessing vma is >= 1, unconditionally allow scanning.
> > 
> > Idea-3 ) Take bitmask_weight of access_pid history of VMA. If number of tasks
> > accessing VMA is > THRESHOLD (=3), unconditionally allow scanning.
> > 
> > Rationale (Idea-2,3): Do not miss out scanning of critical VMAs.
> > 
> > Idea-4) Have a per vma_scan_seq. allow the unconditional scan till vma_scan_seq
> > reaches a value proportional (or equal) to vma_size/scan_size.
> > This a complimentary to Idea-1.
> > 
> > this is how VMA scanning phases looks like after implementation:
> > 
> > |<--p1--->|<-----p2----->|<-----p3----->|<-----p4----->...||<-----p2----->|<-----p3----->|<-----p4-----> ...||
> >                                                          RESET                                               RESET
> > Algorithm:
> > p1: New VMA, initial phase do not scan till scan_delay.
> > 
> > p2: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
> >   f(vma_size)/scan_size) for e.g., f = 1/2 * vma_size/scan_size.
> > 
> > p3: Allow scanning if task has accessed VMA or vma_scan_seq has reached till
> >   f(vma_size)/scan_size in a rate limited manner. This is an optional phase.
> > 
> > p4: Allow scanning iff task has accessed VMA.
> > 
> > Reset after p4 (optional).
> > 
> > Repeat p2, p3 p4
> > 
> > Motivation: Allow agressive scanning in the beginning followed by a rate
> > limited scanning. And then completely disallow scanning to avoid unnecessary
> > scanning. Reset time could be a function of scan_delay and chosen long enough
> > to aid long running task to forget history and start afresh.
> > 
> > +  : Ratelimiting need to be taken care separately if needed.
> > +/-: Scanning continues only if RESET of vma_scan_seq is implemented.
> > +  : changes in vma size is taken care in every scan.
> > 
> >   Current patch series implements Ideas 1, 2, 3 + extension of access PID history
> > idea from PeterZ.
> > 
> > Results:
> > ======
> > Base: 6.5.0-rc6+ (4853c74bd7ab)
> > SUT: Milan w/ 2 numa nodes 256 cpus
> > 
> > mmtest		numa01_THREAD_ALLOC manual run:
> > 
> > 		base		patched
> > real		1m22.758s	1m9.200s
> > user		249m49.540s	229m30.039s
> > sys		0m25.040s	3m10.451s
> > 	
> > numa_pte_updates 	6985	1573363
> > numa_hint_faults 	2705	1022623
> > numa_hint_faults_local 	2279	389633
> > numa_pages_migrated 	426	632990
> > 
> > kernbench
> > 			base			patched
> > Amean     user-256    21989.09 (   0.00%)    21677.36 *   1.42%*
> > Amean     syst-256    10171.34 (   0.00%)    10818.28 *  -6.36%*
> > Amean     elsp-256      166.81 (   0.00%)      168.40 *  -0.95%*
> > 
> > Duration User       65973.18    65038.00
> > Duration System     30538.92    32478.59
> > Duration Elapsed      529.52      533.09
> > 
> > Ops NUMA PTE updates                  976844.00      962680.00
> > Ops NUMA hint faults                  226763.00      245620.00
> > Ops NUMA pages migrated               220146.00      207025.00
> > Ops AutoNUMA cost                       1144.84        1238.77
> > 
> > Improvements in other benchmarks I have tested.
> > Time based:
> > Hashjoin	4.21%
> > Btree	 	2.04%
> > XSbench		0.36%
> > 
> > Throughput based:
> > Graph500 	-3.62%
> > Nas.bt		3.69%
> > Nas.ft		21.91%
> > 
> > Note: VMA scanning improvements [1] has refined scanning so much that
> > system overhead we re-introduce with additional scan look glaringly
> > high. But If we consider the difference between before [1] and current
> > series, overall scanning overhead is considerably reduced.
> > 
> > 1. Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t
> > 2. Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/
> > 
> > Note: Patch description is again repeated in some patches to avoid any
> > need to copy from cover letter again.
> > 
> > Peter Zijlstra (1):
> >    sched/numa: Increase tasks' access history
> > 
> > Raghavendra K T (5):
> >    sched/numa: Move up the access pid reset logic
> >    sched/numa: Add disjoint vma unconditional scan logic
> >    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
> >    sched/numa: Allow recently accessed VMAs to be scanned
> >    sched/numa: Allow scanning of shared VMAs
> > 
> >   include/linux/mm.h       |  12 +++--
> >   include/linux/mm_types.h |   5 +-
> >   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
> >   3 files changed, 102 insertions(+), 24 deletions(-)
> > 
> 
> Hello Andrew,
> 
> I am Resending patch rebasing to mm-unstable, adding results from Oliver
> and Swapnil.

Just for the record, a final version of this series should be submitted via 
the scheduler tree, not -mm.

Thanks,

	Ingo
Raghavendra K T Sept. 19, 2023, 8:06 a.m. UTC | #5
On 9/19/2023 12:45 PM, Ingo Molnar wrote:
> 
> * Raghavendra K T <raghavendra.kt@amd.com> wrote:
> 
>> On 8/29/2023 11:36 AM, Raghavendra K T wrote:
[...]

>>>
>>> Peter Zijlstra (1):
>>>     sched/numa: Increase tasks' access history
>>>
>>> Raghavendra K T (5):
>>>     sched/numa: Move up the access pid reset logic
>>>     sched/numa: Add disjoint vma unconditional scan logic
>>>     sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>>>     sched/numa: Allow recently accessed VMAs to be scanned
>>>     sched/numa: Allow scanning of shared VMAs
>>>
>>>    include/linux/mm.h       |  12 +++--
>>>    include/linux/mm_types.h |   5 +-
>>>    kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>>>    3 files changed, 102 insertions(+), 24 deletions(-)
>>>
>>
>> Hello Andrew,
>>
>> I am Resending patch rebasing to mm-unstable, adding results from Oliver
>> and Swapnil.
> 
> Just for the record, a final version of this series should be submitted via
> the scheduler tree, not -mm.
> 

Thank you for the clarification Ingo.. May be Andrew also wanted me to 
rebase directly to
scheduler tree. Last time patch series (numa scan enhancements)  had
  changes in mm/, but this time it is mostly fair.c.

I hope I have not missed anybody who should have been in the list from
sched side.

-Raghu
Peter Zijlstra Sept. 19, 2023, 9:28 a.m. UTC | #6
On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote:

> Peter Zijlstra (1):
>   sched/numa: Increase tasks' access history
> 
> Raghavendra K T (5):
>   sched/numa: Move up the access pid reset logic
>   sched/numa: Add disjoint vma unconditional scan logic
>   sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>   sched/numa: Allow recently accessed VMAs to be scanned
>   sched/numa: Allow scanning of shared VMAs
> 
>  include/linux/mm.h       |  12 +++--
>  include/linux/mm_types.h |   5 +-
>  kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>  3 files changed, 102 insertions(+), 24 deletions(-)

So I don't immediately see anything horrible with this. Mel, do you have
a few cycles to go over this as well?
Mel Gorman Sept. 19, 2023, 4:22 p.m. UTC | #7
On Tue, Sep 19, 2023 at 11:28:30AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote:
> 
> > Peter Zijlstra (1):
> >   sched/numa: Increase tasks' access history
> > 
> > Raghavendra K T (5):
> >   sched/numa: Move up the access pid reset logic
> >   sched/numa: Add disjoint vma unconditional scan logic
> >   sched/numa: Remove unconditional scan logic using mm numa_scan_seq
> >   sched/numa: Allow recently accessed VMAs to be scanned
> >   sched/numa: Allow scanning of shared VMAs
> > 
> >  include/linux/mm.h       |  12 +++--
> >  include/linux/mm_types.h |   5 +-
> >  kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
> >  3 files changed, 102 insertions(+), 24 deletions(-)
> 
> So I don't immediately see anything horrible with this. Mel, do you have
> a few cycles to go over this as well?

I've been trying my best to find the necessary time and it's still on my
radar for this week. Preliminary results don't look great for the first part
of the series up to the patch "sched/numa: Add disjoint vma unconditional
scan logic" even though other reports indicate the performance may be
fixed up later in the series. For example

autonumabench
                                   6.5.0-rc6              6.5.0-rc6
                         sched-pidclear-v1r5   sched-forcescan-v1r5
Min       syst-NUMA02        1.94 (   0.00%)        1.38 (  28.87%)
Min       elsp-NUMA02       12.67 (   0.00%)       21.02 ( -65.90%)
Amean     syst-NUMA02        2.35 (   0.00%)        1.86 (  21.13%)
Amean     elsp-NUMA02       12.93 (   0.00%)       21.69 * -67.76%*
Stddev    syst-NUMA02        0.54 (   0.00%)        0.90 ( -67.67%)
Stddev    elsp-NUMA02        0.18 (   0.00%)        0.44 (-144.19%)
CoeffVar  syst-NUMA02       22.82 (   0.00%)       48.50 (-112.58%)
CoeffVar  elsp-NUMA02        1.38 (   0.00%)        2.01 ( -45.56%)
Max       syst-NUMA02        3.15 (   0.00%)        3.89 ( -23.49%)
Max       elsp-NUMA02       13.16 (   0.00%)       22.36 ( -69.91%)
BAmean-50 syst-NUMA02        2.01 (   0.00%)        1.45 (  27.69%)
BAmean-50 elsp-NUMA02       12.77 (   0.00%)       21.34 ( -67.04%)
BAmean-95 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
BAmean-95 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)
BAmean-99 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
BAmean-99 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)

                   6.5.0-rc6   6.5.0-rc6
                sched-pidclear-v1r5sched-forcescan-v1r5
Duration User        5702.00    10264.25
Duration System        17.02       13.59
Duration Elapsed       92.57      156.30

Similar results seen across multiple machines. It's not universally bad
but the NUMA02 tests appear to suffer quite badly and while not realistic,
they are somewhat relevant because numa02 is likely an "adverse workload"
for the logic that skips VMAs based on PID accesses.

For the rest of the series, the changelogs lacked detail on why those
changes helped. Patch 4's changelog lacks detail and patch 6 stating
"VMAs being accessed by more than two tasks are critical" is not helpful
either -- e.g. why are they critical? They are obviously shared VMAs and
therefore it may be the case that they need to be identified and interleaved
quickly but maybe not. Is the shared VMA that is critical a large malloc'd
area split into per-thread sections or something that is MAP_SHARED? The
changelog doesn't say so I have to guess. There are also a bunch of
magic variables with limited explanation (e.g. why NR_ACCESS_PID_HIST==4
and SHARED_VMA_THRESH=3?), the numab fields are not documented and the
changelogs lack supporting data. I suspect that patches 3-6 may be dealing
with regressions introduced by patch 2, particularly for NUMA02, but I'm
not certain as I didn't dedicate the necessary test time to prove that
and it's the type of information that should be in the changelog. While
there is nothing wrong with that as such, it's very hard to imagine how
patches 3-6 work in every case and be certain that the various parameters
make sense. That could cause difficulties later in terms of maintenance.

My initial thinking was "There should be a standalone series that deals
*only* with scanning VMAs that had no fault activity and skipped due to
PID hashing". These are important because there may be no fault activity
because there is no scan activity which is due to to fault activity. The
series is incomplete and without changelogs but I pushed it anyway to

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ sched-numabselective-v1r5

The first two patches simply improve the documentation on what is going
on, patch 3 adds a tracepoint for figuring out why VMAs were skipped or
not skipped. Patch 4 handles a corner case to complete the scan of a VMA
once it has started regardless of what task is doing the scanning. The
last patch scans VMAs that have seen no fault activity once active VMAs
have been scanned.

It has its weaknesses because it may be overly simplisitic and it forces
all VMAs to be scanned on every sequence which is wasteful. It also hurts
NUMA02 performance, although not as badly as ""sched/numa: Add disjoint
vma unconditional scan logic". On the plus side, it is easier to reason
about, it solves only one problem in the series and any patch on top or
modification should justify each change individually.
Peter Zijlstra Sept. 19, 2023, 7:11 p.m. UTC | #8
On Tue, Sep 19, 2023 at 05:22:15PM +0100, Mel Gorman wrote:

> I've been trying my best to find the necessary time and it's still on my
> radar for this week. 

OK, no hurry! Take your time.
Raghavendra K T Sept. 20, 2023, 10:42 a.m. UTC | #9
On 9/19/2023 9:52 PM, Mel Gorman wrote:
> On Tue, Sep 19, 2023 at 11:28:30AM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 29, 2023 at 11:36:08AM +0530, Raghavendra K T wrote:
>>
>>> Peter Zijlstra (1):
>>>    sched/numa: Increase tasks' access history
>>>
>>> Raghavendra K T (5):
>>>    sched/numa: Move up the access pid reset logic
>>>    sched/numa: Add disjoint vma unconditional scan logic
>>>    sched/numa: Remove unconditional scan logic using mm numa_scan_seq
>>>    sched/numa: Allow recently accessed VMAs to be scanned
>>>    sched/numa: Allow scanning of shared VMAs
>>>
>>>   include/linux/mm.h       |  12 +++--
>>>   include/linux/mm_types.h |   5 +-
>>>   kernel/sched/fair.c      | 109 ++++++++++++++++++++++++++++++++-------
>>>   3 files changed, 102 insertions(+), 24 deletions(-)
>>
>> So I don't immediately see anything horrible with this. Mel, do you have
>> a few cycles to go over this as well?
> 
> I've been trying my best to find the necessary time and it's still on my
> radar for this week. 

Hello Mel,
Thanks you a lot for your time and for having a detailed look, and your
patches.

In summary, I will start with your patchset.
Link:  https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ 
sched-numabselective-v1r5
and see if there is any cumulative benefits from my patches (3-6) on top 
of them.

Trying to give out some details for your questions. please skip if its
long..

Preliminary results don't look great for the first part
> of the series up to the patch "sched/numa: Add disjoint vma unconditional
> scan logic" even though other reports indicate the performance may be
> fixed up later in the series. For example
> 
> autonumabench
>                                     6.5.0-rc6              6.5.0-rc6
>                           sched-pidclear-v1r5   sched-forcescan-v1r5
> Min       syst-NUMA02        1.94 (   0.00%)        1.38 (  28.87%)
> Min       elsp-NUMA02       12.67 (   0.00%)       21.02 ( -65.90%)
> Amean     syst-NUMA02        2.35 (   0.00%)        1.86 (  21.13%)
> Amean     elsp-NUMA02       12.93 (   0.00%)       21.69 * -67.76%*
> Stddev    syst-NUMA02        0.54 (   0.00%)        0.90 ( -67.67%)
> Stddev    elsp-NUMA02        0.18 (   0.00%)        0.44 (-144.19%)
> CoeffVar  syst-NUMA02       22.82 (   0.00%)       48.50 (-112.58%)
> CoeffVar  elsp-NUMA02        1.38 (   0.00%)        2.01 ( -45.56%)
> Max       syst-NUMA02        3.15 (   0.00%)        3.89 ( -23.49%)
> Max       elsp-NUMA02       13.16 (   0.00%)       22.36 ( -69.91%)
> BAmean-50 syst-NUMA02        2.01 (   0.00%)        1.45 (  27.69%)
> BAmean-50 elsp-NUMA02       12.77 (   0.00%)       21.34 ( -67.04%)
> BAmean-95 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
> BAmean-95 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)
> BAmean-99 syst-NUMA02        2.22 (   0.00%)        1.52 (  31.68%)
> BAmean-99 elsp-NUMA02       12.89 (   0.00%)       21.58 ( -67.39%)
> 
>                     6.5.0-rc6   6.5.0-rc6
>                  sched-pidclear-v1r5sched-forcescan-v1r5
> Duration User        5702.00    10264.25
> Duration System        17.02       13.59
> Duration Elapsed       92.57      156.30
> 
> Similar results seen across multiple machines. It's not universally bad
> but the NUMA02 tests appear to suffer quite badly and while not realistic,
> they are somewhat relevant because numa02 is likely an "adverse workload"
> for the logic that skips VMAs based on PID accesses.
> 
> For the rest of the series, the changelogs lacked detail on why those
> changes helped. Patch 4's changelog lacks detail and patch 6 stating
> "VMAs being accessed by more than two tasks are critical" is not helpful
> either -- e.g. why are they critical?

Agree, for patch 5 and 6 (scanning shared VMA and recently accessed
VMAs) there was a brief rationale in cover letter, but it was not enough
perhaps.

More background:
I had used trace_prints to understand vma sizes, PID hash, success
percentage of is_vma_accessed(), and also how many tasks are typically
accessing etc for some of the workloads..
(vma_size here was in KB)

E.g.,
<...>-1451602 [116] ...1. 39195.488591: vma_fault: vma=ffff8bcab42ad7b8 
pid=1451602 hash=40, success=1
            <...>-1451481 [210] ..... 39196.948390: sched_numascan: 
comm=numa01 pid=1451481 vma = ffff8bc9228637b8 
access_hist=4200000cfe66727 hashval = 26 bitmap_wt = 22, vma_size = 
3153924 success = 1
            <...>-1451570 [052] ...1. 39196.948725: vma_fault: 
vma=ffff8bc9228637b8 pid=1451570 hash=25, success=1

1) For very large VMAs we may incur delay in scanning whole VMA,
because we scan only in 256MB chunks and filter out tasks which had not
touched them etc, So idea was to speed up the scanning.

2) Similar rationale for recently accessed VMA, i.e., not to delay
scanning for a very recently (hot) accessed VMAs.

[ I did not explore using young page info, mm walk etc as I thought it
may be expensive ].

> They are obviously shared VMAs and
> therefore it may be the case that they need to be identified and interleaved
> quickly

Yes. Mostly that was idea as mentioned above.

> but maybe not. Is the shared VMA that is critical a large malloc'd
> area split into per-thread sections or something that is MAP_SHARED? The
> changelog doesn't say so I have to guess.  > There are also a bunch of
> magic variables with limited explanation (e.g. why NR_ACCESS_PID_HIST==4
> and SHARED_VMA_THRESH=3?),

Those thresholds were result of multiple experiments I did.
(SHARED_VMA_THRESH = 3,4 .. NR_ACCESS_PID_HIST=3, 4 etc ).

One thing I did not look is whether I should reduce PID_RESET interval
(because we are maintaining more history now.)

> the numab fields are not documented 
Agree, I should have done better earlier.

> and the
> changelogs lack supporting data. I suspect that patches 3-6 may be dealing
> with regressions introduced by patch 2, particularly for NUMA02, but I'm

TBH, Did not really target to worsen num02, improve num02 later.
This is the data I had for the full patchset.

autonumabench
                              base                   patched
Min       syst-NUMA02        0.99 (   0.00%)        0.99 (   0.00%)
Min       elsp-NUMA02        3.04 (   0.00%)        3.04 (   0.00%)
Amean     syst-NUMA02        1.06 (   0.00%)        1.05 *   1.08%*
Amean     elsp-NUMA02        3.80 (   0.00%)        3.39 *  10.68%*
Stddev    syst-NUMA02        0.10 (   0.00%)        0.07 (  24.57%)
Stddev    elsp-NUMA02        0.73 (   0.00%)        0.34 (  52.86%)
CoeffVar  syst-NUMA02        9.04 (   0.00%)        6.89 (  23.75%)
CoeffVar  elsp-NUMA02       19.25 (   0.00%)       10.16 (  47.22%)
Max       syst-NUMA02        1.27 (   0.00%)        1.21 (   4.72%)
Max       elsp-NUMA02        4.91 (   0.00%)        4.04 (  17.72%)
BAmean-50 syst-NUMA02        1.00 (   0.00%)        1.01 (  -0.66%)
BAmean-50 elsp-NUMA02        3.21 (   0.00%)        3.12 (   2.60%)
BAmean-95 syst-NUMA02        1.03 (   0.00%)        1.02 (   0.32%)
BAmean-95 elsp-NUMA02        3.61 (   0.00%)        3.28 (   9.09%)
BAmean-99 syst-NUMA02        1.03 (   0.00%)        1.02 (   0.32%)
BAmean-99 elsp-NUMA02        3.61 (   0.00%)        3.28 (   9.09%)

Duration User        1555.24     1377.57
Duration System         8.10        7.99
Duration Elapsed       30.86       26.49

But then, I saw result from Kernel test Robot, which compared individual
patches,

commit:
   2f88c8e802 ("(tip/sched/core) sched/eevdf/doc: Modify the documented 
knob to base_slice_ns as well")
   2a806eab1c ("sched/numa: Move up the access pid reset logic")
   1ef5cbb92b ("sched/numa: Add disjoint vma unconditional scan logic")
   68cfe9439a ("sched/numa: Allow scanning of shared VMAs")


2f88c8e802c8b128 2a806eab1c2e1c9f0ae39dc0307 1ef5cbb92bdb320c5eb9fdee1a8 
68cfe9439a1baa642e05883fa64
---------------- --------------------------- --------------------------- 
---------------------------
          %stddev     %change         %stddev     %change 
%stddev     %change         %stddev
              \          |                \          |                \ 
          |                \
     271.01            +0.8%     273.24            -0.7%     269.00 
       -26.4%     199.49 ±  3%  autonuma-benchmark.numa01.seconds
      76.28            +0.2%      76.44           -11.7%      67.36 ± 
6%     -46.9%      40.49 ±  5% 
autonuma-benchmark.numa01_THREAD_ALLOC.seconds
       8.11            -0.9%       8.04            -0.7%       8.05 
        -0.1%       8.10        autonuma-benchmark.numa02.seconds
       1425            +0.7%       1434            -3.1%       1381 
       -30.1%     996.02 ±  2%  autonuma-benchmark.time.elapsed_time

I do see some negligible overhead from first patch but second patch
still gave some improvement.

My observation with the patchset was increase in system time
  because of additional scanning we re-introduced but this
was still 2x better than where we started without numascan enhancements.

> not certain as I didn't dedicate the necessary test time to prove that
> and it's the type of information that should be in the changelog. While
> there is nothing wrong with that as such, it's very hard to imagine how
> patches 3-6 work in every case and be certain that the various parameters
> make sense. That could cause difficulties later in terms of maintenance.
>

Agree regarding maintenance.

> My initial thinking was "There should be a standalone series that deals
> *only* with scanning VMAs that had no fault activity and skipped due to
> PID hashing". These are important because there may be no fault activity
> because there is no scan activity which is due to to fault activity. The
> series is incomplete and without changelogs but I pushed it anyway to
> 

Agreed.

> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ sched-numabselective-v1r5
> 

Thanks.. Patches are simple to start with (1-4) with a force scan in
patch5. Will experiment with these.

> The first two patches simply improve the documentation on what is going
> on, patch 3 adds a tracepoint for figuring out why VMAs were skipped or
> not skipped. Patch 4 handles a corner case to complete the scan of a VMA
> once it has started regardless of what task is doing the scanning. The
> last patch scans VMAs that have seen no fault activity once active VMAs
> have been scanned.
>
> It has its weaknesses because it may be overly simplisitic and it forces
> all VMAs to be scanned on every sequence which is wasteful. It also hurts
> NUMA02 performance, although not as badly as ""sched/numa: Add disjoint
> vma unconditional scan logic". On the plus side, it is easier to reason
> about, it solves only one problem in the series and any patch on top or
> modification should justify each change individually.
> 
Anything else you have in mind that I should look into apart from
above (Rebasing to your patches and experiment with my patch 3-6 for any
cumulative improvements ?).

Thanks and Regards
- Raghu