mbox series

[0/6] sched/numa: Complete scanning of partial and inactive VMAs

Message ID 20231010083143.19593-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series sched/numa: Complete scanning of partial and inactive VMAs | expand

Message

Mel Gorman Oct. 10, 2023, 8:31 a.m. UTC
NUMA Balancing currently uses PID fault activity within a VMA to
determine if it is worth updating PTEs to trap NUMA hinting faults.
While this is reduces overhead, it misses two important corner case.
The first is that if Task A partially scans a VMA that is active and
Task B resumes the scan but is inactive, then the remainder of the VMA
may be missed. Similarly, if a VMA is inactive for a period of time then
it may never be scanned again.

Patches 1-3 improve the documentation of the current per-VMA tracking
and adds a trace point for scan activity. Patch 4 addresses a corner
case where the PID activity information may not be reset after the
expected timeout. Patches 5-6 complete the scanning of partial and
inactive VMAs within the scan sequence.

This could be improved further but it would deserve a separate series on
top with supporting data justifying the change. Otherwise and gain/loss
due to the additional changes could be masked by this series on its own.

 include/linux/mm.h                   |   4 +-
 include/linux/mm_types.h             |  36 +++++++++-
 include/linux/sched/numa_balancing.h |  10 +++
 include/trace/events/sched.h         |  52 ++++++++++++++
 kernel/sched/fair.c                  | 103 ++++++++++++++++++++++-----
 5 files changed, 182 insertions(+), 23 deletions(-)

Comments

Raghavendra K T Oct. 10, 2023, 11:39 a.m. UTC | #1
On 10/10/2023 2:01 PM, Mel Gorman wrote:
> NUMA Balancing currently uses PID fault activity within a VMA to
> determine if it is worth updating PTEs to trap NUMA hinting faults.
> While this is reduces overhead, it misses two important corner case.
> The first is that if Task A partially scans a VMA that is active and
> Task B resumes the scan but is inactive, then the remainder of the VMA
> may be missed. Similarly, if a VMA is inactive for a period of time then
> it may never be scanned again.
> 
> Patches 1-3 improve the documentation of the current per-VMA tracking
> and adds a trace point for scan activity. Patch 4 addresses a corner
> case where the PID activity information may not be reset after the
> expected timeout. Patches 5-6 complete the scanning of partial and
> inactive VMAs within the scan sequence.
> 
> This could be improved further but it would deserve a separate series on
> top with supporting data justifying the change. Otherwise and gain/loss
> due to the additional changes could be masked by this series on its own.
> 

Thank you Mel for the patches. I see Ingo already took to sched/core.
Here is my testing detail FWIW.

SUT:
- 4th Generation EPYC System
- 2 x 128C/256T
- NPS1 mode

base: 6.6.-rc4
patch_v1r5: Mel's  Initial series with prev_scan_seq = -1 fix
  Link: https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ 
sched-numabselective-v1r5

(May not be relevant. But I did see number was even more better for
thread_alloc, so ..)

patch_v1_r13: current series

numa01_thread_alloc
=============
         base            patch_v1r5      patch_v1r13

real    8m46.557s       8m29.040s       8m38.098s
user    599m6.070s      268m38.140s     404m52.065s
sys     3655m38.681s    3794m10.079s    3751m36.779s

numa_hit                394964680       396000482       393981391
numa_local              197351688       198242761       197008099
numa_other              197612992       197757721       196973292
numa_pte_updates        1160            790360          812
numa_hint_faults        755             729196          553
numa_hint_faults_local  754             410220          263
numa_pages_migrated     1               318976          290

num01
======

real    18m26.691s      17m31.770s      17m33.540s
user    4501m40.194s    2148m7.993s     3295m57.897s
sys     3483m11.684s    4764m57.876s    4215m35.599s

numa_hit                395473956       395813242        395000242
numa_local              197776626       198188480        197983594
numa_other              197697330       197624762        197016648
numa_pte_updates        1447            4625319          7142774
numa_hint_faults        1390            4947832          10313097
numa_hint_faults_local  1288            2758651          5354895
numa_pages_migrated     102             594803           960422


Thanks and Regards
- Raghu
Ingo Molnar Oct. 10, 2023, 9:45 p.m. UTC | #2
* Raghavendra K T <raghavendra.kt@amd.com> wrote:

> On 10/10/2023 2:01 PM, Mel Gorman wrote:
> > NUMA Balancing currently uses PID fault activity within a VMA to
> > determine if it is worth updating PTEs to trap NUMA hinting faults.
> > While this is reduces overhead, it misses two important corner case.
> > The first is that if Task A partially scans a VMA that is active and
> > Task B resumes the scan but is inactive, then the remainder of the VMA
> > may be missed. Similarly, if a VMA is inactive for a period of time then
> > it may never be scanned again.
> > 
> > Patches 1-3 improve the documentation of the current per-VMA tracking
> > and adds a trace point for scan activity. Patch 4 addresses a corner
> > case where the PID activity information may not be reset after the
> > expected timeout. Patches 5-6 complete the scanning of partial and
> > inactive VMAs within the scan sequence.
> > 
> > This could be improved further but it would deserve a separate series on
> > top with supporting data justifying the change. Otherwise and gain/loss
> > due to the additional changes could be masked by this series on its own.
> > 
> 
> Thank you Mel for the patches. I see Ingo already took to sched/core.
> Here is my testing detail FWIW.

Thank you for testing the series, I've added your Tested-by to the final 
two patches that change behavior materially:

   Tested-by: Raghavendra K T <raghavendra.kt@amd.com>

Thanks,

	Ingo