diff mbox series

sched/numa: Add statistics of numa balance task migration and swap

Message ID 20250402010611.3204674-1-yu.c.chen@intel.com (mailing list archive)
State New
Headers show
Series sched/numa: Add statistics of numa balance task migration and swap | expand

Commit Message

Chen Yu April 2, 2025, 1:06 a.m. UTC
On system with NUMA balancing enabled, it is found that tracking
the task activities due to NUMA balancing is helpful. NUMA balancing
has two mechanisms for task migration: one is to migrate the task to
an idle CPU in its preferred node, the other is to swap tasks on
different nodes if they are on each other's preferred node.

The kernel already has NUMA page migration statistics in
/sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
but does not have statistics for task migration/swap.
Add the task migration and swap count accordingly.

The following two new fields:

numa_task_migrated
numa_task_swapped

will be displayed in both
/sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched

Previous RFC version can be found here:
https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
RFC->v1: Rename the nr_numa_task_migrated to
         numa_task_migrated, and nr_numa_task_swapped
         numa_task_swapped in /proc/{PID}/sched,
         so both cgroup's memory.stat and task's
         sched have the same field name.
---
 include/linux/sched.h         |  4 ++++
 include/linux/vm_event_item.h |  2 ++
 kernel/sched/core.c           | 10 ++++++++--
 kernel/sched/debug.c          |  4 ++++
 mm/memcontrol.c               |  2 ++
 mm/vmstat.c                   |  2 ++
 6 files changed, 22 insertions(+), 2 deletions(-)

Comments

Michal Koutný April 2, 2025, 1:24 p.m. UTC | #1
Hello Chen.

On Wed, Apr 02, 2025 at 09:06:11AM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful.
...
> The following two new fields:
> 
> numa_task_migrated
> numa_task_swapped
> 
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched

Why is the field /proc/$pid/sched not enough?

Also, you may want to update Documentation/admin-guide/cgroup-v2.rst
too.

Thanks,
Michal
Madadi Vineeth Reddy April 2, 2025, 1:33 p.m. UTC | #2
Hi Chen Yu,

On 02/04/25 06:36, Chen Yu wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
> 
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
> 
> The following two new fields:
> 
> numa_task_migrated
> numa_task_swapped
> 
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched

I applied this patch, but I still don't see the two new fields
in /proc/{PID}/sched.

Am I missing any additional steps?

Thanks,
Madadi Vineeth Reddy

> 
> Previous RFC version can be found here:
> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> RFC->v1: Rename the nr_numa_task_migrated to
>          numa_task_migrated, and nr_numa_task_swapped
>          numa_task_swapped in /proc/{PID}/sched,
>          so both cgroup's memory.stat and task's
>          sched have the same field name.
> ---
>  include/linux/sched.h         |  4 ++++
>  include/linux/vm_event_item.h |  2 ++
>  kernel/sched/core.c           | 10 ++++++++--
>  kernel/sched/debug.c          |  4 ++++
>  mm/memcontrol.c               |  2 ++
>  mm/vmstat.c                   |  2 ++
>  6 files changed, 22 insertions(+), 2 deletions(-)
K Prateek Nayak April 2, 2025, 5:23 p.m. UTC | #3
On 4/2/2025 7:03 PM, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 02/04/25 06:36, Chen Yu wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful. NUMA balancing
>> has two mechanisms for task migration: one is to migrate the task to
>> an idle CPU in its preferred node, the other is to swap tasks on
>> different nodes if they are on each other's preferred node.
>>
>> The kernel already has NUMA page migration statistics in
>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
>> but does not have statistics for task migration/swap.
>> Add the task migration and swap count accordingly.
>>
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
> 
> I applied this patch, but I still don't see the two new fields
> in /proc/{PID}/sched.
> 
> Am I missing any additional steps?

You also need to enable schedstats:

echo 1 > /proc/sys/kernel/sched_schedstats

After that it should be visible:

$ cat /proc/4030/sched
sched-messaging (4030, #threads: 641)
-------------------------------------------------------------------
se.exec_start                                :        283818.948537

...

nr_forced_migrations                         :                    0
numa_task_migrated                           :                    0
numa_task_swapped                            :                    0
nr_wakeups                                   :                    0

...
K Prateek Nayak April 2, 2025, 5:35 p.m. UTC | #4
Hello Chenyu,

On 4/2/2025 6:36 AM, Chen Yu wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
> 
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
> 
> The following two new fields:
> 
> numa_task_migrated
> numa_task_swapped
> 
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched

Running sched-messaging with schedstats enabled, I could see both
"numa_task_migrated" and "numa_task_swapped" being populated for the
sched-messaging threads:

$ for i in $(ls /proc/4030/task/); do grep "numa_task_migrated" /proc/$i/sched; done | tr -s ' ' | cut -d ' ' -f3 | sort | uniq -c
     400 0
     231 1
      10 2

$ for i in $(ls /proc/4030/task/); do grep "numa_task_swapped" /proc/$i/sched; done | tr -s ' ' | cut -d ' ' -f3 | sort | uniq -c
     389 0
     193 1
      47 2
      11 3
       1 4

> 
> Previous RFC version can be found here:
> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
K Prateek Nayak April 2, 2025, 5:43 p.m. UTC | #5
Hello Michal,

On 4/2/2025 6:54 PM, Michal Koutný wrote:
> Hello Chen.
> 
> On Wed, Apr 02, 2025 at 09:06:11AM +0800, Chen Yu <yu.c.chen@intel.com> wrote:
>> On system with NUMA balancing enabled, it is found that tracking
>> the task activities due to NUMA balancing is helpful.
> ...
>> The following two new fields:
>>
>> numa_task_migrated
>> numa_task_swapped
>>
>> will be displayed in both
>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
> 
> Why is the field /proc/$pid/sched not enough?

The /proc/$pid/sched accounting is only done when schedstats are
enabled. memcg users might want to track it separately without relying
on schedstats which also enables a bunch of other scheduler related
stats collection adding more overheads.

> 
> Also, you may want to update Documentation/admin-guide/cgroup-v2.rst
> too.
> 
> Thanks,
> Michal
Madadi Vineeth Reddy April 2, 2025, 6:08 p.m. UTC | #6
On 02/04/25 22:53, K Prateek Nayak wrote:
> On 4/2/2025 7:03 PM, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 02/04/25 06:36, Chen Yu wrote:
>>> On system with NUMA balancing enabled, it is found that tracking
>>> the task activities due to NUMA balancing is helpful. NUMA balancing
>>> has two mechanisms for task migration: one is to migrate the task to
>>> an idle CPU in its preferred node, the other is to swap tasks on
>>> different nodes if they are on each other's preferred node.
>>>
>>> The kernel already has NUMA page migration statistics in
>>> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
>>> but does not have statistics for task migration/swap.
>>> Add the task migration and swap count accordingly.
>>>
>>> The following two new fields:
>>>
>>> numa_task_migrated
>>> numa_task_swapped
>>>
>>> will be displayed in both
>>> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>>
>> I applied this patch, but I still don't see the two new fields
>> in /proc/{PID}/sched.
>>
>> Am I missing any additional steps?
> 
> You also need to enable schedstats:
> 
> echo 1 > /proc/sys/kernel/sched_schedstats
> 
> After that it should be visible:

Thanks, Prateek! I had missed enabling schedstats. Now that it's enabled,
I can see the fields.

Thanks,
Madadi Vineeth Reddy 

> 
> $ cat /proc/4030/sched
> sched-messaging (4030, #threads: 641)
> -------------------------------------------------------------------
> se.exec_start                                :        283818.948537
> 
> ...
> 
> nr_forced_migrations                         :                    0
> numa_task_migrated                           :                    0
> numa_task_swapped                            :                    0
> nr_wakeups                                   :                    0
> 
> ...
>
Madadi Vineeth Reddy April 2, 2025, 6:50 p.m. UTC | #7
Hi Chen Yu,

On 02/04/25 06:36, Chen Yu wrote:
> On system with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
> 
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched.
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
> 
> The following two new fields:
> 
> numa_task_migrated
> numa_task_swapped
> 
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched

I was able to see the fields and their corresponding values for schbench:

numa_task_swapped                            :                    2
numa_task_migrated                           :                    0
numa_task_swapped                            :                    1
numa_task_migrated                           :                    0
numa_task_swapped                            :                    0
numa_task_migrated                           :                    0
numa_task_swapped                            :                    1

Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks,
Madadi Vineeth Reddy

> Previous RFC version can be found here:
> https://lore.kernel.org/lkml/1847c5ef828ad4835a35e3a54b88d2e13bce0eea.1740483690.git.yu.c.chen@intel.com/
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
> RFC->v1: Rename the nr_numa_task_migrated to
>          numa_task_migrated, and nr_numa_task_swapped
>          numa_task_swapped in /proc/{PID}/sched,
>          so both cgroup's memory.stat and task's
>          sched have the same field name.
> ---
>  include/linux/sched.h         |  4 ++++
>  include/linux/vm_event_item.h |  2 ++
>  kernel/sched/core.c           | 10 ++++++++--
>  kernel/sched/debug.c          |  4 ++++
>  mm/memcontrol.c               |  2 ++
>  mm/vmstat.c                   |  2 ++
>  6 files changed, 22 insertions(+), 2 deletions(-)
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0785268c76f8..9623e5300453 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -532,6 +532,10 @@  struct sched_statistics {
 	u64				nr_failed_migrations_running;
 	u64				nr_failed_migrations_hot;
 	u64				nr_forced_migrations;
+#ifdef CONFIG_NUMA_BALANCING
+	u64				numa_task_migrated;
+	u64				numa_task_swapped;
+#endif
 
 	u64				nr_wakeups;
 	u64				nr_wakeups_sync;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index f70d0958095c..aef817474781 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -64,6 +64,8 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NUMA_HINT_FAULTS,
 		NUMA_HINT_FAULTS_LOCAL,
 		NUMA_PAGE_MIGRATE,
+		NUMA_TASK_MIGRATE,
+		NUMA_TASK_SWAP,
 #endif
 #ifdef CONFIG_MIGRATION
 		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c86c05264719..314d5cbce2b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3348,6 +3348,11 @@  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
+	__schedstat_inc(p->stats.numa_task_swapped);
+
+	if (p->mm)
+		count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1);
+
 	if (task_on_rq_queued(p)) {
 		struct rq *src_rq, *dst_rq;
 		struct rq_flags srf, drf;
@@ -7948,8 +7953,9 @@  int migrate_task_to(struct task_struct *p, int target_cpu)
 	if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
 		return -EINVAL;
 
-	/* TODO: This is not properly updating schedstats */
-
+	__schedstat_inc(p->stats.numa_task_migrated);
+	if (p->mm)
+		count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1);
 	trace_sched_move_numa(p, curr_cpu, target_cpu);
 	return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
 }
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..f971c2af7912 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1206,6 +1206,10 @@  void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 		P_SCHEDSTAT(nr_failed_migrations_running);
 		P_SCHEDSTAT(nr_failed_migrations_hot);
 		P_SCHEDSTAT(nr_forced_migrations);
+#ifdef CONFIG_NUMA_BALANCING
+		P_SCHEDSTAT(numa_task_migrated);
+		P_SCHEDSTAT(numa_task_swapped);
+#endif
 		P_SCHEDSTAT(nr_wakeups);
 		P_SCHEDSTAT(nr_wakeups_sync);
 		P_SCHEDSTAT(nr_wakeups_migrate);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4de6acb9b8ec..1656c90b2381 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -460,6 +460,8 @@  static const unsigned int memcg_vm_event_stat[] = {
 	NUMA_PAGE_MIGRATE,
 	NUMA_PTE_UPDATES,
 	NUMA_HINT_FAULTS,
+	NUMA_TASK_MIGRATE,
+	NUMA_TASK_SWAP,
 #endif
 };
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 16bfe1c694dd..7de1583a63c9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1339,6 +1339,8 @@  const char * const vmstat_text[] = {
 	"numa_hint_faults",
 	"numa_hint_faults_local",
 	"numa_pages_migrated",
+	"numa_task_migrated",
+	"numa_task_swapped",
 #endif
 #ifdef CONFIG_MIGRATION
 	"pgmigrate_success",