diff mbox series

[v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

Message ID 20240219041920.1183-1-byungchul@sk.com (mailing list archive)
State New
Headers show
Series [v4] sched/numa, mm: do not try to migrate memory to memoryless nodes | expand

Commit Message

Byungchul Park Feb. 19, 2024, 4:19 a.m. UTC
Changes from v3:
	1. Rewrite the comment in code and the commit message to make it
	   more clear. (feedbacked by Oscar Salvador)
	2. Add "Reviewed-by: Oscar Salvador <osalvador@suse.de>"

Changes from v2:
	1. Rewrite the comment in code and the commit message becasue it
	   turns out that this patch is not the real fix for the oops
	   descriped. The real fix goes in another patch below:

	   https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/

Changes from v1:
	1. Trim the verbose oops in the commit message. (feedbacked by
	   Phil Auld)
	2. Rewrite a comment in code. (feedbacked by Phil Auld)

--->8---
From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001
From: Byungchul Park <byungchul@sk.com>
Date: Mon, 19 Feb 2024 13:10:47 +0900
Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes

Memoryless nodes do not have any memory to migrate to, so stop trying
it.

Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
 kernel/sched/fair.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Huang, Ying Feb. 19, 2024, 8:43 a.m. UTC | #1
Hi, Byungchul,

Byungchul Park <byungchul@sk.com> writes:

> Changes from v3:
> 	1. Rewrite the comment in code and the commit message to make it
> 	   more clear. (feedbacked by Oscar Salvador)
> 	2. Add "Reviewed-by: Oscar Salvador <osalvador@suse.de>"
>
> Changes from v2:
> 	1. Rewrite the comment in code and the commit message becasue it
> 	   turns out that this patch is not the real fix for the oops
> 	   descriped. The real fix goes in another patch below:
>
> 	   https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
>
> Changes from v1:
> 	1. Trim the verbose oops in the commit message. (feedbacked by
> 	   Phil Auld)
> 	2. Rewrite a comment in code. (feedbacked by Phil Auld)
>
> --->8---
>>From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001
> From: Byungchul Park <byungchul@sk.com>
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes
>
> Memoryless nodes do not have any memory to migrate to, so stop trying
> it.
>
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..3e3b44ae72d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  	int dst_nid = cpu_to_node(dst_cpu);
>  	int last_cpupid, this_cpupid;
>  
> +	/*
> +	 * Cannot migrate to memoryless nodes.
> +	 */
> +	if (!node_state(dst_nid, N_MEMORY))
> +		return false;
> +
>  	/*
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.

Good catch!

IIUC, you will use patch as fix to the issue in

https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/

If so, we need the Fixes: tag to make it land in -stable properly.

Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Phil Auld Feb. 19, 2024, 3:06 p.m. UTC | #2
On Mon, Feb 19, 2024 at 01:19:20PM +0900 Byungchul Park wrote:
> Changes from v3:
> 	1. Rewrite the comment in code and the commit message to make it
> 	   more clear. (feedbacked by Oscar Salvador)
> 	2. Add "Reviewed-by: Oscar Salvador <osalvador@suse.de>"
> 
> Changes from v2:
> 	1. Rewrite the comment in code and the commit message becasue it
> 	   turns out that this patch is not the real fix for the oops
> 	   descriped. The real fix goes in another patch below:
> 
> 	   https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> 
> Changes from v1:
> 	1. Trim the verbose oops in the commit message. (feedbacked by
> 	   Phil Auld)
> 	2. Rewrite a comment in code. (feedbacked by Phil Auld)
> 
> --->8---
> From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001
> From: Byungchul Park <byungchul@sk.com>
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes
> 
> Memoryless nodes do not have any memory to migrate to, so stop trying
> it.
> 
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> ---
>  kernel/sched/fair.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..3e3b44ae72d1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
>  	int dst_nid = cpu_to_node(dst_cpu);
>  	int last_cpupid, this_cpupid;
>  
> +	/*
> +	 * Cannot migrate to memoryless nodes.
> +	 */
> +	if (!node_state(dst_nid, N_MEMORY))
> +		return false;
> +
>  	/*
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.
> -- 
> 2.17.1
> 
> 

Sorry, hadn't gotten far enough to see this version when I replied to v3...

Reviewed-by: Phil Auld <pauld@redhat.com>


Cheers,
Phil

--
Andrew Morton Feb. 20, 2024, 1:45 a.m. UTC | #3
On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:

> > +	/*
> > +	 * Cannot migrate to memoryless nodes.
> > +	 */
> > +	if (!node_state(dst_nid, N_MEMORY))
> > +		return false;
> > +
> >  	/*
> >  	 * The pages in slow memory node should be migrated according
> >  	 * to hot/cold instead of private/shared.
> 
> Good catch!
> 
> IIUC, you will use patch as fix to the issue in
> 
> https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> 
> If so, we need the Fixes: tag to make it land in -stable properly.

Yes, this changelog is missing rather a lot of important information.

I pulled together the below, please check.


From: Byungchul Park <byungchul@sk.com>
Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
Date: Mon, 19 Feb 2024 13:10:47 +0900

With numa balancing on, when a numa system is running where a numa node
doesn't have its local memory so it has no managed zones, the following
oops has been observed.  It's because wakeup_kswapd() is called with a
wrong zone index, -1.  Fixed it by checking the index before calling
wakeup_kswapd().

> BUG: unable to handle page fault for address: 00000000000033f3
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> Code: (omitted)
> RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
>  <TASK>
> ? __die
> ? page_fault_oops
> ? __pte_offset_map_lock
> ? exc_page_fault
> ? asm_exc_page_fault
> ? wakeup_kswapd
> migrate_misplaced_page
> __handle_mm_fault
> handle_mm_fault
> do_user_addr_fault
> exc_page_fault
> asm_exc_page_fault
> RIP: 0033:0x55b897ba0808
> Code: (omitted)
> RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
>  </TASK>

Fix this by avoiding any attempt to migrate memory to memoryless nodes.

Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Cc: Benjamin Segall <bsegall@google.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/sched/fair.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
+++ a/kernel/sched/fair.c
@@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
 	int last_cpupid, this_cpupid;
 
 	/*
+	 * Cannot migrate to memoryless nodes.
+	 */
+	if (!node_state(dst_nid, N_MEMORY))
+		return false;
+
+	/*
 	 * The pages in slow memory node should be migrated according
 	 * to hot/cold instead of private/shared.
 	 */
Byungchul Park Feb. 20, 2024, 1:53 a.m. UTC | #4
On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote:
> On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> 
> > > +	/*
> > > +	 * Cannot migrate to memoryless nodes.
> > > +	 */
> > > +	if (!node_state(dst_nid, N_MEMORY))
> > > +		return false;
> > > +
> > >  	/*
> > >  	 * The pages in slow memory node should be migrated according
> > >  	 * to hot/cold instead of private/shared.
> > 
> > Good catch!
> > 
> > IIUC, you will use patch as fix to the issue in
> > 
> > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> > 
> > If so, we need the Fixes: tag to make it land in -stable properly.
> 
> Yes, this changelog is missing rather a lot of important information.

This is not the root cause fix any more but just optimization. That's
why I didn't add Fixes: tag and cc stable@vger.kernel.org in here.

Instead, I added Fixes: tag and cc'ed stable@vger.kernel.org in the real
fix patch. check the following link please:

https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/

	Byungchul

> I pulled together the below, please check.
> 
> 
> From: Byungchul Park <byungchul@sk.com>
> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> 
> With numa balancing on, when a numa system is running where a numa node
> doesn't have its local memory so it has no managed zones, the following
> oops has been observed.  It's because wakeup_kswapd() is called with a
> wrong zone index, -1.  Fixed it by checking the index before calling
> wakeup_kswapd().
> 
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> >  </TASK>
> 
> Fix this by avoiding any attempt to migrate memory to memoryless nodes.
> 
> Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
> Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Cc: Benjamin Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  kernel/sched/fair.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
> +++ a/kernel/sched/fair.c
> @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
>  	int last_cpupid, this_cpupid;
>  
>  	/*
> +	 * Cannot migrate to memoryless nodes.
> +	 */
> +	if (!node_state(dst_nid, N_MEMORY))
> +		return false;
> +
> +	/*
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.
>  	 */
> _
Byungchul Park Feb. 20, 2024, 2:07 a.m. UTC | #5
On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote:
> On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> 
> > > +	/*
> > > +	 * Cannot migrate to memoryless nodes.
> > > +	 */
> > > +	if (!node_state(dst_nid, N_MEMORY))
> > > +		return false;
> > > +
> > >  	/*
> > >  	 * The pages in slow memory node should be migrated according
> > >  	 * to hot/cold instead of private/shared.
> > 
> > Good catch!
> > 
> > IIUC, you will use patch as fix to the issue in
> > 
> > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> > 
> > If so, we need the Fixes: tag to make it land in -stable properly.
> 
> Yes, this changelog is missing rather a lot of important information.
> 
> I pulled together the below, please check.

Plus, two guys gave the tags for the optimization patch as well:

   Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
   Acked-by: David Hildenbrand <david@redhat.com>

	Byungchul

> From: Byungchul Park <byungchul@sk.com>
> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> 
> With numa balancing on, when a numa system is running where a numa node
> doesn't have its local memory so it has no managed zones, the following
> oops has been observed.  It's because wakeup_kswapd() is called with a
> wrong zone index, -1.  Fixed it by checking the index before calling
> wakeup_kswapd().
> 
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> >  </TASK>
> 
> Fix this by avoiding any attempt to migrate memory to memoryless nodes.
> 
> Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
> Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Cc: Benjamin Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  kernel/sched/fair.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
> +++ a/kernel/sched/fair.c
> @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
>  	int last_cpupid, this_cpupid;
>  
>  	/*
> +	 * Cannot migrate to memoryless nodes.
> +	 */
> +	if (!node_state(dst_nid, N_MEMORY))
> +		return false;
> +
> +	/*
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.
>  	 */
> _
Byungchul Park Feb. 20, 2024, 2:33 a.m. UTC | #6
On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote:
> On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote:
> 
> > > +	/*
> > > +	 * Cannot migrate to memoryless nodes.
> > > +	 */
> > > +	if (!node_state(dst_nid, N_MEMORY))
> > > +		return false;
> > > +
> > >  	/*
> > >  	 * The pages in slow memory node should be migrated according
> > >  	 * to hot/cold instead of private/shared.
> > 
> > Good catch!
> > 
> > IIUC, you will use patch as fix to the issue in
> > 
> > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> > 
> > If so, we need the Fixes: tag to make it land in -stable properly.
> 
> Yes, this changelog is missing rather a lot of important information.
> 
> I pulled together the below, please check.

To make it more clear, I need to explain it more. I posted the following
two patches while resolving the oops issue. However, two are going on
for different purposes.

1) https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com

   I started this patch as the fix for the oops. However, I found the
   root cause comes from using -1 as an array index. So let the root 
   cause fix go with another thread, 2). Nevertheless, 1) is still
   necessary as a *reasonable optimization* but not the real fix any
   more.

2) https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com

   I found the root cause of the oops comes from using -1 as an array
   index. So moved all the oops message, Fixes: tag, and cc stable to
   here. Long story short, 2) is the *real fix* for the oops.

	Byungchul

> From: Byungchul Park <byungchul@sk.com>
> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes
> Date: Mon, 19 Feb 2024 13:10:47 +0900
> 
> With numa balancing on, when a numa system is running where a numa node
> doesn't have its local memory so it has no managed zones, the following
> oops has been observed.  It's because wakeup_kswapd() is called with a
> wrong zone index, -1.  Fixed it by checking the index before calling
> wakeup_kswapd().
> 
> > BUG: unable to handle page fault for address: 00000000000033f3
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> >    rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812)
> > Code: (omitted)
> > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286
> > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480
> > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff
> > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003
> > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940
> > FS:  00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> > ? __die
> > ? page_fault_oops
> > ? __pte_offset_map_lock
> > ? exc_page_fault
> > ? asm_exc_page_fault
> > ? wakeup_kswapd
> > migrate_misplaced_page
> > __handle_mm_fault
> > handle_mm_fault
> > do_user_addr_fault
> > exc_page_fault
> > asm_exc_page_fault
> > RIP: 0033:0x55b897ba0808
> > Code: (omitted)
> > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287
> > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0
> > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0
> > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075
> > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
> > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000
> >  </TASK>
> 
> Fix this by avoiding any attempt to migrate memory to memoryless nodes.
> 
> Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
> Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
> Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system")
> Signed-off-by: Byungchul Park <byungchul@sk.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> Reviewed-by: Phil Auld <pauld@redhat.com>
> Cc: Benjamin Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  kernel/sched/fair.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes
> +++ a/kernel/sched/fair.c
> @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t
>  	int last_cpupid, this_cpupid;
>  
>  	/*
> +	 * Cannot migrate to memoryless nodes.
> +	 */
> +	if (!node_state(dst_nid, N_MEMORY))
> +		return false;
> +
> +	/*
>  	 * The pages in slow memory node should be migrated according
>  	 * to hot/cold instead of private/shared.
>  	 */
> _
Andrew Morton Feb. 20, 2024, 3:05 a.m. UTC | #7
On Tue, 20 Feb 2024 10:53:43 +0900 Byungchul Park <byungchul@sk.com> wrote:

> > > IIUC, you will use patch as fix to the issue in
> > > 
> > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> > > 
> > > If so, we need the Fixes: tag to make it land in -stable properly.
> > 
> > Yes, this changelog is missing rather a lot of important information.
> 
> This is not the root cause fix any more but just optimization.

It would have been helpful to have told us this in the changelog :(

> That's
> why I didn't add Fixes: tag and cc stable@vger.kernel.org in here.
> 
> Instead, I added Fixes: tag and cc'ed stable@vger.kernel.org in the real
> fix patch. check the following link please:
> 
> https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/

But doesn't this patch "sched/numa, mm: do not try to migrate memory to
memoryless nodes" also fix the bug?  Do we truly need both?
Byungchul Park Feb. 20, 2024, 3:20 a.m. UTC | #8
On Mon, Feb 19, 2024 at 07:05:20PM -0800, Andrew Morton wrote:
> On Tue, 20 Feb 2024 10:53:43 +0900 Byungchul Park <byungchul@sk.com> wrote:
> 
> > > > IIUC, you will use patch as fix to the issue in
> > > > 
> > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> > > > 
> > > > If so, we need the Fixes: tag to make it land in -stable properly.
> > > 
> > > Yes, this changelog is missing rather a lot of important information.
> > 
> > This is not the root cause fix any more but just optimization.
> 
> It would have been helpful to have told us this in the changelog :(

Sorry for that and making you guys confused.

> > That's
> > why I didn't add Fixes: tag and cc stable@vger.kernel.org in here.
> > 
> > Instead, I added Fixes: tag and cc'ed stable@vger.kernel.org in the real
> > fix patch. check the following link please:
> > 
> > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/
> 
> But doesn't this patch "sched/numa, mm: do not try to migrate memory to
> memoryless nodes" also fix the bug?  Do we truly need both?

Yes. The oops is gone with "sched/numa, mm: do not try to migrate memory
to memoryless nodes". However, as you know, wrongly manipulating array
index is very dangerous - what hackers are looking for. Even with security
in mind, both are necessary. Plus, no one gurantees the problematic code
is gone through with a numa node that has no managed zones in the future.

	Byungchul
Andrew Morton Feb. 20, 2024, 3:28 a.m. UTC | #9
On Tue, 20 Feb 2024 11:33:04 +0900 Byungchul Park <byungchul@sk.com> wrote:

> > Yes, this changelog is missing rather a lot of important information.
> > 
> > I pulled together the below, please check.
> 
> To make it more clear, I need to explain it more. I posted the following
> two patches while resolving the oops issue. However, two are going on
> for different purposes.
> 
> 1) https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
> 
>    I started this patch as the fix for the oops. However, I found the
>    root cause comes from using -1 as an array index. So let the root 
>    cause fix go with another thread, 2). Nevertheless, 1) is still
>    necessary as a *reasonable optimization* but not the real fix any
>    more.

Well I altered this patch's changelog to tell readers that it is an
optimization.  But one does wonder why it isn't simply a bugfix. 
Attempting to migrate to a memoryless node is clearly as error. 
Presumably the called code handles it somehow, but in what fashion and
at what cost?

> 2) https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
> 
>    I found the root cause of the oops comes from using -1 as an array
>    index. So moved all the oops message, Fixes: tag, and cc stable to
>    here. Long story short, 2) is the *real fix* for the oops.
>
Byungchul Park Feb. 20, 2024, 4:09 a.m. UTC | #10
On Mon, Feb 19, 2024 at 07:28:41PM -0800, Andrew Morton wrote:
> On Tue, 20 Feb 2024 11:33:04 +0900 Byungchul Park <byungchul@sk.com> wrote:
> 
> > > Yes, this changelog is missing rather a lot of important information.
> > > 
> > > I pulled together the below, please check.
> > 
> > To make it more clear, I need to explain it more. I posted the following
> > two patches while resolving the oops issue. However, two are going on
> > for different purposes.
> > 
> > 1) https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com
> > 
> >    I started this patch as the fix for the oops. However, I found the
> >    root cause comes from using -1 as an array index. So let the root 
> >    cause fix go with another thread, 2). Nevertheless, 1) is still
> >    necessary as a *reasonable optimization* but not the real fix any
> >    more.
> 
> Well I altered this patch's changelog to tell readers that it is an
> optimization.  But one does wonder why it isn't simply a bugfix. 
> Attempting to migrate to a memoryless node is clearly as error. 

I agree with what Oscar Salvador said:

   "As this is not a bug fix but an optimization, as we will fail anyways
   in migrate_misplaced_folio() when migrate_balanced_pgdat() notices
   that we do not have any memory on that node."

   https://lore.kernel.org/lkml/ZdG1yO29WTyRiw8Q@localhost.localdomain/
   
So assuming all the related code works correctly, the migration will
safely fail even without this optimization patch.

	Byungchul

> Presumably the called code handles it somehow, but in what fashion and
> at what cost?
> 
> > 2) https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com
> > 
> >    I found the root cause of the oops comes from using -1 as an array
> >    index. So moved all the oops message, Fixes: tag, and cc stable to
> >    here. Long story short, 2) is the *real fix* for the oops.
> >
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..3e3b44ae72d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1828,6 +1828,12 @@  bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio,
 	int dst_nid = cpu_to_node(dst_cpu);
 	int last_cpupid, this_cpupid;
 
+	/*
+	 * Cannot migrate to memoryless nodes.
+	 */
+	if (!node_state(dst_nid, N_MEMORY))
+		return false;
+
 	/*
 	 * The pages in slow memory node should be migrated according
 	 * to hot/cold instead of private/shared.