Message ID | 20240219041920.1183-1-byungchul@sk.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v4] sched/numa, mm: do not try to migrate memory to memoryless nodes | expand |
Hi, Byungchul, Byungchul Park <byungchul@sk.com> writes: > Changes from v3: > 1. Rewrite the comment in code and the commit message to make it > more clear. (feedbacked by Oscar Salvador) > 2. Add "Reviewed-by: Oscar Salvador <osalvador@suse.de>" > > Changes from v2: > 1. Rewrite the comment in code and the commit message becasue it > turns out that this patch is not the real fix for the oops > descriped. The real fix goes in another patch below: > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > Changes from v1: > 1. Trim the verbose oops in the commit message. (feedbacked by > Phil Auld) > 2. Rewrite a comment in code. (feedbacked by Phil Auld) > > --->8--- >>From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001 > From: Byungchul Park <byungchul@sk.com> > Date: Mon, 19 Feb 2024 13:10:47 +0900 > Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes > > Memoryless nodes do not have any memory to migrate to, so stop trying > it. > > Signed-off-by: Byungchul Park <byungchul@sk.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > --- > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d7a3c63a2171..3e3b44ae72d1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > int dst_nid = cpu_to_node(dst_cpu); > int last_cpupid, this_cpupid; > > + /* > + * Cannot migrate to memoryless nodes. > + */ > + if (!node_state(dst_nid, N_MEMORY)) > + return false; > + > /* > * The pages in slow memory node should be migrated according > * to hot/cold instead of private/shared. Good catch! IIUC, you will use patch as fix to the issue in https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ If so, we need the Fixes: tag to make it land in -stable properly. Feel free to add Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
On Mon, Feb 19, 2024 at 01:19:20PM +0900 Byungchul Park wrote: > Changes from v3: > 1. Rewrite the comment in code and the commit message to make it > more clear. (feedbacked by Oscar Salvador) > 2. Add "Reviewed-by: Oscar Salvador <osalvador@suse.de>" > > Changes from v2: > 1. Rewrite the comment in code and the commit message becasue it > turns out that this patch is not the real fix for the oops > descriped. The real fix goes in another patch below: > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > Changes from v1: > 1. Trim the verbose oops in the commit message. (feedbacked by > Phil Auld) > 2. Rewrite a comment in code. (feedbacked by Phil Auld) > > --->8--- > From 98f5d472c08e3ed5b9b6543290d392a2e50fcf3c Mon Sep 17 00:00:00 2001 > From: Byungchul Park <byungchul@sk.com> > Date: Mon, 19 Feb 2024 13:10:47 +0900 > Subject: [PATCH v4] sched/numa, mm: do not try to migrate memory to memoryless nodes > > Memoryless nodes do not have any memory to migrate to, so stop trying > it. > > Signed-off-by: Byungchul Park <byungchul@sk.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > --- > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d7a3c63a2171..3e3b44ae72d1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, > int dst_nid = cpu_to_node(dst_cpu); > int last_cpupid, this_cpupid; > > + /* > + * Cannot migrate to memoryless nodes. > + */ > + if (!node_state(dst_nid, N_MEMORY)) > + return false; > + > /* > * The pages in slow memory node should be migrated according > * to hot/cold instead of private/shared. > -- > 2.17.1 > > Sorry, hadn't gotten far enough to see this version when I replied to v3... Reviewed-by: Phil Auld <pauld@redhat.com> Cheers, Phil --
On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > > + /* > > + * Cannot migrate to memoryless nodes. > > + */ > > + if (!node_state(dst_nid, N_MEMORY)) > > + return false; > > + > > /* > > * The pages in slow memory node should be migrated according > > * to hot/cold instead of private/shared. > > Good catch! > > IIUC, you will use patch as fix to the issue in > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > If so, we need the Fixes: tag to make it land in -stable properly. Yes, this changelog is missing rather a lot of important information. I pulled together the below, please check. From: Byungchul Park <byungchul@sk.com> Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes Date: Mon, 19 Feb 2024 13:10:47 +0900 With numa balancing on, when a numa system is running where a numa node doesn't have its local memory so it has no managed zones, the following oops has been observed. It's because wakeup_kswapd() is called with a wrong zone index, -1. Fixed it by checking the index before calling wakeup_kswapd(). > BUG: unable to handle page fault for address: 00000000000033f3 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > Code: (omitted) > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > PKRU: 55555554 > Call Trace: > <TASK> > ? __die > ? page_fault_oops > ? __pte_offset_map_lock > ? exc_page_fault > ? asm_exc_page_fault > ? wakeup_kswapd > migrate_misplaced_page > __handle_mm_fault > handle_mm_fault > do_user_addr_fault > exc_page_fault > asm_exc_page_fault > RIP: 0033:0x55b897ba0808 > Code: (omitted) > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > </TASK> Fix this by avoiding any attempt to migrate memory to memoryless nodes. Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system") Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Phil Auld <pauld@redhat.com> Cc: Benjamin Segall <bsegall@google.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+) --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes +++ a/kernel/sched/fair.c @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t int last_cpupid, this_cpupid; /* + * Cannot migrate to memoryless nodes. + */ + if (!node_state(dst_nid, N_MEMORY)) + return false; + + /* * The pages in slow memory node should be migrated according * to hot/cold instead of private/shared. */
On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote: > On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > > > > + /* > > > + * Cannot migrate to memoryless nodes. > > > + */ > > > + if (!node_state(dst_nid, N_MEMORY)) > > > + return false; > > > + > > > /* > > > * The pages in slow memory node should be migrated according > > > * to hot/cold instead of private/shared. > > > > Good catch! > > > > IIUC, you will use patch as fix to the issue in > > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > > > If so, we need the Fixes: tag to make it land in -stable properly. > > Yes, this changelog is missing rather a lot of important information. This is not the root cause fix any more but just optimization. That's why I didn't add Fixes: tag and cc stable@vger.kernel.org in here. Instead, I added Fixes: tag and cc'ed stable@vger.kernel.org in the real fix patch. check the following link please: https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ Byungchul > I pulled together the below, please check. > > > From: Byungchul Park <byungchul@sk.com> > Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes > Date: Mon, 19 Feb 2024 13:10:47 +0900 > > With numa balancing on, when a numa system is running where a numa node > doesn't have its local memory so it has no managed zones, the following > oops has been observed. It's because wakeup_kswapd() is called with a > wrong zone index, -1. Fixed it by checking the index before calling > wakeup_kswapd(). > > > BUG: unable to handle page fault for address: 00000000000033f3 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > Code: (omitted) > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > Call Trace: > > <TASK> > > ? __die > > ? page_fault_oops > > ? __pte_offset_map_lock > > ? exc_page_fault > > ? asm_exc_page_fault > > ? wakeup_kswapd > > migrate_misplaced_page > > __handle_mm_fault > > handle_mm_fault > > do_user_addr_fault > > exc_page_fault > > asm_exc_page_fault > > RIP: 0033:0x55b897ba0808 > > Code: (omitted) > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > </TASK> > > Fix this by avoiding any attempt to migrate memory to memoryless nodes. > > Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com > Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com > Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system") > Signed-off-by: Byungchul Park <byungchul@sk.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Reviewed-by: Phil Auld <pauld@redhat.com> > Cc: Benjamin Segall <bsegall@google.com> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: <stable@vger.kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes > +++ a/kernel/sched/fair.c > @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t > int last_cpupid, this_cpupid; > > /* > + * Cannot migrate to memoryless nodes. > + */ > + if (!node_state(dst_nid, N_MEMORY)) > + return false; > + > + /* > * The pages in slow memory node should be migrated according > * to hot/cold instead of private/shared. > */ > _
On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote: > On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > > > > + /* > > > + * Cannot migrate to memoryless nodes. > > > + */ > > > + if (!node_state(dst_nid, N_MEMORY)) > > > + return false; > > > + > > > /* > > > * The pages in slow memory node should be migrated according > > > * to hot/cold instead of private/shared. > > > > Good catch! > > > > IIUC, you will use patch as fix to the issue in > > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > > > If so, we need the Fixes: tag to make it land in -stable properly. > > Yes, this changelog is missing rather a lot of important information. > > I pulled together the below, please check. Plus, two guys gave the tags for the optimization patch as well: Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Byungchul > From: Byungchul Park <byungchul@sk.com> > Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes > Date: Mon, 19 Feb 2024 13:10:47 +0900 > > With numa balancing on, when a numa system is running where a numa node > doesn't have its local memory so it has no managed zones, the following > oops has been observed. It's because wakeup_kswapd() is called with a > wrong zone index, -1. Fixed it by checking the index before calling > wakeup_kswapd(). > > > BUG: unable to handle page fault for address: 00000000000033f3 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > Code: (omitted) > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > Call Trace: > > <TASK> > > ? __die > > ? page_fault_oops > > ? __pte_offset_map_lock > > ? exc_page_fault > > ? asm_exc_page_fault > > ? wakeup_kswapd > > migrate_misplaced_page > > __handle_mm_fault > > handle_mm_fault > > do_user_addr_fault > > exc_page_fault > > asm_exc_page_fault > > RIP: 0033:0x55b897ba0808 > > Code: (omitted) > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > </TASK> > > Fix this by avoiding any attempt to migrate memory to memoryless nodes. > > Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com > Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com > Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system") > Signed-off-by: Byungchul Park <byungchul@sk.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Reviewed-by: Phil Auld <pauld@redhat.com> > Cc: Benjamin Segall <bsegall@google.com> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: <stable@vger.kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes > +++ a/kernel/sched/fair.c > @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t > int last_cpupid, this_cpupid; > > /* > + * Cannot migrate to memoryless nodes. > + */ > + if (!node_state(dst_nid, N_MEMORY)) > + return false; > + > + /* > * The pages in slow memory node should be migrated according > * to hot/cold instead of private/shared. > */ > _
On Mon, Feb 19, 2024 at 05:45:08PM -0800, Andrew Morton wrote: > On Mon, 19 Feb 2024 16:43:36 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > > > > + /* > > > + * Cannot migrate to memoryless nodes. > > > + */ > > > + if (!node_state(dst_nid, N_MEMORY)) > > > + return false; > > > + > > > /* > > > * The pages in slow memory node should be migrated according > > > * to hot/cold instead of private/shared. > > > > Good catch! > > > > IIUC, you will use patch as fix to the issue in > > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > > > If so, we need the Fixes: tag to make it land in -stable properly. > > Yes, this changelog is missing rather a lot of important information. > > I pulled together the below, please check. To make it more clear, I need to explain it more. I posted the following two patches while resolving the oops issue. However, two are going on for different purposes. 1) https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com I started this patch as the fix for the oops. However, I found the root cause comes from using -1 as an array index. So let the root cause fix go with another thread, 2). Nevertheless, 1) is still necessary as a *reasonable optimization* but not the real fix any more. 2) https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com I found the root cause of the oops comes from using -1 as an array index. So moved all the oops message, Fixes: tag, and cc stable to here. Long story short, 2) is the *real fix* for the oops. Byungchul > From: Byungchul Park <byungchul@sk.com> > Subject: sched/numa, mm: do not try to migrate memory to memoryless nodes > Date: Mon, 19 Feb 2024 13:10:47 +0900 > > With numa balancing on, when a numa system is running where a numa node > doesn't have its local memory so it has no managed zones, the following > oops has been observed. It's because wakeup_kswapd() is called with a > wrong zone index, -1. Fixed it by checking the index before calling > wakeup_kswapd(). > > > BUG: unable to handle page fault for address: 00000000000033f3 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > CPU: 2 PID: 895 Comm: masim Not tainted 6.6.0-dirty #255 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 > > RIP: 0010:wakeup_kswapd (./linux/mm/vmscan.c:7812) > > Code: (omitted) > > RSP: 0000:ffffc90004257d58 EFLAGS: 00010286 > > RAX: ffffffffffffffff RBX: ffff88883fff0480 RCX: 0000000000000003 > > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88883fff0480 > > RBP: ffffffffffffffff R08: ff0003ffffffffff R09: ffffffffffffffff > > R10: ffff888106c95540 R11: 0000000055555554 R12: 0000000000000003 > > R13: 0000000000000000 R14: 0000000000000000 R15: ffff88883fff0940 > > FS: 00007fc4b8124740(0000) GS:ffff888827c00000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: 00000000000033f3 CR3: 000000026cc08004 CR4: 0000000000770ee0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > PKRU: 55555554 > > Call Trace: > > <TASK> > > ? __die > > ? page_fault_oops > > ? __pte_offset_map_lock > > ? exc_page_fault > > ? asm_exc_page_fault > > ? wakeup_kswapd > > migrate_misplaced_page > > __handle_mm_fault > > handle_mm_fault > > do_user_addr_fault > > exc_page_fault > > asm_exc_page_fault > > RIP: 0033:0x55b897ba0808 > > Code: (omitted) > > RSP: 002b:00007ffeefa821a0 EFLAGS: 00010287 > > RAX: 000055b89983acd0 RBX: 00007ffeefa823f8 RCX: 000055b89983acd0 > > RDX: 00007fc2f8122010 RSI: 0000000000020000 RDI: 000055b89983acd0 > > RBP: 00007ffeefa821a0 R08: 0000000000000037 R09: 0000000000000075 > > R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 > > R13: 00007ffeefa82410 R14: 000055b897ba5dd8 R15: 00007fc4b8340000 > > </TASK> > > Fix this by avoiding any attempt to migrate memory to memoryless nodes. > > Link: https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com > Link: https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com > Fixes: c574bbe917036 ("NUMA balancing: optimize page placement for memory tiering system") > Signed-off-by: Byungchul Park <byungchul@sk.com> > Reviewed-by: Oscar Salvador <osalvador@suse.de> > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > Reviewed-by: Phil Auld <pauld@redhat.com> > Cc: Benjamin Segall <bsegall@google.com> > Cc: Daniel Bristot de Oliveira <bristot@redhat.com> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Juri Lelli <juri.lelli@redhat.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Valentin Schneider <vschneid@redhat.com> > Cc: Vincent Guittot <vincent.guittot@linaro.org> > Cc: <stable@vger.kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > --- a/kernel/sched/fair.c~sched-numa-mm-do-not-try-to-migrate-memory-to-memoryless-nodes > +++ a/kernel/sched/fair.c > @@ -1831,6 +1831,12 @@ bool should_numa_migrate_memory(struct t > int last_cpupid, this_cpupid; > > /* > + * Cannot migrate to memoryless nodes. > + */ > + if (!node_state(dst_nid, N_MEMORY)) > + return false; > + > + /* > * The pages in slow memory node should be migrated according > * to hot/cold instead of private/shared. > */ > _
On Tue, 20 Feb 2024 10:53:43 +0900 Byungchul Park <byungchul@sk.com> wrote: > > > IIUC, you will use patch as fix to the issue in > > > > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > > > > > If so, we need the Fixes: tag to make it land in -stable properly. > > > > Yes, this changelog is missing rather a lot of important information. > > This is not the root cause fix any more but just optimization. It would have been helpful to have told us this in the changelog :( > That's > why I didn't add Fixes: tag and cc stable@vger.kernel.org in here. > > Instead, I added Fixes: tag and cc'ed stable@vger.kernel.org in the real > fix patch. check the following link please: > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ But doesn't this patch "sched/numa, mm: do not try to migrate memory to memoryless nodes" also fix the bug? Do we truly need both?
On Mon, Feb 19, 2024 at 07:05:20PM -0800, Andrew Morton wrote: > On Tue, 20 Feb 2024 10:53:43 +0900 Byungchul Park <byungchul@sk.com> wrote: > > > > > IIUC, you will use patch as fix to the issue in > > > > > > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > > > > > > > If so, we need the Fixes: tag to make it land in -stable properly. > > > > > > Yes, this changelog is missing rather a lot of important information. > > > > This is not the root cause fix any more but just optimization. > > It would have been helpful to have told us this in the changelog :( Sorry for that and making you guys confused. > > That's > > why I didn't add Fixes: tag and cc stable@vger.kernel.org in here. > > > > Instead, I added Fixes: tag and cc'ed stable@vger.kernel.org in the real > > fix patch. check the following link please: > > > > https://lore.kernel.org/lkml/20240216111502.79759-1-byungchul@sk.com/ > > But doesn't this patch "sched/numa, mm: do not try to migrate memory to > memoryless nodes" also fix the bug? Do we truly need both? Yes. The oops is gone with "sched/numa, mm: do not try to migrate memory to memoryless nodes". However, as you know, wrongly manipulating array index is very dangerous - what hackers are looking for. Even with security in mind, both are necessary. Plus, no one gurantees the problematic code is gone through with a numa node that has no managed zones in the future. Byungchul
On Tue, 20 Feb 2024 11:33:04 +0900 Byungchul Park <byungchul@sk.com> wrote: > > Yes, this changelog is missing rather a lot of important information. > > > > I pulled together the below, please check. > > To make it more clear, I need to explain it more. I posted the following > two patches while resolving the oops issue. However, two are going on > for different purposes. > > 1) https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com > > I started this patch as the fix for the oops. However, I found the > root cause comes from using -1 as an array index. So let the root > cause fix go with another thread, 2). Nevertheless, 1) is still > necessary as a *reasonable optimization* but not the real fix any > more. Well I altered this patch's changelog to tell readers that it is an optimization. But one does wonder why it isn't simply a bugfix. Attempting to migrate to a memoryless node is clearly as error. Presumably the called code handles it somehow, but in what fashion and at what cost? > 2) https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com > > I found the root cause of the oops comes from using -1 as an array > index. So moved all the oops message, Fixes: tag, and cc stable to > here. Long story short, 2) is the *real fix* for the oops. >
On Mon, Feb 19, 2024 at 07:28:41PM -0800, Andrew Morton wrote: > On Tue, 20 Feb 2024 11:33:04 +0900 Byungchul Park <byungchul@sk.com> wrote: > > > > Yes, this changelog is missing rather a lot of important information. > > > > > > I pulled together the below, please check. > > > > To make it more clear, I need to explain it more. I posted the following > > two patches while resolving the oops issue. However, two are going on > > for different purposes. > > > > 1) https://lkml.kernel.org/r/20240219041920.1183-1-byungchul@sk.com > > > > I started this patch as the fix for the oops. However, I found the > > root cause comes from using -1 as an array index. So let the root > > cause fix go with another thread, 2). Nevertheless, 1) is still > > necessary as a *reasonable optimization* but not the real fix any > > more. > > Well I altered this patch's changelog to tell readers that it is an > optimization. But one does wonder why it isn't simply a bugfix. > Attempting to migrate to a memoryless node is clearly as error. I agree with what Oscar Salvador said: "As this is not a bug fix but an optimization, as we will fail anyways in migrate_misplaced_folio() when migrate_balanced_pgdat() notices that we do not have any memory on that node." https://lore.kernel.org/lkml/ZdG1yO29WTyRiw8Q@localhost.localdomain/ So assuming all the related code works correctly, the migration will safely fail even without this optimization patch. Byungchul > Presumably the called code handles it somehow, but in what fashion and > at what cost? > > > 2) https://lkml.kernel.org/r/20240216111502.79759-1-byungchul@sk.com > > > > I found the root cause of the oops comes from using -1 as an array > > index. So moved all the oops message, Fixes: tag, and cc stable to > > here. Long story short, 2) is the *real fix* for the oops. > >
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d7a3c63a2171..3e3b44ae72d1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1828,6 +1828,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct folio *folio, int dst_nid = cpu_to_node(dst_cpu); int last_cpupid, this_cpupid; + /* + * Cannot migrate to memoryless nodes. + */ + if (!node_state(dst_nid, N_MEMORY)) + return false; + /* * The pages in slow memory node should be migrated according * to hot/cold instead of private/shared.