Message ID | 20211101201312.11589-1-amakhalov@vmware.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: fix panic in __alloc_pages | expand |
On Mon, Nov 01, 2021 at 01:13:12PM -0700, Alexey Makhalov wrote: > +++ b/include/linux/gfp.h > @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr > static inline unsigned long > alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array) > { > - if (nid == NUMA_NO_NODE) > + if (nid == NUMA_NO_NODE || (!node_online(nid) && > + !(gfp & __GFP_THISNODE))) > nid = numa_mem_id(); > > return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array); I don't think it's a great idea to push node_online() and the gfp check into the caller. Can't we put this check in __alloc_pages_bulk() instead?
[CC Oscar and David] On Mon 01-11-21 13:13:12, Alexey Makhalov wrote: > There is a kernel panic caused by __alloc_pages() accessing > uninitialized NODE_DATA(nid). Uninitialized node data exists > during the time when CPU with memoryless node was added but > not onlined yet. Panic can be easy reproduced by disabling > udev rule for automatic onlining hot added CPU followed by > CPU with memoryless node hot add. > > This is a panic caused by percpu code doing allocations for > all possible CPUs and hitting this issue: > > CPU2 has been hot-added > BUG: unable to handle page fault for address: 0000000000001608 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] SMP PTI > CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 > Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW > > RIP: 0010:__alloc_pages+0x127/0x290 Could you resolve this into a specific line of the source code please? > Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2 > RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246 > RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2 > RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600 > R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2 > R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2 > FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0 > Call Trace: > pcpu_alloc_pages.constprop.0+0xe4/0x1c0 > pcpu_populate_chunk+0x33/0xb0 > pcpu_alloc+0x4d3/0x6f0 > __alloc_percpu_gfp+0xd/0x10 > alloc_mem_cgroup_per_node_info+0x54/0xb0 > mem_cgroup_alloc+0xed/0x2f0 > mem_cgroup_css_alloc+0x33/0x2f0 > css_create+0x3a/0x1f0 > cgroup_apply_control_enable+0x12b/0x150 > cgroup_mkdir+0xdd/0x110 > kernfs_iop_mkdir+0x4f/0x80 > vfs_mkdir+0x178/0x230 > do_mkdirat+0xfd/0x120 > __x64_sys_mkdir+0x47/0x70 > ? syscall_exit_to_user_mode+0x21/0x50 > do_syscall_64+0x43/0x90 > entry_SYSCALL_64_after_hwframe+0x44/0xae > > Node can be in one of the following states: > 1. not present (nid == NUMA_NO_NODE) > 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, > NODE_DATA(nid) == NULL) > 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, > NODE_DATA(nid) != NULL) > > alloc_page_{bulk_array}node() functions verify for nid validity only > and do not check if nid is online. Enhanced verification check allows > to handle page allocation when node is in 2nd state. I do not think this is a correct approach. We should make sure that the proper fallback node is used instead. This means that the zone list is initialized properly. IIRC this has been a problem in the past and it has been fixed. The initialization code is quite subtle though so it is possible that this got broken again. > Signed-off-by: Alexey Makhalov <amakhalov@vmware.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: stable@vger.kernel.org > --- > include/linux/gfp.h | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 55b2ec1f9..34a5a7def 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr > static inline unsigned long > alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array) > { > - if (nid == NUMA_NO_NODE) > + if (nid == NUMA_NO_NODE || (!node_online(nid) && > + !(gfp & __GFP_THISNODE))) > nid = numa_mem_id(); > > return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array); > @@ -578,7 +579,8 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > unsigned int order) > { > - if (nid == NUMA_NO_NODE) > + if (nid == NUMA_NO_NODE || (!node_online(nid) && > + !(gfp_mask & __GFP_THISNODE))) > nid = numa_mem_id(); > > return __alloc_pages_node(nid, gfp_mask, order); > -- > 2.30.0
On 02.11.21 08:47, Michal Hocko wrote: > [CC Oscar and David] > > On Mon 01-11-21 13:13:12, Alexey Makhalov wrote: >> There is a kernel panic caused by __alloc_pages() accessing >> uninitialized NODE_DATA(nid). Uninitialized node data exists >> during the time when CPU with memoryless node was added but >> not onlined yet. Panic can be easy reproduced by disabling >> udev rule for automatic onlining hot added CPU followed by >> CPU with memoryless node hot add. >> >> This is a panic caused by percpu code doing allocations for >> all possible CPUs and hitting this issue: >> >> CPU2 has been hot-added >> BUG: unable to handle page fault for address: 0000000000001608 >> #PF: supervisor read access in kernel mode >> #PF: error_code(0x0000) - not-present page >> PGD 0 P4D 0 >> Oops: 0000 [#1] SMP PTI >> CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 >> Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW >> >> RIP: 0010:__alloc_pages+0x127/0x290 > > Could you resolve this into a specific line of the source code please? > >> Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2 >> RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246 >> RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2 >> RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600 >> R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2 >> R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2 >> FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0 >> Call Trace: >> pcpu_alloc_pages.constprop.0+0xe4/0x1c0 >> pcpu_populate_chunk+0x33/0xb0 >> pcpu_alloc+0x4d3/0x6f0 >> __alloc_percpu_gfp+0xd/0x10 >> alloc_mem_cgroup_per_node_info+0x54/0xb0 >> mem_cgroup_alloc+0xed/0x2f0 >> mem_cgroup_css_alloc+0x33/0x2f0 >> css_create+0x3a/0x1f0 >> cgroup_apply_control_enable+0x12b/0x150 >> cgroup_mkdir+0xdd/0x110 >> kernfs_iop_mkdir+0x4f/0x80 >> vfs_mkdir+0x178/0x230 >> do_mkdirat+0xfd/0x120 >> __x64_sys_mkdir+0x47/0x70 >> ? syscall_exit_to_user_mode+0x21/0x50 >> do_syscall_64+0x43/0x90 >> entry_SYSCALL_64_after_hwframe+0x44/0xae >> >> Node can be in one of the following states: >> 1. not present (nid == NUMA_NO_NODE) >> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, >> NODE_DATA(nid) == NULL) >> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, >> NODE_DATA(nid) != NULL) >> >> alloc_page_{bulk_array}node() functions verify for nid validity only >> and do not check if nid is online. Enhanced verification check allows >> to handle page allocation when node is in 2nd state. > > I do not think this is a correct approach. We should make sure that the > proper fallback node is used instead. This means that the zone list is > initialized properly. IIRC this has been a problem in the past and it > has been fixed. The initialization code is quite subtle though so it is > possible that this got broken again. I'm a little confused: In add_memory_resource() we hotplug the new node if required and set it online. Memory might get onlined later, via online_pages(). So after add_memory_resource()->__try_online_node() succeeded, we have an online pgdat -- essentially 3. This patch detects if we're past 3. but says that it reproduced by disabling *memory* onlining. Before we online memory for a hotplugged node, all zones are !populated. So once we online memory for a !populated zone in online_pages(), we trigger setup_zone_pageset(). The confusing part is that this patch checks for 3. but says it can be reproduced by not onlining *memory*. There seems to be something missing. Do we maybe need a proper populated_zone() check before accessing zone data?
On 11/2/21, 1:12 AM, "David Hildenbrand" <david@redhat.com> wrote: Thanks for reviews, On 02.11.21 08:47, Michal Hocko wrote: > [CC Oscar and David] > > On Mon 01-11-21 13:13:12, Alexey Makhalov wrote: >> There is a kernel panic caused by __alloc_pages() accessing >> uninitialized NODE_DATA(nid). Uninitialized node data exists >> during the time when CPU with memoryless node was added but >> not onlined yet. Panic can be easy reproduced by disabling >> udev rule for automatic onlining hot added CPU followed by >> CPU with memoryless node hot add. >> >> This is a panic caused by percpu code doing allocations for >> all possible CPUs and hitting this issue: >> >> CPU2 has been hot-added >> BUG: unable to handle page fault for address: 0000000000001608 >> #PF: supervisor read access in kernel mode >> #PF: error_code(0x0000) - not-present page >> PGD 0 P4D 0 >> Oops: 0000 [#1] SMP PTI >> CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 >> Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW >> >> RIP: 0010:__alloc_pages+0x127/0x290 > > Could you resolve this into a specific line of the source code please? > >> Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2 >> RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246 >> RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2 >> RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600 >> R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2 >> R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2 >> FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0 >> Call Trace: >> pcpu_alloc_pages.constprop.0+0xe4/0x1c0 >> pcpu_populate_chunk+0x33/0xb0 >> pcpu_alloc+0x4d3/0x6f0 >> __alloc_percpu_gfp+0xd/0x10 >> alloc_mem_cgroup_per_node_info+0x54/0xb0 >> mem_cgroup_alloc+0xed/0x2f0 >> mem_cgroup_css_alloc+0x33/0x2f0 >> css_create+0x3a/0x1f0 >> cgroup_apply_control_enable+0x12b/0x150 >> cgroup_mkdir+0xdd/0x110 >> kernfs_iop_mkdir+0x4f/0x80 >> vfs_mkdir+0x178/0x230 >> do_mkdirat+0xfd/0x120 >> __x64_sys_mkdir+0x47/0x70 >> ? syscall_exit_to_user_mode+0x21/0x50 >> do_syscall_64+0x43/0x90 >> entry_SYSCALL_64_after_hwframe+0x44/0xae >> >> Node can be in one of the following states: >> 1. not present (nid == NUMA_NO_NODE) >> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, >> NODE_DATA(nid) == NULL) >> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, >> NODE_DATA(nid) != NULL) >> >> alloc_page_{bulk_array}node() functions verify for nid validity only >> and do not check if nid is online. Enhanced verification check allows >> to handle page allocation when node is in 2nd state. > > I do not think this is a correct approach. We should make sure that the > proper fallback node is used instead. This means that the zone list is > initialized properly. IIRC this has been a problem in the past and it > has been fixed. The initialization code is quite subtle though so it is > possible that this got broken again. This approach behaves in the same way as CPU was not yet added. (state #1). So, we can think of state #2 as state #1 when CPU is not present. I'm a little confused: In add_memory_resource() we hotplug the new node if required and set it online. Memory might get onlined later, via online_pages(). You are correct. In case of memory hot add, it is true. But in case of adding CPU with memoryless node, try_node_online() will be called only during CPU onlining, see cpu_up(). Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? I think it would be correct to online node during the CPU hot add to align with memory hot add. So after add_memory_resource()->__try_online_node() succeeded, we have an online pgdat -- essentially 3. This patch detects if we're past 3. but says that it reproduced by disabling *memory* onlining. This is the hot adding of both new CPU and new _memoryless_ node (with CPU only) And onlining CPU makes its node online. Disabling CPU onlining puts new node into state #2, which leads to repro. Before we online memory for a hotplugged node, all zones are !populated. So once we online memory for a !populated zone in online_pages(), we trigger setup_zone_pageset(). The confusing part is that this patch checks for 3. but says it can be reproduced by not onlining *memory*. There seems to be something missing. Do we maybe need a proper populated_zone() check before accessing zone data? Thanks, --Alexey
It is hard to follow your reply as your email client is not quoting properly. Let me try to reconstruct On Tue 02-11-21 08:48:27, Alexey Makhalov wrote: > On 02.11.21 08:47, Michal Hocko wrote: [...] >>>> CPU2 has been hot-added >>>> BUG: unable to handle page fault for address: 0000000000001608 >>>> #PF: supervisor read access in kernel mode >>>> #PF: error_code(0x0000) - not-present page >>>> PGD 0 P4D 0 >>>> Oops: 0000 [#1] SMP PTI >>>> CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 >>>> Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW >>>> >>>> RIP: 0010:__alloc_pages+0x127/0x290 >>> >>> Could you resolve this into a specific line of the source code please? This got probably unnoticed. I would be really curious whether this is a broken zonelist or something else. >>>> Node can be in one of the following states: >>>> 1. not present (nid == NUMA_NO_NODE) >>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, >>>> NODE_DATA(nid) == NULL) >>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, >>>> NODE_DATA(nid) != NULL) >>>> >>>> alloc_page_{bulk_array}node() functions verify for nid validity only >>>> and do not check if nid is online. Enhanced verification check allows >>>> to handle page allocation when node is in 2nd state. >>> >>> I do not think this is a correct approach. We should make sure that the >>> proper fallback node is used instead. This means that the zone list is >>> initialized properly. IIRC this has been a problem in the past and it >>> has been fixed. The initialization code is quite subtle though so it is >>> possible that this got broken again. > This approach behaves in the same way as CPU was not yet added. (state #1). > So, we can think of state #2 as state #1 when CPU is not present. >> I'm a little confused: >> >> In add_memory_resource() we hotplug the new node if required and set it >> online. Memory might get onlined later, via online_pages(). > > You are correct. In case of memory hot add, it is true. But in case of adding > CPU with memoryless node, try_node_online() will be called only during CPU > onlining, see cpu_up(). > > Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? > I think it would be correct to online node during the CPU hot add to align with > memory hot add. I am not familiar with cpu hotplug, but this doesn't seem to be anything new so how come this became problem only now? >> So after add_memory_resource()->__try_online_node() succeeded, we have >> an online pgdat -- essentially 3. >> > This patch detects if we're past 3. but says that it reproduced by > disabling *memory* onlining. > This is the hot adding of both new CPU and new _memoryless_ node (with CPU only) > And onlining CPU makes its node online. Disabling CPU onlining puts new node > into state #2, which leads to repro. > >> Before we online memory for a hotplugged node, all zones are !populated. >> So once we online memory for a !populated zone in online_pages(), we >> trigger setup_zone_pageset(). >> >> >> The confusing part is that this patch checks for 3. but says it can be >> reproduced by not onlining *memory*. There seems to be something missing. > > Do we maybe need a proper populated_zone() check before accessing zone data? No, we need them initialize properly.
>>> In add_memory_resource() we hotplug the new node if required and set it >>> online. Memory might get onlined later, via online_pages(). >> >> You are correct. In case of memory hot add, it is true. But in case of adding >> CPU with memoryless node, try_node_online() will be called only during CPU >> onlining, see cpu_up(). >> >> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? >> I think it would be correct to online node during the CPU hot add to align with >> memory hot add. > > I am not familiar with cpu hotplug, but this doesn't seem to be anything > new so how come this became problem only now? So IIUC, the issue is that we have a node a) That has no memory b) That is offline This node will get onlined when onlining the CPU as Alexey says. Yet we have some code that stumbles over the node and goes ahead trying to use the pgdat -- that code is broken. If we take a look at build_zonelists() we indeed skip any !node_online(node). Any other code should do the same. If the node is not online, it shall be ignored because we might not even have a pgdat yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be stale or non-existant. The node onlining logic when onlining a CPU sounds bogus as well: Let's take a look at try_offline_node(). It checks that: 1) That no memory is *present* 2) That no CPU is *present* We should online the node when adding the CPU ("present"), not when onlining the CPU.
> > It is hard to follow your reply as your email client is not quoting > properly. Let me try to reconstruct > > On Tue 02-11-21 08:48:27, Alexey Makhalov wrote: >> On 02.11.21 08:47, Michal Hocko wrote: > [...] >>>>> CPU2 has been hot-added >>>>> BUG: unable to handle page fault for address: 0000000000001608 >>>>> #PF: supervisor read access in kernel mode >>>>> #PF: error_code(0x0000) - not-present page >>>>> PGD 0 P4D 0 >>>>> Oops: 0000 [#1] SMP PTI >>>>> CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 >>>>> Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW >>>>> >>>>> RIP: 0010:__alloc_pages+0x127/0x290 >>>> >>>> Could you resolve this into a specific line of the source code please? > > This got probably unnoticed. I would be really curious whether this is > a broken zonelist or something else. backtrace (including inline functions) pcpu_alloc_pages() alloc_pages_node() __alloc_pages_node() __alloc_pages() prepare_alloc_pages() node_zonelist() Panic happens in node_zonelist(), dereferencing NULL pointer of NODE_DATA(nid) in include/linux/gfp.h:514 512 static inline struct zonelist *node_zonelist(int nid, gfp_t flags) 513 { 514 return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags); 515 } > >>>>> Node can be in one of the following states: >>>>> 1. not present (nid == NUMA_NO_NODE) >>>>> 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, >>>>> NODE_DATA(nid) == NULL) >>>>> 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, >>>>> NODE_DATA(nid) != NULL) >>>>> >>>>> alloc_page_{bulk_array}node() functions verify for nid validity only >>>>> and do not check if nid is online. Enhanced verification check allows >>>>> to handle page allocation when node is in 2nd state. >>>> >>>> I do not think this is a correct approach. We should make sure that the >>>> proper fallback node is used instead. This means that the zone list is >>>> initialized properly. IIRC this has been a problem in the past and it >>>> has been fixed. The initialization code is quite subtle though so it is >>>> possible that this got broken again. > >> This approach behaves in the same way as CPU was not yet added. (state #1). >> So, we can think of state #2 as state #1 when CPU is not present. > >>> I'm a little confused: >>> >>> In add_memory_resource() we hotplug the new node if required and set it >>> online. Memory might get onlined later, via online_pages(). >> >> You are correct. In case of memory hot add, it is true. But in case of adding >> CPU with memoryless node, try_node_online() will be called only during CPU >> onlining, see cpu_up(). >> >> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? >> I think it would be correct to online node during the CPU hot add to align with >> memory hot add. > > I am not familiar with cpu hotplug, but this doesn't seem to be anything > new so how come this became problem only now? This is not CPU only hotplug, but CPU + NUMA node, and this new node is with no memory. We accidentally found it by not unlining the CPU immediately. > >>> So after add_memory_resource()->__try_online_node() succeeded, we have >>> an online pgdat -- essentially 3. >>> >> This patch detects if we're past 3. but says that it reproduced by >> disabling *memory* onlining. >> This is the hot adding of both new CPU and new _memoryless_ node (with CPU only) >> And onlining CPU makes its node online. Disabling CPU onlining puts new node >> into state #2, which leads to repro. >> >>> Before we online memory for a hotplugged node, all zones are !populated. >>> So once we online memory for a !populated zone in online_pages(), we >>> trigger setup_zone_pageset(). >>> >>> >>> The confusing part is that this patch checks for 3. but says it can be >>> reproduced by not onlining *memory*. There seems to be something missing. >> >> Do we maybe need a proper populated_zone() check before accessing zone data? > > No, we need them initialize properly. > Thanks, —Alexey
On Tue 02-11-21 10:04:23, Michal Hocko wrote: [...] > > Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? > > I think it would be correct to online node during the CPU hot add to align with > > memory hot add. > > I am not familiar with cpu hotplug, but this doesn't seem to be anything > new so how come this became problem only now? Just looked at the cpu hotplug part. I do not see add_cpu to add much here. Here is what I can see in the current Linus tree add_cpu device_online() # cpu device - cpu_sys_devices with cpu_subsys bus dev->bus->online -> cpu_subsys_online cpu_device_up cpu_up try_online_node So we should be bringing up the node during add_cpu. Unless something fails on the way - e.g. cpu_possible check or something similar.
>>>> In add_memory_resource() we hotplug the new node if required and set it >>>> online. Memory might get onlined later, via online_pages(). >>> >>> You are correct. In case of memory hot add, it is true. But in case of adding >>> CPU with memoryless node, try_node_online() will be called only during CPU >>> onlining, see cpu_up(). >>> >>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? >>> I think it would be correct to online node during the CPU hot add to align with >>> memory hot add. >> >> I am not familiar with cpu hotplug, but this doesn't seem to be anything >> new so how come this became problem only now? > > So IIUC, the issue is that we have a node > > a) That has no memory > b) That is offline > > This node will get onlined when onlining the CPU as Alexey says. Yet we > have some code that stumbles over the node and goes ahead trying to use > the pgdat -- that code is broken. You are correct. > > > If we take a look at build_zonelists() we indeed skip any > !node_online(node). Any other code should do the same. If the node is > not online, it shall be ignored because we might not even have a pgdat > yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be > stale or non-existant. Agree, alloc_pages_node() should also do the same. Not exactly to skip the node, but to fallback to another node if !node_online(node). alloc_pages_node() can also be hit while onlining the node, creating chicken-egg problem, see below. > > > The node onlining logic when onlining a CPU sounds bogus as well: Let's > take a look at try_offline_node(). It checks that: > 1) That no memory is *present* > 2) That no CPU is *present* > > We should online the node when adding the CPU ("present"), not when > onlining the CPU. Possible. Assuming try_online_node was moved under add_cpu(), let’s take look on this call stack: add_cpu() try_online_node() __try_online_node() hotadd_new_pgdat() At line 1190 we'll have a problem: 1183 pgdat = NODE_DATA(nid); 1184 if (!pgdat) { 1185 pgdat = arch_alloc_nodedata(nid); 1186 if (!pgdat) 1187 return NULL; 1188 1189 pgdat->per_cpu_nodestats = 1190 alloc_percpu(struct per_cpu_nodestat); 1191 arch_refresh_nodedata(nid, pgdat); alloc_percpu() will go for all possible CPUs and will eventually end up calling alloc_pages_node() trying to use subject nid for corresponding CPU hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid is not yet online. I like the idea of onlining the node when adding the CPU rather then when CPU get online. It will require current patch or another solution to resolve described above chicken-egg problem. PS, earlier this year I initiated discussion about redesigning per_cpu allocator to do not allocate/waste memory chunks for not present CPUs, but it has another complications. Thanks, —Alexey
On 02.11.21 11:34, Alexey Makhalov wrote: > >>>>> In add_memory_resource() we hotplug the new node if required and set it >>>>> online. Memory might get onlined later, via online_pages(). >>>> >>>> You are correct. In case of memory hot add, it is true. But in case of adding >>>> CPU with memoryless node, try_node_online() will be called only during CPU >>>> onlining, see cpu_up(). >>>> >>>> Is there any reason why try_online_node() resides in cpu_up() and not in add_cpu()? >>>> I think it would be correct to online node during the CPU hot add to align with >>>> memory hot add. >>> >>> I am not familiar with cpu hotplug, but this doesn't seem to be anything >>> new so how come this became problem only now? >> >> So IIUC, the issue is that we have a node >> >> a) That has no memory >> b) That is offline >> >> This node will get onlined when onlining the CPU as Alexey says. Yet we >> have some code that stumbles over the node and goes ahead trying to use >> the pgdat -- that code is broken. > > You are correct. > >> >> >> If we take a look at build_zonelists() we indeed skip any >> !node_online(node). Any other code should do the same. If the node is >> not online, it shall be ignored because we might not even have a pgdat >> yet -- see hotadd_new_pgdat(). Without node_online(), the pgdat might be >> stale or non-existant. > > Agree, alloc_pages_node() should also do the same. Not exactly to skip the node, > but to fallback to another node if !node_online(node). > alloc_pages_node() can also be hit while onlining the node, creating chicken-egg > problem, see below Right, the issue is also a bit involved when calling alloc_pages_node() on an offline NID. See below. > >> >> >> The node onlining logic when onlining a CPU sounds bogus as well: Let's >> take a look at try_offline_node(). It checks that: >> 1) That no memory is *present* >> 2) That no CPU is *present* >> >> We should online the node when adding the CPU ("present"), not when >> onlining the CPU. > > Possible. > Assuming try_online_node was moved under add_cpu(), let’s > take look on this call stack: > add_cpu() > try_online_node() > __try_online_node() > hotadd_new_pgdat() > At line 1190 we'll have a problem: > 1183 pgdat = NODE_DATA(nid); > 1184 if (!pgdat) { > 1185 pgdat = arch_alloc_nodedata(nid); > 1186 if (!pgdat) > 1187 return NULL; > 1188 > 1189 pgdat->per_cpu_nodestats = > 1190 alloc_percpu(struct per_cpu_nodestat); > 1191 arch_refresh_nodedata(nid, pgdat); > > alloc_percpu() will go for all possible CPUs and will eventually end up > calling alloc_pages_node() trying to use subject nid for corresponding CPU > hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid > is not yet online. Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for each possible CPU. We use cpu_to_node() to come up with the NID. I can only assume that we usually don't get an offline NID for an offline CPU, but instead either NODE=0 or NODE=NUMA_NO_NODE, because ... alloc_pages_node()->__alloc_pages_node() will: VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); BUT: prepare_alloc_pages() ac->zonelist = node_zonelist(preferred_nid, gfp_mask); should similarly fail. when de-referencing NULL.
On Tue 02-11-21 12:00:57, David Hildenbrand wrote: > On 02.11.21 11:34, Alexey Makhalov wrote: [...] > >> The node onlining logic when onlining a CPU sounds bogus as well: Let's > >> take a look at try_offline_node(). It checks that: > >> 1) That no memory is *present* > >> 2) That no CPU is *present* > >> > >> We should online the node when adding the CPU ("present"), not when > >> onlining the CPU. > > > > Possible. > > Assuming try_online_node was moved under add_cpu(), let’s > > take look on this call stack: > > add_cpu() > > try_online_node() > > __try_online_node() > > hotadd_new_pgdat() > > At line 1190 we'll have a problem: > > 1183 pgdat = NODE_DATA(nid); > > 1184 if (!pgdat) { > > 1185 pgdat = arch_alloc_nodedata(nid); > > 1186 if (!pgdat) > > 1187 return NULL; > > 1188 > > 1189 pgdat->per_cpu_nodestats = > > 1190 alloc_percpu(struct per_cpu_nodestat); > > 1191 arch_refresh_nodedata(nid, pgdat); > > > > alloc_percpu() will go for all possible CPUs and will eventually end up > > calling alloc_pages_node() trying to use subject nid for corresponding CPU > > hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid > > is not yet online. > > Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for > each possible CPU. We use cpu_to_node() to come up with the NID. Shouldn't this be numa_mem_id instead? Memory less nodes are odd little critters crafted into the MM code without wider considerations. From time to time we are struggling with some fallouts but the primary thing is that zonelists should be valid for all memory less nodes. If that is not the case then there is a problem with the initialization code. If somebody is providing a bogus node to allocate from then this should be fixed. It is still not clear to me which case are we hitting here.
On 02.11.21 12:44, Michal Hocko wrote: > On Tue 02-11-21 12:00:57, David Hildenbrand wrote: >> On 02.11.21 11:34, Alexey Makhalov wrote: > [...] >>>> The node onlining logic when onlining a CPU sounds bogus as well: Let's >>>> take a look at try_offline_node(). It checks that: >>>> 1) That no memory is *present* >>>> 2) That no CPU is *present* >>>> >>>> We should online the node when adding the CPU ("present"), not when >>>> onlining the CPU. >>> >>> Possible. >>> Assuming try_online_node was moved under add_cpu(), let’s >>> take look on this call stack: >>> add_cpu() >>> try_online_node() >>> __try_online_node() >>> hotadd_new_pgdat() >>> At line 1190 we'll have a problem: >>> 1183 pgdat = NODE_DATA(nid); >>> 1184 if (!pgdat) { >>> 1185 pgdat = arch_alloc_nodedata(nid); >>> 1186 if (!pgdat) >>> 1187 return NULL; >>> 1188 >>> 1189 pgdat->per_cpu_nodestats = >>> 1190 alloc_percpu(struct per_cpu_nodestat); >>> 1191 arch_refresh_nodedata(nid, pgdat); >>> >>> alloc_percpu() will go for all possible CPUs and will eventually end up >>> calling alloc_pages_node() trying to use subject nid for corresponding CPU >>> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid >>> is not yet online. >> >> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for >> each possible CPU. We use cpu_to_node() to come up with the NID. > > Shouldn't this be numa_mem_id instead? Memory less nodes are odd little Hm, good question. Most probably yes for offline nodes. diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index 2054c9213c43..c21ff5bb91dc 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, gfp_t gfp) { unsigned int cpu, tcpu; - int i; + int i, nid; gfp |= __GFP_HIGHMEM; for_each_possible_cpu(cpu) { + nid = cpu_to_node(cpu); + + if (nid == NUMA_NO_NODE || !node_online(nid)) + nid = numa_mem_id(); for (i = page_start; i < page_end; i++) { struct page **pagep = &pages[pcpu_page_idx(cpu, i)]; - *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0); + *pagep = alloc_pages_node(nid, gfp, 0); if (!*pagep) goto err; } > critters crafted into the MM code without wider considerations. From > time to time we are struggling with some fallouts but the primary thing > is that zonelists should be valid for all memory less nodes. Yes, but a zonelist cannot be correct for an offline node, where we might not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). I agree that someone passing an offline NID into an allocator function should be fixed. Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as preferred nid.
On Tue 02-11-21 13:06:06, David Hildenbrand wrote: > On 02.11.21 12:44, Michal Hocko wrote: > > On Tue 02-11-21 12:00:57, David Hildenbrand wrote: > >> On 02.11.21 11:34, Alexey Makhalov wrote: > > [...] > >>>> The node onlining logic when onlining a CPU sounds bogus as well: Let's > >>>> take a look at try_offline_node(). It checks that: > >>>> 1) That no memory is *present* > >>>> 2) That no CPU is *present* > >>>> > >>>> We should online the node when adding the CPU ("present"), not when > >>>> onlining the CPU. > >>> > >>> Possible. > >>> Assuming try_online_node was moved under add_cpu(), let’s > >>> take look on this call stack: > >>> add_cpu() > >>> try_online_node() > >>> __try_online_node() > >>> hotadd_new_pgdat() > >>> At line 1190 we'll have a problem: > >>> 1183 pgdat = NODE_DATA(nid); > >>> 1184 if (!pgdat) { > >>> 1185 pgdat = arch_alloc_nodedata(nid); > >>> 1186 if (!pgdat) > >>> 1187 return NULL; > >>> 1188 > >>> 1189 pgdat->per_cpu_nodestats = > >>> 1190 alloc_percpu(struct per_cpu_nodestat); > >>> 1191 arch_refresh_nodedata(nid, pgdat); > >>> > >>> alloc_percpu() will go for all possible CPUs and will eventually end up > >>> calling alloc_pages_node() trying to use subject nid for corresponding CPU > >>> hitting the same state #2 problem as NODE_DATA(nid) is still NULL and nid > >>> is not yet online. > >> > >> Right, we will end up calling pcpu_alloc_pages()->alloc_pages_node() for > >> each possible CPU. We use cpu_to_node() to come up with the NID. > > > > Shouldn't this be numa_mem_id instead? Memory less nodes are odd little > > Hm, good question. Most probably yes for offline nodes. > > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c > index 2054c9213c43..c21ff5bb91dc 100644 > --- a/mm/percpu-vm.c > +++ b/mm/percpu-vm.c > @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, > gfp_t gfp) > { > unsigned int cpu, tcpu; > - int i; > + int i, nid; > > gfp |= __GFP_HIGHMEM; > > for_each_possible_cpu(cpu) { > + nid = cpu_to_node(cpu); > + > + if (nid == NUMA_NO_NODE || !node_online(nid)) > + nid = numa_mem_id(); or simply nid = cpu_to_mem(cpu) > for (i = page_start; i < page_end; i++) { > struct page **pagep = &pages[pcpu_page_idx(cpu, i)]; > > - *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0); > + *pagep = alloc_pages_node(nid, gfp, 0); > if (!*pagep) > goto err; > } > > > > critters crafted into the MM code without wider considerations. From > > time to time we are struggling with some fallouts but the primary thing > > is that zonelists should be valid for all memory less nodes. > > Yes, but a zonelist cannot be correct for an offline node, where we might > not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as > we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). Yes, that is what I had in mind. We are talking about two things here. Memoryless nodes and offline nodes. The later sounds like a bug to me. > I agree that someone passing an offline NID into an allocator function > should be fixed. Right > Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly > (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as > preferred nid. Historically, those allocation interfaces were not trying to be robust against wrong inputs because that adds cpu cycles for everybody for "what if buggy" code. This has worked (surprisingly) well. Memory less nodes have brought in some confusion but this is still something that we can address on a higher level. Nobody give arbitrary nodes as an input. cpu_to_node might be tricky because it can point to a memory less node which along with __GFP_THISNODE is very likely not something anybody wants. Hence cpu_to_mem should be used for allocations. I hate we have two very similar APIs... But something seems wrong in this case. cpu_to_node shouldn't return offline nodes. That is just a land mine. It is not clear to me how the cpu has been brought up so that the numa node allocation was left behind. As pointed in other email add_cpu resp. cpu_up is not it. Is it possible that the cpu bring up was only half way?
>> Yes, but a zonelist cannot be correct for an offline node, where we might >> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as >> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). > > Yes, that is what I had in mind. We are talking about two things here. > Memoryless nodes and offline nodes. The later sounds like a bug to me. Agreed. memoryless nodes should just have proper zonelists -- which seems to be the case. >> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly >> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as >> preferred nid. > > Historically, those allocation interfaces were not trying to be robust > against wrong inputs because that adds cpu cycles for everybody for > "what if buggy" code. This has worked (surprisingly) well. Memory less > nodes have brought in some confusion but this is still something that we > can address on a higher level. Nobody give arbitrary nodes as an input. > cpu_to_node might be tricky because it can point to a memory less node > which along with __GFP_THISNODE is very likely not something anybody > wants. Hence cpu_to_mem should be used for allocations. I hate we have > two very similar APIs... To be precise, I'm wondering if we should do: diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 55b2ec1f965a..8c49b88336ee 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -565,7 +565,7 @@ static inline struct page * __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); + VM_WARN_ON(!node_online(nid)); return __alloc_pages(gfp_mask, order, nid, NULL); } (Or maybe VM_BUG_ON) Because it cannot possibly work and we'll dereference NULL later. > > But something seems wrong in this case. cpu_to_node shouldn't return > offline nodes. That is just a land mine. It is not clear to me how the > cpu has been brought up so that the numa node allocation was left > behind. As pointed in other email add_cpu resp. cpu_up is not it. > Is it possible that the cpu bring up was only half way? I tried to follow the code (what sets a CPU present, what sets a CPU online, when do we update cpu_to_node() mapping) and IMHO it's all a big mess. Maybe it's clearer to people familiar with that code, but CPU hotplug in general seems to be a confusing piece of (arch-specific) code. Also, I have no clue if cpu_to_node() mapping will get invalidated after unplugging that CPU, or if the mapping will simply stay around for all eternity ...
On Tue 02-11-21 13:39:06, David Hildenbrand wrote: > >> Yes, but a zonelist cannot be correct for an offline node, where we might > >> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as > >> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). > > > > Yes, that is what I had in mind. We are talking about two things here. > > Memoryless nodes and offline nodes. The later sounds like a bug to me. > > Agreed. memoryless nodes should just have proper zonelists -- which > seems to be the case. > > >> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly > >> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as > >> preferred nid. > > > > Historically, those allocation interfaces were not trying to be robust > > against wrong inputs because that adds cpu cycles for everybody for > > "what if buggy" code. This has worked (surprisingly) well. Memory less > > nodes have brought in some confusion but this is still something that we > > can address on a higher level. Nobody give arbitrary nodes as an input. > > cpu_to_node might be tricky because it can point to a memory less node > > which along with __GFP_THISNODE is very likely not something anybody > > wants. Hence cpu_to_mem should be used for allocations. I hate we have > > two very similar APIs... > > To be precise, I'm wondering if we should do: > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 55b2ec1f965a..8c49b88336ee 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -565,7 +565,7 @@ static inline struct page * > __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) > { > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); > + VM_WARN_ON(!node_online(nid)); > > return __alloc_pages(gfp_mask, order, nid, NULL); > } > > (Or maybe VM_BUG_ON) > > Because it cannot possibly work and we'll dereference NULL later. VM_BUG_ON would be silent for most configurations and crash would happen even without it so I am not sure about the additional value. VM_WARN_ON doesn't really add much on top - except it would crash in some configurations. If we really care to catch this case then we would have to do a reasonable fallback with a printk note and a dumps stack. Something like if (unlikely(!node_online(nid))) { pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n"); dump_stack(); nid = numa_mem_id(); } But again this is adding quite some cycles to a hotpath of the page allocator. Is this worth it? > > But something seems wrong in this case. cpu_to_node shouldn't return > > offline nodes. That is just a land mine. It is not clear to me how the > > cpu has been brought up so that the numa node allocation was left > > behind. As pointed in other email add_cpu resp. cpu_up is not it. > > Is it possible that the cpu bring up was only half way? > > I tried to follow the code (what sets a CPU present, what sets a CPU > online, when do we update cpu_to_node() mapping) and IMHO it's all a big > mess. Maybe it's clearer to people familiar with that code, but CPU > hotplug in general seems to be a confusing piece of (arch-specific) code. Yes there are different arch specific parts that make this quite hard to follow. I think we want to learn how exactly Alexey brought that cpu up. Because his initial thought on add_cpu resp cpu_up doesn't seem to be correct. Or I am just not following the code properly. Once we know all those details we can get in touch with cpu hotplug maintainers and see what can we do. Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem? One last thing, there were some mentions of __GFP_THISNODE but I fail to see connection with the pcp allocator...
On 02.11.21 14:25, Michal Hocko wrote: > On Tue 02-11-21 13:39:06, David Hildenbrand wrote: >>>> Yes, but a zonelist cannot be correct for an offline node, where we might >>>> not even have an allocated pgdat yet. No pgdat, no zonelist. So as soon as >>>> we allocate the pgdat and set the node online (->hotadd_new_pgdat()), the zone lists have to be correct. And I can spot an build_all_zonelists() in hotadd_new_pgdat(). >>> >>> Yes, that is what I had in mind. We are talking about two things here. >>> Memoryless nodes and offline nodes. The later sounds like a bug to me. >> >> Agreed. memoryless nodes should just have proper zonelists -- which >> seems to be the case. >> >>>> Maybe __alloc_pages_bulk() and alloc_pages_node() should bail out directly >>>> (VM_BUG()) in case we're providing an offline node with eventually no/stale pgdat as >>>> preferred nid. >>> >>> Historically, those allocation interfaces were not trying to be robust >>> against wrong inputs because that adds cpu cycles for everybody for >>> "what if buggy" code. This has worked (surprisingly) well. Memory less >>> nodes have brought in some confusion but this is still something that we >>> can address on a higher level. Nobody give arbitrary nodes as an input. >>> cpu_to_node might be tricky because it can point to a memory less node >>> which along with __GFP_THISNODE is very likely not something anybody >>> wants. Hence cpu_to_mem should be used for allocations. I hate we have >>> two very similar APIs... >> >> To be precise, I'm wondering if we should do: >> >> diff --git a/include/linux/gfp.h b/include/linux/gfp.h >> index 55b2ec1f965a..8c49b88336ee 100644 >> --- a/include/linux/gfp.h >> +++ b/include/linux/gfp.h >> @@ -565,7 +565,7 @@ static inline struct page * >> __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) >> { >> VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >> - VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid)); >> + VM_WARN_ON(!node_online(nid)); >> >> return __alloc_pages(gfp_mask, order, nid, NULL); >> } >> >> (Or maybe VM_BUG_ON) >> >> Because it cannot possibly work and we'll dereference NULL later. > > VM_BUG_ON would be silent for most configurations and crash would happen > even without it so I am not sure about the additional value. VM_WARN_ON > doesn't really add much on top - except it would crash in some > configurations. If we really care to catch this case then we would have > to do a reasonable fallback with a printk note and a dumps stack. As I learned, VM_BUG_ON and friends are active for e.g., Fedora, which can catch quite some issues early, before they end up in enterprise distro kernels. I think it has value. > Something like > if (unlikely(!node_online(nid))) { > pr_err("%d is an offline numa node and using it is a bug in a caller. Please report...\n"); > dump_stack(); > nid = numa_mem_id(); > } > > But again this is adding quite some cycles to a hotpath of the page > allocator. Is this worth it? Don't think a fallback makes sense. > >>> But something seems wrong in this case. cpu_to_node shouldn't return >>> offline nodes. That is just a land mine. It is not clear to me how the >>> cpu has been brought up so that the numa node allocation was left >>> behind. As pointed in other email add_cpu resp. cpu_up is not it. >>> Is it possible that the cpu bring up was only half way? >> >> I tried to follow the code (what sets a CPU present, what sets a CPU >> online, when do we update cpu_to_node() mapping) and IMHO it's all a big >> mess. Maybe it's clearer to people familiar with that code, but CPU >> hotplug in general seems to be a confusing piece of (arch-specific) code. > > Yes there are different arch specific parts that make this quite hard to > follow. > > I think we want to learn how exactly Alexey brought that cpu up. Because > his initial thought on add_cpu resp cpu_up doesn't seem to be correct. > Or I am just not following the code properly. Once we know all those > details we can get in touch with cpu hotplug maintainers and see what > can we do. Yes. > > Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem? You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids? cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it won't help for this very report. > One last thing, there were some mentions of __GFP_THISNODE but I fail to > see connection with the pcp allocator... Me to. If pcpu would be using __GFP_THISNODE, we'd be hitting the VM_WARN_ON but still crash.
On Tue, Nov 02, 2021 at 02:25:03PM +0100, Michal Hocko wrote: > I think we want to learn how exactly Alexey brought that cpu up. Because > his initial thought on add_cpu resp cpu_up doesn't seem to be correct. > Or I am just not following the code properly. Once we know all those > details we can get in touch with cpu hotplug maintainers and see what > can we do. I am not really familiar with CPU hot-onlining, but I have been taking a look. As with memory, there are two different stages, hot-adding and onlining (and the counterparts). Part of the hot-adding being: acpi_processor_get_info acpi_processor_hotadd_init arch_register_cpu register_cpu One of the things that register_cpu() does is to set cpu->dev.bus pointing to &cpu_subsys, which is: struct bus_type cpu_subsys = { .name = "cpu", .dev_name = "cpu", .match = cpu_subsys_match, #ifdef CONFIG_HOTPLUG_CPU .online = cpu_subsys_online, .offline = cpu_subsys_offline, #endif }; Then, the onlining part (in case of a udev rule or someone onlining the device) would be: online_store device_online cpu_subsys_online cpu_device_up cpu_up ... online node Since Alexey disabled the udev rule and no one onlined the CPU, online_store()-> device_online() wasn't really called. The following only applies to x86_64: I think we got confused because cpu_device_up() is also called from add_cpu(), but that is an exported function and x86 does not call add_cpu() unless for debugging purposes (check kernel/torture.c and arch/x86/kernel/topology.c). It does the onlining through online_store()... So we can take add_cpu() off the equation here.
On Tue 02-11-21 14:41:25, David Hildenbrand wrote: > On 02.11.21 14:25, Michal Hocko wrote: [...] > > Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem? > > You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids? just cpu_to_mem > cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it > won't help for this very report. Weird, x86 allows memory less nodes as well. But you are right there is nothing selecting HAVE_MEMORYLESS_NODES neither do I see any arch specific implementation. I have to say that I have forgot all those nasty details... Sigh
On Tue 02-11-21 14:52:01, Oscar Salvador wrote: > On Tue, Nov 02, 2021 at 02:25:03PM +0100, Michal Hocko wrote: > > I think we want to learn how exactly Alexey brought that cpu up. Because > > his initial thought on add_cpu resp cpu_up doesn't seem to be correct. > > Or I am just not following the code properly. Once we know all those > > details we can get in touch with cpu hotplug maintainers and see what > > can we do. > > I am not really familiar with CPU hot-onlining, but I have been taking a look. > As with memory, there are two different stages, hot-adding and onlining (and the > counterparts). > > Part of the hot-adding being: > > acpi_processor_get_info > acpi_processor_hotadd_init > arch_register_cpu > register_cpu > > One of the things that register_cpu() does is to set cpu->dev.bus pointing to > &cpu_subsys, which is: > > struct bus_type cpu_subsys = { > .name = "cpu", > .dev_name = "cpu", > .match = cpu_subsys_match, > #ifdef CONFIG_HOTPLUG_CPU > .online = cpu_subsys_online, > .offline = cpu_subsys_offline, > #endif > }; > > Then, the onlining part (in case of a udev rule or someone onlining the device) > would be: > > online_store > device_online > cpu_subsys_online > cpu_device_up > cpu_up > ... > online node > > Since Alexey disabled the udev rule and no one onlined the CPU, online_store()-> > device_online() wasn't really called. > > The following only applies to x86_64: > I think we got confused because cpu_device_up() is also called from add_cpu(), > but that is an exported function and x86 does not call add_cpu() unless for > debugging purposes (check kernel/torture.c and arch/x86/kernel/topology.c). > It does the onlining through online_store()... > So we can take add_cpu() off the equation here. Yes, so the real problem is (thanks for pointing me to the acpi code). The cpu->node association is done in acpi_map_cpu2node and I suspect this expects that the node is already present as it gets the information from SRAT/PXM tables which are parsed during boot. But I might be just confused or maybe just VMware inject new entries here somehow. Another interesting thing is that acpi_map_cpu2node skips over association if there is no node found in SRAT but that should only mean it would use the default initialization which should be hopefuly 0. Anyway, I have found in my notes https://www.spinics.net/lists/kernel/msg3010886.html which is a slightly different problem but it has some notes about how the initialization mess works (that one was boot time though and hotplug might be different actually). I have ran out of time for this today so hopefully somebody can re-learn that from there...
On 02.11.21 15:12, Michal Hocko wrote: > On Tue 02-11-21 14:41:25, David Hildenbrand wrote: >> On 02.11.21 14:25, Michal Hocko wrote: > [...] >>> Btw. do you plan to send a patch for pcp allocator to use cpu_to_mem? >> >> You mean s/cpu_to_node/cpu_to_mem/ or also handling offline nids? > > just cpu_to_mem > >> cpu_to_mem() corresponds to cpu_to_node() unless on ia64+ppc IIUC, so it >> won't help for this very report. > > Weird, x86 allows memory less nodes as well. But you are right > there is nothing selecting HAVE_MEMORYLESS_NODES neither do I see any > arch specific implementation. I have to say that I have forgot all those > nasty details... Sigh > I assume HAVE_MEMORYLESS_NODES is just an optimization to set a preferred memory node for memoryless nodes. It doesn't imply that we cannot have memoryless nodes otherwise. I suspect just as so often, the config option name doesn't express what it really does.
I’m going to send patch v2, with node_online check moved to caller (pcpu_alloc_pages() function) as was suggested by David. It seems as it is only one place which passes present but offlined node to alloc-pages_node(). Moving node online check to the caller keeps hot path (alloc_pages) simple and performant. > diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c > index 2054c9213c43..c21ff5bb91dc 100644 > --- a/mm/percpu-vm.c > +++ b/mm/percpu-vm.c > @@ -84,15 +84,19 @@ static int pcpu_alloc_pages(struct pcpu_chunk *chunk, > gfp_t gfp) > { > unsigned int cpu, tcpu; > - int i; > + int i, nid; > > gfp |= __GFP_HIGHMEM; > > for_each_possible_cpu(cpu) { > + nid = cpu_to_node(cpu); > + > + if (nid == NUMA_NO_NODE || !node_online(nid)) > + nid = numa_mem_id(); > for (i = page_start; i < page_end; i++) { > struct page **pagep = &pages[pcpu_page_idx(cpu, i)]; > > - *pagep = alloc_pages_node(cpu_to_node(cpu), gfp, 0); > + *pagep = alloc_pages_node(nid, gfp, 0); > if (!*pagep) > goto err; > } > Thanks, —Alexey
diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 55b2ec1f9..34a5a7def 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -551,7 +551,8 @@ alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_arr static inline unsigned long alloc_pages_bulk_array_node(gfp_t gfp, int nid, unsigned long nr_pages, struct page **page_array) { - if (nid == NUMA_NO_NODE) + if (nid == NUMA_NO_NODE || (!node_online(nid) && + !(gfp & __GFP_THISNODE))) nid = numa_mem_id(); return __alloc_pages_bulk(gfp, nid, NULL, nr_pages, NULL, page_array); @@ -578,7 +579,8 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { - if (nid == NUMA_NO_NODE) + if (nid == NUMA_NO_NODE || (!node_online(nid) && + !(gfp_mask & __GFP_THISNODE))) nid = numa_mem_id(); return __alloc_pages_node(nid, gfp_mask, order);
There is a kernel panic caused by __alloc_pages() accessing uninitialized NODE_DATA(nid). Uninitialized node data exists during the time when CPU with memoryless node was added but not onlined yet. Panic can be easy reproduced by disabling udev rule for automatic onlining hot added CPU followed by CPU with memoryless node hot add. This is a panic caused by percpu code doing allocations for all possible CPUs and hitting this issue: CPU2 has been hot-added BUG: unable to handle page fault for address: 0000000000001608 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 1 Comm: systemd Tainted: G E 5.15.0-rc7+ #11 Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW RIP: 0010:__alloc_pages+0x127/0x290 Code: 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 e0 48 8b 55 b8 c1 e8 0c 83 e0 01 88 45 d0 4c 89 c8 48 85 d2 0f 85 1a 01 00 00 <45> 3b 41 08 0f 82 10 01 00 00 48 89 45 c0 48 8b 00 44 89 e2 81 e2 RSP: 0018:ffffc900006f3bc8 EFLAGS: 00010246 RAX: 0000000000001600 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000cc2 RBP: ffffc900006f3c18 R08: 0000000000000001 R09: 0000000000001600 R10: ffffc900006f3a40 R11: ffff88813c9fffe8 R12: 0000000000000cc2 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000cc2 FS: 00007f27ead70500(0000) GS:ffff88807ce00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001608 CR3: 000000000582c003 CR4: 00000000001706b0 Call Trace: pcpu_alloc_pages.constprop.0+0xe4/0x1c0 pcpu_populate_chunk+0x33/0xb0 pcpu_alloc+0x4d3/0x6f0 __alloc_percpu_gfp+0xd/0x10 alloc_mem_cgroup_per_node_info+0x54/0xb0 mem_cgroup_alloc+0xed/0x2f0 mem_cgroup_css_alloc+0x33/0x2f0 css_create+0x3a/0x1f0 cgroup_apply_control_enable+0x12b/0x150 cgroup_mkdir+0xdd/0x110 kernfs_iop_mkdir+0x4f/0x80 vfs_mkdir+0x178/0x230 do_mkdirat+0xfd/0x120 __x64_sys_mkdir+0x47/0x70 ? syscall_exit_to_user_mode+0x21/0x50 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae Node can be in one of the following states: 1. not present (nid == NUMA_NO_NODE) 2. present, but offline (nid > NUMA_NO_NODE, node_online(nid) == 0, NODE_DATA(nid) == NULL) 3. present and online (nid > NUMA_NO_NODE, node_online(nid) > 0, NODE_DATA(nid) != NULL) alloc_page_{bulk_array}node() functions verify for nid validity only and do not check if nid is online. Enhanced verification check allows to handle page allocation when node is in 2nd state. Signed-off-by: Alexey Makhalov <amakhalov@vmware.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org --- include/linux/gfp.h | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-)