Message ID | 20200311110237.5731-2-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Offline memoryless cpuless node 0 | expand |
On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > A Powerpc system with multiple possible nodes and with CONFIG_NUMA > enabled always used to have a node 0, even if node 0 does not any cpus > or memory attached to it. As per PAPR, node affinity of a cpu is only > available once its present / online. For all cpus that are possible but > not present, cpu_to_node() would point to node 0. > > To ensure a cpuless, memoryless dummy node is not online, powerpc need > to make sure all possible but not present cpu_to_node are set to a > proper node. Just curious, is this somehow related to http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: "Kirill A. Shutemov" <kirill@shutemov.name> > Cc: Christopher Lameter <cl@linux.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > --- > arch/powerpc/mm/numa.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index 8a399db..54dcd49 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -931,8 +931,20 @@ void __init mem_topology_setup(void) > > reset_numa_cpu_lookup_table(); > > - for_each_present_cpu(cpu) > - numa_setup_cpu(cpu); > + for_each_possible_cpu(cpu) { > + /* > + * Powerpc with CONFIG_NUMA always used to have a node 0, > + * even if it was memoryless or cpuless. For all cpus that > + * are possible but not present, cpu_to_node() would point > + * to node 0. To remove a cpuless, memoryless dummy node, > + * powerpc need to make sure all possible but not present > + * cpu_to_node are set to a proper node. > + */ > + if (cpu_present(cpu)) > + numa_setup_cpu(cpu); > + else > + set_cpu_numa_node(cpu, first_online_node); > + } > } > > void __init initmem_init(void) > -- > 1.8.3.1
* Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: > On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > > A Powerpc system with multiple possible nodes and with CONFIG_NUMA > > enabled always used to have a node 0, even if node 0 does not any cpus > > or memory attached to it. As per PAPR, node affinity of a cpu is only > > available once its present / online. For all cpus that are possible but > > not present, cpu_to_node() would point to node 0. > > > > To ensure a cpuless, memoryless dummy node is not online, powerpc need > > to make sure all possible but not present cpu_to_node are set to a > > proper node. > > Just curious, is this somehow related to > http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? > The issue I am trying to fix is a known issue in Powerpc since many years. So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate shrinker_map on appropriate NUMA node"). I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the kernel. Will work with Sachin/Abdul (reporters of the issue). > > Cc: linuxppc-dev@lists.ozlabs.org > > Cc: linux-mm@kvack.org > > Cc: linux-kernel@vger.kernel.org > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Mel Gorman <mgorman@suse.de> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: "Kirill A. Shutemov" <kirill@shutemov.name> > > Cc: Christopher Lameter <cl@linux.com> > > Cc: Michael Ellerman <mpe@ellerman.id.au> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Linus Torvalds <torvalds@linux-foundation.org> > > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > > --- > > arch/powerpc/mm/numa.c | 16 ++++++++++++++-- > > 1 file changed, 14 insertions(+), 2 deletions(-) > > > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > > index 8a399db..54dcd49 100644 > > --- a/arch/powerpc/mm/numa.c > > +++ b/arch/powerpc/mm/numa.c > > @@ -931,8 +931,20 @@ void __init mem_topology_setup(void) > > > > reset_numa_cpu_lookup_table(); > > > > - for_each_present_cpu(cpu) > > - numa_setup_cpu(cpu); > > + for_each_possible_cpu(cpu) { > > + /* > > + * Powerpc with CONFIG_NUMA always used to have a node 0, > > + * even if it was memoryless or cpuless. For all cpus that > > + * are possible but not present, cpu_to_node() would point > > + * to node 0. To remove a cpuless, memoryless dummy node, > > + * powerpc need to make sure all possible but not present > > + * cpu_to_node are set to a proper node. > > + */ > > + if (cpu_present(cpu)) > > + numa_setup_cpu(cpu); > > + else > > + set_cpu_numa_node(cpu, first_online_node); > > + } > > } > > > > void __init initmem_init(void) > > -- > > 1.8.3.1 > > -- > Michal Hocko > SUSE Labs
> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: > >> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: >>> A Powerpc system with multiple possible nodes and with CONFIG_NUMA >>> enabled always used to have a node 0, even if node 0 does not any cpus >>> or memory attached to it. As per PAPR, node affinity of a cpu is only >>> available once its present / online. For all cpus that are possible but >>> not present, cpu_to_node() would point to node 0. >>> >>> To ensure a cpuless, memoryless dummy node is not online, powerpc need >>> to make sure all possible but not present cpu_to_node are set to a >>> proper node. >> >> Just curious, is this somehow related to >> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? >> > > The issue I am trying to fix is a known issue in Powerpc since many years. > So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate > shrinker_map on appropriate NUMA node"). > > I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the > kernel. Will work with Sachin/Abdul (reporters of the issue). > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 ) The kernel still fails to boot with same call trace. [ 6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0 [ 6.159363] Faulting instruction address: 0xc0000000003d7174 [ 6.159368] Oops: Kernel access of bad area, sig: 11 [#1] [ 6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 6.159378] Modules linked in: [ 6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1 [ 6.159388] NIP: c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70 [ 6.159393] REGS: c0000008b36836d0 TRAP: 0300 Not tainted (5.6.0-rc5-next-20200311-autotest+) [ 6.159398] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004848 XER: 00000000 [ 6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 [ 6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500 [ 6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620 [ 6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000 [ 6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000 [ 6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002 [ 6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122 [ 6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8 [ 6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40 [ 6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760 [ 6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60 [ 6.159462] Call Trace: [ 6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70 [ 6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490 [ 6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110 [ 6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270 [ 6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0 [ 6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0 [ 6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0 [ 6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0 [ 6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230 [ 6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0 [ 6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68 [ 6.159527] Instruction dump: [ 6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088 [ 6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a [ 6.159545] ---[ end trace 36d65cb66091a5b6 ]— Boot log attached. Thanks -Sachin
On 3/12/20 9:23 AM, Sachin Sant wrote: > > >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: >>>> A Powerpc system with multiple possible nodes and with CONFIG_NUMA >>>> enabled always used to have a node 0, even if node 0 does not any cpus >>>> or memory attached to it. As per PAPR, node affinity of a cpu is only >>>> available once its present / online. For all cpus that are possible but >>>> not present, cpu_to_node() would point to node 0. >>>> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need >>>> to make sure all possible but not present cpu_to_node are set to a >>>> proper node. >>> >>> Just curious, is this somehow related to >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? >>> >> >> The issue I am trying to fix is a known issue in Powerpc since many years. >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate >> shrinker_map on appropriate NUMA node"). >> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the >> kernel. Will work with Sachin/Abdul (reporters of the issue). >> > > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 ) > The kernel still fails to boot with same call trace. Yeah when I skimmed the patches, I don't think they address the issue where node_to_mem_node(0) = 0 [1]. You could reapply the debug print patch to verify, but it seems very likely. So I'm not surprised you get the same trace. [1] https://lore.kernel.org/linux-next/9a86f865-50b5-7483-9257-dbb08fecd62b@suse.cz/ > [ 6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0 > [ 6.159363] Faulting instruction address: 0xc0000000003d7174 > [ 6.159368] Oops: Kernel access of bad area, sig: 11 [#1] > [ 6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > [ 6.159378] Modules linked in: > [ 6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1 > [ 6.159388] NIP: c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70 > [ 6.159393] REGS: c0000008b36836d0 TRAP: 0300 Not tainted (5.6.0-rc5-next-20200311-autotest+) > [ 6.159398] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004848 XER: 00000000 > [ 6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1 > [ 6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500 > [ 6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620 > [ 6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000 > [ 6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000 > [ 6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002 > [ 6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122 > [ 6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8 > [ 6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40 > [ 6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760 > [ 6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60 > [ 6.159462] Call Trace: > [ 6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70 > [ 6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490 > [ 6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110 > [ 6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270 > [ 6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0 > [ 6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0 > [ 6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0 > [ 6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0 > [ 6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230 > [ 6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0 > [ 6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68 > [ 6.159527] Instruction dump: > [ 6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088 > [ 6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a > [ 6.159545] ---[ end trace 36d65cb66091a5b6 ]— > > Boot log attached. > > Thanks > -Sachin >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]: > On 3/12/20 9:23 AM, Sachin Sant wrote: > >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: > >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need > >>>> to make sure all possible but not present cpu_to_node are set to a > >>>> proper node. > >>> > >>> Just curious, is this somehow related to > >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? > >>> > >> > >> The issue I am trying to fix is a known issue in Powerpc since many years. > >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate > >> shrinker_map on appropriate NUMA node"). > >> > >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the > >> kernel. Will work with Sachin/Abdul (reporters of the issue). I had used v1 and not v2. So my mistake. > > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 ) > > The kernel still fails to boot with same call trace. > While I am not an expert in the slub area, I looked at the patch a75056fc1e7c and had some thoughts on why this could be causing this issue. On the system where the crash happens, the possible number of nodes is much greater than the number of onlined nodes. The pdgat or the NODE_DATA is only available for onlined nodes. With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node for all possible nodes and in ___slab_alloc we end up looking at the node_present_pages which is NODE_DATA(nid)->node_present_pages. i.e for a node whose pdgat struct is not allocated, we are trying to dereference. Also for a memoryless/cpuless node or possible but not present nodes, node_to_mem_node(node) will still end up as node (atleast on powerpc). I tried with this hunk below and it works. But I am not sure if we need to check at other places were node_present_pages is being called. diff --git a/mm/slub.c b/mm/slub.c index 626cbcbd977f..bddb93bed55e 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, if (unlikely(!node_match(page, node))) { int searchnode = node; - if (node != NUMA_NO_NODE && !node_present_pages(node)) - searchnode = node_to_mem_node(node); - + if (node != NUMA_NO_NODE) { + if (!node_online(node) || !node_present_pages(node)) { + searchnode = node_to_mem_node(node); + if (!node_online(searchnode)) + searchnode = first_online_node; + } + } if (unlikely(!node_match(page, searchnode))) { stat(s, ALLOC_NODE_MISMATCH); deactivate_slab(s, page, c->freelist, c); > > >
On 3/12/20 2:14 PM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]: > >> On 3/12/20 9:23 AM, Sachin Sant wrote: >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need >> >>>> to make sure all possible but not present cpu_to_node are set to a >> >>>> proper node. >> >>> >> >>> Just curious, is this somehow related to >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? >> >>> >> >> >> >> The issue I am trying to fix is a known issue in Powerpc since many years. >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate >> >> shrinker_map on appropriate NUMA node"). >> >> >> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the >> >> kernel. Will work with Sachin/Abdul (reporters of the issue). > > I had used v1 and not v2. So my mistake. > >> > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 ) >> > The kernel still fails to boot with same call trace. >> > > While I am not an expert in the slub area, I looked at the patch > a75056fc1e7c and had some thoughts on why this could be causing this issue. > > On the system where the crash happens, the possible number of nodes is much > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only > available for onlined nodes. > > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node > for all possible nodes and in ___slab_alloc we end up looking at the > node_present_pages which is NODE_DATA(nid)->node_present_pages. > i.e for a node whose pdgat struct is not allocated, we are trying to > dereference. From what we saw, the pgdat does exist, the problem is that slab's per-node data doesn't exist for a node that doesn't have present pages, as it would be a waste of memory. Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In Sachin's first report [1] we have [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] [ 0.000000] numa: NODE_DATA(0) on node 1 [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff] But in this thread, with your patches Sachin reports: [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] So I assume it's just node 1. In that case, node_present_pages is really dangerous. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ > Also for a memoryless/cpuless node or possible but not present nodes, > node_to_mem_node(node) will still end up as node (atleast on powerpc). I think that's the place where this would be best to fix. > I tried with this hunk below and it works. > > But I am not sure if we need to check at other places were > node_present_pages is being called. I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it return only nodes that are online with present memory? CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add support for node_to_mem_node() to determine the fallback node") I think we do need well defined and documented rules around node_to_mem_node(), cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so everyone handles it the same, safe way. > diff --git a/mm/slub.c b/mm/slub.c > index 626cbcbd977f..bddb93bed55e 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > if (unlikely(!node_match(page, node))) { > int searchnode = node; > > - if (node != NUMA_NO_NODE && !node_present_pages(node)) > - searchnode = node_to_mem_node(node); > - > + if (node != NUMA_NO_NODE) { > + if (!node_online(node) || !node_present_pages(node)) { > + searchnode = node_to_mem_node(node); > + if (!node_online(searchnode)) > + searchnode = first_online_node; > + } > + } > if (unlikely(!node_match(page, searchnode))) { > stat(s, ALLOC_NODE_MISMATCH); > deactivate_slab(s, page, c->freelist, c); > >> > >> >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]: > > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]: > > > >> On 3/12/20 9:23 AM, Sachin Sant wrote: > >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: > >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need > >> >>>> to make sure all possible but not present cpu_to_node are set to a > >> >>>> proper node. > >> >>> > >> >>> Just curious, is this somehow related to > >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? > >> >>> > >> >> > >> >> The issue I am trying to fix is a known issue in Powerpc since many years. > >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate > >> >> shrinker_map on appropriate NUMA node"). > >> >> > > > > While I am not an expert in the slub area, I looked at the patch > > a75056fc1e7c and had some thoughts on why this could be causing this issue. > > > > On the system where the crash happens, the possible number of nodes is much > > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only > > available for onlined nodes. > > > > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node > > for all possible nodes and in ___slab_alloc we end up looking at the > > node_present_pages which is NODE_DATA(nid)->node_present_pages. > > i.e for a node whose pdgat struct is not allocated, we are trying to > > dereference. > > From what we saw, the pgdat does exist, the problem is that slab's per-node data > doesn't exist for a node that doesn't have present pages, as it would be a waste > of memory. Just to be clear Before my 3 patches to fix dummy node: srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible 0-31 srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online 0-1 > > Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In > Sachin's first report [1] we have > > [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > [ 0.000000] numa: NODE_DATA(0) on node 1 > [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff] > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the rest 30 nodes. > But in this thread, with your patches Sachin reports: and with my patches srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible 0-31 srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online 1 > > [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > so we only see one pgdat. > So I assume it's just node 1. In that case, node_present_pages is really dangerous. > > [1] > https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ > > > Also for a memoryless/cpuless node or possible but not present nodes, > > node_to_mem_node(node) will still end up as node (atleast on powerpc). > > I think that's the place where this would be best to fix. > Maybe. I thought about it but the current set_numa_mem semantics are apt for memoryless cpu node and not for possible nodes. We could have upto 256 possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory. node_to_mem_node seems to return what is set in set_numa_mem(). set_numa_mem() seems to say set my numa_mem node for the current memoryless node to the param passed. But how do we set numa_mem for all the other 253 possible nodes, which probably will have 0 as default? Should we introduce another API such that we could update for all possible nodes? > > I tried with this hunk below and it works. > > > > But I am not sure if we need to check at other places were > > node_present_pages is being called. > > I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it > return only nodes that are online with present memory? > CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add > support for node_to_mem_node() to determine the fallback node") > Agree > I think we do need well defined and documented rules around node_to_mem_node(), > cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so > everyone handles it the same, safe way. > Other option would be to tweak Kirill Tkhai's patch such that we call kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc if the node is offline. > > diff --git a/mm/slub.c b/mm/slub.c > > index 626cbcbd977f..bddb93bed55e 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > > if (unlikely(!node_match(page, node))) { > > int searchnode = node; > > > > - if (node != NUMA_NO_NODE && !node_present_pages(node)) > > - searchnode = node_to_mem_node(node); > > - > > + if (node != NUMA_NO_NODE) { > > + if (!node_online(node) || !node_present_pages(node)) { > > + searchnode = node_to_mem_node(node); > > + if (!node_online(searchnode)) > > + searchnode = first_online_node; > > + } > > + } > > if (unlikely(!node_match(page, searchnode))) { > > stat(s, ALLOC_NODE_MISMATCH); > > deactivate_slab(s, page, c->freelist, c);
On 3/12/20 5:13 PM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]: > >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]: >> > >> >> On 3/12/20 9:23 AM, Sachin Sant wrote: >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: >> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need >> >> >>>> to make sure all possible but not present cpu_to_node are set to a >> >> >>>> proper node. >> >> >>> >> >> >>> Just curious, is this somehow related to >> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? >> >> >>> >> >> >> >> >> >> The issue I am trying to fix is a known issue in Powerpc since many years. >> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate >> >> >> shrinker_map on appropriate NUMA node"). >> >> >> >> > >> > While I am not an expert in the slub area, I looked at the patch >> > a75056fc1e7c and had some thoughts on why this could be causing this issue. >> > >> > On the system where the crash happens, the possible number of nodes is much >> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only >> > available for onlined nodes. >> > >> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node >> > for all possible nodes and in ___slab_alloc we end up looking at the >> > node_present_pages which is NODE_DATA(nid)->node_present_pages. >> > i.e for a node whose pdgat struct is not allocated, we are trying to >> > dereference. >> >> From what we saw, the pgdat does exist, the problem is that slab's per-node data >> doesn't exist for a node that doesn't have present pages, as it would be a waste >> of memory. > > Just to be clear > Before my 3 patches to fix dummy node: > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > 0-31 > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > 0-1 OK >> >> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In >> Sachin's first report [1] we have >> >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] >> [ 0.000000] numa: NODE_DATA(0) on node 1 >> [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff] >> > > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the > rest 30 nodes. I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should check online first, as you suggested. >> But in this thread, with your patches Sachin reports: > > and with my patches > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > 0-31 > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > 1 > >> >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] >> > > so we only see one pgdat. > >> So I assume it's just node 1. In that case, node_present_pages is really dangerous. >> >> [1] >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ >> >> > Also for a memoryless/cpuless node or possible but not present nodes, >> > node_to_mem_node(node) will still end up as node (atleast on powerpc). >> >> I think that's the place where this would be best to fix. >> > > Maybe. I thought about it but the current set_numa_mem semantics are apt > for memoryless cpu node and not for possible nodes. We could have upto 256 > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory. > node_to_mem_node seems to return what is set in set_numa_mem(). > set_numa_mem() seems to say set my numa_mem node for the current memoryless > node to the param passed. > > But how do we set numa_mem for all the other 253 possible nodes, which > probably will have 0 as default? > > Should we introduce another API such that we could update for all possible > nodes? If we want to rely on node_to_mem_node() to give us something safe for each possible node, then probably it would have to be like that, yeah. >> > I tried with this hunk below and it works. >> > >> > But I am not sure if we need to check at other places were >> > node_present_pages is being called. >> >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it >> return only nodes that are online with present memory? >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add >> support for node_to_mem_node() to determine the fallback node") >> > > Agree > >> I think we do need well defined and documented rules around node_to_mem_node(), >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so >> everyone handles it the same, safe way. So let's try to brainstorm how this would look like? What I mean are some rules like below, even if some details in my current understanding are most likely incorrect: with nid present in: N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online node with memory so that we don't require everyone to search for it in slightly different ways N_ONLINE - pgdat must exist, there doesn't have to be present memory, node_to_mem_node() still has to return something else (?) N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself N_HIGH_MEMORY - node has present high memory > > Other option would be to tweak Kirill Tkhai's patch such that we call > kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc > if the node is offline. I really would like a solution that hides these ugly details from callers so they don't have to workaround the APIs we provide. kvmalloc_node() really shouldn't crash, and it should fallback automatically if we don't give it __GFP_THISNODE However, taking a step back, memcg_alloc_shrinker_maps() is probably rather wasteful on systems with 256 possible nodes and only few present, by allocating effectively dead structures for each memcg. SLUB tries to be smart, so it allocates the per-node per-cache structures only when the node goes online in slab_mem_going_online_callback(). This is why there's a crash when such non-existing structures are accessed for a node that's not online, and why they shouldn't be accessed. Perhaps memcg should do the same on-demand allocation, if possible. >> > diff --git a/mm/slub.c b/mm/slub.c >> > index 626cbcbd977f..bddb93bed55e 100644 >> > --- a/mm/slub.c >> > +++ b/mm/slub.c >> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >> > if (unlikely(!node_match(page, node))) { >> > int searchnode = node; >> > >> > - if (node != NUMA_NO_NODE && !node_present_pages(node)) >> > - searchnode = node_to_mem_node(node); >> > - >> > + if (node != NUMA_NO_NODE) { >> > + if (!node_online(node) || !node_present_pages(node)) { >> > + searchnode = node_to_mem_node(node); >> > + if (!node_online(searchnode)) >> > + searchnode = first_online_node; >> > + } >> > + } >> > if (unlikely(!node_match(page, searchnode))) { >> > stat(s, ALLOC_NODE_MISMATCH); >> > deactivate_slab(s, page, c->freelist, c); >
2020년 3월 13일 (금) 오전 1:42, Vlastimil Babka <vbabka@suse.cz>님이 작성: > > On 3/12/20 5:13 PM, Srikar Dronamraju wrote: > > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]: > > > >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]: > >> > > >> >> On 3/12/20 9:23 AM, Sachin Sant wrote: > >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: > >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > >> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need > >> >> >>>> to make sure all possible but not present cpu_to_node are set to a > >> >> >>>> proper node. > >> >> >>> > >> >> >>> Just curious, is this somehow related to > >> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz? > >> >> >>> > >> >> >> > >> >> >> The issue I am trying to fix is a known issue in Powerpc since many years. > >> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate > >> >> >> shrinker_map on appropriate NUMA node"). > >> >> >> > >> > > >> > While I am not an expert in the slub area, I looked at the patch > >> > a75056fc1e7c and had some thoughts on why this could be causing this issue. > >> > > >> > On the system where the crash happens, the possible number of nodes is much > >> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only > >> > available for onlined nodes. > >> > > >> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node > >> > for all possible nodes and in ___slab_alloc we end up looking at the > >> > node_present_pages which is NODE_DATA(nid)->node_present_pages. > >> > i.e for a node whose pdgat struct is not allocated, we are trying to > >> > dereference. > >> > >> From what we saw, the pgdat does exist, the problem is that slab's per-node data > >> doesn't exist for a node that doesn't have present pages, as it would be a waste > >> of memory. > > > > Just to be clear > > Before my 3 patches to fix dummy node: > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > > 0-31 > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > > 0-1 > > OK > > >> > >> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In > >> Sachin's first report [1] we have > >> > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > >> [ 0.000000] numa: NODE_DATA(0) on node 1 > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff] > >> > > > > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the > > rest 30 nodes. > > I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should > check online first, as you suggested. > > >> But in this thread, with your patches Sachin reports: > > > > and with my patches > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible > > 0-31 > > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online > > 1 > > > >> > >> [ 0.000000] numa: NODE_DATA [mem 0x8bfedc900-0x8bfee3fff] > >> > > > > so we only see one pgdat. > > > >> So I assume it's just node 1. In that case, node_present_pages is really dangerous. > >> > >> [1] > >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ > >> > >> > Also for a memoryless/cpuless node or possible but not present nodes, > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc). > >> > >> I think that's the place where this would be best to fix. > >> > > > > Maybe. I thought about it but the current set_numa_mem semantics are apt > > for memoryless cpu node and not for possible nodes. We could have upto 256 > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory. > > node_to_mem_node seems to return what is set in set_numa_mem(). > > set_numa_mem() seems to say set my numa_mem node for the current memoryless > > node to the param passed. > > > > But how do we set numa_mem for all the other 253 possible nodes, which > > probably will have 0 as default? > > > > Should we introduce another API such that we could update for all possible > > nodes? > > If we want to rely on node_to_mem_node() to give us something safe for each > possible node, then probably it would have to be like that, yeah. > > >> > I tried with this hunk below and it works. > >> > > >> > But I am not sure if we need to check at other places were > >> > node_present_pages is being called. > >> > >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it > >> return only nodes that are online with present memory? > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add > >> support for node_to_mem_node() to determine the fallback node") > >> > > > > Agree I lost all the memory about it. :) Anyway, how about this? 1. make node_present_pages() safer static inline node_present_pages(nid) { if (!node_online(nid)) return 0; return (NODE_DATA(nid)->node_present_pages); } 2. make node_to_mem_node() safer for all cases In ppc arch's mem_topology_setup(void) for_each_present_cpu(cpu) { numa_setup_cpu(cpu); mem_node = node_to_mem_node(numa_mem_id()); if (!node_present_pages(mem_node)) { _node_numa_mem_[numa_mem_id()] = first_online_node; } } With these two changes, we can uses node_present_pages() and node_to_mem_node() as intended. Thanks. > >> I think we do need well defined and documented rules around node_to_mem_node(), > >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so > >> everyone handles it the same, safe way. > > So let's try to brainstorm how this would look like? What I mean are some rules > like below, even if some details in my current understanding are most likely > incorrect: > > with nid present in: > N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online > node with memory so that we don't require everyone to search for it in slightly > different ways > N_ONLINE - pgdat must exist, there doesn't have to be present memory, > node_to_mem_node() still has to return something else (?) > N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself > N_HIGH_MEMORY - node has present high memory
* Joonsoo Kim <js1304@gmail.com> [2020-03-13 18:47:49]: > > >> > > >> > Also for a memoryless/cpuless node or possible but not present nodes, > > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc). > > >> > > >> I think that's the place where this would be best to fix. > > >> > > > > > > Maybe. I thought about it but the current set_numa_mem semantics are apt > > > for memoryless cpu node and not for possible nodes. We could have upto 256 > > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory. > > > node_to_mem_node seems to return what is set in set_numa_mem(). > > > set_numa_mem() seems to say set my numa_mem node for the current memoryless > > > node to the param passed. > > > > > > But how do we set numa_mem for all the other 253 possible nodes, which > > > probably will have 0 as default? > > > > > > Should we introduce another API such that we could update for all possible > > > nodes? > > > > If we want to rely on node_to_mem_node() to give us something safe for each > > possible node, then probably it would have to be like that, yeah. > > > > >> > I tried with this hunk below and it works. > > >> > > > >> > But I am not sure if we need to check at other places were > > >> > node_present_pages is being called. > > >> > > >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it > > >> return only nodes that are online with present memory? > > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add > > >> support for node_to_mem_node() to determine the fallback node") > > >> > > > > > > Agree > > I lost all the memory about it. :) > Anyway, how about this? > > 1. make node_present_pages() safer > static inline node_present_pages(nid) > { > if (!node_online(nid)) return 0; > return (NODE_DATA(nid)->node_present_pages); > } > Yes this would help. > 2. make node_to_mem_node() safer for all cases > In ppc arch's mem_topology_setup(void) > for_each_present_cpu(cpu) { > numa_setup_cpu(cpu); > mem_node = node_to_mem_node(numa_mem_id()); > if (!node_present_pages(mem_node)) { > _node_numa_mem_[numa_mem_id()] = first_online_node; > } > } > But here as discussed above, we miss the case of possible but not present nodes. For such nodes, the above change may not update, resulting in they still having 0. And node 0 can be only possible but not present.
* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 17:41:58]: > On 3/12/20 5:13 PM, Srikar Dronamraju wrote: > > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]: > > > >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]: > >> > > >> >> On 3/12/20 9:23 AM, Sachin Sant wrote: > >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]: > >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote: > >> I think we do need well defined and documented rules around node_to_mem_node(), > >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so > >> everyone handles it the same, safe way. > > So let's try to brainstorm how this would look like? What I mean are some rules > like below, even if some details in my current understanding are most likely > incorrect: > Agree. > with nid present in: > N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online > node with memory so that we don't require everyone to search for it in slightly > different ways > N_ONLINE - pgdat must exist, there doesn't have to be present memory, > node_to_mem_node() still has to return something else (?) Right, think this has been taken care of at this time. > N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself > N_HIGH_MEMORY - node has present high memory > dont see any problems with the above two to. That leaves us with N_POSSIBLE. > > > > Other option would be to tweak Kirill Tkhai's patch such that we call > > kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc > > if the node is offline. > > I really would like a solution that hides these ugly details from callers so > they don't have to workaround the APIs we provide. kvmalloc_node() really > shouldn't crash, and it should fallback automatically if we don't give it > __GFP_THISNODE > Agree thats its better to make API's robust where possible. > However, taking a step back, memcg_alloc_shrinker_maps() is probably rather > wasteful on systems with 256 possible nodes and only few present, by allocating > effectively dead structures for each memcg. > If we dont allocate now, we would have to allocate them when we online the nodes. To me it looks better to allocate as soon as the nodes are onlined, > SLUB tries to be smart, so it allocates the per-node per-cache structures only > when the node goes online in slab_mem_going_online_callback(). This is why > there's a crash when such non-existing structures are accessed for a node that's > not online, and why they shouldn't be accessed. > > Perhaps memcg should do the same on-demand allocation, if possible. > Right.
On 3/13/20 12:04 PM, Srikar Dronamraju wrote: >> I lost all the memory about it. :) >> Anyway, how about this? >> >> 1. make node_present_pages() safer >> static inline node_present_pages(nid) >> { >> if (!node_online(nid)) return 0; >> return (NODE_DATA(nid)->node_present_pages); >> } >> > > Yes this would help. Looks good, yeah. >> 2. make node_to_mem_node() safer for all cases >> In ppc arch's mem_topology_setup(void) >> for_each_present_cpu(cpu) { >> numa_setup_cpu(cpu); >> mem_node = node_to_mem_node(numa_mem_id()); >> if (!node_present_pages(mem_node)) { >> _node_numa_mem_[numa_mem_id()] = first_online_node; >> } >> } >> > > But here as discussed above, we miss the case of possible but not present nodes. > For such nodes, the above change may not update, resulting in they still > having 0. And node 0 can be only possible but not present. So is there other way to do the setup so that node_to_mem_node() returns an online+present node when called for any possible node?
2020년 3월 13일 (금) 오후 8:38, Vlastimil Babka <vbabka@suse.cz>님이 작성: > > On 3/13/20 12:04 PM, Srikar Dronamraju wrote: > >> I lost all the memory about it. :) > >> Anyway, how about this? > >> > >> 1. make node_present_pages() safer > >> static inline node_present_pages(nid) > >> { > >> if (!node_online(nid)) return 0; > >> return (NODE_DATA(nid)->node_present_pages); > >> } > >> > > > > Yes this would help. > > Looks good, yeah. > > >> 2. make node_to_mem_node() safer for all cases > >> In ppc arch's mem_topology_setup(void) > >> for_each_present_cpu(cpu) { > >> numa_setup_cpu(cpu); > >> mem_node = node_to_mem_node(numa_mem_id()); > >> if (!node_present_pages(mem_node)) { > >> _node_numa_mem_[numa_mem_id()] = first_online_node; > >> } > >> } > >> > > > > But here as discussed above, we miss the case of possible but not present nodes. > > For such nodes, the above change may not update, resulting in they still > > having 0. And node 0 can be only possible but not present. Oops, I don't read full thread so miss the case. > So is there other way to do the setup so that node_to_mem_node() returns an > online+present node when called for any possible node? Two changes seems to be sufficient. 1. initialize all node's _node_numa_mem_[] = first_online_node in mem_topology_setup() 2. replace the node with online+present node for _node_to_mem_node_[] in set_cpu_numa_mem(). static inline void set_cpu_numa_mem(int cpu, int node) { per_cpu(_numa_mem_, cpu) = node; + if (!node_present_pages(node)) + node = first_online_node; _node_numa_mem_[cpu_to_node(cpu)] = node; } #endif With these two change, we can safely call node_to_mem_node() anywhere. Thanks.
On Thu 12-03-20 17:41:58, Vlastimil Babka wrote: [...] > with nid present in: > N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online I would rather have a dummy pgdat for those. Have a look at $ git grep "NODE_DATA.*->" | wc -l 63 Who knows how many else we have there. I haven't looked more closely. Besides that what is a real reason to not have pgdat ther and force all users of a $random node from those that the platform considers possible for special casing? Is that a memory overhead? Is that really a thing? Somebody has suggested to tweak some of the low level routines to do the special casing but I really have to say I do not like that. We shouldn't use the first online node or anything like that. We should simply always follow the topology presented by FW and of that we need to have a pgdat.
On 3/16/20 10:06 AM, Michal Hocko wrote: > On Thu 12-03-20 17:41:58, Vlastimil Babka wrote: > [...] >> with nid present in: >> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online > > I would rather have a dummy pgdat for those. Have a look at > $ git grep "NODE_DATA.*->" | wc -l > 63 > > Who knows how many else we have there. I haven't looked more closely. > Besides that what is a real reason to not have pgdat ther and force all > users of a $random node from those that the platform considers possible > for special casing? Is that a memory overhead? Is that really a thing? I guess we can ignore memory overhead. I guess there only might be some concern that for nodes that are initially offline, we will allocate the pgdat on a different node, and after they are online, it will stay on a different node with more access latency from local cpus. If we only allocate for online nodes, it can always be local? But I guess it doesn't matter that much. > Somebody has suggested to tweak some of the low level routines to do the > special casing but I really have to say I do not like that. We shouldn't > use the first online node or anything like that. We should simply always > follow the topology presented by FW and of that we need to have a pgdat. >
On Tue 17-03-20 14:44:45, Vlastimil Babka wrote: > On 3/16/20 10:06 AM, Michal Hocko wrote: > > On Thu 12-03-20 17:41:58, Vlastimil Babka wrote: > > [...] > >> with nid present in: > >> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online > > > > I would rather have a dummy pgdat for those. Have a look at > > $ git grep "NODE_DATA.*->" | wc -l > > 63 > > > > Who knows how many else we have there. I haven't looked more closely. > > Besides that what is a real reason to not have pgdat ther and force all > > users of a $random node from those that the platform considers possible > > for special casing? Is that a memory overhead? Is that really a thing? > > I guess we can ignore memory overhead. I guess there only might be some concern > that for nodes that are initially offline, we will allocate the pgdat on a > different node, and after they are online, it will stay on a different node with > more access latency from local cpus. If we only allocate for online nodes, it > can always be local? But I guess it doesn't matter that much. This is not the case even now because of chicke&egg. You need a memory to allocate from and that memory has to be managed somewhere per node (pgdat). Keep in mind we do not have the bootmem allocator for the hotplug. Have a look at hotadd_new_pgdat and when it is called. There are some attempts to allocate memmap from the hotpluged memory but I am not sure we can do the whole thing without pgdat in place. If we can then can come up with some replace the pgdat magic. But still I am not even sure this is something we really have to optimize for.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 8a399db..54dcd49 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -931,8 +931,20 @@ void __init mem_topology_setup(void) reset_numa_cpu_lookup_table(); - for_each_present_cpu(cpu) - numa_setup_cpu(cpu); + for_each_possible_cpu(cpu) { + /* + * Powerpc with CONFIG_NUMA always used to have a node 0, + * even if it was memoryless or cpuless. For all cpus that + * are possible but not present, cpu_to_node() would point + * to node 0. To remove a cpuless, memoryless dummy node, + * powerpc need to make sure all possible but not present + * cpu_to_node are set to a proper node. + */ + if (cpu_present(cpu)) + numa_setup_cpu(cpu); + else + set_cpu_numa_node(cpu, first_online_node); + } } void __init initmem_init(void)
A Powerpc system with multiple possible nodes and with CONFIG_NUMA enabled always used to have a node 0, even if node 0 does not any cpus or memory attached to it. As per PAPR, node affinity of a cpu is only available once its present / online. For all cpus that are possible but not present, cpu_to_node() would point to node 0. To ensure a cpuless, memoryless dummy node is not online, powerpc need to make sure all possible but not present cpu_to_node are set to a proper node. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Christopher Lameter <cl@linux.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- arch/powerpc/mm/numa.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-)