diff mbox series

[1/3] powerpc/numa: Set numa_node for all possible cpus

Message ID 20200311110237.5731-2-srikar@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show
Series Offline memoryless cpuless node 0 | expand

Commit Message

Srikar Dronamraju March 11, 2020, 11:02 a.m. UTC
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.

To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Comments

Michal Hocko March 11, 2020, 11:57 a.m. UTC | #1
On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> enabled always used to have a node 0, even if node 0 does not any cpus
> or memory attached to it. As per PAPR, node affinity of a cpu is only
> available once its present / online. For all cpus that are possible but
> not present, cpu_to_node() would point to node 0.
> 
> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> to make sure all possible but not present cpu_to_node are set to a
> proper node.

Just curious, is this somehow related to
http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/numa.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8a399db..54dcd49 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
>  
>  	reset_numa_cpu_lookup_table();
>  
> -	for_each_present_cpu(cpu)
> -		numa_setup_cpu(cpu);
> +	for_each_possible_cpu(cpu) {
> +		/*
> +		 * Powerpc with CONFIG_NUMA always used to have a node 0,
> +		 * even if it was memoryless or cpuless. For all cpus that
> +		 * are possible but not present, cpu_to_node() would point
> +		 * to node 0. To remove a cpuless, memoryless dummy node,
> +		 * powerpc need to make sure all possible but not present
> +		 * cpu_to_node are set to a proper node.
> +		 */
> +		if (cpu_present(cpu))
> +			numa_setup_cpu(cpu);
> +		else
> +			set_cpu_numa_node(cpu, first_online_node);
> +	}
>  }
>  
>  void __init initmem_init(void)
> -- 
> 1.8.3.1
Srikar Dronamraju March 12, 2020, 5:27 a.m. UTC | #2
* Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:

> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> > A Powerpc system with multiple possible nodes and with CONFIG_NUMA
> > enabled always used to have a node 0, even if node 0 does not any cpus
> > or memory attached to it. As per PAPR, node affinity of a cpu is only
> > available once its present / online. For all cpus that are possible but
> > not present, cpu_to_node() would point to node 0.
> > 
> > To ensure a cpuless, memoryless dummy node is not online, powerpc need
> > to make sure all possible but not present cpu_to_node are set to a
> > proper node.
> 
> Just curious, is this somehow related to
> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> 

The issue I am trying to fix is a known issue in Powerpc since many years.
So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
shrinker_map on appropriate NUMA node"). 

I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
kernel. Will work with Sachin/Abdul (reporters of the issue).


> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> > Cc: Christopher Lameter <cl@linux.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/mm/numa.c | 16 ++++++++++++++--
> >  1 file changed, 14 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> > index 8a399db..54dcd49 100644
> > --- a/arch/powerpc/mm/numa.c
> > +++ b/arch/powerpc/mm/numa.c
> > @@ -931,8 +931,20 @@ void __init mem_topology_setup(void)
> >  
> >  	reset_numa_cpu_lookup_table();
> >  
> > -	for_each_present_cpu(cpu)
> > -		numa_setup_cpu(cpu);
> > +	for_each_possible_cpu(cpu) {
> > +		/*
> > +		 * Powerpc with CONFIG_NUMA always used to have a node 0,
> > +		 * even if it was memoryless or cpuless. For all cpus that
> > +		 * are possible but not present, cpu_to_node() would point
> > +		 * to node 0. To remove a cpuless, memoryless dummy node,
> > +		 * powerpc need to make sure all possible but not present
> > +		 * cpu_to_node are set to a proper node.
> > +		 */
> > +		if (cpu_present(cpu))
> > +			numa_setup_cpu(cpu);
> > +		else
> > +			set_cpu_numa_node(cpu, first_online_node);
> > +	}
> >  }
> >  
> >  void __init initmem_init(void)
> > -- 
> > 1.8.3.1
> 
> -- 
> Michal Hocko
> SUSE Labs
Sachin Sant March 12, 2020, 8:23 a.m. UTC | #3
> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> 
>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>>> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
>>> enabled always used to have a node 0, even if node 0 does not any cpus
>>> or memory attached to it. As per PAPR, node affinity of a cpu is only
>>> available once its present / online. For all cpus that are possible but
>>> not present, cpu_to_node() would point to node 0.
>>> 
>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>>> to make sure all possible but not present cpu_to_node are set to a
>>> proper node.
>> 
>> Just curious, is this somehow related to
>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> 
> 
> The issue I am trying to fix is a known issue in Powerpc since many years.
> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> shrinker_map on appropriate NUMA node"). 
> 
> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
> kernel. Will work with Sachin/Abdul (reporters of the issue).
> 

I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
The kernel still fails to boot with same call trace.

[    6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0
[    6.159363] Faulting instruction address: 0xc0000000003d7174
[    6.159368] Oops: Kernel access of bad area, sig: 11 [#1]
[    6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[    6.159378] Modules linked in:
[    6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1
[    6.159388] NIP:  c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70
[    6.159393] REGS: c0000008b36836d0 TRAP: 0300   Not tainted  (5.6.0-rc5-next-20200311-autotest+)
[    6.159398] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004848  XER: 00000000
[    6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
[    6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500
[    6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620
[    6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000
[    6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000
[    6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002
[    6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122
[    6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8
[    6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40
[    6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760
[    6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60
[    6.159462] Call Trace:
[    6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70
[    6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490
[    6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110
[    6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270
[    6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0
[    6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0
[    6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0
[    6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0
[    6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230
[    6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0
[    6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68
[    6.159527] Instruction dump:
[    6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088
[    6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a
[    6.159545] ---[ end trace 36d65cb66091a5b6 ]—

Boot log attached.

Thanks
-Sachin
Vlastimil Babka March 12, 2020, 9:30 a.m. UTC | #4
On 3/12/20 9:23 AM, Sachin Sant wrote:
> 
> 
>> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> 
>> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> 
>>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>>>> A Powerpc system with multiple possible nodes and with CONFIG_NUMA
>>>> enabled always used to have a node 0, even if node 0 does not any cpus
>>>> or memory attached to it. As per PAPR, node affinity of a cpu is only
>>>> available once its present / online. For all cpus that are possible but
>>>> not present, cpu_to_node() would point to node 0.
>>>> 
>>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>>>> to make sure all possible but not present cpu_to_node are set to a
>>>> proper node.
>>> 
>>> Just curious, is this somehow related to
>>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>>> 
>> 
>> The issue I am trying to fix is a known issue in Powerpc since many years.
>> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> shrinker_map on appropriate NUMA node"). 
>> 
>> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
>> kernel. Will work with Sachin/Abdul (reporters of the issue).
>> 
> 
> I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
> The kernel still fails to boot with same call trace.

Yeah when I skimmed the patches, I don't think they address the issue where
node_to_mem_node(0) = 0 [1]. You could reapply the debug print patch to verify,
but it seems very likely. So I'm not surprised you get the same trace.

[1] https://lore.kernel.org/linux-next/9a86f865-50b5-7483-9257-dbb08fecd62b@suse.cz/

> [    6.159357] BUG: Kernel NULL pointer dereference on read at 0x000073b0
> [    6.159363] Faulting instruction address: 0xc0000000003d7174
> [    6.159368] Oops: Kernel access of bad area, sig: 11 [#1]
> [    6.159372] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [    6.159378] Modules linked in:
> [    6.159382] CPU: 17 PID: 1 Comm: systemd Not tainted 5.6.0-rc5-next-20200311-autotest+ #1
> [    6.159388] NIP:  c0000000003d7174 LR: c0000000003d7714 CTR: c000000000400e70
> [    6.159393] REGS: c0000008b36836d0 TRAP: 0300   Not tainted  (5.6.0-rc5-next-20200311-autotest+)
> [    6.159398] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004848  XER: 00000000
> [    6.159406] CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
> [    6.159406] GPR00: c0000000003d7714 c0000008b3683960 c00000000155e300 c0000008b301f500
> [    6.159406] GPR04: 0000000000000dc0 0000000000000000 c0000000003456f8 c0000008bb198620
> [    6.159406] GPR08: 00000008ba0f0000 0000000000000001 0000000000000000 0000000000000000
> [    6.159406] GPR12: 0000000024004848 c00000001ec55e00 0000000000000000 0000000000000000
> [    6.159406] GPR16: c0000008b0a82048 c000000001595898 c000000001750ca8 0000000000000002
> [    6.159406] GPR20: c000000001750cb8 c000000001624478 0000000fffffffe0 5deadbeef0000122
> [    6.159406] GPR24: 0000000000000001 0000000000000dc0 0000000000000000 c0000000003456f8
> [    6.159406] GPR28: c0000008b301f500 c0000008bb198620 0000000000000000 c00c000002285a40
> [    6.159453] NIP [c0000000003d7174] ___slab_alloc+0x1f4/0x760
> [    6.159458] LR [c0000000003d7714] __slab_alloc+0x34/0x60
> [    6.159462] Call Trace:
> [    6.159465] [c0000008b3683a40] [c0000008b3683a70] 0xc0000008b3683a70
> [    6.159471] [c0000008b3683a70] [c0000000003d8b20] __kmalloc_node+0x110/0x490
> [    6.159477] [c0000008b3683af0] [c0000000003456f8] kvmalloc_node+0x58/0x110
> [    6.159483] [c0000008b3683b30] [c000000000400f78] mem_cgroup_css_online+0x108/0x270
> [    6.159489] [c0000008b3683b90] [c000000000236ed8] online_css+0x48/0xd0
> [    6.159494] [c0000008b3683bc0] [c00000000023ffac] cgroup_apply_control_enable+0x2ec/0x4d0
> [    6.159501] [c0000008b3683ca0] [c0000000002437c8] cgroup_mkdir+0x228/0x5f0
> [    6.159506] [c0000008b3683d10] [c000000000521780] kernfs_iop_mkdir+0x90/0xf0
> [    6.159512] [c0000008b3683d50] [c00000000043f670] vfs_mkdir+0x110/0x230
> [    6.159517] [c0000008b3683da0] [c000000000443150] do_mkdirat+0xb0/0x1a0
> [    6.159523] [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68
> [    6.159527] Instruction dump:
> [    6.159531] 7c421378 e95f0000 714a0001 4082fff0 4bffff64 60000000 60000000 faa10088
> [    6.159538] 3ea2000c 3ab56178 7b4a1f24 7d55502a <e94a73b0> 2faa0000 409e0394 3d02002a
> [    6.159545] ---[ end trace 36d65cb66091a5b6 ]—
> 
> Boot log attached.
> 
> Thanks
> -Sachin
>
Srikar Dronamraju March 12, 2020, 1:14 p.m. UTC | #5
* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:

> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >>>> to make sure all possible but not present cpu_to_node are set to a
> >>>> proper node.
> >>> 
> >>> Just curious, is this somehow related to
> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> >>> 
> >> 
> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> shrinker_map on appropriate NUMA node"). 
> >> 
> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
> >> kernel. Will work with Sachin/Abdul (reporters of the issue).

I had used v1 and not v2. So my mistake.

> > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
> > The kernel still fails to boot with same call trace.
> 

While I am not an expert in the slub area, I looked at the patch
a75056fc1e7c and had some thoughts on why this could be causing this issue.

On the system where the crash happens, the possible number of nodes is much
greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
available for onlined nodes.

With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
for all possible nodes and in ___slab_alloc we end up looking at the
node_present_pages which is NODE_DATA(nid)->node_present_pages.
i.e for a node whose pdgat struct is not allocated, we are trying to
dereference.

Also for a memoryless/cpuless node or possible but not present nodes,
node_to_mem_node(node) will still end up as node (atleast on powerpc).

I tried with this hunk below and it works.

But I am not sure if we need to check at other places were
node_present_pages is being called.

diff --git a/mm/slub.c b/mm/slub.c
index 626cbcbd977f..bddb93bed55e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (unlikely(!node_match(page, node))) {
 		int searchnode = node;
 
-		if (node != NUMA_NO_NODE && !node_present_pages(node))
-			searchnode = node_to_mem_node(node);
-
+		if (node != NUMA_NO_NODE) {
+			if (!node_online(node) || !node_present_pages(node)) {
+				searchnode = node_to_mem_node(node);
+				if (!node_online(searchnode))
+					searchnode = first_online_node;
+			}
+		}
 		if (unlikely(!node_match(page, searchnode))) {
 			stat(s, ALLOC_NODE_MISMATCH);
 			deactivate_slab(s, page, c->freelist, c);

> > 
>
Vlastimil Babka March 12, 2020, 1:51 p.m. UTC | #6
On 3/12/20 2:14 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> 
>> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >>>> proper node.
>> >>> 
>> >>> Just curious, is this somehow related to
>> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> >>> 
>> >> 
>> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> shrinker_map on appropriate NUMA node"). 
>> >> 
>> >> I tried v5.6-rc4 + a75056fc1e7c but didnt face any issues booting the
>> >> kernel. Will work with Sachin/Abdul (reporters of the issue).
> 
> I had used v1 and not v2. So my mistake.
> 
>> > I applied this 3 patch series on top of March 11 next tree (commit d44a64766795 )
>> > The kernel still fails to boot with same call trace.
>> 
> 
> While I am not an expert in the slub area, I looked at the patch
> a75056fc1e7c and had some thoughts on why this could be causing this issue.
> 
> On the system where the crash happens, the possible number of nodes is much
> greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> available for onlined nodes.
> 
> With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> for all possible nodes and in ___slab_alloc we end up looking at the
> node_present_pages which is NODE_DATA(nid)->node_present_pages.
> i.e for a node whose pdgat struct is not allocated, we are trying to
> dereference.

From what we saw, the pgdat does exist, the problem is that slab's per-node data
doesn't exist for a node that doesn't have present pages, as it would be a waste
of memory.

Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
Sachin's first report [1] we have

[    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
[    0.000000] numa:     NODE_DATA(0) on node 1
[    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]

But in this thread, with your patches Sachin reports:

[    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]

So I assume it's just node 1. In that case, node_present_pages is really dangerous.

[1]
https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/

> Also for a memoryless/cpuless node or possible but not present nodes,
> node_to_mem_node(node) will still end up as node (atleast on powerpc).

I think that's the place where this would be best to fix.

> I tried with this hunk below and it works.
> 
> But I am not sure if we need to check at other places were
> node_present_pages is being called.

I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
return only nodes that are online with present memory?
CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
support for node_to_mem_node() to determine the fallback node")

I think we do need well defined and documented rules around node_to_mem_node(),
cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
everyone handles it the same, safe way.

> diff --git a/mm/slub.c b/mm/slub.c
> index 626cbcbd977f..bddb93bed55e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (unlikely(!node_match(page, node))) {
>  		int searchnode = node;
>  
> -		if (node != NUMA_NO_NODE && !node_present_pages(node))
> -			searchnode = node_to_mem_node(node);
> -
> +		if (node != NUMA_NO_NODE) {
> +			if (!node_online(node) || !node_present_pages(node)) {
> +				searchnode = node_to_mem_node(node);
> +				if (!node_online(searchnode))
> +					searchnode = first_online_node;
> +			}
> +		}
>  		if (unlikely(!node_match(page, searchnode))) {
>  			stat(s, ALLOC_NODE_MISMATCH);
>  			deactivate_slab(s, page, c->freelist, c);
> 
>> > 
>> 
>
Srikar Dronamraju March 12, 2020, 4:13 p.m. UTC | #7
* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:

> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> > 
> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >> >>>> to make sure all possible but not present cpu_to_node are set to a
> >> >>>> proper node.
> >> >>> 
> >> >>> Just curious, is this somehow related to
> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> >> >>> 
> >> >> 
> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> >> shrinker_map on appropriate NUMA node"). 
> >> >> 
> > 
> > While I am not an expert in the slub area, I looked at the patch
> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
> > 
> > On the system where the crash happens, the possible number of nodes is much
> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> > available for onlined nodes.
> > 
> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> > for all possible nodes and in ___slab_alloc we end up looking at the
> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
> > i.e for a node whose pdgat struct is not allocated, we are trying to
> > dereference.
> 
> From what we saw, the pgdat does exist, the problem is that slab's per-node data
> doesn't exist for a node that doesn't have present pages, as it would be a waste
> of memory.

Just to be clear
Before my 3 patches to fix dummy node:
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
0-1

> 
> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
> Sachin's first report [1] we have
> 
> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> [    0.000000] numa:     NODE_DATA(0) on node 1
> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
> 

So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
rest 30 nodes.

> But in this thread, with your patches Sachin reports:

and with my patches
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
0-31
srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
1

> 
> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> 

so we only see one pgdat.

> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
> 
> [1]
> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
> 
> > Also for a memoryless/cpuless node or possible but not present nodes,
> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> 
> I think that's the place where this would be best to fix.
> 

Maybe. I thought about it but the current set_numa_mem semantics are apt
for memoryless cpu node and not for possible nodes.  We could have upto 256
possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
node_to_mem_node seems to return what is set in set_numa_mem().
set_numa_mem() seems to say set my numa_mem node for the current memoryless
node to the param passed.

But how do we set numa_mem for all the other 253 possible nodes, which
probably will have 0 as default?

Should we introduce another API such that we could update for all possible
nodes?

> > I tried with this hunk below and it works.
> > 
> > But I am not sure if we need to check at other places were
> > node_present_pages is being called.
> 
> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> return only nodes that are online with present memory?
> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
> support for node_to_mem_node() to determine the fallback node")
> 

Agree 

> I think we do need well defined and documented rules around node_to_mem_node(),
> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> everyone handles it the same, safe way.
> 

Other option would be to tweak Kirill Tkhai's patch such that we call
kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
if the node is offline.

> > diff --git a/mm/slub.c b/mm/slub.c
> > index 626cbcbd977f..bddb93bed55e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  	if (unlikely(!node_match(page, node))) {
> >  		int searchnode = node;
> >  
> > -		if (node != NUMA_NO_NODE && !node_present_pages(node))
> > -			searchnode = node_to_mem_node(node);
> > -
> > +		if (node != NUMA_NO_NODE) {
> > +			if (!node_online(node) || !node_present_pages(node)) {
> > +				searchnode = node_to_mem_node(node);
> > +				if (!node_online(searchnode))
> > +					searchnode = first_online_node;
> > +			}
> > +		}
> >  		if (unlikely(!node_match(page, searchnode))) {
> >  			stat(s, ALLOC_NODE_MISMATCH);
> >  			deactivate_slab(s, page, c->freelist, c);
Vlastimil Babka March 12, 2020, 4:41 p.m. UTC | #8
On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> 
>> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
>> > 
>> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
>> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
>> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
>> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
>> >> >>>> to make sure all possible but not present cpu_to_node are set to a
>> >> >>>> proper node.
>> >> >>> 
>> >> >>> Just curious, is this somehow related to
>> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
>> >> >>> 
>> >> >> 
>> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
>> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
>> >> >> shrinker_map on appropriate NUMA node"). 
>> >> >> 
>> > 
>> > While I am not an expert in the slub area, I looked at the patch
>> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
>> > 
>> > On the system where the crash happens, the possible number of nodes is much
>> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
>> > available for onlined nodes.
>> > 
>> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
>> > for all possible nodes and in ___slab_alloc we end up looking at the
>> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
>> > i.e for a node whose pdgat struct is not allocated, we are trying to
>> > dereference.
>> 
>> From what we saw, the pgdat does exist, the problem is that slab's per-node data
>> doesn't exist for a node that doesn't have present pages, as it would be a waste
>> of memory.
> 
> Just to be clear
> Before my 3 patches to fix dummy node:
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 0-1

OK

>> 
>> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
>> Sachin's first report [1] we have
>> 
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> [    0.000000] numa:     NODE_DATA(0) on node 1
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
>> 
> 
> So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> rest 30 nodes.

I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
check online first, as you suggested.

>> But in this thread, with your patches Sachin reports:
> 
> and with my patches
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> 0-31
> srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> 1
> 
>> 
>> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
>> 
> 
> so we only see one pgdat.
> 
>> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
>> 
>> [1]
>> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
>> 
>> > Also for a memoryless/cpuless node or possible but not present nodes,
>> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
>> 
>> I think that's the place where this would be best to fix.
>> 
> 
> Maybe. I thought about it but the current set_numa_mem semantics are apt
> for memoryless cpu node and not for possible nodes.  We could have upto 256
> possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> node_to_mem_node seems to return what is set in set_numa_mem().
> set_numa_mem() seems to say set my numa_mem node for the current memoryless
> node to the param passed.
> 
> But how do we set numa_mem for all the other 253 possible nodes, which
> probably will have 0 as default?
> 
> Should we introduce another API such that we could update for all possible
> nodes?

If we want to rely on node_to_mem_node() to give us something safe for each
possible node, then probably it would have to be like that, yeah.

>> > I tried with this hunk below and it works.
>> > 
>> > But I am not sure if we need to check at other places were
>> > node_present_pages is being called.
>> 
>> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
>> return only nodes that are online with present memory?
>> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
>> support for node_to_mem_node() to determine the fallback node")
>> 
> 
> Agree 
> 
>> I think we do need well defined and documented rules around node_to_mem_node(),
>> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
>> everyone handles it the same, safe way.

So let's try to brainstorm how this would look like? What I mean are some rules
like below, even if some details in my current understanding are most likely
incorrect:

with nid present in:
N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
node with memory so that we don't require everyone to search for it in slightly
different ways
N_ONLINE - pgdat must exist, there doesn't have to be present memory,
node_to_mem_node() still has to return something else (?)
N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
N_HIGH_MEMORY - node has present high memory

> 
> Other option would be to tweak Kirill Tkhai's patch such that we call
> kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> if the node is offline.

I really would like a solution that hides these ugly details from callers so
they don't have to workaround the APIs we provide. kvmalloc_node() really
shouldn't crash, and it should fallback automatically if we don't give it
__GFP_THISNODE

However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
wasteful on systems with 256 possible nodes and only few present, by allocating
effectively dead structures for each memcg.

SLUB tries to be smart, so it allocates the per-node per-cache structures only
when the node goes online in slab_mem_going_online_callback(). This is why
there's a crash when such non-existing structures are accessed for a node that's
not online, and why they shouldn't be accessed.

Perhaps memcg should do the same on-demand allocation, if possible.

>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index 626cbcbd977f..bddb93bed55e 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2571,9 +2571,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>> >  	if (unlikely(!node_match(page, node))) {
>> >  		int searchnode = node;
>> >  
>> > -		if (node != NUMA_NO_NODE && !node_present_pages(node))
>> > -			searchnode = node_to_mem_node(node);
>> > -
>> > +		if (node != NUMA_NO_NODE) {
>> > +			if (!node_online(node) || !node_present_pages(node)) {
>> > +				searchnode = node_to_mem_node(node);
>> > +				if (!node_online(searchnode))
>> > +					searchnode = first_online_node;
>> > +			}
>> > +		}
>> >  		if (unlikely(!node_match(page, searchnode))) {
>> >  			stat(s, ALLOC_NODE_MISMATCH);
>> >  			deactivate_slab(s, page, c->freelist, c);
>
Joonsoo Kim March 13, 2020, 9:47 a.m. UTC | #9
2020년 3월 13일 (금) 오전 1:42, Vlastimil Babka <vbabka@suse.cz>님이 작성:
>
> On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> >
> >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> >> >
> >> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> >> >>>> To ensure a cpuless, memoryless dummy node is not online, powerpc need
> >> >> >>>> to make sure all possible but not present cpu_to_node are set to a
> >> >> >>>> proper node.
> >> >> >>>
> >> >> >>> Just curious, is this somehow related to
> >> >> >>> http://lkml.kernel.org/r/20200227182650.GG3771@dhcp22.suse.cz?
> >> >> >>>
> >> >> >>
> >> >> >> The issue I am trying to fix is a known issue in Powerpc since many years.
> >> >> >> So this surely not a problem after a75056fc1e7c (mm/memcontrol.c: allocate
> >> >> >> shrinker_map on appropriate NUMA node").
> >> >> >>
> >> >
> >> > While I am not an expert in the slub area, I looked at the patch
> >> > a75056fc1e7c and had some thoughts on why this could be causing this issue.
> >> >
> >> > On the system where the crash happens, the possible number of nodes is much
> >> > greater than the number of onlined nodes. The pdgat or the NODE_DATA is only
> >> > available for onlined nodes.
> >> >
> >> > With a75056fc1e7c memcg_alloc_shrinker_maps, we end up calling kzalloc_node
> >> > for all possible nodes and in ___slab_alloc we end up looking at the
> >> > node_present_pages which is NODE_DATA(nid)->node_present_pages.
> >> > i.e for a node whose pdgat struct is not allocated, we are trying to
> >> > dereference.
> >>
> >> From what we saw, the pgdat does exist, the problem is that slab's per-node data
> >> doesn't exist for a node that doesn't have present pages, as it would be a waste
> >> of memory.
> >
> > Just to be clear
> > Before my 3 patches to fix dummy node:
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> > 0-31
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> > 0-1
>
> OK
>
> >>
> >> Uh actually you are probably right, the NODE_DATA doesn't exist anymore? In
> >> Sachin's first report [1] we have
> >>
> >> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> >> [    0.000000] numa:     NODE_DATA(0) on node 1
> >> [    0.000000] numa:   NODE_DATA [mem 0x8bfed5200-0x8bfedc8ff]
> >>
> >
> > So even if pgdat would exist for nodes 0 and 1, there is no pgdat for the
> > rest 30 nodes.
>
> I see. Perhaps node_present_pages(node) is not safe in SLUB then and it should
> check online first, as you suggested.
>
> >> But in this thread, with your patches Sachin reports:
> >
> > and with my patches
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/possible
> > 0-31
> > srikar@ltc-zzci-2 /sys/devices/system/node $ cat $PWD/online
> > 1
> >
> >>
> >> [    0.000000] numa:   NODE_DATA [mem 0x8bfedc900-0x8bfee3fff]
> >>
> >
> > so we only see one pgdat.
> >
> >> So I assume it's just node 1. In that case, node_present_pages is really dangerous.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
> >>
> >> > Also for a memoryless/cpuless node or possible but not present nodes,
> >> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> >>
> >> I think that's the place where this would be best to fix.
> >>
> >
> > Maybe. I thought about it but the current set_numa_mem semantics are apt
> > for memoryless cpu node and not for possible nodes.  We could have upto 256
> > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> > node_to_mem_node seems to return what is set in set_numa_mem().
> > set_numa_mem() seems to say set my numa_mem node for the current memoryless
> > node to the param passed.
> >
> > But how do we set numa_mem for all the other 253 possible nodes, which
> > probably will have 0 as default?
> >
> > Should we introduce another API such that we could update for all possible
> > nodes?
>
> If we want to rely on node_to_mem_node() to give us something safe for each
> possible node, then probably it would have to be like that, yeah.
>
> >> > I tried with this hunk below and it works.
> >> >
> >> > But I am not sure if we need to check at other places were
> >> > node_present_pages is being called.
> >>
> >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> >> return only nodes that are online with present memory?
> >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
> >> support for node_to_mem_node() to determine the fallback node")
> >>
> >
> > Agree

I lost all the memory about it. :)
Anyway, how about this?

1. make node_present_pages() safer
static inline node_present_pages(nid)
{
if (!node_online(nid)) return 0;
return (NODE_DATA(nid)->node_present_pages);
}

2. make node_to_mem_node() safer for all cases
In ppc arch's mem_topology_setup(void)
for_each_present_cpu(cpu) {
 numa_setup_cpu(cpu);
 mem_node = node_to_mem_node(numa_mem_id());
 if (!node_present_pages(mem_node)) {
  _node_numa_mem_[numa_mem_id()] = first_online_node;
 }
}

With these two changes, we can uses node_present_pages() and node_to_mem_node()
as intended.

Thanks.

> >> I think we do need well defined and documented rules around node_to_mem_node(),
> >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> >> everyone handles it the same, safe way.
>
> So let's try to brainstorm how this would look like? What I mean are some rules
> like below, even if some details in my current understanding are most likely
> incorrect:
>
> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> node with memory so that we don't require everyone to search for it in slightly
> different ways
> N_ONLINE - pgdat must exist, there doesn't have to be present memory,
> node_to_mem_node() still has to return something else (?)
> N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
> N_HIGH_MEMORY - node has present high memory
Srikar Dronamraju March 13, 2020, 11:04 a.m. UTC | #10
* Joonsoo Kim <js1304@gmail.com> [2020-03-13 18:47:49]:

> > >>
> > >> > Also for a memoryless/cpuless node or possible but not present nodes,
> > >> > node_to_mem_node(node) will still end up as node (atleast on powerpc).
> > >>
> > >> I think that's the place where this would be best to fix.
> > >>
> > >
> > > Maybe. I thought about it but the current set_numa_mem semantics are apt
> > > for memoryless cpu node and not for possible nodes.  We could have upto 256
> > > possible nodes and only 2 nodes (1,2) with cpu and 1 node (1) with memory.
> > > node_to_mem_node seems to return what is set in set_numa_mem().
> > > set_numa_mem() seems to say set my numa_mem node for the current memoryless
> > > node to the param passed.
> > >
> > > But how do we set numa_mem for all the other 253 possible nodes, which
> > > probably will have 0 as default?
> > >
> > > Should we introduce another API such that we could update for all possible
> > > nodes?
> >
> > If we want to rely on node_to_mem_node() to give us something safe for each
> > possible node, then probably it would have to be like that, yeah.
> >
> > >> > I tried with this hunk below and it works.
> > >> >
> > >> > But I am not sure if we need to check at other places were
> > >> > node_present_pages is being called.
> > >>
> > >> I think this seems to defeat the purpose of node_to_mem_node()? Shouldn't it
> > >> return only nodes that are online with present memory?
> > >> CCing Joonsoo who seems to have introduced this in ad2c8144418c ("topology: add
> > >> support for node_to_mem_node() to determine the fallback node")
> > >>
> > >
> > > Agree
> 
> I lost all the memory about it. :)
> Anyway, how about this?
> 
> 1. make node_present_pages() safer
> static inline node_present_pages(nid)
> {
> if (!node_online(nid)) return 0;
> return (NODE_DATA(nid)->node_present_pages);
> }
> 

Yes this would help.

> 2. make node_to_mem_node() safer for all cases
> In ppc arch's mem_topology_setup(void)
> for_each_present_cpu(cpu) {
>  numa_setup_cpu(cpu);
>  mem_node = node_to_mem_node(numa_mem_id());
>  if (!node_present_pages(mem_node)) {
>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>  }
> }
> 

But here as discussed above, we miss the case of possible but not present nodes.
For such nodes, the above change may not update, resulting in they still
having 0. And node 0 can be only possible but not present.
Srikar Dronamraju March 13, 2020, 11:22 a.m. UTC | #11
* Vlastimil Babka <vbabka@suse.cz> [2020-03-12 17:41:58]:

> On 3/12/20 5:13 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 14:51:38]:
> > 
> >> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-12 10:30:50]:
> >> > 
> >> >> On 3/12/20 9:23 AM, Sachin Sant wrote:
> >> >> >> On 12-Mar-2020, at 10:57 AM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >> >> >> * Michal Hocko <mhocko@kernel.org> [2020-03-11 12:57:35]:
> >> >> >>> On Wed 11-03-20 16:32:35, Srikar Dronamraju wrote:
> >> I think we do need well defined and documented rules around node_to_mem_node(),
> >> cpu_to_node(), existence of NODE_DATA, various node_states bitmaps etc so
> >> everyone handles it the same, safe way.
> 
> So let's try to brainstorm how this would look like? What I mean are some rules
> like below, even if some details in my current understanding are most likely
> incorrect:
> 

Agree.

> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> node with memory so that we don't require everyone to search for it in slightly
> different ways
> N_ONLINE - pgdat must exist, there doesn't have to be present memory,
> node_to_mem_node() still has to return something else (?)

Right, think this has been taken care of at this time.

> N_NORMAL_MEMORY - there is present memory, node_to_mem_node() returns itself
> N_HIGH_MEMORY - node has present high memory
> 

dont see any problems with the above two to. That leaves us with N_POSSIBLE.

> > 
> > Other option would be to tweak Kirill Tkhai's patch such that we call
> > kvmalloc_node()/kzalloc_node() if node is online and call kvmalloc/kvzalloc
> > if the node is offline.
> 
> I really would like a solution that hides these ugly details from callers so
> they don't have to workaround the APIs we provide. kvmalloc_node() really
> shouldn't crash, and it should fallback automatically if we don't give it
> __GFP_THISNODE
> 

Agree thats its better to make API's robust where possible.

> However, taking a step back, memcg_alloc_shrinker_maps() is probably rather
> wasteful on systems with 256 possible nodes and only few present, by allocating
> effectively dead structures for each memcg.
> 

If we dont allocate now, we would have to allocate them when we online the
nodes. To me it looks better to allocate as soon as the nodes are onlined,

> SLUB tries to be smart, so it allocates the per-node per-cache structures only
> when the node goes online in slab_mem_going_online_callback(). This is why
> there's a crash when such non-existing structures are accessed for a node that's
> not online, and why they shouldn't be accessed.
> 
> Perhaps memcg should do the same on-demand allocation, if possible.
> 

Right.
Vlastimil Babka March 13, 2020, 11:38 a.m. UTC | #12
On 3/13/20 12:04 PM, Srikar Dronamraju wrote:
>> I lost all the memory about it. :)
>> Anyway, how about this?
>> 
>> 1. make node_present_pages() safer
>> static inline node_present_pages(nid)
>> {
>> if (!node_online(nid)) return 0;
>> return (NODE_DATA(nid)->node_present_pages);
>> }
>> 
> 
> Yes this would help.

Looks good, yeah.

>> 2. make node_to_mem_node() safer for all cases
>> In ppc arch's mem_topology_setup(void)
>> for_each_present_cpu(cpu) {
>>  numa_setup_cpu(cpu);
>>  mem_node = node_to_mem_node(numa_mem_id());
>>  if (!node_present_pages(mem_node)) {
>>   _node_numa_mem_[numa_mem_id()] = first_online_node;
>>  }
>> }
>> 
> 
> But here as discussed above, we miss the case of possible but not present nodes.
> For such nodes, the above change may not update, resulting in they still
> having 0. And node 0 can be only possible but not present.

So is there other way to do the setup so that node_to_mem_node() returns an
online+present node when called for any possible node?
Joonsoo Kim March 16, 2020, 8:15 a.m. UTC | #13
2020년 3월 13일 (금) 오후 8:38, Vlastimil Babka <vbabka@suse.cz>님이 작성:
>
> On 3/13/20 12:04 PM, Srikar Dronamraju wrote:
> >> I lost all the memory about it. :)
> >> Anyway, how about this?
> >>
> >> 1. make node_present_pages() safer
> >> static inline node_present_pages(nid)
> >> {
> >> if (!node_online(nid)) return 0;
> >> return (NODE_DATA(nid)->node_present_pages);
> >> }
> >>
> >
> > Yes this would help.
>
> Looks good, yeah.
>
> >> 2. make node_to_mem_node() safer for all cases
> >> In ppc arch's mem_topology_setup(void)
> >> for_each_present_cpu(cpu) {
> >>  numa_setup_cpu(cpu);
> >>  mem_node = node_to_mem_node(numa_mem_id());
> >>  if (!node_present_pages(mem_node)) {
> >>   _node_numa_mem_[numa_mem_id()] = first_online_node;
> >>  }
> >> }
> >>
> >
> > But here as discussed above, we miss the case of possible but not present nodes.
> > For such nodes, the above change may not update, resulting in they still
> > having 0. And node 0 can be only possible but not present.

Oops, I don't read full thread so miss the case.

> So is there other way to do the setup so that node_to_mem_node() returns an
> online+present node when called for any possible node?

Two changes seems to be sufficient.

1. initialize all node's _node_numa_mem_[] = first_online_node in
mem_topology_setup()
2. replace the node with online+present node for _node_to_mem_node_[]
in set_cpu_numa_mem().

 static inline void set_cpu_numa_mem(int cpu, int node)
 {
        per_cpu(_numa_mem_, cpu) = node;
+       if (!node_present_pages(node))
+               node = first_online_node;
        _node_numa_mem_[cpu_to_node(cpu)] = node;
 }
 #endif

With these two change, we can safely call node_to_mem_node() anywhere.

Thanks.
Michal Hocko March 16, 2020, 9:06 a.m. UTC | #14
On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
[...]
> with nid present in:
> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online

I would rather have a dummy pgdat for those. Have a look at 
$ git grep "NODE_DATA.*->" | wc -l
63

Who knows how many else we have there. I haven't looked more closely.
Besides that what is a real reason to not have pgdat ther and force all
users of a $random node from those that the platform considers possible
for special casing? Is that a memory overhead? Is that really a thing?

Somebody has suggested to tweak some of the low level routines to do the
special casing but I really have to say I do not like that. We shouldn't
use the first online node or anything like that. We should simply always
follow the topology presented by FW and of that we need to have a pgdat.
Vlastimil Babka March 17, 2020, 1:44 p.m. UTC | #15
On 3/16/20 10:06 AM, Michal Hocko wrote:
> On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
> [...]
>> with nid present in:
>> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> 
> I would rather have a dummy pgdat for those. Have a look at 
> $ git grep "NODE_DATA.*->" | wc -l
> 63
> 
> Who knows how many else we have there. I haven't looked more closely.
> Besides that what is a real reason to not have pgdat ther and force all
> users of a $random node from those that the platform considers possible
> for special casing? Is that a memory overhead? Is that really a thing?

I guess we can ignore memory overhead. I guess there only might be some concern
that for nodes that are initially offline, we will allocate the pgdat on a
different node, and after they are online, it will stay on a different node with
more access latency from local cpus. If we only allocate for online nodes, it
can always be local? But I guess it doesn't matter that much.

> Somebody has suggested to tweak some of the low level routines to do the
> special casing but I really have to say I do not like that. We shouldn't
> use the first online node or anything like that. We should simply always
> follow the topology presented by FW and of that we need to have a pgdat.
>
Michal Hocko March 17, 2020, 2:01 p.m. UTC | #16
On Tue 17-03-20 14:44:45, Vlastimil Babka wrote:
> On 3/16/20 10:06 AM, Michal Hocko wrote:
> > On Thu 12-03-20 17:41:58, Vlastimil Babka wrote:
> > [...]
> >> with nid present in:
> >> N_POSSIBLE - pgdat might not exist, node_to_mem_node() must return some online
> > 
> > I would rather have a dummy pgdat for those. Have a look at 
> > $ git grep "NODE_DATA.*->" | wc -l
> > 63
> > 
> > Who knows how many else we have there. I haven't looked more closely.
> > Besides that what is a real reason to not have pgdat ther and force all
> > users of a $random node from those that the platform considers possible
> > for special casing? Is that a memory overhead? Is that really a thing?
> 
> I guess we can ignore memory overhead. I guess there only might be some concern
> that for nodes that are initially offline, we will allocate the pgdat on a
> different node, and after they are online, it will stay on a different node with
> more access latency from local cpus. If we only allocate for online nodes, it
> can always be local? But I guess it doesn't matter that much.

This is not the case even now because of chicke&egg. You need a memory
to allocate from and that memory has to be managed somewhere per node
(pgdat). Keep in mind we do not have the bootmem allocator for the
hotplug. Have a look at hotadd_new_pgdat and when it is called. There
are some attempts to allocate memmap from the hotpluged memory but I am
not sure we can do the whole thing without pgdat in place. If we can
then can come up with some replace the pgdat magic. But still I am not
even sure this is something we really have to optimize for.
diff mbox series

Patch

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8a399db..54dcd49 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -931,8 +931,20 @@  void __init mem_topology_setup(void)
 
 	reset_numa_cpu_lookup_table();
 
-	for_each_present_cpu(cpu)
-		numa_setup_cpu(cpu);
+	for_each_possible_cpu(cpu) {
+		/*
+		 * Powerpc with CONFIG_NUMA always used to have a node 0,
+		 * even if it was memoryless or cpuless. For all cpus that
+		 * are possible but not present, cpu_to_node() would point
+		 * to node 0. To remove a cpuless, memoryless dummy node,
+		 * powerpc need to make sure all possible but not present
+		 * cpu_to_node are set to a proper node.
+		 */
+		if (cpu_present(cpu))
+			numa_setup_cpu(cpu);
+		else
+			set_cpu_numa_node(cpu, first_online_node);
+	}
 }
 
 void __init initmem_init(void)