Message ID | 20200317131753.4074-3-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix kmalloc_node on offline nodes | expand |
On 3/17/20 2:17 PM, Srikar Dronamraju wrote: > Currently while allocating a slab for a offline node, we use its > associated node_numa_mem to search for a partial slab. If we don't find > a partial slab, we try allocating a slab from the offline node using > __alloc_pages_node. However this is bound to fail. > > NIP [c00000000039a300] __alloc_pages_nodemask+0x130/0x3b0 > LR [c00000000039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0 > Call Trace: > [c0000008b36837f0] [c00000000039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 (unreliable) > [c0000008b3683870] [c0000000003d1ff8] new_slab+0x128/0xcf0 > [c0000008b3683950] [c0000000003d6060] ___slab_alloc+0x410/0x820 > [c0000008b3683a40] [c0000000003d64a4] __slab_alloc+0x34/0x60 > [c0000008b3683a70] [c0000000003d78b0] __kmalloc_node+0x110/0x490 > [c0000008b3683af0] [c000000000343a08] kvmalloc_node+0x58/0x110 > [c0000008b3683b30] [c0000000003ffd44] mem_cgroup_css_online+0x104/0x270 > [c0000008b3683b90] [c000000000234e08] online_css+0x48/0xd0 > [c0000008b3683bc0] [c00000000023dedc] cgroup_apply_control_enable+0x2ec/0x4d0 > [c0000008b3683ca0] [c0000000002416f8] cgroup_mkdir+0x228/0x5f0 > [c0000008b3683d10] [c000000000520360] kernfs_iop_mkdir+0x90/0xf0 > [c0000008b3683d50] [c00000000043e400] vfs_mkdir+0x110/0x230 > [c0000008b3683da0] [c000000000441ee0] do_mkdirat+0xb0/0x1a0 > [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68 > > Mitigate this by allocating the new slab from the node_numa_mem. Are you sure this is really needed and the other 3 patches are not enough for the current SLUB code to work as needed? It seems you are changing the semantics here... > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, > struct kmem_cache_cpu *c) > { > void *object; > - int searchnode = node; > > - if (node == NUMA_NO_NODE) > - searchnode = numa_mem_id(); > - else if (!node_present_pages(node)) > - searchnode = node_to_mem_node(node); > - > - object = get_partial_node(s, get_node(s, searchnode), c, flags); > + object = get_partial_node(s, get_node(s, node), c, flags); > if (object || node != NUMA_NO_NODE)> return object; > > return get_any_partial(s, flags, c); I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the hunk below), thus the get_any_partial() call becomes dead code? > @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, > > WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); > > + if (node == NUMA_NO_NODE) > + node = numa_mem_id(); > + else if (!node_present_pages(node)) > + node = node_to_mem_node(node); > + > freelist = get_partial(s, flags, node, c); > > if (freelist) > @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > redo: > > if (unlikely(!node_match(page, node))) { > - int searchnode = node; > - > if (node != NUMA_NO_NODE && !node_present_pages(node)) > - searchnode = node_to_mem_node(node); > + node = node_to_mem_node(node); > > - if (unlikely(!node_match(page, searchnode))) { > + if (unlikely(!node_match(page, node))) { > stat(s, ALLOC_NODE_MISMATCH); > deactivate_slab(s, page, c->freelist, c); > goto new_slab; >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 14:34:25]: > On 3/17/20 2:17 PM, Srikar Dronamraju wrote: > > Currently while allocating a slab for a offline node, we use its > > associated node_numa_mem to search for a partial slab. If we don't find > > a partial slab, we try allocating a slab from the offline node using > > __alloc_pages_node. However this is bound to fail. > > > > NIP [c00000000039a300] __alloc_pages_nodemask+0x130/0x3b0 > > LR [c00000000039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0 > > Call Trace: > > [c0000008b36837f0] [c00000000039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 (unreliable) > > [c0000008b3683870] [c0000000003d1ff8] new_slab+0x128/0xcf0 > > [c0000008b3683950] [c0000000003d6060] ___slab_alloc+0x410/0x820 > > [c0000008b3683a40] [c0000000003d64a4] __slab_alloc+0x34/0x60 > > [c0000008b3683a70] [c0000000003d78b0] __kmalloc_node+0x110/0x490 > > [c0000008b3683af0] [c000000000343a08] kvmalloc_node+0x58/0x110 > > [c0000008b3683b30] [c0000000003ffd44] mem_cgroup_css_online+0x104/0x270 > > [c0000008b3683b90] [c000000000234e08] online_css+0x48/0xd0 > > [c0000008b3683bc0] [c00000000023dedc] cgroup_apply_control_enable+0x2ec/0x4d0 > > [c0000008b3683ca0] [c0000000002416f8] cgroup_mkdir+0x228/0x5f0 > > [c0000008b3683d10] [c000000000520360] kernfs_iop_mkdir+0x90/0xf0 > > [c0000008b3683d50] [c00000000043e400] vfs_mkdir+0x110/0x230 > > [c0000008b3683da0] [c000000000441ee0] do_mkdirat+0xb0/0x1a0 > > [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68 > > > > Mitigate this by allocating the new slab from the node_numa_mem. > > Are you sure this is really needed and the other 3 patches are not enough for > the current SLUB code to work as needed? It seems you are changing the semantics > here... > The other 3 patches are not enough because we don't carry the searchnode when the actual alloc_pages_node gets called. With only the 3 patches, we see the above Panic, its signature is slightly different from what Sachin first reported and which I have carried in 1st patch. > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, > > struct kmem_cache_cpu *c) > > { > > void *object; > > - int searchnode = node; > > > > - if (node == NUMA_NO_NODE) > > - searchnode = numa_mem_id(); > > - else if (!node_present_pages(node)) > > - searchnode = node_to_mem_node(node); > > - > > - object = get_partial_node(s, get_node(s, searchnode), c, flags); > > + object = get_partial_node(s, get_node(s, node), c, flags); > > if (object || node != NUMA_NO_NODE)> return object; > > > > return get_any_partial(s, flags, c); > > I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the > hunk below), thus the get_any_partial() call becomes dead code? > > > @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, > > > > WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); > > > > + if (node == NUMA_NO_NODE) > > + node = numa_mem_id(); > > + else if (!node_present_pages(node)) > > + node = node_to_mem_node(node); > > + > > freelist = get_partial(s, flags, node, c); > > > > if (freelist) > > @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, > > redo: > > > > if (unlikely(!node_match(page, node))) { > > - int searchnode = node; > > - > > if (node != NUMA_NO_NODE && !node_present_pages(node)) > > - searchnode = node_to_mem_node(node); > > + node = node_to_mem_node(node); > > > > - if (unlikely(!node_match(page, searchnode))) { > > + if (unlikely(!node_match(page, node))) { > > stat(s, ALLOC_NODE_MISMATCH); > > deactivate_slab(s, page, c->freelist, c); > > goto new_slab; > > >
On 3/17/20 2:45 PM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 14:34:25]: > >> On 3/17/20 2:17 PM, Srikar Dronamraju wrote: >> > Currently while allocating a slab for a offline node, we use its >> > associated node_numa_mem to search for a partial slab. If we don't find >> > a partial slab, we try allocating a slab from the offline node using >> > __alloc_pages_node. However this is bound to fail. >> > >> > NIP [c00000000039a300] __alloc_pages_nodemask+0x130/0x3b0 >> > LR [c00000000039a3c4] __alloc_pages_nodemask+0x1f4/0x3b0 >> > Call Trace: >> > [c0000008b36837f0] [c00000000039a3b4] __alloc_pages_nodemask+0x1e4/0x3b0 (unreliable) >> > [c0000008b3683870] [c0000000003d1ff8] new_slab+0x128/0xcf0 >> > [c0000008b3683950] [c0000000003d6060] ___slab_alloc+0x410/0x820 >> > [c0000008b3683a40] [c0000000003d64a4] __slab_alloc+0x34/0x60 >> > [c0000008b3683a70] [c0000000003d78b0] __kmalloc_node+0x110/0x490 >> > [c0000008b3683af0] [c000000000343a08] kvmalloc_node+0x58/0x110 >> > [c0000008b3683b30] [c0000000003ffd44] mem_cgroup_css_online+0x104/0x270 >> > [c0000008b3683b90] [c000000000234e08] online_css+0x48/0xd0 >> > [c0000008b3683bc0] [c00000000023dedc] cgroup_apply_control_enable+0x2ec/0x4d0 >> > [c0000008b3683ca0] [c0000000002416f8] cgroup_mkdir+0x228/0x5f0 >> > [c0000008b3683d10] [c000000000520360] kernfs_iop_mkdir+0x90/0xf0 >> > [c0000008b3683d50] [c00000000043e400] vfs_mkdir+0x110/0x230 >> > [c0000008b3683da0] [c000000000441ee0] do_mkdirat+0xb0/0x1a0 >> > [c0000008b3683e20] [c00000000000b278] system_call+0x5c/0x68 >> > >> > Mitigate this by allocating the new slab from the node_numa_mem. >> >> Are you sure this is really needed and the other 3 patches are not enough for >> the current SLUB code to work as needed? It seems you are changing the semantics >> here... >> > > The other 3 patches are not enough because we don't carry the searchnode > when the actual alloc_pages_node gets called. > > With only the 3 patches, we see the above Panic, its signature is slightly > different from what Sachin first reported and which I have carried in 1st > patch. Ah, I see. So that's the missing pgdat after your series [1] right? That sounds like an argument for Michal's suggestions that pgdats exist and have correctly populated zonelists for all possible nodes. node_to_mem_node() could be just a shortcut for the first zone's node in the zonelist, so that fallback follows the topology. [1] https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e >> > --- a/mm/slub.c >> > +++ b/mm/slub.c >> > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, >> > struct kmem_cache_cpu *c) >> > { >> > void *object; >> > - int searchnode = node; >> > >> > - if (node == NUMA_NO_NODE) >> > - searchnode = numa_mem_id(); >> > - else if (!node_present_pages(node)) >> > - searchnode = node_to_mem_node(node); >> > - >> > - object = get_partial_node(s, get_node(s, searchnode), c, flags); >> > + object = get_partial_node(s, get_node(s, node), c, flags); >> > if (object || node != NUMA_NO_NODE)> return object; >> > >> > return get_any_partial(s, flags, c); >> >> I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the >> hunk below), thus the get_any_partial() call becomes dead code? >> >> > @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, >> > >> > WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); >> > >> > + if (node == NUMA_NO_NODE) >> > + node = numa_mem_id(); >> > + else if (!node_present_pages(node)) >> > + node = node_to_mem_node(node); >> > + >> > freelist = get_partial(s, flags, node, c); >> > >> > if (freelist) >> > @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, >> > redo: >> > >> > if (unlikely(!node_match(page, node))) { >> > - int searchnode = node; >> > - >> > if (node != NUMA_NO_NODE && !node_present_pages(node)) >> > - searchnode = node_to_mem_node(node); >> > + node = node_to_mem_node(node); >> > >> > - if (unlikely(!node_match(page, searchnode))) { >> > + if (unlikely(!node_match(page, node))) { >> > stat(s, ALLOC_NODE_MISMATCH); >> > deactivate_slab(s, page, c->freelist, c); >> > goto new_slab; >> > >> >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 14:53:26]: > >> > > >> > Mitigate this by allocating the new slab from the node_numa_mem. > >> > >> Are you sure this is really needed and the other 3 patches are not enough for > >> the current SLUB code to work as needed? It seems you are changing the semantics > >> here... > >> > > > > The other 3 patches are not enough because we don't carry the searchnode > > when the actual alloc_pages_node gets called. > > > > With only the 3 patches, we see the above Panic, its signature is slightly > > different from what Sachin first reported and which I have carried in 1st > > patch. > > Ah, I see. So that's the missing pgdat after your series [1] right? Yes the pgdat would be missing after my cpuless, memoryless node patchset. However.. > > That sounds like an argument for Michal's suggestions that pgdats exist and have > correctly populated zonelists for all possible nodes. Only the first patch in this series would be affected by pgdat existing or not. Even if the pgdat existed, the NODE_DATA[nid]->node_present_pages would be 0. Right? So it would look at node_to_mem_node(). And since node 0 is cpuless it would return 0. If we pass this node 0 (which is memoryless/cpuless) to alloc_pages_node. Please note I am only setting node_numa_mem only for offline nodes. However we could change this to set for all offline and memoryless nodes. > node_to_mem_node() could be just a shortcut for the first zone's node in the > zonelist, so that fallback follows the topology. > > [1] > https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e >
On 3/17/20 3:51 PM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 14:53:26]: > >> >> > >> >> > Mitigate this by allocating the new slab from the node_numa_mem. >> >> >> >> Are you sure this is really needed and the other 3 patches are not enough for >> >> the current SLUB code to work as needed? It seems you are changing the semantics >> >> here... >> >> >> > >> > The other 3 patches are not enough because we don't carry the searchnode >> > when the actual alloc_pages_node gets called. >> > >> > With only the 3 patches, we see the above Panic, its signature is slightly >> > different from what Sachin first reported and which I have carried in 1st >> > patch. >> >> Ah, I see. So that's the missing pgdat after your series [1] right? > > Yes the pgdat would be missing after my cpuless, memoryless node patchset. > However.. >> >> That sounds like an argument for Michal's suggestions that pgdats exist and have >> correctly populated zonelists for all possible nodes. > > Only the first patch in this series would be affected by pgdat existing or > not. Even if the pgdat existed, the NODE_DATA[nid]->node_present_pages > would be 0. Right? So it would look at node_to_mem_node(). And since node 0 is > cpuless it would return 0. I thought the point was to return 1 for node 0. > If we pass this node 0 (which is memoryless/cpuless) to > alloc_pages_node. Please note I am only setting node_numa_mem only > for offline nodes. However we could change this to set for all offline and > memoryless nodes. That would indeed make sense. But I guess that alloc_pages would still crash as the result of numa_to_mem_node() is not passed down to alloc_pages() without this patch. In __alloc_pages_node() we currently have "The node must be valid and online" so offline nodes don't have zonelists. Either they get them, or we indeed need something like this patch. But in order to not make get_any_partial() dead code, the final replacement of invalid node with a valid one should be done in alloc_slab_page() I guess? >> node_to_mem_node() could be just a shortcut for the first zone's node in the >> zonelist, so that fallback follows the topology. >> >> [1] >> https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#m76e5b4c4084380b1d4b193d5aa0359b987f2290e >> > > >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 14:34:25]: > > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, > > struct kmem_cache_cpu *c) > > { > > void *object; > > - int searchnode = node; > > > > - if (node == NUMA_NO_NODE) > > - searchnode = numa_mem_id(); > > - else if (!node_present_pages(node)) > > - searchnode = node_to_mem_node(node); > > - > > - object = get_partial_node(s, get_node(s, searchnode), c, flags); > > + object = get_partial_node(s, get_node(s, node), c, flags); > > if (object || node != NUMA_NO_NODE)> return object; > > > > return get_any_partial(s, flags, c); > > I.e. here in this if(), now node will never equal NUMA_NO_NODE (thanks to the > hunk below), thus the get_any_partial() call becomes dead code? Very true. Would it be okay if we remove the node != NUMA_NO_NODE if (object || node != NUMA_NO_NODE) return object; will now become if (object) return object;
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:29:21]: > > If we pass this node 0 (which is memoryless/cpuless) to > > alloc_pages_node. Please note I am only setting node_numa_mem only > > for offline nodes. However we could change this to set for all offline and > > memoryless nodes. > > That would indeed make sense. > > But I guess that alloc_pages would still crash as the result of > numa_to_mem_node() is not passed down to alloc_pages() without this patch. In > __alloc_pages_node() we currently have "The node must be valid and online" so > offline nodes don't have zonelists. Either they get them, or we indeed need > something like this patch. But in order to not make get_any_partial() dead code, > the final replacement of invalid node with a valid one should be done in > alloc_slab_page() I guess? > I am posting v2 with this change. > >> node_to_mem_node() could be just a shortcut for the first zone's node in the > >> zonelist, so that fallback follows the topology.
diff --git a/mm/slub.c b/mm/slub.c index 1c55bf7892bf..fdf7f38f96e6 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1970,14 +1970,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, struct kmem_cache_cpu *c) { void *object; - int searchnode = node; - if (node == NUMA_NO_NODE) - searchnode = numa_mem_id(); - else if (!node_present_pages(node)) - searchnode = node_to_mem_node(node); - - object = get_partial_node(s, get_node(s, searchnode), c, flags); + object = get_partial_node(s, get_node(s, node), c, flags); if (object || node != NUMA_NO_NODE) return object; @@ -2470,6 +2464,11 @@ static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags, WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO)); + if (node == NUMA_NO_NODE) + node = numa_mem_id(); + else if (!node_present_pages(node)) + node = node_to_mem_node(node); + freelist = get_partial(s, flags, node, c); if (freelist) @@ -2569,12 +2568,10 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, redo: if (unlikely(!node_match(page, node))) { - int searchnode = node; - if (node != NUMA_NO_NODE && !node_present_pages(node)) - searchnode = node_to_mem_node(node); + node = node_to_mem_node(node); - if (unlikely(!node_match(page, searchnode))) { + if (unlikely(!node_match(page, node))) { stat(s, ALLOC_NODE_MISMATCH); deactivate_slab(s, page, c->freelist, c); goto new_slab;