diff mbox series

[v2,1/4] mm: Check for node_online in node_present_pages

Message ID 20200318072810.9735-2-srikar@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show
Series Fix kmalloc_node on offline nodes | expand

Commit Message

Srikar Dronamraju March 18, 2020, 7:28 a.m. UTC
Calling a kmalloc_node on a possible node which is not yet onlined can
lead to panic. Currently node_present_pages() doesn't verify the node is
online before accessing the pgdat for the node. However pgdat struct may
not be available resulting in a crash.

NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
LR [c0000000003d5b94] __slab_alloc+0x34/0x60
Call Trace:
[c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
[c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
[c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
[c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
[c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
[c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
[c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
[c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
[c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
[c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
[c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
[c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68

Fix this by verifying the node is online before accessing the pgdat
structure. Fix the same for node_spanned_pages() too.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>

Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mmzone.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Michal Hocko March 18, 2020, 10:02 a.m. UTC | #1
On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
> Calling a kmalloc_node on a possible node which is not yet onlined can
> lead to panic. Currently node_present_pages() doesn't verify the node is
> online before accessing the pgdat for the node. However pgdat struct may
> not be available resulting in a crash.
> 
> NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
> LR [c0000000003d5b94] __slab_alloc+0x34/0x60
> Call Trace:
> [c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
> [c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
> [c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
> [c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
> [c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
> [c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
> [c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
> [c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
> [c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
> [c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
> [c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
> [c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68
> 
> Fix this by verifying the node is online before accessing the pgdat
> structure. Fix the same for node_spanned_pages() too.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> 
> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  include/linux/mmzone.h | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index f3f264826423..88078a3b95e5 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
>  } pg_data_t;
>  
> -#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> -#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
> +#define node_present_pages(nid)		\
> +	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
> +#define node_spanned_pages(nid)		\
> +	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)

I believe this is a wrong approach. We really do not want to special
case all the places which require NODE_DATA. Can we please go and
allocate pgdat for all possible nodes?

The current state of memory less hacks subtle bugs poping up here and
there just prove that we should have done that from the very begining
IMHO.

>  #ifdef CONFIG_FLAT_NODE_MEM_MAP
>  #define pgdat_page_nr(pgdat, pagenr)	((pgdat)->node_mem_map + (pagenr))
>  #else
> -- 
> 2.18.1
Srikar Dronamraju March 18, 2020, 11:02 a.m. UTC | #2
* Michal Hocko <mhocko@suse.com> [2020-03-18 11:02:56]:

> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
> > Calling a kmalloc_node on a possible node which is not yet onlined can
> > lead to panic. Currently node_present_pages() doesn't verify the node is
> > online before accessing the pgdat for the node. However pgdat struct may
> > not be available resulting in a crash.
> >
> > NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
> > LR [c0000000003d5b94] __slab_alloc+0x34/0x60
> > Call Trace:
> > [c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
> > [c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
> > [c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
> > [c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
> > [c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
> > [c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
> > [c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
> > [c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
> > [c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
> > [c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
> > [c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
> > [c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68
> >
> > Fix this by verifying the node is online before accessing the pgdat
> > structure. Fix the same for node_spanned_pages() too.
> >
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: linux-mm@kvack.org
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Christopher Lameter <cl@linux.com>
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > Cc: Bharata B Rao <bharata@linux.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> >
> > Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> > Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  include/linux/mmzone.h | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index f3f264826423..88078a3b95e5 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -756,8 +756,10 @@ typedef struct pglist_data {
> >  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
> >  } pg_data_t;
> >
> > -#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> > -#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
> > +#define node_present_pages(nid)		\
> > +	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
> > +#define node_spanned_pages(nid)		\
> > +	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
>
> I believe this is a wrong approach. We really do not want to special
> case all the places which require NODE_DATA. Can we please go and
> allocate pgdat for all possible nodes?
>

I can do that but the question I had was should we make this change just for
Powerpc or should the change be for other archs.

NODE_DATA initialization always seems to be in arch specific code.

The other archs that are affected seem to be mips, sh and sparc
These archs seem to have making an assumption that NODE_DATA has to be local
only,

For example on sparc / arch/sparc/mm/init_64.c in allocate_node_data function.

  NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data),
                                             SMP_CACHE_BYTES, nid);
        if (!NODE_DATA(nid)) {
                prom_printf("Cannot allocate pglist_data for nid[%d]\n", nid);
                prom_halt();
        }

        NODE_DATA(nid)->node_id = nid;

So even if I make changes to allocate NODE_DATA from fallback node, I may not
be able to test them.

So please let me know your thoughts around the same.

> The current state of memory less hacks subtle bugs poping up here and
> there just prove that we should have done that from the very begining
> IMHO.
>
> >  #ifdef CONFIG_FLAT_NODE_MEM_MAP
> >  #define pgdat_page_nr(pgdat, pagenr)	((pgdat)->node_mem_map + (pagenr))
> >  #else
> > --
> > 2.18.1
>
> --
> Michal Hocko
> SUSE Labs
>

--
Thanks and Regards
Srikar Dronamraju
Michal Hocko March 18, 2020, 11:14 a.m. UTC | #3
On Wed 18-03-20 16:32:15, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@suse.com> [2020-03-18 11:02:56]:
> 
> > On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
[...]
> > > -#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> > > -#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
> > > +#define node_present_pages(nid)		\
> > > +	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
> > > +#define node_spanned_pages(nid)		\
> > > +	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
> >
> > I believe this is a wrong approach. We really do not want to special
> > case all the places which require NODE_DATA. Can we please go and
> > allocate pgdat for all possible nodes?
> >
> 
> I can do that but the question I had was should we make this change just for
> Powerpc or should the change be for other archs.

No, we shouldn't, really. If NODE_DATA is non-null for all possible
nodes then this shouldn't be really necessary and arch specific.

> NODE_DATA initialization always seems to be in arch specific code.
> 
> The other archs that are affected seem to be mips, sh and sparc
> These archs seem to have making an assumption that NODE_DATA has to be local
> only,

Which is all good and fine for nodes that hold some memory. If those
architectures support memory less nodes at all then I do not see any
problem to have remote pgdata.

> For example on sparc / arch/sparc/mm/init_64.c in allocate_node_data function.
> 
>   NODE_DATA(nid) = memblock_alloc_node(sizeof(struct pglist_data),
>                                              SMP_CACHE_BYTES, nid);
>         if (!NODE_DATA(nid)) {
>                 prom_printf("Cannot allocate pglist_data for nid[%d]\n", nid);
>                 prom_halt();
>         }
> 
>         NODE_DATA(nid)->node_id = nid;

This code is not about memroy less nodes, is it? It looks more like a
allocation failure panic-like handling because there is not enough
memory to hold pgdat. This also strongly suggests that this platform
doesn't really expect memory less nodes in the early init path.

> So even if I make changes to allocate NODE_DATA from fallback node, I may not
> be able to test them.

Please try to focus on the architecture you can test for. From the
existing reports I have seen this looks mostly to be a problem for x86
and ppc
Vlastimil Babka March 18, 2020, 11:53 a.m. UTC | #4
On 3/18/20 11:02 AM, Michal Hocko wrote:
> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
>> Calling a kmalloc_node on a possible node which is not yet onlined can
>> lead to panic. Currently node_present_pages() doesn't verify the node is
>> online before accessing the pgdat for the node. However pgdat struct may
>> not be available resulting in a crash.
>> 
>> NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
>> LR [c0000000003d5b94] __slab_alloc+0x34/0x60
>> Call Trace:
>> [c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
>> [c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
>> [c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
>> [c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
>> [c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
>> [c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
>> [c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
>> [c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
>> [c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
>> [c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
>> [c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
>> [c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68
>> 
>> Fix this by verifying the node is online before accessing the pgdat
>> structure. Fix the same for node_spanned_pages() too.
>> 
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: linux-mm@kvack.org
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Christopher Lameter <cl@linux.com>
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Cc: Bharata B Rao <bharata@linux.ibm.com>
>> Cc: Nathan Lynch <nathanl@linux.ibm.com>
>> 
>> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> ---
>>  include/linux/mmzone.h | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index f3f264826423..88078a3b95e5 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
>>  } pg_data_t;
>>  
>> -#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
>> -#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
>> +#define node_present_pages(nid)		\
>> +	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
>> +#define node_spanned_pages(nid)		\
>> +	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
> 
> I believe this is a wrong approach. We really do not want to special
> case all the places which require NODE_DATA. Can we please go and
> allocate pgdat for all possible nodes?
> 
> The current state of memory less hacks subtle bugs poping up here and
> there just prove that we should have done that from the very begining
> IMHO.

Yes. So here's an alternative proposal for fixing the current situation in SLUB,
before the long-term solution of having all possible nodes provide valid pgdat
with zonelists:

- fix SLUB with the hunk at the end of this mail - the point is to use NUMA_NO_NODE
  as fallback instead of node_to_mem_node()
- this removes all uses of node_to_mem_node (luckily it's just SLUB),
  kill it completely instead of trying to fix it up
- patch 1/4 is not needed with the fix
- perhaps many of your other patches are alss not needed 
- once we get the long-term solution, some of the !node_online() checks can be removed

----8<----
diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..1d4f2d7a0080 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
 	struct page *page;
 	unsigned int order = oo_order(oo);
 
-	if (node == NUMA_NO_NODE)
+	if (node == NUMA_NO_NODE || !node_online(node))
 		page = alloc_pages(flags, order);
 	else
 		page = __alloc_pages_node(node, flags, order);
@@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 
 	if (node == NUMA_NO_NODE)
 		searchnode = numa_mem_id();
-	else if (!node_present_pages(node))
-		searchnode = node_to_mem_node(node);
 
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 redo:
 
 	if (unlikely(!node_match(page, node))) {
-		int searchnode = node;
-
-		if (node != NUMA_NO_NODE && !node_present_pages(node))
-			searchnode = node_to_mem_node(node);
-
-		if (unlikely(!node_match(page, searchnode))) {
+		/*
+		 * node_match() false implies node != NUMA_NO_NODE
+		 * but if the node is not online and has no pages, just
+		 * ignore the constraint
+		 */
+		if ((!node_online(node) || !node_present_pages(node))) {
+			node = NUMA_NO_NODE;
+			goto redo;
+		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
 			deactivate_slab(s, page, c->freelist, c);
 			goto new_slab;
Michal Hocko March 18, 2020, 12:52 p.m. UTC | #5
On Wed 18-03-20 12:53:32, Vlastimil Babka wrote:
[...]
> Yes. So here's an alternative proposal for fixing the current situation in SLUB,
> before the long-term solution of having all possible nodes provide valid pgdat
> with zonelists:
> 
> - fix SLUB with the hunk at the end of this mail - the point is to use NUMA_NO_NODE
>   as fallback instead of node_to_mem_node()

I am not familiar with SLUB to review.

> - this removes all uses of node_to_mem_node (luckily it's just SLUB),
>   kill it completely instead of trying to fix it up

Sounds like a good plan to me. The code shouldn't really care.
Michael Ellerman March 19, 2020, 12:32 a.m. UTC | #6
Vlastimil Babka <vbabka@suse.cz> writes:
> On 3/18/20 11:02 AM, Michal Hocko wrote:
>> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
>>> Calling a kmalloc_node on a possible node which is not yet onlined can
>>> lead to panic. Currently node_present_pages() doesn't verify the node is
>>> online before accessing the pgdat for the node. However pgdat struct may
>>> not be available resulting in a crash.
>>> 
>>> NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
>>> LR [c0000000003d5b94] __slab_alloc+0x34/0x60
>>> Call Trace:
>>> [c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
>>> [c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
>>> [c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
>>> [c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
>>> [c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
>>> [c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
>>> [c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
>>> [c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
>>> [c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
>>> [c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
>>> [c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
>>> [c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68
>>> 
>>> Fix this by verifying the node is online before accessing the pgdat
>>> structure. Fix the same for node_spanned_pages() too.
>>> 
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: linux-mm@kvack.org
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Christopher Lameter <cl@linux.com>
>>> Cc: linuxppc-dev@lists.ozlabs.org
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> Cc: Bharata B Rao <bharata@linux.ibm.com>
>>> Cc: Nathan Lynch <nathanl@linux.ibm.com>
>>> 
>>> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>>> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>>> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> ---
>>>  include/linux/mmzone.h | 6 ++++--
>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index f3f264826423..88078a3b95e5 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>>>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
>>>  } pg_data_t;
>>>  
>>> -#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
>>> -#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
>>> +#define node_present_pages(nid)		\
>>> +	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
>>> +#define node_spanned_pages(nid)		\
>>> +	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
>> 
>> I believe this is a wrong approach. We really do not want to special
>> case all the places which require NODE_DATA. Can we please go and
>> allocate pgdat for all possible nodes?
>> 
>> The current state of memory less hacks subtle bugs poping up here and
>> there just prove that we should have done that from the very begining
>> IMHO.
>
> Yes. So here's an alternative proposal for fixing the current situation in SLUB,
> before the long-term solution of having all possible nodes provide valid pgdat
> with zonelists:
>
> - fix SLUB with the hunk at the end of this mail - the point is to use NUMA_NO_NODE
>   as fallback instead of node_to_mem_node()
> - this removes all uses of node_to_mem_node (luckily it's just SLUB),
>   kill it completely instead of trying to fix it up
> - patch 1/4 is not needed with the fix
> - perhaps many of your other patches are alss not needed 
> - once we get the long-term solution, some of the !node_online() checks can be removed

Seems like a nice solution to me :)

> ----8<----
> diff --git a/mm/slub.c b/mm/slub.c
> index 17dc00e33115..1d4f2d7a0080 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
>  	struct page *page;
>  	unsigned int order = oo_order(oo);
>  
> -	if (node == NUMA_NO_NODE)
> +	if (node == NUMA_NO_NODE || !node_online(node))

Why don't we need the node_present_pages() check here?

>  		page = alloc_pages(flags, order);
>  	else
>  		page = __alloc_pages_node(node, flags, order);
> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  
>  	if (node == NUMA_NO_NODE)
>  		searchnode = numa_mem_id();
> -	else if (!node_present_pages(node))
> -		searchnode = node_to_mem_node(node);
>  
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  redo:
>  
>  	if (unlikely(!node_match(page, node))) {
> -		int searchnode = node;
> -
> -		if (node != NUMA_NO_NODE && !node_present_pages(node))
> -			searchnode = node_to_mem_node(node);
> -
> -		if (unlikely(!node_match(page, searchnode))) {
> +		/*
> +		 * node_match() false implies node != NUMA_NO_NODE
> +		 * but if the node is not online and has no pages, just
                                                 ^
                                                 this should be 'or' ?

> +		 * ignore the constraint
> +		 */
> +		if ((!node_online(node) || !node_present_pages(node))) {
> +			node = NUMA_NO_NODE;
> +			goto redo;
> +		} else {
>  			stat(s, ALLOC_NODE_MISMATCH);
>  			deactivate_slab(s, page, c->freelist, c);
>  			goto new_slab;

cheers
Michael Ellerman March 19, 2020, 1:11 a.m. UTC | #7
Michael Ellerman <mpe@ellerman.id.au> writes:
> Vlastimil Babka <vbabka@suse.cz> writes:
>> On 3/18/20 11:02 AM, Michal Hocko wrote:
>>> On Wed 18-03-20 12:58:07, Srikar Dronamraju wrote:
>>>> Calling a kmalloc_node on a possible node which is not yet onlined can
>>>> lead to panic. Currently node_present_pages() doesn't verify the node is
>>>> online before accessing the pgdat for the node. However pgdat struct may
>>>> not be available resulting in a crash.
>>>> 
>>>> NIP [c0000000003d55f4] ___slab_alloc+0x1f4/0x760
>>>> LR [c0000000003d5b94] __slab_alloc+0x34/0x60
>>>> Call Trace:
>>>> [c0000008b3783960] [c0000000003d5734] ___slab_alloc+0x334/0x760 (unreliable)
>>>> [c0000008b3783a40] [c0000000003d5b94] __slab_alloc+0x34/0x60
>>>> [c0000008b3783a70] [c0000000003d6fa0] __kmalloc_node+0x110/0x490
>>>> [c0000008b3783af0] [c0000000003443d8] kvmalloc_node+0x58/0x110
>>>> [c0000008b3783b30] [c0000000003fee38] mem_cgroup_css_online+0x108/0x270
>>>> [c0000008b3783b90] [c000000000235aa8] online_css+0x48/0xd0
>>>> [c0000008b3783bc0] [c00000000023eaec] cgroup_apply_control_enable+0x2ec/0x4d0
>>>> [c0000008b3783ca0] [c000000000242318] cgroup_mkdir+0x228/0x5f0
>>>> [c0000008b3783d10] [c00000000051e170] kernfs_iop_mkdir+0x90/0xf0
>>>> [c0000008b3783d50] [c00000000043dc00] vfs_mkdir+0x110/0x230
>>>> [c0000008b3783da0] [c000000000441c90] do_mkdirat+0xb0/0x1a0
>>>> [c0000008b3783e20] [c00000000000b278] system_call+0x5c/0x68
>>>> 
>>>> Fix this by verifying the node is online before accessing the pgdat
>>>> structure. Fix the same for node_spanned_pages() too.
>>>> 
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: linux-mm@kvack.org
>>>> Cc: Mel Gorman <mgorman@suse.de>
>>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>>> Cc: Sachin Sant <sachinp@linux.vnet.ibm.com>
>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>> Cc: Christopher Lameter <cl@linux.com>
>>>> Cc: linuxppc-dev@lists.ozlabs.org
>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>> Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
>>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>>> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> Cc: Bharata B Rao <bharata@linux.ibm.com>
>>>> Cc: Nathan Lynch <nathanl@linux.ibm.com>
>>>> 
>>>> Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>>>> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>
>>>> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>>> ---
>>>>  include/linux/mmzone.h | 6 ++++--
>>>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>> index f3f264826423..88078a3b95e5 100644
>>>> --- a/include/linux/mmzone.h
>>>> +++ b/include/linux/mmzone.h
>>>> @@ -756,8 +756,10 @@ typedef struct pglist_data {
>>>>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
>>>>  } pg_data_t;
>>>>  
>>>> -#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
>>>> -#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
>>>> +#define node_present_pages(nid)		\
>>>> +	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
>>>> +#define node_spanned_pages(nid)		\
>>>> +	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
>>> 
>>> I believe this is a wrong approach. We really do not want to special
>>> case all the places which require NODE_DATA. Can we please go and
>>> allocate pgdat for all possible nodes?
>>> 
>>> The current state of memory less hacks subtle bugs poping up here and
>>> there just prove that we should have done that from the very begining
>>> IMHO.
>>
>> Yes. So here's an alternative proposal for fixing the current situation in SLUB,
>> before the long-term solution of having all possible nodes provide valid pgdat
>> with zonelists:
>>
>> - fix SLUB with the hunk at the end of this mail - the point is to use NUMA_NO_NODE
>>   as fallback instead of node_to_mem_node()
>> - this removes all uses of node_to_mem_node (luckily it's just SLUB),
>>   kill it completely instead of trying to fix it up
>> - patch 1/4 is not needed with the fix
>> - perhaps many of your other patches are alss not needed 
>> - once we get the long-term solution, some of the !node_online() checks can be removed
>
> Seems like a nice solution to me :)
>
>> ----8<----
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..1d4f2d7a0080 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
>>  	struct page *page;
>>  	unsigned int order = oo_order(oo);
>>  
>> -	if (node == NUMA_NO_NODE)
>> +	if (node == NUMA_NO_NODE || !node_online(node))
>
> Why don't we need the node_present_pages() check here?
>
>>  		page = alloc_pages(flags, order);
>>  	else
>>  		page = __alloc_pages_node(node, flags, order);
>> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>>  
>>  	if (node == NUMA_NO_NODE)
>>  		searchnode = numa_mem_id();
>> -	else if (!node_present_pages(node))
>> -		searchnode = node_to_mem_node(node);
>>  
>>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>  	if (object || node != NUMA_NO_NODE)
>> @@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>  redo:
>>  
>>  	if (unlikely(!node_match(page, node))) {
>> -		int searchnode = node;
>> -
>> -		if (node != NUMA_NO_NODE && !node_present_pages(node))
>> -			searchnode = node_to_mem_node(node);
>> -
>> -		if (unlikely(!node_match(page, searchnode))) {
>> +		/*
>> +		 * node_match() false implies node != NUMA_NO_NODE
>> +		 * but if the node is not online and has no pages, just
>                                                  ^
>                                                  this should be 'or' ?

Sorry I see you've already fixed this in the version you posted.

cheers
Vlastimil Babka March 19, 2020, 9:38 a.m. UTC | #8
On 3/19/20 1:32 AM, Michael Ellerman wrote:
> Seems like a nice solution to me

Thanks :)

>> ----8<----
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..1d4f2d7a0080 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s,
>>  	struct page *page;
>>  	unsigned int order = oo_order(oo);
>>  
>> -	if (node == NUMA_NO_NODE)
>> +	if (node == NUMA_NO_NODE || !node_online(node))
> 
> Why don't we need the node_present_pages() check here?

Page allocator is fine with a node without present pages, as long as there's a
zonelist, which online nodes must have (ideally all possible nodes should have,
and then we can remove this).

SLUB on the other hand doesn't allocate cache per-cpu structures for nodes
without present pages (understandably) that's why the other place includes the
node_present_pages() check.

Thanks
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f3f264826423..88078a3b95e5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -756,8 +756,10 @@  typedef struct pglist_data {
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
 } pg_data_t;
 
-#define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
-#define node_spanned_pages(nid)	(NODE_DATA(nid)->node_spanned_pages)
+#define node_present_pages(nid)		\
+	(node_online(nid) ? NODE_DATA(nid)->node_present_pages : 0)
+#define node_spanned_pages(nid)		\
+	(node_online(nid) ? NODE_DATA(nid)->node_spanned_pages : 0)
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 #define pgdat_page_nr(pgdat, pagenr)	((pgdat)->node_mem_map + (pagenr))
 #else