[v1,1/1] mm/vmalloc.c: Fix percpu free VM area search criteria

Message ID	20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: best guess record for domain of sathyanarayanan.kuppuswamy@linux.intel.com designates 192.55.52.120 as permitted sender) client-ip=192.55.52.120; From: sathyanarayanan.kuppuswamy@linux.intel.com To: akpm@linux-foundation.org, urezki@gmail.com Cc: dave.hansen@intel.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com Subject: [PATCH v1 1/1] mm/vmalloc.c: Fix percpu free VM area search criteria Date: Mon, 29 Jul 2019 16:21:39 -0700 Message-Id: <20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v1,1/1] mm/vmalloc.c: Fix percpu free VM area search criteria \| expand [v1,1/1] mm/vmalloc.c: Fix percpu free VM area search criteria

Kuppuswamy Sathyanarayanan July 29, 2019, 11:21 p.m. UTC

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Recent changes to the vmalloc code by Commit 68ad4a330433
("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
cause spurious percpu allocation failures. These, in turn, can result in
panic()s in the slub code. One such possible panic was reported by
Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
Another related panic observed is,

 RIP: 0033:0x7f46f7441b9b
 Call Trace:
  dump_stack+0x61/0x80
  pcpu_alloc.cold.30+0x22/0x4f
  mem_cgroup_css_alloc+0x110/0x650
  cgroup_apply_control_enable+0x133/0x330
  cgroup_mkdir+0x41b/0x500
  kernfs_iop_mkdir+0x5a/0x90
  vfs_mkdir+0x102/0x1b0
  do_mkdirat+0x7d/0xf0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
uses two lists (vmap_area_list & free_vmap_area_list) to track the used
and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
sizes[], nr_vms, align) function is used for allocating congruent VM
areas for percpu memory allocator. In order to not conflict with VMALLOC
users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC
space. So the search for free vm_area for the given requirement starts
near VMALLOC_END and moves upwards towards VMALLOC_START.

Prior to commit 68ad4a330433, the search for free vm_area in
pcpu_get_vm_areas() involves following two main steps.

Step 1:
    Find a aligned "base" adress near VMALLOC_END.
    va = free vm area near VMALLOC_END
Step 2:
    Loop through number of requested vm_areas and check,
        Step 2.1:
           if (base < VMALLOC_START)
              1. fail with error
        Step 2.2:
           // end is offsets[area] + sizes[area]
           if (base + end > va->vm_end)
               1. Move the base downwards and repeat Step 2
        Step 2.3:
           if (base + start < va->vm_start)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:

        Step 2.3:
           if (base + start < va->vm_start || base + end > va->vm_end)
              1. Move to previous free vm_area node, find aligned
                 base address and repeat Step 2

Above change is the root cause of spurious percpu memory allocation
failures. For example, consider a case where a relatively large vm_area
(~ 30 TB) was ignored in free vm_area search because it did not pass the
base + end  < vm->vm_end boundary check. Ignoring such large free
vm_area's would lead to not finding free vm_area within boundary of
VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.

So modify the search algorithm to include Step 2.2.

Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 mm/vmalloc.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Uladzislau Rezki July 30, 2019, 8:46 p.m. UTC | #1

On Mon, Jul 29, 2019 at 04:21:39PM -0700, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Recent changes to the vmalloc code by Commit 68ad4a330433
> ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
> cause spurious percpu allocation failures. These, in turn, can result in
> panic()s in the slub code. One such possible panic was reported by
> Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
> Another related panic observed is,
> 
>  RIP: 0033:0x7f46f7441b9b
>  Call Trace:
>   dump_stack+0x61/0x80
>   pcpu_alloc.cold.30+0x22/0x4f
>   mem_cgroup_css_alloc+0x110/0x650
>   cgroup_apply_control_enable+0x133/0x330
>   cgroup_mkdir+0x41b/0x500
>   kernfs_iop_mkdir+0x5a/0x90
>   vfs_mkdir+0x102/0x1b0
>   do_mkdirat+0x7d/0xf0
>   do_syscall_64+0x5b/0x180
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
> to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
> uses two lists (vmap_area_list & free_vmap_area_list) to track the used
> and free VM areas in VMALLOC space. And pcpu_get_vm_areas(offsets[],
> sizes[], nr_vms, align) function is used for allocating congruent VM
> areas for percpu memory allocator. In order to not conflict with VMALLOC
> users, pcpu_get_vm_areas allocates VM areas near the end of the VMALLOC
> space. So the search for free vm_area for the given requirement starts
> near VMALLOC_END and moves upwards towards VMALLOC_START.
> 
> Prior to commit 68ad4a330433, the search for free vm_area in
> pcpu_get_vm_areas() involves following two main steps.
> 
> Step 1:
>     Find a aligned "base" adress near VMALLOC_END.
>     va = free vm area near VMALLOC_END
> Step 2:
>     Loop through number of requested vm_areas and check,
>         Step 2.1:
>            if (base < VMALLOC_START)
>               1. fail with error
>         Step 2.2:
>            // end is offsets[area] + sizes[area]
>            if (base + end > va->vm_end)
>                1. Move the base downwards and repeat Step 2
>         Step 2.3:
>            if (base + start < va->vm_start)
>               1. Move to previous free vm_area node, find aligned
>                  base address and repeat Step 2
> 
> But Commit 68ad4a330433 removed Step 2.2 and modified Step 2.3 as below:
> 
>         Step 2.3:
>            if (base + start < va->vm_start || base + end > va->vm_end)
>               1. Move to previous free vm_area node, find aligned
>                  base address and repeat Step 2
> 
> Above change is the root cause of spurious percpu memory allocation
> failures. For example, consider a case where a relatively large vm_area
> (~ 30 TB) was ignored in free vm_area search because it did not pass the
> base + end  < vm->vm_end boundary check. Ignoring such large free
> vm_area's would lead to not finding free vm_area within boundary of
> VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
> 
> So modify the search algorithm to include Step 2.2.
> 
> Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  mm/vmalloc.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4fa8d84599b0..1faa45a38c08 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3269,10 +3269,20 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  		if (va == NULL)
>  			goto overflow;
>  
> +		/*
> +		 * If required width exeeds current VA block, move
> +		 * base downwards and then recheck.
> +		 */
> +		if (base + end > va->va_end) {
> +			base = pvm_determine_end_from_reverse(&va, align) - end;
> +			term_area = area;
> +			continue;
> +		}
> +
>  		/*
>  		 * If this VA does not fit, move base downwards and recheck.
>  		 */
> -		if (base + start < va->va_start || base + end > va->va_end) {
> +		if (base + start < va->va_start) {
>  			va = node_to_va(rb_prev(&va->rb_node));
>  			base = pvm_determine_end_from_reverse(&va, align) - end;
>  			term_area = area;
> -- 
> 2.21.0
> 
I guess it is NUMA related issue, i mean when we have several
areas/sizes/offsets. Is that correct?

Thank you!

--
Vlad Rezki

Dave Hansen July 30, 2019, 8:54 p.m. UTC | #2

On 7/30/19 1:46 PM, Uladzislau Rezki wrote:
>> +		/*
>> +		 * If required width exeeds current VA block, move
>> +		 * base downwards and then recheck.
>> +		 */
>> +		if (base + end > va->va_end) {
>> +			base = pvm_determine_end_from_reverse(&va, align) - end;
>> +			term_area = area;
>> +			continue;
>> +		}
>> +
>>  		/*
>>  		 * If this VA does not fit, move base downwards and recheck.
>>  		 */
>> -		if (base + start < va->va_start || base + end > va->va_end) {
>> +		if (base + start < va->va_start) {
>>  			va = node_to_va(rb_prev(&va->rb_node));
>>  			base = pvm_determine_end_from_reverse(&va, align) - end;
>>  			term_area = area;
>> -- 
>> 2.21.0
>>
> I guess it is NUMA related issue, i mean when we have several
> areas/sizes/offsets. Is that correct?

I don't think NUMA has anything to do with it.  The vmalloc() area
itself doesn't have any NUMA properties I can think of.  We don't, for
instance, partition it into per-node areas that I know of.

I did encounter this issue on a system with ~100 logical CPUs, which is
a moderate amount these days.

Dennis Zhou July 30, 2019, 9:09 p.m. UTC | #3

On Tue, Jul 30, 2019 at 01:54:06PM -0700, Dave Hansen wrote:
> On 7/30/19 1:46 PM, Uladzislau Rezki wrote:
> >> +		/*
> >> +		 * If required width exeeds current VA block, move
> >> +		 * base downwards and then recheck.
> >> +		 */
> >> +		if (base + end > va->va_end) {
> >> +			base = pvm_determine_end_from_reverse(&va, align) - end;
> >> +			term_area = area;
> >> +			continue;
> >> +		}
> >> +
> >>  		/*
> >>  		 * If this VA does not fit, move base downwards and recheck.
> >>  		 */
> >> -		if (base + start < va->va_start || base + end > va->va_end) {
> >> +		if (base + start < va->va_start) {
> >>  			va = node_to_va(rb_prev(&va->rb_node));
> >>  			base = pvm_determine_end_from_reverse(&va, align) - end;
> >>  			term_area = area;
> >> -- 
> >> 2.21.0
> >>
> > I guess it is NUMA related issue, i mean when we have several
> > areas/sizes/offsets. Is that correct?
> 
> I don't think NUMA has anything to do with it.  The vmalloc() area
> itself doesn't have any NUMA properties I can think of.  We don't, for
> instance, partition it into per-node areas that I know of.
> 
> I did encounter this issue on a system with ~100 logical CPUs, which is
> a moderate amount these days.
> 

Percpu memory does have this restriction when we embed the first chunk
as we need to preserve the offsets. So that is when we'd require
multiple areas in the vma.

I didn't see the original patches come through, but this seems like it
restores the original functionality. FWIW, this way of finding space
isn't really smart, so it's possible we want to revisit this.

Acked-by: Dennis Zhou <dennis@kernel.org>

Thanks,
Dennis

Kuppuswamy Sathyanarayanan July 30, 2019, 9:13 p.m. UTC | #4

On 7/30/19 1:54 PM, Dave Hansen wrote:
> On 7/30/19 1:46 PM, Uladzislau Rezki wrote:
>>> +		/*
>>> +		 * If required width exeeds current VA block, move
>>> +		 * base downwards and then recheck.
>>> +		 */
>>> +		if (base + end > va->va_end) {
>>> +			base = pvm_determine_end_from_reverse(&va, align) - end;
>>> +			term_area = area;
>>> +			continue;
>>> +		}
>>> +
>>>   		/*
>>>   		 * If this VA does not fit, move base downwards and recheck.
>>>   		 */
>>> -		if (base + start < va->va_start || base + end > va->va_end) {
>>> +		if (base + start < va->va_start) {
>>>   			va = node_to_va(rb_prev(&va->rb_node));
>>>   			base = pvm_determine_end_from_reverse(&va, align) - end;
>>>   			term_area = area;
>>> -- 
>>> 2.21.0
>>>
>> I guess it is NUMA related issue, i mean when we have several
>> areas/sizes/offsets. Is that correct?
> I don't think NUMA has anything to do with it.  The vmalloc() area
> itself doesn't have any NUMA properties I can think of.  We don't, for
> instance, partition it into per-node areas that I know of.
>
> I did encounter this issue on a system with ~100 logical CPUs, which is
> a moderate amount these days.

I agree with Dave. I don't think this issue is related to NUMA. The 
problem here is about the logic we use to find appropriate vm_area that 
satisfies the offset and size requirements of pcpu memory allocator.

In my test case, I can reproduce this issue if we make request with 
offset (ffff000000) and size (600000).

>

Dennis Zhou July 30, 2019, 9:55 p.m. UTC | #5

On Tue, Jul 30, 2019 at 02:13:25PM -0700, sathyanarayanan kuppuswamy wrote:
> 
> On 7/30/19 1:54 PM, Dave Hansen wrote:
> > On 7/30/19 1:46 PM, Uladzislau Rezki wrote:
> > > > +		/*
> > > > +		 * If required width exeeds current VA block, move
> > > > +		 * base downwards and then recheck.
> > > > +		 */
> > > > +		if (base + end > va->va_end) {
> > > > +			base = pvm_determine_end_from_reverse(&va, align) - end;
> > > > +			term_area = area;
> > > > +			continue;
> > > > +		}
> > > > +
> > > >   		/*
> > > >   		 * If this VA does not fit, move base downwards and recheck.
> > > >   		 */
> > > > -		if (base + start < va->va_start || base + end > va->va_end) {
> > > > +		if (base + start < va->va_start) {
> > > >   			va = node_to_va(rb_prev(&va->rb_node));
> > > >   			base = pvm_determine_end_from_reverse(&va, align) - end;
> > > >   			term_area = area;
> > > > -- 
> > > > 2.21.0
> > > > 
> > > I guess it is NUMA related issue, i mean when we have several
> > > areas/sizes/offsets. Is that correct?
> > I don't think NUMA has anything to do with it.  The vmalloc() area
> > itself doesn't have any NUMA properties I can think of.  We don't, for
> > instance, partition it into per-node areas that I know of.
> > 
> > I did encounter this issue on a system with ~100 logical CPUs, which is
> > a moderate amount these days.
> 
> I agree with Dave. I don't think this issue is related to NUMA. The problem
> here is about the logic we use to find appropriate vm_area that satisfies
> the offset and size requirements of pcpu memory allocator.
> 
> In my test case, I can reproduce this issue if we make request with offset
> (ffff000000) and size (600000).
> 
> > 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux kernel developer
> 

I misspoke earlier. I don't think it's numa related either, but I think
you could trigger this much more easily this way as it could skip more
viable vma space because it'd have to find more holes.

But it seems that pvm_determine_end_from_reverse() will return the free
vma below the address if it is aligned so:

    base + end > va->va_end

will always be true and then push down the searching va instead of using
that va first.

Thanks,
Dennis

Kuppuswamy Sathyanarayanan July 30, 2019, 10:25 p.m. UTC | #6

On 7/30/19 2:55 PM, Dennis Zhou wrote:
> On Tue, Jul 30, 2019 at 02:13:25PM -0700, sathyanarayanan kuppuswamy wrote:
>> On 7/30/19 1:54 PM, Dave Hansen wrote:
>>> On 7/30/19 1:46 PM, Uladzislau Rezki wrote:
>>>>> +		/*
>>>>> +		 * If required width exeeds current VA block, move
>>>>> +		 * base downwards and then recheck.
>>>>> +		 */
>>>>> +		if (base + end > va->va_end) {
>>>>> +			base = pvm_determine_end_from_reverse(&va, align) - end;
>>>>> +			term_area = area;
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>>    		/*
>>>>>    		 * If this VA does not fit, move base downwards and recheck.
>>>>>    		 */
>>>>> -		if (base + start < va->va_start || base + end > va->va_end) {
>>>>> +		if (base + start < va->va_start) {
>>>>>    			va = node_to_va(rb_prev(&va->rb_node));
>>>>>    			base = pvm_determine_end_from_reverse(&va, align) - end;
>>>>>    			term_area = area;
>>>>> -- 
>>>>> 2.21.0
>>>>>
>>>> I guess it is NUMA related issue, i mean when we have several
>>>> areas/sizes/offsets. Is that correct?
>>> I don't think NUMA has anything to do with it.  The vmalloc() area
>>> itself doesn't have any NUMA properties I can think of.  We don't, for
>>> instance, partition it into per-node areas that I know of.
>>>
>>> I did encounter this issue on a system with ~100 logical CPUs, which is
>>> a moderate amount these days.
>> I agree with Dave. I don't think this issue is related to NUMA. The problem
>> here is about the logic we use to find appropriate vm_area that satisfies
>> the offset and size requirements of pcpu memory allocator.
>>
>> In my test case, I can reproduce this issue if we make request with offset
>> (ffff000000) and size (600000).
>>
>> -- 
>> Sathyanarayanan Kuppuswamy
>> Linux kernel developer
>>
> I misspoke earlier. I don't think it's numa related either, but I think
> you could trigger this much more easily this way as it could skip more
> viable vma space because it'd have to find more holes.
>
> But it seems that pvm_determine_end_from_reverse() will return the free
> vma below the address if it is aligned so:
>
>      base + end > va->va_end
>
> will always be true and then push down the searching va instead of using
> that va first.

It won't be always true. Initially base address is calculated as below:

base = pvm_determine_end_from_reverse(&va, align) - end;

So for first iteration it will not fail.
>
> Thanks,
> Dennis
>

Uladzislau Rezki July 30, 2019, 10:34 p.m. UTC | #7

Hello, Sathyanarayanan.

> 
> I agree with Dave. I don't think this issue is related to NUMA. The problem
> here is about the logic we use to find appropriate vm_area that satisfies
> the offset and size requirements of pcpu memory allocator.
> 
> In my test case, I can reproduce this issue if we make request with offset
> (ffff000000) and size (600000).
> 
Just to clarify, does it mean that on your setup you have only one area with the
600000 size and 0xffff000000 offset?

Thank you.

--
Vlad Rezki

Kuppuswamy Sathyanarayanan July 30, 2019, 10:37 p.m. UTC | #8

On 7/30/19 3:34 PM, Uladzislau Rezki wrote:
> Hello, Sathyanarayanan.
>
>> I agree with Dave. I don't think this issue is related to NUMA. The problem
>> here is about the logic we use to find appropriate vm_area that satisfies
>> the offset and size requirements of pcpu memory allocator.
>>
>> In my test case, I can reproduce this issue if we make request with offset
>> (ffff000000) and size (600000).
>>
> Just to clarify, does it mean that on your setup you have only one area with the
> 600000 size and 0xffff000000 offset?
No, its 2 areas. with offset (0, ffff000000) and size (a00000, 600000).
>
> Thank you.
>
> --
> Vlad Rezki
>

Uladzislau Rezki July 31, 2019, 12:04 p.m. UTC | #9

Hello, Sathyanarayanan.

> > Just to clarify, does it mean that on your setup you have only one area with the
> > 600000 size and 0xffff000000 offset?
> No, its 2 areas. with offset (0, ffff000000) and size (a00000, 600000).
> > 
Thank you for clarification, that makes sense to me. I also can reproduce
that issue, so i agree with your patch. Basically we can skip free VA
block(that can fit) examining previous one(my fault), instead of moving
base downwards and recheck an area that did not fit.

Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Appreciate you for fixing it!

--
Vlad Rezki

[v1,1/1] mm/vmalloc.c: Fix percpu free VM area search criteria

Commit Message

Comments

Patch