mm/hugetlb: Fix a race between hugetlb sysctl handlers

Message ID	20200822095328.61306-1-songmuchun@bytedance.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=uXmJ=CA=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7C36420825 From: Muchun Song <songmuchun@bytedance.com> To: mike.kravetz@oracle.com, akpm@linux-foundation.org Cc: npiggin@suse.de, agl@us.ibm.com, ak@linux.intel.com, nacc@us.ibm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song <songmuchun@bytedance.com> Subject: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers Date: Sat, 22 Aug 2020 17:53:28 +0800 Message-Id: <20200822095328.61306-1-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/hugetlb: Fix a race between hugetlb sysctl handlers \| expand mm/hugetlb: Fix a race between hugetlb sysctl handlers

Muchun Song Aug. 22, 2020, 9:53 a.m. UTC

There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
Fix this by duplicating the `table`, and only update the duplicate of
it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.

The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
     ? set_max_huge_pages+0x3da/0x4f0
     ? alloc_pool_huge_page+0x150/0x150
     ? proc_doulongvec_minmax+0x46/0x60
     ? hugetlb_sysctl_handler_common+0x1c7/0x200
     ? nr_hugepages_store+0x20/0x20
     ? copy_fd_bitmaps+0x170/0x170
     ? hugetlb_sysctl_handler+0x1e/0x20
     ? proc_sys_call_handler+0x2f1/0x300
     ? unregister_sysctl_table+0xb0/0xb0
     ? __fd_install+0x78/0x100
     ? proc_sys_write+0x14/0x20
     ? __vfs_write+0x4d/0x90
     ? vfs_write+0xef/0x240
     ? ksys_write+0xc0/0x160
     ? __ia32_sys_read+0x50/0x50
     ? __close_fd+0x129/0x150
     ? __x64_sys_write+0x43/0x50
     ? do_syscall_64+0x6c/0x200
     ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

Andrew Morton Aug. 24, 2020, 8:59 p.m. UTC | #1

On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:

> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().

Where does __do_proc_doulongvec_minmax() write to table->data?

I think you're saying that there is a race between the assignment of
ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
of the same ctl_table->table in hugetlb_overcommit_handler()?

Or not, maybe I'm being thick.  Can you please describe the race more
carefully and completely?

> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.
>

Mike Kravetz Aug. 24, 2020, 9:19 p.m. UTC | #2

On 8/24/20 1:59 PM, Andrew Morton wrote:
> On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> 
>> There is a race between the assignment of `table->data` and write value
>> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> 
> Where does __do_proc_doulongvec_minmax() write to table->data?
> 
> I think you're saying that there is a race between the assignment of
> ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
> of the same ctl_table->table in hugetlb_overcommit_handler()?
> 
> Or not, maybe I'm being thick.  Can you please describe the race more
> carefully and completely?
> 

I too am looking at this now and do not completely understand the race.
It could be that:

hugetlb_sysctl_handler_common
...
	table->data = &tmp;

and, do_proc_doulongvec_minmax()
...
	return __do_proc_doulongvec_minmax(table->data, table, write, ...
with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
...
	i = (unsigned long *) data;
	...
		*i = val;

So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
in one thread when hugetlb_sysctl_handler_common is setting it in another?

Another confusing part of the message is the stack trace which includes
...
     ? set_max_huge_pages+0x3da/0x4f0
     ? alloc_pool_huge_page+0x150/0x150

which are 'downstream' from these routines.  I don't understand why these
are in the trace.

If the race is with the pointer set and dereference/write, then this type of
fix is OK.  However, if you really have two 'sysadmin type' global operations
racing then one or both are not going to get what they expected.  Instead of
changing the code to 'handle the race', I think it might be acceptable to just
put a big semaphore around it.

Muchun Song Aug. 25, 2020, 2:42 a.m. UTC | #3

Hi Andrew and Mike,

On Sat, Aug 22, 2020 at 5:53 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.

I am sorry, I didn't expose more details about how the race happened.

CPU0:                                     CPU1:
                                          proc_sys_write
hugetlb_sysctl_handler                      proc_sys_call_handler
hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
  table->data = &tmp;                           hugetlb_sysctl_handler_common
                                                  table->data = &tmp;
    proc_doulongvec_minmax
      do_proc_doulongvec_minmax             sysctl_head_finish
        __do_proc_doulongvec_minmax
          i = table->data;
          *i = val;     // corrupt CPU1 stack

>
> The following oops was seen:
>
>     BUG: kernel NULL pointer dereference, address: 0000000000000000
>     #PF: supervisor instruction fetch in kernel mode
>     #PF: error_code(0x0010) - not-present page
>     Code: Bad RIP value.

Here we can see the "Bad RIP value", so the stack frame is corrupted by
others.

>     ...
>     Call Trace:
>      ? set_max_huge_pages+0x3da/0x4f0
>      ? alloc_pool_huge_page+0x150/0x150
>      ? proc_doulongvec_minmax+0x46/0x60
>      ? hugetlb_sysctl_handler_common+0x1c7/0x200
>      ? nr_hugepages_store+0x20/0x20
>      ? copy_fd_bitmaps+0x170/0x170
>      ? hugetlb_sysctl_handler+0x1e/0x20
>      ? proc_sys_call_handler+0x2f1/0x300
>      ? unregister_sysctl_table+0xb0/0xb0
>      ? __fd_install+0x78/0x100
>      ? proc_sys_write+0x14/0x20
>      ? __vfs_write+0x4d/0x90
>      ? vfs_write+0xef/0x240
>      ? ksys_write+0xc0/0x160
>      ? __ia32_sys_read+0x50/0x50
>      ? __close_fd+0x129/0x150
>      ? __x64_sys_write+0x43/0x50
>      ? do_syscall_64+0x6c/0x200
>      ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a301c2d672bf..818d6125af49 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h)
>  }
>
>  #ifdef CONFIG_SYSCTL
> +static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write,
> +                                         void *buffer, size_t *length,
> +                                         loff_t *ppos, unsigned long *out)
> +{
> +       struct ctl_table dup_table;
> +
> +       /*
> +        * In order to avoid races with __do_proc_doulongvec_minmax(), we
> +        * can duplicate the @table and alter the duplicate of it.
> +        */
> +       dup_table = *table;
> +       dup_table.data = out;
> +       dup_table.maxlen = sizeof(unsigned long);
> +
> +       return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos);
> +}
> +
>  static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>                          struct ctl_table *table, int write,
>                          void *buffer, size_t *length, loff_t *ppos)
> @@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>         if (!hugepages_supported())
>                 return -EOPNOTSUPP;
>
> -       table->data = &tmp;
> -       table->maxlen = sizeof(unsigned long);
> -       ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
> +       ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
> +                                            &tmp);
>         if (ret)
>                 goto out;
>
> @@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>         if (write && hstate_is_gigantic(h))
>                 return -EINVAL;
>
> -       table->data = &tmp;
> -       table->maxlen = sizeof(unsigned long);
> -       ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
> +       ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
> +                                            &tmp);
>         if (ret)
>                 goto out;
>
> --
> 2.11.0
>

Muchun Song Aug. 25, 2020, 3:01 a.m. UTC | #4

On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/24/20 1:59 PM, Andrew Morton wrote:
> > On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> >
> >> There is a race between the assignment of `table->data` and write value
> >> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> >
> > Where does __do_proc_doulongvec_minmax() write to table->data?
> >
> > I think you're saying that there is a race between the assignment of
> > ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
> > of the same ctl_table->table in hugetlb_overcommit_handler()?
> >
> > Or not, maybe I'm being thick.  Can you please describe the race more
> > carefully and completely?
> >
>
> I too am looking at this now and do not completely understand the race.
> It could be that:
>
> hugetlb_sysctl_handler_common
> ...
>         table->data = &tmp;
>
> and, do_proc_doulongvec_minmax()
> ...
>         return __do_proc_doulongvec_minmax(table->data, table, write, ...
> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
> ...
>         i = (unsigned long *) data;
>         ...
>                 *i = val;
>
> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
> in one thread when hugetlb_sysctl_handler_common is setting it in another?

Yes, you are right.

>
> Another confusing part of the message is the stack trace which includes
> ...
>      ? set_max_huge_pages+0x3da/0x4f0
>      ? alloc_pool_huge_page+0x150/0x150
>
> which are 'downstream' from these routines.  I don't understand why these
> are in the trace.

I am also confused. But this issue can be reproduced easily by letting more
than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
the issue can not be reproduced and disappears.

>
> If the race is with the pointer set and dereference/write, then this type of
> fix is OK.  However, if you really have two 'sysadmin type' global operations
> racing then one or both are not going to get what they expected.  Instead of

In our team, more than one developer shares one server which is a test server.
I guess that the panic happens when two people write
`/proc/sys/vm/nr_hugepages`.

> changing the code to 'handle the race', I think it might be acceptable to just
> put a big semaphore around it.

It is also a good idea to fix this issue. Thanks.

> --
> Mike Kravetz

Andi Kleen Aug. 25, 2020, 3:25 p.m. UTC | #5

> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")

I don't think the Fixes line is correct. The original patch
just used a global variable and didn't have this race.
It must have been added later in some other patch that added
hugetlb_sysctl_handler_common.

-Andi

Mike Kravetz Aug. 26, 2020, 12:01 a.m. UTC | #6

On 8/24/20 8:01 PM, Muchun Song wrote:
> On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> I too am looking at this now and do not completely understand the race.
>> It could be that:
>>
>> hugetlb_sysctl_handler_common
>> ...
>>         table->data = &tmp;
>>
>> and, do_proc_doulongvec_minmax()
>> ...
>>         return __do_proc_doulongvec_minmax(table->data, table, write, ...
>> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
>> ...
>>         i = (unsigned long *) data;
>>         ...
>>                 *i = val;
>>
>> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
>> in one thread when hugetlb_sysctl_handler_common is setting it in another?
> 
> Yes, you are right.
> 
>>
>> Another confusing part of the message is the stack trace which includes
>> ...
>>      ? set_max_huge_pages+0x3da/0x4f0
>>      ? alloc_pool_huge_page+0x150/0x150
>>
>> which are 'downstream' from these routines.  I don't understand why these
>> are in the trace.
> 
> I am also confused. But this issue can be reproduced easily by letting more
> than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
> the issue can not be reproduced and disappears.

There certainly is an issue here as one thread can modify data in another.
However, I am having a hard time seeing what causes the 'kernel NULL pointer
dereference'.

I tried to reproduce the issue myself but was unsuccessful.  I have 16 threads
writing to /proc/sys/vm/nr_hugepages in an infinite loop.  After several hours
running, I did not hit the issue.  Just curious, what architecture is the
system?  any special config or compiler options?

If you can easily reproduce, can you post the detailed oops message?

The 'NULL pointer' seems strange because after the first assignment to
table->data the value should never be NULL.  Certainly it can be modified
by another thread, but I can not see how it can be NULL.  At the beginning
of __do_proc_doulongvec_minmax, there is a check for NULL pointer with:

	if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
		*lenp = 0;
		return 0;
	}

I looked at the code my compiler produced for __do_proc_doulongvec_minmax.
It appears to use the same value/register for the pointer throughout the
routine.  IOW, I do not see how the pointer can be NULL for the assignment
when the routine does:

			*i = val;

Again, your analysis/patch points out a real issue.  I just want to get
a better understanding to make sure there is not another issue causing
the NULL pointer dereference.

Muchun Song Aug. 26, 2020, 2:34 a.m. UTC | #7

HI Andi,

On Tue, Aug 25, 2020 at 11:34 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
>
> I don't think the Fixes line is correct. The original patch
> just used a global variable and didn't have this race.
> It must have been added later in some other patch that added
> hugetlb_sysctl_handler_common.

I don't agree with you. The 'e5ff215941d5' is not used a global
variable. Below is the code snippet of this patch. Thanks.

@@ -1037,8 +1109,19 @@ int hugetlb_sysctl_handler(struct ctl_table
*table, int write,
                           struct file *file, void __user *buffer,
                           size_t *length, loff_t *ppos)
 {
+       struct hstate *h = &default_hstate;
+       unsigned long tmp;

Here is a local variable of tmp.

+
+       if (!write)
+               tmp = h->max_huge_pages;
+
+       table->data = &tmp;
+       table->maxlen = sizeof(unsigned long);
        proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-       max_huge_pages = set_max_huge_pages(max_huge_pages);
+
+       if (write)
+               h->max_huge_pages = set_max_huge_pages(h, tmp);
+
        return 0;
 }


>
> -Andi

Muchun Song Aug. 26, 2020, 2:47 a.m. UTC | #8

On Wed, Aug 26, 2020 at 8:03 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/24/20 8:01 PM, Muchun Song wrote:
> > On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> I too am looking at this now and do not completely understand the race.
> >> It could be that:
> >>
> >> hugetlb_sysctl_handler_common
> >> ...
> >>         table->data = &tmp;
> >>
> >> and, do_proc_doulongvec_minmax()
> >> ...
> >>         return __do_proc_doulongvec_minmax(table->data, table, write, ...
> >> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
> >> ...
> >>         i = (unsigned long *) data;
> >>         ...
> >>                 *i = val;
> >>
> >> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
> >> in one thread when hugetlb_sysctl_handler_common is setting it in another?
> >
> > Yes, you are right.
> >
> >>
> >> Another confusing part of the message is the stack trace which includes
> >> ...
> >>      ? set_max_huge_pages+0x3da/0x4f0
> >>      ? alloc_pool_huge_page+0x150/0x150
> >>
> >> which are 'downstream' from these routines.  I don't understand why these
> >> are in the trace.
> >
> > I am also confused. But this issue can be reproduced easily by letting more
> > than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
> > the issue can not be reproduced and disappears.
>
> There certainly is an issue here as one thread can modify data in another.
> However, I am having a hard time seeing what causes the 'kernel NULL pointer
> dereference'.

If you write 0 to '/proc/sys/vm/nr_hugepages', you will get the
    kernel NULL pointer dereference, address: 0000000000000000

If you write 1024 to '/proc/sys/vm/nr_hugepages', you will get the
    kernel NULL pointer dereference, address: 0000000000000400

The address of dereference is the value which you write to the
'/proc/sys/vm/nr_hugepages'.

>
> I tried to reproduce the issue myself but was unsuccessful.  I have 16 threads
> writing to /proc/sys/vm/nr_hugepages in an infinite loop.  After several hours
> running, I did not hit the issue.  Just curious, what architecture is the
> system?  any special config or compiler options?
>
> If you can easily reproduce, can you post the detailed oops message?
>
> The 'NULL pointer' seems strange because after the first assignment to
> table->data the value should never be NULL.  Certainly it can be modified
> by another thread, but I can not see how it can be NULL.  At the beginning
> of __do_proc_doulongvec_minmax, there is a check for NULL pointer with:

CPU0:                                     CPU1:
                                          proc_sys_write
hugetlb_sysctl_handler                      proc_sys_call_handler
hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
  table->data = &tmp;                           hugetlb_sysctl_handler_common
                                                  table->data = &tmp;
    proc_doulongvec_minmax
      do_proc_doulongvec_minmax             sysctl_head_finish
        __do_proc_doulongvec_minmax
          i = table->data;
          *i = val;     // corrupt CPU1 stack

If the val is 0, you will see the NULL.

>
>         if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
>                 *lenp = 0;
>                 return 0;
>         }
>
> I looked at the code my compiler produced for __do_proc_doulongvec_minmax.
> It appears to use the same value/register for the pointer throughout the
> routine.  IOW, I do not see how the pointer can be NULL for the assignment
> when the routine does:
>
>                         *i = val;
>
> Again, your analysis/patch points out a real issue.  I just want to get
> a better understanding to make sure there is not another issue causing
> the NULL pointer dereference.

Below is my test script. There are 8 threads to execute the following script.
In my qemu, it is easy to panic. Thanks.

#!/bin/sh

while :
do
        echo 128 > /proc/sys/vm/nr_hugepages
        echo 0 > /proc/sys/vm/nr_hugepages
done

> --
> Mike Kravetz

Mike Kravetz Aug. 27, 2020, 9:51 p.m. UTC | #9

On 8/25/20 7:47 PM, Muchun Song wrote:
> 
> CPU0:                                     CPU1:
>                                           proc_sys_write
> hugetlb_sysctl_handler                      proc_sys_call_handler
> hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
>   table->data = &tmp;                           hugetlb_sysctl_handler_common
>                                                   table->data = &tmp;
>     proc_doulongvec_minmax
>       do_proc_doulongvec_minmax             sysctl_head_finish
>         __do_proc_doulongvec_minmax
>           i = table->data;
>           *i = val;     // corrupt CPU1 stack

Thanks Muchun!
Can you please add this to the commit message.

Also, when looking closer at the patch I do not think setting table->maxlen
is necessary in these routines.  maxlen is set when the hugetlb ctl_table
entries are defined and initialized.  This is not something you introduced.
The unnecessary assignments are in the existing code.  However, there is no
need to carry them forward.

Muchun Song Aug. 28, 2020, 2:33 a.m. UTC | #10

On Fri, Aug 28, 2020 at 5:51 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/25/20 7:47 PM, Muchun Song wrote:
> >
> > CPU0:                                     CPU1:
> >                                           proc_sys_write
> > hugetlb_sysctl_handler                      proc_sys_call_handler
> > hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
> >   table->data = &tmp;                           hugetlb_sysctl_handler_common
> >                                                   table->data = &tmp;
> >     proc_doulongvec_minmax
> >       do_proc_doulongvec_minmax             sysctl_head_finish
> >         __do_proc_doulongvec_minmax
> >           i = table->data;
> >           *i = val;     // corrupt CPU1 stack
>
> Thanks Muchun!
> Can you please add this to the commit message.

OK, I will do that. Thanks.

>
> Also, when looking closer at the patch I do not think setting table->maxlen
> is necessary in these routines.  maxlen is set when the hugetlb ctl_table
> entries are defined and initialized.  This is not something you introduced.
> The unnecessary assignments are in the existing code.  However, there is no
> need to carry them forward.

Yeah, I agree with you. I will remove the unnecessary assignment of
table->maxlen.

>
> --
> Mike Kravetz

mm/hugetlb: Fix a race between hugetlb sysctl handlers

Commit Message

Comments

Patch