Message ID | 20200822095328.61306-1-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/hugetlb: Fix a race between hugetlb sysctl handlers | expand |
On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote: > There is a race between the assignment of `table->data` and write value > to the pointer of `table->data` in the __do_proc_doulongvec_minmax(). Where does __do_proc_doulongvec_minmax() write to table->data? I think you're saying that there is a race between the assignment of ctl_table->table in hugetlb_sysctl_handler_common() and the assignment of the same ctl_table->table in hugetlb_overcommit_handler()? Or not, maybe I'm being thick. Can you please describe the race more carefully and completely? > Fix this by duplicating the `table`, and only update the duplicate of > it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to > simplify the code. >
On 8/24/20 1:59 PM, Andrew Morton wrote: > On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote: > >> There is a race between the assignment of `table->data` and write value >> to the pointer of `table->data` in the __do_proc_doulongvec_minmax(). > > Where does __do_proc_doulongvec_minmax() write to table->data? > > I think you're saying that there is a race between the assignment of > ctl_table->table in hugetlb_sysctl_handler_common() and the assignment > of the same ctl_table->table in hugetlb_overcommit_handler()? > > Or not, maybe I'm being thick. Can you please describe the race more > carefully and completely? > I too am looking at this now and do not completely understand the race. It could be that: hugetlb_sysctl_handler_common ... table->data = &tmp; and, do_proc_doulongvec_minmax() ... return __do_proc_doulongvec_minmax(table->data, table, write, ... with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ... ... i = (unsigned long *) data; ... *i = val; So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer in one thread when hugetlb_sysctl_handler_common is setting it in another? Another confusing part of the message is the stack trace which includes ... ? set_max_huge_pages+0x3da/0x4f0 ? alloc_pool_huge_page+0x150/0x150 which are 'downstream' from these routines. I don't understand why these are in the trace. If the race is with the pointer set and dereference/write, then this type of fix is OK. However, if you really have two 'sysadmin type' global operations racing then one or both are not going to get what they expected. Instead of changing the code to 'handle the race', I think it might be acceptable to just put a big semaphore around it.
Hi Andrew and Mike, On Sat, Aug 22, 2020 at 5:53 PM Muchun Song <songmuchun@bytedance.com> wrote: > > There is a race between the assignment of `table->data` and write value > to the pointer of `table->data` in the __do_proc_doulongvec_minmax(). > Fix this by duplicating the `table`, and only update the duplicate of > it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to > simplify the code. I am sorry, I didn't expose more details about how the race happened. CPU0: CPU1: proc_sys_write hugetlb_sysctl_handler proc_sys_call_handler hugetlb_sysctl_handler_common hugetlb_sysctl_handler table->data = &tmp; hugetlb_sysctl_handler_common table->data = &tmp; proc_doulongvec_minmax do_proc_doulongvec_minmax sysctl_head_finish __do_proc_doulongvec_minmax i = table->data; *i = val; // corrupt CPU1 stack > > The following oops was seen: > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor instruction fetch in kernel mode > #PF: error_code(0x0010) - not-present page > Code: Bad RIP value. Here we can see the "Bad RIP value", so the stack frame is corrupted by others. > ... > Call Trace: > ? set_max_huge_pages+0x3da/0x4f0 > ? alloc_pool_huge_page+0x150/0x150 > ? proc_doulongvec_minmax+0x46/0x60 > ? hugetlb_sysctl_handler_common+0x1c7/0x200 > ? nr_hugepages_store+0x20/0x20 > ? copy_fd_bitmaps+0x170/0x170 > ? hugetlb_sysctl_handler+0x1e/0x20 > ? proc_sys_call_handler+0x2f1/0x300 > ? unregister_sysctl_table+0xb0/0xb0 > ? __fd_install+0x78/0x100 > ? proc_sys_write+0x14/0x20 > ? __vfs_write+0x4d/0x90 > ? vfs_write+0xef/0x240 > ? ksys_write+0xc0/0x160 > ? __ia32_sys_read+0x50/0x50 > ? __close_fd+0x129/0x150 > ? __x64_sys_write+0x43/0x50 > ? do_syscall_64+0x6c/0x200 > ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes") > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > mm/hugetlb.c | 27 +++++++++++++++++++++------ > 1 file changed, 21 insertions(+), 6 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index a301c2d672bf..818d6125af49 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h) > } > > #ifdef CONFIG_SYSCTL > +static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write, > + void *buffer, size_t *length, > + loff_t *ppos, unsigned long *out) > +{ > + struct ctl_table dup_table; > + > + /* > + * In order to avoid races with __do_proc_doulongvec_minmax(), we > + * can duplicate the @table and alter the duplicate of it. > + */ > + dup_table = *table; > + dup_table.data = out; > + dup_table.maxlen = sizeof(unsigned long); > + > + return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos); > +} > + > static int hugetlb_sysctl_handler_common(bool obey_mempolicy, > struct ctl_table *table, int write, > void *buffer, size_t *length, loff_t *ppos) > @@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy, > if (!hugepages_supported()) > return -EOPNOTSUPP; > > - table->data = &tmp; > - table->maxlen = sizeof(unsigned long); > - ret = proc_doulongvec_minmax(table, write, buffer, length, ppos); > + ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos, > + &tmp); > if (ret) > goto out; > > @@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write, > if (write && hstate_is_gigantic(h)) > return -EINVAL; > > - table->data = &tmp; > - table->maxlen = sizeof(unsigned long); > - ret = proc_doulongvec_minmax(table, write, buffer, length, ppos); > + ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos, > + &tmp); > if (ret) > goto out; > > -- > 2.11.0 >
On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 8/24/20 1:59 PM, Andrew Morton wrote: > > On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote: > > > >> There is a race between the assignment of `table->data` and write value > >> to the pointer of `table->data` in the __do_proc_doulongvec_minmax(). > > > > Where does __do_proc_doulongvec_minmax() write to table->data? > > > > I think you're saying that there is a race between the assignment of > > ctl_table->table in hugetlb_sysctl_handler_common() and the assignment > > of the same ctl_table->table in hugetlb_overcommit_handler()? > > > > Or not, maybe I'm being thick. Can you please describe the race more > > carefully and completely? > > > > I too am looking at this now and do not completely understand the race. > It could be that: > > hugetlb_sysctl_handler_common > ... > table->data = &tmp; > > and, do_proc_doulongvec_minmax() > ... > return __do_proc_doulongvec_minmax(table->data, table, write, ... > with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ... > ... > i = (unsigned long *) data; > ... > *i = val; > > So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer > in one thread when hugetlb_sysctl_handler_common is setting it in another? Yes, you are right. > > Another confusing part of the message is the stack trace which includes > ... > ? set_max_huge_pages+0x3da/0x4f0 > ? alloc_pool_huge_page+0x150/0x150 > > which are 'downstream' from these routines. I don't understand why these > are in the trace. I am also confused. But this issue can be reproduced easily by letting more than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied, the issue can not be reproduced and disappears. > > If the race is with the pointer set and dereference/write, then this type of > fix is OK. However, if you really have two 'sysadmin type' global operations > racing then one or both are not going to get what they expected. Instead of In our team, more than one developer shares one server which is a test server. I guess that the panic happens when two people write `/proc/sys/vm/nr_hugepages`. > changing the code to 'handle the race', I think it might be acceptable to just > put a big semaphore around it. It is also a good idea to fix this issue. Thanks. > -- > Mike Kravetz
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
I don't think the Fixes line is correct. The original patch
just used a global variable and didn't have this race.
It must have been added later in some other patch that added
hugetlb_sysctl_handler_common.
-Andi
On 8/24/20 8:01 PM, Muchun Song wrote: > On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >> I too am looking at this now and do not completely understand the race. >> It could be that: >> >> hugetlb_sysctl_handler_common >> ... >> table->data = &tmp; >> >> and, do_proc_doulongvec_minmax() >> ... >> return __do_proc_doulongvec_minmax(table->data, table, write, ... >> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ... >> ... >> i = (unsigned long *) data; >> ... >> *i = val; >> >> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer >> in one thread when hugetlb_sysctl_handler_common is setting it in another? > > Yes, you are right. > >> >> Another confusing part of the message is the stack trace which includes >> ... >> ? set_max_huge_pages+0x3da/0x4f0 >> ? alloc_pool_huge_page+0x150/0x150 >> >> which are 'downstream' from these routines. I don't understand why these >> are in the trace. > > I am also confused. But this issue can be reproduced easily by letting more > than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied, > the issue can not be reproduced and disappears. There certainly is an issue here as one thread can modify data in another. However, I am having a hard time seeing what causes the 'kernel NULL pointer dereference'. I tried to reproduce the issue myself but was unsuccessful. I have 16 threads writing to /proc/sys/vm/nr_hugepages in an infinite loop. After several hours running, I did not hit the issue. Just curious, what architecture is the system? any special config or compiler options? If you can easily reproduce, can you post the detailed oops message? The 'NULL pointer' seems strange because after the first assignment to table->data the value should never be NULL. Certainly it can be modified by another thread, but I can not see how it can be NULL. At the beginning of __do_proc_doulongvec_minmax, there is a check for NULL pointer with: if (!data || !table->maxlen || !*lenp || (*ppos && !write)) { *lenp = 0; return 0; } I looked at the code my compiler produced for __do_proc_doulongvec_minmax. It appears to use the same value/register for the pointer throughout the routine. IOW, I do not see how the pointer can be NULL for the assignment when the routine does: *i = val; Again, your analysis/patch points out a real issue. I just want to get a better understanding to make sure there is not another issue causing the NULL pointer dereference.
HI Andi, On Tue, Aug 25, 2020 at 11:34 PM Andi Kleen <ak@linux.intel.com> wrote: > > > Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes") > > I don't think the Fixes line is correct. The original patch > just used a global variable and didn't have this race. > It must have been added later in some other patch that added > hugetlb_sysctl_handler_common. I don't agree with you. The 'e5ff215941d5' is not used a global variable. Below is the code snippet of this patch. Thanks. @@ -1037,8 +1109,19 @@ int hugetlb_sysctl_handler(struct ctl_table *table, int write, struct file *file, void __user *buffer, size_t *length, loff_t *ppos) { + struct hstate *h = &default_hstate; + unsigned long tmp; Here is a local variable of tmp. + + if (!write) + tmp = h->max_huge_pages; + + table->data = &tmp; + table->maxlen = sizeof(unsigned long); proc_doulongvec_minmax(table, write, file, buffer, length, ppos); - max_huge_pages = set_max_huge_pages(max_huge_pages); + + if (write) + h->max_huge_pages = set_max_huge_pages(h, tmp); + return 0; } > > -Andi
On Wed, Aug 26, 2020 at 8:03 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 8/24/20 8:01 PM, Muchun Song wrote: > > On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> > >> I too am looking at this now and do not completely understand the race. > >> It could be that: > >> > >> hugetlb_sysctl_handler_common > >> ... > >> table->data = &tmp; > >> > >> and, do_proc_doulongvec_minmax() > >> ... > >> return __do_proc_doulongvec_minmax(table->data, table, write, ... > >> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ... > >> ... > >> i = (unsigned long *) data; > >> ... > >> *i = val; > >> > >> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer > >> in one thread when hugetlb_sysctl_handler_common is setting it in another? > > > > Yes, you are right. > > > >> > >> Another confusing part of the message is the stack trace which includes > >> ... > >> ? set_max_huge_pages+0x3da/0x4f0 > >> ? alloc_pool_huge_page+0x150/0x150 > >> > >> which are 'downstream' from these routines. I don't understand why these > >> are in the trace. > > > > I am also confused. But this issue can be reproduced easily by letting more > > than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied, > > the issue can not be reproduced and disappears. > > There certainly is an issue here as one thread can modify data in another. > However, I am having a hard time seeing what causes the 'kernel NULL pointer > dereference'. If you write 0 to '/proc/sys/vm/nr_hugepages', you will get the kernel NULL pointer dereference, address: 0000000000000000 If you write 1024 to '/proc/sys/vm/nr_hugepages', you will get the kernel NULL pointer dereference, address: 0000000000000400 The address of dereference is the value which you write to the '/proc/sys/vm/nr_hugepages'. > > I tried to reproduce the issue myself but was unsuccessful. I have 16 threads > writing to /proc/sys/vm/nr_hugepages in an infinite loop. After several hours > running, I did not hit the issue. Just curious, what architecture is the > system? any special config or compiler options? > > If you can easily reproduce, can you post the detailed oops message? > > The 'NULL pointer' seems strange because after the first assignment to > table->data the value should never be NULL. Certainly it can be modified > by another thread, but I can not see how it can be NULL. At the beginning > of __do_proc_doulongvec_minmax, there is a check for NULL pointer with: CPU0: CPU1: proc_sys_write hugetlb_sysctl_handler proc_sys_call_handler hugetlb_sysctl_handler_common hugetlb_sysctl_handler table->data = &tmp; hugetlb_sysctl_handler_common table->data = &tmp; proc_doulongvec_minmax do_proc_doulongvec_minmax sysctl_head_finish __do_proc_doulongvec_minmax i = table->data; *i = val; // corrupt CPU1 stack If the val is 0, you will see the NULL. > > if (!data || !table->maxlen || !*lenp || (*ppos && !write)) { > *lenp = 0; > return 0; > } > > I looked at the code my compiler produced for __do_proc_doulongvec_minmax. > It appears to use the same value/register for the pointer throughout the > routine. IOW, I do not see how the pointer can be NULL for the assignment > when the routine does: > > *i = val; > > Again, your analysis/patch points out a real issue. I just want to get > a better understanding to make sure there is not another issue causing > the NULL pointer dereference. Below is my test script. There are 8 threads to execute the following script. In my qemu, it is easy to panic. Thanks. #!/bin/sh while : do echo 128 > /proc/sys/vm/nr_hugepages echo 0 > /proc/sys/vm/nr_hugepages done > -- > Mike Kravetz
On 8/25/20 7:47 PM, Muchun Song wrote: > > CPU0: CPU1: > proc_sys_write > hugetlb_sysctl_handler proc_sys_call_handler > hugetlb_sysctl_handler_common hugetlb_sysctl_handler > table->data = &tmp; hugetlb_sysctl_handler_common > table->data = &tmp; > proc_doulongvec_minmax > do_proc_doulongvec_minmax sysctl_head_finish > __do_proc_doulongvec_minmax > i = table->data; > *i = val; // corrupt CPU1 stack Thanks Muchun! Can you please add this to the commit message. Also, when looking closer at the patch I do not think setting table->maxlen is necessary in these routines. maxlen is set when the hugetlb ctl_table entries are defined and initialized. This is not something you introduced. The unnecessary assignments are in the existing code. However, there is no need to carry them forward.
On Fri, Aug 28, 2020 at 5:51 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 8/25/20 7:47 PM, Muchun Song wrote: > > > > CPU0: CPU1: > > proc_sys_write > > hugetlb_sysctl_handler proc_sys_call_handler > > hugetlb_sysctl_handler_common hugetlb_sysctl_handler > > table->data = &tmp; hugetlb_sysctl_handler_common > > table->data = &tmp; > > proc_doulongvec_minmax > > do_proc_doulongvec_minmax sysctl_head_finish > > __do_proc_doulongvec_minmax > > i = table->data; > > *i = val; // corrupt CPU1 stack > > Thanks Muchun! > Can you please add this to the commit message. OK, I will do that. Thanks. > > Also, when looking closer at the patch I do not think setting table->maxlen > is necessary in these routines. maxlen is set when the hugetlb ctl_table > entries are defined and initialized. This is not something you introduced. > The unnecessary assignments are in the existing code. However, there is no > need to carry them forward. Yeah, I agree with you. I will remove the unnecessary assignment of table->maxlen. > > -- > Mike Kravetz
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a301c2d672bf..818d6125af49 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h) } #ifdef CONFIG_SYSCTL +static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write, + void *buffer, size_t *length, + loff_t *ppos, unsigned long *out) +{ + struct ctl_table dup_table; + + /* + * In order to avoid races with __do_proc_doulongvec_minmax(), we + * can duplicate the @table and alter the duplicate of it. + */ + dup_table = *table; + dup_table.data = out; + dup_table.maxlen = sizeof(unsigned long); + + return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos); +} + static int hugetlb_sysctl_handler_common(bool obey_mempolicy, struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos) @@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy, if (!hugepages_supported()) return -EOPNOTSUPP; - table->data = &tmp; - table->maxlen = sizeof(unsigned long); - ret = proc_doulongvec_minmax(table, write, buffer, length, ppos); + ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos, + &tmp); if (ret) goto out; @@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write, if (write && hstate_is_gigantic(h)) return -EINVAL; - table->data = &tmp; - table->maxlen = sizeof(unsigned long); - ret = proc_doulongvec_minmax(table, write, buffer, length, ppos); + ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos, + &tmp); if (ret) goto out;
There is a race between the assignment of `table->data` and write value to the pointer of `table->data` in the __do_proc_doulongvec_minmax(). Fix this by duplicating the `table`, and only update the duplicate of it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to simplify the code. The following oops was seen: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page Code: Bad RIP value. ... Call Trace: ? set_max_huge_pages+0x3da/0x4f0 ? alloc_pool_huge_page+0x150/0x150 ? proc_doulongvec_minmax+0x46/0x60 ? hugetlb_sysctl_handler_common+0x1c7/0x200 ? nr_hugepages_store+0x20/0x20 ? copy_fd_bitmaps+0x170/0x170 ? hugetlb_sysctl_handler+0x1e/0x20 ? proc_sys_call_handler+0x2f1/0x300 ? unregister_sysctl_table+0xb0/0xb0 ? __fd_install+0x78/0x100 ? proc_sys_write+0x14/0x20 ? __vfs_write+0x4d/0x90 ? vfs_write+0xef/0x240 ? ksys_write+0xc0/0x160 ? __ia32_sys_read+0x50/0x50 ? __close_fd+0x129/0x150 ? __x64_sys_write+0x43/0x50 ? do_syscall_64+0x6c/0x200 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes") Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- mm/hugetlb.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-)