Message ID | 20220923103858.26729-1-fw@strlen.de (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [mm] mm: fix BUG with kvzalloc+GFP_ATOMIC | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Fri 23-09-22 12:38:58, Florian Westphal wrote: > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: > kernel BUG at mm/vmalloc.c:2437! > invalid opcode: 0000 [#1] SMP > CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 > [..] > RIP: 0010:__get_vm_area_node+0x120/0x130 > __vmalloc_node_range+0x96/0x1e0 > kvmalloc_node+0x92/0xb0 > bucket_table_alloc.isra.0+0x47/0x140 > rhashtable_try_insert+0x3a4/0x440 > rhashtable_insert_slow+0x1b/0x30 > [..] > > bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now > falls through to vmalloc and hits code paths that assume GFP_KERNEL. > > Revert the problematic change and stay with slab allocator. Why don't you simply fix the caller?
Michal Hocko <mhocko@suse.com> wrote: > On Fri 23-09-22 12:38:58, Florian Westphal wrote: > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: > > kernel BUG at mm/vmalloc.c:2437! > > invalid opcode: 0000 [#1] SMP > > CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 > > [..] > > RIP: 0010:__get_vm_area_node+0x120/0x130 > > __vmalloc_node_range+0x96/0x1e0 > > kvmalloc_node+0x92/0xb0 > > bucket_table_alloc.isra.0+0x47/0x140 > > rhashtable_try_insert+0x3a4/0x440 > > rhashtable_insert_slow+0x1b/0x30 > > [..] > > > > bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now > > falls through to vmalloc and hits code paths that assume GFP_KERNEL. > > > > Revert the problematic change and stay with slab allocator. > > Why don't you simply fix the caller? Uh, not following? kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? I can revert 93f976b5190df32793908d49165f78e67fcb66cf instead but that change is from 2018.
On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote: > Michal Hocko <mhocko@suse.com> wrote: > > On Fri 23-09-22 12:38:58, Florian Westphal wrote: > > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: > > > kernel BUG at mm/vmalloc.c:2437! > > > invalid opcode: 0000 [#1] SMP > > > CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 > > > [..] > > > RIP: 0010:__get_vm_area_node+0x120/0x130 > > > __vmalloc_node_range+0x96/0x1e0 > > > kvmalloc_node+0x92/0xb0 > > > bucket_table_alloc.isra.0+0x47/0x140 > > > rhashtable_try_insert+0x3a4/0x440 > > > rhashtable_insert_slow+0x1b/0x30 > > > [..] > > > > > > bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now > > > falls through to vmalloc and hits code paths that assume GFP_KERNEL. > > > > > > Revert the problematic change and stay with slab allocator. > > > > Why don't you simply fix the caller? > > Uh, not following? > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? > <snip> static struct vm_struct *__get_vm_area_node(unsigned long size, unsigned long align, unsigned long shift, unsigned long flags, unsigned long start, unsigned long end, int node, gfp_t gfp_mask, const void *caller) { struct vmap_area *va; struct vm_struct *area; unsigned long requested_size = size; BUG_ON(in_interrupt()); ... <snip> vmalloc is not supposed to be called from the IRQ context. If it is possible it would be better to fix the bucket_table_alloc() by not calling it from the IRQ context. -- Uladzislau Rezki
Uladzislau Rezki <urezki@gmail.com> wrote: > On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote: > > Michal Hocko <mhocko@suse.com> wrote: > > > On Fri 23-09-22 12:38:58, Florian Westphal wrote: > > > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: > > > > kernel BUG at mm/vmalloc.c:2437! > > > > invalid opcode: 0000 [#1] SMP > > > > CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 > > > > [..] > > > > RIP: 0010:__get_vm_area_node+0x120/0x130 > > > > __vmalloc_node_range+0x96/0x1e0 > > > > kvmalloc_node+0x92/0xb0 > > > > bucket_table_alloc.isra.0+0x47/0x140 > > > > rhashtable_try_insert+0x3a4/0x440 > > > > rhashtable_insert_slow+0x1b/0x30 > > > > [..] > > > > > > > > bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now > > > > falls through to vmalloc and hits code paths that assume GFP_KERNEL. > > > > > > > > Revert the problematic change and stay with slab allocator. > > > > > > Why don't you simply fix the caller? > > > > Uh, not following? > > > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? > > > <snip> > static struct vm_struct *__get_vm_area_node(unsigned long size, > unsigned long align, unsigned long shift, unsigned long flags, > unsigned long start, unsigned long end, int node, > gfp_t gfp_mask, const void *caller) > { > struct vmap_area *va; > struct vm_struct *area; > unsigned long requested_size = size; > > BUG_ON(in_interrupt()); > ... > <snip> > > vmalloc is not supposed to be called from the IRQ context. It uses kvzalloc, not vmalloc api. Before 2018, rhashtable did use kzalloc OR kvzalloc, depending on gfp_t. Quote from 93f976b5190df327939 changelog: As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags") we can simplify the caller and trust kvzalloc() to just do the right thing. I fear that if this isn't allowed it will result in hard-to-spot bugs because things will work fine until a fallback to vmalloc happens. rhashtable may not be the only user of kvmalloc api that rely on ability to call it from (soft)irq.
On Fri, Sep 23, 2022 at 04:54:09PM +0200, Florian Westphal wrote: > Uladzislau Rezki <urezki@gmail.com> wrote: > > On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote: > > > Michal Hocko <mhocko@suse.com> wrote: > > > > On Fri 23-09-22 12:38:58, Florian Westphal wrote: > > > > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: > > > > > kernel BUG at mm/vmalloc.c:2437! > > > > > invalid opcode: 0000 [#1] SMP > > > > > CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 > > > > > [..] > > > > > RIP: 0010:__get_vm_area_node+0x120/0x130 > > > > > __vmalloc_node_range+0x96/0x1e0 > > > > > kvmalloc_node+0x92/0xb0 > > > > > bucket_table_alloc.isra.0+0x47/0x140 > > > > > rhashtable_try_insert+0x3a4/0x440 > > > > > rhashtable_insert_slow+0x1b/0x30 > > > > > [..] > > > > > > > > > > bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now > > > > > falls through to vmalloc and hits code paths that assume GFP_KERNEL. > > > > > > > > > > Revert the problematic change and stay with slab allocator. > > > > > > > > Why don't you simply fix the caller? > > > > > > Uh, not following? > > > > > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? > > > > > <snip> > > static struct vm_struct *__get_vm_area_node(unsigned long size, > > unsigned long align, unsigned long shift, unsigned long flags, > > unsigned long start, unsigned long end, int node, > > gfp_t gfp_mask, const void *caller) > > { > > struct vmap_area *va; > > struct vm_struct *area; > > unsigned long requested_size = size; > > > > BUG_ON(in_interrupt()); > > ... > > <snip> > > > > vmalloc is not supposed to be called from the IRQ context. > > It uses kvzalloc, not vmalloc api. > > Before 2018, rhashtable did use kzalloc OR kvzalloc, depending on gfp_t. > > Quote from 93f976b5190df327939 changelog: > As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for > incompatible gfp flags") we can simplify the caller > and trust kvzalloc() to just do the right thing. > > I fear that if this isn't allowed it will result in hard-to-spot bugs > because things will work fine until a fallback to vmalloc happens. > > rhashtable may not be the only user of kvmalloc api that rely on > ability to call it from (soft)irq. > Doing the "p = kmalloc(sizeof(*p), GFP_ATOMIC);" from an atomic context is also a problem nowadays. Such code should be fixed across the kernel because of PREEMPT_RT support. -- Uladzislau Rezki
On Fri 23-09-22 15:35:12, Florian Westphal wrote: > Michal Hocko <mhocko@suse.com> wrote: > > On Fri 23-09-22 12:38:58, Florian Westphal wrote: > > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: > > > kernel BUG at mm/vmalloc.c:2437! > > > invalid opcode: 0000 [#1] SMP > > > CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 > > > [..] > > > RIP: 0010:__get_vm_area_node+0x120/0x130 > > > __vmalloc_node_range+0x96/0x1e0 > > > kvmalloc_node+0x92/0xb0 > > > bucket_table_alloc.isra.0+0x47/0x140 > > > rhashtable_try_insert+0x3a4/0x440 > > > rhashtable_insert_slow+0x1b/0x30 > > > [..] > > > > > > bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now > > > falls through to vmalloc and hits code paths that assume GFP_KERNEL. > > > > > > Revert the problematic change and stay with slab allocator. > > > > Why don't you simply fix the caller? > > Uh, not following? > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? kvmalloc has never really supported GFP_ATOMIC semantic. > I can revert 93f976b5190df32793908d49165f78e67fcb66cf instead > but that change is from 2018. Yeah I would just revert this one as it relies on internal details of kvmalloc doing or not doing a fallback.
Michal Hocko <mhocko@suse.com> wrote: > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? > > kvmalloc has never really supported GFP_ATOMIC semantic. It did, you added it: ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags")
On Mon 26-09-22 09:56:39, Florian Westphal wrote: > Michal Hocko <mhocko@suse.com> wrote: > > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? > > > > kvmalloc has never really supported GFP_ATOMIC semantic. > > It did, you added it: > ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags") Yes, I am very well aware of this commit and I have to say I wasn't really supper happy about it TBH. Linus has argued this will result in a saner code and in some cases this was true. Later on we really had to add support some extensions beyond GFP_KERNEL. Your change would break those GFP_NOFAIL and NOFS usecases. GFP_NOWAIT and GFP_ATOMIC are explicitly documented as unsupported. One we can do to continue in ce91f6ee5b3b sense is to do this instead diff --git a/mm/util.c b/mm/util.c index 0837570c9225..a27b3fce1f0e 100644 --- a/mm/util.c +++ b/mm/util.c @@ -618,6 +618,10 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node) */ if (ret || size <= PAGE_SIZE) return ret; + + /* non-sleeping allocations are not supported by vmalloc */ + if (!gfpflags_allow_blocking(flags)) + return NULL; /* Don't even allow crazy sizes */ if (unlikely(size > INT_MAX)) { A better option to me seems to be reworking the rhashtable_insert_rehash to not rely on an atomic allocation. I am not familiar with that code but it seems to me that the only reason this allocation mode is used is due to rcu locking around rhashtable_try_insert. Is there any reason we cannot drop the rcu lock, allocate with the full GFP_KERNEL allocation power and retry with the pre allocated object? rhashtable_insert_slow is already doing that to implement its never fail semantic.
On Mon 26-09-22 10:58:00, Michal Hocko wrote: [...] > A better option to me seems to be reworking the rhashtable_insert_rehash > to not rely on an atomic allocation. I am not familiar with that code > but it seems to me that the only reason this allocation mode is used is > due to rcu locking around rhashtable_try_insert. Is there any reason we > cannot drop the rcu lock, allocate with the full GFP_KERNEL allocation > power and retry with the pre allocated object? rhashtable_insert_slow is > already doing that to implement its never fail semantic. So a very blunt and likely not 100% correct take on this side of things. But it should give an idea at least. --- diff --git a/lib/rhashtable.c b/lib/rhashtable.c index e12bbfb240b8..fc547c43b05d 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -437,31 +437,11 @@ static void rht_deferred_worker(struct work_struct *work) } static int rhashtable_insert_rehash(struct rhashtable *ht, - struct bucket_table *tbl) + struct bucket_table *tbl, + struct bucket_table *new_tbl) { - struct bucket_table *old_tbl; - struct bucket_table *new_tbl; - unsigned int size; int err; - old_tbl = rht_dereference_rcu(ht->tbl, ht); - - size = tbl->size; - - err = -EBUSY; - - if (rht_grow_above_75(ht, tbl)) - size *= 2; - /* Do not schedule more than one rehash */ - else if (old_tbl != tbl) - goto fail; - - err = -ENOMEM; - - new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN); - if (new_tbl == NULL) - goto fail; - err = rhashtable_rehash_attach(ht, tbl, new_tbl); if (err) { bucket_table_free(new_tbl); @@ -471,17 +451,6 @@ static int rhashtable_insert_rehash(struct rhashtable *ht, schedule_work(&ht->run_work); return err; - -fail: - /* Do not fail the insert if someone else did a rehash. */ - if (likely(rcu_access_pointer(tbl->future_tbl))) - return 0; - - /* Schedule async rehash to retry allocation in process context. */ - if (err == -ENOMEM) - schedule_work(&ht->run_work); - - return err; } static void *rhashtable_lookup_one(struct rhashtable *ht, @@ -619,9 +588,33 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key, } } while (!IS_ERR_OR_NULL(new_tbl)); - if (PTR_ERR(data) == -EAGAIN) - data = ERR_PTR(rhashtable_insert_rehash(ht, tbl) ?: + if (PTR_ERR(data) == -EAGAIN) { + struct bucket_table *old_tbl; + unsigned int size; + + old_tbl = rht_dereference_rcu(ht->tbl, ht); + size = tbl->size; + + data = ERR_PTR(-EBUSY); + + if (rht_grow_above_75(ht, tbl)) + size *= 2; + /* Do not schedule more than one rehash */ + else if (old_tbl != tbl) + return data; + + data = ERR_PTR(-ENOMEM); + + rcu_read_unlock(); + new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL); + rcu_read_lock(); + + if (!new_tbl) + return data; + + data = ERR_PTR(rhashtable_insert_rehash(ht, tbl, new_tbl) ?: -EAGAIN); + } return data; }
Michal Hocko <mhocko@suse.com> wrote: > + old_tbl = rht_dereference_rcu(ht->tbl, ht); > + size = tbl->size; > + > + data = ERR_PTR(-EBUSY); > + > + if (rht_grow_above_75(ht, tbl)) > + size *= 2; > + /* Do not schedule more than one rehash */ > + else if (old_tbl != tbl) > + return data; > + > + data = ERR_PTR(-ENOMEM); > + > + rcu_read_unlock(); > + new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL); > + rcu_read_lock(); I don't think this is going to work, there can be callers that rely on rcu protected data structures getting free'd. Also, network stack assumes synchronize_rcu() means that all inflight packets have completed processing.
On Mon 26-09-22 12:08:00, Florian Westphal wrote: > Michal Hocko <mhocko@suse.com> wrote: > > + old_tbl = rht_dereference_rcu(ht->tbl, ht); > > + size = tbl->size; > > + > > + data = ERR_PTR(-EBUSY); > > + > > + if (rht_grow_above_75(ht, tbl)) > > + size *= 2; > > + /* Do not schedule more than one rehash */ > > + else if (old_tbl != tbl) > > + return data; > > + > > + data = ERR_PTR(-ENOMEM); > > + > > + rcu_read_unlock(); > > + new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL); > > + rcu_read_lock(); > > I don't think this is going to work, there can be callers that > rely on rcu protected data structures getting free'd. The caller of this function drops RCU for each retry, why should be the called function any special?
Michal Hocko <mhocko@suse.com> wrote: > On Mon 26-09-22 12:08:00, Florian Westphal wrote: > > Michal Hocko <mhocko@suse.com> wrote: > > > + old_tbl = rht_dereference_rcu(ht->tbl, ht); > > > + size = tbl->size; > > > + > > > + data = ERR_PTR(-EBUSY); > > > + > > > + if (rht_grow_above_75(ht, tbl)) > > > + size *= 2; > > > + /* Do not schedule more than one rehash */ > > > + else if (old_tbl != tbl) > > > + return data; > > > + > > > + data = ERR_PTR(-ENOMEM); > > > + > > > + rcu_read_unlock(); > > > + new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL); > > > + rcu_read_lock(); > > > > I don't think this is going to work, there can be callers that > > rely on rcu protected data structures getting free'd. > > The caller of this function drops RCU for each retry, why should be the > called function any special? I was unfortunately never able to fully understand rhashtable. AFAICS the rcu_read_lock/unlock in the caller is pointless, or at least dubious. To the best of my knowledge there are users of this interface that invoke it with rcu read lock held, and since those always nest, the rcu_read_unlock() won't move us to GFP_KERNEL territory. I guess you can add a might_sleep() and ask kernel to barf at runtime.
Florian Westphal <fw@strlen.de> wrote: > Michal Hocko <mhocko@suse.com> wrote: > > On Mon 26-09-22 12:08:00, Florian Westphal wrote: > > > Michal Hocko <mhocko@suse.com> wrote: > > > > + old_tbl = rht_dereference_rcu(ht->tbl, ht); > > > > + size = tbl->size; > > > > + > > > > + data = ERR_PTR(-EBUSY); > > > > + > > > > + if (rht_grow_above_75(ht, tbl)) > > > > + size *= 2; > > > > + /* Do not schedule more than one rehash */ > > > > + else if (old_tbl != tbl) > > > > + return data; > > > > + > > > > + data = ERR_PTR(-ENOMEM); > > > > + > > > > + rcu_read_unlock(); > > > > + new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL); > > > > + rcu_read_lock(); > > > > > > I don't think this is going to work, there can be callers that > > > rely on rcu protected data structures getting free'd. > > > > The caller of this function drops RCU for each retry, why should be the > > called function any special? > > I was unfortunately never able to fully understand rhashtable. Obviously. > AFAICS the rcu_read_lock/unlock in the caller is pointless, > or at least dubious. Addedum, I can't read: void *rhashtable_insert_slow(struct rhashtable *ht, const void *key, struct rhash_head *obj) { void *data; do { rcu_read_lock(); data = rhashtable_try_insert(ht, key, obj); rcu_read_unlock(); } } while (PTR_ERR(data) == -EAGAIN); } ... which is needed to prevent a lockdep splat in rhashtable_try_insert() -- there is no guarantee the caller already has rcu_read_lock(). > To the best of my knowledge there are users of this interface that > invoke it with rcu read lock held, and since those always nest, the > rcu_read_unlock() won't move us to GFP_KERNEL territory. > > I guess you can add a might_sleep() and ask kernel to barf at runtime. I did and it triggers. Caller is inet_frag_find(), triggered via 'ping -s 60000 $addr'.
On Mon 26-09-22 15:08:08, Florian Westphal wrote: [...] > To the best of my knowledge there are users of this interface that > invoke it with rcu read lock held, and since those always nest, the > rcu_read_unlock() won't move us to GFP_KERNEL territory. Fiar point. I can see a comment above rhashtable_insert_fast which is supposed to be used by drivers and explicitly documented to be compatible with an atomic context. So the above is clearly a no-go In that case I would propose to go with the patch going http://lkml.kernel.org/r/YzFplwSxwwsLpzzX@dhcp22.suse.cz direction. For that we could use an alternative implementation for the vmalloc fallback for non-sleeping request in the future. A single branch for an unlikely event is not nice but acceptable I would say. Especially if we want to go in line with Linus' original "prevent hack for gfp_mask in callers".
Michal Hocko <mhocko@suse.com> wrote: > On Mon 26-09-22 15:08:08, Florian Westphal wrote: > [...] > > To the best of my knowledge there are users of this interface that > > invoke it with rcu read lock held, and since those always nest, the > > rcu_read_unlock() won't move us to GFP_KERNEL territory. > > Fiar point. I can see a comment above rhashtable_insert_fast which is > supposed to be used by drivers and explicitly documented to be > compatible with an atomic context. So the above is clearly a no-go > > In that case I would propose to go with the patch going > http://lkml.kernel.org/r/YzFplwSxwwsLpzzX@dhcp22.suse.cz direction. FWIW that patch works for me. Will you do a formal patch submission?
On Mon 26-09-22 16:20:13, Florian Westphal wrote: > Michal Hocko <mhocko@suse.com> wrote: > > On Mon 26-09-22 15:08:08, Florian Westphal wrote: > > [...] > > > To the best of my knowledge there are users of this interface that > > > invoke it with rcu read lock held, and since those always nest, the > > > rcu_read_unlock() won't move us to GFP_KERNEL territory. > > > > Fiar point. I can see a comment above rhashtable_insert_fast which is > > supposed to be used by drivers and explicitly documented to be > > compatible with an atomic context. So the above is clearly a no-go > > > > In that case I would propose to go with the patch going > > http://lkml.kernel.org/r/YzFplwSxwwsLpzzX@dhcp22.suse.cz direction. > > FWIW that patch works for me. Will you do a formal patch submission? Feel free to use it and glue it to the actual bug report. Thanks!
On 9/23/22 17:10, Uladzislau Rezki wrote: > On Fri, Sep 23, 2022 at 04:54:09PM +0200, Florian Westphal wrote: >> Uladzislau Rezki <urezki@gmail.com> wrote: >>> On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote: >>>> Michal Hocko <mhocko@suse.com> wrote: >>>>> On Fri 23-09-22 12:38:58, Florian Westphal wrote: >>>>>> Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: >>>>>> kernel BUG at mm/vmalloc.c:2437! >>>>>> invalid opcode: 0000 [#1] SMP >>>>>> CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 >>>>>> [..] >>>>>> RIP: 0010:__get_vm_area_node+0x120/0x130 >>>>>> __vmalloc_node_range+0x96/0x1e0 >>>>>> kvmalloc_node+0x92/0xb0 >>>>>> bucket_table_alloc.isra.0+0x47/0x140 >>>>>> rhashtable_try_insert+0x3a4/0x440 >>>>>> rhashtable_insert_slow+0x1b/0x30 >>>>>> [..] >>>>>> >>>>>> bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now >>>>>> falls through to vmalloc and hits code paths that assume GFP_KERNEL. >>>>>> >>>>>> Revert the problematic change and stay with slab allocator. >>>>> >>>>> Why don't you simply fix the caller? >>>> >>>> Uh, not following? >>>> >>>> kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again? >>>> >>> <snip> >>> static struct vm_struct *__get_vm_area_node(unsigned long size, >>> unsigned long align, unsigned long shift, unsigned long flags, >>> unsigned long start, unsigned long end, int node, >>> gfp_t gfp_mask, const void *caller) >>> { >>> struct vmap_area *va; >>> struct vm_struct *area; >>> unsigned long requested_size = size; >>> >>> BUG_ON(in_interrupt()); >>> ... >>> <snip> >>> >>> vmalloc is not supposed to be called from the IRQ context. >> >> It uses kvzalloc, not vmalloc api. >> >> Before 2018, rhashtable did use kzalloc OR kvzalloc, depending on gfp_t. >> >> Quote from 93f976b5190df327939 changelog: >> As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for >> incompatible gfp flags") we can simplify the caller >> and trust kvzalloc() to just do the right thing. >> >> I fear that if this isn't allowed it will result in hard-to-spot bugs >> because things will work fine until a fallback to vmalloc happens. >> >> rhashtable may not be the only user of kvmalloc api that rely on >> ability to call it from (soft)irq. >> > Doing the "p = kmalloc(sizeof(*p), GFP_ATOMIC);" from an atomic context > is also a problem nowadays. Such code should be fixed across the kernel > because of PREEMPT_RT support. But the "atomic context" here is different, no? Calling kmalloc() from IRQ handlers AFAIK is ok as IRQ handlers are threaded on PREEMPT_RT. Calling it inside an local_irq_disable() would be a problem on the other hand. But then under e.g. spin_lock_irqsave() could be ok as those don't really disable irqs on RT. > -- > Uladzislau Rezki
On 2022-09-26 17:03:48 [+0200], Vlastimil Babka wrote: > > Doing the "p = kmalloc(sizeof(*p), GFP_ATOMIC);" from an atomic context > > is also a problem nowadays. Such code should be fixed across the kernel > > because of PREEMPT_RT support. You should make sure that the context in question is atomic on PREEMPT_RT before fixing it. My guess here is that it is average the softirq (NAPI) callback which is fine. > But the "atomic context" here is different, no? Calling kmalloc() from IRQ > handlers AFAIK is ok as IRQ handlers are threaded on PREEMPT_RT. Calling it > inside an local_irq_disable() would be a problem on the other hand. But then > under e.g. spin_lock_irqsave() could be ok as those don't really disable > irqs on RT. correct. Sebastian
diff --git a/mm/util.c b/mm/util.c index c9439c66d8cf..ba7fe1cb6846 100644 --- a/mm/util.c +++ b/mm/util.c @@ -593,6 +593,13 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node) gfp_t kmalloc_flags = flags; void *ret; + /* + * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables) + * so the given set of flags has to be compatible. + */ + if ((flags & GFP_KERNEL) != GFP_KERNEL) + return kmalloc_node(size, flags, node); + /* * We want to attempt a large physically contiguous block first because * it is less likely to fragment multiple larger blocks and therefore
Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel: kernel BUG at mm/vmalloc.c:2437! invalid opcode: 0000 [#1] SMP CPU: 28 PID: 0 Comm: swapper/28 Tainted: G W O 5.19.9 #1 [..] RIP: 0010:__get_vm_area_node+0x120/0x130 __vmalloc_node_range+0x96/0x1e0 kvmalloc_node+0x92/0xb0 bucket_table_alloc.isra.0+0x47/0x140 rhashtable_try_insert+0x3a4/0x440 rhashtable_insert_slow+0x1b/0x30 [..] bucket_table_alloc uses kvzallocGPF_ATOMIC). If kmalloc fails, this now falls through to vmalloc and hits code paths that assume GFP_KERNEL. Revert the problematic change and stay with slab allocator. Reported-by: Martin Zaharinov <micron10@gmail.com> Fixes: a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc") Cc: Michal Hocko <mhocko@suse.com> Link: https://lore.kernel.org/netdev/09BE0B8A-3ADF-458E-B75E-931B74996355@gmail.com/T/#u Signed-off-by: Florian Westphal <fw@strlen.de> --- mm/util.c | 7 +++++++ 1 file changed, 7 insertions(+)