diff mbox series

[mm] mm: fix BUG with kvzalloc+GFP_ATOMIC

Message ID 20220923103858.26729-1-fw@strlen.de (mailing list archive)
State New
Headers show
Series [mm] mm: fix BUG with kvzalloc+GFP_ATOMIC | expand

Commit Message

Florian Westphal Sept. 23, 2022, 10:38 a.m. UTC
Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
 kernel BUG at mm/vmalloc.c:2437!
 invalid opcode: 0000 [#1] SMP
 CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
 [..]
 RIP: 0010:__get_vm_area_node+0x120/0x130
  __vmalloc_node_range+0x96/0x1e0
  kvmalloc_node+0x92/0xb0
  bucket_table_alloc.isra.0+0x47/0x140
  rhashtable_try_insert+0x3a4/0x440
  rhashtable_insert_slow+0x1b/0x30
 [..]

bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
falls through to vmalloc and hits code paths that assume GFP_KERNEL.

Revert the problematic change and stay with slab allocator.

Reported-by: Martin Zaharinov <micron10@gmail.com>
Fixes: a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
Cc: Michal Hocko <mhocko@suse.com>
Link: https://lore.kernel.org/netdev/09BE0B8A-3ADF-458E-B75E-931B74996355@gmail.com/T/#u

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 mm/util.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Michal Hocko Sept. 23, 2022, 1:29 p.m. UTC | #1
On Fri 23-09-22 12:38:58, Florian Westphal wrote:
> Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
>  kernel BUG at mm/vmalloc.c:2437!
>  invalid opcode: 0000 [#1] SMP
>  CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
>  [..]
>  RIP: 0010:__get_vm_area_node+0x120/0x130
>   __vmalloc_node_range+0x96/0x1e0
>   kvmalloc_node+0x92/0xb0
>   bucket_table_alloc.isra.0+0x47/0x140
>   rhashtable_try_insert+0x3a4/0x440
>   rhashtable_insert_slow+0x1b/0x30
>  [..]
> 
> bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
> falls through to vmalloc and hits code paths that assume GFP_KERNEL.
> 
> Revert the problematic change and stay with slab allocator.

Why don't you simply fix the caller?
Florian Westphal Sept. 23, 2022, 1:35 p.m. UTC | #2
Michal Hocko <mhocko@suse.com> wrote:
> On Fri 23-09-22 12:38:58, Florian Westphal wrote:
> > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
> >  kernel BUG at mm/vmalloc.c:2437!
> >  invalid opcode: 0000 [#1] SMP
> >  CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
> >  [..]
> >  RIP: 0010:__get_vm_area_node+0x120/0x130
> >   __vmalloc_node_range+0x96/0x1e0
> >   kvmalloc_node+0x92/0xb0
> >   bucket_table_alloc.isra.0+0x47/0x140
> >   rhashtable_try_insert+0x3a4/0x440
> >   rhashtable_insert_slow+0x1b/0x30
> >  [..]
> > 
> > bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
> > falls through to vmalloc and hits code paths that assume GFP_KERNEL.
> > 
> > Revert the problematic change and stay with slab allocator.
> 
> Why don't you simply fix the caller?

Uh, not following?

kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?

I can revert 93f976b5190df32793908d49165f78e67fcb66cf instead
but that change is from 2018.
Uladzislau Rezki Sept. 23, 2022, 2:43 p.m. UTC | #3
On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote:
> Michal Hocko <mhocko@suse.com> wrote:
> > On Fri 23-09-22 12:38:58, Florian Westphal wrote:
> > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
> > >  kernel BUG at mm/vmalloc.c:2437!
> > >  invalid opcode: 0000 [#1] SMP
> > >  CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
> > >  [..]
> > >  RIP: 0010:__get_vm_area_node+0x120/0x130
> > >   __vmalloc_node_range+0x96/0x1e0
> > >   kvmalloc_node+0x92/0xb0
> > >   bucket_table_alloc.isra.0+0x47/0x140
> > >   rhashtable_try_insert+0x3a4/0x440
> > >   rhashtable_insert_slow+0x1b/0x30
> > >  [..]
> > > 
> > > bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
> > > falls through to vmalloc and hits code paths that assume GFP_KERNEL.
> > > 
> > > Revert the problematic change and stay with slab allocator.
> > 
> > Why don't you simply fix the caller?
> 
> Uh, not following?
> 
> kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?
>
<snip>
static struct vm_struct *__get_vm_area_node(unsigned long size,
		unsigned long align, unsigned long shift, unsigned long flags,
		unsigned long start, unsigned long end, int node,
		gfp_t gfp_mask, const void *caller)
{
	struct vmap_area *va;
	struct vm_struct *area;
	unsigned long requested_size = size;

	BUG_ON(in_interrupt());
...
<snip>

vmalloc is not supposed to be called from the IRQ context. If it is possible
it would be better to fix the bucket_table_alloc() by not calling it from the
IRQ context.  

--
Uladzislau Rezki
Florian Westphal Sept. 23, 2022, 2:54 p.m. UTC | #4
Uladzislau Rezki <urezki@gmail.com> wrote:
> On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote:
> > Michal Hocko <mhocko@suse.com> wrote:
> > > On Fri 23-09-22 12:38:58, Florian Westphal wrote:
> > > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
> > > >  kernel BUG at mm/vmalloc.c:2437!
> > > >  invalid opcode: 0000 [#1] SMP
> > > >  CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
> > > >  [..]
> > > >  RIP: 0010:__get_vm_area_node+0x120/0x130
> > > >   __vmalloc_node_range+0x96/0x1e0
> > > >   kvmalloc_node+0x92/0xb0
> > > >   bucket_table_alloc.isra.0+0x47/0x140
> > > >   rhashtable_try_insert+0x3a4/0x440
> > > >   rhashtable_insert_slow+0x1b/0x30
> > > >  [..]
> > > > 
> > > > bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
> > > > falls through to vmalloc and hits code paths that assume GFP_KERNEL.
> > > > 
> > > > Revert the problematic change and stay with slab allocator.
> > > 
> > > Why don't you simply fix the caller?
> > 
> > Uh, not following?
> > 
> > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?
> >
> <snip>
> static struct vm_struct *__get_vm_area_node(unsigned long size,
> 		unsigned long align, unsigned long shift, unsigned long flags,
> 		unsigned long start, unsigned long end, int node,
> 		gfp_t gfp_mask, const void *caller)
> {
> 	struct vmap_area *va;
> 	struct vm_struct *area;
> 	unsigned long requested_size = size;
> 
> 	BUG_ON(in_interrupt());
> ...
> <snip>
> 
> vmalloc is not supposed to be called from the IRQ context.

It uses kvzalloc, not vmalloc api.

Before 2018, rhashtable did use kzalloc OR kvzalloc, depending on gfp_t.

Quote from 93f976b5190df327939 changelog:
  As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
  incompatible gfp flags") we can simplify the caller
  and trust kvzalloc() to just do the right thing.

I fear that if this isn't allowed it will result in hard-to-spot bugs
because things will work fine until a fallback to vmalloc happens.

rhashtable may not be the only user of kvmalloc api that rely on
ability to call it from (soft)irq.
Uladzislau Rezki Sept. 23, 2022, 3:10 p.m. UTC | #5
On Fri, Sep 23, 2022 at 04:54:09PM +0200, Florian Westphal wrote:
> Uladzislau Rezki <urezki@gmail.com> wrote:
> > On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote:
> > > Michal Hocko <mhocko@suse.com> wrote:
> > > > On Fri 23-09-22 12:38:58, Florian Westphal wrote:
> > > > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
> > > > >  kernel BUG at mm/vmalloc.c:2437!
> > > > >  invalid opcode: 0000 [#1] SMP
> > > > >  CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
> > > > >  [..]
> > > > >  RIP: 0010:__get_vm_area_node+0x120/0x130
> > > > >   __vmalloc_node_range+0x96/0x1e0
> > > > >   kvmalloc_node+0x92/0xb0
> > > > >   bucket_table_alloc.isra.0+0x47/0x140
> > > > >   rhashtable_try_insert+0x3a4/0x440
> > > > >   rhashtable_insert_slow+0x1b/0x30
> > > > >  [..]
> > > > > 
> > > > > bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
> > > > > falls through to vmalloc and hits code paths that assume GFP_KERNEL.
> > > > > 
> > > > > Revert the problematic change and stay with slab allocator.
> > > > 
> > > > Why don't you simply fix the caller?
> > > 
> > > Uh, not following?
> > > 
> > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?
> > >
> > <snip>
> > static struct vm_struct *__get_vm_area_node(unsigned long size,
> > 		unsigned long align, unsigned long shift, unsigned long flags,
> > 		unsigned long start, unsigned long end, int node,
> > 		gfp_t gfp_mask, const void *caller)
> > {
> > 	struct vmap_area *va;
> > 	struct vm_struct *area;
> > 	unsigned long requested_size = size;
> > 
> > 	BUG_ON(in_interrupt());
> > ...
> > <snip>
> > 
> > vmalloc is not supposed to be called from the IRQ context.
> 
> It uses kvzalloc, not vmalloc api.
> 
> Before 2018, rhashtable did use kzalloc OR kvzalloc, depending on gfp_t.
> 
> Quote from 93f976b5190df327939 changelog:
>   As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
>   incompatible gfp flags") we can simplify the caller
>   and trust kvzalloc() to just do the right thing.
> 
> I fear that if this isn't allowed it will result in hard-to-spot bugs
> because things will work fine until a fallback to vmalloc happens.
> 
> rhashtable may not be the only user of kvmalloc api that rely on
> ability to call it from (soft)irq.
>
Doing the "p = kmalloc(sizeof(*p), GFP_ATOMIC);" from an atomic context
is also a problem nowadays. Such code should be fixed across the kernel
because of PREEMPT_RT support.

--
Uladzislau Rezki
Michal Hocko Sept. 26, 2022, 7:49 a.m. UTC | #6
On Fri 23-09-22 15:35:12, Florian Westphal wrote:
> Michal Hocko <mhocko@suse.com> wrote:
> > On Fri 23-09-22 12:38:58, Florian Westphal wrote:
> > > Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
> > >  kernel BUG at mm/vmalloc.c:2437!
> > >  invalid opcode: 0000 [#1] SMP
> > >  CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
> > >  [..]
> > >  RIP: 0010:__get_vm_area_node+0x120/0x130
> > >   __vmalloc_node_range+0x96/0x1e0
> > >   kvmalloc_node+0x92/0xb0
> > >   bucket_table_alloc.isra.0+0x47/0x140
> > >   rhashtable_try_insert+0x3a4/0x440
> > >   rhashtable_insert_slow+0x1b/0x30
> > >  [..]
> > > 
> > > bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
> > > falls through to vmalloc and hits code paths that assume GFP_KERNEL.
> > > 
> > > Revert the problematic change and stay with slab allocator.
> > 
> > Why don't you simply fix the caller?
> 
> Uh, not following?
> 
> kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?

kvmalloc has never really supported GFP_ATOMIC semantic.
 
> I can revert 93f976b5190df32793908d49165f78e67fcb66cf instead
> but that change is from 2018.

Yeah I would just revert this one as it relies on internal details of
kvmalloc doing or not doing a fallback.
Florian Westphal Sept. 26, 2022, 7:56 a.m. UTC | #7
Michal Hocko <mhocko@suse.com> wrote:
> > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?
> 
> kvmalloc has never really supported GFP_ATOMIC semantic.

It did, you added it:
ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags")
Michal Hocko Sept. 26, 2022, 8:57 a.m. UTC | #8
On Mon 26-09-22 09:56:39, Florian Westphal wrote:
> Michal Hocko <mhocko@suse.com> wrote:
> > > kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?
> > 
> > kvmalloc has never really supported GFP_ATOMIC semantic.
> 
> It did, you added it:
> ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags")

Yes, I am very well aware of this commit and I have to say I wasn't
really supper happy about it TBH. Linus has argued this will result in a
saner code and in some cases this was true.

Later on we really had to add support some extensions beyond
GFP_KERNEL. Your change would break those GFP_NOFAIL and NOFS
usecases. GFP_NOWAIT and GFP_ATOMIC are explicitly documented as
unsupported. One we can do to continue in ce91f6ee5b3b sense is to
do this instead

diff --git a/mm/util.c b/mm/util.c
index 0837570c9225..a27b3fce1f0e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -618,6 +618,10 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
+	
+	/* non-sleeping allocations are not supported by vmalloc */
+	if (!gfpflags_allow_blocking(flags))
+		return NULL;
 
 	/* Don't even allow crazy sizes */
 	if (unlikely(size > INT_MAX)) {

A better option to me seems to be reworking the rhashtable_insert_rehash
to not rely on an atomic allocation. I am not familiar with that code
but it seems to me that the only reason this allocation mode is used is
due to rcu locking around rhashtable_try_insert. Is there any reason we
cannot drop the rcu lock, allocate with the full GFP_KERNEL allocation
power and retry with the pre allocated object? rhashtable_insert_slow is
already doing that to implement its never fail semantic.
Michal Hocko Sept. 26, 2022, 9:30 a.m. UTC | #9
On Mon 26-09-22 10:58:00, Michal Hocko wrote:
[...]
> A better option to me seems to be reworking the rhashtable_insert_rehash
> to not rely on an atomic allocation. I am not familiar with that code
> but it seems to me that the only reason this allocation mode is used is
> due to rcu locking around rhashtable_try_insert. Is there any reason we
> cannot drop the rcu lock, allocate with the full GFP_KERNEL allocation
> power and retry with the pre allocated object? rhashtable_insert_slow is
> already doing that to implement its never fail semantic.

So a very blunt and likely not 100% correct take on this side of things.
But it should give an idea at least.
--- 
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index e12bbfb240b8..fc547c43b05d 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -437,31 +437,11 @@ static void rht_deferred_worker(struct work_struct *work)
 }
 
 static int rhashtable_insert_rehash(struct rhashtable *ht,
-				    struct bucket_table *tbl)
+				    struct bucket_table *tbl,
+				    struct bucket_table *new_tbl)
 {
-	struct bucket_table *old_tbl;
-	struct bucket_table *new_tbl;
-	unsigned int size;
 	int err;
 
-	old_tbl = rht_dereference_rcu(ht->tbl, ht);
-
-	size = tbl->size;
-
-	err = -EBUSY;
-
-	if (rht_grow_above_75(ht, tbl))
-		size *= 2;
-	/* Do not schedule more than one rehash */
-	else if (old_tbl != tbl)
-		goto fail;
-
-	err = -ENOMEM;
-
-	new_tbl = bucket_table_alloc(ht, size, GFP_ATOMIC | __GFP_NOWARN);
-	if (new_tbl == NULL)
-		goto fail;
-
 	err = rhashtable_rehash_attach(ht, tbl, new_tbl);
 	if (err) {
 		bucket_table_free(new_tbl);
@@ -471,17 +451,6 @@ static int rhashtable_insert_rehash(struct rhashtable *ht,
 		schedule_work(&ht->run_work);
 
 	return err;
-
-fail:
-	/* Do not fail the insert if someone else did a rehash. */
-	if (likely(rcu_access_pointer(tbl->future_tbl)))
-		return 0;
-
-	/* Schedule async rehash to retry allocation in process context. */
-	if (err == -ENOMEM)
-		schedule_work(&ht->run_work);
-
-	return err;
 }
 
 static void *rhashtable_lookup_one(struct rhashtable *ht,
@@ -619,9 +588,33 @@ static void *rhashtable_try_insert(struct rhashtable *ht, const void *key,
 		}
 	} while (!IS_ERR_OR_NULL(new_tbl));
 
-	if (PTR_ERR(data) == -EAGAIN)
-		data = ERR_PTR(rhashtable_insert_rehash(ht, tbl) ?:
+	if (PTR_ERR(data) == -EAGAIN) {
+		struct bucket_table *old_tbl;
+		unsigned int size;
+
+		old_tbl = rht_dereference_rcu(ht->tbl, ht);
+		size = tbl->size;
+
+		data = ERR_PTR(-EBUSY);
+
+		if (rht_grow_above_75(ht, tbl))
+			size *= 2;
+		/* Do not schedule more than one rehash */
+		else if (old_tbl != tbl)
+			return data;
+
+		data = ERR_PTR(-ENOMEM);
+
+		rcu_read_unlock();
+		new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
+		rcu_read_lock();
+
+		if (!new_tbl)
+			return data;
+
+		data = ERR_PTR(rhashtable_insert_rehash(ht, tbl, new_tbl) ?:
 			       -EAGAIN);
+	}
 
 	return data;
 }
Florian Westphal Sept. 26, 2022, 10:08 a.m. UTC | #10
Michal Hocko <mhocko@suse.com> wrote:
> +		old_tbl = rht_dereference_rcu(ht->tbl, ht);
> +		size = tbl->size;
> +
> +		data = ERR_PTR(-EBUSY);
> +
> +		if (rht_grow_above_75(ht, tbl))
> +			size *= 2;
> +		/* Do not schedule more than one rehash */
> +		else if (old_tbl != tbl)
> +			return data;
> +
> +		data = ERR_PTR(-ENOMEM);
> +
> +		rcu_read_unlock();
> +		new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
> +		rcu_read_lock();

I don't think this is going to work, there can be callers that
rely on rcu protected data structures getting free'd.

Also, network stack assumes synchronize_rcu() means that all
inflight packets have completed processing.
Michal Hocko Sept. 26, 2022, 12:02 p.m. UTC | #11
On Mon 26-09-22 12:08:00, Florian Westphal wrote:
> Michal Hocko <mhocko@suse.com> wrote:
> > +		old_tbl = rht_dereference_rcu(ht->tbl, ht);
> > +		size = tbl->size;
> > +
> > +		data = ERR_PTR(-EBUSY);
> > +
> > +		if (rht_grow_above_75(ht, tbl))
> > +			size *= 2;
> > +		/* Do not schedule more than one rehash */
> > +		else if (old_tbl != tbl)
> > +			return data;
> > +
> > +		data = ERR_PTR(-ENOMEM);
> > +
> > +		rcu_read_unlock();
> > +		new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
> > +		rcu_read_lock();
> 
> I don't think this is going to work, there can be callers that
> rely on rcu protected data structures getting free'd.

The caller of this function drops RCU for each retry, why should be the
called function any special?
Florian Westphal Sept. 26, 2022, 1:08 p.m. UTC | #12
Michal Hocko <mhocko@suse.com> wrote:
> On Mon 26-09-22 12:08:00, Florian Westphal wrote:
> > Michal Hocko <mhocko@suse.com> wrote:
> > > +		old_tbl = rht_dereference_rcu(ht->tbl, ht);
> > > +		size = tbl->size;
> > > +
> > > +		data = ERR_PTR(-EBUSY);
> > > +
> > > +		if (rht_grow_above_75(ht, tbl))
> > > +			size *= 2;
> > > +		/* Do not schedule more than one rehash */
> > > +		else if (old_tbl != tbl)
> > > +			return data;
> > > +
> > > +		data = ERR_PTR(-ENOMEM);
> > > +
> > > +		rcu_read_unlock();
> > > +		new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
> > > +		rcu_read_lock();
> > 
> > I don't think this is going to work, there can be callers that
> > rely on rcu protected data structures getting free'd.
> 
> The caller of this function drops RCU for each retry, why should be the
> called function any special?

I was unfortunately never able to fully understand rhashtable.
AFAICS the rcu_read_lock/unlock in the caller is pointless,
or at least dubious.

To the best of my knowledge there are users of this interface that
invoke it with rcu read lock held, and since those always nest, the
rcu_read_unlock() won't move us to GFP_KERNEL territory.

I guess you can add a might_sleep() and ask kernel to barf at runtime.
Florian Westphal Sept. 26, 2022, 1:38 p.m. UTC | #13
Florian Westphal <fw@strlen.de> wrote:
> Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 26-09-22 12:08:00, Florian Westphal wrote:
> > > Michal Hocko <mhocko@suse.com> wrote:
> > > > +		old_tbl = rht_dereference_rcu(ht->tbl, ht);
> > > > +		size = tbl->size;
> > > > +
> > > > +		data = ERR_PTR(-EBUSY);
> > > > +
> > > > +		if (rht_grow_above_75(ht, tbl))
> > > > +			size *= 2;
> > > > +		/* Do not schedule more than one rehash */
> > > > +		else if (old_tbl != tbl)
> > > > +			return data;
> > > > +
> > > > +		data = ERR_PTR(-ENOMEM);
> > > > +
> > > > +		rcu_read_unlock();
> > > > +		new_tbl = bucket_table_alloc(ht, size, GFP_KERNEL);
> > > > +		rcu_read_lock();
> > > 
> > > I don't think this is going to work, there can be callers that
> > > rely on rcu protected data structures getting free'd.
> > 
> > The caller of this function drops RCU for each retry, why should be the
> > called function any special?
> 
> I was unfortunately never able to fully understand rhashtable.

Obviously.

> AFAICS the rcu_read_lock/unlock in the caller is pointless,
> or at least dubious.

Addedum, I can't read:

void *rhashtable_insert_slow(struct rhashtable *ht, const void *key,
		                             struct rhash_head *obj)
{
       void *data;

       do {
		rcu_read_lock();
		data = rhashtable_try_insert(ht, key, obj);
	       	rcu_read_unlock();
										        }
	} while (PTR_ERR(data) == -EAGAIN);
}

... which is needed to prevent a lockdep splat in
rhashtable_try_insert() -- there is no guarantee the caller already
has rcu_read_lock().

> To the best of my knowledge there are users of this interface that
> invoke it with rcu read lock held, and since those always nest, the
> rcu_read_unlock() won't move us to GFP_KERNEL territory.
> 
> I guess you can add a might_sleep() and ask kernel to barf at runtime.

I did and it triggers.  Caller is inet_frag_find(), triggered
via 'ping -s 60000 $addr'.
Michal Hocko Sept. 26, 2022, 2:05 p.m. UTC | #14
On Mon 26-09-22 15:08:08, Florian Westphal wrote:
[...]
> To the best of my knowledge there are users of this interface that
> invoke it with rcu read lock held, and since those always nest, the
> rcu_read_unlock() won't move us to GFP_KERNEL territory.

Fiar point. I can see a comment above rhashtable_insert_fast which is
supposed to be used by drivers and explicitly documented to be
compatible with an atomic context. So the above is clearly a no-go

In that case I would propose to go with the patch going
http://lkml.kernel.org/r/YzFplwSxwwsLpzzX@dhcp22.suse.cz direction.
For that we could use an alternative implementation for the vmalloc
fallback for non-sleeping request in the future. A single branch for an
unlikely event is not nice but acceptable I would say. Especially if we
want to go in line with Linus' original "prevent hack for gfp_mask in
callers".
Florian Westphal Sept. 26, 2022, 2:20 p.m. UTC | #15
Michal Hocko <mhocko@suse.com> wrote:
> On Mon 26-09-22 15:08:08, Florian Westphal wrote:
> [...]
> > To the best of my knowledge there are users of this interface that
> > invoke it with rcu read lock held, and since those always nest, the
> > rcu_read_unlock() won't move us to GFP_KERNEL territory.
> 
> Fiar point. I can see a comment above rhashtable_insert_fast which is
> supposed to be used by drivers and explicitly documented to be
> compatible with an atomic context. So the above is clearly a no-go
> 
> In that case I would propose to go with the patch going
> http://lkml.kernel.org/r/YzFplwSxwwsLpzzX@dhcp22.suse.cz direction.

FWIW that patch works for me. Will you do a formal patch submission?
Michal Hocko Sept. 26, 2022, 2:32 p.m. UTC | #16
On Mon 26-09-22 16:20:13, Florian Westphal wrote:
> Michal Hocko <mhocko@suse.com> wrote:
> > On Mon 26-09-22 15:08:08, Florian Westphal wrote:
> > [...]
> > > To the best of my knowledge there are users of this interface that
> > > invoke it with rcu read lock held, and since those always nest, the
> > > rcu_read_unlock() won't move us to GFP_KERNEL territory.
> > 
> > Fiar point. I can see a comment above rhashtable_insert_fast which is
> > supposed to be used by drivers and explicitly documented to be
> > compatible with an atomic context. So the above is clearly a no-go
> > 
> > In that case I would propose to go with the patch going
> > http://lkml.kernel.org/r/YzFplwSxwwsLpzzX@dhcp22.suse.cz direction.
> 
> FWIW that patch works for me. Will you do a formal patch submission?

Feel free to use it and glue it to the actual bug report.

Thanks!
Vlastimil Babka Sept. 26, 2022, 3:03 p.m. UTC | #17
On 9/23/22 17:10, Uladzislau Rezki wrote:
> On Fri, Sep 23, 2022 at 04:54:09PM +0200, Florian Westphal wrote:
>> Uladzislau Rezki <urezki@gmail.com> wrote:
>>> On Fri, Sep 23, 2022 at 03:35:12PM +0200, Florian Westphal wrote:
>>>> Michal Hocko <mhocko@suse.com> wrote:
>>>>> On Fri 23-09-22 12:38:58, Florian Westphal wrote:
>>>>>> Martin Zaharinov reports BUG() in mm land for 5.19.10 kernel:
>>>>>>   kernel BUG at mm/vmalloc.c:2437!
>>>>>>   invalid opcode: 0000 [#1] SMP
>>>>>>   CPU: 28 PID: 0 Comm: swapper/28 Tainted: G        W  O      5.19.9 #1
>>>>>>   [..]
>>>>>>   RIP: 0010:__get_vm_area_node+0x120/0x130
>>>>>>    __vmalloc_node_range+0x96/0x1e0
>>>>>>    kvmalloc_node+0x92/0xb0
>>>>>>    bucket_table_alloc.isra.0+0x47/0x140
>>>>>>    rhashtable_try_insert+0x3a4/0x440
>>>>>>    rhashtable_insert_slow+0x1b/0x30
>>>>>>   [..]
>>>>>>
>>>>>> bucket_table_alloc uses kvzallocGPF_ATOMIC).  If kmalloc fails, this now
>>>>>> falls through to vmalloc and hits code paths that assume GFP_KERNEL.
>>>>>>
>>>>>> Revert the problematic change and stay with slab allocator.
>>>>>
>>>>> Why don't you simply fix the caller?
>>>>
>>>> Uh, not following?
>>>>
>>>> kvzalloc(GFP_ATOMIC) was perfectly fine, is this illegal again?
>>>>
>>> <snip>
>>> static struct vm_struct *__get_vm_area_node(unsigned long size,
>>> 		unsigned long align, unsigned long shift, unsigned long flags,
>>> 		unsigned long start, unsigned long end, int node,
>>> 		gfp_t gfp_mask, const void *caller)
>>> {
>>> 	struct vmap_area *va;
>>> 	struct vm_struct *area;
>>> 	unsigned long requested_size = size;
>>>
>>> 	BUG_ON(in_interrupt());
>>> ...
>>> <snip>
>>>
>>> vmalloc is not supposed to be called from the IRQ context.
>>
>> It uses kvzalloc, not vmalloc api.
>>
>> Before 2018, rhashtable did use kzalloc OR kvzalloc, depending on gfp_t.
>>
>> Quote from 93f976b5190df327939 changelog:
>>    As of ce91f6ee5b3b ("mm: kvmalloc does not fallback to vmalloc for
>>    incompatible gfp flags") we can simplify the caller
>>    and trust kvzalloc() to just do the right thing.
>>
>> I fear that if this isn't allowed it will result in hard-to-spot bugs
>> because things will work fine until a fallback to vmalloc happens.
>>
>> rhashtable may not be the only user of kvmalloc api that rely on
>> ability to call it from (soft)irq.
>>
> Doing the "p = kmalloc(sizeof(*p), GFP_ATOMIC);" from an atomic context
> is also a problem nowadays. Such code should be fixed across the kernel
> because of PREEMPT_RT support.

But the "atomic context" here is different, no? Calling kmalloc() from 
IRQ handlers AFAIK is ok as IRQ handlers are threaded on PREEMPT_RT. 
Calling it inside an local_irq_disable() would be a problem on the other 
hand. But then under e.g. spin_lock_irqsave() could be ok as those don't 
really disable irqs on RT.

> --
> Uladzislau Rezki
Sebastian Andrzej Siewior Sept. 26, 2022, 3:28 p.m. UTC | #18
On 2022-09-26 17:03:48 [+0200], Vlastimil Babka wrote:
> > Doing the "p = kmalloc(sizeof(*p), GFP_ATOMIC);" from an atomic context
> > is also a problem nowadays. Such code should be fixed across the kernel
> > because of PREEMPT_RT support.

You should make sure that the context in question is atomic on
PREEMPT_RT before fixing it. My guess here is that it is average the
softirq (NAPI) callback which is fine.

> But the "atomic context" here is different, no? Calling kmalloc() from IRQ
> handlers AFAIK is ok as IRQ handlers are threaded on PREEMPT_RT. Calling it
> inside an local_irq_disable() would be a problem on the other hand. But then
> under e.g. spin_lock_irqsave() could be ok as those don't really disable
> irqs on RT.

correct.

Sebastian
diff mbox series

Patch

diff --git a/mm/util.c b/mm/util.c
index c9439c66d8cf..ba7fe1cb6846 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -593,6 +593,13 @@  void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
+	/*
+	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
+	 * so the given set of flags has to be compatible.
+	 */
+	if ((flags & GFP_KERNEL) != GFP_KERNEL)
+		return kmalloc_node(size, flags, node);
+
 	/*
 	 * We want to attempt a large physically contiguous block first because
 	 * it is less likely to fragment multiple larger blocks and therefore