diff mbox series

[v2,09/16] rcu/tree: Maintain separate array for vmalloc ptrs

Message ID 20200525214800.93072-10-urezki@gmail.com (mailing list archive)
State New, archived
Headers show
Series Introduce kvfree_rcu(1 or 2 arguments) | expand

Commit Message

Uladzislau Rezki May 25, 2020, 9:47 p.m. UTC
To do so, we use an array of kvfree_rcu_bulk_data structures.
It consists of two elements:
 - index number 0 corresponds to slab pointers.
 - index number 1 corresponds to vmalloc pointers.

Keeping vmalloc pointers separated from slab pointers makes
it possible to invoke the right freeing API for the right
kind of pointer.

It also prepares us for future headless support for vmalloc
and SLAB objects. Such objects cannot be queued on a linked
list and are instead directly into an array.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 173 +++++++++++++++++++++++++++-------------------
 1 file changed, 100 insertions(+), 73 deletions(-)

Comments

Paul E. McKenney June 17, 2020, 11:46 p.m. UTC | #1
On Mon, May 25, 2020 at 11:47:53PM +0200, Uladzislau Rezki (Sony) wrote:
> To do so, we use an array of kvfree_rcu_bulk_data structures.
> It consists of two elements:
>  - index number 0 corresponds to slab pointers.
>  - index number 1 corresponds to vmalloc pointers.
> 
> Keeping vmalloc pointers separated from slab pointers makes
> it possible to invoke the right freeing API for the right
> kind of pointer.
> 
> It also prepares us for future headless support for vmalloc
> and SLAB objects. Such objects cannot be queued on a linked
> list and are instead directly into an array.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Co-developed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---

[ . . . ]

> +	// Handle two first channels.
> +	for (i = 0; i < FREE_N_CHANNELS; i++) {
> +		for (; bkvhead[i]; bkvhead[i] = bnext) {
> +			bnext = bkvhead[i]->next;
> +			debug_rcu_bhead_unqueue(bkvhead[i]);
> +
> +			rcu_lock_acquire(&rcu_callback_map);
> +			if (i == 0) { // kmalloc() / kfree().
> +				trace_rcu_invoke_kfree_bulk_callback(
> +					rcu_state.name, bkvhead[i]->nr_records,
> +					bkvhead[i]->records);
> +
> +				kfree_bulk(bkvhead[i]->nr_records,
> +					bkvhead[i]->records);
> +			} else { // vmalloc() / vfree().
> +				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> +					trace_rcu_invoke_kfree_callback(
> +						rcu_state.name,
> +						bkvhead[i]->records[j], 0);
> +
> +					vfree(bkvhead[i]->records[j]);
> +				}
> +			}
> +			rcu_lock_release(&rcu_callback_map);

Not an emergency, but did you look into replacing this "if" statement
with an array of pointers to functions implementing the legs of the
"if" statement?  If nothing else, this would greatly reduced indentation.

I am taking this as is, but if you have not already done so, could you
please look into this for a follow-up patch?

							Thanx, Paul
Matthew Wilcox June 18, 2020, 12:52 a.m. UTC | #2
On Wed, Jun 17, 2020 at 04:46:09PM -0700, Paul E. McKenney wrote:
> > +	// Handle two first channels.
> > +	for (i = 0; i < FREE_N_CHANNELS; i++) {
> > +		for (; bkvhead[i]; bkvhead[i] = bnext) {
> > +			bnext = bkvhead[i]->next;
> > +			debug_rcu_bhead_unqueue(bkvhead[i]);
> > +
> > +			rcu_lock_acquire(&rcu_callback_map);
> > +			if (i == 0) { // kmalloc() / kfree().
> > +				trace_rcu_invoke_kfree_bulk_callback(
> > +					rcu_state.name, bkvhead[i]->nr_records,
> > +					bkvhead[i]->records);
> > +
> > +				kfree_bulk(bkvhead[i]->nr_records,
> > +					bkvhead[i]->records);
> > +			} else { // vmalloc() / vfree().
> > +				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > +					trace_rcu_invoke_kfree_callback(
> > +						rcu_state.name,
> > +						bkvhead[i]->records[j], 0);
> > +
> > +					vfree(bkvhead[i]->records[j]);
> > +				}
> > +			}
> > +			rcu_lock_release(&rcu_callback_map);
> 
> Not an emergency, but did you look into replacing this "if" statement
> with an array of pointers to functions implementing the legs of the
> "if" statement?  If nothing else, this would greatly reduced indentation.

I don't think that replacing direct function calls with indirect function
calls is a great suggestion with the current state of play around branch
prediction.

I'd suggest:

 			rcu_lock_acquire(&rcu_callback_map);
			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
				bkvhead[i]->nr_records, bkvhead[i]->records);
 			if (i == 0) {
 				kfree_bulk(bkvhead[i]->nr_records,
 					bkvhead[i]->records);
 			} else {
 				for (j = 0; j < bkvhead[i]->nr_records; j++) {
 					vfree(bkvhead[i]->records[j]);
 				}
 			}
 			rcu_lock_release(&rcu_callback_map);

But I'd also suggest a vfree_bulk be added.  There are a few things
which would be better done in bulk as part of the vfree process
(we batch them up already, but i'm sure we could do better).
Paul E. McKenney June 18, 2020, 3:18 a.m. UTC | #3
On Wed, Jun 17, 2020 at 05:52:14PM -0700, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 04:46:09PM -0700, Paul E. McKenney wrote:
> > > +	// Handle two first channels.
> > > +	for (i = 0; i < FREE_N_CHANNELS; i++) {
> > > +		for (; bkvhead[i]; bkvhead[i] = bnext) {
> > > +			bnext = bkvhead[i]->next;
> > > +			debug_rcu_bhead_unqueue(bkvhead[i]);
> > > +
> > > +			rcu_lock_acquire(&rcu_callback_map);
> > > +			if (i == 0) { // kmalloc() / kfree().
> > > +				trace_rcu_invoke_kfree_bulk_callback(
> > > +					rcu_state.name, bkvhead[i]->nr_records,
> > > +					bkvhead[i]->records);
> > > +
> > > +				kfree_bulk(bkvhead[i]->nr_records,
> > > +					bkvhead[i]->records);
> > > +			} else { // vmalloc() / vfree().
> > > +				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > > +					trace_rcu_invoke_kfree_callback(
> > > +						rcu_state.name,
> > > +						bkvhead[i]->records[j], 0);
> > > +
> > > +					vfree(bkvhead[i]->records[j]);
> > > +				}
> > > +			}
> > > +			rcu_lock_release(&rcu_callback_map);
> > 
> > Not an emergency, but did you look into replacing this "if" statement
> > with an array of pointers to functions implementing the legs of the
> > "if" statement?  If nothing else, this would greatly reduced indentation.
> 
> I don't think that replacing direct function calls with indirect function
> calls is a great suggestion with the current state of play around branch
> prediction.
> 
> I'd suggest:
> 
>  			rcu_lock_acquire(&rcu_callback_map);
> 			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
> 				bkvhead[i]->nr_records, bkvhead[i]->records);
>  			if (i == 0) {
>  				kfree_bulk(bkvhead[i]->nr_records,
>  					bkvhead[i]->records);
>  			} else {
>  				for (j = 0; j < bkvhead[i]->nr_records; j++) {
>  					vfree(bkvhead[i]->records[j]);
>  				}
>  			}
>  			rcu_lock_release(&rcu_callback_map);
> 
> But I'd also suggest a vfree_bulk be added.  There are a few things
> which would be better done in bulk as part of the vfree process
> (we batch them up already, but i'm sure we could do better).

I suspect that he would like to keep the tracing.

It might be worth trying the branches, given that they would be constant
and indexed by "i".  The compiler might well remove the indirection.

The compiler guys brag about doing so, which of course might or might
not have any correlation to a given compiler actually doing so.  :-/

Having a vfree_bulk() might well be useful, but I would feel more
confidence in that if there were other callers of kfree_bulk().

But again, either way, future work as far as this series is concerned.

							Thanx, Paul
Uladzislau Rezki June 18, 2020, 5:25 p.m. UTC | #4
> > +	// Handle two first channels.
> > +	for (i = 0; i < FREE_N_CHANNELS; i++) {
> > +		for (; bkvhead[i]; bkvhead[i] = bnext) {
> > +			bnext = bkvhead[i]->next;
> > +			debug_rcu_bhead_unqueue(bkvhead[i]);
> > +
> > +			rcu_lock_acquire(&rcu_callback_map);
> > +			if (i == 0) { // kmalloc() / kfree().
> > +				trace_rcu_invoke_kfree_bulk_callback(
> > +					rcu_state.name, bkvhead[i]->nr_records,
> > +					bkvhead[i]->records);
> > +
> > +				kfree_bulk(bkvhead[i]->nr_records,
> > +					bkvhead[i]->records);
> > +			} else { // vmalloc() / vfree().
> > +				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > +					trace_rcu_invoke_kfree_callback(
> > +						rcu_state.name,
> > +						bkvhead[i]->records[j], 0);
> > +
> > +					vfree(bkvhead[i]->records[j]);
> > +				}
> > +			}
> > +			rcu_lock_release(&rcu_callback_map);
> 
> Not an emergency, but did you look into replacing this "if" statement
> with an array of pointers to functions implementing the legs of the
> "if" statement?  If nothing else, this would greatly reduced indentation.
> 
>
> I am taking this as is, but if you have not already done so, could you
> please look into this for a follow-up patch?
> 
I do not think it makes sense, because it would require to check each
pointer in the array, what can lead to many branching, i.e. "if-else"
instructions.

Paul, thank you to take it in!

--
Vlad Rezki
Uladzislau Rezki June 18, 2020, 5:30 p.m. UTC | #5
> > 
> > Not an emergency, but did you look into replacing this "if" statement
> > with an array of pointers to functions implementing the legs of the
> > "if" statement?  If nothing else, this would greatly reduced indentation.
> 
> I don't think that replacing direct function calls with indirect function
> calls is a great suggestion with the current state of play around branch
> prediction.
> 
> I'd suggest:
> 
>  			rcu_lock_acquire(&rcu_callback_map);
> 			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
> 				bkvhead[i]->nr_records, bkvhead[i]->records);
>  			if (i == 0) {
>  				kfree_bulk(bkvhead[i]->nr_records,
>  					bkvhead[i]->records);
>  			} else {
>  				for (j = 0; j < bkvhead[i]->nr_records; j++) {
>  					vfree(bkvhead[i]->records[j]);
>  				}
>  			}
>  			rcu_lock_release(&rcu_callback_map);
>
There are two different trace functions, one for "bulk" tracing
messages, and another one is per one call of kfree(), though we use 
to indicate vfree() call.

Probably we can rename it to: trace_rcu_invoke_kvfree_callback();

What do you think?

> 
> But I'd also suggest a vfree_bulk be added.  There are a few things
> which would be better done in bulk as part of the vfree process
> (we batch them up already, but i'm sure we could do better).
> 
I was thinking to implement of vfree_bulk() API, but i guess it can
be done as future work.

Does that sound good?

--
Vlad Rezki
Paul E. McKenney June 18, 2020, 5:32 p.m. UTC | #6
On Thu, Jun 18, 2020 at 07:25:04PM +0200, Uladzislau Rezki wrote:
> > > +	// Handle two first channels.
> > > +	for (i = 0; i < FREE_N_CHANNELS; i++) {
> > > +		for (; bkvhead[i]; bkvhead[i] = bnext) {
> > > +			bnext = bkvhead[i]->next;
> > > +			debug_rcu_bhead_unqueue(bkvhead[i]);
> > > +
> > > +			rcu_lock_acquire(&rcu_callback_map);
> > > +			if (i == 0) { // kmalloc() / kfree().
> > > +				trace_rcu_invoke_kfree_bulk_callback(
> > > +					rcu_state.name, bkvhead[i]->nr_records,
> > > +					bkvhead[i]->records);
> > > +
> > > +				kfree_bulk(bkvhead[i]->nr_records,
> > > +					bkvhead[i]->records);
> > > +			} else { // vmalloc() / vfree().
> > > +				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > > +					trace_rcu_invoke_kfree_callback(
> > > +						rcu_state.name,
> > > +						bkvhead[i]->records[j], 0);
> > > +
> > > +					vfree(bkvhead[i]->records[j]);
> > > +				}
> > > +			}
> > > +			rcu_lock_release(&rcu_callback_map);
> > 
> > Not an emergency, but did you look into replacing this "if" statement
> > with an array of pointers to functions implementing the legs of the
> > "if" statement?  If nothing else, this would greatly reduced indentation.
> > 
> >
> > I am taking this as is, but if you have not already done so, could you
> > please look into this for a follow-up patch?
> > 
> I do not think it makes sense, because it would require to check each
> pointer in the array, what can lead to many branching, i.e. "if-else"
> instructions.

Mightn't the compiler simply unroll the outer loop?  Then the first
unrolled iteration of that loop would contain the then-clause and
the second unrolled iteration would contain the else-clause.  At that
point, there would be no checking, just direct calls.

Or am I missing something?

> Paul, thank you to take it in!

Thank you for persisting!

							Thanx, Paul
Uladzislau Rezki June 18, 2020, 5:35 p.m. UTC | #7
> > 
> > I don't think that replacing direct function calls with indirect function
> > calls is a great suggestion with the current state of play around branch
> > prediction.
> > 
> > I'd suggest:
> > 
> >  			rcu_lock_acquire(&rcu_callback_map);
> > 			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
> > 				bkvhead[i]->nr_records, bkvhead[i]->records);
> >  			if (i == 0) {
> >  				kfree_bulk(bkvhead[i]->nr_records,
> >  					bkvhead[i]->records);
> >  			} else {
> >  				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> >  					vfree(bkvhead[i]->records[j]);
> >  				}
> >  			}
> >  			rcu_lock_release(&rcu_callback_map);
> > 
> > But I'd also suggest a vfree_bulk be added.  There are a few things
> > which would be better done in bulk as part of the vfree process
> > (we batch them up already, but i'm sure we could do better).
> 
> I suspect that he would like to keep the tracing.
> 
> It might be worth trying the branches, given that they would be constant
> and indexed by "i".  The compiler might well remove the indirection.
> 
> The compiler guys brag about doing so, which of course might or might
> not have any correlation to a given compiler actually doing so.  :-/
> 
> Having a vfree_bulk() might well be useful, but I would feel more
> confidence in that if there were other callers of kfree_bulk().
>
Hmm... I think replacing that with vfree_bulk() is a good idea though.

> 
> But again, either way, future work as far as this series is concerned.
> 
What do you mean: is concerned?

We are planning to implement kfree_rcu() to be integrated directly into
SLAB: SLAB, SLUB, SLOB. So, there are plenty of future work :)

--
Vlad Rezki
Matthew Wilcox June 18, 2020, 5:35 p.m. UTC | #8
On Thu, Jun 18, 2020 at 07:30:49PM +0200, Uladzislau Rezki wrote:
> > I'd suggest:
> > 
> >  			rcu_lock_acquire(&rcu_callback_map);
> > 			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
> > 				bkvhead[i]->nr_records, bkvhead[i]->records);
> >  			if (i == 0) {
> >  				kfree_bulk(bkvhead[i]->nr_records,
> >  					bkvhead[i]->records);
> >  			} else {
> >  				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> >  					vfree(bkvhead[i]->records[j]);
> >  				}
> >  			}
> >  			rcu_lock_release(&rcu_callback_map);
> >
> There are two different trace functions, one for "bulk" tracing
> messages, and another one is per one call of kfree(), though we use 
> to indicate vfree() call.
> 
> Probably we can rename it to: trace_rcu_invoke_kvfree_callback();
> 
> What do you think?

Works for me!

> > But I'd also suggest a vfree_bulk be added.  There are a few things
> > which would be better done in bulk as part of the vfree process
> > (we batch them up already, but i'm sure we could do better).
> 
> I was thinking to implement of vfree_bulk() API, but i guess it can
> be done as future work.
> 
> Does that sound good?

Yes, definitely a future piece of work.
Uladzislau Rezki June 18, 2020, 5:56 p.m. UTC | #9
On Thu, Jun 18, 2020 at 10:32:06AM -0700, Paul E. McKenney wrote:
> On Thu, Jun 18, 2020 at 07:25:04PM +0200, Uladzislau Rezki wrote:
> > > > +	// Handle two first channels.
> > > > +	for (i = 0; i < FREE_N_CHANNELS; i++) {
> > > > +		for (; bkvhead[i]; bkvhead[i] = bnext) {
> > > > +			bnext = bkvhead[i]->next;
> > > > +			debug_rcu_bhead_unqueue(bkvhead[i]);
> > > > +
> > > > +			rcu_lock_acquire(&rcu_callback_map);
> > > > +			if (i == 0) { // kmalloc() / kfree().
> > > > +				trace_rcu_invoke_kfree_bulk_callback(
> > > > +					rcu_state.name, bkvhead[i]->nr_records,
> > > > +					bkvhead[i]->records);
> > > > +
> > > > +				kfree_bulk(bkvhead[i]->nr_records,
> > > > +					bkvhead[i]->records);
> > > > +			} else { // vmalloc() / vfree().
> > > > +				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > > > +					trace_rcu_invoke_kfree_callback(
> > > > +						rcu_state.name,
> > > > +						bkvhead[i]->records[j], 0);
> > > > +
> > > > +					vfree(bkvhead[i]->records[j]);
> > > > +				}
> > > > +			}
> > > > +			rcu_lock_release(&rcu_callback_map);
> > > 
> > > Not an emergency, but did you look into replacing this "if" statement
> > > with an array of pointers to functions implementing the legs of the
> > > "if" statement?  If nothing else, this would greatly reduced indentation.
> > > 
> > >
> > > I am taking this as is, but if you have not already done so, could you
> > > please look into this for a follow-up patch?
> > > 
> > I do not think it makes sense, because it would require to check each
> > pointer in the array, what can lead to many branching, i.e. "if-else"
> > instructions.
> 
> Mightn't the compiler simply unroll the outer loop?  Then the first
> unrolled iteration of that loop would contain the then-clause and
> the second unrolled iteration would contain the else-clause.  At that
> point, there would be no checking, just direct calls.
> 
> Or am I missing something?
> 
If we mix pointers, then we can do free per pointer only. I mean in that
case we will not be able to use kfree_bulk() interface for freeing SLAB
memory and the code would converted to something like:

<snip>
while (nr_objects_in_array > 0) {
    if (is_vmalloc_addr(array[X]))
       vfree(array[X]);
    else
       kfree(array[X]);
}
<snip>

> > Paul, thank you to take it in!
> 
> Thank you for persisting!
> 
Welcome :)

--
Vlad Rezki
Paul E. McKenney June 18, 2020, 5:57 p.m. UTC | #10
On Thu, Jun 18, 2020 at 07:35:20PM +0200, Uladzislau Rezki wrote:
> > > 
> > > I don't think that replacing direct function calls with indirect function
> > > calls is a great suggestion with the current state of play around branch
> > > prediction.
> > > 
> > > I'd suggest:
> > > 
> > >  			rcu_lock_acquire(&rcu_callback_map);
> > > 			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
> > > 				bkvhead[i]->nr_records, bkvhead[i]->records);
> > >  			if (i == 0) {
> > >  				kfree_bulk(bkvhead[i]->nr_records,
> > >  					bkvhead[i]->records);
> > >  			} else {
> > >  				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > >  					vfree(bkvhead[i]->records[j]);
> > >  				}
> > >  			}
> > >  			rcu_lock_release(&rcu_callback_map);
> > > 
> > > But I'd also suggest a vfree_bulk be added.  There are a few things
> > > which would be better done in bulk as part of the vfree process
> > > (we batch them up already, but i'm sure we could do better).
> > 
> > I suspect that he would like to keep the tracing.
> > 
> > It might be worth trying the branches, given that they would be constant
> > and indexed by "i".  The compiler might well remove the indirection.
> > 
> > The compiler guys brag about doing so, which of course might or might
> > not have any correlation to a given compiler actually doing so.  :-/
> > 
> > Having a vfree_bulk() might well be useful, but I would feel more
> > confidence in that if there were other callers of kfree_bulk().
> >
> Hmm... I think replacing that with vfree_bulk() is a good idea though.

In other words, get rid of kfree_bulk() in favor of vfree_bulk()?

> > But again, either way, future work as far as this series is concerned.
> > 
> What do you mean: is concerned?

Apologies for the strange English.  How about this?

"This series is OK as is.  Any comments above did not prevent me from
taking these patches, but instead discuss possible future work."

> We are planning to implement kfree_rcu() to be integrated directly into
> SLAB: SLAB, SLUB, SLOB. So, there are plenty of future work :)

And I am glad that this is still the goal.  ;-)

							Thanx, Paul
Matthew Wilcox June 18, 2020, 6:15 p.m. UTC | #11
On Thu, Jun 18, 2020 at 07:56:23PM +0200, Uladzislau Rezki wrote:
> If we mix pointers, then we can do free per pointer only. I mean in that
> case we will not be able to use kfree_bulk() interface for freeing SLAB
> memory and the code would converted to something like:
> 
> <snip>
> while (nr_objects_in_array > 0) {
>     if (is_vmalloc_addr(array[X]))
>        vfree(array[X]);
>     else
>        kfree(array[X]);
> }
> <snip>

[PATCH] Add vfree_bulk interface

This is a useful interface to have for the RCU kvfree code.  There is
scope for more performance gains later, but introducing the interface
now allows us to simplify the RCU code today.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 48bb681e6c2a..dc2bbb61af61 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -119,6 +119,7 @@ void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
+extern void vfree_bulk(size_t count, void **addrs);
 
 extern void *vmap(struct page **pages, unsigned int count,
 			unsigned long flags, pgprot_t prot);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index abe37f09ac42..6042f9b4394a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2366,6 +2366,22 @@ void vfree(const void *addr)
 }
 EXPORT_SYMBOL(vfree);
 
+void vfree_bulk(size_t count, void **addrs)
+{
+	unsigned int i;
+
+	BUG_ON(in_nmi());
+	might_sleep_if(!in_interrupt());
+
+	for (i = 0; i < count; i++) {
+		void *addr = addrs[i];
+		kmemleak_free(addr);
+		if (addr)
+			__vfree(addr);
+	}
+}
+EXPORT_SYMBOL(vfree_bulk);
+
 /**
  * vunmap - release virtual mapping obtained by vmap()
  * @addr:   memory base address
Uladzislau Rezki June 18, 2020, 6:23 p.m. UTC | #12
On Thu, Jun 18, 2020 at 11:15:41AM -0700, Matthew Wilcox wrote:
> On Thu, Jun 18, 2020 at 07:56:23PM +0200, Uladzislau Rezki wrote:
> > If we mix pointers, then we can do free per pointer only. I mean in that
> > case we will not be able to use kfree_bulk() interface for freeing SLAB
> > memory and the code would converted to something like:
> > 
> > <snip>
> > while (nr_objects_in_array > 0) {
> >     if (is_vmalloc_addr(array[X]))
> >        vfree(array[X]);
> >     else
> >        kfree(array[X]);
> > }
> > <snip>
> 
> [PATCH] Add vfree_bulk interface
> 
> This is a useful interface to have for the RCU kvfree code.  There is
> scope for more performance gains later, but introducing the interface
> now allows us to simplify the RCU code today.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 48bb681e6c2a..dc2bbb61af61 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -119,6 +119,7 @@ void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
>  
>  extern void vfree(const void *addr);
>  extern void vfree_atomic(const void *addr);
> +extern void vfree_bulk(size_t count, void **addrs);
>  
>  extern void *vmap(struct page **pages, unsigned int count,
>  			unsigned long flags, pgprot_t prot);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index abe37f09ac42..6042f9b4394a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2366,6 +2366,22 @@ void vfree(const void *addr)
>  }
>  EXPORT_SYMBOL(vfree);
>  
> +void vfree_bulk(size_t count, void **addrs)
> +{
> +	unsigned int i;
> +
> +	BUG_ON(in_nmi());
> +	might_sleep_if(!in_interrupt());
> +
> +	for (i = 0; i < count; i++) {
> +		void *addr = addrs[i];
> +		kmemleak_free(addr);
> +		if (addr)
> +			__vfree(addr);
> +	}
> +}
> +EXPORT_SYMBOL(vfree_bulk);
> +
>
Can we just do addrs[i] all over the loop?

Also, we can just call vfree() instead that has all checking we
need: NMI, kmemleak, might_sleep.

Thanks!

--
Vlad Rezki
Uladzislau Rezki June 18, 2020, 6:34 p.m. UTC | #13
> > > 
> > > I suspect that he would like to keep the tracing.
> > > 
> > > It might be worth trying the branches, given that they would be constant
> > > and indexed by "i".  The compiler might well remove the indirection.
> > > 
> > > The compiler guys brag about doing so, which of course might or might
> > > not have any correlation to a given compiler actually doing so.  :-/
> > > 
> > > Having a vfree_bulk() might well be useful, but I would feel more
> > > confidence in that if there were other callers of kfree_bulk().
> > >
> > Hmm... I think replacing that with vfree_bulk() is a good idea though.
> 
> In other words, get rid of kfree_bulk() in favor of vfree_bulk()?
> 
kfree_bulk() does not understand vmalloc memory. vfree_bulk() should
be implemented to release vmalloc's pointers. On i high level it will
be used the same way as kfree_bulk() but for vmalloc ptrs. only.

> > > But again, either way, future work as far as this series is concerned.
> > > 
> > What do you mean: is concerned?
> 
> Apologies for the strange English.  How about this?
> 
> "This series is OK as is.  Any comments above did not prevent me from
> taking these patches, but instead discuss possible future work."
> 
That is perfectly clear to me :)

> > We are planning to implement kfree_rcu() to be integrated directly into
> > SLAB: SLAB, SLUB, SLOB. So, there are plenty of future work :)
>
> And I am glad that this is still the goal.  ;-)
>
:)

--
Vlad Rezki
Matthew Wilcox June 18, 2020, 6:37 p.m. UTC | #14
On Thu, Jun 18, 2020 at 08:23:33PM +0200, Uladzislau Rezki wrote:
> > +void vfree_bulk(size_t count, void **addrs)
> > +{
> > +	unsigned int i;
> > +
> > +	BUG_ON(in_nmi());
> > +	might_sleep_if(!in_interrupt());
> > +
> > +	for (i = 0; i < count; i++) {
> > +		void *addr = addrs[i];
> > +		kmemleak_free(addr);
> > +		if (addr)
> > +			__vfree(addr);
> > +	}
> > +}
> > +EXPORT_SYMBOL(vfree_bulk);
> > +
> >
> Can we just do addrs[i] all over the loop?
> 
> Also, we can just call vfree() instead that has all checking we
> need: NMI, kmemleak, might_sleep.

Of course we _can_.  But would we want to?  This way, we only do these
checks once instead of once per pointer, which is rather the point
of batching.

I might actually go further and hoist the in_interrupt() check into
this function ... I suspect the RCU code always runs in_interrupt()
and so we always call vfree_deferred().
Uladzislau Rezki June 18, 2020, 6:48 p.m. UTC | #15
On Thu, Jun 18, 2020 at 11:37:51AM -0700, Matthew Wilcox wrote:
> On Thu, Jun 18, 2020 at 08:23:33PM +0200, Uladzislau Rezki wrote:
> > > +void vfree_bulk(size_t count, void **addrs)
> > > +{
> > > +	unsigned int i;
> > > +
> > > +	BUG_ON(in_nmi());
> > > +	might_sleep_if(!in_interrupt());
> > > +
> > > +	for (i = 0; i < count; i++) {
> > > +		void *addr = addrs[i];
> > > +		kmemleak_free(addr);
> > > +		if (addr)
> > > +			__vfree(addr);
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL(vfree_bulk);
> > > +
> > >
> > Can we just do addrs[i] all over the loop?
> > 
> > Also, we can just call vfree() instead that has all checking we
> > need: NMI, kmemleak, might_sleep.
> 
> Of course we _can_.  But would we want to?  This way, we only do these
> checks once instead of once per pointer, which is rather the point
> of batching.
>
Ahh, right. I briefly looked at it and missed that point. Right you
are we do not want the vfree() here!

> 
> I might actually go further and hoist the in_interrupt() check into
> this function ...
>
Why do you need it? Just to inline below code:

<snip>
 if (unlikely(in_interrupt()))
  __vfree_deferred(addr);
 else
  __vunmap(addr, 1);
<snip>

and bypass the __vfree() call(that is not marked as inline one)?
I mean to inline above into  vfree_bulk().

>
> I suspect the RCU code always runs in_interrupt()
> and so we always call vfree_deferred().
>
No. We release the memory from workqueue context.

--
Vlad Rezki
Paul E. McKenney June 18, 2020, 7:03 p.m. UTC | #16
On Thu, Jun 18, 2020 at 08:34:48PM +0200, Uladzislau Rezki wrote:
> > > > 
> > > > I suspect that he would like to keep the tracing.
> > > > 
> > > > It might be worth trying the branches, given that they would be constant
> > > > and indexed by "i".  The compiler might well remove the indirection.
> > > > 
> > > > The compiler guys brag about doing so, which of course might or might
> > > > not have any correlation to a given compiler actually doing so.  :-/
> > > > 
> > > > Having a vfree_bulk() might well be useful, but I would feel more
> > > > confidence in that if there were other callers of kfree_bulk().
> > > >
> > > Hmm... I think replacing that with vfree_bulk() is a good idea though.
> > 
> > In other words, get rid of kfree_bulk() in favor of vfree_bulk()?
> > 
> kfree_bulk() does not understand vmalloc memory. vfree_bulk() should
> be implemented to release vmalloc's pointers. On i high level it will
> be used the same way as kfree_bulk() but for vmalloc ptrs. only.

Ah, I thought that you guys were proposing something that did bulk
free of both kmalloc and vmalloc memory.

							Thanx, Paul

> > > > But again, either way, future work as far as this series is concerned.
> > > > 
> > > What do you mean: is concerned?
> > 
> > Apologies for the strange English.  How about this?
> > 
> > "This series is OK as is.  Any comments above did not prevent me from
> > taking these patches, but instead discuss possible future work."
> > 
> That is perfectly clear to me :)
> 
> > > We are planning to implement kfree_rcu() to be integrated directly into
> > > SLAB: SLAB, SLUB, SLOB. So, there are plenty of future work :)
> >
> > And I am glad that this is still the goal.  ;-)
> >
> :)
> 
> --
> Vlad Rezki
Uladzislau Rezki June 18, 2020, 8:03 p.m. UTC | #17
On Thu, Jun 18, 2020 at 10:35:27AM -0700, Matthew Wilcox wrote:
> On Thu, Jun 18, 2020 at 07:30:49PM +0200, Uladzislau Rezki wrote:
> > > I'd suggest:
> > > 
> > >  			rcu_lock_acquire(&rcu_callback_map);
> > > 			trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
> > > 				bkvhead[i]->nr_records, bkvhead[i]->records);
> > >  			if (i == 0) {
> > >  				kfree_bulk(bkvhead[i]->nr_records,
> > >  					bkvhead[i]->records);
> > >  			} else {
> > >  				for (j = 0; j < bkvhead[i]->nr_records; j++) {
> > >  					vfree(bkvhead[i]->records[j]);
> > >  				}
> > >  			}
> > >  			rcu_lock_release(&rcu_callback_map);
> > >
> > There are two different trace functions, one for "bulk" tracing
> > messages, and another one is per one call of kfree(), though we use 
> > to indicate vfree() call.
> > 
> > Probably we can rename it to: trace_rcu_invoke_kvfree_callback();
> > 
> > What do you think?
> 
> Works for me!
> 
OK. I will send out the patch that will rename that trace function
that makes clear that the pointer that is freed can belong to SLAB
or vmalloc.

> > > But I'd also suggest a vfree_bulk be added.  There are a few things
> > > which would be better done in bulk as part of the vfree process
> > > (we batch them up already, but i'm sure we could do better).
> > 
> > I was thinking to implement of vfree_bulk() API, but i guess it can
> > be done as future work.
> > 
> > Does that sound good?
> 
> Yes, definitely a future piece of work.
>
You have already been doing it.

Thank you, Matthew :)

--
Vlad Rezki
Uladzislau Rezki June 18, 2020, 8:35 p.m. UTC | #18
On Thu, Jun 18, 2020 at 12:03:59PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 18, 2020 at 08:34:48PM +0200, Uladzislau Rezki wrote:
> > > > > 
> > > > > I suspect that he would like to keep the tracing.
> > > > > 
> > > > > It might be worth trying the branches, given that they would be constant
> > > > > and indexed by "i".  The compiler might well remove the indirection.
> > > > > 
> > > > > The compiler guys brag about doing so, which of course might or might
> > > > > not have any correlation to a given compiler actually doing so.  :-/
> > > > > 
> > > > > Having a vfree_bulk() might well be useful, but I would feel more
> > > > > confidence in that if there were other callers of kfree_bulk().
> > > > >
> > > > Hmm... I think replacing that with vfree_bulk() is a good idea though.
> > > 
> > > In other words, get rid of kfree_bulk() in favor of vfree_bulk()?
> > > 
> > kfree_bulk() does not understand vmalloc memory. vfree_bulk() should
> > be implemented to release vmalloc's pointers. On i high level it will
> > be used the same way as kfree_bulk() but for vmalloc ptrs. only.
> 
> Ah, I thought that you guys were proposing something that did bulk
> free of both kmalloc and vmalloc memory.
> 
I see your point. We could introduce something like:

	kvfree_bulk(slab_arra, vmalloc_array);

but i do not have a strong opinion here, even though i tend to
say that it would be odd. Having just vfree_bulk(), i think
would be enough, as a result the code will look like:

<snip>
    trace_rcu_invoke_kfree_bulk_callback(
        rcu_state.name, bkvhead[i]->nr_records,
            bkvhead[i]->records);
    if (i == 0)
        kfree_bulk(bkvhead[i]->nr_records,
            bkvhead[i]->records);
    else
        vfree_bulk(bkvhead[i]->nr_records,
            bkvhead[i]->records);
<snip>

Matthew, what is your thought?

Thanks!

--
Vlad rezki
Matthew Wilcox June 18, 2020, 8:38 p.m. UTC | #19
On Thu, Jun 18, 2020 at 10:35:57PM +0200, Uladzislau Rezki wrote:
> On Thu, Jun 18, 2020 at 12:03:59PM -0700, Paul E. McKenney wrote:
> but i do not have a strong opinion here, even though i tend to
> say that it would be odd. Having just vfree_bulk(), i think
> would be enough, as a result the code will look like:
> 
> <snip>
>     trace_rcu_invoke_kfree_bulk_callback(
>         rcu_state.name, bkvhead[i]->nr_records,
>             bkvhead[i]->records);
>     if (i == 0)
>         kfree_bulk(bkvhead[i]->nr_records,
>             bkvhead[i]->records);
>     else
>         vfree_bulk(bkvhead[i]->nr_records,
>             bkvhead[i]->records);
> <snip>
> 
> Matthew, what is your thought?

That was my thinking too.  If we had a kvfree_bulk(), I would expect it to
handle a mixture of vfree and kfree, but you've segregated them already.
So I think this is better.
Uladzislau Rezki June 18, 2020, 9:17 p.m. UTC | #20
> > <snip>
> >     trace_rcu_invoke_kfree_bulk_callback(
> >         rcu_state.name, bkvhead[i]->nr_records,
> >             bkvhead[i]->records);
> >     if (i == 0)
> >         kfree_bulk(bkvhead[i]->nr_records,
> >             bkvhead[i]->records);
> >     else
> >         vfree_bulk(bkvhead[i]->nr_records,
> >             bkvhead[i]->records);
> > <snip>
> > 
> > Matthew, what is your thought?
> 
> That was my thinking too.  If we had a kvfree_bulk(), I would expect it to
> handle a mixture of vfree and kfree, but you've segregated them already.
> So I think this is better.
>
Yes, the segregation is done. Having vfree_bulk() is enough then.
We are on the same page :)

Thanks!

--
Vlad Rezki
Paul E. McKenney June 18, 2020, 9:34 p.m. UTC | #21
On Thu, Jun 18, 2020 at 11:17:09PM +0200, Uladzislau Rezki wrote:
> > > <snip>
> > >     trace_rcu_invoke_kfree_bulk_callback(
> > >         rcu_state.name, bkvhead[i]->nr_records,
> > >             bkvhead[i]->records);
> > >     if (i == 0)
> > >         kfree_bulk(bkvhead[i]->nr_records,
> > >             bkvhead[i]->records);
> > >     else
> > >         vfree_bulk(bkvhead[i]->nr_records,
> > >             bkvhead[i]->records);
> > > <snip>
> > > 
> > > Matthew, what is your thought?
> > 
> > That was my thinking too.  If we had a kvfree_bulk(), I would expect it to
> > handle a mixture of vfree and kfree, but you've segregated them already.
> > So I think this is better.
> >
> Yes, the segregation is done. Having vfree_bulk() is enough then.
> We are on the same page :)

Very good.  When does kfree_rcu() and friends move out of kernel/rcu?

							Thanx, Paul
Uladzislau Rezki June 19, 2020, 3:46 p.m. UTC | #22
On Thu, Jun 18, 2020 at 02:34:27PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 18, 2020 at 11:17:09PM +0200, Uladzislau Rezki wrote:
> > > > <snip>
> > > >     trace_rcu_invoke_kfree_bulk_callback(
> > > >         rcu_state.name, bkvhead[i]->nr_records,
> > > >             bkvhead[i]->records);
> > > >     if (i == 0)
> > > >         kfree_bulk(bkvhead[i]->nr_records,
> > > >             bkvhead[i]->records);
> > > >     else
> > > >         vfree_bulk(bkvhead[i]->nr_records,
> > > >             bkvhead[i]->records);
> > > > <snip>
> > > > 
> > > > Matthew, what is your thought?
> > > 
> > > That was my thinking too.  If we had a kvfree_bulk(), I would expect it to
> > > handle a mixture of vfree and kfree, but you've segregated them already.
> > > So I think this is better.
> > >
> > Yes, the segregation is done. Having vfree_bulk() is enough then.
> > We are on the same page :)
> 
> Very good.  When does kfree_rcu() and friends move out of kernel/rcu?
> 
Do you mean to move the whole logic of kfree_rcu() from top to down to mm/?

Thanks!

--
Vlad Rezki
Paul E. McKenney June 19, 2020, 4:25 p.m. UTC | #23
On Fri, Jun 19, 2020 at 05:46:52PM +0200, Uladzislau Rezki wrote:
> On Thu, Jun 18, 2020 at 02:34:27PM -0700, Paul E. McKenney wrote:
> > On Thu, Jun 18, 2020 at 11:17:09PM +0200, Uladzislau Rezki wrote:
> > > > > <snip>
> > > > >     trace_rcu_invoke_kfree_bulk_callback(
> > > > >         rcu_state.name, bkvhead[i]->nr_records,
> > > > >             bkvhead[i]->records);
> > > > >     if (i == 0)
> > > > >         kfree_bulk(bkvhead[i]->nr_records,
> > > > >             bkvhead[i]->records);
> > > > >     else
> > > > >         vfree_bulk(bkvhead[i]->nr_records,
> > > > >             bkvhead[i]->records);
> > > > > <snip>
> > > > > 
> > > > > Matthew, what is your thought?
> > > > 
> > > > That was my thinking too.  If we had a kvfree_bulk(), I would expect it to
> > > > handle a mixture of vfree and kfree, but you've segregated them already.
> > > > So I think this is better.
> > > >
> > > Yes, the segregation is done. Having vfree_bulk() is enough then.
> > > We are on the same page :)
> > 
> > Very good.  When does kfree_rcu() and friends move out of kernel/rcu?
> > 
> Do you mean to move the whole logic of kfree_rcu() from top to down to mm/?

I do mean exactly that.

That was my goal some years back when Rao Shoaib was making the first
attempt along these lines, and it remains my goal.  After all, if this
effort is at all successful, the coupling between kfree_rcu() with
slab/slob/slub will become much tighter than that between kfree_rcu()
and RCU.

There will need to be some additional RCU APIs used by kfree_rcu(),
for example, something to tell RCU how many blocks are awaiting a
grace period.  But these are narrow and easily defined APIs.

							Thanx, Paul
Uladzislau Rezki June 22, 2020, 7:04 p.m. UTC | #24
> > > 
> > > Very good.  When does kfree_rcu() and friends move out of kernel/rcu?
> > > 
> > Do you mean to move the whole logic of kfree_rcu() from top to down to mm/?
> 
> I do mean exactly that.
> 
> That was my goal some years back when Rao Shoaib was making the first
> attempt along these lines, and it remains my goal.  After all, if this
> effort is at all successful, the coupling between kfree_rcu() with
> slab/slob/slub will become much tighter than that between kfree_rcu()
> and RCU.
> 
> There will need to be some additional RCU APIs used by kfree_rcu(),
> for example, something to tell RCU how many blocks are awaiting a
> grace period.  But these are narrow and easily defined APIs.
>
I also think that k[v]free_rcu() should reside somewhere under "mm/".
Currently they are defined as macros under "linux/rcupdate.h". So i
am not sure if definitions should stay there or moved also.

Implementation of the k[v]free_rcu() is under rcu/tree.c and for tiny
variant is under rcutiny.h. It can be moved to the mm/slab_common.c
or independent files can be created. I think, mm people should consult
what is the best way to go :)

Any thoughts on it?

Thank you!

--
Vlad Rezki
Paul E. McKenney June 22, 2020, 7:53 p.m. UTC | #25
On Mon, Jun 22, 2020 at 09:04:06PM +0200, Uladzislau Rezki wrote:
> > > > 
> > > > Very good.  When does kfree_rcu() and friends move out of kernel/rcu?
> > > > 
> > > Do you mean to move the whole logic of kfree_rcu() from top to down to mm/?
> > 
> > I do mean exactly that.
> > 
> > That was my goal some years back when Rao Shoaib was making the first
> > attempt along these lines, and it remains my goal.  After all, if this
> > effort is at all successful, the coupling between kfree_rcu() with
> > slab/slob/slub will become much tighter than that between kfree_rcu()
> > and RCU.
> > 
> > There will need to be some additional RCU APIs used by kfree_rcu(),
> > for example, something to tell RCU how many blocks are awaiting a
> > grace period.  But these are narrow and easily defined APIs.
> >
> I also think that k[v]free_rcu() should reside somewhere under "mm/".
> Currently they are defined as macros under "linux/rcupdate.h". So i
> am not sure if definitions should stay there or moved also.

I am not as worried about the high-level macros as I am about the code
that does the bulk of the work, but they should still move as well.
Otherwise, changes involving both the macros and the underlying
implementation are harder than needed.

> Implementation of the k[v]free_rcu() is under rcu/tree.c and for tiny
> variant is under rcutiny.h. It can be moved to the mm/slab_common.c
> or independent files can be created. I think, mm people should consult
> what is the best way to go :)
> 
> Any thoughts on it?

I don't have any opinion on exactly where in mm the underlying
implementation code should go.  You suggestion of mm/slab_common.c
seems fine to me.  ;-)

							Thanx, Paul
Uladzislau Rezki June 30, 2020, 5:46 p.m. UTC | #26
On Mon, Jun 22, 2020 at 12:53:29PM -0700, Paul E. McKenney wrote:
> On Mon, Jun 22, 2020 at 09:04:06PM +0200, Uladzislau Rezki wrote:
> > > > > 
> > > > > Very good.  When does kfree_rcu() and friends move out of kernel/rcu?
> > > > > 
> > > > Do you mean to move the whole logic of kfree_rcu() from top to down to mm/?
> > > 
> > > I do mean exactly that.
> > > 
> > > That was my goal some years back when Rao Shoaib was making the first
> > > attempt along these lines, and it remains my goal.  After all, if this
> > > effort is at all successful, the coupling between kfree_rcu() with
> > > slab/slob/slub will become much tighter than that between kfree_rcu()
> > > and RCU.
> > > 
> > > There will need to be some additional RCU APIs used by kfree_rcu(),
> > > for example, something to tell RCU how many blocks are awaiting a
> > > grace period.  But these are narrow and easily defined APIs.
> > >
> > I also think that k[v]free_rcu() should reside somewhere under "mm/".
> > Currently they are defined as macros under "linux/rcupdate.h". So i
> > am not sure if definitions should stay there or moved also.
> 
> I am not as worried about the high-level macros as I am about the code
> that does the bulk of the work, but they should still move as well.
> Otherwise, changes involving both the macros and the underlying
> implementation are harder than needed.
> 
> > Implementation of the k[v]free_rcu() is under rcu/tree.c and for tiny
> > variant is under rcutiny.h. It can be moved to the mm/slab_common.c
> > or independent files can be created. I think, mm people should consult
> > what is the best way to go :)
> > 
> > Any thoughts on it?
> 
> I don't have any opinion on exactly where in mm the underlying
> implementation code should go.  You suggestion of mm/slab_common.c
> seems fine to me.  ;-)
> 
OK :)

Then i would like to hear an opinion from the "mm" people where
kfree_rcu() and friends could potentially be moved.

Matthew, Michal, Vlastimil could you please share your view?

Thanks!

--
Vlad Rezki
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e2267e92de5d..9f84ff80bc25 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -57,6 +57,8 @@ 
 #include <linux/slab.h>
 #include <linux/sched/isolation.h>
 #include <linux/sched/clock.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
 #include "../time/tick-internal.h"
 
 #include "tree.h"
@@ -2832,46 +2834,47 @@  EXPORT_SYMBOL_GPL(call_rcu);
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (HZ / 50)
 #define KFREE_N_BATCHES 2
+#define FREE_N_CHANNELS 2
 
 /**
- * struct kfree_rcu_bulk_data - single block to store kfree_rcu() pointers
+ * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
  * @nr_records: Number of active pointers in the array
- * @records: Array of the kfree_rcu() pointers
  * @next: Next bulk object in the block chain
+ * @records: Array of the kvfree_rcu() pointers
  */
-struct kfree_rcu_bulk_data {
+struct kvfree_rcu_bulk_data {
 	unsigned long nr_records;
-	struct kfree_rcu_bulk_data *next;
+	struct kvfree_rcu_bulk_data *next;
 	void *records[];
 };
 
 /*
  * This macro defines how many entries the "records" array
  * will contain. It is based on the fact that the size of
- * kfree_rcu_bulk_data structure becomes exactly one page.
+ * kvfree_rcu_bulk_data structure becomes exactly one page.
  */
-#define KFREE_BULK_MAX_ENTR \
-	((PAGE_SIZE - sizeof(struct kfree_rcu_bulk_data)) / sizeof(void *))
+#define KVFREE_BULK_MAX_ENTR \
+	((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
 
 /**
  * struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
  * @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
  * @head_free: List of kfree_rcu() objects waiting for a grace period
- * @bhead_free: Bulk-List of kfree_rcu() objects waiting for a grace period
+ * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
  * @krcp: Pointer to @kfree_rcu_cpu structure
  */
 
 struct kfree_rcu_cpu_work {
 	struct rcu_work rcu_work;
 	struct rcu_head *head_free;
-	struct kfree_rcu_bulk_data *bhead_free;
+	struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu *krcp;
 };
 
 /**
  * struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
  * @head: List of kfree_rcu() objects not yet waiting for a grace period
- * @bhead: Bulk-List of kfree_rcu() objects not yet waiting for a grace period
+ * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
  * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
  * @lock: Synchronize access to this structure
  * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
@@ -2886,7 +2889,7 @@  struct kfree_rcu_cpu_work {
  */
 struct kfree_rcu_cpu {
 	struct rcu_head *head;
-	struct kfree_rcu_bulk_data *bhead;
+	struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS];
 	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
 	raw_spinlock_t lock;
 	struct delayed_work monitor_work;
@@ -2910,7 +2913,7 @@  static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
 };
 
 static __always_inline void
-debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead)
+debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
 {
 #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
 	int i;
@@ -2939,20 +2942,20 @@  krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
 	local_irq_restore(flags);
 }
 
-static inline struct kfree_rcu_bulk_data *
+static inline struct kvfree_rcu_bulk_data *
 get_cached_bnode(struct kfree_rcu_cpu *krcp)
 {
 	if (!krcp->nr_bkv_objs)
 		return NULL;
 
 	krcp->nr_bkv_objs--;
-	return (struct kfree_rcu_bulk_data *)
+	return (struct kvfree_rcu_bulk_data *)
 		llist_del_first(&krcp->bkvcache);
 }
 
 static inline bool
 put_cached_bnode(struct kfree_rcu_cpu *krcp,
-	struct kfree_rcu_bulk_data *bnode)
+	struct kvfree_rcu_bulk_data *bnode)
 {
 	// Check the limit.
 	if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
@@ -2971,43 +2974,63 @@  put_cached_bnode(struct kfree_rcu_cpu *krcp,
 static void kfree_rcu_work(struct work_struct *work)
 {
 	unsigned long flags;
+	struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS], *bnext;
 	struct rcu_head *head, *next;
-	struct kfree_rcu_bulk_data *bhead, *bnext;
 	struct kfree_rcu_cpu *krcp;
 	struct kfree_rcu_cpu_work *krwp;
+	int i, j;
 
 	krwp = container_of(to_rcu_work(work),
 			    struct kfree_rcu_cpu_work, rcu_work);
 	krcp = krwp->krcp;
+
 	raw_spin_lock_irqsave(&krcp->lock, flags);
+	// Channels 1 and 2.
+	for (i = 0; i < FREE_N_CHANNELS; i++) {
+		bkvhead[i] = krwp->bkvhead_free[i];
+		krwp->bkvhead_free[i] = NULL;
+	}
+
+	// Channel 3.
 	head = krwp->head_free;
 	krwp->head_free = NULL;
-	bhead = krwp->bhead_free;
-	krwp->bhead_free = NULL;
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
 
-	/* "bhead" is now private, so traverse locklessly. */
-	for (; bhead; bhead = bnext) {
-		bnext = bhead->next;
-
-		debug_rcu_bhead_unqueue(bhead);
-
-		rcu_lock_acquire(&rcu_callback_map);
-		trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
-			bhead->nr_records, bhead->records);
-
-		kfree_bulk(bhead->nr_records, bhead->records);
-		rcu_lock_release(&rcu_callback_map);
+	// Handle two first channels.
+	for (i = 0; i < FREE_N_CHANNELS; i++) {
+		for (; bkvhead[i]; bkvhead[i] = bnext) {
+			bnext = bkvhead[i]->next;
+			debug_rcu_bhead_unqueue(bkvhead[i]);
+
+			rcu_lock_acquire(&rcu_callback_map);
+			if (i == 0) { // kmalloc() / kfree().
+				trace_rcu_invoke_kfree_bulk_callback(
+					rcu_state.name, bkvhead[i]->nr_records,
+					bkvhead[i]->records);
+
+				kfree_bulk(bkvhead[i]->nr_records,
+					bkvhead[i]->records);
+			} else { // vmalloc() / vfree().
+				for (j = 0; j < bkvhead[i]->nr_records; j++) {
+					trace_rcu_invoke_kfree_callback(
+						rcu_state.name,
+						bkvhead[i]->records[j], 0);
+
+					vfree(bkvhead[i]->records[j]);
+				}
+			}
+			rcu_lock_release(&rcu_callback_map);
 
-		krcp = krc_this_cpu_lock(&flags);
-		if (put_cached_bnode(krcp, bhead))
-			bhead = NULL;
-		krc_this_cpu_unlock(krcp, flags);
+			krcp = krc_this_cpu_lock(&flags);
+			if (put_cached_bnode(krcp, bkvhead[i]))
+				bkvhead[i] = NULL;
+			krc_this_cpu_unlock(krcp, flags);
 
-		if (bhead)
-			free_page((unsigned long) bhead);
+			if (bkvhead[i])
+				free_page((unsigned long) bkvhead[i]);
 
-		cond_resched_tasks_rcu_qs();
+			cond_resched_tasks_rcu_qs();
+		}
 	}
 
 	/*
@@ -3025,7 +3048,7 @@  static void kfree_rcu_work(struct work_struct *work)
 		trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset);
 
 		if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset)))
-			kfree(ptr);
+			kvfree(ptr);
 
 		rcu_lock_release(&rcu_callback_map);
 		cond_resched_tasks_rcu_qs();
@@ -3042,7 +3065,7 @@  static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
 {
 	struct kfree_rcu_cpu_work *krwp;
 	bool repeat = false;
-	int i;
+	int i, j;
 
 	lockdep_assert_held(&krcp->lock);
 
@@ -3050,21 +3073,25 @@  static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
 		krwp = &(krcp->krw_arr[i]);
 
 		/*
-		 * Try to detach bhead or head and attach it over any
+		 * Try to detach bkvhead or head and attach it over any
 		 * available corresponding free channel. It can be that
 		 * a previous RCU batch is in progress, it means that
 		 * immediately to queue another one is not possible so
 		 * return false to tell caller to retry.
 		 */
-		if ((krcp->bhead && !krwp->bhead_free) ||
+		if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
+			(krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
 				(krcp->head && !krwp->head_free)) {
-			/* Channel 1. */
-			if (!krwp->bhead_free) {
-				krwp->bhead_free = krcp->bhead;
-				krcp->bhead = NULL;
+			// Channel 1 corresponds to SLAB ptrs.
+			// Channel 2 corresponds to vmalloc ptrs.
+			for (j = 0; j < FREE_N_CHANNELS; j++) {
+				if (!krwp->bkvhead_free[j]) {
+					krwp->bkvhead_free[j] = krcp->bkvhead[j];
+					krcp->bkvhead[j] = NULL;
+				}
 			}
 
-			/* Channel 2. */
+			// Channel 3 corresponds to emergency path.
 			if (!krwp->head_free) {
 				krwp->head_free = krcp->head;
 				krcp->head = NULL;
@@ -3073,16 +3100,17 @@  static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
 			WRITE_ONCE(krcp->count, 0);
 
 			/*
-			 * One work is per one batch, so there are two "free channels",
-			 * "bhead_free" and "head_free" the batch can handle. It can be
-			 * that the work is in the pending state when two channels have
-			 * been detached following each other, one by one.
+			 * One work is per one batch, so there are three
+			 * "free channels", the batch can handle. It can
+			 * be that the work is in the pending state when
+			 * channels have been detached following by each
+			 * other.
 			 */
 			queue_rcu_work(system_wq, &krwp->rcu_work);
 		}
 
-		/* Repeat if any "free" corresponding channel is still busy. */
-		if (krcp->bhead || krcp->head)
+		// Repeat if any "free" corresponding channel is still busy.
+		if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
 			repeat = true;
 	}
 
@@ -3124,23 +3152,22 @@  static void kfree_rcu_monitor(struct work_struct *work)
 }
 
 static inline bool
-kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
-	struct rcu_head *head, rcu_callback_t func)
+kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
 {
-	struct kfree_rcu_bulk_data *bnode;
+	struct kvfree_rcu_bulk_data *bnode;
+	int idx;
 
 	if (unlikely(!krcp->initialized))
 		return false;
 
 	lockdep_assert_held(&krcp->lock);
+	idx = !!is_vmalloc_addr(ptr);
 
 	/* Check if a new block is required. */
-	if (!krcp->bhead ||
-			krcp->bhead->nr_records == KFREE_BULK_MAX_ENTR) {
+	if (!krcp->bkvhead[idx] ||
+			krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
 		bnode = get_cached_bnode(krcp);
 		if (!bnode) {
-			WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
-
 			/*
 			 * To keep this path working on raw non-preemptible
 			 * sections, prevent the optional entry into the
@@ -3153,7 +3180,7 @@  kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
 			if (IS_ENABLED(CONFIG_PREEMPT_RT))
 				return false;
 
-			bnode = (struct kfree_rcu_bulk_data *)
+			bnode = (struct kvfree_rcu_bulk_data *)
 				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
 		}
 
@@ -3163,30 +3190,30 @@  kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
 
 		/* Initialize the new block. */
 		bnode->nr_records = 0;
-		bnode->next = krcp->bhead;
+		bnode->next = krcp->bkvhead[idx];
 
 		/* Attach it to the head. */
-		krcp->bhead = bnode;
+		krcp->bkvhead[idx] = bnode;
 	}
 
 	/* Finally insert. */
-	krcp->bhead->records[krcp->bhead->nr_records++] =
-		(void *) head - (unsigned long) func;
+	krcp->bkvhead[idx]->records
+		[krcp->bkvhead[idx]->nr_records++] = ptr;
 
 	return true;
 }
 
 /*
- * Queue a request for lazy invocation of kfree_bulk()/kfree() after a grace
- * period. Please note there are two paths are maintained, one is the main one
- * that uses kfree_bulk() interface and second one is emergency one, that is
- * used only when the main path can not be maintained temporary, due to memory
- * pressure.
+ * Queue a request for lazy invocation of appropriate free routine after a
+ * grace period. Please note there are three paths are maintained, two are the
+ * main ones that use array of pointers interface and third one is emergency
+ * one, that is used only when the main path can not be maintained temporary,
+ * due to memory pressure.
  *
  * Each kfree_call_rcu() request is added to a batch. The batch will be drained
  * every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will
  * be free'd in workqueue context. This allows us to: batch requests together to
- * reduce the number of grace periods during heavy kfree_rcu() load.
+ * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
  */
 void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 {
@@ -3209,7 +3236,7 @@  void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 	 * Under high memory pressure GFP_NOWAIT can fail,
 	 * in that case the emergency path is maintained.
 	 */
-	if (unlikely(!kfree_call_rcu_add_ptr_to_bulk(krcp, head, func))) {
+	if (unlikely(!kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr))) {
 		head->func = func;
 		head->next = krcp->head;
 		krcp->head = head;
@@ -4187,7 +4214,7 @@  static void __init kfree_rcu_batch_init(void)
 
 	for_each_possible_cpu(cpu) {
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
-		struct kfree_rcu_bulk_data *bnode;
+		struct kvfree_rcu_bulk_data *bnode;
 
 		for (i = 0; i < KFREE_N_BATCHES; i++) {
 			INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
@@ -4195,7 +4222,7 @@  static void __init kfree_rcu_batch_init(void)
 		}
 
 		for (i = 0; i < rcu_min_cached_objs; i++) {
-			bnode = (struct kfree_rcu_bulk_data *)
+			bnode = (struct kvfree_rcu_bulk_data *)
 				__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
 
 			if (bnode)