diff mbox series

[V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.

Message ID 1680266529-28429-1-git-send-email-ziwei.dai@unisoc.com (mailing list archive)
State Accepted
Commit e222f9a512539c3f4093a55d16624d9da614800b
Headers show
Series [V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. | expand

Commit Message

代子为 (Ziwei Dai) March 31, 2023, 12:42 p.m. UTC
In kfree_rcu_monitor(), new free business at krcp is attached to any free
channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
business is handled after the rcu grace period. But if there is any none-free
channel at krwp already, that means there is an on-going rcu work,
which will cause the kvfree_call_rcu()-triggered free business is done
before the wanted rcu grace period ends.

This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
to fix the issue that kvfree_call_rcu() loses effectiveness.

Below is the css_set obj "from_cset" use-after-free case caused by
kvfree_call_rcu() losing effectiveness.
CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
the task is schedule out.
CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.

CPU 0					CPU 1
count_memcg_event_mm()
|rcu_read_lock()  <---
|mem_cgroup_from_task()
 |// css_set_ptr is the "from_cset" mentioned on CPU 1
 |css_set_ptr = rcu_dereference((task)->cgroups)
 |// Hard irq comes, current task is scheduled out.

					cgroup_attach_task()
					|cgroup_migrate()
					|cgroup_migrate_execute()
					|css_set_move_task(task, from_cset, to_cset, true)
					|cgroup_move_task(task, to_cset)
					|rcu_assign_pointer(.., to_cset)
					|...
					|cgroup_migrate_finish()
					|put_css_set_locked(from_cset)
					|from_cset->refcount return 0
					|kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
					|add_ptr_to_bulk_krc_lock()
					|schedule_delayed_work(&krcp->monitor_work, ..)

					kfree_rcu_monitor()
					|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
					|queue_rcu_work(system_wq, &krwp->rcu_work)
					|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
					|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp

					// There is a perious call_rcu(.., rcu_work_rcufn)
					// gp end, rcu_work_rcufn() is called.
					rcu_work_rcufn()
					|__queue_work(.., rwork->wq, &rwork->work);

					|kfree_rcu_work()
					|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
					|The "from_cset" is freed before new gp end.

// the task is scheduled in after many ms.
 |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.

v2: Use helper function instead of inserted code block at kfree_rcu_monitor().

Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
---
 kernel/rcu/tree.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

Comments

Uladzislau Rezki March 31, 2023, 3:01 p.m. UTC | #1
On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote:
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work,
> which will cause the kvfree_call_rcu()-triggered free business is done
> before the wanted rcu grace period ends.
> 
> This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
> 
> Below is the css_set obj "from_cset" use-after-free case caused by
> kvfree_call_rcu() losing effectiveness.
> CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> the task is schedule out.
> CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> 
> CPU 0					CPU 1
> count_memcg_event_mm()
> |rcu_read_lock()  <---
> |mem_cgroup_from_task()
>  |// css_set_ptr is the "from_cset" mentioned on CPU 1
>  |css_set_ptr = rcu_dereference((task)->cgroups)
>  |// Hard irq comes, current task is scheduled out.
> 
> 					cgroup_attach_task()
> 					|cgroup_migrate()
> 					|cgroup_migrate_execute()
> 					|css_set_move_task(task, from_cset, to_cset, true)
> 					|cgroup_move_task(task, to_cset)
> 					|rcu_assign_pointer(.., to_cset)
> 					|...
> 					|cgroup_migrate_finish()
> 					|put_css_set_locked(from_cset)
> 					|from_cset->refcount return 0
> 					|kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> 					|add_ptr_to_bulk_krc_lock()
> 					|schedule_delayed_work(&krcp->monitor_work, ..)
> 
> 					kfree_rcu_monitor()
> 					|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> 					|queue_rcu_work(system_wq, &krwp->rcu_work)
> 					|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> 					|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> 
> 					// There is a perious call_rcu(.., rcu_work_rcufn)
> 					// gp end, rcu_work_rcufn() is called.
> 					rcu_work_rcufn()
> 					|__queue_work(.., rwork->wq, &rwork->work);
> 
> 					|kfree_rcu_work()
> 					|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> 					|The "from_cset" is freed before new gp end.
> 
> // the task is scheduled in after many ms.
>  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> 
> v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> 
> Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> ---
>  kernel/rcu/tree.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8e880c0..7b95ee9 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work)
>  	return !!READ_ONCE(krcp->head);
>  }
>  
> +static bool
> +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
> +{
> +	int i;
> +
> +	for (i = 0; i < FREE_N_CHANNELS; i++)
> +		if (!list_empty(&krwp->bulk_head_free[i]))
> +			return true;
> +
> +	return !!krwp->head_free;
> +}
> +
>  static int krc_count(struct kfree_rcu_cpu *krcp)
>  {
>  	int sum = atomic_read(&krcp->head_count);
> @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
>  	for (i = 0; i < KFREE_N_BATCHES; i++) {
>  		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
>  
> -		// Try to detach bulk_head or head and attach it over any
> -		// available corresponding free channel. It can be that
> -		// a previous RCU batch is in progress, it means that
> -		// immediately to queue another one is not possible so
> -		// in that case the monitor work is rearmed.
> -		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> -			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> -				(READ_ONCE(krcp->head) && !krwp->head_free)) {
> +		// Try to detach bulk_head or head and attach it, only when
> +		// all channels are free.  Any channel is not free means at krwp
> +		// there is on-going rcu work to handle krwp's free business.
> +		if (need_wait_for_krwp_work(krwp))
> +			continue;
>  
> +		// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
> +		if (need_offload_krc(krcp)) {
>  			// Channel 1 corresponds to the SLAB-pointer bulk path.
>  			// Channel 2 corresponds to vmalloc-pointer bulk path.
>  			for (j = 0; j < FREE_N_CHANNELS; j++) {
> -- 
> 1.9.1
> 
It looks correct to me. I will test it over weekend.

--
Uladzislau Rezki
Uladzislau Rezki April 3, 2023, 7:28 a.m. UTC | #2
On Fri, Mar 31, 2023 at 05:01:52PM +0200, Uladzislau Rezki wrote:
> On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote:
> > In kfree_rcu_monitor(), new free business at krcp is attached to any free
> > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> > business is handled after the rcu grace period. But if there is any none-free
> > channel at krwp already, that means there is an on-going rcu work,
> > which will cause the kvfree_call_rcu()-triggered free business is done
> > before the wanted rcu grace period ends.
> > 
> > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> > to fix the issue that kvfree_call_rcu() loses effectiveness.
> > 
> > Below is the css_set obj "from_cset" use-after-free case caused by
> > kvfree_call_rcu() losing effectiveness.
> > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> > the task is schedule out.
> > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> > 
> > CPU 0					CPU 1
> > count_memcg_event_mm()
> > |rcu_read_lock()  <---
> > |mem_cgroup_from_task()
> >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> >  |css_set_ptr = rcu_dereference((task)->cgroups)
> >  |// Hard irq comes, current task is scheduled out.
> > 
> > 					cgroup_attach_task()
> > 					|cgroup_migrate()
> > 					|cgroup_migrate_execute()
> > 					|css_set_move_task(task, from_cset, to_cset, true)
> > 					|cgroup_move_task(task, to_cset)
> > 					|rcu_assign_pointer(.., to_cset)
> > 					|...
> > 					|cgroup_migrate_finish()
> > 					|put_css_set_locked(from_cset)
> > 					|from_cset->refcount return 0
> > 					|kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> > 					|add_ptr_to_bulk_krc_lock()
> > 					|schedule_delayed_work(&krcp->monitor_work, ..)
> > 
> > 					kfree_rcu_monitor()
> > 					|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> > 					|queue_rcu_work(system_wq, &krwp->rcu_work)
> > 					|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> > 					|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> > 
> > 					// There is a perious call_rcu(.., rcu_work_rcufn)
> > 					// gp end, rcu_work_rcufn() is called.
> > 					rcu_work_rcufn()
> > 					|__queue_work(.., rwork->wq, &rwork->work);
> > 
> > 					|kfree_rcu_work()
> > 					|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> > 					|The "from_cset" is freed before new gp end.
> > 
> > // the task is scheduled in after many ms.
> >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> > 
> > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> > 
> > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > ---
> >  kernel/rcu/tree.c | 27 +++++++++++++++++++--------
> >  1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8e880c0..7b95ee9 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work)
> >  	return !!READ_ONCE(krcp->head);
> >  }
> >  
> > +static bool
> > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < FREE_N_CHANNELS; i++)
> > +		if (!list_empty(&krwp->bulk_head_free[i]))
> > +			return true;
> > +
> > +	return !!krwp->head_free;
> > +}
> > +
> >  static int krc_count(struct kfree_rcu_cpu *krcp)
> >  {
> >  	int sum = atomic_read(&krcp->head_count);
> > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
> >  	for (i = 0; i < KFREE_N_BATCHES; i++) {
> >  		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> >  
> > -		// Try to detach bulk_head or head and attach it over any
> > -		// available corresponding free channel. It can be that
> > -		// a previous RCU batch is in progress, it means that
> > -		// immediately to queue another one is not possible so
> > -		// in that case the monitor work is rearmed.
> > -		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> > -			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> > -				(READ_ONCE(krcp->head) && !krwp->head_free)) {
> > +		// Try to detach bulk_head or head and attach it, only when
> > +		// all channels are free.  Any channel is not free means at krwp
> > +		// there is on-going rcu work to handle krwp's free business.
> > +		if (need_wait_for_krwp_work(krwp))
> > +			continue;
> >  
> > +		// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
> > +		if (need_offload_krc(krcp)) {
> >  			// Channel 1 corresponds to the SLAB-pointer bulk path.
> >  			// Channel 2 corresponds to vmalloc-pointer bulk path.
> >  			for (j = 0; j < FREE_N_CHANNELS; j++) {
> > -- 
> > 1.9.1
> > 
> It looks correct to me. I will test it over weekend.
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki
Mukesh Ojha April 3, 2023, 2:01 p.m. UTC | #3
On 3/31/2023 6:12 PM, Ziwei Dai wrote:
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work,
> which will cause the kvfree_call_rcu()-triggered free business is done
> before the wanted rcu grace period ends.
> 
> This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
> 
> Below is the css_set obj "from_cset" use-after-free case caused by
> kvfree_call_rcu() losing effectiveness.
> CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> the task is schedule out.
> CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> 
> CPU 0					CPU 1
> count_memcg_event_mm()
> |rcu_read_lock()  <---
> |mem_cgroup_from_task()
>   |// css_set_ptr is the "from_cset" mentioned on CPU 1
>   |css_set_ptr = rcu_dereference((task)->cgroups)
>   |// Hard irq comes, current task is scheduled out.
> 
> 					cgroup_attach_task()
> 					|cgroup_migrate()
> 					|cgroup_migrate_execute()
> 					|css_set_move_task(task, from_cset, to_cset, true)
> 					|cgroup_move_task(task, to_cset)
> 					|rcu_assign_pointer(.., to_cset)
> 					|...
> 					|cgroup_migrate_finish()
> 					|put_css_set_locked(from_cset)
> 					|from_cset->refcount return 0
> 					|kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> 					|add_ptr_to_bulk_krc_lock()
> 					|schedule_delayed_work(&krcp->monitor_work, ..)
> 
> 					kfree_rcu_monitor()
> 					|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> 					|queue_rcu_work(system_wq, &krwp->rcu_work)
> 					|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> 					|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> 
> 					// There is a perious call_rcu(.., rcu_work_rcufn)
> 					// gp end, rcu_work_rcufn() is called.
> 					rcu_work_rcufn()
> 					|__queue_work(.., rwork->wq, &rwork->work);
> 
> 					|kfree_rcu_work()
> 					|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> 					|The "from_cset" is freed before new gp end.
> 
> // the task is scheduled in after many ms.
>   |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> 

I too have observed this issue on 5.15, where i see below callstack.
And i see different view of memory and one is this callstack and the
other is,  i see proper value of memcg from the ramdump.

Looks to be different view of memory between rcu critical section.

119653.424434][T730913] binder: 30858:30913 ioctl c0306201 7633b56a78 
returned -14
[119653.426095][T730918] Unable to handle kernel paging request at 
virtual address 000000034088e7be

-000|mem_cgroup_disabled(inline)
-000|__count_memcg_events([X0] memcg = 0x000000034088D9AE, [X1] idx = 
PGFAULT = 0x11, [X2] count = 0x1)
     |  [X0] memcg = 0x000000034088D9AE
     |  [X1] idx = PGFAULT = 0x11
     |  [X2] count = 0x1
     |  [locdesc] __ptr = 0x0
-001|arch_local_irq_restore(inline)
     |  [X26] flags = 0x0
-001|count_memcg_events(inline)
     |  [X26] flags = 0x0
-001|count_memcg_event_mm(inline)
     |  [locdesc] idx = PGFAULT = 0x11
-001|do_handle_mm_fault([X24] vma = 0xFFFFFF89CC704870, [X20] address = 
0x000000763395AA78, [X22] flags =
     |  [X24] vma = 0xFFFFFF89CC704870
     |  [X20] address = 0x000000763395AA78
     |  [X22] flags = 0x0215
     |  [X23] seq = 0x0
     |  [X19] regs = 0xFFFFFFC05EA73AD0
     |  [locdesc] ret = 0x0
-002|handle_mm_fault(inline)
     |  [X22] address = 0x000000763395AA78
     |  [X24] flags = 0x0215
     |  [X21] regs = 0xFFFFFFC05EA73AD0
-002|__do_page_fault(inline)
     |  [X23] mm = 0xFFFFFF8812F65400
     |  [X22] addr = 0x000000763395AA78
     |  [X24] mm_flags = 0x0215
     |  [X27] vm_flags = 0x2
     |  [X21] regs = 0xFFFFFFC05EA73AD0
[119653.427011][T730918]  __count_memcg_events+0x2c/0x274
[119653.427014][T730918]  do_handle_mm_fault+0xe4/0x2c8
[119653.427017][T730918]  do_page_fault+0x4c0/0x688
[119653.427022][T730918]  do_translation_fault+0x48/0x64
[119653.427025][T730918]  do_mem_abort+0x68/0x148
[119653.427028][T730918]  el1_abort+0x40/0x64
[119653.427032][T730918]  el1h_64_sync_handler+0x60/0xa0
[119653.427035][T730918]  el1h_64_sync+0x7c/0x80
[119653.427037][T730918]  __arch_copy_to_user+0x60/0x224
[119653.427040][T730918]  binder_ioctl_write_read+0x28c/0x590
[119653.427046][T730918]  binder_ioctl+0x1b0/0xf00
[119653.427049][T730918]  __arm64_sys_ioctl+0x184/0x210
[119653.427052][T730918]  invoke_syscall+0x60/0x150
[119653.427055][T730918]  el0_svc_common+0x8c/0xf8
[119653.427057][T730918]  do_el0_svc+0x28/0xa0
[119653.427059][T730918]  el0_svc+0x24/0x84
[119653.427061][T730918]  el0t_64_sync_handler+0x88/0xec
[119653.427064][T730918]  el0t_64_sync+0x1b4/0x1b8
[119653.427067][T730918] Code: a9044ff4 d503201f 71016c3f 54000d42 
(f9470808)
[119653.427069][T730918] ---[ end trace a3882cb531ca3dd0 ]---


Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>

I will try to test this patch.

Thanks,
--Mukesh

> v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> 
> Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> ---
>   kernel/rcu/tree.c | 27 +++++++++++++++++++--------
>   1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8e880c0..7b95ee9 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work)
>   	return !!READ_ONCE(krcp->head);
>   }
>   
> +static bool
> +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
> +{
> +	int i;
> +
> +	for (i = 0; i < FREE_N_CHANNELS; i++)
> +		if (!list_empty(&krwp->bulk_head_free[i]))
> +			return true;
> +
> +	return !!krwp->head_free;
> +}
> +
>   static int krc_count(struct kfree_rcu_cpu *krcp)
>   {
>   	int sum = atomic_read(&krcp->head_count);
> @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
>   	for (i = 0; i < KFREE_N_BATCHES; i++) {
>   		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
>   
> -		// Try to detach bulk_head or head and attach it over any
> -		// available corresponding free channel. It can be that
> -		// a previous RCU batch is in progress, it means that
> -		// immediately to queue another one is not possible so
> -		// in that case the monitor work is rearmed.
> -		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> -			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> -				(READ_ONCE(krcp->head) && !krwp->head_free)) {
> +		// Try to detach bulk_head or head and attach it, only when
> +		// all channels are free.  Any channel is not free means at krwp
> +		// there is on-going rcu work to handle krwp's free business.
> +		if (need_wait_for_krwp_work(krwp))
> +			continue;
>   
> +		// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
> +		if (need_offload_krc(krcp)) {
>   			// Channel 1 corresponds to the SLAB-pointer bulk path.
>   			// Channel 2 corresponds to vmalloc-pointer bulk path.
>   			for (j = 0; j < FREE_N_CHANNELS; j++) {
Paul E. McKenney April 3, 2023, 10:58 p.m. UTC | #4
On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote:
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work,
> which will cause the kvfree_call_rcu()-triggered free business is done
> before the wanted rcu grace period ends.
> 
> This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
> 
> Below is the css_set obj "from_cset" use-after-free case caused by
> kvfree_call_rcu() losing effectiveness.
> CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> the task is schedule out.
> CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> 
> CPU 0					CPU 1
> count_memcg_event_mm()
> |rcu_read_lock()  <---
> |mem_cgroup_from_task()
>  |// css_set_ptr is the "from_cset" mentioned on CPU 1
>  |css_set_ptr = rcu_dereference((task)->cgroups)
>  |// Hard irq comes, current task is scheduled out.
> 
> 					cgroup_attach_task()
> 					|cgroup_migrate()
> 					|cgroup_migrate_execute()
> 					|css_set_move_task(task, from_cset, to_cset, true)
> 					|cgroup_move_task(task, to_cset)
> 					|rcu_assign_pointer(.., to_cset)
> 					|...
> 					|cgroup_migrate_finish()
> 					|put_css_set_locked(from_cset)
> 					|from_cset->refcount return 0
> 					|kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> 					|add_ptr_to_bulk_krc_lock()
> 					|schedule_delayed_work(&krcp->monitor_work, ..)
> 
> 					kfree_rcu_monitor()
> 					|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> 					|queue_rcu_work(system_wq, &krwp->rcu_work)
> 					|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> 					|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> 
> 					// There is a perious call_rcu(.., rcu_work_rcufn)
> 					// gp end, rcu_work_rcufn() is called.
> 					rcu_work_rcufn()
> 					|__queue_work(.., rwork->wq, &rwork->work);
> 
> 					|kfree_rcu_work()
> 					|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> 					|The "from_cset" is freed before new gp end.
> 
> // the task is scheduled in after many ms.
>  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> 
> v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> 
> Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>

Good catch, thank you!!!

How difficult was this to trigger?  If it can be triggered easily,
this of course needs to go into mainline sooner rather than later.

Longer term, would it make sense to run the three channels through RCU
separately, in order to avoid one channel refraining from starting a grace
period just because some other channel has callbacks waiting for a grace
period to complete?  One argument against might be energy efficiency, but
perhaps the ->gp_snap field could be used to get the best of both worlds.

Either way, this fixes only one bug of two.  The second bug is in the
kfree_rcu() tests, which should have caught this bug.  Thoughts on a
good fix for those tests?

I have applied Uladzislau's and Mukesh's tags, and done the usual
wordsmithing as shown at the end of this message.  Please let me know
if I messed anything up.

> ---
>  kernel/rcu/tree.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8e880c0..7b95ee9 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work)
>  	return !!READ_ONCE(krcp->head);
>  }
>  
> +static bool
> +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
> +{
> +	int i;
> +
> +	for (i = 0; i < FREE_N_CHANNELS; i++)
> +		if (!list_empty(&krwp->bulk_head_free[i]))
> +			return true;
> +
> +	return !!krwp->head_free;

This is fixed from v1, good!

> +}
> +
>  static int krc_count(struct kfree_rcu_cpu *krcp)
>  {
>  	int sum = atomic_read(&krcp->head_count);
> @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
>  	for (i = 0; i < KFREE_N_BATCHES; i++) {
>  		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
>  
> -		// Try to detach bulk_head or head and attach it over any
> -		// available corresponding free channel. It can be that
> -		// a previous RCU batch is in progress, it means that
> -		// immediately to queue another one is not possible so
> -		// in that case the monitor work is rearmed.
> -		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> -			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> -				(READ_ONCE(krcp->head) && !krwp->head_free)) {
> +		// Try to detach bulk_head or head and attach it, only when
> +		// all channels are free.  Any channel is not free means at krwp
> +		// there is on-going rcu work to handle krwp's free business.
> +		if (need_wait_for_krwp_work(krwp))
> +			continue;
>  
> +		// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
> +		if (need_offload_krc(krcp)) {
>  			// Channel 1 corresponds to the SLAB-pointer bulk path.
>  			// Channel 2 corresponds to vmalloc-pointer bulk path.
>  			for (j = 0; j < FREE_N_CHANNELS; j++) {
> -- 
> 1.9.1

------------------------------------------------------------------------

commit e222f9a512539c3f4093a55d16624d9da614800b
Author: Ziwei Dai <ziwei.dai@unisoc.com>
Date:   Fri Mar 31 20:42:09 2023 +0800

    rcu: Avoid freeing new kfree_rcu() memory after old grace period
    
    Memory passed to kvfree_rcu() that is to be freed is tracked by a
    per-CPU kfree_rcu_cpu structure, which in turn contains pointers
    to kvfree_rcu_bulk_data structures that contain pointers to memory
    that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
    structure that tracks the memory that has already been handed to RCU.
    These structures track three categories of memory: (1) Memory for
    kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
    during an OOM episode.  The first two categories are tracked in a
    cache-friendly manner involving a dynamically allocated page of pointers
    (the aforementioned kvfree_rcu_bulk_data structures), while the third
    uses a simple (but decidedly cache-unfriendly) linked list through the
    rcu_head structures in each block of memory.
    
    On a given CPU, these three categories are handled as a unit, with that
    CPU's kfree_rcu_cpu_work structure having one pointer for each of the
    three categories.  Clearly, new memory for a given category cannot be
    placed in the corresponding kfree_rcu_cpu_work structure until any old
    memory has had its grace period elapse and thus has been removed.  And
    the kfree_rcu_monitor() function does in fact check for this.
    
    Except that the kfree_rcu_monitor() function checks these pointers one
    at a time.  This means that if the previous kfree_rcu() memory passed
    to RCU had only category 1 and the current one has only category 2, the
    kfree_rcu_monitor() function will send that current category-2 memory
    along immediately.  This can result in memory being freed too soon,
    that is, out from under unsuspecting RCU readers.
    
    To see this, consider the following sequence of events, in which:
    
    o       Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
            then is preempted.
    
    o       CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
            after a later grace period.  Except that "from_cset" is freed
            right after the previous grace period ended, so that "from_cset"
            is immediately freed.  Task A resumes and references "from_cset"'s
            member, after which nothing good happens.
    
    In full detail:
    
    CPU 0                                   CPU 1
    ----------------------                  ----------------------
    count_memcg_event_mm()
    |rcu_read_lock()  <---
    |mem_cgroup_from_task()
     |// css_set_ptr is the "from_cset" mentioned on CPU 1
     |css_set_ptr = rcu_dereference((task)->cgroups)
     |// Hard irq comes, current task is scheduled out.
    
                                            cgroup_attach_task()
                                            |cgroup_migrate()
                                            |cgroup_migrate_execute()
                                            |css_set_move_task(task, from_cset, to_cset, true)
                                            |cgroup_move_task(task, to_cset)
                                            |rcu_assign_pointer(.., to_cset)
                                            |...
                                            |cgroup_migrate_finish()
                                            |put_css_set_locked(from_cset)
                                            |from_cset->refcount return 0
                                            |kfree_rcu(cset, rcu_head) // free from_cset after new gp
                                            |add_ptr_to_bulk_krc_lock()
                                            |schedule_delayed_work(&krcp->monitor_work, ..)
    
                                            kfree_rcu_monitor()
                                            |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
                                            |queue_rcu_work(system_wq, &krwp->rcu_work)
                                            |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
                                            |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp
    
                                            // There is a perious call_rcu(.., rcu_work_rcufn)
                                            // gp end, rcu_work_rcufn() is called.
                                            rcu_work_rcufn()
                                            |__queue_work(.., rwork->wq, &rwork->work);
    
                                            |kfree_rcu_work()
                                            |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
                                            |The "from_cset" is freed before new gp end.
    
    // the task resumes some time later.
     |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
    
    This commit therefore causes kfree_rcu_monitor() to refrain from moving
    kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
    grace period has completed for all three categories.
    
    v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
    
    Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
    Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
    Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
    Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 859ee02f6614..e2dbea6cee4b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
 	return !!READ_ONCE(krcp->head);
 }
 
+static bool
+need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
+{
+	int i;
+
+	for (i = 0; i < FREE_N_CHANNELS; i++)
+		if (!list_empty(&krwp->bulk_head_free[i]))
+			return true;
+
+	return !!krwp->head_free;
+}
+
 static int krc_count(struct kfree_rcu_cpu *krcp)
 {
 	int sum = atomic_read(&krcp->head_count);
@@ -3134,15 +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
 	for (i = 0; i < KFREE_N_BATCHES; i++) {
 		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
 
-		// Try to detach bulk_head or head and attach it over any
-		// available corresponding free channel. It can be that
-		// a previous RCU batch is in progress, it means that
-		// immediately to queue another one is not possible so
-		// in that case the monitor work is rearmed.
-		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
-			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
-				(READ_ONCE(krcp->head) && !krwp->head_free)) {
+		// Try to detach bulk_head or head and attach it, only when
+		// all channels are free.  Any channel is not free means at krwp
+		// there is on-going rcu work to handle krwp's free business.
+		if (need_wait_for_krwp_work(krwp))
+			continue;
 
+		// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
+		if (need_offload_krc(krcp)) {
 			// Channel 1 corresponds to the SLAB-pointer bulk path.
 			// Channel 2 corresponds to vmalloc-pointer bulk path.
 			for (j = 0; j < FREE_N_CHANNELS; j++) {
代子为 (Ziwei Dai) April 4, 2023, 2:49 a.m. UTC | #5
Hello Paul!

> -----邮件原件-----
> 发件人: Paul E. McKenney <paulmck@kernel.org>
> 发送时间: 2023年4月4日 6:58
> 收件人: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>
> 抄送: urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com;
> josh@joshtriplett.org; rostedt@goodmis.org;
> mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com;
> joel@joelfernandes.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org;
> 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin)
> <Yifan.Xin@unisoc.com>; 王科 (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文
> (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu)
> <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang)
> <zhaoyang.huang@unisoc.com>
> 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after
> the wanted rcu grace period.
> 
> 
> 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任
> 何链接和附件。
> CAUTION: This email originated from outside of the organization. Do not click
> links or open attachments unless you recognize the sender and know the
> content is safe.
> 
> 
> 
> On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote:
> > In kfree_rcu_monitor(), new free business at krcp is attached to any
> > free channel at krwp. kfree_rcu_monitor() is responsible to make sure
> > new free business is handled after the rcu grace period. But if there
> > is any none-free channel at krwp already, that means there is an
> > on-going rcu work, which will cause the kvfree_call_rcu()-triggered
> > free business is done before the wanted rcu grace period ends.
> >
> > This commit ignore krwp which has non-free channel at
> > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses
> effectiveness.
> >
> > Below is the css_set obj "from_cset" use-after-free case caused by
> > kvfree_call_rcu() losing effectiveness.
> > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq
> > comes, the task is schedule out.
> > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new
> gp.
> > But "from_cset" is freed right after current gp end. "from_cset" is
> reallocated.
> > CPU 0 's task arrives back, references "from_cset"'s member, which causes
> crash.
> >
> > CPU 0                                 CPU 1
> > count_memcg_event_mm()
> > |rcu_read_lock()  <---
> > |mem_cgroup_from_task()
> >  |// css_set_ptr is the "from_cset" mentioned on CPU 1  |css_set_ptr =
> > rcu_dereference((task)->cgroups)  |// Hard irq comes, current task is
> > scheduled out.
> >
> >                                       cgroup_attach_task()
> >                                       |cgroup_migrate()
> >                                       |cgroup_migrate_execute()
> >                                       |css_set_move_task(task,
> from_cset, to_cset, true)
> >                                       |cgroup_move_task(task,
> to_cset)
> >                                       |rcu_assign_pointer(.., to_cset)
> >                                       |...
> >                                       |cgroup_migrate_finish()
> >
> |put_css_set_locked(from_cset)
> >                                       |from_cset->refcount return 0
> >                                       |kfree_rcu(cset, rcu_head) //
> means to free from_cset after new gp
> >                                       |add_ptr_to_bulk_krc_lock()
> >
> > |schedule_delayed_work(&krcp->monitor_work, ..)
> >
> >                                       kfree_rcu_monitor()
> >                                       |krcp->bulk_head[0]'s work
> attached to krwp->bulk_head_free[]
> >                                       |queue_rcu_work(system_wq,
> &krwp->rcu_work)
> >                                       |if rwork->rcu.work is not in
> WORK_STRUCT_PENDING_BIT state,
> >                                       |call_rcu(&rwork->rcu,
> > rcu_work_rcufn) <--- request a new gp
> >
> >                                       // There is a perious call_rcu(..,
> rcu_work_rcufn)
> >                                       // gp end, rcu_work_rcufn() is
> called.
> >                                       rcu_work_rcufn()
> >                                       |__queue_work(.., rwork->wq,
> > &rwork->work);
> >
> >                                       |kfree_rcu_work()
> >                                       |krwp->bulk_head_free[0] bulk
> is freed before new gp end!!!
> >                                       |The "from_cset" is freed
> before new gp end.
> >
> > // the task is scheduled in after many ms.
> >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because
> css_set_ptr is freed.
> >
> > v2: Use helper function instead of inserted code block at
> kfree_rcu_monitor().
> >
> > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of
> > kfree_rcu() work")
> > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> 
> Good catch, thank you!!!
> 
> How difficult was this to trigger?  If it can be triggered easily, this of course
> needs to go into mainline sooner rather than later.

Roughly we can reproduce this issue within two rounds of 48h stress test, 
with 20 k5.15 devices. If KASAN is enabled, the reproduce rate is higher.
So I think sooner is better.

> 
> Longer term, would it make sense to run the three channels through RCU
> separately, in order to avoid one channel refraining from starting a grace
> period just because some other channel has callbacks waiting for a grace
> period to complete?  One argument against might be energy efficiency, but
> perhaps the ->gp_snap field could be used to get the best of both worlds.

I see kvfree_rcu_drain_ready(krcp) is already called at the beginning of
kfree_rcu_monitor(), which polls the ->gp_snap field, to decide
whether to free channel objects immediately or after gp.
Both energy efficiency and timing seems be considered?

> Either way, this fixes only one bug of two.  The second bug is in the
> kfree_rcu() tests, which should have caught this bug.  Thoughts on a good fix
> for those tests?

I inserted a msleep() between "rcu_read_lock(), get pointer via rcu_dereference()"
and "reference pointer, using the member", at the rcu scenario, then we can
reproduce this issue very soon in stress test. Can kfree_rcu() tests insert msleep()?

> I have applied Uladzislau's and Mukesh's tags, and done the usual
> wordsmithing as shown at the end of this message.  Please let me know if I
> messed anything up.

Thank you for the improvement on the patch! It seems better now.

> > ---
> >  kernel/rcu/tree.c | 27 +++++++++++++++++++--------
> >  1 file changed, 19 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > 8e880c0..7b95ee9 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct
> *work)
> >       return !!READ_ONCE(krcp->head);
> >  }
> >
> > +static bool
> > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) {
> > +     int i;
> > +
> > +     for (i = 0; i < FREE_N_CHANNELS; i++)
> > +             if (!list_empty(&krwp->bulk_head_free[i]))
> > +                     return true;
> > +
> > +     return !!krwp->head_free;
> 
> This is fixed from v1, good!
> 
> > +}
> > +
> >  static int krc_count(struct kfree_rcu_cpu *krcp)  {
> >       int sum = atomic_read(&krcp->head_count); @@ -3107,15
> +3119,14
> > @@ static void kfree_rcu_monitor(struct work_struct *work)
> >       for (i = 0; i < KFREE_N_BATCHES; i++) {
> >               struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> >
> > -             // Try to detach bulk_head or head and attach it over any
> > -             // available corresponding free channel. It can be that
> > -             // a previous RCU batch is in progress, it means that
> > -             // immediately to queue another one is not possible so
> > -             // in that case the monitor work is rearmed.
> > -             if ((!list_empty(&krcp->bulk_head[0]) &&
> list_empty(&krwp->bulk_head_free[0])) ||
> > -                     (!list_empty(&krcp->bulk_head[1]) &&
> list_empty(&krwp->bulk_head_free[1])) ||
> > -                             (READ_ONCE(krcp->head)
> && !krwp->head_free)) {
> > +             // Try to detach bulk_head or head and attach it, only when
> > +             // all channels are free.  Any channel is not free means at
> krwp
> > +             // there is on-going rcu work to handle krwp's free
> business.
> > +             if (need_wait_for_krwp_work(krwp))
> > +                     continue;
> >
> > +             // kvfree_rcu_drain_ready() might handle this krcp, if so
> give up.
> > +             if (need_offload_krc(krcp)) {
> >                       // Channel 1 corresponds to the SLAB-pointer
> bulk path.
> >                       // Channel 2 corresponds to vmalloc-pointer bulk
> path.
> >                       for (j = 0; j < FREE_N_CHANNELS; j++) {
> > --
> > 1.9.1
> 
> ------------------------------------------------------------------------
> 
> commit e222f9a512539c3f4093a55d16624d9da614800b
> Author: Ziwei Dai <ziwei.dai@unisoc.com>
> Date:   Fri Mar 31 20:42:09 2023 +0800
> 
>     rcu: Avoid freeing new kfree_rcu() memory after old grace period
> 
>     Memory passed to kvfree_rcu() that is to be freed is tracked by a
>     per-CPU kfree_rcu_cpu structure, which in turn contains pointers
>     to kvfree_rcu_bulk_data structures that contain pointers to memory
>     that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
>     structure that tracks the memory that has already been handed to RCU.
>     These structures track three categories of memory: (1) Memory for
>     kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
>     during an OOM episode.  The first two categories are tracked in a
>     cache-friendly manner involving a dynamically allocated page of pointers
>     (the aforementioned kvfree_rcu_bulk_data structures), while the third
>     uses a simple (but decidedly cache-unfriendly) linked list through the
>     rcu_head structures in each block of memory.
> 
>     On a given CPU, these three categories are handled as a unit, with that
>     CPU's kfree_rcu_cpu_work structure having one pointer for each of the
>     three categories.  Clearly, new memory for a given category cannot be
>     placed in the corresponding kfree_rcu_cpu_work structure until any old
>     memory has had its grace period elapse and thus has been removed.
> And
>     the kfree_rcu_monitor() function does in fact check for this.
> 
>     Except that the kfree_rcu_monitor() function checks these pointers one
>     at a time.  This means that if the previous kfree_rcu() memory passed
>     to RCU had only category 1 and the current one has only category 2, the
>     kfree_rcu_monitor() function will send that current category-2 memory
>     along immediately.  This can result in memory being freed too soon,
>     that is, out from under unsuspecting RCU readers.
> 
>     To see this, consider the following sequence of events, in which:
> 
>     o       Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
>             then is preempted.
> 
>     o       CPU 1 calls kfree_rcu(cset, rcu_head) in order to free
> "from_cset"
>             after a later grace period.  Except that "from_cset" is freed
>             right after the previous grace period ended, so that
> "from_cset"
>             is immediately freed.  Task A resumes and references
> "from_cset"'s
>             member, after which nothing good happens.
> 
>     In full detail:
> 
>     CPU 0                                   CPU 1
>     ----------------------                  ----------------------
>     count_memcg_event_mm()
>     |rcu_read_lock()  <---
>     |mem_cgroup_from_task()
>      |// css_set_ptr is the "from_cset" mentioned on CPU 1
>      |css_set_ptr = rcu_dereference((task)->cgroups)
>      |// Hard irq comes, current task is scheduled out.
> 
>                                             cgroup_attach_task()
>                                             |cgroup_migrate()
> 
> |cgroup_migrate_execute()
>                                             |css_set_move_task(task,
> from_cset, to_cset, true)
>                                             |cgroup_move_task(task,
> to_cset)
>                                             |rcu_assign_pointer(..,
> to_cset)
>                                             |...
>                                             |cgroup_migrate_finish()
> 
> |put_css_set_locked(from_cset)
>                                             |from_cset->refcount
> return 0
>                                             |kfree_rcu(cset, rcu_head)
> // free from_cset after new gp
> 
> |add_ptr_to_bulk_krc_lock()
> 
> |schedule_delayed_work(&krcp->monitor_work, ..)
> 
>                                             kfree_rcu_monitor()
>                                             |krcp->bulk_head[0]'s
> work attached to krwp->bulk_head_free[]
> 
> |queue_rcu_work(system_wq, &krwp->rcu_work)
>                                             |if rwork->rcu.work is not
> in WORK_STRUCT_PENDING_BIT state,
>                                             |call_rcu(&rwork->rcu,
> rcu_work_rcufn) <--- request new gp
> 
>                                             // There is a perious
> call_rcu(.., rcu_work_rcufn)
>                                             // gp end,
> rcu_work_rcufn() is called.
>                                             rcu_work_rcufn()
>                                             |__queue_work(..,
> rwork->wq, &rwork->work);
> 
>                                             |kfree_rcu_work()
>                                             |krwp->bulk_head_free[0]
> bulk is freed before new gp end!!!
>                                             |The "from_cset" is freed
> before new gp end.
> 
>     // the task resumes some time later.
>      |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because
> css_set_ptr is freed.
> 
>     This commit therefore causes kfree_rcu_monitor() to refrain from
> moving
>     kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
>     grace period has completed for all three categories.
> 
>     v2: Use helper function instead of inserted code block at
> kfree_rcu_monitor().
> 
>     Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu()
> work")
>     Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
>     Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
>     Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> 859ee02f6614..e2dbea6cee4b 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
>         return !!READ_ONCE(krcp->head);
>  }
> 
> +static bool
> +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) {
> +       int i;
> +
> +       for (i = 0; i < FREE_N_CHANNELS; i++)
> +               if (!list_empty(&krwp->bulk_head_free[i]))
> +                       return true;
> +
> +       return !!krwp->head_free;
> +}
> +
>  static int krc_count(struct kfree_rcu_cpu *krcp)  {
>         int sum = atomic_read(&krcp->head_count); @@ -3134,15
> +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
>         for (i = 0; i < KFREE_N_BATCHES; i++) {
>                 struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> 
> -               // Try to detach bulk_head or head and attach it over any
> -               // available corresponding free channel. It can be that
> -               // a previous RCU batch is in progress, it means that
> -               // immediately to queue another one is not possible so
> -               // in that case the monitor work is rearmed.
> -               if ((!list_empty(&krcp->bulk_head[0]) &&
> list_empty(&krwp->bulk_head_free[0])) ||
> -                       (!list_empty(&krcp->bulk_head[1]) &&
> list_empty(&krwp->bulk_head_free[1])) ||
> -                               (READ_ONCE(krcp->head)
> && !krwp->head_free)) {
> +               // Try to detach bulk_head or head and attach it, only
> when
> +               // all channels are free.  Any channel is not free means at
> krwp
> +               // there is on-going rcu work to handle krwp's free
> business.
> +               if (need_wait_for_krwp_work(krwp))
> +                       continue;
> 
> +               // kvfree_rcu_drain_ready() might handle this krcp, if so
> give up.
> +               if (need_offload_krc(krcp)) {
>                         // Channel 1 corresponds to the SLAB-pointer
> bulk path.
>                         // Channel 2 corresponds to vmalloc-pointer bulk
> path.
>                         for (j = 0; j < FREE_N_CHANNELS; j++) {
Paul E. McKenney April 4, 2023, 3:23 a.m. UTC | #6
On Tue, Apr 04, 2023 at 02:49:15AM +0000, 代子为 (Ziwei Dai) wrote:
> Hello Paul!
> 
> > -----邮件原件-----
> > 发件人: Paul E. McKenney <paulmck@kernel.org>
> > 发送时间: 2023年4月4日 6:58
> > 收件人: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>
> > 抄送: urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com;
> > josh@joshtriplett.org; rostedt@goodmis.org;
> > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com;
> > joel@joelfernandes.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org;
> > 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin)
> > <Yifan.Xin@unisoc.com>; 王科 (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文
> > (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu)
> > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang)
> > <zhaoyang.huang@unisoc.com>
> > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after
> > the wanted rcu grace period.
> > 
> > 
> > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任
> > 何链接和附件。
> > CAUTION: This email originated from outside of the organization. Do not click
> > links or open attachments unless you recognize the sender and know the
> > content is safe.
> > 
> > 
> > 
> > On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote:
> > > In kfree_rcu_monitor(), new free business at krcp is attached to any
> > > free channel at krwp. kfree_rcu_monitor() is responsible to make sure
> > > new free business is handled after the rcu grace period. But if there
> > > is any none-free channel at krwp already, that means there is an
> > > on-going rcu work, which will cause the kvfree_call_rcu()-triggered
> > > free business is done before the wanted rcu grace period ends.
> > >
> > > This commit ignore krwp which has non-free channel at
> > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses
> > effectiveness.
> > >
> > > Below is the css_set obj "from_cset" use-after-free case caused by
> > > kvfree_call_rcu() losing effectiveness.
> > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq
> > > comes, the task is schedule out.
> > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new
> > gp.
> > > But "from_cset" is freed right after current gp end. "from_cset" is
> > reallocated.
> > > CPU 0 's task arrives back, references "from_cset"'s member, which causes
> > crash.
> > >
> > > CPU 0                                 CPU 1
> > > count_memcg_event_mm()
> > > |rcu_read_lock()  <---
> > > |mem_cgroup_from_task()
> > >  |// css_set_ptr is the "from_cset" mentioned on CPU 1  |css_set_ptr =
> > > rcu_dereference((task)->cgroups)  |// Hard irq comes, current task is
> > > scheduled out.
> > >
> > >                                       cgroup_attach_task()
> > >                                       |cgroup_migrate()
> > >                                       |cgroup_migrate_execute()
> > >                                       |css_set_move_task(task,
> > from_cset, to_cset, true)
> > >                                       |cgroup_move_task(task,
> > to_cset)
> > >                                       |rcu_assign_pointer(.., to_cset)
> > >                                       |...
> > >                                       |cgroup_migrate_finish()
> > >
> > |put_css_set_locked(from_cset)
> > >                                       |from_cset->refcount return 0
> > >                                       |kfree_rcu(cset, rcu_head) //
> > means to free from_cset after new gp
> > >                                       |add_ptr_to_bulk_krc_lock()
> > >
> > > |schedule_delayed_work(&krcp->monitor_work, ..)
> > >
> > >                                       kfree_rcu_monitor()
> > >                                       |krcp->bulk_head[0]'s work
> > attached to krwp->bulk_head_free[]
> > >                                       |queue_rcu_work(system_wq,
> > &krwp->rcu_work)
> > >                                       |if rwork->rcu.work is not in
> > WORK_STRUCT_PENDING_BIT state,
> > >                                       |call_rcu(&rwork->rcu,
> > > rcu_work_rcufn) <--- request a new gp
> > >
> > >                                       // There is a perious call_rcu(..,
> > rcu_work_rcufn)
> > >                                       // gp end, rcu_work_rcufn() is
> > called.
> > >                                       rcu_work_rcufn()
> > >                                       |__queue_work(.., rwork->wq,
> > > &rwork->work);
> > >
> > >                                       |kfree_rcu_work()
> > >                                       |krwp->bulk_head_free[0] bulk
> > is freed before new gp end!!!
> > >                                       |The "from_cset" is freed
> > before new gp end.
> > >
> > > // the task is scheduled in after many ms.
> > >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because
> > css_set_ptr is freed.
> > >
> > > v2: Use helper function instead of inserted code block at
> > kfree_rcu_monitor().
> > >
> > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of
> > > kfree_rcu() work")
> > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > 
> > Good catch, thank you!!!
> > 
> > How difficult was this to trigger?  If it can be triggered easily, this of course
> > needs to go into mainline sooner rather than later.
> 
> Roughly we can reproduce this issue within two rounds of 48h stress test, 
> with 20 k5.15 devices. If KASAN is enabled, the reproduce rate is higher.
> So I think sooner is better.

Thank you for the info!  This is in theory an old bug, but if you can
easily find out, does it trigger for you on v6.2 or earlier?

> > Longer term, would it make sense to run the three channels through RCU
> > separately, in order to avoid one channel refraining from starting a grace
> > period just because some other channel has callbacks waiting for a grace
> > period to complete?  One argument against might be energy efficiency, but
> > perhaps the ->gp_snap field could be used to get the best of both worlds.
> 
> I see kvfree_rcu_drain_ready(krcp) is already called at the beginning of
> kfree_rcu_monitor(), which polls the ->gp_snap field, to decide
> whether to free channel objects immediately or after gp.
> Both energy efficiency and timing seems be considered?

My concern is that running the channels separately might mean more grace
periods (and thus more energy draw) on nearly idle devices, such devices
usually being the ones for which energy efficiency matters most.

But perhaps Vlad, Neeraj, or Joel has some insight on this, given
that they are the ones working on battery-powered devices.

> > Either way, this fixes only one bug of two.  The second bug is in the
> > kfree_rcu() tests, which should have caught this bug.  Thoughts on a good fix
> > for those tests?
> 
> I inserted a msleep() between "rcu_read_lock(), get pointer via rcu_dereference()"
> and "reference pointer, using the member", at the rcu scenario, then we can
> reproduce this issue very soon in stress test. Can kfree_rcu() tests insert msleep()?

Another approach is to separate concerns, so that readers interact with
grace periods in the rcutorture.c tests, and to add the interaction
of to-be-freed memory with grace periods in the rcuscale kvfree tests.
I took a step in this direction with this commit on the -rcu tree's
"dev" branch:

efbe7927f479 ("rcu/kvfree: Add debug to check grace periods")

Given this, might it be possible to make rcuscale.c's kfree_rcu()
testing create patterns of usage of the three channels so as to
catch this bug that way?

> > I have applied Uladzislau's and Mukesh's tags, and done the usual
> > wordsmithing as shown at the end of this message.  Please let me know if I
> > messed anything up.
> 
> Thank you for the improvement on the patch! It seems better now.

No problem and thank you again for the debugging and the fix!

							Thanx, Paul

> > > ---
> > >  kernel/rcu/tree.c | 27 +++++++++++++++++++--------
> > >  1 file changed, 19 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > > 8e880c0..7b95ee9 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct
> > *work)
> > >       return !!READ_ONCE(krcp->head);
> > >  }
> > >
> > > +static bool
> > > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) {
> > > +     int i;
> > > +
> > > +     for (i = 0; i < FREE_N_CHANNELS; i++)
> > > +             if (!list_empty(&krwp->bulk_head_free[i]))
> > > +                     return true;
> > > +
> > > +     return !!krwp->head_free;
> > 
> > This is fixed from v1, good!
> > 
> > > +}
> > > +
> > >  static int krc_count(struct kfree_rcu_cpu *krcp)  {
> > >       int sum = atomic_read(&krcp->head_count); @@ -3107,15
> > +3119,14
> > > @@ static void kfree_rcu_monitor(struct work_struct *work)
> > >       for (i = 0; i < KFREE_N_BATCHES; i++) {
> > >               struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> > >
> > > -             // Try to detach bulk_head or head and attach it over any
> > > -             // available corresponding free channel. It can be that
> > > -             // a previous RCU batch is in progress, it means that
> > > -             // immediately to queue another one is not possible so
> > > -             // in that case the monitor work is rearmed.
> > > -             if ((!list_empty(&krcp->bulk_head[0]) &&
> > list_empty(&krwp->bulk_head_free[0])) ||
> > > -                     (!list_empty(&krcp->bulk_head[1]) &&
> > list_empty(&krwp->bulk_head_free[1])) ||
> > > -                             (READ_ONCE(krcp->head)
> > && !krwp->head_free)) {
> > > +             // Try to detach bulk_head or head and attach it, only when
> > > +             // all channels are free.  Any channel is not free means at
> > krwp
> > > +             // there is on-going rcu work to handle krwp's free
> > business.
> > > +             if (need_wait_for_krwp_work(krwp))
> > > +                     continue;
> > >
> > > +             // kvfree_rcu_drain_ready() might handle this krcp, if so
> > give up.
> > > +             if (need_offload_krc(krcp)) {
> > >                       // Channel 1 corresponds to the SLAB-pointer
> > bulk path.
> > >                       // Channel 2 corresponds to vmalloc-pointer bulk
> > path.
> > >                       for (j = 0; j < FREE_N_CHANNELS; j++) {
> > > --
> > > 1.9.1
> > 
> > ------------------------------------------------------------------------
> > 
> > commit e222f9a512539c3f4093a55d16624d9da614800b
> > Author: Ziwei Dai <ziwei.dai@unisoc.com>
> > Date:   Fri Mar 31 20:42:09 2023 +0800
> > 
> >     rcu: Avoid freeing new kfree_rcu() memory after old grace period
> > 
> >     Memory passed to kvfree_rcu() that is to be freed is tracked by a
> >     per-CPU kfree_rcu_cpu structure, which in turn contains pointers
> >     to kvfree_rcu_bulk_data structures that contain pointers to memory
> >     that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
> >     structure that tracks the memory that has already been handed to RCU.
> >     These structures track three categories of memory: (1) Memory for
> >     kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
> >     during an OOM episode.  The first two categories are tracked in a
> >     cache-friendly manner involving a dynamically allocated page of pointers
> >     (the aforementioned kvfree_rcu_bulk_data structures), while the third
> >     uses a simple (but decidedly cache-unfriendly) linked list through the
> >     rcu_head structures in each block of memory.
> > 
> >     On a given CPU, these three categories are handled as a unit, with that
> >     CPU's kfree_rcu_cpu_work structure having one pointer for each of the
> >     three categories.  Clearly, new memory for a given category cannot be
> >     placed in the corresponding kfree_rcu_cpu_work structure until any old
> >     memory has had its grace period elapse and thus has been removed.
> > And
> >     the kfree_rcu_monitor() function does in fact check for this.
> > 
> >     Except that the kfree_rcu_monitor() function checks these pointers one
> >     at a time.  This means that if the previous kfree_rcu() memory passed
> >     to RCU had only category 1 and the current one has only category 2, the
> >     kfree_rcu_monitor() function will send that current category-2 memory
> >     along immediately.  This can result in memory being freed too soon,
> >     that is, out from under unsuspecting RCU readers.
> > 
> >     To see this, consider the following sequence of events, in which:
> > 
> >     o       Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
> >             then is preempted.
> > 
> >     o       CPU 1 calls kfree_rcu(cset, rcu_head) in order to free
> > "from_cset"
> >             after a later grace period.  Except that "from_cset" is freed
> >             right after the previous grace period ended, so that
> > "from_cset"
> >             is immediately freed.  Task A resumes and references
> > "from_cset"'s
> >             member, after which nothing good happens.
> > 
> >     In full detail:
> > 
> >     CPU 0                                   CPU 1
> >     ----------------------                  ----------------------
> >     count_memcg_event_mm()
> >     |rcu_read_lock()  <---
> >     |mem_cgroup_from_task()
> >      |// css_set_ptr is the "from_cset" mentioned on CPU 1
> >      |css_set_ptr = rcu_dereference((task)->cgroups)
> >      |// Hard irq comes, current task is scheduled out.
> > 
> >                                             cgroup_attach_task()
> >                                             |cgroup_migrate()
> > 
> > |cgroup_migrate_execute()
> >                                             |css_set_move_task(task,
> > from_cset, to_cset, true)
> >                                             |cgroup_move_task(task,
> > to_cset)
> >                                             |rcu_assign_pointer(..,
> > to_cset)
> >                                             |...
> >                                             |cgroup_migrate_finish()
> > 
> > |put_css_set_locked(from_cset)
> >                                             |from_cset->refcount
> > return 0
> >                                             |kfree_rcu(cset, rcu_head)
> > // free from_cset after new gp
> > 
> > |add_ptr_to_bulk_krc_lock()
> > 
> > |schedule_delayed_work(&krcp->monitor_work, ..)
> > 
> >                                             kfree_rcu_monitor()
> >                                             |krcp->bulk_head[0]'s
> > work attached to krwp->bulk_head_free[]
> > 
> > |queue_rcu_work(system_wq, &krwp->rcu_work)
> >                                             |if rwork->rcu.work is not
> > in WORK_STRUCT_PENDING_BIT state,
> >                                             |call_rcu(&rwork->rcu,
> > rcu_work_rcufn) <--- request new gp
> > 
> >                                             // There is a perious
> > call_rcu(.., rcu_work_rcufn)
> >                                             // gp end,
> > rcu_work_rcufn() is called.
> >                                             rcu_work_rcufn()
> >                                             |__queue_work(..,
> > rwork->wq, &rwork->work);
> > 
> >                                             |kfree_rcu_work()
> >                                             |krwp->bulk_head_free[0]
> > bulk is freed before new gp end!!!
> >                                             |The "from_cset" is freed
> > before new gp end.
> > 
> >     // the task resumes some time later.
> >      |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because
> > css_set_ptr is freed.
> > 
> >     This commit therefore causes kfree_rcu_monitor() to refrain from
> > moving
> >     kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
> >     grace period has completed for all three categories.
> > 
> >     v2: Use helper function instead of inserted code block at
> > kfree_rcu_monitor().
> > 
> >     Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu()
> > work")
> >     Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
> >     Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> >     Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index
> > 859ee02f6614..e2dbea6cee4b 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp)
> >         return !!READ_ONCE(krcp->head);
> >  }
> > 
> > +static bool
> > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) {
> > +       int i;
> > +
> > +       for (i = 0; i < FREE_N_CHANNELS; i++)
> > +               if (!list_empty(&krwp->bulk_head_free[i]))
> > +                       return true;
> > +
> > +       return !!krwp->head_free;
> > +}
> > +
> >  static int krc_count(struct kfree_rcu_cpu *krcp)  {
> >         int sum = atomic_read(&krcp->head_count); @@ -3134,15
> > +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
> >         for (i = 0; i < KFREE_N_BATCHES; i++) {
> >                 struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> > 
> > -               // Try to detach bulk_head or head and attach it over any
> > -               // available corresponding free channel. It can be that
> > -               // a previous RCU batch is in progress, it means that
> > -               // immediately to queue another one is not possible so
> > -               // in that case the monitor work is rearmed.
> > -               if ((!list_empty(&krcp->bulk_head[0]) &&
> > list_empty(&krwp->bulk_head_free[0])) ||
> > -                       (!list_empty(&krcp->bulk_head[1]) &&
> > list_empty(&krwp->bulk_head_free[1])) ||
> > -                               (READ_ONCE(krcp->head)
> > && !krwp->head_free)) {
> > +               // Try to detach bulk_head or head and attach it, only
> > when
> > +               // all channels are free.  Any channel is not free means at
> > krwp
> > +               // there is on-going rcu work to handle krwp's free
> > business.
> > +               if (need_wait_for_krwp_work(krwp))
> > +                       continue;
> > 
> > +               // kvfree_rcu_drain_ready() might handle this krcp, if so
> > give up.
> > +               if (need_offload_krc(krcp)) {
> >                         // Channel 1 corresponds to the SLAB-pointer
> > bulk path.
> >                         // Channel 2 corresponds to vmalloc-pointer bulk
> > path.
> >                         for (j = 0; j < FREE_N_CHANNELS; j++) {
Joel Fernandes April 5, 2023, 5:39 p.m. UTC | #7
On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
>
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work,
> which will cause the kvfree_call_rcu()-triggered free business is done
> before the wanted rcu grace period ends.
>
> This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
>
> Below is the css_set obj "from_cset" use-after-free case caused by
> kvfree_call_rcu() losing effectiveness.
> CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> the task is schedule out.
> CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
>
> CPU 0                                   CPU 1
> count_memcg_event_mm()
> |rcu_read_lock()  <---
> |mem_cgroup_from_task()
>  |// css_set_ptr is the "from_cset" mentioned on CPU 1
>  |css_set_ptr = rcu_dereference((task)->cgroups)
>  |// Hard irq comes, current task is scheduled out.
>
>                                         cgroup_attach_task()
>                                         |cgroup_migrate()
>                                         |cgroup_migrate_execute()
>                                         |css_set_move_task(task, from_cset, to_cset, true)
>                                         |cgroup_move_task(task, to_cset)
>                                         |rcu_assign_pointer(.., to_cset)
>                                         |...
>                                         |cgroup_migrate_finish()
>                                         |put_css_set_locked(from_cset)
>                                         |from_cset->refcount return 0
>                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
>                                         |add_ptr_to_bulk_krc_lock()
>                                         |schedule_delayed_work(&krcp->monitor_work, ..)
>
>                                         kfree_rcu_monitor()
>                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
>                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
>                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
>                                         |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
>
>                                         // There is a perious call_rcu(.., rcu_work_rcufn)
>                                         // gp end, rcu_work_rcufn() is called.
>                                         rcu_work_rcufn()
>                                         |__queue_work(.., rwork->wq, &rwork->work);
>
>                                         |kfree_rcu_work()
>                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
>                                         |The "from_cset" is freed before new gp end.
>
> // the task is scheduled in after many ms.
>  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
>
> v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
>
> Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>

Please update the fixes tag to:
5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")

The issue happened when 5f3c8d620447 started looking at multiple
channels at the same time in the same work handler function.

I think a better fix might be to separate out the work handler
functions for each channel separately. That way we get more
parallelism.

but since this is urgent,
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel



> ---
>  kernel/rcu/tree.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8e880c0..7b95ee9 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work)
>         return !!READ_ONCE(krcp->head);
>  }
>
> +static bool
> +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
> +{
> +       int i;
> +
> +       for (i = 0; i < FREE_N_CHANNELS; i++)
> +               if (!list_empty(&krwp->bulk_head_free[i]))
> +                       return true;
> +
> +       return !!krwp->head_free;
> +}
> +
>  static int krc_count(struct kfree_rcu_cpu *krcp)
>  {
>         int sum = atomic_read(&krcp->head_count);
> @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work)
>         for (i = 0; i < KFREE_N_BATCHES; i++) {
>                 struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
>
> -               // Try to detach bulk_head or head and attach it over any
> -               // available corresponding free channel. It can be that
> -               // a previous RCU batch is in progress, it means that
> -               // immediately to queue another one is not possible so
> -               // in that case the monitor work is rearmed.
> -               if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> -                       (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> -                               (READ_ONCE(krcp->head) && !krwp->head_free)) {
> +               // Try to detach bulk_head or head and attach it, only when
> +               // all channels are free.  Any channel is not free means at krwp
> +               // there is on-going rcu work to handle krwp's free business.
> +               if (need_wait_for_krwp_work(krwp))
> +                       continue;
>
> +               // kvfree_rcu_drain_ready() might handle this krcp, if so give up.
> +               if (need_offload_krc(krcp)) {
>                         // Channel 1 corresponds to the SLAB-pointer bulk path.
>                         // Channel 2 corresponds to vmalloc-pointer bulk path.
>                         for (j = 0; j < FREE_N_CHANNELS; j++) {
> --
> 1.9.1
>
Joel Fernandes April 5, 2023, 6:12 p.m. UTC | #8
On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
> >
> > In kfree_rcu_monitor(), new free business at krcp is attached to any free
> > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> > business is handled after the rcu grace period. But if there is any none-free
> > channel at krwp already, that means there is an on-going rcu work,
> > which will cause the kvfree_call_rcu()-triggered free business is done
> > before the wanted rcu grace period ends.
> >
> > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> > to fix the issue that kvfree_call_rcu() loses effectiveness.
> >
> > Below is the css_set obj "from_cset" use-after-free case caused by
> > kvfree_call_rcu() losing effectiveness.
> > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> > the task is schedule out.
> > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> >
> > CPU 0                                   CPU 1
> > count_memcg_event_mm()
> > |rcu_read_lock()  <---
> > |mem_cgroup_from_task()
> >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> >  |css_set_ptr = rcu_dereference((task)->cgroups)
> >  |// Hard irq comes, current task is scheduled out.
> >
> >                                         cgroup_attach_task()
> >                                         |cgroup_migrate()
> >                                         |cgroup_migrate_execute()
> >                                         |css_set_move_task(task, from_cset, to_cset, true)
> >                                         |cgroup_move_task(task, to_cset)
> >                                         |rcu_assign_pointer(.., to_cset)
> >                                         |...
> >                                         |cgroup_migrate_finish()
> >                                         |put_css_set_locked(from_cset)
> >                                         |from_cset->refcount return 0
> >                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> >                                         |add_ptr_to_bulk_krc_lock()
> >                                         |schedule_delayed_work(&krcp->monitor_work, ..)
> >
> >                                         kfree_rcu_monitor()
> >                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> >                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
> >                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> >                                         |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> >
> >                                         // There is a perious call_rcu(.., rcu_work_rcufn)
> >                                         // gp end, rcu_work_rcufn() is called.
> >                                         rcu_work_rcufn()
> >                                         |__queue_work(.., rwork->wq, &rwork->work);
> >
> >                                         |kfree_rcu_work()
> >                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> >                                         |The "from_cset" is freed before new gp end.
> >
> > // the task is scheduled in after many ms.
> >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> >
> > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> >
> > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
>
> Please update the fixes tag to:
> 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")

Vlad pointed out in another thread that the fix is actually to 34c881745549.

So just to be sure, it could be updated to:
Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")

thanks,

 - Joel
Paul E. McKenney April 5, 2023, 6:46 p.m. UTC | #9
On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote:
> On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
> > >
> > > In kfree_rcu_monitor(), new free business at krcp is attached to any free
> > > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> > > business is handled after the rcu grace period. But if there is any none-free
> > > channel at krwp already, that means there is an on-going rcu work,
> > > which will cause the kvfree_call_rcu()-triggered free business is done
> > > before the wanted rcu grace period ends.
> > >
> > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
> > > to fix the issue that kvfree_call_rcu() loses effectiveness.
> > >
> > > Below is the css_set obj "from_cset" use-after-free case caused by
> > > kvfree_call_rcu() losing effectiveness.
> > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes,
> > > the task is schedule out.
> > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> > >
> > > CPU 0                                   CPU 1
> > > count_memcg_event_mm()
> > > |rcu_read_lock()  <---
> > > |mem_cgroup_from_task()
> > >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> > >  |css_set_ptr = rcu_dereference((task)->cgroups)
> > >  |// Hard irq comes, current task is scheduled out.
> > >
> > >                                         cgroup_attach_task()
> > >                                         |cgroup_migrate()
> > >                                         |cgroup_migrate_execute()
> > >                                         |css_set_move_task(task, from_cset, to_cset, true)
> > >                                         |cgroup_move_task(task, to_cset)
> > >                                         |rcu_assign_pointer(.., to_cset)
> > >                                         |...
> > >                                         |cgroup_migrate_finish()
> > >                                         |put_css_set_locked(from_cset)
> > >                                         |from_cset->refcount return 0
> > >                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> > >                                         |add_ptr_to_bulk_krc_lock()
> > >                                         |schedule_delayed_work(&krcp->monitor_work, ..)
> > >
> > >                                         kfree_rcu_monitor()
> > >                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> > >                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
> > >                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> > >                                         |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> > >
> > >                                         // There is a perious call_rcu(.., rcu_work_rcufn)
> > >                                         // gp end, rcu_work_rcufn() is called.
> > >                                         rcu_work_rcufn()
> > >                                         |__queue_work(.., rwork->wq, &rwork->work);
> > >
> > >                                         |kfree_rcu_work()
> > >                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> > >                                         |The "from_cset" is freed before new gp end.
> > >
> > > // the task is scheduled in after many ms.
> > >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> > >
> > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> > >
> > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work")
> > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> >
> > Please update the fixes tag to:
> > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
> 
> Vlad pointed out in another thread that the fix is actually to 34c881745549.
> 
> So just to be sure, it could be updated to:
> Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
> Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")

Ziwei Dai, does this change in Fixes look good to you?

If so, I will update the commit log in this commit that I am planning
to submit into v6.3.  It is strictly speaking not a v6.3 regression,
but it is starting to show up in the wild and the patch is contained
enough to be considered an urgent fix.

							Thanx, Paul
代子为 (Ziwei Dai) April 6, 2023, 1:38 a.m. UTC | #10
> -----邮件原件-----
> 发件人: Paul E. McKenney <paulmck@kernel.org>
> 发送时间: 2023年4月6日 2:46
> 收件人: Joel Fernandes <joel@joelfernandes.org>
> 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com;
> josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org;
> linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科
> (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu)
> <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com>
> 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
> 
> 
> 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the
> sender and know the content is safe.
> 
> 
> 
> On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote:
> > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
> > > >
> > > > In kfree_rcu_monitor(), new free business at krcp is attached to
> > > > any free channel at krwp. kfree_rcu_monitor() is responsible to
> > > > make sure new free business is handled after the rcu grace period.
> > > > But if there is any none-free channel at krwp already, that means
> > > > there is an on-going rcu work, which will cause the
> > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends.
> > > >
> > > > This commit ignore krwp which has non-free channel at
> > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness.
> > > >
> > > > Below is the css_set obj "from_cset" use-after-free case caused by
> > > > kvfree_call_rcu() losing effectiveness.
> > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq
> > > > comes, the task is schedule out.
> > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> > > >
> > > > CPU 0                                   CPU 1
> > > > count_memcg_event_mm()
> > > > |rcu_read_lock()  <---
> > > > |mem_cgroup_from_task()
> > > >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> > > > |css_set_ptr = rcu_dereference((task)->cgroups)  |// Hard irq
> > > > comes, current task is scheduled out.
> > > >
> > > >                                         cgroup_attach_task()
> > > >                                         |cgroup_migrate()
> > > >                                         |cgroup_migrate_execute()
> > > >                                         |css_set_move_task(task, from_cset, to_cset, true)
> > > >                                         |cgroup_move_task(task, to_cset)
> > > >                                         |rcu_assign_pointer(.., to_cset)
> > > >                                         |...
> > > >                                         |cgroup_migrate_finish()
> > > >                                         |put_css_set_locked(from_cset)
> > > >                                         |from_cset->refcount return 0
> > > >                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> > > >                                         |add_ptr_to_bulk_krc_lock()
> > > >
> > > > |schedule_delayed_work(&krcp->monitor_work, ..)
> > > >
> > > >                                         kfree_rcu_monitor()
> > > >                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> > > >                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
> > > >                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> > > >                                         |call_rcu(&rwork->rcu,
> > > > rcu_work_rcufn) <--- request a new gp
> > > >
> > > >                                         // There is a perious call_rcu(.., rcu_work_rcufn)
> > > >                                         // gp end, rcu_work_rcufn() is called.
> > > >                                         rcu_work_rcufn()
> > > >                                         |__queue_work(..,
> > > > rwork->wq, &rwork->work);
> > > >
> > > >                                         |kfree_rcu_work()
> > > >                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> > > >                                         |The "from_cset" is freed before new gp end.
> > > >
> > > > // the task is scheduled in after many ms.
> > > >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> > > >
> > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> > > >
> > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of
> > > > kfree_rcu() work")
> > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > >
> > > Please update the fixes tag to:
> > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
> >
> > Vlad pointed out in another thread that the fix is actually to 34c881745549.
> >
> > So just to be sure, it could be updated to:
> > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in
> > kfree_rcu()")
> > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc
> > ptrs")
> 
> Ziwei Dai, does this change in Fixes look good to you?
> 
> If so, I will update the commit log in this commit that I am planning to submit into v6.3.  It is strictly speaking not a v6.3 regression,
> but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix.
> 
>                                                         Thanx, Paul

Hi Paul, it looks good to me and thanks!
Paul E. McKenney April 6, 2023, 3:46 a.m. UTC | #11
On Thu, Apr 06, 2023 at 01:38:09AM +0000, 代子为 (Ziwei Dai) wrote:
> 
> 
> > -----邮件原件-----
> > 发件人: Paul E. McKenney <paulmck@kernel.org>
> > 发送时间: 2023年4月6日 2:46
> > 收件人: Joel Fernandes <joel@joelfernandes.org>
> > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com;
> > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org;
> > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科
> > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu)
> > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com>
> > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
> > 
> > 
> > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the
> > sender and know the content is safe.
> > 
> > 
> > 
> > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote:
> > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
> > > > >
> > > > > In kfree_rcu_monitor(), new free business at krcp is attached to
> > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to
> > > > > make sure new free business is handled after the rcu grace period.
> > > > > But if there is any none-free channel at krwp already, that means
> > > > > there is an on-going rcu work, which will cause the
> > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends.
> > > > >
> > > > > This commit ignore krwp which has non-free channel at
> > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness.
> > > > >
> > > > > Below is the css_set obj "from_cset" use-after-free case caused by
> > > > > kvfree_call_rcu() losing effectiveness.
> > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq
> > > > > comes, the task is schedule out.
> > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> > > > >
> > > > > CPU 0                                   CPU 1
> > > > > count_memcg_event_mm()
> > > > > |rcu_read_lock()  <---
> > > > > |mem_cgroup_from_task()
> > > > >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> > > > > |css_set_ptr = rcu_dereference((task)->cgroups)  |// Hard irq
> > > > > comes, current task is scheduled out.
> > > > >
> > > > >                                         cgroup_attach_task()
> > > > >                                         |cgroup_migrate()
> > > > >                                         |cgroup_migrate_execute()
> > > > >                                         |css_set_move_task(task, from_cset, to_cset, true)
> > > > >                                         |cgroup_move_task(task, to_cset)
> > > > >                                         |rcu_assign_pointer(.., to_cset)
> > > > >                                         |...
> > > > >                                         |cgroup_migrate_finish()
> > > > >                                         |put_css_set_locked(from_cset)
> > > > >                                         |from_cset->refcount return 0
> > > > >                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> > > > >                                         |add_ptr_to_bulk_krc_lock()
> > > > >
> > > > > |schedule_delayed_work(&krcp->monitor_work, ..)
> > > > >
> > > > >                                         kfree_rcu_monitor()
> > > > >                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> > > > >                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
> > > > >                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> > > > >                                         |call_rcu(&rwork->rcu,
> > > > > rcu_work_rcufn) <--- request a new gp
> > > > >
> > > > >                                         // There is a perious call_rcu(.., rcu_work_rcufn)
> > > > >                                         // gp end, rcu_work_rcufn() is called.
> > > > >                                         rcu_work_rcufn()
> > > > >                                         |__queue_work(..,
> > > > > rwork->wq, &rwork->work);
> > > > >
> > > > >                                         |kfree_rcu_work()
> > > > >                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> > > > >                                         |The "from_cset" is freed before new gp end.
> > > > >
> > > > > // the task is scheduled in after many ms.
> > > > >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> > > > >
> > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> > > > >
> > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of
> > > > > kfree_rcu() work")
> > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > > >
> > > > Please update the fixes tag to:
> > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
> > >
> > > Vlad pointed out in another thread that the fix is actually to 34c881745549.
> > >
> > > So just to be sure, it could be updated to:
> > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in
> > > kfree_rcu()")
> > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc
> > > ptrs")
> > 
> > Ziwei Dai, does this change in Fixes look good to you?
> > 
> > If so, I will update the commit log in this commit that I am planning to submit into v6.3.  It is strictly speaking not a v6.3 regression,
> > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix.
> > 
> >                                                         Thanx, Paul
> 
> Hi Paul, it looks good to me and thanks!

Thank you, and I will fix on my next rebase.

							Thanx, Paul
Uladzislau Rezki April 6, 2023, 4:44 a.m. UTC | #12
On Wed, Apr 05, 2023 at 08:46:15PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 06, 2023 at 01:38:09AM +0000, 代子为 (Ziwei Dai) wrote:
> > 
> > 
> > > -----邮件原件-----
> > > 发件人: Paul E. McKenney <paulmck@kernel.org>
> > > 发送时间: 2023年4月6日 2:46
> > > 收件人: Joel Fernandes <joel@joelfernandes.org>
> > > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com;
> > > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科
> > > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu)
> > > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com>
> > > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
> > > 
> > > 
> > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。
> > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the
> > > sender and know the content is safe.
> > > 
> > > 
> > > 
> > > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote:
> > > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
> > > > > >
> > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to
> > > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to
> > > > > > make sure new free business is handled after the rcu grace period.
> > > > > > But if there is any none-free channel at krwp already, that means
> > > > > > there is an on-going rcu work, which will cause the
> > > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends.
> > > > > >
> > > > > > This commit ignore krwp which has non-free channel at
> > > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness.
> > > > > >
> > > > > > Below is the css_set obj "from_cset" use-after-free case caused by
> > > > > > kvfree_call_rcu() losing effectiveness.
> > > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq
> > > > > > comes, the task is schedule out.
> > > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> > > > > >
> > > > > > CPU 0                                   CPU 1
> > > > > > count_memcg_event_mm()
> > > > > > |rcu_read_lock()  <---
> > > > > > |mem_cgroup_from_task()
> > > > > >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> > > > > > |css_set_ptr = rcu_dereference((task)->cgroups)  |// Hard irq
> > > > > > comes, current task is scheduled out.
> > > > > >
> > > > > >                                         cgroup_attach_task()
> > > > > >                                         |cgroup_migrate()
> > > > > >                                         |cgroup_migrate_execute()
> > > > > >                                         |css_set_move_task(task, from_cset, to_cset, true)
> > > > > >                                         |cgroup_move_task(task, to_cset)
> > > > > >                                         |rcu_assign_pointer(.., to_cset)
> > > > > >                                         |...
> > > > > >                                         |cgroup_migrate_finish()
> > > > > >                                         |put_css_set_locked(from_cset)
> > > > > >                                         |from_cset->refcount return 0
> > > > > >                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> > > > > >                                         |add_ptr_to_bulk_krc_lock()
> > > > > >
> > > > > > |schedule_delayed_work(&krcp->monitor_work, ..)
> > > > > >
> > > > > >                                         kfree_rcu_monitor()
> > > > > >                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> > > > > >                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
> > > > > >                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> > > > > >                                         |call_rcu(&rwork->rcu,
> > > > > > rcu_work_rcufn) <--- request a new gp
> > > > > >
> > > > > >                                         // There is a perious call_rcu(.., rcu_work_rcufn)
> > > > > >                                         // gp end, rcu_work_rcufn() is called.
> > > > > >                                         rcu_work_rcufn()
> > > > > >                                         |__queue_work(..,
> > > > > > rwork->wq, &rwork->work);
> > > > > >
> > > > > >                                         |kfree_rcu_work()
> > > > > >                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> > > > > >                                         |The "from_cset" is freed before new gp end.
> > > > > >
> > > > > > // the task is scheduled in after many ms.
> > > > > >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> > > > > >
> > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> > > > > >
> > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of
> > > > > > kfree_rcu() work")
> > > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > > > >
> > > > > Please update the fixes tag to:
> > > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
> > > >
> > > > Vlad pointed out in another thread that the fix is actually to 34c881745549.
> > > >
> > > > So just to be sure, it could be updated to:
> > > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in
> > > > kfree_rcu()")
> > > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc
> > > > ptrs")
> > > 
> > > Ziwei Dai, does this change in Fixes look good to you?
> > > 
> > > If so, I will update the commit log in this commit that I am planning to submit into v6.3.  It is strictly speaking not a v6.3 regression,
> > > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix.
> > > 
> > >                                                         Thanx, Paul
> > 
> > Hi Paul, it looks good to me and thanks!
> 
> Thank you, and I will fix on my next rebase.
> 
After heavy testing over night i do not see that any warnings
are triggered:

Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzislau Rezki
Paul E. McKenney April 6, 2023, 2:13 p.m. UTC | #13
On Thu, Apr 06, 2023 at 06:44:15AM +0200, Uladzislau Rezki wrote:
> On Wed, Apr 05, 2023 at 08:46:15PM -0700, Paul E. McKenney wrote:
> > On Thu, Apr 06, 2023 at 01:38:09AM +0000, 代子为 (Ziwei Dai) wrote:
> > > 
> > > 
> > > > -----邮件原件-----
> > > > 发件人: Paul E. McKenney <paulmck@kernel.org>
> > > > 发送时间: 2023年4月6日 2:46
> > > > 收件人: Joel Fernandes <joel@joelfernandes.org>
> > > > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com;
> > > > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科
> > > > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu)
> > > > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com>
> > > > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
> > > > 
> > > > 
> > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。
> > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the
> > > > sender and know the content is safe.
> > > > 
> > > > 
> > > > 
> > > > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote:
> > > > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > > >
> > > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote:
> > > > > > >
> > > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to
> > > > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to
> > > > > > > make sure new free business is handled after the rcu grace period.
> > > > > > > But if there is any none-free channel at krwp already, that means
> > > > > > > there is an on-going rcu work, which will cause the
> > > > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends.
> > > > > > >
> > > > > > > This commit ignore krwp which has non-free channel at
> > > > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness.
> > > > > > >
> > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by
> > > > > > > kvfree_call_rcu() losing effectiveness.
> > > > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq
> > > > > > > comes, the task is schedule out.
> > > > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
> > > > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated.
> > > > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash.
> > > > > > >
> > > > > > > CPU 0                                   CPU 1
> > > > > > > count_memcg_event_mm()
> > > > > > > |rcu_read_lock()  <---
> > > > > > > |mem_cgroup_from_task()
> > > > > > >  |// css_set_ptr is the "from_cset" mentioned on CPU 1
> > > > > > > |css_set_ptr = rcu_dereference((task)->cgroups)  |// Hard irq
> > > > > > > comes, current task is scheduled out.
> > > > > > >
> > > > > > >                                         cgroup_attach_task()
> > > > > > >                                         |cgroup_migrate()
> > > > > > >                                         |cgroup_migrate_execute()
> > > > > > >                                         |css_set_move_task(task, from_cset, to_cset, true)
> > > > > > >                                         |cgroup_move_task(task, to_cset)
> > > > > > >                                         |rcu_assign_pointer(.., to_cset)
> > > > > > >                                         |...
> > > > > > >                                         |cgroup_migrate_finish()
> > > > > > >                                         |put_css_set_locked(from_cset)
> > > > > > >                                         |from_cset->refcount return 0
> > > > > > >                                         |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp
> > > > > > >                                         |add_ptr_to_bulk_krc_lock()
> > > > > > >
> > > > > > > |schedule_delayed_work(&krcp->monitor_work, ..)
> > > > > > >
> > > > > > >                                         kfree_rcu_monitor()
> > > > > > >                                         |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
> > > > > > >                                         |queue_rcu_work(system_wq, &krwp->rcu_work)
> > > > > > >                                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
> > > > > > >                                         |call_rcu(&rwork->rcu,
> > > > > > > rcu_work_rcufn) <--- request a new gp
> > > > > > >
> > > > > > >                                         // There is a perious call_rcu(.., rcu_work_rcufn)
> > > > > > >                                         // gp end, rcu_work_rcufn() is called.
> > > > > > >                                         rcu_work_rcufn()
> > > > > > >                                         |__queue_work(..,
> > > > > > > rwork->wq, &rwork->work);
> > > > > > >
> > > > > > >                                         |kfree_rcu_work()
> > > > > > >                                         |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
> > > > > > >                                         |The "from_cset" is freed before new gp end.
> > > > > > >
> > > > > > > // the task is scheduled in after many ms.
> > > > > > >  |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
> > > > > > >
> > > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
> > > > > > >
> > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of
> > > > > > > kfree_rcu() work")
> > > > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > > > > >
> > > > > > Please update the fixes tag to:
> > > > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
> > > > >
> > > > > Vlad pointed out in another thread that the fix is actually to 34c881745549.
> > > > >
> > > > > So just to be sure, it could be updated to:
> > > > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in
> > > > > kfree_rcu()")
> > > > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc
> > > > > ptrs")
> > > > 
> > > > Ziwei Dai, does this change in Fixes look good to you?
> > > > 
> > > > If so, I will update the commit log in this commit that I am planning to submit into v6.3.  It is strictly speaking not a v6.3 regression,
> > > > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix.
> > > > 
> > > >                                                         Thanx, Paul
> > > 
> > > Hi Paul, it looks good to me and thanks!
> > 
> > Thank you, and I will fix on my next rebase.
> > 
> After heavy testing over night i do not see that any warnings
> are triggered:
> 
> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Thank you very much!!!  I will apply this during my upcoming rebase.

							Thanx, Paul
diff mbox series

Patch

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8e880c0..7b95ee9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3024,6 +3024,18 @@  static void kfree_rcu_work(struct work_struct *work)
 	return !!READ_ONCE(krcp->head);
 }
 
+static bool
+need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
+{
+	int i;
+
+	for (i = 0; i < FREE_N_CHANNELS; i++)
+		if (!list_empty(&krwp->bulk_head_free[i]))
+			return true;
+
+	return !!krwp->head_free;
+}
+
 static int krc_count(struct kfree_rcu_cpu *krcp)
 {
 	int sum = atomic_read(&krcp->head_count);
@@ -3107,15 +3119,14 @@  static void kfree_rcu_monitor(struct work_struct *work)
 	for (i = 0; i < KFREE_N_BATCHES; i++) {
 		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
 
-		// Try to detach bulk_head or head and attach it over any
-		// available corresponding free channel. It can be that
-		// a previous RCU batch is in progress, it means that
-		// immediately to queue another one is not possible so
-		// in that case the monitor work is rearmed.
-		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
-			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
-				(READ_ONCE(krcp->head) && !krwp->head_free)) {
+		// Try to detach bulk_head or head and attach it, only when
+		// all channels are free.  Any channel is not free means at krwp
+		// there is on-going rcu work to handle krwp's free business.
+		if (need_wait_for_krwp_work(krwp))
+			continue;
 
+		// kvfree_rcu_drain_ready() might handle this krcp, if so give up.
+		if (need_offload_krc(krcp)) {
 			// Channel 1 corresponds to the SLAB-pointer bulk path.
 			// Channel 2 corresponds to vmalloc-pointer bulk path.
 			for (j = 0; j < FREE_N_CHANNELS; j++) {