Message ID | 1680266529-28429-1-git-send-email-ziwei.dai@unisoc.com (mailing list archive) |
---|---|
State | Accepted |
Commit | e222f9a512539c3f4093a55d16624d9da614800b |
Headers | show |
Series | [V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. | expand |
On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote: > In kfree_rcu_monitor(), new free business at krcp is attached to any free > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > business is handled after the rcu grace period. But if there is any none-free > channel at krwp already, that means there is an on-going rcu work, > which will cause the kvfree_call_rcu()-triggered free business is done > before the wanted rcu grace period ends. > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > to fix the issue that kvfree_call_rcu() loses effectiveness. > > Below is the css_set obj "from_cset" use-after-free case caused by > kvfree_call_rcu() losing effectiveness. > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > the task is schedule out. > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > CPU 0 CPU 1 > count_memcg_event_mm() > |rcu_read_lock() <--- > |mem_cgroup_from_task() > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > |css_set_ptr = rcu_dereference((task)->cgroups) > |// Hard irq comes, current task is scheduled out. > > cgroup_attach_task() > |cgroup_migrate() > |cgroup_migrate_execute() > |css_set_move_task(task, from_cset, to_cset, true) > |cgroup_move_task(task, to_cset) > |rcu_assign_pointer(.., to_cset) > |... > |cgroup_migrate_finish() > |put_css_set_locked(from_cset) > |from_cset->refcount return 0 > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > |add_ptr_to_bulk_krc_lock() > |schedule_delayed_work(&krcp->monitor_work, ..) > > kfree_rcu_monitor() > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > |queue_rcu_work(system_wq, &krwp->rcu_work) > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > // There is a perious call_rcu(.., rcu_work_rcufn) > // gp end, rcu_work_rcufn() is called. > rcu_work_rcufn() > |__queue_work(.., rwork->wq, &rwork->work); > > |kfree_rcu_work() > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > |The "from_cset" is freed before new gp end. > > // the task is scheduled in after many ms. > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > --- > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 8e880c0..7b95ee9 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) > return !!READ_ONCE(krcp->head); > } > > +static bool > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) > +{ > + int i; > + > + for (i = 0; i < FREE_N_CHANNELS; i++) > + if (!list_empty(&krwp->bulk_head_free[i])) > + return true; > + > + return !!krwp->head_free; > +} > + > static int krc_count(struct kfree_rcu_cpu *krcp) > { > int sum = atomic_read(&krcp->head_count); > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bulk_head or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || > - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only when > + // all channels are free. Any channel is not free means at krwp > + // there is on-going rcu work to handle krwp's free business. > + if (need_wait_for_krwp_work(krwp)) > + continue; > > + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. > + if (need_offload_krc(krcp)) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) { > -- > 1.9.1 > It looks correct to me. I will test it over weekend. -- Uladzislau Rezki
On Fri, Mar 31, 2023 at 05:01:52PM +0200, Uladzislau Rezki wrote: > On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote: > > In kfree_rcu_monitor(), new free business at krcp is attached to any free > > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > > business is handled after the rcu grace period. But if there is any none-free > > channel at krwp already, that means there is an on-going rcu work, > > which will cause the kvfree_call_rcu()-triggered free business is done > > before the wanted rcu grace period ends. > > > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > > to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > kvfree_call_rcu() losing effectiveness. > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > > the task is schedule out. > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > CPU 0 CPU 1 > > count_memcg_event_mm() > > |rcu_read_lock() <--- > > |mem_cgroup_from_task() > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > |css_set_ptr = rcu_dereference((task)->cgroups) > > |// Hard irq comes, current task is scheduled out. > > > > cgroup_attach_task() > > |cgroup_migrate() > > |cgroup_migrate_execute() > > |css_set_move_task(task, from_cset, to_cset, true) > > |cgroup_move_task(task, to_cset) > > |rcu_assign_pointer(.., to_cset) > > |... > > |cgroup_migrate_finish() > > |put_css_set_locked(from_cset) > > |from_cset->refcount return 0 > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > |add_ptr_to_bulk_krc_lock() > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > kfree_rcu_monitor() > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > // gp end, rcu_work_rcufn() is called. > > rcu_work_rcufn() > > |__queue_work(.., rwork->wq, &rwork->work); > > > > |kfree_rcu_work() > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > |The "from_cset" is freed before new gp end. > > > > // the task is scheduled in after many ms. > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > --- > > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > > 1 file changed, 19 insertions(+), 8 deletions(-) > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 8e880c0..7b95ee9 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) > > return !!READ_ONCE(krcp->head); > > } > > > > +static bool > > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) > > +{ > > + int i; > > + > > + for (i = 0; i < FREE_N_CHANNELS; i++) > > + if (!list_empty(&krwp->bulk_head_free[i])) > > + return true; > > + > > + return !!krwp->head_free; > > +} > > + > > static int krc_count(struct kfree_rcu_cpu *krcp) > > { > > int sum = atomic_read(&krcp->head_count); > > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > > for (i = 0; i < KFREE_N_BATCHES; i++) { > > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > > > - // Try to detach bulk_head or head and attach it over any > > - // available corresponding free channel. It can be that > > - // a previous RCU batch is in progress, it means that > > - // immediately to queue another one is not possible so > > - // in that case the monitor work is rearmed. > > - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || > > - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || > > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > > + // Try to detach bulk_head or head and attach it, only when > > + // all channels are free. Any channel is not free means at krwp > > + // there is on-going rcu work to handle krwp's free business. > > + if (need_wait_for_krwp_work(krwp)) > > + continue; > > > > + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. > > + if (need_offload_krc(krcp)) { > > // Channel 1 corresponds to the SLAB-pointer bulk path. > > // Channel 2 corresponds to vmalloc-pointer bulk path. > > for (j = 0; j < FREE_N_CHANNELS; j++) { > > -- > > 1.9.1 > > > It looks correct to me. I will test it over weekend. > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> -- Uladzislau Rezki
On 3/31/2023 6:12 PM, Ziwei Dai wrote: > In kfree_rcu_monitor(), new free business at krcp is attached to any free > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > business is handled after the rcu grace period. But if there is any none-free > channel at krwp already, that means there is an on-going rcu work, > which will cause the kvfree_call_rcu()-triggered free business is done > before the wanted rcu grace period ends. > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > to fix the issue that kvfree_call_rcu() loses effectiveness. > > Below is the css_set obj "from_cset" use-after-free case caused by > kvfree_call_rcu() losing effectiveness. > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > the task is schedule out. > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > CPU 0 CPU 1 > count_memcg_event_mm() > |rcu_read_lock() <--- > |mem_cgroup_from_task() > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > |css_set_ptr = rcu_dereference((task)->cgroups) > |// Hard irq comes, current task is scheduled out. > > cgroup_attach_task() > |cgroup_migrate() > |cgroup_migrate_execute() > |css_set_move_task(task, from_cset, to_cset, true) > |cgroup_move_task(task, to_cset) > |rcu_assign_pointer(.., to_cset) > |... > |cgroup_migrate_finish() > |put_css_set_locked(from_cset) > |from_cset->refcount return 0 > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > |add_ptr_to_bulk_krc_lock() > |schedule_delayed_work(&krcp->monitor_work, ..) > > kfree_rcu_monitor() > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > |queue_rcu_work(system_wq, &krwp->rcu_work) > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > // There is a perious call_rcu(.., rcu_work_rcufn) > // gp end, rcu_work_rcufn() is called. > rcu_work_rcufn() > |__queue_work(.., rwork->wq, &rwork->work); > > |kfree_rcu_work() > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > |The "from_cset" is freed before new gp end. > > // the task is scheduled in after many ms. > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > I too have observed this issue on 5.15, where i see below callstack. And i see different view of memory and one is this callstack and the other is, i see proper value of memcg from the ramdump. Looks to be different view of memory between rcu critical section. 119653.424434][T730913] binder: 30858:30913 ioctl c0306201 7633b56a78 returned -14 [119653.426095][T730918] Unable to handle kernel paging request at virtual address 000000034088e7be -000|mem_cgroup_disabled(inline) -000|__count_memcg_events([X0] memcg = 0x000000034088D9AE, [X1] idx = PGFAULT = 0x11, [X2] count = 0x1) | [X0] memcg = 0x000000034088D9AE | [X1] idx = PGFAULT = 0x11 | [X2] count = 0x1 | [locdesc] __ptr = 0x0 -001|arch_local_irq_restore(inline) | [X26] flags = 0x0 -001|count_memcg_events(inline) | [X26] flags = 0x0 -001|count_memcg_event_mm(inline) | [locdesc] idx = PGFAULT = 0x11 -001|do_handle_mm_fault([X24] vma = 0xFFFFFF89CC704870, [X20] address = 0x000000763395AA78, [X22] flags = | [X24] vma = 0xFFFFFF89CC704870 | [X20] address = 0x000000763395AA78 | [X22] flags = 0x0215 | [X23] seq = 0x0 | [X19] regs = 0xFFFFFFC05EA73AD0 | [locdesc] ret = 0x0 -002|handle_mm_fault(inline) | [X22] address = 0x000000763395AA78 | [X24] flags = 0x0215 | [X21] regs = 0xFFFFFFC05EA73AD0 -002|__do_page_fault(inline) | [X23] mm = 0xFFFFFF8812F65400 | [X22] addr = 0x000000763395AA78 | [X24] mm_flags = 0x0215 | [X27] vm_flags = 0x2 | [X21] regs = 0xFFFFFFC05EA73AD0 [119653.427011][T730918] __count_memcg_events+0x2c/0x274 [119653.427014][T730918] do_handle_mm_fault+0xe4/0x2c8 [119653.427017][T730918] do_page_fault+0x4c0/0x688 [119653.427022][T730918] do_translation_fault+0x48/0x64 [119653.427025][T730918] do_mem_abort+0x68/0x148 [119653.427028][T730918] el1_abort+0x40/0x64 [119653.427032][T730918] el1h_64_sync_handler+0x60/0xa0 [119653.427035][T730918] el1h_64_sync+0x7c/0x80 [119653.427037][T730918] __arch_copy_to_user+0x60/0x224 [119653.427040][T730918] binder_ioctl_write_read+0x28c/0x590 [119653.427046][T730918] binder_ioctl+0x1b0/0xf00 [119653.427049][T730918] __arm64_sys_ioctl+0x184/0x210 [119653.427052][T730918] invoke_syscall+0x60/0x150 [119653.427055][T730918] el0_svc_common+0x8c/0xf8 [119653.427057][T730918] do_el0_svc+0x28/0xa0 [119653.427059][T730918] el0_svc+0x24/0x84 [119653.427061][T730918] el0t_64_sync_handler+0x88/0xec [119653.427064][T730918] el0t_64_sync+0x1b4/0x1b8 [119653.427067][T730918] Code: a9044ff4 d503201f 71016c3f 54000d42 (f9470808) [119653.427069][T730918] ---[ end trace a3882cb531ca3dd0 ]--- Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> I will try to test this patch. Thanks, --Mukesh > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > --- > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 8e880c0..7b95ee9 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) > return !!READ_ONCE(krcp->head); > } > > +static bool > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) > +{ > + int i; > + > + for (i = 0; i < FREE_N_CHANNELS; i++) > + if (!list_empty(&krwp->bulk_head_free[i])) > + return true; > + > + return !!krwp->head_free; > +} > + > static int krc_count(struct kfree_rcu_cpu *krcp) > { > int sum = atomic_read(&krcp->head_count); > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bulk_head or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || > - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only when > + // all channels are free. Any channel is not free means at krwp > + // there is on-going rcu work to handle krwp's free business. > + if (need_wait_for_krwp_work(krwp)) > + continue; > > + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. > + if (need_offload_krc(krcp)) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) {
On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote: > In kfree_rcu_monitor(), new free business at krcp is attached to any free > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > business is handled after the rcu grace period. But if there is any none-free > channel at krwp already, that means there is an on-going rcu work, > which will cause the kvfree_call_rcu()-triggered free business is done > before the wanted rcu grace period ends. > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > to fix the issue that kvfree_call_rcu() loses effectiveness. > > Below is the css_set obj "from_cset" use-after-free case caused by > kvfree_call_rcu() losing effectiveness. > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > the task is schedule out. > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > CPU 0 CPU 1 > count_memcg_event_mm() > |rcu_read_lock() <--- > |mem_cgroup_from_task() > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > |css_set_ptr = rcu_dereference((task)->cgroups) > |// Hard irq comes, current task is scheduled out. > > cgroup_attach_task() > |cgroup_migrate() > |cgroup_migrate_execute() > |css_set_move_task(task, from_cset, to_cset, true) > |cgroup_move_task(task, to_cset) > |rcu_assign_pointer(.., to_cset) > |... > |cgroup_migrate_finish() > |put_css_set_locked(from_cset) > |from_cset->refcount return 0 > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > |add_ptr_to_bulk_krc_lock() > |schedule_delayed_work(&krcp->monitor_work, ..) > > kfree_rcu_monitor() > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > |queue_rcu_work(system_wq, &krwp->rcu_work) > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > // There is a perious call_rcu(.., rcu_work_rcufn) > // gp end, rcu_work_rcufn() is called. > rcu_work_rcufn() > |__queue_work(.., rwork->wq, &rwork->work); > > |kfree_rcu_work() > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > |The "from_cset" is freed before new gp end. > > // the task is scheduled in after many ms. > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> Good catch, thank you!!! How difficult was this to trigger? If it can be triggered easily, this of course needs to go into mainline sooner rather than later. Longer term, would it make sense to run the three channels through RCU separately, in order to avoid one channel refraining from starting a grace period just because some other channel has callbacks waiting for a grace period to complete? One argument against might be energy efficiency, but perhaps the ->gp_snap field could be used to get the best of both worlds. Either way, this fixes only one bug of two. The second bug is in the kfree_rcu() tests, which should have caught this bug. Thoughts on a good fix for those tests? I have applied Uladzislau's and Mukesh's tags, and done the usual wordsmithing as shown at the end of this message. Please let me know if I messed anything up. > --- > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 8e880c0..7b95ee9 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) > return !!READ_ONCE(krcp->head); > } > > +static bool > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) > +{ > + int i; > + > + for (i = 0; i < FREE_N_CHANNELS; i++) > + if (!list_empty(&krwp->bulk_head_free[i])) > + return true; > + > + return !!krwp->head_free; This is fixed from v1, good! > +} > + > static int krc_count(struct kfree_rcu_cpu *krcp) > { > int sum = atomic_read(&krcp->head_count); > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bulk_head or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || > - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only when > + // all channels are free. Any channel is not free means at krwp > + // there is on-going rcu work to handle krwp's free business. > + if (need_wait_for_krwp_work(krwp)) > + continue; > > + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. > + if (need_offload_krc(krcp)) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) { > -- > 1.9.1 ------------------------------------------------------------------------ commit e222f9a512539c3f4093a55d16624d9da614800b Author: Ziwei Dai <ziwei.dai@unisoc.com> Date: Fri Mar 31 20:42:09 2023 +0800 rcu: Avoid freeing new kfree_rcu() memory after old grace period Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 859ee02f6614..e2dbea6cee4b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp) return !!READ_ONCE(krcp->head); } +static bool +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) +{ + int i; + + for (i = 0; i < FREE_N_CHANNELS; i++) + if (!list_empty(&krwp->bulk_head_free[i])) + return true; + + return !!krwp->head_free; +} + static int krc_count(struct kfree_rcu_cpu *krcp) { int sum = atomic_read(&krcp->head_count); @@ -3134,15 +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work) for (i = 0; i < KFREE_N_BATCHES; i++) { struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); - // Try to detach bulk_head or head and attach it over any - // available corresponding free channel. It can be that - // a previous RCU batch is in progress, it means that - // immediately to queue another one is not possible so - // in that case the monitor work is rearmed. - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || - (READ_ONCE(krcp->head) && !krwp->head_free)) { + // Try to detach bulk_head or head and attach it, only when + // all channels are free. Any channel is not free means at krwp + // there is on-going rcu work to handle krwp's free business. + if (need_wait_for_krwp_work(krwp)) + continue; + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. + if (need_offload_krc(krcp)) { // Channel 1 corresponds to the SLAB-pointer bulk path. // Channel 2 corresponds to vmalloc-pointer bulk path. for (j = 0; j < FREE_N_CHANNELS; j++) {
Hello Paul! > -----邮件原件----- > 发件人: Paul E. McKenney <paulmck@kernel.org> > 发送时间: 2023年4月4日 6:58 > 收件人: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com> > 抄送: urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com; > josh@joshtriplett.org; rostedt@goodmis.org; > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; > joel@joelfernandes.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org; > 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) > <Yifan.Xin@unisoc.com>; 王科 (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 > (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) > <zhaoyang.huang@unisoc.com> > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after > the wanted rcu grace period. > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任 > 何链接和附件。 > CAUTION: This email originated from outside of the organization. Do not click > links or open attachments unless you recognize the sender and know the > content is safe. > > > > On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote: > > In kfree_rcu_monitor(), new free business at krcp is attached to any > > free channel at krwp. kfree_rcu_monitor() is responsible to make sure > > new free business is handled after the rcu grace period. But if there > > is any none-free channel at krwp already, that means there is an > > on-going rcu work, which will cause the kvfree_call_rcu()-triggered > > free business is done before the wanted rcu grace period ends. > > > > This commit ignore krwp which has non-free channel at > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses > effectiveness. > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > kvfree_call_rcu() losing effectiveness. > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq > > comes, the task is schedule out. > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new > gp. > > But "from_cset" is freed right after current gp end. "from_cset" is > reallocated. > > CPU 0 's task arrives back, references "from_cset"'s member, which causes > crash. > > > > CPU 0 CPU 1 > > count_memcg_event_mm() > > |rcu_read_lock() <--- > > |mem_cgroup_from_task() > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = > > rcu_dereference((task)->cgroups) |// Hard irq comes, current task is > > scheduled out. > > > > cgroup_attach_task() > > |cgroup_migrate() > > |cgroup_migrate_execute() > > |css_set_move_task(task, > from_cset, to_cset, true) > > |cgroup_move_task(task, > to_cset) > > |rcu_assign_pointer(.., to_cset) > > |... > > |cgroup_migrate_finish() > > > |put_css_set_locked(from_cset) > > |from_cset->refcount return 0 > > |kfree_rcu(cset, rcu_head) // > means to free from_cset after new gp > > |add_ptr_to_bulk_krc_lock() > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > kfree_rcu_monitor() > > |krcp->bulk_head[0]'s work > attached to krwp->bulk_head_free[] > > |queue_rcu_work(system_wq, > &krwp->rcu_work) > > |if rwork->rcu.work is not in > WORK_STRUCT_PENDING_BIT state, > > |call_rcu(&rwork->rcu, > > rcu_work_rcufn) <--- request a new gp > > > > // There is a perious call_rcu(.., > rcu_work_rcufn) > > // gp end, rcu_work_rcufn() is > called. > > rcu_work_rcufn() > > |__queue_work(.., rwork->wq, > > &rwork->work); > > > > |kfree_rcu_work() > > |krwp->bulk_head_free[0] bulk > is freed before new gp end!!! > > |The "from_cset" is freed > before new gp end. > > > > // the task is scheduled in after many ms. > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because > css_set_ptr is freed. > > > > v2: Use helper function instead of inserted code block at > kfree_rcu_monitor(). > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of > > kfree_rcu() work") > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > Good catch, thank you!!! > > How difficult was this to trigger? If it can be triggered easily, this of course > needs to go into mainline sooner rather than later. Roughly we can reproduce this issue within two rounds of 48h stress test, with 20 k5.15 devices. If KASAN is enabled, the reproduce rate is higher. So I think sooner is better. > > Longer term, would it make sense to run the three channels through RCU > separately, in order to avoid one channel refraining from starting a grace > period just because some other channel has callbacks waiting for a grace > period to complete? One argument against might be energy efficiency, but > perhaps the ->gp_snap field could be used to get the best of both worlds. I see kvfree_rcu_drain_ready(krcp) is already called at the beginning of kfree_rcu_monitor(), which polls the ->gp_snap field, to decide whether to free channel objects immediately or after gp. Both energy efficiency and timing seems be considered? > Either way, this fixes only one bug of two. The second bug is in the > kfree_rcu() tests, which should have caught this bug. Thoughts on a good fix > for those tests? I inserted a msleep() between "rcu_read_lock(), get pointer via rcu_dereference()" and "reference pointer, using the member", at the rcu scenario, then we can reproduce this issue very soon in stress test. Can kfree_rcu() tests insert msleep()? > I have applied Uladzislau's and Mukesh's tags, and done the usual > wordsmithing as shown at the end of this message. Please let me know if I > messed anything up. Thank you for the improvement on the patch! It seems better now. > > --- > > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > > 1 file changed, 19 insertions(+), 8 deletions(-) > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index > > 8e880c0..7b95ee9 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct > *work) > > return !!READ_ONCE(krcp->head); > > } > > > > +static bool > > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) { > > + int i; > > + > > + for (i = 0; i < FREE_N_CHANNELS; i++) > > + if (!list_empty(&krwp->bulk_head_free[i])) > > + return true; > > + > > + return !!krwp->head_free; > > This is fixed from v1, good! > > > +} > > + > > static int krc_count(struct kfree_rcu_cpu *krcp) { > > int sum = atomic_read(&krcp->head_count); @@ -3107,15 > +3119,14 > > @@ static void kfree_rcu_monitor(struct work_struct *work) > > for (i = 0; i < KFREE_N_BATCHES; i++) { > > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > > > - // Try to detach bulk_head or head and attach it over any > > - // available corresponding free channel. It can be that > > - // a previous RCU batch is in progress, it means that > > - // immediately to queue another one is not possible so > > - // in that case the monitor work is rearmed. > > - if ((!list_empty(&krcp->bulk_head[0]) && > list_empty(&krwp->bulk_head_free[0])) || > > - (!list_empty(&krcp->bulk_head[1]) && > list_empty(&krwp->bulk_head_free[1])) || > > - (READ_ONCE(krcp->head) > && !krwp->head_free)) { > > + // Try to detach bulk_head or head and attach it, only when > > + // all channels are free. Any channel is not free means at > krwp > > + // there is on-going rcu work to handle krwp's free > business. > > + if (need_wait_for_krwp_work(krwp)) > > + continue; > > > > + // kvfree_rcu_drain_ready() might handle this krcp, if so > give up. > > + if (need_offload_krc(krcp)) { > > // Channel 1 corresponds to the SLAB-pointer > bulk path. > > // Channel 2 corresponds to vmalloc-pointer bulk > path. > > for (j = 0; j < FREE_N_CHANNELS; j++) { > > -- > > 1.9.1 > > ------------------------------------------------------------------------ > > commit e222f9a512539c3f4093a55d16624d9da614800b > Author: Ziwei Dai <ziwei.dai@unisoc.com> > Date: Fri Mar 31 20:42:09 2023 +0800 > > rcu: Avoid freeing new kfree_rcu() memory after old grace period > > Memory passed to kvfree_rcu() that is to be freed is tracked by a > per-CPU kfree_rcu_cpu structure, which in turn contains pointers > to kvfree_rcu_bulk_data structures that contain pointers to memory > that has not yet been handed to RCU, along with an kfree_rcu_cpu_work > structure that tracks the memory that has already been handed to RCU. > These structures track three categories of memory: (1) Memory for > kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived > during an OOM episode. The first two categories are tracked in a > cache-friendly manner involving a dynamically allocated page of pointers > (the aforementioned kvfree_rcu_bulk_data structures), while the third > uses a simple (but decidedly cache-unfriendly) linked list through the > rcu_head structures in each block of memory. > > On a given CPU, these three categories are handled as a unit, with that > CPU's kfree_rcu_cpu_work structure having one pointer for each of the > three categories. Clearly, new memory for a given category cannot be > placed in the corresponding kfree_rcu_cpu_work structure until any old > memory has had its grace period elapse and thus has been removed. > And > the kfree_rcu_monitor() function does in fact check for this. > > Except that the kfree_rcu_monitor() function checks these pointers one > at a time. This means that if the previous kfree_rcu() memory passed > to RCU had only category 1 and the current one has only category 2, the > kfree_rcu_monitor() function will send that current category-2 memory > along immediately. This can result in memory being freed too soon, > that is, out from under unsuspecting RCU readers. > > To see this, consider the following sequence of events, in which: > > o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", > then is preempted. > > o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free > "from_cset" > after a later grace period. Except that "from_cset" is freed > right after the previous grace period ended, so that > "from_cset" > is immediately freed. Task A resumes and references > "from_cset"'s > member, after which nothing good happens. > > In full detail: > > CPU 0 CPU 1 > ---------------------- ---------------------- > count_memcg_event_mm() > |rcu_read_lock() <--- > |mem_cgroup_from_task() > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > |css_set_ptr = rcu_dereference((task)->cgroups) > |// Hard irq comes, current task is scheduled out. > > cgroup_attach_task() > |cgroup_migrate() > > |cgroup_migrate_execute() > |css_set_move_task(task, > from_cset, to_cset, true) > |cgroup_move_task(task, > to_cset) > |rcu_assign_pointer(.., > to_cset) > |... > |cgroup_migrate_finish() > > |put_css_set_locked(from_cset) > |from_cset->refcount > return 0 > |kfree_rcu(cset, rcu_head) > // free from_cset after new gp > > |add_ptr_to_bulk_krc_lock() > > |schedule_delayed_work(&krcp->monitor_work, ..) > > kfree_rcu_monitor() > |krcp->bulk_head[0]'s > work attached to krwp->bulk_head_free[] > > |queue_rcu_work(system_wq, &krwp->rcu_work) > |if rwork->rcu.work is not > in WORK_STRUCT_PENDING_BIT state, > |call_rcu(&rwork->rcu, > rcu_work_rcufn) <--- request new gp > > // There is a perious > call_rcu(.., rcu_work_rcufn) > // gp end, > rcu_work_rcufn() is called. > rcu_work_rcufn() > |__queue_work(.., > rwork->wq, &rwork->work); > > |kfree_rcu_work() > |krwp->bulk_head_free[0] > bulk is freed before new gp end!!! > |The "from_cset" is freed > before new gp end. > > // the task resumes some time later. > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because > css_set_ptr is freed. > > This commit therefore causes kfree_rcu_monitor() to refrain from > moving > kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU > grace period has completed for all three categories. > > v2: Use helper function instead of inserted code block at > kfree_rcu_monitor(). > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() > work") > Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index > 859ee02f6614..e2dbea6cee4b 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp) > return !!READ_ONCE(krcp->head); > } > > +static bool > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) { > + int i; > + > + for (i = 0; i < FREE_N_CHANNELS; i++) > + if (!list_empty(&krwp->bulk_head_free[i])) > + return true; > + > + return !!krwp->head_free; > +} > + > static int krc_count(struct kfree_rcu_cpu *krcp) { > int sum = atomic_read(&krcp->head_count); @@ -3134,15 > +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bulk_head or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((!list_empty(&krcp->bulk_head[0]) && > list_empty(&krwp->bulk_head_free[0])) || > - (!list_empty(&krcp->bulk_head[1]) && > list_empty(&krwp->bulk_head_free[1])) || > - (READ_ONCE(krcp->head) > && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only > when > + // all channels are free. Any channel is not free means at > krwp > + // there is on-going rcu work to handle krwp's free > business. > + if (need_wait_for_krwp_work(krwp)) > + continue; > > + // kvfree_rcu_drain_ready() might handle this krcp, if so > give up. > + if (need_offload_krc(krcp)) { > // Channel 1 corresponds to the SLAB-pointer > bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk > path. > for (j = 0; j < FREE_N_CHANNELS; j++) {
On Tue, Apr 04, 2023 at 02:49:15AM +0000, 代子为 (Ziwei Dai) wrote: > Hello Paul! > > > -----邮件原件----- > > 发件人: Paul E. McKenney <paulmck@kernel.org> > > 发送时间: 2023年4月4日 6:58 > > 收件人: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com> > > 抄送: urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com; > > josh@joshtriplett.org; rostedt@goodmis.org; > > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; > > joel@joelfernandes.org; rcu@vger.kernel.org; linux-kernel@vger.kernel.org; > > 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) > > <Yifan.Xin@unisoc.com>; 王科 (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 > > (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) > > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) > > <zhaoyang.huang@unisoc.com> > > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after > > the wanted rcu grace period. > > > > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任 > > 何链接和附件。 > > CAUTION: This email originated from outside of the organization. Do not click > > links or open attachments unless you recognize the sender and know the > > content is safe. > > > > > > > > On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote: > > > In kfree_rcu_monitor(), new free business at krcp is attached to any > > > free channel at krwp. kfree_rcu_monitor() is responsible to make sure > > > new free business is handled after the rcu grace period. But if there > > > is any none-free channel at krwp already, that means there is an > > > on-going rcu work, which will cause the kvfree_call_rcu()-triggered > > > free business is done before the wanted rcu grace period ends. > > > > > > This commit ignore krwp which has non-free channel at > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses > > effectiveness. > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > > kvfree_call_rcu() losing effectiveness. > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq > > > comes, the task is schedule out. > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new > > gp. > > > But "from_cset" is freed right after current gp end. "from_cset" is > > reallocated. > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes > > crash. > > > > > > CPU 0 CPU 1 > > > count_memcg_event_mm() > > > |rcu_read_lock() <--- > > > |mem_cgroup_from_task() > > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = > > > rcu_dereference((task)->cgroups) |// Hard irq comes, current task is > > > scheduled out. > > > > > > cgroup_attach_task() > > > |cgroup_migrate() > > > |cgroup_migrate_execute() > > > |css_set_move_task(task, > > from_cset, to_cset, true) > > > |cgroup_move_task(task, > > to_cset) > > > |rcu_assign_pointer(.., to_cset) > > > |... > > > |cgroup_migrate_finish() > > > > > |put_css_set_locked(from_cset) > > > |from_cset->refcount return 0 > > > |kfree_rcu(cset, rcu_head) // > > means to free from_cset after new gp > > > |add_ptr_to_bulk_krc_lock() > > > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > > > kfree_rcu_monitor() > > > |krcp->bulk_head[0]'s work > > attached to krwp->bulk_head_free[] > > > |queue_rcu_work(system_wq, > > &krwp->rcu_work) > > > |if rwork->rcu.work is not in > > WORK_STRUCT_PENDING_BIT state, > > > |call_rcu(&rwork->rcu, > > > rcu_work_rcufn) <--- request a new gp > > > > > > // There is a perious call_rcu(.., > > rcu_work_rcufn) > > > // gp end, rcu_work_rcufn() is > > called. > > > rcu_work_rcufn() > > > |__queue_work(.., rwork->wq, > > > &rwork->work); > > > > > > |kfree_rcu_work() > > > |krwp->bulk_head_free[0] bulk > > is freed before new gp end!!! > > > |The "from_cset" is freed > > before new gp end. > > > > > > // the task is scheduled in after many ms. > > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because > > css_set_ptr is freed. > > > > > > v2: Use helper function instead of inserted code block at > > kfree_rcu_monitor(). > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of > > > kfree_rcu() work") > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > > > Good catch, thank you!!! > > > > How difficult was this to trigger? If it can be triggered easily, this of course > > needs to go into mainline sooner rather than later. > > Roughly we can reproduce this issue within two rounds of 48h stress test, > with 20 k5.15 devices. If KASAN is enabled, the reproduce rate is higher. > So I think sooner is better. Thank you for the info! This is in theory an old bug, but if you can easily find out, does it trigger for you on v6.2 or earlier? > > Longer term, would it make sense to run the three channels through RCU > > separately, in order to avoid one channel refraining from starting a grace > > period just because some other channel has callbacks waiting for a grace > > period to complete? One argument against might be energy efficiency, but > > perhaps the ->gp_snap field could be used to get the best of both worlds. > > I see kvfree_rcu_drain_ready(krcp) is already called at the beginning of > kfree_rcu_monitor(), which polls the ->gp_snap field, to decide > whether to free channel objects immediately or after gp. > Both energy efficiency and timing seems be considered? My concern is that running the channels separately might mean more grace periods (and thus more energy draw) on nearly idle devices, such devices usually being the ones for which energy efficiency matters most. But perhaps Vlad, Neeraj, or Joel has some insight on this, given that they are the ones working on battery-powered devices. > > Either way, this fixes only one bug of two. The second bug is in the > > kfree_rcu() tests, which should have caught this bug. Thoughts on a good fix > > for those tests? > > I inserted a msleep() between "rcu_read_lock(), get pointer via rcu_dereference()" > and "reference pointer, using the member", at the rcu scenario, then we can > reproduce this issue very soon in stress test. Can kfree_rcu() tests insert msleep()? Another approach is to separate concerns, so that readers interact with grace periods in the rcutorture.c tests, and to add the interaction of to-be-freed memory with grace periods in the rcuscale kvfree tests. I took a step in this direction with this commit on the -rcu tree's "dev" branch: efbe7927f479 ("rcu/kvfree: Add debug to check grace periods") Given this, might it be possible to make rcuscale.c's kfree_rcu() testing create patterns of usage of the three channels so as to catch this bug that way? > > I have applied Uladzislau's and Mukesh's tags, and done the usual > > wordsmithing as shown at the end of this message. Please let me know if I > > messed anything up. > > Thank you for the improvement on the patch! It seems better now. No problem and thank you again for the debugging and the fix! Thanx, Paul > > > --- > > > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > > > 1 file changed, 19 insertions(+), 8 deletions(-) > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index > > > 8e880c0..7b95ee9 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct > > *work) > > > return !!READ_ONCE(krcp->head); > > > } > > > > > > +static bool > > > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) { > > > + int i; > > > + > > > + for (i = 0; i < FREE_N_CHANNELS; i++) > > > + if (!list_empty(&krwp->bulk_head_free[i])) > > > + return true; > > > + > > > + return !!krwp->head_free; > > > > This is fixed from v1, good! > > > > > +} > > > + > > > static int krc_count(struct kfree_rcu_cpu *krcp) { > > > int sum = atomic_read(&krcp->head_count); @@ -3107,15 > > +3119,14 > > > @@ static void kfree_rcu_monitor(struct work_struct *work) > > > for (i = 0; i < KFREE_N_BATCHES; i++) { > > > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > > > > > - // Try to detach bulk_head or head and attach it over any > > > - // available corresponding free channel. It can be that > > > - // a previous RCU batch is in progress, it means that > > > - // immediately to queue another one is not possible so > > > - // in that case the monitor work is rearmed. > > > - if ((!list_empty(&krcp->bulk_head[0]) && > > list_empty(&krwp->bulk_head_free[0])) || > > > - (!list_empty(&krcp->bulk_head[1]) && > > list_empty(&krwp->bulk_head_free[1])) || > > > - (READ_ONCE(krcp->head) > > && !krwp->head_free)) { > > > + // Try to detach bulk_head or head and attach it, only when > > > + // all channels are free. Any channel is not free means at > > krwp > > > + // there is on-going rcu work to handle krwp's free > > business. > > > + if (need_wait_for_krwp_work(krwp)) > > > + continue; > > > > > > + // kvfree_rcu_drain_ready() might handle this krcp, if so > > give up. > > > + if (need_offload_krc(krcp)) { > > > // Channel 1 corresponds to the SLAB-pointer > > bulk path. > > > // Channel 2 corresponds to vmalloc-pointer bulk > > path. > > > for (j = 0; j < FREE_N_CHANNELS; j++) { > > > -- > > > 1.9.1 > > > > ------------------------------------------------------------------------ > > > > commit e222f9a512539c3f4093a55d16624d9da614800b > > Author: Ziwei Dai <ziwei.dai@unisoc.com> > > Date: Fri Mar 31 20:42:09 2023 +0800 > > > > rcu: Avoid freeing new kfree_rcu() memory after old grace period > > > > Memory passed to kvfree_rcu() that is to be freed is tracked by a > > per-CPU kfree_rcu_cpu structure, which in turn contains pointers > > to kvfree_rcu_bulk_data structures that contain pointers to memory > > that has not yet been handed to RCU, along with an kfree_rcu_cpu_work > > structure that tracks the memory that has already been handed to RCU. > > These structures track three categories of memory: (1) Memory for > > kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived > > during an OOM episode. The first two categories are tracked in a > > cache-friendly manner involving a dynamically allocated page of pointers > > (the aforementioned kvfree_rcu_bulk_data structures), while the third > > uses a simple (but decidedly cache-unfriendly) linked list through the > > rcu_head structures in each block of memory. > > > > On a given CPU, these three categories are handled as a unit, with that > > CPU's kfree_rcu_cpu_work structure having one pointer for each of the > > three categories. Clearly, new memory for a given category cannot be > > placed in the corresponding kfree_rcu_cpu_work structure until any old > > memory has had its grace period elapse and thus has been removed. > > And > > the kfree_rcu_monitor() function does in fact check for this. > > > > Except that the kfree_rcu_monitor() function checks these pointers one > > at a time. This means that if the previous kfree_rcu() memory passed > > to RCU had only category 1 and the current one has only category 2, the > > kfree_rcu_monitor() function will send that current category-2 memory > > along immediately. This can result in memory being freed too soon, > > that is, out from under unsuspecting RCU readers. > > > > To see this, consider the following sequence of events, in which: > > > > o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", > > then is preempted. > > > > o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free > > "from_cset" > > after a later grace period. Except that "from_cset" is freed > > right after the previous grace period ended, so that > > "from_cset" > > is immediately freed. Task A resumes and references > > "from_cset"'s > > member, after which nothing good happens. > > > > In full detail: > > > > CPU 0 CPU 1 > > ---------------------- ---------------------- > > count_memcg_event_mm() > > |rcu_read_lock() <--- > > |mem_cgroup_from_task() > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > |css_set_ptr = rcu_dereference((task)->cgroups) > > |// Hard irq comes, current task is scheduled out. > > > > cgroup_attach_task() > > |cgroup_migrate() > > > > |cgroup_migrate_execute() > > |css_set_move_task(task, > > from_cset, to_cset, true) > > |cgroup_move_task(task, > > to_cset) > > |rcu_assign_pointer(.., > > to_cset) > > |... > > |cgroup_migrate_finish() > > > > |put_css_set_locked(from_cset) > > |from_cset->refcount > > return 0 > > |kfree_rcu(cset, rcu_head) > > // free from_cset after new gp > > > > |add_ptr_to_bulk_krc_lock() > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > kfree_rcu_monitor() > > |krcp->bulk_head[0]'s > > work attached to krwp->bulk_head_free[] > > > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > |if rwork->rcu.work is not > > in WORK_STRUCT_PENDING_BIT state, > > |call_rcu(&rwork->rcu, > > rcu_work_rcufn) <--- request new gp > > > > // There is a perious > > call_rcu(.., rcu_work_rcufn) > > // gp end, > > rcu_work_rcufn() is called. > > rcu_work_rcufn() > > |__queue_work(.., > > rwork->wq, &rwork->work); > > > > |kfree_rcu_work() > > |krwp->bulk_head_free[0] > > bulk is freed before new gp end!!! > > |The "from_cset" is freed > > before new gp end. > > > > // the task resumes some time later. > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because > > css_set_ptr is freed. > > > > This commit therefore causes kfree_rcu_monitor() to refrain from > > moving > > kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU > > grace period has completed for all three categories. > > > > v2: Use helper function instead of inserted code block at > > kfree_rcu_monitor(). > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() > > work") > > Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Signed-off-by: Paul E. McKenney <paulmck@kernel.org> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index > > 859ee02f6614..e2dbea6cee4b 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp) > > return !!READ_ONCE(krcp->head); > > } > > > > +static bool > > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) { > > + int i; > > + > > + for (i = 0; i < FREE_N_CHANNELS; i++) > > + if (!list_empty(&krwp->bulk_head_free[i])) > > + return true; > > + > > + return !!krwp->head_free; > > +} > > + > > static int krc_count(struct kfree_rcu_cpu *krcp) { > > int sum = atomic_read(&krcp->head_count); @@ -3134,15 > > +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > > for (i = 0; i < KFREE_N_BATCHES; i++) { > > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > > > - // Try to detach bulk_head or head and attach it over any > > - // available corresponding free channel. It can be that > > - // a previous RCU batch is in progress, it means that > > - // immediately to queue another one is not possible so > > - // in that case the monitor work is rearmed. > > - if ((!list_empty(&krcp->bulk_head[0]) && > > list_empty(&krwp->bulk_head_free[0])) || > > - (!list_empty(&krcp->bulk_head[1]) && > > list_empty(&krwp->bulk_head_free[1])) || > > - (READ_ONCE(krcp->head) > > && !krwp->head_free)) { > > + // Try to detach bulk_head or head and attach it, only > > when > > + // all channels are free. Any channel is not free means at > > krwp > > + // there is on-going rcu work to handle krwp's free > > business. > > + if (need_wait_for_krwp_work(krwp)) > > + continue; > > > > + // kvfree_rcu_drain_ready() might handle this krcp, if so > > give up. > > + if (need_offload_krc(krcp)) { > > // Channel 1 corresponds to the SLAB-pointer > > bulk path. > > // Channel 2 corresponds to vmalloc-pointer bulk > > path. > > for (j = 0; j < FREE_N_CHANNELS; j++) {
On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > In kfree_rcu_monitor(), new free business at krcp is attached to any free > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > business is handled after the rcu grace period. But if there is any none-free > channel at krwp already, that means there is an on-going rcu work, > which will cause the kvfree_call_rcu()-triggered free business is done > before the wanted rcu grace period ends. > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > to fix the issue that kvfree_call_rcu() loses effectiveness. > > Below is the css_set obj "from_cset" use-after-free case caused by > kvfree_call_rcu() losing effectiveness. > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > the task is schedule out. > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > CPU 0 CPU 1 > count_memcg_event_mm() > |rcu_read_lock() <--- > |mem_cgroup_from_task() > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > |css_set_ptr = rcu_dereference((task)->cgroups) > |// Hard irq comes, current task is scheduled out. > > cgroup_attach_task() > |cgroup_migrate() > |cgroup_migrate_execute() > |css_set_move_task(task, from_cset, to_cset, true) > |cgroup_move_task(task, to_cset) > |rcu_assign_pointer(.., to_cset) > |... > |cgroup_migrate_finish() > |put_css_set_locked(from_cset) > |from_cset->refcount return 0 > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > |add_ptr_to_bulk_krc_lock() > |schedule_delayed_work(&krcp->monitor_work, ..) > > kfree_rcu_monitor() > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > |queue_rcu_work(system_wq, &krwp->rcu_work) > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > // There is a perious call_rcu(.., rcu_work_rcufn) > // gp end, rcu_work_rcufn() is called. > rcu_work_rcufn() > |__queue_work(.., rwork->wq, &rwork->work); > > |kfree_rcu_work() > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > |The "from_cset" is freed before new gp end. > > // the task is scheduled in after many ms. > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> Please update the fixes tag to: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") The issue happened when 5f3c8d620447 started looking at multiple channels at the same time in the same work handler function. I think a better fix might be to separate out the work handler functions for each channel separately. That way we get more parallelism. but since this is urgent, Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> thanks, - Joel > --- > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 8e880c0..7b95ee9 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) > return !!READ_ONCE(krcp->head); > } > > +static bool > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) > +{ > + int i; > + > + for (i = 0; i < FREE_N_CHANNELS; i++) > + if (!list_empty(&krwp->bulk_head_free[i])) > + return true; > + > + return !!krwp->head_free; > +} > + > static int krc_count(struct kfree_rcu_cpu *krcp) > { > int sum = atomic_read(&krcp->head_count); > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bulk_head or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || > - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only when > + // all channels are free. Any channel is not free means at krwp > + // there is on-going rcu work to handle krwp's free business. > + if (need_wait_for_krwp_work(krwp)) > + continue; > > + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. > + if (need_offload_krc(krcp)) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) { > -- > 1.9.1 >
On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > > > In kfree_rcu_monitor(), new free business at krcp is attached to any free > > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > > business is handled after the rcu grace period. But if there is any none-free > > channel at krwp already, that means there is an on-going rcu work, > > which will cause the kvfree_call_rcu()-triggered free business is done > > before the wanted rcu grace period ends. > > > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > > to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > kvfree_call_rcu() losing effectiveness. > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > > the task is schedule out. > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > CPU 0 CPU 1 > > count_memcg_event_mm() > > |rcu_read_lock() <--- > > |mem_cgroup_from_task() > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > |css_set_ptr = rcu_dereference((task)->cgroups) > > |// Hard irq comes, current task is scheduled out. > > > > cgroup_attach_task() > > |cgroup_migrate() > > |cgroup_migrate_execute() > > |css_set_move_task(task, from_cset, to_cset, true) > > |cgroup_move_task(task, to_cset) > > |rcu_assign_pointer(.., to_cset) > > |... > > |cgroup_migrate_finish() > > |put_css_set_locked(from_cset) > > |from_cset->refcount return 0 > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > |add_ptr_to_bulk_krc_lock() > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > kfree_rcu_monitor() > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > // gp end, rcu_work_rcufn() is called. > > rcu_work_rcufn() > > |__queue_work(.., rwork->wq, &rwork->work); > > > > |kfree_rcu_work() > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > |The "from_cset" is freed before new gp end. > > > > // the task is scheduled in after many ms. > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > Please update the fixes tag to: > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") Vlad pointed out in another thread that the fix is actually to 34c881745549. So just to be sure, it could be updated to: Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()") Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") thanks, - Joel
On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote: > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to any free > > > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > > > business is handled after the rcu grace period. But if there is any none-free > > > channel at krwp already, that means there is an on-going rcu work, > > > which will cause the kvfree_call_rcu()-triggered free business is done > > > before the wanted rcu grace period ends. > > > > > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > > > to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > > kvfree_call_rcu() losing effectiveness. > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > > > the task is schedule out. > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > > > CPU 0 CPU 1 > > > count_memcg_event_mm() > > > |rcu_read_lock() <--- > > > |mem_cgroup_from_task() > > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > > |css_set_ptr = rcu_dereference((task)->cgroups) > > > |// Hard irq comes, current task is scheduled out. > > > > > > cgroup_attach_task() > > > |cgroup_migrate() > > > |cgroup_migrate_execute() > > > |css_set_move_task(task, from_cset, to_cset, true) > > > |cgroup_move_task(task, to_cset) > > > |rcu_assign_pointer(.., to_cset) > > > |... > > > |cgroup_migrate_finish() > > > |put_css_set_locked(from_cset) > > > |from_cset->refcount return 0 > > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > > |add_ptr_to_bulk_krc_lock() > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > > > kfree_rcu_monitor() > > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > > // gp end, rcu_work_rcufn() is called. > > > rcu_work_rcufn() > > > |__queue_work(.., rwork->wq, &rwork->work); > > > > > > |kfree_rcu_work() > > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > > |The "from_cset" is freed before new gp end. > > > > > > // the task is scheduled in after many ms. > > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > > > Please update the fixes tag to: > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") > > Vlad pointed out in another thread that the fix is actually to 34c881745549. > > So just to be sure, it could be updated to: > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()") > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") Ziwei Dai, does this change in Fixes look good to you? If so, I will update the commit log in this commit that I am planning to submit into v6.3. It is strictly speaking not a v6.3 regression, but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix. Thanx, Paul
> -----邮件原件----- > 发件人: Paul E. McKenney <paulmck@kernel.org> > 发送时间: 2023年4月6日 2:46 > 收件人: Joel Fernandes <joel@joelfernandes.org> > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com; > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org; > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科 > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com> > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。 > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the > sender and know the content is safe. > > > > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote: > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > > > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to > > > > make sure new free business is handled after the rcu grace period. > > > > But if there is any none-free channel at krwp already, that means > > > > there is an on-going rcu work, which will cause the > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends. > > > > > > > > This commit ignore krwp which has non-free channel at > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > > > kvfree_call_rcu() losing effectiveness. > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq > > > > comes, the task is schedule out. > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > > > > > CPU 0 CPU 1 > > > > count_memcg_event_mm() > > > > |rcu_read_lock() <--- > > > > |mem_cgroup_from_task() > > > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > > > |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq > > > > comes, current task is scheduled out. > > > > > > > > cgroup_attach_task() > > > > |cgroup_migrate() > > > > |cgroup_migrate_execute() > > > > |css_set_move_task(task, from_cset, to_cset, true) > > > > |cgroup_move_task(task, to_cset) > > > > |rcu_assign_pointer(.., to_cset) > > > > |... > > > > |cgroup_migrate_finish() > > > > |put_css_set_locked(from_cset) > > > > |from_cset->refcount return 0 > > > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > > > |add_ptr_to_bulk_krc_lock() > > > > > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > > > > > kfree_rcu_monitor() > > > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > > > |call_rcu(&rwork->rcu, > > > > rcu_work_rcufn) <--- request a new gp > > > > > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > > > // gp end, rcu_work_rcufn() is called. > > > > rcu_work_rcufn() > > > > |__queue_work(.., > > > > rwork->wq, &rwork->work); > > > > > > > > |kfree_rcu_work() > > > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > > > |The "from_cset" is freed before new gp end. > > > > > > > > // the task is scheduled in after many ms. > > > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of > > > > kfree_rcu() work") > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > > > > > Please update the fixes tag to: > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") > > > > Vlad pointed out in another thread that the fix is actually to 34c881745549. > > > > So just to be sure, it could be updated to: > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in > > kfree_rcu()") > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc > > ptrs") > > Ziwei Dai, does this change in Fixes look good to you? > > If so, I will update the commit log in this commit that I am planning to submit into v6.3. It is strictly speaking not a v6.3 regression, > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix. > > Thanx, Paul Hi Paul, it looks good to me and thanks!
On Thu, Apr 06, 2023 at 01:38:09AM +0000, 代子为 (Ziwei Dai) wrote: > > > > -----邮件原件----- > > 发件人: Paul E. McKenney <paulmck@kernel.org> > > 发送时间: 2023年4月6日 2:46 > > 收件人: Joel Fernandes <joel@joelfernandes.org> > > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com; > > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org; > > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科 > > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) > > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com> > > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. > > > > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。 > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the > > sender and know the content is safe. > > > > > > > > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote: > > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > > > > > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to > > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to > > > > > make sure new free business is handled after the rcu grace period. > > > > > But if there is any none-free channel at krwp already, that means > > > > > there is an on-going rcu work, which will cause the > > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends. > > > > > > > > > > This commit ignore krwp which has non-free channel at > > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > > > > kvfree_call_rcu() losing effectiveness. > > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq > > > > > comes, the task is schedule out. > > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > > > > > > > CPU 0 CPU 1 > > > > > count_memcg_event_mm() > > > > > |rcu_read_lock() <--- > > > > > |mem_cgroup_from_task() > > > > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > > > > |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq > > > > > comes, current task is scheduled out. > > > > > > > > > > cgroup_attach_task() > > > > > |cgroup_migrate() > > > > > |cgroup_migrate_execute() > > > > > |css_set_move_task(task, from_cset, to_cset, true) > > > > > |cgroup_move_task(task, to_cset) > > > > > |rcu_assign_pointer(.., to_cset) > > > > > |... > > > > > |cgroup_migrate_finish() > > > > > |put_css_set_locked(from_cset) > > > > > |from_cset->refcount return 0 > > > > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > > > > |add_ptr_to_bulk_krc_lock() > > > > > > > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > > > > > > > kfree_rcu_monitor() > > > > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > > > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > > > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > > > > |call_rcu(&rwork->rcu, > > > > > rcu_work_rcufn) <--- request a new gp > > > > > > > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > > > > // gp end, rcu_work_rcufn() is called. > > > > > rcu_work_rcufn() > > > > > |__queue_work(.., > > > > > rwork->wq, &rwork->work); > > > > > > > > > > |kfree_rcu_work() > > > > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > > > > |The "from_cset" is freed before new gp end. > > > > > > > > > > // the task is scheduled in after many ms. > > > > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > > > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of > > > > > kfree_rcu() work") > > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > > > > > > > Please update the fixes tag to: > > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") > > > > > > Vlad pointed out in another thread that the fix is actually to 34c881745549. > > > > > > So just to be sure, it could be updated to: > > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in > > > kfree_rcu()") > > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc > > > ptrs") > > > > Ziwei Dai, does this change in Fixes look good to you? > > > > If so, I will update the commit log in this commit that I am planning to submit into v6.3. It is strictly speaking not a v6.3 regression, > > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix. > > > > Thanx, Paul > > Hi Paul, it looks good to me and thanks! Thank you, and I will fix on my next rebase. Thanx, Paul
On Wed, Apr 05, 2023 at 08:46:15PM -0700, Paul E. McKenney wrote: > On Thu, Apr 06, 2023 at 01:38:09AM +0000, 代子为 (Ziwei Dai) wrote: > > > > > > > -----邮件原件----- > > > 发件人: Paul E. McKenney <paulmck@kernel.org> > > > 发送时间: 2023年4月6日 2:46 > > > 收件人: Joel Fernandes <joel@joelfernandes.org> > > > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com; > > > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org; > > > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科 > > > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) > > > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com> > > > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. > > > > > > > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。 > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the > > > sender and know the content is safe. > > > > > > > > > > > > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote: > > > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > > > > > > > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to > > > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to > > > > > > make sure new free business is handled after the rcu grace period. > > > > > > But if there is any none-free channel at krwp already, that means > > > > > > there is an on-going rcu work, which will cause the > > > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends. > > > > > > > > > > > > This commit ignore krwp which has non-free channel at > > > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > > > > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > > > > > kvfree_call_rcu() losing effectiveness. > > > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq > > > > > > comes, the task is schedule out. > > > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > count_memcg_event_mm() > > > > > > |rcu_read_lock() <--- > > > > > > |mem_cgroup_from_task() > > > > > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > > > > > |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq > > > > > > comes, current task is scheduled out. > > > > > > > > > > > > cgroup_attach_task() > > > > > > |cgroup_migrate() > > > > > > |cgroup_migrate_execute() > > > > > > |css_set_move_task(task, from_cset, to_cset, true) > > > > > > |cgroup_move_task(task, to_cset) > > > > > > |rcu_assign_pointer(.., to_cset) > > > > > > |... > > > > > > |cgroup_migrate_finish() > > > > > > |put_css_set_locked(from_cset) > > > > > > |from_cset->refcount return 0 > > > > > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > > > > > |add_ptr_to_bulk_krc_lock() > > > > > > > > > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > > > > > > > > > kfree_rcu_monitor() > > > > > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > > > > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > > > > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > > > > > |call_rcu(&rwork->rcu, > > > > > > rcu_work_rcufn) <--- request a new gp > > > > > > > > > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > > > > > // gp end, rcu_work_rcufn() is called. > > > > > > rcu_work_rcufn() > > > > > > |__queue_work(.., > > > > > > rwork->wq, &rwork->work); > > > > > > > > > > > > |kfree_rcu_work() > > > > > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > > > > > |The "from_cset" is freed before new gp end. > > > > > > > > > > > > // the task is scheduled in after many ms. > > > > > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > > > > > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > > > > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of > > > > > > kfree_rcu() work") > > > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > > > > > > > > > Please update the fixes tag to: > > > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") > > > > > > > > Vlad pointed out in another thread that the fix is actually to 34c881745549. > > > > > > > > So just to be sure, it could be updated to: > > > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in > > > > kfree_rcu()") > > > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc > > > > ptrs") > > > > > > Ziwei Dai, does this change in Fixes look good to you? > > > > > > If so, I will update the commit log in this commit that I am planning to submit into v6.3. It is strictly speaking not a v6.3 regression, > > > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix. > > > > > > Thanx, Paul > > > > Hi Paul, it looks good to me and thanks! > > Thank you, and I will fix on my next rebase. > After heavy testing over night i do not see that any warnings are triggered: Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> -- Uladzislau Rezki
On Thu, Apr 06, 2023 at 06:44:15AM +0200, Uladzislau Rezki wrote: > On Wed, Apr 05, 2023 at 08:46:15PM -0700, Paul E. McKenney wrote: > > On Thu, Apr 06, 2023 at 01:38:09AM +0000, 代子为 (Ziwei Dai) wrote: > > > > > > > > > > -----邮件原件----- > > > > 发件人: Paul E. McKenney <paulmck@kernel.org> > > > > 发送时间: 2023年4月6日 2:46 > > > > 收件人: Joel Fernandes <joel@joelfernandes.org> > > > > 抄送: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; urezki@gmail.com; frederic@kernel.org; quic_neeraju@quicinc.com; > > > > josh@joshtriplett.org; rostedt@goodmis.org; mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com; rcu@vger.kernel.org; > > > > linux-kernel@vger.kernel.org; 王双 (Shuang Wang) <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>; 王科 > > > > (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan) <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) > > > > <Zhiguo.Niu@unisoc.com>; 黄朝阳 (Zhaoyang Huang) <zhaoyang.huang@unisoc.com> > > > > 主题: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. > > > > > > > > > > > > 注意: 这封邮件来自于外部。除非你确定邮件内容安全,否则不要点击任何链接和附件。 > > > > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the > > > > sender and know the content is safe. > > > > > > > > > > > > > > > > On Wed, Apr 05, 2023 at 02:12:02PM -0400, Joel Fernandes wrote: > > > > > On Wed, Apr 5, 2023 at 1:39 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > > > > > > > On Fri, Mar 31, 2023 at 8:43 AM Ziwei Dai <ziwei.dai@unisoc.com> wrote: > > > > > > > > > > > > > > In kfree_rcu_monitor(), new free business at krcp is attached to > > > > > > > any free channel at krwp. kfree_rcu_monitor() is responsible to > > > > > > > make sure new free business is handled after the rcu grace period. > > > > > > > But if there is any none-free channel at krwp already, that means > > > > > > > there is an on-going rcu work, which will cause the > > > > > > > kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends. > > > > > > > > > > > > > > This commit ignore krwp which has non-free channel at > > > > > > > kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness. > > > > > > > > > > > > > > Below is the css_set obj "from_cset" use-after-free case caused by > > > > > > > kvfree_call_rcu() losing effectiveness. > > > > > > > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq > > > > > > > comes, the task is schedule out. > > > > > > > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > > > > > > > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > > > > > > > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > > count_memcg_event_mm() > > > > > > > |rcu_read_lock() <--- > > > > > > > |mem_cgroup_from_task() > > > > > > > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > > > > > > > |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq > > > > > > > comes, current task is scheduled out. > > > > > > > > > > > > > > cgroup_attach_task() > > > > > > > |cgroup_migrate() > > > > > > > |cgroup_migrate_execute() > > > > > > > |css_set_move_task(task, from_cset, to_cset, true) > > > > > > > |cgroup_move_task(task, to_cset) > > > > > > > |rcu_assign_pointer(.., to_cset) > > > > > > > |... > > > > > > > |cgroup_migrate_finish() > > > > > > > |put_css_set_locked(from_cset) > > > > > > > |from_cset->refcount return 0 > > > > > > > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > > > > > > > |add_ptr_to_bulk_krc_lock() > > > > > > > > > > > > > > |schedule_delayed_work(&krcp->monitor_work, ..) > > > > > > > > > > > > > > kfree_rcu_monitor() > > > > > > > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > > > > > > > |queue_rcu_work(system_wq, &krwp->rcu_work) > > > > > > > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > > > > > > > |call_rcu(&rwork->rcu, > > > > > > > rcu_work_rcufn) <--- request a new gp > > > > > > > > > > > > > > // There is a perious call_rcu(.., rcu_work_rcufn) > > > > > > > // gp end, rcu_work_rcufn() is called. > > > > > > > rcu_work_rcufn() > > > > > > > |__queue_work(.., > > > > > > > rwork->wq, &rwork->work); > > > > > > > > > > > > > > |kfree_rcu_work() > > > > > > > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > > > > > > > |The "from_cset" is freed before new gp end. > > > > > > > > > > > > > > // the task is scheduled in after many ms. > > > > > > > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > > > > > > > > > > > > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > > > > > > > > > > > > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of > > > > > > > kfree_rcu() work") > > > > > > > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> > > > > > > > > > > > > Please update the fixes tag to: > > > > > > 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs") > > > > > > > > > > Vlad pointed out in another thread that the fix is actually to 34c881745549. > > > > > > > > > > So just to be sure, it could be updated to: > > > > > Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in > > > > > kfree_rcu()") > > > > > Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc > > > > > ptrs") > > > > > > > > Ziwei Dai, does this change in Fixes look good to you? > > > > > > > > If so, I will update the commit log in this commit that I am planning to submit into v6.3. It is strictly speaking not a v6.3 regression, > > > > but it is starting to show up in the wild and the patch is contained enough to be considered an urgent fix. > > > > > > > > Thanx, Paul > > > > > > Hi Paul, it looks good to me and thanks! > > > > Thank you, and I will fix on my next rebase. > > > After heavy testing over night i do not see that any warnings > are triggered: > > Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Thank you very much!!! I will apply this during my upcoming rebase. Thanx, Paul
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8e880c0..7b95ee9 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) return !!READ_ONCE(krcp->head); } +static bool +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) +{ + int i; + + for (i = 0; i < FREE_N_CHANNELS; i++) + if (!list_empty(&krwp->bulk_head_free[i])) + return true; + + return !!krwp->head_free; +} + static int krc_count(struct kfree_rcu_cpu *krcp) { int sum = atomic_read(&krcp->head_count); @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) for (i = 0; i < KFREE_N_BATCHES; i++) { struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); - // Try to detach bulk_head or head and attach it over any - // available corresponding free channel. It can be that - // a previous RCU batch is in progress, it means that - // immediately to queue another one is not possible so - // in that case the monitor work is rearmed. - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || - (READ_ONCE(krcp->head) && !krwp->head_free)) { + // Try to detach bulk_head or head and attach it, only when + // all channels are free. Any channel is not free means at krwp + // there is on-going rcu work to handle krwp's free business. + if (need_wait_for_krwp_work(krwp)) + continue; + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. + if (need_offload_krc(krcp)) { // Channel 1 corresponds to the SLAB-pointer bulk path. // Channel 2 corresponds to vmalloc-pointer bulk path. for (j = 0; j < FREE_N_CHANNELS; j++) {
In kfree_rcu_monitor(), new free business at krcp is attached to any free channel at krwp. kfree_rcu_monitor() is responsible to make sure new free business is handled after the rcu grace period. But if there is any none-free channel at krwp already, that means there is an on-going rcu work, which will cause the kvfree_call_rcu()-triggered free business is done before the wanted rcu grace period ends. This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), to fix the issue that kvfree_call_rcu() loses effectiveness. Below is the css_set obj "from_cset" use-after-free case caused by kvfree_call_rcu() losing effectiveness. CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, the task is schedule out. CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. But "from_cset" is freed right after current gp end. "from_cset" is reallocated. CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. CPU 0 CPU 1 count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task is scheduled in after many ms. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com> --- kernel/rcu/tree.c | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-)