Message ID | 20220915055825.21525-2-kernelfans@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rcu: Enhance the capability to cope with concurrent cpu offlining/onlining | expand |
Hi Paul, If you pick up this one from my V1, could you drop it? Since I have rephrased the commit log and comment in the code. Sorry for the inconvenience. On Thu, Sep 15, 2022 at 1:58 PM Pingfan Liu <kernelfans@gmail.com> wrote: > > rcutree_online_cpu() is concurrent, so there is the following race > scene: > > CPU 1 CPU2 > mask_old = rcu_rnp_online_cpus(rnp); > ... > > mask_new = rcu_rnp_online_cpus(rnp); > ... > set_cpus_allowed_ptr(t, cm); > > set_cpus_allowed_ptr(t, cm); > > Consequently, the old mask will overwrite the new mask in the task's cpus_ptr. > > Since there is a mutex ->boost_kthread_mutex, using it to build an > order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr. > > But for concurrent offlining, ->qsmaskinitnext is not reilable when > rcutree_offline_cpu(). That is another story and comes in the following > patch. > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: David Woodhouse <dwmw@amazon.co.uk> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> > Cc: Josh Triplett <josh@joshtriplett.org> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Lai Jiangshan <jiangshanlai@gmail.com> > Cc: Joel Fernandes <joel@joelfernandes.org> > Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> > To: rcu@vger.kernel.org > --- > kernel/rcu/tree_plugin.h | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 438ecae6bd7e..ef6d3ae239b9 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > { > struct task_struct *t = rnp->boost_kthread_task; > - unsigned long mask = rcu_rnp_online_cpus(rnp); > + unsigned long mask; > cpumask_var_t cm; > int cpu; > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > return; > mutex_lock(&rnp->boost_kthread_mutex); > + /* > + * Relying on the lock to serialize, so when onlining, the latest > + * qsmaskinitnext is for cpus_ptr. > + */ In the comment, I put accent on 'onlining' which is different from V1 Thanks, Pingfan > + mask = rcu_rnp_online_cpus(rnp); > for_each_leaf_node_possible_cpu(rnp, cpu) > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > cpu != outgoingcpu) > -- > 2.31.1 >
On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote: > rcutree_online_cpu() is concurrent, so there is the following race > scene: > > CPU 1 CPU2 > mask_old = rcu_rnp_online_cpus(rnp); > ... > > mask_new = rcu_rnp_online_cpus(rnp); > ... > set_cpus_allowed_ptr(t, cm); > > set_cpus_allowed_ptr(t, cm); > > Consequently, the old mask will overwrite the new mask in the task's cpus_ptr. > > Since there is a mutex ->boost_kthread_mutex, using it to build an > order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr. > > But for concurrent offlining, ->qsmaskinitnext is not reilable when > rcutree_offline_cpu(). That is another story and comes in the following > patch. > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > Cc: "Paul E. McKenney" <paulmck@kernel.org> > Cc: David Woodhouse <dwmw@amazon.co.uk> > Cc: Frederic Weisbecker <frederic@kernel.org> > Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> > Cc: Josh Triplett <josh@joshtriplett.org> > Cc: Steven Rostedt <rostedt@goodmis.org> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Lai Jiangshan <jiangshanlai@gmail.com> > Cc: Joel Fernandes <joel@joelfernandes.org> > Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> > To: rcu@vger.kernel.org > --- > kernel/rcu/tree_plugin.h | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 438ecae6bd7e..ef6d3ae239b9 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > { > struct task_struct *t = rnp->boost_kthread_task; > - unsigned long mask = rcu_rnp_online_cpus(rnp); > + unsigned long mask; > cpumask_var_t cm; > int cpu; > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > return; > mutex_lock(&rnp->boost_kthread_mutex); > + /* > + * Relying on the lock to serialize, so when onlining, the latest > + * qsmaskinitnext is for cpus_ptr. > + */ > + mask = rcu_rnp_online_cpus(rnp); > for_each_leaf_node_possible_cpu(rnp, cpu) > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > cpu != outgoingcpu) Right but you still race against concurrent rcu_report_dead() doing: WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) Thanks. > -- > 2.31.1 >
On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker <frederic@kernel.org> wrote: > > On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote: > > rcutree_online_cpu() is concurrent, so there is the following race > > scene: > > > > CPU 1 CPU2 > > mask_old = rcu_rnp_online_cpus(rnp); > > ... > > > > mask_new = rcu_rnp_online_cpus(rnp); > > ... > > set_cpus_allowed_ptr(t, cm); > > > > set_cpus_allowed_ptr(t, cm); > > > > Consequently, the old mask will overwrite the new mask in the task's cpus_ptr. > > > > Since there is a mutex ->boost_kthread_mutex, using it to build an > > order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr. > > > > But for concurrent offlining, ->qsmaskinitnext is not reilable when > > rcutree_offline_cpu(). That is another story and comes in the following > > patch. > > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com> > > Cc: "Paul E. McKenney" <paulmck@kernel.org> > > Cc: David Woodhouse <dwmw@amazon.co.uk> > > Cc: Frederic Weisbecker <frederic@kernel.org> > > Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> > > Cc: Josh Triplett <josh@joshtriplett.org> > > Cc: Steven Rostedt <rostedt@goodmis.org> > > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > > Cc: Lai Jiangshan <jiangshanlai@gmail.com> > > Cc: Joel Fernandes <joel@joelfernandes.org> > > Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> > > To: rcu@vger.kernel.org > > --- > > kernel/rcu/tree_plugin.h | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > index 438ecae6bd7e..ef6d3ae239b9 100644 > > --- a/kernel/rcu/tree_plugin.h > > +++ b/kernel/rcu/tree_plugin.h > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > > static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > { > > struct task_struct *t = rnp->boost_kthread_task; > > - unsigned long mask = rcu_rnp_online_cpus(rnp); > > + unsigned long mask; > > cpumask_var_t cm; > > int cpu; > > > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > > return; > > mutex_lock(&rnp->boost_kthread_mutex); > > + /* > > + * Relying on the lock to serialize, so when onlining, the latest > > + * qsmaskinitnext is for cpus_ptr. > > + */ > > + mask = rcu_rnp_online_cpus(rnp); > > for_each_leaf_node_possible_cpu(rnp, cpu) > > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > > cpu != outgoingcpu) > > Right but you still race against concurrent rcu_report_dead() doing: > > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) > Ah. Indeed, my commit log is not precisely described. In fact, either speedup smp_init [1] or fast kexec reboot [2] still uses the model: one hotplug initiator and several reactors. The initiators are still excluded from each other by the pair cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a cpu-up and cpu-down event at the same time. Thanks, Pingfan
On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote: > On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker > <frederic@kernel.org> wrote: > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > index 438ecae6bd7e..ef6d3ae239b9 100644 > > > --- a/kernel/rcu/tree_plugin.h > > > +++ b/kernel/rcu/tree_plugin.h > > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > > > static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > > { > > > struct task_struct *t = rnp->boost_kthread_task; > > > - unsigned long mask = rcu_rnp_online_cpus(rnp); > > > + unsigned long mask; > > > cpumask_var_t cm; > > > int cpu; > > > > > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > > > return; > > > mutex_lock(&rnp->boost_kthread_mutex); > > > + /* > > > + * Relying on the lock to serialize, so when onlining, the latest > > > + * qsmaskinitnext is for cpus_ptr. > > > + */ > > > + mask = rcu_rnp_online_cpus(rnp); > > > for_each_leaf_node_possible_cpu(rnp, cpu) > > > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > > > cpu != outgoingcpu) > > > > Right but you still race against concurrent rcu_report_dead() doing: > > > > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) > > > > Ah. Indeed, my commit log is not precisely described. > > In fact, either speedup smp_init [1] or fast kexec reboot [2] still > uses the model: one hotplug initiator and several reactors. The > initiators are still excluded from each other by the pair > cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a > cpu-up and cpu-down event at the same time. Yes but two downing CPUs may race right? So isn't it possible to have rcu_report_dead() racing against rcutree_offline_cpu()? Thanks.
On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote: > On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote: > > On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker > > <frederic@kernel.org> wrote: > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > > index 438ecae6bd7e..ef6d3ae239b9 100644 > > > > --- a/kernel/rcu/tree_plugin.h > > > > +++ b/kernel/rcu/tree_plugin.h > > > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > > > > static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > > > { > > > > struct task_struct *t = rnp->boost_kthread_task; > > > > - unsigned long mask = rcu_rnp_online_cpus(rnp); > > > > + unsigned long mask; > > > > cpumask_var_t cm; > > > > int cpu; > > > > > > > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > > > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > > > > return; > > > > mutex_lock(&rnp->boost_kthread_mutex); > > > > + /* > > > > + * Relying on the lock to serialize, so when onlining, the latest > > > > + * qsmaskinitnext is for cpus_ptr. > > > > + */ > > > > + mask = rcu_rnp_online_cpus(rnp); > > > > for_each_leaf_node_possible_cpu(rnp, cpu) > > > > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > > > > cpu != outgoingcpu) > > > > > > Right but you still race against concurrent rcu_report_dead() doing: > > > > > > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) > > > > > > > Ah. Indeed, my commit log is not precisely described. > > > > In fact, either speedup smp_init [1] or fast kexec reboot [2] still > > uses the model: one hotplug initiator and several reactors. The > > initiators are still excluded from each other by the pair > > cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a > > cpu-up and cpu-down event at the same time. > > Yes but two downing CPUs may race right? > > So isn't it possible to have rcu_report_dead() racing against > rcutree_offline_cpu()? > Yes, two downing cpus can be in a batch. There is a nuance between onlining and offlining: -1. onlining relies on qsmaskinitnext, which is stable at the point rcutree_online_cpu(). And it is what this patch aims for. -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext which is not cleared yet at the point rcutree_offline_cpu(). It is asymmetric owning to the point of updating qsmaskinitnext. Now considering the racing between rcu_report_dead() and rcutree_offline_cpu(): No matter rcutree_offline_cpu() reads out either stale or fresh value of qsmaskinitnext, it should intersect with cpu_dying_mask. Thanks, Pingfan
On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote: > On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote: > > On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote: > > > On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker > > > <frederic@kernel.org> wrote: > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > > > > > index 438ecae6bd7e..ef6d3ae239b9 100644 > > > > > --- a/kernel/rcu/tree_plugin.h > > > > > +++ b/kernel/rcu/tree_plugin.h > > > > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > > > > > static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > > > > { > > > > > struct task_struct *t = rnp->boost_kthread_task; > > > > > - unsigned long mask = rcu_rnp_online_cpus(rnp); > > > > > + unsigned long mask; > > > > > cpumask_var_t cm; > > > > > int cpu; > > > > > > > > > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > > > > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > > > > > return; > > > > > mutex_lock(&rnp->boost_kthread_mutex); > > > > > + /* > > > > > + * Relying on the lock to serialize, so when onlining, the latest > > > > > + * qsmaskinitnext is for cpus_ptr. > > > > > + */ > > > > > + mask = rcu_rnp_online_cpus(rnp); > > > > > for_each_leaf_node_possible_cpu(rnp, cpu) > > > > > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > > > > > cpu != outgoingcpu) > > > > > > > > Right but you still race against concurrent rcu_report_dead() doing: > > > > > > > > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) > > > > > > > > > > Ah. Indeed, my commit log is not precisely described. > > > > > > In fact, either speedup smp_init [1] or fast kexec reboot [2] still > > > uses the model: one hotplug initiator and several reactors. The > > > initiators are still excluded from each other by the pair > > > cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a > > > cpu-up and cpu-down event at the same time. > > > > Yes but two downing CPUs may race right? > > > > So isn't it possible to have rcu_report_dead() racing against > > rcutree_offline_cpu()? > > > > Yes, two downing cpus can be in a batch. > > There is a nuance between onlining and offlining: > -1. onlining relies on qsmaskinitnext, which is stable at the point > rcutree_online_cpu(). And it is what this patch aims for. > -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext > which is not cleared yet at the point rcutree_offline_cpu(). > > It is asymmetric owning to the point of updating qsmaskinitnext. > > Now considering the racing between rcu_report_dead() and > rcutree_offline_cpu(): > > No matter rcutree_offline_cpu() reads out either stale or fresh value of > qsmaskinitnext, it should intersect with cpu_dying_mask. Ah I see now, so the race can happen but it is then cancelled by the stable READ on cpu_dying_mask. Ok. > > Thanks, > > Pingfan
On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote: > rcutree_online_cpu() is concurrent, so there is the following race > scene: > > CPU 1 CPU2 > mask_old = rcu_rnp_online_cpus(rnp); > ... > > mask_new = rcu_rnp_online_cpus(rnp); > ... > set_cpus_allowed_ptr(t, cm); > > set_cpus_allowed_ptr(t, cm); Just for clarity, may I suggest: CPU 0 CPU 1 ----- ----- rcutree_online_cpu() rcu_boost_kthread_setaffinity() mask = rcu_rnp_online_cpus(rnp); rcu_cpu_starting() WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); rcutree_online_cpu() rcu_boost_kthread_setaffinity() mask = rcu_rnp_online_cpus(rnp); set_cpus_allowed_ptr(t, cm); set_cpus_allowed_ptr(t, cm); > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > return; > mutex_lock(&rnp->boost_kthread_mutex); > + /* > + * Relying on the lock to serialize, so when onlining, the latest > + * qsmaskinitnext is for cpus_ptr. > + */ Given how tricky the situation is, I suggest to also add more details to the comment: /* * Rely on the boost mutex to serialize ->qsmaskinitnext and * set_cpus_allowed_ptr(), so when concurrently onlining, racing against * rcu_cpu_starting() makes sure that the latest update eventually * wins with the right affinity setting. */ Thanks! > + mask = rcu_rnp_online_cpus(rnp); > for_each_leaf_node_possible_cpu(rnp, cpu) > if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > cpu != outgoingcpu) > -- > 2.31.1 >
On Tue, Sep 20, 2022 at 12:31:43PM +0200, Frederic Weisbecker wrote: > On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote: > > rcutree_online_cpu() is concurrent, so there is the following race > > scene: > > > > CPU 1 CPU2 > > mask_old = rcu_rnp_online_cpus(rnp); > > ... > > > > mask_new = rcu_rnp_online_cpus(rnp); > > ... > > set_cpus_allowed_ptr(t, cm); > > > > set_cpus_allowed_ptr(t, cm); > > Just for clarity, may I suggest: > > > CPU 0 CPU 1 > ----- ----- > rcutree_online_cpu() > rcu_boost_kthread_setaffinity() > mask = rcu_rnp_online_cpus(rnp); > rcu_cpu_starting() > WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask); > rcutree_online_cpu() > rcu_boost_kthread_setaffinity() > mask = rcu_rnp_online_cpus(rnp); > set_cpus_allowed_ptr(t, cm); > set_cpus_allowed_ptr(t, cm); > > OK. > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > > if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > > return; > > mutex_lock(&rnp->boost_kthread_mutex); > > + /* > > + * Relying on the lock to serialize, so when onlining, the latest > > + * qsmaskinitnext is for cpus_ptr. > > + */ > > Given how tricky the situation is, I suggest to also add more details to the > comment: > > /* > * Rely on the boost mutex to serialize ->qsmaskinitnext and > * set_cpus_allowed_ptr(), so when concurrently onlining, racing against > * rcu_cpu_starting() makes sure that the latest update eventually > * wins with the right affinity setting. > */ > Appreciate for your re-phrase. I will use it in the next version Thanks, Pingfan
On 9/20/22 05:23, Frederic Weisbecker wrote: > On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote: >> On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote: >>> On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote: >>>> On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker >>>> <frederic@kernel.org> wrote: >>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h >>>>>> index 438ecae6bd7e..ef6d3ae239b9 100644 >>>>>> --- a/kernel/rcu/tree_plugin.h >>>>>> +++ b/kernel/rcu/tree_plugin.h >>>>>> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) >>>>>> static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) >>>>>> { >>>>>> struct task_struct *t = rnp->boost_kthread_task; >>>>>> - unsigned long mask = rcu_rnp_online_cpus(rnp); >>>>>> + unsigned long mask; >>>>>> cpumask_var_t cm; >>>>>> int cpu; >>>>>> >>>>>> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) >>>>>> if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) >>>>>> return; >>>>>> mutex_lock(&rnp->boost_kthread_mutex); >>>>>> + /* >>>>>> + * Relying on the lock to serialize, so when onlining, the latest >>>>>> + * qsmaskinitnext is for cpus_ptr. >>>>>> + */ >>>>>> + mask = rcu_rnp_online_cpus(rnp); >>>>>> for_each_leaf_node_possible_cpu(rnp, cpu) >>>>>> if ((mask & leaf_node_cpu_bit(rnp, cpu)) && >>>>>> cpu != outgoingcpu) >>>>> >>>>> Right but you still race against concurrent rcu_report_dead() doing: >>>>> >>>>> WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) >>>>> >>>> >>>> Ah. Indeed, my commit log is not precisely described. >>>> >>>> In fact, either speedup smp_init [1] or fast kexec reboot [2] still >>>> uses the model: one hotplug initiator and several reactors. The >>>> initiators are still excluded from each other by the pair >>>> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a >>>> cpu-up and cpu-down event at the same time. >>> >>> Yes but two downing CPUs may race right? >>> >>> So isn't it possible to have rcu_report_dead() racing against >>> rcutree_offline_cpu()? >>> >> >> Yes, two downing cpus can be in a batch. >> >> There is a nuance between onlining and offlining: >> -1. onlining relies on qsmaskinitnext, which is stable at the point >> rcutree_online_cpu(). And it is what this patch aims for. >> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext >> which is not cleared yet at the point rcutree_offline_cpu(). >> >> It is asymmetric owning to the point of updating qsmaskinitnext. >> >> Now considering the racing between rcu_report_dead() and >> rcutree_offline_cpu(): >> >> No matter rcutree_offline_cpu() reads out either stale or fresh value of >> qsmaskinitnext, it should intersect with cpu_dying_mask. > > Ah I see now, so the race can happen but it is then cancelled by the > stable READ on cpu_dying_mask. Ok. > Maybe I am missing something but, concurrent rcu_report_dead() on CPU's of the same node can corrupt ->qsmaskinitnext as Frederick showed. That may not be a problem for affinities (due to the cpu_dying_mask), but is that not an issue for RCU operation itself? i.e. due to the corruption, ->qsmaskinitnext is in an inconsistent state causing *RCU* issues. Or is there a reason that that is not an issue? thanks, - Joel
On Sat, Oct 1, 2022 at 10:26 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > On 9/20/22 05:23, Frederic Weisbecker wrote: > > On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote: > >> On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote: > >>> On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote: > >>>> On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker > >>>> <frederic@kernel.org> wrote: > >>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > >>>>>> index 438ecae6bd7e..ef6d3ae239b9 100644 > >>>>>> --- a/kernel/rcu/tree_plugin.h > >>>>>> +++ b/kernel/rcu/tree_plugin.h > >>>>>> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) > >>>>>> static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > >>>>>> { > >>>>>> struct task_struct *t = rnp->boost_kthread_task; > >>>>>> - unsigned long mask = rcu_rnp_online_cpus(rnp); > >>>>>> + unsigned long mask; > >>>>>> cpumask_var_t cm; > >>>>>> int cpu; > >>>>>> > >>>>>> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) > >>>>>> if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) > >>>>>> return; > >>>>>> mutex_lock(&rnp->boost_kthread_mutex); > >>>>>> + /* > >>>>>> + * Relying on the lock to serialize, so when onlining, the latest > >>>>>> + * qsmaskinitnext is for cpus_ptr. > >>>>>> + */ > >>>>>> + mask = rcu_rnp_online_cpus(rnp); > >>>>>> for_each_leaf_node_possible_cpu(rnp, cpu) > >>>>>> if ((mask & leaf_node_cpu_bit(rnp, cpu)) && > >>>>>> cpu != outgoingcpu) > >>>>> > >>>>> Right but you still race against concurrent rcu_report_dead() doing: > >>>>> > >>>>> WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) > >>>>> > >>>> > >>>> Ah. Indeed, my commit log is not precisely described. > >>>> > >>>> In fact, either speedup smp_init [1] or fast kexec reboot [2] still > >>>> uses the model: one hotplug initiator and several reactors. The > >>>> initiators are still excluded from each other by the pair > >>>> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a > >>>> cpu-up and cpu-down event at the same time. > >>> > >>> Yes but two downing CPUs may race right? > >>> > >>> So isn't it possible to have rcu_report_dead() racing against > >>> rcutree_offline_cpu()? > >>> > >> > >> Yes, two downing cpus can be in a batch. > >> > >> There is a nuance between onlining and offlining: > >> -1. onlining relies on qsmaskinitnext, which is stable at the point > >> rcutree_online_cpu(). And it is what this patch aims for. > >> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext > >> which is not cleared yet at the point rcutree_offline_cpu(). > >> > >> It is asymmetric owning to the point of updating qsmaskinitnext. > >> > >> Now considering the racing between rcu_report_dead() and > >> rcutree_offline_cpu(): > >> > >> No matter rcutree_offline_cpu() reads out either stale or fresh value of > >> qsmaskinitnext, it should intersect with cpu_dying_mask. > > > > Ah I see now, so the race can happen but it is then cancelled by the > > stable READ on cpu_dying_mask. Ok. > > > > Maybe I am missing something but, concurrent rcu_report_dead() on CPU's > of the same node can corrupt ->qsmaskinitnext as Frederick showed. That There is a pair of lock ops around WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask), i.e. raw_spin_lock_irqsave_rcu_node(rnp, flags) and raw_spin_unlock_irqrestore_rcu_node(rnp, flags); So it should not be a problem, right? Thanks, Pingfan
On 10/2/2022 8:34 AM, Pingfan Liu wrote: > On Sat, Oct 1, 2022 at 10:26 AM Joel Fernandes <joel@joelfernandes.org> wrote: >> On 9/20/22 05:23, Frederic Weisbecker wrote: >>> On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote: [...] >>>>>> <frederic@kernel.org> wrote: >>>>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h >>>>>>>> index 438ecae6bd7e..ef6d3ae239b9 100644 >>>>>>>> --- a/kernel/rcu/tree_plugin.h >>>>>>>> +++ b/kernel/rcu/tree_plugin.h >>>>>>>> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) >>>>>>>> static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) >>>>>>>> { >>>>>>>> struct task_struct *t = rnp->boost_kthread_task; >>>>>>>> - unsigned long mask = rcu_rnp_online_cpus(rnp); >>>>>>>> + unsigned long mask; >>>>>>>> cpumask_var_t cm; >>>>>>>> int cpu; >>>>>>>> >>>>>>>> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) >>>>>>>> if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) >>>>>>>> return; >>>>>>>> mutex_lock(&rnp->boost_kthread_mutex); >>>>>>>> + /* >>>>>>>> + * Relying on the lock to serialize, so when onlining, the latest >>>>>>>> + * qsmaskinitnext is for cpus_ptr. >>>>>>>> + */ >>>>>>>> + mask = rcu_rnp_online_cpus(rnp); >>>>>>>> for_each_leaf_node_possible_cpu(rnp, cpu) >>>>>>>> if ((mask & leaf_node_cpu_bit(rnp, cpu)) && >>>>>>>> cpu != outgoingcpu) >>>>>>> >>>>>>> Right but you still race against concurrent rcu_report_dead() doing: >>>>>>> >>>>>>> WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask) >>>>>>> >>>>>> >>>>>> Ah. Indeed, my commit log is not precisely described. >>>>>> >>>>>> In fact, either speedup smp_init [1] or fast kexec reboot [2] still >>>>>> uses the model: one hotplug initiator and several reactors. The >>>>>> initiators are still excluded from each other by the pair >>>>>> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a >>>>>> cpu-up and cpu-down event at the same time. >>>>> >>>>> Yes but two downing CPUs may race right? >>>>> >>>>> So isn't it possible to have rcu_report_dead() racing against >>>>> rcutree_offline_cpu()? >>>>> >>>> >>>> Yes, two downing cpus can be in a batch. >>>> >>>> There is a nuance between onlining and offlining: >>>> -1. onlining relies on qsmaskinitnext, which is stable at the point >>>> rcutree_online_cpu(). And it is what this patch aims for. >>>> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext >>>> which is not cleared yet at the point rcutree_offline_cpu(). >>>> >>>> It is asymmetric owning to the point of updating qsmaskinitnext. >>>> >>>> Now considering the racing between rcu_report_dead() and >>>> rcutree_offline_cpu(): >>>> >>>> No matter rcutree_offline_cpu() reads out either stale or fresh value of >>>> qsmaskinitnext, it should intersect with cpu_dying_mask. >>> >>> Ah I see now, so the race can happen but it is then cancelled by the >>> stable READ on cpu_dying_mask. Ok. >>> >> >> Maybe I am missing something but, concurrent rcu_report_dead() on CPU's >> of the same node can corrupt ->qsmaskinitnext as Frederick showed. That > > There is a pair of lock ops around WRITE_ONCE(rnp->qsmaskinitnext, > rnp->qsmaskinitnext & ~mask), i.e. raw_spin_lock_irqsave_rcu_node(rnp, > flags) and raw_spin_unlock_irqrestore_rcu_node(rnp, flags); > > So it should not be a problem, right? Correct, I think should not be problem. I got thrown away a bit by the comment "Right but you still race against concurrent rcu_report_dead()". Even with concurrent offlining, a stable ->qsmaskinitnext should ensure that the hotplugged CPUs are properly considered in rcu_gp_init(), so I don't see a problem with your patch so far... but I'll continue to dig when you send the next patch. My fear was if a bit in ->qsmaskinitnext is not cleared during offlining race, then grace period will hang. Thanks, - Joel
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 438ecae6bd7e..ef6d3ae239b9 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp) static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) { struct task_struct *t = rnp->boost_kthread_task; - unsigned long mask = rcu_rnp_online_cpus(rnp); + unsigned long mask; cpumask_var_t cm; int cpu; @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu) if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) return; mutex_lock(&rnp->boost_kthread_mutex); + /* + * Relying on the lock to serialize, so when onlining, the latest + * qsmaskinitnext is for cpus_ptr. + */ + mask = rcu_rnp_online_cpus(rnp); for_each_leaf_node_possible_cpu(rnp, cpu) if ((mask & leaf_node_cpu_bit(rnp, cpu)) && cpu != outgoingcpu)
rcutree_online_cpu() is concurrent, so there is the following race scene: CPU 1 CPU2 mask_old = rcu_rnp_online_cpus(rnp); ... mask_new = rcu_rnp_online_cpus(rnp); ... set_cpus_allowed_ptr(t, cm); set_cpus_allowed_ptr(t, cm); Consequently, the old mask will overwrite the new mask in the task's cpus_ptr. Since there is a mutex ->boost_kthread_mutex, using it to build an order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr. But for concurrent offlining, ->qsmaskinitnext is not reilable when rcutree_offline_cpu(). That is another story and comes in the following patch. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> To: rcu@vger.kernel.org --- kernel/rcu/tree_plugin.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)