Message ID | 20230818015244.1176929-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [V2] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly | expand |
Hi, On 2023/8/18 09:52, Ming Lei wrote: > group_cpus_evenly() could be part of storage driver's error handler, > such as nvme driver, when may happen during CPU hotplug, in which > storage queue has to drain its pending IOs because all CPUs associated > with the queue are offline and the queue is becoming inactive. And > handling IO needs error handler to provide forward progress. > > Then dead lock is caused: > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > handler is waiting for inflight IO > > 2) error handler is waiting for CPU hotplug lock > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > error handling can't provide forward progress. > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > in which two stage spreads are taken: 1) the 1st stage is over all present > CPUs; 2) the end stage is over all other CPUs. > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > remove the CPU hotplug lock by storing it into one local cache. This way > doesn't change correctness, because all CPUs are still covered. > > Cc: Keith Busch <kbusch@kernel.org> > Cc: linux-nvme@lists.infradead.org > Cc: linux-block@vger.kernel.org > Reported-by: Yi Zhang <yi.zhang@redhat.com> > Reported-by: Guangwu Zhang <guazhang@redhat.com> > Tested-by: Guangwu Zhang <guazhang@redhat.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > V2: > - fix "Cc: block list" > - add tested-by tag > > lib/group_cpus.c | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > index aa3f6815bb12..15006e79196f 100644 > --- a/lib/group_cpus.c > +++ b/lib/group_cpus.c > @@ -348,6 +348,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > { > unsigned int curgrp = 0, nr_present = 0, nr_others = 0; > cpumask_var_t *node_to_cpumask; > + cpumask_var_t local_cpu_present_mask; > cpumask_var_t nmsk, npresmsk; > int ret = -ENOMEM; > struct cpumask *masks = NULL; > @@ -355,6 +356,16 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) > return NULL; > > + if (!zalloc_cpumask_var(&local_cpu_present_mask, GFP_KERNEL)) > + goto fail_local_pres_mask; > + > + /* > + * Make a local cache of 'cpu_present_mask', so the two stages > + * spread can observe consistent 'cpu_present_mask' without holding > + * cpu hotplug lock. > + */ > + cpumask_copy(local_cpu_present_mask, cpu_present_mask); > + Maybe we can reuse npresmsk instead of allocating another cpumask? In the first stage: npresmsk = cpu_present_mask In the second stage: npresmsk = cpu_possible_mask & ~npresmsk > if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) > goto fail_nmsk; > > @@ -366,13 +377,11 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > if (!masks) > goto fail_node_to_cpumask; > > - /* Stabilize the cpumasks */ > - cpus_read_lock(); > build_node_to_cpumask(node_to_cpumask); > > /* grouping present CPUs first */ > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > - cpu_present_mask, nmsk, masks); > + local_cpu_present_mask, nmsk, masks); > if (ret < 0) > goto fail_build_affinity; > nr_present = ret; > @@ -387,15 +396,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > curgrp = 0; > else > curgrp = nr_present; > - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); > + cpumask_andnot(npresmsk, cpu_possible_mask, local_cpu_present_mask); > ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, > npresmsk, nmsk, masks); > if (ret >= 0) > nr_others = ret; > > fail_build_affinity: > - cpus_read_unlock(); > - > if (ret >= 0) > WARN_ON(nr_present + nr_others < numgrps); This fail_build_affinity tag seems unneeded anymore. The patch looks good to me: Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com> Thanks. > > @@ -406,6 +413,9 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > free_cpumask_var(npresmsk); > > fail_nmsk: > + free_cpumask_var(local_cpu_present_mask); > + > + fail_local_pres_mask: > free_cpumask_var(nmsk); > if (ret < 0) { > kfree(masks);
On Fri, Aug 18, 2023 at 02:59:13PM +0800, Chengming Zhou wrote: > Hi, > > On 2023/8/18 09:52, Ming Lei wrote: > > group_cpus_evenly() could be part of storage driver's error handler, > > such as nvme driver, when may happen during CPU hotplug, in which > > storage queue has to drain its pending IOs because all CPUs associated > > with the queue are offline and the queue is becoming inactive. And > > handling IO needs error handler to provide forward progress. > > > > Then dead lock is caused: > > > > 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's > > handler is waiting for inflight IO > > > > 2) error handler is waiting for CPU hotplug lock > > > > 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because > > error handling can't provide forward progress. > > > > Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(), > > in which two stage spreads are taken: 1) the 1st stage is over all present > > CPUs; 2) the end stage is over all other CPUs. > > > > Turns out the two stage spread just needs consistent 'cpu_present_mask', and > > remove the CPU hotplug lock by storing it into one local cache. This way > > doesn't change correctness, because all CPUs are still covered. > > > > Cc: Keith Busch <kbusch@kernel.org> > > Cc: linux-nvme@lists.infradead.org > > Cc: linux-block@vger.kernel.org > > Reported-by: Yi Zhang <yi.zhang@redhat.com> > > Reported-by: Guangwu Zhang <guazhang@redhat.com> > > Tested-by: Guangwu Zhang <guazhang@redhat.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > V2: > > - fix "Cc: block list" > > - add tested-by tag > > > > lib/group_cpus.c | 22 ++++++++++++++++------ > > 1 file changed, 16 insertions(+), 6 deletions(-) > > > > diff --git a/lib/group_cpus.c b/lib/group_cpus.c > > index aa3f6815bb12..15006e79196f 100644 > > --- a/lib/group_cpus.c > > +++ b/lib/group_cpus.c > > @@ -348,6 +348,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > > { > > unsigned int curgrp = 0, nr_present = 0, nr_others = 0; > > cpumask_var_t *node_to_cpumask; > > + cpumask_var_t local_cpu_present_mask; > > cpumask_var_t nmsk, npresmsk; > > int ret = -ENOMEM; > > struct cpumask *masks = NULL; > > @@ -355,6 +356,16 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) > > if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) > > return NULL; > > > > + if (!zalloc_cpumask_var(&local_cpu_present_mask, GFP_KERNEL)) > > + goto fail_local_pres_mask; > > + > > + /* > > + * Make a local cache of 'cpu_present_mask', so the two stages > > + * spread can observe consistent 'cpu_present_mask' without holding > > + * cpu hotplug lock. > > + */ > > + cpumask_copy(local_cpu_present_mask, cpu_present_mask); > > + > > Maybe we can reuse npresmsk instead of allocating another cpumask? > In the first stage: npresmsk = cpu_present_mask > In the second stage: npresmsk = cpu_possible_mask & ~npresmsk Good idea! Thanks, Ming
diff --git a/lib/group_cpus.c b/lib/group_cpus.c index aa3f6815bb12..15006e79196f 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -348,6 +348,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) { unsigned int curgrp = 0, nr_present = 0, nr_others = 0; cpumask_var_t *node_to_cpumask; + cpumask_var_t local_cpu_present_mask; cpumask_var_t nmsk, npresmsk; int ret = -ENOMEM; struct cpumask *masks = NULL; @@ -355,6 +356,16 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) return NULL; + if (!zalloc_cpumask_var(&local_cpu_present_mask, GFP_KERNEL)) + goto fail_local_pres_mask; + + /* + * Make a local cache of 'cpu_present_mask', so the two stages + * spread can observe consistent 'cpu_present_mask' without holding + * cpu hotplug lock. + */ + cpumask_copy(local_cpu_present_mask, cpu_present_mask); + if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) goto fail_nmsk; @@ -366,13 +377,11 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) if (!masks) goto fail_node_to_cpumask; - /* Stabilize the cpumasks */ - cpus_read_lock(); build_node_to_cpumask(node_to_cpumask); /* grouping present CPUs first */ ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, - cpu_present_mask, nmsk, masks); + local_cpu_present_mask, nmsk, masks); if (ret < 0) goto fail_build_affinity; nr_present = ret; @@ -387,15 +396,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) curgrp = 0; else curgrp = nr_present; - cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask); + cpumask_andnot(npresmsk, cpu_possible_mask, local_cpu_present_mask); ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, nmsk, masks); if (ret >= 0) nr_others = ret; fail_build_affinity: - cpus_read_unlock(); - if (ret >= 0) WARN_ON(nr_present + nr_others < numgrps); @@ -406,6 +413,9 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) free_cpumask_var(npresmsk); fail_nmsk: + free_cpumask_var(local_cpu_present_mask); + + fail_local_pres_mask: free_cpumask_var(nmsk); if (ret < 0) { kfree(masks);