Message ID | 20190405215920.27085-1-keith.busch@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: Wait for for hctx requests on CPU unplug | expand |
On 4/5/19 3:59 PM, Keith Busch wrote: > Managed interrupts can not migrate affinity when their CPUs are offline. > If the CPU is allowed to shutdown before they're returned, commands > dispatched to managed queues won't be able to complete through their > irq handlers. > > Introduce per-hctx reference counting so we can block the CPU dead > notification for all allocated requests to complete if an hctx's last > CPU is being taken offline. What does this do to performance? We're doing a map per request...
On Fri, Apr 05, 2019 at 04:23:27PM -0600, Jens Axboe wrote: > On 4/5/19 3:59 PM, Keith Busch wrote: > > Managed interrupts can not migrate affinity when their CPUs are offline. > > If the CPU is allowed to shutdown before they're returned, commands > > dispatched to managed queues won't be able to complete through their > > irq handlers. > > > > Introduce per-hctx reference counting so we can block the CPU dead > > notification for all allocated requests to complete if an hctx's last > > CPU is being taken offline. > > What does this do to performance? We're doing a map per request... It should be the same cost as the blk_queue_enter/blk_queue_exit that's also done per request, which is pretty cheap way to count users. I don't think I'm measuring a difference, but my test sample size so far is just one over-powered machine.
On 4/5/19 4:37 PM, Keith Busch wrote: > On Fri, Apr 05, 2019 at 04:23:27PM -0600, Jens Axboe wrote: >> On 4/5/19 3:59 PM, Keith Busch wrote: >>> Managed interrupts can not migrate affinity when their CPUs are offline. >>> If the CPU is allowed to shutdown before they're returned, commands >>> dispatched to managed queues won't be able to complete through their >>> irq handlers. >>> >>> Introduce per-hctx reference counting so we can block the CPU dead >>> notification for all allocated requests to complete if an hctx's last >>> CPU is being taken offline. >> >> What does this do to performance? We're doing a map per request... > > It should be the same cost as the blk_queue_enter/blk_queue_exit that's > also done per request, which is pretty cheap way to count users. I > don't think I'm measuring a difference, but my test sample size so far > is just one over-powered machine. Looking at current peak testing, I've got around 1.2% in queue enter and exit. It's definitely not free, hence my question. Probably safe to assume that we'll double that cycle counter, per IO.
On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: > Looking at current peak testing, I've got around 1.2% in queue enter > and exit. It's definitely not free, hence my question. Probably safe > to assume that we'll double that cycle counter, per IO. Okay, that's not negligible at all. I don't know of a faster reference than the percpu_ref, but that much overhead would have to rule out having a per hctx counter.
On 04/06/2019 07:36 AM, Keith Busch wrote: > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: >> Looking at current peak testing, I've got around 1.2% in queue enter >> and exit. It's definitely not free, hence my question. Probably safe >> to assume that we'll double that cycle counter, per IO. > > Okay, that's not negligible at all. I don't know of a faster reference > than the percpu_ref, but that much overhead would have to rule out > having a per hctx counter. > If there is no faster reference to enable waiting for all inflight requests to complete, is it possible to re-map (migrate) those requests to other hctx whose cpus (ctx) are still online, e.g., to extract the bio and re-map those bio to other ctx (e.g., cpu0)? One drawback I can see is if 63 out of 64 cpus are suddenly offline, cpu0 would be stuck. Dongli Zhang
On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: > > Looking at current peak testing, I've got around 1.2% in queue enter > > and exit. It's definitely not free, hence my question. Probably safe > > to assume that we'll double that cycle counter, per IO. > > Okay, that's not negligible at all. I don't know of a faster reference > than the percpu_ref, but that much overhead would have to rule out > having a per hctx counter. Or not using any refcount in fast path, how about the following one? diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ff3d7b49969..6fe334e12236 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, return -ENOMEM; } +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx, + int dead_cpu) +{ + unsigned long msecs_left = 1000 * 10; + + while (msecs_left > 0) { + if (blk_mq_hctx_idle(hctx)) + break; + msleep(5); + msecs_left -= 5; + } + + if (msecs_left > 0) + printk(KERN_WARNING "requests not completed from " + "CPU %d\n", dead_cpu); +} + /* * 'cpu' is going away. splice any existing rq_list entries from this * software queue to the hw queue dispatch list, and ensure that it @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) spin_unlock(&hctx->lock); blk_mq_run_hw_queue(hctx, true); + + /* + * Interrupt for this queue will be shutdown, so wait until all + * requests from this hctx is done or timeout. + */ + if (cpumask_first_and(hctx->cpumask, cpu_online_mask) >= nr_cpu_ids) + blk_mq_wait_hctx_become_idle(hctx, cpu); + return 0; } diff --git a/block/blk-mq.h b/block/blk-mq.h index d704fc7766f4..935cf8519bf2 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -240,4 +240,15 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap) qmap->mq_map[cpu] = 0; } +static inline bool blk_mq_hctx_idle(struct blk_mq_hw_ctx *hctx) +{ + struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags; + + if (!tags) + return true; + + return !sbitmap_any_bit_set(&tags->bitmap_tags.sb) && + !sbitmap_any_bit_set(&tags->bitmap_tags.sb); +} + #endif Thanks, Ming
On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: > > Looking at current peak testing, I've got around 1.2% in queue enter > > and exit. It's definitely not free, hence my question. Probably safe > > to assume that we'll double that cycle counter, per IO. > > Okay, that's not negligible at all. I don't know of a faster reference > than the percpu_ref, but that much overhead would have to rule out > having a per hctx counter. Can we just replace queue_enter/exit with the per-hctx reference entirely?
Hi Keith and Ming, On 04/07/2019 05:27 AM, Ming Lei wrote: > On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: >> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: >>> Looking at current peak testing, I've got around 1.2% in queue enter >>> and exit. It's definitely not free, hence my question. Probably safe >>> to assume that we'll double that cycle counter, per IO. >> >> Okay, that's not negligible at all. I don't know of a faster reference >> than the percpu_ref, but that much overhead would have to rule out >> having a per hctx counter. > > Or not using any refcount in fast path, how about the following one? > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b49969..6fe334e12236 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > return -ENOMEM; > } > > +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx, > + int dead_cpu) > +{ > + unsigned long msecs_left = 1000 * 10; > + > + while (msecs_left > 0) { > + if (blk_mq_hctx_idle(hctx)) > + break; > + msleep(5); > + msecs_left -= 5; > + } > + > + if (msecs_left > 0) > + printk(KERN_WARNING "requests not completed from " > + "CPU %d\n", dead_cpu); > +} > + > /* > * 'cpu' is going away. splice any existing rq_list entries from this > * software queue to the hw queue dispatch list, and ensure that it > @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) > spin_unlock(&hctx->lock); > While debugging the blk_mq_hctx_notify_dead(), I found that blk_mq_hctx_notify_dead() is called once for each hctx for the cpu to offline. As shown below between linux 2216 and line 2217, ctx might not be mapped to hctx. Why would we flush ctx->rq_lists[type] to hctx->dispatch? 2207 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) 2208 { 2209 struct blk_mq_hw_ctx *hctx; 2210 struct blk_mq_ctx *ctx; 2211 LIST_HEAD(tmp); 2212 enum hctx_type type; 2213 2214 hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); 2215 ctx = __blk_mq_get_ctx(hctx->queue, cpu); 2216 type = hctx->type; ===>>>>> ctx might not be mappped to hctx, that is, cpumask_test_cpu(cpu, hctx->cpumask) = false 2217 2218 spin_lock(&ctx->lock); 2219 if (!list_empty(&ctx->rq_lists[type])) { 2220 list_splice_init(&ctx->rq_lists[type], &tmp); 2221 blk_mq_hctx_clear_pending(hctx, ctx); 2222 } 2223 spin_unlock(&ctx->lock); 2224 2225 if (list_empty(&tmp)) 2226 return 0; 2227 2228 spin_lock(&hctx->lock); 2229 list_splice_tail_init(&tmp, &hctx->dispatch); 2230 spin_unlock(&hctx->lock); 2231 2232 blk_mq_run_hw_queue(hctx, true); 2233 return 0; 2234 } For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue hctx: 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 in this case, its ctx->rq_lists[type] will however be moved to blk_mq_hw_ctx->queue_num = 3 during the 1st call of blk_mq_hctx_notify_dead(). Is this expected or a bug? If it is a bug, would the below help? [PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx is not mapped to hctx When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each hctx for the offline cpu. While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type] to hctx->dispatch, it never checks whether the ctx is already mapped to the hctx. For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue hctx: 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 in this case, its ctx->rq_lists[type] will however be moved to blk_mq_hw_ctx->queue_num = 3 during the 1st call of blk_mq_hctx_notify_dead(). This patch would return and go ahead to next call of blk_mq_hctx_notify_dead() if ctx is not mapped to hctx. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- block/blk-mq.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ff3d7b..b8ef489 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) enum hctx_type type; hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); + + if (!cpumask_test_cpu(cpu, hctx->cpumask)) + return 0; + ctx = __blk_mq_get_ctx(hctx->queue, cpu); type = hctx->type;
On Sun, Apr 07, 2019 at 09:55:20PM +0800, Dongli Zhang wrote: > Hi Keith and Ming, > > On 04/07/2019 05:27 AM, Ming Lei wrote: > > On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: > >> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: > >>> Looking at current peak testing, I've got around 1.2% in queue enter > >>> and exit. It's definitely not free, hence my question. Probably safe > >>> to assume that we'll double that cycle counter, per IO. > >> > >> Okay, that's not negligible at all. I don't know of a faster reference > >> than the percpu_ref, but that much overhead would have to rule out > >> having a per hctx counter. > > > > Or not using any refcount in fast path, how about the following one? > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 3ff3d7b49969..6fe334e12236 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > > return -ENOMEM; > > } > > > > +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx, > > + int dead_cpu) > > +{ > > + unsigned long msecs_left = 1000 * 10; > > + > > + while (msecs_left > 0) { > > + if (blk_mq_hctx_idle(hctx)) > > + break; > > + msleep(5); > > + msecs_left -= 5; > > + } > > + > > + if (msecs_left > 0) > > + printk(KERN_WARNING "requests not completed from " > > + "CPU %d\n", dead_cpu); > > +} > > + > > /* > > * 'cpu' is going away. splice any existing rq_list entries from this > > * software queue to the hw queue dispatch list, and ensure that it > > @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) > > spin_unlock(&hctx->lock); > > > > While debugging the blk_mq_hctx_notify_dead(), I found that > blk_mq_hctx_notify_dead() is called once for each hctx for the cpu to offline. > > As shown below between linux 2216 and line 2217, ctx might not be mapped to > hctx. Why would we flush ctx->rq_lists[type] to hctx->dispatch? > > 2207 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) > 2208 { > 2209 struct blk_mq_hw_ctx *hctx; > 2210 struct blk_mq_ctx *ctx; > 2211 LIST_HEAD(tmp); > 2212 enum hctx_type type; > 2213 > 2214 hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); > 2215 ctx = __blk_mq_get_ctx(hctx->queue, cpu); > 2216 type = hctx->type; > > ===>>>>> ctx might not be mappped to hctx, that is, > cpumask_test_cpu(cpu, hctx->cpumask) = false > > 2217 > 2218 spin_lock(&ctx->lock); > 2219 if (!list_empty(&ctx->rq_lists[type])) { > 2220 list_splice_init(&ctx->rq_lists[type], &tmp); > 2221 blk_mq_hctx_clear_pending(hctx, ctx); > 2222 } > 2223 spin_unlock(&ctx->lock); > 2224 > 2225 if (list_empty(&tmp)) > 2226 return 0; > 2227 > 2228 spin_lock(&hctx->lock); > 2229 list_splice_tail_init(&tmp, &hctx->dispatch); > 2230 spin_unlock(&hctx->lock); > 2231 > 2232 blk_mq_run_hw_queue(hctx, true); > 2233 return 0; > 2234 } > > > For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the > 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue > hctx: > > 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 > 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 > 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 > 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 > > Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 > in this case, its ctx->rq_lists[type] will however be moved to > blk_mq_hw_ctx->queue_num = 3 during the 1st call of > blk_mq_hctx_notify_dead(). > > > Is this expected or a bug? > > If it is a bug, would the below help? > > [PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx > is not mapped to hctx > > When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each > hctx for the offline cpu. > > While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type] > to hctx->dispatch, it never checks whether the ctx is already mapped to the > hctx. > > For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the > 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue > hctx: > > 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 > 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 > 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 > 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 > > Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 > in this case, its ctx->rq_lists[type] will however be moved to > blk_mq_hw_ctx->queue_num = 3 during the 1st call of > blk_mq_hctx_notify_dead(). > > This patch would return and go ahead to next call of > blk_mq_hctx_notify_dead() if ctx is not mapped to hctx. > > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > block/blk-mq.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b..b8ef489 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, > struct hlist_node *node) > enum hctx_type type; > > hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); > + > + if (!cpumask_test_cpu(cpu, hctx->cpumask)) > + return 0; > + > ctx = __blk_mq_get_ctx(hctx->queue, cpu); > type = hctx->type; Looks correct: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming
On Sat, Apr 06, 2019 at 02:27:10PM -0700, Ming Lei wrote: > On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: > > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: > > > Looking at current peak testing, I've got around 1.2% in queue enter > > > and exit. It's definitely not free, hence my question. Probably safe > > > to assume that we'll double that cycle counter, per IO. > > > > Okay, that's not negligible at all. I don't know of a faster reference > > than the percpu_ref, but that much overhead would have to rule out > > having a per hctx counter. > > Or not using any refcount in fast path, how about the following one? Sure, I don't think we need a high precision completion wait in this path, so a delay-spin seems okay to me. > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b49969..6fe334e12236 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, > return -ENOMEM; > } > > +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx, > + int dead_cpu) > +{ > + unsigned long msecs_left = 1000 * 10; > + > + while (msecs_left > 0) { > + if (blk_mq_hctx_idle(hctx)) > + break; > + msleep(5); > + msecs_left -= 5; > + } > + > + if (msecs_left > 0) > + printk(KERN_WARNING "requests not completed from " > + "CPU %d\n", dead_cpu); > +} > + > /* > * 'cpu' is going away. splice any existing rq_list entries from this > * software queue to the hw queue dispatch list, and ensure that it > @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) > spin_unlock(&hctx->lock); > > blk_mq_run_hw_queue(hctx, true); > + > + /* > + * Interrupt for this queue will be shutdown, so wait until all > + * requests from this hctx is done or timeout. > + */ > + if (cpumask_first_and(hctx->cpumask, cpu_online_mask) >= nr_cpu_ids) > + blk_mq_wait_hctx_become_idle(hctx, cpu); > + > return 0; > } > > diff --git a/block/blk-mq.h b/block/blk-mq.h > index d704fc7766f4..935cf8519bf2 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -240,4 +240,15 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap) > qmap->mq_map[cpu] = 0; > } > > +static inline bool blk_mq_hctx_idle(struct blk_mq_hw_ctx *hctx) > +{ > + struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags; > + > + if (!tags) > + return true; > + > + return !sbitmap_any_bit_set(&tags->bitmap_tags.sb) && > + !sbitmap_any_bit_set(&tags->bitmap_tags.sb); > +} > + > #endif > > Thanks, > Ming
On Sun, Apr 07, 2019 at 12:51:23AM -0700, Christoph Hellwig wrote: > On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote: > > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote: > > > Looking at current peak testing, I've got around 1.2% in queue enter > > > and exit. It's definitely not free, hence my question. Probably safe > > > to assume that we'll double that cycle counter, per IO. > > > > Okay, that's not negligible at all. I don't know of a faster reference > > than the percpu_ref, but that much overhead would have to rule out > > having a per hctx counter. > > Can we just replace queue_enter/exit with the per-hctx reference > entirely? I don't think that we can readily do that. We still need to protect a request_queue access prior to selecting the hctx.
On Sun, Apr 07, 2019 at 06:55:20AM -0700, Dongli Zhang wrote: > [PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx > is not mapped to hctx > > When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each > hctx for the offline cpu. > > While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type] > to hctx->dispatch, it never checks whether the ctx is already mapped to the > hctx. > > For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the > 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue > hctx: > > 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3 > 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2 > 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1 > 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0 > > Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2 > in this case, its ctx->rq_lists[type] will however be moved to > blk_mq_hw_ctx->queue_num = 3 during the 1st call of > blk_mq_hctx_notify_dead(). > > This patch would return and go ahead to next call of > blk_mq_hctx_notify_dead() if ctx is not mapped to hctx. Ha, I think you're right. It would be a bit more work, but it might be best if we could avoid calling the notify for each hctx that doesn't apply to the CPU. We might get that by registering a single callback for the request_queue and loop only the affected hctx's. But this patch looks good to me too. Reviewed-by: Keith Busch <keith.busch@intel.com> > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > block/blk-mq.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b..b8ef489 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, > struct hlist_node *node) > enum hctx_type type; > > hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead); > + > + if (!cpumask_test_cpu(cpu, hctx->cpumask)) > + return 0; > + > ctx = __blk_mq_get_ctx(hctx->queue, cpu); > type = hctx->type; > > -- > 2.7.4
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 40905539afed..d1179e3d0fd1 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -326,6 +326,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio) enum hctx_type type; if (e && e->type->ops.bio_merge) { + blk_mq_unmap_queue(hctx); blk_mq_put_ctx(ctx); return e->type->ops.bio_merge(hctx, bio); } @@ -339,6 +340,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio) spin_unlock(&ctx->lock); } + blk_mq_unmap_queue(hctx); blk_mq_put_ctx(ctx); return ret; } diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3f9c3f4ac44c..e85e702fbaaf 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -34,6 +34,7 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj) struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); free_cpumask_var(hctx->cpumask); + percpu_ref_exit(&hctx->mapped); kfree(hctx->ctxs); kfree(hctx); } diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index a4931fc7be8a..df36af944e4a 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -162,6 +162,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data) if (data->ctx) blk_mq_put_ctx(data->ctx); + blk_mq_unmap_queue(data->hctx); bt_prev = bt; io_schedule(); diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ff3d7b49969..6b2fbe895c6b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -385,6 +385,7 @@ static struct request *blk_mq_get_request(struct request_queue *q, tag = blk_mq_get_tag(data); if (tag == BLK_MQ_TAG_FAIL) { + blk_mq_unmap_queue(data->hctx); if (put_ctx_on_error) { blk_mq_put_ctx(data->ctx); data->ctx = NULL; @@ -516,6 +517,7 @@ void blk_mq_free_request(struct request *rq) ctx->rq_completed[rq_is_sync(rq)]++; if (rq->rq_flags & RQF_MQ_INFLIGHT) atomic_dec(&hctx->nr_active); + blk_mq_unmap_queue(hctx); if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq))) laptop_io_completion(q->backing_dev_info); @@ -2222,14 +2224,19 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) } spin_unlock(&ctx->lock); - if (list_empty(&tmp)) - return 0; - - spin_lock(&hctx->lock); - list_splice_tail_init(&tmp, &hctx->dispatch); - spin_unlock(&hctx->lock); + if (!list_empty(&tmp)) { + spin_lock(&hctx->lock); + list_splice_tail_init(&tmp, &hctx->dispatch); + spin_unlock(&hctx->lock); + } blk_mq_run_hw_queue(hctx, true); + + if (cpumask_first_and(hctx->cpumask, cpu_online_mask) >= nr_cpu_ids) { + percpu_ref_kill(&hctx->mapped); + wait_event(hctx->mapped_wq, percpu_ref_is_zero(&hctx->mapped)); + percpu_ref_reinit(&hctx->mapped); + } return 0; } @@ -2275,6 +2282,14 @@ static void blk_mq_exit_hw_queues(struct request_queue *q, } } +static void hctx_mapped_release(struct percpu_ref *ref) +{ + struct blk_mq_hw_ctx *hctx = + container_of(ref, struct blk_mq_hw_ctx, mapped); + + wake_up(&hctx->mapped_wq); +} + static int blk_mq_init_hctx(struct request_queue *q, struct blk_mq_tag_set *set, struct blk_mq_hw_ctx *hctx, unsigned hctx_idx) @@ -2323,14 +2338,19 @@ static int blk_mq_init_hctx(struct request_queue *q, if (!hctx->fq) goto exit_hctx; - if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node)) + init_waitqueue_head(&hctx->mapped_wq); + if (percpu_ref_init(&hctx->mapped, hctx_mapped_release, 0, GFP_KERNEL)) goto free_fq; + if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node)) + goto free_pcpu; + if (hctx->flags & BLK_MQ_F_BLOCKING) init_srcu_struct(hctx->srcu); return 0; - + free_pcpu: + percpu_ref_exit(&hctx->mapped); free_fq: kfree(hctx->fq); exit_hctx: diff --git a/block/blk-mq.h b/block/blk-mq.h index d704fc7766f4..1adee26a7b96 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -105,6 +105,7 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, unsigned int flags, struct blk_mq_ctx *ctx) { + struct blk_mq_hw_ctx *hctx; enum hctx_type type = HCTX_TYPE_DEFAULT; /* @@ -115,7 +116,14 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, else if ((flags & REQ_OP_MASK) == REQ_OP_READ) type = HCTX_TYPE_READ; - return ctx->hctxs[type]; + hctx = ctx->hctxs[type]; + percpu_ref_get(&hctx->mapped); + return hctx; +} + +static inline void blk_mq_unmap_queue(struct blk_mq_hw_ctx *hctx) +{ + percpu_ref_put(&hctx->mapped); } /* diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index cb2aa7ecafff..66e19611a46d 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -58,6 +58,9 @@ struct blk_mq_hw_ctx { atomic_t nr_active; + wait_queue_head_t mapped_wq; + struct percpu_ref mapped; + struct hlist_node cpuhp_dead; struct kobject kobj;
Managed interrupts can not migrate affinity when their CPUs are offline. If the CPU is allowed to shutdown before they're returned, commands dispatched to managed queues won't be able to complete through their irq handlers. Introduce per-hctx reference counting so we can block the CPU dead notification for all allocated requests to complete if an hctx's last CPU is being taken offline. Cc: Ming Lei <ming.lei@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Keith Busch <keith.busch@intel.com> --- block/blk-mq-sched.c | 2 ++ block/blk-mq-sysfs.c | 1 + block/blk-mq-tag.c | 1 + block/blk-mq.c | 36 ++++++++++++++++++++++++++++-------- block/blk-mq.h | 10 +++++++++- include/linux/blk-mq.h | 3 +++ 6 files changed, 44 insertions(+), 9 deletions(-)