blk-mq: Wait for for hctx requests on CPU unplug

Message ID	20190405215920.27085-1-keith.busch@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Keith Busch <keith.busch@intel.com> To: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Jens Axboe <axboe@kernel.dk> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>, Bart Van Assche <bvanassche@acm.org>, Keith Busch <keith.busch@intel.com>, Ming Lei <ming.lei@redhat.com>, Thomas Gleixner <tglx@linutronix.de> Subject: [PATCH] blk-mq: Wait for for hctx requests on CPU unplug Date: Fri, 5 Apr 2019 15:59:20 -0600 Message-Id: <20190405215920.27085-1-keith.busch@intel.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	blk-mq: Wait for for hctx requests on CPU unplug \| expand blk-mq: Wait for for hctx requests on CPU unplug

Keith Busch April 5, 2019, 9:59 p.m. UTC

Managed interrupts can not migrate affinity when their CPUs are offline.
If the CPU is allowed to shutdown before they're returned, commands
dispatched to managed queues won't be able to complete through their
irq handlers.

Introduce per-hctx reference counting so we can block the CPU dead
notification for all allocated requests to complete if an hctx's last
CPU is being taken offline.

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq-sched.c   |  2 ++
 block/blk-mq-sysfs.c   |  1 +
 block/blk-mq-tag.c     |  1 +
 block/blk-mq.c         | 36 ++++++++++++++++++++++++++++--------
 block/blk-mq.h         | 10 +++++++++-
 include/linux/blk-mq.h |  3 +++
 6 files changed, 44 insertions(+), 9 deletions(-)

Jens Axboe April 5, 2019, 10:23 p.m. UTC | #1

On 4/5/19 3:59 PM, Keith Busch wrote:
> Managed interrupts can not migrate affinity when their CPUs are offline.
> If the CPU is allowed to shutdown before they're returned, commands
> dispatched to managed queues won't be able to complete through their
> irq handlers.
> 
> Introduce per-hctx reference counting so we can block the CPU dead
> notification for all allocated requests to complete if an hctx's last
> CPU is being taken offline.

What does this do to performance? We're doing a map per request...

Keith Busch April 5, 2019, 10:37 p.m. UTC | #2

On Fri, Apr 05, 2019 at 04:23:27PM -0600, Jens Axboe wrote:
> On 4/5/19 3:59 PM, Keith Busch wrote:
> > Managed interrupts can not migrate affinity when their CPUs are offline.
> > If the CPU is allowed to shutdown before they're returned, commands
> > dispatched to managed queues won't be able to complete through their
> > irq handlers.
> > 
> > Introduce per-hctx reference counting so we can block the CPU dead
> > notification for all allocated requests to complete if an hctx's last
> > CPU is being taken offline.
> 
> What does this do to performance? We're doing a map per request...

It should be the same cost as the blk_queue_enter/blk_queue_exit that's
also done per request, which is pretty cheap way to count users. I
don't think I'm measuring a difference, but my test sample size so far
is just one over-powered machine.

Jens Axboe April 5, 2019, 11:04 p.m. UTC | #3

On 4/5/19 4:37 PM, Keith Busch wrote:
> On Fri, Apr 05, 2019 at 04:23:27PM -0600, Jens Axboe wrote:
>> On 4/5/19 3:59 PM, Keith Busch wrote:
>>> Managed interrupts can not migrate affinity when their CPUs are offline.
>>> If the CPU is allowed to shutdown before they're returned, commands
>>> dispatched to managed queues won't be able to complete through their
>>> irq handlers.
>>>
>>> Introduce per-hctx reference counting so we can block the CPU dead
>>> notification for all allocated requests to complete if an hctx's last
>>> CPU is being taken offline.
>>
>> What does this do to performance? We're doing a map per request...
> 
> It should be the same cost as the blk_queue_enter/blk_queue_exit that's
> also done per request, which is pretty cheap way to count users. I
> don't think I'm measuring a difference, but my test sample size so far
> is just one over-powered machine.

Looking at current peak testing, I've got around 1.2% in queue enter
and exit. It's definitely not free, hence my question. Probably safe
to assume that we'll double that cycle counter, per IO.

Keith Busch April 5, 2019, 11:36 p.m. UTC | #4

On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
> Looking at current peak testing, I've got around 1.2% in queue enter
> and exit. It's definitely not free, hence my question. Probably safe
> to assume that we'll double that cycle counter, per IO.

Okay, that's not negligible at all. I don't know of a faster reference
than the percpu_ref, but that much overhead would have to rule out
having a per hctx counter.

Dongli Zhang April 6, 2019, 9:44 a.m. UTC | #5

On 04/06/2019 07:36 AM, Keith Busch wrote:
> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
>> Looking at current peak testing, I've got around 1.2% in queue enter
>> and exit. It's definitely not free, hence my question. Probably safe
>> to assume that we'll double that cycle counter, per IO.
> 
> Okay, that's not negligible at all. I don't know of a faster reference
> than the percpu_ref, but that much overhead would have to rule out
> having a per hctx counter.
> 

If there is no faster reference to enable waiting for all inflight requests to
complete, is it possible to re-map (migrate) those requests to other hctx whose
cpus (ctx) are still online, e.g., to extract the bio and re-map those bio to
other ctx (e.g., cpu0)?

One drawback I can see is if 63 out of 64 cpus are suddenly offline, cpu0 would
be stuck.

Dongli Zhang

Ming Lei April 6, 2019, 9:27 p.m. UTC | #6

On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
> > Looking at current peak testing, I've got around 1.2% in queue enter
> > and exit. It's definitely not free, hence my question. Probably safe
> > to assume that we'll double that cycle counter, per IO.
> 
> Okay, that's not negligible at all. I don't know of a faster reference
> than the percpu_ref, but that much overhead would have to rule out
> having a per hctx counter.

Or not using any refcount in fast path, how about the following one?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ff3d7b49969..6fe334e12236 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx,
+		int dead_cpu)
+{
+	unsigned long msecs_left = 1000 * 10;
+
+	while (msecs_left > 0) {
+		if (blk_mq_hctx_idle(hctx))
+			break;
+		msleep(5);
+		msecs_left -= 5;
+	}
+
+	if (msecs_left > 0)
+		printk(KERN_WARNING "requests not completed from "
+			"CPU %d\n", dead_cpu);
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, true);
+
+	/*
+	 * Interrupt for this queue will be shutdown, so wait until all
+	 * requests from this hctx is done or timeout.
+	 */
+	if (cpumask_first_and(hctx->cpumask, cpu_online_mask) >= nr_cpu_ids)
+		blk_mq_wait_hctx_become_idle(hctx, cpu);
+
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d704fc7766f4..935cf8519bf2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -240,4 +240,15 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 		qmap->mq_map[cpu] = 0;
 }
 
+static inline bool blk_mq_hctx_idle(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
+
+	if (!tags)
+		return true;
+
+	return !sbitmap_any_bit_set(&tags->bitmap_tags.sb) &&
+                       !sbitmap_any_bit_set(&tags->bitmap_tags.sb);
+}
+
 #endif

Thanks,
Ming

Christoph Hellwig April 7, 2019, 7:51 a.m. UTC | #7

On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
> > Looking at current peak testing, I've got around 1.2% in queue enter
> > and exit. It's definitely not free, hence my question. Probably safe
> > to assume that we'll double that cycle counter, per IO.
> 
> Okay, that's not negligible at all. I don't know of a faster reference
> than the percpu_ref, but that much overhead would have to rule out
> having a per hctx counter.

Can we just replace queue_enter/exit with the per-hctx reference
entirely?

Dongli Zhang April 7, 2019, 1:55 p.m. UTC | #8

Hi Keith and Ming,

On 04/07/2019 05:27 AM, Ming Lei wrote:
> On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
>> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> Looking at current peak testing, I've got around 1.2% in queue enter
>>> and exit. It's definitely not free, hence my question. Probably safe
>>> to assume that we'll double that cycle counter, per IO.
>>
>> Okay, that's not negligible at all. I don't know of a faster reference
>> than the percpu_ref, but that much overhead would have to rule out
>> having a per hctx counter.
> 
> Or not using any refcount in fast path, how about the following one?
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b49969..6fe334e12236 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx,
> +		int dead_cpu)
> +{
> +	unsigned long msecs_left = 1000 * 10;
> +
> +	while (msecs_left > 0) {
> +		if (blk_mq_hctx_idle(hctx))
> +			break;
> +		msleep(5);
> +		msecs_left -= 5;
> +	}
> +
> +	if (msecs_left > 0)
> +		printk(KERN_WARNING "requests not completed from "
> +			"CPU %d\n", dead_cpu);
> +}
> +
>  /*
>   * 'cpu' is going away. splice any existing rq_list entries from this
>   * software queue to the hw queue dispatch list, and ensure that it
> @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	spin_unlock(&hctx->lock);
>  

While debugging the blk_mq_hctx_notify_dead(), I found that
blk_mq_hctx_notify_dead() is called once for each hctx for the cpu to offline.

As shown below between linux 2216 and line 2217, ctx might not be mapped to
hctx. Why would we flush ctx->rq_lists[type] to hctx->dispatch?

2207 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
2208 {
2209         struct blk_mq_hw_ctx *hctx;
2210         struct blk_mq_ctx *ctx;
2211         LIST_HEAD(tmp);
2212         enum hctx_type type;
2213
2214         hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
2215         ctx = __blk_mq_get_ctx(hctx->queue, cpu);
2216         type = hctx->type;

===>>>>> ctx might not be mappped to hctx, that is,
cpumask_test_cpu(cpu, hctx->cpumask) = false

2217
2218         spin_lock(&ctx->lock);
2219         if (!list_empty(&ctx->rq_lists[type])) {
2220                 list_splice_init(&ctx->rq_lists[type], &tmp);
2221                 blk_mq_hctx_clear_pending(hctx, ctx);
2222         }
2223         spin_unlock(&ctx->lock);
2224
2225         if (list_empty(&tmp))
2226                 return 0;
2227
2228         spin_lock(&hctx->lock);
2229         list_splice_tail_init(&tmp, &hctx->dispatch);
2230         spin_unlock(&hctx->lock);
2231
2232         blk_mq_run_hw_queue(hctx, true);
2233         return 0;
2234 }


For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
hctx:

1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0

Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
in this case, its ctx->rq_lists[type] will however be moved to
blk_mq_hw_ctx->queue_num = 3 during the 1st call of
blk_mq_hctx_notify_dead().


Is this expected or a bug?

If it is a bug, would the below help?

[PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx
is not mapped to hctx

When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each
hctx for the offline cpu.

While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type]
to hctx->dispatch, it never checks whether the ctx is already mapped to the
hctx.

For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
hctx:

1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0

Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
in this case, its ctx->rq_lists[type] will however be moved to
blk_mq_hw_ctx->queue_num = 3 during the 1st call of
blk_mq_hctx_notify_dead().

This patch would return and go ahead to next call of
blk_mq_hctx_notify_dead() if ctx is not mapped to hctx.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 block/blk-mq.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ff3d7b..b8ef489 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu,
struct hlist_node *node)
 	enum hctx_type type;

 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;

Ming Lei April 8, 2019, 9:49 a.m. UTC | #9

On Sun, Apr 07, 2019 at 09:55:20PM +0800, Dongli Zhang wrote:
> Hi Keith and Ming,
> 
> On 04/07/2019 05:27 AM, Ming Lei wrote:
> > On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
> >> On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>> Looking at current peak testing, I've got around 1.2% in queue enter
> >>> and exit. It's definitely not free, hence my question. Probably safe
> >>> to assume that we'll double that cycle counter, per IO.
> >>
> >> Okay, that's not negligible at all. I don't know of a faster reference
> >> than the percpu_ref, but that much overhead would have to rule out
> >> having a per hctx counter.
> > 
> > Or not using any refcount in fast path, how about the following one?
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 3ff3d7b49969..6fe334e12236 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >  	return -ENOMEM;
> >  }
> >  
> > +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx,
> > +		int dead_cpu)
> > +{
> > +	unsigned long msecs_left = 1000 * 10;
> > +
> > +	while (msecs_left > 0) {
> > +		if (blk_mq_hctx_idle(hctx))
> > +			break;
> > +		msleep(5);
> > +		msecs_left -= 5;
> > +	}
> > +
> > +	if (msecs_left > 0)
> > +		printk(KERN_WARNING "requests not completed from "
> > +			"CPU %d\n", dead_cpu);
> > +}
> > +
> >  /*
> >   * 'cpu' is going away. splice any existing rq_list entries from this
> >   * software queue to the hw queue dispatch list, and ensure that it
> > @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  	spin_unlock(&hctx->lock);
> >  
> 
> While debugging the blk_mq_hctx_notify_dead(), I found that
> blk_mq_hctx_notify_dead() is called once for each hctx for the cpu to offline.
> 
> As shown below between linux 2216 and line 2217, ctx might not be mapped to
> hctx. Why would we flush ctx->rq_lists[type] to hctx->dispatch?
> 
> 2207 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> 2208 {
> 2209         struct blk_mq_hw_ctx *hctx;
> 2210         struct blk_mq_ctx *ctx;
> 2211         LIST_HEAD(tmp);
> 2212         enum hctx_type type;
> 2213
> 2214         hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> 2215         ctx = __blk_mq_get_ctx(hctx->queue, cpu);
> 2216         type = hctx->type;
> 
> ===>>>>> ctx might not be mappped to hctx, that is,
> cpumask_test_cpu(cpu, hctx->cpumask) = false
> 
> 2217
> 2218         spin_lock(&ctx->lock);
> 2219         if (!list_empty(&ctx->rq_lists[type])) {
> 2220                 list_splice_init(&ctx->rq_lists[type], &tmp);
> 2221                 blk_mq_hctx_clear_pending(hctx, ctx);
> 2222         }
> 2223         spin_unlock(&ctx->lock);
> 2224
> 2225         if (list_empty(&tmp))
> 2226                 return 0;
> 2227
> 2228         spin_lock(&hctx->lock);
> 2229         list_splice_tail_init(&tmp, &hctx->dispatch);
> 2230         spin_unlock(&hctx->lock);
> 2231
> 2232         blk_mq_run_hw_queue(hctx, true);
> 2233         return 0;
> 2234 }
> 
> 
> For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
> 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
> hctx:
> 
> 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
> 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
> 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
> 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0
> 
> Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
> in this case, its ctx->rq_lists[type] will however be moved to
> blk_mq_hw_ctx->queue_num = 3 during the 1st call of
> blk_mq_hctx_notify_dead().
> 
> 
> Is this expected or a bug?
> 
> If it is a bug, would the below help?
> 
> [PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx
> is not mapped to hctx
> 
> When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each
> hctx for the offline cpu.
> 
> While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type]
> to hctx->dispatch, it never checks whether the ctx is already mapped to the
> hctx.
> 
> For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
> 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
> hctx:
> 
> 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
> 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
> 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
> 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0
> 
> Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
> in this case, its ctx->rq_lists[type] will however be moved to
> blk_mq_hw_ctx->queue_num = 3 during the 1st call of
> blk_mq_hctx_notify_dead().
> 
> This patch would return and go ahead to next call of
> blk_mq_hctx_notify_dead() if ctx is not mapped to hctx.
> 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  block/blk-mq.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b..b8ef489 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu,
> struct hlist_node *node)
>  	enum hctx_type type;
> 
>  	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
>  	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>  	type = hctx->type;

Looks correct:

Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks,
Ming

Keith Busch April 8, 2019, 3:21 p.m. UTC | #10

On Sat, Apr 06, 2019 at 02:27:10PM -0700, Ming Lei wrote:
> On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
> > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > Looking at current peak testing, I've got around 1.2% in queue enter
> > > and exit. It's definitely not free, hence my question. Probably safe
> > > to assume that we'll double that cycle counter, per IO.
> > 
> > Okay, that's not negligible at all. I don't know of a faster reference
> > than the percpu_ref, but that much overhead would have to rule out
> > having a per hctx counter.
> 
> Or not using any refcount in fast path, how about the following one?

Sure, I don't think we need a high precision completion wait in this path,
so a delay-spin seems okay to me.

 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b49969..6fe334e12236 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2199,6 +2199,23 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> +static void blk_mq_wait_hctx_become_idle(struct blk_mq_hw_ctx *hctx,
> +		int dead_cpu)
> +{
> +	unsigned long msecs_left = 1000 * 10;
> +
> +	while (msecs_left > 0) {
> +		if (blk_mq_hctx_idle(hctx))
> +			break;
> +		msleep(5);
> +		msecs_left -= 5;
> +	}
> +
> +	if (msecs_left > 0)
> +		printk(KERN_WARNING "requests not completed from "
> +			"CPU %d\n", dead_cpu);
> +}
> +
>  /*
>   * 'cpu' is going away. splice any existing rq_list entries from this
>   * software queue to the hw queue dispatch list, and ensure that it
> @@ -2230,6 +2247,14 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	spin_unlock(&hctx->lock);
>  
>  	blk_mq_run_hw_queue(hctx, true);
> +
> +	/*
> +	 * Interrupt for this queue will be shutdown, so wait until all
> +	 * requests from this hctx is done or timeout.
> +	 */
> +	if (cpumask_first_and(hctx->cpumask, cpu_online_mask) >= nr_cpu_ids)
> +		blk_mq_wait_hctx_become_idle(hctx, cpu);
> +
>  	return 0;
>  }
>  
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index d704fc7766f4..935cf8519bf2 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -240,4 +240,15 @@ static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
>  		qmap->mq_map[cpu] = 0;
>  }
>  
> +static inline bool blk_mq_hctx_idle(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_tags *tags = hctx->sched_tags ?: hctx->tags;
> +
> +	if (!tags)
> +		return true;
> +
> +	return !sbitmap_any_bit_set(&tags->bitmap_tags.sb) &&
> +                       !sbitmap_any_bit_set(&tags->bitmap_tags.sb);
> +}
> +
>  #endif
> 
> Thanks,
> Ming

Keith Busch April 8, 2019, 3:23 p.m. UTC | #11

On Sun, Apr 07, 2019 at 12:51:23AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 05, 2019 at 05:36:32PM -0600, Keith Busch wrote:
> > On Fri, Apr 5, 2019 at 5:04 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > Looking at current peak testing, I've got around 1.2% in queue enter
> > > and exit. It's definitely not free, hence my question. Probably safe
> > > to assume that we'll double that cycle counter, per IO.
> > 
> > Okay, that's not negligible at all. I don't know of a faster reference
> > than the percpu_ref, but that much overhead would have to rule out
> > having a per hctx counter.
> 
> Can we just replace queue_enter/exit with the per-hctx reference
> entirely?

I don't think that we can readily do that. We still need to protect a
request_queue access prior to selecting the hctx.

Keith Busch April 8, 2019, 3:36 p.m. UTC | #12

On Sun, Apr 07, 2019 at 06:55:20AM -0700, Dongli Zhang wrote:
> [PATCH 1/1] blk-mq: do not splice ctx->rq_lists[type] to hctx->dispatch if ctx
> is not mapped to hctx
> 
> When a cpu is offline, blk_mq_hctx_notify_dead() is called once for each
> hctx for the offline cpu.
> 
> While blk_mq_hctx_notify_dead() is used to splice all ctx->rq_lists[type]
> to hctx->dispatch, it never checks whether the ctx is already mapped to the
> hctx.
> 
> For example, on a VM (with nvme) of 4 cpu, to offline cpu 2 out of the
> 4 cpu (0-3), blk_mq_hctx_notify_dead() is called once for each io queue
> hctx:
> 
> 1st: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 3
> 2nd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 2
> 3rd: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 1
> 4th: blk_mq_ctx->cpu = 2 for blk_mq_hw_ctx->queue_num = 0
> 
> Although blk_mq_ctx->cpu = 2 is only mapped to blk_mq_hw_ctx->queue_num = 2
> in this case, its ctx->rq_lists[type] will however be moved to
> blk_mq_hw_ctx->queue_num = 3 during the 1st call of
> blk_mq_hctx_notify_dead().
> 
> This patch would return and go ahead to next call of
> blk_mq_hctx_notify_dead() if ctx is not mapped to hctx.

Ha, I think you're right. 

It would be a bit more work, but it might be best if we could avoid
calling the notify for each hctx that doesn't apply to the CPU. We might
get that by registering a single callback for the request_queue and loop
only the affected hctx's.

But this patch looks good to me too.

Reviewed-by: Keith Busch <keith.busch@intel.com>
 
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  block/blk-mq.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3ff3d7b..b8ef489 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2212,6 +2212,10 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu,
> struct hlist_node *node)
>  	enum hctx_type type;
> 
>  	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
>  	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>  	type = hctx->type;
> 
> -- 
> 2.7.4

blk-mq: Wait for for hctx requests on CPU unplug

Commit Message

Comments

Patch