diff mbox series

[v3,15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled

Message ID 20240806-isolcpus-io-queues-v3-15-da0eecfeaf8b@suse.de (mailing list archive)
State New, archived
Headers show
Series honor isolcpus configuration | expand

Commit Message

Daniel Wagner Aug. 6, 2024, 12:06 p.m. UTC
When isolcpus=io_queue is enabled all hardware queues should run on the
housekeeping CPUs only. Thus ignore the affinity mask provided by the
driver. Also we can't use blk_mq_map_queues because it will map all CPUs
to first hctx unless, the CPU is the same as the hctx has the affinity
set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config

  queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1

  PCI name is 00:05.0: nvme0n1
        irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 61 affinity 1 effective 1 is_managed:1 nvme0q4

where as with blk_mq_hk_map_queues we get

  queue mapping for /dev/nvme0n1
        hctx0: default 2 4
        hctx1: default 3 5
        hctx2: default 0 6
        hctx3: default 1 7

  PCI name is 00:05.0: nvme0n1
        irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
        irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
        irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
        irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
        irq 64 affinity 1 effective 1 is_managed:1 nvme0q4

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-cpumap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

Comments

Ming Lei Aug. 6, 2024, 2:55 p.m. UTC | #1
On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled all hardware queues should run on the
> housekeeping CPUs only. Thus ignore the affinity mask provided by the
> driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> to first hctx unless, the CPU is the same as the hctx has the affinity
> set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config

What is the expected behavior if someone still tries to submit IO on isolated
CPUs?

BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this
patchset, that means one random hctx(or even NULL) may be used for submitting
IO from isolated CPUs, then there can be io hang risk during cpu hotplug, or
kernel panic when submitting bio.

Thanks,
Ming
Daniel Wagner Aug. 7, 2024, 12:40 p.m. UTC | #2
On Tue, Aug 06, 2024 at 10:55:09PM GMT, Ming Lei wrote:
> On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> > When isolcpus=io_queue is enabled all hardware queues should run on the
> > housekeeping CPUs only. Thus ignore the affinity mask provided by the
> > driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> > to first hctx unless, the CPU is the same as the hctx has the affinity
> > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> 
> What is the expected behavior if someone still tries to submit IO on isolated
> CPUs?

If a user thread is issuing an IO the IO is handled by the housekeeping
CPU, which will cause some noise on the submitting CPU. As far I was
told this is acceptable. Our customers really don't want to have any
IO not from their application ever hitting the isolcpus. When their
application is issuing an IO.

> BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this
> patchset,

I was trying to figure out what you tried to explain last time with
hangs, but didn't really understand what the conditions are for this
problem to occur.

> that means one random hctx(or even NULL) may be used for submitting
> IO from isolated CPUs,
> then there can be io hang risk during cpu hotplug, or
> kernel panic when submitting bio.

Can you elaborate a bit more? I must miss something important here.

Anyway, my understanding is that when the last CPU of a hctx goes
offline the affinity is broken and assigned to an online HK CPU. And we
ensure all flight IO have finished and also ensure we don't submit any
new IO to a CPU which goes offline.

FWIW, I tried really hard to get an IO hang with cpu hotplug.
Ming Lei Aug. 8, 2024, 5:26 a.m. UTC | #3
On Wed, Aug 07, 2024 at 02:40:11PM +0200, Daniel Wagner wrote:
> On Tue, Aug 06, 2024 at 10:55:09PM GMT, Ming Lei wrote:
> > On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> > > When isolcpus=io_queue is enabled all hardware queues should run on the
> > > housekeeping CPUs only. Thus ignore the affinity mask provided by the
> > > driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> > > to first hctx unless, the CPU is the same as the hctx has the affinity
> > > set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> > 
> > What is the expected behavior if someone still tries to submit IO on isolated
> > CPUs?
> 
> If a user thread is issuing an IO the IO is handled by the housekeeping
> CPU, which will cause some noise on the submitting CPU. As far I was
> told this is acceptable. Our customers really don't want to have any
> IO not from their application ever hitting the isolcpus. When their
> application is issuing an IO.
> 
> > BTW, I don't see any change in blk_mq_get_ctx()/blk_mq_map_queue() in this
> > patchset,
> 
> I was trying to figure out what you tried to explain last time with
> hangs, but didn't really understand what the conditions are for this
> problem to occur.

Isolated CPUs are removed from queue mapping in this patchset, when someone
submit IOs from the isolated CPU, what is the correct hctx used for handling
these IOs?

From current implementation, it depends on implied zero filled
tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.

During CPU offline, in blk_mq_hctx_notify_offline(),
blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
hctx 0 won't be drained.

However managed irq core code still shutdowns the hw queue's irq because all
CPUs in this hctx are offline now. Then IO hang is triggered, isn't it?

The current blk-mq takes static & global queue/CPUs mapping, in which all CPUs
are covered. This patchset removes isolated CPUs from the mapping, and the
change is big from viewpoint of blk-mq queue mapping.

> 
> > that means one random hctx(or even NULL) may be used for submitting
> > IO from isolated CPUs,
> > then there can be io hang risk during cpu hotplug, or
> > kernel panic when submitting bio.
> 
> Can you elaborate a bit more? I must miss something important here.
> 
> Anyway, my understanding is that when the last CPU of a hctx goes
> offline the affinity is broken and assigned to an online HK CPU. And we
> ensure all flight IO have finished and also ensure we don't submit any
> new IO to a CPU which goes offline.
> 
> FWIW, I tried really hard to get an IO hang with cpu hotplug.
 
Please see above.


thanks,
Ming
Daniel Wagner Aug. 9, 2024, 7:22 a.m. UTC | #4
On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> Isolated CPUs are removed from queue mapping in this patchset, when someone
> submit IOs from the isolated CPU, what is the correct hctx used for handling
> these IOs?

No, every possible CPU gets a mapping. What this patch series does, is
to limit/aligns the number of hardware context to the number of
housekeeping CPUs. There is still a complete ctx-hctc mapping. So
whenever an user thread on an isolated CPU is issuing an IO a
housekeeping CPU will also be involved (with the additional overhead,
which seems to be okay for these users).

Without hardware queue on the isolated CPUs ensures we really never get
any unexpected IO on those CPUs unless userspace does it own its own.
It's a safety net.

Just to illustrate it, the non isolcpus configuration (default) map
for an 8 CPU setup:

queue mapping for /dev/vda
        hctx0: default 0
        hctx1: default 1
        hctx2: default 2
        hctx3: default 3
        hctx4: default 4
        hctx5: default 5
        hctx6: default 6
        hctx7: default 7

and with isolcpus=io_queue,2-3,6-7

queue mapping for /dev/vda
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7

> From current implementation, it depends on implied zero filled
> tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.
> 
> During CPU offline, in blk_mq_hctx_notify_offline(),
> blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
> hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
> hctx 0 won't be drained.
> 
> However managed irq core code still shutdowns the hw queue's irq because all
> CPUs in this hctx are offline now. Then IO hang is triggered, isn't
> it?

Thanks for the explanation. I was able to reproduce this scenario, that
is a hardware context with two CPUs which go offline. Initially, I used
fio for creating the workload but this never hit the hanger. Instead
some background workload from systemd-journald is pretty reliable to
trigger the hanger you describe.

Example:

  hctx2: default 4 6

CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined:

  smpboot: CPU 5 is now offline
  blk_mq_hctx_has_online_cpu:3537 hctx3 offline
  blk_mq_hctx_has_online_cpu:3537 hctx2 offline

and there is no forward progress anymore, the cpuhotplug state machine
is blocked and an IO is hanging:

  # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0
  /sys/kernel/debug/block/vda/hctx2/tags:busy=61

and blk_mq_hctx_notify_offline busy loops forever:

   task:cpuhp/6         state:D stack:0     pid:439   tgid:439   ppid:2      flags:0x00004000
   Call Trace:
    <TASK>
    __schedule+0x79d/0x15c0
    ? lockdep_hardirqs_on_prepare+0x152/0x210
    ? kvm_sched_clock_read+0xd/0x20
    ? local_clock_noinstr+0x28/0xb0
    ? local_clock+0x11/0x30
    ? lock_release+0x122/0x4a0
    schedule+0x3d/0xb0
    schedule_timeout+0x88/0xf0
    ? __pfx_process_timeout+0x10/0x10d
    msleep+0x28/0x40
    blk_mq_hctx_notify_offline+0x1b5/0x200
    ? cpuhp_thread_fun+0x41/0x1f0
    cpuhp_invoke_callback+0x27e/0x780
    ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10
    ? cpuhp_thread_fun+0x42/0x1f0
    cpuhp_thread_fun+0x178/0x1f0
    smpboot_thread_fn+0x12e/0x1c0
    ? __pfx_smpboot_thread_fn+0x10/0x10
    kthread+0xe8/0x110
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x33/0x40
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>

I don't think this is a new problem this code introduces. This problem
exists for any hardware context which has more than one CPU. As far I
understand it, the problem is that there is no forward progress possible
for the IO itself (I assume the corresponding resources for the CPU
going offline have already been shutdown, thus no progress?) and
blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.

Couldn't we do something like:

+static bool blk_mq_hctx_timeout_rq(struct request *rq, void *data)
+{
+       blk_mq_rq_timed_out(rq);
+       return true;
+}
+
+static void blk_mq_hctx_timeout_rqs(struct blk_mq_hw_ctx *hctx)
+{
+       struct blk_mq_tags *tags = hctx->sched_tags ?
+                       hctx->sched_tags : hctx->tags;
+       blk_mq_all_tag_iter(tags, blk_mq_hctx_timeout_rq, NULL);
+}
+
+
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
        struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
                        struct blk_mq_hw_ctx, cpuhp_online);
+       int i;

        if (blk_mq_hctx_has_online_cpu(hctx, cpu))
                return 0;
@@ -3551,9 +3589,16 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
         * requests.  If we could not grab a reference the queue has been
         * frozen and there are no requests.
         */
+       i = 0;
        if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
-               while (blk_mq_hctx_has_requests(hctx))
+               while (blk_mq_hctx_has_requests(hctx) && i++ < 10)
                        msleep(5);
+               if (blk_mq_hctx_has_requests(hctx)) {
+                       pr_info("%s:%d hctx %d force timeout request\n",
+                               __func__, __LINE__, hctx->queue_num);
+                       blk_mq_hctx_timeout_rqs(hctx);
+               }
+

This guarantees forward progress and it worked in my test scenario, got
the corresponding log entries

  blk_mq_hctx_notify_offline:3598 hctx 2 force timeout request

and the hotplug state machine continued. Didn't see an IO error either,
but I haven't looked closely, this is just a POC.

BTW, when looking at the tag allocator, I didn't see any hctx state
checks for the batched alloction path. Don't we need to check if the
corresponding hardware context is active there too?

@ -486,6 +487,15 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
        if (data->nr_tags > 1) {
                rq = __blk_mq_alloc_requests_batch(data);
                if (rq) {
+                       if (unlikely(test_bit(BLK_MQ_S_INACTIVE,
+                                             &data->hctx->state))) {
+                               blk_mq_put_tag(blk_mq_tags_from_data(data),
+                                              rq->mq_ctx, rq->tag);
+                               msleep(3);
+                               goto retry;
+                       }
                        blk_mq_rq_time_init(rq, alloc_time_ns);
                        return rq;
                }

But given this is the hotpath and the hotplug path is very unlikely to
be used at all, at least for the majority of users, I would suggest to
try to get blk_mq_hctx_notify_offline to guarantee forward progress?.
This would make the hotpath an 'if' less.
Ming Lei Aug. 9, 2024, 2:53 p.m. UTC | #5
On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote:
> On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> > Isolated CPUs are removed from queue mapping in this patchset, when someone
> > submit IOs from the isolated CPU, what is the correct hctx used for handling
> > these IOs?
> 
> No, every possible CPU gets a mapping. What this patch series does, is
> to limit/aligns the number of hardware context to the number of
> housekeeping CPUs. There is still a complete ctx-hctc mapping. So

OK, then I guess patch 1~7 aren't supposed to belong to this series,
cause you just want to reduce nr_hw_queues, meantime spread
house-keeping CPUs first for avoiding queues with all isolated cpu mask.

> whenever an user thread on an isolated CPU is issuing an IO a
> housekeeping CPU will also be involved (with the additional overhead,
> which seems to be okay for these users).
> 
> Without hardware queue on the isolated CPUs ensures we really never get
> any unexpected IO on those CPUs unless userspace does it own its own.
> It's a safety net.
> 
> Just to illustrate it, the non isolcpus configuration (default) map
> for an 8 CPU setup:
> 
> queue mapping for /dev/vda
>         hctx0: default 0
>         hctx1: default 1
>         hctx2: default 2
>         hctx3: default 3
>         hctx4: default 4
>         hctx5: default 5
>         hctx6: default 6
>         hctx7: default 7
> 
> and with isolcpus=io_queue,2-3,6-7
> 
> queue mapping for /dev/vda
>         hctx0: default 0 2
>         hctx1: default 1 3
>         hctx2: default 4 6
>         hctx3: default 5 7

OK, Looks I missed the point in patch 15 in which you added isolated cpu
into mapping manually, just wondering why not take the current two-stage
policy to cover both house-keeping and isolated CPUs in group_cpus_evenly()?

Such as spread house-keeping CPUs first, then isolated CPUs, just like
what we did for present & non-present cpus.

Then the whole patchset can be simplified a lot.

> 
> > From current implementation, it depends on implied zero filled
> > tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.
> > 
> > During CPU offline, in blk_mq_hctx_notify_offline(),
> > blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
> > hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
> > hctx 0 won't be drained.
> > 
> > However managed irq core code still shutdowns the hw queue's irq because all
> > CPUs in this hctx are offline now. Then IO hang is triggered, isn't
> > it?
> 
> Thanks for the explanation. I was able to reproduce this scenario, that
> is a hardware context with two CPUs which go offline. Initially, I used
> fio for creating the workload but this never hit the hanger. Instead
> some background workload from systemd-journald is pretty reliable to
> trigger the hanger you describe.
> 
> Example:
> 
>   hctx2: default 4 6
> 
> CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined:
> 
>   smpboot: CPU 5 is now offline
>   blk_mq_hctx_has_online_cpu:3537 hctx3 offline
>   blk_mq_hctx_has_online_cpu:3537 hctx2 offline
> 
> and there is no forward progress anymore, the cpuhotplug state machine
> is blocked and an IO is hanging:
> 
>   # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0
>   /sys/kernel/debug/block/vda/hctx2/tags:busy=61
> 
> and blk_mq_hctx_notify_offline busy loops forever:
> 
>    task:cpuhp/6         state:D stack:0     pid:439   tgid:439   ppid:2      flags:0x00004000
>    Call Trace:
>     <TASK>
>     __schedule+0x79d/0x15c0
>     ? lockdep_hardirqs_on_prepare+0x152/0x210
>     ? kvm_sched_clock_read+0xd/0x20
>     ? local_clock_noinstr+0x28/0xb0
>     ? local_clock+0x11/0x30
>     ? lock_release+0x122/0x4a0
>     schedule+0x3d/0xb0
>     schedule_timeout+0x88/0xf0
>     ? __pfx_process_timeout+0x10/0x10d
>     msleep+0x28/0x40
>     blk_mq_hctx_notify_offline+0x1b5/0x200
>     ? cpuhp_thread_fun+0x41/0x1f0
>     cpuhp_invoke_callback+0x27e/0x780
>     ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10
>     ? cpuhp_thread_fun+0x42/0x1f0
>     cpuhp_thread_fun+0x178/0x1f0
>     smpboot_thread_fn+0x12e/0x1c0
>     ? __pfx_smpboot_thread_fn+0x10/0x10
>     kthread+0xe8/0x110
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork+0x33/0x40
>     ? __pfx_kthread+0x10/0x10
>     ret_from_fork_asm+0x1a/0x30
>     </TASK>
> 
> I don't think this is a new problem this code introduces. This problem
> exists for any hardware context which has more than one CPU. As far I
> understand it, the problem is that there is no forward progress possible
> for the IO itself (I assume the corresponding resources for the CPU

When blk_mq_hctx_notify_offline() is running, the current CPU isn't
offline yet, and the hctx is active, same with the managed irq, so it is fine
to wait until all in-flight IOs originated from this hctx completed there.

The reason is why these requests can't be completed? And the forward
progress is provided by blk-mq. And these requests are very likely
allocated & submitted from CPU6.

Can you figure out what is effective mask for irq of hctx2?  It is
supposed to be cpu6. And block debugfs for vda should provide helpful
hint.

> going offline have already been shutdown, thus no progress?) and
> blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.

RH has internal cpu hotplug stress test, but not see such report so far.

I will try to setup such kind of setting and see if it can be
reproduced.

> 
> Couldn't we do something like:

I usually won't thinking about any solution until root-cause is figured
out, :-)
 

Thanks, 
Ming
Ming Lei Aug. 9, 2024, 3:23 p.m. UTC | #6
On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled all hardware queues should run on the
> housekeeping CPUs only. Thus ignore the affinity mask provided by the
> driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> to first hctx unless, the CPU is the same as the hctx has the affinity
> set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 3 4 6 7
>         hctx1: default 5
>         hctx2: default 0
>         hctx3: default 1
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 61 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> where as with blk_mq_hk_map_queues we get
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 4
>         hctx1: default 3 5
>         hctx2: default 0 6
>         hctx3: default 1 7
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 64 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/blk-mq-cpumap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index c1277763aeeb..7e026c2ffa02 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -60,11 +60,64 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
>  
> +static bool blk_mq_hk_map_queues(struct blk_mq_queue_map *qmap)
> +{
> +	struct cpumask *hk_masks;
> +	cpumask_var_t isol_mask;
> +
> +	unsigned int queue, cpu;
> +
> +	if (!housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return false;
> +
> +	/* map housekeeping cpus to matching hardware context */
> +	hk_masks = group_cpus_evenly(qmap->nr_queues);
> +	if (!hk_masks)
> +		goto fallback;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		for_each_cpu(cpu, &hk_masks[queue])
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +	}
> +
> +	kfree(hk_masks);
> +
> +	/* map isolcpus to hardware context */
> +	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
> +		goto fallback;
> +
> +	queue = 0;
> +	cpumask_andnot(isol_mask,
> +		       cpu_possible_mask,
> +		       housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> +
> +	for_each_cpu(cpu, isol_mask) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}
> +
> +	free_cpumask_var(isol_mask);
> +
> +	return true;
> +
> +fallback:
> +	/* map all cpus to hardware context ignoring any affinity */
> +	queue = 0;
> +	for_each_possible_cpu(cpu) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}
> +	return true;
> +}
> +
>  void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
>  {
>  	const struct cpumask *masks;
>  	unsigned int queue, cpu;
>  
> +	if (blk_mq_hk_map_queues(qmap))
> +		return;
> +
>  	masks = group_cpus_evenly(qmap->nr_queues);
>  	if (!masks) {
>  		for_each_possible_cpu(cpu)
> @@ -118,6 +171,9 @@ void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
>  	const struct cpumask *mask;
>  	unsigned int queue, cpu;
>  
> +	if (blk_mq_hk_map_queues(qmap))
> +		return;
> +
>  	for (queue = 0; queue < qmap->nr_queues; queue++) {
>  		mask = get_queue_affinity(dev_data, dev_off, queue);
>  		if (!mask)

From above implementation, "isolcpus=io_queue" is actually just one
optimization on "isolcpus=managed_irq", and there isn't essential
difference between the two.

And I'd suggest to optimize 'isolcpus=managed_irq' directly, such as:

- reduce nr_queues or numgrps for group_cpus_evenly() according to
house-keeping cpu mask

- spread house-keeping & isolate cpu mask evenly on each queue, and
you can use the existed two-stage spread for doing that


thanks,
Ming
Daniel Wagner Aug. 13, 2024, 12:17 p.m. UTC | #7
On Fri, Aug 09, 2024 at 10:53:16PM GMT, Ming Lei wrote:
> On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote:
> > On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> > > Isolated CPUs are removed from queue mapping in this patchset, when someone
> > > submit IOs from the isolated CPU, what is the correct hctx used for handling
> > > these IOs?
> > 
> > No, every possible CPU gets a mapping. What this patch series does, is
> > to limit/aligns the number of hardware context to the number of
> > housekeeping CPUs. There is still a complete ctx-hctc mapping. So
> 
> OK, then I guess patch 1~7 aren't supposed to belong to this series,
> cause you just want to reduce nr_hw_queues, meantime spread
> house-keeping CPUs first for avoiding queues with all isolated cpu
> mask.

I tried to explain the reason for these patches in the cover letter. The
idea here is that it makes the later changes simpler, because we only
have to touch one place. Furthermore, the caller just needs to provide
an affinity mask the rest of the code then is generic. This allows to
replace the open coded mapping code in hisi for example. Overall I think
the resulting code is nicer and cleaner.

> OK, Looks I missed the point in patch 15 in which you added isolated cpu
> into mapping manually, just wondering why not take the current two-stage
> policy to cover both house-keeping and isolated CPUs in
> group_cpus_evenly()?

Patch #15 explains why this approach didn't work in the current form.
blk_mq_map_queues will map all isolated CPUs to the first hctx.

> Such as spread house-keeping CPUs first, then isolated CPUs, just like
> what we did for present & non-present cpus.

I've experimented with this approach and it didn't work (see above).

> When blk_mq_hctx_notify_offline() is running, the current CPU isn't
> offline yet, and the hctx is active, same with the managed irq, so it is fine
> to wait until all in-flight IOs originated from this hctx completed
> there.

But if the if for some reason these never complete (as in my case),
this blocks forever. Wouldn't it make sense to abort the wait after a
while?

> The reason is why these requests can't be completed? And the forward
> progress is provided by blk-mq. And these requests are very likely
> allocated & submitted from CPU6.

Yes, I can confirm that the in flight request have been allocated and
submitted by the CPU which is offlined.

Here a log snipped from a different debug session. CPU 1 and 2 are
already offline, CPU 3 is offlined. The CPU mapping for hctx1 is

        hctx1: default 1 3

I've added a printk to my hack timeout handler:

 blk_mq_hctx_notify_offline:3600 hctx 1 force timeout request
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3

that means these request have been allocated on CPU 3 and are still
marked as in flight. I am trying to figure out why they are not
completed as next step.

> Can you figure out what is effective mask for irq of hctx2?  It is
> supposed to be cpu6. And block debugfs for vda should provide helpful
> hint.

The effective mask for the above debug output is

queue mapping for /dev/vda
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7

PCI name is 00:02.0: vda
        irq 27 affinity 0-1 effective 0  virtio0-config
        irq 28 affinity 0 effective 0  virtio0-req.0
        irq 29 affinity 1 effective 1  virtio0-req.1
        irq 30 affinity 4 effective 4  virtio0-req.2
        irq 31 affinity 5 effective 5  virtio0-req.3

Maybe there is still something off with qemu and the IRQ routing and the
interrupts have been delivered to the wrong CPU.

> > going offline have already been shutdown, thus no progress?) and
> > blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.
> 
> RH has internal cpu hotplug stress test, but not see such report so
> far.

Is this stress test running on real hardware? If so, it adds to my
theory that the interrupt might be lost in certain situation when
running qemu.

 > Couldn't we do something like:
> 
> I usually won't thinking about any solution until root-cause is figured
> out, :-)

I agree, though sometimes is also is okay to have some defensive
programming in place, such an upper limit when until giving up the wait.

But yeah, let's focus figuring out what's wrong.
Daniel Wagner Aug. 13, 2024, 12:53 p.m. UTC | #8
On Fri, Aug 09, 2024 at 11:23:58PM GMT, Ming Lei wrote:
> From above implementation, "isolcpus=io_queue" is actually just one
> optimization on "isolcpus=managed_irq", and there isn't essential
> difference between the two.

Indeed, the two versions do not differ so much. I understood, that you
really want to keep managed_irq as it currently is and that's why I
thought we need io_queue.

> And I'd suggest to optimize 'isolcpus=managed_irq' directly, such as:
> 
> - reduce nr_queues or numgrps for group_cpus_evenly() according to
> house-keeping cpu mask

Okay.

> - spread house-keeping & isolate cpu mask evenly on each queue, and
> you can use the existed two-stage spread for doing that

Sure if we can get the spreading sorted out so that not all isolcpus are
mapped to the first hctx.

Thanks,
Daniel
Ming Lei Aug. 13, 2024, 12:56 p.m. UTC | #9
On Tue, Aug 06, 2024 at 02:06:47PM +0200, Daniel Wagner wrote:
> When isolcpus=io_queue is enabled all hardware queues should run on the
> housekeeping CPUs only. Thus ignore the affinity mask provided by the
> driver. Also we can't use blk_mq_map_queues because it will map all CPUs
> to first hctx unless, the CPU is the same as the hctx has the affinity
> set to, e.g. 8 CPUs with isolcpus=io_queue,2-3,6-7 config
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 3 4 6 7
>         hctx1: default 5
>         hctx2: default 0
>         hctx3: default 1
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 57 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 58 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 59 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 60 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 61 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> where as with blk_mq_hk_map_queues we get
> 
>   queue mapping for /dev/nvme0n1
>         hctx0: default 2 4
>         hctx1: default 3 5
>         hctx2: default 0 6
>         hctx3: default 1 7
> 
>   PCI name is 00:05.0: nvme0n1
>         irq 56 affinity 0-1 effective 1 is_managed:0 nvme0q0
>         irq 61 affinity 4 effective 4 is_managed:1 nvme0q1
>         irq 62 affinity 5 effective 5 is_managed:1 nvme0q2
>         irq 63 affinity 0 effective 0 is_managed:1 nvme0q3
>         irq 64 affinity 1 effective 1 is_managed:1 nvme0q4
> 
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>  block/blk-mq-cpumap.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 56 insertions(+)
> 
> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index c1277763aeeb..7e026c2ffa02 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -60,11 +60,64 @@ unsigned int blk_mq_num_online_queues(unsigned int max_queues)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
>  
> +static bool blk_mq_hk_map_queues(struct blk_mq_queue_map *qmap)
> +{
> +	struct cpumask *hk_masks;
> +	cpumask_var_t isol_mask;
> +
> +	unsigned int queue, cpu;
> +
> +	if (!housekeeping_enabled(HK_TYPE_IO_QUEUE))
> +		return false;
> +
> +	/* map housekeeping cpus to matching hardware context */
> +	hk_masks = group_cpus_evenly(qmap->nr_queues);
> +	if (!hk_masks)
> +		goto fallback;
> +
> +	for (queue = 0; queue < qmap->nr_queues; queue++) {
> +		for_each_cpu(cpu, &hk_masks[queue])
> +			qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +	}
> +
> +	kfree(hk_masks);
> +
> +	/* map isolcpus to hardware context */
> +	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
> +		goto fallback;
> +
> +	queue = 0;
> +	cpumask_andnot(isol_mask,
> +		       cpu_possible_mask,
> +		       housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> +
> +	for_each_cpu(cpu, isol_mask) {
> +		qmap->mq_map[cpu] = qmap->queue_offset + queue;
> +		queue = (queue + 1) % qmap->nr_queues;
> +	}
> +

With patch 14 and the above change, managed irq's affinity becomes not
matched with blk-mq mapping any more.

If the last CPU in managed irq's affinity becomes offline, blk-mq
mapping may have other isolated CPUs, so IOs in this hctx won't be
drained from blk_mq_hctx_notify_offline() in case of CPU offline,
but genirq still shutdowns this manage irq.

So IO hang risk is introduced here, it should be the reason of your
hang observation.


Thanks, 
Ming
Daniel Wagner Aug. 13, 2024, 1:11 p.m. UTC | #10
On Tue, Aug 13, 2024 at 08:56:02PM GMT, Ming Lei wrote:
> With patch 14 and the above change, managed irq's affinity becomes not
> matched with blk-mq mapping any more.

Ah, got it. The problem here is that I need to update also the irq
affinity mask for the hctx when offlining a CPU.
diff mbox series

Patch

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index c1277763aeeb..7e026c2ffa02 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -60,11 +60,64 @@  unsigned int blk_mq_num_online_queues(unsigned int max_queues)
 }
 EXPORT_SYMBOL_GPL(blk_mq_num_online_queues);
 
+static bool blk_mq_hk_map_queues(struct blk_mq_queue_map *qmap)
+{
+	struct cpumask *hk_masks;
+	cpumask_var_t isol_mask;
+
+	unsigned int queue, cpu;
+
+	if (!housekeeping_enabled(HK_TYPE_IO_QUEUE))
+		return false;
+
+	/* map housekeeping cpus to matching hardware context */
+	hk_masks = group_cpus_evenly(qmap->nr_queues);
+	if (!hk_masks)
+		goto fallback;
+
+	for (queue = 0; queue < qmap->nr_queues; queue++) {
+		for_each_cpu(cpu, &hk_masks[queue])
+			qmap->mq_map[cpu] = qmap->queue_offset + queue;
+	}
+
+	kfree(hk_masks);
+
+	/* map isolcpus to hardware context */
+	if (!alloc_cpumask_var(&isol_mask, GFP_KERNEL))
+		goto fallback;
+
+	queue = 0;
+	cpumask_andnot(isol_mask,
+		       cpu_possible_mask,
+		       housekeeping_cpumask(HK_TYPE_IO_QUEUE));
+
+	for_each_cpu(cpu, isol_mask) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+
+	free_cpumask_var(isol_mask);
+
+	return true;
+
+fallback:
+	/* map all cpus to hardware context ignoring any affinity */
+	queue = 0;
+	for_each_possible_cpu(cpu) {
+		qmap->mq_map[cpu] = qmap->queue_offset + queue;
+		queue = (queue + 1) % qmap->nr_queues;
+	}
+	return true;
+}
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
 	unsigned int queue, cpu;
 
+	if (blk_mq_hk_map_queues(qmap))
+		return;
+
 	masks = group_cpus_evenly(qmap->nr_queues);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
@@ -118,6 +171,9 @@  void blk_mq_dev_map_queues(struct blk_mq_queue_map *qmap,
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	if (blk_mq_hk_map_queues(qmap))
+		return;
+
 	for (queue = 0; queue < qmap->nr_queues; queue++) {
 		mask = get_queue_affinity(dev_data, dev_off, queue);
 		if (!mask)