mbox series

[RFC,v7,00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

Message ID 1591810159-240929-1-git-send-email-john.garry@huawei.com (mailing list archive)
Headers show
Series blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs | expand

Message

John Garry June 10, 2020, 5:29 p.m. UTC
Hi all,

Here is v7 of the patchset.

In this version of the series, we keep the shared sbitmap for driver tags,
and introduce changes to fix up the tag budgeting across request queues (
and associated debugfs changes).

Some performance figures:

Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
but it is not always an appropriate scheduler to use.

Tag depth 		4000 (default)			260**

Baseline:
none sched:		2290K IOPS			894K
mq-deadline sched:	2341K IOPS			2313K

Final, host_tagset=0 in LLDD*
none sched:		2289K IOPS			703K
mq-deadline sched:	2337K IOPS			2291K

Final:
none sched:		2281K IOPS			1101K
mq-deadline sched:	2322K IOPS			1278K

* this is relevant as this is the performance in supporting but not
  enabling the feature
** depth=260 is relevant as some point where we are regularly waiting for
   tags to be available. Figures were are a bit unstable here for testing.

A copy of the patches can be found here:
https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7

And to progress this series, we the the following to go in first, when ready:
https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/

Comments welcome, thanks!

Differences to v6:
- tag budgeting across request queues and associated changes
- add any reviewed tags
- rebase
- I did not include any change related to updating shared sbitmap per-hctx
  wait pointer, based on lack of evidence of performance improvement. This
  was discussed here originally:
  https://lore.kernel.org/linux-scsi/ecaeccf029c6fe377ebd4f30f04df9f1@mail.gmail.com/
  I may revisit.

Differences to v5:
- For now, drop the shared scheduler tags
- Fix megaraid SAS queue selection and rebase
- Omit minor unused arguments patch, which has now been merged
- Add separate patch to introduce sbitmap pointer
- Fixed hctx_tags_bitmap_show() for shared sbitmap
- Various tidying

Hannes Reinecke (5):
  blk-mq: rename blk_mq_update_tag_set_depth()
  scsi: Add template flag 'host_tagset'
  megaraid_sas: switch fusion adapters to MQ
  smartpqi: enable host tagset
  hpsa: enable host_tagset and switch to MQ

John Garry (6):
  blk-mq: Use pointers for blk_mq_tags bitmap tags
  blk-mq: Facilitate a shared sbitmap per tagset
  blk-mq: Record nr_active_requests per queue for when using shared
    sbitmap
  blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
    shared sbitmap
  blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  scsi: hisi_sas: Switch v3 hw to MQ

Ming Lei (1):
  blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

 block/bfq-iosched.c                         |   4 +-
 block/blk-core.c                            |   2 +
 block/blk-mq-debugfs.c                      | 120 ++++++++++++++-
 block/blk-mq-tag.c                          | 157 ++++++++++++++------
 block/blk-mq-tag.h                          |  21 ++-
 block/blk-mq.c                              |  64 +++++---
 block/blk-mq.h                              |  33 +++-
 block/kyber-iosched.c                       |   4 +-
 drivers/scsi/hisi_sas/hisi_sas.h            |   3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c       |  36 ++---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c      |  87 +++++------
 drivers/scsi/hpsa.c                         |  44 +-----
 drivers/scsi/hpsa.h                         |   1 -
 drivers/scsi/megaraid/megaraid_sas.h        |   1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   |  59 +++-----
 drivers/scsi/megaraid/megaraid_sas_fusion.c |  24 +--
 drivers/scsi/scsi_lib.c                     |   2 +
 drivers/scsi/smartpqi/smartpqi_init.c       |  38 +++--
 include/linux/blk-mq.h                      |   9 +-
 include/linux/blkdev.h                      |   3 +
 include/scsi/scsi_host.h                    |   6 +-
 21 files changed, 463 insertions(+), 255 deletions(-)

Comments

Ming Lei June 11, 2020, 3:07 a.m. UTC | #1
On Thu, Jun 11, 2020 at 01:29:07AM +0800, John Garry wrote:
> Hi all,
> 
> Here is v7 of the patchset.
> 
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues (
> and associated debugfs changes).
> 
> Some performance figures:
> 
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
> 
> Tag depth 		4000 (default)			260**
> 
> Baseline:
> none sched:		2290K IOPS			894K
> mq-deadline sched:	2341K IOPS			2313K
> 
> Final, host_tagset=0 in LLDD*
> none sched:		2289K IOPS			703K
> mq-deadline sched:	2337K IOPS			2291K
> 
> Final:
> none sched:		2281K IOPS			1101K
> mq-deadline sched:	2322K IOPS			1278K
> 
> * this is relevant as this is the performance in supporting but not
>   enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
>    tags to be available. Figures were are a bit unstable here for testing.
> 
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
> 
> And to progress this series, we the the following to go in first, when ready:
> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/

I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so
that it is easier to verify the changes without real hardware.

Thanks,
Ming
John Garry June 11, 2020, 9:35 a.m. UTC | #2
On 11/06/2020 04:07, Ming Lei wrote:
>> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
>> but it is not always an appropriate scheduler to use.
>>
>> Tag depth 		4000 (default)			260**
>>
>> Baseline:
>> none sched:		2290K IOPS			894K
>> mq-deadline sched:	2341K IOPS			2313K
>>
>> Final, host_tagset=0 in LLDD*
>> none sched:		2289K IOPS			703K
>> mq-deadline sched:	2337K IOPS			2291K
>>
>> Final:
>> none sched:		2281K IOPS			1101K
>> mq-deadline sched:	2322K IOPS			1278K
>>
>> * this is relevant as this is the performance in supporting but not
>>    enabling the feature
>> ** depth=260 is relevant as some point where we are regularly waiting for
>>     tags to be available. Figures were are a bit unstable here for testing.
>>
>> A copy of the patches can be found here:
>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
>>
>> And to progress this series, we the the following to go in first, when ready:
>> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/
> I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so
> that it is easier to verify the changes without real hardware.
> 

ok, fine, I can look at including that. To stop the series getting too 
large, I might spin off the early patches, which are not strictly related.

Thanks,
John
Kashyap Desai June 12, 2020, 6:47 p.m. UTC | #3
> On 11/06/2020 04:07, Ming Lei wrote:
> >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are
> >> included, but it is not always an appropriate scheduler to use.
> >>
> >> Tag depth 		4000 (default)			260**
> >>
> >> Baseline:
> >> none sched:		2290K IOPS			894K
> >> mq-deadline sched:	2341K IOPS			2313K
> >>
> >> Final, host_tagset=0 in LLDD*
> >> none sched:		2289K IOPS			703K
> >> mq-deadline sched:	2337K IOPS			2291K
> >>
> >> Final:
> >> none sched:		2281K IOPS			1101K
> >> mq-deadline sched:	2322K IOPS			1278K
> >>
> >> * this is relevant as this is the performance in supporting but not
> >>    enabling the feature
> >> ** depth=260 is relevant as some point where we are regularly waiting
> >> for
> >>     tags to be available. Figures were are a bit unstable here for
> >> testing.

John -

I tried V7 series and debug further on mq-deadline interface. This time I
have used another setup since HDD based setup is not readily available for
me.
In fact, I was able to simulate issue very easily using single scsi_device
as well. BTW, this is not an issue with this RFC, but generic issue.
Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC,
It becomes noticeable now.

Problem - Using below command  I see heavy CPU utilization on "
native_queued_spin_lock_slowpath". This is because kblockd work queue is
submitting IO from all the CPUs even though fio is bound to single CPU.
Lock contention from " dd_dispatch_request" is causing this issue.

numactl -C 13  fio
single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1
 --cpus_allowed_policy=split --ioscheduler=mq-deadline
--group_reporting --filename=/dev/sdd

While running above command, ideally we expect only kworker/13 to be active.
But you can see below - All the CPU is attempting submission and lots of CPU
consumption is due to lock contention.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
kworker/13:1H-k
 7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03 fio
 2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
kworker/18:1H-k
 2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
kworker/19:1H-k
 1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
kworker/20:1H-k
 2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
kworker/21:1H-k
 1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
kworker/22:1H-k
 2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
kworker/26:1H-k
 2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
kworker/23:1H-k
 2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
kworker/24:1H-k
 2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
kworker/27:1H-k
 1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
kworker/30:1H-k
 2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
kworker/31:1H-k
 2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
kworker/25:1H-k
 2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
kworker/28:1H-k
 1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
kworker/34:1H-k
 2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
kworker/32:1H-k
 2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
kworker/29:1H-k
 2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
kworker/33:1H-k
 1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
kworker/54:1H-k
 1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
kworker/55:1H-k
 2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
kworker/35:1H-k
 2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
kworker/56:1H-k
 1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
kworker/57:1H-k
 1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
kworker/59:1H-k
 1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
kworker/62:1H-k
 2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
kworker/58:1H-k
 2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
kworker/61:1H-k
 2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
kworker/60:1H-k


I root cause this issue -

Block layer always queue IO on hctx context mapped to CPU-13, but hw queue
run from all the hctx context.
I noticed in my test hctx48 has queued all the IOs. No other hctx has queued
IO. But all the hctx is counting for "run".

# cat hctx48/queued
2087058

#cat hctx*/run
151318
30038
83110
50680
69907
60391
111239
18036
33935
91648
34582
22853
61286
19489

Below patch has fix - "Run the hctx queue for which request was completed
instead of running all the hardware queue."
If this looks valid fix, please include in V8 OR I can post separate patch
for this. Just want to have some level of review from this discussion.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0652acd..f52118f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
        struct scsi_device *sdev = cmd->device;
        struct request_queue *q = sdev->request_queue;
+       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;

        if (blk_update_request(req, error, bytes))
                return true;
@@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
        else
-               blk_mq_run_hw_queues(q, true);
+               blk_mq_run_hw_queue(mq_hctx, true);
+               //blk_mq_run_hw_queues(q, true);

        percpu_ref_put(&q->q_usage_counter);
        return false;


After above patch - Only kworker/13 is actively doing submission.

3858 root       0 -20       0      0      0 I  22.9  0.0   3:24.04
kworker/13:1H-k
16768 root      20   0  712008  14968   2180 R  21.6  0.0   0:03.27 fio
16769 root      20   0  712012  14968   2180 R  21.6  0.0   0:03.27 fio

Without above patch - 24 SSD driver can give 1.5M IOPS and after above patch
3.2M IOPS.

I will continue my testing.

Thanks, Kashyap

> >>
> >> A copy of the patches can be found here:
> >> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-
> >> shared-tags-rfc-v7
> >>
> >> And to progress this series, we the the following to go in first, when
> >> ready:
> >> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de
> >> /
> > I'd suggest to add options to enable shared tags for null_blk &
> > scsi_debug in V8, so that it is easier to verify the changes without
> > real
> hardware.
> >
>
> ok, fine, I can look at including that. To stop the series getting too
> large, I
> might spin off the early patches, which are not strictly related.
>
> Thanks,
> John
Ming Lei June 15, 2020, 2:13 a.m. UTC | #4
On Sat, Jun 13, 2020 at 12:17:37AM +0530, Kashyap Desai wrote:
> > On 11/06/2020 04:07, Ming Lei wrote:
> > >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are
> > >> included, but it is not always an appropriate scheduler to use.
> > >>
> > >> Tag depth 		4000 (default)			260**
> > >>
> > >> Baseline:
> > >> none sched:		2290K IOPS			894K
> > >> mq-deadline sched:	2341K IOPS			2313K
> > >>
> > >> Final, host_tagset=0 in LLDD*
> > >> none sched:		2289K IOPS			703K
> > >> mq-deadline sched:	2337K IOPS			2291K
> > >>
> > >> Final:
> > >> none sched:		2281K IOPS			1101K
> > >> mq-deadline sched:	2322K IOPS			1278K
> > >>
> > >> * this is relevant as this is the performance in supporting but not
> > >>    enabling the feature
> > >> ** depth=260 is relevant as some point where we are regularly waiting
> > >> for
> > >>     tags to be available. Figures were are a bit unstable here for
> > >> testing.
> 
> John -
> 
> I tried V7 series and debug further on mq-deadline interface. This time I
> have used another setup since HDD based setup is not readily available for
> me.
> In fact, I was able to simulate issue very easily using single scsi_device
> as well. BTW, this is not an issue with this RFC, but generic issue.
> Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC,
> It becomes noticeable now.
> 
> Problem - Using below command  I see heavy CPU utilization on "
> native_queued_spin_lock_slowpath". This is because kblockd work queue is
> submitting IO from all the CPUs even though fio is bound to single CPU.
> Lock contention from " dd_dispatch_request" is causing this issue.
> 
> numactl -C 13  fio
> single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1
>  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> --group_reporting --filename=/dev/sdd
> 
> While running above command, ideally we expect only kworker/13 to be active.
> But you can see below - All the CPU is attempting submission and lots of CPU
> consumption is due to lock contention.
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> kworker/13:1H-k
>  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03 fio
>  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> kworker/18:1H-k
>  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> kworker/19:1H-k
>  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> kworker/20:1H-k
>  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> kworker/21:1H-k
>  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> kworker/22:1H-k
>  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> kworker/26:1H-k
>  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> kworker/23:1H-k
>  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> kworker/24:1H-k
>  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> kworker/27:1H-k
>  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> kworker/30:1H-k
>  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> kworker/31:1H-k
>  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> kworker/25:1H-k
>  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> kworker/28:1H-k
>  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> kworker/34:1H-k
>  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> kworker/32:1H-k
>  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> kworker/29:1H-k
>  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> kworker/33:1H-k
>  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> kworker/54:1H-k
>  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> kworker/55:1H-k
>  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> kworker/35:1H-k
>  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> kworker/56:1H-k
>  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> kworker/57:1H-k
>  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> kworker/59:1H-k
>  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> kworker/62:1H-k
>  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> kworker/58:1H-k
>  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> kworker/61:1H-k
>  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> kworker/60:1H-k
> 
> 
> I root cause this issue -
> 
> Block layer always queue IO on hctx context mapped to CPU-13, but hw queue
> run from all the hctx context.
> I noticed in my test hctx48 has queued all the IOs. No other hctx has queued
> IO. But all the hctx is counting for "run".
> 
> # cat hctx48/queued
> 2087058
> 
> #cat hctx*/run
> 151318
> 30038
> 83110
> 50680
> 69907
> 60391
> 111239
> 18036
> 33935
> 91648
> 34582
> 22853
> 61286
> 19489
> 
> Below patch has fix - "Run the hctx queue for which request was completed
> instead of running all the hardware queue."
> If this looks valid fix, please include in V8 OR I can post separate patch
> for this. Just want to have some level of review from this discussion.
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0652acd..f52118f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>         struct scsi_device *sdev = cmd->device;
>         struct request_queue *q = sdev->request_queue;
> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> 
>         if (blk_update_request(req, error, bytes))
>                 return true;
> @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>             !list_empty(&sdev->host->starved_list))
>                 kblockd_schedule_work(&sdev->requeue_work);
>         else
> -               blk_mq_run_hw_queues(q, true);
> +               blk_mq_run_hw_queue(mq_hctx, true);
> +               //blk_mq_run_hw_queues(q, true);

This way may cause IO hang because ->device_busy is shared by all hctxs.

Thanks,
Ming
Kashyap Desai June 15, 2020, 6:57 a.m. UTC | #5
> >
> > John -
> >
> > I tried V7 series and debug further on mq-deadline interface. This
> > time I have used another setup since HDD based setup is not readily
> > available for me.
> > In fact, I was able to simulate issue very easily using single
> > scsi_device as well. BTW, this is not an issue with this RFC, but
generic issue.
> > Since I have converted nr_hw_queue > 1 for Broadcom product using this
> > RFC, It becomes noticeable now.
> >
> > Problem - Using below command  I see heavy CPU utilization on "
> > native_queued_spin_lock_slowpath". This is because kblockd work queue
> > is submitting IO from all the CPUs even though fio is bound to single
CPU.
> > Lock contention from " dd_dispatch_request" is causing this issue.
> >
> > numactl -C 13  fio
> > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > --group_reporting --filename=/dev/sdd
> >
> > While running above command, ideally we expect only kworker/13 to be
> active.
> > But you can see below - All the CPU is attempting submission and lots
> > of CPU consumption is due to lock contention.
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
> >  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> > kworker/13:1H-k
> >  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03
fio
> >  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> > kworker/18:1H-k
> >  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> > kworker/19:1H-k
> >  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> > kworker/20:1H-k
> >  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> > kworker/21:1H-k
> >  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> > kworker/22:1H-k
> >  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> > kworker/26:1H-k
> >  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> > kworker/23:1H-k
> >  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> > kworker/24:1H-k
> >  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> > kworker/27:1H-k
> >  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> > kworker/30:1H-k
> >  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> > kworker/31:1H-k
> >  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> > kworker/25:1H-k
> >  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> > kworker/28:1H-k
> >  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> > kworker/34:1H-k
> >  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> > kworker/32:1H-k
> >  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> > kworker/29:1H-k
> >  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> > kworker/33:1H-k
> >  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> > kworker/54:1H-k
> >  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> > kworker/55:1H-k
> >  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> > kworker/35:1H-k
> >  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> > kworker/56:1H-k
> >  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> > kworker/57:1H-k
> >  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> > kworker/59:1H-k
> >  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> > kworker/62:1H-k
> >  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> > kworker/58:1H-k
> >  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> > kworker/61:1H-k
> >  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> > kworker/60:1H-k
> >
> >
> > I root cause this issue -
> >
> > Block layer always queue IO on hctx context mapped to CPU-13, but hw
> > queue run from all the hctx context.
> > I noticed in my test hctx48 has queued all the IOs. No other hctx has
> > queued IO. But all the hctx is counting for "run".
> >
> > # cat hctx48/queued
> > 2087058
> >
> > #cat hctx*/run
> > 151318
> > 30038
> > 83110
> > 50680
> > 69907
> > 60391
> > 111239
> > 18036
> > 33935
> > 91648
> > 34582
> > 22853
> > 61286
> > 19489
> >
> > Below patch has fix - "Run the hctx queue for which request was
> > completed instead of running all the hardware queue."
> > If this looks valid fix, please include in V8 OR I can post separate
> > patch for this. Just want to have some level of review from this
discussion.
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > 0652acd..f52118f 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> >         struct scsi_device *sdev = cmd->device;
> >         struct request_queue *q = sdev->request_queue;
> > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> >
> >         if (blk_update_request(req, error, bytes))
> >                 return true;
> > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >             !list_empty(&sdev->host->starved_list))
> >                 kblockd_schedule_work(&sdev->requeue_work);
> >         else
> > -               blk_mq_run_hw_queues(q, true);
> > +               blk_mq_run_hw_queue(mq_hctx, true);
> > +               //blk_mq_run_hw_queues(q, true);
>
> This way may cause IO hang because ->device_busy is shared by all hctxs.

From SCSI stack, if we attempt to run all h/w queue, is it possible that
block layer actually run hw_queue which has really not queued any IO.
Currently, in case of mq-deadline, IOS are inserted using
"dd_insert_request". This function will add IOs on elevator data which is
per request queue and not per hctx.
When there is an attempt to run hctx, "blk_mq_sched_has_work" will check
pending work which is per request queue and not per hctx.
Because of this, IOs queued on only one hctx will be run from all the hctx
and this will create unnecessary lock contention.

How about below patch - ?

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..1d30bd3 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!hctx->queued)
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

Kashyap

>
> Thanks,
> Ming
Ming Lei June 16, 2020, 1 a.m. UTC | #6
On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > >
> > > John -
> > >
> > > I tried V7 series and debug further on mq-deadline interface. This
> > > time I have used another setup since HDD based setup is not readily
> > > available for me.
> > > In fact, I was able to simulate issue very easily using single
> > > scsi_device as well. BTW, this is not an issue with this RFC, but
> generic issue.
> > > Since I have converted nr_hw_queue > 1 for Broadcom product using this
> > > RFC, It becomes noticeable now.
> > >
> > > Problem - Using below command  I see heavy CPU utilization on "
> > > native_queued_spin_lock_slowpath". This is because kblockd work queue
> > > is submitting IO from all the CPUs even though fio is bound to single
> CPU.
> > > Lock contention from " dd_dispatch_request" is causing this issue.
> > >
> > > numactl -C 13  fio
> > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > --group_reporting --filename=/dev/sdd
> > >
> > > While running above command, ideally we expect only kworker/13 to be
> > active.
> > > But you can see below - All the CPU is attempting submission and lots
> > > of CPU consumption is due to lock contention.
> > >
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
> > >  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> > > kworker/13:1H-k
> > >  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03
> fio
> > >  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> > > kworker/18:1H-k
> > >  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> > > kworker/19:1H-k
> > >  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> > > kworker/20:1H-k
> > >  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> > > kworker/21:1H-k
> > >  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> > > kworker/22:1H-k
> > >  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> > > kworker/26:1H-k
> > >  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> > > kworker/23:1H-k
> > >  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> > > kworker/24:1H-k
> > >  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> > > kworker/27:1H-k
> > >  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> > > kworker/30:1H-k
> > >  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> > > kworker/31:1H-k
> > >  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> > > kworker/25:1H-k
> > >  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> > > kworker/28:1H-k
> > >  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> > > kworker/34:1H-k
> > >  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> > > kworker/32:1H-k
> > >  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> > > kworker/29:1H-k
> > >  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> > > kworker/33:1H-k
> > >  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> > > kworker/54:1H-k
> > >  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> > > kworker/55:1H-k
> > >  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> > > kworker/35:1H-k
> > >  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> > > kworker/56:1H-k
> > >  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> > > kworker/57:1H-k
> > >  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> > > kworker/59:1H-k
> > >  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> > > kworker/62:1H-k
> > >  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> > > kworker/58:1H-k
> > >  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> > > kworker/61:1H-k
> > >  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> > > kworker/60:1H-k
> > >
> > >
> > > I root cause this issue -
> > >
> > > Block layer always queue IO on hctx context mapped to CPU-13, but hw
> > > queue run from all the hctx context.
> > > I noticed in my test hctx48 has queued all the IOs. No other hctx has
> > > queued IO. But all the hctx is counting for "run".
> > >
> > > # cat hctx48/queued
> > > 2087058
> > >
> > > #cat hctx*/run
> > > 151318
> > > 30038
> > > 83110
> > > 50680
> > > 69907
> > > 60391
> > > 111239
> > > 18036
> > > 33935
> > > 91648
> > > 34582
> > > 22853
> > > 61286
> > > 19489
> > >
> > > Below patch has fix - "Run the hctx queue for which request was
> > > completed instead of running all the hardware queue."
> > > If this looks valid fix, please include in V8 OR I can post separate
> > > patch for this. Just want to have some level of review from this
> discussion.
> > >
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > > 0652acd..f52118f 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> > > blk_status_t error,
> > >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > >         struct scsi_device *sdev = cmd->device;
> > >         struct request_queue *q = sdev->request_queue;
> > > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > >
> > >         if (blk_update_request(req, error, bytes))
> > >                 return true;
> > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> > > blk_status_t error,
> > >             !list_empty(&sdev->host->starved_list))
> > >                 kblockd_schedule_work(&sdev->requeue_work);
> > >         else
> > > -               blk_mq_run_hw_queues(q, true);
> > > +               blk_mq_run_hw_queue(mq_hctx, true);
> > > +               //blk_mq_run_hw_queues(q, true);
> >
> > This way may cause IO hang because ->device_busy is shared by all hctxs.
> 
> From SCSI stack, if we attempt to run all h/w queue, is it possible that
> block layer actually run hw_queue which has really not queued any IO.
> Currently, in case of mq-deadline, IOS are inserted using
> "dd_insert_request". This function will add IOs on elevator data which is
> per request queue and not per hctx.
> When there is an attempt to run hctx, "blk_mq_sched_has_work" will check
> pending work which is per request queue and not per hctx.
> Because of this, IOs queued on only one hctx will be run from all the hctx
> and this will create unnecessary lock contention.

Deadline is supposed for HDD. slow disks, so the lock contention shouldn't have
been one problem given there is seldom MQ HDD. before this patchset.

Another related issue is default scheduler, I guess deadline still should have
been the default io sched for HDDs. attached to this kind HBA with multiple reply
queues and single submission queue.

> 
> How about below patch - ?
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021f..1d30bd3 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)
>  {
>         struct elevator_queue *e = hctx->queue->elevator;
> 
> +       /* If current hctx has not queued any request, there is no need to
> run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!hctx->queued)
> +               return false;
> +
>         if (e && e->type->ops.has_work)
>                 return e->type->ops.has_work(hctx);

->queued is increased only and not decreased just for debug purpose so far, so
it can't be relied for this purpose.

One approach is to add one similar counter, and maintain it by scheduler's insert/dispatch
callback.

Thanks,
Ming
Kashyap Desai June 17, 2020, 11:26 a.m. UTC | #7
> On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > > >
> > > > John -
> > > >
> > > > I tried V7 series and debug further on mq-deadline interface. This
> > > > time I have used another setup since HDD based setup is not
> > > > readily available for me.
> > > > In fact, I was able to simulate issue very easily using single
> > > > scsi_device as well. BTW, this is not an issue with this RFC, but
> > generic issue.
> > > > Since I have converted nr_hw_queue > 1 for Broadcom product using
> > > > this RFC, It becomes noticeable now.
> > > >
> > > > Problem - Using below command  I see heavy CPU utilization on "
> > > > native_queued_spin_lock_slowpath". This is because kblockd work
> > > > queue is submitting IO from all the CPUs even though fio is bound
> > > > to single
> > CPU.
> > > > Lock contention from " dd_dispatch_request" is causing this issue.
> > > >
> > > > numactl -C 13  fio
> > > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > > --group_reporting --filename=/dev/sdd
> > > >
> > > > While running above command, ideally we expect only kworker/13 to
> > > > be
> > > active.
> > > > But you can see below - All the CPU is attempting submission and
> > > > lots of CPU consumption is due to lock contention.
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM
TIME+
> > COMMAND
> > > >  2726 root       0 -20       0      0      0 R  56.5  0.0
0:53.20
> > > > kworker/13:1H-k
> > > >  7815 root      20   0  712404  15536   2228 R  43.2  0.0
0:05.03
> > fio
> > > >  2792 root       0 -20       0      0      0 I  26.6  0.0
0:22.19
> > > > kworker/18:1H-k
> > > >  2791 root       0 -20       0      0      0 I  19.9  0.0
0:17.17
> > > > kworker/19:1H-k
> > > >  1419 root       0 -20       0      0      0 I  19.6  0.0
0:17.03
> > > > kworker/20:1H-k
> > > >  2793 root       0 -20       0      0      0 I  18.3  0.0
0:15.64
> > > > kworker/21:1H-k
> > > >  1424 root       0 -20       0      0      0 I  17.3  0.0
0:14.99
> > > > kworker/22:1H-k
> > > >  2626 root       0 -20       0      0      0 I  16.9  0.0
0:14.68
> > > > kworker/26:1H-k
> > > >  2794 root       0 -20       0      0      0 I  16.9  0.0
0:14.87
> > > > kworker/23:1H-k
> > > >  2795 root       0 -20       0      0      0 I  16.9  0.0
0:14.81
> > > > kworker/24:1H-k
> > > >  2797 root       0 -20       0      0      0 I  16.9  0.0
0:14.62
> > > > kworker/27:1H-k
> > > >  1415 root       0 -20       0      0      0 I  16.6  0.0
0:14.44
> > > > kworker/30:1H-k
> > > >  2669 root       0 -20       0      0      0 I  16.6  0.0
0:14.38
> > > > kworker/31:1H-k
> > > >  2796 root       0 -20       0      0      0 I  16.6  0.0
0:14.74
> > > > kworker/25:1H-k
> > > >  2799 root       0 -20       0      0      0 I  16.6  0.0
0:14.56
> > > > kworker/28:1H-k
> > > >  1425 root       0 -20       0      0      0 I  16.3  0.0
0:14.21
> > > > kworker/34:1H-k
> > > >  2746 root       0 -20       0      0      0 I  16.3  0.0
0:14.33
> > > > kworker/32:1H-k
> > > >  2798 root       0 -20       0      0      0 I  16.3  0.0
0:14.50
> > > > kworker/29:1H-k
> > > >  2800 root       0 -20       0      0      0 I  16.3  0.0
0:14.27
> > > > kworker/33:1H-k
> > > >  1423 root       0 -20       0      0      0 I  15.9  0.0
0:14.10
> > > > kworker/54:1H-k
> > > >  1784 root       0 -20       0      0      0 I  15.9  0.0
0:14.03
> > > > kworker/55:1H-k
> > > >  2801 root       0 -20       0      0      0 I  15.9  0.0
0:14.15
> > > > kworker/35:1H-k
> > > >  2815 root       0 -20       0      0      0 I  15.9  0.0
0:13.97
> > > > kworker/56:1H-k
> > > >  1484 root       0 -20       0      0      0 I  15.6  0.0
0:13.90
> > > > kworker/57:1H-k
> > > >  1485 root       0 -20       0      0      0 I  15.6  0.0
0:13.82
> > > > kworker/59:1H-k
> > > >  1519 root       0 -20       0      0      0 I  15.6  0.0
0:13.64
> > > > kworker/62:1H-k
> > > >  2315 root       0 -20       0      0      0 I  15.6  0.0
0:13.87
> > > > kworker/58:1H-k
> > > >  2627 root       0 -20       0      0      0 I  15.6  0.0
0:13.69
> > > > kworker/61:1H-k
> > > >  2816 root       0 -20       0      0      0 I  15.6  0.0
0:13.75
> > > > kworker/60:1H-k
> > > >
> > > >
> > > > I root cause this issue -
> > > >
> > > > Block layer always queue IO on hctx context mapped to CPU-13, but
> > > > hw queue run from all the hctx context.
> > > > I noticed in my test hctx48 has queued all the IOs. No other hctx
> > > > has queued IO. But all the hctx is counting for "run".
> > > >
> > > > # cat hctx48/queued
> > > > 2087058
> > > >
> > > > #cat hctx*/run
> > > > 151318
> > > > 30038
> > > > 83110
> > > > 50680
> > > > 69907
> > > > 60391
> > > > 111239
> > > > 18036
> > > > 33935
> > > > 91648
> > > > 34582
> > > > 22853
> > > > 61286
> > > > 19489
> > > >
> > > > Below patch has fix - "Run the hctx queue for which request was
> > > > completed instead of running all the hardware queue."
> > > > If this looks valid fix, please include in V8 OR I can post
> > > > separate patch for this. Just want to have some level of review
> > > > from this
> > discussion.
> > > >
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index 0652acd..f52118f 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > > >         struct scsi_device *sdev = cmd->device;
> > > >         struct request_queue *q = sdev->request_queue;
> > > > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > > >
> > > >         if (blk_update_request(req, error, bytes))
> > > >                 return true;
> > > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > >             !list_empty(&sdev->host->starved_list))
> > > >                 kblockd_schedule_work(&sdev->requeue_work);
> > > >         else
> > > > -               blk_mq_run_hw_queues(q, true);
> > > > +               blk_mq_run_hw_queue(mq_hctx, true);
> > > > +               //blk_mq_run_hw_queues(q, true);
> > >
> > > This way may cause IO hang because ->device_busy is shared by all
hctxs.
> >
> > From SCSI stack, if we attempt to run all h/w queue, is it possible
> > that block layer actually run hw_queue which has really not queued any
IO.
> > Currently, in case of mq-deadline, IOS are inserted using
> > "dd_insert_request". This function will add IOs on elevator data which
> > is per request queue and not per hctx.
> > When there is an attempt to run hctx, "blk_mq_sched_has_work" will
> > check pending work which is per request queue and not per hctx.
> > Because of this, IOs queued on only one hctx will be run from all the
> > hctx and this will create unnecessary lock contention.
>
> Deadline is supposed for HDD. slow disks, so the lock contention
shouldn't
> have been one problem given there is seldom MQ HDD. before this
patchset.
>
> Another related issue is default scheduler, I guess deadline still
should have
> been the default io sched for HDDs. attached to this kind HBA with
multiple
> reply queues and single submission queue.
>
> >
> > How about below patch - ?
> >
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > 126021f..1d30bd3 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx)  {
> >         struct elevator_queue *e = hctx->queue->elevator;
> >
> > +       /* If current hctx has not queued any request, there is no
> > + need to
> > run.
> > +        * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > +        * running specific hctx.
> > +        */
> > +       if (!hctx->queued)
> > +               return false;
> > +
> >         if (e && e->type->ops.has_work)
> >                 return e->type->ops.has_work(hctx);
>
> ->queued is increased only and not decreased just for debug purpose so
> ->far, so
> it can't be relied for this purpose.

Thanks. I overlooked that that it is only incremental counter.

>
> One approach is to add one similar counter, and maintain it by
scheduler's
> insert/dispatch callback.

I tried below  and I see performance is on expected range.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fdcc2c1..ea201d0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
bool at_head,

                list_add(&rq->queuelist, &list);
                e->type->ops.insert_requests(hctx, &list, at_head);
+               atomic_inc(&hctx->elevator_queued);
        } else {
                spin_lock(&ctx->lock);
                __blk_mq_insert_request(hctx, rq, at_head);
@@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
blk_mq_hw_ctx *hctx,
        percpu_ref_get(&q->q_usage_counter);

        e = hctx->queue->elevator;
-       if (e && e->type->ops.insert_requests)
+       if (e && e->type->ops.insert_requests) {
                e->type->ops.insert_requests(hctx, list, false);
+               atomic_inc(&hctx->elevator_queued);
+       }
        else {
                /*
                 * try to issue requests directly if the hw queue isn't
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..48f1824 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
        struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

        if (rq->rq_flags & RQF_ELVPRIV) {
-               if (e && e->type->ops.finish_request)
+               if (e && e->type->ops.finish_request) {
                        e->type->ops.finish_request(rq);
+                       atomic_dec(&hctx->elevator_queued);
+               }
                if (rq->elv.icq) {
                        put_io_context(rq->elv.icq->ioc);
                        rq->elv.icq = NULL;
@@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
                goto free_hctx;

        atomic_set(&hctx->nr_active, 0);
+       atomic_set(&hctx->elevator_queued, 0);
        if (node == NUMA_NO_NODE)
                node = set->numa_node;
        hctx->numa_node = node;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
         * shared across request queues.
         */
        atomic_t                nr_active;
+       /**
+        * @elevator_queued: Number of queued requests on hctx.
+        */
+       atomic_t                elevator_queued;

        /** @cpuhp_online: List to store request if CPU is going to die */
        struct hlist_node       cpuhp_online;



>
> Thanks,
> Ming
Hannes Reinecke June 22, 2020, 6:24 a.m. UTC | #8
On 6/17/20 1:26 PM, Kashyap Desai wrote:
>>
>> ->queued is increased only and not decreased just for debug purpose so
>> ->far, so
>> it can't be relied for this purpose.
> 
> Thanks. I overlooked that that it is only incremental counter.
> 
>>
>> One approach is to add one similar counter, and maintain it by
> scheduler's
>> insert/dispatch callback.
> 
> I tried below  and I see performance is on expected range.
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index fdcc2c1..ea201d0 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
> bool at_head,
> 
>                  list_add(&rq->queuelist, &list);
>                  e->type->ops.insert_requests(hctx, &list, at_head);
> +               atomic_inc(&hctx->elevator_queued);
>          } else {
>                  spin_lock(&ctx->lock);
>                  __blk_mq_insert_request(hctx, rq, at_head);
> @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> blk_mq_hw_ctx *hctx,
>          percpu_ref_get(&q->q_usage_counter);
> 
>          e = hctx->queue->elevator;
> -       if (e && e->type->ops.insert_requests)
> +       if (e && e->type->ops.insert_requests) {
>                  e->type->ops.insert_requests(hctx, list, false);
> +               atomic_inc(&hctx->elevator_queued);
> +       }
>          else {
>                  /*
>                   * try to issue requests directly if the hw queue isn't
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021f..946b47a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)
>   {
>          struct elevator_queue *e = hctx->queue->elevator;
> 
> +       /* If current hctx has not queued any request, there is no need to
> run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
> +
>          if (e && e->type->ops.has_work)
>                  return e->type->ops.has_work(hctx);
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f73a2f9..48f1824 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
>          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> 
>          if (rq->rq_flags & RQF_ELVPRIV) {
> -               if (e && e->type->ops.finish_request)
> +               if (e && e->type->ops.finish_request) {
>                          e->type->ops.finish_request(rq);
> +                       atomic_dec(&hctx->elevator_queued);
> +               }
>                  if (rq->elv.icq) {
>                          put_io_context(rq->elv.icq->ioc);
>                          rq->elv.icq = NULL;
> @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
> blk_mq_tag_set *set,
>                  goto free_hctx;
> 
>          atomic_set(&hctx->nr_active, 0);
> +       atomic_set(&hctx->elevator_queued, 0);
>          if (node == NUMA_NO_NODE)
>                  node = set->numa_node;
>          hctx->numa_node = node;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 66711c7..ea1ddb1 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
>           * shared across request queues.
>           */
>          atomic_t                nr_active;
> +       /**
> +        * @elevator_queued: Number of queued requests on hctx.
> +        */
> +       atomic_t                elevator_queued;
> 
>          /** @cpuhp_online: List to store request if CPU is going to die */
>          struct hlist_node       cpuhp_online;
> 
> 
> 
Would it make sense to move it into the elevator itself?
It's a bit odd having a value 'elevator_queued' with no direct reference 
to the elevator...

Cheers,

Hannes
Ming Lei June 23, 2020, 12:55 a.m. UTC | #9
On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote:
> On 6/17/20 1:26 PM, Kashyap Desai wrote:
> > > 
> > > ->queued is increased only and not decreased just for debug purpose so
> > > ->far, so
> > > it can't be relied for this purpose.
> > 
> > Thanks. I overlooked that that it is only incremental counter.
> > 
> > > 
> > > One approach is to add one similar counter, and maintain it by
> > scheduler's
> > > insert/dispatch callback.
> > 
> > I tried below  and I see performance is on expected range.
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index fdcc2c1..ea201d0 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
> > bool at_head,
> > 
> >                  list_add(&rq->queuelist, &list);
> >                  e->type->ops.insert_requests(hctx, &list, at_head);
> > +               atomic_inc(&hctx->elevator_queued);
> >          } else {
> >                  spin_lock(&ctx->lock);
> >                  __blk_mq_insert_request(hctx, rq, at_head);
> > @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> > blk_mq_hw_ctx *hctx,
> >          percpu_ref_get(&q->q_usage_counter);
> > 
> >          e = hctx->queue->elevator;
> > -       if (e && e->type->ops.insert_requests)
> > +       if (e && e->type->ops.insert_requests) {
> >                  e->type->ops.insert_requests(hctx, list, false);
> > +               atomic_inc(&hctx->elevator_queued);
> > +       }
> >          else {
> >                  /*
> >                   * try to issue requests directly if the hw queue isn't
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> > index 126021f..946b47a 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx)
> >   {
> >          struct elevator_queue *e = hctx->queue->elevator;
> > 
> > +       /* If current hctx has not queued any request, there is no need to
> > run.
> > +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> > +        * running specific hctx.
> > +        */
> > +       if (!atomic_read(&hctx->elevator_queued))
> > +               return false;
> > +
> >          if (e && e->type->ops.has_work)
> >                  return e->type->ops.has_work(hctx);
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f73a2f9..48f1824 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
> >          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > 
> >          if (rq->rq_flags & RQF_ELVPRIV) {
> > -               if (e && e->type->ops.finish_request)
> > +               if (e && e->type->ops.finish_request) {
> >                          e->type->ops.finish_request(rq);
> > +                       atomic_dec(&hctx->elevator_queued);
> > +               }
> >                  if (rq->elv.icq) {
> >                          put_io_context(rq->elv.icq->ioc);
> >                          rq->elv.icq = NULL;
> > @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
> > blk_mq_tag_set *set,
> >                  goto free_hctx;
> > 
> >          atomic_set(&hctx->nr_active, 0);
> > +       atomic_set(&hctx->elevator_queued, 0);
> >          if (node == NUMA_NO_NODE)
> >                  node = set->numa_node;
> >          hctx->numa_node = node;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 66711c7..ea1ddb1 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
> >           * shared across request queues.
> >           */
> >          atomic_t                nr_active;
> > +       /**
> > +        * @elevator_queued: Number of queued requests on hctx.
> > +        */
> > +       atomic_t                elevator_queued;
> > 
> >          /** @cpuhp_online: List to store request if CPU is going to die */
> >          struct hlist_node       cpuhp_online;
> > 
> > 
> > 
> Would it make sense to move it into the elevator itself?

That is my initial suggestion, and the counter is just done for bfq &
mq-deadline, then we needn't to pay the cost for others.

Thanks,
Ming
Kashyap Desai June 23, 2020, 11:50 a.m. UTC | #10
>
> On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote:
> > On 6/17/20 1:26 PM, Kashyap Desai wrote:
> > > >
> > > > ->queued is increased only and not decreased just for debug
> > > > ->purpose so far, so
> > > > it can't be relied for this purpose.
> > >
> > > Thanks. I overlooked that that it is only incremental counter.
> > >
> > > >
> > > > One approach is to add one similar counter, and maintain it by
> > > scheduler's
> > > > insert/dispatch callback.
> > >
> > > I tried below  and I see performance is on expected range.
> > >
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > > fdcc2c1..ea201d0 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request
> > > *rq, bool at_head,
> > >
> > >                  list_add(&rq->queuelist, &list);
> > >                  e->type->ops.insert_requests(hctx, &list, at_head);
> > > +               atomic_inc(&hctx->elevator_queued);
> > >          } else {
> > >                  spin_lock(&ctx->lock);
> > >                  __blk_mq_insert_request(hctx, rq, at_head); @@
> > > -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> > > blk_mq_hw_ctx *hctx,
> > >          percpu_ref_get(&q->q_usage_counter);
> > >
> > >          e = hctx->queue->elevator;
> > > -       if (e && e->type->ops.insert_requests)
> > > +       if (e && e->type->ops.insert_requests) {
> > >                  e->type->ops.insert_requests(hctx, list, false);
> > > +               atomic_inc(&hctx->elevator_queued);
> > > +       }
> > >          else {
> > >                  /*
> > >                   * try to issue requests directly if the hw queue
> > > isn't diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > > 126021f..946b47a 100644
> > > --- a/block/blk-mq-sched.h
> > > +++ b/block/blk-mq-sched.h
> > > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > > blk_mq_hw_ctx *hctx)
> > >   {
> > >          struct elevator_queue *e = hctx->queue->elevator;
> > >
> > > +       /* If current hctx has not queued any request, there is no
> > > + need to
> > > run.
> > > +        * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > > +        * running specific hctx.
> > > +        */
> > > +       if (!atomic_read(&hctx->elevator_queued))
> > > +               return false;
> > > +
> > >          if (e && e->type->ops.has_work)
> > >                  return e->type->ops.has_work(hctx);
> > >
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c index f73a2f9..48f1824
> > > 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
> > >          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > >
> > >          if (rq->rq_flags & RQF_ELVPRIV) {
> > > -               if (e && e->type->ops.finish_request)
> > > +               if (e && e->type->ops.finish_request) {
> > >                          e->type->ops.finish_request(rq);
> > > +                       atomic_dec(&hctx->elevator_queued);
> > > +               }
> > >                  if (rq->elv.icq) {
> > >                          put_io_context(rq->elv.icq->ioc);
> > >                          rq->elv.icq = NULL; @@ -2571,6 +2573,7 @@
> > > blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set
> > > *set,
> > >                  goto free_hctx;
> > >
> > >          atomic_set(&hctx->nr_active, 0);
> > > +       atomic_set(&hctx->elevator_queued, 0);
> > >          if (node == NUMA_NO_NODE)
> > >                  node = set->numa_node;
> > >          hctx->numa_node = node;
> > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index
> > > 66711c7..ea1ddb1 100644
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
> > >           * shared across request queues.
> > >           */
> > >          atomic_t                nr_active;
> > > +       /**
> > > +        * @elevator_queued: Number of queued requests on hctx.
> > > +        */
> > > +       atomic_t                elevator_queued;
> > >
> > >          /** @cpuhp_online: List to store request if CPU is going to
die */
> > >          struct hlist_node       cpuhp_online;
> > >
> > >
> > >
> > Would it make sense to move it into the elevator itself?

I am not sure where exactly I should add this counter since I need counter
per hctx. Elevator data is per request object.
Please suggest.

>
> That is my initial suggestion, and the counter is just done for bfq &
mq-
> deadline, then we needn't to pay the cost for others.

I have updated patch -

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a1123d4..3e0005c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx
*hctx)
 {
        struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
        /*
         * Avoiding lock: a race on bfqd->busy_queues should cause at
         * most a call to dispatch for nothing
@@ -5554,6 +5561,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx
*hctx,
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
                bfq_insert_request(hctx, rq, at_head);
+              atomic_inc(&hctx->elevator_queued);
        }
 }

@@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct
request *rq)

        if (likely(rq->rq_flags & RQF_STARTED)) {
                unsigned long flags;
+              struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx;

                spin_lock_irqsave(&bfqd->lock, flags);

@@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct
request *rq)
                bfq_completed_request(bfqq, bfqd);
                bfq_finish_requeue_request_body(bfqq);

+              atomic_dec(&hctx->elevator_queued);
                spin_unlock_irqrestore(&bfqd->lock, flags);
        } else {
                /*
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..82dd152 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2571,6 +2571,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
                goto free_hctx;

        atomic_set(&hctx->nr_active, 0);
+      atomic_set(&hctx->elevator_queued, 0);
        if (node == NUMA_NO_NODE)
                node = set->numa_node;
        hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b57470e..703ac55 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx
*hctx,
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
                dd_insert_request(hctx, rq, at_head);
+              atomic_inc(&hctx->elevator_queued);
        }
        spin_unlock(&dd->lock);
 }
@@ -562,6 +563,7 @@ static void dd_prepare_request(struct request *rq)
 static void dd_finish_request(struct request *rq)
 {
        struct request_queue *q = rq->q;
+      struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

        if (blk_queue_is_zoned(q)) {
                struct deadline_data *dd = q->elevator->elevator_data;
@@ -570,15 +572,23 @@ static void dd_finish_request(struct request *rq)
                spin_lock_irqsave(&dd->zone_lock, flags);
                blk_req_zone_write_unlock(rq);
                if (!list_empty(&dd->fifo_list[WRITE]))
-                       blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
+                       blk_mq_sched_mark_restart_hctx(hctx);
                spin_unlock_irqrestore(&dd->zone_lock, flags);
        }
+       atomic_dec(&hctx->elevator_queued);
 }

 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
        struct deadline_data *dd = hctx->queue->elevator->elevator_data;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        return !list_empty_careful(&dd->dispatch) ||
                !list_empty_careful(&dd->fifo_list[0]) ||
                !list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
         * shared across request queues.
         */
        atomic_t                nr_active;
+       /**
+        * @elevator_queued: Number of queued requests on hctx.
+        */
+       atomic_t                elevator_queued;

        /** @cpuhp_online: List to store request if CPU is going to die */
        struct hlist_node       cpuhp_online;


>
> Thanks,
> Ming
Kashyap Desai June 23, 2020, 12:11 p.m. UTC | #11
> > > >
> > > Would it make sense to move it into the elevator itself?
>
> I am not sure where exactly I should add this counter since I need
counter per
> hctx. Elevator data is per request object.
> Please suggest.
>
> >
> > That is my initial suggestion, and the counter is just done for bfq &
> > mq- deadline, then we needn't to pay the cost for others.
>
> I have updated patch -
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index
a1123d4..3e0005c
> 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx
> *hctx)  {
>         struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>
> +       /* If current hctx has not queued any request, there is no need
to run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
>         /*
>          * Avoiding lock: a race on bfqd->busy_queues should cause at
>          * most a call to dispatch for nothing @@ -5554,6 +5561,7 @@
static void
> bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
>                 rq = list_first_entry(list, struct request, queuelist);
>                 list_del_init(&rq->queuelist);
>                 bfq_insert_request(hctx, rq, at_head);
> +              atomic_inc(&hctx->elevator_queued);
>         }
>  }
>
> @@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct
> request *rq)
>
>         if (likely(rq->rq_flags & RQF_STARTED)) {
>                 unsigned long flags;
> +              struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx;
>
>                 spin_lock_irqsave(&bfqd->lock, flags);
>
> @@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct
> request *rq)
>                 bfq_completed_request(bfqq, bfqd);
>                 bfq_finish_requeue_request_body(bfqq);
>
> +              atomic_dec(&hctx->elevator_queued);
>                 spin_unlock_irqrestore(&bfqd->lock, flags);
>         } else {
>                 /*
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> 126021f..946b47a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)  {
>         struct elevator_queue *e = hctx->queue->elevator;
>
> +       /* If current hctx has not queued any request, there is no need
to run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
> +

I have missed this. I will remove above code since it is now managed
within mq-deadline and bfq-iosched *has_work* callback.

>         if (e && e->type->ops.has_work)
>                 return e->type->ops.has_work(hctx);
>