[RFT,0/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags

Message ID	1635852455-39935-1-git-send-email-john.garry@huawei.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: John Garry <john.garry@huawei.com> To: <axboe@kernel.dk> CC: <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <ming.lei@redhat.com>, <kashyap.desai@broadcom.com>, <hare@suse.de>, "John Garry" <john.garry@huawei.com> Subject: [PATCH RFT 0/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags Date: Tue, 2 Nov 2021 19:27:32 +0800 Message-ID: <1635852455-39935-1-git-send-email-john.garry@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags \| expand [RFT,0/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags [RFT,1/3] blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument [RFT,2/3] blk-mq: Delete busy_iter_fn [RFT,3/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags

John Garry Nov. 2, 2021, 11:27 a.m. UTC

In [0], Kashyap reports high CPU usage for blk_mq_queue_tag_busy_iter()
and callees for shared tags.

Indeed blk_mq_queue_tag_busy_iter() would be less optimum for moving to
shared tags, but it was not optimum previously.

So I would like this series tested, and also to know what is triggering
blk_mq_queue_tag_busy_iter() from userspace to cause such high CPU
loading.

As suggested by Ming, reading /proc/diskstats in a while true loop
can trigger blk_mq_queue_tag_busy_iter(); I do so in a test with 2x
separate consoles, and here are the results:

v5.15
blk_mq_queue_tag_busy_iter() 6.2%
part_stat_read_all() 6.7

pre-v5.16 (Linus' master branch @ commit bfc484fe6abb)
blk_mq_queue_tag_busy_iter() 4.5%
part_stat_read_all() 6.2

pre-v5.16+this series
blk_mq_queue_tag_busy_iter() not shown in top users
part_stat_read_all() 7.5%

These results are from perf top, on a system with 7x
disks, with hisi_sas which has 16x HW queues.

[0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/

John Garry (3):
  blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
  blk-mq: Delete busy_iter_fn
  blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags

 block/blk-mq-tag.c     | 58 +++++++++++++++++++++++++++---------------
 block/blk-mq-tag.h     |  2 +-
 block/blk-mq.c         | 17 ++++++-------
 include/linux/blk-mq.h |  2 --
 4 files changed, 47 insertions(+), 32 deletions(-)

John Garry Nov. 15, 2021, 3:56 p.m. UTC | #1

On 02/11/2021 11:27, John Garry wrote:

Hi Kashyap,

Any chance you can try this series or give an update on the issue 
reported earlier?

thanks @ Ming for the reviews.

Cheers,
John

> In [0], Kashyap reports high CPU usage for blk_mq_queue_tag_busy_iter()
> and callees for shared tags.
> 
> Indeed blk_mq_queue_tag_busy_iter() would be less optimum for moving to
> shared tags, but it was not optimum previously.
> 
> So I would like this series tested, and also to know what is triggering
> blk_mq_queue_tag_busy_iter() from userspace to cause such high CPU
> loading.
> 
> As suggested by Ming, reading /proc/diskstats in a while true loop
> can trigger blk_mq_queue_tag_busy_iter(); I do so in a test with 2x
> separate consoles, and here are the results:
> 
> v5.15
> blk_mq_queue_tag_busy_iter() 6.2%
> part_stat_read_all() 6.7
> 
> pre-v5.16 (Linus' master branch @ commit bfc484fe6abb)
> blk_mq_queue_tag_busy_iter() 4.5%
> part_stat_read_all() 6.2
> 
> pre-v5.16+this series
> blk_mq_queue_tag_busy_iter() not shown in top users
> part_stat_read_all() 7.5%
> 
> These results are from perf top, on a system with 7x
> disks, with hisi_sas which has 16x HW queues.
> 
> [0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/
> 
> John Garry (3):
>    blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument
>    blk-mq: Delete busy_iter_fn
>    blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags
> 
>   block/blk-mq-tag.c     | 58 +++++++++++++++++++++++++++---------------
>   block/blk-mq-tag.h     |  2 +-
>   block/blk-mq.c         | 17 ++++++-------
>   include/linux/blk-mq.h |  2 --
>   4 files changed, 47 insertions(+), 32 deletions(-)
>

Kashyap Desai Nov. 15, 2021, 4:31 p.m. UTC | #2

> Hi Kashyap,
>
> Any chance you can try this series or give an update on the issue reported
> earlier?

John -

I will try something this week and let you know.

>
> thanks @ Ming for the reviews.
>
> Cheers,
> John
>

Kashyap Desai Nov. 25, 2021, 1:46 p.m. UTC | #3

> -----Original Message-----
> From: Kashyap Desai [mailto:kashyap.desai@broadcom.com]
> Sent: Monday, November 15, 2021 10:02 PM
> To: 'John Garry' <john.garry@huawei.com>; 'axboe@kernel.dk'
> <axboe@kernel.dk>
> Cc: 'linux-block@vger.kernel.org' <linux-block@vger.kernel.org>; 'linux-
> kernel@vger.kernel.org' <linux-kernel@vger.kernel.org>;
> 'ming.lei@redhat.com'
> <ming.lei@redhat.com>; 'hare@suse.de' <hare@suse.de>
> Subject: RE: [PATCH RFT 0/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter()
> for shared tags
>
> > Hi Kashyap,
> >
> > Any chance you can try this series or give an update on the issue
> > reported earlier?
>
> John -
>
> I will try something this week and let you know.

John - I tested patchset and looks good.  Issue reported at below thread is
fixed.
https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/

Here is perf top data.

5.70%  [megaraid_sas]                [k] complete_cmd_fusion
   3.80%  [megaraid_sas]                [k]
megasas_build_and_issue_cmd_fusion
   2.75%  [kernel]                      [k] syscall_return_via_sysret
   2.68%  [kernel]                      [k] entry_SYSCALL_64
   2.22%  [kernel]                      [k] io_submit_one
   2.19%  [megaraid_sas]                [k] megasas_build_ldio_fusion
   1.95%  [kernel]                      [k] llist_add_batch
   1.80%  [kernel]                      [k] llist_reverse_order
   1.79%  [kernel]                      [k] scsi_complete
   1.73%  [kernel]                      [k] scsi_queue_rq
   1.66%  [kernel]                      [k] check_preemption_disabled
   1.37%  [megaraid_sas]                [k] megasas_queue_command
   1.16%  [kernel]                      [k] native_irq_return_iret
   1.11%  [kernel]                      [k] aio_complete_rw
   1.07%  [kernel]                      [k] read_tsc
   1.06%  fio                           [.] __fio_gettime
   1.05%  [kernel]                      [k] flush_smp_call_function_queue
   1.03%  [kernel]                      [k] blk_complete_reqs
   1.00%  [kernel]                      [k] blk_mq_free_request
   0.98%  [kernel]                      [k] sbitmap_get


I will continue testing and let you know how it goes.


Kashyap

>
> >
> > thanks @ Ming for the reviews.
> >
> > Cheers,
> > John
> >

John Garry Nov. 25, 2021, 2:09 p.m. UTC | #4

On 25/11/2021 13:46, Kashyap Desai wrote:
>> John -
>>
>> I will try something this week and let you know.
> John - I tested patchset and looks good.  Issue reported at below thread is
> fixed.
> https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/
> 
> Here is perf top data.
> 
> 5.70%  [megaraid_sas]                [k] complete_cmd_fusion
>     3.80%  [megaraid_sas]                [k]
> megasas_build_and_issue_cmd_fusion
>     2.75%  [kernel]                      [k] syscall_return_via_sysret
>     2.68%  [kernel]                      [k] entry_SYSCALL_64
>     2.22%  [kernel]                      [k] io_submit_one
>     2.19%  [megaraid_sas]                [k] megasas_build_ldio_fusion
>     1.95%  [kernel]                      [k] llist_add_batch
>     1.80%  [kernel]                      [k] llist_reverse_order
>     1.79%  [kernel]                      [k] scsi_complete
>     1.73%  [kernel]                      [k] scsi_queue_rq
>     1.66%  [kernel]                      [k] check_preemption_disabled
>     1.37%  [megaraid_sas]                [k] megasas_queue_command
>     1.16%  [kernel]                      [k] native_irq_return_iret
>     1.11%  [kernel]                      [k] aio_complete_rw
>     1.07%  [kernel]                      [k] read_tsc
>     1.06%  fio                           [.] __fio_gettime
>     1.05%  [kernel]                      [k] flush_smp_call_function_queue
>     1.03%  [kernel]                      [k] blk_complete_reqs
>     1.00%  [kernel]                      [k] blk_mq_free_request
>     0.98%  [kernel]                      [k] sbitmap_get
> 
> 
> I will continue testing and let you know how it goes.

ok, good to know, thanks. But I would still like to know what is 
triggering blk_mq_queue_tag_busy_iter() so often. Indeed, as mentioned 
in this cover letter, this function was hardly optimised before for 
shared sbitmap.

And any opinion whether we would want this as a fix? Information 
requested above would help explain why we would need it as a fix.

Cheers,
John

Kashyap Desai Nov. 26, 2021, 11:25 a.m. UTC | #5

> >
> >
> > I will continue testing and let you know how it goes.
>
> ok, good to know, thanks. But I would still like to know what is
> triggering
> blk_mq_queue_tag_busy_iter() so often. Indeed, as mentioned in this cover
> letter, this function was hardly optimised before for shared sbitmap.

If I give  "--disk_util=0" option in my fio run, caller of "
blk_mq_queue_tag_busy_iter" reduced drastically.
As part of <fio> run, application call diskutils operations and it is almost
same as doing "cat /proc/stats" in loop.
Looking at fio code, it call diskstats every 250 msec. Here is sample fio
logs -

diskutil 87720 /sys/block/sdb/stat: stat read ok? 0
diskutil 87720 update io ticks
diskutil 87720 open stat file: /sys/block/sdb/stat
diskutil 87720 /sys/block/sdb/stat: 127853173        0 1022829056 241827073
0        0        0        0      255   984012 241827073        0        0
0        0        0        0

There is one more call trace, but not sure why it is getting executed in my
test.  Below path does not execute so frequently but it consumes cpu (not
noticeable on my setup)

kthread
        worker_thread
        process_one_work
        blk_mq_timeout_work
        blk_mq_queue_tag_busy_iter
        bt_iter
        blk_mq_find_and_get_req
        _raw_spin_lock_irqsave
        native_queued_spin_lock_slowpath


This patch set improves above call trace even after disk_util=0 is set.

Kashyap

>
> And any opinion whether we would want this as a fix? Information requested
> above would help explain why we would need it as a fix.
>
> Cheers,
> John

John Garry Nov. 26, 2021, 11:51 a.m. UTC | #6

On 26/11/2021 11:25, Kashyap Desai wrote:
>>>
>>> I will continue testing and let you know how it goes.
>> ok, good to know, thanks. But I would still like to know what is
>> triggering
>> blk_mq_queue_tag_busy_iter() so often. Indeed, as mentioned in this cover
>> letter, this function was hardly optimised before for shared sbitmap.
> If I give  "--disk_util=0" option in my fio run, caller of "
> blk_mq_queue_tag_busy_iter" reduced drastically.
> As part of <fio> run, application call diskutils operations and it is almost
> same as doing "cat /proc/stats" in loop.
> Looking at fio code, it call diskstats every 250 msec. Here is sample fio
> logs -
> 
> diskutil 87720 /sys/block/sdb/stat: stat read ok? 0
> diskutil 87720 update io ticks
> diskutil 87720 open stat file: /sys/block/sdb/stat
> diskutil 87720 /sys/block/sdb/stat: 127853173        0 1022829056 241827073
> 0        0        0        0      255   984012 241827073        0        0
> 0        0        0        0
> 
> There is one more call trace, but not sure why it is getting executed in my
> test.  Below path does not execute so frequently but it consumes cpu (not
> noticeable on my setup)
> 
> kthread
>          worker_thread
>          process_one_work
>          blk_mq_timeout_work
>          blk_mq_queue_tag_busy_iter
>          bt_iter
>          blk_mq_find_and_get_req
>          _raw_spin_lock_irqsave
>          native_queued_spin_lock_slowpath
> 
> 

It would be still nice to know where this is coming from.

> This patch set improves above call trace even after disk_util=0 is set.
ok, fine. Thanks for testing.

So I guess that this is a regression, and you would want this series for 
v5.16, right? My changes were made with v5.17 in mind.

I am not sure how Jens feels about it, since the changes are 
significant. It would be a lot easier to argue for v5.16 if we got to 
this point earlier in the cycle...

Anyway, it would be good to have full review first, so please help with 
that.

@Ming, can you please give feedback on 3/3 here?

BTW, I am on vacation next week and can't help progress then, so any 
assistance would be good.

Thanks,
John

John Garry Dec. 6, 2021, 9:57 a.m. UTC | #7

On 26/11/2021 11:51, John Garry wrote:
>> This patch set improves above call trace even after disk_util=0 is set.
> ok, fine. Thanks for testing.
> 
> So I guess that this is a regression, and you would want this series for 
> v5.16, right? My changes were made with v5.17 in mind.
> 
> I am not sure how Jens feels about it, since the changes are 
> significant. It would be a lot easier to argue for v5.16 if we got to 
> this point earlier in the cycle...
> 

note: I will now resend for 5.17 and add your tested-by, please let me 
know if unhappy about that.

> Anyway, it would be good to have full review first, so please help with 
> that.
> 
> @Ming, can you please give feedback on 3/3 here?

Thanks and also to Hannes for the reviews.

john

Kashyap Desai Dec. 6, 2021, 10:03 a.m. UTC | #8

> -----Original Message-----
> From: John Garry [mailto:john.garry@huawei.com]
> Sent: Monday, December 6, 2021 3:28 PM
> To: Kashyap Desai <kashyap.desai@broadcom.com>; axboe@kernel.dk
> Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org;
> ming.lei@redhat.com; hare@suse.de
> Subject: Re: [PATCH RFT 0/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter()
> for shared tags
>
> On 26/11/2021 11:51, John Garry wrote:
> >> This patch set improves above call trace even after disk_util=0 is set.
> > ok, fine. Thanks for testing.
> >
> > So I guess that this is a regression, and you would want this series
> > for v5.16, right? My changes were made with v5.17 in mind.
> >
> > I am not sure how Jens feels about it, since the changes are
> > significant. It would be a lot easier to argue for v5.16 if we got to
> > this point earlier in the cycle...
> >
>
> note: I will now resend for 5.17 and add your tested-by, please let me
> know if
> unhappy about that.

John - 5.17 should be OK. I am doing additional testing and so far no issue.

Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>

>
> > Anyway, it would be good to have full review first, so please help
> > with that.
> >
> > @Ming, can you please give feedback on 3/3 here?
>
> Thanks and also to Hannes for the reviews.
>
> john

John Garry Dec. 8, 2021, 1:04 p.m. UTC | #9

On 06/12/2021 10:03, Kashyap Desai wrote:
>> note: I will now resend for 5.17 and add your tested-by, please let me
>> know if
>> unhappy about that.
> John - 5.17 should be OK. I am doing additional testing and so far no issue.
> 
> Tested-by: Kashyap Desai<kashyap.desai@broadcom.com>
> 

Hi Kashyap,

On a related topic, you mentioned in an earlier discussion another 
possible regression you saw around 5.11 or 5.12, unrelated to shared 
tags transition:
https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/

Did you come to any conclusion on that?

Thanks,
John

[RFT,0/3] blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags

Message

Comments