[RFT] blk-mq: optimize queue tag busy iter for shared_tags

Message ID	20211221123157.14052-1-kashyap.desai@broadcom.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Kashyap Desai <kashyap.desai@broadcom.com> To: axboe@kernel.dk Cc: Kashyap Desai <kashyap.desai@broadcom.com>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, john.garry@huawei.com, ming.lei@redhat.com, sathya.prakash@broadcom.com Subject: [PATCH RFT] blk-mq: optimize queue tag busy iter for shared_tags Date: Tue, 21 Dec 2021 18:01:57 +0530 Message-Id: <20211221123157.14052-1-kashyap.desai@broadcom.com> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-256; boundary="000000000000f9c9ec05d3a73000" Precedence: bulk
Series	[RFT] blk-mq: optimize queue tag busy iter for shared_tags \| expand [RFT] blk-mq: optimize queue tag busy iter for shared_tags

Kashyap Desai Dec. 21, 2021, 12:31 p.m. UTC

In [0], CPU usage for blk_mq_queue_tag_busy_iter() was optimized, but
there are still periodic call of blk_mq_queue_tag_busy_iter() from
below context. Below context is used for block layer timer to find out
potential expired command (per request queue) which requires tag iteration
almost every 5 seconds(defined BLK_MAX_TIMEOUT) for each request queue.

kthread
        worker_thread
        process_one_work
        blk_mq_timeout_work
        blk_mq_queue_tag_busy_iter
        bt_iter
        blk_mq_find_and_get_req
        _raw_spin_lock_irqsave
        native_queued_spin_lock_slowpath

Changes in this patch optimize extra iterations of tags in case of
shared_tags. One iteration of shared_tags can give expected results for
iterate function.

Setup -  AMD64 Gen-4.0 Server.
64 Virtual Drive created using 16 Nvme drives + mpi3mr driver (in
shared_tags mode)

Test command -
fio 64.fio --rw=randread --bs=4K --iodepth=32 --numjobs=2 --ioscheduler=mq-deadline --disk_util=0

Without this patch on 5.16.0-rc5, mpi3mr driver in shared_tags mode can
give 4.0M IOPs vs expected to get ~6.0M.
Snippet of perf top

  25.42%  [kernel]                               [k] native_queued_spin_lock_slowpath
   3.95%  [kernel]                               [k] cpupri_set
   2.05%  [kernel]                               [k] __blk_mq_get_driver_tag
   1.67%  [kernel]                               [k] __rcu_read_unlock
   1.63%  [kernel]                               [k] check_preemption_disabled

After applying this patch on 5.16.0-rc5, mpi3mr driver in shared_tags
mode reach up to 5.8M IOPs.

Snippet of perf top

   7.95%  [kernel]                               [k] native_queued_spin_lock_slowpath
   5.61%  [kernel]                               [k] cpupri_set
   2.98%  [kernel]                               [k] acpi_processor_ffh_cstate_enter
   2.49%  [kernel]                               [k] read_tsc
   2.15%  [kernel]                               [k] check_preemption_disabled


[0] https://lore.kernel.org/all/9b092ca49e9b5415772cd950a3c12584@mail.gmail.com/ 


Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: john.garry@huawei.com
Cc: ming.lei@redhat.com
Cc: sathya.prakash@broadcom.com
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
---
 block/blk-mq-tag.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

John Garry Dec. 21, 2021, 12:55 p.m. UTC | #1

On 21/12/2021 12:31, Kashyap Desai wrote:

Hi Kashyap,

What kernel is this for? 5.17 or 5.16 + stable? Your intention is not 
clear to me.


> In [0], CPU usage for blk_mq_queue_tag_busy_iter() was optimized, but
> there are still periodic call of blk_mq_queue_tag_busy_iter() from
> below context. Below context is used for block layer timer to find out
> potential expired command (per request queue) which requires tag iteration
> almost every 5 seconds(defined BLK_MAX_TIMEOUT) for each request queue.
> 
> kthread
>          worker_thread
>          process_one_work
>          blk_mq_timeout_work
>          blk_mq_queue_tag_busy_iter
>          bt_iter
>          blk_mq_find_and_get_req
>          _raw_spin_lock_irqsave
>          native_queued_spin_lock_slowpath
> 
> Changes in this patch optimize extra iterations of tags in case of
> shared_tags. One iteration of shared_tags can give expected results for
> iterate function.
> 
> Setup -  AMD64 Gen-4.0 Server.
> 64 Virtual Drive created using 16 Nvme drives + mpi3mr driver (in
> shared_tags mode)
> 
> Test command -
> fio 64.fio --rw=randread --bs=4K --iodepth=32 --numjobs=2 --ioscheduler=mq-deadline --disk_util=0
> 
> Without this patch on 5.16.0-rc5, mpi3mr driver in shared_tags mode can
> give 4.0M IOPs vs expected to get ~6.0M.
> Snippet of perf top
> 
>    25.42%  [kernel]                               [k] native_queued_spin_lock_slowpath
>     3.95%  [kernel]                               [k] cpupri_set
>     2.05%  [kernel]                               [k] __blk_mq_get_driver_tag
>     1.67%  [kernel]                               [k] __rcu_read_unlock
>     1.63%  [kernel]                               [k] check_preemption_disabled
> 
> After applying this patch on 5.16.0-rc5, mpi3mr driver in shared_tags
> mode reach up to 5.8M IOPs.
> 
> Snippet of perf top
> 
>     7.95%  [kernel]                               [k] native_queued_spin_lock_slowpath
>     5.61%  [kernel]                               [k] cpupri_set
>     2.98%  [kernel]                               [k] acpi_processor_ffh_cstate_enter
>     2.49%  [kernel]                               [k] read_tsc
>     2.15%  [kernel]                               [k] check_preemption_disabled
> 
> 
> [0] https://lore.kernel.org/all/9b092ca49e9b5415772cd950a3c12584@mail.gmail.com/
> 
> 
> Cc: linux-block@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: john.garry@huawei.com
> Cc: ming.lei@redhat.com
> Cc: sathya.prakash@broadcom.com
> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> ---
>   block/blk-mq-tag.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 995336abee33..3e0a8e79f966 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -253,7 +253,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	if (!rq)
>   		return true;
>   
> -	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> +	if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
> +				blk_mq_is_shared_tags(hctx->flags)))
>   		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
>   	blk_mq_put_rq_ref(rq);
>   	return ret;
> @@ -484,6 +485,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>   		if (tags->nr_reserved_tags)
>   			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
>   		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
> +		
> +		/* In case of shared bitmap if shared_tags is allocated, it is not required
> +		 * to iterate all the hctx. Looping one hctx is good enough.
> +		 */
> +		if (blk_mq_is_shared_tags(hctx->flags)) {
> +			blk_queue_exit(q);
> +			return;

this looks like v5.16-rc6 code

> +		}
>   	}
>   	blk_queue_exit(q);
>   }
> 



Thanks,
John

Kashyap Desai Dec. 21, 2021, 1:53 p.m. UTC | #2

>
> On 21/12/2021 12:31, Kashyap Desai wrote:
>
> Hi Kashyap,
>
> What kernel is this for? 5.17 or 5.16 + stable? Your intention is not
> clear to
> me.


Hi John

This is for current/5.17. This patch is meaningfully only on top of [1].

[1] " blk-mq: Use shared tags for shared sbitmap support" Commit -
e155b0c238b20f0a866f4334d292656665836c8a

While doing additional testing for [1], I noticed some performance issue.
Along with the performance issue, I noticed CPU lockup as well. Lockup
trace -

_raw_spin_lock_irqsave+0x42/0x50
 blk_mq_find_and_get_req+0x20/0xa0
 bt_iter+0x2d/0x80
 blk_mq_queue_tag_busy_iter+0x1aa/0x2f0
 ? blk_mq_complete_request+0x30/0x30
 ? blk_mq_complete_request+0x30/0x30
 ? __schedule+0x360/0x850
 blk_mq_timeout_work+0x5e/0x120
 process_one_work+0x1a8/0x380
 worker_thread+0x30/0x380
 ? wq_calc_node_cpumask.isra.30+0x100/0x100
 kthread+0x167/0x190
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x22/0x30

It is a generic performance issue if driver use " shost->host_tagset = 1".
In fact, I found that [1] is useful to fix performance issue and provided
this additional patch.

I changed my setup to have 64 scsi_devices (earlier I just kept 16 or 24
drives, so did not noticed this issue). Performance/cpu lockup issue is not
due to [1].
More number of scsi device, hardware context per host and high queue depth
will increase the chances of lockup and performance drop.

Do you think, it is good to have changes in 5.16 + stable ?
I don't know if this  patch will create any side effect. Can you review and
let me know your feedback. ?

Kashyap

>
>
> > In [0], CPU usage for blk_mq_queue_tag_busy_iter() was optimized, but
> > there are still periodic call of blk_mq_queue_tag_busy_iter() from
> > below context. Below context is used for block layer timer to find out
> > potential expired command (per request queue) which requires tag
> > iteration
> > almost every 5 seconds(defined BLK_MAX_TIMEOUT) for each request
> queue.
> >
> > kthread
> >          worker_thread
> >          process_one_work
> >          blk_mq_timeout_work
> >          blk_mq_queue_tag_busy_iter
> >          bt_iter
> >          blk_mq_find_and_get_req
> >          _raw_spin_lock_irqsave
> >          native_queued_spin_lock_slowpath
> >
> > Changes in this patch optimize extra iterations of tags in case of
> > shared_tags. One iteration of shared_tags can give expected results for
> > iterate function.
> >
> > Setup -  AMD64 Gen-4.0 Server.
> > 64 Virtual Drive created using 16 Nvme drives + mpi3mr driver (in
> > shared_tags mode)
> >
> > Test command -
> > fio 64.fio --rw=randread --bs=4K --iodepth=32 --numjobs=2 --
> ioscheduler=mq-deadline --disk_util=0
> >
> > Without this patch on 5.16.0-rc5, mpi3mr driver in shared_tags mode can
> > give 4.0M IOPs vs expected to get ~6.0M.
> > Snippet of perf top
> >
> >    25.42%  [kernel]                               [k]
> > native_queued_spin_lock_slowpath
> >     3.95%  [kernel]                               [k] cpupri_set
> >     2.05%  [kernel]                               [k]
> > __blk_mq_get_driver_tag
> >     1.67%  [kernel]                               [k] __rcu_read_unlock
> >     1.63%  [kernel]                               [k]
> > check_preemption_disabled
> >
> > After applying this patch on 5.16.0-rc5, mpi3mr driver in shared_tags
> > mode reach up to 5.8M IOPs.
> >
> > Snippet of perf top
> >
> >     7.95%  [kernel]                               [k]
> > native_queued_spin_lock_slowpath
> >     5.61%  [kernel]                               [k] cpupri_set
> >     2.98%  [kernel]                               [k]
> > acpi_processor_ffh_cstate_enter
> >     2.49%  [kernel]                               [k] read_tsc
> >     2.15%  [kernel]                               [k]
> > check_preemption_disabled
> >
> >
> > [0]
> https://lore.kernel.org/all/9b092ca49e9b5415772cd950a3c12584@mail.gma
> il.com/
> >
> >
> > Cc: linux-block@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> > Cc: john.garry@huawei.com
> > Cc: ming.lei@redhat.com
> > Cc: sathya.prakash@broadcom.com
> > Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> > ---
> >   block/blk-mq-tag.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 995336abee33..3e0a8e79f966 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -253,7 +253,8 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned
> int bitnr, void *data)
> >   	if (!rq)
> >   		return true;
> >
> > -	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> > +	if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
> > +				blk_mq_is_shared_tags(hctx->flags)))
> >   		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
> >   	blk_mq_put_rq_ref(rq);
> >   	return ret;
> > @@ -484,6 +485,14 @@ void blk_mq_queue_tag_busy_iter(struct
> request_queue *q, busy_iter_fn *fn,
> >   		if (tags->nr_reserved_tags)
> >   			bt_for_each(hctx, &tags->breserved_tags, fn, priv,
> true);
> >   		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
> > +
> > +		/* In case of shared bitmap if shared_tags is allocated, it is not
> required
> > +		 * to iterate all the hctx. Looping one hctx is good enough.
> > +		 */
> > +		if (blk_mq_is_shared_tags(hctx->flags)) {
> > +			blk_queue_exit(q);
> > +			return;
>
> this looks like v5.16-rc6 code
>
> > +		}
> >   	}
> >   	blk_queue_exit(q);
> >   }
> >
>
>
>
> Thanks,
> John

John Garry Dec. 21, 2021, 3:19 p.m. UTC | #3

Hi Kashyap,

> This is for current/5.17. This patch is meaningfully only on top of [1].
> 
> [1] " blk-mq: Use shared tags for shared sbitmap support" Commit -
> e155b0c238b20f0a866f4334d292656665836c8a
> 

But your change seems effectively the same as in 
https://lore.kernel.org/all/1638794990-137490-4-git-send-email-john.garry@huawei.com/, 
which is now merged in Jens' 5.17 queue:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.17/block&id=fea9f92f1748083cb82049ed503be30c3d3a9b69

> While doing additional testing for [1], I noticed some performance issue.
> Along with the performance issue, I noticed CPU lockup as well. Lockup
> trace -
> 
> _raw_spin_lock_irqsave+0x42/0x50
>   blk_mq_find_and_get_req+0x20/0xa0
>   bt_iter+0x2d/0x80
>   blk_mq_queue_tag_busy_iter+0x1aa/0x2f0
>   ? blk_mq_complete_request+0x30/0x30
>   ? blk_mq_complete_request+0x30/0x30
>   ? __schedule+0x360/0x850
>   blk_mq_timeout_work+0x5e/0x120
>   process_one_work+0x1a8/0x380
>   worker_thread+0x30/0x380
>   ? wq_calc_node_cpumask.isra.30+0x100/0x100
>   kthread+0x167/0x190
>   ? set_kthread_struct+0x40/0x40
>   ret_from_fork+0x22/0x30
> 
> It is a generic performance issue if driver use " shost->host_tagset = 1".
> In fact, I found that [1] is useful to fix performance issue and provided
> this additional patch.
> 
> I changed my setup to have 64 scsi_devices (earlier I just kept 16 or 24
> drives, so did not noticed this issue). Performance/cpu lockup issue is not
> due to [1].
> More number of scsi device, hardware context per host and high queue depth
> will increase the chances of lockup and performance drop.
> 
> Do you think, it is good to have changes in 5.16 + stable ?
> I don't know if this  patch will create any side effect. Can you review and
> let me know your feedback. ?
> 

Can you test my merged change again for this scenario?

I will also note that I mentioned previously that 
blk_mq_queue_tag_busy_iter() was not optimum for shared sbitmap, i.e. 
before shared tags, but no one said performance was bad for shared sbitmap.

Thanks,
John

Kashyap Desai Dec. 22, 2021, 11:20 a.m. UTC | #4

>
> But your change seems effectively the same as in
> https://lore.kernel.org/all/1638794990-137490-4-git-send-email-
> john.garry@huawei.com/,
> which is now merged in Jens' 5.17 queue:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-
> block.git/commit/?h=for-
> 5.17/block&id=fea9f92f1748083cb82049ed503be30c3d3a9b69

John -

Yes, above is the same changes I was looking for. I did very basic mistake.
I applied your above commit while doing megaraid_sas testing.
 While I move to mpi3mr testing, I did not apply your patch set. We can drop
request of this RFT since I tested above series and it serve the same
purpose.

Kashyap

>
> > While doing additional testing for [1], I noticed some performance
> > issue.
> > Along with the performance issue, I noticed CPU lockup as well. Lockup
> > trace -
> >
> > _raw_spin_lock_irqsave+0x42/0x50
> >   blk_mq_find_and_get_req+0x20/0xa0
> >   bt_iter+0x2d/0x80
> >   blk_mq_queue_tag_busy_iter+0x1aa/0x2f0
> >   ? blk_mq_complete_request+0x30/0x30
> >   ? blk_mq_complete_request+0x30/0x30
> >   ? __schedule+0x360/0x850
> >   blk_mq_timeout_work+0x5e/0x120
> >   process_one_work+0x1a8/0x380
> >   worker_thread+0x30/0x380
> >   ? wq_calc_node_cpumask.isra.30+0x100/0x100
> >   kthread+0x167/0x190
> >   ? set_kthread_struct+0x40/0x40
> >   ret_from_fork+0x22/0x30
> >
> > It is a generic performance issue if driver use " shost->host_tagset =
> > 1".
> > In fact, I found that [1] is useful to fix performance issue and
> > provided this additional patch.
> >
> > I changed my setup to have 64 scsi_devices (earlier I just kept 16 or
> > 24 drives, so did not noticed this issue). Performance/cpu lockup
> > issue is not due to [1].
> > More number of scsi device, hardware context per host and high queue
> > depth will increase the chances of lockup and performance drop.
> >
> > Do you think, it is good to have changes in 5.16 + stable ?
> > I don't know if this  patch will create any side effect. Can you
> > review and let me know your feedback. ?
> >
>
> Can you test my merged change again for this scenario?
>
> I will also note that I mentioned previously that
> blk_mq_queue_tag_busy_iter() was not optimum for shared sbitmap, i.e.
> before shared tags, but no one said performance was bad for shared
> sbitmap.
>
> Thanks,
> John

John Garry Dec. 22, 2021, 11:35 a.m. UTC | #5

On 22/12/2021 11:20, Kashyap Desai wrote:
> Yes, above is the same changes I was looking for. I did very basic mistake.
> I applied your above commit while doing megaraid_sas testing.
>   While I move to mpi3mr testing, I did not apply your patch set. 

But I did not think that my patch would help mpi3mr since it does not 
use host_tagset.

> We can drop
> request of this RFT since I tested above series and it serve the same
> purpose.

ok, fine.

And just to confirm, do you now think that we need to fix any older 
kernel with some backport of my changes? I think that we would just need 
to consider 5.16 (when it becomes stable), 5.15, and and 5.10

Thanks,
John

Kashyap Desai Dec. 22, 2021, 12:06 p.m. UTC | #6

>
> On 22/12/2021 11:20, Kashyap Desai wrote:
> > Yes, above is the same changes I was looking for. I did very basic
> > mistake.
> > I applied your above commit while doing megaraid_sas testing.
> >   While I move to mpi3mr testing, I did not apply your patch set.
>
> But I did not think that my patch would help mpi3mr since it does not use
> host_tagset.

Internally we are testing performance check for mpi3mr. We want to make sure
performance is not impacted due to shared host_tagset before we apply the
feature.

>
> > We can drop
> > request of this RFT since I tested above series and it serve the same
> > purpose.
>
> ok, fine.
>
> And just to confirm, do you now think that we need to fix any older kernel
> with some backport of my changes? I think that we would just need to
> consider 5.16 (when it becomes stable), 5.15, and and 5.10

It need full patch set (below) + associated fixes.
[1] " blk-mq: Use shared tags for shared sbitmap support" Commit -
e155b0c238b20f0a866f4334d292656665836c8a

Below commit cannot go to stable without [1].
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.17/block&id=fea9f92f1748083cb82049ed503be30c3d3a9b69

I am not sure if stable requirement fits in this case. I mean large
patch-set is OK ?

Kashyap
>
> Thanks,
> John

John Garry Dec. 22, 2021, 12:34 p.m. UTC | #7

On 22/12/2021 12:06, Kashyap Desai wrote:
>> But I did not think that my patch would help mpi3mr since it does not use
>> host_tagset.
> Internally we are testing performance check for mpi3mr. We want to make sure
> performance is not impacted due to shared host_tagset before we apply the
> feature.

Hmmm... I thought that you said previously that it was not necessary for 
this HW. But I am not familiar with the driver or HW.

As an aside, I assume that this driver uses none IO sched by default, 
but drivers using host_tagset use mq-deadline – I’m not sure if that is 
what you want.

> 
>>> We can drop
>>> request of this RFT since I tested above series and it serve the same
>>> purpose.
>> ok, fine.
>>
>> And just to confirm, do you now think that we need to fix any older kernel
>> with some backport of my changes? I think that we would just need to
>> consider 5.16 (when it becomes stable), 5.15, and and 5.10
> It need full patch set (below) + associated fixes.
> [1] " blk-mq: Use shared tags for shared sbitmap support" Commit -
> e155b0c238b20f0a866f4334d292656665836c8a
> 
> Below commit cannot go to stable without [1].
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.17/block&id=fea9f92f1748083cb82049ed503be30c3d3a9b69
> 
> I am not sure if stable requirement fits in this case. I mean large
> patch-set is OK ?

The two other patches in the series wouldn't need to be backported, so 
it should be possible. You would just need a very good reason, though. 
And we would need to know whether 5.10 and 5.15 are required - they use 
shared sbitmap. As I mentioned a few times, contention in 
blk_mq_queue_tag_busy_iter() and callees would not be so high as 
blk_mq_tags.lock and blk_mq_tags.rqs are not shared there.

Thanks,
John

[RFT] blk-mq: optimize queue tag busy iter for shared_tags

Commit Message

Comments

Patch