diff mbox series

blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags

Message ID 1634114459-143003-1-git-send-email-john.garry@huawei.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags | expand

Commit Message

John Garry Oct. 13, 2021, 8:40 a.m. UTC
Since it is now possible for a tagset to share a single set of tags, the
iter function should not re-iter the tags for the count of #hw queues in
that case. Rather it should just iter once.

Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support")
Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>

Comments

Ming Lei Oct. 13, 2021, 9:22 a.m. UTC | #1
On Wed, Oct 13, 2021 at 04:40:59PM +0800, John Garry wrote:
> Since it is now possible for a tagset to share a single set of tags, the
> iter function should not re-iter the tags for the count of #hw queues in
> that case. Rather it should just iter once.
> 
> Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support")
> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 72a2724a4eee..c943b6529619 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
>  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  		busy_tag_iter_fn *fn, void *priv)
>  {
> -	int i;
> +	unsigned int flags = tagset->flags;
> +	int i, nr_tags;
> +
> +	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
>  
> -	for (i = 0; i < tagset->nr_hw_queues; i++) {
> +	for (i = 0; i < nr_tags; i++) {
>  		if (tagset->tags && tagset->tags[i])
>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>  					      BT_TAG_ITER_STARTED);

blk_mq_queue_tag_busy_iter() needn't such change?


Thanks,
Ming
John Garry Oct. 13, 2021, 10:01 a.m. UTC | #2
On 13/10/2021 10:22, Ming Lei wrote:
> On Wed, Oct 13, 2021 at 04:40:59PM +0800, John Garry wrote:
>> Since it is now possible for a tagset to share a single set of tags, the
>> iter function should not re-iter the tags for the count of #hw queues in
>> that case. Rather it should just iter once.
>>
>> Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support")
>> Reported-by: Kashyap Desai<kashyap.desai@broadcom.com>
>> Signed-off-by: John Garry<john.garry@huawei.com>
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 72a2724a4eee..c943b6529619 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
>>   void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>   		busy_tag_iter_fn *fn, void *priv)
>>   {
>> -	int i;
>> +	unsigned int flags = tagset->flags;
>> +	int i, nr_tags;
>> +
>> +	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
>>   
>> -	for (i = 0; i < tagset->nr_hw_queues; i++) {
>> +	for (i = 0; i < nr_tags; i++) {
>>   		if (tagset->tags && tagset->tags[i])
>>   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>   					      BT_TAG_ITER_STARTED);
> blk_mq_queue_tag_busy_iter() needn't such change?

I didn't think so.

blk_mq_queue_tag_busy_iter() will indeed re-iter the tags per hctx. 
However in bt_iter(), we check rq->mq_hctx == hctx for calling the iter 
callback:

static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
{
	...

	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);

And this would only pass for the correct hctx which we're iter'ing for. 
Indeed, it would be nice not to iter excessive times, but I didn't see a 
straightforward way to change that.

There is also blk_mq_all_tag_iter():

void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
		void *priv)
{
	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
}

But then the only user is blk_mq_hctx_has_requests():

static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
{
	struct blk_mq_tags *tags = hctx->sched_tags ?
			hctx->sched_tags : hctx->tags;
	struct rq_iter_data data = {
		.hctx	= hctx,
	};

	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
	return data.has_rq;
}

But, again like bt_iter(), blk_mq_has_request() will check the hctx matches:

static bool blk_mq_has_request(struct request *rq, void *data, bool 
reserved)
{
	struct rq_iter_data *iter_data = data;

	if (rq->mq_hctx != iter_data->hctx)
		return true;
	iter_data->has_rq = true;
	return false;
}

Thanks,
John
Ming Lei Oct. 13, 2021, 10:20 a.m. UTC | #3
On Wed, Oct 13, 2021 at 11:01:11AM +0100, John Garry wrote:
> On 13/10/2021 10:22, Ming Lei wrote:
> > On Wed, Oct 13, 2021 at 04:40:59PM +0800, John Garry wrote:
> > > Since it is now possible for a tagset to share a single set of tags, the
> > > iter function should not re-iter the tags for the count of #hw queues in
> > > that case. Rather it should just iter once.
> > > 
> > > Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support")
> > > Reported-by: Kashyap Desai<kashyap.desai@broadcom.com>
> > > Signed-off-by: John Garry<john.garry@huawei.com>
> > > 
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index 72a2724a4eee..c943b6529619 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
> > >   void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> > >   		busy_tag_iter_fn *fn, void *priv)
> > >   {
> > > -	int i;
> > > +	unsigned int flags = tagset->flags;
> > > +	int i, nr_tags;
> > > +
> > > +	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
> > > -	for (i = 0; i < tagset->nr_hw_queues; i++) {
> > > +	for (i = 0; i < nr_tags; i++) {
> > >   		if (tagset->tags && tagset->tags[i])
> > >   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> > >   					      BT_TAG_ITER_STARTED);
> > blk_mq_queue_tag_busy_iter() needn't such change?
> 
> I didn't think so.
> 
> blk_mq_queue_tag_busy_iter() will indeed re-iter the tags per hctx. However
> in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback:
> 
> static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> {
> 	...
> 
> 	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> 		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
> 
> And this would only pass for the correct hctx which we're iter'ing for.

It is true for both shared and non-shared sbitmap since we don't share
hctx, so what does matter? With single shared tags, you can iterate over
all requests originated from all hw queues, right?

> Indeed, it would be nice not to iter excessive times, but I didn't see a
> straightforward way to change that.

In Kashyap's report, the lock contention is actually from
blk_mq_queue_tag_busy_iter(), see:

https://lore.kernel.org/linux-block/8867352d-2107-1f8a-0f1c-ef73450bf256@huawei.com/

> 
> There is also blk_mq_all_tag_iter():
> 
> void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
> 		void *priv)
> {
> 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
> }
> 
> But then the only user is blk_mq_hctx_has_requests():
> 
> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> {
> 	struct blk_mq_tags *tags = hctx->sched_tags ?
> 			hctx->sched_tags : hctx->tags;
> 	struct rq_iter_data data = {
> 		.hctx	= hctx,
> 	};
> 
> 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> 	return data.has_rq;
> }

This above one only iterates over the specified hctx/tags, it won't be
affected.

> 
> But, again like bt_iter(), blk_mq_has_request() will check the hctx matches:

Not see what matters wrt. checking hctx.



Thanks,
Ming
John Garry Oct. 13, 2021, 11:11 a.m. UTC | #4
>>> blk_mq_queue_tag_busy_iter() needn't such change? >> I didn't think so.>>>> blk_mq_queue_tag_busy_iter() will indeed 
re-iter the tags per hctx. However
>> in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback:
>>
>> static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> {
>> 	...
>>
>> 	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
>> 		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
>>
>> And this would only pass for the correct hctx which we're iter'ing for.
> It is true for both shared and non-shared sbitmap since we don't share
> hctx, so what does matter?

It matters that we are doing the right thing for shared tags. My point 
is we iter but don't call the callback unless the correct hctx.

As I see, this has not changed in transitioning from shared sbitmap to 
shared tags.

> With single shared tags, you can iterate over
> all requests originated from all hw queues, right?
> 
Right, for the same request queue, we should do that.

>> Indeed, it would be nice not to iter excessive times, but I didn't see a
>> straightforward way to change that.


> In Kashyap's report, the lock contention is actually from
> blk_mq_queue_tag_busy_iter(), see:
> 
> https://lore.kernel.org/linux-block/8867352d-2107-1f8a-0f1c-ef73450bf256@huawei.com/
> 

As I understand, Kashyap mentioned no throughput regression with my 
series, but just higher cpu usage in blk_mq_find_and_get_req().

I'll see if I can see such a thing in my setup.

But could it be that since we only have a single sets of requests per 
tagset, and not a set of requests per HW queue, there is more contention 
on the common set of requests in the refcount_inc_not_zero() call ***, 
below:

static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
unsigned int bitnr)
{
	...

	rq = tags->rqs[bitnr];
	if (... || !refcount_inc_not_zero(&rq->ref)) ***
	...
}

But I wonder why this function is even called often...

>> There is also blk_mq_all_tag_iter():
>>
>> void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
>> 		void *priv)
>> {
>> 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
>> }
>>
>> But then the only user is blk_mq_hctx_has_requests():
>>
>> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>> {
>> 	struct blk_mq_tags *tags = hctx->sched_tags ?
>> 			hctx->sched_tags : hctx->tags;
>> 	struct rq_iter_data data = {
>> 		.hctx	= hctx,
>> 	};
>>
>> 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
>> 	return data.has_rq;
>> }
> This above one only iterates over the specified hctx/tags, it won't be
> affected.
> 
>> But, again like bt_iter(), blk_mq_has_request() will check the hctx matches:
> Not see what matters wrt. checking hctx.

I'm just saying that something like the following would be broken for 
shared tags:

static bool blk_mq_has_request(struct request *rq, void *data, bool 
reserved)
{
	struct rq_iter_data *iter_data = data;

	iter_data->has_rq = true;
	return true;
}

static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
{
	struct rq_iter_data data = {
	};

	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
	return data.has_rq;
}

As it ignores that we want to check for a specific hctx.

Thanks,
John
Ming Lei Oct. 13, 2021, 2:29 p.m. UTC | #5
On Wed, Oct 13, 2021 at 12:11:12PM +0100, John Garry wrote:
> > > > blk_mq_queue_tag_busy_iter() needn't such change? >> I didn't
> > > > think so.>>>> blk_mq_queue_tag_busy_iter() will indeed
> re-iter the tags per hctx. However
> > > in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback:
> > > 
> > > static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> > > {
> > > 	...
> > > 
> > > 	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> > > 		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
> > > 
> > > And this would only pass for the correct hctx which we're iter'ing for.
> > It is true for both shared and non-shared sbitmap since we don't share
> > hctx, so what does matter?
> 
> It matters that we are doing the right thing for shared tags. My point is we
> iter but don't call the callback unless the correct hctx.
> 
> As I see, this has not changed in transitioning from shared sbitmap to
> shared tags.
> 
> > With single shared tags, you can iterate over
> > all requests originated from all hw queues, right?
> > 
> Right, for the same request queue, we should do that.
> 
> > > Indeed, it would be nice not to iter excessive times, but I didn't see a
> > > straightforward way to change that.
> 
> 
> > In Kashyap's report, the lock contention is actually from
> > blk_mq_queue_tag_busy_iter(), see:
> > 
> > https://lore.kernel.org/linux-block/8867352d-2107-1f8a-0f1c-ef73450bf256@huawei.com/
> > 
> 
> As I understand, Kashyap mentioned no throughput regression with my series,
> but just higher cpu usage in blk_mq_find_and_get_req().
> 
> I'll see if I can see such a thing in my setup.
> 
> But could it be that since we only have a single sets of requests per
> tagset, and not a set of requests per HW queue, there is more contention on
> the common set of requests in the refcount_inc_not_zero() call ***, below:
> 
> static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> unsigned int bitnr)
> {
> 	...
> 
> 	rq = tags->rqs[bitnr];
> 	if (... || !refcount_inc_not_zero(&rq->ref)) ***
> 	...
> }

Kashyap's log shows that contention on tags->lock is increased, that
should be caused by nr_hw_queues iterating. blk_mq_find_and_get_req()
will be run nr_hw_queue times compared with pre-shared-sbitmap, since it
is done before checking rq->mq_hctx.

> 
> But I wonder why this function is even called often...
> 
> > > There is also blk_mq_all_tag_iter():
> > > 
> > > void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
> > > 		void *priv)
> > > {
> > > 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
> > > }
> > > 
> > > But then the only user is blk_mq_hctx_has_requests():
> > > 
> > > static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> > > {
> > > 	struct blk_mq_tags *tags = hctx->sched_tags ?
> > > 			hctx->sched_tags : hctx->tags;
> > > 	struct rq_iter_data data = {
> > > 		.hctx	= hctx,
> > > 	};
> > > 
> > > 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> > > 	return data.has_rq;
> > > }
> > This above one only iterates over the specified hctx/tags, it won't be
> > affected.
> > 
> > > But, again like bt_iter(), blk_mq_has_request() will check the hctx matches:
> > Not see what matters wrt. checking hctx.
> 
> I'm just saying that something like the following would be broken for shared
> tags:
> 
> static bool blk_mq_has_request(struct request *rq, void *data, bool
> reserved)
> {
> 	struct rq_iter_data *iter_data = data;
> 
> 	iter_data->has_rq = true;
> 	return true;
> }
> 
> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> {
> 	struct rq_iter_data data = {
> 	};
> 
> 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> 	return data.has_rq;
> }
> 
> As it ignores that we want to check for a specific hctx.

No, that isn't what I meant, follows the change I suggested:


diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 72a2724a4eee..2a2ad6dfcc33 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	if (!rq)
 		return true;
 
-	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
-		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
+	if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
+				blk_mq_is_shared_tags(hctx->flags)))
+		ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved);
 	blk_mq_put_rq_ref(rq);
 	return ret;
 }
@@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		if (tags->nr_reserved_tags)
 			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
 		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+
+		if (blk_mq_is_shared_tags(hctx->flags))
+			break;
 	}
 	blk_queue_exit(q);
 }


Thanks,
Ming
John Garry Oct. 13, 2021, 3:13 p.m. UTC | #6
On 13/10/2021 15:29, Ming Lei wrote:
>> As I understand, Kashyap mentioned no throughput regression with my series,
>> but just higher cpu usage in blk_mq_find_and_get_req().
>>
>> I'll see if I can see such a thing in my setup.
>>
>> But could it be that since we only have a single sets of requests per
>> tagset, and not a set of requests per HW queue, there is more contention on
>> the common set of requests in the refcount_inc_not_zero() call ***, below:
>>
>> static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>> unsigned int bitnr)
>> {
>> 	...
>>
>> 	rq = tags->rqs[bitnr];
>> 	if (... || !refcount_inc_not_zero(&rq->ref)) ***
>> 	...
>> }
> Kashyap's log shows that contention on tags->lock is increased, that
> should be caused by nr_hw_queues iterating.

If the lock contention increases on tags->lock then I am not totally 
surprised. For shared sbitmap, each HW queue had its own tags (and tags 
lock). Now with shared tags, we have a single lock over the tagset, and 
so we would have more contention. That's on the basis that we have many 
parallel callers to blk_mq_queue_tag_busy_iter().

> blk_mq_find_and_get_req()
> will be run nr_hw_queue times compared with pre-shared-sbitmap, since it
> is done before checking rq->mq_hctx.

Isn't shared sitmap older than blk_mq_find_and_get_req()?

Anyway, for 5.14 shared sbitmap support, we iter nr_hw_queue times. And 
now, for shared tags, we still do that. I don't see what's changed in 
that regard.

> 
>> But I wonder why this function is even called often...
>>
>>>> There is also blk_mq_all_tag_iter():
>>>>
>>>> void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
>>>> 		void *priv)
>>>> {
>>>> 	__blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS);
>>>> }
>>>>
>>>> But then the only user is blk_mq_hctx_has_requests():
>>>>
>>>> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>>>> {
>>>> 	struct blk_mq_tags *tags = hctx->sched_tags ?
>>>> 			hctx->sched_tags : hctx->tags;
>>>> 	struct rq_iter_data data = {
>>>> 		.hctx	= hctx,
>>>> 	};
>>>>
>>>> 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
>>>> 	return data.has_rq;
>>>> }
>>> This above one only iterates over the specified hctx/tags, it won't be
>>> affected.
>>>
>>>> But, again like bt_iter(), blk_mq_has_request() will check the hctx matches:
>>> Not see what matters wrt. checking hctx.
>> I'm just saying that something like the following would be broken for shared
>> tags:
>>
>> static bool blk_mq_has_request(struct request *rq, void *data, bool
>> reserved)
>> {
>> 	struct rq_iter_data *iter_data = data;
>>
>> 	iter_data->has_rq = true;
>> 	return true;
>> }
>>
>> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
>> {
>> 	struct rq_iter_data data = {
>> 	};
>>
>> 	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
>> 	return data.has_rq;
>> }
>>
>> As it ignores that we want to check for a specific hctx.
> No, that isn't what I meant, follows the change I suggested:

I didn't mean that this was your suggestion. I am just saying that we 
need to be careful iter'ing tags for shared tags now, as in that example.

> 
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 72a2724a4eee..2a2ad6dfcc33 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	if (!rq)
>   		return true;
>   
> -	if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> -		ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
> +	if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
> +				blk_mq_is_shared_tags(hctx->flags)))
> +		ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved);
>   	blk_mq_put_rq_ref(rq);
>   	return ret;
>   }
> @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>   		if (tags->nr_reserved_tags)
>   			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
>   		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
> +
> +		if (blk_mq_is_shared_tags(hctx->flags))
> +			break;
>   	}
>   	blk_queue_exit(q);
>   }
> 

I suppose that is ok, and means that we iter once.

However, I have to ask, where is the big user of 
blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's mail:

 > 1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
 >       native_queued_spin_lock_slowpath
 >       ret_from_fork
 >       kthread
 >       worker_thread
 >       process_one_work
 >       blk_mq_timeout_work
 >       blk_mq_queue_tag_busy_iter
 >       bt_iter
 >       blk_mq_find_and_get_req
 >       _raw_spin_lock_irqsave
 >       native_queued_spin_lock_slowpath

How or why blk_mq_timeout_work()?

Thanks,
john
John Garry Oct. 18, 2021, 8:08 a.m. UTC | #7
On 13/10/2021 16:13, John Garry wrote:
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 72a2724a4eee..2a2ad6dfcc33 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, 
>> unsigned int bitnr, void *data)
>>       if (!rq)
>>           return true;
>> -    if (rq->q == hctx->queue && rq->mq_hctx == hctx)
>> -        ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
>> +    if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
>> +                blk_mq_is_shared_tags(hctx->flags)))
>> +        ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved);
>>       blk_mq_put_rq_ref(rq);
>>       return ret;
>>   }
>> @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct 
>> request_queue *q, busy_iter_fn *fn,
>>           if (tags->nr_reserved_tags)
>>               bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
>>           bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
>> +
>> +        if (blk_mq_is_shared_tags(hctx->flags))
>> +            break;
>>       }
>>       blk_queue_exit(q);
>>   }
>>
> 
> I suppose that is ok, and means that we iter once.
> 
> However, I have to ask, where is the big user of 
> blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's mail:
> 
>  > 1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
>  >       native_queued_spin_lock_slowpath
>  >       ret_from_fork
>  >       kthread
>  >       worker_thread
>  >       process_one_work
>  >       blk_mq_timeout_work
>  >       blk_mq_queue_tag_busy_iter
>  >       bt_iter
>  >       blk_mq_find_and_get_req
>  >       _raw_spin_lock_irqsave
>  >       native_queued_spin_lock_slowpath
> 
> How or why blk_mq_timeout_work()?

Just some update: I tried hisi_sas with 10x SAS SSDs, megaraid sas with 
1x SATA HDD (that's all I have), and null blk with lots of devices, and 
I still can't see high usage of blk_mq_queue_tag_busy_iter().

So how about we get this patch processed (to fix 
blk_mq_tagset_busy_iter()), as it is independent of 
blk_mq_queue_tag_busy_iter()? And then wait for some update or some more 
info from Kashyap regarding blk_mq_queue_tag_busy_iter()

Thanks,
John
Ming Lei Oct. 18, 2021, 9:07 a.m. UTC | #8
On Mon, Oct 18, 2021 at 09:08:57AM +0100, John Garry wrote:
> On 13/10/2021 16:13, John Garry wrote:
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index 72a2724a4eee..2a2ad6dfcc33 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > >       if (!rq)
> > >           return true;
> > > -    if (rq->q == hctx->queue && rq->mq_hctx == hctx)
> > > -        ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
> > > +    if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
> > > +                blk_mq_is_shared_tags(hctx->flags)))
> > > +        ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved);
> > >       blk_mq_put_rq_ref(rq);
> > >       return ret;
> > >   }
> > > @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct
> > > request_queue *q, busy_iter_fn *fn,
> > >           if (tags->nr_reserved_tags)
> > >               bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
> > >           bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
> > > +
> > > +        if (blk_mq_is_shared_tags(hctx->flags))
> > > +            break;
> > >       }
> > >       blk_queue_exit(q);
> > >   }
> > > 
> > 
> > I suppose that is ok, and means that we iter once.
> > 
> > However, I have to ask, where is the big user of
> > blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's
> > mail:
> > 
> >  > 1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
> >  >       native_queued_spin_lock_slowpath
> >  >       ret_from_fork
> >  >       kthread
> >  >       worker_thread
> >  >       process_one_work
> >  >       blk_mq_timeout_work
> >  >       blk_mq_queue_tag_busy_iter
> >  >       bt_iter
> >  >       blk_mq_find_and_get_req
> >  >       _raw_spin_lock_irqsave
> >  >       native_queued_spin_lock_slowpath
> > 
> > How or why blk_mq_timeout_work()?
> 
> Just some update: I tried hisi_sas with 10x SAS SSDs, megaraid sas with 1x
> SATA HDD (that's all I have), and null blk with lots of devices, and I still
> can't see high usage of blk_mq_queue_tag_busy_iter().

It should be triggered easily in case of heavy io accounting:

while true; do cat /proc/diskstats; done


> So how about we get this patch processed (to fix blk_mq_tagset_busy_iter()),
> as it is independent of blk_mq_queue_tag_busy_iter()? And then wait for some
> update or some more info from Kashyap regarding blk_mq_queue_tag_busy_iter()

Looks fine:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming
John Garry Oct. 18, 2021, 9:33 a.m. UTC | #9
On 18/10/2021 10:07, Ming Lei wrote:
> On Mon, Oct 18, 2021 at 09:08:57AM +0100, John Garry wrote:
>> On 13/10/2021 16:13, John Garry wrote:
>>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>>> index 72a2724a4eee..2a2ad6dfcc33 100644
>>>> --- a/block/blk-mq-tag.c
>>>> +++ b/block/blk-mq-tag.c
>>>> @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap,
>>>> unsigned int bitnr, void *data)
>>>>        if (!rq)
>>>>            return true;
>>>> -    if (rq->q == hctx->queue && rq->mq_hctx == hctx)
>>>> -        ret = iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>> +    if (rq->q == hctx->queue && (rq->mq_hctx == hctx ||
>>>> +                blk_mq_is_shared_tags(hctx->flags)))
>>>> +        ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved);
>>>>        blk_mq_put_rq_ref(rq);
>>>>        return ret;
>>>>    }
>>>> @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct
>>>> request_queue *q, busy_iter_fn *fn,
>>>>            if (tags->nr_reserved_tags)
>>>>                bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
>>>>            bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
>>>> +
>>>> +        if (blk_mq_is_shared_tags(hctx->flags))
>>>> +            break;
>>>>        }
>>>>        blk_queue_exit(q);
>>>>    }
>>>>
>>> I suppose that is ok, and means that we iter once.
>>>
>>> However, I have to ask, where is the big user of
>>> blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's
>>> mail:
>>>
>>>   > 1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
>>>   >       native_queued_spin_lock_slowpath
>>>   >       ret_from_fork
>>>   >       kthread
>>>   >       worker_thread
>>>   >       process_one_work
>>>   >       blk_mq_timeout_work
>>>   >       blk_mq_queue_tag_busy_iter
>>>   >       bt_iter
>>>   >       blk_mq_find_and_get_req
>>>   >       _raw_spin_lock_irqsave
>>>   >       native_queued_spin_lock_slowpath
>>>
>>> How or why blk_mq_timeout_work()?
>> Just some update: I tried hisi_sas with 10x SAS SSDs, megaraid sas with 1x
>> SATA HDD (that's all I have), and null blk with lots of devices, and I still
>> can't see high usage of blk_mq_queue_tag_busy_iter().
> It should be triggered easily in case of heavy io accounting:
> 
> while true; do cat /proc/diskstats; done
> 

Let me check that.

> 
>> So how about we get this patch processed (to fix blk_mq_tagset_busy_iter()),
>> as it is independent of blk_mq_queue_tag_busy_iter()? And then wait for some
>> update or some more info from Kashyap regarding blk_mq_queue_tag_busy_iter()
> Looks fine:
> 
> Reviewed-by: Ming Lei<ming.lei@redhat.com>

Thanks, I'll just send a v2 with your tag for clarity, as there has been 
much discussion here.

John
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 72a2724a4eee..c943b6529619 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -378,9 +378,12 @@  void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv)
 {
-	int i;
+	unsigned int flags = tagset->flags;
+	int i, nr_tags;
+
+	nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;
 
-	for (i = 0; i < tagset->nr_hw_queues; i++) {
+	for (i = 0; i < nr_tags; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
 					      BT_TAG_ITER_STARTED);