Message ID | 1634114459-143003-1-git-send-email-john.garry@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags | expand |
On Wed, Oct 13, 2021 at 04:40:59PM +0800, John Garry wrote: > Since it is now possible for a tagset to share a single set of tags, the > iter function should not re-iter the tags for the count of #hw queues in > that case. Rather it should just iter once. > > Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support") > Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> > Signed-off-by: John Garry <john.garry@huawei.com> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 72a2724a4eee..c943b6529619 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > busy_tag_iter_fn *fn, void *priv) > { > - int i; > + unsigned int flags = tagset->flags; > + int i, nr_tags; > + > + nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; > > - for (i = 0; i < tagset->nr_hw_queues; i++) { > + for (i = 0; i < nr_tags; i++) { > if (tagset->tags && tagset->tags[i]) > __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, > BT_TAG_ITER_STARTED); blk_mq_queue_tag_busy_iter() needn't such change? Thanks, Ming
On 13/10/2021 10:22, Ming Lei wrote: > On Wed, Oct 13, 2021 at 04:40:59PM +0800, John Garry wrote: >> Since it is now possible for a tagset to share a single set of tags, the >> iter function should not re-iter the tags for the count of #hw queues in >> that case. Rather it should just iter once. >> >> Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support") >> Reported-by: Kashyap Desai<kashyap.desai@broadcom.com> >> Signed-off-by: John Garry<john.garry@huawei.com> >> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index 72a2724a4eee..c943b6529619 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, >> void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, >> busy_tag_iter_fn *fn, void *priv) >> { >> - int i; >> + unsigned int flags = tagset->flags; >> + int i, nr_tags; >> + >> + nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; >> >> - for (i = 0; i < tagset->nr_hw_queues; i++) { >> + for (i = 0; i < nr_tags; i++) { >> if (tagset->tags && tagset->tags[i]) >> __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, >> BT_TAG_ITER_STARTED); > blk_mq_queue_tag_busy_iter() needn't such change? I didn't think so. blk_mq_queue_tag_busy_iter() will indeed re-iter the tags per hctx. However in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback: static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) { ... if (rq->q == hctx->queue && rq->mq_hctx == hctx) ret = iter_data->fn(hctx, rq, iter_data->data, reserved); And this would only pass for the correct hctx which we're iter'ing for. Indeed, it would be nice not to iter excessive times, but I didn't see a straightforward way to change that. There is also blk_mq_all_tag_iter(): void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void *priv) { __blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS); } But then the only user is blk_mq_hctx_has_requests(): static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) { struct blk_mq_tags *tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags; struct rq_iter_data data = { .hctx = hctx, }; blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); return data.has_rq; } But, again like bt_iter(), blk_mq_has_request() will check the hctx matches: static bool blk_mq_has_request(struct request *rq, void *data, bool reserved) { struct rq_iter_data *iter_data = data; if (rq->mq_hctx != iter_data->hctx) return true; iter_data->has_rq = true; return false; } Thanks, John
On Wed, Oct 13, 2021 at 11:01:11AM +0100, John Garry wrote: > On 13/10/2021 10:22, Ming Lei wrote: > > On Wed, Oct 13, 2021 at 04:40:59PM +0800, John Garry wrote: > > > Since it is now possible for a tagset to share a single set of tags, the > > > iter function should not re-iter the tags for the count of #hw queues in > > > that case. Rather it should just iter once. > > > > > > Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support") > > > Reported-by: Kashyap Desai<kashyap.desai@broadcom.com> > > > Signed-off-by: John Garry<john.garry@huawei.com> > > > > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > > > index 72a2724a4eee..c943b6529619 100644 > > > --- a/block/blk-mq-tag.c > > > +++ b/block/blk-mq-tag.c > > > @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, > > > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > > > busy_tag_iter_fn *fn, void *priv) > > > { > > > - int i; > > > + unsigned int flags = tagset->flags; > > > + int i, nr_tags; > > > + > > > + nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; > > > - for (i = 0; i < tagset->nr_hw_queues; i++) { > > > + for (i = 0; i < nr_tags; i++) { > > > if (tagset->tags && tagset->tags[i]) > > > __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, > > > BT_TAG_ITER_STARTED); > > blk_mq_queue_tag_busy_iter() needn't such change? > > I didn't think so. > > blk_mq_queue_tag_busy_iter() will indeed re-iter the tags per hctx. However > in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback: > > static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > { > ... > > if (rq->q == hctx->queue && rq->mq_hctx == hctx) > ret = iter_data->fn(hctx, rq, iter_data->data, reserved); > > And this would only pass for the correct hctx which we're iter'ing for. It is true for both shared and non-shared sbitmap since we don't share hctx, so what does matter? With single shared tags, you can iterate over all requests originated from all hw queues, right? > Indeed, it would be nice not to iter excessive times, but I didn't see a > straightforward way to change that. In Kashyap's report, the lock contention is actually from blk_mq_queue_tag_busy_iter(), see: https://lore.kernel.org/linux-block/8867352d-2107-1f8a-0f1c-ef73450bf256@huawei.com/ > > There is also blk_mq_all_tag_iter(): > > void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, > void *priv) > { > __blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS); > } > > But then the only user is blk_mq_hctx_has_requests(): > > static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) > { > struct blk_mq_tags *tags = hctx->sched_tags ? > hctx->sched_tags : hctx->tags; > struct rq_iter_data data = { > .hctx = hctx, > }; > > blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); > return data.has_rq; > } This above one only iterates over the specified hctx/tags, it won't be affected. > > But, again like bt_iter(), blk_mq_has_request() will check the hctx matches: Not see what matters wrt. checking hctx. Thanks, Ming
>>> blk_mq_queue_tag_busy_iter() needn't such change? >> I didn't think so.>>>> blk_mq_queue_tag_busy_iter() will indeed re-iter the tags per hctx. However >> in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback: >> >> static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> { >> ... >> >> if (rq->q == hctx->queue && rq->mq_hctx == hctx) >> ret = iter_data->fn(hctx, rq, iter_data->data, reserved); >> >> And this would only pass for the correct hctx which we're iter'ing for. > It is true for both shared and non-shared sbitmap since we don't share > hctx, so what does matter? It matters that we are doing the right thing for shared tags. My point is we iter but don't call the callback unless the correct hctx. As I see, this has not changed in transitioning from shared sbitmap to shared tags. > With single shared tags, you can iterate over > all requests originated from all hw queues, right? > Right, for the same request queue, we should do that. >> Indeed, it would be nice not to iter excessive times, but I didn't see a >> straightforward way to change that. > In Kashyap's report, the lock contention is actually from > blk_mq_queue_tag_busy_iter(), see: > > https://lore.kernel.org/linux-block/8867352d-2107-1f8a-0f1c-ef73450bf256@huawei.com/ > As I understand, Kashyap mentioned no throughput regression with my series, but just higher cpu usage in blk_mq_find_and_get_req(). I'll see if I can see such a thing in my setup. But could it be that since we only have a single sets of requests per tagset, and not a set of requests per HW queue, there is more contention on the common set of requests in the refcount_inc_not_zero() call ***, below: static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags, unsigned int bitnr) { ... rq = tags->rqs[bitnr]; if (... || !refcount_inc_not_zero(&rq->ref)) *** ... } But I wonder why this function is even called often... >> There is also blk_mq_all_tag_iter(): >> >> void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, >> void *priv) >> { >> __blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS); >> } >> >> But then the only user is blk_mq_hctx_has_requests(): >> >> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) >> { >> struct blk_mq_tags *tags = hctx->sched_tags ? >> hctx->sched_tags : hctx->tags; >> struct rq_iter_data data = { >> .hctx = hctx, >> }; >> >> blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); >> return data.has_rq; >> } > This above one only iterates over the specified hctx/tags, it won't be > affected. > >> But, again like bt_iter(), blk_mq_has_request() will check the hctx matches: > Not see what matters wrt. checking hctx. I'm just saying that something like the following would be broken for shared tags: static bool blk_mq_has_request(struct request *rq, void *data, bool reserved) { struct rq_iter_data *iter_data = data; iter_data->has_rq = true; return true; } static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) { struct rq_iter_data data = { }; blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); return data.has_rq; } As it ignores that we want to check for a specific hctx. Thanks, John
On Wed, Oct 13, 2021 at 12:11:12PM +0100, John Garry wrote: > > > > blk_mq_queue_tag_busy_iter() needn't such change? >> I didn't > > > > think so.>>>> blk_mq_queue_tag_busy_iter() will indeed > re-iter the tags per hctx. However > > > in bt_iter(), we check rq->mq_hctx == hctx for calling the iter callback: > > > > > > static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > > > { > > > ... > > > > > > if (rq->q == hctx->queue && rq->mq_hctx == hctx) > > > ret = iter_data->fn(hctx, rq, iter_data->data, reserved); > > > > > > And this would only pass for the correct hctx which we're iter'ing for. > > It is true for both shared and non-shared sbitmap since we don't share > > hctx, so what does matter? > > It matters that we are doing the right thing for shared tags. My point is we > iter but don't call the callback unless the correct hctx. > > As I see, this has not changed in transitioning from shared sbitmap to > shared tags. > > > With single shared tags, you can iterate over > > all requests originated from all hw queues, right? > > > Right, for the same request queue, we should do that. > > > > Indeed, it would be nice not to iter excessive times, but I didn't see a > > > straightforward way to change that. > > > > In Kashyap's report, the lock contention is actually from > > blk_mq_queue_tag_busy_iter(), see: > > > > https://lore.kernel.org/linux-block/8867352d-2107-1f8a-0f1c-ef73450bf256@huawei.com/ > > > > As I understand, Kashyap mentioned no throughput regression with my series, > but just higher cpu usage in blk_mq_find_and_get_req(). > > I'll see if I can see such a thing in my setup. > > But could it be that since we only have a single sets of requests per > tagset, and not a set of requests per HW queue, there is more contention on > the common set of requests in the refcount_inc_not_zero() call ***, below: > > static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags, > unsigned int bitnr) > { > ... > > rq = tags->rqs[bitnr]; > if (... || !refcount_inc_not_zero(&rq->ref)) *** > ... > } Kashyap's log shows that contention on tags->lock is increased, that should be caused by nr_hw_queues iterating. blk_mq_find_and_get_req() will be run nr_hw_queue times compared with pre-shared-sbitmap, since it is done before checking rq->mq_hctx. > > But I wonder why this function is even called often... > > > > There is also blk_mq_all_tag_iter(): > > > > > > void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, > > > void *priv) > > > { > > > __blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS); > > > } > > > > > > But then the only user is blk_mq_hctx_has_requests(): > > > > > > static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) > > > { > > > struct blk_mq_tags *tags = hctx->sched_tags ? > > > hctx->sched_tags : hctx->tags; > > > struct rq_iter_data data = { > > > .hctx = hctx, > > > }; > > > > > > blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); > > > return data.has_rq; > > > } > > This above one only iterates over the specified hctx/tags, it won't be > > affected. > > > > > But, again like bt_iter(), blk_mq_has_request() will check the hctx matches: > > Not see what matters wrt. checking hctx. > > I'm just saying that something like the following would be broken for shared > tags: > > static bool blk_mq_has_request(struct request *rq, void *data, bool > reserved) > { > struct rq_iter_data *iter_data = data; > > iter_data->has_rq = true; > return true; > } > > static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) > { > struct rq_iter_data data = { > }; > > blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); > return data.has_rq; > } > > As it ignores that we want to check for a specific hctx. No, that isn't what I meant, follows the change I suggested: diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 72a2724a4eee..2a2ad6dfcc33 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) if (!rq) return true; - if (rq->q == hctx->queue && rq->mq_hctx == hctx) - ret = iter_data->fn(hctx, rq, iter_data->data, reserved); + if (rq->q == hctx->queue && (rq->mq_hctx == hctx || + blk_mq_is_shared_tags(hctx->flags))) + ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved); blk_mq_put_rq_ref(rq); return ret; } @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, if (tags->nr_reserved_tags) bt_for_each(hctx, &tags->breserved_tags, fn, priv, true); bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false); + + if (blk_mq_is_shared_tags(hctx->flags)) + break; } blk_queue_exit(q); } Thanks, Ming
On 13/10/2021 15:29, Ming Lei wrote: >> As I understand, Kashyap mentioned no throughput regression with my series, >> but just higher cpu usage in blk_mq_find_and_get_req(). >> >> I'll see if I can see such a thing in my setup. >> >> But could it be that since we only have a single sets of requests per >> tagset, and not a set of requests per HW queue, there is more contention on >> the common set of requests in the refcount_inc_not_zero() call ***, below: >> >> static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags, >> unsigned int bitnr) >> { >> ... >> >> rq = tags->rqs[bitnr]; >> if (... || !refcount_inc_not_zero(&rq->ref)) *** >> ... >> } > Kashyap's log shows that contention on tags->lock is increased, that > should be caused by nr_hw_queues iterating. If the lock contention increases on tags->lock then I am not totally surprised. For shared sbitmap, each HW queue had its own tags (and tags lock). Now with shared tags, we have a single lock over the tagset, and so we would have more contention. That's on the basis that we have many parallel callers to blk_mq_queue_tag_busy_iter(). > blk_mq_find_and_get_req() > will be run nr_hw_queue times compared with pre-shared-sbitmap, since it > is done before checking rq->mq_hctx. Isn't shared sitmap older than blk_mq_find_and_get_req()? Anyway, for 5.14 shared sbitmap support, we iter nr_hw_queue times. And now, for shared tags, we still do that. I don't see what's changed in that regard. > >> But I wonder why this function is even called often... >> >>>> There is also blk_mq_all_tag_iter(): >>>> >>>> void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, >>>> void *priv) >>>> { >>>> __blk_mq_all_tag_iter(tags, fn, priv, BT_TAG_ITER_STATIC_RQS); >>>> } >>>> >>>> But then the only user is blk_mq_hctx_has_requests(): >>>> >>>> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) >>>> { >>>> struct blk_mq_tags *tags = hctx->sched_tags ? >>>> hctx->sched_tags : hctx->tags; >>>> struct rq_iter_data data = { >>>> .hctx = hctx, >>>> }; >>>> >>>> blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); >>>> return data.has_rq; >>>> } >>> This above one only iterates over the specified hctx/tags, it won't be >>> affected. >>> >>>> But, again like bt_iter(), blk_mq_has_request() will check the hctx matches: >>> Not see what matters wrt. checking hctx. >> I'm just saying that something like the following would be broken for shared >> tags: >> >> static bool blk_mq_has_request(struct request *rq, void *data, bool >> reserved) >> { >> struct rq_iter_data *iter_data = data; >> >> iter_data->has_rq = true; >> return true; >> } >> >> static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx) >> { >> struct rq_iter_data data = { >> }; >> >> blk_mq_all_tag_iter(tags, blk_mq_has_request, &data); >> return data.has_rq; >> } >> >> As it ignores that we want to check for a specific hctx. > No, that isn't what I meant, follows the change I suggested: I didn't mean that this was your suggestion. I am just saying that we need to be careful iter'ing tags for shared tags now, as in that example. > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > index 72a2724a4eee..2a2ad6dfcc33 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) > if (!rq) > return true; > > - if (rq->q == hctx->queue && rq->mq_hctx == hctx) > - ret = iter_data->fn(hctx, rq, iter_data->data, reserved); > + if (rq->q == hctx->queue && (rq->mq_hctx == hctx || > + blk_mq_is_shared_tags(hctx->flags))) > + ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved); > blk_mq_put_rq_ref(rq); > return ret; > } > @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, > if (tags->nr_reserved_tags) > bt_for_each(hctx, &tags->breserved_tags, fn, priv, true); > bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false); > + > + if (blk_mq_is_shared_tags(hctx->flags)) > + break; > } > blk_queue_exit(q); > } > I suppose that is ok, and means that we iter once. However, I have to ask, where is the big user of blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's mail: > 1.31% 1.31% kworker/57:1H-k [kernel.vmlinux] > native_queued_spin_lock_slowpath > ret_from_fork > kthread > worker_thread > process_one_work > blk_mq_timeout_work > blk_mq_queue_tag_busy_iter > bt_iter > blk_mq_find_and_get_req > _raw_spin_lock_irqsave > native_queued_spin_lock_slowpath How or why blk_mq_timeout_work()? Thanks, john
On 13/10/2021 16:13, John Garry wrote: >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index 72a2724a4eee..2a2ad6dfcc33 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, >> unsigned int bitnr, void *data) >> if (!rq) >> return true; >> - if (rq->q == hctx->queue && rq->mq_hctx == hctx) >> - ret = iter_data->fn(hctx, rq, iter_data->data, reserved); >> + if (rq->q == hctx->queue && (rq->mq_hctx == hctx || >> + blk_mq_is_shared_tags(hctx->flags))) >> + ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved); >> blk_mq_put_rq_ref(rq); >> return ret; >> } >> @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct >> request_queue *q, busy_iter_fn *fn, >> if (tags->nr_reserved_tags) >> bt_for_each(hctx, &tags->breserved_tags, fn, priv, true); >> bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false); >> + >> + if (blk_mq_is_shared_tags(hctx->flags)) >> + break; >> } >> blk_queue_exit(q); >> } >> > > I suppose that is ok, and means that we iter once. > > However, I have to ask, where is the big user of > blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's mail: > > > 1.31% 1.31% kworker/57:1H-k [kernel.vmlinux] > > native_queued_spin_lock_slowpath > > ret_from_fork > > kthread > > worker_thread > > process_one_work > > blk_mq_timeout_work > > blk_mq_queue_tag_busy_iter > > bt_iter > > blk_mq_find_and_get_req > > _raw_spin_lock_irqsave > > native_queued_spin_lock_slowpath > > How or why blk_mq_timeout_work()? Just some update: I tried hisi_sas with 10x SAS SSDs, megaraid sas with 1x SATA HDD (that's all I have), and null blk with lots of devices, and I still can't see high usage of blk_mq_queue_tag_busy_iter(). So how about we get this patch processed (to fix blk_mq_tagset_busy_iter()), as it is independent of blk_mq_queue_tag_busy_iter()? And then wait for some update or some more info from Kashyap regarding blk_mq_queue_tag_busy_iter() Thanks, John
On Mon, Oct 18, 2021 at 09:08:57AM +0100, John Garry wrote: > On 13/10/2021 16:13, John Garry wrote: > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c > > > index 72a2724a4eee..2a2ad6dfcc33 100644 > > > --- a/block/blk-mq-tag.c > > > +++ b/block/blk-mq-tag.c > > > @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, > > > unsigned int bitnr, void *data) > > > if (!rq) > > > return true; > > > - if (rq->q == hctx->queue && rq->mq_hctx == hctx) > > > - ret = iter_data->fn(hctx, rq, iter_data->data, reserved); > > > + if (rq->q == hctx->queue && (rq->mq_hctx == hctx || > > > + blk_mq_is_shared_tags(hctx->flags))) > > > + ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved); > > > blk_mq_put_rq_ref(rq); > > > return ret; > > > } > > > @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct > > > request_queue *q, busy_iter_fn *fn, > > > if (tags->nr_reserved_tags) > > > bt_for_each(hctx, &tags->breserved_tags, fn, priv, true); > > > bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false); > > > + > > > + if (blk_mq_is_shared_tags(hctx->flags)) > > > + break; > > > } > > > blk_queue_exit(q); > > > } > > > > > > > I suppose that is ok, and means that we iter once. > > > > However, I have to ask, where is the big user of > > blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's > > mail: > > > > > 1.31% 1.31% kworker/57:1H-k [kernel.vmlinux] > > > native_queued_spin_lock_slowpath > > > ret_from_fork > > > kthread > > > worker_thread > > > process_one_work > > > blk_mq_timeout_work > > > blk_mq_queue_tag_busy_iter > > > bt_iter > > > blk_mq_find_and_get_req > > > _raw_spin_lock_irqsave > > > native_queued_spin_lock_slowpath > > > > How or why blk_mq_timeout_work()? > > Just some update: I tried hisi_sas with 10x SAS SSDs, megaraid sas with 1x > SATA HDD (that's all I have), and null blk with lots of devices, and I still > can't see high usage of blk_mq_queue_tag_busy_iter(). It should be triggered easily in case of heavy io accounting: while true; do cat /proc/diskstats; done > So how about we get this patch processed (to fix blk_mq_tagset_busy_iter()), > as it is independent of blk_mq_queue_tag_busy_iter()? And then wait for some > update or some more info from Kashyap regarding blk_mq_queue_tag_busy_iter() Looks fine: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming
On 18/10/2021 10:07, Ming Lei wrote: > On Mon, Oct 18, 2021 at 09:08:57AM +0100, John Garry wrote: >> On 13/10/2021 16:13, John Garry wrote: >>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >>>> index 72a2724a4eee..2a2ad6dfcc33 100644 >>>> --- a/block/blk-mq-tag.c >>>> +++ b/block/blk-mq-tag.c >>>> @@ -232,8 +232,9 @@ static bool bt_iter(struct sbitmap *bitmap, >>>> unsigned int bitnr, void *data) >>>> if (!rq) >>>> return true; >>>> - if (rq->q == hctx->queue && rq->mq_hctx == hctx) >>>> - ret = iter_data->fn(hctx, rq, iter_data->data, reserved); >>>> + if (rq->q == hctx->queue && (rq->mq_hctx == hctx || >>>> + blk_mq_is_shared_tags(hctx->flags))) >>>> + ret = iter_data->fn(rq->mq_hctx, rq, iter_data->data, reserved); >>>> blk_mq_put_rq_ref(rq); >>>> return ret; >>>> } >>>> @@ -460,6 +461,9 @@ void blk_mq_queue_tag_busy_iter(struct >>>> request_queue *q, busy_iter_fn *fn, >>>> if (tags->nr_reserved_tags) >>>> bt_for_each(hctx, &tags->breserved_tags, fn, priv, true); >>>> bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false); >>>> + >>>> + if (blk_mq_is_shared_tags(hctx->flags)) >>>> + break; >>>> } >>>> blk_queue_exit(q); >>>> } >>>> >>> I suppose that is ok, and means that we iter once. >>> >>> However, I have to ask, where is the big user of >>> blk_mq_queue_tag_busy_iter() coming from? I saw this from Kashyap's >>> mail: >>> >>> > 1.31% 1.31% kworker/57:1H-k [kernel.vmlinux] >>> > native_queued_spin_lock_slowpath >>> > ret_from_fork >>> > kthread >>> > worker_thread >>> > process_one_work >>> > blk_mq_timeout_work >>> > blk_mq_queue_tag_busy_iter >>> > bt_iter >>> > blk_mq_find_and_get_req >>> > _raw_spin_lock_irqsave >>> > native_queued_spin_lock_slowpath >>> >>> How or why blk_mq_timeout_work()? >> Just some update: I tried hisi_sas with 10x SAS SSDs, megaraid sas with 1x >> SATA HDD (that's all I have), and null blk with lots of devices, and I still >> can't see high usage of blk_mq_queue_tag_busy_iter(). > It should be triggered easily in case of heavy io accounting: > > while true; do cat /proc/diskstats; done > Let me check that. > >> So how about we get this patch processed (to fix blk_mq_tagset_busy_iter()), >> as it is independent of blk_mq_queue_tag_busy_iter()? And then wait for some >> update or some more info from Kashyap regarding blk_mq_queue_tag_busy_iter() > Looks fine: > > Reviewed-by: Ming Lei<ming.lei@redhat.com> Thanks, I'll just send a v2 with your tag for clarity, as there has been much discussion here. John
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 72a2724a4eee..c943b6529619 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -378,9 +378,12 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { - int i; + unsigned int flags = tagset->flags; + int i, nr_tags; + + nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; - for (i = 0; i < tagset->nr_hw_queues; i++) { + for (i = 0; i < nr_tags; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED);
Since it is now possible for a tagset to share a single set of tags, the iter function should not re-iter the tags for the count of #hw queues in that case. Rather it should just iter once. Fixes: e0fdf846c7bb ("blk-mq: Use shared tags for shared sbitmap support") Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: John Garry <john.garry@huawei.com>