blk-mq: don't queue passthrough request into scheduler

Message ID	20230512150328.192908-1-ming.lei@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, Guangwu Zhang <guazhang@redhat.com>, Yu Kuai <yukuai1@huaweicloud.com> Subject: [PATCH] blk-mq: don't queue passthrough request into scheduler Date: Fri, 12 May 2023 23:03:28 +0800 Message-Id: <20230512150328.192908-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	blk-mq: don't queue passthrough request into scheduler \| expand blk-mq: don't queue passthrough request into scheduler

Ming Lei May 12, 2023, 3:03 p.m. UTC

Passthrough(pt) request shouldn't be queued to scheduler, especially some
schedulers(such as bfq) supposes that req->bio is always available and
blk-cgroup can be retrieved via bio.

Sometimes pt request could be part of error handling, so it is better to always
queue it into hctx->dispatch directly.

Fix this issue by queuing pt request from plug list to hctx->dispatch
directly.

Reported-by: Guangwu Zhang <guazhang@redhat.com>
Investigated-by: Yu Kuai <yukuai1@huaweicloud.com>
Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
Guang Wu, please test this patch and provide us the result.

 block/blk-mq.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

Jens Axboe May 12, 2023, 3:08 p.m. UTC | #1

On 5/12/23 9:03?AM, Ming Lei wrote:
> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> schedulers(such as bfq) supposes that req->bio is always available and
> blk-cgroup can be retrieved via bio.
> 
> Sometimes pt request could be part of error handling, so it is better to always
> queue it into hctx->dispatch directly.
> 
> Fix this issue by queuing pt request from plug list to hctx->dispatch
> directly.

Why not just add the check to the BFQ insertion? That would be a lot
more trivial and would not be poluting the core with this stuff.

Ming Lei May 12, 2023, 3:19 p.m. UTC | #2

On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> On 5/12/23 9:03?AM, Ming Lei wrote:
> > Passthrough(pt) request shouldn't be queued to scheduler, especially some
> > schedulers(such as bfq) supposes that req->bio is always available and
> > blk-cgroup can be retrieved via bio.
> > 
> > Sometimes pt request could be part of error handling, so it is better to always
> > queue it into hctx->dispatch directly.
> > 
> > Fix this issue by queuing pt request from plug list to hctx->dispatch
> > directly.
> 
> Why not just add the check to the BFQ insertion? That would be a lot
> more trivial and would not be poluting the core with this stuff.

pt request is supposed to be issued to device directly, and we never
queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
passthrough plugging").

some pt request might be part of error handling, and adding it to
scheduler could cause io hang.


Thanks,
Ming

Jens Axboe May 12, 2023, 3:25 p.m. UTC | #3

On 5/12/23 9:19 AM, Ming Lei wrote:
> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
>> On 5/12/23 9:03?AM, Ming Lei wrote:
>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
>>> schedulers(such as bfq) supposes that req->bio is always available and
>>> blk-cgroup can be retrieved via bio.
>>>
>>> Sometimes pt request could be part of error handling, so it is better to always
>>> queue it into hctx->dispatch directly.
>>>
>>> Fix this issue by queuing pt request from plug list to hctx->dispatch
>>> directly.
>>
>> Why not just add the check to the BFQ insertion? That would be a lot
>> more trivial and would not be poluting the core with this stuff.
> 
> pt request is supposed to be issued to device directly, and we never
> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
> passthrough plugging").
> 
> some pt request might be part of error handling, and adding it to
> scheduler could cause io hang.

I'm not suggesting adding it to the scheduler, just having the bypass
"add to dispatch" in a different spot.

Let me take a look at it... Do we have a reproducer for this issue?

Ming Lei May 12, 2023, 3:34 p.m. UTC | #4

On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote:
> On 5/12/23 9:19 AM, Ming Lei wrote:
> > On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> >> On 5/12/23 9:03?AM, Ming Lei wrote:
> >>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> >>> schedulers(such as bfq) supposes that req->bio is always available and
> >>> blk-cgroup can be retrieved via bio.
> >>>
> >>> Sometimes pt request could be part of error handling, so it is better to always
> >>> queue it into hctx->dispatch directly.
> >>>
> >>> Fix this issue by queuing pt request from plug list to hctx->dispatch
> >>> directly.
> >>
> >> Why not just add the check to the BFQ insertion? That would be a lot
> >> more trivial and would not be poluting the core with this stuff.
> > 
> > pt request is supposed to be issued to device directly, and we never
> > queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
> > passthrough plugging").
> > 
> > some pt request might be part of error handling, and adding it to
> > scheduler could cause io hang.
> 
> I'm not suggesting adding it to the scheduler, just having the bypass
> "add to dispatch" in a different spot.

Originally it is added to dispatch in blk_execute_rq_nowait() for each
request, but now we support plug for pt request, that is why I add the
bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch
given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature.

> 
> Let me take a look at it... Do we have a reproducer for this issue?

Guang Wu and Yu Kuai should have, and I didn't succeed in reproducing
it by setting bfq & io.bfq.weight cgroup in my test VM.


Thanks, 
Ming

Jens Axboe May 12, 2023, 3:43 p.m. UTC | #5

On 5/12/23 9:34?AM, Ming Lei wrote:
> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote:
>> On 5/12/23 9:19?AM, Ming Lei wrote:
>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
>>>> On 5/12/23 9:03?AM, Ming Lei wrote:
>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
>>>>> schedulers(such as bfq) supposes that req->bio is always available and
>>>>> blk-cgroup can be retrieved via bio.
>>>>>
>>>>> Sometimes pt request could be part of error handling, so it is better to always
>>>>> queue it into hctx->dispatch directly.
>>>>>
>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch
>>>>> directly.
>>>>
>>>> Why not just add the check to the BFQ insertion? That would be a lot
>>>> more trivial and would not be poluting the core with this stuff.
>>>
>>> pt request is supposed to be issued to device directly, and we never
>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
>>> passthrough plugging").
>>>
>>> some pt request might be part of error handling, and adding it to
>>> scheduler could cause io hang.
>>
>> I'm not suggesting adding it to the scheduler, just having the bypass
>> "add to dispatch" in a different spot.
> 
> Originally it is added to dispatch in blk_execute_rq_nowait() for each
> request, but now we support plug for pt request, that is why I add the
> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch
> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature.

We really have two types of passthrough - normal kind of IO, and
potential error recovery etc IO. The former can plug just fine, and I
don't think we should treat it differently. Might make more sense to
just bypass plugging for error handling type of IO, or pt that doesn't
transfer any data to avoid having a NULL bio inserted into the
scheduler.

>> Let me take a look at it... Do we have a reproducer for this issue?
> 
> Guang Wu and Yu Kuai should have, and I didn't succeed in reproducing
> it by setting bfq & io.bfq.weight cgroup in my test VM.

I didn't either, but most likely because all the pt testing I did was
mapped IO. So there would be a bio there.

Ming Lei May 12, 2023, 3:55 p.m. UTC | #6

On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote:
> On 5/12/23 9:34?AM, Ming Lei wrote:
> > On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote:
> >> On 5/12/23 9:19?AM, Ming Lei wrote:
> >>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> >>>> On 5/12/23 9:03?AM, Ming Lei wrote:
> >>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> >>>>> schedulers(such as bfq) supposes that req->bio is always available and
> >>>>> blk-cgroup can be retrieved via bio.
> >>>>>
> >>>>> Sometimes pt request could be part of error handling, so it is better to always
> >>>>> queue it into hctx->dispatch directly.
> >>>>>
> >>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch
> >>>>> directly.
> >>>>
> >>>> Why not just add the check to the BFQ insertion? That would be a lot
> >>>> more trivial and would not be poluting the core with this stuff.
> >>>
> >>> pt request is supposed to be issued to device directly, and we never
> >>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
> >>> passthrough plugging").
> >>>
> >>> some pt request might be part of error handling, and adding it to
> >>> scheduler could cause io hang.
> >>
> >> I'm not suggesting adding it to the scheduler, just having the bypass
> >> "add to dispatch" in a different spot.
> > 
> > Originally it is added to dispatch in blk_execute_rq_nowait() for each
> > request, but now we support plug for pt request, that is why I add the
> > bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch
> > given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature.
> 
> We really have two types of passthrough - normal kind of IO, and
> potential error recovery etc IO. The former can plug just fine, and I
> don't think we should treat it differently. Might make more sense to
> just bypass plugging for error handling type of IO, or pt that doesn't
> transfer any data to avoid having a NULL bio inserted into the
> scheduler.

So far, I guess we can't distinguish the two types.

The simplest change could be the previous one[1] by not plugging request
of !rq->bio, but pt request with user IO still can be added to
scheduler, so the question is if pt request with user IO should be
queued to scheduler?

[1] https://lore.kernel.org/linux-block/CAGS2=Yob_Ud9A-aTu5hQt8+kW4cyrLX12hNJTrRkJYigFT-AmA@mail.gmail.com/T/#m8ac66a4294707cf1ba301ef2bd2a70f9a0e2c8cb

Thanks, 
Ming

Jens Axboe May 12, 2023, 4:10 p.m. UTC | #7

On 5/12/23 9:55?AM, Ming Lei wrote:
> On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote:
>> On 5/12/23 9:34?AM, Ming Lei wrote:
>>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote:
>>>> On 5/12/23 9:19?AM, Ming Lei wrote:
>>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
>>>>>> On 5/12/23 9:03?AM, Ming Lei wrote:
>>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
>>>>>>> schedulers(such as bfq) supposes that req->bio is always available and
>>>>>>> blk-cgroup can be retrieved via bio.
>>>>>>>
>>>>>>> Sometimes pt request could be part of error handling, so it is better to always
>>>>>>> queue it into hctx->dispatch directly.
>>>>>>>
>>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch
>>>>>>> directly.
>>>>>>
>>>>>> Why not just add the check to the BFQ insertion? That would be a lot
>>>>>> more trivial and would not be poluting the core with this stuff.
>>>>>
>>>>> pt request is supposed to be issued to device directly, and we never
>>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
>>>>> passthrough plugging").
>>>>>
>>>>> some pt request might be part of error handling, and adding it to
>>>>> scheduler could cause io hang.
>>>>
>>>> I'm not suggesting adding it to the scheduler, just having the bypass
>>>> "add to dispatch" in a different spot.
>>>
>>> Originally it is added to dispatch in blk_execute_rq_nowait() for each
>>> request, but now we support plug for pt request, that is why I add the
>>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch
>>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature.
>>
>> We really have two types of passthrough - normal kind of IO, and
>> potential error recovery etc IO. The former can plug just fine, and I
>> don't think we should treat it differently. Might make more sense to
>> just bypass plugging for error handling type of IO, or pt that doesn't
>> transfer any data to avoid having a NULL bio inserted into the
>> scheduler.
> 
> So far, I guess we can't distinguish the two types.

Right, and that seems to be the real problem here. What is true for any
pt request is that normal sorting by sector doesn't work, but it really
would be nice to distinguish the two for other reasons. Eg "true" pt
error handling should definitely just go to the dispatch list, and we
don't need to apply any batching etc for them. For uring_cmd based
passthrough, we most certainly should.

> The simplest change could be the previous one[1] by not plugging request
> of !rq->bio, but pt request with user IO still can be added to
> scheduler, so the question is if pt request with user IO should be
> queued to scheduler?

I'm fine bypassing the scheduler insertion for that, my main objection
to your original patch is that it's a bit messy and I was hoping we
could do a cleaner separation earlier. But I _think_ that'd get us into
full logic bypass for pt, which again then would not be ideal for
performance oriented pt.

Christoph Hellwig May 12, 2023, 8:20 p.m. UTC | #8

On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> On 5/12/23 9:03?AM, Ming Lei wrote:
> > Passthrough(pt) request shouldn't be queued to scheduler, especially some
> > schedulers(such as bfq) supposes that req->bio is always available and
> > blk-cgroup can be retrieved via bio.
> > 
> > Sometimes pt request could be part of error handling, so it is better to always
> > queue it into hctx->dispatch directly.
> > 
> > Fix this issue by queuing pt request from plug list to hctx->dispatch
> > directly.
> 
> Why not just add the check to the BFQ insertion? That would be a lot
> more trivial and would not be poluting the core with this stuff.

Because we really need to keep the passthrough code separate.  The
fact that a passthrough request can leak into common code in various
places is really a bit of a problem.  We have most of these nicely
separate with two exceptions:

 - the plug list
 - the requeue list

The higher level and the more obvious we special case the passthrough
request there, the better for debuggability and maintainability.

Christoph Hellwig May 12, 2023, 8:21 p.m. UTC | #9

On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote:
> We really have two types of passthrough - normal kind of IO, and
> potential error recovery etc IO. The former can plug just fine, and I
> don't think we should treat it differently. Might make more sense to
> just bypass plugging for error handling type of IO, or pt that doesn't
> transfer any data to avoid having a NULL bio inserted into the
> scheduler.

I'd suggest we distinguish by ... having a plug or not.  Anything
internal like EH or probing should never come from a context with
a plug.

Jens Axboe May 12, 2023, 9:31 p.m. UTC | #10

On 5/12/23 2:21?PM, Christoph Hellwig wrote:
> On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote:
>> We really have two types of passthrough - normal kind of IO, and
>> potential error recovery etc IO. The former can plug just fine, and I
>> don't think we should treat it differently. Might make more sense to
>> just bypass plugging for error handling type of IO, or pt that doesn't
>> transfer any data to avoid having a NULL bio inserted into the
>> scheduler.
> 
> I'd suggest we distinguish by ... having a plug or not.  Anything
> internal like EH or probing should never come from a context with
> a plug.

That's fine for telling error handling, probing etc apart from "normal"
IO, but it doesn't solve the problem at hand which is assuming that
anything inserted into the scheduler has a bio. This is obviously true
for actual IO done via pt, but you can do data less commands through
that too.

Christoph Hellwig May 12, 2023, 11:12 p.m. UTC | #11

On Fri, May 12, 2023 at 03:31:39PM -0600, Jens Axboe wrote:
> That's fine for telling error handling, probing etc apart from "normal"
> IO, but it doesn't solve the problem at hand which is assuming that
> anything inserted into the scheduler has a bio.

I think this is fine.  Passthrough requests should never go to the
scheduler, that's not where they blong.  Non-passthrough request always
have a bio.

Christoph Hellwig May 12, 2023, 11:14 p.m. UTC | #12

On Fri, May 12, 2023 at 04:12:42PM -0700, Christoph Hellwig wrote:
> On Fri, May 12, 2023 at 03:31:39PM -0600, Jens Axboe wrote:
> > That's fine for telling error handling, probing etc apart from "normal"
> > IO, but it doesn't solve the problem at hand which is assuming that
> > anything inserted into the scheduler has a bio.
> 
> I think this is fine.  Passthrough requests should never go to the
> scheduler, that's not where they blong.  Non-passthrough request always
> have a bio.

.. except for the flushes generated by block/blk-flush.c.  Those do not
enter the I/O schedule either.  That being said saying every request
has to have a bio might be a good thing some day.  Totally indepent of
the fact that schedulers have no business touching passthrough requests.

Ming Lei May 13, 2023, 10:55 a.m. UTC | #13

On Fri, May 12, 2023 at 10:10:55AM -0600, Jens Axboe wrote:
> On 5/12/23 9:55?AM, Ming Lei wrote:
> > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote:
> >> On 5/12/23 9:34?AM, Ming Lei wrote:
> >>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote:
> >>>> On 5/12/23 9:19?AM, Ming Lei wrote:
> >>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> >>>>>> On 5/12/23 9:03?AM, Ming Lei wrote:
> >>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> >>>>>>> schedulers(such as bfq) supposes that req->bio is always available and
> >>>>>>> blk-cgroup can be retrieved via bio.
> >>>>>>>
> >>>>>>> Sometimes pt request could be part of error handling, so it is better to always
> >>>>>>> queue it into hctx->dispatch directly.
> >>>>>>>
> >>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch
> >>>>>>> directly.
> >>>>>>
> >>>>>> Why not just add the check to the BFQ insertion? That would be a lot
> >>>>>> more trivial and would not be poluting the core with this stuff.
> >>>>>
> >>>>> pt request is supposed to be issued to device directly, and we never
> >>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for
> >>>>> passthrough plugging").
> >>>>>
> >>>>> some pt request might be part of error handling, and adding it to
> >>>>> scheduler could cause io hang.
> >>>>
> >>>> I'm not suggesting adding it to the scheduler, just having the bypass
> >>>> "add to dispatch" in a different spot.
> >>>
> >>> Originally it is added to dispatch in blk_execute_rq_nowait() for each
> >>> request, but now we support plug for pt request, that is why I add the
> >>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch
> >>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature.
> >>
> >> We really have two types of passthrough - normal kind of IO, and
> >> potential error recovery etc IO. The former can plug just fine, and I
> >> don't think we should treat it differently. Might make more sense to
> >> just bypass plugging for error handling type of IO, or pt that doesn't
> >> transfer any data to avoid having a NULL bio inserted into the
> >> scheduler.
> > 
> > So far, I guess we can't distinguish the two types.
> 
> Right, and that seems to be the real problem here. What is true for any
> pt request is that normal sorting by sector doesn't work, but it really
> would be nice to distinguish the two for other reasons. Eg "true" pt
> error handling should definitely just go to the dispatch list, and we
> don't need to apply any batching etc for them. For uring_cmd based
> passthrough, we most certainly should.
> 
> > The simplest change could be the previous one[1] by not plugging request
> > of !rq->bio, but pt request with user IO still can be added to
> > scheduler, so the question is if pt request with user IO should be
> > queued to scheduler?
> 
> I'm fine bypassing the scheduler insertion for that, my main objection
> to your original patch is that it's a bit messy and I was hoping we
> could do a cleaner separation earlier. But I _think_ that'd get us into
> full logic bypass for pt, which again then would not be ideal for
> performance oriented pt.

Yeah, that is the reason why this patch makes pt request batch and
bypass them all, so nvme uring cmd pt io perf won't be hurt.

So I guess now you are fine with this patch? If that is true, I will post V2
with revised comment log, such as, including Christoph's related comment.

Otherwise, any suggestion wrt. early bypass?

- bypass in case of scheduler, but still a bit tricky, why does pt need
  to care elevator? Meantime performance is still affected for any IOs
  sharing the current plug


Thanks, 
Ming

Ming Lei May 13, 2023, 11:15 a.m. UTC | #14

On Fri, May 12, 2023 at 01:20:15PM -0700, Christoph Hellwig wrote:
> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote:
> > On 5/12/23 9:03?AM, Ming Lei wrote:
> > > Passthrough(pt) request shouldn't be queued to scheduler, especially some
> > > schedulers(such as bfq) supposes that req->bio is always available and
> > > blk-cgroup can be retrieved via bio.
> > > 
> > > Sometimes pt request could be part of error handling, so it is better to always
> > > queue it into hctx->dispatch directly.
> > > 
> > > Fix this issue by queuing pt request from plug list to hctx->dispatch
> > > directly.
> > 
> > Why not just add the check to the BFQ insertion? That would be a lot
> > more trivial and would not be poluting the core with this stuff.
> 
> Because we really need to keep the passthrough code separate.  The
> fact that a passthrough request can leak into common code in various
> places is really a bit of a problem.  We have most of these nicely
> separate with two exceptions:
> 
>  - the plug list
>  - the requeue list
> 
> The higher level and the more obvious we special case the passthrough
> request there, the better for debuggability and maintainability.

Agree, and there could be more things involved:

1) pt request shares same tags, so scheduler tags is used for allocating pt
request

2) RQF_ELV is set for pt request, and e->type->ops.requeue_request()/
e->type->ops.completed_request() still might be called for pt request.

Probably long-term one dedicated request queue(BLK_MQ_F_TAG_QUEUE_SHARED/
BLK_MQ_F_NO_SCHED) could be used for handling passthrough only if the device
exposes pt interface. Then I guess core code may be cleaned a bit and
becomes easier to improve both two paths, but plug handling still have
to cover both.


thanks,
Ming

Jens Axboe May 13, 2023, 4:50 p.m. UTC | #15

On 5/12/23 9:03?AM, Ming Lei wrote:
> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> schedulers(such as bfq) supposes that req->bio is always available and
> blk-cgroup can be retrieved via bio.
> 
> Sometimes pt request could be part of error handling, so it is better to always
> queue it into hctx->dispatch directly.
> 
> Fix this issue by queuing pt request from plug list to hctx->dispatch
> directly.
> 
> Reported-by: Guangwu Zhang <guazhang@redhat.com>
> Investigated-by: Yu Kuai <yukuai1@huaweicloud.com>
> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> Guang Wu, please test this patch and provide us the result.
> 
>  block/blk-mq.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f6dad0886a2f..11efaefa26c3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>  	struct request *requeue_list = NULL;
>  	struct request **requeue_lastp = &requeue_list;
>  	unsigned int depth = 0;
> +	bool pt = false;
>  	LIST_HEAD(list);
>  
>  	do {
> @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>  		if (!this_hctx) {
>  			this_hctx = rq->mq_hctx;
>  			this_ctx = rq->mq_ctx;
> -		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
> +			pt = blk_rq_is_passthrough(rq);
> +		} else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
> +				pt != blk_rq_is_passthrough(rq)) {
>  			rq_list_add_tail(&requeue_lastp, rq);
>  			continue;
>  		}
> @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>  	trace_block_unplug(this_hctx->queue, depth, !from_sched);
>  
>  	percpu_ref_get(&this_hctx->queue->q_usage_counter);
> -	if (this_hctx->queue->elevator) {
> +	if (this_hctx->queue->elevator && !pt) {
>  		this_hctx->queue->elevator->type->ops.insert_requests(this_hctx,
>  				&list, 0);
>  		blk_mq_run_hw_queue(this_hctx, from_sched);
> +	} else if (pt) {
> +		spin_lock(&this_hctx->lock);
> +		list_splice_tail_init(&list, &this_hctx->dispatch);
> +		spin_unlock(&this_hctx->lock);
> +		blk_mq_run_hw_queue(this_hctx, from_sched);
>  	} else {
>  		blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched);
>  	}

I think this would look at lot better as:

	if (pt) {
		...
	} else if (this_hctx->queue->elevator) {
		...
	} else {
		...
	}

and add a comment above that first if condition on why this distinction
is being made.

Guangwu Zhang May 14, 2023, 12:11 p.m. UTC | #16

Hi.
Don't reproduce the issue with your patch, but hit other issue, please
have a look.

[  136.002469] qla2xxx [0000:af:00.6]-800e:2: DEVICE RESET SUCCEEDED
nexus:2:0:0 cmd=000000008d9455f0.
[  136.011591] qla2xxx [0000:af:00.6]-8009:2: TARGET RESET ISSUED
nexus=2:0 cmd=000000008d9455f0.
[  136.206701] qla2xxx [0000:af:00.6]-800e:2: TARGET RESET SUCCEEDED
nexus:2:0 cmd=000000008d9455f0.
[  168.768518] qla2xxx [0000:af:00.7]-801c:4: Abort command issued
nexus=4:0:0 -- 2003.
[  168.776324] qla2xxx [0000:af:00.7]-8009:4: DEVICE RESET ISSUED
nexus=4:0:0 cmd=000000005642c8ad.
[  168.836726] qla2xxx [0000:af:00.7]-5032:4: ABT_IOCB: Invalid
completion handle (21) -- timed-out.
[  168.845758] qla2xxx [0000:af:00.7]-800e:4: DEVICE RESET SUCCEEDED
nexus:4:0:0 cmd=000000005642c8ad.
[  168.854881] qla2xxx [0000:af:00.7]-8018:4: ADAPTER RESET ISSUED nexus=4:0:0.
[  168.861994] qla2xxx [0000:af:00.7]-00af:4: Performing ISP error
recovery - ha=00000000f7f90a13.
[  170.944477] qla2xxx [0000:af:00.7]-00b4:4: Done chip reset cleanup.
[  170.960311] qla2xxx [0000:af:00.7]-00d2:4: Init Firmware **** FAILED ****.
[  170.967241] qla2xxx [0000:af:00.7]-8016:4: qla82xx_restart_isp ****
FAILED ****.
[  170.974693] qla2xxx [0000:af:00.7]-b02f:4: HW State: NEED RESET
[  170.980671] qla2xxx [0000:af:00.7]-009b:4: Device state is 0x4 = Need Reset.
[  170.987784] qla2xxx [0000:af:00.7]-009d:4: Device state is 0x4 = Need Reset.
[  171.968416] qla2xxx [0000:af:00.6]-6001:2: Adapter reset needed.
[  171.974549] qla2xxx [0000:af:00.6]-b031:2: Device state is 0x4 = Need Reset.
[  171.981680] qla2xxx [0000:af:00.6]-009b:2: Device state is 0x4 = Need Reset.
[  171.988776] qla2xxx [0000:af:00.6]-009d:2: Device state is 0x4 = Need Reset.
[  171.995872] qla2xxx [0000:af:00.6]-00af:2: Performing ISP error
recovery - ha=0000000071956e30.
[  174.080409] qla2xxx [0000:af:00.6]-00b4:2: Done chip reset cleanup.
[  174.086822] qla2xxx [0000:af:00.6]-801c:2: Abort command issued
nexus=2:0:0 -- 2002.
[  174.094635] qla2xxx [0000:af:00.6]-8018:2: ADAPTER RESET ISSUED nexus=2:0:0.
[  174.101728] qla2xxx [0000:af:00.6]-8017:2: ADAPTER RESET FAILED nexus=2:0:0.
[  174.108910] scsi 2:0:0:0: rejecting I/O to offline device
[  175.104377] qla2xxx [0000:af:00.7]-00b6:4: Device state is 0x4 = Need Reset.
[  175.111480] qla2xxx [0000:af:00.7]-00b7:4: HW State: COLD/RE-INIT.
[  177.184324] ------------[ cut here ]------------
[  177.188972] WARNING: CPU: 7 PID: 813 at
drivers/scsi/qla2xxx/qla_nx.c:507
qla82xx_need_reset_handler+0x1a4/0x470 [qla2xxx]
[  177.200131] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4
dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat
dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency
intel_uncore_frequency_common isst_if_common nfit libnvdimm mgag200
x86_pkg_temp_thermal i2c_algo_bit intel_powerclamp ipmi_ssif coretemp
drm_shmem_helper drm_kms_helper rapl syscopyarea mei_me sysfillrect
ses sysimgblt intel_cstate enclosure acpi_ipmi ioatdma acpi_tad
ipmi_si lpc_ich intel_pch_thermal intel_uncore acpi_power_meter mei
ipmi_devintf pcspkr dca hpilo ipmi_msghandler drm fuse xfs sd_mod sg
qla2xxx qla4xxx bnx2x nvme_fc nvme nvme_fabrics crct10dif_pclmul
crc32_pclmul nvme_core libiscsi smartpqi ghash_clmulni_intel
scsi_transport_iscsi nvme_common uas scsi_transport_fc t10_pi mdio
iscsi_boot_sysfs libcrc32c usb_storage scsi_transport_sas crc32c_intel
tg3 hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod
[  177.281274] CPU: 7 PID: 813 Comm: qla2xxx_2_dpc Kdump: loaded Not
tainted 6.4.0-rc1+ #1
[  177.289328] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380
Gen10, BIOS U30 03/09/2020
[  177.297903] RIP: 0010:qla82xx_need_reset_handler+0x1a4/0x470 [qla2xxx]
[  177.304513] Code: a6 cd e8 8f 49 4f ce eb 0a bf 64 00 00 00 e8 63
0d a6 cd be 28 c0 11 06 48 89 ef e8 16 a9 ff ff 83 f8 01 74 07 83 eb
01 75 df <0f> 0b be 44 21 20 08 48 89 ef e8 fd a8 ff ff be 38 21 20 08
48 89
[  177.323398] RSP: 0018:ffffaf0b0508be48 EFLAGS: 00010246
[  177.328656] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000130000
[  177.335834] RDX: ffffaf0b0508be00 RSI: ffffaf0b0593c028 RDI: ffff9b2420022000
[  177.343013] RBP: ffff9b2420022000 R08: 0000000000120000 R09: ffff9b272fff2878
[  177.350193] R10: 0000000000000162 R11: 0000000003c23c49 R12: 00000000fffe3bdb
[  177.357373] R13: 0000000000000000 R14: ffff9b274fca6840 R15: 0000000001000000
[  177.364554] FS:  0000000000000000(0000) GS:ffff9b272ffc0000(0000)
knlGS:0000000000000000
[  177.372695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  177.378476] CR2: 000055b7803d43d8 CR3: 000000079c220001 CR4: 00000000007706e0
[  177.385656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  177.392836] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  177.400015] PKRU: 55555554
[  177.402738] Call Trace:
[  177.405201]  <TASK>
[  177.407316]  qla82xx_device_state_handler+0x285/0x4a0 [qla2xxx]
[  177.413303]  ? __pfx_qla2x00_do_dpc+0x10/0x10 [qla2xxx]
[  177.418588]  qla82xx_abort_isp+0x190/0x290 [qla2xxx]
[  177.423613]  qla2x00_do_dpc+0x743/0xa60 [qla2xxx]
[  177.428374]  ? __pfx_qla2x00_do_dpc+0x10/0x10 [qla2xxx]
[  177.433655]  kthread+0xdf/0x110
[  177.436821]  ? __pfx_kthread+0x10/0x10
[  177.440595]  ret_from_fork+0x29/0x50
[  177.444201]  </TASK>
[  177.446402] ---[ end trace 0000000000000000 ]---
[  177.451052] qla2xxx [0000:af:00.6]-00b6:2: Device state is 0x1 =
Cold/Re-init.
[  177.458319] qla2xxx [0000:af:00.6]-009d:2: Device state is 0x1 =
Cold/Re-init.
[  177.465588] qla2xxx [0000:af:00.6]-009e:2: HW State: INITIALIZING.
[    0.020959] APIC: Disabling requested cpu. Processor 0/0x0 ignored.

Ming Lei <ming.lei@redhat.com> 于2023年5月12日周五 23:06写道：
>
> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> schedulers(such as bfq) supposes that req->bio is always available and
> blk-cgroup can be retrieved via bio.
>
> Sometimes pt request could be part of error handling, so it is better to always
> queue it into hctx->dispatch directly.
>
> Fix this issue by queuing pt request from plug list to hctx->dispatch
> directly.
>
> Reported-by: Guangwu Zhang <guazhang@redhat.com>
> Investigated-by: Yu Kuai <yukuai1@huaweicloud.com>
> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> Guang Wu, please test this patch and provide us the result.
>
>  block/blk-mq.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f6dad0886a2f..11efaefa26c3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>         struct request *requeue_list = NULL;
>         struct request **requeue_lastp = &requeue_list;
>         unsigned int depth = 0;
> +       bool pt = false;
>         LIST_HEAD(list);
>
>         do {
> @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>                 if (!this_hctx) {
>                         this_hctx = rq->mq_hctx;
>                         this_ctx = rq->mq_ctx;
> -               } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
> +                       pt = blk_rq_is_passthrough(rq);
> +               } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
> +                               pt != blk_rq_is_passthrough(rq)) {
>                         rq_list_add_tail(&requeue_lastp, rq);
>                         continue;
>                 }
> @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>         trace_block_unplug(this_hctx->queue, depth, !from_sched);
>
>         percpu_ref_get(&this_hctx->queue->q_usage_counter);
> -       if (this_hctx->queue->elevator) {
> +       if (this_hctx->queue->elevator && !pt) {
>                 this_hctx->queue->elevator->type->ops.insert_requests(this_hctx,
>                                 &list, 0);
>                 blk_mq_run_hw_queue(this_hctx, from_sched);
> +       } else if (pt) {
> +               spin_lock(&this_hctx->lock);
> +               list_splice_tail_init(&list, &this_hctx->dispatch);
> +               spin_unlock(&this_hctx->lock);
> +               blk_mq_run_hw_queue(this_hctx, from_sched);
>         } else {
>                 blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched);
>         }
> --
> 2.38.1
>

Ming Lei May 14, 2023, 12:24 p.m. UTC | #17

Hi Guangwu,

On Sun, May 14, 2023 at 08:11:34PM +0800, Guangwu Zhang wrote:
> Hi.
> Don't reproduce the issue with your patch, but hit other issue, please
> have a look.

Can you share your test case a bit so that we can see if the warning on
qla2xxx is related this patch?


Thanks,
Ming

Guangwu Zhang May 15, 2023, 2:33 p.m. UTC | #18

After reinstalling the upstream kernel, I also hit the reset issue, so
 the issue isn't related to mingl's patch.
The reset issue has opened before, please have a look.
https://lore.kernel.org/linux-scsi/CAGS2=YrmwbhMpNA2REnBybvm5dehGRyKBX5Sq5BqY=ex=mwaUg@mail.gmail.com/


Ming Lei <ming.lei@redhat.com> 于2023年5月12日周五 23:06写道：
>
> Passthrough(pt) request shouldn't be queued to scheduler, especially some
> schedulers(such as bfq) supposes that req->bio is always available and
> blk-cgroup can be retrieved via bio.
>
> Sometimes pt request could be part of error handling, so it is better to always
> queue it into hctx->dispatch directly.
>
> Fix this issue by queuing pt request from plug list to hctx->dispatch
> directly.
>
> Reported-by: Guangwu Zhang <guazhang@redhat.com>
> Investigated-by: Yu Kuai <yukuai1@huaweicloud.com>
> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> Guang Wu, please test this patch and provide us the result.
>
>  block/blk-mq.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f6dad0886a2f..11efaefa26c3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>         struct request *requeue_list = NULL;
>         struct request **requeue_lastp = &requeue_list;
>         unsigned int depth = 0;
> +       bool pt = false;
>         LIST_HEAD(list);
>
>         do {
> @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>                 if (!this_hctx) {
>                         this_hctx = rq->mq_hctx;
>                         this_ctx = rq->mq_ctx;
> -               } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) {
> +                       pt = blk_rq_is_passthrough(rq);
> +               } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx ||
> +                               pt != blk_rq_is_passthrough(rq)) {
>                         rq_list_add_tail(&requeue_lastp, rq);
>                         continue;
>                 }
> @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
>         trace_block_unplug(this_hctx->queue, depth, !from_sched);
>
>         percpu_ref_get(&this_hctx->queue->q_usage_counter);
> -       if (this_hctx->queue->elevator) {
> +       if (this_hctx->queue->elevator && !pt) {
>                 this_hctx->queue->elevator->type->ops.insert_requests(this_hctx,
>                                 &list, 0);
>                 blk_mq_run_hw_queue(this_hctx, from_sched);
> +       } else if (pt) {
> +               spin_lock(&this_hctx->lock);
> +               list_splice_tail_init(&list, &this_hctx->dispatch);
> +               spin_unlock(&this_hctx->lock);
> +               blk_mq_run_hw_queue(this_hctx, from_sched);
>         } else {
>                 blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched);
>         }
> --
> 2.38.1
>

blk-mq: don't queue passthrough request into scheduler

Commit Message

Comments

Patch