Message ID | 20230512150328.192908-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: don't queue passthrough request into scheduler | expand |
On 5/12/23 9:03?AM, Ming Lei wrote: > Passthrough(pt) request shouldn't be queued to scheduler, especially some > schedulers(such as bfq) supposes that req->bio is always available and > blk-cgroup can be retrieved via bio. > > Sometimes pt request could be part of error handling, so it is better to always > queue it into hctx->dispatch directly. > > Fix this issue by queuing pt request from plug list to hctx->dispatch > directly. Why not just add the check to the BFQ insertion? That would be a lot more trivial and would not be poluting the core with this stuff.
On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > On 5/12/23 9:03?AM, Ming Lei wrote: > > Passthrough(pt) request shouldn't be queued to scheduler, especially some > > schedulers(such as bfq) supposes that req->bio is always available and > > blk-cgroup can be retrieved via bio. > > > > Sometimes pt request could be part of error handling, so it is better to always > > queue it into hctx->dispatch directly. > > > > Fix this issue by queuing pt request from plug list to hctx->dispatch > > directly. > > Why not just add the check to the BFQ insertion? That would be a lot > more trivial and would not be poluting the core with this stuff. pt request is supposed to be issued to device directly, and we never queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging"). some pt request might be part of error handling, and adding it to scheduler could cause io hang. Thanks, Ming
On 5/12/23 9:19 AM, Ming Lei wrote: > On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: >> On 5/12/23 9:03?AM, Ming Lei wrote: >>> Passthrough(pt) request shouldn't be queued to scheduler, especially some >>> schedulers(such as bfq) supposes that req->bio is always available and >>> blk-cgroup can be retrieved via bio. >>> >>> Sometimes pt request could be part of error handling, so it is better to always >>> queue it into hctx->dispatch directly. >>> >>> Fix this issue by queuing pt request from plug list to hctx->dispatch >>> directly. >> >> Why not just add the check to the BFQ insertion? That would be a lot >> more trivial and would not be poluting the core with this stuff. > > pt request is supposed to be issued to device directly, and we never > queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for > passthrough plugging"). > > some pt request might be part of error handling, and adding it to > scheduler could cause io hang. I'm not suggesting adding it to the scheduler, just having the bypass "add to dispatch" in a different spot. Let me take a look at it... Do we have a reproducer for this issue?
On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: > On 5/12/23 9:19 AM, Ming Lei wrote: > > On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > >> On 5/12/23 9:03?AM, Ming Lei wrote: > >>> Passthrough(pt) request shouldn't be queued to scheduler, especially some > >>> schedulers(such as bfq) supposes that req->bio is always available and > >>> blk-cgroup can be retrieved via bio. > >>> > >>> Sometimes pt request could be part of error handling, so it is better to always > >>> queue it into hctx->dispatch directly. > >>> > >>> Fix this issue by queuing pt request from plug list to hctx->dispatch > >>> directly. > >> > >> Why not just add the check to the BFQ insertion? That would be a lot > >> more trivial and would not be poluting the core with this stuff. > > > > pt request is supposed to be issued to device directly, and we never > > queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for > > passthrough plugging"). > > > > some pt request might be part of error handling, and adding it to > > scheduler could cause io hang. > > I'm not suggesting adding it to the scheduler, just having the bypass > "add to dispatch" in a different spot. Originally it is added to dispatch in blk_execute_rq_nowait() for each request, but now we support plug for pt request, that is why I add the bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. > > Let me take a look at it... Do we have a reproducer for this issue? Guang Wu and Yu Kuai should have, and I didn't succeed in reproducing it by setting bfq & io.bfq.weight cgroup in my test VM. Thanks, Ming
On 5/12/23 9:34?AM, Ming Lei wrote: > On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: >> On 5/12/23 9:19?AM, Ming Lei wrote: >>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: >>>> On 5/12/23 9:03?AM, Ming Lei wrote: >>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some >>>>> schedulers(such as bfq) supposes that req->bio is always available and >>>>> blk-cgroup can be retrieved via bio. >>>>> >>>>> Sometimes pt request could be part of error handling, so it is better to always >>>>> queue it into hctx->dispatch directly. >>>>> >>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch >>>>> directly. >>>> >>>> Why not just add the check to the BFQ insertion? That would be a lot >>>> more trivial and would not be poluting the core with this stuff. >>> >>> pt request is supposed to be issued to device directly, and we never >>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for >>> passthrough plugging"). >>> >>> some pt request might be part of error handling, and adding it to >>> scheduler could cause io hang. >> >> I'm not suggesting adding it to the scheduler, just having the bypass >> "add to dispatch" in a different spot. > > Originally it is added to dispatch in blk_execute_rq_nowait() for each > request, but now we support plug for pt request, that is why I add the > bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch > given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. We really have two types of passthrough - normal kind of IO, and potential error recovery etc IO. The former can plug just fine, and I don't think we should treat it differently. Might make more sense to just bypass plugging for error handling type of IO, or pt that doesn't transfer any data to avoid having a NULL bio inserted into the scheduler. >> Let me take a look at it... Do we have a reproducer for this issue? > > Guang Wu and Yu Kuai should have, and I didn't succeed in reproducing > it by setting bfq & io.bfq.weight cgroup in my test VM. I didn't either, but most likely because all the pt testing I did was mapped IO. So there would be a bio there.
On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: > On 5/12/23 9:34?AM, Ming Lei wrote: > > On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: > >> On 5/12/23 9:19?AM, Ming Lei wrote: > >>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > >>>> On 5/12/23 9:03?AM, Ming Lei wrote: > >>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some > >>>>> schedulers(such as bfq) supposes that req->bio is always available and > >>>>> blk-cgroup can be retrieved via bio. > >>>>> > >>>>> Sometimes pt request could be part of error handling, so it is better to always > >>>>> queue it into hctx->dispatch directly. > >>>>> > >>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch > >>>>> directly. > >>>> > >>>> Why not just add the check to the BFQ insertion? That would be a lot > >>>> more trivial and would not be poluting the core with this stuff. > >>> > >>> pt request is supposed to be issued to device directly, and we never > >>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for > >>> passthrough plugging"). > >>> > >>> some pt request might be part of error handling, and adding it to > >>> scheduler could cause io hang. > >> > >> I'm not suggesting adding it to the scheduler, just having the bypass > >> "add to dispatch" in a different spot. > > > > Originally it is added to dispatch in blk_execute_rq_nowait() for each > > request, but now we support plug for pt request, that is why I add the > > bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch > > given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. > > We really have two types of passthrough - normal kind of IO, and > potential error recovery etc IO. The former can plug just fine, and I > don't think we should treat it differently. Might make more sense to > just bypass plugging for error handling type of IO, or pt that doesn't > transfer any data to avoid having a NULL bio inserted into the > scheduler. So far, I guess we can't distinguish the two types. The simplest change could be the previous one[1] by not plugging request of !rq->bio, but pt request with user IO still can be added to scheduler, so the question is if pt request with user IO should be queued to scheduler? [1] https://lore.kernel.org/linux-block/CAGS2=Yob_Ud9A-aTu5hQt8+kW4cyrLX12hNJTrRkJYigFT-AmA@mail.gmail.com/T/#m8ac66a4294707cf1ba301ef2bd2a70f9a0e2c8cb Thanks, Ming
On 5/12/23 9:55?AM, Ming Lei wrote: > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: >> On 5/12/23 9:34?AM, Ming Lei wrote: >>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: >>>> On 5/12/23 9:19?AM, Ming Lei wrote: >>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: >>>>>> On 5/12/23 9:03?AM, Ming Lei wrote: >>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some >>>>>>> schedulers(such as bfq) supposes that req->bio is always available and >>>>>>> blk-cgroup can be retrieved via bio. >>>>>>> >>>>>>> Sometimes pt request could be part of error handling, so it is better to always >>>>>>> queue it into hctx->dispatch directly. >>>>>>> >>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch >>>>>>> directly. >>>>>> >>>>>> Why not just add the check to the BFQ insertion? That would be a lot >>>>>> more trivial and would not be poluting the core with this stuff. >>>>> >>>>> pt request is supposed to be issued to device directly, and we never >>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for >>>>> passthrough plugging"). >>>>> >>>>> some pt request might be part of error handling, and adding it to >>>>> scheduler could cause io hang. >>>> >>>> I'm not suggesting adding it to the scheduler, just having the bypass >>>> "add to dispatch" in a different spot. >>> >>> Originally it is added to dispatch in blk_execute_rq_nowait() for each >>> request, but now we support plug for pt request, that is why I add the >>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch >>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. >> >> We really have two types of passthrough - normal kind of IO, and >> potential error recovery etc IO. The former can plug just fine, and I >> don't think we should treat it differently. Might make more sense to >> just bypass plugging for error handling type of IO, or pt that doesn't >> transfer any data to avoid having a NULL bio inserted into the >> scheduler. > > So far, I guess we can't distinguish the two types. Right, and that seems to be the real problem here. What is true for any pt request is that normal sorting by sector doesn't work, but it really would be nice to distinguish the two for other reasons. Eg "true" pt error handling should definitely just go to the dispatch list, and we don't need to apply any batching etc for them. For uring_cmd based passthrough, we most certainly should. > The simplest change could be the previous one[1] by not plugging request > of !rq->bio, but pt request with user IO still can be added to > scheduler, so the question is if pt request with user IO should be > queued to scheduler? I'm fine bypassing the scheduler insertion for that, my main objection to your original patch is that it's a bit messy and I was hoping we could do a cleaner separation earlier. But I _think_ that'd get us into full logic bypass for pt, which again then would not be ideal for performance oriented pt.
On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > On 5/12/23 9:03?AM, Ming Lei wrote: > > Passthrough(pt) request shouldn't be queued to scheduler, especially some > > schedulers(such as bfq) supposes that req->bio is always available and > > blk-cgroup can be retrieved via bio. > > > > Sometimes pt request could be part of error handling, so it is better to always > > queue it into hctx->dispatch directly. > > > > Fix this issue by queuing pt request from plug list to hctx->dispatch > > directly. > > Why not just add the check to the BFQ insertion? That would be a lot > more trivial and would not be poluting the core with this stuff. Because we really need to keep the passthrough code separate. The fact that a passthrough request can leak into common code in various places is really a bit of a problem. We have most of these nicely separate with two exceptions: - the plug list - the requeue list The higher level and the more obvious we special case the passthrough request there, the better for debuggability and maintainability.
On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: > We really have two types of passthrough - normal kind of IO, and > potential error recovery etc IO. The former can plug just fine, and I > don't think we should treat it differently. Might make more sense to > just bypass plugging for error handling type of IO, or pt that doesn't > transfer any data to avoid having a NULL bio inserted into the > scheduler. I'd suggest we distinguish by ... having a plug or not. Anything internal like EH or probing should never come from a context with a plug.
On 5/12/23 2:21?PM, Christoph Hellwig wrote: > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: >> We really have two types of passthrough - normal kind of IO, and >> potential error recovery etc IO. The former can plug just fine, and I >> don't think we should treat it differently. Might make more sense to >> just bypass plugging for error handling type of IO, or pt that doesn't >> transfer any data to avoid having a NULL bio inserted into the >> scheduler. > > I'd suggest we distinguish by ... having a plug or not. Anything > internal like EH or probing should never come from a context with > a plug. That's fine for telling error handling, probing etc apart from "normal" IO, but it doesn't solve the problem at hand which is assuming that anything inserted into the scheduler has a bio. This is obviously true for actual IO done via pt, but you can do data less commands through that too.
On Fri, May 12, 2023 at 03:31:39PM -0600, Jens Axboe wrote: > That's fine for telling error handling, probing etc apart from "normal" > IO, but it doesn't solve the problem at hand which is assuming that > anything inserted into the scheduler has a bio. I think this is fine. Passthrough requests should never go to the scheduler, that's not where they blong. Non-passthrough request always have a bio.
On Fri, May 12, 2023 at 04:12:42PM -0700, Christoph Hellwig wrote: > On Fri, May 12, 2023 at 03:31:39PM -0600, Jens Axboe wrote: > > That's fine for telling error handling, probing etc apart from "normal" > > IO, but it doesn't solve the problem at hand which is assuming that > > anything inserted into the scheduler has a bio. > > I think this is fine. Passthrough requests should never go to the > scheduler, that's not where they blong. Non-passthrough request always > have a bio. .. except for the flushes generated by block/blk-flush.c. Those do not enter the I/O schedule either. That being said saying every request has to have a bio might be a good thing some day. Totally indepent of the fact that schedulers have no business touching passthrough requests.
On Fri, May 12, 2023 at 10:10:55AM -0600, Jens Axboe wrote: > On 5/12/23 9:55?AM, Ming Lei wrote: > > On Fri, May 12, 2023 at 09:43:32AM -0600, Jens Axboe wrote: > >> On 5/12/23 9:34?AM, Ming Lei wrote: > >>> On Fri, May 12, 2023 at 09:25:18AM -0600, Jens Axboe wrote: > >>>> On 5/12/23 9:19?AM, Ming Lei wrote: > >>>>> On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > >>>>>> On 5/12/23 9:03?AM, Ming Lei wrote: > >>>>>>> Passthrough(pt) request shouldn't be queued to scheduler, especially some > >>>>>>> schedulers(such as bfq) supposes that req->bio is always available and > >>>>>>> blk-cgroup can be retrieved via bio. > >>>>>>> > >>>>>>> Sometimes pt request could be part of error handling, so it is better to always > >>>>>>> queue it into hctx->dispatch directly. > >>>>>>> > >>>>>>> Fix this issue by queuing pt request from plug list to hctx->dispatch > >>>>>>> directly. > >>>>>> > >>>>>> Why not just add the check to the BFQ insertion? That would be a lot > >>>>>> more trivial and would not be poluting the core with this stuff. > >>>>> > >>>>> pt request is supposed to be issued to device directly, and we never > >>>>> queue it to scheduler before 1c2d2fff6dc0 ("block: wire-up support for > >>>>> passthrough plugging"). > >>>>> > >>>>> some pt request might be part of error handling, and adding it to > >>>>> scheduler could cause io hang. > >>>> > >>>> I'm not suggesting adding it to the scheduler, just having the bypass > >>>> "add to dispatch" in a different spot. > >>> > >>> Originally it is added to dispatch in blk_execute_rq_nowait() for each > >>> request, but now we support plug for pt request, that is why I add the > >>> bypass in blk_mq_dispatch_plug_list(), and just grab lock for each batch > >>> given now blk_execute_rq_nowait() is fast path for nvme uring pt io feature. > >> > >> We really have two types of passthrough - normal kind of IO, and > >> potential error recovery etc IO. The former can plug just fine, and I > >> don't think we should treat it differently. Might make more sense to > >> just bypass plugging for error handling type of IO, or pt that doesn't > >> transfer any data to avoid having a NULL bio inserted into the > >> scheduler. > > > > So far, I guess we can't distinguish the two types. > > Right, and that seems to be the real problem here. What is true for any > pt request is that normal sorting by sector doesn't work, but it really > would be nice to distinguish the two for other reasons. Eg "true" pt > error handling should definitely just go to the dispatch list, and we > don't need to apply any batching etc for them. For uring_cmd based > passthrough, we most certainly should. > > > The simplest change could be the previous one[1] by not plugging request > > of !rq->bio, but pt request with user IO still can be added to > > scheduler, so the question is if pt request with user IO should be > > queued to scheduler? > > I'm fine bypassing the scheduler insertion for that, my main objection > to your original patch is that it's a bit messy and I was hoping we > could do a cleaner separation earlier. But I _think_ that'd get us into > full logic bypass for pt, which again then would not be ideal for > performance oriented pt. Yeah, that is the reason why this patch makes pt request batch and bypass them all, so nvme uring cmd pt io perf won't be hurt. So I guess now you are fine with this patch? If that is true, I will post V2 with revised comment log, such as, including Christoph's related comment. Otherwise, any suggestion wrt. early bypass? - bypass in case of scheduler, but still a bit tricky, why does pt need to care elevator? Meantime performance is still affected for any IOs sharing the current plug Thanks, Ming
On Fri, May 12, 2023 at 01:20:15PM -0700, Christoph Hellwig wrote: > On Fri, May 12, 2023 at 09:08:54AM -0600, Jens Axboe wrote: > > On 5/12/23 9:03?AM, Ming Lei wrote: > > > Passthrough(pt) request shouldn't be queued to scheduler, especially some > > > schedulers(such as bfq) supposes that req->bio is always available and > > > blk-cgroup can be retrieved via bio. > > > > > > Sometimes pt request could be part of error handling, so it is better to always > > > queue it into hctx->dispatch directly. > > > > > > Fix this issue by queuing pt request from plug list to hctx->dispatch > > > directly. > > > > Why not just add the check to the BFQ insertion? That would be a lot > > more trivial and would not be poluting the core with this stuff. > > Because we really need to keep the passthrough code separate. The > fact that a passthrough request can leak into common code in various > places is really a bit of a problem. We have most of these nicely > separate with two exceptions: > > - the plug list > - the requeue list > > The higher level and the more obvious we special case the passthrough > request there, the better for debuggability and maintainability. Agree, and there could be more things involved: 1) pt request shares same tags, so scheduler tags is used for allocating pt request 2) RQF_ELV is set for pt request, and e->type->ops.requeue_request()/ e->type->ops.completed_request() still might be called for pt request. Probably long-term one dedicated request queue(BLK_MQ_F_TAG_QUEUE_SHARED/ BLK_MQ_F_NO_SCHED) could be used for handling passthrough only if the device exposes pt interface. Then I guess core code may be cleaned a bit and becomes easier to improve both two paths, but plug handling still have to cover both. thanks, Ming
On 5/12/23 9:03?AM, Ming Lei wrote: > Passthrough(pt) request shouldn't be queued to scheduler, especially some > schedulers(such as bfq) supposes that req->bio is always available and > blk-cgroup can be retrieved via bio. > > Sometimes pt request could be part of error handling, so it is better to always > queue it into hctx->dispatch directly. > > Fix this issue by queuing pt request from plug list to hctx->dispatch > directly. > > Reported-by: Guangwu Zhang <guazhang@redhat.com> > Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> > Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > Guang Wu, please test this patch and provide us the result. > > block/blk-mq.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f6dad0886a2f..11efaefa26c3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > struct request *requeue_list = NULL; > struct request **requeue_lastp = &requeue_list; > unsigned int depth = 0; > + bool pt = false; > LIST_HEAD(list); > > do { > @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > if (!this_hctx) { > this_hctx = rq->mq_hctx; > this_ctx = rq->mq_ctx; > - } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { > + pt = blk_rq_is_passthrough(rq); > + } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx || > + pt != blk_rq_is_passthrough(rq)) { > rq_list_add_tail(&requeue_lastp, rq); > continue; > } > @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > trace_block_unplug(this_hctx->queue, depth, !from_sched); > > percpu_ref_get(&this_hctx->queue->q_usage_counter); > - if (this_hctx->queue->elevator) { > + if (this_hctx->queue->elevator && !pt) { > this_hctx->queue->elevator->type->ops.insert_requests(this_hctx, > &list, 0); > blk_mq_run_hw_queue(this_hctx, from_sched); > + } else if (pt) { > + spin_lock(&this_hctx->lock); > + list_splice_tail_init(&list, &this_hctx->dispatch); > + spin_unlock(&this_hctx->lock); > + blk_mq_run_hw_queue(this_hctx, from_sched); > } else { > blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched); > } I think this would look at lot better as: if (pt) { ... } else if (this_hctx->queue->elevator) { ... } else { ... } and add a comment above that first if condition on why this distinction is being made.
Hi. Don't reproduce the issue with your patch, but hit other issue, please have a look. [ 136.002469] qla2xxx [0000:af:00.6]-800e:2: DEVICE RESET SUCCEEDED nexus:2:0:0 cmd=000000008d9455f0. [ 136.011591] qla2xxx [0000:af:00.6]-8009:2: TARGET RESET ISSUED nexus=2:0 cmd=000000008d9455f0. [ 136.206701] qla2xxx [0000:af:00.6]-800e:2: TARGET RESET SUCCEEDED nexus:2:0 cmd=000000008d9455f0. [ 168.768518] qla2xxx [0000:af:00.7]-801c:4: Abort command issued nexus=4:0:0 -- 2003. [ 168.776324] qla2xxx [0000:af:00.7]-8009:4: DEVICE RESET ISSUED nexus=4:0:0 cmd=000000005642c8ad. [ 168.836726] qla2xxx [0000:af:00.7]-5032:4: ABT_IOCB: Invalid completion handle (21) -- timed-out. [ 168.845758] qla2xxx [0000:af:00.7]-800e:4: DEVICE RESET SUCCEEDED nexus:4:0:0 cmd=000000005642c8ad. [ 168.854881] qla2xxx [0000:af:00.7]-8018:4: ADAPTER RESET ISSUED nexus=4:0:0. [ 168.861994] qla2xxx [0000:af:00.7]-00af:4: Performing ISP error recovery - ha=00000000f7f90a13. [ 170.944477] qla2xxx [0000:af:00.7]-00b4:4: Done chip reset cleanup. [ 170.960311] qla2xxx [0000:af:00.7]-00d2:4: Init Firmware **** FAILED ****. [ 170.967241] qla2xxx [0000:af:00.7]-8016:4: qla82xx_restart_isp **** FAILED ****. [ 170.974693] qla2xxx [0000:af:00.7]-b02f:4: HW State: NEED RESET [ 170.980671] qla2xxx [0000:af:00.7]-009b:4: Device state is 0x4 = Need Reset. [ 170.987784] qla2xxx [0000:af:00.7]-009d:4: Device state is 0x4 = Need Reset. [ 171.968416] qla2xxx [0000:af:00.6]-6001:2: Adapter reset needed. [ 171.974549] qla2xxx [0000:af:00.6]-b031:2: Device state is 0x4 = Need Reset. [ 171.981680] qla2xxx [0000:af:00.6]-009b:2: Device state is 0x4 = Need Reset. [ 171.988776] qla2xxx [0000:af:00.6]-009d:2: Device state is 0x4 = Need Reset. [ 171.995872] qla2xxx [0000:af:00.6]-00af:2: Performing ISP error recovery - ha=0000000071956e30. [ 174.080409] qla2xxx [0000:af:00.6]-00b4:2: Done chip reset cleanup. [ 174.086822] qla2xxx [0000:af:00.6]-801c:2: Abort command issued nexus=2:0:0 -- 2002. [ 174.094635] qla2xxx [0000:af:00.6]-8018:2: ADAPTER RESET ISSUED nexus=2:0:0. [ 174.101728] qla2xxx [0000:af:00.6]-8017:2: ADAPTER RESET FAILED nexus=2:0:0. [ 174.108910] scsi 2:0:0:0: rejecting I/O to offline device [ 175.104377] qla2xxx [0000:af:00.7]-00b6:4: Device state is 0x4 = Need Reset. [ 175.111480] qla2xxx [0000:af:00.7]-00b7:4: HW State: COLD/RE-INIT. [ 177.184324] ------------[ cut here ]------------ [ 177.188972] WARNING: CPU: 7 PID: 813 at drivers/scsi/qla2xxx/qla_nx.c:507 qla82xx_need_reset_handler+0x1a4/0x470 [qla2xxx] [ 177.200131] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common nfit libnvdimm mgag200 x86_pkg_temp_thermal i2c_algo_bit intel_powerclamp ipmi_ssif coretemp drm_shmem_helper drm_kms_helper rapl syscopyarea mei_me sysfillrect ses sysimgblt intel_cstate enclosure acpi_ipmi ioatdma acpi_tad ipmi_si lpc_ich intel_pch_thermal intel_uncore acpi_power_meter mei ipmi_devintf pcspkr dca hpilo ipmi_msghandler drm fuse xfs sd_mod sg qla2xxx qla4xxx bnx2x nvme_fc nvme nvme_fabrics crct10dif_pclmul crc32_pclmul nvme_core libiscsi smartpqi ghash_clmulni_intel scsi_transport_iscsi nvme_common uas scsi_transport_fc t10_pi mdio iscsi_boot_sysfs libcrc32c usb_storage scsi_transport_sas crc32c_intel tg3 hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod [ 177.281274] CPU: 7 PID: 813 Comm: qla2xxx_2_dpc Kdump: loaded Not tainted 6.4.0-rc1+ #1 [ 177.289328] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 03/09/2020 [ 177.297903] RIP: 0010:qla82xx_need_reset_handler+0x1a4/0x470 [qla2xxx] [ 177.304513] Code: a6 cd e8 8f 49 4f ce eb 0a bf 64 00 00 00 e8 63 0d a6 cd be 28 c0 11 06 48 89 ef e8 16 a9 ff ff 83 f8 01 74 07 83 eb 01 75 df <0f> 0b be 44 21 20 08 48 89 ef e8 fd a8 ff ff be 38 21 20 08 48 89 [ 177.323398] RSP: 0018:ffffaf0b0508be48 EFLAGS: 00010246 [ 177.328656] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000130000 [ 177.335834] RDX: ffffaf0b0508be00 RSI: ffffaf0b0593c028 RDI: ffff9b2420022000 [ 177.343013] RBP: ffff9b2420022000 R08: 0000000000120000 R09: ffff9b272fff2878 [ 177.350193] R10: 0000000000000162 R11: 0000000003c23c49 R12: 00000000fffe3bdb [ 177.357373] R13: 0000000000000000 R14: ffff9b274fca6840 R15: 0000000001000000 [ 177.364554] FS: 0000000000000000(0000) GS:ffff9b272ffc0000(0000) knlGS:0000000000000000 [ 177.372695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 177.378476] CR2: 000055b7803d43d8 CR3: 000000079c220001 CR4: 00000000007706e0 [ 177.385656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 177.392836] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 177.400015] PKRU: 55555554 [ 177.402738] Call Trace: [ 177.405201] <TASK> [ 177.407316] qla82xx_device_state_handler+0x285/0x4a0 [qla2xxx] [ 177.413303] ? __pfx_qla2x00_do_dpc+0x10/0x10 [qla2xxx] [ 177.418588] qla82xx_abort_isp+0x190/0x290 [qla2xxx] [ 177.423613] qla2x00_do_dpc+0x743/0xa60 [qla2xxx] [ 177.428374] ? __pfx_qla2x00_do_dpc+0x10/0x10 [qla2xxx] [ 177.433655] kthread+0xdf/0x110 [ 177.436821] ? __pfx_kthread+0x10/0x10 [ 177.440595] ret_from_fork+0x29/0x50 [ 177.444201] </TASK> [ 177.446402] ---[ end trace 0000000000000000 ]--- [ 177.451052] qla2xxx [0000:af:00.6]-00b6:2: Device state is 0x1 = Cold/Re-init. [ 177.458319] qla2xxx [0000:af:00.6]-009d:2: Device state is 0x1 = Cold/Re-init. [ 177.465588] qla2xxx [0000:af:00.6]-009e:2: HW State: INITIALIZING. [ 0.020959] APIC: Disabling requested cpu. Processor 0/0x0 ignored. Ming Lei <ming.lei@redhat.com> 于2023年5月12日周五 23:06写道: > > Passthrough(pt) request shouldn't be queued to scheduler, especially some > schedulers(such as bfq) supposes that req->bio is always available and > blk-cgroup can be retrieved via bio. > > Sometimes pt request could be part of error handling, so it is better to always > queue it into hctx->dispatch directly. > > Fix this issue by queuing pt request from plug list to hctx->dispatch > directly. > > Reported-by: Guangwu Zhang <guazhang@redhat.com> > Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> > Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > Guang Wu, please test this patch and provide us the result. > > block/blk-mq.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f6dad0886a2f..11efaefa26c3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > struct request *requeue_list = NULL; > struct request **requeue_lastp = &requeue_list; > unsigned int depth = 0; > + bool pt = false; > LIST_HEAD(list); > > do { > @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > if (!this_hctx) { > this_hctx = rq->mq_hctx; > this_ctx = rq->mq_ctx; > - } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { > + pt = blk_rq_is_passthrough(rq); > + } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx || > + pt != blk_rq_is_passthrough(rq)) { > rq_list_add_tail(&requeue_lastp, rq); > continue; > } > @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > trace_block_unplug(this_hctx->queue, depth, !from_sched); > > percpu_ref_get(&this_hctx->queue->q_usage_counter); > - if (this_hctx->queue->elevator) { > + if (this_hctx->queue->elevator && !pt) { > this_hctx->queue->elevator->type->ops.insert_requests(this_hctx, > &list, 0); > blk_mq_run_hw_queue(this_hctx, from_sched); > + } else if (pt) { > + spin_lock(&this_hctx->lock); > + list_splice_tail_init(&list, &this_hctx->dispatch); > + spin_unlock(&this_hctx->lock); > + blk_mq_run_hw_queue(this_hctx, from_sched); > } else { > blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched); > } > -- > 2.38.1 >
Hi Guangwu, On Sun, May 14, 2023 at 08:11:34PM +0800, Guangwu Zhang wrote: > Hi. > Don't reproduce the issue with your patch, but hit other issue, please > have a look. Can you share your test case a bit so that we can see if the warning on qla2xxx is related this patch? Thanks, Ming
After reinstalling the upstream kernel, I also hit the reset issue, so the issue isn't related to mingl's patch. The reset issue has opened before, please have a look. https://lore.kernel.org/linux-scsi/CAGS2=YrmwbhMpNA2REnBybvm5dehGRyKBX5Sq5BqY=ex=mwaUg@mail.gmail.com/ Ming Lei <ming.lei@redhat.com> 于2023年5月12日周五 23:06写道: > > Passthrough(pt) request shouldn't be queued to scheduler, especially some > schedulers(such as bfq) supposes that req->bio is always available and > blk-cgroup can be retrieved via bio. > > Sometimes pt request could be part of error handling, so it is better to always > queue it into hctx->dispatch directly. > > Fix this issue by queuing pt request from plug list to hctx->dispatch > directly. > > Reported-by: Guangwu Zhang <guazhang@redhat.com> > Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> > Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > Guang Wu, please test this patch and provide us the result. > > block/blk-mq.c | 12 ++++++++++-- > 1 file changed, 10 insertions(+), 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f6dad0886a2f..11efaefa26c3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > struct request *requeue_list = NULL; > struct request **requeue_lastp = &requeue_list; > unsigned int depth = 0; > + bool pt = false; > LIST_HEAD(list); > > do { > @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > if (!this_hctx) { > this_hctx = rq->mq_hctx; > this_ctx = rq->mq_ctx; > - } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { > + pt = blk_rq_is_passthrough(rq); > + } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx || > + pt != blk_rq_is_passthrough(rq)) { > rq_list_add_tail(&requeue_lastp, rq); > continue; > } > @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) > trace_block_unplug(this_hctx->queue, depth, !from_sched); > > percpu_ref_get(&this_hctx->queue->q_usage_counter); > - if (this_hctx->queue->elevator) { > + if (this_hctx->queue->elevator && !pt) { > this_hctx->queue->elevator->type->ops.insert_requests(this_hctx, > &list, 0); > blk_mq_run_hw_queue(this_hctx, from_sched); > + } else if (pt) { > + spin_lock(&this_hctx->lock); > + list_splice_tail_init(&list, &this_hctx->dispatch); > + spin_unlock(&this_hctx->lock); > + blk_mq_run_hw_queue(this_hctx, from_sched); > } else { > blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched); > } > -- > 2.38.1 >
diff --git a/block/blk-mq.c b/block/blk-mq.c index f6dad0886a2f..11efaefa26c3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2711,6 +2711,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) struct request *requeue_list = NULL; struct request **requeue_lastp = &requeue_list; unsigned int depth = 0; + bool pt = false; LIST_HEAD(list); do { @@ -2719,7 +2720,9 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) if (!this_hctx) { this_hctx = rq->mq_hctx; this_ctx = rq->mq_ctx; - } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx) { + pt = blk_rq_is_passthrough(rq); + } else if (this_hctx != rq->mq_hctx || this_ctx != rq->mq_ctx || + pt != blk_rq_is_passthrough(rq)) { rq_list_add_tail(&requeue_lastp, rq); continue; } @@ -2731,10 +2734,15 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) trace_block_unplug(this_hctx->queue, depth, !from_sched); percpu_ref_get(&this_hctx->queue->q_usage_counter); - if (this_hctx->queue->elevator) { + if (this_hctx->queue->elevator && !pt) { this_hctx->queue->elevator->type->ops.insert_requests(this_hctx, &list, 0); blk_mq_run_hw_queue(this_hctx, from_sched); + } else if (pt) { + spin_lock(&this_hctx->lock); + list_splice_tail_init(&list, &this_hctx->dispatch); + spin_unlock(&this_hctx->lock); + blk_mq_run_hw_queue(this_hctx, from_sched); } else { blk_mq_insert_requests(this_hctx, this_ctx, &list, from_sched); }
Passthrough(pt) request shouldn't be queued to scheduler, especially some schedulers(such as bfq) supposes that req->bio is always available and blk-cgroup can be retrieved via bio. Sometimes pt request could be part of error handling, so it is better to always queue it into hctx->dispatch directly. Fix this issue by queuing pt request from plug list to hctx->dispatch directly. Reported-by: Guangwu Zhang <guazhang@redhat.com> Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") Signed-off-by: Ming Lei <ming.lei@redhat.com> --- Guang Wu, please test this patch and provide us the result. block/blk-mq.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)