diff mbox series

blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

Message ID CAHsXFKFOBSmZeT9j7U2S6J4Af2k2ODe_pkFVGJg73v0Wp7O38A@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag | expand

Commit Message

Kashyap Desai Dec. 4, 2018, 10 a.m. UTC
Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ffffff8000000010
IP: [<ffffffffc048306c>] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops: 0000 [#1] SMP
Call Trace:
 [<ffffffffc046f72f>] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
 [<ffffffffc047e125>] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: <stable@vger.kernel.org>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>


---
 block/blk-mq.h | 1 +
 1 file changed, 1 insertion(+)

Comments

Ming Lei Dec. 4, 2018, 11:35 a.m. UTC | #1
On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote:
> Problem statement :
> Whenever try to get outstanding request via scsi_host_find_tag,
> block layer will return stale entries instead of actual outstanding
> request. Kernel panic if stale entry is inaccessible or memory is reused.
> Fix :
> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> 
> More detail :
> Whenever each SDEV entry is created, block layer allocate separate tags
> and static requestis.Those requests are not valid after SDEV is deleted
> from the system. On the fly, block layer maps static rqs to rqs as below
> from blk_mq_get_driver_tag()
> 
> data.hctx->tags->rqs[rq->tag] = rq;
> 
> Above mapping is active in-used requests and it is the same mapping which
> is referred in function scsi_host_find_tag().
> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> entries which will never be reset in block layer.

However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should
have pointed to one active request instead of the stale one, right?

Thanks,
Ming
Bart Van Assche Dec. 4, 2018, 2:48 p.m. UTC | #2
On 12/4/18 2:00 AM, Kashyap Desai wrote:
> Problem statement :
> Whenever try to get outstanding request via scsi_host_find_tag,
> block layer will return stale entries instead of actual outstanding
> request. Kernel panic if stale entry is inaccessible or memory is reused.
> Fix :
> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> 
> More detail :
> Whenever each SDEV entry is created, block layer allocate separate tags
> and static requestis.Those requests are not valid after SDEV is deleted
> from the system. On the fly, block layer maps static rqs to rqs as below
> from blk_mq_get_driver_tag()
> 
> data.hctx->tags->rqs[rq->tag] = rq;
> 
> Above mapping is active in-used requests and it is the same mapping which
> is referred in function scsi_host_find_tag().
> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> entries which will never be reset in block layer.
> 
> There would be a kernel panic, If request pointing to
> “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
> and as part of that all the memory allocation of request associated with
> that sdev might be reused or inaccessible to the driver.
> Kernel panic snippet -
> 
> BUG: unable to handle kernel paging request at ffffff8000000010
> IP: [<ffffffffc048306c>] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
> PGD aa4414067 PUD 0
> Oops: 0000 [#1] SMP
> Call Trace:
>   [<ffffffffc046f72f>] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
>   [<ffffffffc047e125>] scsih_shutdown+0x55/0x100 [mpt3sas]
> 
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
> 
> 
> ---
>   block/blk-mq.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 9497b47..57432be 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -175,6 +175,7 @@ static inline bool
> blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>                          struct request *rq)
>   {
> +    hctx->tags->rqs[rq->tag] = NULL;
>       blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
>       rq->tag = -1;

No SCSI driver should call scsi_host_find_tag() after a request has 
finished. The above patch introduces yet another race and hence can't be 
a proper fix.

Bart.
Kashyap Desai Dec. 4, 2018, 4:47 p.m. UTC | #3
+ Linux-scsi

> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 9497b47..57432be 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -175,6 +175,7 @@ static inline bool
> > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> >                          struct request *rq)
> >   {
> > +    hctx->tags->rqs[rq->tag] = NULL;
> >       blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> >       rq->tag = -1;
>
> No SCSI driver should call scsi_host_find_tag() after a request has
> finished. The above patch introduces yet another race and hence can't be
> a proper fix.

Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to
find out pending IO in firmware.
One of the use case is -  HBA firmware recovery.  In case of firmware
recovery, driver may require to traverse the list and return back pending
scsi command to SML for retry.
I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
mpt3sas are using API scsi_host_find_tag for the same purpose.

Without this patch, we hit very basic kernel panic due to page fault.  This
is not an issue in non-mq code path. Non-mq path use
blk_map_queue_find_tag() and that particular API does not provide stale
requests.

Kashyap

>
> Bart.
Kashyap Desai Dec. 4, 2018, 4:51 p.m. UTC | #4
> On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote:
> > Problem statement :
> > Whenever try to get outstanding request via scsi_host_find_tag,
> > block layer will return stale entries instead of actual outstanding
> > request. Kernel panic if stale entry is inaccessible or memory is
> > reused.
> > Fix :
> > Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> >
> > More detail :
> > Whenever each SDEV entry is created, block layer allocate separate tags
> > and static requestis.Those requests are not valid after SDEV is deleted
> > from the system. On the fly, block layer maps static rqs to rqs as below
> > from blk_mq_get_driver_tag()
> >
> > data.hctx->tags->rqs[rq->tag] = rq;
> >
> > Above mapping is active in-used requests and it is the same mapping
> > which
> > is referred in function scsi_host_find_tag().
> > After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> > entries which will never be reset in block layer.
>
> However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should
> have pointed to one active request instead of the stale one, right?

Yes that is my understanding and learning from this issue.
Side note -

At driver load whenever driver does scsi_add_host_with_dma(), it follows
below code path in block layer.

scsi_mq_setup_tags
  ->blk_mq_alloc_tag_set
          -> blk_mq_alloc_rq_maps
                     -> __blk_mq_alloc_rq_maps

SML create two set of request pool. One is per HBA and other is per SDEV. I
was confused why SML creates request pool per HBA.

>
> Thanks,
> Ming
Bart Van Assche Dec. 4, 2018, 5:14 p.m. UTC | #5
On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote:
> + Linux-scsi
> 
> > > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > > index 9497b47..57432be 100644
> > > --- a/block/blk-mq.h
> > > +++ b/block/blk-mq.h
> > > @@ -175,6 +175,7 @@ static inline bool
> > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> > >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> > >                          struct request *rq)
> > >   {
> > > +    hctx->tags->rqs[rq->tag] = NULL;
> > >       blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> > >       rq->tag = -1;
> > 
> > No SCSI driver should call scsi_host_find_tag() after a request has
> > finished. The above patch introduces yet another race and hence can't be
> > a proper fix.
> 
> Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to
> find out pending IO in firmware.
> One of the use case is -  HBA firmware recovery.  In case of firmware
> recovery, driver may require to traverse the list and return back pending
> scsi command to SML for retry.
> I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
> mpt3sas are using API scsi_host_find_tag for the same purpose.
> 
> Without this patch, we hit very basic kernel panic due to page fault.  This
> is not an issue in non-mq code path. Non-mq path use
> blk_map_queue_find_tag() and that particular API does not provide stale
> requests.

As I wrote before, your patch doesn't fix the race you described but only
makes the race window smaller. If you want an example of how to use
scsi_host_find_tag() properly, have a look at the SRP initiator driver
(drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag() without
triggering any NULL pointer dereferences. The approach used in that driver
also works when having to support HBA firmware recovery.

Bart.
Kashyap Desai Dec. 4, 2018, 6:18 p.m. UTC | #6
> -----Original Message-----
> From: Bart Van Assche [mailto:bvanassche@acm.org]
> Sent: Tuesday, December 4, 2018 10:45 PM
> To: Kashyap Desai; linux-block; Jens Axboe; Ming Lei; linux-scsi
> Cc: Suganath Prabu Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty
> Subject: Re: [PATCH] blk-mq: Set request mapping to NULL in
> blk_mq_put_driver_tag
>
> On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote:
> > + Linux-scsi
> >
> > > > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > > > index 9497b47..57432be 100644
> > > > --- a/block/blk-mq.h
> > > > +++ b/block/blk-mq.h
> > > > @@ -175,6 +175,7 @@ static inline bool
> > > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> > > >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx
> *hctx,
> > > >                          struct request *rq)
> > > >   {
> > > > +    hctx->tags->rqs[rq->tag] = NULL;
> > > >       blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> > > >       rq->tag = -1;
> > >
> > > No SCSI driver should call scsi_host_find_tag() after a request has
> > > finished. The above patch introduces yet another race and hence can't
> > > be
> > > a proper fix.
> >
> > Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id
> > to
> > find out pending IO in firmware.
> > One of the use case is -  HBA firmware recovery.  In case of firmware
> > recovery, driver may require to traverse the list and return back
> > pending
> > scsi command to SML for retry.
> > I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
> > mpt3sas are using API scsi_host_find_tag for the same purpose.
> >
> > Without this patch, we hit very basic kernel panic due to page fault.
> > This
> > is not an issue in non-mq code path. Non-mq path use
> > blk_map_queue_find_tag() and that particular API does not provide stale
> > requests.
>
> As I wrote before, your patch doesn't fix the race you described but only
> makes the race window smaller.
Hi Bart,

Let me explain the issue. It is not a race, but very straight issue.  Let's
say we have one scsi_device /dev/sda and total IO submitted + completed are
some number 100.
All the 100 IO is *completed*.   Now, As part of Firmware recovery, driver
tries to find our outstanding IOs using scsi_host_find_tag().
Block layer will return all the 100 commands to the driver but really those
100 commands are not outstanding. This patch will return *actual*
outstanding commands.
If scsi_device /dev/sda is not removed in OS, driver accessing scmd of those
100 commands are safe memory access.

Now consider a case where scsi_device /dev/sda is removed and driver
performs firmware recovery. This time driver will crash while accessing scmd
(randomly based on memory reused.).

Along with this patch, low level driver should make sure that all request
queue at block layer is quiesce.

If you want an example of how to use
> scsi_host_find_tag() properly, have a look at the SRP initiator driver
> (drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag()
> without
> triggering any NULL pointer dereferences.

I am not able to find right context from srp, but I check the srp code and
looks like that driver is getting scmd using scsi_host_find_tag() for live
command.

> The approach used in that driver
> also works when having to support HBA firmware recovery.
>
> Bart.
Bart Van Assche Dec. 4, 2018, 7:35 p.m. UTC | #7
On Tue, 2018-12-04 at 23:48 +0530, Kashyap Desai wrote:
> Let me explain the issue. It is not a race, but very straight issue.

Please clarify what makes you think that iterating over all requests does not
race with request completion. Is request submission perhaps blocked during the
firmware recovery process? Does the firmware recovery process wait for
completion interrupts that are in progress to finish?

> Let's
> say we have one scsi_device /dev/sda and total IO submitted + completed are
> some number 100.
> All the 100 IO is *completed*.   Now, As part of Firmware recovery, driver
> tries to find our outstanding IOs using scsi_host_find_tag().
> Block layer will return all the 100 commands to the driver but really those
> 100 commands are not outstanding. This patch will return *actual*
> outstanding commands.

Would iterating over started requests be a good alternative to the patch
that you had posted? The header above blk_mq_tagset_busy_iter() is as follows
(this function is used by e.g. the skd driver for firmware recovery):

/**
 * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
 * @tagset:	Tag set to iterate over.
 * @fn:		Pointer to the function that will be called for each started
 *		request. @fn will be called as follows: @fn(rq, @priv,
 *		reserved) where rq is a pointer to a request. 'reserved'
 *		indicates whether or not @rq is a reserved request.
 * @priv:	Will be passed as second argument to @fn.
 */

> I am not able to find right context from srp, but I check the srp code and
> looks like that driver is getting scmd using scsi_host_find_tag() for live
> command.

Sorry, my e-mail was a bit too short to be comprehensible. When running
sg_reset -d /dev/sd... in a loop it is possible that the SCSI reset handler
terminates a request while the SRP driver is handling a request completion
for one of the terminated requests. That is why both srp_process_srp() and
the reset handler call srp_claim_req() to make sure that only one of these
two contexts completes a request.

Bart.
Ming Lei Dec. 6, 2018, 12:33 a.m. UTC | #8
On Tue, Dec 04, 2018 at 11:48:48PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Bart Van Assche [mailto:bvanassche@acm.org]
> > Sent: Tuesday, December 4, 2018 10:45 PM
> > To: Kashyap Desai; linux-block; Jens Axboe; Ming Lei; linux-scsi
> > Cc: Suganath Prabu Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty
> > Subject: Re: [PATCH] blk-mq: Set request mapping to NULL in
> > blk_mq_put_driver_tag
> >
> > On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote:
> > > + Linux-scsi
> > >
> > > > > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > > > > index 9497b47..57432be 100644
> > > > > --- a/block/blk-mq.h
> > > > > +++ b/block/blk-mq.h
> > > > > @@ -175,6 +175,7 @@ static inline bool
> > > > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> > > > >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx
> > *hctx,
> > > > >                          struct request *rq)
> > > > >   {
> > > > > +    hctx->tags->rqs[rq->tag] = NULL;
> > > > >       blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> > > > >       rq->tag = -1;
> > > >
> > > > No SCSI driver should call scsi_host_find_tag() after a request has
> > > > finished. The above patch introduces yet another race and hence can't
> > > > be
> > > > a proper fix.
> > >
> > > Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id
> > > to
> > > find out pending IO in firmware.
> > > One of the use case is -  HBA firmware recovery.  In case of firmware
> > > recovery, driver may require to traverse the list and return back
> > > pending
> > > scsi command to SML for retry.
> > > I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
> > > mpt3sas are using API scsi_host_find_tag for the same purpose.
> > >
> > > Without this patch, we hit very basic kernel panic due to page fault.
> > > This
> > > is not an issue in non-mq code path. Non-mq path use
> > > blk_map_queue_find_tag() and that particular API does not provide stale
> > > requests.
> >
> > As I wrote before, your patch doesn't fix the race you described but only
> > makes the race window smaller.
> Hi Bart,
> 
> Let me explain the issue. It is not a race, but very straight issue.  Let's
> say we have one scsi_device /dev/sda and total IO submitted + completed are
> some number 100.
> All the 100 IO is *completed*.   Now, As part of Firmware recovery, driver
> tries to find our outstanding IOs using scsi_host_find_tag().

If the 'tag' passed to scsi_host_find_tag() is valid, I think there
shouldn't have such issue.

If you want to find outstanding IOs, maybe you can try blk_mq_queue_tag_busy_iter()
or blk_mq_tagset_busy_iter(), because you may not know if the passed 'tag' to
scsi_host_find_tag() is valid or not.


Thanks,
Ming
Kashyap Desai Dec. 6, 2018, 5:45 a.m. UTC | #9
>
> If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> shouldn't have such issue.
>
> If you want to find outstanding IOs, maybe you can try
> blk_mq_queue_tag_busy_iter()
> or blk_mq_tagset_busy_iter(), because you may not know if the passed
'tag'
> to
> scsi_host_find_tag() is valid or not.

We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter and
it returns/callback for valid requests (no stale entries are returned).
Expected.
Above two APIs are only for blk-mq.  What about non-mq case ? Driver
should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter for
blk-mq case ?
I don't see that will be good interface. Also, blk_mq_tagset_busy_iter API
does not provide control if driver wants to quit in-between or do some
retry logic etc.

Why can't we add single API which provides the correct output.

scsi_host_find_tag () API works well in non-mq case because,
blk_queue_end_tag() set bqt->tag_index[tag] = NULL;.
We are missing similar reset upon request completion in blk-mq case. This
patch has similar approach as non-mq and there is no race condition I can
foresee.

BTW - My original patch is half fix. We also need below changes -

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e..d8f53ac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,8 +477,10 @@ static void __blk_mq_free_request(struct request *rq)
        const int sched_tag = rq->internal_tag;

        blk_pm_mark_last_busy(rq);
-       if (rq->tag != -1)
+       if (rq->tag != -1) {
+               hctx->tags->rqs[rq->tag] = NULL;
                blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+       }
        if (sched_tag != -1)
                blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
        blk_mq_sched_restart(hctx);


>
>
> Thanks,
> Ming
Jens Axboe Dec. 6, 2018, 3:22 p.m. UTC | #10
On 12/5/18 10:45 PM, Kashyap Desai wrote:
>>
>> If the 'tag' passed to scsi_host_find_tag() is valid, I think there
>> shouldn't have such issue.
>>
>> If you want to find outstanding IOs, maybe you can try
>> blk_mq_queue_tag_busy_iter()
>> or blk_mq_tagset_busy_iter(), because you may not know if the passed
> 'tag'
>> to
>> scsi_host_find_tag() is valid or not.
> 
> We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter and
> it returns/callback for valid requests (no stale entries are returned).
> Expected.
> Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter for
> blk-mq case ?
> I don't see that will be good interface. Also, blk_mq_tagset_busy_iter API
> does not provide control if driver wants to quit in-between or do some
> retry logic etc.
> 
> Why can't we add single API which provides the correct output.

From 4.21 and forward, there will only be blk/scsi-mq. This is exactly
the problem with having to maintain two stacks, it's a huge pain.
Kashyap Desai Dec. 7, 2018, 7:16 a.m. UTC | #11
> On 12/5/18 10:45 PM, Kashyap Desai wrote:
> >>
> >> If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> >> shouldn't have such issue.
> >>
> >> If you want to find outstanding IOs, maybe you can try
> >> blk_mq_queue_tag_busy_iter()
> >> or blk_mq_tagset_busy_iter(), because you may not know if the passed
> > 'tag'
> >> to
> >> scsi_host_find_tag() is valid or not.
> >
> > We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter
> > and
> > it returns/callback for valid requests (no stale entries are returned).
> > Expected.
> > Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> > should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter for
> > blk-mq case ?
> > I don't see that will be good interface. Also, blk_mq_tagset_busy_iter
> > API
> > does not provide control if driver wants to quit in-between or do some
> > retry logic etc.
> >
> > Why can't we add single API which provides the correct output.
>
> From 4.21 and forward, there will only be blk/scsi-mq. This is exactly
> the problem with having to maintain two stacks, it's a huge pain.

Hi Jens, Fix for this issue also required to be back ported to stable
kernels (which still use non-mq + mq stack).
We have multiple choices to fix this.

1. Use " blk_mq_tagset_busy_iter" in *all* the affected drivers. This API
has certain limitation as explained and also fix only blk-mq part. Using
this API may need more code in low level drivers to handle non-mq and mq
separately.
2. Driver can use internal memory for scsiio_track (driver private) and
track all the outstanding IO within a driver. This is mostly a scsi mid
layer interface. All the affected driver require a changes.
3. Fix blk-mq code around  blk_mq_put_tag and driver can still use "
scsi_host_find_tag". No driver changes are required.
This is smooth to back port stable kernel and Linux Distribution which
normally pick critical fixes from stable can pick the fix. This is better
fix and did not change any design.
In fact, I mimic the same flow as non-mq code is doing.

I don't see any design or functional issue with #3 (PATCH provided in this
thread.) What is your feedback for this patch ?

Kashyap

>
> --
> Jens Axboe
Ming Lei Dec. 7, 2018, 10:20 a.m. UTC | #12
On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> >
> > If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> > shouldn't have such issue.
> >
> > If you want to find outstanding IOs, maybe you can try
> > blk_mq_queue_tag_busy_iter()
> > or blk_mq_tagset_busy_iter(), because you may not know if the passed
> 'tag'
> > to
> > scsi_host_find_tag() is valid or not.
> 
> We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter and
> it returns/callback for valid requests (no stale entries are returned).
> Expected.
> Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter for
> blk-mq case ?

But your patch is only for blk-mq, is there same issue on non-mq case?

Thanks,
Ming
Kashyap Desai Dec. 7, 2018, 10:34 a.m. UTC | #13
> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Friday, December 7, 2018 3:50 PM
> To: Kashyap Desai
> Cc: Bart Van Assche; linux-block; Jens Axboe; linux-scsi; Suganath Prabu
> Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty
> Subject: Re: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in
> blk+AF8-mq+AF8-put+AF8-driver+AF8-tag
>
> On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> > >
> > > If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> > > shouldn't have such issue.
> > >
> > > If you want to find outstanding IOs, maybe you can try
> > > blk_mq_queue_tag_busy_iter()
> > > or blk_mq_tagset_busy_iter(), because you may not know if the passed
> > 'tag'
> > > to
> > > scsi_host_find_tag() is valid or not.
> >
> > We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter
and
> > it returns/callback for valid requests (no stale entries are
returned).
> > Expected.
> > Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> > should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter
for
> > blk-mq case ?
>
> But your patch is only for blk-mq, is there same issue on non-mq case?

Problematic part from below function is code path which goes from "
shost_use_blk_mq(shost))".
Non-mq path works fine because every IO completion set bqt->tag_index[tag]
= NULL from blk_queue_end_tag().

I did similar things for mq path in this patch.

static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host
*shost,
                int tag)
{
        struct request *req = NULL;

        if (tag == SCSI_NO_TAG)
                return NULL;

        if (shost_use_blk_mq(shost)) {
                u16 hwq = blk_mq_unique_tag_to_hwq(tag);

                if (hwq < shost->tag_set.nr_hw_queues) {
                        req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
                                blk_mq_unique_tag_to_tag(tag));
                }
        } else {
                req = blk_map_queue_find_tag(shost->bqt, tag);
        }

Kashyap
>
> Thanks,
> Ming
Kashyap Desai Dec. 11, 2018, 3:06 p.m. UTC | #14
> > On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> > > >
> > > > If the 'tag' passed to scsi_host_find_tag() is valid, I think
there
> > > > shouldn't have such issue.
> > > >
> > > > If you want to find outstanding IOs, maybe you can try
> > > > blk_mq_queue_tag_busy_iter()
> > > > or blk_mq_tagset_busy_iter(), because you may not know if the
passed
> > > 'tag'
> > > > to
> > > > scsi_host_find_tag() is valid or not.
> > >
> > > We tried quick change in mpt3sas driver using
blk_mq_tagset_busy_iter
> and
> > > it returns/callback for valid requests (no stale entries are
returned).
> > > Expected.
> > > Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> > > should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter
> for
> > > blk-mq case ?
> >
> > But your patch is only for blk-mq, is there same issue on non-mq case?
>
> Problematic part from below function is code path which goes from "
> shost_use_blk_mq(shost))".
> Non-mq path works fine because every IO completion set
bqt->tag_index[tag]
> = NULL from blk_queue_end_tag().
>
> I did similar things for mq path in this patch.
>
> static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host
*shost,
>                 int tag)
> {
>         struct request *req = NULL;
>
>         if (tag == SCSI_NO_TAG)
>                 return NULL;
>
>         if (shost_use_blk_mq(shost)) {
>                 u16 hwq = blk_mq_unique_tag_to_hwq(tag);
>
>                 if (hwq < shost->tag_set.nr_hw_queues) {
>                         req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
>                                 blk_mq_unique_tag_to_tag(tag));
>                 }
>         } else {
>                 req = blk_map_queue_find_tag(shost->bqt, tag);
>         }


Hi Jens,

Any conclusion/feedback on this topic/patch ?  As discussed, This is a
safe change and  good to have if no design issue.

Kashyap
Kashyap Desai Dec. 14, 2018, 6:22 a.m. UTC | #15
> > > On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> > > > >
> > > > > If the 'tag' passed to scsi_host_find_tag() is valid, I think
there
> > > > > shouldn't have such issue.
> > > > >
> > > > > If you want to find outstanding IOs, maybe you can try
> > > > > blk_mq_queue_tag_busy_iter()
> > > > > or blk_mq_tagset_busy_iter(), because you may not know if the
passed
> > > > 'tag'
> > > > > to
> > > > > scsi_host_find_tag() is valid or not.
> > > >
> > > > We tried quick change in mpt3sas driver using
blk_mq_tagset_busy_iter
> > and
> > > > it returns/callback for valid requests (no stale entries are
returned).
> > > > Expected.
> > > > Above two APIs are only for blk-mq.  What about non-mq case ?
Driver
> > > > should use scsi_host_find_tag for non-mq and
blk_mq_tagset_busy_iter
> > for
> > > > blk-mq case ?
> > >
> > > But your patch is only for blk-mq, is there same issue on non-mq
case?
> >
> > Problematic part from below function is code path which goes from "
> > shost_use_blk_mq(shost))".
> > Non-mq path works fine because every IO completion set bqt-
> >tag_index[tag]
> > = NULL from blk_queue_end_tag().
> >
> > I did similar things for mq path in this patch.
> >
> > static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host
*shost,
> >                 int tag)
> > {
> >         struct request *req = NULL;
> >
> >         if (tag == SCSI_NO_TAG)
> >                 return NULL;
> >
> >         if (shost_use_blk_mq(shost)) {
> >                 u16 hwq = blk_mq_unique_tag_to_hwq(tag);
> >
> >                 if (hwq < shost->tag_set.nr_hw_queues) {
> >                         req =
blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
> >                                 blk_mq_unique_tag_to_tag(tag));
> >                 }
> >         } else {
> >                 req = blk_map_queue_find_tag(shost->bqt, tag);
> >         }
>
>
> Hi Jens,
>
> Any conclusion/feedback on this topic/patch ?  As discussed, This is a
safe
> change and  good to have if no design issue.

Hi, Since we have not concluded on fix, we can resume discussion on next
revision of the patch. I will be posting V2 patch with complete fix.

Kashyap

>
> Kashyap
diff mbox series

Patch

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@  static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
                        struct request *rq)
 {
+    hctx->tags->rqs[rq->tag] = NULL;
     blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
     rq->tag = -1;