mbox series

[0/8] : blk-mq: use static_rqs to iterate busy tags

Message ID 1552640264-26101-1-git-send-email-jianchao.w.wang@oracle.com (mailing list archive)
Headers show
Series : blk-mq: use static_rqs to iterate busy tags | expand

Message

jianchao.wang March 15, 2019, 8:57 a.m. UTC
Hi Jens

As we know, there is a risk of accesing stale requests when iterate
in-flight requests with tags->rqs[] and this has been talked in following
thread,
[1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
[2] https://marc.info/?l=linux-block&m=154526189023236&w=2

A typical sence could be
blk_mq_get_request         blk_mq_queue_tag_busy_iter
  -> blk_mq_get_tag
                             -> bt_for_each
                               -> bt_iter
                                 -> rq = taags->rqs[]
                                 -> rq->q
  -> blk_mq_rq_ctx_init
    -> data->hctx->tags->rqs[rq->tag] = rq;

The root cause is that there is a window between set bit on tag sbitmap
and set tags->rqs[].

This patch would fix this issue by iterating requests with tags->static_rqs[]
instead of tags->rqs[] which would be changed dynamically. Moreover,
we will try to get a non-zero q_usage_counter before access hctxs and tags and
thus could avoid the race with updating nr_hw_queues, switching io scheduler
and even queue clean up which are all under a frozen and drained queue.

The 1st patch get rid of the useless of synchronize_rcu in __blk_mq_update_nr_hw_queues

The 2nd patch modify the blk_mq_queue_tag_busy_iter to use tags->static_rqs[]
instead of tags->rqs[] to iterate the busy tags.

The 3rd ~ 7th patch change the blk_mq_tagset_busy_iter to blk_mq_queue_tag_busy_iter
which is safer

The 8th patch get rid of the blk_mq_tagset_busy_iter.

Jianchao Wang(8)
	blk-mq: get rid of the synchronize_rcu in
	blk-mq: change the method of iterating busy tags of a
	blk-mq: use blk_mq_queue_tag_busy_iter in debugfs
	mtip32xx: use blk_mq_queue_tag_busy_iter
	nbd: use blk_mq_queue_tag_busy_iter
	skd: use blk_mq_queue_tag_busy_iter
	nvme: use blk_mq_queue_tag_busy_iter
	blk-mq: remove blk_mq_tagset_busy_iter

diff stat
 block/blk-mq-debugfs.c            |   4 +-
 block/blk-mq-tag.c                | 173 +++++++++++++++++++++++++-------------------------------------------------------------
 block/blk-mq-tag.h                |   2 -
 block/blk-mq.c                    |  35 ++++++------------
 drivers/block/mtip32xx/mtip32xx.c |   8 ++--
 drivers/block/nbd.c               |   2 +-
 drivers/block/skd_main.c          |   4 +-
 drivers/nvme/host/core.c          |  12 ++++++
 drivers/nvme/host/fc.c            |  12 +++---
 drivers/nvme/host/nvme.h          |   2 +
 drivers/nvme/host/pci.c           |   5 ++-
 drivers/nvme/host/rdma.c          |   6 +--
 drivers/nvme/host/tcp.c           |   5 ++-
 drivers/nvme/target/loop.c        |   6 +--
 include/linux/blk-mq.h            |   7 ++--
 15 files changed, 105 insertions(+), 178 deletions(-

Thanks
Jianchao

Comments

Christoph Hellwig March 15, 2019, 9:20 a.m. UTC | #1
On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
> Hi Jens
> 
> As we know, there is a risk of accesing stale requests when iterate
> in-flight requests with tags->rqs[] and this has been talked in following
> thread,
> [1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
> [2] https://marc.info/?l=linux-block&m=154526189023236&w=2

I'd rather take one step back and figure out why we are iterating
the busy requests.  There really shouldn't be any reason why a driver
is even doings that (vs some error handling helpers in the core
block code that can properly synchronize).
jianchao.wang March 15, 2019, 9:44 a.m. UTC | #2
On 3/15/19 5:20 PM, Christoph Hellwig wrote:
> On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
>> Hi Jens
>>
>> As we know, there is a risk of accesing stale requests when iterate
>> in-flight requests with tags->rqs[] and this has been talked in following
>> thread,
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dscsi-26m-3D154511693912752-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=ZQ7RfO6-737-t5kQv7SFlXMhIdpwn_AxJI93d6c-nj0&e=
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=EBV1M5p4mE8jZ5ZD1ecU5kMbJ9EtbpVJoc7Tqolrsc8&e=
> 
> I'd rather take one step back and figure out why we are iterating
> the busy requests.  There really shouldn't be any reason why a driver
> is even doings that (vs some error handling helpers in the core
> block code that can properly synchronize).
> 

A typical scene is blk_mq_in_flight,
                          
blk_mq_get_request         blk_mq_in_flight
  -> blk_mq_get_tag        -> blk_mq_queue_tag_busy_iter
                             -> bt_for_each
                               -> bt_iter
                                 -> rq = taags->rqs[]
                                 -> rq->q       //---> get a stale request
  -> blk_mq_rq_ctx_init
    -> data->hctx->tags->rqs[rq->tag] = rq

This stale request maybe something that has been freed due to io scheduler
is detached or a q using a shared tagset is gone.

And also the blk_mq_timeout_work could use it to pick up the expired request.
The driver would also use it to requeue the in-flight requests when the device is dead.

Compared with adding more synchronization, using static_rqs[] directly maybe simpler :)

Thanks
Jianchao
Josef Bacik March 15, 2019, 1:30 p.m. UTC | #3
On Fri, Mar 15, 2019 at 10:20:20AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
> > Hi Jens
> > 
> > As we know, there is a risk of accesing stale requests when iterate
> > in-flight requests with tags->rqs[] and this has been talked in following
> > thread,
> > [1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
> > [2] https://marc.info/?l=linux-block&m=154526189023236&w=2
> 
> I'd rather take one step back and figure out why we are iterating
> the busy requests.  There really shouldn't be any reason why a driver
> is even doings that (vs some error handling helpers in the core
> block code that can properly synchronize).

I use it in NBD to error out any pending requests for forced disconnects.  I'd
be happy to use some other blessed interface, I'm not married to this one.
Thanks,

Josef
Bart Van Assche March 15, 2019, 4:19 p.m. UTC | #4
On Fri, 2019-03-15 at 17:44 +0800, jianchao.wang wrote:
> On 3/15/19 5:20 PM, Christoph Hellwig wrote:
> > On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
> > > Hi Jens
> > > 
> > > As we know, there is a risk of accesing stale requests when iterate
> > > in-flight requests with tags->rqs[] and this has been talked in following
> > > thread,
> > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dscsi-26m-3D154511693912752-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zX
> > > yr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=ZQ7RfO6-737-t5kQv7SFlXMhIdpwn_AxJI93d6c-nj0&e=
> > > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7z
> > > Xyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=EBV1M5p4mE8jZ5ZD1ecU5kMbJ9EtbpVJoc7Tqolrsc8&e=
> > 
> > I'd rather take one step back and figure out why we are iterating
> > the busy requests.  There really shouldn't be any reason why a driver
> > is even doings that (vs some error handling helpers in the core
> > block code that can properly synchronize).
> > 
> 
> A typical scene is blk_mq_in_flight,
>                           
> blk_mq_get_request         blk_mq_in_flight
>   -> blk_mq_get_tag        -> blk_mq_queue_tag_busy_iter
>                              -> bt_for_each
>                                -> bt_iter
>                                  -> rq = taags->rqs[]
>                                  -> rq->q       //---> get a stale request
>   -> blk_mq_rq_ctx_init
>     -> data->hctx->tags->rqs[rq->tag] = rq
> 
> This stale request maybe something that has been freed due to io scheduler
> is detached or a q using a shared tagset is gone.
> 
> And also the blk_mq_timeout_work could use it to pick up the expired request.
> The driver would also use it to requeue the in-flight requests when the device is dead.
> 
> Compared with adding more synchronization, using static_rqs[] directly maybe simpler :)

Hi Jianchao,

Although I appreciate your work: I agree with Christoph that we should avoid races
like this rather than modifying the block layer to make sure that such races are
handled safely.

Thanks,

Bart.
jianchao.wang March 18, 2019, 2:47 a.m. UTC | #5
Hi Bart

On 3/16/19 12:19 AM, Bart Van Assche wrote:
>> This stale request maybe something that has been freed due to io scheduler
>> is detached or a q using a shared tagset is gone.
>>
>> And also the blk_mq_timeout_work could use it to pick up the expired request.
>> The driver would also use it to requeue the in-flight requests when the device is dead.
>>
>> Compared with adding more synchronization, using static_rqs[] directly maybe simpler :)
> Hi Jianchao,
> 
> Although I appreciate your work: I agree with Christoph that we should avoid races
> like this rather than modifying the block layer to make sure that such races are
> handled safely.

The root cause here is that there is window between set tag sbitmap and set tags->rqs[]
, and we don't clear the tags->rqs[] in free tags path. So when we iterate the busy
tags, we could see stale requests in tags->rqs[] and this stale requests maybe freed.

Looks like it is difficult to close the window above, so we used to try to clear the
tags->rqs[] in two ways,

1. clear the tags->rqs[] in the request free path
Jens doesn't like it
https://marc.info/?l=linux-block&m=154515671524877&w=2
"
It's an extra store, and it's a store to an area that's then now shared
between issue and completion. Those are never a good idea. Besides, it's
the kind of issue you solve in the SLOW path, not in the fast path. Since
that's doable, it would be silly to do it for every IO.
"

2. clear the associated slots in tags->tags[] when blk_mq_free_rqs
https://marc.info/?l=linux-block&m=154534605914798&w=2

+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();

The busy_iter_fn could sleep
blk_mq_check_expired
  -> blk_mq_rq_timed_out
    -> q->mq_ops->timeout
       nvme_timeout
         -> nvme_dev_disable
            -> mutex_lock dev->shutdown_lock

Since it is not so flexible to fix on tags->rqs[], why not try to use tags->static_rqs[] ?
Then we wound never care about the stable requests any more.

Thanks
Jianchao
Bart Van Assche March 18, 2019, 5:28 p.m. UTC | #6
On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote:
> [2] https://marc.info/?l=linux-block&m=154526189023236&w=2

Hi Jianchao,

That is a reference to the "BUG: KASAN: use-after-free in bt_iter" issue.
I think that issue can be fixed in another way than modifying all code that
iterates over tags, namely by adding an rcu_read_lock() / rcu_read_unlock()
pair in bt_for_each() and bt_tags_for_each() and by changing the calls in
blk_mq_free_rqs() and blk_free_flush_queue() that free the data structures
used by the tag iteration functions into kfree_rcu() or call_rcu() calls.

Thanks,

Bart.
jianchao.wang March 19, 2019, 1:25 a.m. UTC | #7
Hi Bart

Thanks for your kindly and detailed comment on this.

On 3/19/19 1:28 AM, Bart Van Assche wrote:
> On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote:
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=_8Zz6iRpso8g7WlZ-WB50qqNkI2X2GRfySSBWyFKuI4&s=ZVNqSClQ_47hVGJpSrF5rbTh3X32cAlY-GFF2BPkGx0&e=
> 
> Hi Jianchao,
> 
> That is a reference to the "BUG: KASAN: use-after-free in bt_iter" issue.
> I think that issue can be fixed in another way than modifying all code that
> iterates over tags, namely by adding an rcu_read_lock() / rcu_read_unlock()
> pair in bt_for_each() and bt_tags_for_each() and by changing the calls in
> blk_mq_free_rqs() and blk_free_flush_queue() that free the data structures
> used by the tag iteration functions into kfree_rcu() or call_rcu() calls.

Do you mean this patch from Jens ?
https://marc.info/?l=linux-block&m=154534605914798&w=2

+	rcu_read_lock();
 	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+	rcu_read_unlock();

The busy_iter_fn could sleep for nvme
blk_mq_check_expired
  -> blk_mq_rq_timed_out
    -> q->mq_ops->timeout
       nvme_timeout
         -> nvme_dev_disable
            -> mutex_lock dev->shutdown_lock

Thanks
Jianchao
Bart Van Assche March 19, 2019, 3:10 p.m. UTC | #8
On Tue, 2019-03-19 at 09:25 +0800, jianchao.wang wrote:
> Do you mean this patch from Jens ?
> https://marc.info/?l=linux-block&m=154534605914798&w=2
> 
> +	rcu_read_lock();
>  	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
> +	rcu_read_unlock();
> 
> The busy_iter_fn could sleep for nvme
> blk_mq_check_expired
>   -> blk_mq_rq_timed_out
>     -> q->mq_ops->timeout
>        nvme_timeout
>          -> nvme_dev_disable
>             -> mutex_lock dev->shutdown_lock

Hi Jianchao,

I think that's an additional reason to rewrite NVMe error handling ...

Best regards,

Bart.
Keith Busch March 19, 2019, 3:25 p.m. UTC | #9
On Tue, Mar 19, 2019 at 08:10:22AM -0700, Bart Van Assche wrote:
> On Tue, 2019-03-19 at 09:25 +0800, jianchao.wang wrote:
> > Do you mean this patch from Jens ?
> > https://marc.info/?l=linux-block&m=154534605914798&w=2
> > 
> > +	rcu_read_lock();
> >  	sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
> > +	rcu_read_unlock();
> > 
> > The busy_iter_fn could sleep for nvme
> > blk_mq_check_expired
> >   -> blk_mq_rq_timed_out
> >     -> q->mq_ops->timeout
> >        nvme_timeout
> >          -> nvme_dev_disable
> >             -> mutex_lock dev->shutdown_lock
> 
> Hi Jianchao,
> 
> I think that's an additional reason to rewrite NVMe error handling ...

Nonesense. Block timeout handling runs in a work queue precicesly so
handlers can actually do useful work in line with the notification.
Sagi Grimberg March 20, 2019, 6:38 p.m. UTC | #10
>> Hi Jianchao,
>>
>> I think that's an additional reason to rewrite NVMe error handling ...
> 
> Nonesense. Block timeout handling runs in a work queue precicesly so
> handlers can actually do useful work in line with the notification.

I have to agree with Keith on this one, there is absolutely no reason to
force this constraint on the error handler. If we want to teardown stuff
to guarantee that we can free the request safely we may very well need
to block on queue flushing.