Message ID | 1552640264-26101-1-git-send-email-jianchao.w.wang@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | : blk-mq: use static_rqs to iterate busy tags | expand |
On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote: > Hi Jens > > As we know, there is a risk of accesing stale requests when iterate > in-flight requests with tags->rqs[] and this has been talked in following > thread, > [1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2 > [2] https://marc.info/?l=linux-block&m=154526189023236&w=2 I'd rather take one step back and figure out why we are iterating the busy requests. There really shouldn't be any reason why a driver is even doings that (vs some error handling helpers in the core block code that can properly synchronize).
On 3/15/19 5:20 PM, Christoph Hellwig wrote: > On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote: >> Hi Jens >> >> As we know, there is a risk of accesing stale requests when iterate >> in-flight requests with tags->rqs[] and this has been talked in following >> thread, >> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dscsi-26m-3D154511693912752-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=ZQ7RfO6-737-t5kQv7SFlXMhIdpwn_AxJI93d6c-nj0&e= >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=EBV1M5p4mE8jZ5ZD1ecU5kMbJ9EtbpVJoc7Tqolrsc8&e= > > I'd rather take one step back and figure out why we are iterating > the busy requests. There really shouldn't be any reason why a driver > is even doings that (vs some error handling helpers in the core > block code that can properly synchronize). > A typical scene is blk_mq_in_flight, blk_mq_get_request blk_mq_in_flight -> blk_mq_get_tag -> blk_mq_queue_tag_busy_iter -> bt_for_each -> bt_iter -> rq = taags->rqs[] -> rq->q //---> get a stale request -> blk_mq_rq_ctx_init -> data->hctx->tags->rqs[rq->tag] = rq This stale request maybe something that has been freed due to io scheduler is detached or a q using a shared tagset is gone. And also the blk_mq_timeout_work could use it to pick up the expired request. The driver would also use it to requeue the in-flight requests when the device is dead. Compared with adding more synchronization, using static_rqs[] directly maybe simpler :) Thanks Jianchao
On Fri, Mar 15, 2019 at 10:20:20AM +0100, Christoph Hellwig wrote: > On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote: > > Hi Jens > > > > As we know, there is a risk of accesing stale requests when iterate > > in-flight requests with tags->rqs[] and this has been talked in following > > thread, > > [1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2 > > [2] https://marc.info/?l=linux-block&m=154526189023236&w=2 > > I'd rather take one step back and figure out why we are iterating > the busy requests. There really shouldn't be any reason why a driver > is even doings that (vs some error handling helpers in the core > block code that can properly synchronize). I use it in NBD to error out any pending requests for forced disconnects. I'd be happy to use some other blessed interface, I'm not married to this one. Thanks, Josef
On Fri, 2019-03-15 at 17:44 +0800, jianchao.wang wrote: > On 3/15/19 5:20 PM, Christoph Hellwig wrote: > > On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote: > > > Hi Jens > > > > > > As we know, there is a risk of accesing stale requests when iterate > > > in-flight requests with tags->rqs[] and this has been talked in following > > > thread, > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dscsi-26m-3D154511693912752-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zX > > > yr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=ZQ7RfO6-737-t5kQv7SFlXMhIdpwn_AxJI93d6c-nj0&e= > > > [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7z > > > Xyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=EBV1M5p4mE8jZ5ZD1ecU5kMbJ9EtbpVJoc7Tqolrsc8&e= > > > > I'd rather take one step back and figure out why we are iterating > > the busy requests. There really shouldn't be any reason why a driver > > is even doings that (vs some error handling helpers in the core > > block code that can properly synchronize). > > > > A typical scene is blk_mq_in_flight, > > blk_mq_get_request blk_mq_in_flight > -> blk_mq_get_tag -> blk_mq_queue_tag_busy_iter > -> bt_for_each > -> bt_iter > -> rq = taags->rqs[] > -> rq->q //---> get a stale request > -> blk_mq_rq_ctx_init > -> data->hctx->tags->rqs[rq->tag] = rq > > This stale request maybe something that has been freed due to io scheduler > is detached or a q using a shared tagset is gone. > > And also the blk_mq_timeout_work could use it to pick up the expired request. > The driver would also use it to requeue the in-flight requests when the device is dead. > > Compared with adding more synchronization, using static_rqs[] directly maybe simpler :) Hi Jianchao, Although I appreciate your work: I agree with Christoph that we should avoid races like this rather than modifying the block layer to make sure that such races are handled safely. Thanks, Bart.
Hi Bart On 3/16/19 12:19 AM, Bart Van Assche wrote: >> This stale request maybe something that has been freed due to io scheduler >> is detached or a q using a shared tagset is gone. >> >> And also the blk_mq_timeout_work could use it to pick up the expired request. >> The driver would also use it to requeue the in-flight requests when the device is dead. >> >> Compared with adding more synchronization, using static_rqs[] directly maybe simpler :) > Hi Jianchao, > > Although I appreciate your work: I agree with Christoph that we should avoid races > like this rather than modifying the block layer to make sure that such races are > handled safely. The root cause here is that there is window between set tag sbitmap and set tags->rqs[] , and we don't clear the tags->rqs[] in free tags path. So when we iterate the busy tags, we could see stale requests in tags->rqs[] and this stale requests maybe freed. Looks like it is difficult to close the window above, so we used to try to clear the tags->rqs[] in two ways, 1. clear the tags->rqs[] in the request free path Jens doesn't like it https://marc.info/?l=linux-block&m=154515671524877&w=2 " It's an extra store, and it's a store to an area that's then now shared between issue and completion. Those are never a good idea. Besides, it's the kind of issue you solve in the SLOW path, not in the fast path. Since that's doable, it would be silly to do it for every IO. " 2. clear the associated slots in tags->tags[] when blk_mq_free_rqs https://marc.info/?l=linux-block&m=154534605914798&w=2 + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); The busy_iter_fn could sleep blk_mq_check_expired -> blk_mq_rq_timed_out -> q->mq_ops->timeout nvme_timeout -> nvme_dev_disable -> mutex_lock dev->shutdown_lock Since it is not so flexible to fix on tags->rqs[], why not try to use tags->static_rqs[] ? Then we wound never care about the stable requests any more. Thanks Jianchao
On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote:
> [2] https://marc.info/?l=linux-block&m=154526189023236&w=2
Hi Jianchao,
That is a reference to the "BUG: KASAN: use-after-free in bt_iter" issue.
I think that issue can be fixed in another way than modifying all code that
iterates over tags, namely by adding an rcu_read_lock() / rcu_read_unlock()
pair in bt_for_each() and bt_tags_for_each() and by changing the calls in
blk_mq_free_rqs() and blk_free_flush_queue() that free the data structures
used by the tag iteration functions into kfree_rcu() or call_rcu() calls.
Thanks,
Bart.
Hi Bart Thanks for your kindly and detailed comment on this. On 3/19/19 1:28 AM, Bart Van Assche wrote: > On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote: >> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=_8Zz6iRpso8g7WlZ-WB50qqNkI2X2GRfySSBWyFKuI4&s=ZVNqSClQ_47hVGJpSrF5rbTh3X32cAlY-GFF2BPkGx0&e= > > Hi Jianchao, > > That is a reference to the "BUG: KASAN: use-after-free in bt_iter" issue. > I think that issue can be fixed in another way than modifying all code that > iterates over tags, namely by adding an rcu_read_lock() / rcu_read_unlock() > pair in bt_for_each() and bt_tags_for_each() and by changing the calls in > blk_mq_free_rqs() and blk_free_flush_queue() that free the data structures > used by the tag iteration functions into kfree_rcu() or call_rcu() calls. Do you mean this patch from Jens ? https://marc.info/?l=linux-block&m=154534605914798&w=2 + rcu_read_lock(); sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); + rcu_read_unlock(); The busy_iter_fn could sleep for nvme blk_mq_check_expired -> blk_mq_rq_timed_out -> q->mq_ops->timeout nvme_timeout -> nvme_dev_disable -> mutex_lock dev->shutdown_lock Thanks Jianchao
On Tue, 2019-03-19 at 09:25 +0800, jianchao.wang wrote: > Do you mean this patch from Jens ? > https://marc.info/?l=linux-block&m=154534605914798&w=2 > > + rcu_read_lock(); > sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); > + rcu_read_unlock(); > > The busy_iter_fn could sleep for nvme > blk_mq_check_expired > -> blk_mq_rq_timed_out > -> q->mq_ops->timeout > nvme_timeout > -> nvme_dev_disable > -> mutex_lock dev->shutdown_lock Hi Jianchao, I think that's an additional reason to rewrite NVMe error handling ... Best regards, Bart.
On Tue, Mar 19, 2019 at 08:10:22AM -0700, Bart Van Assche wrote: > On Tue, 2019-03-19 at 09:25 +0800, jianchao.wang wrote: > > Do you mean this patch from Jens ? > > https://marc.info/?l=linux-block&m=154534605914798&w=2 > > > > + rcu_read_lock(); > > sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data); > > + rcu_read_unlock(); > > > > The busy_iter_fn could sleep for nvme > > blk_mq_check_expired > > -> blk_mq_rq_timed_out > > -> q->mq_ops->timeout > > nvme_timeout > > -> nvme_dev_disable > > -> mutex_lock dev->shutdown_lock > > Hi Jianchao, > > I think that's an additional reason to rewrite NVMe error handling ... Nonesense. Block timeout handling runs in a work queue precicesly so handlers can actually do useful work in line with the notification.
>> Hi Jianchao, >> >> I think that's an additional reason to rewrite NVMe error handling ... > > Nonesense. Block timeout handling runs in a work queue precicesly so > handlers can actually do useful work in line with the notification. I have to agree with Keith on this one, there is absolutely no reason to force this constraint on the error handler. If we want to teardown stuff to guarantee that we can free the request safely we may very well need to block on queue flushing.