diff mbox

NVMe induced NULL deref in bt_iter()

Message ID 7138df5a-b1ce-7f46-281f-ae15172c61e5@grimberg.me (mailing list archive)
State New, archived
Headers show

Commit Message

Sagi Grimberg July 2, 2017, 11:56 a.m. UTC
On 02/07/17 13:45, Max Gurtovoy wrote:
> 
> 
> On 6/30/2017 8:26 PM, Jens Axboe wrote:
>> Hi Max,
> 
> Hi Jens,
> 
>>
>> I remembered you reporting this. I think this is a regression introduced
>> with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
>> is, but that's not indexable by the tag we find. So I think we need to
>> guard those with a NULL check. The actual requests themselves are
>> static, so we know the memory itself isn't going away. But if we race
>> with completion, we could find a NULL there, validly.
>>
>> Since you could reproduce it, can you try the below?
> 
> I still can repro the null deref with this patch applied.
> 
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d0be72ccb091..b856b2827157 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, 
>> unsigned int bitnr, void *data)
>>          bitnr += tags->nr_reserved_tags;
>>      rq = tags->rqs[bitnr];
>>
>> -    if (rq->q == hctx->queue)
>> +    if (rq && rq->q == hctx->queue)
>>          iter_data->fn(hctx, rq, iter_data->data, reserved);
>>      return true;
>>  }
>> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, 
>> unsigned int bitnr, void *data)
>>      if (!reserved)
>>          bitnr += tags->nr_reserved_tags;
>>      rq = tags->rqs[bitnr];
>> -
>> -    iter_data->fn(rq, iter_data->data, reserved);
>> +    if (rq)
>> +        iter_data->fn(rq, iter_data->data, reserved);
>>      return true;
>>  }
> 
> see the attached file for dmesg output.
> 
> output of gdb:
> 
> (gdb) list *(blk_mq_flush_busy_ctxs+0x48)
> 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs 
> (./include/linux/sbitmap.h:234).
> 229
> 230             for (i = 0; i < sb->map_nr; i++) {
> 231                     struct sbitmap_word *word = &sb->map[i];
> 232                     unsigned int off, nr;
> 233
> 234                     if (!word->word)
> 235                             continue;
> 236
> 237                     nr = 0;
> 238                     off = i << sb->shift;
> 
> 
> when I change the "if (!word->word)" to  "if (word && !word->word)"
> I can get null deref at "nr = find_next_bit(&word->word, word->depth, 
> nr);". Seems like somehow word becomes NULL.
> 
> Adding the linux-nvme guys too.
> Sagi has mentioned that this can be null only if we remove the tagset 
> while I/O is trying to get a tag and when killing the target we get into
> error recovery and periodic reconnects, which does _NOT_ include freeing
> the tagset, so this is probably the admin tagset.
> 
> Sagi,
> you've mention a patch for centrelizing the treatment of the admin 
> tagset to the nvme core. I think I missed this patch, so can you please 
> send a pointer to it and I'll check if it helps ?

Hmm,

In the above flow we should not be freeing the tag_set, not on admin as
well. The target keep removing namespaces and finally removes the
subsystem which generates a error recovery flow. What we at least try
to do is:

1. mark rdma queues as not live
2. stop all the sw queues (admin and io)
3. fail inflight I/Os
4. restart all sw queues (to fast fail until we recover)

We shouldn't be freeing the tagsets (although we might update them
when we recover and cpu map changed - which I don't think is happening).

However, I do see a difference between bt_tags_for_each
and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).

Unrelated to this I think we should quiesce/unquiesce the admin_q
instead of stop/start because it respects the submission path rcu [1].

It might hide the issue, but given that we never free the tagset its
seems like it's not in nvme-rdma (max, can you see if this makes the
issue go away?)

[1]:
--
requests */
         if (ctrl->ctrl.queue_count > 1)
@@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct 
work_struct *work)
          * queues are not a live anymore, so restart the queues to fail 
fast
          * new IO
          */
-       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
+       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
+       blk_mq_kick_requeue_list(ctrl->ctrl.admin_q);
         nvme_start_queues(&ctrl->ctrl);

         nvme_rdma_reconnect_or_remove(ctrl);
@@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(struct 
nvme_rdma_ctrl *ctrl)
         if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags))
                 nvme_shutdown_ctrl(&ctrl->ctrl);

-       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
+       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
                                 nvme_cancel_request, &ctrl->ctrl);
         nvme_rdma_destroy_admin_queue(ctrl);
--

Comments

Max Gurtovoy July 2, 2017, 2:37 p.m. UTC | #1
On 7/2/2017 2:56 PM, Sagi Grimberg wrote:
>
>
> On 02/07/17 13:45, Max Gurtovoy wrote:
>>
>>
>> On 6/30/2017 8:26 PM, Jens Axboe wrote:
>>> Hi Max,
>>
>> Hi Jens,
>>
>>>
>>> I remembered you reporting this. I think this is a regression introduced
>>> with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
>>> is, but that's not indexable by the tag we find. So I think we need to
>>> guard those with a NULL check. The actual requests themselves are
>>> static, so we know the memory itself isn't going away. But if we race
>>> with completion, we could find a NULL there, validly.
>>>
>>> Since you could reproduce it, can you try the below?
>>
>> I still can repro the null deref with this patch applied.
>>
>>>
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index d0be72ccb091..b856b2827157 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap,
>>> unsigned int bitnr, void *data)
>>>          bitnr += tags->nr_reserved_tags;
>>>      rq = tags->rqs[bitnr];
>>>
>>> -    if (rq->q == hctx->queue)
>>> +    if (rq && rq->q == hctx->queue)
>>>          iter_data->fn(hctx, rq, iter_data->data, reserved);
>>>      return true;
>>>  }
>>> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap,
>>> unsigned int bitnr, void *data)
>>>      if (!reserved)
>>>          bitnr += tags->nr_reserved_tags;
>>>      rq = tags->rqs[bitnr];
>>> -
>>> -    iter_data->fn(rq, iter_data->data, reserved);
>>> +    if (rq)
>>> +        iter_data->fn(rq, iter_data->data, reserved);
>>>      return true;
>>>  }
>>
>> see the attached file for dmesg output.
>>
>> output of gdb:
>>
>> (gdb) list *(blk_mq_flush_busy_ctxs+0x48)
>> 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs
>> (./include/linux/sbitmap.h:234).
>> 229
>> 230             for (i = 0; i < sb->map_nr; i++) {
>> 231                     struct sbitmap_word *word = &sb->map[i];
>> 232                     unsigned int off, nr;
>> 233
>> 234                     if (!word->word)
>> 235                             continue;
>> 236
>> 237                     nr = 0;
>> 238                     off = i << sb->shift;
>>
>>
>> when I change the "if (!word->word)" to  "if (word && !word->word)"
>> I can get null deref at "nr = find_next_bit(&word->word, word->depth,
>> nr);". Seems like somehow word becomes NULL.
>>
>> Adding the linux-nvme guys too.
>> Sagi has mentioned that this can be null only if we remove the tagset
>> while I/O is trying to get a tag and when killing the target we get into
>> error recovery and periodic reconnects, which does _NOT_ include freeing
>> the tagset, so this is probably the admin tagset.
>>
>> Sagi,
>> you've mention a patch for centrelizing the treatment of the admin
>> tagset to the nvme core. I think I missed this patch, so can you
>> please send a pointer to it and I'll check if it helps ?
>
> Hmm,
>
> In the above flow we should not be freeing the tag_set, not on admin as
> well. The target keep removing namespaces and finally removes the
> subsystem which generates a error recovery flow. What we at least try
> to do is:
>
> 1. mark rdma queues as not live
> 2. stop all the sw queues (admin and io)
> 3. fail inflight I/Os
> 4. restart all sw queues (to fast fail until we recover)
>
> We shouldn't be freeing the tagsets (although we might update them
> when we recover and cpu map changed - which I don't think is happening).
>
> However, I do see a difference between bt_tags_for_each
> and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).
>
> Unrelated to this I think we should quiesce/unquiesce the admin_q
> instead of stop/start because it respects the submission path rcu [1].
>
> It might hide the issue, but given that we never free the tagset its
> seems like it's not in nvme-rdma (max, can you see if this makes the
> issue go away?)

Yes, this fixes the null deref issue.
I run some additional login/logout tests that passed too.
This fix is important also for stable kernel (with needed backports to 
blk_mq_quiesce_queue/blk_mq_unquiesce_queue functions).
You can add my:
Tested-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>

Let me know if you want me to push this fix to the mailing list to save 
time (can we make it to 4.12 ?)

>
> [1]:
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index e3996db22738..094873a4ee38 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>
>         if (ctrl->ctrl.queue_count > 1)
>                 nvme_stop_queues(&ctrl->ctrl);
> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> +       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>
>         /* We must take care of fastfail/requeue all our inflight
> requests */
>         if (ctrl->ctrl.queue_count > 1)
> @@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>          * queues are not a live anymore, so restart the queues to fail
> fast
>          * new IO
>          */
> -       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> +       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> +       blk_mq_kick_requeue_list(ctrl->ctrl.admin_q);
>         nvme_start_queues(&ctrl->ctrl);
>
>         nvme_rdma_reconnect_or_remove(ctrl);
> @@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(struct
> nvme_rdma_ctrl *ctrl)
>         if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags))
>                 nvme_shutdown_ctrl(&ctrl->ctrl);
>
> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> +       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>                                 nvme_cancel_request, &ctrl->ctrl);
>         nvme_rdma_destroy_admin_queue(ctrl);
> --
Sagi Grimberg July 2, 2017, 3:08 p.m. UTC | #2
>> Hmm,
>>
>> In the above flow we should not be freeing the tag_set, not on admin as
>> well. The target keep removing namespaces and finally removes the
>> subsystem which generates a error recovery flow. What we at least try
>> to do is:
>>
>> 1. mark rdma queues as not live
>> 2. stop all the sw queues (admin and io)
>> 3. fail inflight I/Os
>> 4. restart all sw queues (to fast fail until we recover)
>>
>> We shouldn't be freeing the tagsets (although we might update them
>> when we recover and cpu map changed - which I don't think is happening).
>>
>> However, I do see a difference between bt_tags_for_each
>> and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).
>>
>> Unrelated to this I think we should quiesce/unquiesce the admin_q
>> instead of stop/start because it respects the submission path rcu [1].
>>
>> It might hide the issue, but given that we never free the tagset its
>> seems like it's not in nvme-rdma (max, can you see if this makes the
>> issue go away?)
> 
> Yes, this fixes the null deref issue.
> I run some additional login/logout tests that passed too.
> This fix is important also for stable kernel (with needed backports to 
> blk_mq_quiesce_queue/blk_mq_unquiesce_queue functions).
> You can add my:
> Tested-by: Max Gurtovoy <maxg@mellanox.com>
> Reviewed-by: Max Gurtovoy <maxg@mellanox.com>

Thanks for clarifying Max.

However I still think its not the root cause (unless I don't understand
it).

As I said, we do not free the tagset so I'm not sure why we get to
a NULL deref in the sbitmap code. Jens, can you explain why
changing blk_mq_stop_hw_queues to blk_mq_quiesce_queue makes the issue
go away? I know that quiesce respects the rcu grace, but I still do not
understand why without it we get a NULL sb->map.

> Let me know if you want me to push this fix to the mailing list to save 
> time (can we make it to 4.12 ?)

I can send patches, we need it in pci, fc and loop too..

I don't think its a 4.12 material as we are way too late to this sort of
fix.
Ming Lei July 3, 2017, 9:40 a.m. UTC | #3
On Sun, Jul 02, 2017 at 02:56:56PM +0300, Sagi Grimberg wrote:
> 
> 
> On 02/07/17 13:45, Max Gurtovoy wrote:
> > 
> > 
> > On 6/30/2017 8:26 PM, Jens Axboe wrote:
> > > Hi Max,
> > 
> > Hi Jens,
> > 
> > > 
> > > I remembered you reporting this. I think this is a regression introduced
> > > with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
> > > is, but that's not indexable by the tag we find. So I think we need to
> > > guard those with a NULL check. The actual requests themselves are
> > > static, so we know the memory itself isn't going away. But if we race
> > > with completion, we could find a NULL there, validly.
> > > 
> > > Since you could reproduce it, can you try the below?
> > 
> > I still can repro the null deref with this patch applied.
> > 
> > > 
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index d0be72ccb091..b856b2827157 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > >          bitnr += tags->nr_reserved_tags;
> > >      rq = tags->rqs[bitnr];
> > > 
> > > -    if (rq->q == hctx->queue)
> > > +    if (rq && rq->q == hctx->queue)
> > >          iter_data->fn(hctx, rq, iter_data->data, reserved);
> > >      return true;
> > >  }
> > > @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap,
> > > unsigned int bitnr, void *data)
> > >      if (!reserved)
> > >          bitnr += tags->nr_reserved_tags;
> > >      rq = tags->rqs[bitnr];
> > > -
> > > -    iter_data->fn(rq, iter_data->data, reserved);
> > > +    if (rq)
> > > +        iter_data->fn(rq, iter_data->data, reserved);
> > >      return true;
> > >  }
> > 
> > see the attached file for dmesg output.
> > 
> > output of gdb:
> > 
> > (gdb) list *(blk_mq_flush_busy_ctxs+0x48)
> > 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs
> > (./include/linux/sbitmap.h:234).
> > 229
> > 230             for (i = 0; i < sb->map_nr; i++) {
> > 231                     struct sbitmap_word *word = &sb->map[i];
> > 232                     unsigned int off, nr;
> > 233
> > 234                     if (!word->word)
> > 235                             continue;
> > 236
> > 237                     nr = 0;
> > 238                     off = i << sb->shift;
> > 
> > 
> > when I change the "if (!word->word)" to  "if (word && !word->word)"
> > I can get null deref at "nr = find_next_bit(&word->word, word->depth,
> > nr);". Seems like somehow word becomes NULL.
> > 
> > Adding the linux-nvme guys too.
> > Sagi has mentioned that this can be null only if we remove the tagset
> > while I/O is trying to get a tag and when killing the target we get into
> > error recovery and periodic reconnects, which does _NOT_ include freeing
> > the tagset, so this is probably the admin tagset.
> > 
> > Sagi,
> > you've mention a patch for centrelizing the treatment of the admin
> > tagset to the nvme core. I think I missed this patch, so can you please
> > send a pointer to it and I'll check if it helps ?
> 
> Hmm,
> 
> In the above flow we should not be freeing the tag_set, not on admin as
> well. The target keep removing namespaces and finally removes the
> subsystem which generates a error recovery flow. What we at least try
> to do is:
> 
> 1. mark rdma queues as not live
> 2. stop all the sw queues (admin and io)
> 3. fail inflight I/Os
> 4. restart all sw queues (to fast fail until we recover)
> 
> We shouldn't be freeing the tagsets (although we might update them
> when we recover and cpu map changed - which I don't think is happening).
> 
> However, I do see a difference between bt_tags_for_each
> and blk_mq_flush_busy_ctxs (checks tags->rqs not being NULL).
> 
> Unrelated to this I think we should quiesce/unquiesce the admin_q
> instead of stop/start because it respects the submission path rcu [1].
> 
> It might hide the issue, but given that we never free the tagset its
> seems like it's not in nvme-rdma (max, can you see if this makes the
> issue go away?)
> 
> [1]:
> --
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index e3996db22738..094873a4ee38 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -785,7 +785,7 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
> 
>         if (ctrl->ctrl.queue_count > 1)
>                 nvme_stop_queues(&ctrl->ctrl);
> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> +       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
> 
>         /* We must take care of fastfail/requeue all our inflight requests
> */
>         if (ctrl->ctrl.queue_count > 1)
> @@ -798,7 +798,8 @@ static void nvme_rdma_error_recovery_work(struct
> work_struct *work)
>          * queues are not a live anymore, so restart the queues to fail fast
>          * new IO
>          */
> -       blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> +       blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
> +       blk_mq_kick_requeue_list(ctrl->ctrl.admin_q);
>         nvme_start_queues(&ctrl->ctrl);
> 
>         nvme_rdma_reconnect_or_remove(ctrl);
> @@ -1651,7 +1652,7 @@ static void nvme_rdma_shutdown_ctrl(struct
> nvme_rdma_ctrl *ctrl)
>         if (test_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags))
>                 nvme_shutdown_ctrl(&ctrl->ctrl);
> 
> -       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
> +       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
>         blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>                                 nvme_cancel_request, &ctrl->ctrl);
>         nvme_rdma_destroy_admin_queue(ctrl);

Yeah, the above change is correct, for any canceling requests in this
way we should use blk_mq_quiesce_queue().

Thanks,
Ming
Sagi Grimberg July 3, 2017, 10:07 a.m. UTC | #4
Hi Ming,

> Yeah, the above change is correct, for any canceling requests in this
> way we should use blk_mq_quiesce_queue().

I still don't understand why should blk_mq_flush_busy_ctxs hit a NULL
deref if we don't touch the tagset...

Also, I'm wandering in what case we shouldn't use
blk_mq_quiesce_queue()? Maybe we should unexport blk_mq_stop_hw_queues()
and blk_mq_start_stopped_hw_queues() and use the quiesce/unquiesce
equivalent always?

The only fishy usage is in nvme_fc_start_fcp_op() where if submission
failed the code stop the hw queues and delays it, but I think it should
be handled differently..
Ming Lei July 3, 2017, 12:03 p.m. UTC | #5
On Mon, Jul 03, 2017 at 01:07:44PM +0300, Sagi Grimberg wrote:
> Hi Ming,
> 
> > Yeah, the above change is correct, for any canceling requests in this
> > way we should use blk_mq_quiesce_queue().
> 
> I still don't understand why should blk_mq_flush_busy_ctxs hit a NULL
> deref if we don't touch the tagset...

Looks no one mentioned the steps for reproduction, then it isn't easy
to understand the related use case, could anyone share the steps for
reproduction?

> 
> Also, I'm wandering in what case we shouldn't use
> blk_mq_quiesce_queue()? Maybe we should unexport blk_mq_stop_hw_queues()
> and blk_mq_start_stopped_hw_queues() and use the quiesce/unquiesce
> equivalent always?

There are at least one case in which we have to use stop queues:

	- when QUEUE_BUSY(now it becomes BLK_STS_RESOURCE) happens, some drivers
	need to stop queues for avoiding to hurt CPU, such as virtio-blk, ...

> 
> The only fishy usage is in nvme_fc_start_fcp_op() where if submission
> failed the code stop the hw queues and delays it, but I think it should
> be handled differently..

It looks like the old way of scsi-mq, but scsi has removed this way and
avoids to stop queue.


Thanks,
Ming
Max Gurtovoy July 3, 2017, 12:46 p.m. UTC | #6
On 7/3/2017 3:03 PM, Ming Lei wrote:
> On Mon, Jul 03, 2017 at 01:07:44PM +0300, Sagi Grimberg wrote:
>> Hi Ming,
>>
>>> Yeah, the above change is correct, for any canceling requests in this
>>> way we should use blk_mq_quiesce_queue().
>>
>> I still don't understand why should blk_mq_flush_busy_ctxs hit a NULL
>> deref if we don't touch the tagset...
>
> Looks no one mentioned the steps for reproduction, then it isn't easy
> to understand the related use case, could anyone share the steps for
> reproduction?

Hi Ming,
I create 500 ns per 1 subsystem (using with CX4 target and C-IB 
initiator but also saw it in CX5 vs. CX5 setup).
The null deref happens when I remove all configuration in the target (1 
port 1 subsystem and 500 namespaces and nvmet modules unload) during 
traffic to 1 nvme device/ns from the intiator.
I get Null deref in blk_mq_flush_busy_ctxs function that calls 
sbitmap_for_each_set in the initiator. seems like the "struct 
sbitmap_word *word = &sb->map[i];" is null. It's actually might be not 
null in the beginning of the func and become null during running the 
while loop there.

>
>>
>> Also, I'm wandering in what case we shouldn't use
>> blk_mq_quiesce_queue()? Maybe we should unexport blk_mq_stop_hw_queues()
>> and blk_mq_start_stopped_hw_queues() and use the quiesce/unquiesce
>> equivalent always?
>
> There are at least one case in which we have to use stop queues:
>
> 	- when QUEUE_BUSY(now it becomes BLK_STS_RESOURCE) happens, some drivers
> 	need to stop queues for avoiding to hurt CPU, such as virtio-blk, ...
>
>>
>> The only fishy usage is in nvme_fc_start_fcp_op() where if submission
>> failed the code stop the hw queues and delays it, but I think it should
>> be handled differently..
>
> It looks like the old way of scsi-mq, but scsi has removed this way and
> avoids to stop queue.
>
>
> Thanks,
> Ming
>
Ming Lei July 3, 2017, 3:54 p.m. UTC | #7
On Mon, Jul 03, 2017 at 03:46:34PM +0300, Max Gurtovoy wrote:
> 
> 
> On 7/3/2017 3:03 PM, Ming Lei wrote:
> > On Mon, Jul 03, 2017 at 01:07:44PM +0300, Sagi Grimberg wrote:
> > > Hi Ming,
> > > 
> > > > Yeah, the above change is correct, for any canceling requests in this
> > > > way we should use blk_mq_quiesce_queue().
> > > 
> > > I still don't understand why should blk_mq_flush_busy_ctxs hit a NULL
> > > deref if we don't touch the tagset...
> > 
> > Looks no one mentioned the steps for reproduction, then it isn't easy
> > to understand the related use case, could anyone share the steps for
> > reproduction?
> 
> Hi Ming,
> I create 500 ns per 1 subsystem (using with CX4 target and C-IB initiator
> but also saw it in CX5 vs. CX5 setup).
> The null deref happens when I remove all configuration in the target (1 port
> 1 subsystem and 500 namespaces and nvmet modules unload) during traffic to 1
> nvme device/ns from the intiator.
> I get Null deref in blk_mq_flush_busy_ctxs function that calls
> sbitmap_for_each_set in the initiator. seems like the "struct sbitmap_word
> *word = &sb->map[i];" is null. It's actually might be not null in the
> beginning of the func and become null during running the while loop there.

So looks it is still a normal release in initiator.

Per my experience, without quiescing queue before
blk_mq_tagset_busy_iter() for canceling requests, request double free
can be caused: one submitted req in .queue_rq can completed in
blk_mq_end_request(), meantime it can be completed in
nvme_cancel_request(). That is why we have to quiescing queue
first before canceling request in this way. Except for NVMe, looks
NBD and mtip32xx need fix too.

This way might cause blk_cleanup_queue() to complete early, then NULL
deref can be triggered in blk_mq_flush_busy_ctxs(). But in my previous
debug in PCI NVMe, this wasn't seen yet.

It should have been verified if the above is true by adding some debug
message inside blk_cleanup_queue(). 


Thanks,
Ming
Sagi Grimberg July 4, 2017, 6:58 a.m. UTC | #8
> So looks it is still a normal release in initiator.
> 
> Per my experience, without quiescing queue before
> blk_mq_tagset_busy_iter() for canceling requests, request double free
> can be caused: one submitted req in .queue_rq can completed in
> blk_mq_end_request(), meantime it can be completed in
> nvme_cancel_request(). That is why we have to quiescing queue
> first before canceling request in this way. Except for NVMe, looks
> NBD and mtip32xx need fix too.

Let me cook some patches for those as well...
Sagi Grimberg July 4, 2017, 7:56 a.m. UTC | #9
> There are at least one case in which we have to use stop queues:
> 
> 	- when QUEUE_BUSY(now it becomes BLK_STS_RESOURCE) happens, some drivers
> 	need to stop queues for avoiding to hurt CPU, such as virtio-blk, ...

Why isn't virtio_blk using blk_mq_delay_run_hw_queue like scsi does?
Ming Lei July 4, 2017, 8:08 a.m. UTC | #10
On Tue, Jul 04, 2017 at 10:56:23AM +0300, Sagi Grimberg wrote:
> 
> > There are at least one case in which we have to use stop queues:
> > 
> > 	- when QUEUE_BUSY(now it becomes BLK_STS_RESOURCE) happens, some drivers
> > 	need to stop queues for avoiding to hurt CPU, such as virtio-blk, ...
> 
> Why isn't virtio_blk using blk_mq_delay_run_hw_queue like scsi does?

IMO it shouldn't be easy to figure out one perfect delay time, and it
should have been self-adaptive.

Also I think it might be possible to move this kind of stop action into
blk-mq core code, and not let drivers touch stop state. Finally we
may kill all stopping in drivers.

Thanks,
Ming
Sagi Grimberg July 4, 2017, 9:14 a.m. UTC | #11
On 04/07/17 11:08, Ming Lei wrote:
> On Tue, Jul 04, 2017 at 10:56:23AM +0300, Sagi Grimberg wrote:
>>
>>> There are at least one case in which we have to use stop queues:
>>>
>>> 	- when QUEUE_BUSY(now it becomes BLK_STS_RESOURCE) happens, some drivers
>>> 	need to stop queues for avoiding to hurt CPU, such as virtio-blk, ...
>>
>> Why isn't virtio_blk using blk_mq_delay_run_hw_queue like scsi does?
> 
> IMO it shouldn't be easy to figure out one perfect delay time,

It doesn't needs to be perfect, just something that is sufficient
to not hog the cpu and won't have noticeable effects...

> and it should have been self-adaptive.

But IMO always start the queues on *every* completion is a waste... why
iterating on all the hw queues on each completion?

> Also I think it might be possible to move this kind of stop action into
> blk-mq core code, and not let drivers touch stop state. Finally we
> may kill all stopping in drivers.

That's a good idea!
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index e3996db22738..094873a4ee38 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -785,7 +785,7 @@  static void nvme_rdma_error_recovery_work(struct 
work_struct *work)

         if (ctrl->ctrl.queue_count > 1)
                 nvme_stop_queues(&ctrl->ctrl);
-       blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
+       blk_mq_quiesce_queue(ctrl->ctrl.admin_q);

         /* We must take care of fastfail/requeue all our inflight