diff mbox series

blk-mq: Fix cpu indexing error in blk_mq_alloc_request_hctx()

Message ID 20191023175700.18615-1-jsmart2021@gmail.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: Fix cpu indexing error in blk_mq_alloc_request_hctx() | expand

Commit Message

James Smart Oct. 23, 2019, 5:57 p.m. UTC
During the following test scenario:
- Offline a cpu
- load lpfc driver, which auto-discovers NVMe devices. For a new
  nvme device, the lpfc/nvme_fc transport can request up to
  num_online_cpus() worth of nr_hw_queues. The target in
  this case allowed at least that many of nvme queues.
The system encountered the following crash:

 BUG: unable to handle kernel paging request at 00003659d33953a8
 ...
 Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc]
 RIP: 0010:blk_mq_get_request+0x21d/0x3c0
 ...
 Blk_mq_alloc_request_hctx+0xef/0x140
 Nvme_alloc_request+0x32/0x80 [nvme_core]
 __nvme_submit_sync_cmd+0x4a/0x1c0 [nvme_core]
 Nvmf_connect_io_queue+0x130/0x1a0 [nvme_fabrics]
 Nvme_fc_connect_io_queues+0x285/0x2b0 [nvme_fc]
 Nvme_fc_create_association+0x0x8ea/0x9c0 [nvme_fc]
 Nvme_fc_connect_ctrl_work+0x19/0x50 [nvme_fc]
 ...

There was a commit a while ago to simplify queue mapping which
replaced the use of cpumask_first() by cpumask_first_and().
The issue is if cpumask_first_and() does not find any _intersecting_ cpus,
it return's nr_cpu_id. nr_cpu_id isn't valid for the per_cpu_ptr index
which is done in __blk_mq_get_ctx().

Considered reverting back to cpumask_first(), but instead followed
logic in blk_mq_first_mapped_cpu() to check for nr_cpu_id before
calling cpumask_first().

Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
Signed-off-by: Shagun Agrawal <shagun.agrawal@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
CC: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Ming Lei Oct. 24, 2019, 9:28 a.m. UTC | #1
On Thu, Oct 24, 2019 at 4:53 PM James Smart <jsmart2021@gmail.com> wrote:
>
> During the following test scenario:
> - Offline a cpu
> - load lpfc driver, which auto-discovers NVMe devices. For a new
>   nvme device, the lpfc/nvme_fc transport can request up to
>   num_online_cpus() worth of nr_hw_queues. The target in
>   this case allowed at least that many of nvme queues.
> The system encountered the following crash:
>
>  BUG: unable to handle kernel paging request at 00003659d33953a8
>  ...
>  Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc]
>  RIP: 0010:blk_mq_get_request+0x21d/0x3c0
>  ...
>  Blk_mq_alloc_request_hctx+0xef/0x140
>  Nvme_alloc_request+0x32/0x80 [nvme_core]
>  __nvme_submit_sync_cmd+0x4a/0x1c0 [nvme_core]
>  Nvmf_connect_io_queue+0x130/0x1a0 [nvme_fabrics]
>  Nvme_fc_connect_io_queues+0x285/0x2b0 [nvme_fc]
>  Nvme_fc_create_association+0x0x8ea/0x9c0 [nvme_fc]
>  Nvme_fc_connect_ctrl_work+0x19/0x50 [nvme_fc]
>  ...
>
> There was a commit a while ago to simplify queue mapping which
> replaced the use of cpumask_first() by cpumask_first_and().
> The issue is if cpumask_first_and() does not find any _intersecting_ cpus,
> it return's nr_cpu_id. nr_cpu_id isn't valid for the per_cpu_ptr index
> which is done in __blk_mq_get_ctx().
>
> Considered reverting back to cpumask_first(), but instead followed
> logic in blk_mq_first_mapped_cpu() to check for nr_cpu_id before
> calling cpumask_first().
>
> Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
> Signed-off-by: Shagun Agrawal <shagun.agrawal@broadcom.com>
> Signed-off-by: James Smart <jsmart2021@gmail.com>
> CC: Christoph Hellwig <hch@lst.de>
> ---
>  block/blk-mq.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8538dc415499..0b06b4ea57f1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>                 return ERR_PTR(-EXDEV);
>         }
>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +       if (cpu >= nr_cpu_ids)
> +               cpu = cpumask_first(alloc_data.hctx->cpumask);

The first cpu may be offline too, then kernel warning or timeout may
be triggered later
when this allocated request is dispatched.

To be honest, blk_mq_alloc_request_hctx() is really a weird interface,
given the hctx may become
dead just when calling into blk_mq_alloc_request_hctx().

Given only NVMe FC/RDMA uses this interface, could you provide some
background about
this kind of usage?

The normal usage is that user doesn't specify the hctx for allocating
request from, since blk-mq
can figure out which hctx is used for allocation via queue mapping.
Just wondering why NVMe
FC/RDMA can't do that way?


Thanks,
Ming Lei
Jens Axboe Oct. 24, 2019, 1:02 p.m. UTC | #2
On 10/24/19 3:28 AM, Ming Lei wrote:
> On Thu, Oct 24, 2019 at 4:53 PM James Smart <jsmart2021@gmail.com> wrote:
>>
>> During the following test scenario:
>> - Offline a cpu
>> - load lpfc driver, which auto-discovers NVMe devices. For a new
>>    nvme device, the lpfc/nvme_fc transport can request up to
>>    num_online_cpus() worth of nr_hw_queues. The target in
>>    this case allowed at least that many of nvme queues.
>> The system encountered the following crash:
>>
>>   BUG: unable to handle kernel paging request at 00003659d33953a8
>>   ...
>>   Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc]
>>   RIP: 0010:blk_mq_get_request+0x21d/0x3c0
>>   ...
>>   Blk_mq_alloc_request_hctx+0xef/0x140
>>   Nvme_alloc_request+0x32/0x80 [nvme_core]
>>   __nvme_submit_sync_cmd+0x4a/0x1c0 [nvme_core]
>>   Nvmf_connect_io_queue+0x130/0x1a0 [nvme_fabrics]
>>   Nvme_fc_connect_io_queues+0x285/0x2b0 [nvme_fc]
>>   Nvme_fc_create_association+0x0x8ea/0x9c0 [nvme_fc]
>>   Nvme_fc_connect_ctrl_work+0x19/0x50 [nvme_fc]
>>   ...
>>
>> There was a commit a while ago to simplify queue mapping which
>> replaced the use of cpumask_first() by cpumask_first_and().
>> The issue is if cpumask_first_and() does not find any _intersecting_ cpus,
>> it return's nr_cpu_id. nr_cpu_id isn't valid for the per_cpu_ptr index
>> which is done in __blk_mq_get_ctx().
>>
>> Considered reverting back to cpumask_first(), but instead followed
>> logic in blk_mq_first_mapped_cpu() to check for nr_cpu_id before
>> calling cpumask_first().
>>
>> Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
>> Signed-off-by: Shagun Agrawal <shagun.agrawal@broadcom.com>
>> Signed-off-by: James Smart <jsmart2021@gmail.com>
>> CC: Christoph Hellwig <hch@lst.de>
>> ---
>>   block/blk-mq.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 8538dc415499..0b06b4ea57f1 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>>                  return ERR_PTR(-EXDEV);
>>          }
>>          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>> +       if (cpu >= nr_cpu_ids)
>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> 
> The first cpu may be offline too, then kernel warning or timeout may
> be triggered later
> when this allocated request is dispatched.
> 
> To be honest, blk_mq_alloc_request_hctx() is really a weird interface,
> given the hctx may become
> dead just when calling into blk_mq_alloc_request_hctx().
> 
> Given only NVMe FC/RDMA uses this interface, could you provide some
> background about
> this kind of usage?
> 
> The normal usage is that user doesn't specify the hctx for allocating
> request from, since blk-mq
> can figure out which hctx is used for allocation via queue mapping.
> Just wondering why NVMe
> FC/RDMA can't do that way?

Fully agree, it'd be much nicer if that weird interface could just
die.
James Smart Oct. 24, 2019, 6:53 p.m. UTC | #3
On 10/24/2019 6:02 AM, Jens Axboe wrote:
> On 10/24/19 3:28 AM, Ming Lei wrote:
>> The normal usage is that user doesn't specify the hctx for allocating
>> request from, since blk-mq
>> can figure out which hctx is used for allocation via queue mapping.
>> Just wondering why NVMe
>> FC/RDMA can't do that way?
> 
> Fully agree, it'd be much nicer if that weird interface could just
> die.
> 

Well.  All depends on what you think a hctx corresponds to, which 
relates to why the caller originally set nr_hw_queues to what it did. I 
think it is reasonable for the caller to say - this must be via this 
specific context.

To the nvme fabric transports (rdma, fc, tcp) the hctx corresponds to 
the nvme controller queue to issue a request on. In the single case 
where the hctx is specified specifically, it is the 1st command on a new 
nvme controller queue. The command *must* be issued on the queue it is 
to initialize (this is different from pci nvme).  The hctx is specified 
so the correct nvme queue is selected when the command comes down the 
request path.  Saying "don't do that" means one of the following: a) 
snooping every rq on the request path to spot initialization ios and 
move them to the right queue; or b) creating a duplicate non-blk-mq 
request path for this 1 initialization io. Both of those are ugly.

-- james
Ming Lei Oct. 25, 2019, 7:22 a.m. UTC | #4
On Thu, Oct 24, 2019 at 11:53:26AM -0700, James Smart wrote:
> On 10/24/2019 6:02 AM, Jens Axboe wrote:
> > On 10/24/19 3:28 AM, Ming Lei wrote:
> > > The normal usage is that user doesn't specify the hctx for allocating
> > > request from, since blk-mq
> > > can figure out which hctx is used for allocation via queue mapping.
> > > Just wondering why NVMe
> > > FC/RDMA can't do that way?
> > 
> > Fully agree, it'd be much nicer if that weird interface could just
> > die.
> > 
> 
> Well.  All depends on what you think a hctx corresponds to, which relates to
> why the caller originally set nr_hw_queues to what it did. I think it is
> reasonable for the caller to say - this must be via this specific context.

Usually MQ is only for handling IO efficiently, and we seldom use
MQ for management purpose, such as, so far all admin queue is SQ.

> 
> To the nvme fabric transports (rdma, fc, tcp) the hctx corresponds to the
> nvme controller queue to issue a request on. In the single case where the

Could you explain a bit about what the purpose of nvme controller is ?

> hctx is specified specifically, it is the 1st command on a new nvme
> controller queue. The command *must* be issued on the queue it is to
> initialize (this is different from pci nvme).  The hctx is specified so the
> correct nvme queue is selected when the command comes down the request path.
> Saying "don't do that" means one of the following: a) snooping every rq on
> the request path to spot initialization ios and move them to the right
> queue; or b) creating a duplicate non-blk-mq request path for this 1
> initialization io. Both of those are ugly.

In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct
nvme_command', that means the 'nvme controller' should know the
specified queue by parsing the command. So still not understand why you
have to submit the command via the specified queue.

Thanks,
Ming
Sagi Grimberg Oct. 25, 2019, 8:26 p.m. UTC | #5
>> hctx is specified specifically, it is the 1st command on a new nvme
>> controller queue. The command *must* be issued on the queue it is to
>> initialize (this is different from pci nvme).  The hctx is specified so the
>> correct nvme queue is selected when the command comes down the request path.
>> Saying "don't do that" means one of the following: a) snooping every rq on
>> the request path to spot initialization ios and move them to the right
>> queue; or b) creating a duplicate non-blk-mq request path for this 1
>> initialization io. Both of those are ugly.
> 
> In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct
> nvme_command', that means the 'nvme controller' should know the
> specified queue by parsing the command. So still not understand why you
> have to submit the command via the specified queue.

The connect command must be send on the queue that it is connecting, the
qid is telling the controller the id of the queue, but the controller
still expects the connect to be issued on the queue that it is designed
to connect (or rather initialize).

in queue_rq we take queue from hctx->driver_data and use it to issue
the command. The connect is different that it is invoked on a context
that is not necessarily running from a cpu that maps to this specific
hctx. So in essence what is needed is a tag from the specific queue tags
without running cpu consideration.
Sagi Grimberg Oct. 25, 2019, 8:51 p.m. UTC | #6
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8538dc415499..0b06b4ea57f1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   		return ERR_PTR(-EXDEV);
>   	}
>   	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +	if (cpu >= nr_cpu_ids)
> +		cpu = cpumask_first(alloc_data.hctx->cpumask);

I'd just drop the cpu_online_mask test, we don't care about the cpu
being online as this request is running already.
Ming Lei Oct. 25, 2019, 10:20 p.m. UTC | #7
On Fri, Oct 25, 2019 at 01:26:46PM -0700, Sagi Grimberg wrote:
> 
> > > hctx is specified specifically, it is the 1st command on a new nvme
> > > controller queue. The command *must* be issued on the queue it is to
> > > initialize (this is different from pci nvme).  The hctx is specified so the
> > > correct nvme queue is selected when the command comes down the request path.
> > > Saying "don't do that" means one of the following: a) snooping every rq on
> > > the request path to spot initialization ios and move them to the right
> > > queue; or b) creating a duplicate non-blk-mq request path for this 1
> > > initialization io. Both of those are ugly.
> > 
> > In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct
> > nvme_command', that means the 'nvme controller' should know the
> > specified queue by parsing the command. So still not understand why you
> > have to submit the command via the specified queue.
> 
> The connect command must be send on the queue that it is connecting, the
> qid is telling the controller the id of the queue, but the controller
> still expects the connect to be issued on the queue that it is designed
> to connect (or rather initialize).
> 
> in queue_rq we take queue from hctx->driver_data and use it to issue
> the command. The connect is different that it is invoked on a context
> that is not necessarily running from a cpu that maps to this specific
> hctx. So in essence what is needed is a tag from the specific queue tags
> without running cpu consideration.

OK, got it.

If nvmf_connect_io_queue() is only run before setting up IO queues, the
shared tag problem could be solved easily, such as, use a standalone
tagset?


Thanks,
Ming
Sagi Grimberg Oct. 25, 2019, 10:33 p.m. UTC | #8
>>>> hctx is specified specifically, it is the 1st command on a new nvme
>>>> controller queue. The command *must* be issued on the queue it is to
>>>> initialize (this is different from pci nvme).  The hctx is specified so the
>>>> correct nvme queue is selected when the command comes down the request path.
>>>> Saying "don't do that" means one of the following: a) snooping every rq on
>>>> the request path to spot initialization ios and move them to the right
>>>> queue; or b) creating a duplicate non-blk-mq request path for this 1
>>>> initialization io. Both of those are ugly.
>>>
>>> In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct
>>> nvme_command', that means the 'nvme controller' should know the
>>> specified queue by parsing the command. So still not understand why you
>>> have to submit the command via the specified queue.
>>
>> The connect command must be send on the queue that it is connecting, the
>> qid is telling the controller the id of the queue, but the controller
>> still expects the connect to be issued on the queue that it is designed
>> to connect (or rather initialize).
>>
>> in queue_rq we take queue from hctx->driver_data and use it to issue
>> the command. The connect is different that it is invoked on a context
>> that is not necessarily running from a cpu that maps to this specific
>> hctx. So in essence what is needed is a tag from the specific queue tags
>> without running cpu consideration.
> 
> OK, got it.
> 
> If nvmf_connect_io_queue() is only run before setting up IO queues, the
> shared tag problem could be solved easily, such as, use a standalone
> tagset?

Not sure what you mean exactly...

Also, keep in mind that this sequence also goes into reconnects, where
we already have our tagset allocated (with pending requests
potentially).
Ming Lei Oct. 27, 2019, 7:23 a.m. UTC | #9
On Fri, Oct 25, 2019 at 03:33:15PM -0700, Sagi Grimberg wrote:
> 
> > > > > hctx is specified specifically, it is the 1st command on a new nvme
> > > > > controller queue. The command *must* be issued on the queue it is to
> > > > > initialize (this is different from pci nvme).  The hctx is specified so the
> > > > > correct nvme queue is selected when the command comes down the request path.
> > > > > Saying "don't do that" means one of the following: a) snooping every rq on
> > > > > the request path to spot initialization ios and move them to the right
> > > > > queue; or b) creating a duplicate non-blk-mq request path for this 1
> > > > > initialization io. Both of those are ugly.
> > > > 
> > > > In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct
> > > > nvme_command', that means the 'nvme controller' should know the
> > > > specified queue by parsing the command. So still not understand why you
> > > > have to submit the command via the specified queue.
> > > 
> > > The connect command must be send on the queue that it is connecting, the
> > > qid is telling the controller the id of the queue, but the controller
> > > still expects the connect to be issued on the queue that it is designed
> > > to connect (or rather initialize).
> > > 
> > > in queue_rq we take queue from hctx->driver_data and use it to issue
> > > the command. The connect is different that it is invoked on a context
> > > that is not necessarily running from a cpu that maps to this specific
> > > hctx. So in essence what is needed is a tag from the specific queue tags
> > > without running cpu consideration.
> > 
> > OK, got it.
> > 
> > If nvmf_connect_io_queue() is only run before setting up IO queues, the
> > shared tag problem could be solved easily, such as, use a standalone
> > tagset?
> 
> Not sure what you mean exactly...
> 
> Also, keep in mind that this sequence also goes into reconnects, where
> we already have our tagset allocated (with pending requests
> potentially).

I just found the connect command is always submitted via the unique reserved
tag 0, so it is nothing to do with IO requests any more.

You can use the reserved tag 0 for connect command as before just by not
sharing tagset between connect queue and IO queues.

Follows the detailed idea:

1) still reserve 1 tag for connect command in the IO tagset.

2) in each driver, create a conn_tagset for .connect_q only, and create
the .connect_q from this 'conn_tagset', and sets conn_tagset.nr_hw_queues
as 1, and sets conn_tagset.queue_depth as 'ctrl.queue_count - 1'

3) in .queue_rq of conn_tagset.ops:

- parse index of queue to be connected from nvme_command.conn.qid
- set the connect command's tag as 0
- then do every other thing just like before


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8538dc415499..0b06b4ea57f1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -461,6 +461,8 @@  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		return ERR_PTR(-EXDEV);
 	}
 	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+	if (cpu >= nr_cpu_ids)
+		cpu = cpumask_first(alloc_data.hctx->cpumask);
 	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	rq = blk_mq_get_request(q, NULL, &alloc_data);