Message ID | 20191023175700.18615-1-jsmart2021@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-mq: Fix cpu indexing error in blk_mq_alloc_request_hctx() | expand |
On Thu, Oct 24, 2019 at 4:53 PM James Smart <jsmart2021@gmail.com> wrote: > > During the following test scenario: > - Offline a cpu > - load lpfc driver, which auto-discovers NVMe devices. For a new > nvme device, the lpfc/nvme_fc transport can request up to > num_online_cpus() worth of nr_hw_queues. The target in > this case allowed at least that many of nvme queues. > The system encountered the following crash: > > BUG: unable to handle kernel paging request at 00003659d33953a8 > ... > Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc] > RIP: 0010:blk_mq_get_request+0x21d/0x3c0 > ... > Blk_mq_alloc_request_hctx+0xef/0x140 > Nvme_alloc_request+0x32/0x80 [nvme_core] > __nvme_submit_sync_cmd+0x4a/0x1c0 [nvme_core] > Nvmf_connect_io_queue+0x130/0x1a0 [nvme_fabrics] > Nvme_fc_connect_io_queues+0x285/0x2b0 [nvme_fc] > Nvme_fc_create_association+0x0x8ea/0x9c0 [nvme_fc] > Nvme_fc_connect_ctrl_work+0x19/0x50 [nvme_fc] > ... > > There was a commit a while ago to simplify queue mapping which > replaced the use of cpumask_first() by cpumask_first_and(). > The issue is if cpumask_first_and() does not find any _intersecting_ cpus, > it return's nr_cpu_id. nr_cpu_id isn't valid for the per_cpu_ptr index > which is done in __blk_mq_get_ctx(). > > Considered reverting back to cpumask_first(), but instead followed > logic in blk_mq_first_mapped_cpu() to check for nr_cpu_id before > calling cpumask_first(). > > Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") > Signed-off-by: Shagun Agrawal <shagun.agrawal@broadcom.com> > Signed-off-by: James Smart <jsmart2021@gmail.com> > CC: Christoph Hellwig <hch@lst.de> > --- > block/blk-mq.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8538dc415499..0b06b4ea57f1 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > return ERR_PTR(-EXDEV); > } > cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); > + if (cpu >= nr_cpu_ids) > + cpu = cpumask_first(alloc_data.hctx->cpumask); The first cpu may be offline too, then kernel warning or timeout may be triggered later when this allocated request is dispatched. To be honest, blk_mq_alloc_request_hctx() is really a weird interface, given the hctx may become dead just when calling into blk_mq_alloc_request_hctx(). Given only NVMe FC/RDMA uses this interface, could you provide some background about this kind of usage? The normal usage is that user doesn't specify the hctx for allocating request from, since blk-mq can figure out which hctx is used for allocation via queue mapping. Just wondering why NVMe FC/RDMA can't do that way? Thanks, Ming Lei
On 10/24/19 3:28 AM, Ming Lei wrote: > On Thu, Oct 24, 2019 at 4:53 PM James Smart <jsmart2021@gmail.com> wrote: >> >> During the following test scenario: >> - Offline a cpu >> - load lpfc driver, which auto-discovers NVMe devices. For a new >> nvme device, the lpfc/nvme_fc transport can request up to >> num_online_cpus() worth of nr_hw_queues. The target in >> this case allowed at least that many of nvme queues. >> The system encountered the following crash: >> >> BUG: unable to handle kernel paging request at 00003659d33953a8 >> ... >> Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc] >> RIP: 0010:blk_mq_get_request+0x21d/0x3c0 >> ... >> Blk_mq_alloc_request_hctx+0xef/0x140 >> Nvme_alloc_request+0x32/0x80 [nvme_core] >> __nvme_submit_sync_cmd+0x4a/0x1c0 [nvme_core] >> Nvmf_connect_io_queue+0x130/0x1a0 [nvme_fabrics] >> Nvme_fc_connect_io_queues+0x285/0x2b0 [nvme_fc] >> Nvme_fc_create_association+0x0x8ea/0x9c0 [nvme_fc] >> Nvme_fc_connect_ctrl_work+0x19/0x50 [nvme_fc] >> ... >> >> There was a commit a while ago to simplify queue mapping which >> replaced the use of cpumask_first() by cpumask_first_and(). >> The issue is if cpumask_first_and() does not find any _intersecting_ cpus, >> it return's nr_cpu_id. nr_cpu_id isn't valid for the per_cpu_ptr index >> which is done in __blk_mq_get_ctx(). >> >> Considered reverting back to cpumask_first(), but instead followed >> logic in blk_mq_first_mapped_cpu() to check for nr_cpu_id before >> calling cpumask_first(). >> >> Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") >> Signed-off-by: Shagun Agrawal <shagun.agrawal@broadcom.com> >> Signed-off-by: James Smart <jsmart2021@gmail.com> >> CC: Christoph Hellwig <hch@lst.de> >> --- >> block/blk-mq.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 8538dc415499..0b06b4ea57f1 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, >> return ERR_PTR(-EXDEV); >> } >> cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); >> + if (cpu >= nr_cpu_ids) >> + cpu = cpumask_first(alloc_data.hctx->cpumask); > > The first cpu may be offline too, then kernel warning or timeout may > be triggered later > when this allocated request is dispatched. > > To be honest, blk_mq_alloc_request_hctx() is really a weird interface, > given the hctx may become > dead just when calling into blk_mq_alloc_request_hctx(). > > Given only NVMe FC/RDMA uses this interface, could you provide some > background about > this kind of usage? > > The normal usage is that user doesn't specify the hctx for allocating > request from, since blk-mq > can figure out which hctx is used for allocation via queue mapping. > Just wondering why NVMe > FC/RDMA can't do that way? Fully agree, it'd be much nicer if that weird interface could just die.
On 10/24/2019 6:02 AM, Jens Axboe wrote: > On 10/24/19 3:28 AM, Ming Lei wrote: >> The normal usage is that user doesn't specify the hctx for allocating >> request from, since blk-mq >> can figure out which hctx is used for allocation via queue mapping. >> Just wondering why NVMe >> FC/RDMA can't do that way? > > Fully agree, it'd be much nicer if that weird interface could just > die. > Well. All depends on what you think a hctx corresponds to, which relates to why the caller originally set nr_hw_queues to what it did. I think it is reasonable for the caller to say - this must be via this specific context. To the nvme fabric transports (rdma, fc, tcp) the hctx corresponds to the nvme controller queue to issue a request on. In the single case where the hctx is specified specifically, it is the 1st command on a new nvme controller queue. The command *must* be issued on the queue it is to initialize (this is different from pci nvme). The hctx is specified so the correct nvme queue is selected when the command comes down the request path. Saying "don't do that" means one of the following: a) snooping every rq on the request path to spot initialization ios and move them to the right queue; or b) creating a duplicate non-blk-mq request path for this 1 initialization io. Both of those are ugly. -- james
On Thu, Oct 24, 2019 at 11:53:26AM -0700, James Smart wrote: > On 10/24/2019 6:02 AM, Jens Axboe wrote: > > On 10/24/19 3:28 AM, Ming Lei wrote: > > > The normal usage is that user doesn't specify the hctx for allocating > > > request from, since blk-mq > > > can figure out which hctx is used for allocation via queue mapping. > > > Just wondering why NVMe > > > FC/RDMA can't do that way? > > > > Fully agree, it'd be much nicer if that weird interface could just > > die. > > > > Well. All depends on what you think a hctx corresponds to, which relates to > why the caller originally set nr_hw_queues to what it did. I think it is > reasonable for the caller to say - this must be via this specific context. Usually MQ is only for handling IO efficiently, and we seldom use MQ for management purpose, such as, so far all admin queue is SQ. > > To the nvme fabric transports (rdma, fc, tcp) the hctx corresponds to the > nvme controller queue to issue a request on. In the single case where the Could you explain a bit about what the purpose of nvme controller is ? > hctx is specified specifically, it is the 1st command on a new nvme > controller queue. The command *must* be issued on the queue it is to > initialize (this is different from pci nvme). The hctx is specified so the > correct nvme queue is selected when the command comes down the request path. > Saying "don't do that" means one of the following: a) snooping every rq on > the request path to spot initialization ios and move them to the right > queue; or b) creating a duplicate non-blk-mq request path for this 1 > initialization io. Both of those are ugly. In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct nvme_command', that means the 'nvme controller' should know the specified queue by parsing the command. So still not understand why you have to submit the command via the specified queue. Thanks, Ming
>> hctx is specified specifically, it is the 1st command on a new nvme >> controller queue. The command *must* be issued on the queue it is to >> initialize (this is different from pci nvme). The hctx is specified so the >> correct nvme queue is selected when the command comes down the request path. >> Saying "don't do that" means one of the following: a) snooping every rq on >> the request path to spot initialization ios and move them to the right >> queue; or b) creating a duplicate non-blk-mq request path for this 1 >> initialization io. Both of those are ugly. > > In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct > nvme_command', that means the 'nvme controller' should know the > specified queue by parsing the command. So still not understand why you > have to submit the command via the specified queue. The connect command must be send on the queue that it is connecting, the qid is telling the controller the id of the queue, but the controller still expects the connect to be issued on the queue that it is designed to connect (or rather initialize). in queue_rq we take queue from hctx->driver_data and use it to issue the command. The connect is different that it is invoked on a context that is not necessarily running from a cpu that maps to this specific hctx. So in essence what is needed is a tag from the specific queue tags without running cpu consideration.
> diff --git a/block/blk-mq.c b/block/blk-mq.c > index 8538dc415499..0b06b4ea57f1 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > return ERR_PTR(-EXDEV); > } > cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); > + if (cpu >= nr_cpu_ids) > + cpu = cpumask_first(alloc_data.hctx->cpumask); I'd just drop the cpu_online_mask test, we don't care about the cpu being online as this request is running already.
On Fri, Oct 25, 2019 at 01:26:46PM -0700, Sagi Grimberg wrote: > > > > hctx is specified specifically, it is the 1st command on a new nvme > > > controller queue. The command *must* be issued on the queue it is to > > > initialize (this is different from pci nvme). The hctx is specified so the > > > correct nvme queue is selected when the command comes down the request path. > > > Saying "don't do that" means one of the following: a) snooping every rq on > > > the request path to spot initialization ios and move them to the right > > > queue; or b) creating a duplicate non-blk-mq request path for this 1 > > > initialization io. Both of those are ugly. > > > > In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct > > nvme_command', that means the 'nvme controller' should know the > > specified queue by parsing the command. So still not understand why you > > have to submit the command via the specified queue. > > The connect command must be send on the queue that it is connecting, the > qid is telling the controller the id of the queue, but the controller > still expects the connect to be issued on the queue that it is designed > to connect (or rather initialize). > > in queue_rq we take queue from hctx->driver_data and use it to issue > the command. The connect is different that it is invoked on a context > that is not necessarily running from a cpu that maps to this specific > hctx. So in essence what is needed is a tag from the specific queue tags > without running cpu consideration. OK, got it. If nvmf_connect_io_queue() is only run before setting up IO queues, the shared tag problem could be solved easily, such as, use a standalone tagset? Thanks, Ming
>>>> hctx is specified specifically, it is the 1st command on a new nvme >>>> controller queue. The command *must* be issued on the queue it is to >>>> initialize (this is different from pci nvme). The hctx is specified so the >>>> correct nvme queue is selected when the command comes down the request path. >>>> Saying "don't do that" means one of the following: a) snooping every rq on >>>> the request path to spot initialization ios and move them to the right >>>> queue; or b) creating a duplicate non-blk-mq request path for this 1 >>>> initialization io. Both of those are ugly. >>> >>> In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct >>> nvme_command', that means the 'nvme controller' should know the >>> specified queue by parsing the command. So still not understand why you >>> have to submit the command via the specified queue. >> >> The connect command must be send on the queue that it is connecting, the >> qid is telling the controller the id of the queue, but the controller >> still expects the connect to be issued on the queue that it is designed >> to connect (or rather initialize). >> >> in queue_rq we take queue from hctx->driver_data and use it to issue >> the command. The connect is different that it is invoked on a context >> that is not necessarily running from a cpu that maps to this specific >> hctx. So in essence what is needed is a tag from the specific queue tags >> without running cpu consideration. > > OK, got it. > > If nvmf_connect_io_queue() is only run before setting up IO queues, the > shared tag problem could be solved easily, such as, use a standalone > tagset? Not sure what you mean exactly... Also, keep in mind that this sequence also goes into reconnects, where we already have our tagset allocated (with pending requests potentially).
On Fri, Oct 25, 2019 at 03:33:15PM -0700, Sagi Grimberg wrote: > > > > > > hctx is specified specifically, it is the 1st command on a new nvme > > > > > controller queue. The command *must* be issued on the queue it is to > > > > > initialize (this is different from pci nvme). The hctx is specified so the > > > > > correct nvme queue is selected when the command comes down the request path. > > > > > Saying "don't do that" means one of the following: a) snooping every rq on > > > > > the request path to spot initialization ios and move them to the right > > > > > queue; or b) creating a duplicate non-blk-mq request path for this 1 > > > > > initialization io. Both of those are ugly. > > > > > > > > In nvmf_connect_io_queue(), 'qid' has been encoded into instance of 'struct > > > > nvme_command', that means the 'nvme controller' should know the > > > > specified queue by parsing the command. So still not understand why you > > > > have to submit the command via the specified queue. > > > > > > The connect command must be send on the queue that it is connecting, the > > > qid is telling the controller the id of the queue, but the controller > > > still expects the connect to be issued on the queue that it is designed > > > to connect (or rather initialize). > > > > > > in queue_rq we take queue from hctx->driver_data and use it to issue > > > the command. The connect is different that it is invoked on a context > > > that is not necessarily running from a cpu that maps to this specific > > > hctx. So in essence what is needed is a tag from the specific queue tags > > > without running cpu consideration. > > > > OK, got it. > > > > If nvmf_connect_io_queue() is only run before setting up IO queues, the > > shared tag problem could be solved easily, such as, use a standalone > > tagset? > > Not sure what you mean exactly... > > Also, keep in mind that this sequence also goes into reconnects, where > we already have our tagset allocated (with pending requests > potentially). I just found the connect command is always submitted via the unique reserved tag 0, so it is nothing to do with IO requests any more. You can use the reserved tag 0 for connect command as before just by not sharing tagset between connect queue and IO queues. Follows the detailed idea: 1) still reserve 1 tag for connect command in the IO tagset. 2) in each driver, create a conn_tagset for .connect_q only, and create the .connect_q from this 'conn_tagset', and sets conn_tagset.nr_hw_queues as 1, and sets conn_tagset.queue_depth as 'ctrl.queue_count - 1' 3) in .queue_rq of conn_tagset.ops: - parse index of queue to be connected from nvme_command.conn.qid - set the connect command's tag as 0 - then do every other thing just like before Thanks, Ming
diff --git a/block/blk-mq.c b/block/blk-mq.c index 8538dc415499..0b06b4ea57f1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, return ERR_PTR(-EXDEV); } cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) + cpu = cpumask_first(alloc_data.hctx->cpumask); alloc_data.ctx = __blk_mq_get_ctx(q, cpu); rq = blk_mq_get_request(q, NULL, &alloc_data);