Message ID | 20221108150252.2123727-12-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/12] nvme-pci: don't call nvme_init_ctrl_finish from nvme_passthru_end | expand |
Nice On 11/8/22 17:02, Christoph Hellwig wrote: > nvme_reset_work is a little fragile as it needs to handle both resetting > a live controller and initializing one during probe. Split out the initial > probe and open code it in nvme_probe and leave nvme_reset_work to just do > the live controller reset. > > This fixes a recently introduced bug where nvme_dev_disable causes a NULL > pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer > is not set when the reset state is entered directly from the new state. > The separate probe code can skip the reset state and probe directly and > fixes this. > > To make sure the system isn't single threaded on enabling nvme > controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver > structure so that the driver core probes in parallel. > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") > Reported-by: Gerd Bayer <gbayer@linux.ibm.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > drivers/nvme/host/pci.c | 139 ++++++++++++++++++++++++---------------- > 1 file changed, 83 insertions(+), 56 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 1b3c96a4b7c90..1c8c70767cb8a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2821,15 +2821,7 @@ static void nvme_reset_work(struct work_struct *work) > result = nvme_pci_enable(dev); > if (result) > goto out_unlock; > - > - if (!dev->ctrl.admin_q) { > - result = nvme_pci_alloc_admin_tag_set(dev); > - if (result) > - goto out_unlock; > - } else { > - nvme_start_admin_queue(&dev->ctrl); > - } > - > + nvme_start_admin_queue(&dev->ctrl); > mutex_unlock(&dev->shutdown_lock); > > /* > @@ -2854,9 +2846,6 @@ static void nvme_reset_work(struct work_struct *work) > */ > memset(dev->dbbuf_dbs, 0, nvme_dbbuf_size(dev)); > memset(dev->dbbuf_eis, 0, nvme_dbbuf_size(dev)); > - } else { > - if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) > - nvme_dbbuf_dma_alloc(dev); > } > > if (dev->ctrl.hmpre) { > @@ -2869,37 +2858,23 @@ static void nvme_reset_work(struct work_struct *work) > if (result) > goto out; > > - if (dev->ctrl.tagset) { > - /* > - * This is a controller reset and we already have a tagset. > - * Freeze and update the number of I/O queues as thos might have > - * changed. If there are no I/O queues left after this reset, > - * keep the controller around but remove all namespaces. > - */ > - if (dev->online_queues > 1) { > - nvme_start_queues(&dev->ctrl); > - nvme_wait_freeze(&dev->ctrl); > - nvme_pci_update_nr_queues(dev); > - nvme_dbbuf_set(dev); > - nvme_unfreeze(&dev->ctrl); > - } else { > - dev_warn(dev->ctrl.device, "IO queues lost\n"); > - nvme_mark_namespaces_dead(&dev->ctrl); > - nvme_start_queues(&dev->ctrl); > - nvme_remove_namespaces(&dev->ctrl); > - nvme_free_tagset(dev); > - } > + /* > + * Freeze and update the number of I/O queues as thos might have > + * changed. If there are no I/O queues left after this reset, keep the > + * controller around but remove all namespaces. > + */ > + if (dev->online_queues > 1) { > + nvme_start_queues(&dev->ctrl); > + nvme_wait_freeze(&dev->ctrl); > + nvme_pci_update_nr_queues(dev); > + nvme_dbbuf_set(dev); > + nvme_unfreeze(&dev->ctrl); > } else { > - /* > - * First probe. Still allow the controller to show up even if > - * there are no namespaces. > - */ > - if (dev->online_queues > 1) { > - nvme_pci_alloc_tag_set(dev); > - nvme_dbbuf_set(dev); > - } else { > - dev_warn(dev->ctrl.device, "IO queues not created\n"); > - } > + dev_warn(dev->ctrl.device, "IO queues lost\n"); > + nvme_mark_namespaces_dead(&dev->ctrl); > + nvme_start_queues(&dev->ctrl); > + nvme_remove_namespaces(&dev->ctrl); Is this needed? isn't nvme_remove coming soon? In fact, shouldn't all these calls be in nvme_remove?
On Wed, Nov 09, 2022 at 05:14:02AM +0200, Sagi Grimberg wrote: >> - if (dev->online_queues > 1) { >> - nvme_pci_alloc_tag_set(dev); >> - nvme_dbbuf_set(dev); >> - } else { >> - dev_warn(dev->ctrl.device, "IO queues not created\n"); >> - } >> + dev_warn(dev->ctrl.device, "IO queues lost\n"); >> + nvme_mark_namespaces_dead(&dev->ctrl); >> + nvme_start_queues(&dev->ctrl); >> + nvme_remove_namespaces(&dev->ctrl); > > Is this needed? isn't nvme_remove coming soon? > In fact, shouldn't all these calls be in nvme_remove? This handles the case where a controller controller does not have I/O queues. For controllers that never had them (e.g. admin controllers) none of the three calls above is needed, but they deal with the case where a controller had queues, but they are going away. I'm not sure if that can happen, but it keeps the behavior of the existing code. Keith wrote it to deal with Intel controllers that can be in a degraded state where having the admin queue live even without I/O queues allows updating the firmware, so he might know more.
Hi Christoph, On Tue, 2022-11-08 at 16:02 +0100, Christoph Hellwig wrote: > nvme_reset_work is a little fragile as it needs to handle both resetting > a live controller and initializing one during probe. Split out the initial > probe and open code it in nvme_probe and leave nvme_reset_work to just do > the live controller reset. > > This fixes a recently introduced bug where nvme_dev_disable causes a NULL > pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer > is not set when the reset state is entered directly from the new state. > The separate probe code can skip the reset state and probe directly and > fixes this. > > To make sure the system isn't single threaded on enabling nvme > controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver > structure so that the driver core probes in parallel. > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") > Reported-by: Gerd Bayer <gbayer@linux.ibm.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > drivers/nvme/host/pci.c | 139 ++++++++++++++++++++++++---------------- > 1 file changed, 83 insertions(+), 56 deletions(-) I have successfully tested the patch series as proposed here on top of next-20220908. The small test script that I used to expose the race condition in my initial bug report https://lore.kernel.org/linux-nvme/20221108091609.1020-1-hdanton@sina.com/T/#t did no longer reproduce the kernel panic. Even repeated unbind/bind/remove/rescan cycles worked without any issue. Out of curiousity I did run a little traffic to the NVMe drive after the repeated cycles, too. No issues. While the patch series did apply fine on next-20220909 I was unable to do any testing with that as that had different severe issues to boot. So feel free to add my Tested-by Gerd Bayer <gbayer@linxu.ibm.com> for the whole series Thank you, Gerd Bayer
On Tue, Nov 08, 2022 at 04:02:51PM +0100, Christoph Hellwig wrote: > nvme_reset_work is a little fragile as it needs to handle both resetting > a live controller and initializing one during probe. Split out the initial > probe and open code it in nvme_probe and leave nvme_reset_work to just do > the live controller reset. By enabling the controller in probe, you are blocking subsequent controllers from probing in parallel. Some devices take a very long time to complete enable, so serializing this process may significantly increase a system reset time for one with even a modest number of nvme drives.
On Tue, Nov 08, 2022 at 04:02:51PM +0100, Christoph Hellwig wrote: > To make sure the system isn't single threaded on enabling nvme > controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver > structure so that the driver core probes in parallel. Please ignore my previous message; I replied before getting to this part. :)
On Wed, Nov 09, 2022 at 07:31:19AM +0100, Christoph Hellwig wrote: > On Wed, Nov 09, 2022 at 05:14:02AM +0200, Sagi Grimberg wrote: > >> - if (dev->online_queues > 1) { > >> - nvme_pci_alloc_tag_set(dev); > >> - nvme_dbbuf_set(dev); > >> - } else { > >> - dev_warn(dev->ctrl.device, "IO queues not created\n"); > >> - } > >> + dev_warn(dev->ctrl.device, "IO queues lost\n"); > >> + nvme_mark_namespaces_dead(&dev->ctrl); > >> + nvme_start_queues(&dev->ctrl); > >> + nvme_remove_namespaces(&dev->ctrl); > > > > Is this needed? isn't nvme_remove coming soon? > > In fact, shouldn't all these calls be in nvme_remove? > > This handles the case where a controller controller does not have I/O > queues. For controllers that never had them (e.g. admin controllers) > none of the three calls above is needed, but they deal with the > case where a controller had queues, but they are going away. I'm > not sure if that can happen, but it keeps the behavior of the existing > code. Keith wrote it to deal with Intel controllers that can be in > a degraded state where having the admin queue live even without I/O > queues allows updating the firmware, so he might know more. Right, firmware assert and other types of errors can put the controller into a degraded state with only an admin-queue when it previously had working IO queues. Keeping the admin queue active allows an admin to pull logs for their bug reports.
On 2022/11/8 23:02, Christoph Hellwig wrote: > nvme_reset_work is a little fragile as it needs to handle both resetting > a live controller and initializing one during probe. Split out the initial > probe and open code it in nvme_probe and leave nvme_reset_work to just do > the live controller reset. > > This fixes a recently introduced bug where nvme_dev_disable causes a NULL > pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer > is not set when the reset state is entered directly from the new state. > The separate probe code can skip the reset state and probe directly and > fixes this. > > To make sure the system isn't single threaded on enabling nvme > controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver > structure so that the driver core probes in parallel. > > Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") > Reported-by: Gerd Bayer <gbayer@linux.ibm.com> > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > drivers/nvme/host/pci.c | 139 ++++++++++++++++++++++++---------------- > 1 file changed, 83 insertions(+), 56 deletions(-) > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 1b3c96a4b7c90..1c8c70767cb8a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2821,15 +2821,7 @@ static void nvme_reset_work(struct work_struct *work) > result = nvme_pci_enable(dev); > if (result) > goto out_unlock; > - > - if (!dev->ctrl.admin_q) { > - result = nvme_pci_alloc_admin_tag_set(dev); > - if (result) > - goto out_unlock; > - } else { > - nvme_start_admin_queue(&dev->ctrl); > - } > - > + nvme_start_admin_queue(&dev->ctrl); > mutex_unlock(&dev->shutdown_lock); > > /* > @@ -2854,9 +2846,6 @@ static void nvme_reset_work(struct work_struct *work) > */ > memset(dev->dbbuf_dbs, 0, nvme_dbbuf_size(dev)); > memset(dev->dbbuf_eis, 0, nvme_dbbuf_size(dev)); > - } else { > - if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) > - nvme_dbbuf_dma_alloc(dev); > } > > if (dev->ctrl.hmpre) { > @@ -2869,37 +2858,23 @@ static void nvme_reset_work(struct work_struct *work) > if (result) > goto out; > > - if (dev->ctrl.tagset) { > - /* > - * This is a controller reset and we already have a tagset. > - * Freeze and update the number of I/O queues as thos might have > - * changed. If there are no I/O queues left after this reset, > - * keep the controller around but remove all namespaces. > - */ > - if (dev->online_queues > 1) { > - nvme_start_queues(&dev->ctrl); > - nvme_wait_freeze(&dev->ctrl); > - nvme_pci_update_nr_queues(dev); > - nvme_dbbuf_set(dev); > - nvme_unfreeze(&dev->ctrl); > - } else { > - dev_warn(dev->ctrl.device, "IO queues lost\n"); > - nvme_mark_namespaces_dead(&dev->ctrl); > - nvme_start_queues(&dev->ctrl); > - nvme_remove_namespaces(&dev->ctrl); > - nvme_free_tagset(dev); > - } > + /* > + * Freeze and update the number of I/O queues as thos might have > + * changed. If there are no I/O queues left after this reset, keep the > + * controller around but remove all namespaces. > + */ > + if (dev->online_queues > 1) { > + nvme_start_queues(&dev->ctrl); > + nvme_wait_freeze(&dev->ctrl); > + nvme_pci_update_nr_queues(dev); > + nvme_dbbuf_set(dev); > + nvme_unfreeze(&dev->ctrl); > } else { > - /* > - * First probe. Still allow the controller to show up even if > - * there are no namespaces. > - */ > - if (dev->online_queues > 1) { > - nvme_pci_alloc_tag_set(dev); > - nvme_dbbuf_set(dev); > - } else { > - dev_warn(dev->ctrl.device, "IO queues not created\n"); > - } > + dev_warn(dev->ctrl.device, "IO queues lost\n"); > + nvme_mark_namespaces_dead(&dev->ctrl); > + nvme_start_queues(&dev->ctrl); > + nvme_remove_namespaces(&dev->ctrl); > + nvme_free_tagset(dev); nvme_free_tagset is not necessary when IO queues lost. nvme_free_tagset can be called in nvme_pci_free_ctrl. If we call nvme_free_tagset here, nvme_dev_disable will still cause a NULL pointer references in blk_mq_quiesce_tagset.
On Thu, Nov 10, 2022 at 11:17:22AM +0800, Chao Leng wrote: > nvme_free_tagset is not necessary when IO queues lost. > nvme_free_tagset can be called in nvme_pci_free_ctrl. Yes, it could, but that isn't really required here. > If we call nvme_free_tagset here, nvme_dev_disable will still cause a NULL pointer references > in blk_mq_quiesce_tagset. True. I have another series as a follow up to sort out some more of the tagset mess.
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 1b3c96a4b7c90..1c8c70767cb8a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2821,15 +2821,7 @@ static void nvme_reset_work(struct work_struct *work) result = nvme_pci_enable(dev); if (result) goto out_unlock; - - if (!dev->ctrl.admin_q) { - result = nvme_pci_alloc_admin_tag_set(dev); - if (result) - goto out_unlock; - } else { - nvme_start_admin_queue(&dev->ctrl); - } - + nvme_start_admin_queue(&dev->ctrl); mutex_unlock(&dev->shutdown_lock); /* @@ -2854,9 +2846,6 @@ static void nvme_reset_work(struct work_struct *work) */ memset(dev->dbbuf_dbs, 0, nvme_dbbuf_size(dev)); memset(dev->dbbuf_eis, 0, nvme_dbbuf_size(dev)); - } else { - if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) - nvme_dbbuf_dma_alloc(dev); } if (dev->ctrl.hmpre) { @@ -2869,37 +2858,23 @@ static void nvme_reset_work(struct work_struct *work) if (result) goto out; - if (dev->ctrl.tagset) { - /* - * This is a controller reset and we already have a tagset. - * Freeze and update the number of I/O queues as thos might have - * changed. If there are no I/O queues left after this reset, - * keep the controller around but remove all namespaces. - */ - if (dev->online_queues > 1) { - nvme_start_queues(&dev->ctrl); - nvme_wait_freeze(&dev->ctrl); - nvme_pci_update_nr_queues(dev); - nvme_dbbuf_set(dev); - nvme_unfreeze(&dev->ctrl); - } else { - dev_warn(dev->ctrl.device, "IO queues lost\n"); - nvme_mark_namespaces_dead(&dev->ctrl); - nvme_start_queues(&dev->ctrl); - nvme_remove_namespaces(&dev->ctrl); - nvme_free_tagset(dev); - } + /* + * Freeze and update the number of I/O queues as thos might have + * changed. If there are no I/O queues left after this reset, keep the + * controller around but remove all namespaces. + */ + if (dev->online_queues > 1) { + nvme_start_queues(&dev->ctrl); + nvme_wait_freeze(&dev->ctrl); + nvme_pci_update_nr_queues(dev); + nvme_dbbuf_set(dev); + nvme_unfreeze(&dev->ctrl); } else { - /* - * First probe. Still allow the controller to show up even if - * there are no namespaces. - */ - if (dev->online_queues > 1) { - nvme_pci_alloc_tag_set(dev); - nvme_dbbuf_set(dev); - } else { - dev_warn(dev->ctrl.device, "IO queues not created\n"); - } + dev_warn(dev->ctrl.device, "IO queues lost\n"); + nvme_mark_namespaces_dead(&dev->ctrl); + nvme_start_queues(&dev->ctrl); + nvme_remove_namespaces(&dev->ctrl); + nvme_free_tagset(dev); } /* @@ -3055,15 +3030,6 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev) return 0; } -static void nvme_async_probe(void *data, async_cookie_t cookie) -{ - struct nvme_dev *dev = data; - - flush_work(&dev->ctrl.reset_work); - flush_work(&dev->ctrl.scan_work); - nvme_put_ctrl(&dev->ctrl); -} - static struct nvme_dev *nvme_pci_alloc_ctrl(struct pci_dev *pdev, const struct pci_device_id *id) { @@ -3155,12 +3121,72 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id) goto out_release_prp_pools; dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev)); + + result = nvme_pci_enable(dev); + if (result) + goto out_release_iod_mempool; + + result = nvme_pci_alloc_admin_tag_set(dev); + if (result) + goto out_disable; + + /* + * Mark the controller as connecting before sending admin commands to + * allow the timeout handler to do the right thing. + */ + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING)) { + dev_warn(dev->ctrl.device, + "failed to mark controller CONNECTING\n"); + result = -EBUSY; + goto out_disable; + } + + result = nvme_init_ctrl_finish(&dev->ctrl, false); + if (result) + goto out_disable; + + if (dev->ctrl.oacs & NVME_CTRL_OACS_DBBUF_SUPP) + nvme_dbbuf_dma_alloc(dev); + + if (dev->ctrl.hmpre) { + result = nvme_setup_host_mem(dev); + if (result < 0) + goto out_disable; + } + + result = nvme_setup_io_queues(dev); + if (result) + goto out_disable; + + if (dev->online_queues > 1) { + nvme_pci_alloc_tag_set(dev); + nvme_dbbuf_set(dev); + } else { + dev_warn(dev->ctrl.device, "IO queues not created\n"); + } + + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) { + dev_warn(dev->ctrl.device, + "failed to mark controller live state\n"); + result = -ENODEV; + goto out_disable; + } + pci_set_drvdata(pdev, dev); - nvme_reset_ctrl(&dev->ctrl); - async_schedule(nvme_async_probe, dev); + nvme_start_ctrl(&dev->ctrl); + nvme_put_ctrl(&dev->ctrl); return 0; +out_disable: + nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); + nvme_dev_disable(dev, true); + nvme_free_host_mem(dev); + nvme_dev_remove_admin(dev); + nvme_dbbuf_dma_free(dev); + nvme_free_queues(dev, 0); +out_release_iod_mempool: + mempool_destroy(dev->iod_mempool); out_release_prp_pools: nvme_release_prp_pools(dev); out_dev_unmap: @@ -3556,11 +3582,12 @@ static struct pci_driver nvme_driver = { .probe = nvme_probe, .remove = nvme_remove, .shutdown = nvme_shutdown, -#ifdef CONFIG_PM_SLEEP .driver = { - .pm = &nvme_dev_pm_ops, - }, + .probe_type = PROBE_PREFER_ASYNCHRONOUS, +#ifdef CONFIG_PM_SLEEP + .pm = &nvme_dev_pm_ops, #endif + }, .sriov_configure = pci_sriov_configure_simple, .err_handler = &nvme_err_handler, };
nvme_reset_work is a little fragile as it needs to handle both resetting a live controller and initializing one during probe. Split out the initial probe and open code it in nvme_probe and leave nvme_reset_work to just do the live controller reset. This fixes a recently introduced bug where nvme_dev_disable causes a NULL pointer dereferences in blk_mq_quiesce_tagset because the tagset pointer is not set when the reset state is entered directly from the new state. The separate probe code can skip the reset state and probe directly and fixes this. To make sure the system isn't single threaded on enabling nvme controllers, set the PROBE_PREFER_ASYNCHRONOUS flag in the device_driver structure so that the driver core probes in parallel. Fixes: 98d81f0df70c ("nvme: use blk_mq_[un]quiesce_tagset") Reported-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> --- drivers/nvme/host/pci.c | 139 ++++++++++++++++++++++++---------------- 1 file changed, 83 insertions(+), 56 deletions(-)