Message ID | 20180409220444.6632-5-keith.busch@intel.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Mon, Apr 09, 2018 at 04:04:44PM -0600, Keith Busch wrote: > The side effects of surprise removal may trigger AER handling. The AER > handling walks the pci topology and may access a pci_dev that is being > freed by the hotplug handler. > > This patch fixes that use-after-free by locking the PCI topology in > the AER handler so it isn't racing with the pciehp removal. > > Since the AER handler now runs under a global PCI lock, the rpc specific > mutex is no longer necessary. > > Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com> > Signed-off-by: Keith Busch <keith.busch@intel.com> > --- > drivers/pci/pcie/aer/aerdrv.c | 1 - > drivers/pci/pcie/aer/aerdrv.h | 5 ----- > drivers/pci/pcie/aer/aerdrv_core.c | 4 ++-- > 3 files changed, 2 insertions(+), 8 deletions(-) > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c > index 0b2eb88c422b..b88e5e2f3700 100644 > --- a/drivers/pci/pcie/aer/aerdrv.c > +++ b/drivers/pci/pcie/aer/aerdrv.c > @@ -237,7 +237,6 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev) > rpc->rpd = pci_dev_get(dev->port); > kref_init(&rpc->ref); > INIT_WORK(&rpc->dpc_handler, aer_isr); > - mutex_init(&rpc->rpc_mutex); > > /* Use PCIe bus function to store rpc into PCIe device */ > set_service_data(dev, rpc); > diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h > index f886521e2c7b..b90fc5d4cda2 100644 > --- a/drivers/pci/pcie/aer/aerdrv.h > +++ b/drivers/pci/pcie/aer/aerdrv.h > @@ -70,11 +70,6 @@ struct aer_rpc { > * Lock access to Error Status/ID Regs > * and error producer/consumer index > */ > - struct mutex rpc_mutex; /* > - * only one thread could do > - * recovery on the same > - * root port hierarchy > - */ > }; > > struct aer_broadcast_data { > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > index e4059d7fa7fa..de210b7439eb 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work) > struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler); > struct aer_err_source uninitialized_var(e_src); > > - mutex_lock(&rpc->rpc_mutex); > + pci_lock_rescan_remove(); > while (get_e_source(rpc, &e_src)) > aer_isr_one_error(rpc, &e_src); > - mutex_unlock(&rpc->rpc_mutex); > + pci_unlock_rescan_remove(); I think this needs to be updated after Oza's patches, doesn't it? It looks like this would deadlock if I applied it to my current "next" branch as-is: aer_isr pci_lock_rescan_remove aer_isr_one_error aer_process_err_devices handle_error_source pcie_do_fatal_recovery pci_lock_rescan_remove <-- deadlock > aer_release(rpc); > } > -- > 2.14.3 >
On Tue, Jun 05, 2018 at 05:09:11PM -0500, Bjorn Helgaas wrote: > > @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work) > > struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler); > > struct aer_err_source uninitialized_var(e_src); > > > > - mutex_lock(&rpc->rpc_mutex); > > + pci_lock_rescan_remove(); > > while (get_e_source(rpc, &e_src)) > > aer_isr_one_error(rpc, &e_src); > > - mutex_unlock(&rpc->rpc_mutex); > > + pci_unlock_rescan_remove(); > > I think this needs to be updated after Oza's patches, doesn't it? > > It looks like this would deadlock if I applied it to my current "next" > branch as-is: > > aer_isr > pci_lock_rescan_remove > aer_isr_one_error > aer_process_err_devices > handle_error_source > pcie_do_fatal_recovery > pci_lock_rescan_remove <-- deadlock > > > aer_release(rpc); > > } Yes, looks like you are right about that. I fully intended to have this rebased on that by now, but nvme issues took way more time than I anticipated. Things appear to have calmed down on that front, and I should be able to rebase appropriately this week (famous last words...).
On Tue, Jun 05, 2018 at 04:18:26PM -0600, Keith Busch wrote: > On Tue, Jun 05, 2018 at 05:09:11PM -0500, Bjorn Helgaas wrote: > > > @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work) > > > struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler); > > > struct aer_err_source uninitialized_var(e_src); > > > > > > - mutex_lock(&rpc->rpc_mutex); > > > + pci_lock_rescan_remove(); > > > while (get_e_source(rpc, &e_src)) > > > aer_isr_one_error(rpc, &e_src); > > > - mutex_unlock(&rpc->rpc_mutex); > > > + pci_unlock_rescan_remove(); > > > > I think this needs to be updated after Oza's patches, doesn't it? > > > > It looks like this would deadlock if I applied it to my current "next" > > branch as-is: > > > > aer_isr > > pci_lock_rescan_remove > > aer_isr_one_error > > aer_process_err_devices > > handle_error_source > > pcie_do_fatal_recovery > > pci_lock_rescan_remove <-- deadlock > > > > > aer_release(rpc); > > > } > > Yes, looks like you are right about that. > > I fully intended to have this rebased on that by now, but nvme issues > took way more time than I anticipated. Things appear to have calmed down > on that front, and I should be able to rebase appropriately this week > (famous last words...). No worries, it's doubtful that I can still squeeze it into v4.18, so I think it's better if we target this for v4.19. I have some questions unrelated to the rebase: - What does the use-after-free look like to a user? Panic, corruption, etc? It's nice if we have breadcrumbs that connect a symptom to the fix. - I'm not super comfortable with the AER tree traversal in find_source_device(). Obviously this has always been there and is not your issue. But it's dubious when a driver for device X also peeks at devices Y, Z, etc. That always leads to locking issues. - I'm not clear on how pci_bus_sem and pci_rescan_remove_lock should be used. pci_bus_sem is acquired by pci_walk_bus() and a few other PCI core paths. pci_rescan_remove_lock is acquired (via pci_lock_rescan_remove()) by hotplug drivers and a few other add/remove/rescan paths. Most users of pci_walk_bus() do not use pci_lock_rescan_remove(), so it's not clear to me how to decide whether they *all* should, or why this AER path is different from the ones that don't need to. Bjorn
diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c index 0b2eb88c422b..b88e5e2f3700 100644 --- a/drivers/pci/pcie/aer/aerdrv.c +++ b/drivers/pci/pcie/aer/aerdrv.c @@ -237,7 +237,6 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev) rpc->rpd = pci_dev_get(dev->port); kref_init(&rpc->ref); INIT_WORK(&rpc->dpc_handler, aer_isr); - mutex_init(&rpc->rpc_mutex); /* Use PCIe bus function to store rpc into PCIe device */ set_service_data(dev, rpc); diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h index f886521e2c7b..b90fc5d4cda2 100644 --- a/drivers/pci/pcie/aer/aerdrv.h +++ b/drivers/pci/pcie/aer/aerdrv.h @@ -70,11 +70,6 @@ struct aer_rpc { * Lock access to Error Status/ID Regs * and error producer/consumer index */ - struct mutex rpc_mutex; /* - * only one thread could do - * recovery on the same - * root port hierarchy - */ }; struct aer_broadcast_data { diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index e4059d7fa7fa..de210b7439eb 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work) struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler); struct aer_err_source uninitialized_var(e_src); - mutex_lock(&rpc->rpc_mutex); + pci_lock_rescan_remove(); while (get_e_source(rpc, &e_src)) aer_isr_one_error(rpc, &e_src); - mutex_unlock(&rpc->rpc_mutex); + pci_unlock_rescan_remove(); aer_release(rpc); }
The side effects of surprise removal may trigger AER handling. The AER handling walks the pci topology and may access a pci_dev that is being freed by the hotplug handler. This patch fixes that use-after-free by locking the PCI topology in the AER handler so it isn't racing with the pciehp removal. Since the AER handler now runs under a global PCI lock, the rpc specific mutex is no longer necessary. Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com> Signed-off-by: Keith Busch <keith.busch@intel.com> --- drivers/pci/pcie/aer/aerdrv.c | 1 - drivers/pci/pcie/aer/aerdrv.h | 5 ----- drivers/pci/pcie/aer/aerdrv_core.c | 4 ++-- 3 files changed, 2 insertions(+), 8 deletions(-)