Message ID | 20151217143243.GA9654@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Hi Sebastian, On Thu, Dec 17, 2015 at 03:32:43PM +0100, Sebastian Andrzej Siewior wrote: > I start a binary which should flash the FPGA and re-enumare the PCI-BUS > and find a new device. It works most of the time. With SLUB debug it > crashes on each iteration with something like this (compressed output): > > | pcieport 0000:00:00.0: AER: Multiple Corrected error received: id=0000 > | Unable to handle kernel paging request for data at address 0x27ef9e3e > | Faulting instruction address: 0x602f5328 > | Oops: Kernel access of bad area, sig: 11 [#1] > | Workqueue: events aer_isr > | GPR24: dd6aa000 6b6b6b6b 605f8378 605f8360 d99b12c0 604fc674 606b1704 d99b12c0 > | NIP [602f5328] pci_walk_bus+0xd4/0x104 > > Register 25 has the user-after magic. As it turns out, the old PCIe > device is leaving, generates an error before it left, aer_irq() is fired, > it schedules a work item. What happens now is that free_irq() is > invoked, all resources are gone *before* the aes_isr() work item is > completed. > So to fix this, I flush the workqueue to ensure that there is no more > work pending. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> > --- > Bjorn, this could deserve a stable tag. However it seems to have been > like that even in v2.6.20. > > drivers/pci/pcie/aer/aerdrv.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c > index 0bf82a20a0fb..7acd27348098 100644 > --- a/drivers/pci/pcie/aer/aerdrv.c > +++ b/drivers/pci/pcie/aer/aerdrv.c > @@ -282,8 +282,10 @@ static void aer_remove(struct pcie_device *dev) > > if (rpc) { > /* If register interrupt service, it must be free. */ > - if (rpc->isr) > + if (rpc->isr) { > free_irq(dev->irq, dev); > + flush_work(&rpc->dpc_handler); > + } > > wait_event(rpc->wait_release, rpc->prod_idx == rpc->cons_idx); Your change looks reasonable. But I'm curious about the wait_event() just below it. That *looks* like it's intended to do the same thing as your flush_work(). Can you explain why the wait_event() isn't working? If we add the flush_work(), can we remove the wait_event() stuff? Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Bjorn Helgaas | 2016-01-06 17:27:58 [-0600]: >Hi Sebastian, Hi Bjorn, >Your change looks reasonable. But I'm curious about the wait_event() >just below it. That *looks* like it's intended to do the same thing >as your flush_work(). Indeed. >Can you explain why the wait_event() isn't working? If we add the aer_isr() invokes get_e_source() which increments rpc->cons_idx. So the condition is valid after that and the function does not terminate yes it invokes aer_isr_one_error(). That means if we have one CPU doing the ISR + workqueue task and another CPU doing the aer_remove() removal thingy then the latter CPU evaluates the condition to true and continues cleanup while the former is still in aer_isr_one_error() wondering where the memory went. >flush_work(), can we remove the wait_event() stuff? I think so since its only purpose is to sync against removal which does not work on SMP. So let me remove this and the wait_release member. >Bjorn Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c index 0bf82a20a0fb..7acd27348098 100644 --- a/drivers/pci/pcie/aer/aerdrv.c +++ b/drivers/pci/pcie/aer/aerdrv.c @@ -282,8 +282,10 @@ static void aer_remove(struct pcie_device *dev) if (rpc) { /* If register interrupt service, it must be free. */ - if (rpc->isr) + if (rpc->isr) { free_irq(dev->irq, dev); + flush_work(&rpc->dpc_handler); + } wait_event(rpc->wait_release, rpc->prod_idx == rpc->cons_idx);
I start a binary which should flash the FPGA and re-enumare the PCI-BUS and find a new device. It works most of the time. With SLUB debug it crashes on each iteration with something like this (compressed output): | pcieport 0000:00:00.0: AER: Multiple Corrected error received: id=0000 | Unable to handle kernel paging request for data at address 0x27ef9e3e | Faulting instruction address: 0x602f5328 | Oops: Kernel access of bad area, sig: 11 [#1] | Workqueue: events aer_isr | GPR24: dd6aa000 6b6b6b6b 605f8378 605f8360 d99b12c0 604fc674 606b1704 d99b12c0 | NIP [602f5328] pci_walk_bus+0xd4/0x104 Register 25 has the user-after magic. As it turns out, the old PCIe device is leaving, generates an error before it left, aer_irq() is fired, it schedules a work item. What happens now is that free_irq() is invoked, all resources are gone *before* the aes_isr() work item is completed. So to fix this, I flush the workqueue to ensure that there is no more work pending. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- Bjorn, this could deserve a stable tag. However it seems to have been like that even in v2.6.20. drivers/pci/pcie/aer/aerdrv.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)