Message ID | 20180918221501.13112-1-mr.nuke.me@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected | expand |
ping On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote: > When a PCI device is gone, we don't want to send IO to it if we can > avoid it. We expose functionality via the irq_chip structure. As > users of that structure may not know about the underlying PCI device, > it's our responsibility to guard against removed devices. > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). > .irq_mask/unmask() are not. Guard them for completeness. > > For example, surprise removal of a PCIe device triggers teardown. This > touches the irq_chips ops some point to disable the interrupts. I/O > generated here can crash the system on firmware-first machines. > Not triggering the IO in the first place greatly reduces the > possibility of the problem occurring. > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > --- > drivers/pci/msi.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index f2ef896464b3..f31058fd2260 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) > { > struct msi_desc *desc = irq_data_get_msi_desc(data); > > + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) > + return; > + > if (desc->msi_attrib.is_msix) { > msix_mask_irq(desc, flag); > readl(desc->mask_base); /* Flush write to device */ >
I found the same issue: https://patchwork.ozlabs.org/patch/989272/ Tested-by: Jon Derrick <jonathan.derrick@intel.com> On Mon, 2018-11-05 at 18:32 -0600, Alex G. wrote: > ping > > On 09/18/2018 05:15 PM, Alexandru Gagniuc wrote: > > When a PCI device is gone, we don't want to send IO to it if we can > > avoid it. We expose functionality via the irq_chip structure. As > > users of that structure may not know about the underlying PCI > > device, > > it's our responsibility to guard against removed devices. > > > > .irq_write_msi_msg() is already guarded inside > > __pci_write_msi_msg(). > > .irq_mask/unmask() are not. Guard them for completeness. > > > > For example, surprise removal of a PCIe device triggers teardown. > > This > > touches the irq_chips ops some point to disable the interrupts. I/O > > generated here can crash the system on firmware-first machines. > > Not triggering the IO in the first place greatly reduces the > > possibility of the problem occurring. > > > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > > --- > > drivers/pci/msi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index f2ef896464b3..f31058fd2260 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data > > *data, u32 flag) > > { > > struct msi_desc *desc = irq_data_get_msi_desc(data); > > > > + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) > > + return; > > + > > if (desc->msi_attrib.is_msix) { > > msix_mask_irq(desc, flag); > > readl(desc->mask_base); /* Flush > > write to device */ > >
On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote: > When a PCI device is gone, we don't want to send IO to it if we can > avoid it. We expose functionality via the irq_chip structure. As > users of that structure may not know about the underlying PCI device, > it's our responsibility to guard against removed devices. > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). > .irq_mask/unmask() are not. Guard them for completeness. > > For example, surprise removal of a PCIe device triggers teardown. This > touches the irq_chips ops some point to disable the interrupts. I/O > generated here can crash the system on firmware-first machines. > Not triggering the IO in the first place greatly reduces the > possibility of the problem occurring. > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> Applied to pci/misc for v4.21, thanks! > --- > drivers/pci/msi.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > index f2ef896464b3..f31058fd2260 100644 > --- a/drivers/pci/msi.c > +++ b/drivers/pci/msi.c > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) > { > struct msi_desc *desc = irq_data_get_msi_desc(data); > > + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) > + return; > + > if (desc->msi_attrib.is_msix) { > msix_mask_irq(desc, flag); > readl(desc->mask_base); /* Flush write to device */ > -- > 2.17.1 >
[+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about PCI error recovery in general] On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote: > > When a PCI device is gone, we don't want to send IO to it if we can > > avoid it. We expose functionality via the irq_chip structure. As > > users of that structure may not know about the underlying PCI device, > > it's our responsibility to guard against removed devices. > > > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). > > .irq_mask/unmask() are not. Guard them for completeness. > > > > For example, surprise removal of a PCIe device triggers teardown. This > > touches the irq_chips ops some point to disable the interrupts. I/O > > generated here can crash the system on firmware-first machines. > > Not triggering the IO in the first place greatly reduces the > > possibility of the problem occurring. > > > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > > Applied to pci/misc for v4.21, thanks! I'm having second thoughts about this. One thing I'm uncomfortable with is that sprinkling pci_dev_is_disconnected() around feels ad hoc instead of systematic, in the sense that I don't know how we convince ourselves that this (and only this) is the correct place to put it. Another is that the only place we call pci_dev_set_disconnected() is in pciehp and acpiphp, so the only "disconnected" case we catch is if hotplug happens to be involved. Every MMIO read from the device is an opportunity to learn whether it is reachable (a read from an unreachable device typically returns ~0 data), but we don't do anything at all with those. The config accessors already check pci_dev_is_disconnected(), so this patch is really aimed at MMIO accesses. I think it would be more robust if we added wrappers for readl() and writel() so we could notice read errors and avoid future reads and writes. Two compiled but untested patches below for your comments. You can mostly ignore the first; it's just boring interface changes. The important part is the second, which adds the wrappers. These would be an alternative to the (admittedly much shorter) patch here. The wrappers are independent of MSI and could potentially be exported from the PCI core for use by drivers. I think it would be better for drivers to use something like this instead of calling pci_device_is_present() or pci_dev_is_disconnected() directly. > > --- > > drivers/pci/msi.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c > > index f2ef896464b3..f31058fd2260 100644 > > --- a/drivers/pci/msi.c > > +++ b/drivers/pci/msi.c > > @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) > > { > > struct msi_desc *desc = irq_data_get_msi_desc(data); > > > > + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) > > + return; > > + > > if (desc->msi_attrib.is_msix) { > > msix_mask_irq(desc, flag); > > readl(desc->mask_base); /* Flush write to device */ commit 150346e09edbcaedc520a6d7dec2b16f3a53afa1 Author: Bjorn Helgaas <bhelgaas@google.com> Date: Thu Nov 8 09:17:51 2018 -0600 PCI/MSI: Pass pci_dev into IRQ mask interfaces Add the struct pci_dev pointer to these interfaces: __pci_msix_desc_mask_irq() __pci_msi_desc_mask_irq() msix_mask_irq msi_mask_irq() The pci_dev pointer is currently unused, so there's no functional change intended with this patch. A subsequent patch will use the pointer to improve error detection. diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c index 9f6f392a4461..56bbee2cf761 100644 --- a/arch/s390/pci/pci.c +++ b/arch/s390/pci/pci.c @@ -462,9 +462,9 @@ void arch_teardown_msi_irqs(struct pci_dev *pdev) if (!msi->irq) continue; if (msi->msi_attrib.is_msix) - __pci_msix_desc_mask_irq(msi, 1); + __pci_msix_desc_mask_irq(pdev, msi, 1); else - __pci_msi_desc_mask_irq(msi, 1, 1); + __pci_msi_desc_mask_irq(pdev, msi, 1, 1); irq_set_msi_desc(msi->irq, NULL); irq_free_desc(msi->irq); msi->msg.address_lo = 0; diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index af24ed50a245..d46ae506e296 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -170,7 +170,8 @@ static inline __attribute_const__ u32 msi_mask(unsigned x) * reliably as devices without an INTx disable bit will then generate a * level IRQ which will never be cleared. */ -u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) +u32 __pci_msi_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc, + u32 mask, u32 flag) { u32 mask_bits = desc->masked; @@ -179,15 +180,15 @@ u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) mask_bits &= ~mask; mask_bits |= flag; - pci_write_config_dword(msi_desc_to_pci_dev(desc), desc->mask_pos, - mask_bits); + pci_write_config_dword(dev, desc->mask_pos, mask_bits); return mask_bits; } -static void msi_mask_irq(struct msi_desc *desc, u32 mask, u32 flag) +static void msi_mask_irq(struct pci_dev *dev, struct msi_desc *desc, u32 mask, + u32 flag) { - desc->masked = __pci_msi_desc_mask_irq(desc, mask, flag); + desc->masked = __pci_msi_desc_mask_irq(dev, desc, mask, flag); } static void __iomem *pci_msix_desc_addr(struct msi_desc *desc) @@ -203,7 +204,8 @@ static void __iomem *pci_msix_desc_addr(struct msi_desc *desc) * file. This saves a few milliseconds when initialising devices with lots * of MSI-X interrupts. */ -u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag) +u32 __pci_msix_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc, + u32 flag) { u32 mask_bits = desc->masked; @@ -218,21 +220,22 @@ u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag) return mask_bits; } -static void msix_mask_irq(struct msi_desc *desc, u32 flag) +static void msix_mask_irq(struct pci_dev *dev, struct msi_desc *desc, u32 flag) { - desc->masked = __pci_msix_desc_mask_irq(desc, flag); + desc->masked = __pci_msix_desc_mask_irq(dev, desc, flag); } static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi_desc(data); + struct pci_dev *dev = msi_desc_to_pci_dev(desc); if (desc->msi_attrib.is_msix) { - msix_mask_irq(desc, flag); + msix_mask_irq(dev, desc, flag); readl(desc->mask_base); /* Flush write to device */ } else { unsigned offset = data->irq - desc->irq; - msi_mask_irq(desc, 1 << offset, flag << offset); + msi_mask_irq(dev, desc, 1 << offset, flag << offset); } } @@ -401,7 +404,7 @@ static void __pci_restore_msi_state(struct pci_dev *dev) arch_restore_msi_irqs(dev); pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &control); - msi_mask_irq(entry, msi_mask(entry->msi_attrib.multi_cap), + msi_mask_irq(dev, entry, msi_mask(entry->msi_attrib.multi_cap), entry->masked); control &= ~PCI_MSI_FLAGS_QSIZE; control |= (entry->msi_attrib.multiple << 4) | PCI_MSI_FLAGS_ENABLE; @@ -423,7 +426,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev) arch_restore_msi_irqs(dev); for_each_pci_msi_entry(entry, dev) - msix_mask_irq(entry, entry->masked); + msix_mask_irq(dev, entry, entry->masked); pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_MASKALL, 0); } @@ -612,28 +615,28 @@ static int msi_capability_init(struct pci_dev *dev, int nvec, /* All MSIs are unmasked by default, Mask them all */ mask = msi_mask(entry->msi_attrib.multi_cap); - msi_mask_irq(entry, mask, mask); + msi_mask_irq(dev, entry, mask, mask); list_add_tail(&entry->list, dev_to_msi_list(&dev->dev)); /* Configure MSI capability structure */ ret = pci_msi_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI); if (ret) { - msi_mask_irq(entry, mask, ~mask); + msi_mask_irq(dev, entry, mask, ~mask); free_msi_irqs(dev); return ret; } ret = msi_verify_entries(dev); if (ret) { - msi_mask_irq(entry, mask, ~mask); + msi_mask_irq(dev, entry, mask, ~mask); free_msi_irqs(dev); return ret; } ret = populate_msi_sysfs(dev); if (ret) { - msi_mask_irq(entry, mask, ~mask); + msi_mask_irq(dev, entry, mask, ~mask); free_msi_irqs(dev); return ret; } @@ -721,7 +724,7 @@ static void msix_program_entries(struct pci_dev *dev, entries[i++].vector = entry->irq; entry->masked = readl(pci_msix_desc_addr(entry) + PCI_MSIX_ENTRY_VECTOR_CTRL); - msix_mask_irq(entry, 1); + msix_mask_irq(dev, entry, 1); } } @@ -895,7 +898,7 @@ static void pci_msi_shutdown(struct pci_dev *dev) /* Return the device with MSI unmasked as initial states */ mask = msi_mask(desc->msi_attrib.multi_cap); /* Keep cached state to be restored */ - __pci_msi_desc_mask_irq(desc, mask, ~mask); + __pci_msi_desc_mask_irq(dev, desc, mask, ~mask); /* Restore dev->irq to its default pin-assertion irq */ dev->irq = desc->msi_attrib.default_irq; @@ -982,7 +985,7 @@ static void pci_msix_shutdown(struct pci_dev *dev) /* Return the device with MSI-X masked as initial states */ for_each_pci_msi_entry(entry, dev) { /* Keep cached states to be restored */ - __pci_msix_desc_mask_irq(entry, 1); + __pci_msix_desc_mask_irq(dev, entry, 1); } pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0); diff --git a/include/linux/msi.h b/include/linux/msi.h index 0e9c50052ff3..1525ec9457a4 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -141,8 +141,9 @@ void free_msi_entry(struct msi_desc *entry); void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg); void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg); -u32 __pci_msix_desc_mask_irq(struct msi_desc *desc, u32 flag); -u32 __pci_msi_desc_mask_irq(struct msi_desc *desc, u32 mask, u32 flag); +u32 __pci_msix_desc_mask_irq(struct pci_dev *, struct msi_desc *desc, u32 flag); +u32 __pci_msi_desc_mask_irq(struct pci_dev *, struct msi_desc *desc, u32 mask, + u32 flag); void pci_msi_mask_irq(struct irq_data *data); void pci_msi_unmask_irq(struct irq_data *data); commit 9b51617cc910b5facdb4ab0442d6562422c916d7 Author: Bjorn Helgaas <bhelgaas@google.com> Date: Thu Nov 8 08:34:07 2018 -0600 PCI/MSI: Avoid MMIO accesses to device if it's unreachable Add wrappers around MMIO accesses (readl/writel) to the device. If we already know the device is unreachable because of hot-removal or a bus error, skip the MMIO access. In addition, if we're doing an MMIO read, check whether it returns ~0 data. If a read fails because of a bus error, host bridges typically synthesize ~0 data to complete the CPU's read. If a read returns ~0, do a config read to determine whether the device is still reachable. If the config read is successful, the ~0 data is probably valid data from the device. If the config read fails, the device is unreachable, so mark it as disconnected. diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index d46ae506e296..0cf095eef69d 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -31,6 +31,38 @@ int pci_msi_ignore_mask; #define msix_table_size(flags) ((flags & PCI_MSIX_FLAGS_QSIZE) + 1) +static u32 mmio_readl(struct pci_dev *dev, const volatile void __iomem *addr) +{ + u32 val, id; + + if (pci_dev_is_disconnected(dev)) + return ~0; + + val = readl(addr); + + /* + * If an MMIO read from the device returns ~0 data, that data may + * be valid, or it may indicate a bus error. If config space is + * readable, assume it's valid data; otherwise, assume a bus error. + */ + if (val == ~0) { + pci_read_config_dword(dev, PCI_VENDOR_ID, &id); + if (id == ~0) + pci_dev_set_disconnected(dev, NULL); + } + + return val; +} + +static void mmio_writel(struct pci_dev *dev, u32 val, + volatile void __iomem *addr) +{ + if (pci_dev_is_disconnected(dev)) + return; + + writel(val, addr); +} + #ifdef CONFIG_PCI_MSI_IRQ_DOMAIN static int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) { @@ -215,7 +247,8 @@ u32 __pci_msix_desc_mask_irq(struct pci_dev *dev, struct msi_desc *desc, mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT; if (flag) mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT; - writel(mask_bits, pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL); + mmio_writel(dev, mask_bits, + pci_msix_desc_addr(desc) + PCI_MSIX_ENTRY_VECTOR_CTRL); return mask_bits; } @@ -232,7 +265,7 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) if (desc->msi_attrib.is_msix) { msix_mask_irq(dev, desc, flag); - readl(desc->mask_base); /* Flush write to device */ + mmio_readl(dev, desc->mask_base); /* Flush write to device */ } else { unsigned offset = data->irq - desc->irq; msi_mask_irq(dev, desc, 1 << offset, flag << offset); @@ -276,9 +309,11 @@ void __pci_read_msi_msg(struct msi_desc *entry, struct msi_msg *msg) if (entry->msi_attrib.is_msix) { void __iomem *base = pci_msix_desc_addr(entry); - msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR); - msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR); - msg->data = readl(base + PCI_MSIX_ENTRY_DATA); + msg->address_lo = mmio_readl(dev, + base + PCI_MSIX_ENTRY_LOWER_ADDR); + msg->address_hi = mmio_readl(dev, + base + PCI_MSIX_ENTRY_UPPER_ADDR); + msg->data = mmio_readl(dev, base + PCI_MSIX_ENTRY_DATA); } else { int pos = dev->msi_cap; u16 data; @@ -306,9 +341,11 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) } else if (entry->msi_attrib.is_msix) { void __iomem *base = pci_msix_desc_addr(entry); - writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR); - writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR); - writel(msg->data, base + PCI_MSIX_ENTRY_DATA); + mmio_writel(dev, msg->address_lo, + base + PCI_MSIX_ENTRY_LOWER_ADDR); + mmio_writel(dev, msg->address_hi, + base + PCI_MSIX_ENTRY_UPPER_ADDR); + mmio_writel(dev, msg->data, base + PCI_MSIX_ENTRY_DATA); } else { int pos = dev->msi_cap; u16 msgctl; @@ -722,8 +759,8 @@ static void msix_program_entries(struct pci_dev *dev, for_each_pci_msi_entry(entry, dev) { if (entries) entries[i++].vector = entry->irq; - entry->masked = readl(pci_msix_desc_addr(entry) + - PCI_MSIX_ENTRY_VECTOR_CTRL); + entry->masked = mmio_readl(dev, pci_msix_desc_addr(entry) + + PCI_MSIX_ENTRY_VECTOR_CTRL); msix_mask_irq(dev, entry, 1); } }
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. You know how the kernel provides ZERO_PAGE, wouldn't it be cool if we also had a ONES_PAGE and could remap all virtual addresses from a memory mapped device to that page on an ungraceful disconnect? I do not know how to accomplish that, so might just be crazy talk... But if it is possible, that would be a pretty nifty way to solve this problem.
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about > PCI error recovery in general] > > On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > > On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote: > > > When a PCI device is gone, we don't want to send IO to it if we can > > > avoid it. We expose functionality via the irq_chip structure. As > > > users of that structure may not know about the underlying PCI device, > > > it's our responsibility to guard against removed devices. > > > > > > .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). > > > .irq_mask/unmask() are not. Guard them for completeness. > > > > > > For example, surprise removal of a PCIe device triggers teardown. This > > > touches the irq_chips ops some point to disable the interrupts. I/O > > > generated here can crash the system on firmware-first machines. > > > Not triggering the IO in the first place greatly reduces the > > > possibility of the problem occurring. > > > > > > Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> > > > > Applied to pci/misc for v4.21, thanks! > > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. I think my stance always has been that this call is not good at all because once you call it you never really know if it is still true as the device could have been removed right afterward. So almost any code that relies on it is broken, there is no locking and it can and will race and you will loose. I think your patch suffers from this race: > +static u32 mmio_readl(struct pci_dev *dev, const volatile void __iomem *addr) > +{ > + u32 val, id; > + > + if (pci_dev_is_disconnected(dev)) > + return ~0; Great, but what happens if I yank the device out right here? > + val = readl(addr); This value could now be all FF, if the device is gone, so what did the check above help with? > + /* > + * If an MMIO read from the device returns ~0 data, that data may > + * be valid, or it may indicate a bus error. If config space is > + * readable, assume it's valid data; otherwise, assume a bus error. > + */ > + if (val == ~0) { > + pci_read_config_dword(dev, PCI_VENDOR_ID, &id); > + if (id == ~0) > + pci_dev_set_disconnected(dev, NULL); So why do the check above for "is disconnected"? What does this buy us here, just short-circuiting the readl()? > + } > + > + return val; > +} > + > +static void mmio_writel(struct pci_dev *dev, u32 val, > + volatile void __iomem *addr) > +{ > + if (pci_dev_is_disconnected(dev)) > + return; > + > + writel(val, addr); Why even check, what's wrong with always doing the write? I understand the wish to make this easier, but I think the only way is that the driver themselves should be checking on their reads. And they have to check on all reads, or at least on some subset of reads and be able to handle 0xff for the other ones without going crazy. I _think_ the xhci driver does this given that it is hot added/removed all the time dynamically due to the way that modern laptops are made where the bios adds/removed the xhci controller when a USB device is added/removed. thanks, greg k-h
On 11/08/2018 02:09 PM, Bjorn Helgaas wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive information. > > > [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about > PCI error recovery in general] Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have been floating around the past few months? -- HPX, SFI, CER. Without divulging too much before publication, I'm curious on opinions on how well (or not well) those flows would work in general, and in linux. > On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: > I'm having second thoughts about this. One thing I'm uncomfortable > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. > > Another is that the only place we call pci_dev_set_disconnected() is > in pciehp and acpiphp, so the only "disconnected" case we catch is if > hotplug happens to be involved. Every MMIO read from the device is an > opportunity to learn whether it is reachable (a read from an > unreachable device typically returns ~0 data), but we don't do > anything at all with those. > > The config accessors already check pci_dev_is_disconnected(), so this > patch is really aimed at MMIO accesses. I think it would be more > robust if we added wrappers for readl() and writel() so we could > notice read errors and avoid future reads and writes. I wouldn't expect anything less than complete scrutiny and quality control of unquestionable moral integrity :). In theory ~0 can be a great indicator that something may be wrong. Though I think it's about as ad-hoc as pci_dev_is_disconnected(). I slightly like the idea of wrapping the MMIO accessors. There's still memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the same sort of errors in PCIe land, and it would be good to have more testing on this. Since this patch is tested and confirmed to fix a known failure case, I would keep it, and the look at fixing the problem in a more generic way. BTW, a lot of the problems we're fixing here come courtesy of firmware-first error handling. Do we reach a point where we draw a line in handling new problems introduced by FFS? So, if something is a problem with FFS, but not native handling, do we commit to supporting it? Alex
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > I'm having second thoughts about this. One thing I'm uncomfortable > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > instead of systematic, in the sense that I don't know how we convince > > ourselves that this (and only this) is the correct place to put it. > > I think my stance always has been that this call is not good at all > because once you call it you never really know if it is still true as > the device could have been removed right afterward. > > So almost any code that relies on it is broken, there is no locking and > it can and will race and you will loose. AIUI, we're not trying to create code to rely on this. This more about reducing reliance on hardware. If the software misses the race once and accesses disconnected device memory, that's usually not a big deal to let hardware sort it out, but the point is not to push our luck. Surprise hot remove is empirically more reliable the less we interact with hardware and firmware. That shouldn't be necessary, but is just an unfortunate reality.
On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > > I'm having second thoughts about this. One thing I'm uncomfortable > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > > instead of systematic, in the sense that I don't know how we convince > > > ourselves that this (and only this) is the correct place to put it. > > > > I think my stance always has been that this call is not good at all > > because once you call it you never really know if it is still true as > > the device could have been removed right afterward. > > > > So almost any code that relies on it is broken, there is no locking and > > it can and will race and you will loose. > > AIUI, we're not trying to create code to rely on this. This more about > reducing reliance on hardware. If the software misses the race once and > accesses disconnected device memory, that's usually not a big deal to > let hardware sort it out, but the point is not to push our luck. Then why even care about this call at all? If you need to really know if the read worked, you have to check the value. If the value is FF then you have a huge hint that the hardware is now gone. And you can rely on it being gone, you can never rely on making the call to the function to check if the hardware is there to be still valid any point in time after the call returns. > Surprise hot remove is empirically more reliable the less we interact > with hardware and firmware. That shouldn't be necessary, but is just an > unfortunate reality. You are not "interacting", you are reading/writing to the hardware, as you have to do so. So I really don't understand what you are talking about here, sorry. greg k-h
On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive information. > > > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: >> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: >>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: >>>> I'm having second thoughts about this. One thing I'm uncomfortable >>>> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc >>>> instead of systematic, in the sense that I don't know how we convince >>>> ourselves that this (and only this) is the correct place to put it. >>> >>> I think my stance always has been that this call is not good at all >>> because once you call it you never really know if it is still true as >>> the device could have been removed right afterward. >>> >>> So almost any code that relies on it is broken, there is no locking and >>> it can and will race and you will loose. >> >> AIUI, we're not trying to create code to rely on this. This more about >> reducing reliance on hardware. If the software misses the race once and >> accesses disconnected device memory, that's usually not a big deal to >> let hardware sort it out, but the point is not to push our luck. > > Then why even care about this call at all? If you need to really know > if the read worked, you have to check the value. If the value is FF > then you have a huge hint that the hardware is now gone. And you can > rely on it being gone, you can never rely on making the call to the > function to check if the hardware is there to be still valid any point > in time after the call returns. In the case that we're trying to fix, this code executing is a result of the device being gone, so we can guarantee race-free operation. I agree that there is a race, in the general case. As far as checking the result for all F's, that's not an option when firmware crashes the system as a result of the mmio read/write. It's never pretty when firmware gets involved. Alex
On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/08/2018 04:43 PM, Greg Kroah-Hartman wrote: > > > > [EXTERNAL EMAIL] > > Please report any suspicious attachments, links, or requests for sensitive information. > > > > > > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: > >> On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > >>> On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > >>>> I'm having second thoughts about this. One thing I'm uncomfortable > >>>> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > >>>> instead of systematic, in the sense that I don't know how we convince > >>>> ourselves that this (and only this) is the correct place to put it. > >>> > >>> I think my stance always has been that this call is not good at all > >>> because once you call it you never really know if it is still true as > >>> the device could have been removed right afterward. > >>> > >>> So almost any code that relies on it is broken, there is no locking and > >>> it can and will race and you will loose. > >> > >> AIUI, we're not trying to create code to rely on this. This more about > >> reducing reliance on hardware. If the software misses the race once and > >> accesses disconnected device memory, that's usually not a big deal to > >> let hardware sort it out, but the point is not to push our luck. > > > > Then why even care about this call at all? If you need to really know > > if the read worked, you have to check the value. If the value is FF > > then you have a huge hint that the hardware is now gone. And you can > > rely on it being gone, you can never rely on making the call to the > > function to check if the hardware is there to be still valid any point > > in time after the call returns. > > In the case that we're trying to fix, this code executing is a result of > the device being gone, so we can guarantee race-free operation. I agree > that there is a race, in the general case. As far as checking the result > for all F's, that's not an option when firmware crashes the system as a > result of the mmio read/write. It's never pretty when firmware gets > involved. If you have firmware that crashes the system when you try to read from a PCI device that was hot-removed, that is broken firmware and needs to be fixed. The kernel can not work around that as again, you will never win that race. thanks, greg k-h
On Thu, Nov 08, 2018 at 02:42:55PM -0800, Greg Kroah-Hartman wrote: > On Thu, Nov 08, 2018 at 03:32:58PM -0700, Keith Busch wrote: > > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > > > I'm having second thoughts about this. One thing I'm uncomfortable > > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > > > instead of systematic, in the sense that I don't know how we convince > > > > ourselves that this (and only this) is the correct place to put it. > > > > > > I think my stance always has been that this call is not good at all > > > because once you call it you never really know if it is still true as > > > the device could have been removed right afterward. > > > > > > So almost any code that relies on it is broken, there is no locking and > > > it can and will race and you will loose. > > > > AIUI, we're not trying to create code to rely on this. This more about > > reducing reliance on hardware. If the software misses the race once and > > accesses disconnected device memory, that's usually not a big deal to > > let hardware sort it out, but the point is not to push our luck. > > Then why even care about this call at all? If you need to really know > if the read worked, you have to check the value. If the value is FF > then you have a huge hint that the hardware is now gone. And you can > rely on it being gone, you can never rely on making the call to the > function to check if the hardware is there to be still valid any point > in time after the call returns. > > > Surprise hot remove is empirically more reliable the less we interact > > with hardware and firmware. That shouldn't be necessary, but is just an > > unfortunate reality. > > You are not "interacting", you are reading/writing to the hardware, as > you have to do so. So I really don't understand what you are talking > about here, sorry. We're reading hardware memory, yes, but the hardware isn't there. Something obviously needs to return FF, so we are indirectly interacting with whatever mechanism handles that. Sometimes that mechanism doesn't handle it gracefully and instead of having FF to consider, you have a machine check rebooting your system.
On 11/08/2018 04:51 PM, Greg KH wrote: > On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> In the case that we're trying to fix, this code executing is a result of >> the device being gone, so we can guarantee race-free operation. I agree >> that there is a race, in the general case. As far as checking the result >> for all F's, that's not an option when firmware crashes the system as a >> result of the mmio read/write. It's never pretty when firmware gets >> involved. > > If you have firmware that crashes the system when you try to read from a > PCI device that was hot-removed, that is broken firmware and needs to be > fixed. The kernel can not work around that as again, you will never win > that race. But it's not the firmware that crashes. It's linux as a result of a fatal error message from the firmware. And we can't fix that because FFS handling requires that the system reboots [1]. If we're going to say that we don't want to support FFS because it's a separate code path, and different flow, that's fine. I am myself, not a fan of FFS. But if we're going to continue supporting it, I think we'll continue to have to resolve these sort of unintended consequences. Alex [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > + /* > + * If an MMIO read from the device returns ~0 data, that data may > + * be valid, or it may indicate a bus error. If config space is > + * readable, assume it's valid data; otherwise, assume a bus error. > + */ > + if (val == ~0) { > + pci_read_config_dword(dev, PCI_VENDOR_ID, &id); > + if (id == ~0) > + pci_dev_set_disconnected(dev, NULL); > + } This isn't safe unfortunately because "all ones" may occur for other reasons besides disconnectedness. E.g. on an Uncorrectable Error, the device may likewise respond with all ones, but revert to valid responses if the error can be recovered through a Secondary Bus Reset. In such a case, marking the device disconnected would be inappropriate. Accessing a device in D3cold would be another example where all ones is returned both from mmio and config space despite the device still being present and future accesses having a chance to succeed. In fact, in v2 of Keith's patches adding pci_dev_set_disconnected() he attempted the same as what you're doing here and caused issues for me with devices in D3cold: https://spinics.net/lists/linux-pci/msg54337.html > One thing I'm uncomfortable with is that [...]. Another is that the > only place we call pci_dev_set_disconnected() is in pciehp and acpiphp, > so the only "disconnected" case we catch is if hotplug happens to be > involved. Yes, that's because the hotplug drivers are the only ones who can identify removal authoritatively and unambiguously. They *know* when the device is gone and don't have to resort to heuristics such as all ones. (ISTR that dpc also marks devices disconnected.) > sprinkling pci_dev_is_disconnected() around feels ad hoc > instead of systematic, in the sense that I don't know how we convince > ourselves that this (and only this) is the correct place to put it. We need to add documentation for driver authors how to deal with surprise removal. Briefly: * If (pdev->error_state == pci_channel_io_perm_failure), the device is definitely gone and any further device access can be skipped. Otherwise presence of the device is likely, but not guaranteed. * If a device access can significantly delay device removal due to Completion Timeouts, or can cause an infinite loop, MCE or crash, do check pdev->error_state before carrying out the device access. * Always be prepared that a device access may fail due to surprise removal, do not blindly trust mmio or config space reads or assume success of writes. I'm sure this can be extended quite a bit. There's more information in this LWN article in the "Surprise removal" section: https://lwn.net/Articles/767885/ Thanks, Lukas
On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > I'm having second thoughts about this. One thing I'm uncomfortable > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > I think my stance always has been that this call is not good at all > because once you call it you never really know if it is still true as > the device could have been removed right afterward. > > So almost any code that relies on it is broken, there is no locking and > it can and will race and you will loose. Hm, to be honest if that's your impression I think you must have missed a large portion of the discussion we've been having over the past 2 years. Please consider reading this LWN article, particularly the "Surprise removal" section, to get up to speed: https://lwn.net/Articles/767885/ You seem to be assuming that all we care about is the *return value* of an mmio read. However a transaction to a surprise removed device has side effects beyond returning all ones, such as a Completion Timeout which, with thousands of transactions in flight, added up to many seconds to handle removal of an NVMe array and occasionally caused MCEs. It is not an option to just blindly carry out device accesses even though it is known the device is gone, Completion Timeouts be damned. However there is more to it than just Completion Timeouts, this is all detailed in the LWN article. Thanks, Lukas
On Fri, Nov 09, 2018 at 08:29:53AM +0100, Lukas Wunner wrote: > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > > I'm having second thoughts about this. One thing I'm uncomfortable > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > > > I think my stance always has been that this call is not good at all > > because once you call it you never really know if it is still true as > > the device could have been removed right afterward. > > > > So almost any code that relies on it is broken, there is no locking and > > it can and will race and you will loose. > > Hm, to be honest if that's your impression I think you must have missed a > large portion of the discussion we've been having over the past 2 years. > > Please consider reading this LWN article, particularly the "Surprise > removal" section, to get up to speed: > > https://lwn.net/Articles/767885/ > > You seem to be assuming that all we care about is the *return value* of > an mmio read. However a transaction to a surprise removed device has > side effects beyond returning all ones, such as a Completion Timeout > which, with thousands of transactions in flight, added up to many seconds > to handle removal of an NVMe array and occasionally caused MCEs. Again, I still claim this is broken hardware/firmware :) > It is not an option to just blindly carry out device accesses even though > it is known the device is gone, Completion Timeouts be damned. I don't disagree with you at all, and your other email is great with summarizing the issues here. What I do object to is somehow relying on that function call as knowing that the device really is present or not. It's a good hint, yes, but driver authors still have to be able to handle the bad data coming back from when the call races with the device being removed. > However there is more to it than just Completion Timeouts, this is all > detailed in the LWN article. And that's a great article and your work here is much appreciated. I think we are in violent agreement :) thanks, greg k-h
On Fri, Nov 09, 2018 at 03:32:57AM -0800, Greg Kroah-Hartman wrote: > On Fri, Nov 09, 2018 at 08:29:53AM +0100, Lukas Wunner wrote: > > On Thu, Nov 08, 2018 at 02:01:17PM -0800, Greg Kroah-Hartman wrote: > > > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > > > I'm having second thoughts about this. One thing I'm uncomfortable > > > > with is that sprinkling pci_dev_is_disconnected() around feels ad hoc > > > > > > I think my stance always has been that this call is not good at all > > > because once you call it you never really know if it is still true as > > > the device could have been removed right afterward. > > > > > > So almost any code that relies on it is broken, there is no locking and > > > it can and will race and you will loose. > > > > Hm, to be honest if that's your impression I think you must have missed a > > large portion of the discussion we've been having over the past 2 years. > > > > Please consider reading this LWN article, particularly the "Surprise > > removal" section, to get up to speed: > > > > https://lwn.net/Articles/767885/ > > > > You seem to be assuming that all we care about is the *return value* of > > an mmio read. However a transaction to a surprise removed device has > > side effects beyond returning all ones, such as a Completion Timeout > > which, with thousands of transactions in flight, added up to many seconds > > to handle removal of an NVMe array and occasionally caused MCEs. > > Again, I still claim this is broken hardware/firmware :) Indeed it is, but I don't want to abandon people with hardware in hand if we can make it work despite being broken. Perfection is the enemy of good. :) > > It is not an option to just blindly carry out device accesses even though > > it is known the device is gone, Completion Timeouts be damned. > > I don't disagree with you at all, and your other email is great with > summarizing the issues here. > > What I do object to is somehow relying on that function call as knowing > that the device really is present or not. It's a good hint, yes, but > driver authors still have to be able to handle the bad data coming back > from when the call races with the device being removed. The function has always been a private interface. It is not available for drivers to rely on. The only thing we're trying to accomplish is not start a transaction if software knows it will not succeed. There are certainly times when a transaction will fail that software does not forsee, but we're not suggesting the intent handles that either.
On Fri, 2018-11-09 at 08:11 +0100, Lukas Wunner wrote: > On Thu, Nov 08, 2018 at 02:09:17PM -0600, Bjorn Helgaas wrote: > > + /* > > + * If an MMIO read from the device returns ~0 data, that data may > > + * be valid, or it may indicate a bus error. If config space is > > + * readable, assume it's valid data; otherwise, assume a bus error. > > + */ > > + if (val == ~0) { > > + pci_read_config_dword(dev, PCI_VENDOR_ID, &id); > > + if (id == ~0) > > + pci_dev_set_disconnected(dev, NULL); > > + } > > This isn't safe unfortunately because "all ones" may occur for other > reasons besides disconnectedness. E.g. on an Uncorrectable Error, > the device may likewise respond with all ones, but revert to valid > responses if the error can be recovered through a Secondary Bus Reset. > In such a case, marking the device disconnected would be inappropriate. I don't really see why we're trying to make a distinction between recoverable errors and disconnected devices at this stage. In either case we should assume the device is broken and shouldn't be accessed until we perform some kind of recovery action. Bjorn's MMIO wrappers are more-or-less an opt-in software emulation of the freeze-MMIO-on-error behaviour that the EEH mechanism provides on IBM hardware so I think it makes sense. It also has the nice side effect of giving driver writers an error injection mechanism so they might actually test how their drivers deal with errors. > Accessing a device in D3cold would be another example where all ones > is returned both from mmio and config space despite the device still > being present and future accesses having a chance to succeed. Is doing a MMIO to a device in D3cold (or hot) ever a valid thing to do? > In fact, in v2 of Keith's patches adding pci_dev_set_disconnected() > he attempted the same as what you're doing here and caused issues > for me with devices in D3cold: > > https://spinics.net/lists/linux-pci/msg54337.html > > > > One thing I'm uncomfortable with is that [...]. Another is that the > > only place we call pci_dev_set_disconnected() is in pciehp and acpiphp, > > so the only "disconnected" case we catch is if hotplug happens to be > > involved. > > Yes, that's because the hotplug drivers are the only ones who can > identify removal authoritatively and unambiguously. They *know* > when the device is gone and don't have to resort to heuristics > such as all ones. (ISTR that dpc also marks devices disconnected.) The herustics shouldn't be used to work out when the device is gone, rather they should be used to work out when we need to check on the device. One of the grosser bits of EEH support is a hook in readl() and friends that checks for a 0xFF response. If it finds one, it looks at the EEH state and will start the recovery process if the device is marked as frozen. (don't look at the code. you won't like what you find) > > sprinkling pci_dev_is_disconnected() around feels ad hoc > > instead of systematic, in the sense that I don't know how we convince > > ourselves that this (and only this) is the correct place to put it. > > We need to add documentation for driver authors how to deal with > surprise removal. Briefly: > > * If (pdev->error_state == pci_channel_io_perm_failure), the device > is definitely gone and any further device access can be skipped. > Otherwise presence of the device is likely, but not guaranteed. > > * If a device access can significantly delay device removal due to > Completion Timeouts, or can cause an infinite loop, MCE or crash, > do check pdev->error_state before carrying out the device access. > > * Always be prepared that a device access may fail due to surprise > removal, do not blindly trust mmio or config space reads or > assume success of writes. Completely agree. We really need better documentation of what drivers should be doing. Oliver
On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/08/2018 04:51 PM, Greg KH wrote: > > On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote: > > > In the case that we're trying to fix, this code executing is a result of > > > the device being gone, so we can guarantee race-free operation. I agree > > > that there is a race, in the general case. As far as checking the result > > > for all F's, that's not an option when firmware crashes the system as a > > > result of the mmio read/write. It's never pretty when firmware gets > > > involved. > > > > If you have firmware that crashes the system when you try to read from a > > PCI device that was hot-removed, that is broken firmware and needs to be > > fixed. The kernel can not work around that as again, you will never win > > that race. > > But it's not the firmware that crashes. It's linux as a result of a > fatal error message from the firmware. And we can't fix that because FFS > handling requires that the system reboots [1]. Do we know the exact circumsances that result in firmware requesting a reboot? If it happen on any PCIe error I don't see what we can do to prevent that beyond masking UEs entirely (are we even allowed to do that on FFS systems?). > If we're going to say that we don't want to support FFS because it's a > separate code path, and different flow, that's fine. I am myself, not a > fan of FFS. But if we're going to continue supporting it, I think we'll > continue to have to resolve these sort of unintended consequences. > > Alex > > [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources
On 11/11/2018 11:50 PM, Oliver O'Halloran wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive information. > > > On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 11/08/2018 04:51 PM, Greg KH wrote: >>> On Thu, Nov 08, 2018 at 10:49:08PM +0000, Alex_Gagniuc@Dellteam.com wrote: >>>> In the case that we're trying to fix, this code executing is a result of >>>> the device being gone, so we can guarantee race-free operation. I agree >>>> that there is a race, in the general case. As far as checking the result >>>> for all F's, that's not an option when firmware crashes the system as a >>>> result of the mmio read/write. It's never pretty when firmware gets >>>> involved. >>> >>> If you have firmware that crashes the system when you try to read from a >>> PCI device that was hot-removed, that is broken firmware and needs to be >>> fixed. The kernel can not work around that as again, you will never win >>> that race. >> >> But it's not the firmware that crashes. It's linux as a result of a >> fatal error message from the firmware. And we can't fix that because FFS >> handling requires that the system reboots [1]. > > Do we know the exact circumsances that result in firmware requesting a > reboot? If it happen on any PCIe error I don't see what we can do to > prevent that beyond masking UEs entirely (are we even allowed to do > that on FFS systems?). Pull a drive out at an angle, push two drives in at the same time, pull out a drive really slow. If an error is even reported to the OS depends on PD state, and proprietary mechanisms and logic in the HW and FW. OS is not supposed to mask errors (touch AER bits) on FFS. Sadly, with FFS, behavior can and does change from BIOS version to BIOS version. On one product, for example, we eliminated a lot of crashes by simply not reporting some classes of PCIe errors to the OS. Alex >> If we're going to say that we don't want to support FFS because it's a >> separate code path, and different flow, that's fine. I am myself, not a >> fan of FFS. But if we're going to continue supporting it, I think we'll >> continue to have to resolve these sort of unintended consequences. >> >> Alex >> >> [1] ACPI 6.2, 18.1 - Hardware Errors and Error Sources > >
[+cc Jon, for related VMD firmware-first error enable issue] On Mon, Nov 12, 2018 at 08:05:41PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/11/2018 11:50 PM, Oliver O'Halloran wrote: > > On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote: > >> But it's not the firmware that crashes. It's linux as a result of a > >> fatal error message from the firmware. And we can't fix that because FFS > >> handling requires that the system reboots [1]. > > > > Do we know the exact circumsances that result in firmware requesting a > > reboot? If it happen on any PCIe error I don't see what we can do to > > prevent that beyond masking UEs entirely (are we even allowed to do > > that on FFS systems?). > > Pull a drive out at an angle, push two drives in at the same time, pull > out a drive really slow. If an error is even reported to the OS depends > on PD state, and proprietary mechanisms and logic in the HW and FW. OS > is not supposed to mask errors (touch AER bits) on FFS. PD? Do you think Linux observes the rule about not touching AER bits on FFS? I'm not sure it does. I'm not even sure what section of the spec is relevant. The whole issue of firmware-first, the mechanism by which firmware gets control, the System Error enables in Root Port Root Control registers, etc., is very murky to me. Jon has a sort of similar issue with VMD where he needs to leave System Errors enabled instead of disabling them as we currently do. Bjorn [1] https://lore.kernel.org/linux-pci/20181029210651.GB13681@bhelgaas-glaptop.roam.corp.google.com
On 11/12/2018 11:02 PM, Bjorn Helgaas wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive information. > > > [+cc Jon, for related VMD firmware-first error enable issue] > > On Mon, Nov 12, 2018 at 08:05:41PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 11/11/2018 11:50 PM, Oliver O'Halloran wrote: >>> On Thu, 2018-11-08 at 23:06 +0000, Alex_Gagniuc@Dellteam.com wrote: > >>>> But it's not the firmware that crashes. It's linux as a result of a >>>> fatal error message from the firmware. And we can't fix that because FFS >>>> handling requires that the system reboots [1]. >>> >>> Do we know the exact circumsances that result in firmware requesting a >>> reboot? If it happen on any PCIe error I don't see what we can do to >>> prevent that beyond masking UEs entirely (are we even allowed to do >>> that on FFS systems?). >> >> Pull a drive out at an angle, push two drives in at the same time, pull >> out a drive really slow. If an error is even reported to the OS depends >> on PD state, and proprietary mechanisms and logic in the HW and FW. OS >> is not supposed to mask errors (touch AER bits) on FFS. > > PD? Presence Detect > Do you think Linux observes the rule about not touching AER bits on > FFS? I'm not sure it does. I'm not even sure what section of the > spec is relevant. I haven't found any place where linux breaks this rule. I'm very confident that, unless otherwise instructed, we follow this rule. > The whole issue of firmware-first, the mechanism by which firmware > gets control, the System Error enables in Root Port Root Control > registers, etc., is very murky to me. Jon has a sort of similar issue > with VMD where he needs to leave System Errors enabled instead of > disabling them as we currently do. Well, OS gets control via _OSC method, and based on that it should touch/not touch the AER bits. The bits that get set/cleared come from _HPX method, and there's a more about the FFS described in ACPI spec. It seems that if platform, wants to enable VMD, it should pass the correct bits via _HPX. I'm curious to know in what new twisted way FFS doesn't work as intended. Alex > Bjorn > > [1] https://lore.kernel.org/linux-pci/20181029210651.GB13681@bhelgaas-glaptop.roam.corp.google.com >
On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/12/2018 11:02 PM, Bjorn Helgaas wrote: > > The whole issue of firmware-first, the mechanism by which firmware > > gets control, the System Error enables in Root Port Root Control > > registers, etc., is very murky to me. Jon has a sort of similar issue > > with VMD where he needs to leave System Errors enabled instead of > > disabling them as we currently do. > > Well, OS gets control via _OSC method, and based on that it should > touch/not touch the AER bits. The bits that get set/cleared come from > _HPX method, and there's a more about the FFS described in ACPI spec. It > seems that if platform, wants to enable VMD, it should pass the correct > bits via _HPX. I'm curious to know in what new twisted way FFS doesn't > work as intended. When VMD is enabled, the platform sees a VMD endpoint. It doesn't see any of the root ports on that domain, so ACPI can't provide policies for them nor AER registers for the platform to consider controlling.
On 11/13/2018 04:56 PM, Keith Busch wrote: > On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote: >>> The whole issue of firmware-first, the mechanism by which firmware >>> gets control, the System Error enables in Root Port Root Control >>> registers, etc., is very murky to me. Jon has a sort of similar issue >>> with VMD where he needs to leave System Errors enabled instead of >>> disabling them as we currently do. >> >> Well, OS gets control via _OSC method, and based on that it should >> touch/not touch the AER bits. The bits that get set/cleared come from >> _HPX method, and there's a more about the FFS described in ACPI spec. It >> seems that if platform, wants to enable VMD, it should pass the correct >> bits via _HPX. I'm curious to know in what new twisted way FFS doesn't >> work as intended. > > When VMD is enabled, the platform sees a VMD endpoint. It doesn't see > any of the root ports on that domain, so ACPI can't provide policies for > them nor AER registers for the platform to consider controlling. I'm not understanding the interdependency between RP AER settings and VMD. My understanding of VMD is quite rudimentary though, so I'll take your word for it. Alex
On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/12/2018 11:02 PM, Bjorn Helgaas wrote: > > > > [EXTERNAL EMAIL] > > Please report any suspicious attachments, links, or requests for sensitive information. It looks like Dell's email system adds the above in such a way that the email quoting convention suggests that *I* wrote it, when I did not. > ... > > Do you think Linux observes the rule about not touching AER bits on > > FFS? I'm not sure it does. I'm not even sure what section of the > > spec is relevant. > > I haven't found any place where linux breaks this rule. I'm very > confident that, unless otherwise instructed, we follow this rule. Just to make sure we're on the same page, can you point me to this rule? I do see that OSPM must request control of AER using _OSC before it touches the AER registers. What I don't see is the connection between firmware-first and the AER registers. The closest I can find is the "Enabled" field in the HEST PCIe AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says: If the field value is 1, indicates this error source is to be enabled. If the field value is 0, indicates that the error source is not to be enabled. If FIRMWARE_FIRST is set in the flags field, the Enabled field is ignored by the OSPM. AFAICT, Linux completely ignores the Enabled field in these structures. These structures also contain values the OS is apparently supposed to write to Device Control and several AER registers (in struct acpi_hest_aer_common). Linux ignores these as well. These seem like fairly serious omissions in Linux. > > The whole issue of firmware-first, the mechanism by which firmware > > gets control, the System Error enables in Root Port Root Control > > registers, etc., is very murky to me. Jon has a sort of similar issue > > with VMD where he needs to leave System Errors enabled instead of > > disabling them as we currently do. > > Well, OS gets control via _OSC method, and based on that it should > touch/not touch the AER bits. I agree so far. > The bits that get set/cleared come from _HPX method, _HPX tells us about some AER registers, Device Control, Link Control, and some bridge registers. It doesn't say anything about the Root Control register that Jon is concerned with. For firmware-first to work, firmware has to get control. How does it get control? How does OSPM know to either set up that mechanism or keep its mitts off something firmware set up before handoff? In Jon's VMD case, I think firmware-first relies on the System Error controlled by the Root Control register. Linux thinks it owns that, and I don't know how to learn otherwise. > and there's a more about the FFS described in ACPI spec. It > seems that if platform, wants to enable VMD, it should pass the correct > bits via _HPX. I'm curious to know in what new twisted way FFS doesn't > work as intended. Bjorn
On 11/14/2018 12:00 AM, Bjorn Helgaas wrote: > On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote: >>> >>> [EXTERNAL EMAIL] >>> Please report any suspicious attachments, links, or requests for sensitive information. > > It looks like Dell's email system adds the above in such a way that the > email quoting convention suggests that *I* wrote it, when I did not. I was wondering why you thought I was suspicious. It's a recent (server-side) change. You used to be able to disable these sort of notices. I'm told back in the day people were asked to delete emails before reading them. >> ... >>> Do you think Linux observes the rule about not touching AER bits on >>> FFS? I'm not sure it does. I'm not even sure what section of the >>> spec is relevant. >> >> I haven't found any place where linux breaks this rule. I'm very >> confident that, unless otherwise instructed, we follow this rule. > > Just to make sure we're on the same page, can you point me to this > rule? I do see that OSPM must request control of AER using _OSC > before it touches the AER registers. What I don't see is the > connection between firmware-first and the AER registers. ACPI 6.2 - 6.2.11.3, Table 6-197: PCI Express Advanced Error Reporting control: * The firmware sets this bit to 1 to grant control over PCI Express Advanced Error Reporting. If firmware allows the OS control of this feature, then in the context of the _OSC method it must ensure that error messages are routed to device interrupts as described in the PCI Express Base Specification[...] Now I'm confused too: * HEST -> __aer_firmware_first This is used for touching/not touching AER bits * _OSC -> bridge->native_aer Used to enable/not enable AER portdrv service Maybe Keith knows better why we're doing it this way. From ACPI text, it doesn't seem that control of AER would be tied to HEST entries, although in practice, it is. > The closest I can find is the "Enabled" field in the HEST PCIe > AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says: > > If the field value is 1, indicates this error source is > to be enabled. > > If the field value is 0, indicates that the error source > is not to be enabled. > > If FIRMWARE_FIRST is set in the flags field, the Enabled > field is ignored by the OSPM. > > AFAICT, Linux completely ignores the Enabled field in these > structures. I don't think ignoring the field is a problem: * With FFS, OS should ignore it. * Without FFS, we have control, and we get to make the decisions anyway. In the latter case we decide whether to use AER, independent of the crap in ACPI. I'm not even sure why "Enabled" matters in native AER handling. Probably one of the check-boxes in "Binary table designer's handbook"? > These structures also contain values the OS is apparently supposed to > write to Device Control and several AER registers (in struct > acpi_hest_aer_common). Linux ignores these as well. > > These seem like fairly serious omissions in Linux. I think HPX carries the same sort of information (except for Root Command reg). FW is supposed to program those registers anyway, so even if OS doesn't touch them, I'd expect things to just work. >>> The whole issue of firmware-first, the mechanism by which firmware >>> gets control, the System Error enables in Root Port Root Control >>> registers, etc., is very murky to me. Jon has a sort of similar issue >>> with VMD where he needs to leave System Errors enabled instead of >>> disabling them as we currently do. >> >> Well, OS gets control via _OSC method, and based on that it should >> touch/not touch the AER bits. > > I agree so far. > >> The bits that get set/cleared come from _HPX method, > > _HPX tells us about some AER registers, Device Control, Link Control, > and some bridge registers. It doesn't say anything about the Root > Control register that Jon is concerned with. _HPX type 3 (yay!!!) got approved recently, and that will have more fine-grained control. It will be able to handle root control reg. > For firmware-first to work, firmware has to get control. How does it > get control? How does OSPM know to either set up that mechanism or > keep its mitts off something firmware set up before handoff? My understanding is that, if FW keeps control of AER in _OSC, then it will have set things up to get notified instead of the OS. OSPM not touching AER bits is to make sure it doesn't mess up FW's setup. I think there are some proprietary bits in the root port to route interrupts to SMIs instead of the AER vectors. > In Jon's > VMD case, I think firmware-first relies on the System Error controlled > by the Root Control register. Linux thinks it owns that, and I don't > know how to learn otherwise. Didn't Keith say the root port is not visible to the OS? Alex
On Wed, 2018-11-14 at 19:22 +0000, Alex_Gagniuc@Dellteam.com wrote: [snip] > The whole issue of firmware-first, the mechanism by which > > > > firmware > > > > gets control, the System Error enables in Root Port Root > > > > Control > > > > registers, etc., is very murky to me. Jon has a sort of > > > > similar issue > > > > with VMD where he needs to leave System Errors enabled instead > > > > of > > > > disabling them as we currently do. > > > > > > Well, OS gets control via _OSC method, and based on that it > > > should > > > touch/not touch the AER bits. > > > > I agree so far. > > > > > The bits that get set/cleared come from _HPX method, > > > > _HPX tells us about some AER registers, Device Control, Link > > Control, > > and some bridge registers. It doesn't say anything about the Root > > Control register that Jon is concerned with. > > _HPX type 3 (yay!!!) got approved recently, and that will have more > fine-grained control. It will be able to handle root control reg. > > > For firmware-first to work, firmware has to get control. How does > > it > > get control? How does OSPM know to either set up that mechanism or > > keep its mitts off something firmware set up before handoff? > > My understanding is that, if FW keeps control of AER in _OSC, then > it > will have set things up to get notified instead of the OS. OSPM not > touching AER bits is to make sure it doesn't mess up FW's setup. I > think > there are some proprietary bits in the root port to route interrupts > to > SMIs instead of the AER vectors. > > > In Jon's > > VMD case, I think firmware-first relies on the System Error > > controlled > > by the Root Control register. Linux thinks it owns that, and I > > don't > > know how to learn otherwise. > > Didn't Keith say the root port is not visible to the OS? > > Alex That's correct. OS visibility wrt ACPI is limited to the VMD endpoint/host bridge device which exposes the root ports. The root ports aren't described by ACPI. VMD is the unusual case. In VMD case, we might or might not need to pass back control to AER for further error handling post FFS. I can see that's normally done by GHES but will probably need some shimming to support the VMD case. I can't rely on AER, because if any other devices use APEI, then the AER module won't be initialized (aer_service_init::aer_acpi_firmware_first)
On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/14/2018 12:00 AM, Bjorn Helgaas wrote: > > Just to make sure we're on the same page, can you point me to this > > rule? I do see that OSPM must request control of AER using _OSC > > before it touches the AER registers. What I don't see is the > > connection between firmware-first and the AER registers. > > ACPI 6.2 - 6.2.11.3, Table 6-197: > > PCI Express Advanced Error Reporting control: > * The firmware sets this bit to 1 to grant control over PCI Express > Advanced Error Reporting. If firmware allows the OS control of this > feature, then in the context of the _OSC method it must ensure that > error messages are routed to device interrupts as described in the PCI > Express Base Specification[...] > > Now I'm confused too: > * HEST -> __aer_firmware_first > This is used for touching/not touching AER bits > * _OSC -> bridge->native_aer > Used to enable/not enable AER portdrv service > Maybe Keith knows better why we're doing it this way. From ACPI text, it > doesn't seem that control of AER would be tied to HEST entries, although > in practice, it is. I'm not sure, that predates me. HEST does have a FIRMWARE_FIRST flag, but spec does not say anymore on relation to _OSC control or AER capability. Nothing in PCIe spec either. I also don't know why Linux disables the AER driver if only one device has a FIRMWARE_FIRST HEST. Shouldn't that just be a per-device decision? > > The closest I can find is the "Enabled" field in the HEST PCIe > > AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says: > > > > If the field value is 1, indicates this error source is > > to be enabled. > > > > If the field value is 0, indicates that the error source > > is not to be enabled. > > > > If FIRMWARE_FIRST is set in the flags field, the Enabled > > field is ignored by the OSPM. > > > > AFAICT, Linux completely ignores the Enabled field in these > > structures. > > I don't think ignoring the field is a problem: > * With FFS, OS should ignore it. > * Without FFS, we have control, and we get to make the decisions anyway. > In the latter case we decide whether to use AER, independent of the crap > in ACPI. I'm not even sure why "Enabled" matters in native AER handling. > Probably one of the check-boxes in "Binary table designer's handbook"? And why doesn't Linux do anything with _OSC response other than logging it? If OS control wasn't granted, shouldn't that take priority over HEST?
On 11/14/2018 02:27 PM, Keith Busch wrote: > On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote: >>> Just to make sure we're on the same page, can you point me to this >>> rule? I do see that OSPM must request control of AER using _OSC >>> before it touches the AER registers. What I don't see is the >>> connection between firmware-first and the AER registers. >> >> ACPI 6.2 - 6.2.11.3, Table 6-197: >>[...] >> Maybe Keith knows better why we're doing it this way. From ACPI text, it >> doesn't seem that control of AER would be tied to HEST entries, although >> in practice, it is. > > I'm not sure, that predates me. HEST does have a FIRMWARE_FIRST flag, but > spec does not say anymore on relation to _OSC control or AER capability. > Nothing in PCIe spec either. Speaking to one of the PCIe (and _HPX type 3) spec authors, ownership of AER should be determined by _OSC. period. The result of _OSC applies to every device under the root port. This crap we do with checking HEST is crap. If I'm not stepping on anyone toes, and there's no known unintended consequences, I can look at patching this up. I'm not promising a patch, though, but it's exactly the sort of thing I like to fix. > I also don't know why Linux disables the AER driver if only one > device has a FIRMWARE_FIRST HEST. Shouldn't that just be a per-device > decision? I think the logic is if one HEST entry has both FFS and GLOBAL flags set, then then disable AER services for all devices. It works in practice better than it works in theory. I think _OSC should be the determining factor here, not HEST. >>> The closest I can find is the "Enabled" field in the HEST PCIe >>> AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says: >>> [...] >>> AFAICT, Linux completely ignores the Enabled field in these >>> structures. >> >> I don't think ignoring the field is a problem: >> * With FFS, OS should ignore it. >> * Without FFS, we have control, and we get to make the decisions anyway. >> In the latter case we decide whether to use AER, independent of the crap >> in ACPI. I'm not even sure why "Enabled" matters in native AER handling. >> Probably one of the check-boxes in "Binary table designer's handbook"? > > And why doesn't Linux do anything with _OSC response other than logging > it? If OS control wasn't granted, shouldn't that take priority over HEST? But it does in portdrv_core.c: if (dev->aer_cap && pci_aer_available() && (pcie_ports_native || host->native_aer)) { services |= PCIE_PORT_SERVICE_AER; That flag later creates a pcie device that allows aerdrv to attach to. Alex
On Wed, Nov 14, 2018 at 08:52:10PM +0000, Alex_Gagniuc@Dellteam.com wrote: > But it does in portdrv_core.c: > > if (dev->aer_cap && pci_aer_available() && > (pcie_ports_native || host->native_aer)) { > services |= PCIE_PORT_SERVICE_AER; > > That flag later creates a pcie device that allows aerdrv to attach to. Oh, right. I saw negotiate_os_control() just uses a stack variable for the _OSC response, but if I had looked one level deeper, I'd see it cached in a different structure.
On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote: > On 11/14/2018 12:00 AM, Bjorn Helgaas wrote: > > On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote: > >> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote: > >> ... > >>> Do you think Linux observes the rule about not touching AER bits on > >>> FFS? I'm not sure it does. I'm not even sure what section of the > >>> spec is relevant. > >> > >> I haven't found any place where linux breaks this rule. I'm very > >> confident that, unless otherwise instructed, we follow this rule. > > > > Just to make sure we're on the same page, can you point me to this > > rule? I do see that OSPM must request control of AER using _OSC > > before it touches the AER registers. What I don't see is the > > connection between firmware-first and the AER registers. > > ACPI 6.2 - 6.2.11.3, Table 6-197: > > PCI Express Advanced Error Reporting control: > * The firmware sets this bit to 1 to grant control over PCI Express > Advanced Error Reporting. If firmware allows the OS control of this > feature, then in the context of the _OSC method it must ensure that > error messages are routed to device interrupts as described in the PCI > Express Base Specification[...] The PCIe Base Spec is pretty big, so I wish this reference were a little more explicit. I *guess* maybe it's referring to PCIe r4.0, figure 6-3 in sec 6.2.6, where PCIe ERR_* Messages can be routed to "INTx or MSI Error Interrupts" and/or "platform-specific System Error" interrupts. "Device interrupts" seems like it refers to the "INTx or MSI" interrupts, not the platform-specific System Errors, so I would read that as saying "if firmware grants OS control of AER via _OSC, firmware must set the AER Reporting Enables in the AER Root Error Command register." But that seems a little silly because the OS now *owns* the AER capability and it can set the AER Root Error Command register itself if it wants to. And I still don't see the connection here with Firmware-First. I'm pretty sure firmware could not be notified via INTx or MSI interrupts because those are totally managed by OSPM. > > The closest I can find is the "Enabled" field in the HEST PCIe > > AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says: > > > > If the field value is 1, indicates this error source is > > to be enabled. > > > > If the field value is 0, indicates that the error source > > is not to be enabled. > > > > If FIRMWARE_FIRST is set in the flags field, the Enabled > > field is ignored by the OSPM. > > > > AFAICT, Linux completely ignores the Enabled field in these > > structures. > > I don't think ignoring the field is a problem: > * With FFS, OS should ignore it. > * Without FFS, we have control, and we get to make the decisions anyway. > In the latter case we decide whether to use AER, independent of the crap > in ACPI. I'm not even sure why "Enabled" matters in native AER handling. It seems like these HEST structures are "here's how firmware thinks you should set up AER on this device". But I agree, I have no idea how to interpret "Enabled". The rest of the HEST fields cover all the useful AER registers, including the Reporting Enables in the AER Root Error Command register *and* the Error Reporting Enables in the Device Control register. So I don't know what the "Enabled" field adds to all that. What a mess. > > For firmware-first to work, firmware has to get control. How does > > it get control? How does OSPM know to either set up that > > mechanism or keep its mitts off something firmware set up before > > handoff? > > My understanding is that, if FW keeps control of AER in _OSC, then > it will have set things up to get notified instead of the OS. OSPM > not touching AER bits is to make sure it doesn't mess up FW's setup. > I think there are some proprietary bits in the root port to route > interrupts to SMIs instead of the AER vectors. It makes good sense that if OSPM doesn't have AER control, firmware does all AER handling, including any setup for firmware-first notification. If we can assume that firmware-first notification is done in some way the OS doesn't know about and can't mess up, that would be awesome. But I think the VMD model really has nothing to do with the APEI firmware-first model. With VMD, it sounds like OSPM owns the AER capability and doesn't know firmware exists *except* that it has to be careful not to step on firmware's interrupt. So maybe we can handle it separately. Bjorn
+ Borislav (ACPI guy, maybe he can answer more about HEST) On 11/15/2018 12:24 AM, Bjorn Helgaas wrote: > On Wed, Nov 14, 2018 at 07:22:04PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> On 11/14/2018 12:00 AM, Bjorn Helgaas wrote: >>> On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@Dellteam.com wrote: >> ACPI 6.2 - 6.2.11.3, Table 6-197: >> >> PCI Express Advanced Error Reporting control: >> * The firmware sets this bit to 1 to grant control over PCI Express >> Advanced Error Reporting. If firmware allows the OS control of this >> feature, then in the context of the _OSC method it must ensure that >> error messages are routed to device interrupts as described in the PCI >> Express Base Specification[...] > > The PCIe Base Spec is pretty big, so I wish this reference were a > little more explicit. I *guess* maybe it's referring to PCIe r4.0, > figure 6-3 in sec 6.2.6, where PCIe ERR_* Messages can be routed to > "INTx or MSI Error Interrupts" and/or "platform-specific System Error" > interrupts. It's engineering slang for "I don't know what the spec says, but do what the spec says." What it means is that FW should set up HW in such a way, that PCIe-compliant OS which enables interrupts will get interrupts the way it expects. For example, some FW might set the root port to trigger an SMI instead of firing up the proper MSI vector. In this case FW must ensure that AER interrupts fire up the MSI vector instead of triggering SMI. > "Device interrupts" seems like it refers to the "INTx or MSI" > interrupts, not the platform-specific System Errors, so I would read > that as saying "if firmware grants OS control of AER via _OSC, > firmware must set the AER Reporting Enables in the AER Root Error > Command register." But that seems a little silly because the OS now > *owns* the AER capability and it can set the AER Root Error Command > register itself if it wants to. When OS takes control of AER, it is responsible for setting things up, at least as far as standard PCIe registers are concerned. So setting up the the root command register would still be the OS's responsibility. Proprietary registers, like routing to SMI/SCI/MSI would have to be done by FW. > And I still don't see the connection here with Firmware-First. I'm > pretty sure firmware could not be notified via INTx or MSI interrupts > because those are totally managed by OSPM. The connection with FFS is that FFS needs to be notified by some other method. FW can't commandeer interrupt vectors willie-nillie. FW gets notified by platform-specific mechanisms. In my case, it happens to be SMM, but it could be a number of other proprietary mechanisms. >>> The closest I can find is the "Enabled" field in the HEST PCIe >>> AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says: >>>[...] >>> AFAICT, Linux completely ignores the Enabled field in these >>> structures. >> >> I don't think ignoring the field is a problem: >> * With FFS, OS should ignore it. >> * Without FFS, we have control, and we get to make the decisions anyway. >> In the latter case we decide whether to use AER, independent of the crap >> in ACPI. I'm not even sure why "Enabled" matters in native AER handling. > > It seems like these HEST structures are "here's how firmware thinks > you should set up AER on this device". But I agree, I have no idea > how to interpret "Enabled". The rest of the HEST fields cover all the > useful AER registers, including the Reporting Enables in the AER Root > Error Command register *and* the Error Reporting Enables in the Device > Control register. So I don't know what the "Enabled" field adds to > all that. "This table contains information platform firmware supplies to OSPM for configuring AER support on a given PCI Express device." I don't fully get the point of the HEST structures for PCIe. I'm hoping Borislav can shed enlightment on this. I think we can ignore them in PCI core, with a note to revisit them if something ever goes horribly wrong. I've patched some changes to reduce reliance on HEST [1]. > What a mess. It's ACPI. What else did you expect? >>> For firmware-first to work, firmware has to get control. How does >>> it get control? How does OSPM know to either set up that >>> mechanism or keep its mitts off something firmware set up before >>> handoff? >> >> My understanding is that, if FW keeps control of AER in _OSC, then >> it will have set things up to get notified instead of the OS. OSPM >> not touching AER bits is to make sure it doesn't mess up FW's setup. >> I think there are some proprietary bits in the root port to route >> interrupts to SMIs instead of the AER vectors. > > It makes good sense that if OSPM doesn't have AER control, firmware > does all AER handling, including any setup for firmware-first > notification. If we can assume that firmware-first notification is > done in some way the OS doesn't know about and can't mess up, that > would be awesome. I think that's a reasonable assumption as long as OS respects AER ownership. > But I think the VMD model really has nothing to do with the APEI > firmware-first model. With VMD, it sounds like OSPM owns the AER > capability and doesn't know firmware exists *except* that it has to be > careful not to step on firmware's interrupt. So maybe we can handle it > separately. <sarcasm_alert> Sounds like VMD is qualified to become part of ACPI spec. <\sarcasm_alert> I'll have to look into VMD in more detail, but this sounds like a design flaw. It's possible to use the values in HEST to set up AER, but that might conflict with the OS's wish on how to set up AER. Alex [1] https://lkml.org/lkml/2018/11/16/202
On 11/8/18 2:09 PM, Bjorn Helgaas wrote: > > [EXTERNAL EMAIL] > Please report any suspicious attachments, links, or requests for sensitive information. > > > [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about > PCI error recovery in general] > > On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote: >> On Tue, Sep 18, 2018 at 05:15:00PM -0500, Alexandru Gagniuc wrote: >>> When a PCI device is gone, we don't want to send IO to it if we can >>> avoid it. We expose functionality via the irq_chip structure. As >>> users of that structure may not know about the underlying PCI device, >>> it's our responsibility to guard against removed devices. >>> >>> .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). >>> .irq_mask/unmask() are not. Guard them for completeness. >>> >>> For example, surprise removal of a PCIe device triggers teardown. This >>> touches the irq_chips ops some point to disable the interrupts. I/O >>> generated here can crash the system on firmware-first machines. >>> Not triggering the IO in the first place greatly reduces the >>> possibility of the problem occurring. >>> >>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> >> >> Applied to pci/misc for v4.21, thanks! > > I'm having second thoughts about this. Do we have a verdict on this? If you don't like this approach, then I'll have to fix the problem in some other way, but the problem still needs to be fixed. Alex
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index f2ef896464b3..f31058fd2260 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -227,6 +227,9 @@ static void msi_set_mask_bit(struct irq_data *data, u32 flag) { struct msi_desc *desc = irq_data_get_msi_desc(data); + if (pci_dev_is_disconnected(msi_desc_to_pci_dev(desc))) + return; + if (desc->msi_attrib.is_msix) { msix_mask_irq(desc, flag); readl(desc->mask_base); /* Flush write to device */
When a PCI device is gone, we don't want to send IO to it if we can avoid it. We expose functionality via the irq_chip structure. As users of that structure may not know about the underlying PCI device, it's our responsibility to guard against removed devices. .irq_write_msi_msg() is already guarded inside __pci_write_msi_msg(). .irq_mask/unmask() are not. Guard them for completeness. For example, surprise removal of a PCIe device triggers teardown. This touches the irq_chips ops some point to disable the interrupts. I/O generated here can crash the system on firmware-first machines. Not triggering the IO in the first place greatly reduces the possibility of the problem occurring. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> --- drivers/pci/msi.c | 3 +++ 1 file changed, 3 insertions(+)