Message ID | 1445894704-28277-1-git-send-email-okaya@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
[+cc Taku] Hi Sinan, On Mon, Oct 26, 2015 at 05:25:02PM -0400, Sinan Kaya wrote: > A PCIe card behind a PCIe switch is unable to report its errors > when SERR# forwarding is not enabled on the PCIe switch's > secondary interface. This is required by the PCIe spec. I think enabling SERR# forwarding is only "required by the spec" in the sense that bridges do not forward errors upstream unless the SERR# Enable bit is set, right? So I would say this is just an ordinary enable bit in the bridge, not a special spec requirement. The Linux AER code doesn't poll for errors; it assumes errors will be propagated upstream to the Root Port, where they will cause an interrupt, so I agree that it doesn't really make sense to enable AER for a device unless we also enable SERR# forwarding in all the bridges leading to it. I assume this patch fixes a problem, e.g., errors reported by a device are not forwarded upstream, so AER never logs them, and with this patch, AER does log them? I suppose we didn't notice this before because some firmware enables SERR# forwarding for us, but yours doesn't? > This patch > enables SERR# forwarding and also cleans out compatibility > mode so that AER reporting is enabled. > > Tested with PEX8749-CA RDK. > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org> > --- > drivers/pci/pcie/aer/aerdrv_core.c | 56 +++++++++++++++++++++++++++++++++++++- > 1 file changed, 55 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > index 9803e3d..acd22d7 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -37,21 +37,75 @@ module_param(nosourceid, bool, 0); > > int pci_enable_pcie_error_reporting(struct pci_dev *dev) > { > + u8 header_type; > + int pos; > + > if (pcie_aer_get_firmware_first(dev)) > return -EIO; > > - if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR)) > + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); > + if (!pos) > return -EIO; > > + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); > + > + /* needs to be a bridge/switch */ > + if (header_type == PCI_HEADER_TYPE_BRIDGE) { > + u32 status; > + u16 control; > + > + /* > + * A switch will not forward ERR_ messages coming from an > + * endpoint if SERR# forwarding is not enabled. > + * AER driver is checking the errors at the root only. > + */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); > + control |= PCI_BRIDGE_CTL_SERR; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); Does this work for hot-added devices? I don't see a path that calls pci_enable_pcie_error_reporting() for hot-added devices, so I don't know how the PCI_EXP_AER_FLAGS bits would get set for them. Taku, we recently merged your patch: b07461a8e45b ("PCI/AER: Clear error status registers during enumeration and restore"). The changelog mentions errors recorded by hot-added devices. Did you test AER on hot-added devices, or did you just notice that they recorded errors? Do you think we could enable PCI_BRIDGE_CTL_SERR and set the PCI_EXP_AER_FLAGS bits from pci_init_capabilities()? That path would work for all devices, whether present at boot or hot-added later. I know the AER driver won't attach to the Root Port and set up its interrupts until after enumeration, so AER would be enabled before we set up an ISR for the AER interrupts and turn them on in the Root Port. But I think any errors in the interim should be logged and shouldn't cause a problem. > + /* > + * Need to inform hardware that we support > + * Role-Based Error Reporting. > + */ > + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); > + status &= ~PCI_ERR_COR_ADV_NFAT; > + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); Let's do this in a separate patch. The SERR# thing is related to propagating error messages upstream. PCI_ERR_COR_ADV_NFAT is a bit different. I don't understand all the implications of Role-Based Error Reporting. Can you include a pointer to the Linux code that comprehends it? It looks like the section 6.2.7 implementation note is relevant, but I don't understand it yet. Do we need to pay attention to the Role-Based Error Reporting bit in the Device Capabilities register for any reason? I guess maybe it doesn't matter because Role-Base Error Reporting appeared in PCIe 1.1, and the PCI_ERR_COR_ADV_NFAT bit was reserved prior to that, so it shouldn't do anything if we clear it. Bjorn > + } > + > return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); > } > EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); > > int pci_disable_pcie_error_reporting(struct pci_dev *dev) > { > + int pos; > + u8 header_type; > + > if (pcie_aer_get_firmware_first(dev)) > return -EIO; > > + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); > + if (!pos) > + return -EIO; > + > + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); > + > + /* needs to be a bridge/switch */ > + if (header_type == PCI_HEADER_TYPE_BRIDGE) { > + u32 status; > + u16 control; > + > + /* clear serr forwarding */ > + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); > + control &= ~PCI_BRIDGE_CTL_SERR; > + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); > + > + /* set compatibility mode */ > + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); > + status |= PCI_ERR_COR_ADV_NFAT; > + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); > + } > + > return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, > PCI_EXP_AER_FLAGS); > } > -- > Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc. > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/01/2015 01:51 PM, Bjorn Helgaas wrote: > [+cc Taku] It looks to me like Bjorn intended to add Taku to the distribution list, but accidentally left him off, so I'm adding him to the To field in this reply. Regards, Christopher Covington > Hi Sinan, > > On Mon, Oct 26, 2015 at 05:25:02PM -0400, Sinan Kaya wrote: >> A PCIe card behind a PCIe switch is unable to report its errors >> when SERR# forwarding is not enabled on the PCIe switch's >> secondary interface. This is required by the PCIe spec. > > I think enabling SERR# forwarding is only "required by the spec" in > the sense that bridges do not forward errors upstream unless the > SERR# Enable bit is set, right? So I would say this is just an > ordinary enable bit in the bridge, not a special spec requirement. > > The Linux AER code doesn't poll for errors; it assumes errors will be > propagated upstream to the Root Port, where they will cause an > interrupt, so I agree that it doesn't really make sense to enable AER > for a device unless we also enable SERR# forwarding in all the bridges > leading to it. > > I assume this patch fixes a problem, e.g., errors reported by a device > are not forwarded upstream, so AER never logs them, and with this > patch, AER does log them? I suppose we didn't notice this before > because some firmware enables SERR# forwarding for us, but yours > doesn't? > >> This patch >> enables SERR# forwarding and also cleans out compatibility >> mode so that AER reporting is enabled. >> >> Tested with PEX8749-CA RDK. >> >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> >> --- >> drivers/pci/pcie/aer/aerdrv_core.c | 56 +++++++++++++++++++++++++++++++++++++- >> 1 file changed, 55 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c >> index 9803e3d..acd22d7 100644 >> --- a/drivers/pci/pcie/aer/aerdrv_core.c >> +++ b/drivers/pci/pcie/aer/aerdrv_core.c >> @@ -37,21 +37,75 @@ module_param(nosourceid, bool, 0); >> >> int pci_enable_pcie_error_reporting(struct pci_dev *dev) >> { >> + u8 header_type; >> + int pos; >> + >> if (pcie_aer_get_firmware_first(dev)) >> return -EIO; >> >> - if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR)) >> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); >> + if (!pos) >> return -EIO; >> >> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); >> + >> + /* needs to be a bridge/switch */ >> + if (header_type == PCI_HEADER_TYPE_BRIDGE) { >> + u32 status; >> + u16 control; >> + >> + /* >> + * A switch will not forward ERR_ messages coming from an >> + * endpoint if SERR# forwarding is not enabled. >> + * AER driver is checking the errors at the root only. >> + */ >> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); >> + control |= PCI_BRIDGE_CTL_SERR; >> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); > > Does this work for hot-added devices? I don't see a path that calls > pci_enable_pcie_error_reporting() for hot-added devices, so I don't > know how the PCI_EXP_AER_FLAGS bits would get set for them. > > Taku, we recently merged your patch: b07461a8e45b ("PCI/AER: Clear > error status registers during enumeration and restore"). The > changelog mentions errors recorded by hot-added devices. Did you test > AER on hot-added devices, or did you just notice that they recorded > errors? > > Do you think we could enable PCI_BRIDGE_CTL_SERR and set the > PCI_EXP_AER_FLAGS bits from pci_init_capabilities()? That path would > work for all devices, whether present at boot or hot-added later. > > I know the AER driver won't attach to the Root Port and set up its > interrupts until after enumeration, so AER would be enabled before we > set up an ISR for the AER interrupts and turn them on in the Root > Port. But I think any errors in the interim should be logged and > shouldn't cause a problem. > >> + /* >> + * Need to inform hardware that we support >> + * Role-Based Error Reporting. >> + */ >> + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); >> + status &= ~PCI_ERR_COR_ADV_NFAT; >> + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); > > Let's do this in a separate patch. The SERR# thing is related to > propagating error messages upstream. PCI_ERR_COR_ADV_NFAT is a bit > different. > > I don't understand all the implications of Role-Based Error Reporting. > Can you include a pointer to the Linux code that comprehends it? > It looks like the section 6.2.7 implementation note is relevant, but I > don't understand it yet. > > Do we need to pay attention to the Role-Based Error Reporting bit in > the Device Capabilities register for any reason? I guess maybe it > doesn't matter because Role-Base Error Reporting appeared in PCIe 1.1, > and the PCI_ERR_COR_ADV_NFAT bit was reserved prior to that, so it > shouldn't do anything if we clear it. > > Bjorn > >> + } >> + >> return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); >> } >> EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); >> >> int pci_disable_pcie_error_reporting(struct pci_dev *dev) >> { >> + int pos; >> + u8 header_type; >> + >> if (pcie_aer_get_firmware_first(dev)) >> return -EIO; >> >> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); >> + if (!pos) >> + return -EIO; >> + >> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); >> + >> + /* needs to be a bridge/switch */ >> + if (header_type == PCI_HEADER_TYPE_BRIDGE) { >> + u32 status; >> + u16 control; >> + >> + /* clear serr forwarding */ >> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); >> + control &= ~PCI_BRIDGE_CTL_SERR; >> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); >> + >> + /* set compatibility mode */ >> + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); >> + status |= PCI_ERR_COR_ADV_NFAT; >> + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); >> + } >> + >> return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, >> PCI_EXP_AER_FLAGS); >> } >> -- >> Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc. >> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-pci" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/1/2015 2:21 PM, Christopher Covington wrote: > On 12/01/2015 01:51 PM, Bjorn Helgaas wrote: >> [+cc Taku] > > It looks to me like Bjorn intended to add Taku to the distribution list, > but accidentally left him off, so I'm adding him to the To field in this > reply. > > Regards, > Christopher Covington > >> Hi Sinan, >> >> On Mon, Oct 26, 2015 at 05:25:02PM -0400, Sinan Kaya wrote: >>> A PCIe card behind a PCIe switch is unable to report its errors >>> when SERR# forwarding is not enabled on the PCIe switch's >>> secondary interface. This is required by the PCIe spec. >> >> I think enabling SERR# forwarding is only "required by the spec" in >> the sense that bridges do not forward errors upstream unless the >> SERR# Enable bit is set, right? So I would say this is just an >> ordinary enable bit in the bridge, not a special spec requirement. >> Bottom line is that AER errors won't be forwarded if this bit is not set. >> The Linux AER code doesn't poll for errors; it assumes errors will be >> propagated upstream to the Root Port, where they will cause an >> interrupt, so I agree that it doesn't really make sense to enable AER >> for a device unless we also enable SERR# forwarding in all the bridges >> leading to it. >> >> I assume this patch fixes a problem, e.g., errors reported by a device >> are not forwarded upstream, so AER never logs them, and with this >> patch, AER does log them? Correct. I'm not observing the AER errors without this patch. After this patch, I'm seeing the AER errors coming from the endpoints. >> I suppose we didn't notice this before >> because some firmware enables SERR# forwarding for us, but yours >> doesn't? Possible..., I could have done this in the UEFI BIOS but I'm also worried about hotplug case. That's why, I chose to submit a patch for the kernel. For instance, what happens after hotplug insertion. Will anybody set these bits? We need some kernel support for some PCIe features to reconfigure the hardware. >> >>> This patch >>> enables SERR# forwarding and also cleans out compatibility >>> mode so that AER reporting is enabled. >>> >>> Tested with PEX8749-CA RDK. >>> >>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> >>> --- >>> drivers/pci/pcie/aer/aerdrv_core.c | 56 +++++++++++++++++++++++++++++++++++++- >>> 1 file changed, 55 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c >>> index 9803e3d..acd22d7 100644 >>> --- a/drivers/pci/pcie/aer/aerdrv_core.c >>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c >>> @@ -37,21 +37,75 @@ module_param(nosourceid, bool, 0); >>> >>> int pci_enable_pcie_error_reporting(struct pci_dev *dev) >>> { >>> + u8 header_type; >>> + int pos; >>> + >>> if (pcie_aer_get_firmware_first(dev)) >>> return -EIO; >>> >>> - if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR)) >>> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); >>> + if (!pos) >>> return -EIO; >>> >>> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); >>> + >>> + /* needs to be a bridge/switch */ >>> + if (header_type == PCI_HEADER_TYPE_BRIDGE) { >>> + u32 status; >>> + u16 control; >>> + >>> + /* >>> + * A switch will not forward ERR_ messages coming from an >>> + * endpoint if SERR# forwarding is not enabled. >>> + * AER driver is checking the errors at the root only. >>> + */ >>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); >>> + control |= PCI_BRIDGE_CTL_SERR; >>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); >> >> Does this work for hot-added devices? I don't see a path that calls >> pci_enable_pcie_error_reporting() for hot-added devices, so I don't >> know how the PCI_EXP_AER_FLAGS bits would get set for them. Yes for me. We remove the root port along with the endpoint during hotplug. On the next insertion, AER driver get reprobed for the root port. >> Let's do this in a separate patch. The SERR# thing is related to >> propagating error messages upstream. PCI_ERR_COR_ADV_NFAT is a bit >> different. >> >> I don't understand all the implications of Role-Based Error Reporting. >> Can you include a pointer to the Linux code that comprehends it? >> It looks like the section 6.2.7 implementation note is relevant, but I >> don't understand it yet. My understanding of the spec is that you can't have AER enabled when PCI_ERR_COR_ADV_NFAT is 1. >> >> Do we need to pay attention to the Role-Based Error Reporting bit in >> the Device Capabilities register for any reason? I guess maybe it >> doesn't matter because Role-Base Error Reporting appeared in PCIe 1.1, >> and the PCI_ERR_COR_ADV_NFAT bit was reserved prior to that, so it >> shouldn't do anything if we clear it. >> >> Bjorn >> >>> + } >>> + >>> return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); >>> } >>> EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); >>> >>> int pci_disable_pcie_error_reporting(struct pci_dev *dev) >>> { >>> + int pos; >>> + u8 header_type; >>> + >>> if (pcie_aer_get_firmware_first(dev)) >>> return -EIO; >>> >>> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); >>> + if (!pos) >>> + return -EIO; >>> + >>> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); >>> + >>> + /* needs to be a bridge/switch */ >>> + if (header_type == PCI_HEADER_TYPE_BRIDGE) { >>> + u32 status; >>> + u16 control; >>> + >>> + /* clear serr forwarding */ >>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); >>> + control &= ~PCI_BRIDGE_CTL_SERR; >>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); >>> + >>> + /* set compatibility mode */ >>> + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); >>> + status |= PCI_ERR_COR_ADV_NFAT; >>> + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); >>> + } >>> + >>> return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, >>> PCI_EXP_AER_FLAGS); >>> } >>> -- >>> Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc. >>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >
On Tue, Dec 01, 2015 at 03:08:48PM -0500, Sinan Kaya wrote: > On 12/1/2015 2:21 PM, Christopher Covington wrote: > > On 12/01/2015 01:51 PM, Bjorn Helgaas wrote: > >> [+cc Taku] > > > > It looks to me like Bjorn intended to add Taku to the distribution list, > > but accidentally left him off, so I'm adding him to the To field in this > > reply. > > > > Regards, > > Christopher Covington > > > >> Hi Sinan, > >> > >> On Mon, Oct 26, 2015 at 05:25:02PM -0400, Sinan Kaya wrote: > >>> A PCIe card behind a PCIe switch is unable to report its errors > >>> when SERR# forwarding is not enabled on the PCIe switch's > >>> secondary interface. This is required by the PCIe spec. > >> > >> I think enabling SERR# forwarding is only "required by the spec" in > >> the sense that bridges do not forward errors upstream unless the > >> SERR# Enable bit is set, right? So I would say this is just an > >> ordinary enable bit in the bridge, not a special spec requirement. > > Bottom line is that AER errors won't be forwarded if this bit is not set. Right. So it's required by the spec if you want error forwarding, in the same way the spec requires the Hot-Plug Interrupt Enable bit to be set if you want interrupts for hot-plug events. > >> The Linux AER code doesn't poll for errors; it assumes errors will be > >> propagated upstream to the Root Port, where they will cause an > >> interrupt, so I agree that it doesn't really make sense to enable AER > >> for a device unless we also enable SERR# forwarding in all the bridges > >> leading to it. > >> > >> I assume this patch fixes a problem, e.g., errors reported by a device > >> are not forwarded upstream, so AER never logs them, and with this > >> patch, AER does log them? > > Correct. I'm not observing the AER errors without this patch. After this > patch, I'm seeing the AER errors coming from the endpoints. Thanks for confirming this. > >> I suppose we didn't notice this before > >> because some firmware enables SERR# forwarding for us, but yours > >> doesn't? > > Possible..., I could have done this in the UEFI BIOS but I'm also > worried about hotplug case. That's why, I chose to submit a patch for > the kernel. Thanks for confirming that your firmware does not enable SERR# forwarding. That explains why this patch makes a difference for you. There must be firmware that does enable SERR# forwarding; otherwise AER would never have worked for anybody. I think it makes sense for Linux to make sure SERR# forwarding is enabled when enabling AER. We shouldn't rely on firmware setting it for us. > For instance, what happens after hotplug insertion. Will anybody set > these bits? We need some kernel support for some PCIe features to > reconfigure the hardware. ACPI systems that support hotplug may supply _HPP or _HPX methods. If _HPP or _HPX indicates that SERR# forwarding should be enabled, Linux does enable it for hot-added devices (and I think we now do it for all devices at boot, too). That would explain how this could work on ACPI systems today. But obviously AER should work even if we don't have ACPI, so we need something like this patch. > >>> This patch > >>> enables SERR# forwarding and also cleans out compatibility > >>> mode so that AER reporting is enabled. > >>> > >>> Tested with PEX8749-CA RDK. > >>> > >>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> > >>> --- > >>> drivers/pci/pcie/aer/aerdrv_core.c | 56 +++++++++++++++++++++++++++++++++++++- > >>> 1 file changed, 55 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > >>> index 9803e3d..acd22d7 100644 > >>> --- a/drivers/pci/pcie/aer/aerdrv_core.c > >>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c > >>> @@ -37,21 +37,75 @@ module_param(nosourceid, bool, 0); > >>> > >>> int pci_enable_pcie_error_reporting(struct pci_dev *dev) > >>> { > >>> + u8 header_type; > >>> + int pos; > >>> + > >>> if (pcie_aer_get_firmware_first(dev)) > >>> return -EIO; > >>> > >>> - if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR)) > >>> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); > >>> + if (!pos) > >>> return -EIO; > >>> > >>> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); > >>> + > >>> + /* needs to be a bridge/switch */ > >>> + if (header_type == PCI_HEADER_TYPE_BRIDGE) { > >>> + u32 status; > >>> + u16 control; > >>> + > >>> + /* > >>> + * A switch will not forward ERR_ messages coming from an > >>> + * endpoint if SERR# forwarding is not enabled. > >>> + * AER driver is checking the errors at the root only. > >>> + */ > >>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); > >>> + control |= PCI_BRIDGE_CTL_SERR; > >>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); > >> > >> Does this work for hot-added devices? I don't see a path that calls > >> pci_enable_pcie_error_reporting() for hot-added devices, so I don't > >> know how the PCI_EXP_AER_FLAGS bits would get set for them. > > Yes for me. We remove the root port along with the endpoint during > hotplug. On the next insertion, AER driver get reprobed for the root port. The case I'm wondering about is when we hot-add an endpoint (not a root port). In that case, I think the AER driver is not re-probed because aerdriver.port_type == PCI_EXP_TYPE_ROOT_PORT, so pcie_port_bus_match() will only match the AER driver with a Root Port. On your system (which I assume doesn't have _HPP or _HPX), if you hot-add a bridge (not a Root Port) or an endpoint, I don't know what would enable SERR# forwarding for a bridge or AER reporting for an endpoint. > >> Let's do this in a separate patch. The SERR# thing is related to > >> propagating error messages upstream. PCI_ERR_COR_ADV_NFAT is a bit > >> different. > >> > >> I don't understand all the implications of Role-Based Error Reporting. > >> Can you include a pointer to the Linux code that comprehends it? > >> It looks like the section 6.2.7 implementation note is relevant, but I > >> don't understand it yet. > > My understanding of the spec is that you can't have AER enabled when > PCI_ERR_COR_ADV_NFAT is 1. Can you point out the reasoning here in a little more detail? This may well be true, but it wasn't that obvious to me. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/1/2015 6:07 PM, Bjorn Helgaas wrote: > On Tue, Dec 01, 2015 at 03:08:48PM -0500, Sinan Kaya wrote: >> On 12/1/2015 2:21 PM, Christopher Covington wrote: >>> On 12/01/2015 01:51 PM, Bjorn Helgaas wrote: >>>> [+cc Taku] >>> >>> It looks to me like Bjorn intended to add Taku to the distribution list, >>> but accidentally left him off, so I'm adding him to the To field in this >>> reply. >>> >>> Regards, >>> Christopher Covington >>> >>>> Hi Sinan, >>>> >>>> On Mon, Oct 26, 2015 at 05:25:02PM -0400, Sinan Kaya wrote: >>>>> A PCIe card behind a PCIe switch is unable to report its errors >>>>> when SERR# forwarding is not enabled on the PCIe switch's >>>>> secondary interface. This is required by the PCIe spec. >>>> >>>> I think enabling SERR# forwarding is only "required by the spec" in >>>> the sense that bridges do not forward errors upstream unless the >>>> SERR# Enable bit is set, right? So I would say this is just an >>>> ordinary enable bit in the bridge, not a special spec requirement. >> >> Bottom line is that AER errors won't be forwarded if this bit is not set. > > Right. So it's required by the spec if you want error forwarding, in > the same way the spec requires the Hot-Plug Interrupt Enable bit to be > set if you want interrupts for hot-plug events. > >>>> The Linux AER code doesn't poll for errors; it assumes errors will be >>>> propagated upstream to the Root Port, where they will cause an >>>> interrupt, so I agree that it doesn't really make sense to enable AER >>>> for a device unless we also enable SERR# forwarding in all the bridges >>>> leading to it. >>>> >>>> I assume this patch fixes a problem, e.g., errors reported by a device >>>> are not forwarded upstream, so AER never logs them, and with this >>>> patch, AER does log them? >> >> Correct. I'm not observing the AER errors without this patch. After this >> patch, I'm seeing the AER errors coming from the endpoints. > > Thanks for confirming this. > >>>> I suppose we didn't notice this before >>>> because some firmware enables SERR# forwarding for us, but yours >>>> doesn't? >> >> Possible..., I could have done this in the UEFI BIOS but I'm also >> worried about hotplug case. That's why, I chose to submit a patch for >> the kernel. > > Thanks for confirming that your firmware does not enable SERR# > forwarding. That explains why this patch makes a difference for you. > There must be firmware that does enable SERR# forwarding; otherwise > AER would never have worked for anybody. > > I think it makes sense for Linux to make sure SERR# forwarding is > enabled when enabling AER. We shouldn't rely on firmware setting it > for us. > >> For instance, what happens after hotplug insertion. Will anybody set >> these bits? We need some kernel support for some PCIe features to >> reconfigure the hardware. > > ACPI systems that support hotplug may supply _HPP or _HPX methods. If > _HPP or _HPX indicates that SERR# forwarding should be enabled, Linux > does enable it for hot-added devices (and I think we now do it for all > devices at boot, too). That would explain how this could work on ACPI > systems today. Are we sure about this? The name of the field in HPP is "Enable SERR". Will this also enable SERR# forwarding? I do have _HPP in ACPI with SERR enabled but I do not have HPX. I rely on the hardware defaults for which errors to be enabled and masked etc. So, I don't need HPX. > > But obviously AER should work even if we don't have ACPI, so we need > something like this patch. > >>>>> This patch >>>>> enables SERR# forwarding and also cleans out compatibility >>>>> mode so that AER reporting is enabled. >>>>> >>>>> Tested with PEX8749-CA RDK. >>>>> >>>>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> >>>>> --- >>>>> drivers/pci/pcie/aer/aerdrv_core.c | 56 +++++++++++++++++++++++++++++++++++++- >>>>> 1 file changed, 55 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c >>>>> index 9803e3d..acd22d7 100644 >>>>> --- a/drivers/pci/pcie/aer/aerdrv_core.c >>>>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c >>>>> @@ -37,21 +37,75 @@ module_param(nosourceid, bool, 0); >>>>> >>>>> int pci_enable_pcie_error_reporting(struct pci_dev *dev) >>>>> { >>>>> + u8 header_type; >>>>> + int pos; >>>>> + >>>>> if (pcie_aer_get_firmware_first(dev)) >>>>> return -EIO; >>>>> >>>>> - if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR)) >>>>> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); >>>>> + if (!pos) >>>>> return -EIO; >>>>> >>>>> + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); >>>>> + >>>>> + /* needs to be a bridge/switch */ >>>>> + if (header_type == PCI_HEADER_TYPE_BRIDGE) { >>>>> + u32 status; >>>>> + u16 control; >>>>> + >>>>> + /* >>>>> + * A switch will not forward ERR_ messages coming from an >>>>> + * endpoint if SERR# forwarding is not enabled. >>>>> + * AER driver is checking the errors at the root only. >>>>> + */ >>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); >>>>> + control |= PCI_BRIDGE_CTL_SERR; >>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); >>>> >>>> Does this work for hot-added devices? I don't see a path that calls >>>> pci_enable_pcie_error_reporting() for hot-added devices, so I don't >>>> know how the PCI_EXP_AER_FLAGS bits would get set for them. >> >> Yes for me. We remove the root port along with the endpoint during >> hotplug. On the next insertion, AER driver get reprobed for the root port. > > The case I'm wondering about is when we hot-add an endpoint (not a > root port). In that case, I think the AER driver is not re-probed > because aerdriver.port_type == PCI_EXP_TYPE_ROOT_PORT, so > pcie_port_bus_match() will only match the AER driver with a Root Port. > > On your system (which I assume doesn't have _HPP or _HPX), if you > hot-add a bridge (not a Root Port) or an endpoint, I don't know what > would enable SERR# forwarding for a bridge or AER reporting for an > endpoint. > >>>> Let's do this in a separate patch. The SERR# thing is related to >>>> propagating error messages upstream. PCI_ERR_COR_ADV_NFAT is a bit >>>> different. >>>> >>>> I don't understand all the implications of Role-Based Error Reporting. >>>> Can you include a pointer to the Linux code that comprehends it? >>>> It looks like the section 6.2.7 implementation note is relevant, but I >>>> don't understand it yet. >> >> My understanding of the spec is that you can't have AER enabled when >> PCI_ERR_COR_ADV_NFAT is 1. > > Can you point out the reasoning here in a little more detail? This > may well be true, but it wasn't that obvious to me. > I looked at the code again and also tried to refresh my memory. The role based error reporting is in the device capability register. It is a read-only register and declares which version of the error reporting SW (1.0a or newer generation) the hardware supports. This code, on the other hand; is just acknowledging an existing error by clearing the advisory nonfatal error to W1C type register. I don't think this part is necessary. Setting the SERR# forwarding must have made the trick. This part was just an additional clearing of the errors. I'll retest without this bit. > Bjorn > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On 12/1/2015 11:43 PM, Sinan Kaya wrote: > Setting the SERR# forwarding must have made the trick. This part was > just an additional clearing of the errors. > Nope, I was just enabling non-advisory fatal error from the mask register. Not clearing it. > I'll retest without this bit. Here we go. /#lspci 00:00.0 Class 0604: 17cb:0400 01:00.0 Class 0604: 10b5:8732 02:08.0 Class 0604: 10b5:8732 03:00.0 Class 0604: 10b5:8732 04:00.0 Class 0604: 10b5:8732 05:00.0 Class 0604: 10b5:8749 05:00.1 Class 0880: 10b5:87d0 05:00.2 Class 0880: 10b5:87d0 05:00.3 Class 0880: 10b5:87d0 05:00.4 Class 0880: 10b5:87d0 06:08.0 Class 0604: 10b5:8749 06:09.0 Class 0604: 10b5:8749 06:10.0 Class 0604: 10b5:8749 06:11.0 Class 0604: 10b5:8749 06:12.0 Class 0604: 10b5:8749 07:00.0 Class ff00: 1172:e001 This is after removing the PCI_ERR_COR_ADV_NFAT setting which looks much better to me. I'll post a new patch without PCI_ERR_COR_ADV_NFAT. /#[24.358445]pcieport_0006:00:00.0:_AER:_Multiple_Corrected_error_received:_id=0640 [ 24.358559] pcieport 0006:06:08.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=06 [ 24.358571] pcieport 0006:06:08.0: device [10b5:8749] error status/mask=00002081/0000e000 [ 24.358583] pcieport 0006:06:08.0: [ 0] Receiver Error (First) [ 24.358593] pcieport 0006:06:08.0: [ 7] Bad DLLP [ 24.358616] pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 [ 24.358708] pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 [ 24.358800] pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 [ 24.358892] pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 Below is the test result with the original code. <remove card> pcieport_0006:00:00.0:_AER:_Multiple_Corrected_error_received:_id=0640 pcieport 0006:01:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0100(Receiver ID) pcieport 0006:01:00.0: device [10b5:8732] error status/mask=00002000/0000c000 pcieport 0006:01:00.0: [13] Advisory Non-Fatal pcieport 0006:02:08.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0240(Receiver ID) pcieport 0006:02:08.0: device [10b5:8732] error status/mask=00002000/0000c000 pcieport 0006:02:08.0: [13] Advisory Non-Fatal pcieport 0006:03:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0300(Receiver ID) pcieport 0006:03:00.0: device [10b5:8732] error status/mask=00002000/0000c000 pcieport 0006:03:00.0: [13] Advisory Non-Fatal pcieport 0006:04:00.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0400(Receiver ID) pcieport 0006:04:00.0: device [10b5:8732] error status/mask=00002000/0000c000 pcieport 0006:04:00.0: [13] Advisory Non-Fatal pcieport 0006:06:08.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0640(Receiver ID) pcieport 0006:06:08.0: device [10b5:8749] error status/mask=00002001/0000c000 pcieport 0006:06:08.0: [ 0] Receiver Error pcieport 0006:06:08.0: [13] Advisory Non-Fatal pcieport 0006:06:08.0: Error of this Agent(0640) is reported first pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 pcieport 0006:06:09.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0648(Receiver ID) pcieport 0006:06:09.0: device [10b5:8749] error status/mask=00002000/00008000 pcieport 0006:06:09.0: [13] Advisory Non-Fatal pcieport 0006:06:10.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0680(Receiver ID) pcieport 0006:06:10.0: device [10b5:8749] error status/mask=00002000/0000c000 pcieport 0006:06:10.0: [13] Advisory Non-Fatal pcieport 0006:06:11.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0688(Receiver ID) pcieport 0006:06:11.0: device [10b5:8749] error status/mask=00002000/00008000 pcieport 0006:06:11.0: [13] Advisory Non-Fatal pcieport 0006:06:12.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0690(Receiver ID) pcieport 0006:06:12.0: device [10b5:8749] error status/mask=00002000/00008000 pcieport 0006:06:12.0: [13] Advisory Non-Fatal pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 pcieport 0006:00:00.0: AER: Multiple Corrected error received: id=0640 / #
On Tue, Dec 01, 2015 at 11:43:21PM -0500, Sinan Kaya wrote: > On 12/1/2015 6:07 PM, Bjorn Helgaas wrote: > > On Tue, Dec 01, 2015 at 03:08:48PM -0500, Sinan Kaya wrote: > >> For instance, what happens after hotplug insertion. Will anybody set > >> these bits? We need some kernel support for some PCIe features to > >> reconfigure the hardware. > > > > ACPI systems that support hotplug may supply _HPP or _HPX methods. If > > _HPP or _HPX indicates that SERR# forwarding should be enabled, Linux > > does enable it for hot-added devices (and I think we now do it for all > > devices at boot, too). That would explain how this could work on ACPI > > systems today. > > Are we sure about this? The name of the field in HPP is "Enable SERR". > Will this also enable SERR# forwarding? I *think* so. In program_hpp_type0(), if hpp->enable_serr is set, we turn on PCI_COMMAND_SERR. For bridges, we also turn on PCI_BRIDGE_CTL_SERR, which enables SERR# forwarding. Of course, I don't have a machine I'm testing this on; I'm just reading the code. > I do have _HPP in ACPI with SERR enabled but I do not have HPX. I rely > on the hardware defaults for which errors to be enabled and masked etc. > So, I don't need HPX. Huh. If you have _HPP with "Enable SERR" set, I wonder why the pci_configure_device() path isn't turning on PCI_COMMAND_SERR and PCI_BRIDGE_CTL_SERR for you. Can you instrument that path and see what's going on? I reworked that recently, and my *intent* was that we apply _HPP & _HPX settings to every device we enumerate, not just hot-added ones. If that isn't happening, I'd like to know about it. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 9803e3d..acd22d7 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -37,21 +37,75 @@ module_param(nosourceid, bool, 0); int pci_enable_pcie_error_reporting(struct pci_dev *dev) { + u8 header_type; + int pos; + if (pcie_aer_get_firmware_first(dev)) return -EIO; - if (!pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR)) + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); + if (!pos) return -EIO; + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); + + /* needs to be a bridge/switch */ + if (header_type == PCI_HEADER_TYPE_BRIDGE) { + u32 status; + u16 control; + + /* + * A switch will not forward ERR_ messages coming from an + * endpoint if SERR# forwarding is not enabled. + * AER driver is checking the errors at the root only. + */ + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); + control |= PCI_BRIDGE_CTL_SERR; + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); + + /* + * Need to inform hardware that we support + * Role-Based Error Reporting. + */ + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); + status &= ~PCI_ERR_COR_ADV_NFAT; + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); + } + return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); } EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting); int pci_disable_pcie_error_reporting(struct pci_dev *dev) { + int pos; + u8 header_type; + if (pcie_aer_get_firmware_first(dev)) return -EIO; + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR); + if (!pos) + return -EIO; + + pci_read_config_byte(dev, PCI_HEADER_TYPE, &header_type); + + /* needs to be a bridge/switch */ + if (header_type == PCI_HEADER_TYPE_BRIDGE) { + u32 status; + u16 control; + + /* clear serr forwarding */ + pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &control); + control &= ~PCI_BRIDGE_CTL_SERR; + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, control); + + /* set compatibility mode */ + pci_read_config_dword(dev, pos + PCI_ERR_COR_MASK, &status); + status |= PCI_ERR_COR_ADV_NFAT; + pci_write_config_dword(dev, pos + PCI_ERR_COR_MASK, status); + } + return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS); }
A PCIe card behind a PCIe switch is unable to report its errors when SERR# forwarding is not enabled on the PCIe switch's secondary interface. This is required by the PCIe spec. This patch enables SERR# forwarding and also cleans out compatibility mode so that AER reporting is enabled. Tested with PEX8749-CA RDK. Signed-off-by: Sinan Kaya <okaya@codeaurora.org> --- drivers/pci/pcie/aer/aerdrv_core.c | 56 +++++++++++++++++++++++++++++++++++++- 1 file changed, 55 insertions(+), 1 deletion(-)