Message ID | 20201130211145.3012-6-james.quinlan@broadcom.com (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Lorenzo Pieralisi |
Headers | show |
Series | brcmstb: add EP regulators and panic handler | expand |
On 11/30/2020 1:11 PM, Jim Quinlan wrote: > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > by default Broadcom's STB PCIe controller effects an abort. This simple > handler determines if the PCIe controller was the cause of the abort and if > so, prints out diagnostic info. > > Example output: > brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000 > brcm-pcie 8b20000.pcie: Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0 > > Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com>
On Mon, Nov 30, 2020 at 04:11:42PM -0500, Jim Quinlan wrote: > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > by default Broadcom's STB PCIe controller effects an abort. This simple > handler determines if the PCIe controller was the cause of the abort and if > so, prints out diagnostic info. What happens during enumeration? pci_bus_generic_read_dev_vendor_id() assumes a read of Vendor ID returns 0xffffffff if the device doesn't exist. I assume this case doesn't cause the abort you're referring to here, or nothing would work. I think this enumeration case results in PCIe Unsupported Request errors (PCIe r5.0, sec 2.3.2 implementation note). Bjorn
On Tue, Dec 1, 2020 at 1:05 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Mon, Nov 30, 2020 at 04:11:42PM -0500, Jim Quinlan wrote: > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > > by default Broadcom's STB PCIe controller effects an abort. This simple > > handler determines if the PCIe controller was the cause of the abort and if > > so, prints out diagnostic info. > > What happens during enumeration? pci_bus_generic_read_dev_vendor_id() > assumes a read of Vendor ID returns 0xffffffff if the device doesn't > exist. > > I assume this case doesn't cause the abort you're referring to here, > or nothing would work. I think this enumeration case results in PCIe > Unsupported Request errors (PCIe r5.0, sec 2.3.2 implementation note). Hi Bjorn, Yes, our controller makes a special case to allow for config-space accesses to the dev_id and vendor_id registers. even if the device is missing. That being said, it will abort on any access if the link is down. However, the 7216-type SOCs bring PCIe error-reporting HW but also have a mode where 0xffffffff is returned on improper accesses, just like many other controllers. We are debating whether we should turn this on by default. Regards, Jim Quinlan Broadcom STB > > Bjorn
On Mon, Nov 30, 2020 at 04:11:42PM -0500, Jim Quinlan wrote: > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > by default Broadcom's STB PCIe controller effects an abort. This simple > handler determines if the PCIe controller was the cause of the abort and if > so, prints out diagnostic info. > > Example output: > brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000 > brcm-pcie 8b20000.pcie: Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0 What does this mean for all the other PCI core code that expects 0xffffffff data returns? Does it work? Does it break differently on STB than on other platforms? > +/* > + * Dump out pcie errors on die or panic. s/pcie/PCIe/ This could be a single-line comment. > + */
On Wed, Jan 6, 2021 at 2:42 PM Jim Quinlan <james.quinlan@broadcom.com> wrote: > > ---------- Forwarded message --------- > From: Bjorn Helgaas <helgaas@kernel.org> > Date: Wed, Jan 6, 2021 at 2:19 PM > Subject: Re: [PATCH v2 5/6] PCI: brcmstb: Add panic/die handler to RC driver > To: Jim Quinlan <james.quinlan@broadcom.com> > Cc: <linux-pci@vger.kernel.org>, Nicolas Saenz Julienne > <nsaenzjulienne@suse.de>, <broonie@kernel.org>, > <bcm-kernel-feedback-list@broadcom.com>, Lorenzo Pieralisi > <lorenzo.pieralisi@arm.com>, Rob Herring <robh@kernel.org>, Bjorn > Helgaas <bhelgaas@google.com>, Florian Fainelli > <f.fainelli@gmail.com>, moderated list:BROADCOM BCM2711/BCM2835 ARM > ARCHITECTURE <linux-rpi-kernel@lists.infradead.org>, moderated > list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE > <linux-arm-kernel@lists.infradead.org>, open list > <linux-kernel@vger.kernel.org> > > > On Mon, Nov 30, 2020 at 04:11:42PM -0500, Jim Quinlan wrote: > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > > by default Broadcom's STB PCIe controller effects an abort. This simple > > handler determines if the PCIe controller was the cause of the abort and if > > so, prints out diagnostic info. > > > > Example output: > > brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000 > > brcm-pcie 8b20000.pcie: Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0 > > What does this mean for all the other PCI core code that expects > 0xffffffff data returns? Does it work? Does it break differently on > STB than on other platforms? Hi Bjorn, Our PCIe HW causes a CPU abort when this happens. Occasionally a customer will have a fault handler try to fix up the abort and continue on, but we recommend solving the root problem. This commit just gives us a chance to glean info about the problem. Our newer SOCs have a mode that doesn't abort and instead returns 0xffffffff. BTW, can you point me to example files where "PCI core code that expects 0xffffffff data returns" [on bad accesses]? Regards, Jim Quinlan Broadcom STB > > > +/* > > + * Dump out pcie errors on die or panic. > > s/pcie/PCIe/ > This could be a single-line comment. > > > + */ >
On Wed, Jan 06, 2021 at 02:57:19PM -0500, Jim Quinlan wrote: > On Wed, Jan 6, 2021 at 2:42 PM Jim Quinlan <james.quinlan@broadcom.com> wrote: > > > > ---------- Forwarded message --------- > > From: Bjorn Helgaas <helgaas@kernel.org> > > Date: Wed, Jan 6, 2021 at 2:19 PM > > Subject: Re: [PATCH v2 5/6] PCI: brcmstb: Add panic/die handler to RC driver > > To: Jim Quinlan <james.quinlan@broadcom.com> > > Cc: <linux-pci@vger.kernel.org>, Nicolas Saenz Julienne > > <nsaenzjulienne@suse.de>, <broonie@kernel.org>, > > <bcm-kernel-feedback-list@broadcom.com>, Lorenzo Pieralisi > > <lorenzo.pieralisi@arm.com>, Rob Herring <robh@kernel.org>, Bjorn > > Helgaas <bhelgaas@google.com>, Florian Fainelli > > <f.fainelli@gmail.com>, moderated list:BROADCOM BCM2711/BCM2835 ARM > > ARCHITECTURE <linux-rpi-kernel@lists.infradead.org>, moderated > > list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE > > <linux-arm-kernel@lists.infradead.org>, open list > > <linux-kernel@vger.kernel.org> > > > > > > On Mon, Nov 30, 2020 at 04:11:42PM -0500, Jim Quinlan wrote: > > > Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, > > > by default Broadcom's STB PCIe controller effects an abort. This simple > > > handler determines if the PCIe controller was the cause of the abort and if > > > so, prints out diagnostic info. > > > > > > Example output: > > > brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000 > > > brcm-pcie 8b20000.pcie: Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0 > > > > What does this mean for all the other PCI core code that expects > > 0xffffffff data returns? Does it work? Does it break differently on > > STB than on other platforms? > Hi Bjorn, > > Our PCIe HW causes a CPU abort when this happens. Occasionally a > customer will have a fault handler try to fix up the abort and > continue on, but we recommend solving the root problem. This commit > just gives us a chance to glean info about the problem. Our newer > SOCs have a mode that doesn't abort and instead returns 0xffffffff. > > BTW, can you point me to example files where "PCI core code that > expects 0xffffffff data returns" [on bad accesses]? The most important case is during enumeration. A config read to a device that doesn't exist normally terminates as an Unsupported Request, and pci_bus_generic_read_dev_vendor_id() depends on reading 0xffffffff in that case. I assume this particular case does work that way for brcm-pcie, because I assume enumeration does work. pci_cfg_space_size_ext() is similar. I assume this also works for brcm-pcie for the same reason. pci_raw_set_power_state() looks for ~0, which it may see if it does a config read to a device in D3cold. pci_dev_wait(), dpc_irq(), pcie_pme_work_fn(), pcie_pme_irq() are all similar. Yes, this is ugly and we should check for these more consistently. The above are all for config reads. The PCI core doesn't do MMIO accesses except for a few cases like MSI-X. But drivers do, and if they check for PCIe errors on MMIO reads, they do it by looking for 0xffffffff, e.g., pci_mmio_enabled() (in hfi1), qib_pci_mmio_enabled(), bnx2x_get_hwinfo(), etc. Bjorn
diff --git a/drivers/pci/controller/pcie-brcmstb.c b/drivers/pci/controller/pcie-brcmstb.c index 989e4231d136..3983d6c80769 100644 --- a/drivers/pci/controller/pcie-brcmstb.c +++ b/drivers/pci/controller/pcie-brcmstb.c @@ -12,11 +12,13 @@ #include <linux/ioport.h> #include <linux/irqchip/chained_irq.h> #include <linux/irqdomain.h> +#include <linux/kdebug.h> #include <linux/kernel.h> #include <linux/list.h> #include <linux/log2.h> #include <linux/module.h> #include <linux/msi.h> +#include <linux/notifier.h> #include <linux/of_address.h> #include <linux/of_irq.h> #include <linux/of_pci.h> @@ -187,6 +189,39 @@ #define PCIE_DVT_PMU_PCIE_PHY_CTRL_DAST_PWRDN_MASK 0x1 #define PCIE_DVT_PMU_PCIE_PHY_CTRL_DAST_PWRDN_SHIFT 0x0 +/* Error report regiseters */ +#define PCIE_OUTB_ERR_TREAT 0x6000 +#define PCIE_OUTB_ERR_TREAT_CONFIG_MASK 0x1 +#define PCIE_OUTB_ERR_TREAT_MEM_MASK 0x2 +#define PCIE_OUTB_ERR_VALID 0x6004 +#define PCIE_OUTB_ERR_CLEAR 0x6008 +#define PCIE_OUTB_ERR_ACC_INFO 0x600c +#define PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK 0x01 +#define PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK 0x02 +#define PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK 0x04 +#define PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK 0x10 +#define PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK 0xff00 +#define PCIE_OUTB_ERR_ACC_ADDR 0x6010 +#define PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK 0xff00000 +#define PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK 0xf8000 +#define PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK 0x7000 +#define PCIE_OUTB_ERR_ACC_ADDR_REG_MASK 0xfff +#define PCIE_OUTB_ERR_CFG_CAUSE 0x6014 +#define PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK 0x40 +#define PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK 0x20 +#define PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK 0x10 +#define PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK 0x4 +#define PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK 0x2 +#define PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK 0x1 +#define PCIE_OUTB_ERR_MEM_ADDR_LO 0x6018 +#define PCIE_OUTB_ERR_MEM_ADDR_HI 0x601c +#define PCIE_OUTB_ERR_MEM_CAUSE 0x6020 +#define PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK 0x40 +#define PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK 0x20 +#define PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK 0x10 +#define PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK 0x2 +#define PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK 0x1 + /* Forward declarations */ struct brcm_pcie; static inline void brcm_pcie_bridge_sw_init_set_7278(struct brcm_pcie *pcie, u32 val); @@ -221,6 +256,7 @@ struct pcie_cfg_data { const enum pcie_type type; void (*perst_set)(struct brcm_pcie *pcie, u32 val); void (*bridge_sw_init_set)(struct brcm_pcie *pcie, u32 val); + const bool has_err_report; }; static const int pcie_offsets[] = { @@ -261,6 +297,7 @@ static const struct pcie_cfg_data bcm7216_cfg = { .type = BCM7278, .perst_set = brcm_pcie_perst_set_7278, .bridge_sw_init_set = brcm_pcie_bridge_sw_init_set_7278, + .has_err_report = true, }; struct brcm_msi { @@ -302,8 +339,89 @@ struct brcm_pcie { void (*bridge_sw_init_set)(struct brcm_pcie *pcie, u32 val); struct regulator_bulk_data supplies[ARRAY_SIZE(ep_regulator_names)]; bool ep_wakeup_capable; + bool has_err_report; + struct notifier_block die_notifier; }; +/* + * Dump out pcie errors on die or panic. + */ +static int dump_pcie_error(struct notifier_block *self, unsigned long v, void *p) +{ + const struct brcm_pcie *pcie = container_of(self, struct brcm_pcie, die_notifier); + void __iomem *base = pcie->base; + int i, is_cfg_err, is_mem_err, lanes; + char *width_str, *direction_str, lanes_str[9]; + u32 info; + + if (readl(base + PCIE_OUTB_ERR_VALID) == 0) + return NOTIFY_DONE; + info = readl(base + PCIE_OUTB_ERR_ACC_INFO); + + + is_cfg_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_CFG_ERR_MASK); + is_mem_err = !!(info & PCIE_OUTB_ERR_ACC_INFO_MEM_ERR_MASK); + width_str = (info & PCIE_OUTB_ERR_ACC_INFO_TYPE_64_MASK) ? "64bit" : "32bit"; + direction_str = (info & PCIE_OUTB_ERR_ACC_INFO_DIR_WRITE_MASK) ? "Write" : "Read"; + lanes = FIELD_GET(PCIE_OUTB_ERR_ACC_INFO_BYTE_LANES_MASK, info); + for (i = 0, lanes_str[8] = 0; i < 8; i++) + lanes_str[i] = (lanes & (1 << i)) ? '1' : '0'; + + if (is_cfg_err) { + u32 cfg_addr = readl(base + PCIE_OUTB_ERR_ACC_ADDR); + u32 cause = readl(base + PCIE_OUTB_ERR_CFG_CAUSE); + int bus = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_BUS_MASK, cfg_addr); + int dev = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_DEV_MASK, cfg_addr); + int func = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_FUNC_MASK, cfg_addr); + int reg = FIELD_GET(PCIE_OUTB_ERR_ACC_ADDR_REG_MASK, cfg_addr); + + dev_err(pcie->dev, "Error: CFG Acc, %s, %s, Bus=%d, Dev=%d, Fun=%d, Reg=0x%x, lanes=%s\n", + width_str, direction_str, bus, dev, func, reg, lanes_str); + dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccTO=%d AccDsbld=%d Acc64bit=%d\n", + !!(cause & PCIE_OUTB_ERR_CFG_CAUSE_TIMEOUT_MASK), + !!(cause & PCIE_OUTB_ERR_CFG_CAUSE_ABORT_MASK), + !!(cause & PCIE_OUTB_ERR_CFG_CAUSE_UNSUPP_REQ_MASK), + !!(cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_TIMEOUT_MASK), + !!(cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_DISABLED_MASK), + !!(cause & PCIE_OUTB_ERR_CFG_CAUSE_ACC_64BIT__MASK)); + } + + if (is_mem_err) { + u32 cause = readl(base + PCIE_OUTB_ERR_MEM_CAUSE); + u32 lo = readl(base + PCIE_OUTB_ERR_MEM_ADDR_LO); + u32 hi = readl(base + PCIE_OUTB_ERR_MEM_ADDR_HI); + u64 addr = ((u64)hi << 32) | (u64)lo; + + dev_err(pcie->dev, "Error: Mem Acc, %s, %s, @0x%llx, lanes=%s\n", + width_str, direction_str, addr, lanes_str); + dev_err(pcie->dev, " Type: TO=%d Abt=%d UnsupReq=%d AccDsble=%d BadAddr=%d\n", + !!(cause & PCIE_OUTB_ERR_MEM_CAUSE_TIMEOUT_MASK), + !!(cause & PCIE_OUTB_ERR_MEM_CAUSE_ABORT_MASK), + !!(cause & PCIE_OUTB_ERR_MEM_CAUSE_UNSUPP_REQ_MASK), + !!(cause & PCIE_OUTB_ERR_MEM_CAUSE_ACC_DISABLED_MASK), + !!(cause & PCIE_OUTB_ERR_MEM_CAUSE_BAD_ADDR_MASK)); + } + + /* Clear the error */ + writel(1, base + PCIE_OUTB_ERR_CLEAR); + + return NOTIFY_DONE; +} + +static void brcm_register_die_notifiers(struct brcm_pcie *pcie) +{ + pcie->die_notifier.notifier_call = dump_pcie_error; + register_die_notifier(&pcie->die_notifier); + atomic_notifier_chain_register(&panic_notifier_list, &pcie->die_notifier); +} + +static void brcm_unregister_die_notifiers(struct brcm_pcie *pcie) +{ + unregister_die_notifier(&pcie->die_notifier); + atomic_notifier_chain_unregister(&panic_notifier_list, &pcie->die_notifier); + pcie->die_notifier.notifier_call = NULL; +} + static int pci_dev_may_wakeup(struct pci_dev *dev, void *data) { bool *ret = data; @@ -1273,6 +1391,8 @@ static int brcm_pcie_remove(struct platform_device *pdev) struct pci_host_bridge *bridge = pci_host_bridge_from_priv(pcie); pci_stop_root_bus(bridge->bus); + if (pcie->has_err_report) + brcm_unregister_die_notifiers(pcie); pci_remove_root_bus(bridge->bus); __brcm_pcie_remove(pcie); @@ -1311,6 +1431,7 @@ static int brcm_pcie_probe(struct platform_device *pdev) pcie->np = np; pcie->reg_offsets = data->offsets; pcie->type = data->type; + pcie->has_err_report = data->has_err_report; pcie->perst_set = data->perst_set; pcie->bridge_sw_init_set = data->bridge_sw_init_set; @@ -1380,6 +1501,9 @@ static int brcm_pcie_probe(struct platform_device *pdev) platform_set_drvdata(pdev, pcie); + if (pcie->has_err_report) + brcm_register_die_notifiers(pcie); + return pci_host_probe(bridge); fail: __brcm_pcie_remove(pcie);
Whereas most PCIe HW returns 0xffffffff on illegal accesses and the like, by default Broadcom's STB PCIe controller effects an abort. This simple handler determines if the PCIe controller was the cause of the abort and if so, prints out diagnostic info. Example output: brcm-pcie 8b20000.pcie: Error: Mem Acc: 32bit, Read, @0x38000000 brcm-pcie 8b20000.pcie: Type: TO=0 Abt=0 UnspReq=1 AccDsble=0 BadAddr=0 Signed-off-by: Jim Quinlan <james.quinlan@broadcom.com> --- drivers/pci/controller/pcie-brcmstb.c | 124 ++++++++++++++++++++++++++ 1 file changed, 124 insertions(+)