Message ID | 1359399397-29729-20-git-send-email-thomas.petazzoni@free-electrons.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On 01/28/2013 11:56 AM, Thomas Petazzoni wrote: > This driver implements the support for the PCIe interfaces on the > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to > cover earlier families of Marvell SoCs, such as Dove, Orion and > Kirkwood. > > The driver implements the hw_pci operations needed by the core ARM PCI > code to setup PCI devices and get their corresponding IRQs, and the > pci_ops operations that are used by the PCI core to read/write the > configuration space of PCI devices. > > Since the PCIe interfaces of Marvell SoCs are completely separate and > not linked together in a bus, this driver sets up an emulated PCI host > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe > interface. > > In addition, this driver enumerates the different PCIe slots, and for > those having a device plugged in, it sets up the necessary address > decoding windows, using the new armada_370_xp_alloc_pcie_window() > function from mach-mvebu/addr-map.c. > diff --git a/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt > +Mandatory properties: > +- compatible: must be "marvell,armada-370-xp-pcie" > +- status: either "disabled" or "okay" status is a standard DT property; I certainly wouldn't expect its presence to be mandatory (there's a defined default), nor would I expect each device's binding to redefine this property. > +In addition, the Device Tree node must have sub-nodes describing each > +PCIe interface, having the following mandatory properties: > +- marvell,pcie-port: the physical PCIe port number Should the standardized cell-index property be used here instead? Or, perhaps that property is deprecated/discouraged... > +- status: either "disabled" or "okay" Similar comment as above. > diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile > +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o > +ccflags-$(CONFIG_PCI_MVEBU) += \ > + -I$(srctree)/arch/arm/plat-orion/include \ > + -I$(srctree)/arch/arm/mach-mvebu/include That seems a little dangerous w.r.t. multi-platform zImage. Can the required headers be moved out to somewhere more public to avoid this? > diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c > +/* > + * Those are the product IDs used for the emulated PCI Host bridge and > + * emulated PCI-to-PCI bridges. They are temporary until we get > + * official IDs assigned. > + */ > +#define MARVELL_EMULATED_HOST_BRIDGE_ID 4141 > +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242 I assume that means we can't merge this driver yet. The cover letter mentioned a desire to merge this for 3.9; there's not much time to get official IDs assigned, then. > +static int mvebu_pcie_init(void) > +{ > + return platform_driver_probe(&mvebu_pcie_driver, > + mvebu_pcie_probe); > +} > + > +subsys_initcall(mvebu_pcie_init); Why isn't that just platform_driver_register()? > +MODULE_LICENSE("GPL"); "GPL v2". -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 28, 2013 at 11:56 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > This driver implements the support for the PCIe interfaces on the > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to > cover earlier families of Marvell SoCs, such as Dove, Orion and > Kirkwood. > > The driver implements the hw_pci operations needed by the core ARM PCI > code to setup PCI devices and get their corresponding IRQs, and the > pci_ops operations that are used by the PCI core to read/write the > configuration space of PCI devices. > > Since the PCIe interfaces of Marvell SoCs are completely separate and > not linked together in a bus, this driver sets up an emulated PCI host > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe > interface. There's no Linux requirement that multiple PCIe interfaces appear to be in the same hierarchy. You can just use pci_scan_root_bus() separately on each interface. Each interface can be in its own domain if necessary. > In addition, this driver enumerates the different PCIe slots, and for > those having a device plugged in, it sets up the necessary address > decoding windows, using the new armada_370_xp_alloc_pcie_window() > function from mach-mvebu/addr-map.c. > > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> > --- > .../devicetree/bindings/pci/armada-370-xp-pcie.txt | 175 +++++++ > drivers/pci/host/Kconfig | 6 + > drivers/pci/host/Makefile | 4 + > drivers/pci/host/pci-mvebu.c | 500 ++++++++++++++++++++ > 4 files changed, 685 insertions(+) > create mode 100644 Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt > create mode 100644 drivers/pci/host/Makefile > create mode 100644 drivers/pci/host/pci-mvebu.c > > diff --git a/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt > new file mode 100644 > index 0000000..9313e92 > --- /dev/null > +++ b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt > @@ -0,0 +1,175 @@ > +* Marvell Armada 370/XP PCIe interfaces > + > +Mandatory properties: > +- compatible: must be "marvell,armada-370-xp-pcie" > +- status: either "disabled" or "okay" > +- #address-cells, set to <3> > +- #size-cells, set to <2> > +- #interrupt-cells, set to <1> > +- bus-range: PCI bus numbers covered > +- ranges: standard PCI-style address ranges, describing the PCIe > + registers for each PCIe interface, and then ranges for the PCI > + memory and I/O regions. > +- interrupt-map-mask and interrupt-map are standard PCI Device Tree > + properties to describe the interrupts associated to each PCI > + interface. > + > +In addition, the Device Tree node must have sub-nodes describing each > +PCIe interface, having the following mandatory properties: > +- reg: the address and size of the PCIe registers (translated > + addresses according to the ranges property of the parent) > +- clocks: the clock associated to this PCIe interface > +- marvell,pcie-port: the physical PCIe port number > +- status: either "disabled" or "okay" > + > +and the following optional properties: > +- marvell,pcie-lane: the physical PCIe lane number, for ports having > + multiple lanes. If this property is not found, we assume that the > + value is 0. > + > +Example: > + > +pcie-controller { > + compatible = "marvell,armada-370-xp-pcie"; > + status = "disabled"; > + > + #address-cells = <3>; > + #size-cells = <2>; > + > + bus-range = <0x00 0xff>; > + > + ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000 /* port 0.0 registers */ > + 0x00004800 0 0xd0042000 0xd0042000 0 0x00002000 /* port 2.0 registers */ > + 0x00001000 0 0xd0044000 0xd0044000 0 0x00002000 /* port 0.1 registers */ > + 0x00001800 0 0xd0048000 0xd0048000 0 0x00002000 /* port 0.2 registers */ > + 0x00002000 0 0xd004C000 0xd004C000 0 0x00002000 /* port 0.3 registers */ > + 0x00002800 0 0xd0080000 0xd0080000 0 0x00002000 /* port 1.0 registers */ > + 0x00005000 0 0xd0082000 0xd0082000 0 0x00002000 /* port 3.0 registers */ > + 0x00003000 0 0xd0084000 0xd0084000 0 0x00002000 /* port 1.1 registers */ > + 0x00003800 0 0xd0088000 0xd0088000 0 0x00002000 /* port 1.2 registers */ > + 0x00004000 0 0xd008C000 0xd008C000 0 0x00002000 /* port 1.3 registers */ > + 0x81000000 0 0 0xc0000000 0 0x00100000 /* downstream I/O */ > + 0x82000000 0 0 0xc1000000 0 0x08000000>; /* non-prefetchable memory */ > + > + #interrupt-cells = <1>; > + interrupt-map-mask = <0xf800 0 0 1>; > + interrupt-map = <0x0800 0 0 1 &mpic 58 > + 0x1000 0 0 1 &mpic 59 > + 0x1800 0 0 1 &mpic 60 > + 0x2000 0 0 1 &mpic 61 > + 0x2800 0 0 1 &mpic 62 > + 0x3000 0 0 1 &mpic 63 > + 0x3800 0 0 1 &mpic 64 > + 0x4000 0 0 1 &mpic 65 > + 0x4800 0 0 1 &mpic 99 > + 0x5000 0 0 1 &mpic 103>; > + > + pcie@0,0 { > + device_type = "pciex"; > + reg = <0x0800 0 0xd0040000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <0>; > + marvell,pcie-lane = <0>; > + clocks = <&gateclk 5>; > + status = "disabled"; > + }; > + > + pcie@0,1 { > + device_type = "pciex"; > + reg = <0x1000 0 0xd0044000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <0>; > + marvell,pcie-lane = <1>; > + clocks = <&gateclk 6>; > + status = "disabled"; > + }; > + > + pcie@0,2 { > + device_type = "pciex"; > + reg = <0x1800 0 0xd0048000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <0>; > + marvell,pcie-lane = <2>; > + clocks = <&gateclk 7>; > + status = "disabled"; > + }; > + > + pcie@0,3 { > + device_type = "pciex"; > + reg = <0x2000 0 0xd004C000 0 0xC000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <0>; > + marvell,pcie-lane = <3>; > + clocks = <&gateclk 8>; > + status = "disabled"; > + }; > + > + pcie@1,0 { > + device_type = "pciex"; > + reg = <0x2800 0 0xd0080000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <1>; > + marvell,pcie-lane = <0>; > + clocks = <&gateclk 9>; > + status = "disabled"; > + }; > + > + pcie@1,1 { > + device_type = "pciex"; > + reg = <0x3000 0 0xd0084000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <1>; > + marvell,pcie-lane = <1>; > + clocks = <&gateclk 10>; > + status = "disabled"; > + }; > + > + pcie@1,2 { > + device_type = "pciex"; > + reg = <0x3800 0 0xd0088000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <1>; > + marvell,pcie-lane = <2>; > + clocks = <&gateclk 11>; > + status = "disabled"; > + }; > + > + pcie@1,3 { > + device_type = "pciex"; > + reg = <0x4000 0 0xd008C000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <1>; > + marvell,pcie-lane = <3>; > + clocks = <&gateclk 12>; > + status = "disabled"; > + }; > + pcie@2,0 { > + device_type = "pciex"; > + reg = <0x4800 0 0xd0042000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <2>; > + marvell,pcie-lane = <0>; > + clocks = <&gateclk 26>; > + status = "disabled"; > + }; > + > + pcie@3,0 { > + device_type = "pciex"; > + reg = <0x5000 0 0xd0082000 0 0x2000>; > + #address-cells = <3>; > + #size-cells = <2>; > + marvell,pcie-port = <3>; > + marvell,pcie-lane = <0>; > + clocks = <&gateclk 27>; > + status = "disabled"; > + }; > +}; > diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig > index cc3a1af..03e15e7 100644 > --- a/drivers/pci/host/Kconfig > +++ b/drivers/pci/host/Kconfig > @@ -1,4 +1,10 @@ > menu "PCI host controller drivers" > depends on PCI > > +config PCI_MVEBU > + bool "Marvell EBU PCIe controller" > + depends on ARCH_MVEBU > + select PCI_SW_HOST_BRIDGE > + select PCI_SW_PCI_PCI_BRIDGE > + > endmenu > diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile > new file mode 100644 > index 0000000..34d6057 > --- /dev/null > +++ b/drivers/pci/host/Makefile > @@ -0,0 +1,4 @@ > +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o > +ccflags-$(CONFIG_PCI_MVEBU) += \ > + -I$(srctree)/arch/arm/plat-orion/include \ > + -I$(srctree)/arch/arm/mach-mvebu/include > diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c > new file mode 100644 > index 0000000..4db09e1 > --- /dev/null > +++ b/drivers/pci/host/pci-mvebu.c > @@ -0,0 +1,500 @@ > +/* > + * PCIe driver for Marvell Armada 370 and Armada XP SoCs > + * > + * This file is licensed under the terms of the GNU General Public > + * License version 2. This program is licensed "as is" without any > + * warranty of any kind, whether express or implied. > + */ > + > +#include <linux/kernel.h> > +#include <linux/pci.h> > +#include <linux/clk.h> > +#include <linux/module.h> > +#include <linux/slab.h> > +#include <linux/platform_device.h> > +#include <linux/of_address.h> > +#include <linux/of_pci.h> > +#include <linux/of_irq.h> > +#include <linux/of_platform.h> > +#include <plat/pcie.h> > +#include <mach/addr-map.h> > + > +/* > + * Those are the product IDs used for the emulated PCI Host bridge and > + * emulated PCI-to-PCI bridges. They are temporary until we get > + * official IDs assigned. > + */ > +#define MARVELL_EMULATED_HOST_BRIDGE_ID 4141 > +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242 > + > +struct mvebu_pcie_port; > + > +/* Structure representing all PCIe interfaces */ > +struct mvebu_pcie { > + struct pci_sw_host_bridge bridge; > + struct platform_device *pdev; > + struct mvebu_pcie_port *ports; > + struct resource io; > + struct resource mem; > + struct resource busn; > + int nports; > +}; > + > +/* Structure representing one PCIe interface */ > +struct mvebu_pcie_port { > + void __iomem *base; > + spinlock_t conf_lock; > + int haslink; > + u32 port; > + u32 lane; > + int devfn; > + struct clk *clk; > + struct pci_sw_pci_bridge bridge; > + struct device_node *dn; > +}; > + > +static inline struct mvebu_pcie *sys_to_pcie(struct pci_sys_data *sys) > +{ > + return sys->private_data; > +} > + > +/* PCI configuration space write function */ > +static int mvebu_pcie_wr_conf(struct pci_bus *bus, u32 devfn, > + int where, int size, u32 val) > +{ > + struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata); > + > + if (bus->number != 0) { > + /* > + * Accessing a real PCIe interface, where the Linux > + * virtual bus number is equal to the hardware PCIe > + * interface number + 1 > + */ This is really weird. It doesn't seem like a good idea to me, but I don't understand the whole architecture. > + struct mvebu_pcie_port *port; > + unsigned long flags; > + int porti, ret; > + > + porti = bus->number - 1; > + if (porti >= pcie->nports) > + return PCIBIOS_DEVICE_NOT_FOUND; > + > + port = &pcie->ports[porti]; > + > + if (!port->haslink) > + return PCIBIOS_DEVICE_NOT_FOUND; > + > + if (PCI_SLOT(devfn) != 0) > + return PCIBIOS_DEVICE_NOT_FOUND; > + > + spin_lock_irqsave(&port->conf_lock, flags); > + ret = orion_pcie_wr_conf_bus(port->base, bus->number - 1, > + PCI_DEVFN(1, PCI_FUNC(devfn)), > + where, size, val); > + spin_unlock_irqrestore(&port->conf_lock, flags); > + > + return ret; > + } else { > + /* > + * Accessing the emulated PCIe devices. In the first > + * slot, the emulated host bridge, and in the next > + * slots, the PCI-to-PCI bridges that correspond to > + * each PCIe hardware interface > + */ > + if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0) > + return pci_sw_host_bridge_write(&pcie->bridge, where, > + size, val); > + else if (PCI_SLOT(devfn) >= 1 && > + PCI_SLOT(devfn) <= pcie->nports) { > + struct mvebu_pcie_port *port; > + int porti = PCI_SLOT(devfn) - 1; > + port = &pcie->ports[porti]; > + return pci_sw_pci_bridge_write(&port->bridge, where, > + size, val); > + } else { > + return PCIBIOS_DEVICE_NOT_FOUND; > + } > + } > + > + return PCIBIOS_SUCCESSFUL; > +} > + > +/* PCI configuration space read function */ > +static int mvebu_pcie_rd_conf(struct pci_bus *bus, u32 devfn, int where, > + int size, u32 *val) > +{ > + struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata); > + > + if (bus->number != 0) { > + /* > + * Accessing a real PCIe interface, where the Linux > + * virtual bus number is equal to the hardware PCIe > + * interface number + 1 > + */ > + struct mvebu_pcie_port *port; > + unsigned long flags; > + int porti, ret; > + > + porti = bus->number - 1; > + if (porti >= pcie->nports) { > + *val = 0xffffffff; > + return PCIBIOS_DEVICE_NOT_FOUND; > + } > + > + port = &pcie->ports[porti]; > + > + if (!port->haslink || PCI_SLOT(devfn) != 0) { > + *val = 0xffffffff; > + return PCIBIOS_DEVICE_NOT_FOUND; > + } > + > + spin_lock_irqsave(&port->conf_lock, flags); > + ret = orion_pcie_rd_conf_bus(port->base, bus->number - 1, > + PCI_DEVFN(1, PCI_FUNC(devfn)), > + where, size, val); > + spin_unlock_irqrestore(&port->conf_lock, flags); > + > + return ret; > + } else { > + /* > + * Accessing the emulated PCIe devices. In the first > + * slot, the emulated host bridge, and in the next > + * slots, the PCI-to-PCI bridges that correspond to > + * each PCIe hardware interface > + */ > + if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0) > + return pci_sw_host_bridge_read(&pcie->bridge, where, > + size, val); > + else if (PCI_SLOT(devfn) >= 1 && > + PCI_SLOT(devfn) <= pcie->nports) { > + struct mvebu_pcie_port *port; > + int porti = PCI_SLOT(devfn) - 1; > + port = &pcie->ports[porti]; > + return pci_sw_pci_bridge_read(&port->bridge, where, > + size, val); > + } else { > + *val = 0xffffffff; > + return PCIBIOS_DEVICE_NOT_FOUND; > + } > + } > +} > + > +static struct pci_ops mvebu_pcie_ops = { > + .read = mvebu_pcie_rd_conf, > + .write = mvebu_pcie_wr_conf, > +}; > + > +static int __init mvebu_pcie_setup(int nr, struct pci_sys_data *sys) > +{ > + struct mvebu_pcie *pcie = sys_to_pcie(sys); > + int i; > + > + pci_add_resource_offset(&sys->resources, &pcie->io, sys->io_offset); > + pci_add_resource_offset(&sys->resources, &pcie->mem, sys->mem_offset); > + pci_add_resource(&sys->resources, &pcie->busn); > + > + pci_ioremap_io(nr * SZ_64K, pcie->io.start); > + > + for (i = 0; i < pcie->nports; i++) { > + struct mvebu_pcie_port *port = &pcie->ports[i]; > + orion_pcie_set_local_bus_nr(port->base, i); > + orion_pcie_setup(port->base); > + } > + > + return 1; > +} > + > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin) > +{ > + struct mvebu_pcie *pcie = sys_to_pcie(dev->bus->sysdata); > + struct mvebu_pcie_port *port; > + struct of_irq oirq; > + u32 laddr[3]; > + int ret; > + __be32 intspec; > + > + /* > + * Ignore requests related to the emulated host bridge or the > + * emulated pci-to-pci bridges > + */ > + if (!dev->bus->number) > + return -1; > + > + port = &pcie->ports[dev->bus->number - 1]; > + > + /* > + * Build an laddr array that describes the PCI device in a DT > + * way > + */ > + laddr[0] = cpu_to_be32(port->devfn << 8); > + laddr[1] = laddr[2] = 0; > + intspec = cpu_to_be32(pin); > + > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > + if (ret) { > + dev_err(&pcie->pdev->dev, > + "%s: of_irq_map_raw() failed, %d\n", > + __func__, ret); > + return ret; > + } > + > + return irq_create_of_mapping(oirq.controller, oirq.specifier, > + oirq.size); > +} > + > +/* > + * For a given PCIe interface (represented by a mvebu_pcie_port > + * structure), we read the PCI configuration space of the > + * corresponding PCI-to-PCI bridge in order to find out which range of > + * I/O addresses and memory addresses have been assigned to this PCIe > + * interface. Using these informations, we set up the appropriate > + * address decoding windows so that the physical address are actually > + * resolved to the right PCIe interface. > + */ Are you inferring the host bridge apertures by using the resources assigned to devices under the bridge, i.e., taking the union of all the BARs and PCI-to-PCI bridge apertures of devices on the root bus? If so, it would be much better to learn the host bridge apertures via some non-PCI mechanism that corresponds to the actual hardware configuration. The config of devices below the host bridge gives you hints, of course, but usually they don't consume all the available space. > +static int mvebu_pcie_window_config_port(struct mvebu_pcie *pcie, > + struct mvebu_pcie_port *port) > +{ > + unsigned long iobase = 0; > + int ret; > + > + if (port->bridge.iolimit >= port->bridge.iobase) { > + unsigned long iolimit = 0xFFF | > + ((port->bridge.iolimit & 0xF0) << 8) | > + (port->bridge.iolimitupper << 16); > + iobase = ((port->bridge.iobase & 0xF0) << 8) | > + (port->bridge.iobaseupper << 16); > + ret = armada_370_xp_alloc_pcie_window(port->port, port->lane, > + iobase, iolimit-iobase, > + IORESOURCE_IO); > + if (ret) { > + dev_err(&pcie->pdev->dev, > + "%s: could not alloc PCIe %d:%d window for I/O [0x%lx; 0x%lx]\n", > + __func__, port->port, port->lane, > + iobase, iolimit); > + goto out_io; > + } > + } > + > + if (port->bridge.memlimit >= port->bridge.membase) { > + unsigned long membase = > + ((port->bridge.membase & 0xFFF0) << 16); > + unsigned long memlimit = > + ((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF; > + ret = armada_370_xp_alloc_pcie_window(port->port, port->lane, > + membase, memlimit-membase, > + IORESOURCE_MEM); > + if (ret) { > + dev_err(&pcie->pdev->dev, > + "%s: could not alloc PCIe %d:%d window for MEM [0x%lx; 0x%lx]\n", > + __func__, port->port, port->lane, > + membase, memlimit); > + goto out_mem; > + } > + } > + > +out_mem: > + if (port->bridge.iolimit >= port->bridge.iobase) > + armada_370_xp_free_pcie_window(iobase); > +out_io: > + return ret; > +} > + > +/* > + * Set up the address decoding windows for all PCIe interfaces. > + */ > +static int mvebu_pcie_window_config(struct mvebu_pcie *pcie) > +{ > + int i, ret; > + > + for (i = 0; i < pcie->nports; i++) { > + struct mvebu_pcie_port *port = &pcie->ports[i]; > + if (!port->haslink) > + continue; > + > + ret = mvebu_pcie_window_config_port(pcie, port); > + if (ret) > + return ret; > + } > + > + return 0; > +} > + > +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev, > + const struct resource *res, > + resource_size_t start, > + resource_size_t size, > + resource_size_t align) > +{ > + if (!(res->flags & IORESOURCE_IO)) > + return start; > + > + /* > + * The I/O regions must be 64K aligned, because the > + * granularity of PCIe I/O address decoding windows is 64 K > + */ > + return round_up(start, SZ_64K); > +} > + > +static int mvebu_pcie_enable(struct mvebu_pcie *pcie) > +{ > + struct hw_pci hw; > + > + memset(&hw, 0, sizeof(hw)); > + > + hw.nr_controllers = 1; > + hw.private_data = (void **)&pcie; > + hw.setup = mvebu_pcie_setup; > + hw.map_irq = mvebu_pcie_map_irq; > + hw.align_resource = mvebu_pcie_align_resource; > + hw.ops = &mvebu_pcie_ops; > + > + pci_common_init(&hw); > + > + return mvebu_pcie_window_config(pcie); > +} > + > +static int __init mvebu_pcie_probe(struct platform_device *pdev) > +{ > + struct mvebu_pcie *pcie; > + struct device_node *np = pdev->dev.of_node; > + struct device_node *child; > + const __be32 *range = NULL; > + struct resource res; > + int i, ret; > + > + pcie = devm_kzalloc(&pdev->dev, sizeof(struct mvebu_pcie), > + GFP_KERNEL); > + if (!pcie) > + return -ENOMEM; > + > + pcie->pdev = pdev; > + > + pci_sw_host_bridge_init(&pcie->bridge); > + pcie->bridge.vendor = PCI_VENDOR_ID_MARVELL; > + pcie->bridge.device = MARVELL_EMULATED_HOST_BRIDGE_ID; > + > + /* Get the I/O and memory ranges from DT */ > + while ((range = of_pci_process_ranges(np, &res, range)) != NULL) { > + if (resource_type(&res) == IORESOURCE_IO) { > + memcpy(&pcie->io, &res, sizeof(res)); > + pcie->io.name = "I/O"; > + } > + if (resource_type(&res) == IORESOURCE_MEM) { > + memcpy(&pcie->mem, &res, sizeof(res)); > + pcie->mem.name = "MEM"; > + } > + } > + > + /* Get the bus range */ > + ret = of_pci_parse_bus_range(np, &pcie->busn); > + if (ret) { > + dev_err(&pdev->dev, "failed to parse bus-range property: %d\n", > + ret); > + return ret; > + } > + > + for_each_child_of_node(pdev->dev.of_node, child) { > + if (!of_device_is_available(child)) > + continue; > + pcie->nports++; > + } > + > + pcie->ports = devm_kzalloc(&pdev->dev, pcie->nports * > + sizeof(struct mvebu_pcie_port), > + GFP_KERNEL); > + if (!pcie->ports) > + return -ENOMEM; > + > + i = 0; > + for_each_child_of_node(pdev->dev.of_node, child) { > + struct mvebu_pcie_port *port = &pcie->ports[i]; > + > + if (!of_device_is_available(child)) > + continue; > + > + if (of_property_read_u32(child, "marvell,pcie-port", > + &port->port)) { > + dev_warn(&pdev->dev, > + "ignoring PCIe DT node, missing pcie-port property\n"); > + continue; > + } > + > + if (of_property_read_u32(child, "marvell,pcie-lane", > + &port->lane)) > + port->lane = 0; > + > + port->devfn = of_pci_get_devfn(child); > + if (port->devfn < 0) > + continue; > + > + port->base = of_iomap(child, 0); > + if (!port->base) { > + dev_err(&pdev->dev, "PCIe%d.%d: cannot map registers\n", > + port->port, port->lane); > + continue; > + } > + > + if (orion_pcie_link_up(port->base)) { > + port->haslink = 1; > + dev_info(&pdev->dev, "PCIe%d.%d: link up\n", > + port->port, port->lane); > + } else { > + port->haslink = 0; > + dev_info(&pdev->dev, "PCIe%d.%d: link down\n", > + port->port, port->lane); > + } > + > + port->clk = of_clk_get_by_name(child, NULL); > + if (!port->clk) { > + dev_err(&pdev->dev, "PCIe%d.%d: cannot get clock\n", > + port->port, port->lane); > + iounmap(port->base); > + port->haslink = 0; > + continue; > + } > + > + port->dn = child; > + > + clk_prepare_enable(port->clk); > + spin_lock_init(&port->conf_lock); > + > + pci_sw_pci_bridge_init(&port->bridge); > + port->bridge.vendor = PCI_VENDOR_ID_MARVELL; > + port->bridge.device = MARVELL_EMULATED_PCI_PCI_BRIDGE_ID; > + port->bridge.primary_bus = 0; > + port->bridge.secondary_bus = PCI_SLOT(port->devfn); > + port->bridge.subordinate_bus = PCI_SLOT(port->devfn); > + > + i++; > + } > + > + mvebu_pcie_enable(pcie); > + > + return 0; > +} > + > +static const struct of_device_id mvebu_pcie_of_match_table[] = { > + { .compatible = "marvell,armada-370-xp-pcie", }, > + {}, > +}; > +MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table); > + > +static struct platform_driver mvebu_pcie_driver = { > + .driver = { > + .owner = THIS_MODULE, > + .name = "mvebu-pcie", > + .of_match_table = > + of_match_ptr(mvebu_pcie_of_match_table), > + }, > +}; > + > +static int mvebu_pcie_init(void) > +{ > + return platform_driver_probe(&mvebu_pcie_driver, > + mvebu_pcie_probe); > +} > + > +subsys_initcall(mvebu_pcie_init); > + > +MODULE_AUTHOR("Thomas Petazzoni <thomas.petazzoni@free-electrons.com>"); > +MODULE_DESCRIPTION("Marvell EBU PCIe driver"); > +MODULE_LICENSE("GPL"); > -- > 1.7.9.5 > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 28, 2013 at 08:29:24PM -0700, Bjorn Helgaas wrote: > On Mon, Jan 28, 2013 at 11:56 AM, Thomas Petazzoni > <thomas.petazzoni@free-electrons.com> wrote: > > This driver implements the support for the PCIe interfaces on the > > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to > > cover earlier families of Marvell SoCs, such as Dove, Orion and > > Kirkwood. > > > > The driver implements the hw_pci operations needed by the core ARM PCI > > code to setup PCI devices and get their corresponding IRQs, and the > > pci_ops operations that are used by the PCI core to read/write the > > configuration space of PCI devices. > > > > Since the PCIe interfaces of Marvell SoCs are completely separate and > > not linked together in a bus, this driver sets up an emulated PCI host > > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe > > interface. > > There's no Linux requirement that multiple PCIe interfaces appear to > be in the same hierarchy. You can just use pci_scan_root_bus() > separately on each interface. Each interface can be in its own domain > if necessary. What you suggest is basically what the Marvell driver did originally, the probelm is that Linux requires a pre-assigned aperture for each PCI domain/root bus, and these new chips have so many PCI-E ports that they can exhaust the physical address space, and also a limited internal HW resource for setting address routing. Thus they require resource allocation that is sensitive to the devices present downstream. By far the simplest solution is to merge all the physical links into a single domain and rely on existing PCI resource allocation code to drive allocation of scarce physical address space and demand allocate the HW routing resource (specifically there are enough resources to accomidate MMIO only devices on every bus, but not enough to accomidate MMIO and IO on every bus). > > +/* > > + * For a given PCIe interface (represented by a mvebu_pcie_port > > + * structure), we read the PCI configuration space of the > > + * corresponding PCI-to-PCI bridge in order to find out which range of > > + * I/O addresses and memory addresses have been assigned to this PCIe > > + * interface. Using these informations, we set up the appropriate > > + * address decoding windows so that the physical address are actually > > + * resolved to the right PCIe interface. > > + */ > > Are you inferring the host bridge apertures by using the resources > assigned to devices under the bridge, i.e., taking the union of all The flow is different, a portion of physical address space is set aside for use by PCI-E (via DT) and that portion is specified in the struct resource's ultimately attached to the PCI domain for the bus scan. You could call that the 'host bridge aperture' though it doesn't reflect any HW configuration at all. The values come from the device tree. During the bus scan the Linux core code splits up that contiguous space and assigns to the PCI-PCI bridges and devices under that domain. Each physical PCI-E link on the chip is seen by Linux through the SW emulated PCI-PCI bridge attached to bus 0. When Linux configures the bridge windows it triggers this code here to copy that window information from the PCI config space into non-standard internal HW registers. The purpose of the SW PCI-PCI bridge and this code here is to give the Linux PCI core control over the window (MMIO,IO,busnr) assigned to the PCI-E link. This arrangement with PCI-PCI bridges controlling address routing is part of the PCI-E standard, in this instance Marvell did not implement the required config space in HW so the driver is working around that deficiency. Other drivers, like tegra have a similar design, but their hardware does implement PCI-PCI bridge configuration space and does drive address decoding through the HW PCI-PCI window registers. Having PCI-E links be bridges, not domains/root_bus's is in-line with the standard and works better with the Linux PCI resource allocator. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bjorn, Jason, On Mon, 28 Jan 2013 22:55:08 -0700, Jason Gunthorpe wrote: > > There's no Linux requirement that multiple PCIe interfaces appear to > > be in the same hierarchy. You can just use pci_scan_root_bus() > > separately on each interface. Each interface can be in its own domain > > if necessary. > > What you suggest is basically what the Marvell driver did originally, > the probelm is that Linux requires a pre-assigned aperture for each > PCI domain/root bus, and these new chips have so many PCI-E ports that > they can exhaust the physical address space, and also a limited > internal HW resource for setting address routing. > > Thus they require resource allocation that is sensitive to the devices > present downstream. > > By far the simplest solution is to merge all the physical links into a > single domain and rely on existing PCI resource allocation code to > drive allocation of scarce physical address space and demand allocate > the HW routing resource (specifically there are enough resources to > accomidate MMIO only devices on every bus, but not enough to > accomidate MMIO and IO on every bus). > > > > +/* > > > + * For a given PCIe interface (represented by a mvebu_pcie_port > > > + * structure), we read the PCI configuration space of the > > > + * corresponding PCI-to-PCI bridge in order to find out which range of > > > + * I/O addresses and memory addresses have been assigned to this PCIe > > > + * interface. Using these informations, we set up the appropriate > > > + * address decoding windows so that the physical address are actually > > > + * resolved to the right PCIe interface. > > > + */ > > > > Are you inferring the host bridge apertures by using the resources > > assigned to devices under the bridge, i.e., taking the union of all > > The flow is different, a portion of physical address space is set > aside for use by PCI-E (via DT) and that portion is specified in the > struct resource's ultimately attached to the PCI domain for the bus > scan. You could call that the 'host bridge aperture' though it doesn't > reflect any HW configuration at all. The values come from the device > tree. > > During the bus scan the Linux core code splits up that contiguous > space and assigns to the PCI-PCI bridges and devices under that domain. > > Each physical PCI-E link on the chip is seen by Linux through the SW > emulated PCI-PCI bridge attached to bus 0. When Linux configures the > bridge windows it triggers this code here to copy that window > information from the PCI config space into non-standard internal HW > registers. > > The purpose of the SW PCI-PCI bridge and this code here is to give > the Linux PCI core control over the window (MMIO,IO,busnr) assigned > to the PCI-E link. > > This arrangement with PCI-PCI bridges controlling address routing is > part of the PCI-E standard, in this instance Marvell did not implement > the required config space in HW so the driver is working around that > deficiency. > > Other drivers, like tegra have a similar design, but their hardware > does implement PCI-PCI bridge configuration space and does drive > address decoding through the HW PCI-PCI window registers. > > Having PCI-E links be bridges, not domains/root_bus's is in-line with > the standard and works better with the Linux PCI resource allocator. Thanks a lot Jason for this explanation, I couldn't have explained it as clearly as you did. Bjorn, does Jason's reply answers your questions? Or do you need other details? Thanks! Thomas
Dear Stephen Warren, On Mon, 28 Jan 2013 15:21:45 -0700, Stephen Warren wrote: > > +Mandatory properties: > > +- compatible: must be "marvell,armada-370-xp-pcie" > > +- status: either "disabled" or "okay" > > status is a standard DT property; I certainly wouldn't expect its > presence to be mandatory (there's a defined default), nor would I expect > each device's binding to redefine this property. Ok. > > +- marvell,pcie-port: the physical PCIe port number > > Should the standardized cell-index property be used here instead? Or, > perhaps that property is deprecated/discouraged... The problem is that I need two identifiers, the pcie-port and pcie-lane, and it would be strange to have one referenced as cell-index, and the other one as marvell,pcie-lane, no? Unless of course we can put two numbers in the cell-index property, but a quick grep in Documentation/devicetree/bindings/ seems to indicate that all users of cell-index use it with a single identifier. Just tell me what to do here, I don't have a strong opinion on this. > > diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile > > > +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o > > +ccflags-$(CONFIG_PCI_MVEBU) += \ > > + -I$(srctree)/arch/arm/plat-orion/include \ > > + -I$(srctree)/arch/arm/mach-mvebu/include > > That seems a little dangerous w.r.t. multi-platform zImage. Can the > required headers be moved out to somewhere more public to avoid this? Why is this dangerous for multi-platform zImage? For this specific driver only, some SoC-specific headers are used. I don't think it prevents another PCI driver (such as the Tegra one) from being built into the same kernel image, no? Also, this is kind of a temporary solution, because I can't fix all the problems in just the PCI patch series without making it horribly large. The reason why we need those include paths at the moment are: * For plat-orion/include, because of pcie.h that provides functions common to all Marvell SoCs regarding PCI. Ultimately, all the Marvell SoCs that use this common code should be migrated over to the DT-capable PCI driver that is submitted through this patch series. This will take a bit of time, and is too complex to do in one shot, together with the introduction of the driver itself. So ultimately, all the code in plat-orion/pcie.c will migrate into drivers/pci/host/pci-mvebu.c, and this plat-orion/include/plat/pcie.h file should disappear, removing the need for this special header path. * For mach-mvebu/include, because of the addr-map.h header that provides functions related to address decoding windows. This is also likely to evolve quite significantly when we'll make the PCI driver being used by the other Marvell SoC families (Kirkwood, Dove, Orion5x), and when this work is done, we can think of having a public header in include/linux that exposes the address decoding APIs, once it has stabilized a bit across the different Marvell SoC families. So, the bottom line is: yes I know those include paths are not nice, but I don't think they prevent multiplatform builds, and they are a temporary solution until we convert more Marvell SoC families to this new PCI driver. > > +/* > > + * Those are the product IDs used for the emulated PCI Host bridge and > > + * emulated PCI-to-PCI bridges. They are temporary until we get > > + * official IDs assigned. > > + */ > > +#define MARVELL_EMULATED_HOST_BRIDGE_ID 4141 > > +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242 > > I assume that means we can't merge this driver yet. The cover letter > mentioned a desire to merge this for 3.9; there's not much time to get > official IDs assigned, then. I am working on getting real IDs assigned. For now, I'd like to work on getting all other issues fixed, and have this only problem remaining. > > > +static int mvebu_pcie_init(void) > > +{ > > + return platform_driver_probe(&mvebu_pcie_driver, > > + mvebu_pcie_probe); > > +} > > + > > +subsys_initcall(mvebu_pcie_init); > > Why isn't that just platform_driver_register()? I didn't test recently, but with my first version of the patch set, having an initialization as late as module_init() was too late. Some PCI fixup code was being executed *before* we get the opportunity of initializing the PCI driver, and it was crashing the kernel. I can provide more details if you want. > > +MODULE_LICENSE("GPL"); > > "GPL v2". Sure. Thomas
On Tue, Jan 29, 2013 at 09:41:43AM +0100, Thomas Petazzoni wrote: > On Mon, 28 Jan 2013 15:21:45 -0700, Stephen Warren wrote: [...] > > > +static int mvebu_pcie_init(void) > > > +{ > > > + return platform_driver_probe(&mvebu_pcie_driver, > > > + mvebu_pcie_probe); > > > +} > > > + > > > +subsys_initcall(mvebu_pcie_init); > > > > Why isn't that just platform_driver_register()? > > I didn't test recently, but with my first version of the patch set, > having an initialization as late as module_init() was too late. Some > PCI fixup code was being executed *before* we get the opportunity of > initializing the PCI driver, and it was crashing the kernel. I can > provide more details if you want. Does this patch perhaps fix this crash? http://patchwork.ozlabs.org/patch/210870/ Thierry
Dear Thierry Reding, On Tue, 29 Jan 2013 10:20:06 +0100, Thierry Reding wrote: > Does this patch perhaps fix this crash? > > http://patchwork.ozlabs.org/patch/210870/ I'll test it, thanks for the notice! Thomas
On Mon, Jan 28, 2013 at 06:56:28PM +0000, Thomas Petazzoni wrote: > This driver implements the support for the PCIe interfaces on the > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to > cover earlier families of Marvell SoCs, such as Dove, Orion and > Kirkwood. > > The driver implements the hw_pci operations needed by the core ARM PCI > code to setup PCI devices and get their corresponding IRQs, and the > pci_ops operations that are used by the PCI core to read/write the > configuration space of PCI devices. > > Since the PCIe interfaces of Marvell SoCs are completely separate and > not linked together in a bus, this driver sets up an emulated PCI host > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe > interface. > > In addition, this driver enumerates the different PCIe slots, and for > those having a device plugged in, it sets up the necessary address > decoding windows, using the new armada_370_xp_alloc_pcie_window() > function from mach-mvebu/addr-map.c. > > Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> [snip] > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin) > +{ [snip] > + > + /* > + * Build an laddr array that describes the PCI device in a DT > + * way > + */ > + laddr[0] = cpu_to_be32(port->devfn << 8); > + laddr[1] = laddr[2] = 0; > + intspec = cpu_to_be32(pin); > + > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > + if (ret) { > + dev_err(&pcie->pdev->dev, > + "%s: of_irq_map_raw() failed, %d\n", > + __func__, ret); > + return ret; > + } Are you able to replace the above code with a call to of_irq_map_pci? I'm not sure which approach is better. The of_irq_map_pci function doesn't require the pin argument and instead uses the DT and/or performs its own pin swizzling. I guess this means that if there are PCIe devices in the DT tree that does any thing strange with pins then it would be reflected in the IRQ you get. I've found that you will also need to provide an implementation of pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch). > + > + return irq_create_of_mapping(oirq.controller, oirq.specifier, > + oirq.size); > +} > +static int mvebu_pcie_enable(struct mvebu_pcie *pcie) > +{ > + struct hw_pci hw; [snip] > + pci_common_init(&hw); > + > + return mvebu_pcie_window_config(pcie); > +} > + > +static int __init mvebu_pcie_probe(struct platform_device *pdev) > +{ [snip] > + > + mvebu_pcie_enable(pcie); > + > + return 0; > +} > + > +static const struct of_device_id mvebu_pcie_of_match_table[] = { > + { .compatible = "marvell,armada-370-xp-pcie", }, > + {}, > +}; > +MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table); > + > +static struct platform_driver mvebu_pcie_driver = { > + .driver = { > + .owner = THIS_MODULE, > + .name = "mvebu-pcie", > + .of_match_table = > + of_match_ptr(mvebu_pcie_of_match_table), > + }, > +}; > + > +static int mvebu_pcie_init(void) > +{ > + return platform_driver_probe(&mvebu_pcie_driver, > + mvebu_pcie_probe); > +} If you have multiple 'mvebu-pcie' in your DT then you will end up with multiple calls to mvebu_pcie_probe/mvebu_pcie_enable/pci_common_init. However pci_common_init/pcibios_init_hw assumes it will only ever be called once, and will thus result in trying to create multiple busses with the same bus number. (The first root bus it creates is always zero provided you haven't implemented hw->scan). I noticed this in Thierry's patch set and posted an RFC patch which overcomes this issue (patchwork.kernel.org/patch/2001171) and others. Perhaps you would want to include this in your patchset? Andrew Murray -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Andrew Murray, On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote: > > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin) > > +{ > > [snip] > > > + > > + /* > > + * Build an laddr array that describes the PCI device in a DT > > + * way > > + */ > > + laddr[0] = cpu_to_be32(port->devfn << 8); > > + laddr[1] = laddr[2] = 0; > > + intspec = cpu_to_be32(pin); > > + > > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > > + if (ret) { > > + dev_err(&pcie->pdev->dev, > > + "%s: of_irq_map_raw() failed, %d\n", > > + __func__, ret); > > + return ret; > > + } > > Are you able to replace the above code with a call to of_irq_map_pci? I'm not > sure which approach is better. The of_irq_map_pci function doesn't require the > pin argument and instead uses the DT and/or performs its own pin swizzling. I > guess this means that if there are PCIe devices in the DT tree that does any > thing strange with pins then it would be reflected in the IRQ you get. I've > found that you will also need to provide an implementation of > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch). I did try using the of_irq_map_pci() function, but unfortunately, it didn't work. IIRC, it didn't work because none of the pci_dev in my PCI tree had any 'struct device_node' associated to them, or at least not the one that had the right pdev->bus->number and pdev->devfn. But, I guess that your patch that implements pcibios_get_phb_of_node() should fix this problem. I'll try this. Thanks! > > +static int mvebu_pcie_init(void) > > +{ > > + return platform_driver_probe(&mvebu_pcie_driver, > > + mvebu_pcie_probe); > > +} > > If you have multiple 'mvebu-pcie' in your DT then you will end up > with multiple calls to > mvebu_pcie_probe/mvebu_pcie_enable/pci_common_init. Right. In practice, there will only ever be a single DT node, since all PCIe interfaces are sub-nodes of the PCI controller node. But I understand the theoretical problem. > However pci_common_init/pcibios_init_hw assumes it will only ever be called > once, and will thus result in trying to create multiple busses with the same > bus number. (The first root bus it creates is always zero provided you haven't > implemented hw->scan). > > I noticed this in Thierry's patch set and posted an RFC patch which overcomes > this issue (patchwork.kernel.org/patch/2001171) and others. Perhaps you would > want to include this in your patchset? Sure, I'll give it a test, and report if it works for me. Thanks a lot! Thomas
On Tue, Jan 29, 2013 at 01:45:22PM +0000, Thomas Petazzoni wrote: > Dear Andrew Murray, > > On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote: > > > > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin) > > > +{ > > > > [snip] > > > > > + > > > + /* > > > + * Build an laddr array that describes the PCI device in a DT > > > + * way > > > + */ > > > + laddr[0] = cpu_to_be32(port->devfn << 8); > > > + laddr[1] = laddr[2] = 0; > > > + intspec = cpu_to_be32(pin); > > > + > > > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > > > + if (ret) { > > > + dev_err(&pcie->pdev->dev, > > > + "%s: of_irq_map_raw() failed, %d\n", > > > + __func__, ret); > > > + return ret; > > > + } > > > > Are you able to replace the above code with a call to of_irq_map_pci? I'm not > > sure which approach is better. The of_irq_map_pci function doesn't require the > > pin argument and instead uses the DT and/or performs its own pin swizzling. I > > guess this means that if there are PCIe devices in the DT tree that does any > > thing strange with pins then it would be reflected in the IRQ you get. I've > > found that you will also need to provide an implementation of > > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch). > > I did try using the of_irq_map_pci() function, but unfortunately, it > didn't work. IIRC, it didn't work because none of the pci_dev in my PCI > tree had any 'struct device_node' associated to them, or at least not > the one that had the right pdev->bus->number and pdev->devfn. > > But, I guess that your patch that implements pcibios_get_phb_of_node() > should fix this problem. I'll try this. Thanks! My bios32 patch departs slightly from your v2 04/27 patch in that it updates hw_pci to contain a device node rather than opaque private data and my pcibios_get_phb_of_node implementation relies on this. If you wanted to stick with the implementation you and Thierry share then you'd have to find another way to get to the device node from the void **private_data. > > > +static int mvebu_pcie_init(void) > > > +{ > > > + return platform_driver_probe(&mvebu_pcie_driver, > > > + mvebu_pcie_probe); > > > +} > > > > If you have multiple 'mvebu-pcie' in your DT then you will end up > > with multiple calls to > > mvebu_pcie_probe/mvebu_pcie_enable/pci_common_init. > > Right. In practice, there will only ever be a single DT node, since all > PCIe interfaces are sub-nodes of the PCI controller node. But I > understand the theoretical problem. > > > However pci_common_init/pcibios_init_hw assumes it will only ever be called > > once, and will thus result in trying to create multiple busses with the same > > bus number. (The first root bus it creates is always zero provided you haven't > > implemented hw->scan). > > > > I noticed this in Thierry's patch set and posted an RFC patch which overcomes > > this issue (patchwork.kernel.org/patch/2001171) and others. Perhaps you would > > want to include this in your patchset? > > Sure, I'll give it a test, and report if it works for me. > > Thanks a lot! > > Thomas > -- > Thomas Petazzoni, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 29, 2013 at 02:05:22PM +0000, Andrew Murray wrote: > On Tue, Jan 29, 2013 at 01:45:22PM +0000, Thomas Petazzoni wrote: > > Dear Andrew Murray, > > > > On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote: > > > > > > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin) > > > > +{ > > > > > > [snip] > > > > > > > + > > > > + /* > > > > + * Build an laddr array that describes the PCI device in a DT > > > > + * way > > > > + */ > > > > + laddr[0] = cpu_to_be32(port->devfn << 8); > > > > + laddr[1] = laddr[2] = 0; > > > > + intspec = cpu_to_be32(pin); > > > > + > > > > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > > > > + if (ret) { > > > > + dev_err(&pcie->pdev->dev, > > > > + "%s: of_irq_map_raw() failed, %d\n", > > > > + __func__, ret); > > > > + return ret; > > > > + } > > > > > > Are you able to replace the above code with a call to of_irq_map_pci? I'm not > > > sure which approach is better. The of_irq_map_pci function doesn't require the > > > pin argument and instead uses the DT and/or performs its own pin swizzling. I > > > guess this means that if there are PCIe devices in the DT tree that does any > > > thing strange with pins then it would be reflected in the IRQ you get. I've > > > found that you will also need to provide an implementation of > > > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch). > > > > I did try using the of_irq_map_pci() function, but unfortunately, it > > didn't work. IIRC, it didn't work because none of the pci_dev in my PCI > > tree had any 'struct device_node' associated to them, or at least not > > the one that had the right pdev->bus->number and pdev->devfn. > > > > But, I guess that your patch that implements pcibios_get_phb_of_node() > > should fix this problem. I'll try this. Thanks! > > My bios32 patch departs slightly from your v2 04/27 patch in that it updates > hw_pci to contain a device node rather than opaque private data and my > pcibios_get_phb_of_node implementation relies on this. If you wanted to stick > with the implementation you and Thierry share then you'd have to find another > way to get to the device node from the void **private_data. If at all possible I think the right thing to do is reuse the generic pcibios_get_phb_of_node() implementation. On Tegra this turned out to require a minimal change to the DT bindings of the root port nodes to make sure they provide the correct address in the reg property. Thierry
Dear Thierry Reding, On Tue, 29 Jan 2013 15:20:15 +0100, Thierry Reding wrote: > If at all possible I think the right thing to do is reuse the generic > pcibios_get_phb_of_node() implementation. On Tegra this turned out to > require a minimal change to the DT bindings of the root port nodes to > make sure they provide the correct address in the reg property. Could you detail the change that was needed? The DT bindings I use for the Marvell PCIe driver are very, very similar to the ones you use for Tegra, since I basically inspired my entire DT binding on your work. And still, I think of_irq_map_pci() wasn't working for me. Best regards, Thomas
On Tue, Jan 29, 2013 at 03:29:37PM +0100, Thomas Petazzoni wrote: > Dear Thierry Reding, > > On Tue, 29 Jan 2013 15:20:15 +0100, Thierry Reding wrote: > > > If at all possible I think the right thing to do is reuse the generic > > pcibios_get_phb_of_node() implementation. On Tegra this turned out to > > require a minimal change to the DT bindings of the root port nodes to > > make sure they provide the correct address in the reg property. > > Could you detail the change that was needed? The DT bindings I use for > the Marvell PCIe driver are very, very similar to the ones you use for > Tegra, since I basically inspired my entire DT binding on your work. > And still, I think of_irq_map_pci() wasn't working for me. Now that I think about it, there were a few more changes needed. For one, the reg property of the root port nodes need to be in the format specified by the PCI DT binding. That is, 3 cells for the address and 2 cells for the size. So I end up with something like this: pcie-controller { ... ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000 /* port 0 registers */ 0x00001000 0 0x80001000 0x80001000 0 0x00001000 /* port 1 registers */ ...>; pci@1,0 { reg = <0x000800 0 0x80000000 0 0x1000>; ... }; pci@2,0 { reg = <0x001000 0 0x80001000 0 0x1000>; ... }; }; So what happens here is that for each root port (pci@1,0 and pci@2,0), the reg property is translated into the parent address space via the pcie-controller's ranges property. pci@1,0 gets the memory region 0x80000000-0x80000fff and pci@2,0 gets 0x80001000-0x80001fff. (These are actually windows through which the configuration space of the root ports is accessed.) At the same time this reg property maps both devices into the PCI address space at addresses 0:01.0 and 0:02.0 respectively. The second change is that you can't rely on ARM's default implementation of the bus scan operation, which calls pci_scan_root_bus(), passing in a NULL as the struct device which acts as the bus' parent. So on Tegra I added a custom implementation which calls pci_create_root_bus(), passing in the struct device of the PCI host bridge, whose .of_node field will be set to the pcie-controller node above. Incidentally this also fixed another issue where the PCI core and ARM's pci_common_init() both eventually end up calling pci_bus_add_devices(). I don't remember the exact symptoms but I think this was causing resource conflicts during the second enumeration or so. Because a proper struct device with the correct .of_node field is passed into pci_create_root_bus(), the generic pcibios_get_phb_of_node() will know how to find it by looking at bus->bridge->parent->of_node. After that the generic matching code will search the bridge (pcie-controller) node's children and relate them to the struct pci_dev by devfn. This is done in pci_set_of_node() defined in drivers/pci/of.c, which calls of_pci_find_child_device() from drivers/of/of_pci.c. This is quite convoluted, but I hope it helps. Thierry
On Tue, Jan 29, 2013 at 03:02:01PM +0000, Thierry Reding wrote: > On Tue, Jan 29, 2013 at 03:29:37PM +0100, Thomas Petazzoni wrote: > > Dear Thierry Reding, > > > > On Tue, 29 Jan 2013 15:20:15 +0100, Thierry Reding wrote: > > > > > If at all possible I think the right thing to do is reuse the generic > > > pcibios_get_phb_of_node() implementation. On Tegra this turned out to > > > require a minimal change to the DT bindings of the root port nodes to > > > make sure they provide the correct address in the reg property. > > > > Could you detail the change that was needed? The DT bindings I use for > > the Marvell PCIe driver are very, very similar to the ones you use for > > Tegra, since I basically inspired my entire DT binding on your work. > > And still, I think of_irq_map_pci() wasn't working for me. > > Now that I think about it, there were a few more changes needed. > For one, the reg property of the root port nodes need to be in the > format specified by the PCI DT binding. That is, 3 cells for the > address and 2 cells for the size. > > So I end up with something like this: > > pcie-controller { > ... > > ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000 /* port 0 registers */ > 0x00001000 0 0x80001000 0x80001000 0 0x00001000 /* port 1 registers */ > ...>; > > pci@1,0 { > reg = <0x000800 0 0x80000000 0 0x1000>; > ... > }; > > pci@2,0 { > reg = <0x001000 0 0x80001000 0 0x1000>; > ... > }; > }; > > So what happens here is that for each root port (pci@1,0 and pci@2,0), > the reg property is translated into the parent address space via the > pcie-controller's ranges property. pci@1,0 gets the memory region > 0x80000000-0x80000fff and pci@2,0 gets 0x80001000-0x80001fff. (These are > actually windows through which the configuration space of the root ports > is accessed.) > > At the same time this reg property maps both devices into the PCI > address space at addresses 0:01.0 and 0:02.0 respectively. > > The second change is that you can't rely on ARM's default implementation > of the bus scan operation, which calls pci_scan_root_bus(), passing in a > NULL as the struct device which acts as the bus' parent. So on Tegra I > added a custom implementation which calls pci_create_root_bus(), passing > in the struct device of the PCI host bridge, whose .of_node field will > be set to the pcie-controller node above. Incidentally this also fixed > another issue where the PCI core and ARM's pci_common_init() both > eventually end up calling pci_bus_add_devices(). I don't remember the > exact symptoms but I think this was causing resource conflicts during > the second enumeration or so. > > Because a proper struct device with the correct .of_node field is passed > into pci_create_root_bus(), the generic pcibios_get_phb_of_node() will > know how to find it by looking at bus->bridge->parent->of_node. After > that the generic matching code will search the bridge (pcie-controller) > node's children and relate them to the struct pci_dev by devfn. This is > done in pci_set_of_node() defined in drivers/pci/of.c, which calls > of_pci_find_child_device() from drivers/of/of_pci.c. > > This is quite convoluted, but I hope it helps. Thanks this is very helpful. I will see if this lets me avoid implementing pcibios_get_phb_of_node. Thanks, Andrew Murray -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Thierry Reding, On Tue, 29 Jan 2013 16:02:01 +0100, Thierry Reding wrote: > Now that I think about it, there were a few more changes needed. > For one, the reg property of the root port nodes need to be in the > format specified by the PCI DT binding. That is, 3 cells for the > address and 2 cells for the size. > > So I end up with something like this: > > pcie-controller { > ... > > ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000 /* port 0 registers */ > 0x00001000 0 0x80001000 0x80001000 0 0x00001000 /* port 1 registers */ > ...>; > > pci@1,0 { > reg = <0x000800 0 0x80000000 0 0x1000>; > ... > }; > > pci@2,0 { > reg = <0x001000 0 0x80001000 0 0x1000>; > ... > }; > }; > > So what happens here is that for each root port (pci@1,0 and pci@2,0), > the reg property is translated into the parent address space via the > pcie-controller's ranges property. pci@1,0 gets the memory region > 0x80000000-0x80000fff and pci@2,0 gets 0x80001000-0x80001fff. (These are > actually windows through which the configuration space of the root ports > is accessed.) > > At the same time this reg property maps both devices into the PCI > address space at addresses 0:01.0 and 0:02.0 respectively. This part I think I've done exactly the same thing in the Marvell PCIe DT binding. > The second change is that you can't rely on ARM's default implementation > of the bus scan operation, which calls pci_scan_root_bus(), passing in a > NULL as the struct device which acts as the bus' parent. So on Tegra I > added a custom implementation which calls pci_create_root_bus(), passing > in the struct device of the PCI host bridge, whose .of_node field will > be set to the pcie-controller node above. Incidentally this also fixed > another issue where the PCI core and ARM's pci_common_init() both > eventually end up calling pci_bus_add_devices(). I don't remember the > exact symptoms but I think this was causing resource conflicts during > the second enumeration or so. > > Because a proper struct device with the correct .of_node field is passed > into pci_create_root_bus(), the generic pcibios_get_phb_of_node() will > know how to find it by looking at bus->bridge->parent->of_node. After > that the generic matching code will search the bridge (pcie-controller) > node's children and relate them to the struct pci_dev by devfn. This is > done in pci_set_of_node() defined in drivers/pci/of.c, which calls > of_pci_find_child_device() from drivers/of/of_pci.c. This is quite certainly the part that I was missing. I'll try this and let you know. Thanks a lot for the lengthy, but very useful explanation! Thomas
On Mon, Jan 28, 2013 at 10:55 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Mon, Jan 28, 2013 at 08:29:24PM -0700, Bjorn Helgaas wrote: >> On Mon, Jan 28, 2013 at 11:56 AM, Thomas Petazzoni >> <thomas.petazzoni@free-electrons.com> wrote: >> > This driver implements the support for the PCIe interfaces on the >> > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to >> > cover earlier families of Marvell SoCs, such as Dove, Orion and >> > Kirkwood. >> > >> > The driver implements the hw_pci operations needed by the core ARM PCI >> > code to setup PCI devices and get their corresponding IRQs, and the >> > pci_ops operations that are used by the PCI core to read/write the >> > configuration space of PCI devices. >> > >> > Since the PCIe interfaces of Marvell SoCs are completely separate and >> > not linked together in a bus, this driver sets up an emulated PCI host >> > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe >> > interface. >> >> There's no Linux requirement that multiple PCIe interfaces appear to >> be in the same hierarchy. You can just use pci_scan_root_bus() >> separately on each interface. Each interface can be in its own domain >> if necessary. > > What you suggest is basically what the Marvell driver did originally, > the probelm is that Linux requires a pre-assigned aperture for each > PCI domain/root bus, and these new chips have so many PCI-E ports that > they can exhaust the physical address space, and also a limited > internal HW resource for setting address routing. > > Thus they require resource allocation that is sensitive to the devices > present downstream. > > By far the simplest solution is to merge all the physical links into a > single domain and rely on existing PCI resource allocation code to > drive allocation of scarce physical address space and demand allocate > the HW routing resource (specifically there are enough resources to > accomidate MMIO only devices on every bus, but not enough to > accomidate MMIO and IO on every bus). > >> > +/* >> > + * For a given PCIe interface (represented by a mvebu_pcie_port >> > + * structure), we read the PCI configuration space of the >> > + * corresponding PCI-to-PCI bridge in order to find out which range of >> > + * I/O addresses and memory addresses have been assigned to this PCIe >> > + * interface. Using these informations, we set up the appropriate >> > + * address decoding windows so that the physical address are actually >> > + * resolved to the right PCIe interface. >> > + */ >> >> Are you inferring the host bridge apertures by using the resources >> assigned to devices under the bridge, i.e., taking the union of all > > The flow is different, a portion of physical address space is set > aside for use by PCI-E (via DT) and that portion is specified in the > struct resource's ultimately attached to the PCI domain for the bus > scan. You could call that the 'host bridge aperture' though it doesn't > reflect any HW configuration at all. The values come from the device > tree. I think I would understand this better if we had a concrete example to talk about, say a dmesg log and corresponding lspci -v output. As I understand it, the DT is a description of the hardware, so in that sense, the DT can't set aside physical address space. It can describe what the hardware does with the address space, and I assume that's what you mean. Maybe the hardware isn't configurable, e.g., it is hard-wired to route certain address ranges to PCIe? > During the bus scan the Linux core code splits up that contiguous > space and assigns to the PCI-PCI bridges and devices under that domain. > > Each physical PCI-E link on the chip is seen by Linux through the SW > emulated PCI-PCI bridge attached to bus 0. When Linux configures the > bridge windows it triggers this code here to copy that window > information from the PCI config space into non-standard internal HW > registers. > > The purpose of the SW PCI-PCI bridge and this code here is to give > the Linux PCI core control over the window (MMIO,IO,busnr) assigned > to the PCI-E link. > > This arrangement with PCI-PCI bridges controlling address routing is > part of the PCI-E standard, in this instance Marvell did not implement > the required config space in HW so the driver is working around that > deficiency. > > Other drivers, like tegra have a similar design, but their hardware > does implement PCI-PCI bridge configuration space and does drive > address decoding through the HW PCI-PCI window registers. > > Having PCI-E links be bridges, not domains/root_bus's is in-line with > the standard and works better with the Linux PCI resource allocator. > > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Bjorn Helgaas, On Tue, 29 Jan 2013 10:47:09 -0700, Bjorn Helgaas wrote: > I think I would understand this better if we had a concrete example to > talk about, say a dmesg log and corresponding lspci -v output. Please note that the cover letter of this patch series has the lspci -vvv output. Best regards, Thomas
On Tue, Jan 29, 2013 at 10:47:09AM -0700, Bjorn Helgaas wrote: > As I understand it, the DT is a description of the hardware, so in > that sense, the DT can't set aside physical address space. It can > describe what the hardware does with the address space, and I assume > that's what you mean. Maybe the hardware isn't configurable, e.g., it > is hard-wired to route certain address ranges to PCIe? The DT is largely a description of the hardware, but when it comes to addresses, particularly HW programmable addresess, there is an general expectation that the driver/bootloader will program HW address decoders to either match the addresses given in the DT, or to new values guided by the DT addresses. In a real sense that means the DT also describes the physical address map the kernel should use. In the PCI-E case the DT PCI-E HW description includes physical address ranges to use for the MMIO/IO/PREFETCH PCI-E interface windows and the driver is expected to program the internal HW address decoders based on those address ranges. The catch is that the hardware decoders are on a link-by-link basis, not on a root-complex basis, so the programming can only take place once the Linux kernel has done PCI resource assignment. So when I say set aside, I mean for instance, the PCI-E entry in DT has 128M of physical address space marked for PCI MMIO use. The kernel does PCI resource allocation and the HW decoders in each link will be set to claim some portion of the 128M - based on the MMIO windows programmed on the PCI-PCI root port bridges. The reamining part of the 128M is dead address space, not claimed by any hardware block at all. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 29, 2013 at 11:41 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Jan 29, 2013 at 10:47:09AM -0700, Bjorn Helgaas wrote: > >> As I understand it, the DT is a description of the hardware, so in >> that sense, the DT can't set aside physical address space. It can >> describe what the hardware does with the address space, and I assume >> that's what you mean. Maybe the hardware isn't configurable, e.g., it >> is hard-wired to route certain address ranges to PCIe? > > The DT is largely a description of the hardware, but when it comes to > addresses, particularly HW programmable addresess, there is an general > expectation that the driver/bootloader will program HW address > decoders to either match the addresses given in the DT, or to new > values guided by the DT addresses. > > In a real sense that means the DT also describes the physical address > map the kernel should use. > > In the PCI-E case the DT PCI-E HW description includes physical > address ranges to use for the MMIO/IO/PREFETCH PCI-E interface windows > and the driver is expected to program the internal HW address decoders > based on those address ranges. > > The catch is that the hardware decoders are on a link-by-link basis, > not on a root-complex basis, so the programming can only take place > once the Linux kernel has done PCI resource assignment. > > So when I say set aside, I mean for instance, the PCI-E entry in DT > has 128M of physical address space marked for PCI MMIO use. The kernel > does PCI resource allocation and the HW decoders in each link will be > set to claim some portion of the 128M - based on the MMIO windows > programmed on the PCI-PCI root port bridges. The reamining part of the > 128M is dead address space, not claimed by any hardware block at all. Thanks, this really helps get to the issue that the PCI core will care about. The root ports look like normal bridges, so the core assumes it can manage their windows as needed, as long as the windows stay inside the host bridge apertures that are logically upstream from the root ports. In your example, it sounds like the 128M should be treated as the host bridge aperture. Is there any reason not to do that? It sounds like there's no place you can actually program that 128M region into the hardware, and you would just program pieces of that region as root port windows. But that should be OK from the core's perspective. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 29, 2013 at 12:07:00PM -0700, Bjorn Helgaas wrote: > > So when I say set aside, I mean for instance, the PCI-E entry in DT > > has 128M of physical address space marked for PCI MMIO use. The kernel > > does PCI resource allocation and the HW decoders in each link will be > > set to claim some portion of the 128M - based on the MMIO windows > > programmed on the PCI-PCI root port bridges. The reamining part of the > > 128M is dead address space, not claimed by any hardware block at all. > > Thanks, this really helps get to the issue that the PCI core will care > about. The root ports look like normal bridges, so the core assumes > it can manage their windows as needed, as long as the windows stay > inside the host bridge apertures that are logically upstream from the > root ports. Yes, that is basically correct. This is what the PCI-E specification says the root complex/root port should look like and this is what some SOC hardware implements fully in hardware. The small wrinkle with Marvell is that the PCI-PCI bridge config space is created by the driver since the HW does not expose a standard config space. > In your example, it sounds like the 128M should be treated as the host > bridge aperture. Is there any reason not to do that? It sounds like > there's no place you can actually program that 128M region into the > hardware, and you would just program pieces of that region as root > port windows. But that should be OK from the core's perspective. AFAIK this is already what Thomas's driver is doing.. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 29, 2013 at 12:18 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Jan 29, 2013 at 12:07:00PM -0700, Bjorn Helgaas wrote: >> > So when I say set aside, I mean for instance, the PCI-E entry in DT >> > has 128M of physical address space marked for PCI MMIO use. The kernel >> > does PCI resource allocation and the HW decoders in each link will be >> > set to claim some portion of the 128M - based on the MMIO windows >> > programmed on the PCI-PCI root port bridges. The reamining part of the >> > 128M is dead address space, not claimed by any hardware block at all. >> >> Thanks, this really helps get to the issue that the PCI core will care >> about. The root ports look like normal bridges, so the core assumes >> it can manage their windows as needed, as long as the windows stay >> inside the host bridge apertures that are logically upstream from the >> root ports. > > Yes, that is basically correct. This is what the PCI-E specification > says the root complex/root port should look like and this is what some > SOC hardware implements fully in hardware. The small wrinkle with > Marvell is that the PCI-PCI bridge config space is created by the > driver since the HW does not expose a standard config space. Oh, so the actual *root port* itself doesn't conform to the spec? Wow, that's worse than I expected. Then I guess you have emulate it and make sure its config space is complete enough and functional enough so that all the link management, power management, AER, etc., code in the core works as well as it would with a conforming device. >> In your example, it sounds like the 128M should be treated as the host >> bridge aperture. Is there any reason not to do that? It sounds like >> there's no place you can actually program that 128M region into the >> hardware, and you would just program pieces of that region as root >> port windows. But that should be OK from the core's perspective. > > AFAIK this is already what Thomas's driver is doing.. > > Regards, > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/29/2013 01:41 AM, Thomas Petazzoni wrote: > Dear Stephen Warren, > > On Mon, 28 Jan 2013 15:21:45 -0700, Stephen Warren wrote: > >>> +Mandatory properties: >>> +- compatible: must be "marvell,armada-370-xp-pcie" >>> +- status: either "disabled" or "okay" >> >> status is a standard DT property; I certainly wouldn't expect its >> presence to be mandatory (there's a defined default), nor would I expect >> each device's binding to redefine this property. > > Ok. > >>> +- marvell,pcie-port: the physical PCIe port number >> >> Should the standardized cell-index property be used here instead? Or, >> perhaps that property is deprecated/discouraged... > > The problem is that I need two identifiers, the pcie-port and > pcie-lane, and it would be strange to have one referenced as > cell-index, and the other one as marvell,pcie-lane, no? Yes, using a custom property for half of the information and a standard property for the other half would be odd. > Unless of > course we can put two numbers in the cell-index property, but a quick > grep in Documentation/devicetree/bindings/ seems to indicate that all > users of cell-index use it with a single identifier. > > Just tell me what to do here, I don't have a strong opinion on this. It's probably fine as-is then. Although I wasn't sure exactly what port/lane meant; is there some kind of mux/cross-bar between the PCIe root ports and the physical lanes/balls/pins on the chip? >>> diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile >> >>> +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o >>> +ccflags-$(CONFIG_PCI_MVEBU) += \ >>> + -I$(srctree)/arch/arm/plat-orion/include \ >>> + -I$(srctree)/arch/arm/mach-mvebu/include >> >> That seems a little dangerous w.r.t. multi-platform zImage. Can the >> required headers be moved out to somewhere more public to avoid this? > > Why is this dangerous for multi-platform zImage? For this specific > driver only, some SoC-specific headers are used. I don't think it > prevents another PCI driver (such as the Tegra one) from being built > into the same kernel image, no? Aren't those ccflags applied to everything that's built by that Makefile? If they were applied only to one .o file, it'd probably be OK, but I don't see how that's specified. I'm not especially bothered with reaching into the mach/plat include directories especially since you're well aware it needs cleaning up, I just don't think that Tegra's PCIe driver is going to compile too well against an Orion/mvebu header if one was to get picked up first. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 29, 2013 at 12:38 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > On Tue, Jan 29, 2013 at 12:18 PM, Jason Gunthorpe > <jgunthorpe@obsidianresearch.com> wrote: >> On Tue, Jan 29, 2013 at 12:07:00PM -0700, Bjorn Helgaas wrote: >>> > So when I say set aside, I mean for instance, the PCI-E entry in DT >>> > has 128M of physical address space marked for PCI MMIO use. The kernel >>> > does PCI resource allocation and the HW decoders in each link will be >>> > set to claim some portion of the 128M - based on the MMIO windows >>> > programmed on the PCI-PCI root port bridges. The reamining part of the >>> > 128M is dead address space, not claimed by any hardware block at all. >>> >>> Thanks, this really helps get to the issue that the PCI core will care >>> about. The root ports look like normal bridges, so the core assumes >>> it can manage their windows as needed, as long as the windows stay >>> inside the host bridge apertures that are logically upstream from the >>> root ports. >> >> Yes, that is basically correct. This is what the PCI-E specification >> says the root complex/root port should look like and this is what some >> SOC hardware implements fully in hardware. The small wrinkle with >> Marvell is that the PCI-PCI bridge config space is created by the >> driver since the HW does not expose a standard config space. > > Oh, so the actual *root port* itself doesn't conform to the spec? > Wow, that's worse than I expected. > > Then I guess you have emulate it and make sure its config space is > complete enough and functional enough so that all the link management, > power management, AER, etc., code in the core works as well as it > would with a conforming device. I'm not sure the existing emulation in these patches is sufficient. For example, pci_sw_pci_bridge_write() updates bridge->membase when we write to the window register, but I don't see anything that updates the actual hardware decoder. That might be done in mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(), but that looks like it's only done once. If the PCI core updates a root port window later, I don't see where the hardware decoder will be updated. Maybe you're counting on the window assignments to be static? The PCI core doesn't guarantee anything like that, though in the absence of hotplug I don't know any reason why it would change things. I also forgot about the bus number munging in mvebu_pcie_rd_conf(). The PCI core can update the bridge secondary/subordinate registers. It looks like you don't support writing to them, and the read path (pci_sw_pci_bridge_read()) looks like it doesn't do any translation between the hardware and Linux bus numbers. I don't understand the system well enough to know if this is an issue. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 29, 2013 at 03:27:43PM -0700, Bjorn Helgaas wrote: > I'm not sure the existing emulation in these patches is sufficient. > For example, pci_sw_pci_bridge_write() updates bridge->membase when we > write to the window register, but I don't see anything that updates > the actual hardware decoder. That might be done in > mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(), > but that looks like it's only done once. If the PCI core updates a > root port window later, I don't see where the hardware decoder will be > updated. > > Maybe you're counting on the window assignments to be static? The PCI > core doesn't guarantee anything like that, though in the absence of > hotplug I don't know any reason why it would change things. Agree.. Thomas, I think you need to directly update the Marvell hardware registers when config writes are made to the SW bridge. If this means it is too hard/complex to keep the code general then I'd say make it part of the Marvell host driver. > I also forgot about the bus number munging in mvebu_pcie_rd_conf(). > The PCI core can update the bridge secondary/subordinate registers. > It looks like you don't support writing to them, and the read path > (pci_sw_pci_bridge_read()) looks like it doesn't do any translation > between the hardware and Linux bus numbers. I don't understand the > system well enough to know if this is an issue. I was chatting with Thomas on this subject, it looks like there is a HW register that needs to be set to the subordinate bus number of the bridge, that will solve this weirdness. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Bjorn Helgaas, On Tue, 29 Jan 2013 15:27:43 -0700, Bjorn Helgaas wrote: > I'm not sure the existing emulation in these patches is sufficient. > For example, pci_sw_pci_bridge_write() updates bridge->membase when we > write to the window register, but I don't see anything that updates > the actual hardware decoder. That might be done in > mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(), > but that looks like it's only done once. That's correct. I currently let the Linux PCI core enumerate the real PCIe devices, allocate the resources, and set the appropriate values in the emulated PCI-to-PCI bridge registers. Once this is all done, the Marvell PCIe driver looks at each PCI-to-PCI bridge, reads the membase and iobase registers, and creates address decoding windows so that the physical addresses assigned by the Linux PCI core actually resolve to the right PCIe interface. This is done once for all. > If the PCI core updates a root port window later, I don't see where the hardware > decoder will be updated. It will not be updated. > Maybe you're counting on the window assignments to be static? The PCI > core doesn't guarantee anything like that, though in the absence of > hotplug I don't know any reason why it would change things. Right. Is supporting hotplug a show-stopper to get this included? I think it could be added later, if it happens to be needed, no? I could of course do it, but the patch series is already quite large and complicated, so if we could merge a simple, but working, version first, and then improve on top of it when needed, it would be nice. > I also forgot about the bus number munging in mvebu_pcie_rd_conf(). > The PCI core can update the bridge secondary/subordinate registers. > It looks like you don't support writing to them, and the read path > (pci_sw_pci_bridge_read()) looks like it doesn't do any translation > between the hardware and Linux bus numbers. I don't understand the > system well enough to know if this is an issue. Right. Could you explain a little bit for what reasons the PCI core could update the secondary/subordinate registers, and to what values it sets them? For now, I statically assign the secondary bus register value to be X+1, where X is the number of the PCIe interface, since X=0 is reserved for the root bus (which has the host bridge and the PCI-to-PCI bridges). Also, could you detail what kind of translation I should be doing when reading the hardware and Linux bus numbers? I apologize for asking so many, probably silly, questions, but I am still learning all those internal PCI mechanisms. Thanks, Thomas
On Mon, Jan 28, 2013 at 07:56:28PM +0100, Thomas Petazzoni wrote: > +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev, > + const struct resource *res, > + resource_size_t start, > + resource_size_t size, > + resource_size_t align) > +{ > + if (!(res->flags & IORESOURCE_IO)) > + return start; > + > + /* > + * The I/O regions must be 64K aligned, because the > + * granularity of PCIe I/O address decoding windows is 64 K > + */ > + return round_up(start, SZ_64K); > +} You do realise that this will result in all PCI I/O BARs being rounded up to 64K. I've just been digging through the PCI code and have come across a function - pcibios_window_alignment() - which the PCI code allows to be overriden which allows you to increase the alignment requirement of bridge windows. It takes the PCI bus and window type as arguments. I'd suggest using that, and checking whether the bus which is passed corresponds with a bus which gives you problems, so that you don't impose the 64K requirement on downstream bridges. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Russell, On Wed, 30 Jan 2013 11:32:46 +0000, Russell King - ARM Linux wrote: > You do realise that this will result in all PCI I/O BARs being rounded > up to 64K. Hum, yes, correct. > I've just been digging through the PCI code and have come across a > function - pcibios_window_alignment() - which the PCI code allows to be > overriden which allows you to increase the alignment requirement of > bridge windows. It takes the PCI bus and window type as arguments. > > I'd suggest using that, and checking whether the bus which is passed > corresponds with a bus which gives you problems, so that you don't > impose the 64K requirement on downstream bridges. Seems interesting indeed, I'll look into this idea! Thanks! Thomas
On Wed, Jan 30, 2013 at 11:32:46AM +0000, Russell King - ARM Linux wrote: > On Mon, Jan 28, 2013 at 07:56:28PM +0100, Thomas Petazzoni wrote: > > +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev, > > + const struct resource *res, > > + resource_size_t start, > > + resource_size_t size, > > + resource_size_t align) > > +{ > > + if (!(res->flags & IORESOURCE_IO)) > > + return start; > > + > > + /* > > + * The I/O regions must be 64K aligned, because the > > + * granularity of PCIe I/O address decoding windows is 64 K > > + */ > > + return round_up(start, SZ_64K); > > +} > > You do realise that this will result in all PCI I/O BARs being rounded > up to 64K. > > I've just been digging through the PCI code and have come across a > function - pcibios_window_alignment() - which the PCI code allows to be > overriden which allows you to increase the alignment requirement of > bridge windows. It takes the PCI bus and window type as arguments. > > I'd suggest using that, and checking whether the bus which is passed > corresponds with a bus which gives you problems, so that you don't > impose the 64K requirement on downstream bridges. That approach isn't going to work very well with multi-platform, though, since the function can only be overridden on a per-architecture basis. Thierry
Dear Thierry Reding, On Wed, 30 Jan 2013 13:03:44 +0100, Thierry Reding wrote: > That approach isn't going to work very well with multi-platform, > though, since the function can only be overridden on a > per-architecture basis. I can do like is done for pcibios_align_resource(): put a single implementation of the function in arch/arm/kernel/bios32.c, and make it call a hook registered in the hw_pci structure, or even directly use a numerical value in hw_pci, as Russell suggested earlier. Best regards, Thomas
On Wed, Jan 30, 2013 at 01:03:44PM +0100, Thierry Reding wrote: > On Wed, Jan 30, 2013 at 11:32:46AM +0000, Russell King - ARM Linux wrote: > > On Mon, Jan 28, 2013 at 07:56:28PM +0100, Thomas Petazzoni wrote: > > > +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev, > > > + const struct resource *res, > > > + resource_size_t start, > > > + resource_size_t size, > > > + resource_size_t align) > > > +{ > > > + if (!(res->flags & IORESOURCE_IO)) > > > + return start; > > > + > > > + /* > > > + * The I/O regions must be 64K aligned, because the > > > + * granularity of PCIe I/O address decoding windows is 64 K > > > + */ > > > + return round_up(start, SZ_64K); > > > +} > > > > You do realise that this will result in all PCI I/O BARs being rounded > > up to 64K. > > > > I've just been digging through the PCI code and have come across a > > function - pcibios_window_alignment() - which the PCI code allows to be > > overriden which allows you to increase the alignment requirement of > > bridge windows. It takes the PCI bus and window type as arguments. > > > > I'd suggest using that, and checking whether the bus which is passed > > corresponds with a bus which gives you problems, so that you don't > > impose the 64K requirement on downstream bridges. > > That approach isn't going to work very well with multi-platform, though, > since the function can only be overridden on a per-architecture basis. The same can be said of all the various other functions which the PCI stuff expects the arch to provide, yet we seem to cope just fine... -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 30, 2013 at 2:35 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Bjorn Helgaas, > > On Tue, 29 Jan 2013 15:27:43 -0700, Bjorn Helgaas wrote: > >> I'm not sure the existing emulation in these patches is sufficient. >> For example, pci_sw_pci_bridge_write() updates bridge->membase when we >> write to the window register, but I don't see anything that updates >> the actual hardware decoder. That might be done in >> mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(), >> but that looks like it's only done once. > > That's correct. I currently let the Linux PCI core enumerate the > real PCIe devices, allocate the resources, and set the appropriate > values in the emulated PCI-to-PCI bridge registers. Once this is all > done, the Marvell PCIe driver looks at each PCI-to-PCI bridge, reads > the membase and iobase registers, and creates address decoding windows > so that the physical addresses assigned by the Linux PCI core actually > resolve to the right PCIe interface. This is done once for all. > >> If the PCI core updates a root port window later, I don't see where the hardware >> decoder will be updated. > > It will not be updated. > >> Maybe you're counting on the window assignments to be static? The PCI >> core doesn't guarantee anything like that, though in the absence of >> hotplug I don't know any reason why it would change things. > > Right. Is supporting hotplug a show-stopper to get this included? I > think it could be added later, if it happens to be needed, no? > > I could of course do it, but the patch series is already quite large > and complicated, so if we could merge a simple, but working, version > first, and then improve on top of it when needed, it would be nice. I'm most concerned about the stuff in drivers/pci. I hesitate to merge drivers/pci/sw-pci-pci-bridge.c as-is because it's a model that's not connected to hardware and only works in a completely static situation, and the rest of the PCI core can't really deal with that. But I don't think supporting hotplug should be a show-stopper at this point, either. It sounds like we might be heading towards hooking this up more directly to the Marvell hardware, which will make it more arch-dependent. Something like that could either go in arch/arm, or in some not-quite-so-generic spot under drivers/pci. >> I also forgot about the bus number munging in mvebu_pcie_rd_conf(). >> The PCI core can update the bridge secondary/subordinate registers. >> It looks like you don't support writing to them, and the read path >> (pci_sw_pci_bridge_read()) looks like it doesn't do any translation >> between the hardware and Linux bus numbers. I don't understand the >> system well enough to know if this is an issue. > > Right. Could you explain a little bit for what reasons the PCI core > could update the secondary/subordinate registers, and to what values it > sets them? The secondary/subordinate registers effectively define a bus number aperture that tells the bridge which transactions to claim and forward downstream. When enumerating devices, we may update the subordinate bus number to widen the aperture so we can enumerate an arbitrary tree behind the bridge. When we're finished, we'll probably narrow it by updating the subordinate again, so the unused bus number space can be used for other bridges. I don't know the exact details of the algorithm, and they're likely to change anyway, but pci_scan_bridge() is where most of it happens. It looks like your current system doesn't support trees below the bridges, but hopefully we can make it so the generic enumeration algorithms still work. > For now, I statically assign the secondary bus register value to be > X+1, where X is the number of the PCIe interface, since X=0 is reserved > for the root bus (which has the host bridge and the PCI-to-PCI > bridges). That makes sense but limits you to a single bus (and really, a single device since this is PCIe) below the bridge. > Also, could you detail what kind of translation I should be doing when > reading the hardware and Linux bus numbers? I'm hoping that the register Jason mentioned is enough to avoid the need for translation. If it's not, we can explore this a bit more. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Bjorn Helgaas, On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote: > I'm most concerned about the stuff in drivers/pci. I hesitate to > merge drivers/pci/sw-pci-pci-bridge.c as-is because it's a model > that's not connected to hardware and only works in a completely static > situation, and the rest of the PCI core can't really deal with that. > > But I don't think supporting hotplug should be a show-stopper at this > point, either. It sounds like we might be heading towards hooking > this up more directly to the Marvell hardware, which will make it more > arch-dependent. Something like that could either go in arch/arm, or > in some not-quite-so-generic spot under drivers/pci. If you really don't want sw-pci-pci-bridge.c in drivers/pci, then I can make it a part of the drivers/pci/host/pci-mvebu.c driver itself. I initially followed the idea started by Thierry Redding for the emulated host bridge, but if you feel that this emulated PCI-to-PCI bridge is too specific to this driver, then I'm fine with keeping it inside the driver itself. > >> I also forgot about the bus number munging in mvebu_pcie_rd_conf(). > >> The PCI core can update the bridge secondary/subordinate registers. > >> It looks like you don't support writing to them, and the read path > >> (pci_sw_pci_bridge_read()) looks like it doesn't do any translation > >> between the hardware and Linux bus numbers. I don't understand the > >> system well enough to know if this is an issue. > > > > Right. Could you explain a little bit for what reasons the PCI core > > could update the secondary/subordinate registers, and to what > > values it sets them? > > The secondary/subordinate registers effectively define a bus number > aperture that tells the bridge which transactions to claim and forward > downstream. When enumerating devices, we may update the subordinate > bus number to widen the aperture so we can enumerate an arbitrary tree > behind the bridge. When we're finished, we'll probably narrow it by > updating the subordinate again, so the unused bus number space can be > used for other bridges. I don't know the exact details of the > algorithm, and they're likely to change anyway, but pci_scan_bridge() > is where most of it happens. > > It looks like your current system doesn't support trees below the > bridges, but hopefully we can make it so the generic enumeration > algorithms still work. In practice, in our situation, there isn't a tree below the bridge. There is one single device. I'd prefer to not implement features that I cannot effectively test, and let the implementation of those additional features to whoever will need them, and therefore be able to test them. I guess that if I integrate the PCI-to-PCI bridge emulation code within the Marvell driver, then I can keep it fairly limited to whatever the Marvell PCI driver requires, no? > > For now, I statically assign the secondary bus register value to be > > X+1, where X is the number of the PCIe interface, since X=0 is > > reserved for the root bus (which has the host bridge and the > > PCI-to-PCI bridges). > > That makes sense but limits you to a single bus (and really, a single > device since this is PCIe) below the bridge. Which is exactly what is happening here. > > Also, could you detail what kind of translation I should be doing > > when reading the hardware and Linux bus numbers? > > I'm hoping that the register Jason mentioned is enough to avoid the > need for translation. If it's not, we can explore this a bit more. Ok. Thanks, Thomas
On Wed, Jan 30, 2013 at 11:28:36PM +0100, Thomas Petazzoni wrote: > > It looks like your current system doesn't support trees below the > > bridges, but hopefully we can make it so the generic enumeration > > algorithms still work. > > In practice, in our situation, there isn't a tree below the bridge. > There is one single device. I'd prefer to not implement features that I > cannot effectively test, and let the implementation of those additional > features to whoever will need them, and therefore be able to test > them. Agreed it is hard to test, but be aware that any system that has PCI-E slots can host an add-in card that has a bridge on it. These are midly common in some areas like high port count ethernet cards. If you aren't going to attempt the implementation then a really big FIXME that the config access routing is not correct and needs to be based on the bus range assigned to the bridge would be friendly :) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 30, 2013 at 3:28 PM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Bjorn Helgaas, > > On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote: >> It looks like your current system doesn't support trees below the >> bridges, but hopefully we can make it so the generic enumeration >> algorithms still work. > > In practice, in our situation, there isn't a tree below the bridge. > There is one single device. I'd prefer to not implement features that I > cannot effectively test, and let the implementation of those additional > features to whoever will need them, and therefore be able to test them. I understand the concern about testing, but my advice is to not use that as an excuse to artificially limit the functionality of the code you're writing :) You're talking about emulating a bridge, and the bridge really doesn't know or care what's downstream. If it works with a single device downstream, it should work with another bridge downstream. Many aspects of bridge configuration can be tested with creative application of setpci, so that might be a possibility, too. > I guess that if I integrate the PCI-to-PCI bridge emulation code within > the Marvell driver, then I can keep it fairly limited to whatever the > Marvell PCI driver requires, no? Yeah, it's just that the pci_dev for this emulated bridge is used by the generic PCI core code, e.g., pci_scan_bridge(), so whatever emulation you do has to be robust enough that the core won't notice it's emulated. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Bjorn Helgaas, On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote: > The secondary/subordinate registers effectively define a bus number > aperture that tells the bridge which transactions to claim and forward > downstream. When enumerating devices, we may update the subordinate > bus number to widen the aperture so we can enumerate an arbitrary tree > behind the bridge. When we're finished, we'll probably narrow it by > updating the subordinate again, so the unused bus number space can be > used for other bridges. I don't know the exact details of the > algorithm, and they're likely to change anyway, but pci_scan_bridge() > is where most of it happens. > > It looks like your current system doesn't support trees below the > bridges, but hopefully we can make it so the generic enumeration > algorithms still work. The PCI-to-PCI bridge specification says that the Primary Bus Number Register, Secondary Bus Number Register and Subordinate Bus Number Register of the PCI configuration space of a PCI-to-PCI bridge should all be set to 0 after reset. Until now, I was forcing a specific value of the Secondary Bus Number and Subordinate Bus Number (1 for my first bridge, 2 for my second bridge, etc.). Following you're recommendation, I've changed this, and left those values initialized to 0 by default, in order to let Linux set correct values. Yes, Linux does assign appropriate values in the Secondary Bus Number Register. But before that Linux also complains loudly that the bridge configuration is invalid: pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring Looking at the code, we have: /* Check if setup is sensible at all */ if (!pass && (primary != bus->number || secondary <= bus->number || secondary > subordinate)) { dev_info(&dev->dev, "bridge configuration invalid ([bus %02x-%02x]), reconfiguring\n", secondary, subordinate); broken = 1; } Due to the default values of the Primary Bus Number Register, Secondary Bus Number Register and Subordinate Bus Number Register, we have: primary = 0 secondary = 0 subordinate = 0 We are enumerating the root bus, so bus->number = 0. Therefore: * The test primary != bus->number is false, so it's not the problem. * secondary <= bus->number is true, because secondary = 0, and bus->number = 0. It is the problem. * secondary > subordinate is false. So I'm not sure what to do with this... Thoas
On Thu, Jan 31, 2013 at 9:04 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Bjorn Helgaas, > > On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote: > >> The secondary/subordinate registers effectively define a bus number >> aperture that tells the bridge which transactions to claim and forward >> downstream. When enumerating devices, we may update the subordinate >> bus number to widen the aperture so we can enumerate an arbitrary tree >> behind the bridge. When we're finished, we'll probably narrow it by >> updating the subordinate again, so the unused bus number space can be >> used for other bridges. I don't know the exact details of the >> algorithm, and they're likely to change anyway, but pci_scan_bridge() >> is where most of it happens. >> >> It looks like your current system doesn't support trees below the >> bridges, but hopefully we can make it so the generic enumeration >> algorithms still work. > > The PCI-to-PCI bridge specification says that the Primary Bus Number > Register, Secondary Bus Number Register and Subordinate Bus Number > Register of the PCI configuration space of a PCI-to-PCI bridge should > all be set to 0 after reset. > > Until now, I was forcing a specific value of the Secondary Bus Number > and Subordinate Bus Number (1 for my first bridge, 2 for my second > bridge, etc.). > > Following you're recommendation, I've changed this, and left those > values initialized to 0 by default, in order to let Linux set correct > values. Yes, Linux does assign appropriate values in the Secondary Bus > Number Register. But before that Linux also complains loudly that the > bridge configuration is invalid: > > pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring > pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring Linux makes the unwarranted assumption that the PCI hierarchy has already been configured by firmware. If the only problem is the messages above, I think we could just rework the message so it doesn't look like an error. I would guess that we probably also see the same distressing message when we hot-add a card with a bridge on it, because firmware won't have initialized the bridge. My rule of thumb is that I like to note something in dmesg about the initial configuration of bus/mem/io apertures and BARs, as well as indications when we update them. That way, the dmesg log should contain enough information to debug most enumeration and configuration defects. pci_scan_bridge() is somewhat lacking in this regard. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Bjorn Helgaas, On Thu, 31 Jan 2013 09:30:07 -0700, Bjorn Helgaas wrote: > > Following you're recommendation, I've changed this, and left those > > values initialized to 0 by default, in order to let Linux set correct > > values. Yes, Linux does assign appropriate values in the Secondary Bus > > Number Register. But before that Linux also complains loudly that the > > bridge configuration is invalid: > > > > pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring > > Linux makes the unwarranted assumption that the PCI hierarchy has > already been configured by firmware. If the only problem is the > messages above, I think we could just rework the message so it doesn't > look like an error. I would guess that we probably also see the same > distressing message when we hot-add a card with a bridge on it, > because firmware won't have initialized the bridge. > > My rule of thumb is that I like to note something in dmesg about the > initial configuration of bus/mem/io apertures and BARs, as well as > indications when we update them. That way, the dmesg log should > contain enough information to debug most enumeration and configuration > defects. pci_scan_bridge() is somewhat lacking in this regard. Ok. Would something like: "bridge configuration with unassigned bus numbers ([bus 00-00]), reconfiguring" be an acceptable to replace this one? Thanks for your quick feedback, Thomas
On Thu, Jan 31, 2013 at 09:30:07AM -0700, Bjorn Helgaas wrote: > Linux makes the unwarranted assumption that the PCI hierarchy has > already been configured by firmware. If the only problem is the > messages above, I think we could just rework the message so it doesn't > look like an error. That's not a safe assumption, especially on platforms where there's no BIOS (like ARM platforms). Thankfully, for the platforms I care about, the boot loaders I wrote for them _do_ do a full bus setup, so I don't see a problem. :) However, I have historially had the kernel over many years (probably around 14 or 15 now) reassign all resources so that things are how the kernel wants them, and not how my half-hearted attempt at setting them up did (which only does a limited job enough to get the system to a state where we can load the kernel.) -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 31, 2013 at 9:33 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Bjorn Helgaas, > > On Thu, 31 Jan 2013 09:30:07 -0700, Bjorn Helgaas wrote: > >> > Following you're recommendation, I've changed this, and left those >> > values initialized to 0 by default, in order to let Linux set correct >> > values. Yes, Linux does assign appropriate values in the Secondary Bus >> > Number Register. But before that Linux also complains loudly that the >> > bridge configuration is invalid: >> > >> > pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring >> > pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring >> > pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring >> > pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring >> > pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring >> > pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring >> >> Linux makes the unwarranted assumption that the PCI hierarchy has >> already been configured by firmware. If the only problem is the >> messages above, I think we could just rework the message so it doesn't >> look like an error. I would guess that we probably also see the same >> distressing message when we hot-add a card with a bridge on it, >> because firmware won't have initialized the bridge. >> >> My rule of thumb is that I like to note something in dmesg about the >> initial configuration of bus/mem/io apertures and BARs, as well as >> indications when we update them. That way, the dmesg log should >> contain enough information to debug most enumeration and configuration >> defects. pci_scan_bridge() is somewhat lacking in this regard. > > Ok. Would something like: > > "bridge configuration with unassigned bus numbers ([bus 00-00]), reconfiguring" > > be an acceptable to replace this one? Seems reasonable. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/28/2013 11:56 AM, Thomas Petazzoni wrote: > This driver implements the support for the PCIe interfaces on the > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to > cover earlier families of Marvell SoCs, such as Dove, Orion and > Kirkwood. Bjorn and I happen to live very close, so we got together today and talked about PCIe on ARM. One of the questions he asked is: why does the window management on the Marvell SoCs need to be dynamic? (Sorry if this was covered earlier; I vaguely recall some discussion on the topic, but couldn't find it quickly) As background, PCIe enumeration in Linux usually works like: 1) You start off with some CPU physical address regions that generate transactions on the PCIe bus. 2) You enumerate all the PCIe devices, and assign an address to each BAR found, carved out of the PCIe address range corresponding to the regions you knew from (1). However, it sounds like the Marvell code wants to: 1) Start off with no real knowledge of the CPU physical address that will generate transactions on the PCIe bus, since you want to assign that later. 2) You enumerate all the PCIe devices, and assign an address. But, what address range do you use? 3) Then you program the SoC's windows to set up the CPU->PCIe address translations. Am I recalling what you're trying to do correctly, or am I completely confused? Now, I recall that a related issue was that you are tight on CPU physical address space, and the second algorithm above would allow the size of the PCIe controller's window configuration to be as small as possible, and hence there would be more CPU physical address space available to fit in other peripherals. However, why does this need to be dynamic? On a particular board, you know all the other (non-PCIe) peripherals that you need to fit into the CPU physical address space, so you know how much is left over for PCIe, so why not always make the PCIe window fill up all the available space, and use the first algorithm I described above? And also, I think you always know the exact set of PCIe devices that are attached to the boards, so you know the exact BAR size requirements there (or are there user-accessible PCIe devices; I don't think so from your recent comments about PCIe<->PCIe bridges not needing to be supported since the user couldn't plug one in?) Note that with DT, you can easily specify the window location/size in the board .dts file rather than the SoC .dtsi file, so it can easily be customized based on how much physical address space is taken up by RAM, directly mapped NOR flash, etc. With a static window configuration in DT, you'd end up with a system that worked much like any x86 system or Tegra, with some static memory range available to PCIe. It's just that in your case, the region location/size could change from boot to boot based on DT, whereas it's hard-coded in HW for Tegra and I assume x86 too. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 31, 2013 at 05:34:36PM -0700, Stephen Warren wrote: > On 01/28/2013 11:56 AM, Thomas Petazzoni wrote: > > This driver implements the support for the PCIe interfaces on the > > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to > > cover earlier families of Marvell SoCs, such as Dove, Orion and > > Kirkwood. > > Bjorn and I happen to live very close, so we got together today and > talked about PCIe on ARM. > > One of the questions he asked is: why does the window management on the > Marvell SoCs need to be dynamic? > (Sorry if this was covered earlier; I vaguely recall some discussion on > the topic, but couldn't find it quickly) Well I've answered it several times, so has Thomas.. Lets try again, please save for future reference :) Lets seperate two things. The CPU physical address ranges reserved for PCI bus access are not dynamic. This is set in DT, or whatever, statically. Just like every other PCI case on ARM. Just like Tegra. That is not the issue. What is required is that the division of this space amongst the 10 physical PCI-E links must be dynamic. Just like x86. Just like the new tegra driver. [1] Is that clear? > As background, PCIe enumeration in Linux usually works like: > > 1) You start off with some CPU physical address regions that generate > transactions on the PCIe bus. > > 2) You enumerate all the PCIe devices, and assign an address to each BAR > found, carved out of the PCIe address range corresponding to the regions > you knew from (1). Step 2 also includes 'assign address windows to all the physical PCI-E links'. This is very important because it is what this entire discussion is about. Look at how tegra or x86 works, the CPU physical addresses for PCI-E do nothing until the PCI-to-PCI bridge window registers in each link's configuration space are setup. Until that is done the SOC doesn't know which link to send the transaction to. Marvell is the same, until the link's window registers are setup the CPU addresses don't go anywhere. Notice this has absolutely no effect on the host bridge aperture. This is a link-by-link configuration of what addresses *go down that link*. The big difference is the link window registers for Marvell do not conform to the PCI configuration space specification. They are Marvell specific. This is what the glue code in the host driver does, it converts the Marvell specificness into something the kernel can undertstand and control. There are countless ways to do this, but please accept it is necessary that it be done... > Now, I recall that a related issue was that you are tight on CPU > physical address space, and the second algorithm above would allow the > size of the PCIe controller's window configuration to be as small as > possible, and hence there would be more CPU physical address space > available to fit in other peripherals. Physical address space is certainly a concern, but availability of decoder windows is the major one. Each link requires one decoder window for MMIO and one for IO, and possibly one for prefetch. The chip doesn't have 30 decoder windows. So the allocation of decoders to links must be dynamic, based on the requirements of the downstream endports on the link. > However, why does this need to be dynamic? On a particular board, you > know all the other (non-PCIe) peripherals that you need to fit into the > CPU physical address space, so you know how much is left over for PCIe, > so why not always make the PCIe window fill up all the available > space, Because there is no such thing as an all-links PCIe window on this hardware. Each link has a seperate window. If you get rid of all the dynamic allocation then every link must statically reserve some portion of physical address space and some number of decoder windows. That more or less means you need to know what is going to be on the other side of every link when you write the DT. > With a static window configuration in DT, you'd end up with a system > that worked much like any x86 system or Tegra, with some static memory > range available to PCIe. It's just that in your case, the region > location/size could change from boot to boot based on DT, whereas it's > hard-coded in HW for Tegra and I assume x86 too. How is this better? Now you have a system where you have to customize the DT before you connect a PCI-E device. What if someone uses this chip in a configuration with physical slots? How does that work? What about hotplug? What about a unified kernel? That is *not* like x86 or tegra. IMHO Thomas's direction in his proposed driver ends up working very close to the new tegra driver, and has the sort of dynamic allocation and discovery people expect from PCI-E. Jason 1 - The new tegra driver switches from calling ARM's pci_common_init once for every physical link, to once for the SOC. It does this by fixing the routing of config transactions so that the kernel sees the per-link PCI-PCI root port bridge config space provided by the hardware at the correct place. By doing this it changes from statically allocating a physical memory region for each link to statically allocating a region for all the links, and dynamically dividing that region amongst the links. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 01/31/2013 06:41 PM, Jason Gunthorpe wrote: > On Thu, Jan 31, 2013 at 05:34:36PM -0700, Stephen Warren wrote: >> On 01/28/2013 11:56 AM, Thomas Petazzoni wrote: >>> This driver implements the support for the PCIe interfaces on the >>> Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to >>> cover earlier families of Marvell SoCs, such as Dove, Orion and >>> Kirkwood. >> >> Bjorn and I happen to live very close, so we got together today and >> talked about PCIe on ARM. >> >> One of the questions he asked is: why does the window management on the >> Marvell SoCs need to be dynamic? > >> (Sorry if this was covered earlier; I vaguely recall some discussion on >> the topic, but couldn't find it quickly) > > Well I've answered it several times, so has Thomas.. Lets try again, > please save for future reference :) > > Lets seperate two things. > > The CPU physical address ranges reserved for PCI bus access are not > dynamic. This is set in DT, or whatever, statically. Just like every > other PCI case on ARM. Just like Tegra. That is not the issue. > > What is required is that the division of this space amongst the 10 > physical PCI-E links must be dynamic. Just like x86. Just like the new > tegra driver. [1] > > Is that clear? Yes. >> As background, PCIe enumeration in Linux usually works like: >> >> 1) You start off with some CPU physical address regions that generate >> transactions on the PCIe bus. >> >> 2) You enumerate all the PCIe devices, and assign an address to each BAR >> found, carved out of the PCIe address range corresponding to the regions >> you knew from (1). > > Step 2 also includes 'assign address windows to all the physical PCI-E > links'. This is very important because it is what this entire > discussion is about. OK. > Look at how tegra or x86 works, the CPU physical addresses for PCI-E > do nothing until the PCI-to-PCI bridge window registers in each link's > configuration space are setup. Until that is done the SOC doesn't know > which link to send the transaction to. From my perspective, this is slightly the wrong way of describing the issue, but I see what you mean: At least on Tegra and I think x86, any transaction that goes to the physical PCIe aperture is translated onto (internal) PCIe bus 0, so the lack of window or PCIe/PCIe bridge BAR register programming doesn't prevent the transaction going /somewhere/ (even if "somewhere" is only half way to where it's useful!). The difference is pretty subtle. The issue is that without the PCIe/PCIe bridge BARs programmed, the PCIe transactions won't get off bus 0 and onto a downstream bus of one of the PCIe/PCIe bridges, or put another way, no PCIe/PCIe will claim the transaction that happens on PCIe bus 0 (Using "PCIe/PCIe bridge" above to mean "PCIe root port") > Marvell is the same, until the link's window registers are setup the > CPU addresses don't go anywhere. > > Notice this has absolutely no effect on the host bridge aperture. > This is a link-by-link configuration of what addresses *go down that > link*. Right. > The big difference is the link window registers for Marvell do not > conform to the PCI configuration space specification. They are Marvell > specific. > > This is what the glue code in the host driver does, it converts the > Marvell specificness into something the kernel can undertstand and > control. There are countless ways to do this, but please accept it > is necessary that it be done... Sure. >> Now, I recall that a related issue was that you are tight on CPU >> physical address space, and the second algorithm above would allow the >> size of the PCIe controller's window configuration to be as small as >> possible, and hence there would be more CPU physical address space >> available to fit in other peripherals. > > Physical address space is certainly a concern, but availability of > decoder windows is the major one. Each link requires one decoder > window for MMIO and one for IO, and possibly one for prefetch. The > chip doesn't have 30 decoder windows. So the allocation of decoders to > links must be dynamic, based on the requirements of the downstream > endports on the link. Oh I see... I originally thought the issue was that the windows were between CPU physical address space and the PCIe host controller itself. But in fact, the windows are between PCIe bus 0 and the root ports, so they're the equivalent of the standard PCIe root port (or PCIe/PCIe bridge) BAR registers. And related, these BAR/window registers are usually part of each PCIe root port itself, and hence there's a whole number dedicated to each root port, but on Marvell there's a *global* pool of these BARs/windows instead. Now I think I finally understand the architecture of your HW. >> However, why does this need to be dynamic? On a particular board, you >> know all the other (non-PCIe) peripherals that you need to fit into the >> CPU physical address space, so you know how much is left over for PCIe, >> so why not always make the PCIe window fill up all the available >> space, > > Because there is no such thing as an all-links PCIe window on this > hardware. > > Each link has a seperate window. > > If you get rid of all the dynamic allocation then every link must > statically reserve some portion of physical address space and some > number of decoder windows. > > That more or less means you need to know what is going to be on the > other side of every link when you write the DT. So, the dynamic programming of the windows on Marvell HW is the exact logical equivalent of programming a standard PCIe root port's BAR registers. It makes perfect sense that should be dynamic. Presumably this is something you can make work inside your emulated PCIe/PCIe bridge module, simply by capturing writes to the BAR registers, and translating them into writes to the Marvell window registers. Now, I do have one follow-on question: You said you don't have 30 windows, but how many do you have free after allocating windows to any other peripherals that need them, relative to (3 * number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.) The thing here is that when the PCIe core writes to a root port BAR window to configure/enable it the first time, you'll need to capture that transaction and dynamically allocate a window and program it in a way equivalent to what the BAR register write would have achieved on standard HW. Later, the window might need resizing, or even to be completely disabled, if the PCIe core were to change the standard BAR register. Dynamically allocating a window when the BAR is written seems a little heavy-weight. So while it's obvious that window base address and size shouldn't be static, I wonder if the assignment of a specific window ID to a specific root port ID shouldn be dynamic or static. For example, if your HW configuration leaves you with 6 windows available, you could support 2 PCIe root ports by statically assigning 3 windows to serve each of those 2 root ports. Would that work, or are there systems where over-commit is needed, e.g. if there's no IO space behind a root port, you could get away with two windows per root port, and hence be able to run 3 root ports rather than just 2? Still, if you know which PCIe devices are being the root ports, you could still represent the over-commit statically in DT Still, I supose doing it dynamically in the driver does end up being a lot less to think about for someone creating the DT for a new board. Having to translate standard root port BAR register writes to Marvell window register allocation/writes would imply that the emulated root port code has to be very closely tied into the Marvell PCIe driver, and not something that could be at all generic in the most part. >> With a static window configuration in DT, you'd end up with a system >> that worked much like any x86 system or Tegra, with some static memory >> range available to PCIe. It's just that in your case, the region >> location/size could change from boot to boot based on DT, whereas it's >> hard-coded in HW for Tegra and I assume x86 too. > > How is this better? Now you have a system where you have to customize > the DT before you connect a PCI-E device. What if someone uses this > chip in a configuration with physical slots? How does that work? What > about hotplug? What about a unified kernel? That is *not* like x86 or > tegra. Right. Now that I really understand what the windows are doing, I can see that a static window configuration (address/size, perhaps rather than windows are used) would not be appropriate. > IMHO Thomas's direction in his proposed driver ends up working very > close to the new tegra driver, and has the sort of dynamic allocation > and discovery people expect from PCI-E. > > Jason > > 1 - The new tegra driver switches from calling ARM's pci_common_init > once for every physical link, to once for the SOC. It does this by > fixing the routing of config transactions so that the kernel sees > the per-link PCI-PCI root port bridge config space provided by the > hardware at the correct place. By doing this it changes from > statically allocating a physical memory region for each link to > statically allocating a region for all the links, and dynamically > dividing that region amongst the links. Right, we have both (or all 3) root ports show up in the same PCIe domain. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jan 31, 2013 at 07:21:02PM -0700, Stephen Warren wrote: > I originally thought the issue was that the windows were between CPU > physical address space and the PCIe host controller itself. But in fact, > the windows are between PCIe bus 0 and the root ports, so they're > the Donno what exactly is inside tegra, but on Marvell the docs describe an an internal bus cross bar/whatever and each PCI-E link gets a port on that structure. There is no such physical thing as a 'PCI bus 0', and they didn't arrange the hardware in a way that makes it easy for the host driver to create one like tegra did :( > equivalent of the standard PCIe root port (or PCIe/PCIe bridge) BAR > registers. And related, these BAR/window registers are usually part of > each PCIe root port itself, and hence there's a whole number dedicated > to each root port, but on Marvell there's a *global* pool of these > BARs/windows instead. Right, that is all following the PCI-E spec, and is reasonable and sane. What Marvell did is take an *end port core* and jam'd it on their internal bus without adjusting things to follow the PCI-E spec regarding config space and what not. > > That more or less means you need to know what is going to be on the > > other side of every link when you write the DT. > > So, the dynamic programming of the windows on Marvell HW is the exact > logical equivalent of programming a standard PCIe root port's BAR > registers. It makes perfect sense that should be dynamic. Presumably > this is something you can make work inside your emulated PCIe/PCIe > bridge module, simply by capturing writes to the BAR registers, and > translating them into writes to the Marvell window registers. Yes, that is exactly the idea. > Now, I do have one follow-on question: You said you don't have 30 > windows, but how many do you have free after allocating windows to any > other peripherals that need them, relative to (3 * > number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.) Thomas will have to answer this, it varies depending on the SOC, and what other on chip peripherals are in use. For instance Kirkwood has the same design but there are plenty of windows for the two PCI-E links. Still how would you even connect a limited number of regions on a link by link basis to the common PCI code? > The thing here is that when the PCIe core writes to a root port BAR > window to configure/enable it the first time, you'll need to capture > that transaction and dynamically allocate a window and program it in a > way equivalent to what the BAR register write would have achieved on > standard HW. Later, the window might need resizing, or even to be > completely disabled, if the PCIe core were to change the standard > BAR Right. This is pretty straightforward except for the need to hook the alignment fixup.. > register. Dynamically allocating a window when the BAR is written seems > a little heavy-weight. I think what Thomas had here was pretty small, and the windows need to be shared with other on chip periphals beyond PCI-E.. > needed, e.g. if there's no IO space behind a root port, you could get > away with two windows per root port, and hence be able to run 3 root > ports rather than just 2? Right, this is the main point. If you plug in 3 devices and they all only use MMIO regions then you only need to grab 3 windows. The kernel disables the unused windows on the bridge so it is easy to tell when they are disused. > Still, I supose doing it dynamically in the driver does end up being a > lot less to think about for someone creating the DT for a new board. Agreed, these restrictions are all so HW specific, subtle and have nothing to do with the PCI-E spec. Codifying them once in the driver seems like the way to keep this crazyness out of the PCI core and away from users of the SOC. > Having to translate standard root port BAR register writes to Marvell > window register allocation/writes would imply that the emulated root > port code has to be very closely tied into the Marvell PCIe driver, and > not something that could be at all generic in the most part. Agreed.. At the very least generic code would need call back functions to the driver... It has a fair bit to do for Marvell: - Translate MMIO, prefetch and IO ranges to mbus windows - Keep track of the secondary/subordinate bus numbers and fiddle with other hardware registers to set those up - Copy the link state/control regsiters from the end port config space into the bridge express root port capability - Probably ditto for AER as well.. Probably simpler just to make one for marvell then mess excessively with callbacks.. Cheers, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Stephen Warren, Thanks for this great discussion. I see that Jason has already answered most of the questions on the why we need this "dynamicity" in the window configuration. A few comments below. On Thu, 31 Jan 2013 19:21:02 -0700, Stephen Warren wrote: > So, the dynamic programming of the windows on Marvell HW is the exact > logical equivalent of programming a standard PCIe root port's BAR > registers. It makes perfect sense that should be dynamic. Presumably > this is something you can make work inside your emulated PCIe/PCIe > bridge module, simply by capturing writes to the BAR registers, and > translating them into writes to the Marvell window registers. That's what I'm doing. In the PATCHv2, my PCIe host driver was reading back the BARs in the PCI-to-PCI bridges configuration space, and was setting up the windows according to the addresses that had been assigned to each bridge. I am currently working in making this more dynamic: it is directly when the BAR is being written to in the PCI-to-PCI bridge configuration space that a window will be setup. > Now, I do have one follow-on question: You said you don't have 30 > windows, but how many do you have free after allocating windows to any > other peripherals that need them, relative to (3 * > number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.) We have 20 windows on Armada XP if I remember correctly, and they are not only used for PCIe, but also to map the BootROM (needed to boot secondary CPUs), to map SPI flashes or NOR flashes, for example. So they are really shared between many uses. In terms of PCIe, there are only two types of windows: I/O and Memory, there is no notion of Prefetchable Memory window as far as I could see. We have up to 10 PCIe interfaces, and only 20 windows. It means that you basically can't use all PCIe interfaces, there will necessarily be some limit, due to the limited number of windows. Also, I'd like to point out that the dynamic configuration is needed for two reasons: * The number of windows, as we are discussing now. * The amount of physical address space available. If you don't dynamically configure those windows, then you have to account the "worst case", i.e the PCIe devices that require very large memory areas. So you end up creating static windows that reserve 32M or 64M or 128M *per* PCIe link. You can see that it "consumes" pretty quickly a large part of the 4G physical address space that we have. Thanks to the dynamic window configuration that we do with the PCI-to-PCI bridge, we can size the windows exactly the size needed by the downstream device on each PCIe interface. > The thing here is that when the PCIe core writes to a root port BAR > window to configure/enable it the first time, you'll need to capture > that transaction and dynamically allocate a window and program it in a > way equivalent to what the BAR register write would have achieved on > standard HW. Later, the window might need resizing, or even to be > completely disabled, if the PCIe core were to change the standard BAR > register. Dynamically allocating a window when the BAR is written > seems a little heavy-weight. Why? > So while it's obvious that window base address and size shouldn't be > static, I wonder if the assignment of a specific window ID to a > specific root port ID shouldn be dynamic or static. For example, if > your HW configuration leaves you with 6 windows available, you could > support 2 PCIe root ports by statically assigning 3 windows to serve > each of those 2 root ports. Would that work, or are there systems > where over-commit is needed, e.g. if there's no IO space behind a > root port, you could get away with two windows per root port, and > hence be able to run 3 root ports rather than just 2? Still, if you > know which PCIe devices are being the root ports, you could still > represent the over-commit statically in DT For now, I haven't figured out how not to allocate an I/O window if the downstream device doesn't use I/O, but I'd like to achieve that, as it would save one of the two windows needed per PCIe interface... and many PCIe devices don't need the I/O window. > Still, I supose doing it dynamically in the driver does end up being a > lot less to think about for someone creating the DT for a new board. Indeed. > Having to translate standard root port BAR register writes to Marvell > window register allocation/writes would imply that the emulated root > port code has to be very closely tied into the Marvell PCIe driver, > and not something that could be at all generic in the most part. Right. I've already moved the PCI-to-PCI bridge code from the generic drivers/pci/sw-pci-pci-bridge.c location to be directly into the driver. It also to integrate more tightly things like window allocation. > Right. Now that I really understand what the windows are doing, I can > see that a static window configuration (address/size, perhaps rather > than windows are used) would not be appropriate. Glad to see we reached the same conclusion :-) Best regards, Thomas
Dear Jason Gunthorpe, Thanks again for continuing this discussion while I was sleeping :-) On Thu, 31 Jan 2013 20:51:15 -0700, Jason Gunthorpe wrote: > > Now, I do have one follow-on question: You said you don't have 30 > > windows, but how many do you have free after allocating windows to > > any other peripherals that need them, relative to (3 * > > number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.) > > Thomas will have to answer this, it varies depending on the SOC, and > what other on chip peripherals are in use. For instance Kirkwood has > the same design but there are plenty of windows for the two PCI-E > links. Right. I already answered this point directly to Stephen. On Kirkwood, there are many windows and two PCIe links, so the windows were statically allocated. On Armada XP, there are 20 windows and 10 PCIe links. Static allocation is no longer reasonable. > > The thing here is that when the PCIe core writes to a root port BAR > > window to configure/enable it the first time, you'll need to capture > > that transaction and dynamically allocate a window and program it > > in a way equivalent to what the BAR register write would have > > achieved on standard HW. Later, the window might need resizing, or > > even to be completely disabled, if the PCIe core were to change the > > standard BAR > > Right. This is pretty straightforward except for the need to hook the > alignment fixup.. > > > register. Dynamically allocating a window when the BAR is written > > seems a little heavy-weight. > > I think what Thomas had here was pretty small, and the windows need to > be shared with other on chip periphals beyond PCI-E.. Yes, it is not very complicated. We already have some common code that creates/removes those windows, so it is just a matter of calling the right thing at the right time. Definitely not hundreds of line of crap. > Right, this is the main point. If you plug in 3 devices and they all > only use MMIO regions then you only need to grab 3 windows. The kernel > disables the unused windows on the bridge so it is easy to tell when > they are disused. Ah, I'm interested in further discussing this. I currently have a setup with one SATA PCIe card and one NIC PCIe card. On the NIC, the I/O ports are said to be "disabled", but still an I/O region gets allocated in the PCI-to-PCI bridge that gives access to this particular device. The device in question is: 05:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06) Subsystem: Intel Corporation PRO/1000 PT Server Adapter Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 106 Region 0: Memory at c1200000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at c1220000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at c0010000 [disabled] [size=32] [virtual] Expansion ROM at c1300000 [disabled] [size=128K] So the Region 2 is disabled. But, in the corresponding bridge: 00:05.0 PCI bridge: Marvell Technology Group Ltd. Device 1092 (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF+ FastB2B+ ParErr+ DEVSEL=?? >TAbort+ <TAbort+ <MAbort+ >SERR+ <PERR+ INTx+ Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=05, subordinate=05, sec-latency=0 I/O behind bridge: c0010000-c001ffff Memory behind bridge: c1200000-c12fffff Prefetchable memory behind bridge: c1300000-c13fffff So there is really a range of I/O addresses associated to it, even though the device will apparently not use it. Would it be possible to detect that the I/O range is not used by the device, and therefore avoid the allocation of an address decoding window for this I/O range? > Agreed.. At the very least generic code would need call back > functions to the driver... It has a fair bit to do for Marvell: > - Translate MMIO, prefetch and IO ranges to mbus windows > - Keep track of the secondary/subordinate bus numbers and fiddle > with other hardware registers to set those up > - Copy the link state/control regsiters from the end port config > space into the bridge express root port capability > - Probably ditto for AER as well.. > > Probably simpler just to make one for marvell then mess excessively > with callbacks.. As replied to Stephen, I've chosen to bring the PCI-to-PCI bridge emulation code directly into the driver, specifically for this reason. Best regards, Thomas
On Friday 01 February 2013, Thomas Petazzoni wrote: > > So while it's obvious that window base address and size shouldn't be > > static, I wonder if the assignment of a specific window ID to a > > specific root port ID shouldn be dynamic or static. For example, if > > your HW configuration leaves you with 6 windows available, you could > > support 2 PCIe root ports by statically assigning 3 windows to serve > > each of those 2 root ports. Would that work, or are there systems > > where over-commit is needed, e.g. if there's no IO space behind a > > root port, you could get away with two windows per root port, and > > hence be able to run 3 root ports rather than just 2? Still, if you > > know which PCIe devices are being the root ports, you could still > > represent the over-commit statically in DT > > For now, I haven't figured out how not to allocate an I/O window if the > downstream device doesn't use I/O, but I'd like to achieve that, as it > would save one of the two windows needed per PCIe interface... and many > PCIe devices don't need the I/O window. The easiest hack would be to only ever allow one I/O window on exactly one of the ports, and not do PIO on the other ports at all. Given the various troubles of making any other combination work, that sounds like a good enough compromise to me. A lot of the add-on cards would not work in the remaining ports anyway, and worrying too much about that legacy hardware may just not be worth it. That way, you only need 11 windows for PCIe (10*mem, 1*IO), which will always fit. Do you actually have /any/ PCIe cards with PIO BARs that you can test with? Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 01 February 2013, Thomas Petazzoni wrote: > So there is really a range of I/O addresses associated to it, even > though the device will apparently not use it. Would it be possible to > detect that the I/O range is not used by the device, and therefore > avoid the allocation of an address decoding window for this I/O range? I suspect it just gets disabled because the port number 0xc0010000 is larger than IO_PORT_LIMIT and we cannot access that offset inside of the virtual memory window we use for PIO. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 01, 2013 at 04:07:49PM +0000, Arnd Bergmann wrote: > On Friday 01 February 2013, Thomas Petazzoni wrote: > > So there is really a range of I/O addresses associated to it, even > > though the device will apparently not use it. Would it be possible to > > detect that the I/O range is not used by the device, and therefore > > avoid the allocation of an address decoding window for this I/O range? > > I suspect it just gets disabled because the port number 0xc0010000 is > larger than IO_PORT_LIMIT and we cannot access that offset inside > of the virtual memory window we use for PIO. You're running into that trap again which you fall into on other architectures. If you arrange for your PCI IO space to start at 0 rather than the physical address that it appears on your CPU, then you shouldn't end up with it extending up to something at 0xcXXXXXXX. Remember that we should be ensuring that inb(0) hits the first address of the cross-subarch PCI IO area - this alway requires that any sub-arch taking part in a multiplatform kernel must start its IO space addresses at 0 and not the physical address on the local CPU. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 01 February 2013, Russell King - ARM Linux wrote: > On Fri, Feb 01, 2013 at 04:07:49PM +0000, Arnd Bergmann wrote: > > On Friday 01 February 2013, Thomas Petazzoni wrote: > > > So there is really a range of I/O addresses associated to it, even > > > though the device will apparently not use it. Would it be possible to > > > detect that the I/O range is not used by the device, and therefore > > > avoid the allocation of an address decoding window for this I/O range? > > > > I suspect it just gets disabled because the port number 0xc0010000 is > > larger than IO_PORT_LIMIT and we cannot access that offset inside > > of the virtual memory window we use for PIO. > > You're running into that trap again which you fall into on other > architectures. > > If you arrange for your PCI IO space to start at 0 rather than the > physical address that it appears on your CPU, then you shouldn't > end up with it extending up to something at 0xcXXXXXXX. > > Remember that we should be ensuring that inb(0) hits the first address > of the cross-subarch PCI IO area - this alway requires that any sub-arch > taking part in a multiplatform kernel must start its IO space addresses > at 0 and not the physical address on the local CPU. Yes, that was my point. I think in this case, the bug is in the new of_pci_process_ranges functions, which returns a 'struct resource' translated into IORESOURCE_MEM space, but with the type set to IORESOURCE_IO. This resource then gets passed to pci_add_resource_offset(). Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/01/2013 01:46 AM, Thomas Petazzoni wrote: > Dear Stephen Warren, > > Thanks for this great discussion. I see that Jason has already answered > most of the questions on the why we need this "dynamicity" in the window > configuration. A few comments below. > > On Thu, 31 Jan 2013 19:21:02 -0700, Stephen Warren wrote: > >> So, the dynamic programming of the windows on Marvell HW is the exact >> logical equivalent of programming a standard PCIe root port's BAR >> registers. It makes perfect sense that should be dynamic. Presumably >> this is something you can make work inside your emulated PCIe/PCIe >> bridge module, simply by capturing writes to the BAR registers, and >> translating them into writes to the Marvell window registers. > > That's what I'm doing. In the PATCHv2, my PCIe host driver was reading > back the BARs in the PCI-to-PCI bridges configuration space, and was > setting up the windows according to the addresses that had been > assigned to each bridge. > > I am currently working in making this more dynamic: it is directly when > the BAR is being written to in the PCI-to-PCI bridge configuration > space that a window will be setup. > >> Now, I do have one follow-on question: You said you don't have 30 >> windows, but how many do you have free after allocating windows to any >> other peripherals that need them, relative to (3 * >> number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.) > > We have 20 windows on Armada XP if I remember correctly, and they are > not only used for PCIe, but also to map the BootROM (needed to boot > secondary CPUs), to map SPI flashes or NOR flashes, for example. So > they are really shared between many uses. In terms of PCIe, there are > only two types of windows: I/O and Memory, there is no notion of > Prefetchable Memory window as far as I could see. In Tegra, we end up having separate MMIO vs. Prefetchable MMIO chunks of our overall PCIe aperture. However, the HW setup appears the same for both of those. I'm not sure if it's a bug in the driver, or if it's just to separate the two address spaces so that the page tables can be configured for those two regions with large rather than small granularity. I need to go investigate that. > We have up to 10 PCIe interfaces, and only 20 windows. It means that > you basically can't use all PCIe interfaces, there will necessarily be > some limit, due to the limited number of windows. So there are 10 PCIe interfaces (root ports). That's on the SoC itself right. Are all 10 (or a large number of them) actually used at once on any given board design? I suppose this must be the case, or Marvell wouldn't have wasted the silicon space on 10 root ports... Still, that's a rather large number of ports! If only a few PCIe ports are ever in use at once on a design and/or the PCIe ports generally contain soldered-down devices rather than user-accessible ports, the statically assigning window *IDs* to individual ports would make for easier code in the driver, since the BAR register emulation would never have to allocate/de-allocate windows, but rather only ever have to enable/disable/configure them. However, if many PCIe ports are in use at once and there are user-accessible ports, you can't know ahead of time which ports will need MMIO vs. MMIO prefetch vs. IO, so you'd have to dynamically allocate window IDs to ports, in addition to dynamically setting up the address/size of windows. > Also, I'd like to point out that the dynamic configuration is needed > for two reasons: > > * The number of windows, as we are discussing now. OK. > * The amount of physical address space available. If you don't > dynamically configure those windows, then you have to account the > "worst case", i.e the PCIe devices that require very large memory > areas. So you end up creating static windows that reserve 32M or 64M > or 128M *per* PCIe link. You can see that it "consumes" pretty > quickly a large part of the 4G physical address space that we have. > Thanks to the dynamic window configuration that we do with the > PCI-to-PCI bridge, we can size the windows exactly the size needed > by the downstream device on each PCIe interface. That aspect is applicable to any PCIe system; there's always some chunk of physical address space that maps to PCIe, and which must be divided into per-root-port chunks. I think the only difference on the Marvell HW is: * The overall total size of the physical address space is dynamic rather than fixed, because it's programmed through windows rather than hard-coded into HW. * Hence, the windows /both/ define the physical address space layout /and/ define the routing of transactions to individual root ports. On regular PCIe, the root port BARs only divide up the overall physical (PCIe bus 0 really) address space and hence perform routing; they have no influence over the CPU physical address space. So I think the crux of the problem is that you really have 10 PCIe root ports, each of which is a nominally a separate PCIe domain (since the windows connect CPU physical address space to an individual/specific root port's PCIe address space, rather than having separate connections from CPU -> PCIe bus 0 address space, then BARs/windows connecting PCIe bus 0 address space to individual root port/subordinate bus address space), but you're attempting to treat all 10 ports as a single PCIe domain so that you don't have to dedicate separate physical CPU address space to each port, which you would have to do if they were actually treated as separate domains. >> The thing here is that when the PCIe core writes to a root port BAR >> window to configure/enable it the first time, you'll need to capture >> that transaction and dynamically allocate a window and program it in a >> way equivalent to what the BAR register write would have achieved on >> standard HW. Later, the window might need resizing, or even to be >> completely disabled, if the PCIe core were to change the standard BAR >> register. Dynamically allocating a window when the BAR is written >> seems a little heavy-weight. > > Why? Well, it's just a bunch more code; much more than a simple writel(). -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 01, 2013 at 10:57:20AM -0700, Stephen Warren wrote: > > We have 20 windows on Armada XP if I remember correctly, and they are > > not only used for PCIe, but also to map the BootROM (needed to boot > > secondary CPUs), to map SPI flashes or NOR flashes, for example. So > > they are really shared between many uses. In terms of PCIe, there are > > only two types of windows: I/O and Memory, there is no notion of > > Prefetchable Memory window as far as I could see. > > In Tegra, we end up having separate MMIO vs. Prefetchable MMIO > chunks of our overall PCIe aperture. However, the HW setup appears > the same for both of those. I'm not sure if it's a bug in the > driver, or if it's just to separate the two address spaces so that > the page tables can be configured for those two regions with large > rather than small granularity. I need to go investigate that. The only purpose of prefetchable space is for legacy PCI. When a P2P bridge targets legacy PCI it has different behavior for its prefetchable memory window compared to the non-prefetchable memory window. IIRC (though it has been a long time since I looked really close at this) PCI-X and PCI-E did away with this special bridge behaviour but kept the prefetchable memory space for compatibility. These days it is typically used to mark cachable memory on an end device. From a SOC perspective, there is no need to treat MMIO and prefetch areas any differently. ARM's per-page cachability flags can be used to deal with the differing caching requirements. However, the bus tree downstream of each root port will require the prefetch window to be contiguous. On Marvell, today, this means you need to burn two mbus windows to get this. If the Linux pci core could allocate the prefetch space for each root port bridge contiguously with the mmio space for the same root port then this could be reduced to one window covering both spaces for the port. > So there are 10 PCIe interfaces (root ports). That's on the SoC itself > right. Are all 10 (or a large number of them) actually used at once on > any given board design? I suppose this must be the case, or Marvell > wouldn't have wasted the silicon space on 10 root ports... Still, that's > a rather large number of ports! Agreed.. I have no idea what the target is for this.. > I think the only difference on the Marvell HW is: > > * The overall total size of the physical address space is dynamic rather > than fixed, because it's programmed through windows rather than > hard-coded into HW. Is it hard coded on tegra? I thought there was a register set that was used to set the overall PCI-E MMIO window location and size. I know even on x86 the PCI window is set via register, though that typically isn't disclosed except to bios writers.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 01, 2013 at 05:45:29PM +0000, Arnd Bergmann wrote: > Yes, that was my point. I think in this case, the bug is in the new > of_pci_process_ranges functions, which returns a 'struct resource' > translated into IORESOURCE_MEM space, but with the type set > to IORESOURCE_IO. This resource then gets passed to > pci_add_resource_offset(). A standard way to express the required address translation from CPU physical address to IO bus address might help other people avoid this trap?? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/01/2013 12:39 PM, Jason Gunthorpe wrote: > On Fri, Feb 01, 2013 at 10:57:20AM -0700, Stephen Warren wrote: ... >> I think the only difference on the Marvell HW is: >> >> * The overall total size of the physical address space is dynamic rather >> than fixed, because it's programmed through windows rather than >> hard-coded into HW. > > Is it hard coded on tegra? I thought there was a register set that was > used to set the overall PCI-E MMIO window location and size. I know > even on x86 the PCI window is set via register, though that typically > isn't disclosed except to bios writers.. There is a fixed (in HW) 1 GiB physical address window dedicated to PCIe. That window is divided between host controller registers, PCIe root port registers (since our root ports don't respond to configuration transactions), and regular PCIe accesses; config/MMIO/IO. There are registers in the host controller that configure the division of this space into config/MMIO/IO, so that can be dynamic. The DT bindings for the driver Thierry proposed hard-code those divisions in DT. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Thierry Reding, On Tue, 29 Jan 2013 10:20:06 +0100, Thierry Reding wrote: > > I didn't test recently, but with my first version of the patch set, > > having an initialization as late as module_init() was too late. Some > > PCI fixup code was being executed *before* we get the opportunity of > > initializing the PCI driver, and it was crashing the kernel. I can > > provide more details if you want. > > Does this patch perhaps fix this crash? > > http://patchwork.ozlabs.org/patch/210870/ I investigated a bit more, and managed to reproduce my crash even with your patch applied. And indeed, my crash is really unrelated to the pcibios function disappearing. Here is the kernel panic (and a short analysis afterwards) : Unhandled fault: external abort on non-linefetch (0x1008) at 0xe0910010 Internal error: : 1008 [#1] SMP ARM Modules linked in: CPU: 0 Not tainted (3.8.0-rc5-00029-g80e55fd-dirty #1303) PC is at quirk_usb_handoff_xhci+0x5c/0x284 LR is at ioremap_pte_range+0x84/0xdc pc : [<c022717c>] lr : [<c0150944>] psr: a0000013 sp : df82bce8 ip : df81c000 fp : c0e09dac r10: 00008000 r9 : 00000000 r8 : de935000 r7 : e0910000 r6 : c03f0ce0 r5 : de935000 r4 : de935000 r3 : 01c801c8 r2 : 00000000 r1 : 42007e13 r0 : e0910000 Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel Control: 10c5387d Table: 1e95c019 DAC: 00000015 Process swapper/0 (pid: 1, stack limit = 0xdf82a238) Stack: (0xdf82bce8 to 0xdf82c000) bce0: de935060 c01702f4 de935000 de935000 de935000 c03f0ce0 bd00: c0e220dc c02277f8 de935060 c02277d4 c03f0ce0 c0227858 c03f0ce0 c0178d98 bd20: c01a855c c0312dd0 00000000 00000000 de936f6c c0312ed0 c0e0f7cc de935000 bd40: de936c14 de936c00 df82bde8 00000001 de916668 c016a958 00000000 de935000 bd60: de936c14 c016ab5c de935800 de910014 de936c00 c016abcc 00000000 00000001 bd80: de910000 c016d234 de916668 de9166d0 de916640 00000000 df82be14 c017a93c bda0: de916668 df82be14 df82bde8 c0012ef8 f3cec23f c0251c90 c0019da0 df805340 bdc0: f3cec23f df82bde8 c0e426e8 df82be14 df802b00 de912a90 00000060 df89e400 bde0: 00000002 c00131dc df82bde8 df82bde8 c0e454f0 00000000 de9166d0 de912af0 be00: df802b00 c0442010 df89e410 00000000 00000000 c0e0fcac 00000001 df82be54 be20: c04420b4 c017a90c 00000000 00000000 00000000 c0442054 c1000000 c8ffffff be40: c124b5f0 00000200 00000000 00000000 00000000 de9166d0 df89e410 c0e43e48 be60: c0e0fc6c df89e410 00000000 c0e0fc6c c04563d4 c017a7c4 00000000 c01a8c24 be80: c01a8c0c c01a78e8 df89e410 c0e0fc6c df89e444 00000000 c043023c c01a7bd8 bea0: c0e0fc6c 00000000 c01a7b4c c01a632c df8067d8 df85a474 c0e0fc6c c0e170c8 bec0: de916740 c01a7228 c03d28bc c0e0fc6c c0e0fc6c df82a000 c0e220c0 00000000 bee0: c043023c c01a80d8 00000000 c0e0fc58 df82a000 c0e220c0 00000000 c043023c bf00: c017a7c4 c01a8e1c c044fc80 df82a000 c0e220c0 c00086d4 c040e208 00000006 bf20: 0000007b c017a7c4 0000007b 00000006 00000006 c043023c c124bb15 00000000 bf40: c0e08194 c044fc80 00000006 c044fc60 c0e220c0 c043023c c04563d4 0000007b bf60: 00000000 c04308b0 00000006 00000006 c043023c 00000000 c0456128 c0456128 bf80: 00000000 00000000 00000000 00000000 00000000 c0430958 00000000 00000000 bfa0: c03156c8 c03156d0 00000000 c000dfd8 00000000 00000000 00000000 00000000 bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 2254c9ef c6885425 [<c022717c>] (quirk_usb_handoff_xhci+0x5c/0x284) from [<c02277d4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4) [<c02277d4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4) from [<c0178d98>] (pci_do_fixups+0xc4/0x17c) [<c0178d98>] (pci_do_fixups+0xc4/0x17c) from [<c016a958>] (pci_bus_add_device+0x14/0x60) [<c016a958>] (pci_bus_add_device+0x14/0x60) from [<c016ab5c>] (pci_bus_add_devices+0x44/0x128) [<c016ab5c>] (pci_bus_add_devices+0x44/0x128) from [<c016abcc>] (pci_bus_add_devices+0xb4/0x128) [<c016abcc>] (pci_bus_add_devices+0xb4/0x128) from [<c016d234>] (pci_scan_root_bus+0x7c/0xcc) [<c016d234>] (pci_scan_root_bus+0x7c/0xcc) from [<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c) [<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c) from [<c0012ef8>] (pcibios_init_hw+0x5c/0x15c) [<c0012ef8>] (pcibios_init_hw+0x5c/0x15c) from [<c00131dc>] (pci_common_init+0x44/0xc4) [<c00131dc>] (pci_common_init+0x44/0xc4) from [<c0442010>] (mvebu_pcie_probe+0x360/0x3a4) [<c0442010>] (mvebu_pcie_probe+0x360/0x3a4) from [<c01a8c24>] (platform_drv_probe+0x18/0x1c) [<c01a8c24>] (platform_drv_probe+0x18/0x1c) from [<c01a78e8>] (really_probe+0x60/0x1e0) [<c01a78e8>] (really_probe+0x60/0x1e0) from [<c01a7bd8>] (__driver_attach+0x8c/0x90) [<c01a7bd8>] (__driver_attach+0x8c/0x90) from [<c01a632c>] (bus_for_each_dev+0x50/0x7c) [<c01a632c>] (bus_for_each_dev+0x50/0x7c) from [<c01a7228>] (bus_add_driver+0x168/0x22c) [<c01a7228>] (bus_add_driver+0x168/0x22c) from [<c01a80d8>] (driver_register+0x78/0x144) [<c01a80d8>] (driver_register+0x78/0x144) from [<c01a8e1c>] (platform_driver_probe+0x18/0xac) [<c01a8e1c>] (platform_driver_probe+0x18/0xac) from [<c00086d4>] (do_one_initcall+0x34/0x174) [<c00086d4>] (do_one_initcall+0x34/0x174) from [<c04308b0>] (do_basic_setup+0x90/0xc4) [<c04308b0>] (do_basic_setup+0x90/0xc4) from [<c0430958>] (kernel_init_freeable+0x74/0x10c) [<c0430958>] (kernel_init_freeable+0x74/0x10c) from [<c03156d0>] (kernel_init+0x8/0xe4) [<c03156d0>] (kernel_init+0x8/0xe4) from [<c000dfd8>] (ret_from_fork+0x14/0x3c) Code: e3a02000 ebf7c092 e2507000 0afffff7 (e5973010) ---[ end trace 834a6081748c17ef ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Basically, the problem comes from the fact that the USB XHCI code has an early PCI fixup, that ioremap() the PCI device memory BAR and makes access to it. Unfortunately, this fixup is called during pcibios_init_hw(), at a point where the Marvell PCIe driver haven't yet had a chance to set up the address decoding windows (the Linux PCI core hasn't even configured the emulated PCI-to-PCI bridges, so I don't know where the PCI devices will sit). Due to this, the first access to the PCI device memory by this early fixup triggers an exception, and the kernel panics. For some reason, moving the Marvell PCIe driver initialization at the subsys_initcall() level works around the problem, but I'm not sure why, since it is actually the driver initialization that ends up calling the PCI fixup code. But clearly, with the PCIe initialization done at subsys_initcall() time, the PCIe is initialized, and then a lot later the PCI fixup is executed. Ideas welcome. Best regards, Thomas
Dear Andrew Murray, On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote: > > + /* > > + * Build an laddr array that describes the PCI device in a > > DT > > + * way > > + */ > > + laddr[0] = cpu_to_be32(port->devfn << 8); > > + laddr[1] = laddr[2] = 0; > > + intspec = cpu_to_be32(pin); > > + > > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > > + if (ret) { > > + dev_err(&pcie->pdev->dev, > > + "%s: of_irq_map_raw() failed, %d\n", > > + __func__, ret); > > + return ret; > > + } > > Are you able to replace the above code with a call to of_irq_map_pci? > I'm not sure which approach is better. The of_irq_map_pci function > doesn't require the pin argument and instead uses the DT and/or > performs its own pin swizzling. I guess this means that if there are > PCIe devices in the DT tree that does any thing strange with pins > then it would be reflected in the IRQ you get. I've found that you > will also need to provide an implementation of > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 > patch). I tried to do so, but it doesn't work properly. Let me explain what I did and the behavior that I observe. First of all, I didn't reimplement the pcibios_get_phb_of_node(), but instead, as Thierry Reding suggested, simply implemented the hw_pci.scan() function as follows: static struct pci_bus *mvebu_pcie_scan_bus(int nr, struct pci_sys_data *sys) { struct mvebu_pcie *pcie = sys_to_pcie(sys); return pci_scan_root_bus(&pcie->pdev->dev, sys->busnr, &mvebu_pcie_ops, sys, &sys->resources); } This allows to pass the "struct device *" pointer, which ultimately allows the PCI devices to carry a pointer to the corresponding DT node. The DT representing my PCIe controller and its interfaces is the following: pcie-controller { compatible = "marvell,armada-370-xp-pcie"; status = "disabled"; #address-cells = <3>; #size-cells = <2>; bus-range = <0x00 0xff>; ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000 /* port 0.0 registers */ 0x00001000 0 0xd0080000 0xd0080000 0 0x00002000 /* port 1.0 registers */ 0x81000000 0 0 0xc0000000 0 0x00010000 /* downstream I/O */ 0x82000000 0 0 0xc1000000 0 0x08000000>; /* non-prefetchable memory */ #interrupt-cells = <1>; interrupt-map-mask = <0xf800 0 0 1>; interrupt-map = <0x0800 0 0 1 &mpic 58 /* port 0.0 */ 0x1000 0 0 1 &mpic 62>; /* port 1.0 */ pcie@0,0 { device_type = "pciex"; reg = <0x0800 0 0xd0040000 0 0x2000>; #address-cells = <3>; #size-cells = <2>; marvell,pcie-port = <0>; marvell,pcie-lane = <0>; clocks = <&gateclk 5>; status = "disabled"; }; pcie@1,0 { device_type = "pciex"; reg = <0x1000 0 0xd0080000 0 0x2000>; #address-cells = <3>; #size-cells = <2>; marvell,pcie-port = <1>; marvell,pcie-lane = <0>; clocks = <&gateclk 9>; status = "disabled"; }; }; So we have two PCIe interfaces. lspci shows the following output: 00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092 00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092 01:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300 02:00.0 USB controller: ASMedia Technology Inc. Device 1040 So: * On bus 0, we have two PCI-to-PCI bridges. Those are emulated in software and allow the Marvell PCIe driver to get dynamic assignment of address decoding windows. The entries in the interrupt-map DT property match the bus number and slot number of those PCI-to-PCI bridges. * On bus 1, we have the real PCIe device connected to the first PCIe interface. This bus 1 is made "visible" thanks to the 00:01.0 PCI-to-PCI bridge. * On bus 2, we have the real PCIe device connected to the second PCIe interface. This bus 2 is made "visible" thanks to the 00:02.0 PCI-to-PCI bridge. Now, when I call the of_irq_map_pci() function, the problem is that the "struct pci_dev" that it receives is the one corresponding to the particular PCIe device we're interested in. And this "struct pci_dev" has a non-NULL pointer to the "struct device_node" representing "pcie0,0" or "pci0,1" above. Since the "struct device_node" is non-NULL, of_irq_map_pci() builds a laddr[] with the bus number and devfn of this device: bus number is 1, devfn is 0. And this doesn't match with the interrupt-map that is in my DT, which associates the interrupts numbers with the PCI-to-PCI bridges rather than the devices themselves. To me, the "struct pci_dev" representing the real PCIe devices should have a NULL "struct device_node" pointer, because those device are not represented in the DT. If this was the case, then the of_irq_map_pci() would go one level up in the PCI hierarchy, find the "struct pci_dev" that corresponds to the PCI-to-PCI bridge, which generates an laddr[] having a bus number a devfn value matching the interrupt-map property. If I do the following hack in of_irq_map_pci(), then everything works nicely: } else { /* We found a P2P bridge, check if it has a node */ ppnode = pci_device_to_OF_node(ppdev); + if (pdev->bus->number != 0) + ppnode = NULL; } What this hack does is that if we are not on bus 0, it means that the pci_dev is a real PCIe device, and therefore we force the code to asssume it does not have a DT reference. Isn't there a problem here in the PCI/DT code ? Is it normal that a PCIe device that isn't described in the DT carries a non-NULL struct device_node pointer? If you need some more details about the problem, do not hesitate to ask. Thanks a lot for your help, Thomas
On Thu, Feb 7, 2013 at 3:24 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Thierry Reding, > > On Tue, 29 Jan 2013 10:20:06 +0100, Thierry Reding wrote: > >> > I didn't test recently, but with my first version of the patch set, >> > having an initialization as late as module_init() was too late. Some >> > PCI fixup code was being executed *before* we get the opportunity of >> > initializing the PCI driver, and it was crashing the kernel. I can >> > provide more details if you want. >> >> Does this patch perhaps fix this crash? >> >> http://patchwork.ozlabs.org/patch/210870/ > > I investigated a bit more, and managed to reproduce my crash even with > your patch applied. And indeed, my crash is really unrelated to the > pcibios function disappearing. Here is the kernel panic (and a short > analysis afterwards) : Hi Thomas, Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y? That should have more information about the enumeration process, including what we think the XHCI BARs are and the apertures leading to them. The PCI core assumes that we know the host bridge apertures up front, and I'm not sure that is true on your platform, so maybe we'll need some changes to accommodate that. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 07, 2013 at 02:37:50PM +0000, Thomas Petazzoni wrote: > Dear Andrew Murray, > > On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote: > > > > + /* > > > + * Build an laddr array that describes the PCI device in a > > > DT > > > + * way > > > + */ > > > + laddr[0] = cpu_to_be32(port->devfn << 8); > > > + laddr[1] = laddr[2] = 0; > > > + intspec = cpu_to_be32(pin); > > > + > > > + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); > > > + if (ret) { > > > + dev_err(&pcie->pdev->dev, > > > + "%s: of_irq_map_raw() failed, %d\n", > > > + __func__, ret); > > > + return ret; > > > + } > > > > Are you able to replace the above code with a call to of_irq_map_pci? > > I'm not sure which approach is better. The of_irq_map_pci function > > doesn't require the pin argument and instead uses the DT and/or > > performs its own pin swizzling. I guess this means that if there are > > PCIe devices in the DT tree that does any thing strange with pins > > then it would be reflected in the IRQ you get. I've found that you > > will also need to provide an implementation of > > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 > > patch). > > I tried to do so, but it doesn't work properly. Let me explain what I > did and the behavior that I observe. > > First of all, I didn't reimplement the pcibios_get_phb_of_node(), but > instead, as Thierry Reding suggested, simply implemented the > hw_pci.scan() function as follows: I've not had any time to test Thierry's solution to avoid implementing pcibios_get_phb_of_node - but it did seem to work for him and seem correct at the time. > > static struct pci_bus *mvebu_pcie_scan_bus(int nr, struct pci_sys_data *sys) > { > struct mvebu_pcie *pcie = sys_to_pcie(sys); > return pci_scan_root_bus(&pcie->pdev->dev, sys->busnr, > &mvebu_pcie_ops, sys, &sys->resources); > } > > This allows to pass the "struct device *" pointer, which ultimately > allows the PCI devices to carry a pointer to the corresponding DT node. > > The DT representing my PCIe controller and its interfaces is the > following: > > pcie-controller { > compatible = "marvell,armada-370-xp-pcie"; > status = "disabled"; > > #address-cells = <3>; > #size-cells = <2>; > > bus-range = <0x00 0xff>; > > ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000 /* port 0.0 registers */ > 0x00001000 0 0xd0080000 0xd0080000 0 0x00002000 /* port 1.0 registers */ > 0x81000000 0 0 0xc0000000 0 0x00010000 /* downstream I/O */ > 0x82000000 0 0 0xc1000000 0 0x08000000>; /* non-prefetchable memory */ > > #interrupt-cells = <1>; > interrupt-map-mask = <0xf800 0 0 1>; > interrupt-map = <0x0800 0 0 1 &mpic 58 /* port 0.0 */ > 0x1000 0 0 1 &mpic 62>; /* port 1.0 */ > > pcie@0,0 { > device_type = "pciex"; > reg = <0x0800 0 0xd0040000 0 0x2000>; > #address-cells = <3>; > #size-cells = <2>; > marvell,pcie-port = <0>; > marvell,pcie-lane = <0>; > clocks = <&gateclk 5>; > status = "disabled"; > }; > > pcie@1,0 { > device_type = "pciex"; > reg = <0x1000 0 0xd0080000 0 0x2000>; > #address-cells = <3>; > #size-cells = <2>; > marvell,pcie-port = <1>; > marvell,pcie-lane = <0>; > clocks = <&gateclk 9>; > status = "disabled"; > }; > }; > > So we have two PCIe interfaces. lspci shows the following output: > > 00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092 > 00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092 > 01:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300 > 02:00.0 USB controller: ASMedia Technology Inc. Device 1040 > > So: > > * On bus 0, we have two PCI-to-PCI bridges. Those are emulated in > software and allow the Marvell PCIe driver to get dynamic assignment > of address decoding windows. The entries in the interrupt-map DT > property match the bus number and slot number of those PCI-to-PCI > bridges. > > * On bus 1, we have the real PCIe device connected to the first PCIe > interface. This bus 1 is made "visible" thanks to the 00:01.0 > PCI-to-PCI bridge. > > * On bus 2, we have the real PCIe device connected to the second PCIe > interface. This bus 2 is made "visible" thanks to the 00:02.0 > PCI-to-PCI bridge. > > Now, when I call the of_irq_map_pci() function, the problem is that the > "struct pci_dev" that it receives is the one corresponding to the > particular PCIe device we're interested in. And this "struct pci_dev" > has a non-NULL pointer to the "struct device_node" representing > "pcie0,0" or "pci0,1" above. Since the "struct device_node" is > non-NULL, of_irq_map_pci() builds a laddr[] with the bus number and > devfn of this device: bus number is 1, devfn is 0. And this doesn't > match with the interrupt-map that is in my DT, which associates the > interrupts numbers with the PCI-to-PCI bridges rather than the devices > themselves. > > To me, the "struct pci_dev" representing the real PCIe devices should > have a NULL "struct device_node" pointer, because those device are not > represented in the DT. If this was the case, then the of_irq_map_pci() > would go one level up in the PCI hierarchy, find the "struct pci_dev" > that corresponds to the PCI-to-PCI bridge, which generates an laddr[] > having a bus number a devfn value matching the interrupt-map property. Yes this is my current understanding. > > If I do the following hack in of_irq_map_pci(), then everything works > nicely: > > } else { > /* We found a P2P bridge, check if it has a node */ > ppnode = pci_device_to_OF_node(ppdev); > + if (pdev->bus->number != 0) > + ppnode = NULL; > } > > > What this hack does is that if we are not on bus 0, it means that the > pci_dev is a real PCIe device, and therefore we force the code to > asssume it does not have a DT reference. > > Isn't there a problem here in the PCI/DT code ? Is it normal that a > PCIe device that isn't described in the DT carries a non-NULL struct > device_node pointer? I would suggest the issue isn't in the PCI/DT code. This is what I see with my implementation (which uses an implementation of pcibios_get_phb_of_node) - I'd like to believe this is correct behaviour: - of_irq_map_pci is called for an endpoint (5:0:0), its pdev->dev.of_node is NULL. As I don't have a representation of this endpoint in my DT the of_irq_map_pci code proceeds to walk the fabric. - Starting with the pdev for 5:0:0, of_irq_map_pci sees it has a parent (downstream switch bridge), in this case pdev(5:0:0)->bus->self->dev.of_node is NULL, due to this swizzling occurs and we walk the parent's bus (bus 2) - This continues to bus 1 and then bus 0 where of_irq_map_pci realises that it has no parent (its the host bridge). Due to the implementation of pcibios_get_phb_of_node a of_node is produced, this is then used to construct a lspec (for bus number 0). The of_irq_map_pci code stops walking as soon as it finds a function in the tree that has a device node. This suggests that if you represent a bridge you must also include an interrupt-map. The problem here is that you have represented a bridge but not included a map. I can think of three solutions: 1. Something like this: } else { /* We found a P2P bridge, check if it has a node */ ppnode = pci_device_to_OF_node(ppdev); + if (ppnode doesnt have an interrupt-map)//if (pdev->bus->number != 0) + ppnode = NULL; } 2. Remove the bridges from the DT? Or remove the map from pcie-controller and add a map each to pcie@0,1 and pcie@1,1? 3. Change the mask of your map so that it doesn't care about bus numbers. I have a map that looks like this: interrupt-map-mask = <0 0 0 7>; interrupt-map = <0 0 0 1 ... //anything coming in on INTA 0 0 0 2 ... //anything coming in on INTB 0 0 0 3 ... ... 0 0 0 4 ... ... Andrew Murray > > If you need some more details about the problem, do not hesitate to ask. > > Thanks a lot for your help, > > Thomas > -- > Thomas Petazzoni, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Bjorn Helgaas, On Thu, 7 Feb 2013 08:46:06 -0700, Bjorn Helgaas wrote: > Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y? > That should have more information about the enumeration process, > including what we think the XHCI BARs are and the apertures leading to > them. Sure, see below. > The PCI core assumes that we know the host bridge apertures up front, > and I'm not sure that is true on your platform, so maybe we'll need > some changes to accommodate that. In this hardware, we need to set up the address decoding windows. So there shouldn't be any access to a PCI device memory or I/O region until the addresses have been assigned in the PCI-to-PCI bridge. Note that I am know setting up the address decoding window as soon as the address is written into the PCI-to-PCI bridge. I am no longer waiting the end of enumeration process, and then go through the PCI-to-PCI bridge registers to configure them. The system tested below is an Armada 370, it has only two PCIe links. One is connected to a XHCI USB controller, the other one to an Intel Wireless NIC. First the dmesg when module_init() is used, which shows the crash: =================================================================== Booting Linux on physical CPU 0x0 Linux version 3.8.0-rc5-00029-g80e55fd-dirty (thomas@skate) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #1313 SMP Thu Feb 7 16:53:32 CET 2013 CPU: ARMv7 Processor [561f5811] revision 1 (ARMv7), cr=10c53c7d CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache Machine: Marvell Armada 370/XP (Device Tree), model: Globalscale Mirabox bootconsole [earlycon0] enabled Memory policy: ECC disabled, Data cache writeback PERCPU: Embedded 7 pages/cpu @c128b000 s6464 r8192 d14016 u32768 Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048 Kernel command line: console=ttyS0,115200 earlyprintk PID hash table entries: 2048 (order: 1, 8192 bytes) Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) __ex_table already sorted, skipping sort Memory: 512MB = 512MB total Memory: 504860k/504860k available, 19428k reserved, 0K highmem Virtual kernel memory layout: vector : 0xffff0000 - 0xffff1000 ( 4 kB) fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB) vmalloc : 0xe0800000 - 0xff000000 ( 488 MB) lowmem : 0xc0000000 - 0xe0000000 ( 512 MB) pkmap : 0xbfe00000 - 0xc0000000 ( 2 MB) modules : 0xbf000000 - 0xbfe00000 ( 14 MB) .text : 0xc0008000 - 0xc0467700 (4478 kB) .init : 0xc0468000 - 0xc0e30940 (10019 kB) .data : 0xc0e32000 - 0xc0e5f9c0 ( 183 kB) .bss : 0xc0e5f9c0 - 0xc0e84fdc ( 150 kB) Hierarchical RCU implementation. RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1. NR_IRQS:16 nr_irqs:16 16 Aurora cache controller enabled l2x0: 4 ways, CACHE_ID 0x00000100, AUX_CTRL 0x1a086302, Cache size: 262144 B sched_clock: 32 bits at 18MHz, resolution 53ns, wraps every 229064ms Console: colour dummy device 80x30 Calibrating delay loop... 1196.85 BogoMIPS (lpj=5984256) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 512 CPU: Testing write buffer coherency: ok CPU0: thread -1, cpu 0, socket -1, mpidr 0 Setting up static identity map for 0x351a28 - 0x351a80 Brought up 1 CPUs SMP: Total of 1 processors activated (1196.85 BogoMIPS). devtmpfs: initialized pinctrl core: initialized pinctrl subsystem NET: Registered protocol family 16 DMA: preallocated 1024 KiB pool for atomic coherent allocations irq: Cannot allocate irq_descs @ IRQ27, assuming pre-allocated irq: Cannot allocate irq_descs @ IRQ63, assuming pre-allocated irq: Cannot allocate irq_descs @ IRQ96, assuming pre-allocated Initializing Coherency fabric bio: create slab <bio-0> at 0 vgaarb: loaded SCSI subsystem initialized usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb cfg80211: Calling CRDA to update world regulatory domain Switching to clocksource armada_370_xp_clocksource NET: Registered protocol family 2 TCP established hash table entries: 4096 (order: 3, 32768 bytes) TCP bind hash table entries: 4096 (order: 3, 32768 bytes) TCP: Hash tables configured (established 4096 bind 4096) TCP: reno registered UDP hash table entries: 256 (order: 1, 8192 bytes) UDP-Lite hash table entries: 256 (order: 1, 8192 bytes) msgmni has been set to 986 Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253) io scheduler noop registered io scheduler deadline registered io scheduler cfq registered (default) armada-370-pinctrl d0018000.pinctrl: registered pinctrl driver mvebu-pcie pcie-controller.1: PCIe0.0: link up mvebu-pcie pcie-controller.1: PCIe1.0: link up mvebu-pcie pcie-controller.1: PCI host bridge to bus 0000:00 pci_bus 0000:00: root bus resource [io 0xc0000000-0xc000ffff] pci_bus 0000:00: root bus resource [mem 0xc1000000-0xc8ffffff] pci_bus 0000:00: root bus resource [bus 00-ff] PCI: bus0: Fast back to back transfers disabled pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring PCI: bus1: Fast back to back transfers disabled PCI: bus2: Fast back to back transfers disabled Unhandled fault: external abort on non-linefetch (0x1008) at 0xe0910010 Internal error: : 1008 [#1] SMP ARM Modules linked in: CPU: 0 Not tainted (3.8.0-rc5-00029-g80e55fd-dirty #1313) PC is at quirk_usb_handoff_xhci+0x5c/0x284 LR is at ioremap_pte_range+0x84/0xdc pc : [<c023ba8c>] lr : [<c0150944>] psr: a0000013 sp : df82bce8 ip : df81c000 fp : c0e43dac r10: 00008000 r9 : 00000000 r8 : de933000 r7 : e0910000 r6 : c04267a0 r5 : de933000 r4 : de933000 r3 : 01c801c8 r2 : 00000000 r1 : 42007e13 r0 : e0910000 Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel Control: 10c5387d Table: 1e960019 DAC: 00000015 Process swapper/0 (pid: 1, stack limit = 0xdf82a238) Stack: (0xdf82bce8 to 0xdf82c000) bce0: de933060 c01702f4 de933000 de933000 de933000 c04267a0 bd00: c0e5f9dc c023c108 de933060 c023c0e4 c04267a0 c023c168 c04267a0 c0178d98 bd20: c01a855c c033cab4 00000000 00000000 de935f6c c033cbb4 c0e497cc de933000 bd40: de935c14 de935c00 df82bde8 00000001 de919368 c016a958 00000000 de933000 bd60: de935c14 c016ab5c de933800 de918014 de935c00 c016abcc 00000000 00000001 bd80: de918000 c016d234 de919368 de9193d0 de919340 00000000 df82be14 c017a93c bda0: de919368 df82be14 df82bde8 c0012ef8 f3cec23f c027b974 c0019da0 df805340 bdc0: f3cec23f df82bde8 c0e7ffe8 df82be14 df802b00 de917890 00000060 df89e400 bde0: 00000002 c00131dc df82bde8 df82bde8 c0e83318 00000000 de9193d0 de9178f0 be00: df802b00 c047a010 df89e410 00000000 00000000 c0e49cac 00000001 df82be54 be20: c047a0b4 c017a90c 00000000 00000000 00000000 c047a054 c1000000 c8ffffff be40: c12885f0 00000200 00000000 00000000 00000000 de9193d0 df89e410 c0e81748 be60: c0e49c6c df89e410 00000000 c0e49c6c c048e9b8 c017a7c4 00000000 c01a8c24 be80: c01a8c0c c01a78e8 df89e410 c0e49c6c df89e444 00000000 c046823c c01a7bd8 bea0: c0e49c6c 00000000 c01a7b4c c01a632c df8067d8 df85a474 c0e49c6c c0e510c8 bec0: de919440 c01a7228 c0402548 c0e49c6c c0e49c6c df82a000 c0e5f9c0 00000000 bee0: c046823c c01a80d8 00000000 c0e49c58 df82a000 c0e5f9c0 00000000 c046823c bf00: c017a7c4 c01a8e1c c0488260 df82a000 c0e5f9c0 c00086d4 c0444b18 00000006 bf20: 0000008b c017a7c4 0000008b 00000006 00000006 c046823c c1288b15 00000000 bf40: c0e42194 c0488260 00000006 c0488240 c0e5f9c0 c046823c c048e9b8 0000008b bf60: 00000000 c04688b0 00000006 00000006 c046823c 00000000 c048e708 c048e708 bf80: 00000000 00000000 00000000 00000000 00000000 c0468958 00000000 00000000 bfa0: c033f3ac c033f3b4 00000000 c000dfd8 00000000 00000000 00000000 00000000 bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 2254c9ef c6885425 [<c023ba8c>] (quirk_usb_handoff_xhci+0x5c/0x284) from [<c023c0e4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4) [<c023c0e4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4) from [<c0178d98>] (pci_do_fixups+0xc4/0x17c) [<c0178d98>] (pci_do_fixups+0xc4/0x17c) from [<c016a958>] (pci_bus_add_device+0x14/0x60) [<c016a958>] (pci_bus_add_device+0x14/0x60) from [<c016ab5c>] (pci_bus_add_devices+0x44/0x128) [<c016ab5c>] (pci_bus_add_devices+0x44/0x128) from [<c016abcc>] (pci_bus_add_devices+0xb4/0x128) [<c016abcc>] (pci_bus_add_devices+0xb4/0x128) from [<c016d234>] (pci_scan_root_bus+0x7c/0xcc) [<c016d234>] (pci_scan_root_bus+0x7c/0xcc) from [<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c) [<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c) from [<c0012ef8>] (pcibios_init_hw+0x5c/0x15c) [<c0012ef8>] (pcibios_init_hw+0x5c/0x15c) from [<c00131dc>] (pci_common_init+0x44/0xc4) [<c00131dc>] (pci_common_init+0x44/0xc4) from [<c047a010>] (mvebu_pcie_probe+0x360/0x3a4) [<c047a010>] (mvebu_pcie_probe+0x360/0x3a4) from [<c01a8c24>] (platform_drv_probe+0x18/0x1c) [<c01a8c24>] (platform_drv_probe+0x18/0x1c) from [<c01a78e8>] (really_probe+0x60/0x1e0) [<c01a78e8>] (really_probe+0x60/0x1e0) from [<c01a7bd8>] (__driver_attach+0x8c/0x90) [<c01a7bd8>] (__driver_attach+0x8c/0x90) from [<c01a632c>] (bus_for_each_dev+0x50/0x7c) [<c01a632c>] (bus_for_each_dev+0x50/0x7c) from [<c01a7228>] (bus_add_driver+0x168/0x22c) [<c01a7228>] (bus_add_driver+0x168/0x22c) from [<c01a80d8>] (driver_register+0x78/0x144) [<c01a80d8>] (driver_register+0x78/0x144) from [<c01a8e1c>] (platform_driver_probe+0x18/0xac) [<c01a8e1c>] (platform_driver_probe+0x18/0xac) from [<c00086d4>] (do_one_initcall+0x34/0x174) [<c00086d4>] (do_one_initcall+0x34/0x174) from [<c04688b0>] (do_basic_setup+0x90/0xc4) [<c04688b0>] (do_basic_setup+0x90/0xc4) from [<c0468958>] (kernel_init_freeable+0x74/0x10c) [<c0468958>] (kernel_init_freeable+0x74/0x10c) from [<c033f3b4>] (kernel_init+0x8/0xe4) [<c033f3b4>] (kernel_init+0x8/0xe4) from [<c000dfd8>] (ret_from_fork+0x14/0x3c) Code: e3a02000 ebf76e4e e2507000 0afffff7 (e5973010) ---[ end trace 7097ba2281051df7 ]--- Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b =================================================================== And now, the dmesg when the PCIe driver is initialized at the subsys_initcall() level, no crash happens, and everything works fine. =================================================================== Booting Linux on physical CPU 0x0 Linux version 3.8.0-rc5-00029-g80e55fd-dirty (thomas@skate) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #1314 SMP Thu Feb 7 16:57:39 CET 2013 CPU: ARMv7 Processor [561f5811] revision 1 (ARMv7), cr=10c53c7d CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache Machine: Marvell Armada 370/XP (Device Tree), model: Globalscale Mirabox bootconsole [earlycon0] enabled Memory policy: ECC disabled, Data cache writeback PERCPU: Embedded 7 pages/cpu @c128b000 s6464 r8192 d14016 u32768 Built 1 zonelists in Zone order, mobility grouping on. Total pages: 130048 Kernel command line: console=ttyS0,115200 earlyprintk PID hash table entries: 2048 (order: 1, 8192 bytes) Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) __ex_table already sorted, skipping sort Memory: 512MB = 512MB total Memory: 504860k/504860k available, 19428k reserved, 0K highmem Virtual kernel memory layout: vector : 0xffff0000 - 0xffff1000 ( 4 kB) fixmap : 0xfff00000 - 0xfffe0000 ( 896 kB) vmalloc : 0xe0800000 - 0xff000000 ( 488 MB) lowmem : 0xc0000000 - 0xe0000000 ( 512 MB) pkmap : 0xbfe00000 - 0xc0000000 ( 2 MB) modules : 0xbf000000 - 0xbfe00000 ( 14 MB) .text : 0xc0008000 - 0xc0467700 (4478 kB) .init : 0xc0468000 - 0xc0e30940 (10019 kB) .data : 0xc0e32000 - 0xc0e5f9c0 ( 183 kB) .bss : 0xc0e5f9c0 - 0xc0e84fdc ( 150 kB) Hierarchical RCU implementation. RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1. NR_IRQS:16 nr_irqs:16 16 Aurora cache controller enabled l2x0: 4 ways, CACHE_ID 0x00000100, AUX_CTRL 0x1a086302, Cache size: 262144 B sched_clock: 32 bits at 18MHz, resolution 53ns, wraps every 229064ms Console: colour dummy device 80x30 Calibrating delay loop... 1196.85 BogoMIPS (lpj=5984256) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 512 CPU: Testing write buffer coherency: ok CPU0: thread -1, cpu 0, socket -1, mpidr 0 Setting up static identity map for 0x351a28 - 0x351a80 Brought up 1 CPUs SMP: Total of 1 processors activated (1196.85 BogoMIPS). devtmpfs: initialized pinctrl core: initialized pinctrl subsystem NET: Registered protocol family 16 DMA: preallocated 1024 KiB pool for atomic coherent allocations irq: Cannot allocate irq_descs @ IRQ27, assuming pre-allocated irq: Cannot allocate irq_descs @ IRQ63, assuming pre-allocated irq: Cannot allocate irq_descs @ IRQ96, assuming pre-allocated Initializing Coherency fabric bio: create slab <bio-0> at 0 mvebu-pcie pcie-controller.1: PCIe0.0: link up mvebu-pcie pcie-controller.1: PCIe1.0: link up mvebu-pcie pcie-controller.1: PCI host bridge to bus 0000:00 pci_bus 0000:00: root bus resource [io 0xc0000000-0xc000ffff] pci_bus 0000:00: root bus resource [mem 0xc1000000-0xc8ffffff] pci_bus 0000:00: root bus resource [bus 00-ff] PCI: bus0: Fast back to back transfers disabled pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring PCI: bus1: Fast back to back transfers disabled PCI: bus2: Fast back to back transfers disabled Getting IRQ slot 1, pin 1... Cannot get irq... 135 Getting IRQ slot 2, pin 1... Cannot get irq... 135 Getting IRQ slot 1, pin 1... Getting IRQ slot 2, pin 1... pci 0000:00:01.0: BAR 8: assigned [mem 0xc1000000-0xc10fffff] pci 0000:00:02.0: BAR 8: assigned [mem 0xc1100000-0xc11fffff] pci 0000:01:00.0: BAR 0: assigned [mem 0xc1000000-0xc1001fff 64bit] pci 0000:00:01.0: PCI bridge to [bus 01] pci 0000:00:01.0: bridge window [mem 0xc1000000-0xc10fffff] pci 0000:02:00.0: BAR 0: assigned [mem 0xc1100000-0xc1107fff 64bit] pci 0000:00:02.0: PCI bridge to [bus 02] pci 0000:00:02.0: bridge window [mem 0xc1100000-0xc11fffff] PCI: enabling device 0000:00:01.0 (0140 -> 0143) PCI: enabling device 0000:00:02.0 (0140 -> 0143) vgaarb: loaded SCSI subsystem initialized usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb cfg80211: Calling CRDA to update world regulatory domain Switching to clocksource armada_370_xp_clocksource NET: Registered protocol family 2 TCP established hash table entries: 4096 (order: 3, 32768 bytes) TCP bind hash table entries: 4096 (order: 3, 32768 bytes) TCP: Hash tables configured (established 4096 bind 4096) TCP: reno registered UDP hash table entries: 256 (order: 1, 8192 bytes) UDP-Lite hash table entries: 256 (order: 1, 8192 bytes) msgmni has been set to 986 Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253) io scheduler noop registered io scheduler deadline registered io scheduler cfq registered (default) armada-370-pinctrl d0018000.pinctrl: registered pinctrl driver mv_xor d0060800.xor: Marvell XOR driver mv_xor d0060800.xor: Marvell XOR: ( xor cpy ) mv_xor d0060800.xor: Marvell XOR: ( xor fill cpy ) mv_xor d0060900.xor: Marvell XOR driver mv_xor d0060900.xor: Marvell XOR: ( xor cpy ) mv_xor d0060900.xor: Marvell XOR: ( xor fill cpy ) Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled d0012000.serial: ttyS0 at MMIO 0xd0012000 (irq = 17) is a 8250 console [ttyS0] enabled, bootconsole disabled console [ttyS0] enabled, bootconsole disabled e1000e: Intel(R) PRO/1000 Network Driver - 2.1.4-k e1000e: Copyright(c) 1999 - 2012 Intel Corporation. libphy: orion_mdio_bus: probed mvneta d0070000.ethernet eth0: mac: fe:4b:b2:e2:64:ca mvneta d0074000.ethernet eth1: mac: 56:dd:1a:9e:e0:aa Intel(R) Wireless WiFi driver for Linux, in-tree: Copyright(c) 2003-2012 Intel Corporation iwlwifi 0000:01:00.0: pci_enable_msi failed(0Xffffffff) iwlwifi 0000:01:00.0: loaded firmware version 8.83.5.1 build 33692 iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEBUG disabled iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEBUGFS disabled iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEVICE_TRACING disabled iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEVICE_TESTMODE disabled iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_P2P enabled iwlwifi 0000:01:00.0: Detected Intel(R) Ultimate N WiFi Link 5300 AGN, REV=0x24 iwlwifi 0000:01:00.0: L1 Disabled; Enabling L0S xhci_hcd 0000:02:00.0: xHCI Host Controller xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 1 xhci_hcd 0000:02:00.0: irq 99, io mem 0xc1100000 hub 1-0:1.0: USB hub found hub 1-0:1.0: 2 ports detected xhci_hcd 0000:02:00.0: xHCI Host Controller xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found hub 2-0:1.0: 2 ports detected Initializing USB Mass Storage driver... usbcore: registered new interface driver usb-storage USB Mass Storage support registered. mousedev: PS/2 mouse device common for all mice usbcore: registered new interface driver usbhid usbhid: USB HID core driver TCP: cubic registered VFP support v0.3: implementor 56 architecture 2 part 20 variant 9 rev 6 /home/thomas/projets/linux-2.6/drivers/rtc/hctosys.c: unable to open rtc device (rtc0) Freeing init memory: 10016K Starting logging: OK Starting mdev... usb 1-1: new high-speed USB device number 2 using xhci_hcd scsi0 : usb-storage 1-1:1.0 Initializing random number generator... done. Starting network... Welcome to Buildroot buildroot login: =================================================================== Do not hesitate to ask if you need more details about this. Thanks! Thomas
Dear Andrew Murray, On Thu, 7 Feb 2013 15:51:17 +0000, Andrew Murray wrote: > > First of all, I didn't reimplement the pcibios_get_phb_of_node(), but > > instead, as Thierry Reding suggested, simply implemented the > > hw_pci.scan() function as follows: > > I've not had any time to test Thierry's solution to avoid implementing > pcibios_get_phb_of_node - but it did seem to work for him and seem correct > at the time. At least there are device tree nodes associated to PCI devices, so it looks good from this perspective. > > To me, the "struct pci_dev" representing the real PCIe devices should > > have a NULL "struct device_node" pointer, because those device are not > > represented in the DT. If this was the case, then the of_irq_map_pci() > > would go one level up in the PCI hierarchy, find the "struct pci_dev" > > that corresponds to the PCI-to-PCI bridge, which generates an laddr[] > > having a bus number a devfn value matching the interrupt-map property. > > Yes this is my current understanding. Ok. But that's not what happens: the "struct pci_dev" representing the real PCIe device *DOES* have a non-NULL struct device_node pointer. And this is what makes the entire thing fail. > I would suggest the issue isn't in the PCI/DT code. This is what I see I believe it is, because as I said above, the struct pci_dev associated to a real PCIe device should not have a struct device_node pointer, because this PCIe device has been dynamically enumerated and is therefore not part of the device tree. > with my implementation (which uses an implementation of > pcibios_get_phb_of_node) - I'd like to believe this is correct behaviour: > > - of_irq_map_pci is called for an endpoint (5:0:0), its pdev->dev.of_node > is NULL. As I don't have a representation of this endpoint in my DT the > of_irq_map_pci code proceeds to walk the fabric. I believe this should be the behavior. But this not what happens: the pdev->dev.of_node of an endpoint pci_dev is not NULL. > - Starting with the pdev for 5:0:0, of_irq_map_pci sees it has a parent > (downstream switch bridge), in this case pdev(5:0:0)->bus->self->dev.of_node > is NULL, due to this swizzling occurs and we walk the parent's bus (bus 2) > - This continues to bus 1 and then bus 0 where of_irq_map_pci realises that > it has no parent (its the host bridge). Due to the implementation of > pcibios_get_phb_of_node a of_node is produced, this is then used to construct > a lspec (for bus number 0). > > The of_irq_map_pci code stops walking as soon as it finds a function in the > tree that has a device node. This suggests that if you represent a bridge > you must also include an interrupt-map. The problem here is that you have > represented a bridge but not included a map. I understood that it walks up the PCI hierarchy, and that's fine. As I've shown in my previous e-mail, the only problem is that this pdev->dev.of_node should be NULL for the PCIe endpoint device. If it were NULL, then everything would work correctly, as I could confirmed by the hack I did in of_irq_map_pci(). > I can think of three solutions: > > 1. Something like this: > > } else { > /* We found a P2P bridge, check if it has a node */ > ppnode = pci_device_to_OF_node(ppdev); > + if (ppnode doesnt have an interrupt-map)//if (pdev->bus->number != 0) > + ppnode = NULL; > } > > 2. Remove the bridges from the DT? Or remove the map from pcie-controller and > add a map each to pcie@0,1 and pcie@1,1? > > 3. Change the mask of your map so that it doesn't care about bus numbers. I > have a map that looks like this: > > interrupt-map-mask = <0 0 0 7>; > interrupt-map = <0 0 0 1 ... //anything coming in on INTA > 0 0 0 2 ... //anything coming in on INTB > 0 0 0 3 ... ... > 0 0 0 4 ... ... Unfortunately, I don't quite agree with any of your three solutions. I still do believe the root problem is that pdev->dev.of_node should be NULL for the PCIe endpoints, since those devices are not probed with the Device Tree. Best regards, Thomas
Dear Thomas Petazzoni, On Thu, 7 Feb 2013 17:19:04 +0100, Thomas Petazzoni wrote: > Unfortunately, I don't quite agree with any of your three solutions. I > still do believe the root problem is that pdev->dev.of_node should be > NULL for the PCIe endpoints, since those devices are not probed with > the Device Tree. Looking more at this, the pdev->dev.of_node is in fact NULL, but the code uses the of_node of the parent PCI device. So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0 are seen as corresponding to the PCI-to-PCI bridges. Thanks, Thomas
On Thu, Feb 07, 2013 at 04:40:40PM +0000, Thomas Petazzoni wrote: > Dear Thomas Petazzoni, > > On Thu, 7 Feb 2013 17:19:04 +0100, Thomas Petazzoni wrote: > > > Unfortunately, I don't quite agree with any of your three solutions. I > > still do believe the root problem is that pdev->dev.of_node should be > > NULL for the PCIe endpoints, since those devices are not probed with > > the Device Tree. > > Looking more at this, the pdev->dev.of_node is in fact NULL, but the > code uses the of_node of the parent PCI device. > > So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0 > are seen as corresponding to the PCI-to-PCI bridges. I would suggest changing the interrupt-mask to match any bus number. (Don't forget that the secondary bus number of each of your emulated bridges will vary depending on how many devices are detected underneath each root port, assuming you don't try and partition bus numbers or use domains between ports). Andrew Murray > > Thanks, > > Thomas > -- > Thomas Petazzoni, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Andrew Murray, On Thu, 7 Feb 2013 16:53:47 +0000, Andrew Murray wrote: > > So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0 > > are seen as corresponding to the PCI-to-PCI bridges. > > I would suggest changing the interrupt-mask to match any bus number. (Don't > forget that the secondary bus number of each of your emulated bridges will > vary depending on how many devices are detected underneath each root port, > assuming you don't try and partition bus numbers or use domains between ports). I don't think this would work. Currently, the interrupt-map associates the interrupts with the PCI-to-PCI bridges, i.e devices 00:01, 00:02, 00:03, 00:04, 00:05, etc. The real PCIe devices themselves are at 01:00, 02:00, 03:00, 04:00, 05:00. Each of them sit on a different bus, at devfn = 0. So if I ignore the bus number, how could the PCI code find what is the matching interrupt? Thanks, Thomas
On Thu, Feb 07, 2013 at 05:14:18PM +0000, Thomas Petazzoni wrote: > Dear Andrew Murray, > > On Thu, 7 Feb 2013 16:53:47 +0000, Andrew Murray wrote: > > > > So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0 > > > are seen as corresponding to the PCI-to-PCI bridges. > > > > I would suggest changing the interrupt-mask to match any bus number. (Don't > > forget that the secondary bus number of each of your emulated bridges will > > vary depending on how many devices are detected underneath each root port, > > assuming you don't try and partition bus numbers or use domains between ports). > > I don't think this would work. Currently, the interrupt-map associates > the interrupts with the PCI-to-PCI bridges, i.e devices 00:01, 00:02, > 00:03, 00:04, 00:05, etc. > > The real PCIe devices themselves are at 01:00, 02:00, 03:00, 04:00, > 05:00. Each of them sit on a different bus, at devfn = 0. > > So if I ignore the bus number, how could the PCI code find what is the > matching interrupt? Apologies if I've missed information about your hardware in the other discussion (I've tried to keep up) - does your hardware raise a single host interrupt for each pin regardless to which bridge they come in on - or do you separate A,B,C,D host interrupts for each bridge? If you have only 4 interrupt sources for legacy interrupts then you shouldn't need to care which bus/device/function they were generated on (of_pci_map_irq takes care of this for you). During enumeration an interrupt number should be assigned to each requesting device which reflects the pin (after swizzling) which will arrive at the host bridges. That interrupt number should be shared across all devices that requested the same pin (after swizzling) - i.e. shared interrupts. So all you need to do is map A,B,C,D interrupts with the interrupt they come into the CPU. Andrew Murray > > Thanks, > > Thomas > -- > Thomas Petazzoni, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Dear Andrew Murray, On Thu, 7 Feb 2013 17:29:34 +0000, Andrew Murray wrote: > > So if I ignore the bus number, how could the PCI code find what is the > > matching interrupt? > > Apologies if I've missed information about your hardware in the other > discussion (I've tried to keep up) - does your hardware raise a single host > interrupt for each pin regardless to which bridge they come in on - or do you > separate A,B,C,D host interrupts for each bridge? There are separate A,B,C,D interrupts for each PCIe interface, and each PCIe interface is represented by an emulated PCI-to-PCI bridge. See my interrupt-map: interrupt-map = <0x0800 0 0 1 &mpic 58 0x1000 0 0 1 &mpic 59 0x1800 0 0 1 &mpic 60 0x2000 0 0 1 &mpic 61 0x2800 0 0 1 &mpic 62 0x3000 0 0 1 &mpic 63 0x3800 0 0 1 &mpic 64 0x4000 0 0 1 &mpic 65 0x4800 0 0 1 &mpic 99 0x5000 0 0 1 &mpic 103>; Here I have 10 PCIe interfaces, and therefore 10 interrupts. There is only one interrupt per PCIe interface, and for now, I don't distinguish A,B,C,D (I will do it later, it requires reading a register to know if the interrupt came from A, B, C or D, but that's a different problem). > If you have only 4 interrupt sources for legacy interrupts then you shouldn't > need to care which bus/device/function they were generated on (of_pci_map_irq > takes care of this for you). No, I have interrupts per PCIe interface, so I really need to take care of the relation between the PCIe device and the PCIe interface it is connected to. Thomas
On Thu, Feb 7, 2013 at 9:00 AM, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Bjorn Helgaas, > > On Thu, 7 Feb 2013 08:46:06 -0700, Bjorn Helgaas wrote: > >> Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y? >> That should have more information about the enumeration process, >> including what we think the XHCI BARs are and the apertures leading to >> them. > > Sure, see below. > >> The PCI core assumes that we know the host bridge apertures up front, >> and I'm not sure that is true on your platform, so maybe we'll need >> some changes to accommodate that. > > In this hardware, we need to set up the address decoding windows. So > there shouldn't be any access to a PCI device memory or I/O region > until the addresses have been assigned in the PCI-to-PCI bridge. I think this is the path where the crash happens (this is the same as the backtrace you included below): mvebu_pcie_scan_bus pci_scan_root_bus pci_create_root_bus pci_scan_child_bus pci_bus_add_devices pci_bus_add_device pci_fixup_device(pci_fixup_final) quirk_usb_early_handoff # pci_fixup_final quirk_usb_handoff_xhci The problem is that we haven't assigned resources anywhere. Normally this is done by pci_bus_assign_resources() or pci_assign_unassigned_bus_resources(), but I don't think there's anything in the path above that does this. This is not really a problem in your code; it's a generic PCI core problem. pci_scan_root_bus() does everything including creating the root bus, scanning it, and adding the devices we find. At the point where we add a device (pci_bus_add_device()), it should be ready for a driver to claim it -- all resource assignment should already be done. I don't think it's completely trivial to fix this in the PCI core yet (but we're moving in that direction) because we have some boot-time ordering issues, e.g., x86 scans the root buses before we know about the address space consumed by ACPI devices, so we can't just assign the resources when we scan the bus. I think the way you'll have to fix this in the meantime is to use pci_create_root_bus() directly so you can do things in this sequence: pci_create_root_bus pci_scan_child_bus pci_bus_assign_resources pci_bus_add_devices dino_probe() is an example of similar code that does this. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 07, 2013 at 11:08:45AM -0700, Bjorn Helgaas wrote: > I think the way you'll have to fix this in the meantime is to use > pci_create_root_bus() directly so you can do things in this sequence: > > pci_create_root_bus > pci_scan_child_bus > pci_bus_assign_resources > pci_bus_add_devices > > dino_probe() is an example of similar code that does this. Erm, but isn't this problem going to strike any system that doesn't have resources assigned by the firmware? If so, a common 'pci_scan_bus_from_scratch' would at least highlight the problem and centralize the work around until it is fixed... Many embedded systems will require this.. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 07, 2013 at 06:37:43PM +0100, Thomas Petazzoni wrote: > Dear Andrew Murray, > > On Thu, 7 Feb 2013 17:29:34 +0000, Andrew Murray wrote: > > > > So if I ignore the bus number, how could the PCI code find what is the > > > matching interrupt? > > > > Apologies if I've missed information about your hardware in the other > > discussion (I've tried to keep up) - does your hardware raise a single host > > interrupt for each pin regardless to which bridge they come in on - or do you > > separate A,B,C,D host interrupts for each bridge? > > There are separate A,B,C,D interrupts for each PCIe interface, and each > PCIe interface is represented by an emulated PCI-to-PCI bridge. See my > interrupt-map: > > interrupt-map = <0x0800 0 0 1 &mpic 58 > 0x1000 0 0 1 &mpic 59 > 0x1800 0 0 1 &mpic 60 > 0x2000 0 0 1 &mpic 61 > 0x2800 0 0 1 &mpic 62 > 0x3000 0 0 1 &mpic 63 > 0x3800 0 0 1 &mpic 64 > 0x4000 0 0 1 &mpic 65 > 0x4800 0 0 1 &mpic 99 > 0x5000 0 0 1 &mpic 103>; > > Here I have 10 PCIe interfaces, and therefore 10 interrupts. > > There is only one interrupt per PCIe interface, and for now, I don't > distinguish A,B,C,D (I will do it later, it requires reading a register > to know if the interrupt came from A, B, C or D, but that's a different > problem). Right, someday you can have all 40 interrupts :) This interrupt-map stuff is ugly, I looks like it was designed as a way for the firmware to communicate the per-device interrupt assignment to the OS - using it to describe a HW setup is a bit more tortured for sure.. Frankly, I think it should not be used in PCI-E drivers *at all* beacuse there is no need for an external interrupt routing description. [1] PCI-E *only* has inband interrupt delivery, so the host driver has 100% of the information it needs to convert a INTx signal received on a link to a Linux interrupt number. So in a PCI-E world translation should look like this: - Start at the source device INTx - Traverse up bridges performing INTx swizzling according to the PCI-E spec - When the host bridge is reached call into the host driver and pass - The INTx number - The bus/device/function of last bridge traversed - The host bridge driver returns a Linux interrupt number. This *always* works for (compliant) PCI-E. An OF interrupt map is never, ever, needed. Providing a common mechanism can remove this complex OF stuff and ultimately help everyone :) However, if interrupt-map is used, I strongly suspect it should be placed on each bridge node (as Andrew mentioned), and it should be very simple: link@0 { reg = <0x800 0 0 0 0>; // Bus 0, Dev 0x10, Fn 0 interrupt-mask = <0x0 0 0 7>; interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA 0x0000 0 0 2 &mpic 58 // INTB 0x0000 0 0 3 &mpic 58 // INTC 0x0000 0 0 4 &mpic 58>; // INTD } Which hopefully says 'when you reach this bridge node, look only at the INTx pin number and convert it'. I don't think you can make interrupt-map work were you have placed it.. Jason 1 - For PCI/PCI-X the physical INTx interrupt pins on any device, anywhere in the system could by physically tied to any interrupt ingress in the system. Even if they are behind bridges. It is a total free for all. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 7, 2013 at 11:15 AM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Thu, Feb 07, 2013 at 11:08:45AM -0700, Bjorn Helgaas wrote: > >> I think the way you'll have to fix this in the meantime is to use >> pci_create_root_bus() directly so you can do things in this sequence: >> >> pci_create_root_bus >> pci_scan_child_bus >> pci_bus_assign_resources >> pci_bus_add_devices >> >> dino_probe() is an example of similar code that does this. > > Erm, but isn't this problem going to strike any system that doesn't > have resources assigned by the firmware? > > If so, a common 'pci_scan_bus_from_scratch' would at least highlight > the problem and centralize the work around until it is fixed... > > Many embedded systems will require this.. Yep, agreed on all counts. -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 7 February 2013 17:37, Thomas Petazzoni <thomas.petazzoni@free-electrons.com> wrote: > Dear Andrew Murray, > > On Thu, 7 Feb 2013 17:29:34 +0000, Andrew Murray wrote: > >> > So if I ignore the bus number, how could the PCI code find what is the >> > matching interrupt? >> >> Apologies if I've missed information about your hardware in the other >> discussion (I've tried to keep up) - does your hardware raise a single host >> interrupt for each pin regardless to which bridge they come in on - or do you >> separate A,B,C,D host interrupts for each bridge? > > There are separate A,B,C,D interrupts for each PCIe interface, and each > PCIe interface is represented by an emulated PCI-to-PCI bridge. See my > interrupt-map: > > interrupt-map = <0x0800 0 0 1 &mpic 58 > 0x1000 0 0 1 &mpic 59 > 0x1800 0 0 1 &mpic 60 > 0x2000 0 0 1 &mpic 61 > 0x2800 0 0 1 &mpic 62 > 0x3000 0 0 1 &mpic 63 > 0x3800 0 0 1 &mpic 64 > 0x4000 0 0 1 &mpic 65 > 0x4800 0 0 1 &mpic 99 > 0x5000 0 0 1 &mpic 103>; > > Here I have 10 PCIe interfaces, and therefore 10 interrupts. > > There is only one interrupt per PCIe interface, and for now, I don't > distinguish A,B,C,D (I will do it later, it requires reading a register > to know if the interrupt came from A, B, C or D, but that's a different > problem). > >> If you have only 4 interrupt sources for legacy interrupts then you shouldn't >> need to care which bus/device/function they were generated on (of_pci_map_irq >> takes care of this for you). > > No, I have interrupts per PCIe interface, so I really need to take care > of the relation between the PCIe device and the PCIe interface it is > connected to. In that case, I think you can create a mask that only checks for the device number and INT pin (i.e. ignore bus and function). Looking at your mask - it already does this... interrupt-map = <0x0800 0 0 1 &mpic 58 0x1000 0 0 1 &mpic 59 0x1800 0 0 1 &mpic 60 0x2000 0 0 1 &mpic 61 0x2800 0 0 1 &mpic 62 0x3000 0 0 1 &mpic 63 0x3800 0 0 1 &mpic 64 0x4000 0 0 1 &mpic 65 0x4800 0 0 1 &mpic 99 0x5000 0 0 1 &mpic 103>; I'm not sure if the device pin part of the map is correct (I always forget how this works) - but I know this would definately work: interrupt-map-mask = <0xf800 0 0 7> interrupt-map = <0x0800 0 0 1 &mpic 58 0x0800 0 0 2 &mpic 58 0x0800 0 0 3 &mpic 58 0x0800 0 0 4 &mpic 58 0x1000 0 0 1 &mpic 59 0x1000 0 0 2 &mpic 59 0x1000 0 0 3 &mpic 59 0x1000 0 0 4 &mpic 59 .... In any case, I've realized that my original suggestion of changing the map won't quite do (apologies). This is because the OF code won't even look at this map as it stops at the emulated bridge below. In addition to this type of mapping - you'll also need to investigate my solution 1 and 2. Andrew Murray > > Thomas > -- > Thomas Petazzoni, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 07, 2013 at 11:08:45AM -0700, Bjorn Helgaas wrote: > On Thu, Feb 7, 2013 at 9:00 AM, Thomas Petazzoni > <thomas.petazzoni@free-electrons.com> wrote: > > Dear Bjorn Helgaas, > > > > On Thu, 7 Feb 2013 08:46:06 -0700, Bjorn Helgaas wrote: > > > >> Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y? > >> That should have more information about the enumeration process, > >> including what we think the XHCI BARs are and the apertures leading to > >> them. > > > > Sure, see below. > > > >> The PCI core assumes that we know the host bridge apertures up front, > >> and I'm not sure that is true on your platform, so maybe we'll need > >> some changes to accommodate that. > > > > In this hardware, we need to set up the address decoding windows. So > > there shouldn't be any access to a PCI device memory or I/O region > > until the addresses have been assigned in the PCI-to-PCI bridge. > > I think this is the path where the crash happens (this is the same as > the backtrace you included below): > > mvebu_pcie_scan_bus > pci_scan_root_bus > pci_create_root_bus > pci_scan_child_bus > pci_bus_add_devices > pci_bus_add_device > pci_fixup_device(pci_fixup_final) > quirk_usb_early_handoff # pci_fixup_final > quirk_usb_handoff_xhci > > The problem is that we haven't assigned resources anywhere. Normally > this is done by pci_bus_assign_resources() or > pci_assign_unassigned_bus_resources(), but I don't think there's > anything in the path above that does this. > > This is not really a problem in your code; it's a generic PCI core > problem. pci_scan_root_bus() does everything including creating the > root bus, scanning it, and adding the devices we find. At the point > where we add a device (pci_bus_add_device()), it should be ready for a > driver to claim it -- all resource assignment should already be done. > > I don't think it's completely trivial to fix this in the PCI core yet > (but we're moving in that direction) because we have some boot-time > ordering issues, e.g., x86 scans the root buses before we know about > the address space consumed by ACPI devices, so we can't just assign > the resources when we scan the bus. > > I think the way you'll have to fix this in the meantime is to use > pci_create_root_bus() directly so you can do things in this sequence: > > pci_create_root_bus > pci_scan_child_bus > pci_bus_assign_resources > pci_bus_add_devices The last two are already done by ARM's pci_common_init(). On Tegra I've got by having a custom .scan_bus() implementation that does the first two. Back when I used the pci_scan_root_bus() we were seeing some issues with resource conflicts and such and there was a similar discussion at that time. Thomas, have you tried using the same .scan_bus() that I use on Tegra? It should be easy to port it to Marvell, the only Tegra-specific bit is how to get at the struct tegra_pcie. Thierry
On Thursday 07 February 2013, Jason Gunthorpe wrote: > On Thu, Feb 07, 2013 at 06:37:43PM +0100, Thomas Petazzoni wrote: > Right, someday you can have all 40 interrupts :) > > This interrupt-map stuff is ugly, I looks like it was designed as a > way for the firmware to communicate the per-device interrupt > assignment to the OS - using it to describe a HW setup is a bit more > tortured for sure.. It is usually used for PCI (non-PCIe) systems, where you can have multiple slots on the same bus and each slot has its own four interrupt lines routed to an upstream interrupt controller, not necessarily the same one for each slot. > Frankly, I think it should not be used in PCI-E drivers *at all* > beacuse there is no need for an external interrupt routing > description. [1] > > PCI-E *only* has inband interrupt delivery, so the host driver has > 100% of the information it needs to convert a INTx signal received on > a link to a Linux interrupt number. > > So in a PCI-E world translation should look like this: > - Start at the source device INTx > - Traverse up bridges performing INTx swizzling according to the > PCI-E spec > - When the host bridge is reached call into the host driver and pass > - The INTx number > - The bus/device/function of last bridge traversed > - The host bridge driver returns a Linux interrupt number. It's the same as on PCI, except that there you stop when you reach the slot that is physically connected to the interrupt controller. > This *always* works for (compliant) PCI-E. An OF interrupt map is > never, ever, needed. Providing a common mechanism can remove this > complex OF stuff and ultimately help everyone :) > > However, if interrupt-map is used, I strongly suspect it should be > placed on each bridge node (as Andrew mentioned), and it should be > very simple: > > link@0 { > reg = <0x800 0 0 0 0>; // Bus 0, Dev 0x10, Fn 0 > interrupt-mask = <0x0 0 0 7>; > interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA > 0x0000 0 0 2 &mpic 58 // INTB > 0x0000 0 0 3 &mpic 58 // INTC > 0x0000 0 0 4 &mpic 58>; // INTD > } The interrupt-map property only makes sense for the host bridge, not for bridges below it, which don't normally get represented in the device tree. > Jason > 1 - For PCI/PCI-X the physical INTx interrupt pins on any device, > anywhere in the system could by physically tied to any interrupt > ingress in the system. Even if they are behind bridges. It is a > total free for all. I would not be surprised to see system that have a PCIe to PCI bridge and slots behind that which are directly connected to an upstream interrupt controller. In fact, I believe I have worked on a system like that (IBM QS21 and QS22), and thankfully the interrupt-map property did let us model this correctly. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 07 February 2013, Andrew Murray wrote: > I'm not sure if the device pin part of the map is correct (I always > forget how this works) - but I know this would definately work: > > interrupt-map-mask = <0xf800 0 0 7> > interrupt-map = <0x0800 0 0 1 &mpic 58 > 0x0800 0 0 2 &mpic 58 > 0x0800 0 0 3 &mpic 58 > 0x0800 0 0 4 &mpic 58 > 0x1000 0 0 1 &mpic 59 > 0x1000 0 0 2 &mpic 59 > 0x1000 0 0 3 &mpic 59 > 0x1000 0 0 4 &mpic 59 > .... If all the pins are routed to the same interrupt, you can leave the pin out of the map-mask and save a bunch of lines in the interrupt-map, at least by the spec. I have not looked at the source code to see if that is how Linux implements the lookup, but that can be fixed if necessary. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 07, 2013 at 11:25:23PM +0000, Arnd Bergmann wrote: > > link@0 { > > reg = <0x800 0 0 0 0>; // Bus 0, Dev 0x10, Fn 0 > > interrupt-mask = <0x0 0 0 7>; > > interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA > > 0x0000 0 0 2 &mpic 58 // INTB > > 0x0000 0 0 3 &mpic 58 // INTC > > 0x0000 0 0 4 &mpic 58>; // INTD > > } > > The interrupt-map property only makes sense for the host bridge, > not for bridges below it, which don't normally get represented > in the device tree. Linux scans up the PCI bus until it finds a PCI device with a matching OF node. It then constructs an interrupt map 'laddr' (ie the bus:dev.fn) for the child device of this OF node. If you don't have any DT PCI nodes then this should always fold down to doing a lookup with bus=0, and device representing the 'slot' in legacy PCI. However, as soon as you provide a node for a bridge in DT this halts the 'fold down' and goes to the interrupt-map with a device on the subordinate bus number of the bridge. This makes *lots* of sense, if you have bridges providing bus slots then you include the bridge in DT to stop the 'fold down' at that known bridge, giving you a chance to see the interrupt wiring behind the bridge. This matches the design of PCI - if you know how interrupts are hooked up then use that information, otherwise assume the INTx interrupts swizzle and search upward. This is how add-in cards with PCI bridges are supported. This behavior seems complex, but sane to me. I wouldn't change it as Andrew suggested. Thomas's problem is the presence of the static DT node for the root port bridge. Since the node is static you can't know what the runtime determined subordinate bus numbers will be, so there is no possible way to write an interrupt-map at the host bridge. Putting the map in the bridge's DT node seems elegant and correct to me - the map is describing the actual hardware - the root port bridge is actually terminating INTx from downstream devices and converting them to CPU interrupts. (FWIW discrete HT to PCIe bridges do something similar) If you imagine the case you alluded to, a PCI-E root port, connected to a PCI-E to PCI bridge, with 2 physical PCI bus slots. The interrupts for the 2 slots are routed to the CPU directly: link@0 { reg = </* Bus 0, Dev 0x10, Fn 0 */>; // Root Port bridge // Match on INTx (not used since the pci-bridge doesn't create inband INTx) interrupt-mask = <0x0 0 0 7>; interrupt-map = <0x0000 0 0 1 &pic 0 // Inband INTA 0x0000 0 0 2 &pic 1 // Inband INTB .. pci_bridge@0 { reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge // Match on the device/slot and INTx pin interrupt-mask = <0x7f 0 0 7>; interrupt-map = <0x00xx 0 0 1 &pic 2 // Slot 0 physical INTA 0x00xx 0 0 1 &pic 3 // Slot 1 physical INTA .. } } To me, this seems to be a much more accurate description of how the hardware is constructed then trying to cram all this information into the host bridge's interrupt map. It shows clearly where inband INTA messages arriving at the root port are directed as well as where the slot by slot out-of-band interrupt wires on the PCI bus are directed. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 08 February 2013, Jason Gunthorpe wrote: > On Thu, Feb 07, 2013 at 11:25:23PM +0000, Arnd Bergmann wrote: > > > link@0 { > > > reg = <0x800 0 0 0 0>; // Bus 0, Dev 0x10, Fn 0 > > > interrupt-mask = <0x0 0 0 7>; > > > interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA > > > 0x0000 0 0 2 &mpic 58 // INTB > > > 0x0000 0 0 3 &mpic 58 // INTC > > > 0x0000 0 0 4 &mpic 58>; // INTD > > > } > > > > The interrupt-map property only makes sense for the host bridge, > > not for bridges below it, which don't normally get represented > > in the device tree. > > Linux scans up the PCI bus until it finds a PCI device with a matching > OF node. It then constructs an interrupt map 'laddr' (ie the > bus:dev.fn) for the child device of this OF node. > > If you don't have any DT PCI nodes then this should always fold down > to doing a lookup with bus=0, and device representing the 'slot' in > legacy PCI. > > However, as soon as you provide a node for a bridge in DT this halts > the 'fold down' and goes to the interrupt-map with a device on the > subordinate bus number of the bridge. > > This makes *lots* of sense, if you have bridges providing bus slots > then you include the bridge in DT to stop the 'fold down' at that > known bridge, giving you a chance to see the interrupt wiring behind > the bridge. I would argue that it matters not so much what Linux does but what the standard says, but it seems they both agree with you in this case: http://www.openfirmware.org/1275/practice/imap/imap0_9d.pdf defines that "At any level in the interrupt tree, a mapping may need to take place between the child interrupt domain and the parent’s. This is represented by a new property called 'interrupt-map'". > This matches the design of PCI - if you know how interrupts are hooked > up then use that information, otherwise assume the INTx interrupts > swizzle and search upward. This is how add-in cards with PCI bridges > are supported. Note that the implicit swizzling was not part of the original PCI binding, which assumed that all devices were explicitly represented in the device tree, and we don't normally do that any more because PCI can be probed easily, and we cannot assume that all PCI BARs have been correctly assigned by the firmware before the OS is booting. Having the interrupt-map at PCI host controller node is convenient because it lets us define unit interrupt specifiers for devices that are not represented in the device tree themselves. I think the key question here is whether there is just one interrupt domain across all bridges because the hardware requires the unit address to be unique, or whether each PCIe port has its own unit address space, and thereby interrupt domain that requires its own interrupt-map. If there is just one domain, we have the choice whether to have one interrupt-map for the entire domain, or to have one interrupt map per PCIe port for the devices under that port. I would consider it more logical to have a single interrupt-map for the interrupt domain, because that is essentially what lets us describe the interrupt daomain as a whole. Of course, if each port has its own domain, we have to have a separate interrupt map for each one. > Thomas's problem is the presence of the static DT node for the root > port bridge. Since the node is static you can't know what the runtime > determined subordinate bus numbers will be, so there is no possible > way to write an interrupt-map at the host bridge. Right, that is a problem if there are additional bridges. I guess we could represent all devices on bus 0 easily because their address would be fixed, but can't uniquely identify anything below them. > If you imagine the case you alluded to, a PCI-E root port, connected > to a PCI-E to PCI bridge, with 2 physical PCI bus slots. The > interrupts for the 2 slots are routed to the CPU directly: > > link@0 { > reg = </* Bus 0, Dev 0x10, Fn 0 */>; // Root Port bridge > > // Match on INTx (not used since the pci-bridge doesn't create inband INTx) > interrupt-mask = <0x0 0 0 7>; > interrupt-map = <0x0000 0 0 1 &pic 0 // Inband INTA > 0x0000 0 0 2 &pic 1 // Inband INTB What are these two interrupts in the example then? > pci_bridge@0 { > reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge The device would be "pci@10", right? > // Match on the device/slot and INTx pin > interrupt-mask = <0x7f 0 0 7>; > interrupt-map = <0x00xx 0 0 1 &pic 2 // Slot 0 physical INTA > 0x00xx 0 0 1 &pic 3 // Slot 1 physical INTA > .. > } > } You are accidentally matching the on the register number, not the device number here, right? The interrupt-map-mask should be <0xf800 0 0 7> to match the device. > To me, this seems to be a much more accurate description of how the > hardware is constructed then trying to cram all this information into > the host bridge's interrupt map. It shows clearly where inband INTA > messages arriving at the root port are directed as well as where the > slot by slot out-of-band interrupt wires on the PCI bus are directed. Yes, I guess you're right. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Feb 09, 2013 at 10:23:11PM +0000, Arnd Bergmann wrote: > > This makes *lots* of sense, if you have bridges providing bus slots > > then you include the bridge in DT to stop the 'fold down' at that > > known bridge, giving you a chance to see the interrupt wiring behind > > the bridge. > > I would argue that it matters not so much what Linux does but what > the standard says, but it seems they both agree with you in this > case: http://www.openfirmware.org/1275/practice/imap/imap0_9d.pdf > defines that "At any level in the interrupt tree, a mapping may > need to take place between the child interrupt domain and the > parent???s. This is represented by a new property called 'interrupt-map'". Right, the standard (and Linux) allows the property in any node. > > This matches the design of PCI - if you know how interrupts are hooked > > up then use that information, otherwise assume the INTx interrupts > > swizzle and search upward. This is how add-in cards with PCI bridges > > are supported. > > Note that the implicit swizzling was not part of the original PCI > binding, which assumed that all devices were explicitly represented > in the device tree, and we don't normally do that any more because Yes, if the tree includes all PCI devices then you don't need interrupt-map, just place an interrupt property directly on the end nodes. However the implicit swizzling and 'fold down' is pretty much essential to support the PCI standards for hot plug behind bridges. > PCI can be probed easily, and we cannot assume that all PCI BARs > have been correctly assigned by the firmware before the OS > is booting. Having the interrupt-map at PCI host controller > node is convenient because it lets us define unit interrupt > specifiers for devices that are not represented in the device > tree themselves. Right, interrupt-map is actually only needed when the DT is incomplete, or to support hot pluggable ports. > I think the key question here is whether there is just one interrupt > domain across all bridges because the hardware requires the unit > address to be unique, or whether each PCIe port has its own > unit address space, and thereby interrupt domain that requires > its own interrupt-map. IMHO, the interrupt domains should describe the underlying hardware, and in many cases the HW is designed so that every PCI bus bridge has it's own downstream interrupt layout - and thus is an interrupt domain. > > If you imagine the case you alluded to, a PCI-E root port, connected > > to a PCI-E to PCI bridge, with 2 physical PCI bus slots. The > > interrupts for the 2 slots are routed to the CPU directly: > > > > link@0 { > > reg = </* Bus 0, Dev 0x10, Fn 0 */>; // Root Port bridge > > > > // Match on INTx (not used since the pci-bridge doesn't create inband INTx) > > interrupt-mask = <0x0 0 0 7>; > > interrupt-map = <0x0000 0 0 1 &pic 0 // Inband INTA > > 0x0000 0 0 2 &pic 1 // Inband INTB > > What are these two interrupts in the example then? This shows that the HW block 'link@0' - which is a PCI Express root port bridge - accepts inband INTx messages and converts them to CPU interrupts pic 0/1/... Since this is a general function, and fully self contained, it can be placed in the general SOC's dtsi. However, the board has a hard-wired PCIe to PCI bridge with PCI slots, and never generates inband INTx. We can then describe that chip via the following stanza in the board specific dts: > > pci_bridge@0 { > > reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge > > The device would be "pci@10", right? Probably best to use the hex version of the regs value /* Bus 1, Dev 0x10, Fn 0 */, but nothing inspects that, right? > > // Match on the device/slot and INTx pin > > interrupt-mask = <0x7f 0 0 7>; > > interrupt-map = <0x00xx 0 0 1 &pic 2 // Slot 0 physical INTA > > 0x00xx 0 0 1 &pic 3 // Slot 1 physical INTA > > .. > > } > > } > > You are accidentally matching the on the register number, not the > device number here, right? The interrupt-map-mask should be > <0xf800 0 0 7> to match the device. Right, only match the device, ignore the bus. There is also another variant, if the PCIe to PCI bridge provides its own interrupt pins and converts those to inband PCIe INTx messages, then the PCB can wire up the PCI bus slots to the bridge's INTx pins according to some pattern and describe that pattern in DT: pci_bridge@0 { reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge interrupt-mask = <0xf800 0 0 7>; interrupt-map = <0x00xx 0 0 1 &pci_bridge0 0 0 0 1 // Slot 0 physical INTA to inband INTA 0x00xx 0 0 1 &pci_bridge0 0 0 0 2 // Slot 1 physical INTA to inband INTB ... } (minus errors, haven't tried this one, but the standard says it should be OK) Which would be processed as: - pci_bridge@0 converts out of brand interrupts into in-band interrupts according its interrupt-map, and then sends those upstream. - link@0 converts in band interrupts into CPU interrupts according to its interrupt map. In my experience the above is a common case. Boot firmware could fold all this down to a single interrupt map, and hide the programming of the IOAPIC/etc from the OS, but the HW is still undertaking these transformations.. Anyhow, it sounds like Thomas has had success using this approach, so it works. Cheers, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt new file mode 100644 index 0000000..9313e92 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt @@ -0,0 +1,175 @@ +* Marvell Armada 370/XP PCIe interfaces + +Mandatory properties: +- compatible: must be "marvell,armada-370-xp-pcie" +- status: either "disabled" or "okay" +- #address-cells, set to <3> +- #size-cells, set to <2> +- #interrupt-cells, set to <1> +- bus-range: PCI bus numbers covered +- ranges: standard PCI-style address ranges, describing the PCIe + registers for each PCIe interface, and then ranges for the PCI + memory and I/O regions. +- interrupt-map-mask and interrupt-map are standard PCI Device Tree + properties to describe the interrupts associated to each PCI + interface. + +In addition, the Device Tree node must have sub-nodes describing each +PCIe interface, having the following mandatory properties: +- reg: the address and size of the PCIe registers (translated + addresses according to the ranges property of the parent) +- clocks: the clock associated to this PCIe interface +- marvell,pcie-port: the physical PCIe port number +- status: either "disabled" or "okay" + +and the following optional properties: +- marvell,pcie-lane: the physical PCIe lane number, for ports having + multiple lanes. If this property is not found, we assume that the + value is 0. + +Example: + +pcie-controller { + compatible = "marvell,armada-370-xp-pcie"; + status = "disabled"; + + #address-cells = <3>; + #size-cells = <2>; + + bus-range = <0x00 0xff>; + + ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000 /* port 0.0 registers */ + 0x00004800 0 0xd0042000 0xd0042000 0 0x00002000 /* port 2.0 registers */ + 0x00001000 0 0xd0044000 0xd0044000 0 0x00002000 /* port 0.1 registers */ + 0x00001800 0 0xd0048000 0xd0048000 0 0x00002000 /* port 0.2 registers */ + 0x00002000 0 0xd004C000 0xd004C000 0 0x00002000 /* port 0.3 registers */ + 0x00002800 0 0xd0080000 0xd0080000 0 0x00002000 /* port 1.0 registers */ + 0x00005000 0 0xd0082000 0xd0082000 0 0x00002000 /* port 3.0 registers */ + 0x00003000 0 0xd0084000 0xd0084000 0 0x00002000 /* port 1.1 registers */ + 0x00003800 0 0xd0088000 0xd0088000 0 0x00002000 /* port 1.2 registers */ + 0x00004000 0 0xd008C000 0xd008C000 0 0x00002000 /* port 1.3 registers */ + 0x81000000 0 0 0xc0000000 0 0x00100000 /* downstream I/O */ + 0x82000000 0 0 0xc1000000 0 0x08000000>; /* non-prefetchable memory */ + + #interrupt-cells = <1>; + interrupt-map-mask = <0xf800 0 0 1>; + interrupt-map = <0x0800 0 0 1 &mpic 58 + 0x1000 0 0 1 &mpic 59 + 0x1800 0 0 1 &mpic 60 + 0x2000 0 0 1 &mpic 61 + 0x2800 0 0 1 &mpic 62 + 0x3000 0 0 1 &mpic 63 + 0x3800 0 0 1 &mpic 64 + 0x4000 0 0 1 &mpic 65 + 0x4800 0 0 1 &mpic 99 + 0x5000 0 0 1 &mpic 103>; + + pcie@0,0 { + device_type = "pciex"; + reg = <0x0800 0 0xd0040000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <0>; + marvell,pcie-lane = <0>; + clocks = <&gateclk 5>; + status = "disabled"; + }; + + pcie@0,1 { + device_type = "pciex"; + reg = <0x1000 0 0xd0044000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <0>; + marvell,pcie-lane = <1>; + clocks = <&gateclk 6>; + status = "disabled"; + }; + + pcie@0,2 { + device_type = "pciex"; + reg = <0x1800 0 0xd0048000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <0>; + marvell,pcie-lane = <2>; + clocks = <&gateclk 7>; + status = "disabled"; + }; + + pcie@0,3 { + device_type = "pciex"; + reg = <0x2000 0 0xd004C000 0 0xC000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <0>; + marvell,pcie-lane = <3>; + clocks = <&gateclk 8>; + status = "disabled"; + }; + + pcie@1,0 { + device_type = "pciex"; + reg = <0x2800 0 0xd0080000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <1>; + marvell,pcie-lane = <0>; + clocks = <&gateclk 9>; + status = "disabled"; + }; + + pcie@1,1 { + device_type = "pciex"; + reg = <0x3000 0 0xd0084000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <1>; + marvell,pcie-lane = <1>; + clocks = <&gateclk 10>; + status = "disabled"; + }; + + pcie@1,2 { + device_type = "pciex"; + reg = <0x3800 0 0xd0088000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <1>; + marvell,pcie-lane = <2>; + clocks = <&gateclk 11>; + status = "disabled"; + }; + + pcie@1,3 { + device_type = "pciex"; + reg = <0x4000 0 0xd008C000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <1>; + marvell,pcie-lane = <3>; + clocks = <&gateclk 12>; + status = "disabled"; + }; + pcie@2,0 { + device_type = "pciex"; + reg = <0x4800 0 0xd0042000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <2>; + marvell,pcie-lane = <0>; + clocks = <&gateclk 26>; + status = "disabled"; + }; + + pcie@3,0 { + device_type = "pciex"; + reg = <0x5000 0 0xd0082000 0 0x2000>; + #address-cells = <3>; + #size-cells = <2>; + marvell,pcie-port = <3>; + marvell,pcie-lane = <0>; + clocks = <&gateclk 27>; + status = "disabled"; + }; +}; diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig index cc3a1af..03e15e7 100644 --- a/drivers/pci/host/Kconfig +++ b/drivers/pci/host/Kconfig @@ -1,4 +1,10 @@ menu "PCI host controller drivers" depends on PCI +config PCI_MVEBU + bool "Marvell EBU PCIe controller" + depends on ARCH_MVEBU + select PCI_SW_HOST_BRIDGE + select PCI_SW_PCI_PCI_BRIDGE + endmenu diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile new file mode 100644 index 0000000..34d6057 --- /dev/null +++ b/drivers/pci/host/Makefile @@ -0,0 +1,4 @@ +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o +ccflags-$(CONFIG_PCI_MVEBU) += \ + -I$(srctree)/arch/arm/plat-orion/include \ + -I$(srctree)/arch/arm/mach-mvebu/include diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c new file mode 100644 index 0000000..4db09e1 --- /dev/null +++ b/drivers/pci/host/pci-mvebu.c @@ -0,0 +1,500 @@ +/* + * PCIe driver for Marvell Armada 370 and Armada XP SoCs + * + * This file is licensed under the terms of the GNU General Public + * License version 2. This program is licensed "as is" without any + * warranty of any kind, whether express or implied. + */ + +#include <linux/kernel.h> +#include <linux/pci.h> +#include <linux/clk.h> +#include <linux/module.h> +#include <linux/slab.h> +#include <linux/platform_device.h> +#include <linux/of_address.h> +#include <linux/of_pci.h> +#include <linux/of_irq.h> +#include <linux/of_platform.h> +#include <plat/pcie.h> +#include <mach/addr-map.h> + +/* + * Those are the product IDs used for the emulated PCI Host bridge and + * emulated PCI-to-PCI bridges. They are temporary until we get + * official IDs assigned. + */ +#define MARVELL_EMULATED_HOST_BRIDGE_ID 4141 +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242 + +struct mvebu_pcie_port; + +/* Structure representing all PCIe interfaces */ +struct mvebu_pcie { + struct pci_sw_host_bridge bridge; + struct platform_device *pdev; + struct mvebu_pcie_port *ports; + struct resource io; + struct resource mem; + struct resource busn; + int nports; +}; + +/* Structure representing one PCIe interface */ +struct mvebu_pcie_port { + void __iomem *base; + spinlock_t conf_lock; + int haslink; + u32 port; + u32 lane; + int devfn; + struct clk *clk; + struct pci_sw_pci_bridge bridge; + struct device_node *dn; +}; + +static inline struct mvebu_pcie *sys_to_pcie(struct pci_sys_data *sys) +{ + return sys->private_data; +} + +/* PCI configuration space write function */ +static int mvebu_pcie_wr_conf(struct pci_bus *bus, u32 devfn, + int where, int size, u32 val) +{ + struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata); + + if (bus->number != 0) { + /* + * Accessing a real PCIe interface, where the Linux + * virtual bus number is equal to the hardware PCIe + * interface number + 1 + */ + struct mvebu_pcie_port *port; + unsigned long flags; + int porti, ret; + + porti = bus->number - 1; + if (porti >= pcie->nports) + return PCIBIOS_DEVICE_NOT_FOUND; + + port = &pcie->ports[porti]; + + if (!port->haslink) + return PCIBIOS_DEVICE_NOT_FOUND; + + if (PCI_SLOT(devfn) != 0) + return PCIBIOS_DEVICE_NOT_FOUND; + + spin_lock_irqsave(&port->conf_lock, flags); + ret = orion_pcie_wr_conf_bus(port->base, bus->number - 1, + PCI_DEVFN(1, PCI_FUNC(devfn)), + where, size, val); + spin_unlock_irqrestore(&port->conf_lock, flags); + + return ret; + } else { + /* + * Accessing the emulated PCIe devices. In the first + * slot, the emulated host bridge, and in the next + * slots, the PCI-to-PCI bridges that correspond to + * each PCIe hardware interface + */ + if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0) + return pci_sw_host_bridge_write(&pcie->bridge, where, + size, val); + else if (PCI_SLOT(devfn) >= 1 && + PCI_SLOT(devfn) <= pcie->nports) { + struct mvebu_pcie_port *port; + int porti = PCI_SLOT(devfn) - 1; + port = &pcie->ports[porti]; + return pci_sw_pci_bridge_write(&port->bridge, where, + size, val); + } else { + return PCIBIOS_DEVICE_NOT_FOUND; + } + } + + return PCIBIOS_SUCCESSFUL; +} + +/* PCI configuration space read function */ +static int mvebu_pcie_rd_conf(struct pci_bus *bus, u32 devfn, int where, + int size, u32 *val) +{ + struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata); + + if (bus->number != 0) { + /* + * Accessing a real PCIe interface, where the Linux + * virtual bus number is equal to the hardware PCIe + * interface number + 1 + */ + struct mvebu_pcie_port *port; + unsigned long flags; + int porti, ret; + + porti = bus->number - 1; + if (porti >= pcie->nports) { + *val = 0xffffffff; + return PCIBIOS_DEVICE_NOT_FOUND; + } + + port = &pcie->ports[porti]; + + if (!port->haslink || PCI_SLOT(devfn) != 0) { + *val = 0xffffffff; + return PCIBIOS_DEVICE_NOT_FOUND; + } + + spin_lock_irqsave(&port->conf_lock, flags); + ret = orion_pcie_rd_conf_bus(port->base, bus->number - 1, + PCI_DEVFN(1, PCI_FUNC(devfn)), + where, size, val); + spin_unlock_irqrestore(&port->conf_lock, flags); + + return ret; + } else { + /* + * Accessing the emulated PCIe devices. In the first + * slot, the emulated host bridge, and in the next + * slots, the PCI-to-PCI bridges that correspond to + * each PCIe hardware interface + */ + if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0) + return pci_sw_host_bridge_read(&pcie->bridge, where, + size, val); + else if (PCI_SLOT(devfn) >= 1 && + PCI_SLOT(devfn) <= pcie->nports) { + struct mvebu_pcie_port *port; + int porti = PCI_SLOT(devfn) - 1; + port = &pcie->ports[porti]; + return pci_sw_pci_bridge_read(&port->bridge, where, + size, val); + } else { + *val = 0xffffffff; + return PCIBIOS_DEVICE_NOT_FOUND; + } + } +} + +static struct pci_ops mvebu_pcie_ops = { + .read = mvebu_pcie_rd_conf, + .write = mvebu_pcie_wr_conf, +}; + +static int __init mvebu_pcie_setup(int nr, struct pci_sys_data *sys) +{ + struct mvebu_pcie *pcie = sys_to_pcie(sys); + int i; + + pci_add_resource_offset(&sys->resources, &pcie->io, sys->io_offset); + pci_add_resource_offset(&sys->resources, &pcie->mem, sys->mem_offset); + pci_add_resource(&sys->resources, &pcie->busn); + + pci_ioremap_io(nr * SZ_64K, pcie->io.start); + + for (i = 0; i < pcie->nports; i++) { + struct mvebu_pcie_port *port = &pcie->ports[i]; + orion_pcie_set_local_bus_nr(port->base, i); + orion_pcie_setup(port->base); + } + + return 1; +} + +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin) +{ + struct mvebu_pcie *pcie = sys_to_pcie(dev->bus->sysdata); + struct mvebu_pcie_port *port; + struct of_irq oirq; + u32 laddr[3]; + int ret; + __be32 intspec; + + /* + * Ignore requests related to the emulated host bridge or the + * emulated pci-to-pci bridges + */ + if (!dev->bus->number) + return -1; + + port = &pcie->ports[dev->bus->number - 1]; + + /* + * Build an laddr array that describes the PCI device in a DT + * way + */ + laddr[0] = cpu_to_be32(port->devfn << 8); + laddr[1] = laddr[2] = 0; + intspec = cpu_to_be32(pin); + + ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq); + if (ret) { + dev_err(&pcie->pdev->dev, + "%s: of_irq_map_raw() failed, %d\n", + __func__, ret); + return ret; + } + + return irq_create_of_mapping(oirq.controller, oirq.specifier, + oirq.size); +} + +/* + * For a given PCIe interface (represented by a mvebu_pcie_port + * structure), we read the PCI configuration space of the + * corresponding PCI-to-PCI bridge in order to find out which range of + * I/O addresses and memory addresses have been assigned to this PCIe + * interface. Using these informations, we set up the appropriate + * address decoding windows so that the physical address are actually + * resolved to the right PCIe interface. + */ +static int mvebu_pcie_window_config_port(struct mvebu_pcie *pcie, + struct mvebu_pcie_port *port) +{ + unsigned long iobase = 0; + int ret; + + if (port->bridge.iolimit >= port->bridge.iobase) { + unsigned long iolimit = 0xFFF | + ((port->bridge.iolimit & 0xF0) << 8) | + (port->bridge.iolimitupper << 16); + iobase = ((port->bridge.iobase & 0xF0) << 8) | + (port->bridge.iobaseupper << 16); + ret = armada_370_xp_alloc_pcie_window(port->port, port->lane, + iobase, iolimit-iobase, + IORESOURCE_IO); + if (ret) { + dev_err(&pcie->pdev->dev, + "%s: could not alloc PCIe %d:%d window for I/O [0x%lx; 0x%lx]\n", + __func__, port->port, port->lane, + iobase, iolimit); + goto out_io; + } + } + + if (port->bridge.memlimit >= port->bridge.membase) { + unsigned long membase = + ((port->bridge.membase & 0xFFF0) << 16); + unsigned long memlimit = + ((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF; + ret = armada_370_xp_alloc_pcie_window(port->port, port->lane, + membase, memlimit-membase, + IORESOURCE_MEM); + if (ret) { + dev_err(&pcie->pdev->dev, + "%s: could not alloc PCIe %d:%d window for MEM [0x%lx; 0x%lx]\n", + __func__, port->port, port->lane, + membase, memlimit); + goto out_mem; + } + } + +out_mem: + if (port->bridge.iolimit >= port->bridge.iobase) + armada_370_xp_free_pcie_window(iobase); +out_io: + return ret; +} + +/* + * Set up the address decoding windows for all PCIe interfaces. + */ +static int mvebu_pcie_window_config(struct mvebu_pcie *pcie) +{ + int i, ret; + + for (i = 0; i < pcie->nports; i++) { + struct mvebu_pcie_port *port = &pcie->ports[i]; + if (!port->haslink) + continue; + + ret = mvebu_pcie_window_config_port(pcie, port); + if (ret) + return ret; + } + + return 0; +} + +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev, + const struct resource *res, + resource_size_t start, + resource_size_t size, + resource_size_t align) +{ + if (!(res->flags & IORESOURCE_IO)) + return start; + + /* + * The I/O regions must be 64K aligned, because the + * granularity of PCIe I/O address decoding windows is 64 K + */ + return round_up(start, SZ_64K); +} + +static int mvebu_pcie_enable(struct mvebu_pcie *pcie) +{ + struct hw_pci hw; + + memset(&hw, 0, sizeof(hw)); + + hw.nr_controllers = 1; + hw.private_data = (void **)&pcie; + hw.setup = mvebu_pcie_setup; + hw.map_irq = mvebu_pcie_map_irq; + hw.align_resource = mvebu_pcie_align_resource; + hw.ops = &mvebu_pcie_ops; + + pci_common_init(&hw); + + return mvebu_pcie_window_config(pcie); +} + +static int __init mvebu_pcie_probe(struct platform_device *pdev) +{ + struct mvebu_pcie *pcie; + struct device_node *np = pdev->dev.of_node; + struct device_node *child; + const __be32 *range = NULL; + struct resource res; + int i, ret; + + pcie = devm_kzalloc(&pdev->dev, sizeof(struct mvebu_pcie), + GFP_KERNEL); + if (!pcie) + return -ENOMEM; + + pcie->pdev = pdev; + + pci_sw_host_bridge_init(&pcie->bridge); + pcie->bridge.vendor = PCI_VENDOR_ID_MARVELL; + pcie->bridge.device = MARVELL_EMULATED_HOST_BRIDGE_ID; + + /* Get the I/O and memory ranges from DT */ + while ((range = of_pci_process_ranges(np, &res, range)) != NULL) { + if (resource_type(&res) == IORESOURCE_IO) { + memcpy(&pcie->io, &res, sizeof(res)); + pcie->io.name = "I/O"; + } + if (resource_type(&res) == IORESOURCE_MEM) { + memcpy(&pcie->mem, &res, sizeof(res)); + pcie->mem.name = "MEM"; + } + } + + /* Get the bus range */ + ret = of_pci_parse_bus_range(np, &pcie->busn); + if (ret) { + dev_err(&pdev->dev, "failed to parse bus-range property: %d\n", + ret); + return ret; + } + + for_each_child_of_node(pdev->dev.of_node, child) { + if (!of_device_is_available(child)) + continue; + pcie->nports++; + } + + pcie->ports = devm_kzalloc(&pdev->dev, pcie->nports * + sizeof(struct mvebu_pcie_port), + GFP_KERNEL); + if (!pcie->ports) + return -ENOMEM; + + i = 0; + for_each_child_of_node(pdev->dev.of_node, child) { + struct mvebu_pcie_port *port = &pcie->ports[i]; + + if (!of_device_is_available(child)) + continue; + + if (of_property_read_u32(child, "marvell,pcie-port", + &port->port)) { + dev_warn(&pdev->dev, + "ignoring PCIe DT node, missing pcie-port property\n"); + continue; + } + + if (of_property_read_u32(child, "marvell,pcie-lane", + &port->lane)) + port->lane = 0; + + port->devfn = of_pci_get_devfn(child); + if (port->devfn < 0) + continue; + + port->base = of_iomap(child, 0); + if (!port->base) { + dev_err(&pdev->dev, "PCIe%d.%d: cannot map registers\n", + port->port, port->lane); + continue; + } + + if (orion_pcie_link_up(port->base)) { + port->haslink = 1; + dev_info(&pdev->dev, "PCIe%d.%d: link up\n", + port->port, port->lane); + } else { + port->haslink = 0; + dev_info(&pdev->dev, "PCIe%d.%d: link down\n", + port->port, port->lane); + } + + port->clk = of_clk_get_by_name(child, NULL); + if (!port->clk) { + dev_err(&pdev->dev, "PCIe%d.%d: cannot get clock\n", + port->port, port->lane); + iounmap(port->base); + port->haslink = 0; + continue; + } + + port->dn = child; + + clk_prepare_enable(port->clk); + spin_lock_init(&port->conf_lock); + + pci_sw_pci_bridge_init(&port->bridge); + port->bridge.vendor = PCI_VENDOR_ID_MARVELL; + port->bridge.device = MARVELL_EMULATED_PCI_PCI_BRIDGE_ID; + port->bridge.primary_bus = 0; + port->bridge.secondary_bus = PCI_SLOT(port->devfn); + port->bridge.subordinate_bus = PCI_SLOT(port->devfn); + + i++; + } + + mvebu_pcie_enable(pcie); + + return 0; +} + +static const struct of_device_id mvebu_pcie_of_match_table[] = { + { .compatible = "marvell,armada-370-xp-pcie", }, + {}, +}; +MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table); + +static struct platform_driver mvebu_pcie_driver = { + .driver = { + .owner = THIS_MODULE, + .name = "mvebu-pcie", + .of_match_table = + of_match_ptr(mvebu_pcie_of_match_table), + }, +}; + +static int mvebu_pcie_init(void) +{ + return platform_driver_probe(&mvebu_pcie_driver, + mvebu_pcie_probe); +} + +subsys_initcall(mvebu_pcie_init); + +MODULE_AUTHOR("Thomas Petazzoni <thomas.petazzoni@free-electrons.com>"); +MODULE_DESCRIPTION("Marvell EBU PCIe driver"); +MODULE_LICENSE("GPL");
This driver implements the support for the PCIe interfaces on the Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to cover earlier families of Marvell SoCs, such as Dove, Orion and Kirkwood. The driver implements the hw_pci operations needed by the core ARM PCI code to setup PCI devices and get their corresponding IRQs, and the pci_ops operations that are used by the PCI core to read/write the configuration space of PCI devices. Since the PCIe interfaces of Marvell SoCs are completely separate and not linked together in a bus, this driver sets up an emulated PCI host bridge, with one PCI-to-PCI bridge as child for each hardware PCIe interface. In addition, this driver enumerates the different PCIe slots, and for those having a device plugged in, it sets up the necessary address decoding windows, using the new armada_370_xp_alloc_pcie_window() function from mach-mvebu/addr-map.c. Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> --- .../devicetree/bindings/pci/armada-370-xp-pcie.txt | 175 +++++++ drivers/pci/host/Kconfig | 6 + drivers/pci/host/Makefile | 4 + drivers/pci/host/pci-mvebu.c | 500 ++++++++++++++++++++ 4 files changed, 685 insertions(+) create mode 100644 Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt create mode 100644 drivers/pci/host/Makefile create mode 100644 drivers/pci/host/pci-mvebu.c