diff mbox

[v2,19/27] pci: PCIe driver for Marvell Armada 370/XP systems

Message ID 1359399397-29729-20-git-send-email-thomas.petazzoni@free-electrons.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Thomas Petazzoni Jan. 28, 2013, 6:56 p.m. UTC
This driver implements the support for the PCIe interfaces on the
Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
cover earlier families of Marvell SoCs, such as Dove, Orion and
Kirkwood.

The driver implements the hw_pci operations needed by the core ARM PCI
code to setup PCI devices and get their corresponding IRQs, and the
pci_ops operations that are used by the PCI core to read/write the
configuration space of PCI devices.

Since the PCIe interfaces of Marvell SoCs are completely separate and
not linked together in a bus, this driver sets up an emulated PCI host
bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
interface.

In addition, this driver enumerates the different PCIe slots, and for
those having a device plugged in, it sets up the necessary address
decoding windows, using the new armada_370_xp_alloc_pcie_window()
function from mach-mvebu/addr-map.c.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
---
 .../devicetree/bindings/pci/armada-370-xp-pcie.txt |  175 +++++++
 drivers/pci/host/Kconfig                           |    6 +
 drivers/pci/host/Makefile                          |    4 +
 drivers/pci/host/pci-mvebu.c                       |  500 ++++++++++++++++++++
 4 files changed, 685 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt
 create mode 100644 drivers/pci/host/Makefile
 create mode 100644 drivers/pci/host/pci-mvebu.c

Comments

Stephen Warren Jan. 28, 2013, 10:21 p.m. UTC | #1
On 01/28/2013 11:56 AM, Thomas Petazzoni wrote:
> This driver implements the support for the PCIe interfaces on the
> Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
> cover earlier families of Marvell SoCs, such as Dove, Orion and
> Kirkwood.
> 
> The driver implements the hw_pci operations needed by the core ARM PCI
> code to setup PCI devices and get their corresponding IRQs, and the
> pci_ops operations that are used by the PCI core to read/write the
> configuration space of PCI devices.
> 
> Since the PCIe interfaces of Marvell SoCs are completely separate and
> not linked together in a bus, this driver sets up an emulated PCI host
> bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
> interface.
> 
> In addition, this driver enumerates the different PCIe slots, and for
> those having a device plugged in, it sets up the necessary address
> decoding windows, using the new armada_370_xp_alloc_pcie_window()
> function from mach-mvebu/addr-map.c.

> diff --git a/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt

> +Mandatory properties:
> +- compatible: must be "marvell,armada-370-xp-pcie"
> +- status: either "disabled" or "okay"

status is a standard DT property; I certainly wouldn't expect its
presence to be mandatory (there's a defined default), nor would I expect
each device's binding to redefine this property.

> +In addition, the Device Tree node must have sub-nodes describing each
> +PCIe interface, having the following mandatory properties:

> +- marvell,pcie-port: the physical PCIe port number

Should the standardized cell-index property be used here instead? Or,
perhaps that property is deprecated/discouraged...

> +- status: either "disabled" or "okay"

Similar comment as above.

> diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile

> +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
> +ccflags-$(CONFIG_PCI_MVEBU) += \
> +	-I$(srctree)/arch/arm/plat-orion/include \
> +	-I$(srctree)/arch/arm/mach-mvebu/include

That seems a little dangerous w.r.t. multi-platform zImage. Can the
required headers be moved out to somewhere more public to avoid this?

> diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c

> +/*
> + * Those are the product IDs used for the emulated PCI Host bridge and
> + * emulated PCI-to-PCI bridges. They are temporary until we get
> + * official IDs assigned.
> + */
> +#define MARVELL_EMULATED_HOST_BRIDGE_ID    4141
> +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242

I assume that means we can't merge this driver yet. The cover letter
mentioned a desire to merge this for 3.9; there's not much time to get
official IDs assigned, then.

> +static int mvebu_pcie_init(void)
> +{
> +	return platform_driver_probe(&mvebu_pcie_driver,
> +				     mvebu_pcie_probe);
> +}
> +
> +subsys_initcall(mvebu_pcie_init);

Why isn't that just platform_driver_register()?

> +MODULE_LICENSE("GPL");

"GPL v2".
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 29, 2013, 3:29 a.m. UTC | #2
On Mon, Jan 28, 2013 at 11:56 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> This driver implements the support for the PCIe interfaces on the
> Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
> cover earlier families of Marvell SoCs, such as Dove, Orion and
> Kirkwood.
>
> The driver implements the hw_pci operations needed by the core ARM PCI
> code to setup PCI devices and get their corresponding IRQs, and the
> pci_ops operations that are used by the PCI core to read/write the
> configuration space of PCI devices.
>
> Since the PCIe interfaces of Marvell SoCs are completely separate and
> not linked together in a bus, this driver sets up an emulated PCI host
> bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
> interface.

There's no Linux requirement that multiple PCIe interfaces appear to
be in the same hierarchy.  You can just use pci_scan_root_bus()
separately on each interface.  Each interface can be in its own domain
if necessary.

> In addition, this driver enumerates the different PCIe slots, and for
> those having a device plugged in, it sets up the necessary address
> decoding windows, using the new armada_370_xp_alloc_pcie_window()
> function from mach-mvebu/addr-map.c.
>
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
> ---
>  .../devicetree/bindings/pci/armada-370-xp-pcie.txt |  175 +++++++
>  drivers/pci/host/Kconfig                           |    6 +
>  drivers/pci/host/Makefile                          |    4 +
>  drivers/pci/host/pci-mvebu.c                       |  500 ++++++++++++++++++++
>  4 files changed, 685 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt
>  create mode 100644 drivers/pci/host/Makefile
>  create mode 100644 drivers/pci/host/pci-mvebu.c
>
> diff --git a/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt
> new file mode 100644
> index 0000000..9313e92
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt
> @@ -0,0 +1,175 @@
> +* Marvell Armada 370/XP PCIe interfaces
> +
> +Mandatory properties:
> +- compatible: must be "marvell,armada-370-xp-pcie"
> +- status: either "disabled" or "okay"
> +- #address-cells, set to <3>
> +- #size-cells, set to <2>
> +- #interrupt-cells, set to <1>
> +- bus-range: PCI bus numbers covered
> +- ranges: standard PCI-style address ranges, describing the PCIe
> +  registers for each PCIe interface, and then ranges for the PCI
> +  memory and I/O regions.
> +- interrupt-map-mask and interrupt-map are standard PCI Device Tree
> +  properties to describe the interrupts associated to each PCI
> +  interface.
> +
> +In addition, the Device Tree node must have sub-nodes describing each
> +PCIe interface, having the following mandatory properties:
> +- reg: the address and size of the PCIe registers (translated
> +  addresses according to the ranges property of the parent)
> +- clocks: the clock associated to this PCIe interface
> +- marvell,pcie-port: the physical PCIe port number
> +- status: either "disabled" or "okay"
> +
> +and the following optional properties:
> +- marvell,pcie-lane: the physical PCIe lane number, for ports having
> +  multiple lanes. If this property is not found, we assume that the
> +  value is 0.
> +
> +Example:
> +
> +pcie-controller {
> +       compatible = "marvell,armada-370-xp-pcie";
> +       status = "disabled";
> +
> +       #address-cells = <3>;
> +       #size-cells = <2>;
> +
> +       bus-range = <0x00 0xff>;
> +
> +       ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
> +                 0x00004800 0 0xd0042000 0xd0042000 0 0x00002000   /* port 2.0 registers */
> +                 0x00001000 0 0xd0044000 0xd0044000 0 0x00002000   /* port 0.1 registers */
> +                 0x00001800 0 0xd0048000 0xd0048000 0 0x00002000   /* port 0.2 registers */
> +                 0x00002000 0 0xd004C000 0xd004C000 0 0x00002000   /* port 0.3 registers */
> +                 0x00002800 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
> +                 0x00005000 0 0xd0082000 0xd0082000 0 0x00002000   /* port 3.0 registers */
> +                 0x00003000 0 0xd0084000 0xd0084000 0 0x00002000   /* port 1.1 registers */
> +                 0x00003800 0 0xd0088000 0xd0088000 0 0x00002000   /* port 1.2 registers */
> +                 0x00004000 0 0xd008C000 0xd008C000 0 0x00002000   /* port 1.3 registers */
> +                 0x81000000 0 0          0xc0000000 0 0x00100000   /* downstream I/O */
> +                 0x82000000 0 0          0xc1000000 0 0x08000000>; /* non-prefetchable memory */
> +
> +       #interrupt-cells = <1>;
> +       interrupt-map-mask = <0xf800 0 0 1>;
> +       interrupt-map = <0x0800 0 0 1 &mpic 58
> +                        0x1000 0 0 1 &mpic 59
> +                        0x1800 0 0 1 &mpic 60
> +                        0x2000 0 0 1 &mpic 61
> +                        0x2800 0 0 1 &mpic 62
> +                        0x3000 0 0 1 &mpic 63
> +                        0x3800 0 0 1 &mpic 64
> +                        0x4000 0 0 1 &mpic 65
> +                        0x4800 0 0 1 &mpic 99
> +                        0x5000 0 0 1 &mpic 103>;
> +
> +       pcie@0,0 {
> +               device_type = "pciex";
> +               reg = <0x0800 0 0xd0040000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <0>;
> +               marvell,pcie-lane = <0>;
> +               clocks = <&gateclk 5>;
> +               status = "disabled";
> +       };
> +
> +       pcie@0,1 {
> +               device_type = "pciex";
> +               reg = <0x1000 0 0xd0044000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <0>;
> +               marvell,pcie-lane = <1>;
> +               clocks = <&gateclk 6>;
> +               status = "disabled";
> +       };
> +
> +       pcie@0,2 {
> +               device_type = "pciex";
> +               reg = <0x1800 0 0xd0048000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <0>;
> +               marvell,pcie-lane = <2>;
> +               clocks = <&gateclk 7>;
> +               status = "disabled";
> +       };
> +
> +       pcie@0,3 {
> +               device_type = "pciex";
> +               reg = <0x2000 0 0xd004C000 0 0xC000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <0>;
> +               marvell,pcie-lane = <3>;
> +               clocks = <&gateclk 8>;
> +               status = "disabled";
> +       };
> +
> +       pcie@1,0 {
> +               device_type = "pciex";
> +               reg = <0x2800 0 0xd0080000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <1>;
> +               marvell,pcie-lane = <0>;
> +               clocks = <&gateclk 9>;
> +               status = "disabled";
> +       };
> +
> +       pcie@1,1 {
> +               device_type = "pciex";
> +               reg = <0x3000 0 0xd0084000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <1>;
> +               marvell,pcie-lane = <1>;
> +               clocks = <&gateclk 10>;
> +               status = "disabled";
> +       };
> +
> +       pcie@1,2 {
> +               device_type = "pciex";
> +               reg = <0x3800 0 0xd0088000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <1>;
> +               marvell,pcie-lane = <2>;
> +               clocks = <&gateclk 11>;
> +               status = "disabled";
> +       };
> +
> +       pcie@1,3 {
> +               device_type = "pciex";
> +               reg = <0x4000 0 0xd008C000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <1>;
> +               marvell,pcie-lane = <3>;
> +               clocks = <&gateclk 12>;
> +               status = "disabled";
> +       };
> +       pcie@2,0 {
> +               device_type = "pciex";
> +               reg = <0x4800 0 0xd0042000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <2>;
> +               marvell,pcie-lane = <0>;
> +               clocks = <&gateclk 26>;
> +               status = "disabled";
> +       };
> +
> +       pcie@3,0 {
> +               device_type = "pciex";
> +               reg = <0x5000 0 0xd0082000 0 0x2000>;
> +               #address-cells = <3>;
> +               #size-cells = <2>;
> +               marvell,pcie-port = <3>;
> +               marvell,pcie-lane = <0>;
> +               clocks = <&gateclk 27>;
> +               status = "disabled";
> +       };
> +};
> diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig
> index cc3a1af..03e15e7 100644
> --- a/drivers/pci/host/Kconfig
> +++ b/drivers/pci/host/Kconfig
> @@ -1,4 +1,10 @@
>  menu "PCI host controller drivers"
>         depends on PCI
>
> +config PCI_MVEBU
> +       bool "Marvell EBU PCIe controller"
> +       depends on ARCH_MVEBU
> +       select PCI_SW_HOST_BRIDGE
> +       select PCI_SW_PCI_PCI_BRIDGE
> +
>  endmenu
> diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
> new file mode 100644
> index 0000000..34d6057
> --- /dev/null
> +++ b/drivers/pci/host/Makefile
> @@ -0,0 +1,4 @@
> +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
> +ccflags-$(CONFIG_PCI_MVEBU) += \
> +       -I$(srctree)/arch/arm/plat-orion/include \
> +       -I$(srctree)/arch/arm/mach-mvebu/include
> diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
> new file mode 100644
> index 0000000..4db09e1
> --- /dev/null
> +++ b/drivers/pci/host/pci-mvebu.c
> @@ -0,0 +1,500 @@
> +/*
> + * PCIe driver for Marvell Armada 370 and Armada XP SoCs
> + *
> + * This file is licensed under the terms of the GNU General Public
> + * License version 2.  This program is licensed "as is" without any
> + * warranty of any kind, whether express or implied.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/pci.h>
> +#include <linux/clk.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/platform_device.h>
> +#include <linux/of_address.h>
> +#include <linux/of_pci.h>
> +#include <linux/of_irq.h>
> +#include <linux/of_platform.h>
> +#include <plat/pcie.h>
> +#include <mach/addr-map.h>
> +
> +/*
> + * Those are the product IDs used for the emulated PCI Host bridge and
> + * emulated PCI-to-PCI bridges. They are temporary until we get
> + * official IDs assigned.
> + */
> +#define MARVELL_EMULATED_HOST_BRIDGE_ID    4141
> +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242
> +
> +struct mvebu_pcie_port;
> +
> +/* Structure representing all PCIe interfaces */
> +struct mvebu_pcie {
> +       struct pci_sw_host_bridge bridge;
> +       struct platform_device *pdev;
> +       struct mvebu_pcie_port *ports;
> +       struct resource io;
> +       struct resource mem;
> +       struct resource busn;
> +       int nports;
> +};
> +
> +/* Structure representing one PCIe interface */
> +struct mvebu_pcie_port {
> +       void __iomem *base;
> +       spinlock_t conf_lock;
> +       int haslink;
> +       u32 port;
> +       u32 lane;
> +       int devfn;
> +       struct clk *clk;
> +       struct pci_sw_pci_bridge bridge;
> +       struct device_node *dn;
> +};
> +
> +static inline struct mvebu_pcie *sys_to_pcie(struct pci_sys_data *sys)
> +{
> +       return sys->private_data;
> +}
> +
> +/* PCI configuration space write function */
> +static int mvebu_pcie_wr_conf(struct pci_bus *bus, u32 devfn,
> +                             int where, int size, u32 val)
> +{
> +       struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata);
> +
> +       if (bus->number != 0) {
> +               /*
> +                * Accessing a real PCIe interface, where the Linux
> +                * virtual bus number is equal to the hardware PCIe
> +                * interface number + 1
> +                */

This is really weird.  It doesn't seem like a good idea to me, but I
don't understand the whole architecture.

> +               struct mvebu_pcie_port *port;
> +               unsigned long flags;
> +               int porti, ret;
> +
> +               porti = bus->number - 1;
> +               if (porti >= pcie->nports)
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +
> +               port = &pcie->ports[porti];
> +
> +               if (!port->haslink)
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +
> +               if (PCI_SLOT(devfn) != 0)
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +
> +               spin_lock_irqsave(&port->conf_lock, flags);
> +               ret = orion_pcie_wr_conf_bus(port->base, bus->number - 1,
> +                                            PCI_DEVFN(1, PCI_FUNC(devfn)),
> +                                            where, size, val);
> +               spin_unlock_irqrestore(&port->conf_lock, flags);
> +
> +               return ret;
> +       } else {
> +               /*
> +                * Accessing the emulated PCIe devices. In the first
> +                * slot, the emulated host bridge, and in the next
> +                * slots, the PCI-to-PCI bridges that correspond to
> +                * each PCIe hardware interface
> +                */
> +               if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0)
> +                       return pci_sw_host_bridge_write(&pcie->bridge, where,
> +                                                       size, val);
> +               else if (PCI_SLOT(devfn) >= 1 &&
> +                        PCI_SLOT(devfn) <= pcie->nports) {
> +                       struct mvebu_pcie_port *port;
> +                       int porti = PCI_SLOT(devfn) - 1;
> +                       port = &pcie->ports[porti];
> +                       return pci_sw_pci_bridge_write(&port->bridge, where,
> +                                                      size, val);
> +               } else {
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +               }
> +       }
> +
> +       return PCIBIOS_SUCCESSFUL;
> +}
> +
> +/* PCI configuration space read function */
> +static int mvebu_pcie_rd_conf(struct pci_bus *bus, u32 devfn, int where,
> +                             int size, u32 *val)
> +{
> +       struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata);
> +
> +       if (bus->number != 0) {
> +               /*
> +                * Accessing a real PCIe interface, where the Linux
> +                * virtual bus number is equal to the hardware PCIe
> +                * interface number + 1
> +                */
> +               struct mvebu_pcie_port *port;
> +               unsigned long flags;
> +               int porti, ret;
> +
> +               porti = bus->number - 1;
> +               if (porti >= pcie->nports) {
> +                       *val = 0xffffffff;
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +               }
> +
> +               port = &pcie->ports[porti];
> +
> +               if (!port->haslink || PCI_SLOT(devfn) != 0) {
> +                       *val = 0xffffffff;
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +               }
> +
> +               spin_lock_irqsave(&port->conf_lock, flags);
> +               ret = orion_pcie_rd_conf_bus(port->base, bus->number - 1,
> +                                            PCI_DEVFN(1, PCI_FUNC(devfn)),
> +                                            where, size, val);
> +               spin_unlock_irqrestore(&port->conf_lock, flags);
> +
> +               return ret;
> +       } else {
> +               /*
> +                * Accessing the emulated PCIe devices. In the first
> +                * slot, the emulated host bridge, and in the next
> +                * slots, the PCI-to-PCI bridges that correspond to
> +                * each PCIe hardware interface
> +                */
> +               if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0)
> +                       return pci_sw_host_bridge_read(&pcie->bridge, where,
> +                                                      size, val);
> +               else if (PCI_SLOT(devfn) >= 1 &&
> +                        PCI_SLOT(devfn) <= pcie->nports) {
> +                       struct mvebu_pcie_port *port;
> +                       int porti = PCI_SLOT(devfn) - 1;
> +                       port = &pcie->ports[porti];
> +                       return pci_sw_pci_bridge_read(&port->bridge, where,
> +                                                     size, val);
> +               } else {
> +                       *val = 0xffffffff;
> +                       return PCIBIOS_DEVICE_NOT_FOUND;
> +               }
> +       }
> +}
> +
> +static struct pci_ops mvebu_pcie_ops = {
> +       .read = mvebu_pcie_rd_conf,
> +       .write = mvebu_pcie_wr_conf,
> +};
> +
> +static int __init mvebu_pcie_setup(int nr, struct pci_sys_data *sys)
> +{
> +       struct mvebu_pcie *pcie = sys_to_pcie(sys);
> +       int i;
> +
> +       pci_add_resource_offset(&sys->resources, &pcie->io, sys->io_offset);
> +       pci_add_resource_offset(&sys->resources, &pcie->mem, sys->mem_offset);
> +       pci_add_resource(&sys->resources, &pcie->busn);
> +
> +       pci_ioremap_io(nr * SZ_64K, pcie->io.start);
> +
> +       for (i = 0; i < pcie->nports; i++) {
> +               struct mvebu_pcie_port *port = &pcie->ports[i];
> +               orion_pcie_set_local_bus_nr(port->base, i);
> +               orion_pcie_setup(port->base);
> +       }
> +
> +       return 1;
> +}
> +
> +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
> +{
> +       struct mvebu_pcie *pcie = sys_to_pcie(dev->bus->sysdata);
> +       struct mvebu_pcie_port *port;
> +       struct of_irq oirq;
> +       u32 laddr[3];
> +       int ret;
> +       __be32 intspec;
> +
> +       /*
> +        * Ignore requests related to the emulated host bridge or the
> +        * emulated pci-to-pci bridges
> +        */
> +       if (!dev->bus->number)
> +               return -1;
> +
> +       port = &pcie->ports[dev->bus->number - 1];
> +
> +       /*
> +        * Build an laddr array that describes the PCI device in a DT
> +        * way
> +        */
> +       laddr[0] = cpu_to_be32(port->devfn << 8);
> +       laddr[1] = laddr[2] = 0;
> +       intspec = cpu_to_be32(pin);
> +
> +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> +       if (ret) {
> +               dev_err(&pcie->pdev->dev,
> +                       "%s: of_irq_map_raw() failed, %d\n",
> +                       __func__, ret);
> +               return ret;
> +       }
> +
> +       return irq_create_of_mapping(oirq.controller, oirq.specifier,
> +                                    oirq.size);
> +}
> +
> +/*
> + * For a given PCIe interface (represented by a mvebu_pcie_port
> + * structure), we read the PCI configuration space of the
> + * corresponding PCI-to-PCI bridge in order to find out which range of
> + * I/O addresses and memory addresses have been assigned to this PCIe
> + * interface. Using these informations, we set up the appropriate
> + * address decoding windows so that the physical address are actually
> + * resolved to the right PCIe interface.
> + */

Are you inferring the host bridge apertures by using the resources
assigned to devices under the bridge, i.e., taking the union of all
the BARs and PCI-to-PCI bridge apertures of devices on the root bus?
If so, it would be much better to learn the host bridge apertures via
some non-PCI mechanism that corresponds to the actual hardware
configuration.  The config of devices below the host bridge gives you
hints, of course, but usually they don't consume all the available
space.

> +static int mvebu_pcie_window_config_port(struct mvebu_pcie *pcie,
> +                                        struct mvebu_pcie_port *port)
> +{
> +       unsigned long iobase = 0;
> +       int ret;
> +
> +       if (port->bridge.iolimit >= port->bridge.iobase) {
> +               unsigned long iolimit = 0xFFF |
> +                       ((port->bridge.iolimit & 0xF0) << 8) |
> +                       (port->bridge.iolimitupper << 16);
> +               iobase = ((port->bridge.iobase & 0xF0) << 8) |
> +                       (port->bridge.iobaseupper << 16);
> +               ret = armada_370_xp_alloc_pcie_window(port->port, port->lane,
> +                                                     iobase, iolimit-iobase,
> +                                                     IORESOURCE_IO);
> +               if (ret) {
> +                       dev_err(&pcie->pdev->dev,
> +                               "%s: could not alloc PCIe %d:%d window for I/O [0x%lx; 0x%lx]\n",
> +                               __func__, port->port, port->lane,
> +                               iobase, iolimit);
> +                       goto out_io;
> +               }
> +       }
> +
> +       if (port->bridge.memlimit >= port->bridge.membase) {
> +               unsigned long membase =
> +                       ((port->bridge.membase & 0xFFF0) << 16);
> +               unsigned long memlimit =
> +                       ((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF;
> +               ret = armada_370_xp_alloc_pcie_window(port->port, port->lane,
> +                                                     membase, memlimit-membase,
> +                                                     IORESOURCE_MEM);
> +               if (ret) {
> +                       dev_err(&pcie->pdev->dev,
> +                               "%s: could not alloc PCIe %d:%d window for MEM [0x%lx; 0x%lx]\n",
> +                               __func__, port->port, port->lane,
> +                               membase, memlimit);
> +                       goto out_mem;
> +               }
> +       }
> +
> +out_mem:
> +       if (port->bridge.iolimit >= port->bridge.iobase)
> +               armada_370_xp_free_pcie_window(iobase);
> +out_io:
> +       return ret;
> +}
> +
> +/*
> + * Set up the address decoding windows for all PCIe interfaces.
> + */
> +static int mvebu_pcie_window_config(struct mvebu_pcie *pcie)
> +{
> +       int i, ret;
> +
> +       for (i = 0; i < pcie->nports; i++) {
> +               struct mvebu_pcie_port *port = &pcie->ports[i];
> +               if (!port->haslink)
> +                       continue;
> +
> +               ret = mvebu_pcie_window_config_port(pcie, port);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev,
> +                                                const struct resource *res,
> +                                                resource_size_t start,
> +                                                resource_size_t size,
> +                                                resource_size_t align)
> +{
> +       if (!(res->flags & IORESOURCE_IO))
> +               return start;
> +
> +       /*
> +        * The I/O regions must be 64K aligned, because the
> +        * granularity of PCIe I/O address decoding windows is 64 K
> +        */
> +       return round_up(start, SZ_64K);
> +}
> +
> +static int mvebu_pcie_enable(struct mvebu_pcie *pcie)
> +{
> +       struct hw_pci hw;
> +
> +       memset(&hw, 0, sizeof(hw));
> +
> +       hw.nr_controllers = 1;
> +       hw.private_data   = (void **)&pcie;
> +       hw.setup          = mvebu_pcie_setup;
> +       hw.map_irq        = mvebu_pcie_map_irq;
> +       hw.align_resource = mvebu_pcie_align_resource;
> +       hw.ops            = &mvebu_pcie_ops;
> +
> +       pci_common_init(&hw);
> +
> +       return mvebu_pcie_window_config(pcie);
> +}
> +
> +static int __init mvebu_pcie_probe(struct platform_device *pdev)
> +{
> +       struct mvebu_pcie *pcie;
> +       struct device_node *np = pdev->dev.of_node;
> +       struct device_node *child;
> +       const __be32 *range = NULL;
> +       struct resource res;
> +       int i, ret;
> +
> +       pcie = devm_kzalloc(&pdev->dev, sizeof(struct mvebu_pcie),
> +                           GFP_KERNEL);
> +       if (!pcie)
> +               return -ENOMEM;
> +
> +       pcie->pdev = pdev;
> +
> +       pci_sw_host_bridge_init(&pcie->bridge);
> +       pcie->bridge.vendor = PCI_VENDOR_ID_MARVELL;
> +       pcie->bridge.device = MARVELL_EMULATED_HOST_BRIDGE_ID;
> +
> +       /* Get the I/O and memory ranges from DT */
> +       while ((range = of_pci_process_ranges(np, &res, range)) != NULL) {
> +               if (resource_type(&res) == IORESOURCE_IO) {
> +                       memcpy(&pcie->io, &res, sizeof(res));
> +                       pcie->io.name = "I/O";
> +               }
> +               if (resource_type(&res) == IORESOURCE_MEM) {
> +                       memcpy(&pcie->mem, &res, sizeof(res));
> +                       pcie->mem.name = "MEM";
> +               }
> +       }
> +
> +       /* Get the bus range */
> +       ret = of_pci_parse_bus_range(np, &pcie->busn);
> +       if (ret) {
> +               dev_err(&pdev->dev, "failed to parse bus-range property: %d\n",
> +                       ret);
> +               return ret;
> +       }
> +
> +       for_each_child_of_node(pdev->dev.of_node, child) {
> +               if (!of_device_is_available(child))
> +                       continue;
> +               pcie->nports++;
> +       }
> +
> +       pcie->ports = devm_kzalloc(&pdev->dev, pcie->nports *
> +                                  sizeof(struct mvebu_pcie_port),
> +                                  GFP_KERNEL);
> +       if (!pcie->ports)
> +               return -ENOMEM;
> +
> +       i = 0;
> +       for_each_child_of_node(pdev->dev.of_node, child) {
> +               struct mvebu_pcie_port *port = &pcie->ports[i];
> +
> +               if (!of_device_is_available(child))
> +                       continue;
> +
> +               if (of_property_read_u32(child, "marvell,pcie-port",
> +                                        &port->port)) {
> +                       dev_warn(&pdev->dev,
> +                                "ignoring PCIe DT node, missing pcie-port property\n");
> +                       continue;
> +               }
> +
> +               if (of_property_read_u32(child, "marvell,pcie-lane",
> +                                        &port->lane))
> +                       port->lane = 0;
> +
> +               port->devfn = of_pci_get_devfn(child);
> +               if (port->devfn < 0)
> +                       continue;
> +
> +               port->base = of_iomap(child, 0);
> +               if (!port->base) {
> +                       dev_err(&pdev->dev, "PCIe%d.%d: cannot map registers\n",
> +                               port->port, port->lane);
> +                       continue;
> +               }
> +
> +               if (orion_pcie_link_up(port->base)) {
> +                       port->haslink = 1;
> +                       dev_info(&pdev->dev, "PCIe%d.%d: link up\n",
> +                                port->port, port->lane);
> +               } else {
> +                       port->haslink = 0;
> +                       dev_info(&pdev->dev, "PCIe%d.%d: link down\n",
> +                                port->port, port->lane);
> +               }
> +
> +               port->clk = of_clk_get_by_name(child, NULL);
> +               if (!port->clk) {
> +                       dev_err(&pdev->dev, "PCIe%d.%d: cannot get clock\n",
> +                              port->port, port->lane);
> +                       iounmap(port->base);
> +                       port->haslink = 0;
> +                       continue;
> +               }
> +
> +               port->dn = child;
> +
> +               clk_prepare_enable(port->clk);
> +               spin_lock_init(&port->conf_lock);
> +
> +               pci_sw_pci_bridge_init(&port->bridge);
> +               port->bridge.vendor = PCI_VENDOR_ID_MARVELL;
> +               port->bridge.device = MARVELL_EMULATED_PCI_PCI_BRIDGE_ID;
> +               port->bridge.primary_bus = 0;
> +               port->bridge.secondary_bus = PCI_SLOT(port->devfn);
> +               port->bridge.subordinate_bus = PCI_SLOT(port->devfn);
> +
> +               i++;
> +       }
> +
> +       mvebu_pcie_enable(pcie);
> +
> +       return 0;
> +}
> +
> +static const struct of_device_id mvebu_pcie_of_match_table[] = {
> +       { .compatible = "marvell,armada-370-xp-pcie", },
> +       {},
> +};
> +MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table);
> +
> +static struct platform_driver mvebu_pcie_driver = {
> +       .driver = {
> +               .owner = THIS_MODULE,
> +               .name = "mvebu-pcie",
> +               .of_match_table =
> +                  of_match_ptr(mvebu_pcie_of_match_table),
> +       },
> +};
> +
> +static int mvebu_pcie_init(void)
> +{
> +       return platform_driver_probe(&mvebu_pcie_driver,
> +                                    mvebu_pcie_probe);
> +}
> +
> +subsys_initcall(mvebu_pcie_init);
> +
> +MODULE_AUTHOR("Thomas Petazzoni <thomas.petazzoni@free-electrons.com>");
> +MODULE_DESCRIPTION("Marvell EBU PCIe driver");
> +MODULE_LICENSE("GPL");
> --
> 1.7.9.5
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Jan. 29, 2013, 5:55 a.m. UTC | #3
On Mon, Jan 28, 2013 at 08:29:24PM -0700, Bjorn Helgaas wrote:
> On Mon, Jan 28, 2013 at 11:56 AM, Thomas Petazzoni
> <thomas.petazzoni@free-electrons.com> wrote:
> > This driver implements the support for the PCIe interfaces on the
> > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
> > cover earlier families of Marvell SoCs, such as Dove, Orion and
> > Kirkwood.
> >
> > The driver implements the hw_pci operations needed by the core ARM PCI
> > code to setup PCI devices and get their corresponding IRQs, and the
> > pci_ops operations that are used by the PCI core to read/write the
> > configuration space of PCI devices.
> >
> > Since the PCIe interfaces of Marvell SoCs are completely separate and
> > not linked together in a bus, this driver sets up an emulated PCI host
> > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
> > interface.
> 
> There's no Linux requirement that multiple PCIe interfaces appear to
> be in the same hierarchy.  You can just use pci_scan_root_bus()
> separately on each interface.  Each interface can be in its own domain
> if necessary.

What you suggest is basically what the Marvell driver did originally,
the probelm is that Linux requires a pre-assigned aperture for each
PCI domain/root bus, and these new chips have so many PCI-E ports that
they can exhaust the physical address space, and also a limited
internal HW resource for setting address routing.

Thus they require resource allocation that is sensitive to the devices
present downstream.

By far the simplest solution is to merge all the physical links into a
single domain and rely on existing PCI resource allocation code to
drive allocation of scarce physical address space and demand allocate
the HW routing resource (specifically there are enough resources to
accomidate MMIO only devices on every bus, but not enough to
accomidate MMIO and IO on every bus).

> > +/*
> > + * For a given PCIe interface (represented by a mvebu_pcie_port
> > + * structure), we read the PCI configuration space of the
> > + * corresponding PCI-to-PCI bridge in order to find out which range of
> > + * I/O addresses and memory addresses have been assigned to this PCIe
> > + * interface. Using these informations, we set up the appropriate
> > + * address decoding windows so that the physical address are actually
> > + * resolved to the right PCIe interface.
> > + */
> 
> Are you inferring the host bridge apertures by using the resources
> assigned to devices under the bridge, i.e., taking the union of all

The flow is different, a portion of physical address space is set
aside for use by PCI-E (via DT) and that portion is specified in the
struct resource's ultimately attached to the PCI domain for the bus
scan. You could call that the 'host bridge aperture' though it doesn't
reflect any HW configuration at all. The values come from the device
tree.

During the bus scan the Linux core code splits up that contiguous
space and assigns to the PCI-PCI bridges and devices under that domain.

Each physical PCI-E link on the chip is seen by Linux through the SW
emulated PCI-PCI bridge attached to bus 0. When Linux configures the
bridge windows it triggers this code here to copy that window
information from the PCI config space into non-standard internal HW
registers.

The purpose of the SW PCI-PCI bridge and this code here is to give
the Linux PCI core control over the window (MMIO,IO,busnr) assigned
to the PCI-E link.

This arrangement with PCI-PCI bridges controlling address routing is
part of the PCI-E standard, in this instance Marvell did not implement
the required config space in HW so the driver is working around that
deficiency.

Other drivers, like tegra have a similar design, but their hardware
does implement PCI-PCI bridge configuration space and does drive
address decoding through the HW PCI-PCI window registers.

Having PCI-E links be bridges, not domains/root_bus's is in-line with
the standard and works better with the Linux PCI resource allocator.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 29, 2013, 8 a.m. UTC | #4
Bjorn, Jason,

On Mon, 28 Jan 2013 22:55:08 -0700, Jason Gunthorpe wrote:

> > There's no Linux requirement that multiple PCIe interfaces appear to
> > be in the same hierarchy.  You can just use pci_scan_root_bus()
> > separately on each interface.  Each interface can be in its own domain
> > if necessary.
> 
> What you suggest is basically what the Marvell driver did originally,
> the probelm is that Linux requires a pre-assigned aperture for each
> PCI domain/root bus, and these new chips have so many PCI-E ports that
> they can exhaust the physical address space, and also a limited
> internal HW resource for setting address routing.
> 
> Thus they require resource allocation that is sensitive to the devices
> present downstream.
> 
> By far the simplest solution is to merge all the physical links into a
> single domain and rely on existing PCI resource allocation code to
> drive allocation of scarce physical address space and demand allocate
> the HW routing resource (specifically there are enough resources to
> accomidate MMIO only devices on every bus, but not enough to
> accomidate MMIO and IO on every bus).
> 
> > > +/*
> > > + * For a given PCIe interface (represented by a mvebu_pcie_port
> > > + * structure), we read the PCI configuration space of the
> > > + * corresponding PCI-to-PCI bridge in order to find out which range of
> > > + * I/O addresses and memory addresses have been assigned to this PCIe
> > > + * interface. Using these informations, we set up the appropriate
> > > + * address decoding windows so that the physical address are actually
> > > + * resolved to the right PCIe interface.
> > > + */
> > 
> > Are you inferring the host bridge apertures by using the resources
> > assigned to devices under the bridge, i.e., taking the union of all
> 
> The flow is different, a portion of physical address space is set
> aside for use by PCI-E (via DT) and that portion is specified in the
> struct resource's ultimately attached to the PCI domain for the bus
> scan. You could call that the 'host bridge aperture' though it doesn't
> reflect any HW configuration at all. The values come from the device
> tree.
> 
> During the bus scan the Linux core code splits up that contiguous
> space and assigns to the PCI-PCI bridges and devices under that domain.
> 
> Each physical PCI-E link on the chip is seen by Linux through the SW
> emulated PCI-PCI bridge attached to bus 0. When Linux configures the
> bridge windows it triggers this code here to copy that window
> information from the PCI config space into non-standard internal HW
> registers.
> 
> The purpose of the SW PCI-PCI bridge and this code here is to give
> the Linux PCI core control over the window (MMIO,IO,busnr) assigned
> to the PCI-E link.
> 
> This arrangement with PCI-PCI bridges controlling address routing is
> part of the PCI-E standard, in this instance Marvell did not implement
> the required config space in HW so the driver is working around that
> deficiency.
> 
> Other drivers, like tegra have a similar design, but their hardware
> does implement PCI-PCI bridge configuration space and does drive
> address decoding through the HW PCI-PCI window registers.
> 
> Having PCI-E links be bridges, not domains/root_bus's is in-line with
> the standard and works better with the Linux PCI resource allocator.

Thanks a lot Jason for this explanation, I couldn't have explained it
as clearly as you did.

Bjorn, does Jason's reply answers your questions? Or do you need other
details?

Thanks!

Thomas
Thomas Petazzoni Jan. 29, 2013, 8:41 a.m. UTC | #5
Dear Stephen Warren,

On Mon, 28 Jan 2013 15:21:45 -0700, Stephen Warren wrote:

> > +Mandatory properties:
> > +- compatible: must be "marvell,armada-370-xp-pcie"
> > +- status: either "disabled" or "okay"
> 
> status is a standard DT property; I certainly wouldn't expect its
> presence to be mandatory (there's a defined default), nor would I expect
> each device's binding to redefine this property.

Ok.

> > +- marvell,pcie-port: the physical PCIe port number
> 
> Should the standardized cell-index property be used here instead? Or,
> perhaps that property is deprecated/discouraged...

The problem is that I need two identifiers, the pcie-port and
pcie-lane, and it would be strange to have one referenced as
cell-index, and the other one as marvell,pcie-lane, no? Unless of
course we can put two numbers in the cell-index property, but a quick
grep in Documentation/devicetree/bindings/ seems to indicate that all
users of cell-index use it with a single identifier.

Just tell me what to do here, I don't have a strong opinion on this.

> > diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
> 
> > +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
> > +ccflags-$(CONFIG_PCI_MVEBU) += \
> > +	-I$(srctree)/arch/arm/plat-orion/include \
> > +	-I$(srctree)/arch/arm/mach-mvebu/include
> 
> That seems a little dangerous w.r.t. multi-platform zImage. Can the
> required headers be moved out to somewhere more public to avoid this?

Why is this dangerous for multi-platform zImage? For this specific
driver only, some SoC-specific headers are used. I don't think it
prevents another PCI driver (such as the Tegra one) from being built
into the same kernel image, no?

Also, this is kind of a temporary solution, because I can't fix all the
problems in just the PCI patch series without making it horribly large.
The reason why we need those include paths at the moment are:

 * For plat-orion/include, because of pcie.h that provides functions
   common to all Marvell SoCs regarding PCI. Ultimately, all the
   Marvell SoCs that use this common code should be migrated over to
   the DT-capable PCI driver that is submitted through this patch
   series. This will take a bit of time, and is too complex to do in
   one shot, together with the introduction of the driver itself. So
   ultimately, all the code in plat-orion/pcie.c will migrate into
   drivers/pci/host/pci-mvebu.c, and this
   plat-orion/include/plat/pcie.h file should disappear, removing the
   need for this special header path.

 * For mach-mvebu/include, because of the addr-map.h header that
   provides functions related to address decoding windows. This is also
   likely to evolve quite significantly when we'll make the PCI driver
   being used by the other Marvell SoC families (Kirkwood, Dove,
   Orion5x), and when this work is done, we can think of having a
   public header in include/linux that exposes the address decoding
   APIs, once it has stabilized a bit across the different Marvell SoC
   families.

So, the bottom line is: yes I know those include paths are not nice,
but I don't think they prevent multiplatform builds, and they are a
temporary solution until we convert more Marvell SoC families to this
new PCI driver.

> > +/*
> > + * Those are the product IDs used for the emulated PCI Host bridge and
> > + * emulated PCI-to-PCI bridges. They are temporary until we get
> > + * official IDs assigned.
> > + */
> > +#define MARVELL_EMULATED_HOST_BRIDGE_ID    4141
> > +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242
> 
> I assume that means we can't merge this driver yet. The cover letter
> mentioned a desire to merge this for 3.9; there's not much time to get
> official IDs assigned, then.

I am working on getting real IDs assigned. For now, I'd like to work on
getting all other issues fixed, and have this only problem remaining.

> 
> > +static int mvebu_pcie_init(void)
> > +{
> > +	return platform_driver_probe(&mvebu_pcie_driver,
> > +				     mvebu_pcie_probe);
> > +}
> > +
> > +subsys_initcall(mvebu_pcie_init);
> 
> Why isn't that just platform_driver_register()?

I didn't test recently, but with my first version of the patch set,
having an initialization as late as module_init() was too late. Some
PCI fixup code was being executed *before* we get the opportunity of
initializing the PCI driver, and it was crashing the kernel. I can
provide more details if you want.

> > +MODULE_LICENSE("GPL");
> 
> "GPL v2".

Sure.

Thomas
Thierry Reding Jan. 29, 2013, 9:20 a.m. UTC | #6
On Tue, Jan 29, 2013 at 09:41:43AM +0100, Thomas Petazzoni wrote:
> On Mon, 28 Jan 2013 15:21:45 -0700, Stephen Warren wrote:
[...]
> > > +static int mvebu_pcie_init(void)
> > > +{
> > > +	return platform_driver_probe(&mvebu_pcie_driver,
> > > +				     mvebu_pcie_probe);
> > > +}
> > > +
> > > +subsys_initcall(mvebu_pcie_init);
> > 
> > Why isn't that just platform_driver_register()?
> 
> I didn't test recently, but with my first version of the patch set,
> having an initialization as late as module_init() was too late. Some
> PCI fixup code was being executed *before* we get the opportunity of
> initializing the PCI driver, and it was crashing the kernel. I can
> provide more details if you want.

Does this patch perhaps fix this crash?

	http://patchwork.ozlabs.org/patch/210870/

Thierry
Thomas Petazzoni Jan. 29, 2013, 9:21 a.m. UTC | #7
Dear Thierry Reding,

On Tue, 29 Jan 2013 10:20:06 +0100, Thierry Reding wrote:

> Does this patch perhaps fix this crash?
> 
> 	http://patchwork.ozlabs.org/patch/210870/

I'll test it, thanks for the notice!

Thomas
Andrew Murray Jan. 29, 2013, 1:22 p.m. UTC | #8
On Mon, Jan 28, 2013 at 06:56:28PM +0000, Thomas Petazzoni wrote:
> This driver implements the support for the PCIe interfaces on the
> Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
> cover earlier families of Marvell SoCs, such as Dove, Orion and
> Kirkwood.
> 
> The driver implements the hw_pci operations needed by the core ARM PCI
> code to setup PCI devices and get their corresponding IRQs, and the
> pci_ops operations that are used by the PCI core to read/write the
> configuration space of PCI devices.
> 
> Since the PCIe interfaces of Marvell SoCs are completely separate and
> not linked together in a bus, this driver sets up an emulated PCI host
> bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
> interface.
> 
> In addition, this driver enumerates the different PCIe slots, and for
> those having a device plugged in, it sets up the necessary address
> decoding windows, using the new armada_370_xp_alloc_pcie_window()
> function from mach-mvebu/addr-map.c.
> 
> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>

[snip]

> +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
> +{

[snip]

> +
> +       /*
> +        * Build an laddr array that describes the PCI device in a DT
> +        * way
> +        */
> +       laddr[0] = cpu_to_be32(port->devfn << 8);
> +       laddr[1] = laddr[2] = 0;
> +       intspec = cpu_to_be32(pin);
> +
> +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> +       if (ret) {
> +               dev_err(&pcie->pdev->dev,
> +                       "%s: of_irq_map_raw() failed, %d\n",
> +                       __func__, ret);
> +               return ret;
> +       }

Are you able to replace the above code with a call to of_irq_map_pci? I'm not
sure which approach is better. The of_irq_map_pci function doesn't require the
pin argument and instead uses the DT and/or performs its own pin swizzling. I
guess this means that if there are PCIe devices in the DT tree that does any
thing strange with pins then it would be reflected in the IRQ you get. I've
found that you will also need to provide an implementation of
pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch).

> +
> +       return irq_create_of_mapping(oirq.controller, oirq.specifier,
> +                                    oirq.size);
> +}



> +static int mvebu_pcie_enable(struct mvebu_pcie *pcie)
> +{
> +       struct hw_pci hw;

[snip]

> +       pci_common_init(&hw);
> +
> +       return mvebu_pcie_window_config(pcie);
> +}
> +
> +static int __init mvebu_pcie_probe(struct platform_device *pdev)
> +{

[snip]

> +
> +       mvebu_pcie_enable(pcie);
> +
> +       return 0;
> +}
> +
> +static const struct of_device_id mvebu_pcie_of_match_table[] = {
> +       { .compatible = "marvell,armada-370-xp-pcie", },
> +       {},
> +};
> +MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table);
> +
> +static struct platform_driver mvebu_pcie_driver = {
> +       .driver = {
> +               .owner = THIS_MODULE,
> +               .name = "mvebu-pcie",
> +               .of_match_table =
> +                  of_match_ptr(mvebu_pcie_of_match_table),
> +       },
> +};
> +
> +static int mvebu_pcie_init(void)
> +{
> +       return platform_driver_probe(&mvebu_pcie_driver,
> +                                    mvebu_pcie_probe);
> +}

If you have multiple 'mvebu-pcie' in your DT then you will end up
with multiple calls to
mvebu_pcie_probe/mvebu_pcie_enable/pci_common_init.

However pci_common_init/pcibios_init_hw assumes it will only ever be called
once, and will thus result in trying to create multiple busses with the same
bus number. (The first root bus it creates is always zero provided you haven't
implemented hw->scan).

I noticed this in Thierry's patch set and posted an RFC patch which overcomes
this issue (patchwork.kernel.org/patch/2001171) and others. Perhaps you would
want to include this in your patchset?

Andrew Murray

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 29, 2013, 1:45 p.m. UTC | #9
Dear Andrew Murray,

On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote:

> > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
> > +{
> 
> [snip]
> 
> > +
> > +       /*
> > +        * Build an laddr array that describes the PCI device in a DT
> > +        * way
> > +        */
> > +       laddr[0] = cpu_to_be32(port->devfn << 8);
> > +       laddr[1] = laddr[2] = 0;
> > +       intspec = cpu_to_be32(pin);
> > +
> > +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> > +       if (ret) {
> > +               dev_err(&pcie->pdev->dev,
> > +                       "%s: of_irq_map_raw() failed, %d\n",
> > +                       __func__, ret);
> > +               return ret;
> > +       }
> 
> Are you able to replace the above code with a call to of_irq_map_pci? I'm not
> sure which approach is better. The of_irq_map_pci function doesn't require the
> pin argument and instead uses the DT and/or performs its own pin swizzling. I
> guess this means that if there are PCIe devices in the DT tree that does any
> thing strange with pins then it would be reflected in the IRQ you get. I've
> found that you will also need to provide an implementation of
> pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch).

I did try using the of_irq_map_pci() function, but unfortunately, it
didn't work. IIRC, it didn't work because none of the pci_dev in my PCI
tree had any 'struct device_node' associated to them, or at least not
the one that had the right pdev->bus->number and pdev->devfn.

But, I guess that your patch that implements pcibios_get_phb_of_node()
should fix this problem. I'll try this. Thanks!

> > +static int mvebu_pcie_init(void)
> > +{
> > +       return platform_driver_probe(&mvebu_pcie_driver,
> > +                                    mvebu_pcie_probe);
> > +}
> 
> If you have multiple 'mvebu-pcie' in your DT then you will end up
> with multiple calls to
> mvebu_pcie_probe/mvebu_pcie_enable/pci_common_init.

Right. In practice, there will only ever be a single DT node, since all
PCIe interfaces are sub-nodes of the PCI controller node. But I
understand the theoretical problem.

> However pci_common_init/pcibios_init_hw assumes it will only ever be called
> once, and will thus result in trying to create multiple busses with the same
> bus number. (The first root bus it creates is always zero provided you haven't
> implemented hw->scan).
> 
> I noticed this in Thierry's patch set and posted an RFC patch which overcomes
> this issue (patchwork.kernel.org/patch/2001171) and others. Perhaps you would
> want to include this in your patchset?

Sure, I'll give it a test, and report if it works for me.

Thanks a lot!

Thomas
Andrew Murray Jan. 29, 2013, 2:05 p.m. UTC | #10
On Tue, Jan 29, 2013 at 01:45:22PM +0000, Thomas Petazzoni wrote:
> Dear Andrew Murray,
> 
> On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote:
> 
> > > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
> > > +{
> > 
> > [snip]
> > 
> > > +
> > > +       /*
> > > +        * Build an laddr array that describes the PCI device in a DT
> > > +        * way
> > > +        */
> > > +       laddr[0] = cpu_to_be32(port->devfn << 8);
> > > +       laddr[1] = laddr[2] = 0;
> > > +       intspec = cpu_to_be32(pin);
> > > +
> > > +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> > > +       if (ret) {
> > > +               dev_err(&pcie->pdev->dev,
> > > +                       "%s: of_irq_map_raw() failed, %d\n",
> > > +                       __func__, ret);
> > > +               return ret;
> > > +       }
> > 
> > Are you able to replace the above code with a call to of_irq_map_pci? I'm not
> > sure which approach is better. The of_irq_map_pci function doesn't require the
> > pin argument and instead uses the DT and/or performs its own pin swizzling. I
> > guess this means that if there are PCIe devices in the DT tree that does any
> > thing strange with pins then it would be reflected in the IRQ you get. I've
> > found that you will also need to provide an implementation of
> > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch).
> 
> I did try using the of_irq_map_pci() function, but unfortunately, it
> didn't work. IIRC, it didn't work because none of the pci_dev in my PCI
> tree had any 'struct device_node' associated to them, or at least not
> the one that had the right pdev->bus->number and pdev->devfn.
> 
> But, I guess that your patch that implements pcibios_get_phb_of_node()
> should fix this problem. I'll try this. Thanks!

My bios32 patch departs slightly from your v2 04/27 patch in that it updates
hw_pci to contain a device node rather than opaque private data and my
pcibios_get_phb_of_node implementation relies on this. If you wanted to stick
with the implementation you and Thierry share then you'd have to find another
way to get to the device node from the void **private_data.
 
> > > +static int mvebu_pcie_init(void)
> > > +{
> > > +       return platform_driver_probe(&mvebu_pcie_driver,
> > > +                                    mvebu_pcie_probe);
> > > +}
> > 
> > If you have multiple 'mvebu-pcie' in your DT then you will end up
> > with multiple calls to
> > mvebu_pcie_probe/mvebu_pcie_enable/pci_common_init.
> 
> Right. In practice, there will only ever be a single DT node, since all
> PCIe interfaces are sub-nodes of the PCI controller node. But I
> understand the theoretical problem.
> 
> > However pci_common_init/pcibios_init_hw assumes it will only ever be called
> > once, and will thus result in trying to create multiple busses with the same
> > bus number. (The first root bus it creates is always zero provided you haven't
> > implemented hw->scan).
> > 
> > I noticed this in Thierry's patch set and posted an RFC patch which overcomes
> > this issue (patchwork.kernel.org/patch/2001171) and others. Perhaps you would
> > want to include this in your patchset?
> 
> Sure, I'll give it a test, and report if it works for me.
> 
> Thanks a lot!
> 
> Thomas
> -- 
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Jan. 29, 2013, 2:20 p.m. UTC | #11
On Tue, Jan 29, 2013 at 02:05:22PM +0000, Andrew Murray wrote:
> On Tue, Jan 29, 2013 at 01:45:22PM +0000, Thomas Petazzoni wrote:
> > Dear Andrew Murray,
> > 
> > On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote:
> > 
> > > > +static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
> > > > +{
> > > 
> > > [snip]
> > > 
> > > > +
> > > > +       /*
> > > > +        * Build an laddr array that describes the PCI device in a DT
> > > > +        * way
> > > > +        */
> > > > +       laddr[0] = cpu_to_be32(port->devfn << 8);
> > > > +       laddr[1] = laddr[2] = 0;
> > > > +       intspec = cpu_to_be32(pin);
> > > > +
> > > > +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> > > > +       if (ret) {
> > > > +               dev_err(&pcie->pdev->dev,
> > > > +                       "%s: of_irq_map_raw() failed, %d\n",
> > > > +                       __func__, ret);
> > > > +               return ret;
> > > > +       }
> > > 
> > > Are you able to replace the above code with a call to of_irq_map_pci? I'm not
> > > sure which approach is better. The of_irq_map_pci function doesn't require the
> > > pin argument and instead uses the DT and/or performs its own pin swizzling. I
> > > guess this means that if there are PCIe devices in the DT tree that does any
> > > thing strange with pins then it would be reflected in the IRQ you get. I've
> > > found that you will also need to provide an implementation of
> > > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32 patch).
> > 
> > I did try using the of_irq_map_pci() function, but unfortunately, it
> > didn't work. IIRC, it didn't work because none of the pci_dev in my PCI
> > tree had any 'struct device_node' associated to them, or at least not
> > the one that had the right pdev->bus->number and pdev->devfn.
> > 
> > But, I guess that your patch that implements pcibios_get_phb_of_node()
> > should fix this problem. I'll try this. Thanks!
> 
> My bios32 patch departs slightly from your v2 04/27 patch in that it updates
> hw_pci to contain a device node rather than opaque private data and my
> pcibios_get_phb_of_node implementation relies on this. If you wanted to stick
> with the implementation you and Thierry share then you'd have to find another
> way to get to the device node from the void **private_data.

If at all possible I think the right thing to do is reuse the generic
pcibios_get_phb_of_node() implementation. On Tegra this turned out to
require a minimal change to the DT bindings of the root port nodes to
make sure they provide the correct address in the reg property.

Thierry
Thomas Petazzoni Jan. 29, 2013, 2:29 p.m. UTC | #12
Dear Thierry Reding,

On Tue, 29 Jan 2013 15:20:15 +0100, Thierry Reding wrote:

> If at all possible I think the right thing to do is reuse the generic
> pcibios_get_phb_of_node() implementation. On Tegra this turned out to
> require a minimal change to the DT bindings of the root port nodes to
> make sure they provide the correct address in the reg property.

Could you detail the change that was needed? The DT bindings I use for
the Marvell PCIe driver are very, very similar to the ones you use for
Tegra, since I basically inspired my entire DT binding on your work.
And still, I think of_irq_map_pci() wasn't working for me.

Best regards,

Thomas
Thierry Reding Jan. 29, 2013, 3:02 p.m. UTC | #13
On Tue, Jan 29, 2013 at 03:29:37PM +0100, Thomas Petazzoni wrote:
> Dear Thierry Reding,
> 
> On Tue, 29 Jan 2013 15:20:15 +0100, Thierry Reding wrote:
> 
> > If at all possible I think the right thing to do is reuse the generic
> > pcibios_get_phb_of_node() implementation. On Tegra this turned out to
> > require a minimal change to the DT bindings of the root port nodes to
> > make sure they provide the correct address in the reg property.
> 
> Could you detail the change that was needed? The DT bindings I use for
> the Marvell PCIe driver are very, very similar to the ones you use for
> Tegra, since I basically inspired my entire DT binding on your work.
> And still, I think of_irq_map_pci() wasn't working for me.

Now that I think about it, there were a few more changes needed.
For one, the reg property of the root port nodes need to be in the
format specified by the PCI DT binding. That is, 3 cells for the
address and 2 cells for the size.

So I end up with something like this:

	pcie-controller {
		...

		ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000 /* port 0 registers */
			  0x00001000 0 0x80001000 0x80001000 0 0x00001000 /* port 1 registers */
			  ...>;

		pci@1,0 {
			reg = <0x000800 0 0x80000000 0 0x1000>;
			...
		};

		pci@2,0 {
			reg = <0x001000 0 0x80001000 0 0x1000>;
			...
		};
	};

So what happens here is that for each root port (pci@1,0 and pci@2,0),
the reg property is translated into the parent address space via the
pcie-controller's ranges property. pci@1,0 gets the memory region
0x80000000-0x80000fff and pci@2,0 gets 0x80001000-0x80001fff. (These are
actually windows through which the configuration space of the root ports
is accessed.)

At the same time this reg property maps both devices into the PCI
address space at addresses 0:01.0 and 0:02.0 respectively.

The second change is that you can't rely on ARM's default implementation
of the bus scan operation, which calls pci_scan_root_bus(), passing in a
NULL as the struct device which acts as the bus' parent. So on Tegra I
added a custom implementation which calls pci_create_root_bus(), passing
in the struct device of the PCI host bridge, whose .of_node field will
be set to the pcie-controller node above. Incidentally this also fixed
another issue where the PCI core and ARM's pci_common_init() both
eventually end up calling pci_bus_add_devices(). I don't remember the
exact symptoms but I think this was causing resource conflicts during
the second enumeration or so.

Because a proper struct device with the correct .of_node field is passed
into pci_create_root_bus(), the generic pcibios_get_phb_of_node() will
know how to find it by looking at bus->bridge->parent->of_node. After
that the generic matching code will search the bridge (pcie-controller)
node's children and relate them to the struct pci_dev by devfn. This is
done in pci_set_of_node() defined in drivers/pci/of.c, which calls
of_pci_find_child_device() from drivers/of/of_pci.c.

This is quite convoluted, but I hope it helps.

Thierry
Andrew Murray Jan. 29, 2013, 3:08 p.m. UTC | #14
On Tue, Jan 29, 2013 at 03:02:01PM +0000, Thierry Reding wrote:
> On Tue, Jan 29, 2013 at 03:29:37PM +0100, Thomas Petazzoni wrote:
> > Dear Thierry Reding,
> > 
> > On Tue, 29 Jan 2013 15:20:15 +0100, Thierry Reding wrote:
> > 
> > > If at all possible I think the right thing to do is reuse the generic
> > > pcibios_get_phb_of_node() implementation. On Tegra this turned out to
> > > require a minimal change to the DT bindings of the root port nodes to
> > > make sure they provide the correct address in the reg property.
> > 
> > Could you detail the change that was needed? The DT bindings I use for
> > the Marvell PCIe driver are very, very similar to the ones you use for
> > Tegra, since I basically inspired my entire DT binding on your work.
> > And still, I think of_irq_map_pci() wasn't working for me.
> 
> Now that I think about it, there were a few more changes needed.
> For one, the reg property of the root port nodes need to be in the
> format specified by the PCI DT binding. That is, 3 cells for the
> address and 2 cells for the size.
> 
> So I end up with something like this:
> 
> 	pcie-controller {
> 		...
> 
> 		ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000 /* port 0 registers */
> 			  0x00001000 0 0x80001000 0x80001000 0 0x00001000 /* port 1 registers */
> 			  ...>;
> 
> 		pci@1,0 {
> 			reg = <0x000800 0 0x80000000 0 0x1000>;
> 			...
> 		};
> 
> 		pci@2,0 {
> 			reg = <0x001000 0 0x80001000 0 0x1000>;
> 			...
> 		};
> 	};
> 
> So what happens here is that for each root port (pci@1,0 and pci@2,0),
> the reg property is translated into the parent address space via the
> pcie-controller's ranges property. pci@1,0 gets the memory region
> 0x80000000-0x80000fff and pci@2,0 gets 0x80001000-0x80001fff. (These are
> actually windows through which the configuration space of the root ports
> is accessed.)
> 
> At the same time this reg property maps both devices into the PCI
> address space at addresses 0:01.0 and 0:02.0 respectively.
> 
> The second change is that you can't rely on ARM's default implementation
> of the bus scan operation, which calls pci_scan_root_bus(), passing in a
> NULL as the struct device which acts as the bus' parent. So on Tegra I
> added a custom implementation which calls pci_create_root_bus(), passing
> in the struct device of the PCI host bridge, whose .of_node field will
> be set to the pcie-controller node above. Incidentally this also fixed
> another issue where the PCI core and ARM's pci_common_init() both
> eventually end up calling pci_bus_add_devices(). I don't remember the
> exact symptoms but I think this was causing resource conflicts during
> the second enumeration or so.
> 
> Because a proper struct device with the correct .of_node field is passed
> into pci_create_root_bus(), the generic pcibios_get_phb_of_node() will
> know how to find it by looking at bus->bridge->parent->of_node. After
> that the generic matching code will search the bridge (pcie-controller)
> node's children and relate them to the struct pci_dev by devfn. This is
> done in pci_set_of_node() defined in drivers/pci/of.c, which calls
> of_pci_find_child_device() from drivers/of/of_pci.c.
> 
> This is quite convoluted, but I hope it helps.

Thanks this is very helpful. I will see if this lets me avoid implementing
pcibios_get_phb_of_node.

Thanks,

Andrew Murray

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 29, 2013, 3:10 p.m. UTC | #15
Dear Thierry Reding,

On Tue, 29 Jan 2013 16:02:01 +0100, Thierry Reding wrote:

> Now that I think about it, there were a few more changes needed.
> For one, the reg property of the root port nodes need to be in the
> format specified by the PCI DT binding. That is, 3 cells for the
> address and 2 cells for the size.
> 
> So I end up with something like this:
> 
> 	pcie-controller {
> 		...
> 
> 		ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000 /* port 0 registers */
> 			  0x00001000 0 0x80001000 0x80001000 0 0x00001000 /* port 1 registers */
> 			  ...>;
> 
> 		pci@1,0 {
> 			reg = <0x000800 0 0x80000000 0 0x1000>;
> 			...
> 		};
> 
> 		pci@2,0 {
> 			reg = <0x001000 0 0x80001000 0 0x1000>;
> 			...
> 		};
> 	};
> 
> So what happens here is that for each root port (pci@1,0 and pci@2,0),
> the reg property is translated into the parent address space via the
> pcie-controller's ranges property. pci@1,0 gets the memory region
> 0x80000000-0x80000fff and pci@2,0 gets 0x80001000-0x80001fff. (These are
> actually windows through which the configuration space of the root ports
> is accessed.)
> 
> At the same time this reg property maps both devices into the PCI
> address space at addresses 0:01.0 and 0:02.0 respectively.

This part I think I've done exactly the same thing in the Marvell PCIe
DT binding.

> The second change is that you can't rely on ARM's default implementation
> of the bus scan operation, which calls pci_scan_root_bus(), passing in a
> NULL as the struct device which acts as the bus' parent. So on Tegra I
> added a custom implementation which calls pci_create_root_bus(), passing
> in the struct device of the PCI host bridge, whose .of_node field will
> be set to the pcie-controller node above. Incidentally this also fixed
> another issue where the PCI core and ARM's pci_common_init() both
> eventually end up calling pci_bus_add_devices(). I don't remember the
> exact symptoms but I think this was causing resource conflicts during
> the second enumeration or so.
> 
> Because a proper struct device with the correct .of_node field is passed
> into pci_create_root_bus(), the generic pcibios_get_phb_of_node() will
> know how to find it by looking at bus->bridge->parent->of_node. After
> that the generic matching code will search the bridge (pcie-controller)
> node's children and relate them to the struct pci_dev by devfn. This is
> done in pci_set_of_node() defined in drivers/pci/of.c, which calls
> of_pci_find_child_device() from drivers/of/of_pci.c.

This is quite certainly the part that I was missing. I'll try this and
let you know.

Thanks a lot for the lengthy, but very useful explanation!

Thomas
Bjorn Helgaas Jan. 29, 2013, 5:47 p.m. UTC | #16
On Mon, Jan 28, 2013 at 10:55 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Jan 28, 2013 at 08:29:24PM -0700, Bjorn Helgaas wrote:
>> On Mon, Jan 28, 2013 at 11:56 AM, Thomas Petazzoni
>> <thomas.petazzoni@free-electrons.com> wrote:
>> > This driver implements the support for the PCIe interfaces on the
>> > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
>> > cover earlier families of Marvell SoCs, such as Dove, Orion and
>> > Kirkwood.
>> >
>> > The driver implements the hw_pci operations needed by the core ARM PCI
>> > code to setup PCI devices and get their corresponding IRQs, and the
>> > pci_ops operations that are used by the PCI core to read/write the
>> > configuration space of PCI devices.
>> >
>> > Since the PCIe interfaces of Marvell SoCs are completely separate and
>> > not linked together in a bus, this driver sets up an emulated PCI host
>> > bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
>> > interface.
>>
>> There's no Linux requirement that multiple PCIe interfaces appear to
>> be in the same hierarchy.  You can just use pci_scan_root_bus()
>> separately on each interface.  Each interface can be in its own domain
>> if necessary.
>
> What you suggest is basically what the Marvell driver did originally,
> the probelm is that Linux requires a pre-assigned aperture for each
> PCI domain/root bus, and these new chips have so many PCI-E ports that
> they can exhaust the physical address space, and also a limited
> internal HW resource for setting address routing.
>
> Thus they require resource allocation that is sensitive to the devices
> present downstream.
>
> By far the simplest solution is to merge all the physical links into a
> single domain and rely on existing PCI resource allocation code to
> drive allocation of scarce physical address space and demand allocate
> the HW routing resource (specifically there are enough resources to
> accomidate MMIO only devices on every bus, but not enough to
> accomidate MMIO and IO on every bus).
>
>> > +/*
>> > + * For a given PCIe interface (represented by a mvebu_pcie_port
>> > + * structure), we read the PCI configuration space of the
>> > + * corresponding PCI-to-PCI bridge in order to find out which range of
>> > + * I/O addresses and memory addresses have been assigned to this PCIe
>> > + * interface. Using these informations, we set up the appropriate
>> > + * address decoding windows so that the physical address are actually
>> > + * resolved to the right PCIe interface.
>> > + */
>>
>> Are you inferring the host bridge apertures by using the resources
>> assigned to devices under the bridge, i.e., taking the union of all
>
> The flow is different, a portion of physical address space is set
> aside for use by PCI-E (via DT) and that portion is specified in the
> struct resource's ultimately attached to the PCI domain for the bus
> scan. You could call that the 'host bridge aperture' though it doesn't
> reflect any HW configuration at all. The values come from the device
> tree.

I think I would understand this better if we had a concrete example to
talk about, say a dmesg log and corresponding lspci -v output.

As I understand it, the DT is a description of the hardware, so in
that sense, the DT can't set aside physical address space.  It can
describe what the hardware does with the address space, and I assume
that's what you mean.  Maybe the hardware isn't configurable, e.g., it
is hard-wired to route certain address ranges to PCIe?

> During the bus scan the Linux core code splits up that contiguous
> space and assigns to the PCI-PCI bridges and devices under that domain.
>
> Each physical PCI-E link on the chip is seen by Linux through the SW
> emulated PCI-PCI bridge attached to bus 0. When Linux configures the
> bridge windows it triggers this code here to copy that window
> information from the PCI config space into non-standard internal HW
> registers.
>
> The purpose of the SW PCI-PCI bridge and this code here is to give
> the Linux PCI core control over the window (MMIO,IO,busnr) assigned
> to the PCI-E link.
>
> This arrangement with PCI-PCI bridges controlling address routing is
> part of the PCI-E standard, in this instance Marvell did not implement
> the required config space in HW so the driver is working around that
> deficiency.
>
> Other drivers, like tegra have a similar design, but their hardware
> does implement PCI-PCI bridge configuration space and does drive
> address decoding through the HW PCI-PCI window registers.
>
> Having PCI-E links be bridges, not domains/root_bus's is in-line with
> the standard and works better with the Linux PCI resource allocator.
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 29, 2013, 6:14 p.m. UTC | #17
Dear Bjorn Helgaas,

On Tue, 29 Jan 2013 10:47:09 -0700, Bjorn Helgaas wrote:

> I think I would understand this better if we had a concrete example to
> talk about, say a dmesg log and corresponding lspci -v output.

Please note that the cover letter of this patch series has the lspci
-vvv output.

Best regards,

Thomas
Jason Gunthorpe Jan. 29, 2013, 6:41 p.m. UTC | #18
On Tue, Jan 29, 2013 at 10:47:09AM -0700, Bjorn Helgaas wrote:

> As I understand it, the DT is a description of the hardware, so in
> that sense, the DT can't set aside physical address space.  It can
> describe what the hardware does with the address space, and I assume
> that's what you mean.  Maybe the hardware isn't configurable, e.g., it
> is hard-wired to route certain address ranges to PCIe?

The DT is largely a description of the hardware, but when it comes to
addresses, particularly HW programmable addresess, there is an general
expectation that the driver/bootloader will program HW address
decoders to either match the addresses given in the DT, or to new
values guided by the DT addresses.

In a real sense that means the DT also describes the physical address
map the kernel should use.

In the PCI-E case the DT PCI-E HW description includes physical
address ranges to use for the MMIO/IO/PREFETCH PCI-E interface windows
and the driver is expected to program the internal HW address decoders
based on those address ranges.

The catch is that the hardware decoders are on a link-by-link basis,
not on a root-complex basis, so the programming can only take place
once the Linux kernel has done PCI resource assignment.

So when I say set aside, I mean for instance, the PCI-E entry in DT
has 128M of physical address space marked for PCI MMIO use. The kernel
does PCI resource allocation and the HW decoders in each link will be
set to claim some portion of the 128M - based on the MMIO windows
programmed on the PCI-PCI root port bridges. The reamining part of the
128M is dead address space, not claimed by any hardware block at all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 29, 2013, 7:07 p.m. UTC | #19
On Tue, Jan 29, 2013 at 11:41 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Jan 29, 2013 at 10:47:09AM -0700, Bjorn Helgaas wrote:
>
>> As I understand it, the DT is a description of the hardware, so in
>> that sense, the DT can't set aside physical address space.  It can
>> describe what the hardware does with the address space, and I assume
>> that's what you mean.  Maybe the hardware isn't configurable, e.g., it
>> is hard-wired to route certain address ranges to PCIe?
>
> The DT is largely a description of the hardware, but when it comes to
> addresses, particularly HW programmable addresess, there is an general
> expectation that the driver/bootloader will program HW address
> decoders to either match the addresses given in the DT, or to new
> values guided by the DT addresses.
>
> In a real sense that means the DT also describes the physical address
> map the kernel should use.
>
> In the PCI-E case the DT PCI-E HW description includes physical
> address ranges to use for the MMIO/IO/PREFETCH PCI-E interface windows
> and the driver is expected to program the internal HW address decoders
> based on those address ranges.
>
> The catch is that the hardware decoders are on a link-by-link basis,
> not on a root-complex basis, so the programming can only take place
> once the Linux kernel has done PCI resource assignment.
>
> So when I say set aside, I mean for instance, the PCI-E entry in DT
> has 128M of physical address space marked for PCI MMIO use. The kernel
> does PCI resource allocation and the HW decoders in each link will be
> set to claim some portion of the 128M - based on the MMIO windows
> programmed on the PCI-PCI root port bridges. The reamining part of the
> 128M is dead address space, not claimed by any hardware block at all.

Thanks, this really helps get to the issue that the PCI core will care
about.  The root ports look like normal bridges, so the core assumes
it can manage their windows as needed, as long as the windows stay
inside the host bridge apertures that are logically upstream from the
root ports.

In your example, it sounds like the 128M should be treated as the host
bridge aperture.  Is there any reason not to do that?  It sounds like
there's no place you can actually program that 128M region into the
hardware, and you would just program pieces of that region as root
port windows.  But that should be OK from the core's perspective.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Jan. 29, 2013, 7:18 p.m. UTC | #20
On Tue, Jan 29, 2013 at 12:07:00PM -0700, Bjorn Helgaas wrote:
> > So when I say set aside, I mean for instance, the PCI-E entry in DT
> > has 128M of physical address space marked for PCI MMIO use. The kernel
> > does PCI resource allocation and the HW decoders in each link will be
> > set to claim some portion of the 128M - based on the MMIO windows
> > programmed on the PCI-PCI root port bridges. The reamining part of the
> > 128M is dead address space, not claimed by any hardware block at all.
> 
> Thanks, this really helps get to the issue that the PCI core will care
> about.  The root ports look like normal bridges, so the core assumes
> it can manage their windows as needed, as long as the windows stay
> inside the host bridge apertures that are logically upstream from the
> root ports.

Yes, that is basically correct. This is what the PCI-E specification
says the root complex/root port should look like and this is what some
SOC hardware implements fully in hardware. The small wrinkle with
Marvell is that the PCI-PCI bridge config space is created by the
driver since the HW does not expose a standard config space.

> In your example, it sounds like the 128M should be treated as the host
> bridge aperture.  Is there any reason not to do that?  It sounds like
> there's no place you can actually program that 128M region into the
> hardware, and you would just program pieces of that region as root
> port windows.  But that should be OK from the core's perspective.

AFAIK this is already what Thomas's driver is doing..

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 29, 2013, 7:38 p.m. UTC | #21
On Tue, Jan 29, 2013 at 12:18 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Jan 29, 2013 at 12:07:00PM -0700, Bjorn Helgaas wrote:
>> > So when I say set aside, I mean for instance, the PCI-E entry in DT
>> > has 128M of physical address space marked for PCI MMIO use. The kernel
>> > does PCI resource allocation and the HW decoders in each link will be
>> > set to claim some portion of the 128M - based on the MMIO windows
>> > programmed on the PCI-PCI root port bridges. The reamining part of the
>> > 128M is dead address space, not claimed by any hardware block at all.
>>
>> Thanks, this really helps get to the issue that the PCI core will care
>> about.  The root ports look like normal bridges, so the core assumes
>> it can manage their windows as needed, as long as the windows stay
>> inside the host bridge apertures that are logically upstream from the
>> root ports.
>
> Yes, that is basically correct. This is what the PCI-E specification
> says the root complex/root port should look like and this is what some
> SOC hardware implements fully in hardware. The small wrinkle with
> Marvell is that the PCI-PCI bridge config space is created by the
> driver since the HW does not expose a standard config space.

Oh, so the actual *root port* itself doesn't conform to the spec?
Wow, that's worse than I expected.

Then I guess you have emulate it and make sure its config space is
complete enough and functional enough so that all the link management,
power management, AER, etc., code in the core works as well as it
would with a conforming device.

>> In your example, it sounds like the 128M should be treated as the host
>> bridge aperture.  Is there any reason not to do that?  It sounds like
>> there's no place you can actually program that 128M region into the
>> hardware, and you would just program pieces of that region as root
>> port windows.  But that should be OK from the core's perspective.
>
> AFAIK this is already what Thomas's driver is doing..
>
> Regards,
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Jan. 29, 2013, 7:47 p.m. UTC | #22
On 01/29/2013 01:41 AM, Thomas Petazzoni wrote:
> Dear Stephen Warren,
> 
> On Mon, 28 Jan 2013 15:21:45 -0700, Stephen Warren wrote:
> 
>>> +Mandatory properties:
>>> +- compatible: must be "marvell,armada-370-xp-pcie"
>>> +- status: either "disabled" or "okay"
>>
>> status is a standard DT property; I certainly wouldn't expect its
>> presence to be mandatory (there's a defined default), nor would I expect
>> each device's binding to redefine this property.
> 
> Ok.
> 
>>> +- marvell,pcie-port: the physical PCIe port number
>>
>> Should the standardized cell-index property be used here instead? Or,
>> perhaps that property is deprecated/discouraged...
> 
> The problem is that I need two identifiers, the pcie-port and
> pcie-lane, and it would be strange to have one referenced as
> cell-index, and the other one as marvell,pcie-lane, no?

Yes, using a custom property for half of the information and a standard
property for the other half would be odd.

> Unless of
> course we can put two numbers in the cell-index property, but a quick
> grep in Documentation/devicetree/bindings/ seems to indicate that all
> users of cell-index use it with a single identifier.
> 
> Just tell me what to do here, I don't have a strong opinion on this.

It's probably fine as-is then. Although I wasn't sure exactly what
port/lane meant; is there some kind of mux/cross-bar between the PCIe
root ports and the physical lanes/balls/pins on the chip?

>>> diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
>>
>>> +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
>>> +ccflags-$(CONFIG_PCI_MVEBU) += \
>>> +	-I$(srctree)/arch/arm/plat-orion/include \
>>> +	-I$(srctree)/arch/arm/mach-mvebu/include
>>
>> That seems a little dangerous w.r.t. multi-platform zImage. Can the
>> required headers be moved out to somewhere more public to avoid this?
> 
> Why is this dangerous for multi-platform zImage? For this specific
> driver only, some SoC-specific headers are used. I don't think it
> prevents another PCI driver (such as the Tegra one) from being built
> into the same kernel image, no?

Aren't those ccflags applied to everything that's built by that
Makefile? If they were applied only to one .o file, it'd probably be OK,
but I don't see how that's specified.

I'm not especially bothered with reaching into the mach/plat include
directories especially since you're well aware it needs cleaning up, I
just don't think that Tegra's PCIe driver is going to compile too well
against an Orion/mvebu header if one was to get picked up first.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 29, 2013, 10:27 p.m. UTC | #23
On Tue, Jan 29, 2013 at 12:38 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jan 29, 2013 at 12:18 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
>> On Tue, Jan 29, 2013 at 12:07:00PM -0700, Bjorn Helgaas wrote:
>>> > So when I say set aside, I mean for instance, the PCI-E entry in DT
>>> > has 128M of physical address space marked for PCI MMIO use. The kernel
>>> > does PCI resource allocation and the HW decoders in each link will be
>>> > set to claim some portion of the 128M - based on the MMIO windows
>>> > programmed on the PCI-PCI root port bridges. The reamining part of the
>>> > 128M is dead address space, not claimed by any hardware block at all.
>>>
>>> Thanks, this really helps get to the issue that the PCI core will care
>>> about.  The root ports look like normal bridges, so the core assumes
>>> it can manage their windows as needed, as long as the windows stay
>>> inside the host bridge apertures that are logically upstream from the
>>> root ports.
>>
>> Yes, that is basically correct. This is what the PCI-E specification
>> says the root complex/root port should look like and this is what some
>> SOC hardware implements fully in hardware. The small wrinkle with
>> Marvell is that the PCI-PCI bridge config space is created by the
>> driver since the HW does not expose a standard config space.
>
> Oh, so the actual *root port* itself doesn't conform to the spec?
> Wow, that's worse than I expected.
>
> Then I guess you have emulate it and make sure its config space is
> complete enough and functional enough so that all the link management,
> power management, AER, etc., code in the core works as well as it
> would with a conforming device.

I'm not sure the existing emulation in these patches is sufficient.
For example, pci_sw_pci_bridge_write() updates bridge->membase when we
write to the window register, but I don't see anything that updates
the actual hardware decoder.  That might be done in
mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(),
but that looks like it's only done once.  If the PCI core updates a
root port window later, I don't see where the hardware decoder will be
updated.

Maybe you're counting on the window assignments to be static?  The PCI
core doesn't guarantee anything like that, though in the absence of
hotplug I don't know any reason why it would change things.

I also forgot about the bus number munging in mvebu_pcie_rd_conf().
The PCI core can update the bridge secondary/subordinate registers.
It looks like you don't support writing to them, and the read path
(pci_sw_pci_bridge_read()) looks like it doesn't do any translation
between the hardware and Linux bus numbers.  I don't understand the
system well enough to know if this is an issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Jan. 30, 2013, 4:24 a.m. UTC | #24
On Tue, Jan 29, 2013 at 03:27:43PM -0700, Bjorn Helgaas wrote:

> I'm not sure the existing emulation in these patches is sufficient.
> For example, pci_sw_pci_bridge_write() updates bridge->membase when we
> write to the window register, but I don't see anything that updates
> the actual hardware decoder.  That might be done in
> mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(),
> but that looks like it's only done once.  If the PCI core updates a
> root port window later, I don't see where the hardware decoder will be
> updated.
> 
> Maybe you're counting on the window assignments to be static?  The PCI
> core doesn't guarantee anything like that, though in the absence of
> hotplug I don't know any reason why it would change things.
 
Agree..

Thomas, I think you need to directly update the Marvell hardware
registers when config writes are made to the SW bridge. If this means
it is too hard/complex to keep the code general then I'd say make it
part of the Marvell host driver.

> I also forgot about the bus number munging in mvebu_pcie_rd_conf().
> The PCI core can update the bridge secondary/subordinate registers.
> It looks like you don't support writing to them, and the read path
> (pci_sw_pci_bridge_read()) looks like it doesn't do any translation
> between the hardware and Linux bus numbers.  I don't understand the
> system well enough to know if this is an issue.

I was chatting with Thomas on this subject, it looks like there is a
HW register that needs to be set to the subordinate bus number of the
bridge, that will solve this weirdness.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 30, 2013, 9:35 a.m. UTC | #25
Dear Bjorn Helgaas,

On Tue, 29 Jan 2013 15:27:43 -0700, Bjorn Helgaas wrote:

> I'm not sure the existing emulation in these patches is sufficient.
> For example, pci_sw_pci_bridge_write() updates bridge->membase when we
> write to the window register, but I don't see anything that updates
> the actual hardware decoder.  That might be done in
> mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(),
> but that looks like it's only done once.

That's correct. I currently let the Linux PCI core enumerate the
real PCIe devices, allocate the resources, and set the appropriate
values in the emulated PCI-to-PCI bridge registers. Once this is all
done, the Marvell PCIe driver looks at each PCI-to-PCI bridge, reads
the membase and iobase registers, and creates address decoding windows
so that the physical addresses assigned by the Linux PCI core actually
resolve to the right PCIe interface. This is done once for all.

> If the PCI core updates a root port window later, I don't see where the hardware
> decoder will be updated.

It will not be updated.

> Maybe you're counting on the window assignments to be static?  The PCI
> core doesn't guarantee anything like that, though in the absence of
> hotplug I don't know any reason why it would change things.

Right. Is supporting hotplug a show-stopper to get this included? I
think it could be added later, if it happens to be needed, no?

I could of course do it, but the patch series is already quite large
and complicated, so if we could merge a simple, but working, version
first, and then improve on top of it when needed, it would be nice.

> I also forgot about the bus number munging in mvebu_pcie_rd_conf().
> The PCI core can update the bridge secondary/subordinate registers.
> It looks like you don't support writing to them, and the read path
> (pci_sw_pci_bridge_read()) looks like it doesn't do any translation
> between the hardware and Linux bus numbers.  I don't understand the
> system well enough to know if this is an issue.

Right. Could you explain a little bit for what reasons the PCI core
could update the secondary/subordinate registers, and to what values it
sets them?

For now, I statically assign the secondary bus register value to be
X+1, where X is the number of the PCIe interface, since X=0 is reserved
for the root bus (which has the host bridge and the PCI-to-PCI
bridges).

Also, could you detail what kind of translation I should be doing when
reading the hardware and Linux bus numbers?

I apologize for asking so many, probably silly, questions, but I am
still learning all those internal PCI mechanisms.

Thanks,

Thomas
Russell King - ARM Linux Jan. 30, 2013, 11:32 a.m. UTC | #26
On Mon, Jan 28, 2013 at 07:56:28PM +0100, Thomas Petazzoni wrote:
> +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev,
> +						 const struct resource *res,
> +						 resource_size_t start,
> +						 resource_size_t size,
> +						 resource_size_t align)
> +{
> +	if (!(res->flags & IORESOURCE_IO))
> +		return start;
> +
> +	/*
> +	 * The I/O regions must be 64K aligned, because the
> +	 * granularity of PCIe I/O address decoding windows is 64 K
> +	 */
> +	return round_up(start, SZ_64K);
> +}

You do realise that this will result in all PCI I/O BARs being rounded
up to 64K.

I've just been digging through the PCI code and have come across a
function - pcibios_window_alignment() - which the PCI code allows to be
overriden which allows you to increase the alignment requirement of
bridge windows.  It takes the PCI bus and window type as arguments.

I'd suggest using that, and checking whether the bus which is passed
corresponds with a bus which gives you problems, so that you don't
impose the 64K requirement on downstream bridges.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 30, 2013, 11:37 a.m. UTC | #27
Russell,

On Wed, 30 Jan 2013 11:32:46 +0000, Russell King - ARM Linux wrote:

> You do realise that this will result in all PCI I/O BARs being rounded
> up to 64K.

Hum, yes, correct.

> I've just been digging through the PCI code and have come across a
> function - pcibios_window_alignment() - which the PCI code allows to be
> overriden which allows you to increase the alignment requirement of
> bridge windows.  It takes the PCI bus and window type as arguments.
> 
> I'd suggest using that, and checking whether the bus which is passed
> corresponds with a bus which gives you problems, so that you don't
> impose the 64K requirement on downstream bridges.

Seems interesting indeed, I'll look into this idea! Thanks!

Thomas
Thierry Reding Jan. 30, 2013, 12:03 p.m. UTC | #28
On Wed, Jan 30, 2013 at 11:32:46AM +0000, Russell King - ARM Linux wrote:
> On Mon, Jan 28, 2013 at 07:56:28PM +0100, Thomas Petazzoni wrote:
> > +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev,
> > +						 const struct resource *res,
> > +						 resource_size_t start,
> > +						 resource_size_t size,
> > +						 resource_size_t align)
> > +{
> > +	if (!(res->flags & IORESOURCE_IO))
> > +		return start;
> > +
> > +	/*
> > +	 * The I/O regions must be 64K aligned, because the
> > +	 * granularity of PCIe I/O address decoding windows is 64 K
> > +	 */
> > +	return round_up(start, SZ_64K);
> > +}
> 
> You do realise that this will result in all PCI I/O BARs being rounded
> up to 64K.
> 
> I've just been digging through the PCI code and have come across a
> function - pcibios_window_alignment() - which the PCI code allows to be
> overriden which allows you to increase the alignment requirement of
> bridge windows.  It takes the PCI bus and window type as arguments.
> 
> I'd suggest using that, and checking whether the bus which is passed
> corresponds with a bus which gives you problems, so that you don't
> impose the 64K requirement on downstream bridges.

That approach isn't going to work very well with multi-platform, though,
since the function can only be overridden on a per-architecture basis.

Thierry
Thomas Petazzoni Jan. 30, 2013, 1:07 p.m. UTC | #29
Dear Thierry Reding,

On Wed, 30 Jan 2013 13:03:44 +0100, Thierry Reding wrote:

> That approach isn't going to work very well with multi-platform,
> though, since the function can only be overridden on a
> per-architecture basis.

I can do like is done for pcibios_align_resource(): put a single
implementation of the function in arch/arm/kernel/bios32.c, and make it
call a hook registered in the hw_pci structure, or even directly use a
numerical value in hw_pci, as Russell suggested earlier.

Best regards,

Thomas
Russell King - ARM Linux Jan. 30, 2013, 3:08 p.m. UTC | #30
On Wed, Jan 30, 2013 at 01:03:44PM +0100, Thierry Reding wrote:
> On Wed, Jan 30, 2013 at 11:32:46AM +0000, Russell King - ARM Linux wrote:
> > On Mon, Jan 28, 2013 at 07:56:28PM +0100, Thomas Petazzoni wrote:
> > > +static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev,
> > > +						 const struct resource *res,
> > > +						 resource_size_t start,
> > > +						 resource_size_t size,
> > > +						 resource_size_t align)
> > > +{
> > > +	if (!(res->flags & IORESOURCE_IO))
> > > +		return start;
> > > +
> > > +	/*
> > > +	 * The I/O regions must be 64K aligned, because the
> > > +	 * granularity of PCIe I/O address decoding windows is 64 K
> > > +	 */
> > > +	return round_up(start, SZ_64K);
> > > +}
> > 
> > You do realise that this will result in all PCI I/O BARs being rounded
> > up to 64K.
> > 
> > I've just been digging through the PCI code and have come across a
> > function - pcibios_window_alignment() - which the PCI code allows to be
> > overriden which allows you to increase the alignment requirement of
> > bridge windows.  It takes the PCI bus and window type as arguments.
> > 
> > I'd suggest using that, and checking whether the bus which is passed
> > corresponds with a bus which gives you problems, so that you don't
> > impose the 64K requirement on downstream bridges.
> 
> That approach isn't going to work very well with multi-platform, though,
> since the function can only be overridden on a per-architecture basis.

The same can be said of all the various other functions which the PCI
stuff expects the arch to provide, yet we seem to cope just fine...
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 30, 2013, 6:52 p.m. UTC | #31
On Wed, Jan 30, 2013 at 2:35 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Bjorn Helgaas,
>
> On Tue, 29 Jan 2013 15:27:43 -0700, Bjorn Helgaas wrote:
>
>> I'm not sure the existing emulation in these patches is sufficient.
>> For example, pci_sw_pci_bridge_write() updates bridge->membase when we
>> write to the window register, but I don't see anything that updates
>> the actual hardware decoder.  That might be done in
>> mvebu_pcie_window_config_port() via armada_370_xp_alloc_pcie_window(),
>> but that looks like it's only done once.
>
> That's correct. I currently let the Linux PCI core enumerate the
> real PCIe devices, allocate the resources, and set the appropriate
> values in the emulated PCI-to-PCI bridge registers. Once this is all
> done, the Marvell PCIe driver looks at each PCI-to-PCI bridge, reads
> the membase and iobase registers, and creates address decoding windows
> so that the physical addresses assigned by the Linux PCI core actually
> resolve to the right PCIe interface. This is done once for all.
>
>> If the PCI core updates a root port window later, I don't see where the hardware
>> decoder will be updated.
>
> It will not be updated.
>
>> Maybe you're counting on the window assignments to be static?  The PCI
>> core doesn't guarantee anything like that, though in the absence of
>> hotplug I don't know any reason why it would change things.
>
> Right. Is supporting hotplug a show-stopper to get this included? I
> think it could be added later, if it happens to be needed, no?
>
> I could of course do it, but the patch series is already quite large
> and complicated, so if we could merge a simple, but working, version
> first, and then improve on top of it when needed, it would be nice.

I'm most concerned about the stuff in drivers/pci.  I hesitate to
merge drivers/pci/sw-pci-pci-bridge.c as-is because it's a model
that's not connected to hardware and only works in a completely static
situation, and the rest of the PCI core can't really deal with that.

But I don't think supporting hotplug should be a show-stopper at this
point, either.  It sounds like we might be heading towards hooking
this up more directly to the Marvell hardware, which will make it more
arch-dependent.  Something like that could either go in arch/arm, or
in some not-quite-so-generic spot under drivers/pci.

>> I also forgot about the bus number munging in mvebu_pcie_rd_conf().
>> The PCI core can update the bridge secondary/subordinate registers.
>> It looks like you don't support writing to them, and the read path
>> (pci_sw_pci_bridge_read()) looks like it doesn't do any translation
>> between the hardware and Linux bus numbers.  I don't understand the
>> system well enough to know if this is an issue.
>
> Right. Could you explain a little bit for what reasons the PCI core
> could update the secondary/subordinate registers, and to what values it
> sets them?

The secondary/subordinate registers effectively define a bus number
aperture that tells the bridge which transactions to claim and forward
downstream.  When enumerating devices, we may update the subordinate
bus number to widen the aperture so we can enumerate an arbitrary tree
behind the bridge.  When we're finished, we'll probably narrow it by
updating the subordinate again, so the unused bus number space can be
used for other bridges.  I don't know the exact details of the
algorithm, and they're likely to change anyway, but pci_scan_bridge()
is where most of it happens.

It looks like your current system doesn't support trees below the
bridges, but hopefully we can make it so the generic enumeration
algorithms still work.

> For now, I statically assign the secondary bus register value to be
> X+1, where X is the number of the PCIe interface, since X=0 is reserved
> for the root bus (which has the host bridge and the PCI-to-PCI
> bridges).

That makes sense but limits you to a single bus (and really, a single
device since this is PCIe) below the bridge.

> Also, could you detail what kind of translation I should be doing when
> reading the hardware and Linux bus numbers?

I'm hoping that the register Jason mentioned is enough to avoid the
need for translation.  If it's not, we can explore this a bit more.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 30, 2013, 10:28 p.m. UTC | #32
Dear Bjorn Helgaas,

On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote:

> I'm most concerned about the stuff in drivers/pci.  I hesitate to
> merge drivers/pci/sw-pci-pci-bridge.c as-is because it's a model
> that's not connected to hardware and only works in a completely static
> situation, and the rest of the PCI core can't really deal with that.
> 
> But I don't think supporting hotplug should be a show-stopper at this
> point, either.  It sounds like we might be heading towards hooking
> this up more directly to the Marvell hardware, which will make it more
> arch-dependent.  Something like that could either go in arch/arm, or
> in some not-quite-so-generic spot under drivers/pci.

If you really don't want sw-pci-pci-bridge.c in drivers/pci, then I can
make it a part of the drivers/pci/host/pci-mvebu.c driver itself. I
initially followed the idea started by Thierry Redding for the emulated
host bridge, but if you feel that this emulated PCI-to-PCI bridge is
too specific to this driver, then I'm fine with keeping it inside the
driver itself.

> >> I also forgot about the bus number munging in mvebu_pcie_rd_conf().
> >> The PCI core can update the bridge secondary/subordinate registers.
> >> It looks like you don't support writing to them, and the read path
> >> (pci_sw_pci_bridge_read()) looks like it doesn't do any translation
> >> between the hardware and Linux bus numbers.  I don't understand the
> >> system well enough to know if this is an issue.
> >
> > Right. Could you explain a little bit for what reasons the PCI core
> > could update the secondary/subordinate registers, and to what
> > values it sets them?
> 
> The secondary/subordinate registers effectively define a bus number
> aperture that tells the bridge which transactions to claim and forward
> downstream.  When enumerating devices, we may update the subordinate
> bus number to widen the aperture so we can enumerate an arbitrary tree
> behind the bridge.  When we're finished, we'll probably narrow it by
> updating the subordinate again, so the unused bus number space can be
> used for other bridges.  I don't know the exact details of the
> algorithm, and they're likely to change anyway, but pci_scan_bridge()
> is where most of it happens.
> 
> It looks like your current system doesn't support trees below the
> bridges, but hopefully we can make it so the generic enumeration
> algorithms still work.

In practice, in our situation, there isn't a tree below the bridge.
There is one single device. I'd prefer to not implement features that I
cannot effectively test, and let the implementation of those additional
features to whoever will need them, and therefore be able to test them.

I guess that if I integrate the PCI-to-PCI bridge emulation code within
the Marvell driver, then I can keep it fairly limited to whatever the
Marvell PCI driver requires, no?

> > For now, I statically assign the secondary bus register value to be
> > X+1, where X is the number of the PCIe interface, since X=0 is
> > reserved for the root bus (which has the host bridge and the
> > PCI-to-PCI bridges).
> 
> That makes sense but limits you to a single bus (and really, a single
> device since this is PCIe) below the bridge.

Which is exactly what is happening here.

> > Also, could you detail what kind of translation I should be doing
> > when reading the hardware and Linux bus numbers?
> 
> I'm hoping that the register Jason mentioned is enough to avoid the
> need for translation.  If it's not, we can explore this a bit more.

Ok.

Thanks,

Thomas
Jason Gunthorpe Jan. 30, 2013, 11:10 p.m. UTC | #33
On Wed, Jan 30, 2013 at 11:28:36PM +0100, Thomas Petazzoni wrote:

> > It looks like your current system doesn't support trees below the
> > bridges, but hopefully we can make it so the generic enumeration
> > algorithms still work.
> 
> In practice, in our situation, there isn't a tree below the bridge.
> There is one single device. I'd prefer to not implement features that I
> cannot effectively test, and let the implementation of those additional
> features to whoever will need them, and therefore be able to test
> them.

Agreed it is hard to test, but be aware that any system that has PCI-E
slots can host an add-in card that has a bridge on it. These are midly
common in some areas like high port count ethernet cards.

If you aren't going to attempt the implementation then a really big
FIXME that the config access routing is not correct and needs to be
based on the bus range assigned to the bridge would be friendly :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 30, 2013, 11:48 p.m. UTC | #34
On Wed, Jan 30, 2013 at 3:28 PM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Bjorn Helgaas,
>
> On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote:

>> It looks like your current system doesn't support trees below the
>> bridges, but hopefully we can make it so the generic enumeration
>> algorithms still work.
>
> In practice, in our situation, there isn't a tree below the bridge.
> There is one single device. I'd prefer to not implement features that I
> cannot effectively test, and let the implementation of those additional
> features to whoever will need them, and therefore be able to test them.

I understand the concern about testing, but my advice is to not use
that as an excuse to artificially limit the functionality of the code
you're writing :)

You're talking about emulating a bridge, and the bridge really doesn't
know or care what's downstream.  If it works with a single device
downstream, it should work with another bridge downstream.  Many
aspects of bridge configuration can be tested with creative
application of setpci, so that might be a possibility, too.

> I guess that if I integrate the PCI-to-PCI bridge emulation code within
> the Marvell driver, then I can keep it fairly limited to whatever the
> Marvell PCI driver requires, no?

Yeah, it's just that the pci_dev for this emulated bridge is used by
the generic PCI core code, e.g., pci_scan_bridge(), so whatever
emulation you do has to be robust enough that the core won't notice
it's emulated.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 31, 2013, 4:04 p.m. UTC | #35
Dear Bjorn Helgaas,

On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote:

> The secondary/subordinate registers effectively define a bus number
> aperture that tells the bridge which transactions to claim and forward
> downstream.  When enumerating devices, we may update the subordinate
> bus number to widen the aperture so we can enumerate an arbitrary tree
> behind the bridge.  When we're finished, we'll probably narrow it by
> updating the subordinate again, so the unused bus number space can be
> used for other bridges.  I don't know the exact details of the
> algorithm, and they're likely to change anyway, but pci_scan_bridge()
> is where most of it happens.
> 
> It looks like your current system doesn't support trees below the
> bridges, but hopefully we can make it so the generic enumeration
> algorithms still work.

The PCI-to-PCI bridge specification says that the Primary Bus Number
Register, Secondary Bus Number Register and Subordinate Bus Number
Register of the PCI configuration space of a PCI-to-PCI bridge should
all be set to 0 after reset.

Until now, I was forcing a specific value of the Secondary Bus Number
and Subordinate Bus Number (1 for my first bridge, 2 for my second
bridge, etc.).

Following you're recommendation, I've changed this, and left those
values initialized to 0 by default, in order to let Linux set correct
values. Yes, Linux does assign appropriate values in the Secondary Bus
Number Register. But before that Linux also complains loudly that the
bridge configuration is invalid:

pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring

Looking at the code, we have:

        /* Check if setup is sensible at all */
        if (!pass &&
            (primary != bus->number || secondary <= bus->number ||
             secondary > subordinate)) {
                dev_info(&dev->dev, "bridge configuration invalid ([bus %02x-%02x]), reconfiguring\n",
                         secondary, subordinate);
                broken = 1;
        }

Due to the default values of the Primary Bus Number Register, Secondary
Bus Number Register and Subordinate Bus Number Register, we have:

 primary = 0
 secondary = 0
 subordinate = 0

We are enumerating the root bus, so bus->number = 0. Therefore:

 * The test primary != bus->number is false, so it's not the problem.

 * secondary <= bus->number is true, because secondary = 0, and
   bus->number = 0. It is the problem.

 * secondary > subordinate is false.

So I'm not sure what to do with this...

Thoas
Bjorn Helgaas Jan. 31, 2013, 4:30 p.m. UTC | #36
On Thu, Jan 31, 2013 at 9:04 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Bjorn Helgaas,
>
> On Wed, 30 Jan 2013 11:52:15 -0700, Bjorn Helgaas wrote:
>
>> The secondary/subordinate registers effectively define a bus number
>> aperture that tells the bridge which transactions to claim and forward
>> downstream.  When enumerating devices, we may update the subordinate
>> bus number to widen the aperture so we can enumerate an arbitrary tree
>> behind the bridge.  When we're finished, we'll probably narrow it by
>> updating the subordinate again, so the unused bus number space can be
>> used for other bridges.  I don't know the exact details of the
>> algorithm, and they're likely to change anyway, but pci_scan_bridge()
>> is where most of it happens.
>>
>> It looks like your current system doesn't support trees below the
>> bridges, but hopefully we can make it so the generic enumeration
>> algorithms still work.
>
> The PCI-to-PCI bridge specification says that the Primary Bus Number
> Register, Secondary Bus Number Register and Subordinate Bus Number
> Register of the PCI configuration space of a PCI-to-PCI bridge should
> all be set to 0 after reset.
>
> Until now, I was forcing a specific value of the Secondary Bus Number
> and Subordinate Bus Number (1 for my first bridge, 2 for my second
> bridge, etc.).
>
> Following you're recommendation, I've changed this, and left those
> values initialized to 0 by default, in order to let Linux set correct
> values. Yes, Linux does assign appropriate values in the Secondary Bus
> Number Register. But before that Linux also complains loudly that the
> bridge configuration is invalid:
>
> pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring

Linux makes the unwarranted assumption that the PCI hierarchy has
already been configured by firmware.  If the only problem is the
messages above, I think we could just rework the message so it doesn't
look like an error.  I would guess that we probably also see the same
distressing message when we hot-add a card with a bridge on it,
because firmware won't have initialized the bridge.

My rule of thumb is that I like to note something in dmesg about the
initial configuration of bus/mem/io apertures and BARs, as well as
indications when we update them.  That way, the dmesg log should
contain enough information to debug most enumeration and configuration
defects.  pci_scan_bridge() is somewhat lacking in this regard.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Jan. 31, 2013, 4:33 p.m. UTC | #37
Dear Bjorn Helgaas,

On Thu, 31 Jan 2013 09:30:07 -0700, Bjorn Helgaas wrote:

> > Following you're recommendation, I've changed this, and left those
> > values initialized to 0 by default, in order to let Linux set correct
> > values. Yes, Linux does assign appropriate values in the Secondary Bus
> > Number Register. But before that Linux also complains loudly that the
> > bridge configuration is invalid:
> >
> > pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> > pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> > pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> > pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> > pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> > pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> 
> Linux makes the unwarranted assumption that the PCI hierarchy has
> already been configured by firmware.  If the only problem is the
> messages above, I think we could just rework the message so it doesn't
> look like an error.  I would guess that we probably also see the same
> distressing message when we hot-add a card with a bridge on it,
> because firmware won't have initialized the bridge.
> 
> My rule of thumb is that I like to note something in dmesg about the
> initial configuration of bus/mem/io apertures and BARs, as well as
> indications when we update them.  That way, the dmesg log should
> contain enough information to debug most enumeration and configuration
> defects.  pci_scan_bridge() is somewhat lacking in this regard.

Ok. Would something like:

 "bridge configuration with unassigned bus numbers ([bus 00-00]), reconfiguring"

be an acceptable to replace this one?

Thanks for your quick feedback,

Thomas
Russell King - ARM Linux Jan. 31, 2013, 4:42 p.m. UTC | #38
On Thu, Jan 31, 2013 at 09:30:07AM -0700, Bjorn Helgaas wrote:
> Linux makes the unwarranted assumption that the PCI hierarchy has
> already been configured by firmware.  If the only problem is the
> messages above, I think we could just rework the message so it doesn't
> look like an error.

That's not a safe assumption, especially on platforms where there's no
BIOS (like ARM platforms).  Thankfully, for the platforms I care about,
the boot loaders I wrote for them _do_ do a full bus setup, so I don't
see a problem. :)

However, I have historially had the kernel over many years (probably
around 14 or 15 now) reassign all resources so that things are how the
kernel wants them, and not how my half-hearted attempt at setting them
up did (which only does a limited job enough to get the system to a
state where we can load the kernel.)
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Jan. 31, 2013, 5:03 p.m. UTC | #39
On Thu, Jan 31, 2013 at 9:33 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Bjorn Helgaas,
>
> On Thu, 31 Jan 2013 09:30:07 -0700, Bjorn Helgaas wrote:
>
>> > Following you're recommendation, I've changed this, and left those
>> > values initialized to 0 by default, in order to let Linux set correct
>> > values. Yes, Linux does assign appropriate values in the Secondary Bus
>> > Number Register. But before that Linux also complains loudly that the
>> > bridge configuration is invalid:
>> >
>> > pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> > pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> > pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> > pci 0000:00:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> > pci 0000:00:05.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>> > pci 0000:00:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring
>>
>> Linux makes the unwarranted assumption that the PCI hierarchy has
>> already been configured by firmware.  If the only problem is the
>> messages above, I think we could just rework the message so it doesn't
>> look like an error.  I would guess that we probably also see the same
>> distressing message when we hot-add a card with a bridge on it,
>> because firmware won't have initialized the bridge.
>>
>> My rule of thumb is that I like to note something in dmesg about the
>> initial configuration of bus/mem/io apertures and BARs, as well as
>> indications when we update them.  That way, the dmesg log should
>> contain enough information to debug most enumeration and configuration
>> defects.  pci_scan_bridge() is somewhat lacking in this regard.
>
> Ok. Would something like:
>
>  "bridge configuration with unassigned bus numbers ([bus 00-00]), reconfiguring"
>
> be an acceptable to replace this one?

Seems reasonable.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Feb. 1, 2013, 12:34 a.m. UTC | #40
On 01/28/2013 11:56 AM, Thomas Petazzoni wrote:
> This driver implements the support for the PCIe interfaces on the
> Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
> cover earlier families of Marvell SoCs, such as Dove, Orion and
> Kirkwood.

Bjorn and I happen to live very close, so we got together today and
talked about PCIe on ARM.

One of the questions he asked is: why does the window management on the
Marvell SoCs need to be dynamic?

(Sorry if this was covered earlier; I vaguely recall some discussion on
the topic, but couldn't find it quickly)

As background, PCIe enumeration in Linux usually works like:

1) You start off with some CPU physical address regions that generate
transactions on the PCIe bus.

2) You enumerate all the PCIe devices, and assign an address to each BAR
found, carved out of the PCIe address range corresponding to the regions
you knew from (1).

However, it sounds like the Marvell code wants to:

1) Start off with no real knowledge of the CPU physical address that
will generate transactions on the PCIe bus, since you want to assign
that later.

2) You enumerate all the PCIe devices, and assign an address. But, what
address range do you use?

3) Then you program the SoC's windows to set up the CPU->PCIe address
translations.

Am I recalling what you're trying to do correctly, or am I completely
confused?

Now, I recall that a related issue was that you are tight on CPU
physical address space, and the second algorithm above would allow the
size of the PCIe controller's window configuration to be as small as
possible, and hence there would be more CPU physical address space
available to fit in other peripherals.

However, why does this need to be dynamic? On a particular board, you
know all the other (non-PCIe) peripherals that you need to fit into the
CPU physical address space, so you know how much is left over for PCIe,
so why not always make the PCIe window fill up all the available space,
and use the first algorithm I described above? And also, I think you
always know the exact set of PCIe devices that are attached to the
boards, so you know the exact BAR size requirements there (or are there
user-accessible PCIe devices; I don't think so from your recent comments
about PCIe<->PCIe bridges not needing to be supported since the user
couldn't plug one in?)

Note that with DT, you can easily specify the window location/size in
the board .dts file rather than the SoC .dtsi file, so it can easily be
customized based on how much physical address space is taken up by RAM,
directly mapped NOR flash, etc.

With a static window configuration in DT, you'd end up with a system
that worked much like any x86 system or Tegra, with some static memory
range available to PCIe. It's just that in your case, the region
location/size could change from boot to boot based on DT, whereas it's
hard-coded in HW for Tegra and I assume x86 too.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 1, 2013, 1:41 a.m. UTC | #41
On Thu, Jan 31, 2013 at 05:34:36PM -0700, Stephen Warren wrote:
> On 01/28/2013 11:56 AM, Thomas Petazzoni wrote:
> > This driver implements the support for the PCIe interfaces on the
> > Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
> > cover earlier families of Marvell SoCs, such as Dove, Orion and
> > Kirkwood.
> 
> Bjorn and I happen to live very close, so we got together today and
> talked about PCIe on ARM.
> 
> One of the questions he asked is: why does the window management on the
> Marvell SoCs need to be dynamic?

> (Sorry if this was covered earlier; I vaguely recall some discussion on
> the topic, but couldn't find it quickly)

Well I've answered it several times, so has Thomas.. Lets try again,
please save for future reference :)

Lets seperate two things.

The CPU physical address ranges reserved for PCI bus access are not
dynamic. This is set in DT, or whatever, statically. Just like every
other PCI case on ARM. Just like Tegra. That is not the issue.

What is required is that the division of this space amongst the 10
physical PCI-E links must be dynamic. Just like x86. Just like the new
tegra driver. [1]

Is that clear?

> As background, PCIe enumeration in Linux usually works like:
> 
> 1) You start off with some CPU physical address regions that generate
> transactions on the PCIe bus.
> 
> 2) You enumerate all the PCIe devices, and assign an address to each BAR
> found, carved out of the PCIe address range corresponding to the regions
> you knew from (1).

Step 2 also includes 'assign address windows to all the physical PCI-E
links'. This is very important because it is what this entire
discussion is about.

Look at how tegra or x86 works, the CPU physical addresses for PCI-E
do nothing until the PCI-to-PCI bridge window registers in each link's
configuration space are setup. Until that is done the SOC doesn't know
which link to send the transaction to.

Marvell is the same, until the link's window registers are setup the
CPU addresses don't go anywhere.

Notice this has absolutely no effect on the host bridge aperture.
This is a link-by-link configuration of what addresses *go down that
link*.

The big difference is the link window registers for Marvell do not
conform to the PCI configuration space specification. They are Marvell
specific.

This is what the glue code in the host driver does, it converts the
Marvell specificness into something the kernel can undertstand and
control. There are countless ways to do this, but please accept it
is necessary that it be done...

> Now, I recall that a related issue was that you are tight on CPU
> physical address space, and the second algorithm above would allow the
> size of the PCIe controller's window configuration to be as small as
> possible, and hence there would be more CPU physical address space
> available to fit in other peripherals.

Physical address space is certainly a concern, but availability of
decoder windows is the major one. Each link requires one decoder
window for MMIO and one for IO, and possibly one for prefetch. The
chip doesn't have 30 decoder windows. So the allocation of decoders to
links must be dynamic, based on the requirements of the downstream
endports on the link.

> However, why does this need to be dynamic? On a particular board, you
> know all the other (non-PCIe) peripherals that you need to fit into the
> CPU physical address space, so you know how much is left over for PCIe,
> so why not always make the PCIe window fill up all the available
> space,

Because there is no such thing as an all-links PCIe window on this
hardware.

Each link has a seperate window.

If you get rid of all the dynamic allocation then every link must
statically reserve some portion of physical address space and some
number of decoder windows.

That more or less means you need to know what is going to be on the
other side of every link when you write the DT.

> With a static window configuration in DT, you'd end up with a system
> that worked much like any x86 system or Tegra, with some static memory
> range available to PCIe. It's just that in your case, the region
> location/size could change from boot to boot based on DT, whereas it's
> hard-coded in HW for Tegra and I assume x86 too.

How is this better? Now you have a system where you have to customize
the DT before you connect a PCI-E device. What if someone uses this
chip in a configuration with physical slots? How does that work? What
about hotplug? What about a unified kernel? That is *not* like x86 or
tegra.

IMHO Thomas's direction in his proposed driver ends up working very
close to the new tegra driver, and has the sort of dynamic allocation
and discovery people expect from PCI-E.

Jason

1 - The new tegra driver switches from calling ARM's pci_common_init
    once for every physical link, to once for the SOC. It does this by
    fixing the routing of config transactions so that the kernel sees
    the per-link PCI-PCI root port bridge config space provided by the
    hardware at the correct place. By doing this it changes from
    statically allocating a physical memory region for each link to
    statically allocating a region for all the links, and dynamically
    dividing that region amongst the links.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Feb. 1, 2013, 2:21 a.m. UTC | #42
On 01/31/2013 06:41 PM, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2013 at 05:34:36PM -0700, Stephen Warren wrote:
>> On 01/28/2013 11:56 AM, Thomas Petazzoni wrote:
>>> This driver implements the support for the PCIe interfaces on the
>>> Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
>>> cover earlier families of Marvell SoCs, such as Dove, Orion and
>>> Kirkwood.
>>
>> Bjorn and I happen to live very close, so we got together today and
>> talked about PCIe on ARM.
>>
>> One of the questions he asked is: why does the window management on the
>> Marvell SoCs need to be dynamic?
> 
>> (Sorry if this was covered earlier; I vaguely recall some discussion on
>> the topic, but couldn't find it quickly)
> 
> Well I've answered it several times, so has Thomas.. Lets try again,
> please save for future reference :)
> 
> Lets seperate two things.
> 
> The CPU physical address ranges reserved for PCI bus access are not
> dynamic. This is set in DT, or whatever, statically. Just like every
> other PCI case on ARM. Just like Tegra. That is not the issue.
> 
> What is required is that the division of this space amongst the 10
> physical PCI-E links must be dynamic. Just like x86. Just like the new
> tegra driver. [1]
> 
> Is that clear?

Yes.

>> As background, PCIe enumeration in Linux usually works like:
>>
>> 1) You start off with some CPU physical address regions that generate
>> transactions on the PCIe bus.
>>
>> 2) You enumerate all the PCIe devices, and assign an address to each BAR
>> found, carved out of the PCIe address range corresponding to the regions
>> you knew from (1).
> 
> Step 2 also includes 'assign address windows to all the physical PCI-E
> links'. This is very important because it is what this entire
> discussion is about.

OK.

> Look at how tegra or x86 works, the CPU physical addresses for PCI-E
> do nothing until the PCI-to-PCI bridge window registers in each link's
> configuration space are setup. Until that is done the SOC doesn't know
> which link to send the transaction to.

From my perspective, this is slightly the wrong way of describing the
issue, but I see what you mean:

At least on Tegra and I think x86, any transaction that goes to the
physical PCIe aperture is translated onto (internal) PCIe bus 0, so the
lack of window or PCIe/PCIe bridge BAR register programming doesn't
prevent the transaction going /somewhere/ (even if "somewhere" is only
half way to where it's useful!). The difference is pretty subtle. The
issue is that without the PCIe/PCIe bridge BARs programmed, the PCIe
transactions won't get off bus 0 and onto a downstream bus of one of the
PCIe/PCIe bridges, or put another way, no PCIe/PCIe will claim the
transaction that happens on PCIe bus 0

(Using "PCIe/PCIe bridge" above to mean "PCIe root port")

> Marvell is the same, until the link's window registers are setup the
> CPU addresses don't go anywhere.
> 
> Notice this has absolutely no effect on the host bridge aperture.
> This is a link-by-link configuration of what addresses *go down that
> link*.

Right.

> The big difference is the link window registers for Marvell do not
> conform to the PCI configuration space specification. They are Marvell
> specific.
> 
> This is what the glue code in the host driver does, it converts the
> Marvell specificness into something the kernel can undertstand and
> control. There are countless ways to do this, but please accept it
> is necessary that it be done...

Sure.

>> Now, I recall that a related issue was that you are tight on CPU
>> physical address space, and the second algorithm above would allow the
>> size of the PCIe controller's window configuration to be as small as
>> possible, and hence there would be more CPU physical address space
>> available to fit in other peripherals.
> 
> Physical address space is certainly a concern, but availability of
> decoder windows is the major one. Each link requires one decoder
> window for MMIO and one for IO, and possibly one for prefetch. The
> chip doesn't have 30 decoder windows. So the allocation of decoders to
> links must be dynamic, based on the requirements of the downstream
> endports on the link.

Oh I see...

I originally thought the issue was that the windows were between CPU
physical address space and the PCIe host controller itself. But in fact,
the windows are between PCIe bus 0 and the root ports, so they're the
equivalent of the standard PCIe root port (or PCIe/PCIe bridge) BAR
registers. And related, these BAR/window registers are usually part of
each PCIe root port itself, and hence there's a whole number dedicated
to each root port, but on Marvell there's a *global* pool of these
BARs/windows instead.

Now I think I finally understand the architecture of your HW.

>> However, why does this need to be dynamic? On a particular board, you
>> know all the other (non-PCIe) peripherals that you need to fit into the
>> CPU physical address space, so you know how much is left over for PCIe,
>> so why not always make the PCIe window fill up all the available
>> space,
> 
> Because there is no such thing as an all-links PCIe window on this
> hardware.
> 
> Each link has a seperate window.
> 
> If you get rid of all the dynamic allocation then every link must
> statically reserve some portion of physical address space and some
> number of decoder windows.
> 
> That more or less means you need to know what is going to be on the
> other side of every link when you write the DT.

So, the dynamic programming of the windows on Marvell HW is the exact
logical equivalent of programming a standard PCIe root port's BAR
registers. It makes perfect sense that should be dynamic. Presumably
this is something you can make work inside your emulated PCIe/PCIe
bridge module, simply by capturing writes to the BAR registers, and
translating them into writes to the Marvell window registers.

Now, I do have one follow-on question: You said you don't have 30
windows, but how many do you have free after allocating windows to any
other peripherals that need them, relative to (3 *
number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.)

The thing here is that when the PCIe core writes to a root port BAR
window to configure/enable it the first time, you'll need to capture
that transaction and dynamically allocate a window and program it in a
way equivalent to what the BAR register write would have achieved on
standard HW. Later, the window might need resizing, or even to be
completely disabled, if the PCIe core were to change the standard BAR
register. Dynamically allocating a window when the BAR is written seems
a little heavy-weight.

So while it's obvious that window base address and size shouldn't be
static, I wonder if the assignment of a specific window ID to a specific
root port ID shouldn be dynamic or static. For example, if your HW
configuration leaves you with 6 windows available, you could support 2
PCIe root ports by statically assigning 3 windows to serve each of those
2 root ports. Would that work, or are there systems where over-commit is
needed, e.g. if there's no IO space behind a root port, you could get
away with two windows per root port, and hence be able to run 3 root
ports rather than just 2? Still, if you know which PCIe devices are
being the root ports, you could still represent the over-commit
statically in DT

Still, I supose doing it dynamically in the driver does end up being a
lot less to think about for someone creating the DT for a new board.

Having to translate standard root port BAR register writes to Marvell
window register allocation/writes would imply that the emulated root
port code has to be very closely tied into the Marvell PCIe driver, and
not something that could be at all generic in the most part.

>> With a static window configuration in DT, you'd end up with a system
>> that worked much like any x86 system or Tegra, with some static memory
>> range available to PCIe. It's just that in your case, the region
>> location/size could change from boot to boot based on DT, whereas it's
>> hard-coded in HW for Tegra and I assume x86 too.
> 
> How is this better? Now you have a system where you have to customize
> the DT before you connect a PCI-E device. What if someone uses this
> chip in a configuration with physical slots? How does that work? What
> about hotplug? What about a unified kernel? That is *not* like x86 or
> tegra.

Right. Now that I really understand what the windows are doing, I can
see that a static window configuration (address/size, perhaps rather
than windows are used) would not be appropriate.

> IMHO Thomas's direction in his proposed driver ends up working very
> close to the new tegra driver, and has the sort of dynamic allocation
> and discovery people expect from PCI-E.
> 
> Jason
> 
> 1 - The new tegra driver switches from calling ARM's pci_common_init
>     once for every physical link, to once for the SOC. It does this by
>     fixing the routing of config transactions so that the kernel sees
>     the per-link PCI-PCI root port bridge config space provided by the
>     hardware at the correct place. By doing this it changes from
>     statically allocating a physical memory region for each link to
>     statically allocating a region for all the links, and dynamically
>     dividing that region amongst the links.

Right, we have both (or all 3) root ports show up in the same PCIe domain.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 1, 2013, 3:51 a.m. UTC | #43
On Thu, Jan 31, 2013 at 07:21:02PM -0700, Stephen Warren wrote:

> I originally thought the issue was that the windows were between CPU
> physical address space and the PCIe host controller itself. But in fact,
> the windows are between PCIe bus 0 and the root ports, so they're
> the

Donno what exactly is inside tegra, but on Marvell the docs describe
an an internal bus cross bar/whatever and each PCI-E link gets a port
on that structure. There is no such physical thing as a 'PCI bus 0',
and they didn't arrange the hardware in a way that makes it easy for
the host driver to create one like tegra did :(

> equivalent of the standard PCIe root port (or PCIe/PCIe bridge) BAR
> registers. And related, these BAR/window registers are usually part of
> each PCIe root port itself, and hence there's a whole number dedicated
> to each root port, but on Marvell there's a *global* pool of these
> BARs/windows instead.

Right, that is all following the PCI-E spec, and is reasonable and
sane. What Marvell did is take an *end port core* and jam'd it on
their internal bus without adjusting things to follow the PCI-E spec
regarding config space and what not.

> > That more or less means you need to know what is going to be on the
> > other side of every link when you write the DT.
> 
> So, the dynamic programming of the windows on Marvell HW is the exact
> logical equivalent of programming a standard PCIe root port's BAR
> registers. It makes perfect sense that should be dynamic. Presumably
> this is something you can make work inside your emulated PCIe/PCIe
> bridge module, simply by capturing writes to the BAR registers, and
> translating them into writes to the Marvell window registers.

Yes, that is exactly the idea.
 
> Now, I do have one follow-on question: You said you don't have 30
> windows, but how many do you have free after allocating windows to any
> other peripherals that need them, relative to (3 *
> number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.)

Thomas will have to answer this, it varies depending on the SOC, and
what other on chip peripherals are in use. For instance Kirkwood has
the same design but there are plenty of windows for the two PCI-E
links.

Still how would you even connect a limited number of regions on a link
by link basis to the common PCI code?

> The thing here is that when the PCIe core writes to a root port BAR
> window to configure/enable it the first time, you'll need to capture
> that transaction and dynamically allocate a window and program it in a
> way equivalent to what the BAR register write would have achieved on
> standard HW. Later, the window might need resizing, or even to be
> completely disabled, if the PCIe core were to change the standard
> BAR

Right. This is pretty straightforward except for the need to hook the
alignment fixup..

> register. Dynamically allocating a window when the BAR is written seems
> a little heavy-weight.

I think what Thomas had here was pretty small, and the windows need to
be shared with other on chip periphals beyond PCI-E..
 
> needed, e.g. if there's no IO space behind a root port, you could get
> away with two windows per root port, and hence be able to run 3 root
> ports rather than just 2?

Right, this is the main point. If you plug in 3 devices and they all
only use MMIO regions then you only need to grab 3 windows. The kernel
disables the unused windows on the bridge so it is easy to tell when
they are disused.

> Still, I supose doing it dynamically in the driver does end up being a
> lot less to think about for someone creating the DT for a new board.

Agreed, these restrictions are all so HW specific, subtle and have
nothing to do with the PCI-E spec. Codifying them once in the driver
seems like the way to keep this crazyness out of the PCI core and away
from users of the SOC.
 
> Having to translate standard root port BAR register writes to Marvell
> window register allocation/writes would imply that the emulated root
> port code has to be very closely tied into the Marvell PCIe driver, and
> not something that could be at all generic in the most part.

Agreed.. At the very least generic code would need call back
functions to the driver... It has a fair bit to do for Marvell:
 - Translate MMIO, prefetch and IO ranges to mbus windows
 - Keep track of the secondary/subordinate bus numbers and fiddle
   with other hardware registers to set those up
 - Copy the link state/control regsiters from the end port config
   space into the bridge express root port capability
 - Probably ditto for AER as well..

Probably simpler just to make one for marvell then mess excessively
with callbacks..

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 1, 2013, 8:46 a.m. UTC | #44
Dear Stephen Warren,

Thanks for this great discussion. I see that Jason has already answered
most of the questions on the why we need this "dynamicity" in the window
configuration. A few comments below.

On Thu, 31 Jan 2013 19:21:02 -0700, Stephen Warren wrote:

> So, the dynamic programming of the windows on Marvell HW is the exact
> logical equivalent of programming a standard PCIe root port's BAR
> registers. It makes perfect sense that should be dynamic. Presumably
> this is something you can make work inside your emulated PCIe/PCIe
> bridge module, simply by capturing writes to the BAR registers, and
> translating them into writes to the Marvell window registers.

That's what I'm doing. In the PATCHv2, my PCIe host driver was reading
back the BARs in the PCI-to-PCI bridges configuration space, and was
setting up the windows according to the addresses that had been
assigned to each bridge.

I am currently working in making this more dynamic: it is directly when
the BAR is being written to in the PCI-to-PCI bridge configuration
space that a window will be setup.

> Now, I do have one follow-on question: You said you don't have 30
> windows, but how many do you have free after allocating windows to any
> other peripherals that need them, relative to (3 *
> number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.)

We have 20 windows on Armada XP if I remember correctly, and they are
not only used for PCIe, but also to map the BootROM (needed to boot
secondary CPUs), to map SPI flashes or NOR flashes, for example. So
they are really shared between many uses. In terms of PCIe, there are
only two types of windows: I/O and Memory, there is no notion of
Prefetchable Memory window as far as I could see.

We have up to 10 PCIe interfaces, and only 20 windows. It means that
you basically can't use all PCIe interfaces, there will necessarily be
some limit, due to the limited number of windows.

Also, I'd like to point out that the dynamic configuration is needed
for two reasons:

 * The number of windows, as we are discussing now.

 * The amount of physical address space available. If you don't
   dynamically configure those windows, then you have to account the
   "worst case", i.e the PCIe devices that require very large memory
   areas. So you end up creating static windows that reserve 32M or 64M
   or 128M *per* PCIe link. You can see that it "consumes" pretty
   quickly a large part of the 4G physical address space that we have.
   Thanks to the dynamic window configuration that we do with the
   PCI-to-PCI bridge, we can size the windows exactly the size needed
   by the downstream device on each PCIe interface.

> The thing here is that when the PCIe core writes to a root port BAR
> window to configure/enable it the first time, you'll need to capture
> that transaction and dynamically allocate a window and program it in a
> way equivalent to what the BAR register write would have achieved on
> standard HW. Later, the window might need resizing, or even to be
> completely disabled, if the PCIe core were to change the standard BAR
> register. Dynamically allocating a window when the BAR is written
> seems a little heavy-weight.

Why?

> So while it's obvious that window base address and size shouldn't be
> static, I wonder if the assignment of a specific window ID to a
> specific root port ID shouldn be dynamic or static. For example, if
> your HW configuration leaves you with 6 windows available, you could
> support 2 PCIe root ports by statically assigning 3 windows to serve
> each of those 2 root ports. Would that work, or are there systems
> where over-commit is needed, e.g. if there's no IO space behind a
> root port, you could get away with two windows per root port, and
> hence be able to run 3 root ports rather than just 2? Still, if you
> know which PCIe devices are being the root ports, you could still
> represent the over-commit statically in DT

For now, I haven't figured out how not to allocate an I/O window if the
downstream device doesn't use I/O, but I'd like to achieve that, as it
would save one of the two windows needed per PCIe interface... and many
PCIe devices don't need the I/O window.

> Still, I supose doing it dynamically in the driver does end up being a
> lot less to think about for someone creating the DT for a new board.

Indeed.

> Having to translate standard root port BAR register writes to Marvell
> window register allocation/writes would imply that the emulated root
> port code has to be very closely tied into the Marvell PCIe driver,
> and not something that could be at all generic in the most part.

Right. I've already moved the PCI-to-PCI bridge code from the generic
drivers/pci/sw-pci-pci-bridge.c location to be directly into the
driver. It also to integrate more tightly things like window allocation.

> Right. Now that I really understand what the windows are doing, I can
> see that a static window configuration (address/size, perhaps rather
> than windows are used) would not be appropriate.

Glad to see we reached the same conclusion :-)

Best regards,

Thomas
Thomas Petazzoni Feb. 1, 2013, 9:03 a.m. UTC | #45
Dear Jason Gunthorpe,

Thanks again for continuing this discussion while I was sleeping :-)

On Thu, 31 Jan 2013 20:51:15 -0700, Jason Gunthorpe wrote:

> > Now, I do have one follow-on question: You said you don't have 30
> > windows, but how many do you have free after allocating windows to
> > any other peripherals that need them, relative to (3 *
> > number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.)
> 
> Thomas will have to answer this, it varies depending on the SOC, and
> what other on chip peripherals are in use. For instance Kirkwood has
> the same design but there are plenty of windows for the two PCI-E
> links.

Right. I already answered this point directly to Stephen. On Kirkwood,
there are many windows and two PCIe links, so the windows were
statically allocated. On Armada XP, there are 20 windows and 10 PCIe
links. Static allocation is no longer reasonable.

> > The thing here is that when the PCIe core writes to a root port BAR
> > window to configure/enable it the first time, you'll need to capture
> > that transaction and dynamically allocate a window and program it
> > in a way equivalent to what the BAR register write would have
> > achieved on standard HW. Later, the window might need resizing, or
> > even to be completely disabled, if the PCIe core were to change the
> > standard BAR
> 
> Right. This is pretty straightforward except for the need to hook the
> alignment fixup..
> 
> > register. Dynamically allocating a window when the BAR is written
> > seems a little heavy-weight.
> 
> I think what Thomas had here was pretty small, and the windows need to
> be shared with other on chip periphals beyond PCI-E..

Yes, it is not very complicated. We already have some common code that
creates/removes those windows, so it is just a matter of calling the
right thing at the right time. Definitely not hundreds of line of crap.

> Right, this is the main point. If you plug in 3 devices and they all
> only use MMIO regions then you only need to grab 3 windows. The kernel
> disables the unused windows on the bridge so it is easy to tell when
> they are disused.

Ah, I'm interested in further discussing this. I currently have a setup
with one SATA PCIe card and one NIC PCIe card. On the NIC, the I/O
ports are said to be "disabled", but still an I/O region gets allocated
in the PCI-to-PCI bridge that gives access to this particular device.

The device in question is:

05:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
	Subsystem: Intel Corporation PRO/1000 PT Server Adapter
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 106
	Region 0: Memory at c1200000 (32-bit, non-prefetchable) [size=128K]
	Region 1: Memory at c1220000 (32-bit, non-prefetchable) [size=128K]
	Region 2: I/O ports at c0010000 [disabled] [size=32]
	[virtual] Expansion ROM at c1300000 [disabled] [size=128K]

So the Region 2 is disabled. But, in the corresponding bridge:

00:05.0 PCI bridge: Marvell Technology Group Ltd. Device 1092 (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz+ UDF+ FastB2B+ ParErr+ DEVSEL=?? >TAbort+ <TAbort+ <MAbort+ >SERR+ <PERR+ INTx+
	Latency: 0, Cache Line Size: 64 bytes
	Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
	I/O behind bridge: c0010000-c001ffff
	Memory behind bridge: c1200000-c12fffff
	Prefetchable memory behind bridge: c1300000-c13fffff

So there is really a range of I/O addresses associated to it, even
though the device will apparently not use it. Would it be possible to
detect that the I/O range is not used by the device, and therefore
avoid the allocation of an address decoding window for this I/O range?

> Agreed.. At the very least generic code would need call back
> functions to the driver... It has a fair bit to do for Marvell:
>  - Translate MMIO, prefetch and IO ranges to mbus windows
>  - Keep track of the secondary/subordinate bus numbers and fiddle
>    with other hardware registers to set those up
>  - Copy the link state/control regsiters from the end port config
>    space into the bridge express root port capability
>  - Probably ditto for AER as well..
> 
> Probably simpler just to make one for marvell then mess excessively
> with callbacks..

As replied to Stephen, I've chosen to bring the PCI-to-PCI bridge
emulation code directly into the driver, specifically for this reason.

Best regards,

Thomas
Arnd Bergmann Feb. 1, 2013, 4:02 p.m. UTC | #46
On Friday 01 February 2013, Thomas Petazzoni wrote:
> > So while it's obvious that window base address and size shouldn't be
> > static, I wonder if the assignment of a specific window ID to a
> > specific root port ID shouldn be dynamic or static. For example, if
> > your HW configuration leaves you with 6 windows available, you could
> > support 2 PCIe root ports by statically assigning 3 windows to serve
> > each of those 2 root ports. Would that work, or are there systems
> > where over-commit is needed, e.g. if there's no IO space behind a
> > root port, you could get away with two windows per root port, and
> > hence be able to run 3 root ports rather than just 2? Still, if you
> > know which PCIe devices are being the root ports, you could still
> > represent the over-commit statically in DT
> 
> For now, I haven't figured out how not to allocate an I/O window if the
> downstream device doesn't use I/O, but I'd like to achieve that, as it
> would save one of the two windows needed per PCIe interface... and many
> PCIe devices don't need the I/O window.

The easiest hack would be to only ever allow one I/O window on exactly
one of the ports, and not do PIO on the other ports at all. Given the
various troubles of making any other combination work, that sounds like
a good enough compromise to me. A lot of the add-on cards would not
work in the remaining ports anyway, and worrying too much about that
legacy hardware may just not be worth it. That way, you only need 11
windows for PCIe (10*mem, 1*IO), which will always fit.

Do you actually have /any/ PCIe cards with PIO BARs that you can test
with?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 1, 2013, 4:07 p.m. UTC | #47
On Friday 01 February 2013, Thomas Petazzoni wrote:
> So there is really a range of I/O addresses associated to it, even
> though the device will apparently not use it. Would it be possible to
> detect that the I/O range is not used by the device, and therefore
> avoid the allocation of an address decoding window for this I/O range?

I suspect it just gets disabled because the port number 0xc0010000 is
larger than IO_PORT_LIMIT and we cannot access that offset inside
of the virtual memory window we use for PIO.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Russell King - ARM Linux Feb. 1, 2013, 4:26 p.m. UTC | #48
On Fri, Feb 01, 2013 at 04:07:49PM +0000, Arnd Bergmann wrote:
> On Friday 01 February 2013, Thomas Petazzoni wrote:
> > So there is really a range of I/O addresses associated to it, even
> > though the device will apparently not use it. Would it be possible to
> > detect that the I/O range is not used by the device, and therefore
> > avoid the allocation of an address decoding window for this I/O range?
> 
> I suspect it just gets disabled because the port number 0xc0010000 is
> larger than IO_PORT_LIMIT and we cannot access that offset inside
> of the virtual memory window we use for PIO.

You're running into that trap again which you fall into on other
architectures.

If you arrange for your PCI IO space to start at 0 rather than the
physical address that it appears on your CPU, then you shouldn't
end up with it extending up to something at 0xcXXXXXXX.

Remember that we should be ensuring that inb(0) hits the first address
of the cross-subarch PCI IO area - this alway requires that any sub-arch
taking part in a multiplatform kernel must start its IO space addresses
at 0 and not the physical address on the local CPU.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 1, 2013, 5:45 p.m. UTC | #49
On Friday 01 February 2013, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 04:07:49PM +0000, Arnd Bergmann wrote:
> > On Friday 01 February 2013, Thomas Petazzoni wrote:
> > > So there is really a range of I/O addresses associated to it, even
> > > though the device will apparently not use it. Would it be possible to
> > > detect that the I/O range is not used by the device, and therefore
> > > avoid the allocation of an address decoding window for this I/O range?
> > 
> > I suspect it just gets disabled because the port number 0xc0010000 is
> > larger than IO_PORT_LIMIT and we cannot access that offset inside
> > of the virtual memory window we use for PIO.
> 
> You're running into that trap again which you fall into on other
> architectures.
> 
> If you arrange for your PCI IO space to start at 0 rather than the
> physical address that it appears on your CPU, then you shouldn't
> end up with it extending up to something at 0xcXXXXXXX.
> 
> Remember that we should be ensuring that inb(0) hits the first address
> of the cross-subarch PCI IO area - this alway requires that any sub-arch
> taking part in a multiplatform kernel must start its IO space addresses
> at 0 and not the physical address on the local CPU.

Yes, that was my point. I think in this case, the bug is in the new
of_pci_process_ranges functions, which returns a 'struct resource'
translated into IORESOURCE_MEM space, but with the type set
to IORESOURCE_IO. This resource then gets passed to 
pci_add_resource_offset().

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Feb. 1, 2013, 5:57 p.m. UTC | #50
On 02/01/2013 01:46 AM, Thomas Petazzoni wrote:
> Dear Stephen Warren,
> 
> Thanks for this great discussion. I see that Jason has already answered
> most of the questions on the why we need this "dynamicity" in the window
> configuration. A few comments below.
> 
> On Thu, 31 Jan 2013 19:21:02 -0700, Stephen Warren wrote:
> 
>> So, the dynamic programming of the windows on Marvell HW is the exact
>> logical equivalent of programming a standard PCIe root port's BAR
>> registers. It makes perfect sense that should be dynamic. Presumably
>> this is something you can make work inside your emulated PCIe/PCIe
>> bridge module, simply by capturing writes to the BAR registers, and
>> translating them into writes to the Marvell window registers.
> 
> That's what I'm doing. In the PATCHv2, my PCIe host driver was reading
> back the BARs in the PCI-to-PCI bridges configuration space, and was
> setting up the windows according to the addresses that had been
> assigned to each bridge.
> 
> I am currently working in making this more dynamic: it is directly when
> the BAR is being written to in the PCI-to-PCI bridge configuration
> space that a window will be setup.
> 
>> Now, I do have one follow-on question: You said you don't have 30
>> windows, but how many do you have free after allocating windows to any
>> other peripherals that need them, relative to (3 *
>> number-of-root-ports-in-the-SoC)? (3 being IO+Mem+PrefetchableMem.)
> 
> We have 20 windows on Armada XP if I remember correctly, and they are
> not only used for PCIe, but also to map the BootROM (needed to boot
> secondary CPUs), to map SPI flashes or NOR flashes, for example. So
> they are really shared between many uses. In terms of PCIe, there are
> only two types of windows: I/O and Memory, there is no notion of
> Prefetchable Memory window as far as I could see.

In Tegra, we end up having separate MMIO vs. Prefetchable MMIO chunks of
our overall PCIe aperture. However, the HW setup appears the same for
both of those. I'm not sure if it's a bug in the driver, or if it's just
to separate the two address spaces so that the page tables can be
configured for those two regions with large rather than small
granularity. I need to go investigate that.

> We have up to 10 PCIe interfaces, and only 20 windows. It means that
> you basically can't use all PCIe interfaces, there will necessarily be
> some limit, due to the limited number of windows.

So there are 10 PCIe interfaces (root ports). That's on the SoC itself
right. Are all 10 (or a large number of them) actually used at once on
any given board design? I suppose this must be the case, or Marvell
wouldn't have wasted the silicon space on 10 root ports... Still, that's
a rather large number of ports!

If only a few PCIe ports are ever in use at once on a design and/or the
PCIe ports generally contain soldered-down devices rather than
user-accessible ports, the statically assigning window *IDs* to
individual ports would make for easier code in the driver, since the BAR
register emulation would never have to allocate/de-allocate windows, but
rather only ever have to enable/disable/configure them.

However, if many PCIe ports are in use at once and there are
user-accessible ports, you can't know ahead of time which ports will
need MMIO vs. MMIO prefetch vs. IO, so you'd have to dynamically
allocate window IDs to ports, in addition to dynamically setting up the
address/size of windows.

> Also, I'd like to point out that the dynamic configuration is needed
> for two reasons:
> 
>  * The number of windows, as we are discussing now.

OK.

>  * The amount of physical address space available. If you don't
>    dynamically configure those windows, then you have to account the
>    "worst case", i.e the PCIe devices that require very large memory
>    areas. So you end up creating static windows that reserve 32M or 64M
>    or 128M *per* PCIe link. You can see that it "consumes" pretty
>    quickly a large part of the 4G physical address space that we have.
>    Thanks to the dynamic window configuration that we do with the
>    PCI-to-PCI bridge, we can size the windows exactly the size needed
>    by the downstream device on each PCIe interface.

That aspect is applicable to any PCIe system; there's always some chunk
of physical address space that maps to PCIe, and which must be divided
into per-root-port chunks.

I think the only difference on the Marvell HW is:

* The overall total size of the physical address space is dynamic rather
than fixed, because it's programmed through windows rather than
hard-coded into HW.

* Hence, the windows /both/ define the physical address space layout
/and/ define the routing of transactions to individual root ports. On
regular PCIe, the root port BARs only divide up the overall physical
(PCIe bus 0 really) address space and hence perform routing; they have
no influence over the CPU physical address space.

So I think the crux of the problem is that you really have 10 PCIe root
ports, each of which is a nominally a separate PCIe domain (since the
windows connect CPU physical address space to an individual/specific
root port's PCIe address space, rather than having separate connections
from CPU -> PCIe bus 0 address space, then BARs/windows connecting PCIe
bus 0 address space to individual root port/subordinate bus address
space), but you're attempting to treat all 10 ports as a single PCIe
domain so that you don't have to dedicate separate physical CPU address
space to each port, which you would have to do if they were actually
treated as separate domains.

>> The thing here is that when the PCIe core writes to a root port BAR
>> window to configure/enable it the first time, you'll need to capture
>> that transaction and dynamically allocate a window and program it in a
>> way equivalent to what the BAR register write would have achieved on
>> standard HW. Later, the window might need resizing, or even to be
>> completely disabled, if the PCIe core were to change the standard BAR
>> register. Dynamically allocating a window when the BAR is written
>> seems a little heavy-weight.
> 
> Why?

Well, it's just a bunch more code; much more than a simple writel().
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 1, 2013, 7:39 p.m. UTC | #51
On Fri, Feb 01, 2013 at 10:57:20AM -0700, Stephen Warren wrote:

> > We have 20 windows on Armada XP if I remember correctly, and they are
> > not only used for PCIe, but also to map the BootROM (needed to boot
> > secondary CPUs), to map SPI flashes or NOR flashes, for example. So
> > they are really shared between many uses. In terms of PCIe, there are
> > only two types of windows: I/O and Memory, there is no notion of
> > Prefetchable Memory window as far as I could see.
> 
> In Tegra, we end up having separate MMIO vs. Prefetchable MMIO
> chunks of our overall PCIe aperture. However, the HW setup appears
> the same for both of those. I'm not sure if it's a bug in the
> driver, or if it's just to separate the two address spaces so that
> the page tables can be configured for those two regions with large
> rather than small granularity. I need to go investigate that.

The only purpose of prefetchable space is for legacy PCI. When a P2P
bridge targets legacy PCI it has different behavior for its
prefetchable memory window compared to the non-prefetchable memory
window.

IIRC (though it has been a long time since I looked really close at
this) PCI-X and PCI-E did away with this special bridge behaviour but
kept the prefetchable memory space for compatibility.

These days it is typically used to mark cachable memory on an end
device.

From a SOC perspective, there is no need to treat MMIO and prefetch
areas any differently. ARM's per-page cachability flags can be used to
deal with the differing caching requirements.

However, the bus tree downstream of each root port will require the
prefetch window to be contiguous. On Marvell, today, this means you
need to burn two mbus windows to get this. If the Linux pci core could
allocate the prefetch space for each root port bridge contiguously
with the mmio space for the same root port then this could be reduced
to one window covering both spaces for the port.

> So there are 10 PCIe interfaces (root ports). That's on the SoC itself
> right. Are all 10 (or a large number of them) actually used at once on
> any given board design? I suppose this must be the case, or Marvell
> wouldn't have wasted the silicon space on 10 root ports... Still, that's
> a rather large number of ports!

Agreed.. I have no idea what the target is for this..
 
> I think the only difference on the Marvell HW is:
> 
> * The overall total size of the physical address space is dynamic rather
> than fixed, because it's programmed through windows rather than
> hard-coded into HW.

Is it hard coded on tegra? I thought there was a register set that was
used to set the overall PCI-E MMIO window location and size. I know
even on x86 the PCI window is set via register, though that typically
isn't disclosed except to bios writers.. 

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 1, 2013, 7:58 p.m. UTC | #52
On Fri, Feb 01, 2013 at 05:45:29PM +0000, Arnd Bergmann wrote:

> Yes, that was my point. I think in this case, the bug is in the new
> of_pci_process_ranges functions, which returns a 'struct resource'
> translated into IORESOURCE_MEM space, but with the type set
> to IORESOURCE_IO. This resource then gets passed to 
> pci_add_resource_offset().

A standard way to express the required address translation from CPU
physical address to IO bus address might help other people avoid this
trap??

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren Feb. 1, 2013, 8:30 p.m. UTC | #53
On 02/01/2013 12:39 PM, Jason Gunthorpe wrote:
> On Fri, Feb 01, 2013 at 10:57:20AM -0700, Stephen Warren wrote:
...
>> I think the only difference on the Marvell HW is:
>>
>> * The overall total size of the physical address space is dynamic rather
>> than fixed, because it's programmed through windows rather than
>> hard-coded into HW.
> 
> Is it hard coded on tegra? I thought there was a register set that was
> used to set the overall PCI-E MMIO window location and size. I know
> even on x86 the PCI window is set via register, though that typically
> isn't disclosed except to bios writers.. 

There is a fixed (in HW) 1 GiB physical address window dedicated to
PCIe. That window is divided between host controller registers, PCIe
root port registers (since our root ports don't respond to configuration
transactions), and regular PCIe accesses; config/MMIO/IO. There are
registers in the host controller that configure the division of this
space into config/MMIO/IO, so that can be dynamic. The DT bindings for
the driver Thierry proposed hard-code those divisions in DT.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 7, 2013, 10:24 a.m. UTC | #54
Dear Thierry Reding,

On Tue, 29 Jan 2013 10:20:06 +0100, Thierry Reding wrote:

> > I didn't test recently, but with my first version of the patch set,
> > having an initialization as late as module_init() was too late. Some
> > PCI fixup code was being executed *before* we get the opportunity of
> > initializing the PCI driver, and it was crashing the kernel. I can
> > provide more details if you want.
> 
> Does this patch perhaps fix this crash?
> 
> 	http://patchwork.ozlabs.org/patch/210870/

I investigated a bit more, and managed to reproduce my crash even with
your patch applied. And indeed, my crash is really unrelated to the
pcibios function disappearing. Here is the kernel panic (and a short
analysis afterwards) :

Unhandled fault: external abort on non-linefetch (0x1008) at 0xe0910010
Internal error: : 1008 [#1] SMP ARM
Modules linked in:
CPU: 0    Not tainted  (3.8.0-rc5-00029-g80e55fd-dirty #1303)
PC is at quirk_usb_handoff_xhci+0x5c/0x284
LR is at ioremap_pte_range+0x84/0xdc
pc : [<c022717c>]    lr : [<c0150944>]    psr: a0000013
sp : df82bce8  ip : df81c000  fp : c0e09dac
r10: 00008000  r9 : 00000000  r8 : de935000
r7 : e0910000  r6 : c03f0ce0  r5 : de935000  r4 : de935000
r3 : 01c801c8  r2 : 00000000  r1 : 42007e13  r0 : e0910000
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
Control: 10c5387d  Table: 1e95c019  DAC: 00000015
Process swapper/0 (pid: 1, stack limit = 0xdf82a238)
Stack: (0xdf82bce8 to 0xdf82c000)
bce0:                   de935060 c01702f4 de935000 de935000 de935000 c03f0ce0
bd00: c0e220dc c02277f8 de935060 c02277d4 c03f0ce0 c0227858 c03f0ce0 c0178d98
bd20: c01a855c c0312dd0 00000000 00000000 de936f6c c0312ed0 c0e0f7cc de935000
bd40: de936c14 de936c00 df82bde8 00000001 de916668 c016a958 00000000 de935000
bd60: de936c14 c016ab5c de935800 de910014 de936c00 c016abcc 00000000 00000001
bd80: de910000 c016d234 de916668 de9166d0 de916640 00000000 df82be14 c017a93c
bda0: de916668 df82be14 df82bde8 c0012ef8 f3cec23f c0251c90 c0019da0 df805340
bdc0: f3cec23f df82bde8 c0e426e8 df82be14 df802b00 de912a90 00000060 df89e400
bde0: 00000002 c00131dc df82bde8 df82bde8 c0e454f0 00000000 de9166d0 de912af0
be00: df802b00 c0442010 df89e410 00000000 00000000 c0e0fcac 00000001 df82be54
be20: c04420b4 c017a90c 00000000 00000000 00000000 c0442054 c1000000 c8ffffff
be40: c124b5f0 00000200 00000000 00000000 00000000 de9166d0 df89e410 c0e43e48
be60: c0e0fc6c df89e410 00000000 c0e0fc6c c04563d4 c017a7c4 00000000 c01a8c24
be80: c01a8c0c c01a78e8 df89e410 c0e0fc6c df89e444 00000000 c043023c c01a7bd8
bea0: c0e0fc6c 00000000 c01a7b4c c01a632c df8067d8 df85a474 c0e0fc6c c0e170c8
bec0: de916740 c01a7228 c03d28bc c0e0fc6c c0e0fc6c df82a000 c0e220c0 00000000
bee0: c043023c c01a80d8 00000000 c0e0fc58 df82a000 c0e220c0 00000000 c043023c
bf00: c017a7c4 c01a8e1c c044fc80 df82a000 c0e220c0 c00086d4 c040e208 00000006
bf20: 0000007b c017a7c4 0000007b 00000006 00000006 c043023c c124bb15 00000000
bf40: c0e08194 c044fc80 00000006 c044fc60 c0e220c0 c043023c c04563d4 0000007b
bf60: 00000000 c04308b0 00000006 00000006 c043023c 00000000 c0456128 c0456128
bf80: 00000000 00000000 00000000 00000000 00000000 c0430958 00000000 00000000
bfa0: c03156c8 c03156d0 00000000 c000dfd8 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 2254c9ef c6885425
[<c022717c>] (quirk_usb_handoff_xhci+0x5c/0x284) from [<c02277d4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4)
[<c02277d4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4) from [<c0178d98>] (pci_do_fixups+0xc4/0x17c)
[<c0178d98>] (pci_do_fixups+0xc4/0x17c) from [<c016a958>] (pci_bus_add_device+0x14/0x60)
[<c016a958>] (pci_bus_add_device+0x14/0x60) from [<c016ab5c>] (pci_bus_add_devices+0x44/0x128)
[<c016ab5c>] (pci_bus_add_devices+0x44/0x128) from [<c016abcc>] (pci_bus_add_devices+0xb4/0x128)
[<c016abcc>] (pci_bus_add_devices+0xb4/0x128) from [<c016d234>] (pci_scan_root_bus+0x7c/0xcc)
[<c016d234>] (pci_scan_root_bus+0x7c/0xcc) from [<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c)
[<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c) from [<c0012ef8>] (pcibios_init_hw+0x5c/0x15c)
[<c0012ef8>] (pcibios_init_hw+0x5c/0x15c) from [<c00131dc>] (pci_common_init+0x44/0xc4)
[<c00131dc>] (pci_common_init+0x44/0xc4) from [<c0442010>] (mvebu_pcie_probe+0x360/0x3a4)
[<c0442010>] (mvebu_pcie_probe+0x360/0x3a4) from [<c01a8c24>] (platform_drv_probe+0x18/0x1c)
[<c01a8c24>] (platform_drv_probe+0x18/0x1c) from [<c01a78e8>] (really_probe+0x60/0x1e0)
[<c01a78e8>] (really_probe+0x60/0x1e0) from [<c01a7bd8>] (__driver_attach+0x8c/0x90)
[<c01a7bd8>] (__driver_attach+0x8c/0x90) from [<c01a632c>] (bus_for_each_dev+0x50/0x7c)
[<c01a632c>] (bus_for_each_dev+0x50/0x7c) from [<c01a7228>] (bus_add_driver+0x168/0x22c)
[<c01a7228>] (bus_add_driver+0x168/0x22c) from [<c01a80d8>] (driver_register+0x78/0x144)
[<c01a80d8>] (driver_register+0x78/0x144) from [<c01a8e1c>] (platform_driver_probe+0x18/0xac)
[<c01a8e1c>] (platform_driver_probe+0x18/0xac) from [<c00086d4>] (do_one_initcall+0x34/0x174)
[<c00086d4>] (do_one_initcall+0x34/0x174) from [<c04308b0>] (do_basic_setup+0x90/0xc4)
[<c04308b0>] (do_basic_setup+0x90/0xc4) from [<c0430958>] (kernel_init_freeable+0x74/0x10c)
[<c0430958>] (kernel_init_freeable+0x74/0x10c) from [<c03156d0>] (kernel_init+0x8/0xe4)
[<c03156d0>] (kernel_init+0x8/0xe4) from [<c000dfd8>] (ret_from_fork+0x14/0x3c)
Code: e3a02000 ebf7c092 e2507000 0afffff7 (e5973010) 
---[ end trace 834a6081748c17ef ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

Basically, the problem comes from the fact that the USB XHCI code has
an early PCI fixup, that ioremap() the PCI device memory BAR and makes
access to it. Unfortunately, this fixup is called during
pcibios_init_hw(), at a point where the Marvell PCIe driver haven't yet
had a chance to set up the address decoding windows (the Linux PCI core
hasn't even configured the emulated PCI-to-PCI bridges, so I don't know
where the PCI devices will sit).

Due to this, the first access to the PCI device memory by this early
fixup triggers an exception, and the kernel panics.

For some reason, moving the Marvell PCIe driver initialization at the
subsys_initcall() level works around the problem, but I'm not sure why,
since it is actually the driver initialization that ends up calling the
PCI fixup code. But clearly, with the PCIe initialization done at
subsys_initcall() time, the PCIe is initialized, and then a lot later
the PCI fixup is executed.

Ideas welcome.

Best regards,

Thomas
Thomas Petazzoni Feb. 7, 2013, 2:37 p.m. UTC | #55
Dear Andrew Murray,

On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote:

> > +       /*
> > +        * Build an laddr array that describes the PCI device in a
> > DT
> > +        * way
> > +        */
> > +       laddr[0] = cpu_to_be32(port->devfn << 8);
> > +       laddr[1] = laddr[2] = 0;
> > +       intspec = cpu_to_be32(pin);
> > +
> > +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> > +       if (ret) {
> > +               dev_err(&pcie->pdev->dev,
> > +                       "%s: of_irq_map_raw() failed, %d\n",
> > +                       __func__, ret);
> > +               return ret;
> > +       }
> 
> Are you able to replace the above code with a call to of_irq_map_pci?
> I'm not sure which approach is better. The of_irq_map_pci function
> doesn't require the pin argument and instead uses the DT and/or
> performs its own pin swizzling. I guess this means that if there are
> PCIe devices in the DT tree that does any thing strange with pins
> then it would be reflected in the IRQ you get. I've found that you
> will also need to provide an implementation of
> pcibios_get_phb_of_node for this to work correctly (see my RFC bios32
> patch).

I tried to do so, but it doesn't work properly. Let me explain what I
did and the behavior that I observe.

First of all, I didn't reimplement the pcibios_get_phb_of_node(), but
instead, as Thierry Reding suggested, simply implemented the
hw_pci.scan() function as follows:

static struct pci_bus *mvebu_pcie_scan_bus(int nr, struct pci_sys_data *sys)
{
	struct mvebu_pcie *pcie = sys_to_pcie(sys);
	return pci_scan_root_bus(&pcie->pdev->dev, sys->busnr,
				 &mvebu_pcie_ops, sys, &sys->resources);
}

This allows to pass the "struct device *" pointer, which ultimately
allows the PCI devices to carry a pointer to the corresponding DT node.

The DT representing my PCIe controller and its interfaces is the
following:

		pcie-controller {
			compatible = "marvell,armada-370-xp-pcie";
			status = "disabled";

			#address-cells = <3>;
			#size-cells = <2>;

			bus-range = <0x00 0xff>;

			ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
				  0x00001000 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
				  0x81000000 0 0	  0xc0000000 0 0x00010000   /* downstream I/O */
				  0x82000000 0 0	  0xc1000000 0 0x08000000>; /* non-prefetchable memory */

			#interrupt-cells = <1>;
			interrupt-map-mask = <0xf800 0 0 1>;
			interrupt-map = <0x0800 0 0 1 &mpic 58 /* port 0.0 */
					 0x1000 0 0 1 &mpic 62>; /* port 1.0 */

			pcie@0,0 {
				device_type = "pciex";
				reg = <0x0800 0 0xd0040000 0 0x2000>;
				#address-cells = <3>;
				#size-cells = <2>;
				marvell,pcie-port = <0>;
				marvell,pcie-lane = <0>;
				clocks = <&gateclk 5>;
				status = "disabled";
			};

			pcie@1,0 {
				device_type = "pciex";
				reg = <0x1000 0 0xd0080000 0 0x2000>;
				#address-cells = <3>;
				#size-cells = <2>;
				marvell,pcie-port = <1>;
				marvell,pcie-lane = <0>;
				clocks = <&gateclk 9>;
				status = "disabled";
			};
		};

So we have two PCIe interfaces. lspci shows the following output:

00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
01:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300
02:00.0 USB controller: ASMedia Technology Inc. Device 1040

So:

 * On bus 0, we have two PCI-to-PCI bridges. Those are emulated in
   software and allow the Marvell PCIe driver to get dynamic assignment
   of address decoding windows. The entries in the interrupt-map DT
   property match the bus number and slot number of those PCI-to-PCI
   bridges.

 * On bus 1, we have the real PCIe device connected to the first PCIe
   interface. This bus 1 is made "visible" thanks to the 00:01.0
   PCI-to-PCI bridge.

 * On bus 2, we have the real PCIe device connected to the second PCIe
   interface. This bus 2 is made "visible" thanks to the 00:02.0
   PCI-to-PCI bridge.

Now, when I call the of_irq_map_pci() function, the problem is that the
"struct pci_dev" that it receives is the one corresponding to the
particular PCIe device we're interested in. And this "struct pci_dev"
has a non-NULL pointer to the "struct device_node" representing
"pcie0,0" or "pci0,1" above. Since the "struct device_node" is
non-NULL, of_irq_map_pci() builds a laddr[] with the bus number and
devfn of this device: bus number is 1, devfn is 0. And this doesn't
match with the interrupt-map that is in my DT, which associates the
interrupts numbers with the PCI-to-PCI bridges rather than the devices
themselves.

To me, the "struct pci_dev" representing the real PCIe devices should
have a NULL "struct device_node" pointer, because those device are not
represented in the DT. If this was the case, then the of_irq_map_pci()
would go one level up in the PCI hierarchy, find the "struct pci_dev"
that corresponds to the PCI-to-PCI bridge, which generates an laddr[]
having a bus number a devfn value matching the interrupt-map property.

If I do the following hack in of_irq_map_pci(), then everything works
nicely:

                } else {
                        /* We found a P2P bridge, check if it has a node */
                        ppnode = pci_device_to_OF_node(ppdev);
+                       if (pdev->bus->number != 0)
+                               ppnode = NULL;
                }


What this hack does is that if we are not on bus 0, it means that the
pci_dev is a real PCIe device, and therefore we force the code to
asssume it does not have a DT reference.

Isn't there a problem here in the PCI/DT code ? Is it normal that a
PCIe device that isn't described in the DT carries a non-NULL struct
device_node pointer?

If you need some more details about the problem, do not hesitate to ask.

Thanks a lot for your help,

Thomas
Bjorn Helgaas Feb. 7, 2013, 3:46 p.m. UTC | #56
On Thu, Feb 7, 2013 at 3:24 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Thierry Reding,
>
> On Tue, 29 Jan 2013 10:20:06 +0100, Thierry Reding wrote:
>
>> > I didn't test recently, but with my first version of the patch set,
>> > having an initialization as late as module_init() was too late. Some
>> > PCI fixup code was being executed *before* we get the opportunity of
>> > initializing the PCI driver, and it was crashing the kernel. I can
>> > provide more details if you want.
>>
>> Does this patch perhaps fix this crash?
>>
>>       http://patchwork.ozlabs.org/patch/210870/
>
> I investigated a bit more, and managed to reproduce my crash even with
> your patch applied. And indeed, my crash is really unrelated to the
> pcibios function disappearing. Here is the kernel panic (and a short
> analysis afterwards) :

Hi Thomas,

Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y?
That should have more information about the enumeration process,
including what we think the XHCI BARs are and the apertures leading to
them.

The PCI core assumes that we know the host bridge apertures up front,
and I'm not sure that is true on your platform, so maybe we'll need
some changes to accommodate that.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Murray Feb. 7, 2013, 3:51 p.m. UTC | #57
On Thu, Feb 07, 2013 at 02:37:50PM +0000, Thomas Petazzoni wrote:
> Dear Andrew Murray,
> 
> On Tue, 29 Jan 2013 13:22:04 +0000, Andrew Murray wrote:
> 
> > > +       /*
> > > +        * Build an laddr array that describes the PCI device in a
> > > DT
> > > +        * way
> > > +        */
> > > +       laddr[0] = cpu_to_be32(port->devfn << 8);
> > > +       laddr[1] = laddr[2] = 0;
> > > +       intspec = cpu_to_be32(pin);
> > > +
> > > +       ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
> > > +       if (ret) {
> > > +               dev_err(&pcie->pdev->dev,
> > > +                       "%s: of_irq_map_raw() failed, %d\n",
> > > +                       __func__, ret);
> > > +               return ret;
> > > +       }
> > 
> > Are you able to replace the above code with a call to of_irq_map_pci?
> > I'm not sure which approach is better. The of_irq_map_pci function
> > doesn't require the pin argument and instead uses the DT and/or
> > performs its own pin swizzling. I guess this means that if there are
> > PCIe devices in the DT tree that does any thing strange with pins
> > then it would be reflected in the IRQ you get. I've found that you
> > will also need to provide an implementation of
> > pcibios_get_phb_of_node for this to work correctly (see my RFC bios32
> > patch).
> 
> I tried to do so, but it doesn't work properly. Let me explain what I
> did and the behavior that I observe.
> 
> First of all, I didn't reimplement the pcibios_get_phb_of_node(), but
> instead, as Thierry Reding suggested, simply implemented the
> hw_pci.scan() function as follows:

I've not had any time to test Thierry's solution to avoid implementing
pcibios_get_phb_of_node - but it did seem to work for him and seem correct
at the time.

> 
> static struct pci_bus *mvebu_pcie_scan_bus(int nr, struct pci_sys_data *sys)
> {
> 	struct mvebu_pcie *pcie = sys_to_pcie(sys);
> 	return pci_scan_root_bus(&pcie->pdev->dev, sys->busnr,
> 				 &mvebu_pcie_ops, sys, &sys->resources);
> }
> 
> This allows to pass the "struct device *" pointer, which ultimately
> allows the PCI devices to carry a pointer to the corresponding DT node.
> 
> The DT representing my PCIe controller and its interfaces is the
> following:
> 
> 		pcie-controller {
> 			compatible = "marvell,armada-370-xp-pcie";
> 			status = "disabled";
> 
> 			#address-cells = <3>;
> 			#size-cells = <2>;
> 
> 			bus-range = <0x00 0xff>;
> 
> 			ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
> 				  0x00001000 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
> 				  0x81000000 0 0	  0xc0000000 0 0x00010000   /* downstream I/O */
> 				  0x82000000 0 0	  0xc1000000 0 0x08000000>; /* non-prefetchable memory */
> 
> 			#interrupt-cells = <1>;
> 			interrupt-map-mask = <0xf800 0 0 1>;
> 			interrupt-map = <0x0800 0 0 1 &mpic 58 /* port 0.0 */
> 					 0x1000 0 0 1 &mpic 62>; /* port 1.0 */
> 
> 			pcie@0,0 {
> 				device_type = "pciex";
> 				reg = <0x0800 0 0xd0040000 0 0x2000>;
> 				#address-cells = <3>;
> 				#size-cells = <2>;
> 				marvell,pcie-port = <0>;
> 				marvell,pcie-lane = <0>;
> 				clocks = <&gateclk 5>;
> 				status = "disabled";
> 			};
> 
> 			pcie@1,0 {
> 				device_type = "pciex";
> 				reg = <0x1000 0 0xd0080000 0 0x2000>;
> 				#address-cells = <3>;
> 				#size-cells = <2>;
> 				marvell,pcie-port = <1>;
> 				marvell,pcie-lane = <0>;
> 				clocks = <&gateclk 9>;
> 				status = "disabled";
> 			};
> 		};
> 
> So we have two PCIe interfaces. lspci shows the following output:
> 
> 00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 1092
> 01:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300
> 02:00.0 USB controller: ASMedia Technology Inc. Device 1040
> 
> So:
> 
>  * On bus 0, we have two PCI-to-PCI bridges. Those are emulated in
>    software and allow the Marvell PCIe driver to get dynamic assignment
>    of address decoding windows. The entries in the interrupt-map DT
>    property match the bus number and slot number of those PCI-to-PCI
>    bridges.
> 
>  * On bus 1, we have the real PCIe device connected to the first PCIe
>    interface. This bus 1 is made "visible" thanks to the 00:01.0
>    PCI-to-PCI bridge.
> 
>  * On bus 2, we have the real PCIe device connected to the second PCIe
>    interface. This bus 2 is made "visible" thanks to the 00:02.0
>    PCI-to-PCI bridge.
> 
> Now, when I call the of_irq_map_pci() function, the problem is that the
> "struct pci_dev" that it receives is the one corresponding to the
> particular PCIe device we're interested in. And this "struct pci_dev"
> has a non-NULL pointer to the "struct device_node" representing
> "pcie0,0" or "pci0,1" above. Since the "struct device_node" is
> non-NULL, of_irq_map_pci() builds a laddr[] with the bus number and
> devfn of this device: bus number is 1, devfn is 0. And this doesn't
> match with the interrupt-map that is in my DT, which associates the
> interrupts numbers with the PCI-to-PCI bridges rather than the devices
> themselves.
> 
> To me, the "struct pci_dev" representing the real PCIe devices should
> have a NULL "struct device_node" pointer, because those device are not
> represented in the DT. If this was the case, then the of_irq_map_pci()
> would go one level up in the PCI hierarchy, find the "struct pci_dev"
> that corresponds to the PCI-to-PCI bridge, which generates an laddr[]
> having a bus number a devfn value matching the interrupt-map property.

Yes this is my current understanding.

> 
> If I do the following hack in of_irq_map_pci(), then everything works
> nicely:
> 
>                 } else {
>                         /* We found a P2P bridge, check if it has a node */
>                         ppnode = pci_device_to_OF_node(ppdev);
> +                       if (pdev->bus->number != 0)
> +                               ppnode = NULL;
>                 }
> 
> 
> What this hack does is that if we are not on bus 0, it means that the
> pci_dev is a real PCIe device, and therefore we force the code to
> asssume it does not have a DT reference.
> 
> Isn't there a problem here in the PCI/DT code ? Is it normal that a
> PCIe device that isn't described in the DT carries a non-NULL struct
> device_node pointer?

I would suggest the issue isn't in the PCI/DT code. This is what I see
with my implementation (which uses an implementation of
pcibios_get_phb_of_node) - I'd like to believe this is correct behaviour:

- of_irq_map_pci is called for an endpoint (5:0:0), its pdev->dev.of_node
  is NULL. As I don't have a representation of this endpoint in my DT the
  of_irq_map_pci code proceeds to walk the fabric.
- Starting with the pdev for 5:0:0, of_irq_map_pci sees it has a parent
  (downstream switch bridge), in this case pdev(5:0:0)->bus->self->dev.of_node
  is NULL, due to this swizzling occurs and we walk the parent's bus (bus 2)
- This continues to bus 1 and then bus 0 where of_irq_map_pci realises that
  it has no parent (its the host bridge). Due to the implementation of
  pcibios_get_phb_of_node a of_node is produced, this is then used to construct
  a lspec (for bus number 0).

The of_irq_map_pci code stops walking as soon as it finds a function in the
tree that has a device node. This suggests that if you represent a bridge
you must also include an interrupt-map. The problem here is that you have
represented a bridge but not included a map.

I can think of three solutions:

1. Something like this:

                 } else {
                         /* We found a P2P bridge, check if it has a node */
                         ppnode = pci_device_to_OF_node(ppdev);
 +                       if (ppnode doesnt have an interrupt-map)//if (pdev->bus->number != 0)
 +                               ppnode = NULL;
                 }

2. Remove the bridges from the DT? Or remove the map from pcie-controller and
   add a map each to pcie@0,1 and pcie@1,1?

3. Change the mask of your map so that it doesn't care about bus numbers. I
   have a map that looks like this:

                interrupt-map-mask = <0 0 0 7>;
                interrupt-map = <0 0 0 1 ... //anything coming in on INTA
                                 0 0 0 2 ... //anything coming in on INTB
                                 0 0 0 3 ... ...
                                 0 0 0 4 ... ...

Andrew Murray

> 
> If you need some more details about the problem, do not hesitate to ask.
> 
> Thanks a lot for your help,
> 
> Thomas
> -- 
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 7, 2013, 4 p.m. UTC | #58
Dear Bjorn Helgaas,

On Thu, 7 Feb 2013 08:46:06 -0700, Bjorn Helgaas wrote:

> Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y?
> That should have more information about the enumeration process,
> including what we think the XHCI BARs are and the apertures leading to
> them.

Sure, see below.

> The PCI core assumes that we know the host bridge apertures up front,
> and I'm not sure that is true on your platform, so maybe we'll need
> some changes to accommodate that.

In this hardware, we need to set up the address decoding windows. So
there shouldn't be any access to a PCI device memory or I/O region
until the addresses have been assigned in the PCI-to-PCI bridge.

Note that I am know setting up the address decoding window as soon as
the address is written into the PCI-to-PCI bridge. I am no longer
waiting the end of enumeration process, and then go through the
PCI-to-PCI bridge registers to configure them.

The system tested below is an Armada 370, it has only two PCIe links.
One is connected to a XHCI USB controller, the other one to an Intel
Wireless NIC.

First the dmesg when module_init() is used, which shows the crash:

===================================================================
Booting Linux on physical CPU 0x0
Linux version 3.8.0-rc5-00029-g80e55fd-dirty (thomas@skate) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #1313 SMP Thu Feb 7 16:53:32 CET 2013
CPU: ARMv7 Processor [561f5811] revision 1 (ARMv7), cr=10c53c7d
CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
Machine: Marvell Armada 370/XP (Device Tree), model: Globalscale Mirabox
bootconsole [earlycon0] enabled
Memory policy: ECC disabled, Data cache writeback
PERCPU: Embedded 7 pages/cpu @c128b000 s6464 r8192 d14016 u32768
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 130048
Kernel command line: console=ttyS0,115200 earlyprintk
PID hash table entries: 2048 (order: 1, 8192 bytes)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
__ex_table already sorted, skipping sort
Memory: 512MB = 512MB total
Memory: 504860k/504860k available, 19428k reserved, 0K highmem
Virtual kernel memory layout:
    vector  : 0xffff0000 - 0xffff1000   (   4 kB)
    fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
    vmalloc : 0xe0800000 - 0xff000000   ( 488 MB)
    lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
    pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
    modules : 0xbf000000 - 0xbfe00000   (  14 MB)
      .text : 0xc0008000 - 0xc0467700   (4478 kB)
      .init : 0xc0468000 - 0xc0e30940   (10019 kB)
      .data : 0xc0e32000 - 0xc0e5f9c0   ( 183 kB)
       .bss : 0xc0e5f9c0 - 0xc0e84fdc   ( 150 kB)
Hierarchical RCU implementation.
	RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1.
NR_IRQS:16 nr_irqs:16 16
Aurora cache controller enabled
l2x0: 4 ways, CACHE_ID 0x00000100, AUX_CTRL 0x1a086302, Cache size: 262144 B
sched_clock: 32 bits at 18MHz, resolution 53ns, wraps every 229064ms
Console: colour dummy device 80x30
Calibrating delay loop... 1196.85 BogoMIPS (lpj=5984256)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
CPU0: thread -1, cpu 0, socket -1, mpidr 0
Setting up static identity map for 0x351a28 - 0x351a80
Brought up 1 CPUs
SMP: Total of 1 processors activated (1196.85 BogoMIPS).
devtmpfs: initialized
pinctrl core: initialized pinctrl subsystem
NET: Registered protocol family 16
DMA: preallocated 1024 KiB pool for atomic coherent allocations
irq: Cannot allocate irq_descs @ IRQ27, assuming pre-allocated
irq: Cannot allocate irq_descs @ IRQ63, assuming pre-allocated
irq: Cannot allocate irq_descs @ IRQ96, assuming pre-allocated
Initializing Coherency fabric
bio: create slab <bio-0> at 0
vgaarb: loaded
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
cfg80211: Calling CRDA to update world regulatory domain
Switching to clocksource armada_370_xp_clocksource
NET: Registered protocol family 2
TCP established hash table entries: 4096 (order: 3, 32768 bytes)
TCP bind hash table entries: 4096 (order: 3, 32768 bytes)
TCP: Hash tables configured (established 4096 bind 4096)
TCP: reno registered
UDP hash table entries: 256 (order: 1, 8192 bytes)
UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
msgmni has been set to 986
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
armada-370-pinctrl d0018000.pinctrl: registered pinctrl driver
mvebu-pcie pcie-controller.1: PCIe0.0: link up
mvebu-pcie pcie-controller.1: PCIe1.0: link up
mvebu-pcie pcie-controller.1: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0xc0000000-0xc000ffff]
pci_bus 0000:00: root bus resource [mem 0xc1000000-0xc8ffffff]
pci_bus 0000:00: root bus resource [bus 00-ff]
PCI: bus0: Fast back to back transfers disabled
pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
PCI: bus1: Fast back to back transfers disabled
PCI: bus2: Fast back to back transfers disabled
Unhandled fault: external abort on non-linefetch (0x1008) at 0xe0910010
Internal error: : 1008 [#1] SMP ARM
Modules linked in:
CPU: 0    Not tainted  (3.8.0-rc5-00029-g80e55fd-dirty #1313)
PC is at quirk_usb_handoff_xhci+0x5c/0x284
LR is at ioremap_pte_range+0x84/0xdc
pc : [<c023ba8c>]    lr : [<c0150944>]    psr: a0000013
sp : df82bce8  ip : df81c000  fp : c0e43dac
r10: 00008000  r9 : 00000000  r8 : de933000
r7 : e0910000  r6 : c04267a0  r5 : de933000  r4 : de933000
r3 : 01c801c8  r2 : 00000000  r1 : 42007e13  r0 : e0910000
Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
Control: 10c5387d  Table: 1e960019  DAC: 00000015
Process swapper/0 (pid: 1, stack limit = 0xdf82a238)
Stack: (0xdf82bce8 to 0xdf82c000)
bce0:                   de933060 c01702f4 de933000 de933000 de933000 c04267a0
bd00: c0e5f9dc c023c108 de933060 c023c0e4 c04267a0 c023c168 c04267a0 c0178d98
bd20: c01a855c c033cab4 00000000 00000000 de935f6c c033cbb4 c0e497cc de933000
bd40: de935c14 de935c00 df82bde8 00000001 de919368 c016a958 00000000 de933000
bd60: de935c14 c016ab5c de933800 de918014 de935c00 c016abcc 00000000 00000001
bd80: de918000 c016d234 de919368 de9193d0 de919340 00000000 df82be14 c017a93c
bda0: de919368 df82be14 df82bde8 c0012ef8 f3cec23f c027b974 c0019da0 df805340
bdc0: f3cec23f df82bde8 c0e7ffe8 df82be14 df802b00 de917890 00000060 df89e400
bde0: 00000002 c00131dc df82bde8 df82bde8 c0e83318 00000000 de9193d0 de9178f0
be00: df802b00 c047a010 df89e410 00000000 00000000 c0e49cac 00000001 df82be54
be20: c047a0b4 c017a90c 00000000 00000000 00000000 c047a054 c1000000 c8ffffff
be40: c12885f0 00000200 00000000 00000000 00000000 de9193d0 df89e410 c0e81748
be60: c0e49c6c df89e410 00000000 c0e49c6c c048e9b8 c017a7c4 00000000 c01a8c24
be80: c01a8c0c c01a78e8 df89e410 c0e49c6c df89e444 00000000 c046823c c01a7bd8
bea0: c0e49c6c 00000000 c01a7b4c c01a632c df8067d8 df85a474 c0e49c6c c0e510c8
bec0: de919440 c01a7228 c0402548 c0e49c6c c0e49c6c df82a000 c0e5f9c0 00000000
bee0: c046823c c01a80d8 00000000 c0e49c58 df82a000 c0e5f9c0 00000000 c046823c
bf00: c017a7c4 c01a8e1c c0488260 df82a000 c0e5f9c0 c00086d4 c0444b18 00000006
bf20: 0000008b c017a7c4 0000008b 00000006 00000006 c046823c c1288b15 00000000
bf40: c0e42194 c0488260 00000006 c0488240 c0e5f9c0 c046823c c048e9b8 0000008b
bf60: 00000000 c04688b0 00000006 00000006 c046823c 00000000 c048e708 c048e708
bf80: 00000000 00000000 00000000 00000000 00000000 c0468958 00000000 00000000
bfa0: c033f3ac c033f3b4 00000000 c000dfd8 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 2254c9ef c6885425
[<c023ba8c>] (quirk_usb_handoff_xhci+0x5c/0x284) from [<c023c0e4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4)
[<c023c0e4>] (quirk_usb_early_handoff.part.1+0xc0/0xe4) from [<c0178d98>] (pci_do_fixups+0xc4/0x17c)
[<c0178d98>] (pci_do_fixups+0xc4/0x17c) from [<c016a958>] (pci_bus_add_device+0x14/0x60)
[<c016a958>] (pci_bus_add_device+0x14/0x60) from [<c016ab5c>] (pci_bus_add_devices+0x44/0x128)
[<c016ab5c>] (pci_bus_add_devices+0x44/0x128) from [<c016abcc>] (pci_bus_add_devices+0xb4/0x128)
[<c016abcc>] (pci_bus_add_devices+0xb4/0x128) from [<c016d234>] (pci_scan_root_bus+0x7c/0xcc)
[<c016d234>] (pci_scan_root_bus+0x7c/0xcc) from [<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c)
[<c017a93c>] (mvebu_pcie_scan_bus+0x30/0x3c) from [<c0012ef8>] (pcibios_init_hw+0x5c/0x15c)
[<c0012ef8>] (pcibios_init_hw+0x5c/0x15c) from [<c00131dc>] (pci_common_init+0x44/0xc4)
[<c00131dc>] (pci_common_init+0x44/0xc4) from [<c047a010>] (mvebu_pcie_probe+0x360/0x3a4)
[<c047a010>] (mvebu_pcie_probe+0x360/0x3a4) from [<c01a8c24>] (platform_drv_probe+0x18/0x1c)
[<c01a8c24>] (platform_drv_probe+0x18/0x1c) from [<c01a78e8>] (really_probe+0x60/0x1e0)
[<c01a78e8>] (really_probe+0x60/0x1e0) from [<c01a7bd8>] (__driver_attach+0x8c/0x90)
[<c01a7bd8>] (__driver_attach+0x8c/0x90) from [<c01a632c>] (bus_for_each_dev+0x50/0x7c)
[<c01a632c>] (bus_for_each_dev+0x50/0x7c) from [<c01a7228>] (bus_add_driver+0x168/0x22c)
[<c01a7228>] (bus_add_driver+0x168/0x22c) from [<c01a80d8>] (driver_register+0x78/0x144)
[<c01a80d8>] (driver_register+0x78/0x144) from [<c01a8e1c>] (platform_driver_probe+0x18/0xac)
[<c01a8e1c>] (platform_driver_probe+0x18/0xac) from [<c00086d4>] (do_one_initcall+0x34/0x174)
[<c00086d4>] (do_one_initcall+0x34/0x174) from [<c04688b0>] (do_basic_setup+0x90/0xc4)
[<c04688b0>] (do_basic_setup+0x90/0xc4) from [<c0468958>] (kernel_init_freeable+0x74/0x10c)
[<c0468958>] (kernel_init_freeable+0x74/0x10c) from [<c033f3b4>] (kernel_init+0x8/0xe4)
[<c033f3b4>] (kernel_init+0x8/0xe4) from [<c000dfd8>] (ret_from_fork+0x14/0x3c)
Code: e3a02000 ebf76e4e e2507000 0afffff7 (e5973010) 
---[ end trace 7097ba2281051df7 ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
===================================================================

And now, the dmesg when the PCIe driver is initialized at the
subsys_initcall() level, no crash happens, and everything works fine.

===================================================================
Booting Linux on physical CPU 0x0
Linux version 3.8.0-rc5-00029-g80e55fd-dirty (thomas@skate) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #1314 SMP Thu Feb 7 16:57:39 CET 2013
CPU: ARMv7 Processor [561f5811] revision 1 (ARMv7), cr=10c53c7d
CPU: PIPT / VIPT nonaliasing data cache, PIPT instruction cache
Machine: Marvell Armada 370/XP (Device Tree), model: Globalscale Mirabox
bootconsole [earlycon0] enabled
Memory policy: ECC disabled, Data cache writeback
PERCPU: Embedded 7 pages/cpu @c128b000 s6464 r8192 d14016 u32768
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 130048
Kernel command line: console=ttyS0,115200 earlyprintk
PID hash table entries: 2048 (order: 1, 8192 bytes)
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
__ex_table already sorted, skipping sort
Memory: 512MB = 512MB total
Memory: 504860k/504860k available, 19428k reserved, 0K highmem
Virtual kernel memory layout:
    vector  : 0xffff0000 - 0xffff1000   (   4 kB)
    fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
    vmalloc : 0xe0800000 - 0xff000000   ( 488 MB)
    lowmem  : 0xc0000000 - 0xe0000000   ( 512 MB)
    pkmap   : 0xbfe00000 - 0xc0000000   (   2 MB)
    modules : 0xbf000000 - 0xbfe00000   (  14 MB)
      .text : 0xc0008000 - 0xc0467700   (4478 kB)
      .init : 0xc0468000 - 0xc0e30940   (10019 kB)
      .data : 0xc0e32000 - 0xc0e5f9c0   ( 183 kB)
       .bss : 0xc0e5f9c0 - 0xc0e84fdc   ( 150 kB)
Hierarchical RCU implementation.
	RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=1.
NR_IRQS:16 nr_irqs:16 16
Aurora cache controller enabled
l2x0: 4 ways, CACHE_ID 0x00000100, AUX_CTRL 0x1a086302, Cache size: 262144 B
sched_clock: 32 bits at 18MHz, resolution 53ns, wraps every 229064ms
Console: colour dummy device 80x30
Calibrating delay loop... 1196.85 BogoMIPS (lpj=5984256)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
CPU0: thread -1, cpu 0, socket -1, mpidr 0
Setting up static identity map for 0x351a28 - 0x351a80
Brought up 1 CPUs
SMP: Total of 1 processors activated (1196.85 BogoMIPS).
devtmpfs: initialized
pinctrl core: initialized pinctrl subsystem
NET: Registered protocol family 16
DMA: preallocated 1024 KiB pool for atomic coherent allocations
irq: Cannot allocate irq_descs @ IRQ27, assuming pre-allocated
irq: Cannot allocate irq_descs @ IRQ63, assuming pre-allocated
irq: Cannot allocate irq_descs @ IRQ96, assuming pre-allocated
Initializing Coherency fabric
bio: create slab <bio-0> at 0
mvebu-pcie pcie-controller.1: PCIe0.0: link up
mvebu-pcie pcie-controller.1: PCIe1.0: link up
mvebu-pcie pcie-controller.1: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0xc0000000-0xc000ffff]
pci_bus 0000:00: root bus resource [mem 0xc1000000-0xc8ffffff]
pci_bus 0000:00: root bus resource [bus 00-ff]
PCI: bus0: Fast back to back transfers disabled
pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
PCI: bus1: Fast back to back transfers disabled
PCI: bus2: Fast back to back transfers disabled
Getting IRQ slot 1, pin 1...
Cannot get irq... 135
Getting IRQ slot 2, pin 1...
Cannot get irq... 135
Getting IRQ slot 1, pin 1...
Getting IRQ slot 2, pin 1...
pci 0000:00:01.0: BAR 8: assigned [mem 0xc1000000-0xc10fffff]
pci 0000:00:02.0: BAR 8: assigned [mem 0xc1100000-0xc11fffff]
pci 0000:01:00.0: BAR 0: assigned [mem 0xc1000000-0xc1001fff 64bit]
pci 0000:00:01.0: PCI bridge to [bus 01]
pci 0000:00:01.0:   bridge window [mem 0xc1000000-0xc10fffff]
pci 0000:02:00.0: BAR 0: assigned [mem 0xc1100000-0xc1107fff 64bit]
pci 0000:00:02.0: PCI bridge to [bus 02]
pci 0000:00:02.0:   bridge window [mem 0xc1100000-0xc11fffff]
PCI: enabling device 0000:00:01.0 (0140 -> 0143)
PCI: enabling device 0000:00:02.0 (0140 -> 0143)
vgaarb: loaded
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
cfg80211: Calling CRDA to update world regulatory domain
Switching to clocksource armada_370_xp_clocksource
NET: Registered protocol family 2
TCP established hash table entries: 4096 (order: 3, 32768 bytes)
TCP bind hash table entries: 4096 (order: 3, 32768 bytes)
TCP: Hash tables configured (established 4096 bind 4096)
TCP: reno registered
UDP hash table entries: 256 (order: 1, 8192 bytes)
UDP-Lite hash table entries: 256 (order: 1, 8192 bytes)
msgmni has been set to 986
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 253)
io scheduler noop registered
io scheduler deadline registered
io scheduler cfq registered (default)
armada-370-pinctrl d0018000.pinctrl: registered pinctrl driver
mv_xor d0060800.xor: Marvell XOR driver
mv_xor d0060800.xor: Marvell XOR: ( xor cpy )
mv_xor d0060800.xor: Marvell XOR: ( xor fill cpy )
mv_xor d0060900.xor: Marvell XOR driver
mv_xor d0060900.xor: Marvell XOR: ( xor cpy )
mv_xor d0060900.xor: Marvell XOR: ( xor fill cpy )
Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
d0012000.serial: ttyS0 at MMIO 0xd0012000 (irq = 17) is a 8250
console [ttyS0] enabled, bootconsole disabled
console [ttyS0] enabled, bootconsole disabled
e1000e: Intel(R) PRO/1000 Network Driver - 2.1.4-k
e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
libphy: orion_mdio_bus: probed
mvneta d0070000.ethernet eth0: mac: fe:4b:b2:e2:64:ca
mvneta d0074000.ethernet eth1: mac: 56:dd:1a:9e:e0:aa
Intel(R) Wireless WiFi driver for Linux, in-tree:
Copyright(c) 2003-2012 Intel Corporation
iwlwifi 0000:01:00.0: pci_enable_msi failed(0Xffffffff)
iwlwifi 0000:01:00.0: loaded firmware version 8.83.5.1 build 33692
iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEBUG disabled
iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEBUGFS disabled
iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEVICE_TRACING disabled
iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_DEVICE_TESTMODE disabled
iwlwifi 0000:01:00.0: CONFIG_IWLWIFI_P2P enabled
iwlwifi 0000:01:00.0: Detected Intel(R) Ultimate N WiFi Link 5300 AGN, REV=0x24
iwlwifi 0000:01:00.0: L1 Disabled; Enabling L0S
xhci_hcd 0000:02:00.0: xHCI Host Controller
xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 1
xhci_hcd 0000:02:00.0: irq 99, io mem 0xc1100000
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
xhci_hcd 0000:02:00.0: xHCI Host Controller
xhci_hcd 0000:02:00.0: new USB bus registered, assigned bus number 2
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 2 ports detected
Initializing USB Mass Storage driver...
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
mousedev: PS/2 mouse device common for all mice
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
TCP: cubic registered
VFP support v0.3: implementor 56 architecture 2 part 20 variant 9 rev 6
/home/thomas/projets/linux-2.6/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
Freeing init memory: 10016K
Starting logging: OK
Starting mdev...
usb 1-1: new high-speed USB device number 2 using xhci_hcd
scsi0 : usb-storage 1-1:1.0
Initializing random number generator... done.
Starting network...

Welcome to Buildroot
buildroot login:
===================================================================

Do not hesitate to ask if you need more details about this.

Thanks!

Thomas
Thomas Petazzoni Feb. 7, 2013, 4:19 p.m. UTC | #59
Dear Andrew Murray,

On Thu, 7 Feb 2013 15:51:17 +0000, Andrew Murray wrote:

> > First of all, I didn't reimplement the pcibios_get_phb_of_node(), but
> > instead, as Thierry Reding suggested, simply implemented the
> > hw_pci.scan() function as follows:
> 
> I've not had any time to test Thierry's solution to avoid implementing
> pcibios_get_phb_of_node - but it did seem to work for him and seem correct
> at the time.

At least there are device tree nodes associated to PCI devices, so it
looks good from this perspective.


> > To me, the "struct pci_dev" representing the real PCIe devices should
> > have a NULL "struct device_node" pointer, because those device are not
> > represented in the DT. If this was the case, then the of_irq_map_pci()
> > would go one level up in the PCI hierarchy, find the "struct pci_dev"
> > that corresponds to the PCI-to-PCI bridge, which generates an laddr[]
> > having a bus number a devfn value matching the interrupt-map property.
> 
> Yes this is my current understanding.

Ok. But that's not what happens: the "struct pci_dev" representing the
real PCIe device *DOES* have a non-NULL struct device_node pointer. And
this is what makes the entire thing fail.

> I would suggest the issue isn't in the PCI/DT code. This is what I see

I believe it is, because as I said above, the struct pci_dev associated
to a real PCIe device should not have a struct device_node pointer,
because this PCIe device has been dynamically enumerated and is
therefore not part of the device tree.

> with my implementation (which uses an implementation of
> pcibios_get_phb_of_node) - I'd like to believe this is correct behaviour:
> 
> - of_irq_map_pci is called for an endpoint (5:0:0), its pdev->dev.of_node
>   is NULL. As I don't have a representation of this endpoint in my DT the
>   of_irq_map_pci code proceeds to walk the fabric.

I believe this should be the behavior. But this not what happens: the
pdev->dev.of_node of an endpoint pci_dev is not NULL.

> - Starting with the pdev for 5:0:0, of_irq_map_pci sees it has a parent
>   (downstream switch bridge), in this case pdev(5:0:0)->bus->self->dev.of_node
>   is NULL, due to this swizzling occurs and we walk the parent's bus (bus 2)
> - This continues to bus 1 and then bus 0 where of_irq_map_pci realises that
>   it has no parent (its the host bridge). Due to the implementation of
>   pcibios_get_phb_of_node a of_node is produced, this is then used to construct
>   a lspec (for bus number 0).
> 
> The of_irq_map_pci code stops walking as soon as it finds a function in the
> tree that has a device node. This suggests that if you represent a bridge
> you must also include an interrupt-map. The problem here is that you have
> represented a bridge but not included a map.

I understood that it walks up the PCI hierarchy, and that's fine. As
I've shown in my previous e-mail, the only problem is that this
pdev->dev.of_node should be NULL for the PCIe endpoint device. If it
were NULL, then everything would work correctly, as I could confirmed
by the hack I did in of_irq_map_pci().

> I can think of three solutions:
> 
> 1. Something like this:
> 
>                  } else {
>                          /* We found a P2P bridge, check if it has a node */
>                          ppnode = pci_device_to_OF_node(ppdev);
>  +                       if (ppnode doesnt have an interrupt-map)//if (pdev->bus->number != 0)
>  +                               ppnode = NULL;
>                  }
> 
> 2. Remove the bridges from the DT? Or remove the map from pcie-controller and
>    add a map each to pcie@0,1 and pcie@1,1?
> 
> 3. Change the mask of your map so that it doesn't care about bus numbers. I
>    have a map that looks like this:
> 
>                 interrupt-map-mask = <0 0 0 7>;
>                 interrupt-map = <0 0 0 1 ... //anything coming in on INTA
>                                  0 0 0 2 ... //anything coming in on INTB
>                                  0 0 0 3 ... ...
>                                  0 0 0 4 ... ...

Unfortunately, I don't quite agree with any of your three solutions. I
still do believe the root problem is that pdev->dev.of_node should be
NULL for the PCIe endpoints, since those devices are not probed with
the Device Tree.

Best regards,

Thomas
Thomas Petazzoni Feb. 7, 2013, 4:40 p.m. UTC | #60
Dear Thomas Petazzoni,

On Thu, 7 Feb 2013 17:19:04 +0100, Thomas Petazzoni wrote:

> Unfortunately, I don't quite agree with any of your three solutions. I
> still do believe the root problem is that pdev->dev.of_node should be
> NULL for the PCIe endpoints, since those devices are not probed with
> the Device Tree.

Looking more at this, the pdev->dev.of_node is in fact NULL, but the
code uses the of_node of the parent PCI device.

So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0
are seen as corresponding to the PCI-to-PCI bridges.

Thanks,

Thomas
Andrew Murray Feb. 7, 2013, 4:53 p.m. UTC | #61
On Thu, Feb 07, 2013 at 04:40:40PM +0000, Thomas Petazzoni wrote:
> Dear Thomas Petazzoni,
> 
> On Thu, 7 Feb 2013 17:19:04 +0100, Thomas Petazzoni wrote:
> 
> > Unfortunately, I don't quite agree with any of your three solutions. I
> > still do believe the root problem is that pdev->dev.of_node should be
> > NULL for the PCIe endpoints, since those devices are not probed with
> > the Device Tree.
> 
> Looking more at this, the pdev->dev.of_node is in fact NULL, but the
> code uses the of_node of the parent PCI device.
> 
> So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0
> are seen as corresponding to the PCI-to-PCI bridges.

I would suggest changing the interrupt-mask to match any bus number. (Don't
forget that the secondary bus number of each of your emulated bridges will
vary depending on how many devices are detected underneath each root port,
assuming you don't try and partition bus numbers or use domains between ports).

Andrew Murray

> 
> Thanks,
> 
> Thomas
> -- 
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 7, 2013, 5:14 p.m. UTC | #62
Dear Andrew Murray,

On Thu, 7 Feb 2013 16:53:47 +0000, Andrew Murray wrote:

> > So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0
> > are seen as corresponding to the PCI-to-PCI bridges.
> 
> I would suggest changing the interrupt-mask to match any bus number. (Don't
> forget that the secondary bus number of each of your emulated bridges will
> vary depending on how many devices are detected underneath each root port,
> assuming you don't try and partition bus numbers or use domains between ports).

I don't think this would work. Currently, the interrupt-map associates
the interrupts with the PCI-to-PCI bridges, i.e devices 00:01, 00:02,
00:03, 00:04, 00:05, etc.

The real PCIe devices themselves are at 01:00, 02:00, 03:00, 04:00,
05:00. Each of them sit on a different bus, at devfn = 0.

So if I ignore the bus number, how could the PCI code find what is the
matching interrupt?

Thanks,

Thomas
Andrew Murray Feb. 7, 2013, 5:29 p.m. UTC | #63
On Thu, Feb 07, 2013 at 05:14:18PM +0000, Thomas Petazzoni wrote:
> Dear Andrew Murray,
> 
> On Thu, 7 Feb 2013 16:53:47 +0000, Andrew Murray wrote:
> 
> > > So in fact the problem is indeed that the subnodes pcie0,0 and pcie1,0
> > > are seen as corresponding to the PCI-to-PCI bridges.
> > 
> > I would suggest changing the interrupt-mask to match any bus number. (Don't
> > forget that the secondary bus number of each of your emulated bridges will
> > vary depending on how many devices are detected underneath each root port,
> > assuming you don't try and partition bus numbers or use domains between ports).
> 
> I don't think this would work. Currently, the interrupt-map associates
> the interrupts with the PCI-to-PCI bridges, i.e devices 00:01, 00:02,
> 00:03, 00:04, 00:05, etc.
> 
> The real PCIe devices themselves are at 01:00, 02:00, 03:00, 04:00,
> 05:00. Each of them sit on a different bus, at devfn = 0.
> 
> So if I ignore the bus number, how could the PCI code find what is the
> matching interrupt?

Apologies if I've missed information about your hardware in the other
discussion (I've tried to keep up) - does your hardware raise a single host
interrupt for each pin regardless to which bridge they come in on - or do you
separate A,B,C,D host interrupts for each bridge?

If you have only 4 interrupt sources for legacy interrupts then you shouldn't
need to care which bus/device/function they were generated on (of_pci_map_irq
takes care of this for you).

During enumeration an interrupt number should be assigned to each requesting
device which reflects the pin (after swizzling) which will arrive at the host
bridges. That interrupt number should be shared across all devices that
requested the same pin (after swizzling) - i.e. shared interrupts. So all you
need to do is map A,B,C,D interrupts with the interrupt they come into the 
CPU.

Andrew Murray
> 
> Thanks,
> 
> Thomas
> -- 
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 7, 2013, 5:37 p.m. UTC | #64
Dear Andrew Murray,

On Thu, 7 Feb 2013 17:29:34 +0000, Andrew Murray wrote:

> > So if I ignore the bus number, how could the PCI code find what is the
> > matching interrupt?
> 
> Apologies if I've missed information about your hardware in the other
> discussion (I've tried to keep up) - does your hardware raise a single host
> interrupt for each pin regardless to which bridge they come in on - or do you
> separate A,B,C,D host interrupts for each bridge?

There are separate A,B,C,D interrupts for each PCIe interface, and each
PCIe interface is represented by an emulated PCI-to-PCI bridge. See my
interrupt-map:

			interrupt-map = <0x0800 0 0 1 &mpic 58
				         0x1000 0 0 1 &mpic 59
					 0x1800 0 0 1 &mpic 60
					 0x2000 0 0 1 &mpic 61
					 0x2800 0 0 1 &mpic 62
				         0x3000 0 0 1 &mpic 63
					 0x3800 0 0 1 &mpic 64
					 0x4000 0 0 1 &mpic 65
					 0x4800 0 0 1 &mpic 99
					 0x5000 0 0 1 &mpic 103>;

Here I have 10 PCIe interfaces, and therefore 10 interrupts.

There is only one interrupt per PCIe interface, and for now, I don't
distinguish A,B,C,D (I will do it later, it requires reading a register
to know if the interrupt came from A, B, C or D, but that's a different
problem).

> If you have only 4 interrupt sources for legacy interrupts then you shouldn't
> need to care which bus/device/function they were generated on (of_pci_map_irq
> takes care of this for you).

No, I have interrupts per PCIe interface, so I really need to take care
of the relation between the PCIe device and the PCIe interface it is
connected to.

Thomas
Bjorn Helgaas Feb. 7, 2013, 6:08 p.m. UTC | #65
On Thu, Feb 7, 2013 at 9:00 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Bjorn Helgaas,
>
> On Thu, 7 Feb 2013 08:46:06 -0700, Bjorn Helgaas wrote:
>
>> Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y?
>> That should have more information about the enumeration process,
>> including what we think the XHCI BARs are and the apertures leading to
>> them.
>
> Sure, see below.
>
>> The PCI core assumes that we know the host bridge apertures up front,
>> and I'm not sure that is true on your platform, so maybe we'll need
>> some changes to accommodate that.
>
> In this hardware, we need to set up the address decoding windows. So
> there shouldn't be any access to a PCI device memory or I/O region
> until the addresses have been assigned in the PCI-to-PCI bridge.

I think this is the path where the crash happens (this is the same as
the backtrace you included below):

    mvebu_pcie_scan_bus
      pci_scan_root_bus
        pci_create_root_bus
        pci_scan_child_bus
        pci_bus_add_devices
          pci_bus_add_device
            pci_fixup_device(pci_fixup_final)
              quirk_usb_early_handoff           # pci_fixup_final
                quirk_usb_handoff_xhci

The problem is that we haven't assigned resources anywhere.  Normally
this is done by pci_bus_assign_resources() or
pci_assign_unassigned_bus_resources(), but I don't think there's
anything in the path above that does this.

This is not really a problem in your code; it's a generic PCI core
problem.  pci_scan_root_bus() does everything including creating the
root bus, scanning it, and adding the devices we find.  At the point
where we add a device (pci_bus_add_device()), it should be ready for a
driver to claim it -- all resource assignment should already be done.

I don't think it's completely trivial to fix this in the PCI core yet
(but we're moving in that direction) because we have some boot-time
ordering issues, e.g., x86 scans the root buses before we know about
the address space consumed by ACPI devices, so we can't just assign
the resources when we scan the bus.

I think the way you'll have to fix this in the meantime is to use
pci_create_root_bus() directly so you can do things in this sequence:

  pci_create_root_bus
  pci_scan_child_bus
  pci_bus_assign_resources
  pci_bus_add_devices

dino_probe() is an example of similar code that does this.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 7, 2013, 6:15 p.m. UTC | #66
On Thu, Feb 07, 2013 at 11:08:45AM -0700, Bjorn Helgaas wrote:
 
> I think the way you'll have to fix this in the meantime is to use
> pci_create_root_bus() directly so you can do things in this sequence:
> 
>   pci_create_root_bus
>   pci_scan_child_bus
>   pci_bus_assign_resources
>   pci_bus_add_devices
> 
> dino_probe() is an example of similar code that does this.

Erm, but isn't this problem going to strike any system that doesn't
have resources assigned by the firmware?

If so, a common 'pci_scan_bus_from_scratch' would at least highlight
the problem and centralize the work around until it is fixed...

Many embedded systems will require this..

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 7, 2013, 6:21 p.m. UTC | #67
On Thu, Feb 07, 2013 at 06:37:43PM +0100, Thomas Petazzoni wrote:
> Dear Andrew Murray,
> 
> On Thu, 7 Feb 2013 17:29:34 +0000, Andrew Murray wrote:
> 
> > > So if I ignore the bus number, how could the PCI code find what is the
> > > matching interrupt?
> > 
> > Apologies if I've missed information about your hardware in the other
> > discussion (I've tried to keep up) - does your hardware raise a single host
> > interrupt for each pin regardless to which bridge they come in on - or do you
> > separate A,B,C,D host interrupts for each bridge?
> 
> There are separate A,B,C,D interrupts for each PCIe interface, and each
> PCIe interface is represented by an emulated PCI-to-PCI bridge. See my
> interrupt-map:
> 
> 			interrupt-map = <0x0800 0 0 1 &mpic 58
> 				         0x1000 0 0 1 &mpic 59
> 					 0x1800 0 0 1 &mpic 60
> 					 0x2000 0 0 1 &mpic 61
> 					 0x2800 0 0 1 &mpic 62
> 				         0x3000 0 0 1 &mpic 63
> 					 0x3800 0 0 1 &mpic 64
> 					 0x4000 0 0 1 &mpic 65
> 					 0x4800 0 0 1 &mpic 99
> 					 0x5000 0 0 1 &mpic 103>;
> 
> Here I have 10 PCIe interfaces, and therefore 10 interrupts.
>
> There is only one interrupt per PCIe interface, and for now, I don't
> distinguish A,B,C,D (I will do it later, it requires reading a register
> to know if the interrupt came from A, B, C or D, but that's a different
> problem).

Right, someday you can have all 40 interrupts :)

This interrupt-map stuff is ugly, I looks like it was designed as a
way for the firmware to communicate the per-device interrupt
assignment to the OS - using it to describe a HW setup is a bit more
tortured for sure..

Frankly, I think it should not be used in PCI-E drivers *at all*
beacuse there is no need for an external interrupt routing
description. [1]

PCI-E *only* has inband interrupt delivery, so the host driver has
100% of the information it needs to convert a INTx signal received on
a link to a Linux interrupt number.

So in a PCI-E world translation should look like this:
 - Start at the source device INTx
 - Traverse up bridges performing INTx swizzling according to the
   PCI-E spec
 - When the host bridge is reached call into the host driver and pass
   - The INTx number
   - The bus/device/function of last bridge traversed
 - The host bridge driver returns a Linux interrupt number.

This *always* works for (compliant) PCI-E. An OF interrupt map is
never, ever, needed. Providing a common mechanism can remove this
complex OF stuff and ultimately help everyone :)

However, if interrupt-map is used, I strongly suspect it should be
placed on each bridge node (as Andrew mentioned), and it should be
very simple:

link@0 {
reg = <0x800 0 0  0 0>; // Bus 0, Dev 0x10, Fn 0
interrupt-mask = <0x0 0 0 7>;
interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA
                 0x0000 0 0 2 &mpic 58 // INTB
                 0x0000 0 0 3 &mpic 58 // INTC
                 0x0000 0 0 4 &mpic 58>; // INTD
}

Which hopefully says 'when you reach this bridge node, look only at
the INTx pin number and convert it'.

I don't think you can make interrupt-map work were you have placed
it..

Jason
1 - For PCI/PCI-X the physical INTx interrupt pins on any device,
    anywhere in the system could by physically tied to any interrupt
    ingress in the system. Even if they are behind bridges. It is a
    total free for all.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Feb. 7, 2013, 6:30 p.m. UTC | #68
On Thu, Feb 7, 2013 at 11:15 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Thu, Feb 07, 2013 at 11:08:45AM -0700, Bjorn Helgaas wrote:
>
>> I think the way you'll have to fix this in the meantime is to use
>> pci_create_root_bus() directly so you can do things in this sequence:
>>
>>   pci_create_root_bus
>>   pci_scan_child_bus
>>   pci_bus_assign_resources
>>   pci_bus_add_devices
>>
>> dino_probe() is an example of similar code that does this.
>
> Erm, but isn't this problem going to strike any system that doesn't
> have resources assigned by the firmware?
>
> If so, a common 'pci_scan_bus_from_scratch' would at least highlight
> the problem and centralize the work around until it is fixed...
>
> Many embedded systems will require this..

Yep, agreed on all counts.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Murray Feb. 7, 2013, 6:30 p.m. UTC | #69
On 7 February 2013 17:37, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:
> Dear Andrew Murray,
>
> On Thu, 7 Feb 2013 17:29:34 +0000, Andrew Murray wrote:
>
>> > So if I ignore the bus number, how could the PCI code find what is the
>> > matching interrupt?
>>
>> Apologies if I've missed information about your hardware in the other
>> discussion (I've tried to keep up) - does your hardware raise a single host
>> interrupt for each pin regardless to which bridge they come in on - or do you
>> separate A,B,C,D host interrupts for each bridge?
>
> There are separate A,B,C,D interrupts for each PCIe interface, and each
> PCIe interface is represented by an emulated PCI-to-PCI bridge. See my
> interrupt-map:
>
>                         interrupt-map = <0x0800 0 0 1 &mpic 58
>                                          0x1000 0 0 1 &mpic 59
>                                          0x1800 0 0 1 &mpic 60
>                                          0x2000 0 0 1 &mpic 61
>                                          0x2800 0 0 1 &mpic 62
>                                          0x3000 0 0 1 &mpic 63
>                                          0x3800 0 0 1 &mpic 64
>                                          0x4000 0 0 1 &mpic 65
>                                          0x4800 0 0 1 &mpic 99
>                                          0x5000 0 0 1 &mpic 103>;
>
> Here I have 10 PCIe interfaces, and therefore 10 interrupts.
>
> There is only one interrupt per PCIe interface, and for now, I don't
> distinguish A,B,C,D (I will do it later, it requires reading a register
> to know if the interrupt came from A, B, C or D, but that's a different
> problem).
>
>> If you have only 4 interrupt sources for legacy interrupts then you shouldn't
>> need to care which bus/device/function they were generated on (of_pci_map_irq
>> takes care of this for you).
>
> No, I have interrupts per PCIe interface, so I really need to take care
> of the relation between the PCIe device and the PCIe interface it is
> connected to.

In that case, I think you can create a mask that only checks for the
device number and INT pin (i.e. ignore bus and function). Looking at
your mask - it already does this...

                         interrupt-map = <0x0800 0 0 1 &mpic 58
                                          0x1000 0 0 1 &mpic 59
                                          0x1800 0 0 1 &mpic 60
                                          0x2000 0 0 1 &mpic 61
                                          0x2800 0 0 1 &mpic 62
                                          0x3000 0 0 1 &mpic 63
                                          0x3800 0 0 1 &mpic 64
                                          0x4000 0 0 1 &mpic 65
                                          0x4800 0 0 1 &mpic 99
                                          0x5000 0 0 1 &mpic 103>;

I'm not sure if the device pin part of the map is correct (I always
forget how this works) - but I know this would definately work:

                         interrupt-map-mask = <0xf800 0 0 7>
                         interrupt-map = <0x0800 0 0 1 &mpic 58
                                          0x0800 0 0 2 &mpic 58
                                          0x0800 0 0 3 &mpic 58
                                          0x0800 0 0 4 &mpic 58
                                          0x1000 0 0 1 &mpic 59
                                          0x1000 0 0 2 &mpic 59
                                          0x1000 0 0 3 &mpic 59
                                          0x1000 0 0 4 &mpic 59
                                          ....

In any case, I've realized that my original suggestion of changing the
map won't quite do (apologies). This is because the OF code won't even
look at this map as it stops at the emulated bridge below. In addition
to this type of mapping - you'll also need to investigate my solution
1 and 2.

Andrew Murray

>
> Thomas
> --
> Thomas Petazzoni, Free Electrons
> Kernel, drivers, real-time and embedded Linux
> development, consulting, training and support.
> http://free-electrons.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding Feb. 7, 2013, 6:43 p.m. UTC | #70
On Thu, Feb 07, 2013 at 11:08:45AM -0700, Bjorn Helgaas wrote:
> On Thu, Feb 7, 2013 at 9:00 AM, Thomas Petazzoni
> <thomas.petazzoni@free-electrons.com> wrote:
> > Dear Bjorn Helgaas,
> >
> > On Thu, 7 Feb 2013 08:46:06 -0700, Bjorn Helgaas wrote:
> >
> >> Can you post the entire dmesg log, ideally with CONFIG_PCI_DEBUG=y?
> >> That should have more information about the enumeration process,
> >> including what we think the XHCI BARs are and the apertures leading to
> >> them.
> >
> > Sure, see below.
> >
> >> The PCI core assumes that we know the host bridge apertures up front,
> >> and I'm not sure that is true on your platform, so maybe we'll need
> >> some changes to accommodate that.
> >
> > In this hardware, we need to set up the address decoding windows. So
> > there shouldn't be any access to a PCI device memory or I/O region
> > until the addresses have been assigned in the PCI-to-PCI bridge.
> 
> I think this is the path where the crash happens (this is the same as
> the backtrace you included below):
> 
>     mvebu_pcie_scan_bus
>       pci_scan_root_bus
>         pci_create_root_bus
>         pci_scan_child_bus
>         pci_bus_add_devices
>           pci_bus_add_device
>             pci_fixup_device(pci_fixup_final)
>               quirk_usb_early_handoff           # pci_fixup_final
>                 quirk_usb_handoff_xhci
> 
> The problem is that we haven't assigned resources anywhere.  Normally
> this is done by pci_bus_assign_resources() or
> pci_assign_unassigned_bus_resources(), but I don't think there's
> anything in the path above that does this.
> 
> This is not really a problem in your code; it's a generic PCI core
> problem.  pci_scan_root_bus() does everything including creating the
> root bus, scanning it, and adding the devices we find.  At the point
> where we add a device (pci_bus_add_device()), it should be ready for a
> driver to claim it -- all resource assignment should already be done.
> 
> I don't think it's completely trivial to fix this in the PCI core yet
> (but we're moving in that direction) because we have some boot-time
> ordering issues, e.g., x86 scans the root buses before we know about
> the address space consumed by ACPI devices, so we can't just assign
> the resources when we scan the bus.
> 
> I think the way you'll have to fix this in the meantime is to use
> pci_create_root_bus() directly so you can do things in this sequence:
> 
>   pci_create_root_bus
>   pci_scan_child_bus
>   pci_bus_assign_resources
>   pci_bus_add_devices

The last two are already done by ARM's pci_common_init(). On Tegra I've
got by having a custom .scan_bus() implementation that does the first
two. Back when I used the pci_scan_root_bus() we were seeing some issues
with resource conflicts and such and there was a similar discussion at
that time.

Thomas, have you tried using the same .scan_bus() that I use on Tegra?
It should be easy to port it to Marvell, the only Tegra-specific bit is
how to get at the struct tegra_pcie.

Thierry
Arnd Bergmann Feb. 7, 2013, 11:25 p.m. UTC | #71
On Thursday 07 February 2013, Jason Gunthorpe wrote:
> On Thu, Feb 07, 2013 at 06:37:43PM +0100, Thomas Petazzoni wrote:

> Right, someday you can have all 40 interrupts :)
> 
> This interrupt-map stuff is ugly, I looks like it was designed as a
> way for the firmware to communicate the per-device interrupt
> assignment to the OS - using it to describe a HW setup is a bit more
> tortured for sure..

It is usually used for PCI (non-PCIe) systems, where you can have
multiple slots on the same bus and each slot has its own four
interrupt lines routed to an upstream interrupt controller, not
necessarily the same one for each slot.

> Frankly, I think it should not be used in PCI-E drivers *at all*
> beacuse there is no need for an external interrupt routing
> description. [1]
> 
> PCI-E *only* has inband interrupt delivery, so the host driver has
> 100% of the information it needs to convert a INTx signal received on
> a link to a Linux interrupt number.
> 
> So in a PCI-E world translation should look like this:
>  - Start at the source device INTx
>  - Traverse up bridges performing INTx swizzling according to the
>    PCI-E spec
>  - When the host bridge is reached call into the host driver and pass
>    - The INTx number
>    - The bus/device/function of last bridge traversed
>  - The host bridge driver returns a Linux interrupt number.

It's the same as on PCI, except that there you stop when you reach
the slot that is physically connected to the interrupt controller.

> This *always* works for (compliant) PCI-E. An OF interrupt map is
> never, ever, needed. Providing a common mechanism can remove this
> complex OF stuff and ultimately help everyone :)
> 
> However, if interrupt-map is used, I strongly suspect it should be
> placed on each bridge node (as Andrew mentioned), and it should be
> very simple:
> 
> link@0 {
> reg = <0x800 0 0  0 0>; // Bus 0, Dev 0x10, Fn 0
> interrupt-mask = <0x0 0 0 7>;
> interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA
>                  0x0000 0 0 2 &mpic 58 // INTB
>                  0x0000 0 0 3 &mpic 58 // INTC
>                  0x0000 0 0 4 &mpic 58>; // INTD
> }

The interrupt-map property only makes sense for the host bridge,
not for bridges below it, which don't normally get represented
in the device tree.

> Jason
> 1 - For PCI/PCI-X the physical INTx interrupt pins on any device,
>     anywhere in the system could by physically tied to any interrupt
>     ingress in the system. Even if they are behind bridges. It is a
>     total free for all.

I would not be surprised to see system that have a PCIe to
PCI bridge and slots behind that which are directly connected
to an upstream interrupt controller. In fact, I believe I have
worked on a system like that (IBM QS21 and QS22), and thankfully
the interrupt-map property did let us model this correctly.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 7, 2013, 11:27 p.m. UTC | #72
On Thursday 07 February 2013, Andrew Murray wrote:
> I'm not sure if the device pin part of the map is correct (I always
> forget how this works) - but I know this would definately work:
> 
>                          interrupt-map-mask = <0xf800 0 0 7>
>                          interrupt-map = <0x0800 0 0 1 &mpic 58
>                                           0x0800 0 0 2 &mpic 58
>                                           0x0800 0 0 3 &mpic 58
>                                           0x0800 0 0 4 &mpic 58
>                                           0x1000 0 0 1 &mpic 59
>                                           0x1000 0 0 2 &mpic 59
>                                           0x1000 0 0 3 &mpic 59
>                                           0x1000 0 0 4 &mpic 59
>                                           ....

If all the pins are routed to the same interrupt, you can leave
the pin out of the map-mask and save a bunch of lines in the
interrupt-map, at least  by the spec. I have not looked at the
source code to see if that is how Linux implements the lookup,
but that can be fixed if necessary.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 8, 2013, 12:44 a.m. UTC | #73
On Thu, Feb 07, 2013 at 11:25:23PM +0000, Arnd Bergmann wrote:
> > link@0 {
> > reg = <0x800 0 0  0 0>; // Bus 0, Dev 0x10, Fn 0
> > interrupt-mask = <0x0 0 0 7>;
> > interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA
> >                  0x0000 0 0 2 &mpic 58 // INTB
> >                  0x0000 0 0 3 &mpic 58 // INTC
> >                  0x0000 0 0 4 &mpic 58>; // INTD
> > }
> 
> The interrupt-map property only makes sense for the host bridge,
> not for bridges below it, which don't normally get represented
> in the device tree.

Linux scans up the PCI bus until it finds a PCI device with a matching
OF node. It then constructs an interrupt map 'laddr' (ie the
bus:dev.fn) for the child device of this OF node.

If you don't have any DT PCI nodes then this should always fold down
to doing a lookup with bus=0, and device representing the 'slot' in
legacy PCI.

However, as soon as you provide a node for a bridge in DT this halts
the 'fold down' and goes to the interrupt-map with a device on the
subordinate bus number of the bridge.

This makes *lots* of sense, if you have bridges providing bus slots
then you include the bridge in DT to stop the 'fold down' at that
known bridge, giving you a chance to see the interrupt wiring behind
the bridge.

This matches the design of PCI - if you know how interrupts are hooked
up then use that information, otherwise assume the INTx interrupts
swizzle and search upward. This is how add-in cards with PCI bridges
are supported.

This behavior seems complex, but sane to me. I wouldn't change it as
Andrew suggested.

Thomas's problem is the presence of the static DT node for the root
port bridge. Since the node is static you can't know what the runtime
determined subordinate bus numbers will be, so there is no possible
way to write an interrupt-map at the host bridge.

Putting the map in the bridge's DT node seems elegant and correct to
me - the map is describing the actual hardware - the root port bridge
is actually terminating INTx from downstream devices and converting
them to CPU interrupts. (FWIW discrete HT to PCIe bridges do something
similar)

If you imagine the case you alluded to, a PCI-E root port, connected
to a PCI-E to PCI bridge, with 2 physical PCI bus slots. The
interrupts for the 2 slots are routed to the CPU directly:

link@0 {
 reg = </* Bus 0, Dev 0x10, Fn 0 */>; // Root Port bridge

  // Match on INTx (not used since the pci-bridge doesn't create inband INTx)
  interrupt-mask = <0x0 0 0 7>;
  interrupt-map = <0x0000 0 0 1 &pic 0  // Inband INTA
                   0x0000 0 0 2 &pic 1  // Inband INTB
		   ..
 pci_bridge@0 {
    reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge
  
    // Match on the device/slot and INTx pin
    interrupt-mask = <0x7f 0 0 7>;
    interrupt-map = <0x00xx 0 0 1 &pic 2 // Slot 0 physical INTA
                     0x00xx 0 0 1 &pic 3 // Slot 1 physical INTA
                     ..
 }
}

To me, this seems to be a much more accurate description of how the
hardware is constructed then trying to cram all this information into
the host bridge's interrupt map. It shows clearly where inband INTA
messages arriving at the root port are directed as well as where the
slot by slot out-of-band interrupt wires on the PCI bus are directed.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 9, 2013, 10:23 p.m. UTC | #74
On Friday 08 February 2013, Jason Gunthorpe wrote:
> On Thu, Feb 07, 2013 at 11:25:23PM +0000, Arnd Bergmann wrote:
> > > link@0 {
> > > reg = <0x800 0 0  0 0>; // Bus 0, Dev 0x10, Fn 0
> > > interrupt-mask = <0x0 0 0 7>;
> > > interrupt-map = <0x0000 0 0 1 &mpic 58 // INTA
> > >                  0x0000 0 0 2 &mpic 58 // INTB
> > >                  0x0000 0 0 3 &mpic 58 // INTC
> > >                  0x0000 0 0 4 &mpic 58>; // INTD
> > > }
> > 
> > The interrupt-map property only makes sense for the host bridge,
> > not for bridges below it, which don't normally get represented
> > in the device tree.
> 
> Linux scans up the PCI bus until it finds a PCI device with a matching
> OF node. It then constructs an interrupt map 'laddr' (ie the
> bus:dev.fn) for the child device of this OF node.
> 
> If you don't have any DT PCI nodes then this should always fold down
> to doing a lookup with bus=0, and device representing the 'slot' in
> legacy PCI.
> 
> However, as soon as you provide a node for a bridge in DT this halts
> the 'fold down' and goes to the interrupt-map with a device on the
> subordinate bus number of the bridge.
> 
> This makes *lots* of sense, if you have bridges providing bus slots
> then you include the bridge in DT to stop the 'fold down' at that
> known bridge, giving you a chance to see the interrupt wiring behind
> the bridge.

I would argue that it matters not so much what Linux does but what
the standard says, but it seems they both agree with you in this
case: http://www.openfirmware.org/1275/practice/imap/imap0_9d.pdf
defines that "At any level in the interrupt tree, a mapping may
need to take place between the child interrupt domain and the
parent’s. This is represented by a new property called 'interrupt-map'".

> This matches the design of PCI - if you know how interrupts are hooked
> up then use that information, otherwise assume the INTx interrupts
> swizzle and search upward. This is how add-in cards with PCI bridges
> are supported.

Note that the implicit swizzling was not part of the original PCI
binding, which assumed that all devices were explicitly represented
in the device tree, and we don't normally do that any more because
PCI can be probed easily, and we cannot assume that all PCI BARs
have been correctly assigned by the firmware before the OS
is booting. Having the interrupt-map at PCI host controller
node is convenient because it lets us define unit interrupt
specifiers for devices that are not represented in the device
tree themselves.

I think the key question here is whether there is just one interrupt
domain across all bridges because the hardware requires the unit
address to be unique, or whether each PCIe port has its own
unit address space, and thereby interrupt domain that requires
its own interrupt-map.

If there is just one domain, we have the choice whether to have
one interrupt-map for the entire domain, or to have one
interrupt map per PCIe port for the devices under that port.
I would consider it more logical to have a single interrupt-map
for the interrupt domain, because that is essentially what
lets us describe the interrupt daomain as a whole.

Of course, if each port has its own domain, we have to have
a separate interrupt map for each one.

> Thomas's problem is the presence of the static DT node for the root
> port bridge. Since the node is static you can't know what the runtime
> determined subordinate bus numbers will be, so there is no possible
> way to write an interrupt-map at the host bridge.

Right, that is a problem if there are additional bridges. I guess
we could represent all devices on bus 0 easily because their
address would be fixed,  but can't uniquely identify anything
below them.

> If you imagine the case you alluded to, a PCI-E root port, connected
> to a PCI-E to PCI bridge, with 2 physical PCI bus slots. The
> interrupts for the 2 slots are routed to the CPU directly:
> 
> link@0 {
>  reg = </* Bus 0, Dev 0x10, Fn 0 */>; // Root Port bridge
> 
>   // Match on INTx (not used since the pci-bridge doesn't create inband INTx)
>   interrupt-mask = <0x0 0 0 7>;
>   interrupt-map = <0x0000 0 0 1 &pic 0  // Inband INTA
>                    0x0000 0 0 2 &pic 1  // Inband INTB

What are these two interrupts in the example then?

>  pci_bridge@0 {
>     reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge

The device would be "pci@10", right?

>     // Match on the device/slot and INTx pin
>     interrupt-mask = <0x7f 0 0 7>;
>     interrupt-map = <0x00xx 0 0 1 &pic 2 // Slot 0 physical INTA
>                      0x00xx 0 0 1 &pic 3 // Slot 1 physical INTA
>                      ..
>  }
> }

You are accidentally matching the on the register number, not the
device number here, right? The interrupt-map-mask should be
<0xf800 0 0 7> to match the device.

> To me, this seems to be a much more accurate description of how the
> hardware is constructed then trying to cram all this information into
> the host bridge's interrupt map. It shows clearly where inband INTA
> messages arriving at the root port are directed as well as where the
> slot by slot out-of-band interrupt wires on the PCI bus are directed.

Yes, I guess you're right.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 12, 2013, 7:26 p.m. UTC | #75
On Sat, Feb 09, 2013 at 10:23:11PM +0000, Arnd Bergmann wrote:

> > This makes *lots* of sense, if you have bridges providing bus slots
> > then you include the bridge in DT to stop the 'fold down' at that
> > known bridge, giving you a chance to see the interrupt wiring behind
> > the bridge.
>
> I would argue that it matters not so much what Linux does but what
> the standard says, but it seems they both agree with you in this
> case: http://www.openfirmware.org/1275/practice/imap/imap0_9d.pdf
> defines that "At any level in the interrupt tree, a mapping may
> need to take place between the child interrupt domain and the
> parent???s. This is represented by a new property called 'interrupt-map'".

Right, the standard (and Linux) allows the property in any node.

> > This matches the design of PCI - if you know how interrupts are hooked
> > up then use that information, otherwise assume the INTx interrupts
> > swizzle and search upward. This is how add-in cards with PCI bridges
> > are supported.
> 
> Note that the implicit swizzling was not part of the original PCI
> binding, which assumed that all devices were explicitly represented
> in the device tree, and we don't normally do that any more because

Yes, if the tree includes all PCI devices then you don't need
interrupt-map, just place an interrupt property directly on the end
nodes.

However the implicit swizzling and 'fold down' is pretty much
essential to support the PCI standards for hot plug behind bridges.

> PCI can be probed easily, and we cannot assume that all PCI BARs
> have been correctly assigned by the firmware before the OS
> is booting. Having the interrupt-map at PCI host controller
> node is convenient because it lets us define unit interrupt
> specifiers for devices that are not represented in the device
> tree themselves.

Right, interrupt-map is actually only needed when the DT is
incomplete, or to support hot pluggable ports.

> I think the key question here is whether there is just one interrupt
> domain across all bridges because the hardware requires the unit
> address to be unique, or whether each PCIe port has its own
> unit address space, and thereby interrupt domain that requires
> its own interrupt-map.

IMHO, the interrupt domains should describe the underlying hardware,
and in many cases the HW is designed so that every PCI bus bridge has
it's own downstream interrupt layout - and thus is an interrupt domain.

> > If you imagine the case you alluded to, a PCI-E root port, connected
> > to a PCI-E to PCI bridge, with 2 physical PCI bus slots. The
> > interrupts for the 2 slots are routed to the CPU directly:
> > 
> > link@0 {
> >  reg = </* Bus 0, Dev 0x10, Fn 0 */>; // Root Port bridge
> > 
> >   // Match on INTx (not used since the pci-bridge doesn't create inband INTx)
> >   interrupt-mask = <0x0 0 0 7>;
> >   interrupt-map = <0x0000 0 0 1 &pic 0  // Inband INTA
> >                    0x0000 0 0 2 &pic 1  // Inband INTB
> 
> What are these two interrupts in the example then?

This shows that the HW block 'link@0' - which is a PCI Express root
port bridge - accepts inband INTx messages and converts them to CPU
interrupts pic 0/1/...

Since this is a general function, and fully self contained, it can be
placed in the general SOC's dtsi.

However, the board has a hard-wired PCIe to PCI bridge with PCI slots,
and never generates inband INTx. We can then describe that chip via
the following stanza in the board specific dts:

> >  pci_bridge@0 {
> >     reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge
> 
> The device would be "pci@10", right?

Probably best to use the hex version of the regs value /* Bus 1, Dev
0x10, Fn 0 */, but nothing inspects that, right?

> >     // Match on the device/slot and INTx pin
> >     interrupt-mask = <0x7f 0 0 7>;
> >     interrupt-map = <0x00xx 0 0 1 &pic 2 // Slot 0 physical INTA
> >                      0x00xx 0 0 1 &pic 3 // Slot 1 physical INTA
> >                      ..
> >  }
> > }
> 
> You are accidentally matching the on the register number, not the
> device number here, right? The interrupt-map-mask should be
> <0xf800 0 0 7> to match the device.

Right, only match the device, ignore the bus.

There is also another variant, if the PCIe to PCI bridge provides its
own interrupt pins and converts those to inband PCIe INTx messages,
then the PCB can wire up the PCI bus slots to the bridge's INTx pins
according to some pattern and describe that pattern in DT:

pci_bridge@0 {
   reg = </* Bus 1, Dev 0x10, Fn 0 */>; // PCIe to PCI bridge
   interrupt-mask = <0xf800 0 0 7>;
   interrupt-map = <0x00xx 0 0 1 &pci_bridge0 0 0 0 1 // Slot 0 physical INTA to inband INTA
                    0x00xx 0 0 1 &pci_bridge0 0 0 0 2 // Slot 1 physical INTA to inband INTB
		    ...
}

(minus errors, haven't tried this one, but the standard says it should
be OK)

Which would be processed as:
 - pci_bridge@0 converts out of brand interrupts into in-band
   interrupts according its interrupt-map, and then sends those
   upstream.
 - link@0 converts in band interrupts into CPU interrupts according
   to its interrupt map.

In my experience the above is a common case.

Boot firmware could fold all this down to a single interrupt map, and
hide the programming of the IOAPIC/etc from the OS, but the HW is
still undertaking these transformations..

Anyhow, it sounds like Thomas has had success using this approach, so
it works.

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt
new file mode 100644
index 0000000..9313e92
--- /dev/null
+++ b/Documentation/devicetree/bindings/pci/armada-370-xp-pcie.txt
@@ -0,0 +1,175 @@ 
+* Marvell Armada 370/XP PCIe interfaces
+
+Mandatory properties:
+- compatible: must be "marvell,armada-370-xp-pcie"
+- status: either "disabled" or "okay"
+- #address-cells, set to <3>
+- #size-cells, set to <2>
+- #interrupt-cells, set to <1>
+- bus-range: PCI bus numbers covered
+- ranges: standard PCI-style address ranges, describing the PCIe
+  registers for each PCIe interface, and then ranges for the PCI
+  memory and I/O regions.
+- interrupt-map-mask and interrupt-map are standard PCI Device Tree
+  properties to describe the interrupts associated to each PCI
+  interface.
+
+In addition, the Device Tree node must have sub-nodes describing each
+PCIe interface, having the following mandatory properties:
+- reg: the address and size of the PCIe registers (translated
+  addresses according to the ranges property of the parent)
+- clocks: the clock associated to this PCIe interface
+- marvell,pcie-port: the physical PCIe port number
+- status: either "disabled" or "okay"
+
+and the following optional properties:
+- marvell,pcie-lane: the physical PCIe lane number, for ports having
+  multiple lanes. If this property is not found, we assume that the
+  value is 0.
+
+Example:
+
+pcie-controller {
+	compatible = "marvell,armada-370-xp-pcie";
+	status = "disabled";
+
+	#address-cells = <3>;
+	#size-cells = <2>;
+
+	bus-range = <0x00 0xff>;
+
+	ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
+	          0x00004800 0 0xd0042000 0xd0042000 0 0x00002000   /* port 2.0 registers */
+	          0x00001000 0 0xd0044000 0xd0044000 0 0x00002000   /* port 0.1 registers */
+	          0x00001800 0 0xd0048000 0xd0048000 0 0x00002000   /* port 0.2 registers */
+	          0x00002000 0 0xd004C000 0xd004C000 0 0x00002000   /* port 0.3 registers */
+		  0x00002800 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
+	          0x00005000 0 0xd0082000 0xd0082000 0 0x00002000   /* port 3.0 registers */
+		  0x00003000 0 0xd0084000 0xd0084000 0 0x00002000   /* port 1.1 registers */
+		  0x00003800 0 0xd0088000 0xd0088000 0 0x00002000   /* port 1.2 registers */
+		  0x00004000 0 0xd008C000 0xd008C000 0 0x00002000   /* port 1.3 registers */
+		  0x81000000 0 0	  0xc0000000 0 0x00100000   /* downstream I/O */
+		  0x82000000 0 0	  0xc1000000 0 0x08000000>; /* non-prefetchable memory */
+
+	#interrupt-cells = <1>;
+	interrupt-map-mask = <0xf800 0 0 1>;
+	interrupt-map = <0x0800 0 0 1 &mpic 58
+		         0x1000 0 0 1 &mpic 59
+			 0x1800 0 0 1 &mpic 60
+			 0x2000 0 0 1 &mpic 61
+			 0x2800 0 0 1 &mpic 62
+		         0x3000 0 0 1 &mpic 63
+			 0x3800 0 0 1 &mpic 64
+			 0x4000 0 0 1 &mpic 65
+			 0x4800 0 0 1 &mpic 99
+			 0x5000 0 0 1 &mpic 103>;
+
+	pcie@0,0 {
+		device_type = "pciex";
+		reg = <0x0800 0 0xd0040000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 5>;
+		status = "disabled";
+	};
+
+	pcie@0,1 {
+		device_type = "pciex";
+		reg = <0x1000 0 0xd0044000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <1>;
+		clocks = <&gateclk 6>;
+		status = "disabled";
+	};
+
+	pcie@0,2 {
+		device_type = "pciex";
+		reg = <0x1800 0 0xd0048000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <2>;
+		clocks = <&gateclk 7>;
+		status = "disabled";
+	};
+
+	pcie@0,3 {
+		device_type = "pciex";
+		reg = <0x2000 0 0xd004C000 0 0xC000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <3>;
+		clocks = <&gateclk 8>;
+		status = "disabled";
+	};
+
+	pcie@1,0 {
+		device_type = "pciex";
+		reg = <0x2800 0 0xd0080000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 9>;
+		status = "disabled";
+	};
+
+	pcie@1,1 {
+		device_type = "pciex";
+		reg = <0x3000 0 0xd0084000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <1>;
+		clocks = <&gateclk 10>;
+		status = "disabled";
+	};
+
+	pcie@1,2 {
+		device_type = "pciex";
+		reg = <0x3800 0 0xd0088000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <2>;
+		clocks = <&gateclk 11>;
+		status = "disabled";
+	};
+
+	pcie@1,3 {
+		device_type = "pciex";
+		reg = <0x4000 0 0xd008C000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <3>;
+		clocks = <&gateclk 12>;
+		status = "disabled";
+	};
+	pcie@2,0 {
+		device_type = "pciex";
+		reg = <0x4800 0 0xd0042000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <2>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 26>;
+		status = "disabled";
+	};
+
+	pcie@3,0 {
+		device_type = "pciex";
+		reg = <0x5000 0 0xd0082000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <3>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 27>;
+		status = "disabled";
+	};
+};
diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig
index cc3a1af..03e15e7 100644
--- a/drivers/pci/host/Kconfig
+++ b/drivers/pci/host/Kconfig
@@ -1,4 +1,10 @@ 
 menu "PCI host controller drivers"
 	depends on PCI
 
+config PCI_MVEBU
+	bool "Marvell EBU PCIe controller"
+	depends on ARCH_MVEBU
+	select PCI_SW_HOST_BRIDGE
+	select PCI_SW_PCI_PCI_BRIDGE
+
 endmenu
diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
new file mode 100644
index 0000000..34d6057
--- /dev/null
+++ b/drivers/pci/host/Makefile
@@ -0,0 +1,4 @@ 
+obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
+ccflags-$(CONFIG_PCI_MVEBU) += \
+	-I$(srctree)/arch/arm/plat-orion/include \
+	-I$(srctree)/arch/arm/mach-mvebu/include
diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
new file mode 100644
index 0000000..4db09e1
--- /dev/null
+++ b/drivers/pci/host/pci-mvebu.c
@@ -0,0 +1,500 @@ 
+/*
+ * PCIe driver for Marvell Armada 370 and Armada XP SoCs
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2.  This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/clk.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/platform_device.h>
+#include <linux/of_address.h>
+#include <linux/of_pci.h>
+#include <linux/of_irq.h>
+#include <linux/of_platform.h>
+#include <plat/pcie.h>
+#include <mach/addr-map.h>
+
+/*
+ * Those are the product IDs used for the emulated PCI Host bridge and
+ * emulated PCI-to-PCI bridges. They are temporary until we get
+ * official IDs assigned.
+ */
+#define MARVELL_EMULATED_HOST_BRIDGE_ID    4141
+#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 4242
+
+struct mvebu_pcie_port;
+
+/* Structure representing all PCIe interfaces */
+struct mvebu_pcie {
+	struct pci_sw_host_bridge bridge;
+	struct platform_device *pdev;
+	struct mvebu_pcie_port *ports;
+	struct resource io;
+	struct resource mem;
+	struct resource busn;
+	int nports;
+};
+
+/* Structure representing one PCIe interface */
+struct mvebu_pcie_port {
+	void __iomem *base;
+	spinlock_t conf_lock;
+	int haslink;
+	u32 port;
+	u32 lane;
+	int devfn;
+	struct clk *clk;
+	struct pci_sw_pci_bridge bridge;
+	struct device_node *dn;
+};
+
+static inline struct mvebu_pcie *sys_to_pcie(struct pci_sys_data *sys)
+{
+	return sys->private_data;
+}
+
+/* PCI configuration space write function */
+static int mvebu_pcie_wr_conf(struct pci_bus *bus, u32 devfn,
+			      int where, int size, u32 val)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata);
+
+	if (bus->number != 0) {
+		/*
+		 * Accessing a real PCIe interface, where the Linux
+		 * virtual bus number is equal to the hardware PCIe
+		 * interface number + 1
+		 */
+		struct mvebu_pcie_port *port;
+		unsigned long flags;
+		int porti, ret;
+
+		porti = bus->number - 1;
+		if (porti >= pcie->nports)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+
+		port = &pcie->ports[porti];
+
+		if (!port->haslink)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+
+		if (PCI_SLOT(devfn) != 0)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+
+		spin_lock_irqsave(&port->conf_lock, flags);
+		ret = orion_pcie_wr_conf_bus(port->base, bus->number - 1,
+					     PCI_DEVFN(1, PCI_FUNC(devfn)),
+					     where, size, val);
+		spin_unlock_irqrestore(&port->conf_lock, flags);
+
+		return ret;
+	} else {
+		/*
+		 * Accessing the emulated PCIe devices. In the first
+		 * slot, the emulated host bridge, and in the next
+		 * slots, the PCI-to-PCI bridges that correspond to
+		 * each PCIe hardware interface
+		 */
+		if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0)
+			return pci_sw_host_bridge_write(&pcie->bridge, where,
+							size, val);
+		else if (PCI_SLOT(devfn) >= 1 &&
+			 PCI_SLOT(devfn) <= pcie->nports) {
+			struct mvebu_pcie_port *port;
+			int porti = PCI_SLOT(devfn) - 1;
+			port = &pcie->ports[porti];
+			return pci_sw_pci_bridge_write(&port->bridge, where,
+						       size, val);
+		} else {
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+	}
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+/* PCI configuration space read function */
+static int mvebu_pcie_rd_conf(struct pci_bus *bus, u32 devfn, int where,
+			      int size, u32 *val)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata);
+
+	if (bus->number != 0) {
+		/*
+		 * Accessing a real PCIe interface, where the Linux
+		 * virtual bus number is equal to the hardware PCIe
+		 * interface number + 1
+		 */
+		struct mvebu_pcie_port *port;
+		unsigned long flags;
+		int porti, ret;
+
+		porti = bus->number - 1;
+		if (porti >= pcie->nports) {
+			*val = 0xffffffff;
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+
+		port = &pcie->ports[porti];
+
+		if (!port->haslink || PCI_SLOT(devfn) != 0) {
+			*val = 0xffffffff;
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+
+		spin_lock_irqsave(&port->conf_lock, flags);
+		ret = orion_pcie_rd_conf_bus(port->base, bus->number - 1,
+					     PCI_DEVFN(1, PCI_FUNC(devfn)),
+					     where, size, val);
+		spin_unlock_irqrestore(&port->conf_lock, flags);
+
+		return ret;
+	} else {
+		/*
+		 * Accessing the emulated PCIe devices. In the first
+		 * slot, the emulated host bridge, and in the next
+		 * slots, the PCI-to-PCI bridges that correspond to
+		 * each PCIe hardware interface
+		 */
+		if (PCI_SLOT(devfn) == 0 && PCI_FUNC(devfn) == 0)
+			return pci_sw_host_bridge_read(&pcie->bridge, where,
+						       size, val);
+		else if (PCI_SLOT(devfn) >= 1 &&
+			 PCI_SLOT(devfn) <= pcie->nports) {
+			struct mvebu_pcie_port *port;
+			int porti = PCI_SLOT(devfn) - 1;
+			port = &pcie->ports[porti];
+			return pci_sw_pci_bridge_read(&port->bridge, where,
+						      size, val);
+		} else {
+			*val = 0xffffffff;
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+	}
+}
+
+static struct pci_ops mvebu_pcie_ops = {
+	.read = mvebu_pcie_rd_conf,
+	.write = mvebu_pcie_wr_conf,
+};
+
+static int __init mvebu_pcie_setup(int nr, struct pci_sys_data *sys)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(sys);
+	int i;
+
+	pci_add_resource_offset(&sys->resources, &pcie->io, sys->io_offset);
+	pci_add_resource_offset(&sys->resources, &pcie->mem, sys->mem_offset);
+	pci_add_resource(&sys->resources, &pcie->busn);
+
+	pci_ioremap_io(nr * SZ_64K, pcie->io.start);
+
+	for (i = 0; i < pcie->nports; i++) {
+		struct mvebu_pcie_port *port = &pcie->ports[i];
+		orion_pcie_set_local_bus_nr(port->base, i);
+		orion_pcie_setup(port->base);
+	}
+
+	return 1;
+}
+
+static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(dev->bus->sysdata);
+	struct mvebu_pcie_port *port;
+	struct of_irq oirq;
+	u32 laddr[3];
+	int ret;
+	__be32 intspec;
+
+	/*
+	 * Ignore requests related to the emulated host bridge or the
+	 * emulated pci-to-pci bridges
+	 */
+	if (!dev->bus->number)
+		return -1;
+
+	port = &pcie->ports[dev->bus->number - 1];
+
+	/*
+	 * Build an laddr array that describes the PCI device in a DT
+	 * way
+	 */
+	laddr[0] = cpu_to_be32(port->devfn << 8);
+	laddr[1] = laddr[2] = 0;
+	intspec = cpu_to_be32(pin);
+
+	ret = of_irq_map_raw(port->dn, &intspec, 1, laddr, &oirq);
+	if (ret) {
+		dev_err(&pcie->pdev->dev,
+			"%s: of_irq_map_raw() failed, %d\n",
+			__func__, ret);
+		return ret;
+	}
+
+	return irq_create_of_mapping(oirq.controller, oirq.specifier,
+				     oirq.size);
+}
+
+/*
+ * For a given PCIe interface (represented by a mvebu_pcie_port
+ * structure), we read the PCI configuration space of the
+ * corresponding PCI-to-PCI bridge in order to find out which range of
+ * I/O addresses and memory addresses have been assigned to this PCIe
+ * interface. Using these informations, we set up the appropriate
+ * address decoding windows so that the physical address are actually
+ * resolved to the right PCIe interface.
+ */
+static int mvebu_pcie_window_config_port(struct mvebu_pcie *pcie,
+					 struct mvebu_pcie_port *port)
+{
+	unsigned long iobase = 0;
+	int ret;
+
+	if (port->bridge.iolimit >= port->bridge.iobase) {
+		unsigned long iolimit = 0xFFF |
+			((port->bridge.iolimit & 0xF0) << 8) |
+			(port->bridge.iolimitupper << 16);
+		iobase = ((port->bridge.iobase & 0xF0) << 8) |
+			(port->bridge.iobaseupper << 16);
+		ret = armada_370_xp_alloc_pcie_window(port->port, port->lane,
+						      iobase, iolimit-iobase,
+						      IORESOURCE_IO);
+		if (ret) {
+			dev_err(&pcie->pdev->dev,
+				"%s: could not alloc PCIe %d:%d window for I/O [0x%lx; 0x%lx]\n",
+				__func__, port->port, port->lane,
+				iobase, iolimit);
+			goto out_io;
+		}
+	}
+
+	if (port->bridge.memlimit >= port->bridge.membase) {
+		unsigned long membase =
+			((port->bridge.membase & 0xFFF0) << 16);
+		unsigned long memlimit =
+			((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF;
+		ret = armada_370_xp_alloc_pcie_window(port->port, port->lane,
+						      membase, memlimit-membase,
+						      IORESOURCE_MEM);
+		if (ret) {
+			dev_err(&pcie->pdev->dev,
+				"%s: could not alloc PCIe %d:%d window for MEM [0x%lx; 0x%lx]\n",
+				__func__, port->port, port->lane,
+				membase, memlimit);
+			goto out_mem;
+		}
+	}
+
+out_mem:
+	if (port->bridge.iolimit >= port->bridge.iobase)
+		armada_370_xp_free_pcie_window(iobase);
+out_io:
+	return ret;
+}
+
+/*
+ * Set up the address decoding windows for all PCIe interfaces.
+ */
+static int mvebu_pcie_window_config(struct mvebu_pcie *pcie)
+{
+	int i, ret;
+
+	for (i = 0; i < pcie->nports; i++) {
+		struct mvebu_pcie_port *port = &pcie->ports[i];
+		if (!port->haslink)
+			continue;
+
+		ret = mvebu_pcie_window_config_port(pcie, port);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev,
+						 const struct resource *res,
+						 resource_size_t start,
+						 resource_size_t size,
+						 resource_size_t align)
+{
+	if (!(res->flags & IORESOURCE_IO))
+		return start;
+
+	/*
+	 * The I/O regions must be 64K aligned, because the
+	 * granularity of PCIe I/O address decoding windows is 64 K
+	 */
+	return round_up(start, SZ_64K);
+}
+
+static int mvebu_pcie_enable(struct mvebu_pcie *pcie)
+{
+	struct hw_pci hw;
+
+	memset(&hw, 0, sizeof(hw));
+
+	hw.nr_controllers = 1;
+	hw.private_data   = (void **)&pcie;
+	hw.setup          = mvebu_pcie_setup;
+	hw.map_irq        = mvebu_pcie_map_irq;
+	hw.align_resource = mvebu_pcie_align_resource;
+	hw.ops            = &mvebu_pcie_ops;
+
+	pci_common_init(&hw);
+
+	return mvebu_pcie_window_config(pcie);
+}
+
+static int __init mvebu_pcie_probe(struct platform_device *pdev)
+{
+	struct mvebu_pcie *pcie;
+	struct device_node *np = pdev->dev.of_node;
+	struct device_node *child;
+	const __be32 *range = NULL;
+	struct resource res;
+	int i, ret;
+
+	pcie = devm_kzalloc(&pdev->dev, sizeof(struct mvebu_pcie),
+			    GFP_KERNEL);
+	if (!pcie)
+		return -ENOMEM;
+
+	pcie->pdev = pdev;
+
+	pci_sw_host_bridge_init(&pcie->bridge);
+	pcie->bridge.vendor = PCI_VENDOR_ID_MARVELL;
+	pcie->bridge.device = MARVELL_EMULATED_HOST_BRIDGE_ID;
+
+	/* Get the I/O and memory ranges from DT */
+	while ((range = of_pci_process_ranges(np, &res, range)) != NULL) {
+		if (resource_type(&res) == IORESOURCE_IO) {
+			memcpy(&pcie->io, &res, sizeof(res));
+			pcie->io.name = "I/O";
+		}
+		if (resource_type(&res) == IORESOURCE_MEM) {
+			memcpy(&pcie->mem, &res, sizeof(res));
+			pcie->mem.name = "MEM";
+		}
+	}
+
+	/* Get the bus range */
+	ret = of_pci_parse_bus_range(np, &pcie->busn);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to parse bus-range property: %d\n",
+			ret);
+		return ret;
+	}
+
+	for_each_child_of_node(pdev->dev.of_node, child) {
+		if (!of_device_is_available(child))
+			continue;
+		pcie->nports++;
+	}
+
+	pcie->ports = devm_kzalloc(&pdev->dev, pcie->nports *
+				   sizeof(struct mvebu_pcie_port),
+				   GFP_KERNEL);
+	if (!pcie->ports)
+		return -ENOMEM;
+
+	i = 0;
+	for_each_child_of_node(pdev->dev.of_node, child) {
+		struct mvebu_pcie_port *port = &pcie->ports[i];
+
+		if (!of_device_is_available(child))
+			continue;
+
+		if (of_property_read_u32(child, "marvell,pcie-port",
+					 &port->port)) {
+			dev_warn(&pdev->dev,
+				 "ignoring PCIe DT node, missing pcie-port property\n");
+			continue;
+		}
+
+		if (of_property_read_u32(child, "marvell,pcie-lane",
+					 &port->lane))
+			port->lane = 0;
+
+		port->devfn = of_pci_get_devfn(child);
+		if (port->devfn < 0)
+			continue;
+
+		port->base = of_iomap(child, 0);
+		if (!port->base) {
+			dev_err(&pdev->dev, "PCIe%d.%d: cannot map registers\n",
+				port->port, port->lane);
+			continue;
+		}
+
+		if (orion_pcie_link_up(port->base)) {
+			port->haslink = 1;
+			dev_info(&pdev->dev, "PCIe%d.%d: link up\n",
+				 port->port, port->lane);
+		} else {
+			port->haslink = 0;
+			dev_info(&pdev->dev, "PCIe%d.%d: link down\n",
+				 port->port, port->lane);
+		}
+
+		port->clk = of_clk_get_by_name(child, NULL);
+		if (!port->clk) {
+			dev_err(&pdev->dev, "PCIe%d.%d: cannot get clock\n",
+			       port->port, port->lane);
+			iounmap(port->base);
+			port->haslink = 0;
+			continue;
+		}
+
+		port->dn = child;
+
+		clk_prepare_enable(port->clk);
+		spin_lock_init(&port->conf_lock);
+
+		pci_sw_pci_bridge_init(&port->bridge);
+		port->bridge.vendor = PCI_VENDOR_ID_MARVELL;
+		port->bridge.device = MARVELL_EMULATED_PCI_PCI_BRIDGE_ID;
+		port->bridge.primary_bus = 0;
+		port->bridge.secondary_bus = PCI_SLOT(port->devfn);
+		port->bridge.subordinate_bus = PCI_SLOT(port->devfn);
+
+		i++;
+	}
+
+	mvebu_pcie_enable(pcie);
+
+	return 0;
+}
+
+static const struct of_device_id mvebu_pcie_of_match_table[] = {
+	{ .compatible = "marvell,armada-370-xp-pcie", },
+	{},
+};
+MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table);
+
+static struct platform_driver mvebu_pcie_driver = {
+	.driver = {
+		.owner = THIS_MODULE,
+		.name = "mvebu-pcie",
+		.of_match_table =
+		   of_match_ptr(mvebu_pcie_of_match_table),
+	},
+};
+
+static int mvebu_pcie_init(void)
+{
+	return platform_driver_probe(&mvebu_pcie_driver,
+				     mvebu_pcie_probe);
+}
+
+subsys_initcall(mvebu_pcie_init);
+
+MODULE_AUTHOR("Thomas Petazzoni <thomas.petazzoni@free-electrons.com>");
+MODULE_DESCRIPTION("Marvell EBU PCIe driver");
+MODULE_LICENSE("GPL");