Message ID | 20221130112221.66612-2-mika.westerberg@linux.intel.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | PCI: Distribute resources for root buses | expand |
On Wed, 30 Nov 2022 13:22:20 +0200 Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > A PCI bridge may reside on a bus with other devices as well. The > resource distribution code does not take this into account properly and > therefore it expands the bridge resource windows too much, not leaving > space for the other devices (or functions a multifunction device) and > this leads to an issue that Jonathan reported. He runs QEMU with the > following topoology (QEMU parameters): > > -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2 \ > -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \ > -device e1000,bus=root_port13,addr=0.1 \ > -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3 \ > -device e1000,bus=fun1 > > The first e1000 NIC here is another function in the switch upstream > port. This leads to following errors: > > pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04] > pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04] > pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000] > e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0] > > Fix this by taking into account the possible multifunction devices when > uptream port resources are distributed. > > Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/ > Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Trivial comment inline. Either way.. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > drivers/pci/setup-bus.c | 66 ++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 62 insertions(+), 4 deletions(-) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index b4096598dbcb..d456175ddc4f 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1830,10 +1830,68 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus, > * bridges below. > */ > if (hotplug_bridges + normal_bridges == 1) { > - dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); > - if (dev->subordinate) > - pci_bus_distribute_available_resources(dev->subordinate, > - add_list, io, mmio, mmio_pref); > + bridge = NULL; > + > + /* Find the single bridge on this bus first */ > + for_each_pci_bridge(dev, bus) { We could cache this a few lines up where we calculate the number of bridges. Perhaps not worth bothering though other than it letting you get rid of the WARN_ON_ONCE. > + bridge = dev; > + break; > + } > + > + if (WARN_ON_ONCE(!bridge)) > + return; > + if (!bridge->subordinate) > + return; > + > + /* > + * Reduce the space available for distribution by the > + * amount required by the other devices on the same bus > + * as this bridge. > + */ > + list_for_each_entry(dev, &bus->devices, bus_list) { > + int i; > + > + if (dev == bridge) > + continue; > + > + for (i = 0; i < PCI_NUM_RESOURCES; i++) { > + const struct resource *dev_res = &dev->resource[i]; > + resource_size_t dev_sz; > + struct resource *b_res; > + > + if (dev_res->flags & IORESOURCE_IO) { > + b_res = &io; > + } else if (dev_res->flags & IORESOURCE_MEM) { > + if (dev_res->flags & IORESOURCE_PREFETCH) > + b_res = &mmio_pref; > + else > + b_res = &mmio; > + } else { > + continue; > + } > + > + /* Size aligned to bridge window */ > + align = pci_resource_alignment(bridge, b_res); > + dev_sz = ALIGN(resource_size(dev_res), align); > + if (!dev_sz) > + continue; > + > + pci_dbg(dev, "resource %pR aligned to %#llx\n", > + dev_res, (unsigned long long)dev_sz); > + > + if (dev_sz > resource_size(b_res)) > + memset(b_res, 0, sizeof(*b_res)); > + else > + b_res->end -= dev_sz; > + > + pci_dbg(bridge, "updated available resources to %pR\n", > + b_res); > + } > + } > + > + pci_bus_distribute_available_resources(bridge->subordinate, > + add_list, io, mmio, > + mmio_pref); > return; > } >
Hi Mika, On Wed, Nov 30, 2022 at 01:22:20PM +0200, Mika Westerberg wrote: > A PCI bridge may reside on a bus with other devices as well. The > resource distribution code does not take this into account properly and > therefore it expands the bridge resource windows too much, not leaving > space for the other devices (or functions a multifunction device) and functions *of* a > this leads to an issue that Jonathan reported. He runs QEMU with the > following topoology (QEMU parameters): topology > -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2 \ > -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \ > -device e1000,bus=root_port13,addr=0.1 \ > -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3 \ > -device e1000,bus=fun1 If you use spaces instead of tabs above, the "\" will stay lined up when git log indents. > The first e1000 NIC here is another function in the switch upstream > port. This leads to following errors: > > pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04] > pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04] > pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000] > e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0] > > Fix this by taking into account the possible multifunction devices when > uptream port resources are distributed. "upstream", although I think I would word this so it's less PCIe-centric. IIUC, we just want to account for all the BARs on the bus, whether they're in bridges, peers in a multi-function device, or other devices. > Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/ > Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> > --- > drivers/pci/setup-bus.c | 66 ++++++++++++++++++++++++++++++++++++++--- > 1 file changed, 62 insertions(+), 4 deletions(-) > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index b4096598dbcb..d456175ddc4f 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1830,10 +1830,68 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus, > * bridges below. > */ > if (hotplug_bridges + normal_bridges == 1) { > - dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); > - if (dev->subordinate) > - pci_bus_distribute_available_resources(dev->subordinate, > - add_list, io, mmio, mmio_pref); > + bridge = NULL; > + > + /* Find the single bridge on this bus first */ > + for_each_pci_bridge(dev, bus) { > + bridge = dev; > + break; > + } If we just remember "bridge" in the loop before this hunk, could we get rid of the loop here? E.g., bridge = NULL; for_each_pci_bridge(dev, bus) { bridge = dev; if (dev->is_hotplug_bridge) hotplug_bridges++; else normal_bridges++; } > + > + if (WARN_ON_ONCE(!bridge)) > + return; Then I think this would be superfluous. > + if (!bridge->subordinate) > + return; > + > + /* > + * Reduce the space available for distribution by the > + * amount required by the other devices on the same bus > + * as this bridge. > + */ > + list_for_each_entry(dev, &bus->devices, bus_list) { > + int i; > + > + if (dev == bridge) > + continue; Why do we skip "bridge"? Bridges are allowed to have two BARs themselves, and it seems like they should be included here. > + for (i = 0; i < PCI_NUM_RESOURCES; i++) { > + const struct resource *dev_res = &dev->resource[i]; > + resource_size_t dev_sz; > + struct resource *b_res; > + > + if (dev_res->flags & IORESOURCE_IO) { > + b_res = &io; > + } else if (dev_res->flags & IORESOURCE_MEM) { > + if (dev_res->flags & IORESOURCE_PREFETCH) > + b_res = &mmio_pref; > + else > + b_res = &mmio; > + } else { > + continue; > + } > + > + /* Size aligned to bridge window */ > + align = pci_resource_alignment(bridge, b_res); > + dev_sz = ALIGN(resource_size(dev_res), align); > + if (!dev_sz) > + continue; > + > + pci_dbg(dev, "resource %pR aligned to %#llx\n", > + dev_res, (unsigned long long)dev_sz); > + > + if (dev_sz > resource_size(b_res)) > + memset(b_res, 0, sizeof(*b_res)); > + else > + b_res->end -= dev_sz; > + > + pci_dbg(bridge, "updated available resources to %pR\n", > + b_res); > + } > + } This only happens for buses with a single bridge. Shouldn't it happen regardless of how many bridges there are? This block feels like something that could be split out to a separate function. It looks like it only needs "bus", "io", "mmio", "mmio_pref", and maybe "bridge". I don't understand the "bridge" part; it looks like that's basically to use 4K alignment for I/O windows and 1M for memory windows? Using "bridge" seems like a clunky way to figure that out. Bjorn
On Fri, Dec 02, 2022 at 05:45:13PM +0000, Jonathan Cameron wrote: > On Wed, 30 Nov 2022 13:22:20 +0200 > Mika Westerberg <mika.westerberg@linux.intel.com> wrote: > > if (hotplug_bridges + normal_bridges == 1) { > > - dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); > > - if (dev->subordinate) > > - pci_bus_distribute_available_resources(dev->subordinate, > > - add_list, io, mmio, mmio_pref); > > + bridge = NULL; > > + > > + /* Find the single bridge on this bus first */ > > > + for_each_pci_bridge(dev, bus) { > > We could cache this a few lines up where we calculate the > number of bridges. Perhaps not worth bothering though other > than it letting you get rid of the WARN_ON_ONCE. Sorry for repeating this; I saw your response, but it didn't sink in before I responded. Bjorn
Hi, On Fri, Dec 02, 2022 at 05:34:24PM -0600, Bjorn Helgaas wrote: > Hi Mika, > > On Wed, Nov 30, 2022 at 01:22:20PM +0200, Mika Westerberg wrote: > > A PCI bridge may reside on a bus with other devices as well. The > > resource distribution code does not take this into account properly and > > therefore it expands the bridge resource windows too much, not leaving > > space for the other devices (or functions a multifunction device) and > > functions *of* a > > > this leads to an issue that Jonathan reported. He runs QEMU with the > > following topoology (QEMU parameters): > > topology > > > -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2 \ > > -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \ > > -device e1000,bus=root_port13,addr=0.1 \ > > -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3 \ > > -device e1000,bus=fun1 > > If you use spaces instead of tabs above, the "\" will stay lined up > when git log indents. Sure. > > The first e1000 NIC here is another function in the switch upstream > > port. This leads to following errors: > > > > pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04] > > pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04] > > pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000] > > e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0] > > > > Fix this by taking into account the possible multifunction devices when > > uptream port resources are distributed. > > "upstream", although I think I would word this so it's less > PCIe-centric. IIUC, we just want to account for all the BARs on the > bus, whether they're in bridges, peers in a multi-function device, or > other devices. Okay. > > Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/ > > Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> > > --- > > drivers/pci/setup-bus.c | 66 ++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 62 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > > index b4096598dbcb..d456175ddc4f 100644 > > --- a/drivers/pci/setup-bus.c > > +++ b/drivers/pci/setup-bus.c > > @@ -1830,10 +1830,68 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus, > > * bridges below. > > */ > > if (hotplug_bridges + normal_bridges == 1) { > > - dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); > > - if (dev->subordinate) > > - pci_bus_distribute_available_resources(dev->subordinate, > > - add_list, io, mmio, mmio_pref); > > + bridge = NULL; > > + > > + /* Find the single bridge on this bus first */ > > + for_each_pci_bridge(dev, bus) { > > + bridge = dev; > > + break; > > + } > > If we just remember "bridge" in the loop before this hunk, could we > get rid of the loop here? E.g., > > bridge = NULL; > for_each_pci_bridge(dev, bus) { > bridge = dev; > if (dev->is_hotplug_bridge) > hotplug_bridges++; > else > normal_bridges++; > } Yes, I think that would work too. > > + > > + if (WARN_ON_ONCE(!bridge)) > > + return; > > Then I think this would be superfluous. > > > + if (!bridge->subordinate) > > + return; > > + > > + /* > > + * Reduce the space available for distribution by the > > + * amount required by the other devices on the same bus > > + * as this bridge. > > + */ > > + list_for_each_entry(dev, &bus->devices, bus_list) { > > + int i; > > + > > + if (dev == bridge) > > + continue; > > Why do we skip "bridge"? Bridges are allowed to have two BARs > themselves, and it seems like they should be included here. Good point but then we would need to skip below the bridge window resources to avoid accounting them. > > + for (i = 0; i < PCI_NUM_RESOURCES; i++) { > > + const struct resource *dev_res = &dev->resource[i]; > > + resource_size_t dev_sz; > > + struct resource *b_res; > > + > > + if (dev_res->flags & IORESOURCE_IO) { > > + b_res = &io; > > + } else if (dev_res->flags & IORESOURCE_MEM) { > > + if (dev_res->flags & IORESOURCE_PREFETCH) > > + b_res = &mmio_pref; > > + else > > + b_res = &mmio; > > + } else { > > + continue; > > + } > > + > > + /* Size aligned to bridge window */ > > + align = pci_resource_alignment(bridge, b_res); > > + dev_sz = ALIGN(resource_size(dev_res), align); > > + if (!dev_sz) > > + continue; > > + > > + pci_dbg(dev, "resource %pR aligned to %#llx\n", > > + dev_res, (unsigned long long)dev_sz); > > + > > + if (dev_sz > resource_size(b_res)) > > + memset(b_res, 0, sizeof(*b_res)); > > + else > > + b_res->end -= dev_sz; > > + > > + pci_dbg(bridge, "updated available resources to %pR\n", > > + b_res); > > + } > > + } > > This only happens for buses with a single bridge. Shouldn't it happen > regardless of how many bridges there are? This branch specifically deals with the "upstream port" so it gives all the spare resources to that upstream port. The whole resource distribution is actually done to accommondate Thunderbolt/USB4 topologies which involve only PCIe devices so we always have PCIe upstream port and downstream ports which some of them are able to perform native PCIe hotplug. And for those ports we want to distribute the available resources so that they can expand to further topologies. I'm slightly concerned that forcing this to support the "generic" PCI case makes this rather complicated. This is something that never appears in the regular PCI based systems because we never distribute resources for those in the first place (->is_hotplug_bridge needs to be set). > This block feels like something that could be split out to a separate > function. It looks like it only needs "bus", "io", "mmio", > "mmio_pref", and maybe "bridge". Makes sense. > I don't understand the "bridge" part; it looks like that's basically > to use 4K alignment for I/O windows and 1M for memory windows? > Using "bridge" seems like a clunky way to figure that out. Okay, but if not using "bridge", how exactly you suggest to doing the calculation?
On Mon, Dec 05, 2022 at 09:28:30AM +0200, Mika Westerberg wrote: > On Fri, Dec 02, 2022 at 05:34:24PM -0600, Bjorn Helgaas wrote: > > On Wed, Nov 30, 2022 at 01:22:20PM +0200, Mika Westerberg wrote: > > > A PCI bridge may reside on a bus with other devices as well. The > > > resource distribution code does not take this into account properly and > > > therefore it expands the bridge resource windows too much, not leaving > > > space for the other devices (or functions a multifunction device) and > > > + * Reduce the space available for distribution by the > > > + * amount required by the other devices on the same bus > > > + * as this bridge. > > > + */ > > > + list_for_each_entry(dev, &bus->devices, bus_list) { > > > + int i; > > > + > > > + if (dev == bridge) > > > + continue; > > > > Why do we skip "bridge"? Bridges are allowed to have two BARs > > themselves, and it seems like they should be included here. > > Good point but then we would need to skip below the bridge window > resources to avoid accounting them. Seems like we should handle bridge BARs. There are definitely bridges (PCIe for sure, I dunno about conventional PCI) that implement them and some drivers starting to appear that use them for performance monitoring, etc. > > This only happens for buses with a single bridge. Shouldn't it happen > > regardless of how many bridges there are? > > This branch specifically deals with the "upstream port" so it gives all > the spare resources to that upstream port. The whole resource > distribution is actually done to accommondate Thunderbolt/USB4 > topologies which involve only PCIe devices so we always have PCIe > upstream port and downstream ports which some of them are able to > perform native PCIe hotplug. And for those ports we want to distribute > the available resources so that they can expand to further topologies. > > I'm slightly concerned that forcing this to support the "generic" PCI > case makes this rather complicated. This is something that never appears > in the regular PCI based systems because we never distribute resources > for those in the first place (->is_hotplug_bridge needs to be set). This code is fairly complicated in any case :) I understand why this is useful for Thunderbolt topologies, but it should be equally useful for other hotplug topologies because at this level we're purely talking about the address space needed by devices and how that space is assigned and routed through bridges. Nothing unique to Thunderbolt here. I don't think we should make this PCIe-specific. ->is_hotplug_bridge is set by a PCIe path (set_pcie_hotplug_bridge()), but also by check_hotplug_bridge() in acpiphp, which could be any flavor of PCI, and I don't think there's anything intrinsically PCIe-specific about it. > > I don't understand the "bridge" part; it looks like that's basically > > to use 4K alignment for I/O windows and 1M for memory windows? > > Using "bridge" seems like a clunky way to figure that out. > > Okay, but if not using "bridge", how exactly you suggest to doing the > calculation? I was thinking it would always be 4K or 1M, but I guess that's actually not true. There are some Intel bridges that support 1K alignment for I/O windows, and some powerpc hypervisor stuff that can also influence the alignment. And it looks like we still need to figure out which b_res to use, so we couldn't get rid of the IO/MEM case analysis. So never mind, I guess ... Bjorn
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index b4096598dbcb..d456175ddc4f 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1830,10 +1830,68 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus, * bridges below. */ if (hotplug_bridges + normal_bridges == 1) { - dev = list_first_entry(&bus->devices, struct pci_dev, bus_list); - if (dev->subordinate) - pci_bus_distribute_available_resources(dev->subordinate, - add_list, io, mmio, mmio_pref); + bridge = NULL; + + /* Find the single bridge on this bus first */ + for_each_pci_bridge(dev, bus) { + bridge = dev; + break; + } + + if (WARN_ON_ONCE(!bridge)) + return; + if (!bridge->subordinate) + return; + + /* + * Reduce the space available for distribution by the + * amount required by the other devices on the same bus + * as this bridge. + */ + list_for_each_entry(dev, &bus->devices, bus_list) { + int i; + + if (dev == bridge) + continue; + + for (i = 0; i < PCI_NUM_RESOURCES; i++) { + const struct resource *dev_res = &dev->resource[i]; + resource_size_t dev_sz; + struct resource *b_res; + + if (dev_res->flags & IORESOURCE_IO) { + b_res = &io; + } else if (dev_res->flags & IORESOURCE_MEM) { + if (dev_res->flags & IORESOURCE_PREFETCH) + b_res = &mmio_pref; + else + b_res = &mmio; + } else { + continue; + } + + /* Size aligned to bridge window */ + align = pci_resource_alignment(bridge, b_res); + dev_sz = ALIGN(resource_size(dev_res), align); + if (!dev_sz) + continue; + + pci_dbg(dev, "resource %pR aligned to %#llx\n", + dev_res, (unsigned long long)dev_sz); + + if (dev_sz > resource_size(b_res)) + memset(b_res, 0, sizeof(*b_res)); + else + b_res->end -= dev_sz; + + pci_dbg(bridge, "updated available resources to %pR\n", + b_res); + } + } + + pci_bus_distribute_available_resources(bridge->subordinate, + add_list, io, mmio, + mmio_pref); return; }
A PCI bridge may reside on a bus with other devices as well. The resource distribution code does not take this into account properly and therefore it expands the bridge resource windows too much, not leaving space for the other devices (or functions a multifunction device) and this leads to an issue that Jonathan reported. He runs QEMU with the following topoology (QEMU parameters): -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2 \ -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on \ -device e1000,bus=root_port13,addr=0.1 \ -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3 \ -device e1000,bus=fun1 The first e1000 NIC here is another function in the switch upstream port. This leads to following errors: pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04] pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04] pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000] e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0] Fix this by taking into account the possible multifunction devices when uptream port resources are distributed. Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/ Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> --- drivers/pci/setup-bus.c | 66 ++++++++++++++++++++++++++++++++++++++--- 1 file changed, 62 insertions(+), 4 deletions(-)