Message ID | 20140408180828.GC32490@obsidianresearch.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Dear Jason Gunthorpe, On Tue, 8 Apr 2014 12:08:28 -0600, Jason Gunthorpe wrote: > > BTW I can try your patch with the myricom NIC which started to work > > when rounding up, I'll quickly see if that fixes the issue as well, > > but I'm now a little bit confused. > > This patch won't fix anything, it just fixes the mbus driver to always > program the hardware with valid values, even if the upper layer > requests something invalid. Basically, we spent a few weeks tracking > this behavior down, the patch would ensure that time doesn't get spent > again. Agreed. > The goal now is to avoid the WARN_ON in the patch from firing. > > For a proper fix, something like this to create multiple aligned > windows is a simple option (untested, need the inverse on the > de-register, loop should probably live in mbus code): > > diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c > index 789cdb2..7312c82 100644 > --- a/drivers/pci/host/pci-mvebu.c > +++ b/drivers/pci/host/pci-mvebu.c > @@ -382,6 +382,9 @@ static void mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port) > > static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port) > { > + phys_addr_t base; > + unsigned int mapped_size; > + > /* Are the new membase/memlimit values invalid? */ > if (port->bridge.memlimit < port->bridge.membase || > !(port->bridge.command & PCI_COMMAND_MEMORY)) { > @@ -408,8 +411,16 @@ static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port) > (((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF) - > port->memwin_base; > > - mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr, > - port->memwin_base, port->memwin_size); > + base = port->memwin_base; > + mapped_size = 0; > + while (mapped_size < port->memwin_size) { > + unsigned int size = > + rounddown_pow_of_two(port->memwin_size - mapped_size); > + mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr, > + base, size); > + base += size; > + mapped_size += size; > + } > } The problem I have with this approach is the consumption of windows. We have a limited number of them, and this approach could easily consume quite a few windows for one single bridge BAR, depending on its size. Like a bridge BAR of 120 MB would consume four windows (one 64 MB, one 32 MB, one 16 MB and one 8 MB). Instead, I would really like to see the PCI core being told about this constraint, so that it can oversize the bridge BAR (to 128 MB in the above example of a 120 MB bridge BAR). Of course, it would be better if the PCI core would not blindly apply this logic: if there is a 129 MB bridge BAR, we'd prefer to have two windows (one 128 MB and one 1 MB), so as to not loose too much physical address space. Thomas
On Tue, Apr 08, 2014 at 08:15:47PM +0200, Thomas Petazzoni wrote: > The problem I have with this approach is the consumption of windows. We > have a limited number of them, and this approach could easily consume > quite a few windows for one single bridge BAR, depending on its size. > Like a bridge BAR of 120 MB would consume four windows (one 64 MB, one > 32 MB, one 16 MB and one 8 MB). I agree, but on the other hand, the three people hitting this problem have lots and lots of free windows. This won't work for someone using a large number of PEX ports, but it does work right now, and it is small enough to go into -stable.. > Instead, I would really like to see the PCI core being told about this > constraint, so that it can oversize the bridge BAR (to 128 MB in the > above example of a 120 MB bridge BAR). Of course, it would be better if > the PCI core would not blindly apply this logic: if there is a 129 MB > bridge BAR, we'd prefer to have two windows (one 128 MB and one 1 MB), > so as to not loose too much physical address space. Right, it would be great if the alignment function could alter the size as well - but as you say, once we get there you will still want to allocate more than 1 mbus window per bridge window, so the code will still be required .. Also, as you pointed out, the base alignment is also important to consider, so this needs fixing too: if (res->flags & IORESOURCE_IO) return round_up(start, max_t(resource_size_t, SZ_64K, size)); else if (res->flags & IORESOURCE_MEM) return round_up(start, max_t(resource_size_t, SZ_1M, size)); What does round_up do when size is not a power of 2? Not the right thing, I think. Size should be rounded first.. And the rough loop should consider the base alignment when constraining the size as well... Jason
On Tue, Apr 08, 2014 at 12:08:28PM -0600, Jason Gunthorpe wrote: > > So if I understand it right, for example when my first NIC registers > > e0000000-e02fffff, then we'll truncate it down to e0000000-e01fffff, > > won't that cause any issue, for example if the NIC needs to access > > data beyond that limit ? > > The mbus windows are assigned to the bridge, not to the NIC bars: > > 00:0a.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:7846] (rev 02) (prog-if 00 [Normal decode]) > Memory behind bridge: e0000000-e17fffff > > So the above is not OK, 17fffff is not a power of two. The patch > causes the mbus to WARN_ON and then round down so that the hardware is > at least in a defined configuration. > > > BTW I can try your patch with the myricom NIC which started to work > > when rounding up, I'll quickly see if that fixes the issue as well, > > but I'm now a little bit confused. > > This patch won't fix anything, it just fixes the mbus driver to always > program the hardware with valid values, even if the upper layer > requests something invalid. OK that's what I understood, but I mean why rounding down instead of up in order to correctly cover the window the device expects ? And we all found that rounding up fixed our respective devices (3 igb NICs, one myri10ge NIC, one SATA controller I believe). By rounding up you'd have had 1ffffff instead, which at least covers the window the device expects. It's just this specific point of the reason for rounding down versus up that concerns me. Thanks, Willy
On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote: > OK that's what I understood, but I mean why rounding down instead of up > in order to correctly cover the window the device expects ? And we all > found that rounding up fixed our respective devices (3 igb NICs, one > myri10ge NIC, one SATA controller I believe). > > By rounding up you'd have had 1ffffff instead, which at least covers the > window the device expects. It is also an error to configure the mbus to have overlaping windows and the code checks for overlaps with the base/size given. If we round up then it might create an overlap. You guys were OK with the round up in the PCI code only because your systems have a single used PEX so the rounding didn't create an overlaping situation.. The generic code should either round down, or compeltely bail, as Thomas suggested. Fundementally we shouldn't get to the WARN_ON, so what happens after is just an attempt to salvage something from the situation. Jason
This is just a guess but perhaps the best we could do would be to round up but also add logic to fail completely if adding a PCI device would cause a window to overlap? This would mean situations will be fixed where we can without design changes and situations that cannot be fixed will fail with clear errors? On 8 Apr 2014, at 20:21, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote: > On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote: > >> OK that's what I understood, but I mean why rounding down instead of up >> in order to correctly cover the window the device expects ? And we all >> found that rounding up fixed our respective devices (3 igb NICs, one >> myri10ge NIC, one SATA controller I believe). >> >> By rounding up you'd have had 1ffffff instead, which at least covers the >> window the device expects. > > It is also an error to configure the mbus to have overlaping windows > and the code checks for overlaps with the base/size given. If we round > up then it might create an overlap. > > You guys were OK with the round up in the PCI code only because your > systems have a single used PEX so the rounding didn't create an > overlaping situation.. > > The generic code should either round down, or compeltely bail, as > Thomas suggested. > > Fundementally we shouldn't get to the WARN_ON, so what happens after > is just an attempt to salvage something from the situation. > > Jason
On Tue, 8 Apr 2014, Jason Gunthorpe wrote: > On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote: > >> OK that's what I understood, but I mean why rounding down instead of up >> in order to correctly cover the window the device expects ? And we all >> found that rounding up fixed our respective devices (3 igb NICs, one >> myri10ge NIC, one SATA controller I believe). >> >> By rounding up you'd have had 1ffffff instead, which at least covers the >> window the device expects. > > It is also an error to configure the mbus to have overlaping windows > and the code checks for overlaps with the base/size given. If we round > up then it might create an overlap. If I use only the rounding up part of my patch and not the 4M alignment part then this is the exact scenario I get: 00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:6710] (rev 01) (prog-if 00 [Normal decode]) Memory behind bridge: e0000000-e02fffff Prefetchable memory behind bridge: 00000000-000fffff 00:02.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:6710] (rev 01) (prog-if 00 [Normal decode]) Memory behind bridge: e0300000-e03fffff Prefetchable memory behind bridge: 00000000-000fffff The window for bridge 00:01.0 gets rounded up to cover e0000000-e03fffff and the attempt to add the window for bridge 00:02.0 fails. > You guys were OK with the round up in the PCI code only because your > systems have a single used PEX so the rounding didn't create an > overlaping situation.. In my case it worked because I changed the alignment to 4M to prevent the overlap. > The generic code should either round down, or compeltely bail, as > Thomas suggested. > > Fundementally we shouldn't get to the WARN_ON, so what happens after > is just an attempt to salvage something from the situation. Agreed. Cheers, Neil
Hi Jason, On Tue, Apr 08, 2014 at 01:21:41PM -0600, Jason Gunthorpe wrote: > On Tue, Apr 08, 2014 at 09:15:14PM +0200, Willy Tarreau wrote: > > > OK that's what I understood, but I mean why rounding down instead of up > > in order to correctly cover the window the device expects ? And we all > > found that rounding up fixed our respective devices (3 igb NICs, one > > myri10ge NIC, one SATA controller I believe). > > > > By rounding up you'd have had 1ffffff instead, which at least covers the > > window the device expects. > > It is also an error to configure the mbus to have overlaping windows > and the code checks for overlaps with the base/size given. If we round > up then it might create an overlap. OK, Thomas just explained this to me over the phone, it makes sense now. In fact, I think that a mix between your patch to create multiple windows and something Thomas wanted to do (ie allocate larger and inform the PCI core) would be a good solution by preferring to round up small windows (low waste) and create multiple windows where the waste is too large (eg: address space divided by the max number of windows). Thank you for your explanation. Willy
Dear Matthew Minter,
On Tue, 8 Apr 2014 21:17:47 +0100, Matthew Minter wrote:
> This is just a guess but perhaps the best we could do would be to round up but also add logic to fail completely if adding a PCI device would cause a window to overlap?
Rounding up does not work. Let me try to explain why.
The PCI core enumerates all the BARs into your device and decides that
it should be mapped from 0xe0000000 to 0xe0900000 (9 MB). The PCI core
writes those values into the BAR of the emulated bridge in the
pci-mvebu driver. The pci-mvebu driver captures this write, and figures
out that 9 MB is not good for a MBus window. So it rounds it up to 16
MB, from 0xe0000000 to 0xe1000000, and configures the MBus window. Your
device will work perfectly fine.
Now, the PCI core goes on and enumerate the next device. This device
needs 1 MB of memory space. From the PCI core point of view, only
0xe0000000 to 0xe0900000 is occupied, so for the next device, it may
well decide to use 0xe0900000 -> 0xe0a00000, write this to the BAR of
the emulated bridge, which will lead the emulated bridge to try to
create a window for this area... which isn't possible as it overlaps
the previous window.
Conclusion: rounding up the size of the BARs cannot be done without the
cooperation of the PCI core.
Thomas
diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c index 789cdb2..7312c82 100644 --- a/drivers/pci/host/pci-mvebu.c +++ b/drivers/pci/host/pci-mvebu.c @@ -382,6 +382,9 @@ static void mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port) static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port) { + phys_addr_t base; + unsigned int mapped_size; + /* Are the new membase/memlimit values invalid? */ if (port->bridge.memlimit < port->bridge.membase || !(port->bridge.command & PCI_COMMAND_MEMORY)) { @@ -408,8 +411,16 @@ static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port) (((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF) - port->memwin_base; - mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr, - port->memwin_base, port->memwin_size); + base = port->memwin_base; + mapped_size = 0; + while (mapped_size < port->memwin_size) { + unsigned int size = + rounddown_pow_of_two(port->memwin_size - mapped_size); + mvebu_mbus_add_window_by_id(port->mem_target, port->mem_attr, + base, size); + base += size; + mapped_size += size; + } }