Message ID | 20210304213902.83903-13-marcan@marcan.st (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Apple M1 SoC platform bring-up | expand |
On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote: > This implements the 'nonposted-mmio' and 'posted-mmio' boolean > properties. Placing these properties in a bus marks all child devices as > requiring non-posted or posted MMIO mappings. If no such properties are > found, the default is posted MMIO. > > of_mmio_is_nonposted() performs the tree walking to determine if a given > device has requested non-posted MMIO. > > of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED > flag on resources that require non-posted MMIO. > > of_iomap() and of_io_request_and_map() then use this flag to pick the > correct ioremap() variant. > > This mechanism is currently restricted to Apple ARM platforms, as an > optimization. > > Signed-off-by: Hector Martin <marcan@marcan.st> This is fine with me atleast given that the nonposted IO is acceptable. Caching the quirk state in a static local is maybe a bit overoptimized but if the compiler can't see it but we can, why not. Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Yours, Linus Walleij
On Thu, Mar 4, 2021 at 11:40 PM Hector Martin <marcan@marcan.st> wrote: > > This implements the 'nonposted-mmio' and 'posted-mmio' boolean > properties. Placing these properties in a bus marks all child devices as > requiring non-posted or posted MMIO mappings. If no such properties are > found, the default is posted MMIO. > > of_mmio_is_nonposted() performs the tree walking to determine if a given > device has requested non-posted MMIO. > > of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED > flag on resources that require non-posted MMIO. > > of_iomap() and of_io_request_and_map() then use this flag to pick the > correct ioremap() variant. > > This mechanism is currently restricted to Apple ARM platforms, as an > optimization. ... > @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index) > if (of_address_to_resource(np, index, &res)) > return NULL; > > - return ioremap(res.start, resource_size(&res)); > + if (res.flags & IORESOURCE_MEM_NONPOSTED) > + return ioremap_np(res.start, resource_size(&res)); > + else > + return ioremap(res.start, resource_size(&res)); This doesn't sound right. Why _np is so exceptional? Why don't we have other flavours (it also rings a bell to my previous comment that the flag in ioresource is not in the right place)? ... > + if (res.flags & IORESOURCE_MEM_NONPOSTED) > + mem = ioremap_np(res.start, resource_size(&res)); > + else > + mem = ioremap(res.start, resource_size(&res)); > + Ditto. ... > + while (node) { > + if (!of_property_read_bool(node, "ranges")) { > + break; > + } else if (of_property_read_bool(node, "nonposted-mmio")) { > + of_node_put(node); > + return true; > + } else if (of_property_read_bool(node, "posted-mmio")) { > + break; > + } > + parent = of_get_parent(node); > + of_node_put(node); > + node = parent; > + } I believe above can be slightly optimized. Don't we have helpers to traverse to all parents?
On 06/03/2021 00.13, Andy Shevchenko wrote: >> @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index) >> if (of_address_to_resource(np, index, &res)) >> return NULL; >> >> - return ioremap(res.start, resource_size(&res)); >> + if (res.flags & IORESOURCE_MEM_NONPOSTED) >> + return ioremap_np(res.start, resource_size(&res)); >> + else >> + return ioremap(res.start, resource_size(&res)); > > This doesn't sound right. Why _np is so exceptional? Why don't we have > other flavours (it also rings a bell to my previous comment that the > flag in ioresource is not in the right place)? This is different from other variants, because until now *drivers* have made the choice of what ioremap mode to use based on device requirements (which means ioremap() 99% of the time, and then framebuffers and other memory-ish things such use something else). Now we have a *SoC fabric* that is calling the shots on what ioremap mode we have to use - and *every* non-PCIe driver needs to use ioremap_np() on these SoCs, or they break. So it seems a lot cleaner to make the choice for drivers here to upgrade ioremap() to ioremap_np() for SoCs that need it. If we don't do something like this here or in otherwise common code, we'd have to have an open-coded "if apple then ioremap_np, else ioremap" in every driver that runs on-die devices on these SoCs, even ones that are otherwise standard and need few or no Apple-specific quirks. We're still going to have to patch some drivers to use managed APIs that can properly hit this conditional (like I did for samsung_tty) in cases where they currently don't, but that's a lot cleaner than an open-coded conditional, I think (and often comes with other benefits anyway). Note that wholesale making ioremap() behave like ioremap_np() at the arch level as as SoC quirk is not an option - for extenal PCIe devices, we still need to use ioremap(). We tried this approach initially but it doesn't work. Hence we arrived at this solution which describes the required mode in the devicetree, at the bus level (which makes sense, since that represents the fabric), and then these wrappers can use that information, carried over via the bit in struct device, to pick the right ioremap mode. It doesn't really make sense to include the other variants here, because _np is strictly stronger than the default. Downgrading ioremap to any other variant would break most drivers, badly. However, upgrading to ioremap_np() is always correct (if possibly slower), on platforms where it is allowed by the bus. In fact, I bet that on many systems nGnRE already behaves like nGnRnE anyway. I don't know why Apple didn't just allow nGnRE mappings to work (behaving like nGnRnE) instead of making them explode, which is the whole reason we have to do this. >> + while (node) { >> + if (!of_property_read_bool(node, "ranges")) { >> + break; >> + } else if (of_property_read_bool(node, "nonposted-mmio")) { >> + of_node_put(node); >> + return true; >> + } else if (of_property_read_bool(node, "posted-mmio")) { >> + break; >> + } >> + parent = of_get_parent(node); >> + of_node_put(node); >> + node = parent; >> + } > > I believe above can be slightly optimized. Don't we have helpers to > traverse to all parents? Keep in mind the logic here is that it stops on the first instance of either property, and does not traverse non-translatable boundaries. Are there helpers that can implement this kind of complex logic? It's not a simple recursive property lookup.
On Fri, Mar 5, 2021 at 9:13 AM Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > > On Thu, Mar 4, 2021 at 11:40 PM Hector Martin <marcan@marcan.st> wrote: > > > > This implements the 'nonposted-mmio' and 'posted-mmio' boolean > > properties. Placing these properties in a bus marks all child devices as > > requiring non-posted or posted MMIO mappings. If no such properties are > > found, the default is posted MMIO. > > > > of_mmio_is_nonposted() performs the tree walking to determine if a given > > device has requested non-posted MMIO. > > > > of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED > > flag on resources that require non-posted MMIO. > > > > of_iomap() and of_io_request_and_map() then use this flag to pick the > > correct ioremap() variant. > > > > This mechanism is currently restricted to Apple ARM platforms, as an > > optimization. > > ... > > > @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index) > > if (of_address_to_resource(np, index, &res)) > > return NULL; > > > > - return ioremap(res.start, resource_size(&res)); > > + if (res.flags & IORESOURCE_MEM_NONPOSTED) > > + return ioremap_np(res.start, resource_size(&res)); > > + else > > + return ioremap(res.start, resource_size(&res)); > > This doesn't sound right. Why _np is so exceptional? Why don't we have > other flavours (it also rings a bell to my previous comment that the > flag in ioresource is not in the right place)? > > ... > > > + if (res.flags & IORESOURCE_MEM_NONPOSTED) > > + mem = ioremap_np(res.start, resource_size(&res)); > > + else > > + mem = ioremap(res.start, resource_size(&res)); > > + > > Ditto. > > ... > > > + while (node) { > > + if (!of_property_read_bool(node, "ranges")) { > > + break; > > + } else if (of_property_read_bool(node, "nonposted-mmio")) { > > + of_node_put(node); > > + return true; > > + } else if (of_property_read_bool(node, "posted-mmio")) { > > + break; > > + } > > + parent = of_get_parent(node); > > + of_node_put(node); > > + node = parent; > > + } > > I believe above can be slightly optimized. Don't we have helpers to > traverse to all parents? We don't. I only found a handful of cases mostly in arch/powerpc. Given that and this series is big enough already, I don't think we need a helper as part of it. But patches welcome. Rob
On Fri, Mar 5, 2021 at 5:55 PM Hector Martin <marcan@marcan.st> wrote: > On 06/03/2021 00.13, Andy Shevchenko wrote: ... > >> - return ioremap(res.start, resource_size(&res)); > >> + if (res.flags & IORESOURCE_MEM_NONPOSTED) > >> + return ioremap_np(res.start, resource_size(&res)); > >> + else > >> + return ioremap(res.start, resource_size(&res)); > > > > This doesn't sound right. Why _np is so exceptional? Why don't we have > > other flavours (it also rings a bell to my previous comment that the > > flag in ioresource is not in the right place)? > > This is different from other variants, because until now *drivers* have > made the choice of what ioremap mode to use based on device requirements > (which means ioremap() 99% of the time, and then framebuffers and other > memory-ish things such use something else). Now we have a *SoC fabric* > that is calling the shots on what ioremap mode we have to use - and > *every* non-PCIe driver needs to use ioremap_np() on these SoCs, or they > break. So it seems a lot cleaner to make the choice for drivers here to > upgrade ioremap() to ioremap_np() for SoCs that need it. Yes, that is a good idea. Once we discussed x86 and _uc cases and actually on x86 it makes a lot of sense to have ioremap() == ioremap_uc(). Can't be this a similar case here? Arnd, what do you think of actually providing an ioremap() as some kind of "best for the architecture the code is running on"? Otherwise if the same driver happens to be needed on different architectures, oops, ifdeffery or simple conditionals over the code is really not the best way to solve it. > If we don't do something like this here or in otherwise common code, > we'd have to have an open-coded "if apple then ioremap_np, else ioremap" > in every driver that runs on-die devices on these SoCs, even ones that > are otherwise standard and need few or no Apple-specific quirks. Exactly! But what about architectures where _uc is that one? So, why does your patch only take part of _np case? (Hint we have x86 Device Tree based platforms) > We're still going to have to patch some drivers to use managed APIs that > can properly hit this conditional (like I did for samsung_tty) in cases > where they currently don't, but that's a lot cleaner than an open-coded > conditional, I think (and often comes with other benefits anyway). > > Note that wholesale making ioremap() behave like ioremap_np() at the > arch level as as SoC quirk is not an option - for extenal PCIe devices, > we still need to use ioremap(). We tried this approach initially but it > doesn't work. Hence we arrived at this solution which describes the > required mode in the devicetree, at the bus level (which makes sense, > since that represents the fabric), and then these wrappers can use that > information, carried over via the bit in struct device, to pick the > right ioremap mode. > > It doesn't really make sense to include the other variants here, because > _np is strictly stronger than the default. Downgrading ioremap to any > other variant would break most drivers, badly. However, upgrading to > ioremap_np() is always correct (if possibly slower), on platforms where > it is allowed by the bus. In fact, I bet that on many systems nGnRE > already behaves like nGnRnE anyway. I don't know why Apple didn't just > allow nGnRE mappings to work (behaving like nGnRnE) instead of making > them explode, which is the whole reason we have to do this. Yep, and why not to make ioremap() == ioremap_nc() on architecture that requires it? Can it be detected at run time? ... > >> + while (node) { > >> + if (!of_property_read_bool(node, "ranges")) { > >> + break; > >> + } else if (of_property_read_bool(node, "nonposted-mmio")) { > >> + of_node_put(node); > >> + return true; > >> + } else if (of_property_read_bool(node, "posted-mmio")) { > >> + break; > >> + } > >> + parent = of_get_parent(node); > >> + of_node_put(node); > >> + node = parent; > >> + } > > > > I believe above can be slightly optimized. Don't we have helpers to > > traverse to all parents? > > Keep in mind the logic here is that it stops on the first instance of > either property, and does not traverse non-translatable boundaries. Are > there helpers that can implement this kind of complex logic? It's not a > simple recursive property lookup. I am aware of what it does and I believe if we don't have such a helper yet we may introduce it and maybe even existing users of something similar can utilize it.
On Fri, Mar 5, 2021 at 5:08 PM Andy Shevchenko <andy.shevchenko@gmail.com> wrote: > > On Fri, Mar 5, 2021 at 5:55 PM Hector Martin <marcan@marcan.st> wrote: > > On 06/03/2021 00.13, Andy Shevchenko wrote: > > ... > > > >> - return ioremap(res.start, resource_size(&res)); > > >> + if (res.flags & IORESOURCE_MEM_NONPOSTED) > > >> + return ioremap_np(res.start, resource_size(&res)); > > >> + else > > >> + return ioremap(res.start, resource_size(&res)); > > > > > > This doesn't sound right. Why _np is so exceptional? Why don't we have > > > other flavours (it also rings a bell to my previous comment that the > > > flag in ioresource is not in the right place)? > > > > This is different from other variants, because until now *drivers* have > > made the choice of what ioremap mode to use based on device requirements > > (which means ioremap() 99% of the time, and then framebuffers and other > > memory-ish things such use something else). Now we have a *SoC fabric* > > that is calling the shots on what ioremap mode we have to use - and > > *every* non-PCIe driver needs to use ioremap_np() on these SoCs, or they > > break. So it seems a lot cleaner to make the choice for drivers here to > > upgrade ioremap() to ioremap_np() for SoCs that need it. > > Yes, that is a good idea. Once we discussed x86 and _uc cases and > actually on x86 it makes a lot of sense to have ioremap() == > ioremap_uc(). Can't be this a similar case here? The difference is that ioremap() should be ioremap_uc() on /all/ architectures, it's just that x86 and ia64 for a long time were the exception and defined ioremap() as 'do whatever the mtrr says'. > Arnd, what do you think of actually providing an ioremap() as some > kind of "best for the architecture the code is running on"? Linus has been pretty clear about wanting all the default functions to have similar behavior across architectures and be at least as strict as x86. In case of ioremap() that usually means that writes are posted, because they are that way on PCI buses on x86. There are definitely some advantages of making all writes non-posted by default on Arm because that would be a simpler model, but there are some important downsides: - non-posted writes can be much slower than posted ones, depending on the specific hardware - it would change the behavior of all Arm platforms, with no easy way to validate it - setting ioremap() on PCI buses non-posted only makes them only slower but not more reliable, because the non-posted flag on the bus is discarded by the PCI host bridge. > Otherwise if the same driver happens to be needed on different > architectures, oops, ifdeffery or simple conditionals over the code is > really not the best way to solve it. The behavior of devm_platform_ioremap_resource() is now to do the right thing automatically, and I think that is good enough. For all I can tell, we can use that in all drivers without conditional compilation. > > If we don't do something like this here or in otherwise common code, > > we'd have to have an open-coded "if apple then ioremap_np, else ioremap" > > in every driver that runs on-die devices on these SoCs, even ones that > > are otherwise standard and need few or no Apple-specific quirks. > > Exactly! But what about architectures where _uc is that one? So, why > does your patch only take part of _np case? > (Hint we have x86 Device Tree based platforms) Usually, the driver knows the requirements, and they are independent of the platform. If a driver wants _wc or _wt mappings, it will want that on all machines, and should be able to deal with platforms that implement that through a stricter mapping. The two drivers that need to override ioremap() to ioremap_uc() in order to override the mtrr already have that logic. You are right that these are a bit like the _np() case in that the device needs something stricter than the default mapping, but the difference is that for x86 ioremap_uc() this is needed to work around broken firmware, while for the apple ioremap_np() case we trust the firmware to tell us what to do. > Yep, and why not to make ioremap() == ioremap_nc() on architecture > that requires it? > Can it be detected at run time? I think doing this requires auditing a lot of legacy driver code, especially drivers/video/fbdev. Once all drivers that need write-combining mappings explicitly ask for them, the default ioremap can be changed over to act like ioremap_nc() on the remaining two architectures that don't do that yet. There has been a lot of work toward that goal, but the problem is knowing exactly which drivers need it. Arnd
On 06/03/2021 01.43, Arnd Bergmann wrote: > - setting ioremap() on PCI buses non-posted only makes them > only slower but not more reliable, because the non-posted flag > on the bus is discarded by the PCI host bridge. Note that this doesn't work here *anyway*. The fabric is picky in both directions: thou shalt use nGnRnE for on-SoC MMIO and nGnRE for PCIe windows, or else, SError. Since these devices can support *any* PCI device via Thunderbolt, making PCI drivers be the oddball ones needing special APIs would mean hundreds of changes needed - the vast majority of PCI drivers in the kernel use plain ioremap variants that don't have any flags to look at.
On Thu, Mar 4, 2021 at 3:40 PM Hector Martin <marcan@marcan.st> wrote: > > This implements the 'nonposted-mmio' and 'posted-mmio' boolean > properties. Placing these properties in a bus marks all child devices as > requiring non-posted or posted MMIO mappings. If no such properties are > found, the default is posted MMIO. I'm still a little hesitant to add these properties and having some default. I worry about a similar situation as 'dma-coherent' where the assumed default on non-coherent on Arm doesn't work for PowerPC which defaults coherent. More below on this. > of_mmio_is_nonposted() performs the tree walking to determine if a given > device has requested non-posted MMIO. > > of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED > flag on resources that require non-posted MMIO. > > of_iomap() and of_io_request_and_map() then use this flag to pick the > correct ioremap() variant. > > This mechanism is currently restricted to Apple ARM platforms, as an > optimization. > > Signed-off-by: Hector Martin <marcan@marcan.st> > --- > drivers/of/address.c | 72 ++++++++++++++++++++++++++++++++++++-- > include/linux/of_address.h | 1 + > 2 files changed, 71 insertions(+), 2 deletions(-) > > diff --git a/drivers/of/address.c b/drivers/of/address.c > index 73ddf2540f3f..6114dceb1ba6 100644 > --- a/drivers/of/address.c > +++ b/drivers/of/address.c > @@ -847,6 +847,9 @@ static int __of_address_to_resource(struct device_node *dev, > return -EINVAL; > memset(r, 0, sizeof(struct resource)); > > + if (of_mmio_is_nonposted(dev)) > + flags |= IORESOURCE_MEM_NONPOSTED; > + > r->start = taddr; > r->end = taddr + size - 1; > r->flags = flags; > @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index) > if (of_address_to_resource(np, index, &res)) > return NULL; > > - return ioremap(res.start, resource_size(&res)); > + if (res.flags & IORESOURCE_MEM_NONPOSTED) > + return ioremap_np(res.start, resource_size(&res)); > + else > + return ioremap(res.start, resource_size(&res)); This and the devm variants all scream for a ioremap_extended() function. IOW, it would be better if the ioremap flavor was a parameter. Unless we could implement that just for arm64 first, that's a lot of refactoring... > } > EXPORT_SYMBOL(of_iomap); > > @@ -928,7 +934,11 @@ void __iomem *of_io_request_and_map(struct device_node *np, int index, > if (!request_mem_region(res.start, resource_size(&res), name)) > return IOMEM_ERR_PTR(-EBUSY); > > - mem = ioremap(res.start, resource_size(&res)); > + if (res.flags & IORESOURCE_MEM_NONPOSTED) > + mem = ioremap_np(res.start, resource_size(&res)); > + else > + mem = ioremap(res.start, resource_size(&res)); > + > if (!mem) { > release_mem_region(res.start, resource_size(&res)); > return IOMEM_ERR_PTR(-ENOMEM); > @@ -1094,3 +1104,61 @@ bool of_dma_is_coherent(struct device_node *np) > return false; > } > EXPORT_SYMBOL_GPL(of_dma_is_coherent); > + > +static bool of_nonposted_mmio_quirk(void) > +{ > + if (IS_ENABLED(CONFIG_ARCH_APPLE)) { > + /* To save cycles, we cache the result for global "Apple ARM" setting */ > + static int quirk_state = -1; > + > + /* Make quirk cached */ > + if (quirk_state < 0) > + quirk_state = of_machine_is_compatible("apple,arm-platform"); > + return !!quirk_state; > + } > + return false; > +} > + > +/** > + * of_mmio_is_nonposted - Check if device uses non-posted MMIO > + * @np: device node > + * > + * Returns true if the "nonposted-mmio" property was found for > + * the device's bus or a parent. "posted-mmio" has the opposite > + * effect, terminating recursion and overriding any > + * "nonposted-mmio" properties in parent buses. > + * > + * Recursion terminates if reach a non-translatable boundary > + * (a node without a 'ranges' property). > + * > + * This is currently only enabled on Apple ARM devices, as an > + * optimization. > + */ > +bool of_mmio_is_nonposted(struct device_node *np) > +{ > + struct device_node *node; > + struct device_node *parent; > + > + if (!of_nonposted_mmio_quirk()) > + return false; > + > + node = of_get_parent(np); > + > + while (node) { > + if (!of_property_read_bool(node, "ranges")) { > + break; > + } else if (of_property_read_bool(node, "nonposted-mmio")) { > + of_node_put(node); > + return true; > + } else if (of_property_read_bool(node, "posted-mmio")) { > + break; > + } > + parent = of_get_parent(node); > + of_node_put(node); > + node = parent; > + } > + > + of_node_put(node); > + return false; What's the code path using these functions on the M1 where we need to return 'posted'? It's just downstream PCI mappings (PCI memory space), right? Those would never hit these paths because they don't have a DT node or if they do the memory space is not part of it. So can't the check just be: bool of_mmio_is_nonposted(struct device_node *np) { return np && of_machine_is_compatible("apple,arm-platform"); } Note in theory we could use 'assigned-addresses' with PCI, but that's pretty much never the case with FDT. If we did, we could detect the device node is a PCI device in that case. Rob
On 06/03/2021 02.39, Rob Herring wrote: > I'm still a little hesitant to add these properties and having some > default. I worry about a similar situation as 'dma-coherent' where the > assumed default on non-coherent on Arm doesn't work for PowerPC which > defaults coherent. More below on this. The intent of the default here is that it matches what ioremap() does on other platforms already (where it does not make any claims of being posted, though it could be on some platforms). It could be per-platform what that means... but either way it should be what drivers get today without asking for anything special. >> - return ioremap(res.start, resource_size(&res)); >> + if (res.flags & IORESOURCE_MEM_NONPOSTED) >> + return ioremap_np(res.start, resource_size(&res)); >> + else >> + return ioremap(res.start, resource_size(&res)); > > This and the devm variants all scream for a ioremap_extended() > function. IOW, it would be better if the ioremap flavor was a > parameter. Unless we could implement that just for arm64 first, that's > a lot of refactoring... I agree, but yeah... that's one big refactor to try to do now... > What's the code path using these functions on the M1 where we need to > return 'posted'? It's just downstream PCI mappings (PCI memory space), > right? Those would never hit these paths because they don't have a DT > node or if they do the memory space is not part of it. So can't the > check just be: > > bool of_mmio_is_nonposted(struct device_node *np) > { > return np && of_machine_is_compatible("apple,arm-platform"); > } Yes; the implementation was trying to be generic, but AIUI we don't need this on M1 because the PCI mappings don't go through this codepath, and nothing else needs posted mode. My first hack was something not too unlike this, then I was going to get rid of apple,arm-platform and just have this be a generic mechanism with the properties, but then we added the optimization to not do the lookups on other platforms, and now we're coming full circle... :-) If you prefer to handle it this way for now I can do it like this. I think we should still have the DT bindings and properties though (even if not used), as they do describe the hardware properly, and in the future we might want to use them instead of having a quirk.
On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote: > > On 06/03/2021 02.39, Rob Herring wrote: > >> - return ioremap(res.start, resource_size(&res)); > >> + if (res.flags & IORESOURCE_MEM_NONPOSTED) > >> + return ioremap_np(res.start, resource_size(&res)); > >> + else > >> + return ioremap(res.start, resource_size(&res)); > > > > This and the devm variants all scream for a ioremap_extended() > > function. IOW, it would be better if the ioremap flavor was a > > parameter. Unless we could implement that just for arm64 first, that's > > a lot of refactoring... > > I agree, but yeah... that's one big refactor to try to do now... FWIW, there is ioremap_prot() that Christoph introduced in 2019 for a few architectures. I suppose it would be nice to lift that out architecture specific code and completely replace the unusual variants, leaving only ioremap(), ioremap_prot() and memremap() but dropping the _nc, _cached, _wc, _wt and _np versions in favor of an extensible set of flags. Then again, I would not make that a prerequisite for the merge of the M1 support. > > What's the code path using these functions on the M1 where we need to > > return 'posted'? It's just downstream PCI mappings (PCI memory space), > > right? Those would never hit these paths because they don't have a DT > > node or if they do the memory space is not part of it. So can't the > > check just be: > > > > bool of_mmio_is_nonposted(struct device_node *np) > > { > > return np && of_machine_is_compatible("apple,arm-platform"); > > } > > Yes; the implementation was trying to be generic, but AIUI we don't need > this on M1 because the PCI mappings don't go through this codepath, and > nothing else needs posted mode. My first hack was something not too > unlike this, then I was going to get rid of apple,arm-platform and just > have this be a generic mechanism with the properties, but then we added > the optimization to not do the lookups on other platforms, and now we're > coming full circle... :-) I never liked the idea of having a list of platforms that need a special hack, please let's not go back to that. Arnd
On Fri, Mar 5, 2021 at 2:17 PM Arnd Bergmann <arnd@kernel.org> wrote: > > On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote: > > > > On 06/03/2021 02.39, Rob Herring wrote: > > >> - return ioremap(res.start, resource_size(&res)); > > >> + if (res.flags & IORESOURCE_MEM_NONPOSTED) > > >> + return ioremap_np(res.start, resource_size(&res)); > > >> + else > > >> + return ioremap(res.start, resource_size(&res)); > > > > > > This and the devm variants all scream for a ioremap_extended() > > > function. IOW, it would be better if the ioremap flavor was a > > > parameter. Unless we could implement that just for arm64 first, that's > > > a lot of refactoring... > > > > I agree, but yeah... that's one big refactor to try to do now... > > FWIW, there is ioremap_prot() that Christoph introduced in 2019 > for a few architectures. I suppose it would be nice to lift > that out architecture specific code and completely replace the > unusual variants, leaving only ioremap(), ioremap_prot() and > memremap() but dropping the _nc, _cached, _wc, _wt and _np > versions in favor of an extensible set of flags. > > Then again, I would not make that a prerequisite for the merge > of the M1 support. > > > > What's the code path using these functions on the M1 where we need to > > > return 'posted'? It's just downstream PCI mappings (PCI memory space), > > > right? Those would never hit these paths because they don't have a DT > > > node or if they do the memory space is not part of it. So can't the > > > check just be: > > > > > > bool of_mmio_is_nonposted(struct device_node *np) > > > { > > > return np && of_machine_is_compatible("apple,arm-platform"); > > > } > > > > Yes; the implementation was trying to be generic, but AIUI we don't need > > this on M1 because the PCI mappings don't go through this codepath, and > > nothing else needs posted mode. My first hack was something not too > > unlike this, then I was going to get rid of apple,arm-platform and just > > have this be a generic mechanism with the properties, but then we added > > the optimization to not do the lookups on other platforms, and now we're > > coming full circle... :-) > > I never liked the idea of having a list of platforms that need a > special hack, please let's not go back to that. I'm a fan of generic solutions as much as anyone, but not when there's a single user. Yes, there could be more, but we haven't seen any yet and Apple seems to have a knack for doing special things. I'm pretty sure posted vs. non-posted has been a possibility with AXI buses from the start, so it's not like this is a new thing we're going to see frequently on new platforms. A generic property we have to support forever because there's zero visibility if someone uses them. At least with something platform specific, we know if it's in use or can be removed. That's something I just checked recently with some of the PPC irq work-arounds (spoiler: yes, those 'old world Mac' are). I'm a bit less worried about this aspect given we can probably assume someone will still be using M1 Macs in 20+ years. The other situation I worry about here is another arch has implicitly defaulted to non-posted instead of posted. It could just be non-posted was what worked everywhere and Linux couldn't distinguish. Now someone sees we have this new posted vs. non-posted handling and can optimize some mappings on their platform and we have to have per arch defaults (like 'dma-coherent' now). Rob
On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote: > On Fri, Mar 5, 2021 at 2:17 PM Arnd Bergmann <arnd@kernel.org> wrote: > > On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote: > > > > > What's the code path using these functions on the M1 where we need to > > > > return 'posted'? It's just downstream PCI mappings (PCI memory space), > > > > right? Those would never hit these paths because they don't have a DT > > > > node or if they do the memory space is not part of it. So can't the > > > > check just be: > > > > > > > > bool of_mmio_is_nonposted(struct device_node *np) > > > > { > > > > return np && of_machine_is_compatible("apple,arm-platform"); > > > > } > > > > > > Yes; the implementation was trying to be generic, but AIUI we don't need > > > this on M1 because the PCI mappings don't go through this codepath, and > > > nothing else needs posted mode. My first hack was something not too > > > unlike this, then I was going to get rid of apple,arm-platform and just > > > have this be a generic mechanism with the properties, but then we added > > > the optimization to not do the lookups on other platforms, and now we're > > > coming full circle... :-) > > > > I never liked the idea of having a list of platforms that need a > > special hack, please let's not go back to that. > > I'm a fan of generic solutions as much as anyone, but not when there's > a single user. Yes, there could be more, but we haven't seen any yet > and Apple seems to have a knack for doing special things. I'm pretty > sure posted vs. non-posted has been a possibility with AXI buses from > the start, so it's not like this is a new thing we're going to see > frequently on new platforms. Ok, but if we make it a platform specific bit, I would prefer not to do the IORESOURCE_MEM_NONPOSTED flag either but instead keep the logic in the device drivers that call ioremap(). This is obviously more work for the drivers, but at least it keeps the common code free of the hack while also allowing drivers to use ioremap_np() intentionally on other platforms. > The other situation I worry about here is another arch has implicitly > defaulted to non-posted instead of posted. It could just be non-posted > was what worked everywhere and Linux couldn't distinguish. Now someone > sees we have this new posted vs. non-posted handling and can optimize > some mappings on their platform and we have to have per arch defaults > (like 'dma-coherent' now). I think one of the dark secrets of MMIO is that a lot of drivers get the posted behavior wrong by assuming that a writel() before a spin_unlock() is protected by that unlock. This may in fact work on many architectures but is broken on PCI and on local devices for ARM. Having a properly working (on non-PCI) ioremap_np() interface would be nice here, as it could be used to document when drivers rely on non-posted behavior, and cause the ioremap to fail when running on architectures that don't support nonposted maps. Arnd
On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: > On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote: > > On Fri, Mar 5, 2021 at 2:17 PM Arnd Bergmann <arnd@kernel.org> wrote: > > > On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote: > > > > > > > What's the code path using these functions on the M1 where we need to > > > > > return 'posted'? It's just downstream PCI mappings (PCI memory space), > > > > > right? Those would never hit these paths because they don't have a DT > > > > > node or if they do the memory space is not part of it. So can't the > > > > > check just be: > > > > > > > > > > bool of_mmio_is_nonposted(struct device_node *np) > > > > > { > > > > > return np && of_machine_is_compatible("apple,arm-platform"); > > > > > } > > > > > > > > Yes; the implementation was trying to be generic, but AIUI we don't need > > > > this on M1 because the PCI mappings don't go through this codepath, and > > > > nothing else needs posted mode. My first hack was something not too > > > > unlike this, then I was going to get rid of apple,arm-platform and just > > > > have this be a generic mechanism with the properties, but then we added > > > > the optimization to not do the lookups on other platforms, and now we're > > > > coming full circle... :-) > > > > > > I never liked the idea of having a list of platforms that need a > > > special hack, please let's not go back to that. > > > > I'm a fan of generic solutions as much as anyone, but not when there's > > a single user. Yes, there could be more, but we haven't seen any yet > > and Apple seems to have a knack for doing special things. I'm pretty > > sure posted vs. non-posted has been a possibility with AXI buses from > > the start, so it's not like this is a new thing we're going to see > > frequently on new platforms. > > Ok, but if we make it a platform specific bit, I would prefer not > to do the IORESOURCE_MEM_NONPOSTED flag either but > instead keep the logic in the device drivers that call ioremap(). That seems like an orthogonal decision to me. > This is obviously more work for the drivers, but at least it keeps > the common code free of the hack while also allowing drivers to > use ioremap_np() intentionally on other platforms. I don't agree. The problem is within the interconnect. The device and its driver are unaware of this. The other idea I had was doing a compatible other than 'simple-bus' for the bus node which could imply non-posted io and any other quirks in Apple's bus implementation. However, something different there means updates in lots of places (schemas, dtc checks, etc.) unless we kept 'simple-bus' as a fallback. Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd rather know if and when we need 'posted-mmio'. It does need to be added to the DT spec[1] and schema[2] though (GH PRs are fine for both). > > The other situation I worry about here is another arch has implicitly > > defaulted to non-posted instead of posted. It could just be non-posted > > was what worked everywhere and Linux couldn't distinguish. Now someone > > sees we have this new posted vs. non-posted handling and can optimize > > some mappings on their platform and we have to have per arch defaults > > (like 'dma-coherent' now). > > I think one of the dark secrets of MMIO is that a lot of drivers > get the posted behavior wrong by assuming that a writel() before > a spin_unlock() is protected by that unlock. This may in fact work > on many architectures but is broken on PCI and on local devices > for ARM. > > Having a properly working (on non-PCI) ioremap_np() interface > would be nice here, as it could be used to document when drivers > rely on non-posted behavior, and cause the ioremap to fail when > running on architectures that don't support nonposted maps. Good to know. Rob [1] https://github.com/devicetree-org/devicetree-specification [2] https://github.com/devicetree-org/dt-schema
On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote: > On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: > > On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote: > > Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd > rather know if and when we need 'posted-mmio'. It does need to be added > to the DT spec[1] and schema[2] though (GH PRs are fine for both). I think the reason for having "posted-mmio" is that you cannot properly define the PCI host controller nodes on the M1 without that: Since nonposted-mmio applies to all child nodes, this would mean the PCI memory space gets declared as nonposted by the DT, but the hardware requires it to be mapped as posted. Arnd
On Mon, Mar 8, 2021 at 10:13 PM Rob Herring <robh@kernel.org> wrote: > On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: > > This is obviously more work for the drivers, but at least it keeps > > the common code free of the hack while also allowing drivers to > > use ioremap_np() intentionally on other platforms. > > I don't agree. The problem is within the interconnect. The device and > its driver are unaware of this. If it is possible that a driver needs to use posted access on one SoC and nonposted on another SoC then clearly the nature of the access need to be part of the memory access abstraction, obviously ioremap() one way or another. Having the driver conditionally use different ioremap_* functions depending on SoC seems awkward. We had different execution paths for OF and ACPI drivers and have been working hard to create fwnode to abstract this away for drivers used with both abstractions for example. If we can hide it from drivers from day 1 I think we can save maintenance costs in the long run. Given that the Apple silicon through it's heritage from Samsung S3C (the genealogy is unclear to me) already share drivers with this platform, this seems to already be the case so it's not a theoretical use case. The core argument here seems to be "will this become common practice or is it an Apple-ism?" That is a question to someone who is deep down there synthesizing SoCs. It appears the market for custom silicon laptops has just begun. There are people that can answer this question but I doubt that we have access to them or that they would tell us. What is an educated guess? It seems Arnds position is that it's an Apple-ism and I kind of trust him on this. At the same time I know that in emerging markets, what copycats are likely to do is say "give me exactly what Apple has, exactly that thing". Just my €0.01 Linus Walleij
On Tue, Mar 9, 2021 at 12:14 PM Linus Walleij <linus.walleij@linaro.org> wrote: > > On Mon, Mar 8, 2021 at 10:13 PM Rob Herring <robh@kernel.org> wrote: > > On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: > > > > This is obviously more work for the drivers, but at least it keeps > > > the common code free of the hack while also allowing drivers to > > > use ioremap_np() intentionally on other platforms. > > > > I don't agree. The problem is within the interconnect. The device and > > its driver are unaware of this. > > If it is possible that a driver needs to use posted access on one > SoC and nonposted on another SoC then clearly the nature > of the access need to be part of the memory access abstraction, > obviously ioremap() one way or another. There are two possible scenarios: - drivers that we already know are shared between apple and other vendors (s3c-serial, pasemi i2c) would need to use nonposted mmio on Apple but can use either one on other platforms. On non-ARM CPUs, the ioremap_np() function might fail when the hardware only supports posted writes. - A driver writer may want to choose between posted and nonposted mmio based on performance considerations: if writes are never serialized, posted writes should always be faster. However, if the driver uses a spinlock to serialize writes, then a nonposted write is likely faster than a posted write followed by a read that serializes the spin_unlock. In this case we want the driver to explicitly pick one over the other, and not have rely on bus specific magic. > Having the driver conditionally use different ioremap_* > functions depending on SoC seems awkward. We had different > execution paths for OF and ACPI drivers and have been working > hard to create fwnode to abstract this away for drivers used with > both abstractions for example. If we can hide it from drivers > from day 1 I think we can save maintenance costs in the long > run. > > Given that the Apple silicon through it's heritage from Samsung > S3C (the genealogy is unclear to me) already share drivers with > this platform, this seems to already be the case so it's not a > theoretical use case. As far as I can tell, there are only a handful of soc specific drivers that are actually shared with other platforms. Aside from serial and i2c, these are the ones that I can see being shared: - there is an on-chip nvme host controller that is not PCI. So far, nobody else does this, but it can clearly happen in the future - I think one of the USB controllers is a standard designware part, while the others are PCI devices. - The PCI host bridge may be close enough to the standard that we can use the generic driver for config space access. Arnd
On Tue, Mar 9, 2021 at 1:41 PM Arnd Bergmann <arnd@kernel.org> wrote: > - A driver writer may want to choose between posted and > nonposted mmio based on performance considerations: > if writes are never serialized, posted writes should always > be faster. However, if the driver uses a spinlock to serialize > writes, then a nonposted write is likely faster than a posted > write followed by a read that serializes the spin_unlock. > In this case we want the driver to explicitly pick one over > the other, and not have rely on bus specific magic. OK then I am all for having drivers explicitly choose access method. Openness to speed optimization is a well established Linux kernel design principle. Yours, Linus Walleij
On Mon, Mar 8, 2021 at 2:56 PM Arnd Bergmann <arnd@kernel.org> wrote: > > On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote: > > On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: > > > On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote: > > > > Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd > > rather know if and when we need 'posted-mmio'. It does need to be added > > to the DT spec[1] and schema[2] though (GH PRs are fine for both). > > I think the reason for having "posted-mmio" is that you cannot properly > define the PCI host controller nodes on the M1 without that: Since > nonposted-mmio applies to all child nodes, this would mean the PCI > memory space gets declared as nonposted by the DT, but the hardware > requires it to be mapped as posted. I don't think so. PCI devices wouldn't use any of the code paths in this patch. They would map their memory space with plain ioremap() which is posted. Rob
On 10/03/2021 00.48, Rob Herring wrote: > On Mon, Mar 8, 2021 at 2:56 PM Arnd Bergmann <arnd@kernel.org> wrote: >> >> On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote: >>> On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: >>>> On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote: >>> >>> Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd >>> rather know if and when we need 'posted-mmio'. It does need to be added >>> to the DT spec[1] and schema[2] though (GH PRs are fine for both). >> >> I think the reason for having "posted-mmio" is that you cannot properly >> define the PCI host controller nodes on the M1 without that: Since >> nonposted-mmio applies to all child nodes, this would mean the PCI >> memory space gets declared as nonposted by the DT, but the hardware >> requires it to be mapped as posted. > > I don't think so. PCI devices wouldn't use any of the code paths in > this patch. They would map their memory space with plain ioremap() > which is posted. My main concern here is that this creates an inconsistency in the device tree representation that only works because PCI drivers happen not to use these code paths. Logically, having "nonposted-mmio" above the PCI controller would imply that it applies to that bus too. Sure, it doesn't matter for Linux since it is ignored, but this creates an implicit exception that PCI buses always use posted modes. Then if a device comes along that due to some twisted fabric logic needs nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will end up posted at the bus anyway)... how do we represent that? Declare that another "nonposted-mmio" on the PCIe bus means "no, really, use nonposted mmio for this"?
On Tue, Mar 9, 2021 at 1:24 PM Hector Martin <marcan@marcan.st> wrote: > > On 10/03/2021 00.48, Rob Herring wrote: > > On Mon, Mar 8, 2021 at 2:56 PM Arnd Bergmann <arnd@kernel.org> wrote: > >> > >> On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote: > >>> On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote: > >>>> On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote: > >>> > >>> Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd > >>> rather know if and when we need 'posted-mmio'. It does need to be added > >>> to the DT spec[1] and schema[2] though (GH PRs are fine for both). > >> > >> I think the reason for having "posted-mmio" is that you cannot properly > >> define the PCI host controller nodes on the M1 without that: Since > >> nonposted-mmio applies to all child nodes, this would mean the PCI > >> memory space gets declared as nonposted by the DT, but the hardware > >> requires it to be mapped as posted. > > > > I don't think so. PCI devices wouldn't use any of the code paths in > > this patch. They would map their memory space with plain ioremap() > > which is posted. > > My main concern here is that this creates an inconsistency in the device > tree representation that only works because PCI drivers happen not to > use these code paths. Logically, having "nonposted-mmio" above the PCI > controller would imply that it applies to that bus too. Sure, it doesn't > matter for Linux since it is ignored, but this creates an implicit > exception that PCI buses always use posted modes. We could be stricter that "nonposted-mmio" must be in the immediate parent. That's kind of in line with how addressing already works. Every level has to have 'ranges' to be an MMIO address, and the address cell size is set by the immediate parent. > Then if a device comes along that due to some twisted fabric logic needs > nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will > end up posted at the bus anyway)... how do we represent that? Declare > that another "nonposted-mmio" on the PCIe bus means "no, really, use > nonposted mmio for this"? If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio". Rob
On 10/03/2021 07.06, Rob Herring wrote: >> My main concern here is that this creates an inconsistency in the device >> tree representation that only works because PCI drivers happen not to >> use these code paths. Logically, having "nonposted-mmio" above the PCI >> controller would imply that it applies to that bus too. Sure, it doesn't >> matter for Linux since it is ignored, but this creates an implicit >> exception that PCI buses always use posted modes. > > We could be stricter that "nonposted-mmio" must be in the immediate > parent. That's kind of in line with how addressing already works. > Every level has to have 'ranges' to be an MMIO address, and the > address cell size is set by the immediate parent. > >> Then if a device comes along that due to some twisted fabric logic needs >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will >> end up posted at the bus anyway)... how do we represent that? Declare >> that another "nonposted-mmio" on the PCIe bus means "no, really, use >> nonposted mmio for this"? > > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio". Works for me; then let's just make it non-recursive. Do you think we can get rid of the Apple-only optimization if we do this? It would mean only looking at the parent during address resolution, not recursing all the way to the top, so presumably the performance impact would be quite minimal.
On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote: > > On 10/03/2021 07.06, Rob Herring wrote: > >> My main concern here is that this creates an inconsistency in the device > >> tree representation that only works because PCI drivers happen not to > >> use these code paths. Logically, having "nonposted-mmio" above the PCI > >> controller would imply that it applies to that bus too. Sure, it doesn't > >> matter for Linux since it is ignored, but this creates an implicit > >> exception that PCI buses always use posted modes. > > > > We could be stricter that "nonposted-mmio" must be in the immediate > > parent. That's kind of in line with how addressing already works. > > Every level has to have 'ranges' to be an MMIO address, and the > > address cell size is set by the immediate parent. > > > >> Then if a device comes along that due to some twisted fabric logic needs > >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will > >> end up posted at the bus anyway)... how do we represent that? Declare > >> that another "nonposted-mmio" on the PCIe bus means "no, really, use > >> nonposted mmio for this"? > > > > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio". > > Works for me; then let's just make it non-recursive. > > Do you think we can get rid of the Apple-only optimization if we do > this? It would mean only looking at the parent during address > resolution, not recursing all the way to the top, so presumably the > performance impact would be quite minimal. Yeah, that should be fine. I'd keep an IS_ENABLED() config check though. Then I'll also know if anyone else needs this. Rob
On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: > > On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote: > > > > On 10/03/2021 07.06, Rob Herring wrote: > > >> My main concern here is that this creates an inconsistency in the device > > >> tree representation that only works because PCI drivers happen not to > > >> use these code paths. Logically, having "nonposted-mmio" above the PCI > > >> controller would imply that it applies to that bus too. Sure, it doesn't > > >> matter for Linux since it is ignored, but this creates an implicit > > >> exception that PCI buses always use posted modes. > > > > > > We could be stricter that "nonposted-mmio" must be in the immediate > > > parent. That's kind of in line with how addressing already works. > > > Every level has to have 'ranges' to be an MMIO address, and the > > > address cell size is set by the immediate parent. > > > > > >> Then if a device comes along that due to some twisted fabric logic needs > > >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will > > >> end up posted at the bus anyway)... how do we represent that? Declare > > >> that another "nonposted-mmio" on the PCIe bus means "no, really, use > > >> nonposted mmio for this"? > > > > > > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio". > > > > Works for me; then let's just make it non-recursive. > > > > Do you think we can get rid of the Apple-only optimization if we do > > this? It would mean only looking at the parent during address > > resolution, not recursing all the way to the top, so presumably the > > performance impact would be quite minimal. Works for me. > Yeah, that should be fine. I'd keep an IS_ENABLED() config check > though. Then I'll also know if anyone else needs this. Ok, makes sense. Conceptually, I'd like to then see a check that verifies that the property is only set for nodes whose parent also has it set, since that is how AXI defines it: A bus can wait for the ack from its child node, or it can acknowledge the write to its parent early. However, this breaks down as soon as a bus does the early ack: all its children by definition use posted writes (as seen by the CPU), even if they wait for stores that come from other masters. Does this make sense to you? Arnd
On 11/03/2021 18.12, Arnd Bergmann wrote: > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: >> >> On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote: >>> >>> On 10/03/2021 07.06, Rob Herring wrote: >>>>> My main concern here is that this creates an inconsistency in the device >>>>> tree representation that only works because PCI drivers happen not to >>>>> use these code paths. Logically, having "nonposted-mmio" above the PCI >>>>> controller would imply that it applies to that bus too. Sure, it doesn't >>>>> matter for Linux since it is ignored, but this creates an implicit >>>>> exception that PCI buses always use posted modes. >>>> >>>> We could be stricter that "nonposted-mmio" must be in the immediate >>>> parent. That's kind of in line with how addressing already works. >>>> Every level has to have 'ranges' to be an MMIO address, and the >>>> address cell size is set by the immediate parent. >>>> >>>>> Then if a device comes along that due to some twisted fabric logic needs >>>>> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will >>>>> end up posted at the bus anyway)... how do we represent that? Declare >>>>> that another "nonposted-mmio" on the PCIe bus means "no, really, use >>>>> nonposted mmio for this"? >>>> >>>> If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio". >>> >>> Works for me; then let's just make it non-recursive. >>> >>> Do you think we can get rid of the Apple-only optimization if we do >>> this? It would mean only looking at the parent during address >>> resolution, not recursing all the way to the top, so presumably the >>> performance impact would be quite minimal. > > Works for me. Incidentally, even though it would now be unused, I'd like to keep the apple,arm-platform compatible at this point; we've already been pretty close to a use case for it, and I don't want to have to fall back to a list of SoC compatibles if we ever need another quirk for all Apple ARM SoCs (or break backwards compat). It doesn't really hurt to have it in the binding and devicetrees, right? >> Yeah, that should be fine. I'd keep an IS_ENABLED() config check >> though. Then I'll also know if anyone else needs this. > > Ok, makes sense. > > Conceptually, I'd like to then see a check that verifies that the > property is only set for nodes whose parent also has it set, since > that is how AXI defines it: A bus can wait for the ack from its > child node, or it can acknowledge the write to its parent early. > However, this breaks down as soon as a bus does the early ack: > all its children by definition use posted writes (as seen by the > CPU), even if they wait for stores that come from other masters. > > Does this make sense to you? Makes sense. This shouldn't really be something the kernel concerns itself with at runtime, just something for the dts linting, right? I assume this isn't representable in json-schema, so it would presumably need some ad-hoc validation code.
On Thu, Mar 11, 2021 at 1:11 PM Hector Martin <marcan@marcan.st> wrote: > On 11/03/2021 18.12, Arnd Bergmann wrote: > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: > >> On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote: > >>> Works for me; then let's just make it non-recursive. > >>> > >>> Do you think we can get rid of the Apple-only optimization if we do > >>> this? It would mean only looking at the parent during address > >>> resolution, not recursing all the way to the top, so presumably the > >>> performance impact would be quite minimal. > > > > Works for me. > > Incidentally, even though it would now be unused, I'd like to keep the > apple,arm-platform compatible at this point; we've already been pretty > close to a use case for it, and I don't want to have to fall back to a > list of SoC compatibles if we ever need another quirk for all Apple ARM > SoCs (or break backwards compat). It doesn't really hurt to have it in > the binding and devicetrees, right? Yes, keeping the compatible string is a good idea regardless. > >> Yeah, that should be fine. I'd keep an IS_ENABLED() config check > >> though. Then I'll also know if anyone else needs this. > > > > Ok, makes sense. > > > > Conceptually, I'd like to then see a check that verifies that the > > property is only set for nodes whose parent also has it set, since > > that is how AXI defines it: A bus can wait for the ack from its > > child node, or it can acknowledge the write to its parent early. > > However, this breaks down as soon as a bus does the early ack: > > all its children by definition use posted writes (as seen by the > > CPU), even if they wait for stores that come from other masters. > > > > Does this make sense to you? > > Makes sense. This shouldn't really be something the kernel concerns > itself with at runtime, just something for the dts linting, right? > > I assume this isn't representable in json-schema, so it would presumably > need some ad-hoc validation code. Agreed, having a check in either dtc or expressed in the json scheme is better than a runtime check. I assume Rob would know how to best add such a check. Arnd
On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote: > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: > > > > On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote: > > > > > > On 10/03/2021 07.06, Rob Herring wrote: > > > >> My main concern here is that this creates an inconsistency in the device > > > >> tree representation that only works because PCI drivers happen not to > > > >> use these code paths. Logically, having "nonposted-mmio" above the PCI > > > >> controller would imply that it applies to that bus too. Sure, it doesn't > > > >> matter for Linux since it is ignored, but this creates an implicit > > > >> exception that PCI buses always use posted modes. > > > > > > > > We could be stricter that "nonposted-mmio" must be in the immediate > > > > parent. That's kind of in line with how addressing already works. > > > > Every level has to have 'ranges' to be an MMIO address, and the > > > > address cell size is set by the immediate parent. > > > > > > > >> Then if a device comes along that due to some twisted fabric logic needs > > > >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will > > > >> end up posted at the bus anyway)... how do we represent that? Declare > > > >> that another "nonposted-mmio" on the PCIe bus means "no, really, use > > > >> nonposted mmio for this"? > > > > > > > > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio". > > > > > > Works for me; then let's just make it non-recursive. > > > > > > Do you think we can get rid of the Apple-only optimization if we do > > > this? It would mean only looking at the parent during address > > > resolution, not recursing all the way to the top, so presumably the > > > performance impact would be quite minimal. > > Works for me. > > > Yeah, that should be fine. I'd keep an IS_ENABLED() config check > > though. Then I'll also know if anyone else needs this. > > Ok, makes sense. > > Conceptually, I'd like to then see a check that verifies that the > property is only set for nodes whose parent also has it set, since > that is how AXI defines it: A bus can wait for the ack from its > child node, or it can acknowledge the write to its parent early. > However, this breaks down as soon as a bus does the early ack: > all its children by definition use posted writes (as seen by the > CPU), even if they wait for stores that come from other masters. > > Does this make sense to you? BTW, I don't think it's clear in this thread, but the current definition proposed for the spec[1] and schema is 'nonposted-mmio' is specific to 'simple-bus'. I like this restriction and we can expand where 'nonposted-mmio' is allowed later if needed. It's possible to express in json-schema, but I think it wouldn't be pretty. Json-schema is not great for expressing inter-property constraints and it gets worse if we're talking inter-node constraints. I'd like to define a way to do python snippets of code for something like this, but that's way down on the wish list. Rob [1] https://github.com/devicetree-org/devicetree-specification/pull/40
On Thu, Mar 11, 2021 at 5:10 PM Rob Herring <robh@kernel.org> wrote: > On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote: > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: > > Ok, makes sense. > > > > Conceptually, I'd like to then see a check that verifies that the > > property is only set for nodes whose parent also has it set, since > > that is how AXI defines it: A bus can wait for the ack from its > > child node, or it can acknowledge the write to its parent early. > > However, this breaks down as soon as a bus does the early ack: > > all its children by definition use posted writes (as seen by the > > CPU), even if they wait for stores that come from other masters. > > > > Does this make sense to you? > > BTW, I don't think it's clear in this thread, but the current > definition proposed for the spec[1] and schema is 'nonposted-mmio' is > specific to 'simple-bus'. I like this restriction and we can expand > where 'nonposted-mmio' is allowed later if needed. That sounds ok, as long as we can express everything for the mac at the moment. Do we need to explicitly add a description to allow the property in the root node in addition to simple-bus to be able to enforce the rule about parent buses also having it? Arnd
On Thu, Mar 11, 2021 at 9:48 AM Arnd Bergmann <arnd@kernel.org> wrote: > > On Thu, Mar 11, 2021 at 5:10 PM Rob Herring <robh@kernel.org> wrote: > > On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote: > > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: > > > Ok, makes sense. > > > > > > Conceptually, I'd like to then see a check that verifies that the > > > property is only set for nodes whose parent also has it set, since > > > that is how AXI defines it: A bus can wait for the ack from its > > > child node, or it can acknowledge the write to its parent early. > > > However, this breaks down as soon as a bus does the early ack: > > > all its children by definition use posted writes (as seen by the > > > CPU), even if they wait for stores that come from other masters. > > > > > > Does this make sense to you? > > > > BTW, I don't think it's clear in this thread, but the current > > definition proposed for the spec[1] and schema is 'nonposted-mmio' is > > specific to 'simple-bus'. I like this restriction and we can expand > > where 'nonposted-mmio' is allowed later if needed. > > That sounds ok, as long as we can express everything for the mac > at the moment. Do we need to explicitly add a description to allow > the property in the root node in addition to simple-bus to be able > to enforce the rule about parent buses also having it? IMO it should not be allowed in the root node. That's a failure to define a bus node. Also, would that mean your memory has to be non-posted!? Rob
On Thu, Mar 11, 2021 at 7:10 PM Rob Herring <robh@kernel.org> wrote: > > On Thu, Mar 11, 2021 at 9:48 AM Arnd Bergmann <arnd@kernel.org> wrote: > > > > On Thu, Mar 11, 2021 at 5:10 PM Rob Herring <robh@kernel.org> wrote: > > > On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote: > > > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote: > > > > Ok, makes sense. > > > > > > > > Conceptually, I'd like to then see a check that verifies that the > > > > property is only set for nodes whose parent also has it set, since > > > > that is how AXI defines it: A bus can wait for the ack from its > > > > child node, or it can acknowledge the write to its parent early. > > > > However, this breaks down as soon as a bus does the early ack: > > > > all its children by definition use posted writes (as seen by the > > > > CPU), even if they wait for stores that come from other masters. > > > > > > > > Does this make sense to you? > > > > > > BTW, I don't think it's clear in this thread, but the current > > > definition proposed for the spec[1] and schema is 'nonposted-mmio' is > > > specific to 'simple-bus'. I like this restriction and we can expand > > > where 'nonposted-mmio' is allowed later if needed. > > > > That sounds ok, as long as we can express everything for the mac > > at the moment. Do we need to explicitly add a description to allow > > the property in the root node in addition to simple-bus to be able > > to enforce the rule about parent buses also having it? > > IMO it should not be allowed in the root node. That's a failure to > define a bus node. My interpretation would be that the root node defines the first bus connected to the CPU(s) themselves, which may already have posted writes. If writes on that bus are posted, then no child could be non-posted. I suppose it depends a bit on the mental model of what the nodes refer to. If you say that there cannot be a device with registers directly on the root node, but every bus defines its own space, then we don't need this, but I think a lot of machines would break if you try to enforce the rule that there cannot be devices on the root node. > Also, would that mean your memory has to be non-posted!? Good question. You could argue that this should not be because you don't want to use ioremap_np() flags but instead want this to be normal cacheable memory instead of device memory. On the other hand, you definitely don't want memory stores to be posted, as that would break coherency between the CPUs when a wmb() no longer has an effect. Arnd
diff --git a/drivers/of/address.c b/drivers/of/address.c index 73ddf2540f3f..6114dceb1ba6 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -847,6 +847,9 @@ static int __of_address_to_resource(struct device_node *dev, return -EINVAL; memset(r, 0, sizeof(struct resource)); + if (of_mmio_is_nonposted(dev)) + flags |= IORESOURCE_MEM_NONPOSTED; + r->start = taddr; r->end = taddr + size - 1; r->flags = flags; @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index) if (of_address_to_resource(np, index, &res)) return NULL; - return ioremap(res.start, resource_size(&res)); + if (res.flags & IORESOURCE_MEM_NONPOSTED) + return ioremap_np(res.start, resource_size(&res)); + else + return ioremap(res.start, resource_size(&res)); } EXPORT_SYMBOL(of_iomap); @@ -928,7 +934,11 @@ void __iomem *of_io_request_and_map(struct device_node *np, int index, if (!request_mem_region(res.start, resource_size(&res), name)) return IOMEM_ERR_PTR(-EBUSY); - mem = ioremap(res.start, resource_size(&res)); + if (res.flags & IORESOURCE_MEM_NONPOSTED) + mem = ioremap_np(res.start, resource_size(&res)); + else + mem = ioremap(res.start, resource_size(&res)); + if (!mem) { release_mem_region(res.start, resource_size(&res)); return IOMEM_ERR_PTR(-ENOMEM); @@ -1094,3 +1104,61 @@ bool of_dma_is_coherent(struct device_node *np) return false; } EXPORT_SYMBOL_GPL(of_dma_is_coherent); + +static bool of_nonposted_mmio_quirk(void) +{ + if (IS_ENABLED(CONFIG_ARCH_APPLE)) { + /* To save cycles, we cache the result for global "Apple ARM" setting */ + static int quirk_state = -1; + + /* Make quirk cached */ + if (quirk_state < 0) + quirk_state = of_machine_is_compatible("apple,arm-platform"); + return !!quirk_state; + } + return false; +} + +/** + * of_mmio_is_nonposted - Check if device uses non-posted MMIO + * @np: device node + * + * Returns true if the "nonposted-mmio" property was found for + * the device's bus or a parent. "posted-mmio" has the opposite + * effect, terminating recursion and overriding any + * "nonposted-mmio" properties in parent buses. + * + * Recursion terminates if reach a non-translatable boundary + * (a node without a 'ranges' property). + * + * This is currently only enabled on Apple ARM devices, as an + * optimization. + */ +bool of_mmio_is_nonposted(struct device_node *np) +{ + struct device_node *node; + struct device_node *parent; + + if (!of_nonposted_mmio_quirk()) + return false; + + node = of_get_parent(np); + + while (node) { + if (!of_property_read_bool(node, "ranges")) { + break; + } else if (of_property_read_bool(node, "nonposted-mmio")) { + of_node_put(node); + return true; + } else if (of_property_read_bool(node, "posted-mmio")) { + break; + } + parent = of_get_parent(node); + of_node_put(node); + node = parent; + } + + of_node_put(node); + return false; +} +EXPORT_SYMBOL_GPL(of_mmio_is_nonposted); diff --git a/include/linux/of_address.h b/include/linux/of_address.h index 88bc943405cd..88f6333fee6c 100644 --- a/include/linux/of_address.h +++ b/include/linux/of_address.h @@ -62,6 +62,7 @@ extern struct of_pci_range *of_pci_range_parser_one( struct of_pci_range_parser *parser, struct of_pci_range *range); extern bool of_dma_is_coherent(struct device_node *np); +extern bool of_mmio_is_nonposted(struct device_node *np); #else /* CONFIG_OF_ADDRESS */ static inline void __iomem *of_io_request_and_map(struct device_node *device, int index, const char *name)
This implements the 'nonposted-mmio' and 'posted-mmio' boolean properties. Placing these properties in a bus marks all child devices as requiring non-posted or posted MMIO mappings. If no such properties are found, the default is posted MMIO. of_mmio_is_nonposted() performs the tree walking to determine if a given device has requested non-posted MMIO. of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED flag on resources that require non-posted MMIO. of_iomap() and of_io_request_and_map() then use this flag to pick the correct ioremap() variant. This mechanism is currently restricted to Apple ARM platforms, as an optimization. Signed-off-by: Hector Martin <marcan@marcan.st> --- drivers/of/address.c | 72 ++++++++++++++++++++++++++++++++++++-- include/linux/of_address.h | 1 + 2 files changed, 71 insertions(+), 2 deletions(-)