diff mbox series

[RFT,v3,12/27] of/address: Add infrastructure to declare MMIO as non-posted

Message ID 20210304213902.83903-13-marcan@marcan.st (mailing list archive)
State Not Applicable
Headers show
Series Apple M1 SoC platform bring-up | expand

Commit Message

Hector Martin March 4, 2021, 9:38 p.m. UTC
This implements the 'nonposted-mmio' and 'posted-mmio' boolean
properties. Placing these properties in a bus marks all child devices as
requiring non-posted or posted MMIO mappings. If no such properties are
found, the default is posted MMIO.

of_mmio_is_nonposted() performs the tree walking to determine if a given
device has requested non-posted MMIO.

of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED
flag on resources that require non-posted MMIO.

of_iomap() and of_io_request_and_map() then use this flag to pick the
correct ioremap() variant.

This mechanism is currently restricted to Apple ARM platforms, as an
optimization.

Signed-off-by: Hector Martin <marcan@marcan.st>
---
 drivers/of/address.c       | 72 ++++++++++++++++++++++++++++++++++++--
 include/linux/of_address.h |  1 +
 2 files changed, 71 insertions(+), 2 deletions(-)

Comments

Linus Walleij March 5, 2021, 10:28 a.m. UTC | #1
On Thu, Mar 4, 2021 at 10:40 PM Hector Martin <marcan@marcan.st> wrote:

> This implements the 'nonposted-mmio' and 'posted-mmio' boolean
> properties. Placing these properties in a bus marks all child devices as
> requiring non-posted or posted MMIO mappings. If no such properties are
> found, the default is posted MMIO.
>
> of_mmio_is_nonposted() performs the tree walking to determine if a given
> device has requested non-posted MMIO.
>
> of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED
> flag on resources that require non-posted MMIO.
>
> of_iomap() and of_io_request_and_map() then use this flag to pick the
> correct ioremap() variant.
>
> This mechanism is currently restricted to Apple ARM platforms, as an
> optimization.
>
> Signed-off-by: Hector Martin <marcan@marcan.st>

This is fine with me atleast given that the nonposted IO is acceptable.
Caching the quirk state in a static local is maybe a bit overoptimized
but if the compiler can't see it but we can, why not.
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Andy Shevchenko March 5, 2021, 3:13 p.m. UTC | #2
On Thu, Mar 4, 2021 at 11:40 PM Hector Martin <marcan@marcan.st> wrote:
>
> This implements the 'nonposted-mmio' and 'posted-mmio' boolean
> properties. Placing these properties in a bus marks all child devices as
> requiring non-posted or posted MMIO mappings. If no such properties are
> found, the default is posted MMIO.
>
> of_mmio_is_nonposted() performs the tree walking to determine if a given
> device has requested non-posted MMIO.
>
> of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED
> flag on resources that require non-posted MMIO.
>
> of_iomap() and of_io_request_and_map() then use this flag to pick the
> correct ioremap() variant.
>
> This mechanism is currently restricted to Apple ARM platforms, as an
> optimization.

...

> @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index)
>         if (of_address_to_resource(np, index, &res))
>                 return NULL;
>
> -       return ioremap(res.start, resource_size(&res));
> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> +               return ioremap_np(res.start, resource_size(&res));
> +       else
> +               return ioremap(res.start, resource_size(&res));

This doesn't sound right. Why _np is so exceptional? Why don't we have
other flavours (it also rings a bell to my previous comment that the
flag in ioresource is not in the right place)?

...

> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> +               mem = ioremap_np(res.start, resource_size(&res));
> +       else
> +               mem = ioremap(res.start, resource_size(&res));
> +

Ditto.

...

> +       while (node) {
> +               if (!of_property_read_bool(node, "ranges")) {
> +                       break;
> +               } else if (of_property_read_bool(node, "nonposted-mmio")) {
> +                       of_node_put(node);
> +                       return true;
> +               } else if (of_property_read_bool(node, "posted-mmio")) {
> +                       break;
> +               }
> +               parent = of_get_parent(node);
> +               of_node_put(node);
> +               node = parent;
> +       }

I believe above can be slightly optimized. Don't we have helpers to
traverse to all parents?
Hector Martin March 5, 2021, 3:55 p.m. UTC | #3
On 06/03/2021 00.13, Andy Shevchenko wrote:
>> @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index)
>>          if (of_address_to_resource(np, index, &res))
>>                  return NULL;
>>
>> -       return ioremap(res.start, resource_size(&res));
>> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
>> +               return ioremap_np(res.start, resource_size(&res));
>> +       else
>> +               return ioremap(res.start, resource_size(&res));
> 
> This doesn't sound right. Why _np is so exceptional? Why don't we have
> other flavours (it also rings a bell to my previous comment that the
> flag in ioresource is not in the right place)?

This is different from other variants, because until now *drivers* have 
made the choice of what ioremap mode to use based on device requirements 
(which means ioremap() 99% of the time, and then framebuffers and other 
memory-ish things such use something else). Now we have a *SoC fabric* 
that is calling the shots on what ioremap mode we have to use - and 
*every* non-PCIe driver needs to use ioremap_np() on these SoCs, or they 
break. So it seems a lot cleaner to make the choice for drivers here to 
upgrade ioremap() to ioremap_np() for SoCs that need it.

If we don't do something like this here or in otherwise common code, 
we'd have to have an open-coded "if apple then ioremap_np, else ioremap" 
in every driver that runs on-die devices on these SoCs, even ones that 
are otherwise standard and need few or no Apple-specific quirks.

We're still going to have to patch some drivers to use managed APIs that 
can properly hit this conditional (like I did for samsung_tty) in cases 
where they currently don't, but that's a lot cleaner than an open-coded 
conditional, I think (and often comes with other benefits anyway).

Note that wholesale making ioremap() behave like ioremap_np() at the 
arch level as as SoC quirk is not an option - for extenal PCIe devices, 
we still need to use ioremap(). We tried this approach initially but it 
doesn't work. Hence we arrived at this solution which describes the 
required mode in the devicetree, at the bus level (which makes sense, 
since that represents the fabric), and then these wrappers can use that 
information, carried over via the bit in struct device, to pick the 
right ioremap mode.

It doesn't really make sense to include the other variants here, because 
_np is strictly stronger than the default. Downgrading ioremap to any 
other variant would break most drivers, badly. However, upgrading to 
ioremap_np() is always correct (if possibly slower), on platforms where 
it is allowed by the bus. In fact, I bet that on many systems nGnRE 
already behaves like nGnRnE anyway. I don't know why Apple didn't just 
allow nGnRE mappings to work (behaving like nGnRnE) instead of making 
them explode, which is the whole reason we have to do this.

>> +       while (node) {
>> +               if (!of_property_read_bool(node, "ranges")) {
>> +                       break;
>> +               } else if (of_property_read_bool(node, "nonposted-mmio")) {
>> +                       of_node_put(node);
>> +                       return true;
>> +               } else if (of_property_read_bool(node, "posted-mmio")) {
>> +                       break;
>> +               }
>> +               parent = of_get_parent(node);
>> +               of_node_put(node);
>> +               node = parent;
>> +       }
> 
> I believe above can be slightly optimized. Don't we have helpers to
> traverse to all parents?

Keep in mind the logic here is that it stops on the first instance of 
either property, and does not traverse non-translatable boundaries. Are 
there helpers that can implement this kind of complex logic? It's not a 
simple recursive property lookup.
Rob Herring (Arm) March 5, 2021, 4:05 p.m. UTC | #4
On Fri, Mar 5, 2021 at 9:13 AM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
>
> On Thu, Mar 4, 2021 at 11:40 PM Hector Martin <marcan@marcan.st> wrote:
> >
> > This implements the 'nonposted-mmio' and 'posted-mmio' boolean
> > properties. Placing these properties in a bus marks all child devices as
> > requiring non-posted or posted MMIO mappings. If no such properties are
> > found, the default is posted MMIO.
> >
> > of_mmio_is_nonposted() performs the tree walking to determine if a given
> > device has requested non-posted MMIO.
> >
> > of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED
> > flag on resources that require non-posted MMIO.
> >
> > of_iomap() and of_io_request_and_map() then use this flag to pick the
> > correct ioremap() variant.
> >
> > This mechanism is currently restricted to Apple ARM platforms, as an
> > optimization.
>
> ...
>
> > @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index)
> >         if (of_address_to_resource(np, index, &res))
> >                 return NULL;
> >
> > -       return ioremap(res.start, resource_size(&res));
> > +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> > +               return ioremap_np(res.start, resource_size(&res));
> > +       else
> > +               return ioremap(res.start, resource_size(&res));
>
> This doesn't sound right. Why _np is so exceptional? Why don't we have
> other flavours (it also rings a bell to my previous comment that the
> flag in ioresource is not in the right place)?
>
> ...
>
> > +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> > +               mem = ioremap_np(res.start, resource_size(&res));
> > +       else
> > +               mem = ioremap(res.start, resource_size(&res));
> > +
>
> Ditto.
>
> ...
>
> > +       while (node) {
> > +               if (!of_property_read_bool(node, "ranges")) {
> > +                       break;
> > +               } else if (of_property_read_bool(node, "nonposted-mmio")) {
> > +                       of_node_put(node);
> > +                       return true;
> > +               } else if (of_property_read_bool(node, "posted-mmio")) {
> > +                       break;
> > +               }
> > +               parent = of_get_parent(node);
> > +               of_node_put(node);
> > +               node = parent;
> > +       }
>
> I believe above can be slightly optimized. Don't we have helpers to
> traverse to all parents?

We don't. I only found a handful of cases mostly in arch/powerpc.
Given that and this series is big enough already, I don't think we
need a helper as part of it. But patches welcome.

Rob
Andy Shevchenko March 5, 2021, 4:08 p.m. UTC | #5
On Fri, Mar 5, 2021 at 5:55 PM Hector Martin <marcan@marcan.st> wrote:
> On 06/03/2021 00.13, Andy Shevchenko wrote:

...

> >> -       return ioremap(res.start, resource_size(&res));
> >> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> >> +               return ioremap_np(res.start, resource_size(&res));
> >> +       else
> >> +               return ioremap(res.start, resource_size(&res));
> >
> > This doesn't sound right. Why _np is so exceptional? Why don't we have
> > other flavours (it also rings a bell to my previous comment that the
> > flag in ioresource is not in the right place)?
>
> This is different from other variants, because until now *drivers* have
> made the choice of what ioremap mode to use based on device requirements
> (which means ioremap() 99% of the time, and then framebuffers and other
> memory-ish things such use something else). Now we have a *SoC fabric*
> that is calling the shots on what ioremap mode we have to use - and
> *every* non-PCIe driver needs to use ioremap_np() on these SoCs, or they
> break. So it seems a lot cleaner to make the choice for drivers here to
> upgrade ioremap() to ioremap_np() for SoCs that need it.

Yes, that is a good idea. Once we discussed x86 and _uc cases and
actually on x86 it makes a lot of sense to have ioremap() ==
ioremap_uc(). Can't be this a similar case here?
Arnd, what do you think of actually providing an ioremap() as some
kind of "best for the architecture the code is running on"?

Otherwise if the same driver happens to be needed on different
architectures, oops, ifdeffery or simple conditionals over the code is
really not the best way to solve it.

> If we don't do something like this here or in otherwise common code,
> we'd have to have an open-coded "if apple then ioremap_np, else ioremap"
> in every driver that runs on-die devices on these SoCs, even ones that
> are otherwise standard and need few or no Apple-specific quirks.

Exactly! But what about architectures where _uc is that one? So, why
does your patch only take part of _np case?
(Hint we have x86 Device Tree based platforms)

> We're still going to have to patch some drivers to use managed APIs that
> can properly hit this conditional (like I did for samsung_tty) in cases
> where they currently don't, but that's a lot cleaner than an open-coded
> conditional, I think (and often comes with other benefits anyway).
>
> Note that wholesale making ioremap() behave like ioremap_np() at the
> arch level as as SoC quirk is not an option - for extenal PCIe devices,
> we still need to use ioremap(). We tried this approach initially but it
> doesn't work. Hence we arrived at this solution which describes the
> required mode in the devicetree, at the bus level (which makes sense,
> since that represents the fabric), and then these wrappers can use that
> information, carried over via the bit in struct device, to pick the
> right ioremap mode.
>
> It doesn't really make sense to include the other variants here, because
> _np is strictly stronger than the default. Downgrading ioremap to any
> other variant would break most drivers, badly. However, upgrading to
> ioremap_np() is always correct (if possibly slower), on platforms where
> it is allowed by the bus. In fact, I bet that on many systems nGnRE
> already behaves like nGnRnE anyway. I don't know why Apple didn't just
> allow nGnRE mappings to work (behaving like nGnRnE) instead of making
> them explode, which is the whole reason we have to do this.

Yep, and why not to make ioremap() == ioremap_nc() on architecture
that requires it?
Can it be detected at run time?

...

> >> +       while (node) {
> >> +               if (!of_property_read_bool(node, "ranges")) {
> >> +                       break;
> >> +               } else if (of_property_read_bool(node, "nonposted-mmio")) {
> >> +                       of_node_put(node);
> >> +                       return true;
> >> +               } else if (of_property_read_bool(node, "posted-mmio")) {
> >> +                       break;
> >> +               }
> >> +               parent = of_get_parent(node);
> >> +               of_node_put(node);
> >> +               node = parent;
> >> +       }
> >
> > I believe above can be slightly optimized. Don't we have helpers to
> > traverse to all parents?
>
> Keep in mind the logic here is that it stops on the first instance of
> either property, and does not traverse non-translatable boundaries. Are
> there helpers that can implement this kind of complex logic? It's not a
> simple recursive property lookup.

I am aware of what it does and I believe if we don't have such a
helper yet we may introduce it and maybe even existing users of
something similar can utilize it.
Arnd Bergmann March 5, 2021, 4:43 p.m. UTC | #6
On Fri, Mar 5, 2021 at 5:08 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
>
> On Fri, Mar 5, 2021 at 5:55 PM Hector Martin <marcan@marcan.st> wrote:
> > On 06/03/2021 00.13, Andy Shevchenko wrote:
>
> ...
>
> > >> -       return ioremap(res.start, resource_size(&res));
> > >> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> > >> +               return ioremap_np(res.start, resource_size(&res));
> > >> +       else
> > >> +               return ioremap(res.start, resource_size(&res));
> > >
> > > This doesn't sound right. Why _np is so exceptional? Why don't we have
> > > other flavours (it also rings a bell to my previous comment that the
> > > flag in ioresource is not in the right place)?
> >
> > This is different from other variants, because until now *drivers* have
> > made the choice of what ioremap mode to use based on device requirements
> > (which means ioremap() 99% of the time, and then framebuffers and other
> > memory-ish things such use something else). Now we have a *SoC fabric*
> > that is calling the shots on what ioremap mode we have to use - and
> > *every* non-PCIe driver needs to use ioremap_np() on these SoCs, or they
> > break. So it seems a lot cleaner to make the choice for drivers here to
> > upgrade ioremap() to ioremap_np() for SoCs that need it.
>
> Yes, that is a good idea. Once we discussed x86 and _uc cases and
> actually on x86 it makes a lot of sense to have ioremap() ==
> ioremap_uc(). Can't be this a similar case here?

The difference is that ioremap() should be ioremap_uc() on /all/
architectures, it's just that x86 and ia64 for a long time were the
exception and defined ioremap() as 'do whatever the mtrr says'.

> Arnd, what do you think of actually providing an ioremap() as some
> kind of "best for the architecture the code is running on"?

Linus has been pretty clear about wanting all the default functions
to have similar behavior across architectures and be at least as
strict as x86. In case of ioremap() that usually means that writes
are posted, because they are that way on PCI buses on x86.

There are definitely some advantages of making all writes non-posted
by default on Arm because that would be a simpler model, but there
are some important downsides:

- non-posted writes can be much slower than posted ones, depending
  on the specific hardware

- it would change the behavior of all Arm platforms, with no easy
  way to validate it

- setting ioremap() on PCI buses non-posted only makes them
  only slower but not more reliable, because the non-posted flag
  on the bus is discarded by the PCI host bridge.

> Otherwise if the same driver happens to be needed on different
> architectures, oops, ifdeffery or simple conditionals over the code is
> really not the best way to solve it.

The behavior of devm_platform_ioremap_resource() is now to
do the right thing automatically, and I think that is good enough. For
all I can tell, we can use that in all drivers without conditional
compilation.

> > If we don't do something like this here or in otherwise common code,
> > we'd have to have an open-coded "if apple then ioremap_np, else ioremap"
> > in every driver that runs on-die devices on these SoCs, even ones that
> > are otherwise standard and need few or no Apple-specific quirks.
>
> Exactly! But what about architectures where _uc is that one? So, why
> does your patch only take part of _np case?
> (Hint we have x86 Device Tree based platforms)

Usually, the driver knows the requirements, and they are independent
of the platform. If a driver wants _wc or _wt mappings, it will want that
on all machines, and should be able to deal with platforms that implement
that through a stricter mapping.

The two drivers that need to override ioremap() to ioremap_uc()
in order to override the mtrr already have that logic. You are right
that these are a bit like the _np() case in that the device needs
something stricter than the default mapping, but the difference is
that for x86 ioremap_uc() this is needed to work around broken
firmware, while for the apple ioremap_np() case we trust the firmware
to tell us what to do.

> Yep, and why not to make ioremap() == ioremap_nc() on architecture
> that requires it?
> Can it be detected at run time?

I think doing this requires auditing a lot of legacy driver code,
especially drivers/video/fbdev. Once all drivers that need write-combining
mappings explicitly ask for them, the default ioremap can be changed
over to act like ioremap_nc() on the remaining two architectures that
don't do that yet.

There has been a lot of work toward that goal, but the problem
is knowing exactly which drivers need it.

       Arnd
Hector Martin March 5, 2021, 5:19 p.m. UTC | #7
On 06/03/2021 01.43, Arnd Bergmann wrote:
> - setting ioremap() on PCI buses non-posted only makes them
>    only slower but not more reliable, because the non-posted flag
>    on the bus is discarded by the PCI host bridge.

Note that this doesn't work here *anyway*. The fabric is picky in both 
directions: thou shalt use nGnRnE for on-SoC MMIO and nGnRE for PCIe 
windows, or else, SError.

Since these devices can support *any* PCI device via Thunderbolt, making 
PCI drivers be the oddball ones needing special APIs would mean hundreds 
of changes needed - the vast majority of PCI drivers in the kernel use 
plain ioremap variants that don't have any flags to look at.
Rob Herring (Arm) March 5, 2021, 5:39 p.m. UTC | #8
On Thu, Mar 4, 2021 at 3:40 PM Hector Martin <marcan@marcan.st> wrote:
>
> This implements the 'nonposted-mmio' and 'posted-mmio' boolean
> properties. Placing these properties in a bus marks all child devices as
> requiring non-posted or posted MMIO mappings. If no such properties are
> found, the default is posted MMIO.

I'm still a little hesitant to add these properties and having some
default. I worry about a similar situation as 'dma-coherent' where the
assumed default on non-coherent on Arm doesn't work for PowerPC which
defaults coherent. More below on this.

> of_mmio_is_nonposted() performs the tree walking to determine if a given
> device has requested non-posted MMIO.
>
> of_address_to_resource() uses this to set the IORESOURCE_MEM_NONPOSTED
> flag on resources that require non-posted MMIO.
>
> of_iomap() and of_io_request_and_map() then use this flag to pick the
> correct ioremap() variant.
>
> This mechanism is currently restricted to Apple ARM platforms, as an
> optimization.
>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
>  drivers/of/address.c       | 72 ++++++++++++++++++++++++++++++++++++--
>  include/linux/of_address.h |  1 +
>  2 files changed, 71 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/of/address.c b/drivers/of/address.c
> index 73ddf2540f3f..6114dceb1ba6 100644
> --- a/drivers/of/address.c
> +++ b/drivers/of/address.c
> @@ -847,6 +847,9 @@ static int __of_address_to_resource(struct device_node *dev,
>                 return -EINVAL;
>         memset(r, 0, sizeof(struct resource));
>
> +       if (of_mmio_is_nonposted(dev))
> +               flags |= IORESOURCE_MEM_NONPOSTED;
> +
>         r->start = taddr;
>         r->end = taddr + size - 1;
>         r->flags = flags;
> @@ -896,7 +899,10 @@ void __iomem *of_iomap(struct device_node *np, int index)
>         if (of_address_to_resource(np, index, &res))
>                 return NULL;
>
> -       return ioremap(res.start, resource_size(&res));
> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> +               return ioremap_np(res.start, resource_size(&res));
> +       else
> +               return ioremap(res.start, resource_size(&res));

This and the devm variants all scream for a ioremap_extended()
function. IOW, it would be better if the ioremap flavor was a
parameter. Unless we could implement that just for arm64 first, that's
a lot of refactoring...

>  }
>  EXPORT_SYMBOL(of_iomap);
>
> @@ -928,7 +934,11 @@ void __iomem *of_io_request_and_map(struct device_node *np, int index,
>         if (!request_mem_region(res.start, resource_size(&res), name))
>                 return IOMEM_ERR_PTR(-EBUSY);
>
> -       mem = ioremap(res.start, resource_size(&res));
> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> +               mem = ioremap_np(res.start, resource_size(&res));
> +       else
> +               mem = ioremap(res.start, resource_size(&res));
> +
>         if (!mem) {
>                 release_mem_region(res.start, resource_size(&res));
>                 return IOMEM_ERR_PTR(-ENOMEM);
> @@ -1094,3 +1104,61 @@ bool of_dma_is_coherent(struct device_node *np)
>         return false;
>  }
>  EXPORT_SYMBOL_GPL(of_dma_is_coherent);
> +
> +static bool of_nonposted_mmio_quirk(void)
> +{
> +       if (IS_ENABLED(CONFIG_ARCH_APPLE)) {
> +               /* To save cycles, we cache the result for global "Apple ARM" setting */
> +               static int quirk_state = -1;
> +
> +               /* Make quirk cached */
> +               if (quirk_state < 0)
> +                       quirk_state = of_machine_is_compatible("apple,arm-platform");
> +               return !!quirk_state;
> +       }
> +       return false;
> +}
> +
> +/**
> + * of_mmio_is_nonposted - Check if device uses non-posted MMIO
> + * @np:        device node
> + *
> + * Returns true if the "nonposted-mmio" property was found for
> + * the device's bus or a parent. "posted-mmio" has the opposite
> + * effect, terminating recursion and overriding any
> + * "nonposted-mmio" properties in parent buses.
> + *
> + * Recursion terminates if reach a non-translatable boundary
> + * (a node without a 'ranges' property).
> + *
> + * This is currently only enabled on Apple ARM devices, as an
> + * optimization.
> + */
> +bool of_mmio_is_nonposted(struct device_node *np)
> +{
> +       struct device_node *node;
> +       struct device_node *parent;
> +
> +       if (!of_nonposted_mmio_quirk())
> +               return false;
> +
> +       node = of_get_parent(np);
> +
> +       while (node) {
> +               if (!of_property_read_bool(node, "ranges")) {
> +                       break;
> +               } else if (of_property_read_bool(node, "nonposted-mmio")) {
> +                       of_node_put(node);
> +                       return true;
> +               } else if (of_property_read_bool(node, "posted-mmio")) {
> +                       break;
> +               }
> +               parent = of_get_parent(node);
> +               of_node_put(node);
> +               node = parent;
> +       }
> +
> +       of_node_put(node);
> +       return false;

What's the code path using these functions on the M1 where we need to
return 'posted'? It's just downstream PCI mappings (PCI memory space),
right? Those would never hit these paths because they don't have a DT
node or if they do the memory space is not part of it. So can't the
check just be:

bool of_mmio_is_nonposted(struct device_node *np)
{
    return np && of_machine_is_compatible("apple,arm-platform");
}

Note in theory we could use 'assigned-addresses' with PCI, but that's
pretty much never the case with FDT. If we did, we could detect the
device node is a PCI device in that case.

Rob
Hector Martin March 5, 2021, 6:18 p.m. UTC | #9
On 06/03/2021 02.39, Rob Herring wrote:
> I'm still a little hesitant to add these properties and having some
> default. I worry about a similar situation as 'dma-coherent' where the
> assumed default on non-coherent on Arm doesn't work for PowerPC which
> defaults coherent. More below on this.

The intent of the default here is that it matches what ioremap() does on 
other platforms already (where it does not make any claims of being 
posted, though it could be on some platforms). It could be per-platform 
what that means... but either way it should be what drivers get today 
without asking for anything special.

>> -       return ioremap(res.start, resource_size(&res));
>> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
>> +               return ioremap_np(res.start, resource_size(&res));
>> +       else
>> +               return ioremap(res.start, resource_size(&res));
> 
> This and the devm variants all scream for a ioremap_extended()
> function. IOW, it would be better if the ioremap flavor was a
> parameter. Unless we could implement that just for arm64 first, that's
> a lot of refactoring...

I agree, but yeah... that's one big refactor to try to do now...

> What's the code path using these functions on the M1 where we need to
> return 'posted'? It's just downstream PCI mappings (PCI memory space),
> right? Those would never hit these paths because they don't have a DT
> node or if they do the memory space is not part of it. So can't the
> check just be:
> 
> bool of_mmio_is_nonposted(struct device_node *np)
> {
>      return np && of_machine_is_compatible("apple,arm-platform");
> }

Yes; the implementation was trying to be generic, but AIUI we don't need 
this on M1 because the PCI mappings don't go through this codepath, and 
nothing else needs posted mode. My first hack was something not too 
unlike this, then I was going to get rid of apple,arm-platform and just 
have this be a generic mechanism with the properties, but then we added 
the optimization to not do the lookups on other platforms, and now we're 
coming full circle... :-)

If you prefer to handle it this way for now I can do it like this. I 
think we should still have the DT bindings and properties though (even 
if not used), as they do describe the hardware properly, and in the 
future we might want to use them instead of having a quirk.
Arnd Bergmann March 5, 2021, 9:17 p.m. UTC | #10
On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote:
>
> On 06/03/2021 02.39, Rob Herring wrote:
> >> -       return ioremap(res.start, resource_size(&res));
> >> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> >> +               return ioremap_np(res.start, resource_size(&res));
> >> +       else
> >> +               return ioremap(res.start, resource_size(&res));
> >
> > This and the devm variants all scream for a ioremap_extended()
> > function. IOW, it would be better if the ioremap flavor was a
> > parameter. Unless we could implement that just for arm64 first, that's
> > a lot of refactoring...
>
> I agree, but yeah... that's one big refactor to try to do now...

FWIW, there is ioremap_prot() that Christoph introduced in 2019
for a few architectures.  I suppose it would be nice to lift
that out architecture specific code and completely replace the
unusual variants, leaving only ioremap(), ioremap_prot() and
memremap() but dropping the _nc, _cached, _wc, _wt and _np
versions in favor of an extensible set of flags.

Then again, I would not make that a prerequisite for the merge
of the M1 support.

> > What's the code path using these functions on the M1 where we need to
> > return 'posted'? It's just downstream PCI mappings (PCI memory space),
> > right? Those would never hit these paths because they don't have a DT
> > node or if they do the memory space is not part of it. So can't the
> > check just be:
> >
> > bool of_mmio_is_nonposted(struct device_node *np)
> > {
> >      return np && of_machine_is_compatible("apple,arm-platform");
> > }
>
> Yes; the implementation was trying to be generic, but AIUI we don't need
> this on M1 because the PCI mappings don't go through this codepath, and
> nothing else needs posted mode. My first hack was something not too
> unlike this, then I was going to get rid of apple,arm-platform and just
> have this be a generic mechanism with the properties, but then we added
> the optimization to not do the lookups on other platforms, and now we're
> coming full circle... :-)

I never liked the idea of having a list of platforms that need a
special hack, please let's not go back to that.

         Arnd
Rob Herring (Arm) March 8, 2021, 3:56 p.m. UTC | #11
On Fri, Mar 5, 2021 at 2:17 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote:
> >
> > On 06/03/2021 02.39, Rob Herring wrote:
> > >> -       return ioremap(res.start, resource_size(&res));
> > >> +       if (res.flags & IORESOURCE_MEM_NONPOSTED)
> > >> +               return ioremap_np(res.start, resource_size(&res));
> > >> +       else
> > >> +               return ioremap(res.start, resource_size(&res));
> > >
> > > This and the devm variants all scream for a ioremap_extended()
> > > function. IOW, it would be better if the ioremap flavor was a
> > > parameter. Unless we could implement that just for arm64 first, that's
> > > a lot of refactoring...
> >
> > I agree, but yeah... that's one big refactor to try to do now...
>
> FWIW, there is ioremap_prot() that Christoph introduced in 2019
> for a few architectures.  I suppose it would be nice to lift
> that out architecture specific code and completely replace the
> unusual variants, leaving only ioremap(), ioremap_prot() and
> memremap() but dropping the _nc, _cached, _wc, _wt and _np
> versions in favor of an extensible set of flags.
>
> Then again, I would not make that a prerequisite for the merge
> of the M1 support.
>
> > > What's the code path using these functions on the M1 where we need to
> > > return 'posted'? It's just downstream PCI mappings (PCI memory space),
> > > right? Those would never hit these paths because they don't have a DT
> > > node or if they do the memory space is not part of it. So can't the
> > > check just be:
> > >
> > > bool of_mmio_is_nonposted(struct device_node *np)
> > > {
> > >      return np && of_machine_is_compatible("apple,arm-platform");
> > > }
> >
> > Yes; the implementation was trying to be generic, but AIUI we don't need
> > this on M1 because the PCI mappings don't go through this codepath, and
> > nothing else needs posted mode. My first hack was something not too
> > unlike this, then I was going to get rid of apple,arm-platform and just
> > have this be a generic mechanism with the properties, but then we added
> > the optimization to not do the lookups on other platforms, and now we're
> > coming full circle... :-)
>
> I never liked the idea of having a list of platforms that need a
> special hack, please let's not go back to that.

I'm a fan of generic solutions as much as anyone, but not when there's
a single user. Yes, there could be more, but we haven't seen any yet
and Apple seems to have a knack for doing special things. I'm pretty
sure posted vs. non-posted has been a possibility with AXI buses from
the start, so it's not like this is a new thing we're going to see
frequently on new platforms.

A generic property we have to support forever because there's zero
visibility if someone uses them. At least with something platform
specific, we know if it's in use or can be removed. That's something I
just checked recently with some of the PPC irq work-arounds (spoiler:
yes, those 'old world Mac' are). I'm a bit less worried about this
aspect given we can probably assume someone will still be using M1
Macs in 20+ years.

The other situation I worry about here is another arch has implicitly
defaulted to non-posted instead of posted. It could just be non-posted
was what worked everywhere and Linux couldn't distinguish. Now someone
sees we have this new posted vs. non-posted handling and can optimize
some mappings on their platform and we have to have per arch defaults
(like 'dma-coherent' now).

Rob
Arnd Bergmann March 8, 2021, 8:29 p.m. UTC | #12
On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote:
> On Fri, Mar 5, 2021 at 2:17 PM Arnd Bergmann <arnd@kernel.org> wrote:
> > On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote:
>
> > > > What's the code path using these functions on the M1 where we need to
> > > > return 'posted'? It's just downstream PCI mappings (PCI memory space),
> > > > right? Those would never hit these paths because they don't have a DT
> > > > node or if they do the memory space is not part of it. So can't the
> > > > check just be:
> > > >
> > > > bool of_mmio_is_nonposted(struct device_node *np)
> > > > {
> > > >      return np && of_machine_is_compatible("apple,arm-platform");
> > > > }
> > >
> > > Yes; the implementation was trying to be generic, but AIUI we don't need
> > > this on M1 because the PCI mappings don't go through this codepath, and
> > > nothing else needs posted mode. My first hack was something not too
> > > unlike this, then I was going to get rid of apple,arm-platform and just
> > > have this be a generic mechanism with the properties, but then we added
> > > the optimization to not do the lookups on other platforms, and now we're
> > > coming full circle... :-)
> >
> > I never liked the idea of having a list of platforms that need a
> > special hack, please let's not go back to that.
>
> I'm a fan of generic solutions as much as anyone, but not when there's
> a single user. Yes, there could be more, but we haven't seen any yet
> and Apple seems to have a knack for doing special things. I'm pretty
> sure posted vs. non-posted has been a possibility with AXI buses from
> the start, so it's not like this is a new thing we're going to see
> frequently on new platforms.

Ok, but if we make it a platform specific bit, I would prefer not
to do the IORESOURCE_MEM_NONPOSTED flag either but
instead keep the logic in the device drivers that call ioremap().

This is obviously more work for the drivers, but at least it keeps
the common code free of the hack while also allowing drivers to
use ioremap_np() intentionally on other platforms.

> The other situation I worry about here is another arch has implicitly
> defaulted to non-posted instead of posted. It could just be non-posted
> was what worked everywhere and Linux couldn't distinguish. Now someone
> sees we have this new posted vs. non-posted handling and can optimize
> some mappings on their platform and we have to have per arch defaults
> (like 'dma-coherent' now).

I think one of the dark secrets of MMIO is that a lot of drivers
get the posted behavior wrong by assuming that a writel() before
a spin_unlock() is protected by that unlock. This may in fact work
on many architectures but is broken on PCI and on local devices
for ARM.

Having a properly working (on non-PCI) ioremap_np() interface
would be nice here, as it could be used to document when drivers
rely on non-posted behavior, and cause the ioremap to fail when
running on architectures that don't support nonposted maps.

       Arnd
Rob Herring (Arm) March 8, 2021, 9:13 p.m. UTC | #13
On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:
> On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote:
> > On Fri, Mar 5, 2021 at 2:17 PM Arnd Bergmann <arnd@kernel.org> wrote:
> > > On Fri, Mar 5, 2021 at 7:18 PM Hector Martin <marcan@marcan.st> wrote:
> >
> > > > > What's the code path using these functions on the M1 where we need to
> > > > > return 'posted'? It's just downstream PCI mappings (PCI memory space),
> > > > > right? Those would never hit these paths because they don't have a DT
> > > > > node or if they do the memory space is not part of it. So can't the
> > > > > check just be:
> > > > >
> > > > > bool of_mmio_is_nonposted(struct device_node *np)
> > > > > {
> > > > >      return np && of_machine_is_compatible("apple,arm-platform");
> > > > > }
> > > >
> > > > Yes; the implementation was trying to be generic, but AIUI we don't need
> > > > this on M1 because the PCI mappings don't go through this codepath, and
> > > > nothing else needs posted mode. My first hack was something not too
> > > > unlike this, then I was going to get rid of apple,arm-platform and just
> > > > have this be a generic mechanism with the properties, but then we added
> > > > the optimization to not do the lookups on other platforms, and now we're
> > > > coming full circle... :-)
> > >
> > > I never liked the idea of having a list of platforms that need a
> > > special hack, please let's not go back to that.
> >
> > I'm a fan of generic solutions as much as anyone, but not when there's
> > a single user. Yes, there could be more, but we haven't seen any yet
> > and Apple seems to have a knack for doing special things. I'm pretty
> > sure posted vs. non-posted has been a possibility with AXI buses from
> > the start, so it's not like this is a new thing we're going to see
> > frequently on new platforms.
> 
> Ok, but if we make it a platform specific bit, I would prefer not
> to do the IORESOURCE_MEM_NONPOSTED flag either but
> instead keep the logic in the device drivers that call ioremap().

That seems like an orthogonal decision to me.

> This is obviously more work for the drivers, but at least it keeps
> the common code free of the hack while also allowing drivers to
> use ioremap_np() intentionally on other platforms.

I don't agree. The problem is within the interconnect. The device and 
its driver are unaware of this.

The other idea I had was doing a compatible other than 'simple-bus' for 
the bus node which could imply non-posted io and any other quirks in 
Apple's bus implementation. However, something different there means 
updates in lots of places (schemas, dtc checks, etc.) unless we kept 
'simple-bus' as a fallback.

Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd 
rather know if and when we need 'posted-mmio'. It does need to be added 
to the DT spec[1] and schema[2] though (GH PRs are fine for both).

> > The other situation I worry about here is another arch has implicitly
> > defaulted to non-posted instead of posted. It could just be non-posted
> > was what worked everywhere and Linux couldn't distinguish. Now someone
> > sees we have this new posted vs. non-posted handling and can optimize
> > some mappings on their platform and we have to have per arch defaults
> > (like 'dma-coherent' now).
> 
> I think one of the dark secrets of MMIO is that a lot of drivers
> get the posted behavior wrong by assuming that a writel() before
> a spin_unlock() is protected by that unlock. This may in fact work
> on many architectures but is broken on PCI and on local devices
> for ARM.
> 
> Having a properly working (on non-PCI) ioremap_np() interface
> would be nice here, as it could be used to document when drivers
> rely on non-posted behavior, and cause the ioremap to fail when
> running on architectures that don't support nonposted maps.

Good to know.

Rob

[1] https://github.com/devicetree-org/devicetree-specification
[2] https://github.com/devicetree-org/dt-schema
Arnd Bergmann March 8, 2021, 9:56 p.m. UTC | #14
On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote:
> On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:
> > On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote:
>
> Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd
> rather know if and when we need 'posted-mmio'. It does need to be added
> to the DT spec[1] and schema[2] though (GH PRs are fine for both).

I think the reason for having "posted-mmio" is that you cannot properly
define the PCI host controller nodes on the M1 without that: Since
nonposted-mmio applies to all child nodes, this would mean the PCI
memory space gets declared as nonposted by the DT, but the hardware
requires it to be mapped as posted.

       Arnd
Linus Walleij March 9, 2021, 11:14 a.m. UTC | #15
On Mon, Mar 8, 2021 at 10:13 PM Rob Herring <robh@kernel.org> wrote:
> On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:

> > This is obviously more work for the drivers, but at least it keeps
> > the common code free of the hack while also allowing drivers to
> > use ioremap_np() intentionally on other platforms.
>
> I don't agree. The problem is within the interconnect. The device and
> its driver are unaware of this.

If it is possible that a driver needs to use posted access on one
SoC and nonposted on another SoC then clearly the nature
of the access need to be part of the memory access abstraction,
obviously ioremap() one way or another.

Having the driver conditionally use different ioremap_*
functions depending on SoC seems awkward. We had different
execution paths for OF and ACPI drivers and have been working
hard to create fwnode to abstract this away for drivers used with
both abstractions for example. If we can hide it from drivers
from day 1 I think we can save maintenance costs in the long
run.

Given that the Apple silicon through it's heritage from Samsung
S3C (the genealogy is unclear to me) already share drivers with
this platform, this seems to already be the case so it's not a
theoretical use case.

The core argument here seems to be "will this become common
practice or is it an Apple-ism?"

That is a question to someone who is deep down there
synthesizing SoCs. It appears the market for custom silicon
laptops has just begun. There are people that can answer this
question but I doubt that we have access to them or that they
would tell us. What is an educated guess? It seems Arnds
position is that it's an Apple-ism and I kind of trust him on this.
At the same time I know that in emerging markets, what
copycats are likely to do is say "give me exactly what Apple
has, exactly that thing".

Just my €0.01
Linus Walleij
Arnd Bergmann March 9, 2021, 12:41 p.m. UTC | #16
On Tue, Mar 9, 2021 at 12:14 PM Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Mon, Mar 8, 2021 at 10:13 PM Rob Herring <robh@kernel.org> wrote:
> > On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:
>
> > > This is obviously more work for the drivers, but at least it keeps
> > > the common code free of the hack while also allowing drivers to
> > > use ioremap_np() intentionally on other platforms.
> >
> > I don't agree. The problem is within the interconnect. The device and
> > its driver are unaware of this.
>
> If it is possible that a driver needs to use posted access on one
> SoC and nonposted on another SoC then clearly the nature
> of the access need to be part of the memory access abstraction,
> obviously ioremap() one way or another.

There are two possible scenarios:

- drivers that we already know are shared between apple and
  other vendors (s3c-serial, pasemi i2c) would need to use
  nonposted mmio on Apple but can use either one on other
  platforms. On non-ARM CPUs, the ioremap_np() function
  might fail when the hardware only supports posted writes.

- A driver writer may want to choose between posted and
  nonposted mmio based on performance considerations:
  if writes are never serialized, posted writes should always
  be faster. However, if the driver uses a spinlock to serialize
  writes, then a nonposted write is likely faster than a posted
  write followed by a read that serializes the spin_unlock.
  In this case we want the driver to explicitly pick one over
  the other, and not have rely on bus specific magic.

> Having the driver conditionally use different ioremap_*
> functions depending on SoC seems awkward. We had different
> execution paths for OF and ACPI drivers and have been working
> hard to create fwnode to abstract this away for drivers used with
> both abstractions for example. If we can hide it from drivers
> from day 1 I think we can save maintenance costs in the long
> run.
>
> Given that the Apple silicon through it's heritage from Samsung
> S3C (the genealogy is unclear to me) already share drivers with
> this platform, this seems to already be the case so it's not a
> theoretical use case.

As far as I can tell, there are only a handful of soc specific drivers
that are actually shared with other platforms. Aside from serial
and i2c, these are the ones that I can see being shared:

- there is an on-chip nvme host controller that is not PCI. So far,
  nobody else does this, but it can clearly happen in the future

- I think one of the USB controllers is a standard designware
  part, while the others are PCI devices.

- The PCI host bridge may be close enough to the standard
   that we can use the generic driver for config space access.

        Arnd
Linus Walleij March 9, 2021, 3:40 p.m. UTC | #17
On Tue, Mar 9, 2021 at 1:41 PM Arnd Bergmann <arnd@kernel.org> wrote:

> - A driver writer may want to choose between posted and
>   nonposted mmio based on performance considerations:
>   if writes are never serialized, posted writes should always
>   be faster. However, if the driver uses a spinlock to serialize
>   writes, then a nonposted write is likely faster than a posted
>   write followed by a read that serializes the spin_unlock.
>   In this case we want the driver to explicitly pick one over
>   the other, and not have rely on bus specific magic.

OK then I am all for having drivers explicitly choose access
method. Openness to speed optimization is a well established
Linux kernel design principle.

Yours,
Linus Walleij
Rob Herring (Arm) March 9, 2021, 3:48 p.m. UTC | #18
On Mon, Mar 8, 2021 at 2:56 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote:
> > On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:
> > > On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote:
> >
> > Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd
> > rather know if and when we need 'posted-mmio'. It does need to be added
> > to the DT spec[1] and schema[2] though (GH PRs are fine for both).
>
> I think the reason for having "posted-mmio" is that you cannot properly
> define the PCI host controller nodes on the M1 without that: Since
> nonposted-mmio applies to all child nodes, this would mean the PCI
> memory space gets declared as nonposted by the DT, but the hardware
> requires it to be mapped as posted.

I don't think so. PCI devices wouldn't use any of the code paths in
this patch. They would map their memory space with plain ioremap()
which is posted.

Rob
Hector Martin March 9, 2021, 8:23 p.m. UTC | #19
On 10/03/2021 00.48, Rob Herring wrote:
> On Mon, Mar 8, 2021 at 2:56 PM Arnd Bergmann <arnd@kernel.org> wrote:
>>
>> On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote:
>>> On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:
>>>> On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote:
>>>
>>> Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd
>>> rather know if and when we need 'posted-mmio'. It does need to be added
>>> to the DT spec[1] and schema[2] though (GH PRs are fine for both).
>>
>> I think the reason for having "posted-mmio" is that you cannot properly
>> define the PCI host controller nodes on the M1 without that: Since
>> nonposted-mmio applies to all child nodes, this would mean the PCI
>> memory space gets declared as nonposted by the DT, but the hardware
>> requires it to be mapped as posted.
> 
> I don't think so. PCI devices wouldn't use any of the code paths in
> this patch. They would map their memory space with plain ioremap()
> which is posted.

My main concern here is that this creates an inconsistency in the device 
tree representation that only works because PCI drivers happen not to 
use these code paths. Logically, having "nonposted-mmio" above the PCI 
controller would imply that it applies to that bus too. Sure, it doesn't 
matter for Linux since it is ignored, but this creates an implicit 
exception that PCI buses always use posted modes.

Then if a device comes along that due to some twisted fabric logic needs 
nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will 
end up posted at the bus anyway)... how do we represent that? Declare 
that another "nonposted-mmio" on the PCIe bus means "no, really, use 
nonposted mmio for this"?
Rob Herring (Arm) March 9, 2021, 10:06 p.m. UTC | #20
On Tue, Mar 9, 2021 at 1:24 PM Hector Martin <marcan@marcan.st> wrote:
>
> On 10/03/2021 00.48, Rob Herring wrote:
> > On Mon, Mar 8, 2021 at 2:56 PM Arnd Bergmann <arnd@kernel.org> wrote:
> >>
> >> On Mon, Mar 8, 2021 at 10:14 PM Rob Herring <robh@kernel.org> wrote:
> >>> On Mon, Mar 08, 2021 at 09:29:54PM +0100, Arnd Bergmann wrote:
> >>>> On Mon, Mar 8, 2021 at 4:56 PM Rob Herring <robh@kernel.org> wrote:
> >>>
> >>> Let's just stick with 'nonposted-mmio', but drop 'posted-mmio'. I'd
> >>> rather know if and when we need 'posted-mmio'. It does need to be added
> >>> to the DT spec[1] and schema[2] though (GH PRs are fine for both).
> >>
> >> I think the reason for having "posted-mmio" is that you cannot properly
> >> define the PCI host controller nodes on the M1 without that: Since
> >> nonposted-mmio applies to all child nodes, this would mean the PCI
> >> memory space gets declared as nonposted by the DT, but the hardware
> >> requires it to be mapped as posted.
> >
> > I don't think so. PCI devices wouldn't use any of the code paths in
> > this patch. They would map their memory space with plain ioremap()
> > which is posted.
>
> My main concern here is that this creates an inconsistency in the device
> tree representation that only works because PCI drivers happen not to
> use these code paths. Logically, having "nonposted-mmio" above the PCI
> controller would imply that it applies to that bus too. Sure, it doesn't
> matter for Linux since it is ignored, but this creates an implicit
> exception that PCI buses always use posted modes.

We could be stricter that "nonposted-mmio" must be in the immediate
parent. That's kind of in line with how addressing already works.
Every level has to have 'ranges' to be an MMIO address, and the
address cell size is set by the immediate parent.

> Then if a device comes along that due to some twisted fabric logic needs
> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will
> end up posted at the bus anyway)... how do we represent that? Declare
> that another "nonposted-mmio" on the PCIe bus means "no, really, use
> nonposted mmio for this"?

If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio".

Rob
Hector Martin March 10, 2021, 8:26 a.m. UTC | #21
On 10/03/2021 07.06, Rob Herring wrote:
>> My main concern here is that this creates an inconsistency in the device
>> tree representation that only works because PCI drivers happen not to
>> use these code paths. Logically, having "nonposted-mmio" above the PCI
>> controller would imply that it applies to that bus too. Sure, it doesn't
>> matter for Linux since it is ignored, but this creates an implicit
>> exception that PCI buses always use posted modes.
> 
> We could be stricter that "nonposted-mmio" must be in the immediate
> parent. That's kind of in line with how addressing already works.
> Every level has to have 'ranges' to be an MMIO address, and the
> address cell size is set by the immediate parent.
> 
>> Then if a device comes along that due to some twisted fabric logic needs
>> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will
>> end up posted at the bus anyway)... how do we represent that? Declare
>> that another "nonposted-mmio" on the PCIe bus means "no, really, use
>> nonposted mmio for this"?
> 
> If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio".

Works for me; then let's just make it non-recursive.

Do you think we can get rid of the Apple-only optimization if we do 
this? It would mean only looking at the parent during address 
resolution, not recursing all the way to the top, so presumably the 
performance impact would be quite minimal.
Rob Herring (Arm) March 10, 2021, 5:01 p.m. UTC | #22
On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote:
>
> On 10/03/2021 07.06, Rob Herring wrote:
> >> My main concern here is that this creates an inconsistency in the device
> >> tree representation that only works because PCI drivers happen not to
> >> use these code paths. Logically, having "nonposted-mmio" above the PCI
> >> controller would imply that it applies to that bus too. Sure, it doesn't
> >> matter for Linux since it is ignored, but this creates an implicit
> >> exception that PCI buses always use posted modes.
> >
> > We could be stricter that "nonposted-mmio" must be in the immediate
> > parent. That's kind of in line with how addressing already works.
> > Every level has to have 'ranges' to be an MMIO address, and the
> > address cell size is set by the immediate parent.
> >
> >> Then if a device comes along that due to some twisted fabric logic needs
> >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will
> >> end up posted at the bus anyway)... how do we represent that? Declare
> >> that another "nonposted-mmio" on the PCIe bus means "no, really, use
> >> nonposted mmio for this"?
> >
> > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio".
>
> Works for me; then let's just make it non-recursive.
>
> Do you think we can get rid of the Apple-only optimization if we do
> this? It would mean only looking at the parent during address
> resolution, not recursing all the way to the top, so presumably the
> performance impact would be quite minimal.

Yeah, that should be fine. I'd keep an IS_ENABLED() config check
though. Then I'll also know if anyone else needs this.

Rob
Arnd Bergmann March 11, 2021, 9:12 a.m. UTC | #23
On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
>
> On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote:
> >
> > On 10/03/2021 07.06, Rob Herring wrote:
> > >> My main concern here is that this creates an inconsistency in the device
> > >> tree representation that only works because PCI drivers happen not to
> > >> use these code paths. Logically, having "nonposted-mmio" above the PCI
> > >> controller would imply that it applies to that bus too. Sure, it doesn't
> > >> matter for Linux since it is ignored, but this creates an implicit
> > >> exception that PCI buses always use posted modes.
> > >
> > > We could be stricter that "nonposted-mmio" must be in the immediate
> > > parent. That's kind of in line with how addressing already works.
> > > Every level has to have 'ranges' to be an MMIO address, and the
> > > address cell size is set by the immediate parent.
> > >
> > >> Then if a device comes along that due to some twisted fabric logic needs
> > >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will
> > >> end up posted at the bus anyway)... how do we represent that? Declare
> > >> that another "nonposted-mmio" on the PCIe bus means "no, really, use
> > >> nonposted mmio for this"?
> > >
> > > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio".
> >
> > Works for me; then let's just make it non-recursive.
> >
> > Do you think we can get rid of the Apple-only optimization if we do
> > this? It would mean only looking at the parent during address
> > resolution, not recursing all the way to the top, so presumably the
> > performance impact would be quite minimal.

Works for me.

> Yeah, that should be fine. I'd keep an IS_ENABLED() config check
> though. Then I'll also know if anyone else needs this.

Ok, makes sense.

Conceptually, I'd like to then see a check that verifies that the
property is only set for nodes whose parent also has it set, since
that is how AXI defines it: A bus can wait for the ack from its
child node, or it can acknowledge the write to its parent early.
However, this breaks down as soon as a bus does the early ack:
all its children by definition use posted writes (as seen by the
CPU), even if they wait for stores that come from other masters.

Does this make sense to you?

       Arnd
Hector Martin March 11, 2021, 12:11 p.m. UTC | #24
On 11/03/2021 18.12, Arnd Bergmann wrote:
> On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
>>
>> On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote:
>>>
>>> On 10/03/2021 07.06, Rob Herring wrote:
>>>>> My main concern here is that this creates an inconsistency in the device
>>>>> tree representation that only works because PCI drivers happen not to
>>>>> use these code paths. Logically, having "nonposted-mmio" above the PCI
>>>>> controller would imply that it applies to that bus too. Sure, it doesn't
>>>>> matter for Linux since it is ignored, but this creates an implicit
>>>>> exception that PCI buses always use posted modes.
>>>>
>>>> We could be stricter that "nonposted-mmio" must be in the immediate
>>>> parent. That's kind of in line with how addressing already works.
>>>> Every level has to have 'ranges' to be an MMIO address, and the
>>>> address cell size is set by the immediate parent.
>>>>
>>>>> Then if a device comes along that due to some twisted fabric logic needs
>>>>> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will
>>>>> end up posted at the bus anyway)... how do we represent that? Declare
>>>>> that another "nonposted-mmio" on the PCIe bus means "no, really, use
>>>>> nonposted mmio for this"?
>>>>
>>>> If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio".
>>>
>>> Works for me; then let's just make it non-recursive.
>>>
>>> Do you think we can get rid of the Apple-only optimization if we do
>>> this? It would mean only looking at the parent during address
>>> resolution, not recursing all the way to the top, so presumably the
>>> performance impact would be quite minimal.
> 
> Works for me.

Incidentally, even though it would now be unused, I'd like to keep the 
apple,arm-platform compatible at this point; we've already been pretty 
close to a use case for it, and I don't want to have to fall back to a 
list of SoC compatibles if we ever need another quirk for all Apple ARM 
SoCs (or break backwards compat). It doesn't really hurt to have it in 
the binding and devicetrees, right?

>> Yeah, that should be fine. I'd keep an IS_ENABLED() config check
>> though. Then I'll also know if anyone else needs this.
> 
> Ok, makes sense.
> 
> Conceptually, I'd like to then see a check that verifies that the
> property is only set for nodes whose parent also has it set, since
> that is how AXI defines it: A bus can wait for the ack from its
> child node, or it can acknowledge the write to its parent early.
> However, this breaks down as soon as a bus does the early ack:
> all its children by definition use posted writes (as seen by the
> CPU), even if they wait for stores that come from other masters.
> 
> Does this make sense to you?

Makes sense. This shouldn't really be something the kernel concerns 
itself with at runtime, just something for the dts linting, right?

I assume this isn't representable in json-schema, so it would presumably 
need some ad-hoc validation code.
Arnd Bergmann March 11, 2021, 1:35 p.m. UTC | #25
On Thu, Mar 11, 2021 at 1:11 PM Hector Martin <marcan@marcan.st> wrote:
> On 11/03/2021 18.12, Arnd Bergmann wrote:
> > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
> >> On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote:
> >>> Works for me; then let's just make it non-recursive.
> >>>
> >>> Do you think we can get rid of the Apple-only optimization if we do
> >>> this? It would mean only looking at the parent during address
> >>> resolution, not recursing all the way to the top, so presumably the
> >>> performance impact would be quite minimal.
> >
> > Works for me.
>
> Incidentally, even though it would now be unused, I'd like to keep the
> apple,arm-platform compatible at this point; we've already been pretty
> close to a use case for it, and I don't want to have to fall back to a
> list of SoC compatibles if we ever need another quirk for all Apple ARM
> SoCs (or break backwards compat). It doesn't really hurt to have it in
> the binding and devicetrees, right?

Yes, keeping the compatible string is a good idea regardless.

> >> Yeah, that should be fine. I'd keep an IS_ENABLED() config check
> >> though. Then I'll also know if anyone else needs this.
> >
> > Ok, makes sense.
> >
> > Conceptually, I'd like to then see a check that verifies that the
> > property is only set for nodes whose parent also has it set, since
> > that is how AXI defines it: A bus can wait for the ack from its
> > child node, or it can acknowledge the write to its parent early.
> > However, this breaks down as soon as a bus does the early ack:
> > all its children by definition use posted writes (as seen by the
> > CPU), even if they wait for stores that come from other masters.
> >
> > Does this make sense to you?
>
> Makes sense. This shouldn't really be something the kernel concerns
> itself with at runtime, just something for the dts linting, right?
>
> I assume this isn't representable in json-schema, so it would presumably
> need some ad-hoc validation code.

Agreed, having a check in either dtc or expressed in the json scheme
is better than a runtime check. I assume Rob would know how to best
add such a check.

     Arnd
Rob Herring (Arm) March 11, 2021, 4:07 p.m. UTC | #26
On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
> >
> > On Wed, Mar 10, 2021 at 1:27 AM Hector Martin <marcan@marcan.st> wrote:
> > >
> > > On 10/03/2021 07.06, Rob Herring wrote:
> > > >> My main concern here is that this creates an inconsistency in the device
> > > >> tree representation that only works because PCI drivers happen not to
> > > >> use these code paths. Logically, having "nonposted-mmio" above the PCI
> > > >> controller would imply that it applies to that bus too. Sure, it doesn't
> > > >> matter for Linux since it is ignored, but this creates an implicit
> > > >> exception that PCI buses always use posted modes.
> > > >
> > > > We could be stricter that "nonposted-mmio" must be in the immediate
> > > > parent. That's kind of in line with how addressing already works.
> > > > Every level has to have 'ranges' to be an MMIO address, and the
> > > > address cell size is set by the immediate parent.
> > > >
> > > >> Then if a device comes along that due to some twisted fabric logic needs
> > > >> nonposted nGnRnE mappings for PCIe (even though the actual PCIe ops will
> > > >> end up posted at the bus anyway)... how do we represent that? Declare
> > > >> that another "nonposted-mmio" on the PCIe bus means "no, really, use
> > > >> nonposted mmio for this"?
> > > >
> > > > If we're strict, yes. The PCI host bridge would have to have "nonposted-mmio".
> > >
> > > Works for me; then let's just make it non-recursive.
> > >
> > > Do you think we can get rid of the Apple-only optimization if we do
> > > this? It would mean only looking at the parent during address
> > > resolution, not recursing all the way to the top, so presumably the
> > > performance impact would be quite minimal.
>
> Works for me.
>
> > Yeah, that should be fine. I'd keep an IS_ENABLED() config check
> > though. Then I'll also know if anyone else needs this.
>
> Ok, makes sense.
>
> Conceptually, I'd like to then see a check that verifies that the
> property is only set for nodes whose parent also has it set, since
> that is how AXI defines it: A bus can wait for the ack from its
> child node, or it can acknowledge the write to its parent early.
> However, this breaks down as soon as a bus does the early ack:
> all its children by definition use posted writes (as seen by the
> CPU), even if they wait for stores that come from other masters.
>
> Does this make sense to you?

BTW, I don't think it's clear in this thread, but the current
definition proposed for the spec[1] and schema is 'nonposted-mmio' is
specific to 'simple-bus'. I like this restriction and we can expand
where 'nonposted-mmio' is allowed later if needed.

It's possible to express in json-schema, but I think it wouldn't be
pretty. Json-schema is not great for expressing inter-property
constraints and it gets worse if we're talking inter-node constraints.
I'd like to define a way to do python snippets of code for something
like this, but that's way down on the wish list.

Rob

[1] https://github.com/devicetree-org/devicetree-specification/pull/40
Arnd Bergmann March 11, 2021, 4:48 p.m. UTC | #27
On Thu, Mar 11, 2021 at 5:10 PM Rob Herring <robh@kernel.org> wrote:
> On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote:
> > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
> > Ok, makes sense.
> >
> > Conceptually, I'd like to then see a check that verifies that the
> > property is only set for nodes whose parent also has it set, since
> > that is how AXI defines it: A bus can wait for the ack from its
> > child node, or it can acknowledge the write to its parent early.
> > However, this breaks down as soon as a bus does the early ack:
> > all its children by definition use posted writes (as seen by the
> > CPU), even if they wait for stores that come from other masters.
> >
> > Does this make sense to you?
>
> BTW, I don't think it's clear in this thread, but the current
> definition proposed for the spec[1] and schema is 'nonposted-mmio' is
> specific to 'simple-bus'. I like this restriction and we can expand
> where 'nonposted-mmio' is allowed later if needed.

That sounds ok, as long as we can express everything for the mac
at the moment. Do we need to explicitly add a description to allow
the property in the root node in addition to simple-bus to be able
to enforce the rule about parent buses also having it?

       Arnd
Rob Herring (Arm) March 11, 2021, 6:10 p.m. UTC | #28
On Thu, Mar 11, 2021 at 9:48 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Thu, Mar 11, 2021 at 5:10 PM Rob Herring <robh@kernel.org> wrote:
> > On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote:
> > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
> > > Ok, makes sense.
> > >
> > > Conceptually, I'd like to then see a check that verifies that the
> > > property is only set for nodes whose parent also has it set, since
> > > that is how AXI defines it: A bus can wait for the ack from its
> > > child node, or it can acknowledge the write to its parent early.
> > > However, this breaks down as soon as a bus does the early ack:
> > > all its children by definition use posted writes (as seen by the
> > > CPU), even if they wait for stores that come from other masters.
> > >
> > > Does this make sense to you?
> >
> > BTW, I don't think it's clear in this thread, but the current
> > definition proposed for the spec[1] and schema is 'nonposted-mmio' is
> > specific to 'simple-bus'. I like this restriction and we can expand
> > where 'nonposted-mmio' is allowed later if needed.
>
> That sounds ok, as long as we can express everything for the mac
> at the moment. Do we need to explicitly add a description to allow
> the property in the root node in addition to simple-bus to be able
> to enforce the rule about parent buses also having it?

IMO it should not be allowed in the root node. That's a failure to
define a bus node. Also, would that mean your memory has to be
non-posted!?

Rob
Arnd Bergmann March 12, 2021, 10:20 a.m. UTC | #29
On Thu, Mar 11, 2021 at 7:10 PM Rob Herring <robh@kernel.org> wrote:
>
> On Thu, Mar 11, 2021 at 9:48 AM Arnd Bergmann <arnd@kernel.org> wrote:
> >
> > On Thu, Mar 11, 2021 at 5:10 PM Rob Herring <robh@kernel.org> wrote:
> > > On Thu, Mar 11, 2021 at 2:12 AM Arnd Bergmann <arnd@kernel.org> wrote:
> > > > On Wed, Mar 10, 2021 at 6:01 PM Rob Herring <robh@kernel.org> wrote:
> > > > Ok, makes sense.
> > > >
> > > > Conceptually, I'd like to then see a check that verifies that the
> > > > property is only set for nodes whose parent also has it set, since
> > > > that is how AXI defines it: A bus can wait for the ack from its
> > > > child node, or it can acknowledge the write to its parent early.
> > > > However, this breaks down as soon as a bus does the early ack:
> > > > all its children by definition use posted writes (as seen by the
> > > > CPU), even if they wait for stores that come from other masters.
> > > >
> > > > Does this make sense to you?
> > >
> > > BTW, I don't think it's clear in this thread, but the current
> > > definition proposed for the spec[1] and schema is 'nonposted-mmio' is
> > > specific to 'simple-bus'. I like this restriction and we can expand
> > > where 'nonposted-mmio' is allowed later if needed.
> >
> > That sounds ok, as long as we can express everything for the mac
> > at the moment. Do we need to explicitly add a description to allow
> > the property in the root node in addition to simple-bus to be able
> > to enforce the rule about parent buses also having it?
>
> IMO it should not be allowed in the root node. That's a failure to
> define a bus node.

My interpretation would be that the root node defines the first bus
connected to the CPU(s) themselves, which may already have
posted writes. If writes on that bus are posted, then no child could
be non-posted.

I suppose it depends a bit on the mental model of what the nodes
refer to. If you say that there cannot be a device with registers
directly on the root node, but every bus defines its own space,
then we don't need this, but I think a lot of machines would break
if you try to enforce the rule that there cannot be devices on
the root node.

> Also, would that mean your memory has to be non-posted!?

Good question. You could argue that this should not be because you
don't want to use ioremap_np() flags but instead want this to be
normal cacheable memory instead of device memory.

On the other hand, you definitely don't want memory stores to be
posted, as that would break coherency between the CPUs when
a wmb() no longer has an effect.

       Arnd
diff mbox series

Patch

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 73ddf2540f3f..6114dceb1ba6 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -847,6 +847,9 @@  static int __of_address_to_resource(struct device_node *dev,
 		return -EINVAL;
 	memset(r, 0, sizeof(struct resource));
 
+	if (of_mmio_is_nonposted(dev))
+		flags |= IORESOURCE_MEM_NONPOSTED;
+
 	r->start = taddr;
 	r->end = taddr + size - 1;
 	r->flags = flags;
@@ -896,7 +899,10 @@  void __iomem *of_iomap(struct device_node *np, int index)
 	if (of_address_to_resource(np, index, &res))
 		return NULL;
 
-	return ioremap(res.start, resource_size(&res));
+	if (res.flags & IORESOURCE_MEM_NONPOSTED)
+		return ioremap_np(res.start, resource_size(&res));
+	else
+		return ioremap(res.start, resource_size(&res));
 }
 EXPORT_SYMBOL(of_iomap);
 
@@ -928,7 +934,11 @@  void __iomem *of_io_request_and_map(struct device_node *np, int index,
 	if (!request_mem_region(res.start, resource_size(&res), name))
 		return IOMEM_ERR_PTR(-EBUSY);
 
-	mem = ioremap(res.start, resource_size(&res));
+	if (res.flags & IORESOURCE_MEM_NONPOSTED)
+		mem = ioremap_np(res.start, resource_size(&res));
+	else
+		mem = ioremap(res.start, resource_size(&res));
+
 	if (!mem) {
 		release_mem_region(res.start, resource_size(&res));
 		return IOMEM_ERR_PTR(-ENOMEM);
@@ -1094,3 +1104,61 @@  bool of_dma_is_coherent(struct device_node *np)
 	return false;
 }
 EXPORT_SYMBOL_GPL(of_dma_is_coherent);
+
+static bool of_nonposted_mmio_quirk(void)
+{
+	if (IS_ENABLED(CONFIG_ARCH_APPLE)) {
+		/* To save cycles, we cache the result for global "Apple ARM" setting */
+		static int quirk_state = -1;
+
+		/* Make quirk cached */
+		if (quirk_state < 0)
+			quirk_state = of_machine_is_compatible("apple,arm-platform");
+		return !!quirk_state;
+	}
+	return false;
+}
+
+/**
+ * of_mmio_is_nonposted - Check if device uses non-posted MMIO
+ * @np:	device node
+ *
+ * Returns true if the "nonposted-mmio" property was found for
+ * the device's bus or a parent. "posted-mmio" has the opposite
+ * effect, terminating recursion and overriding any
+ * "nonposted-mmio" properties in parent buses.
+ *
+ * Recursion terminates if reach a non-translatable boundary
+ * (a node without a 'ranges' property).
+ *
+ * This is currently only enabled on Apple ARM devices, as an
+ * optimization.
+ */
+bool of_mmio_is_nonposted(struct device_node *np)
+{
+	struct device_node *node;
+	struct device_node *parent;
+
+	if (!of_nonposted_mmio_quirk())
+		return false;
+
+	node = of_get_parent(np);
+
+	while (node) {
+		if (!of_property_read_bool(node, "ranges")) {
+			break;
+		} else if (of_property_read_bool(node, "nonposted-mmio")) {
+			of_node_put(node);
+			return true;
+		} else if (of_property_read_bool(node, "posted-mmio")) {
+			break;
+		}
+		parent = of_get_parent(node);
+		of_node_put(node);
+		node = parent;
+	}
+
+	of_node_put(node);
+	return false;
+}
+EXPORT_SYMBOL_GPL(of_mmio_is_nonposted);
diff --git a/include/linux/of_address.h b/include/linux/of_address.h
index 88bc943405cd..88f6333fee6c 100644
--- a/include/linux/of_address.h
+++ b/include/linux/of_address.h
@@ -62,6 +62,7 @@  extern struct of_pci_range *of_pci_range_parser_one(
 					struct of_pci_range_parser *parser,
 					struct of_pci_range *range);
 extern bool of_dma_is_coherent(struct device_node *np);
+extern bool of_mmio_is_nonposted(struct device_node *np);
 #else /* CONFIG_OF_ADDRESS */
 static inline void __iomem *of_io_request_and_map(struct device_node *device,
 						  int index, const char *name)