[RFC] drm/ttm: force cached mappings for system RAM on ARM

Message ID	20190110072841.3283-1-ard.biesheuvel@linaro.org (mailing list archive)
State	RFC
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Ard Biesheuvel <ard.biesheuvel@linaro.org> To: linux-kernel@vger.kernel.org Subject: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM Date: Thu, 10 Jan 2019 08:28:41 +0100 Message-Id: <20190110072841.3283-1-ard.biesheuvel@linaro.org> Precedence: list Cc: Carsten.Haitzler@arm.com, Ard Biesheuvel <ard.biesheuvel@linaro.org>, David Airlie <airlied@linux.ie>, will.deacon@arm.com, dri-devel@lists.freedesktop.org, Huang Rui <ray.huang@amd.com>, Junwei Zhang <Jerry.Zhang@amd.com>, Christian Koenig <christian.koenig@amd.com>, linux-arm-kernel@lists.infradead.org, Bernhard.Rosenkranzer@linaro.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	[RFC] drm/ttm: force cached mappings for system RAM on ARM \| expand [RFC] drm/ttm: force cached mappings for system RAM on ARM

Ard Biesheuvel Jan. 10, 2019, 7:28 a.m. UTC

ARM systems do not permit the use of anything other than cached
mappings for system memory, since that memory may be mapped in the
linear region as well, and the architecture does not permit aliases
with mismatched attributes.

So short-circuit the evaluation in ttm_io_prot() if the flags include
TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
attributes immediately.

This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
Without this change, amdgpu does not start at all, and radeon only
produces corrupt display output.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Junwei Zhang <Jerry.Zhang@amd.com>
Cc: David Airlie <airlied@linux.ie>
Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
 1 file changed, 5 insertions(+)

Zhang, Jerry(Junwei) Jan. 10, 2019, 8:36 a.m. UTC | #1

On 1/10/19 3:28 PM, Ard Biesheuvel wrote:
> ARM systems do not permit the use of anything other than cached
> mappings for system memory, since that memory may be mapped in the
> linear region as well, and the architecture does not permit aliases
> with mismatched attributes.
>
> So short-circuit the evaluation in ttm_io_prot() if the flags include
> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> attributes immediately.

It sounds a case for ARM system memory access from CPU only?

If that always applies to ARM memory, suppose we should do that for 
TTM_PL_TT as well.
While TTM_PL_TT | TTM_PL_FLAG_WC is likely to work as below mention.

Regards,
Jerry
> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> Without this change, amdgpu does not start at all, and radeon only
> produces corrupt display output.
>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Junwei Zhang <Jerry.Zhang@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>   drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 046a6dda690a..0c1eef5f7ae3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
>   	if (caching_flags & TTM_PL_FLAG_CACHED)
>   		return tmp;
>   
> +#if defined(__arm__) || defined(__aarch64__)
> +	/* ARM only permits cached mappings of system memory */
> +	if (caching_flags & TTM_PL_SYSTEM)
> +		return tmp;
> +#endif
>   #if defined(__i386__) || defined(__x86_64__)
>   	if (caching_flags & TTM_PL_FLAG_WC)
>   		tmp = pgprot_writecombine(tmp);

Christian König Jan. 10, 2019, 8:36 a.m. UTC | #2

Hi Ard,

thanks a lot for this! At least somebody who can explain why this 
doesn't work as expected.

The problem is that the hardware actually needs a few pages as uncached 
in a couple of cases to work correctly. So we could still run into 
issues with that solution.

For now we have blocked userspace/kernel from extensively using write 
combine mappings by adjusting drm_arch_can_wc_memory(), but that is 
probably degrading performance quite a bit.

What can be done to improve the solution? At least on X86 we solve this 
by marking the write combined pages in the linear mapping as uncached as 
well. Would this be doable on ARM as well?

Thanks,
Christian.

Am 10.01.19 um 08:28 schrieb Ard Biesheuvel:
> ARM systems do not permit the use of anything other than cached
> mappings for system memory, since that memory may be mapped in the
> linear region as well, and the architecture does not permit aliases
> with mismatched attributes.
>
> So short-circuit the evaluation in ttm_io_prot() if the flags include
> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> attributes immediately.
>
> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> Without this change, amdgpu does not start at all, and radeon only
> produces corrupt display output.
>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Junwei Zhang <Jerry.Zhang@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>   drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 046a6dda690a..0c1eef5f7ae3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
>   	if (caching_flags & TTM_PL_FLAG_CACHED)
>   		return tmp;
>   
> +#if defined(__arm__) || defined(__aarch64__)
> +	/* ARM only permits cached mappings of system memory */
> +	if (caching_flags & TTM_PL_SYSTEM)
> +		return tmp;
> +#endif
>   #if defined(__i386__) || defined(__x86_64__)
>   	if (caching_flags & TTM_PL_FLAG_WC)
>   		tmp = pgprot_writecombine(tmp);

Michel Dänzer Jan. 10, 2019, 9:34 a.m. UTC | #3

On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote:
> ARM systems do not permit the use of anything other than cached
> mappings for system memory, since that memory may be mapped in the
> linear region as well, and the architecture does not permit aliases
> with mismatched attributes.
> 
> So short-circuit the evaluation in ttm_io_prot() if the flags include
> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> attributes immediately.
> 
> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> Without this change, amdgpu does not start at all, and radeon only
> produces corrupt display output.
> 
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Junwei Zhang <Jerry.Zhang@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> index 046a6dda690a..0c1eef5f7ae3 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
>  	if (caching_flags & TTM_PL_FLAG_CACHED)
>  		return tmp;
>  
> +#if defined(__arm__) || defined(__aarch64__)
> +	/* ARM only permits cached mappings of system memory */
> +	if (caching_flags & TTM_PL_SYSTEM)
> +		return tmp;
> +#endif
>  #if defined(__i386__) || defined(__x86_64__)
>  	if (caching_flags & TTM_PL_FLAG_WC)
>  		tmp = pgprot_writecombine(tmp);
> 

Apart from Christian's concerns, I think this is the wrong place for
this, because other TTM / driver code will still consider the memory
uncacheable. E.g. the amdgpu driver will program the GPU to treat the
memory as uncacheable, so it won't participate in cache coherency
protocol for it, which is unlikely to work as expected in general if the
CPU treats the memory as cacheable.

Ard Biesheuvel Jan. 14, 2019, 10:53 a.m. UTC | #4

On Thu, 10 Jan 2019 at 10:34, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote:
> > ARM systems do not permit the use of anything other than cached
> > mappings for system memory, since that memory may be mapped in the
> > linear region as well, and the architecture does not permit aliases
> > with mismatched attributes.
> >
> > So short-circuit the evaluation in ttm_io_prot() if the flags include
> > TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> > attributes immediately.
> >
> > This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> > Without this change, amdgpu does not start at all, and radeon only
> > produces corrupt display output.
> >
> > Cc: Christian Koenig <christian.koenig@amd.com>
> > Cc: Huang Rui <ray.huang@amd.com>
> > Cc: Junwei Zhang <Jerry.Zhang@amd.com>
> > Cc: David Airlie <airlied@linux.ie>
> > Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > ---
> >  drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> > index 046a6dda690a..0c1eef5f7ae3 100644
> > --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> > +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> > @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
> >       if (caching_flags & TTM_PL_FLAG_CACHED)
> >               return tmp;
> >
> > +#if defined(__arm__) || defined(__aarch64__)
> > +     /* ARM only permits cached mappings of system memory */
> > +     if (caching_flags & TTM_PL_SYSTEM)
> > +             return tmp;
> > +#endif
> >  #if defined(__i386__) || defined(__x86_64__)
> >       if (caching_flags & TTM_PL_FLAG_WC)
> >               tmp = pgprot_writecombine(tmp);
> >
>
> Apart from Christian's concerns, I think this is the wrong place for
> this, because other TTM / driver code will still consider the memory
> uncacheable. E.g. the amdgpu driver will program the GPU to treat the
> memory as uncacheable, so it won't participate in cache coherency
> protocol for it, which is unlikely to work as expected in general if the
> CPU treats the memory as cacheable.
>

Will and I have spent some time digging into this, so allow me to
share some preliminary findings while we carry on and try to fix this
properly.

- The patch above is flawed, i.e., it doesn't do what it intends to
since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies
for that.
- The existence of a linear region mapping with mismatched attributes
is likely not the culprit here. (We do something similar with
non-cache coherent DMA in other places).
- The reason remapping the CPU side as cacheable does work (which I
did test) is because the GPU's uncacheable accesses (which I assume
are made using the NoSnoop PCIe transaction attribute) are actually
emitted as cacheable in some cases.
  . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
must use cacheable accesses from the CPU side or things are broken.
This might be a h/w flaw, though.
  . On systems with stage 1+2 SMMUs, the driver uses stage 1
translations which always override the memory attributes to cacheable
for DMA coherent devices. This is what is affecting the Cavium
ThunderX2 (although it appears the attributes emitted by the RC may be
incorrect as well.)

The latter issue is a shortcoming in the SMMU driver that we have to
fix, i.e., it should take care not to modify the incoming attributes
of DMA coherent PCIe devices for NoSnoop to be able to work.

So in summary, the mismatch appears to be between the CPU accessing
the vmap region with non-cacheable attributes and the GPU accessing
the same memory with cacheable attributes, resulting in a loss of
coherency and lots of visible corruption.

To be able to debug this further, could you elaborate a bit on
- How does the hardware emit those uncached/wc inbound accesses? Do
they rely on NoSnoop?
- Christian pointed out that some accesses must be uncached even when
not using WC. What kind of accesses are those? And do they access
system RAM?

Christian König Jan. 14, 2019, 11:38 a.m. UTC | #5

Am 14.01.19 um 11:53 schrieb Ard Biesheuvel:
> On Thu, 10 Jan 2019 at 10:34, Michel Dänzer <michel@daenzer.net> wrote:
>> On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote:
>>> ARM systems do not permit the use of anything other than cached
>>> mappings for system memory, since that memory may be mapped in the
>>> linear region as well, and the architecture does not permit aliases
>>> with mismatched attributes.
>>>
>>> So short-circuit the evaluation in ttm_io_prot() if the flags include
>>> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
>>> attributes immediately.
>>>
>>> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
>>> Without this change, amdgpu does not start at all, and radeon only
>>> produces corrupt display output.
>>>
>>> Cc: Christian Koenig <christian.koenig@amd.com>
>>> Cc: Huang Rui <ray.huang@amd.com>
>>> Cc: Junwei Zhang <Jerry.Zhang@amd.com>
>>> Cc: David Airlie <airlied@linux.ie>
>>> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
>>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>>> ---
>>>   drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
>>>   1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> index 046a6dda690a..0c1eef5f7ae3 100644
>>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
>>> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
>>>        if (caching_flags & TTM_PL_FLAG_CACHED)
>>>                return tmp;
>>>
>>> +#if defined(__arm__) || defined(__aarch64__)
>>> +     /* ARM only permits cached mappings of system memory */
>>> +     if (caching_flags & TTM_PL_SYSTEM)
>>> +             return tmp;
>>> +#endif
>>>   #if defined(__i386__) || defined(__x86_64__)
>>>        if (caching_flags & TTM_PL_FLAG_WC)
>>>                tmp = pgprot_writecombine(tmp);
>>>
>> Apart from Christian's concerns, I think this is the wrong place for
>> this, because other TTM / driver code will still consider the memory
>> uncacheable. E.g. the amdgpu driver will program the GPU to treat the
>> memory as uncacheable, so it won't participate in cache coherency
>> protocol for it, which is unlikely to work as expected in general if the
>> CPU treats the memory as cacheable.
>>
> Will and I have spent some time digging into this, so allow me to
> share some preliminary findings while we carry on and try to fix this
> properly.
>
> - The patch above is flawed, i.e., it doesn't do what it intends to
> since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies
> for that.
> - The existence of a linear region mapping with mismatched attributes
> is likely not the culprit here. (We do something similar with
> non-cache coherent DMA in other places).

This is still rather problematic.

The issue is that we often don't create a vmap for a page, but rather 
access the page directly using the linear mapping.

So we would use the wrong access type here.

> - The reason remapping the CPU side as cacheable does work (which I
> did test) is because the GPU's uncacheable accesses (which I assume
> are made using the NoSnoop PCIe transaction attribute) are actually
> emitted as cacheable in some cases.
>    . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
> must use cacheable accesses from the CPU side or things are broken.
> This might be a h/w flaw, though.
>    . On systems with stage 1+2 SMMUs, the driver uses stage 1
> translations which always override the memory attributes to cacheable
> for DMA coherent devices. This is what is affecting the Cavium
> ThunderX2 (although it appears the attributes emitted by the RC may be
> incorrect as well.)
>
> The latter issue is a shortcoming in the SMMU driver that we have to
> fix, i.e., it should take care not to modify the incoming attributes
> of DMA coherent PCIe devices for NoSnoop to be able to work.
>
> So in summary, the mismatch appears to be between the CPU accessing
> the vmap region with non-cacheable attributes and the GPU accessing
> the same memory with cacheable attributes, resulting in a loss of
> coherency and lots of visible corruption.

Actually it is the other way around. The CPU thinks some data is in the 
cache and the GPU only updates the system memory version because the 
snoop flag is not set.

> To be able to debug this further, could you elaborate a bit on
> - How does the hardware emit those uncached/wc inbound accesses? Do
> they rely on NoSnoop?

The GPU has a separate page walker in the MC and the page tables there 
have a bits saying if the access should go to the PCIe bus and if yes if 
the snoop bit should be set.

> - Christian pointed out that some accesses must be uncached even when
> not using WC. What kind of accesses are those? And do they access
> system RAM?

On some hardware generations we have a buggy engine which fails to 
forward the snoop bit and because of this the system memory page used by 
this engine must be uncached. But this only applies if you use ROCm in a 
specific configuration.

Regards,
Christian.

Ard Biesheuvel Jan. 14, 2019, 5:32 p.m. UTC | #6

On Mon, 14 Jan 2019 at 12:38, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 14.01.19 um 11:53 schrieb Ard Biesheuvel:
> > On Thu, 10 Jan 2019 at 10:34, Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2019-01-10 8:28 a.m., Ard Biesheuvel wrote:
> >>> ARM systems do not permit the use of anything other than cached
> >>> mappings for system memory, since that memory may be mapped in the
> >>> linear region as well, and the architecture does not permit aliases
> >>> with mismatched attributes.
> >>>
> >>> So short-circuit the evaluation in ttm_io_prot() if the flags include
> >>> TTM_PL_SYSTEM when running on ARM or arm64, and just return cached
> >>> attributes immediately.
> >>>
> >>> This fixes the radeon and amdgpu [TBC] drivers when running on arm64.
> >>> Without this change, amdgpu does not start at all, and radeon only
> >>> produces corrupt display output.
> >>>
> >>> Cc: Christian Koenig <christian.koenig@amd.com>
> >>> Cc: Huang Rui <ray.huang@amd.com>
> >>> Cc: Junwei Zhang <Jerry.Zhang@amd.com>
> >>> Cc: David Airlie <airlied@linux.ie>
> >>> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
> >>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> >>> ---
> >>>   drivers/gpu/drm/ttm/ttm_bo_util.c | 5 +++++
> >>>   1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
> >>> index 046a6dda690a..0c1eef5f7ae3 100644
> >>> --- a/drivers/gpu/drm/ttm/ttm_bo_util.c
> >>> +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
> >>> @@ -530,6 +530,11 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
> >>>        if (caching_flags & TTM_PL_FLAG_CACHED)
> >>>                return tmp;
> >>>
> >>> +#if defined(__arm__) || defined(__aarch64__)
> >>> +     /* ARM only permits cached mappings of system memory */
> >>> +     if (caching_flags & TTM_PL_SYSTEM)
> >>> +             return tmp;
> >>> +#endif
> >>>   #if defined(__i386__) || defined(__x86_64__)
> >>>        if (caching_flags & TTM_PL_FLAG_WC)
> >>>                tmp = pgprot_writecombine(tmp);
> >>>
> >> Apart from Christian's concerns, I think this is the wrong place for
> >> this, because other TTM / driver code will still consider the memory
> >> uncacheable. E.g. the amdgpu driver will program the GPU to treat the
> >> memory as uncacheable, so it won't participate in cache coherency
> >> protocol for it, which is unlikely to work as expected in general if the
> >> CPU treats the memory as cacheable.
> >>
> > Will and I have spent some time digging into this, so allow me to
> > share some preliminary findings while we carry on and try to fix this
> > properly.
> >
> > - The patch above is flawed, i.e., it doesn't do what it intends to
> > since it uses TTM_PL_SYSTEM instead of TTM_PL_FLAG_SYSTEM. Apologies
> > for that.
> > - The existence of a linear region mapping with mismatched attributes
> > is likely not the culprit here. (We do something similar with
> > non-cache coherent DMA in other places).
>
> This is still rather problematic.
>
> The issue is that we often don't create a vmap for a page, but rather
> access the page directly using the linear mapping.
>
> So we would use the wrong access type here.
>

Yes. But how are these accesses done? Are they wrapped in a kmap()?

> > - The reason remapping the CPU side as cacheable does work (which I
> > did test) is because the GPU's uncacheable accesses (which I assume
> > are made using the NoSnoop PCIe transaction attribute) are actually
> > emitted as cacheable in some cases.
> >    . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
> > must use cacheable accesses from the CPU side or things are broken.
> > This might be a h/w flaw, though.
> >    . On systems with stage 1+2 SMMUs, the driver uses stage 1
> > translations which always override the memory attributes to cacheable
> > for DMA coherent devices. This is what is affecting the Cavium
> > ThunderX2 (although it appears the attributes emitted by the RC may be
> > incorrect as well.)
> >
> > The latter issue is a shortcoming in the SMMU driver that we have to
> > fix, i.e., it should take care not to modify the incoming attributes
> > of DMA coherent PCIe devices for NoSnoop to be able to work.
> >
> > So in summary, the mismatch appears to be between the CPU accessing
> > the vmap region with non-cacheable attributes and the GPU accessing
> > the same memory with cacheable attributes, resulting in a loss of
> > coherency and lots of visible corruption.
>
> Actually it is the other way around. The CPU thinks some data is in the
> cache and the GPU only updates the system memory version because the
> snoop flag is not set.
>

That doesn't seem to be what is happening. As far as we can tell from
our experiments, all inbound transactions are always cacheable, and so
the only way to make things work is to ensure that the CPU uses the
same attributes.

> > To be able to debug this further, could you elaborate a bit on
> > - How does the hardware emit those uncached/wc inbound accesses? Do
> > they rely on NoSnoop?
>
> The GPU has a separate page walker in the MC and the page tables there
> have a bits saying if the access should go to the PCIe bus and if yes if
> the snoop bit should be set.
>
> > - Christian pointed out that some accesses must be uncached even when
> > not using WC. What kind of accesses are those? And do they access
> > system RAM?
>
> On some hardware generations we have a buggy engine which fails to
> forward the snoop bit and because of this the system memory page used by
> this engine must be uncached. But this only applies if you use ROCm in a
> specific configuration.
>

OK. I don't know what that means tbh. Does this apply to both radeon and amdgpu?

Will Deacon Jan. 14, 2019, 7:13 p.m. UTC | #7

On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote:
> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:
>             - The reason remapping the CPU side as cacheable does work (which I
>             did test) is because the GPU's uncacheable accesses (which I assume
>             are made using the NoSnoop PCIe transaction attribute) are actually
>             emitted as cacheable in some cases.
>                . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
>             must use cacheable accesses from the CPU side or things are broken.
>             This might be a h/w flaw, though.
>                . On systems with stage 1+2 SMMUs, the driver uses stage 1
>             translations which always override the memory attributes to cacheable
>             for DMA coherent devices. This is what is affecting the Cavium
>             ThunderX2 (although it appears the attributes emitted by the RC may be
>             incorrect as well.)
> 
>             The latter issue is a shortcoming in the SMMU driver that we have to
>             fix, i.e., it should take care not to modify the incoming attributes
>             of DMA coherent PCIe devices for NoSnoop to be able to work.
> 
>             So in summary, the mismatch appears to be between the CPU accessing
>             the vmap region with non-cacheable attributes and the GPU accessing
>             the same memory with cacheable attributes, resulting in a loss of
>             coherency and lots of visible corruption.
> 
>         Actually it is the other way around. The CPU thinks some data is in the
>         cache and the GPU only updates the system memory version because the
>         snoop flag is not set.
> 
> 
>     That doesn't seem to be what is happening. As far as we can tell from
>     our experiments, all inbound transactions are always cacheable, and so
>     the only way to make things work is to ensure that the CPU uses the
>     same attributes.
> 
> 
> Ok that doesn't make any sense. If inbound transactions are cacheable or not is
> irrelevant when the CPU always uses uncached accesses.
> 
> See on the PCIe side you have the snoop bit in the read/write transactions
> which tells the root hub if the device wants to snoop caches or not.
> 
> When the CPU accesses some memory as cached then devices need to snoop the
> cache for coherent accesses.
> 
> When the CPU accesses some memory as uncached then devices can disable snooping
> to improve performance, but when they don't do this it is mandated by the spec
> that this still works.

Which spec? The Arm architecture (and others including Power afaiu) doesn't
guarantee coherency when memory is accessed using mismatched cacheability
attributes.

Will

Christian König Jan. 14, 2019, 7:21 p.m. UTC | #8

Am 14.01.19 um 20:13 schrieb Will Deacon:
> On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote:
>> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:
>>              - The reason remapping the CPU side as cacheable does work (which I
>>              did test) is because the GPU's uncacheable accesses (which I assume
>>              are made using the NoSnoop PCIe transaction attribute) are actually
>>              emitted as cacheable in some cases.
>>                 . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
>>              must use cacheable accesses from the CPU side or things are broken.
>>              This might be a h/w flaw, though.
>>                 . On systems with stage 1+2 SMMUs, the driver uses stage 1
>>              translations which always override the memory attributes to cacheable
>>              for DMA coherent devices. This is what is affecting the Cavium
>>              ThunderX2 (although it appears the attributes emitted by the RC may be
>>              incorrect as well.)
>>
>>              The latter issue is a shortcoming in the SMMU driver that we have to
>>              fix, i.e., it should take care not to modify the incoming attributes
>>              of DMA coherent PCIe devices for NoSnoop to be able to work.
>>
>>              So in summary, the mismatch appears to be between the CPU accessing
>>              the vmap region with non-cacheable attributes and the GPU accessing
>>              the same memory with cacheable attributes, resulting in a loss of
>>              coherency and lots of visible corruption.
>>
>>          Actually it is the other way around. The CPU thinks some data is in the
>>          cache and the GPU only updates the system memory version because the
>>          snoop flag is not set.
>>
>>
>>      That doesn't seem to be what is happening. As far as we can tell from
>>      our experiments, all inbound transactions are always cacheable, and so
>>      the only way to make things work is to ensure that the CPU uses the
>>      same attributes.
>>
>>
>> Ok that doesn't make any sense. If inbound transactions are cacheable or not is
>> irrelevant when the CPU always uses uncached accesses.
>>
>> See on the PCIe side you have the snoop bit in the read/write transactions
>> which tells the root hub if the device wants to snoop caches or not.
>>
>> When the CPU accesses some memory as cached then devices need to snoop the
>> cache for coherent accesses.
>>
>> When the CPU accesses some memory as uncached then devices can disable snooping
>> to improve performance, but when they don't do this it is mandated by the spec
>> that this still works.
> Which spec?

The PCIe spec. The snoop bit (or rather the NoSnoop) in the transaction 
is perfectly optional IIRC.

> The Arm architecture (and others including Power afaiu) doesn't
> guarantee coherency when memory is accessed using mismatched cacheability
> attributes.

Well what exactly goes wrong on ARM?

As far as I know Power doesn't really supports un-cached memory at all, 
except for a very very old and odd configuration with AGP.

I mean in theory I agree that devices should use matching cacheability 
attributes, but in practice I know of quite a bunch of devices/engines 
which fails to do this correctly.

Regards,
Christian.

>
> Will

Will Deacon Jan. 14, 2019, 7:35 p.m. UTC | #9

[+ BenH and MPE]

On Mon, Jan 14, 2019 at 07:21:08PM +0000, Koenig, Christian wrote:
> Am 14.01.19 um 20:13 schrieb Will Deacon:
> > On Mon, Jan 14, 2019 at 07:07:54PM +0000, Koenig, Christian wrote:
> >> Am 14.01.19 um 18:32 schrieb Ard Biesheuvel:
> >>              - The reason remapping the CPU side as cacheable does work (which I
> >>              did test) is because the GPU's uncacheable accesses (which I assume
> >>              are made using the NoSnoop PCIe transaction attribute) are actually
> >>              emitted as cacheable in some cases.
> >>                 . On my AMD Seattle, with or without SMMU (which is stage 2 only), I
> >>              must use cacheable accesses from the CPU side or things are broken.
> >>              This might be a h/w flaw, though.
> >>                 . On systems with stage 1+2 SMMUs, the driver uses stage 1
> >>              translations which always override the memory attributes to cacheable
> >>              for DMA coherent devices. This is what is affecting the Cavium
> >>              ThunderX2 (although it appears the attributes emitted by the RC may be
> >>              incorrect as well.)
> >>
> >>              The latter issue is a shortcoming in the SMMU driver that we have to
> >>              fix, i.e., it should take care not to modify the incoming attributes
> >>              of DMA coherent PCIe devices for NoSnoop to be able to work.
> >>
> >>              So in summary, the mismatch appears to be between the CPU accessing
> >>              the vmap region with non-cacheable attributes and the GPU accessing
> >>              the same memory with cacheable attributes, resulting in a loss of
> >>              coherency and lots of visible corruption.
> >>
> >>          Actually it is the other way around. The CPU thinks some data is in the
> >>          cache and the GPU only updates the system memory version because the
> >>          snoop flag is not set.
> >>
> >>
> >>      That doesn't seem to be what is happening. As far as we can tell from
> >>      our experiments, all inbound transactions are always cacheable, and so
> >>      the only way to make things work is to ensure that the CPU uses the
> >>      same attributes.
> >>
> >>
> >> Ok that doesn't make any sense. If inbound transactions are cacheable or not is
> >> irrelevant when the CPU always uses uncached accesses.
> >>
> >> See on the PCIe side you have the snoop bit in the read/write transactions
> >> which tells the root hub if the device wants to snoop caches or not.
> >>
> >> When the CPU accesses some memory as cached then devices need to snoop the
> >> cache for coherent accesses.
> >>
> >> When the CPU accesses some memory as uncached then devices can disable snooping
> >> to improve performance, but when they don't do this it is mandated by the spec
> >> that this still works.
> > Which spec?
> 
> The PCIe spec. The snoop bit (or rather the NoSnoop) in the transaction 
> is perfectly optional IIRC.

Thanks for the clarification. I suspect the devil is in the details, so I'll
try to dig up the spec.

> > The Arm architecture (and others including Power afaiu) doesn't
> > guarantee coherency when memory is accessed using mismatched cacheability
> > attributes.
> 
> Well what exactly goes wrong on ARM?

Coherency (and any ordering guarantees) can be lost, so the device may see a
stale copy of the memory it is accessing. The architecture requires cache
maintenance to restore coherency between the mismatched aliases.

> As far as I know Power doesn't really supports un-cached memory at all, 
> except for a very very old and odd configuration with AGP.

Hopefully Michael/Ben can elaborate here, but I was under the (possibly
mistaken) impression that mismatched attributes could cause a machine-check
on Power.

> I mean in theory I agree that devices should use matching cacheability 
> attributes, but in practice I know of quite a bunch of devices/engines 
> which fails to do this correctly.

Given that the experiences of Ard and I so far has been that the system
ends up making everything cacheable after the RC, perhaps that's an attempt
by system designers to correct for these devices. Unfortunately, it doesn't
help if the CPU carefully goes ahead and establishes a non-cacheable mapping
for itself!

Will

Michael Ellerman Jan. 15, 2019, 11:31 a.m. UTC | #10

Hi Will,

Will Deacon <will.deacon@arm.com> writes:
> [+ BenH and MPE]
>
> On Mon, Jan 14, 2019 at 07:21:08PM +0000, Koenig, Christian wrote:
>> Am 14.01.19 um 20:13 schrieb Will Deacon:
...
>
>> > The Arm architecture (and others including Power afaiu) doesn't
>> > guarantee coherency when memory is accessed using mismatched cacheability
>> > attributes.
...
>
>> As far as I know Power doesn't really supports un-cached memory at all, 
>> except for a very very old and odd configuration with AGP.
>
> Hopefully Michael/Ben can elaborate here, but I was under the (possibly
> mistaken) impression that mismatched attributes could cause a machine-check
> on Power.

That's what I've always been told, but I can't actually find where it's
documented, I'll keep searching.

But you're right that mixing cached / uncached is not really supported,
and probably results in a machine check or worse.

cheers

Benjamin Herrenschmidt Jan. 16, 2019, 12:33 a.m. UTC | #11

On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote:
> > > As far as I know Power doesn't really supports un-cached memory at all, 
> > > except for a very very old and odd configuration with AGP.
> > 
> > Hopefully Michael/Ben can elaborate here, but I was under the (possibly
> > mistaken) impression that mismatched attributes could cause a machine-check
> > on Power.
> 
> That's what I've always been told, but I can't actually find where it's
> documented, I'll keep searching.
> 
> But you're right that mixing cached / uncached is not really supported,
> and probably results in a machine check or worse.

 .. or worse :) It could checkstop.

It's also my understanding that on ARM v7 and above, it's technically
forbidden to map the same physical page with both cached and non-cached 
mappings, since the cached one could prefetch (or speculatively load),
thus creating collisions and inconsistencies. Am I wrong here ?

The old hack of using non-cached mapping to avoid snoop cost in AGP and
others is just that ... an ugly and horrible hacks that should have
never eventuated, when the search for performance pushes HW people into
utter insanity :)

Cheers,
Ben.

Christian König Jan. 16, 2019, 7:35 a.m. UTC | #12

Am 16.01.19 um 01:33 schrieb Benjamin Herrenschmidt:
> On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote:
>>>> As far as I know Power doesn't really supports un-cached memory at all,
>>>> except for a very very old and odd configuration with AGP.
>>> Hopefully Michael/Ben can elaborate here, but I was under the (possibly
>>> mistaken) impression that mismatched attributes could cause a machine-check
>>> on Power.
>> That's what I've always been told, but I can't actually find where it's
>> documented, I'll keep searching.
>>
>> But you're right that mixing cached / uncached is not really supported,
>> and probably results in a machine check or worse.
>   .. or worse :) It could checkstop.

Not sure if that would be so bad, it would at least give us a clear 
indicator that something is wrong instead of silently corrupting data.

> It's also my understanding that on ARM v7 and above, it's technically
> forbidden to map the same physical page with both cached and non-cached
> mappings, since the cached one could prefetch (or speculatively load),
> thus creating collisions and inconsistencies. Am I wrong here ?

No, but you answer the wrong question.

See we don't want to have different mappings of cached and non-cached on 
the CPU, but rather want to know if a snooped DMA from the PCIe counts 
as cached access as well.

As far as I know on x86 it doesn't, so when you have an un-cached page 
you can still access it with a snooping DMA read/write operation and 
don't cause trouble.

> The old hack of using non-cached mapping to avoid snoop cost in AGP and
> others is just that ... an ugly and horrible hacks that should have
> never eventuated, when the search for performance pushes HW people into
> utter insanity :)

Well I agree that un-cached system memory makes things much more 
complicated for a questionable gain.

But fact is we now have to deal with the mess, so no point in 
complaining about it to much :)

Cheers,
Christian.

>
> Cheers,
> Ben.
>
>
>

Ard Biesheuvel Jan. 16, 2019, 7:47 a.m. UTC | #13

On Wed, 16 Jan 2019 at 08:36, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 16.01.19 um 01:33 schrieb Benjamin Herrenschmidt:
> > On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote:
> >>>> As far as I know Power doesn't really supports un-cached memory at all,
> >>>> except for a very very old and odd configuration with AGP.
> >>> Hopefully Michael/Ben can elaborate here, but I was under the (possibly
> >>> mistaken) impression that mismatched attributes could cause a machine-check
> >>> on Power.
> >> That's what I've always been told, but I can't actually find where it's
> >> documented, I'll keep searching.
> >>
> >> But you're right that mixing cached / uncached is not really supported,
> >> and probably results in a machine check or worse.
> >   .. or worse :) It could checkstop.
>
> Not sure if that would be so bad, it would at least give us a clear
> indicator that something is wrong instead of silently corrupting data.
>
> > It's also my understanding that on ARM v7 and above, it's technically
> > forbidden to map the same physical page with both cached and non-cached
> > mappings, since the cached one could prefetch (or speculatively load),
> > thus creating collisions and inconsistencies. Am I wrong here ?
>
> No, but you answer the wrong question.
>
> See we don't want to have different mappings of cached and non-cached on
> the CPU, but rather want to know if a snooped DMA from the PCIe counts
> as cached access as well.
>
> As far as I know on x86 it doesn't, so when you have an un-cached page
> you can still access it with a snooping DMA read/write operation and
> don't cause trouble.
>

I think it is the other way around. The question is, on an otherwise
cache coherent device, whether the NoSnoop attribute set by the GPU
propagates all the way to the bus so that it bypasses the caches.

On x86, we can tolerate if this is not the case, since uncached memory
accesses by the CPU snoop the caches as well.

On other architectures, uncached accesses go straight to main memory,
so if the device wrote anything to the caches we won't see it.

So to use this optimization, you have to either be 100% sure that
NoSnoop is implemented correctly, or have a x86 CPU.

> > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > others is just that ... an ugly and horrible hacks that should have
> > never eventuated, when the search for performance pushes HW people into
> > utter insanity :)
>
> Well I agree that un-cached system memory makes things much more
> complicated for a questionable gain.
>
> But fact is we now have to deal with the mess, so no point in
> complaining about it to much :)
>

Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y

Benjamin Herrenschmidt Jan. 17, 2019, 5:59 a.m. UTC | #14

On Wed, 2019-01-16 at 07:35 +0000, Koenig, Christian wrote:
> No, but you answer the wrong question.
> 
> See we don't want to have different mappings of cached and non-cached on 
> the CPU, but rather want to know if a snooped DMA from the PCIe counts 
> as cached access as well.
> 
> As far as I know on x86 it doesn't, so when you have an un-cached page 
> you can still access it with a snooping DMA read/write operation and 
> don't cause trouble.

Hrm... well, if you map it uncached on the CPU on powerpc, a snoop DMA
will work fine too, it won't hit any cache. The only problem I'm aware
of is a core (or CAPI device) emiting non-cached load/stores colliding
with a cache snooper.

> > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > others is just that ... an ugly and horrible hacks that should have
> > never eventuated, when the search for performance pushes HW people into
> > utter insanity :)
> 
> Well I agree that un-cached system memory makes things much more 
> complicated for a questionable gain.
> 
> But fact is we now have to deal with the mess, so no point in 
> complaining about it to much :)

I wish we could just sent the HW designers home and tell them we won't
support that crap... oh well.

Ben.

> Cheers,
> Christian.
> 
> > Cheers,
> > Ben.
> > 
> >

Benjamin Herrenschmidt Jan. 17, 2019, 6:07 a.m. UTC | #15

On Wed, 2019-01-16 at 08:47 +0100, Ard Biesheuvel wrote:
> > As far as I know on x86 it doesn't, so when you have an un-cached page
> > you can still access it with a snooping DMA read/write operation and
> > don't cause trouble.
> > 
> 
> I think it is the other way around. The question is, on an otherwise
> cache coherent device, whether the NoSnoop attribute set by the GPU
> propagates all the way to the bus so that it bypasses the caches.

On powerpc it's ignored, all DMA accesses will be snooped. But that's
fine regardless of whether the memory was mapped cachable or not, the
snooper will simply not find anything if not. I *think* we only do
cache inject if the line already exists in one of the caches.

> On x86, we can tolerate if this is not the case, since uncached memory
> accesses by the CPU snoop the caches as well.
> 
> On other architectures, uncached accesses go straight to main memory,
> so if the device wrote anything to the caches we won't see it.

Well, on all powerpc implementations that I am aware of at least (dunno
about ARM), they do, but we don't have a problem because I don't think
the devices can/will write to the caches directly unless a
corresponding line already exists (but I might be wrong, we need to
double check all implementations which is tricky).

I am not aware of any powerpc chip implementing NoSnoop.

> So to use this optimization, you have to either be 100% sure that
> NoSnoop is implemented correctly, or have a x86 CPU.
> 
> > > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > > others is just that ... an ugly and horrible hacks that should have
> > > never eventuated, when the search for performance pushes HW people into
> > > utter insanity :)
> > 
> > Well I agree that un-cached system memory makes things much more
> > complicated for a questionable gain.
> > 
> > But fact is we now have to deal with the mess, so no point in
> > complaining about it to much :)
> > 
> 
> Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y

The question is whether DMA from a device can instanciate cache lines
in your system. This a system specific rather than architecture
specific question I suspect...

Cheers,
Ben.

Ard Biesheuvel Jan. 17, 2019, 8:02 a.m. UTC | #16

On Thu, 17 Jan 2019 at 07:07, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> On Wed, 2019-01-16 at 08:47 +0100, Ard Biesheuvel wrote:
> > > As far as I know on x86 it doesn't, so when you have an un-cached page
> > > you can still access it with a snooping DMA read/write operation and
> > > don't cause trouble.
> > >
> >
> > I think it is the other way around. The question is, on an otherwise
> > cache coherent device, whether the NoSnoop attribute set by the GPU
> > propagates all the way to the bus so that it bypasses the caches.
>
> On powerpc it's ignored, all DMA accesses will be snooped. But that's
> fine regardless of whether the memory was mapped cachable or not, the
> snooper will simply not find anything if not. I *think* we only do
> cache inject if the line already exists in one of the caches.
>

Others should correct me if I am wrong, but arm64 SoCs often have L3
system caches, and I would expect inbound transactions with writeback
write-allocate (WBWA) attributes to allocate there.

> > On x86, we can tolerate if this is not the case, since uncached memory
> > accesses by the CPU snoop the caches as well.
> >
> > On other architectures, uncached accesses go straight to main memory,
> > so if the device wrote anything to the caches we won't see it.
>
> Well, on all powerpc implementations that I am aware of at least (dunno
> about ARM), they do, but we don't have a problem because I don't think
> the devices can/will write to the caches directly unless a
> corresponding line already exists (but I might be wrong, we need to
> double check all implementations which is tricky).
>
> I am not aware of any powerpc chip implementing NoSnoop.
>

Do you have any history on why this optimization is disabled for power
unless CONFIG_NOT_CACHE_COHERENT is set?

That also begs the question how any of this is supposed to work with
non-cache coherent DMA. The code appears to always assume cache
coherent, and provide non-cache coherent as an optimization if
dma_arch_can_wc_memory() returns true. So I wonder if that helper
should take a struct device pointer instead, and return true for
non-cache coherent devices.

> > So to use this optimization, you have to either be 100% sure that
> > NoSnoop is implemented correctly, or have a x86 CPU.
> >
> > > > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > > > others is just that ... an ugly and horrible hacks that should have
> > > > never eventuated, when the search for performance pushes HW people into
> > > > utter insanity :)
> > >
> > > Well I agree that un-cached system memory makes things much more
> > > complicated for a questionable gain.
> > >
> > > But fact is we now have to deal with the mess, so no point in
> > > complaining about it to much :)
> > >
> >
> > Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y
>
> The question is whether DMA from a device can instanciate cache lines
> in your system. This a system specific rather than architecture
> specific question I suspect...
>

The ARM architecture permits it, afaict, and write-allocate is a hint
so the implementation is free to ignore it, whether it is set or
cleared.

[RFC] drm/ttm: force cached mappings for system RAM on ARM

Commit Message

Comments

Patch