diff mbox series

[RFC] drm: disable WC optimization for cache coherent devices on non-x86

Message ID 20190121100617.2311-1-ard.biesheuvel@linaro.org (mailing list archive)
State RFC
Headers show
Series [RFC] drm: disable WC optimization for cache coherent devices on non-x86 | expand

Commit Message

Ard Biesheuvel Jan. 21, 2019, 10:06 a.m. UTC
Currently, the DRM code assumes that PCI devices are always cache
coherent for DMA, and that this can be selectively overridden for
some buffers using non-cached mappings on the CPU side and PCIe
NoSnoop transactions on the bus side.

Whether the NoSnoop part is implemented correctly is highly platform
specific. Whether it /matters/ if NoSnoop is implemented correctly or
not is architecture specific: on x86, such transactions are coherent
with the CPU whether the NoSnoop attribute is honored or not. On other
architectures, it depends on whether such transactions may allocate in
caches that are non-coherent with the CPU's uncached mappings.

Bottom line is that we should not rely on this optimization to work
correctly for cache coherent devices in the general case. On the
other hand, disabling this optimization for non-coherent devices
is likely to cause breakage as well, since the driver will assume
cache coherent PCIe if this optimization is turned off.

So rename drm_arch_can_wc_memory() to drm_device_can_wc_memory(), and
pass the drm_device pointer into it so we can base the return value
on whether the device is cache coherent or not if not running on
X86.

Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Zhou <David1.Zhou@amd.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Junwei Zhang <Jerry.Zhang@amd.com>
Cc: Michel Daenzer <michel.daenzer@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <maxime.ripard@bootlin.com>
Cc: Sean Paul <sean@poorly.run>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Will Deacon <will.deacon@arm.com>
Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
This is a followup to '[RFC PATCH] drm/ttm: force cached mappings for system
RAM on ARM'

https://lore.kernel.org/linux-arm-kernel/20190110072841.3283-1-ard.biesheuvel@linaro.org/

Without t
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  2 +-
 drivers/gpu/drm/radeon/radeon_object.c     |  2 +-
 include/drm/drm_cache.h                    | 19 +++++++++++--------
 3 files changed, 13 insertions(+), 10 deletions(-)

Comments

Christian König Jan. 21, 2019, 10:11 a.m. UTC | #1
Am 21.01.19 um 11:06 schrieb Ard Biesheuvel:
> Currently, the DRM code assumes that PCI devices are always cache
> coherent for DMA, and that this can be selectively overridden for
> some buffers using non-cached mappings on the CPU side and PCIe
> NoSnoop transactions on the bus side.
>
> Whether the NoSnoop part is implemented correctly is highly platform
> specific. Whether it /matters/ if NoSnoop is implemented correctly or
> not is architecture specific: on x86, such transactions are coherent
> with the CPU whether the NoSnoop attribute is honored or not. On other
> architectures, it depends on whether such transactions may allocate in
> caches that are non-coherent with the CPU's uncached mappings.
>
> Bottom line is that we should not rely on this optimization to work
> correctly for cache coherent devices in the general case. On the
> other hand, disabling this optimization for non-coherent devices
> is likely to cause breakage as well, since the driver will assume
> cache coherent PCIe if this optimization is turned off.
>
> So rename drm_arch_can_wc_memory() to drm_device_can_wc_memory(), and
> pass the drm_device pointer into it so we can base the return value
> on whether the device is cache coherent or not if not running on
> X86.
>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: David Zhou <David1.Zhou@amd.com>
> Cc: Huang Rui <ray.huang@amd.com>
> Cc: Junwei Zhang <Jerry.Zhang@amd.com>
> Cc: Michel Daenzer <michel.daenzer@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
> Cc: Maxime Ripard <maxime.ripard@bootlin.com>
> Cc: Sean Paul <sean@poorly.run>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Will Deacon <will.deacon@arm.com>
> Reported-by: Carsten Haitzler <Carsten.Haitzler@arm.com>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Looks sane to me, but I can't fully judge if the check is actually 
correct or not.

So Acked-by: Christian König <christian.koenig@amd.com>

> ---
> This is a followup to '[RFC PATCH] drm/ttm: force cached mappings for system
> RAM on ARM'
>
> https://lore.kernel.org/linux-arm-kernel/20190110072841.3283-1-ard.biesheuvel@linaro.org/
>
> Without t
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c |  2 +-
>   drivers/gpu/drm/radeon/radeon_object.c     |  2 +-
>   include/drm/drm_cache.h                    | 19 +++++++++++--------
>   3 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 728e15e5d68a..777fa251838f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -480,7 +480,7 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>   	/* For architectures that don't support WC memory,
>   	 * mask out the WC flag from the BO
>   	 */
> -	if (!drm_arch_can_wc_memory())
> +	if (!drm_device_can_wc_memory(adev->ddev))
>   		bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
>   #endif
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
> index 833e909706a9..610889bf6ab5 100644
> --- a/drivers/gpu/drm/radeon/radeon_object.c
> +++ b/drivers/gpu/drm/radeon/radeon_object.c
> @@ -249,7 +249,7 @@ int radeon_bo_create(struct radeon_device *rdev,
>   	/* For architectures that don't support WC memory,
>   	 * mask out the WC flag from the BO
>   	 */
> -	if (!drm_arch_can_wc_memory())
> +	if (!drm_device_can_wc_memory(rdev->ddev))
>   		bo->flags &= ~RADEON_GEM_GTT_WC;
>   #endif
>   
> diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h
> index bfe1639df02d..ced63b1207a3 100644
> --- a/include/drm/drm_cache.h
> +++ b/include/drm/drm_cache.h
> @@ -33,6 +33,8 @@
>   #ifndef _DRM_CACHE_H_
>   #define _DRM_CACHE_H_
>   
> +#include <drm/drm_device.h>
> +#include <linux/dma-noncoherent.h>
>   #include <linux/scatterlist.h>
>   
>   void drm_clflush_pages(struct page *pages[], unsigned long num_pages);
> @@ -41,15 +43,16 @@ void drm_clflush_virt_range(void *addr, unsigned long length);
>   u64 drm_get_max_iomem(void);
>   
>   
> -static inline bool drm_arch_can_wc_memory(void)
> +static inline bool drm_device_can_wc_memory(struct drm_device *ddev)
>   {
> -#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> -	return false;
> -#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
> -	return false;
> -#else
> -	return true;
> -#endif
> +	if (IS_ENABLED(CONFIG_PPC))
> +		return IS_ENABLED(CONFIG_NOT_COHERENT_CACHE);
> +	else if (IS_ENABLED(CONFIG_MIPS))
> +		return !IS_ENABLED(CONFIG_CPU_LOONGSON3);
> +	else if (IS_ENABLED(CONFIG_X86))
> +		return true;
> +
> +	return !dev_is_dma_coherent(ddev->dev);
>   }
>   
>   #endif
Christoph Hellwig Jan. 21, 2019, 3:07 p.m. UTC | #2
> +#include <linux/dma-noncoherent.h>

This header is not for usage in device drivers, but purely for
dma-mapping implementations!

> +static inline bool drm_device_can_wc_memory(struct drm_device *ddev)
>  {
> +	if (IS_ENABLED(CONFIG_PPC))
> +		return IS_ENABLED(CONFIG_NOT_COHERENT_CACHE);
> +	else if (IS_ENABLED(CONFIG_MIPS))
> +		return !IS_ENABLED(CONFIG_CPU_LOONGSON3);
> +	else if (IS_ENABLED(CONFIG_X86))
> +		return true;
> +
> +	return !dev_is_dma_coherent(ddev->dev);

And even if something like this was valid to do, it would have to be
a core function with an arch hook, and not hidden in a random driver.
Ard Biesheuvel Jan. 21, 2019, 3:33 p.m. UTC | #3
On Mon, 21 Jan 2019 at 16:07, Christoph Hellwig <hch@infradead.org> wrote:
>
> > +#include <linux/dma-noncoherent.h>
>
> This header is not for usage in device drivers, but purely for
> dma-mapping implementations!
>

Is that documented anywhere?

> > +static inline bool drm_device_can_wc_memory(struct drm_device *ddev)
> >  {
> > +     if (IS_ENABLED(CONFIG_PPC))
> > +             return IS_ENABLED(CONFIG_NOT_COHERENT_CACHE);
> > +     else if (IS_ENABLED(CONFIG_MIPS))
> > +             return !IS_ENABLED(CONFIG_CPU_LOONGSON3);
> > +     else if (IS_ENABLED(CONFIG_X86))
> > +             return true;
> > +
> > +     return !dev_is_dma_coherent(ddev->dev);
>
> And even if something like this was valid to do, it would have to be
> a core function with an arch hook, and not hidden in a random driver.

Why would it not be valid to do? Is it wrong for a driver to be aware
of whether a device is cache coherent or not?

And in case it isn't, do you have an alternative suggestion on how to
fix this mess?
Christoph Hellwig Jan. 21, 2019, 3:59 p.m. UTC | #4
On Mon, Jan 21, 2019 at 04:33:27PM +0100, Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 16:07, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > +#include <linux/dma-noncoherent.h>
> >
> > This header is not for usage in device drivers, but purely for
> > dma-mapping implementations!
> >
> 
> Is that documented anywhere?

I'll add big fat comments.  But the fact that nothing is exported
there should be a fairly big hint.

> > And even if something like this was valid to do, it would have to be
> > a core function with an arch hook, and not hidden in a random driver.
> 
> Why would it not be valid to do? Is it wrong for a driver to be aware
> of whether a device is cache coherent or not?
> 
> And in case it isn't, do you have an alternative suggestion on how to
> fix this mess?

For the write combine mappings we need a proper core API how instances
can advertise the support.  One thing I want to do fairly soon is
error checking of the attrs argument to dma_alloc_attrs - so if you
pass in something unsupported it will give you back an error.

It seems that isn't quite enough for the drm use case, so we might
also need a way to query these features, but that really has to go
through the usual dma layer abstraction as well and not hacked together
in a driver based on an eduacted guestimate.
Ard Biesheuvel Jan. 21, 2019, 4:14 p.m. UTC | #5
On Mon, 21 Jan 2019 at 16:59, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jan 21, 2019 at 04:33:27PM +0100, Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 16:07, Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > +#include <linux/dma-noncoherent.h>
> > >
> > > This header is not for usage in device drivers, but purely for
> > > dma-mapping implementations!
> > >
> >
> > Is that documented anywhere?
>
> I'll add big fat comments.  But the fact that nothing is exported
> there should be a fairly big hint.
>

I don't follow. How do other header files 'export' things in a way
that this header doesn't?

> > > And even if something like this was valid to do, it would have to be
> > > a core function with an arch hook, and not hidden in a random driver.
> >
> > Why would it not be valid to do? Is it wrong for a driver to be aware
> > of whether a device is cache coherent or not?
> >
> > And in case it isn't, do you have an alternative suggestion on how to
> > fix this mess?
>
> For the write combine mappings we need a proper core API how instances
> can advertise the support.  One thing I want to do fairly soon is
> error checking of the attrs argument to dma_alloc_attrs - so if you
> pass in something unsupported it will give you back an error.
>

That sounds useful.

> It seems that isn't quite enough for the drm use case, so we might
> also need a way to query these features, but that really has to go
> through the usual dma layer abstraction as well and not hacked together
> in a driver based on an eduacted guestimate.

As far as I can tell, these drivers allocate DMA'able memory [in
ttm_tt_populate()] and subsequently create their own CPU mappings for
it, assuming that
a) the default is cache coherent, so vmap()ing those pages with
cacheable attributes works, and
b) telling the GPU to use NoSnoop attributes makes the accesses it
performs coherent with non-cacheable CPU mappings of those physical
pages

Since the latter is not true for many arm64 systems, I need this patch
to get a working system.

So how do you propose we proceed to get this fixed? Does it have to
wait for this proper core API plus followup changes for DRM?
Christoph Hellwig Jan. 21, 2019, 4:22 p.m. UTC | #6
On Mon, Jan 21, 2019 at 05:14:37PM +0100, Ard Biesheuvel wrote:
> > I'll add big fat comments.  But the fact that nothing is exported
> > there should be a fairly big hint.
> >
> 
> I don't follow. How do other header files 'export' things in a way
> that this header doesn't?

Well, I'll add comments to make it more obvious..

> As far as I can tell, these drivers allocate DMA'able memory [in
> ttm_tt_populate()] and subsequently create their own CPU mappings for
> it, assuming that
> a) the default is cache coherent, so vmap()ing those pages with
> cacheable attributes works, and

Yikes.  vmaping with different attributes is generally prone to
create problems on a lot of architectures.

> b) telling the GPU to use NoSnoop attributes makes the accesses it
> performs coherent with non-cacheable CPU mappings of those physical
> pages
> 
> Since the latter is not true for many arm64 systems, I need this patch
> to get a working system.

Do we know that this actually works anywhere but x86?

In general I would call these above sequence rather bogus and would
prefer we could get rid of such antipatterns in the kernel and just use
dma_alloc_attrs with DMA_ATTR_WRITECOMBINE if we want writecombine
semantics.

Until that happens we should just change the driver ifdefs to default
the hacks to off and only enable them on setups where we 100%
positively know that they actually work.  And document that fact
in big fat comments.
Ard Biesheuvel Jan. 21, 2019, 4:30 p.m. UTC | #7
On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jan 21, 2019 at 05:14:37PM +0100, Ard Biesheuvel wrote:
> > > I'll add big fat comments.  But the fact that nothing is exported
> > > there should be a fairly big hint.
> > >
> >
> > I don't follow. How do other header files 'export' things in a way
> > that this header doesn't?
>
> Well, I'll add comments to make it more obvious..
>
> > As far as I can tell, these drivers allocate DMA'able memory [in
> > ttm_tt_populate()] and subsequently create their own CPU mappings for
> > it, assuming that
> > a) the default is cache coherent, so vmap()ing those pages with
> > cacheable attributes works, and
>
> Yikes.  vmaping with different attributes is generally prone to
> create problems on a lot of architectures.
>

Indeed. But if your starting point is the assumption that DMA is
always cache coherent, those vmap() attributes are never different.

> > b) telling the GPU to use NoSnoop attributes makes the accesses it
> > performs coherent with non-cacheable CPU mappings of those physical
> > pages
> >
> > Since the latter is not true for many arm64 systems, I need this patch
> > to get a working system.
>
> Do we know that this actually works anywhere but x86?
>

In theory, it could work on arm64 systems with stage2-only SMMUs and
correctly configured PCIe RCs that set the right AMBA attributes for
inbound transactions with the NoSnoop attributes set.

Unfortunately, it seems that the current SMMU ARM code will clobber
those AMBA attributes when it uses stage1 mappings, since it forces
the memory attributes to WBWA for cache coherent devices.

So, as I pointed out in the commit log, the main difference between
x86 and other arches it that it can easily tolerate when NoSnoop is
non-functional.

> In general I would call these above sequence rather bogus and would
> prefer we could get rid of such antipatterns in the kernel and just use
> dma_alloc_attrs with DMA_ATTR_WRITECOMBINE if we want writecombine
> semantics.
>

Agreed.

> Until that happens we should just change the driver ifdefs to default
> the hacks to off and only enable them on setups where we 100%
> positively know that they actually work.  And document that fact
> in big fat comments.

Well, as I mentioned in my commit log as well, if we default to off
unless CONFIG_X86, we may break working setups on MIPS and Power where
the device is in fact non-cache coherent, and relies on this
'optimization' to get things working. The same could be true for
non-coherent ARM systems, hence my approach to disable this hack for
cache coherent devices on non-X86 only.
Christoph Hellwig Jan. 21, 2019, 4:35 p.m. UTC | #8
On Mon, Jan 21, 2019 at 05:30:00PM +0100, Ard Biesheuvel wrote:
> > Until that happens we should just change the driver ifdefs to default
> > the hacks to off and only enable them on setups where we 100%
> > positively know that they actually work.  And document that fact
> > in big fat comments.
> 
> Well, as I mentioned in my commit log as well, if we default to off
> unless CONFIG_X86, we may break working setups on MIPS and Power where
> the device is in fact non-cache coherent, and relies on this
> 'optimization' to get things working. The same could be true for
> non-coherent ARM systems, hence my approach to disable this hack for
> cache coherent devices on non-X86 only.

Do we break existing setups or just reduce performance due to the lack
of WC mappings?  I thought it was the latter.  The point is that
even your check won't do what you actually said.  At lot of non-loongson
mips platforms are not cache coherent.  As are a lot of arm setups
and all sparc64 ones for example.  And chances that someone will
hacks this file out in a random subsystem when adding news ports also
is rather slim, so we'll remaing broken by default.

That is why I want at very least:  a whitelist instead of a blacklist
and some very big comments explaining what is going on here.  And in
the mid to long term even drm really needs to learn to use the
proper APIs :(
Ard Biesheuvel Jan. 21, 2019, 4:50 p.m. UTC | #9
On Mon, 21 Jan 2019 at 17:35, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jan 21, 2019 at 05:30:00PM +0100, Ard Biesheuvel wrote:
> > > Until that happens we should just change the driver ifdefs to default
> > > the hacks to off and only enable them on setups where we 100%
> > > positively know that they actually work.  And document that fact
> > > in big fat comments.
> >
> > Well, as I mentioned in my commit log as well, if we default to off
> > unless CONFIG_X86, we may break working setups on MIPS and Power where
> > the device is in fact non-cache coherent, and relies on this
> > 'optimization' to get things working. The same could be true for
> > non-coherent ARM systems, hence my approach to disable this hack for
> > cache coherent devices on non-X86 only.
>
> Do we break existing setups or just reduce performance due to the lack
> of WC mappings?  I thought it was the latter.

Combining this check

#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
    return false;
...
#else
    return true;
#endif

with the fact that the driver assumes that DMA is cache coherent, I'd
be surprised if this hardware works without the hack.

The optimization targets cache coherent systems where pushing data
through the cache into the GPU is wasteful, given that it evicts other
data and relies on snooping by the device. In this case, disabling the
optimization reduces performance.

My point is that it is highly likely that non-cache coherent systems
may accidentally be in a working state only due to this optimization,
and not because the driver deals with non-cache coherent DMA
correctly.

> The point is that
> even your check won't do what you actually said.

Well, it enables the optimization only for non-cache coherent devices.
How is that different from what I said?

>  At lot of non-loongson
> mips platforms are not cache coherent.

You are assuming that whoever added that check cared about other MIPS platforms

> As are a lot of arm setups
> and all sparc64 ones for example.  And chances that someone will
> hacks this file out in a random subsystem when adding news ports also
> is rather slim, so we'll remaing broken by default.
>
> That is why I want at very least:  a whitelist instead of a blacklist
> and some very big comments explaining what is going on here.

ideally, we'd turn off this optimization entirely unless running on
x86. That would be my preference.

However, given the above re non-cache coherent systems, i am cautious.

>  And in
> the mid to long term even drm really needs to learn to use the
> proper APIs :(

+1
Michel Dänzer Jan. 21, 2019, 5:55 p.m. UTC | #10
On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> 
>> Until that happens we should just change the driver ifdefs to default
>> the hacks to off and only enable them on setups where we 100%
>> positively know that they actually work.  And document that fact
>> in big fat comments.
> 
> Well, as I mentioned in my commit log as well, if we default to off
> unless CONFIG_X86, we may break working setups on MIPS and Power where
> the device is in fact non-cache coherent, and relies on this
> 'optimization' to get things working.

FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
correct basic operation (the scenario Christian brought up is a very
specialized use-case), so that shouldn't be an issue.
Ard Biesheuvel Jan. 21, 2019, 5:59 p.m. UTC | #11
On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> >
> >> Until that happens we should just change the driver ifdefs to default
> >> the hacks to off and only enable them on setups where we 100%
> >> positively know that they actually work.  And document that fact
> >> in big fat comments.
> >
> > Well, as I mentioned in my commit log as well, if we default to off
> > unless CONFIG_X86, we may break working setups on MIPS and Power where
> > the device is in fact non-cache coherent, and relies on this
> > 'optimization' to get things working.
>
> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> correct basic operation (the scenario Christian brought up is a very
> specialized use-case), so that shouldn't be an issue.
>

The point is that this is only true for x86.

On other architectures, the use of non-cached mappings on the CPU side
means that you /do/ rely on non-snooped transfers, since if those
transfers turn out not to snoop inadvertently, the accesses are
incoherent with the CPU's view of memory.
Michel Dänzer Jan. 21, 2019, 6:04 p.m. UTC | #12
On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
>>
>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
>>>
>>>> Until that happens we should just change the driver ifdefs to default
>>>> the hacks to off and only enable them on setups where we 100%
>>>> positively know that they actually work.  And document that fact
>>>> in big fat comments.
>>>
>>> Well, as I mentioned in my commit log as well, if we default to off
>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
>>> the device is in fact non-cache coherent, and relies on this
>>> 'optimization' to get things working.
>>
>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
>> correct basic operation (the scenario Christian brought up is a very
>> specialized use-case), so that shouldn't be an issue.
>>
> 
> The point is that this is only true for x86.
> 
> On other architectures, the use of non-cached mappings on the CPU side
> means that you /do/ rely on non-snooped transfers, since if those
> transfers turn out not to snoop inadvertently, the accesses are
> incoherent with the CPU's view of memory.

The driver generally only uses non-cached mappings if
drm_arch/device_can_wc_memory returns true.
Ard Biesheuvel Jan. 21, 2019, 6:20 p.m. UTC | #13
On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
> >>
> >> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> >>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> >>>
> >>>> Until that happens we should just change the driver ifdefs to default
> >>>> the hacks to off and only enable them on setups where we 100%
> >>>> positively know that they actually work.  And document that fact
> >>>> in big fat comments.
> >>>
> >>> Well, as I mentioned in my commit log as well, if we default to off
> >>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> >>> the device is in fact non-cache coherent, and relies on this
> >>> 'optimization' to get things working.
> >>
> >> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> >> correct basic operation (the scenario Christian brought up is a very
> >> specialized use-case), so that shouldn't be an issue.
> >>
> >
> > The point is that this is only true for x86.
> >
> > On other architectures, the use of non-cached mappings on the CPU side
> > means that you /do/ rely on non-snooped transfers, since if those
> > transfers turn out not to snoop inadvertently, the accesses are
> > incoherent with the CPU's view of memory.
>
> The driver generally only uses non-cached mappings if
> drm_arch/device_can_wc_memory returns true.
>

Indeed. And so we should take care to only return 'true' from that
function if it is guaranteed that non-cached CPU mappings are coherent
with the mappings used by the GPU, either because that is always the
case (like on x86), or because we know that the platform in question
implements NoSnoop correctly throughout the interconnect.

What seems to be complicating matters is that in some cases, the
device is non-cache coherent to begin with, so regardless of whether
the NoSnoop attribute is used or not, those accesses will not snoop in
the caches and be coherent with the non-cached mappings used by the
CPU. So if we restrict this optimization [on non-X86] to platforms
that are known to implement NoSnoop correctly, we may break platforms
that are implicitly NoSnoop all the time.
Michel Dänzer Jan. 21, 2019, 6:23 p.m. UTC | #14
On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
>>
>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
>>>>
>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
>>>>>
>>>>>> Until that happens we should just change the driver ifdefs to default
>>>>>> the hacks to off and only enable them on setups where we 100%
>>>>>> positively know that they actually work.  And document that fact
>>>>>> in big fat comments.
>>>>>
>>>>> Well, as I mentioned in my commit log as well, if we default to off
>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
>>>>> the device is in fact non-cache coherent, and relies on this
>>>>> 'optimization' to get things working.
>>>>
>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
>>>> correct basic operation (the scenario Christian brought up is a very
>>>> specialized use-case), so that shouldn't be an issue.
>>>>
>>>
>>> The point is that this is only true for x86.
>>>
>>> On other architectures, the use of non-cached mappings on the CPU side
>>> means that you /do/ rely on non-snooped transfers, since if those
>>> transfers turn out not to snoop inadvertently, the accesses are
>>> incoherent with the CPU's view of memory.
>>
>> The driver generally only uses non-cached mappings if
>> drm_arch/device_can_wc_memory returns true.
>>
> 
> Indeed. And so we should take care to only return 'true' from that
> function if it is guaranteed that non-cached CPU mappings are coherent
> with the mappings used by the GPU, either because that is always the
> case (like on x86), or because we know that the platform in question
> implements NoSnoop correctly throughout the interconnect.
> 
> What seems to be complicating matters is that in some cases, the
> device is non-cache coherent to begin with, so regardless of whether
> the NoSnoop attribute is used or not, those accesses will not snoop in
> the caches and be coherent with the non-cached mappings used by the
> CPU. So if we restrict this optimization [on non-X86] to platforms
> that are known to implement NoSnoop correctly, we may break platforms
> that are implicitly NoSnoop all the time.

Since the driver generally doesn't rely on non-snooped accesses for
correctness, that couldn't "break" anything that hasn't always been broken.
Ard Biesheuvel Jan. 21, 2019, 6:28 p.m. UTC | #15
On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
> >>
> >> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> >>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
> >>>>
> >>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> >>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> >>>>>
> >>>>>> Until that happens we should just change the driver ifdefs to default
> >>>>>> the hacks to off and only enable them on setups where we 100%
> >>>>>> positively know that they actually work.  And document that fact
> >>>>>> in big fat comments.
> >>>>>
> >>>>> Well, as I mentioned in my commit log as well, if we default to off
> >>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> >>>>> the device is in fact non-cache coherent, and relies on this
> >>>>> 'optimization' to get things working.
> >>>>
> >>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> >>>> correct basic operation (the scenario Christian brought up is a very
> >>>> specialized use-case), so that shouldn't be an issue.
> >>>>
> >>>
> >>> The point is that this is only true for x86.
> >>>
> >>> On other architectures, the use of non-cached mappings on the CPU side
> >>> means that you /do/ rely on non-snooped transfers, since if those
> >>> transfers turn out not to snoop inadvertently, the accesses are
> >>> incoherent with the CPU's view of memory.
> >>
> >> The driver generally only uses non-cached mappings if
> >> drm_arch/device_can_wc_memory returns true.
> >>
> >
> > Indeed. And so we should take care to only return 'true' from that
> > function if it is guaranteed that non-cached CPU mappings are coherent
> > with the mappings used by the GPU, either because that is always the
> > case (like on x86), or because we know that the platform in question
> > implements NoSnoop correctly throughout the interconnect.
> >
> > What seems to be complicating matters is that in some cases, the
> > device is non-cache coherent to begin with, so regardless of whether
> > the NoSnoop attribute is used or not, those accesses will not snoop in
> > the caches and be coherent with the non-cached mappings used by the
> > CPU. So if we restrict this optimization [on non-X86] to platforms
> > that are known to implement NoSnoop correctly, we may break platforms
> > that are implicitly NoSnoop all the time.
>
> Since the driver generally doesn't rely on non-snooped accesses for
> correctness, that couldn't "break" anything that hasn't always been broken.
>

Again, that is only true on x86.

On other architectures, DMA writes from the device may allocate in the
caches, and be invisible to the CPU when it uses non-cached mappings.
Michel Dänzer Jan. 21, 2019, 7:04 p.m. UTC | #16
On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel@daenzer.net> wrote:
>> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
>>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
>>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
>>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
>>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
>>>>>>>
>>>>>>>> Until that happens we should just change the driver ifdefs to default
>>>>>>>> the hacks to off and only enable them on setups where we 100%
>>>>>>>> positively know that they actually work.  And document that fact
>>>>>>>> in big fat comments.
>>>>>>>
>>>>>>> Well, as I mentioned in my commit log as well, if we default to off
>>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
>>>>>>> the device is in fact non-cache coherent, and relies on this
>>>>>>> 'optimization' to get things working.
>>>>>>
>>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
>>>>>> correct basic operation (the scenario Christian brought up is a very
>>>>>> specialized use-case), so that shouldn't be an issue.
>>>>>
>>>>> The point is that this is only true for x86.
>>>>>
>>>>> On other architectures, the use of non-cached mappings on the CPU side
>>>>> means that you /do/ rely on non-snooped transfers, since if those
>>>>> transfers turn out not to snoop inadvertently, the accesses are
>>>>> incoherent with the CPU's view of memory.
>>>>
>>>> The driver generally only uses non-cached mappings if
>>>> drm_arch/device_can_wc_memory returns true.
>>>
>>> Indeed. And so we should take care to only return 'true' from that
>>> function if it is guaranteed that non-cached CPU mappings are coherent
>>> with the mappings used by the GPU, either because that is always the
>>> case (like on x86), or because we know that the platform in question
>>> implements NoSnoop correctly throughout the interconnect.
>>>
>>> What seems to be complicating matters is that in some cases, the
>>> device is non-cache coherent to begin with, so regardless of whether
>>> the NoSnoop attribute is used or not, those accesses will not snoop in
>>> the caches and be coherent with the non-cached mappings used by the
>>> CPU. So if we restrict this optimization [on non-X86] to platforms
>>> that are known to implement NoSnoop correctly, we may break platforms
>>> that are implicitly NoSnoop all the time.
>>
>> Since the driver generally doesn't rely on non-snooped accesses for
>> correctness, that couldn't "break" anything that hasn't always been broken.
> 
> Again, that is only true on x86.
> 
> On other architectures, DMA writes from the device may allocate in the
> caches, and be invisible to the CPU when it uses non-cached mappings.

Let me try one last time:

If drm_arch_can_wc_memory returns false, the driver falls back to the
normal mode of operation, using a cacheable CPU mapping and snooped GPU
transfers, even if userspace asks (as a performance optimization) for a
write-combined CPU mapping and non-snooped GPU transfers via
AMDGPU_GEM_CREATE_CPU_GTT_USWC. This normal mode of operation is also
used for the ring buffers at the heart of the driver's operation. If
there is a platform where this normal mode of operation doesn't work,
the driver could never have worked reliably on that platform, since
before AMDGPU_GEM_CREATE_CPU_GTT_USWC or drm_arch_can_wc_memory even
existed.
Ard Biesheuvel Jan. 21, 2019, 7:18 p.m. UTC | #17
On Mon, 21 Jan 2019 at 20:04, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> >>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
> >>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> >>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
> >>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> >>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> >>>>>>>
> >>>>>>>> Until that happens we should just change the driver ifdefs to default
> >>>>>>>> the hacks to off and only enable them on setups where we 100%
> >>>>>>>> positively know that they actually work.  And document that fact
> >>>>>>>> in big fat comments.
> >>>>>>>
> >>>>>>> Well, as I mentioned in my commit log as well, if we default to off
> >>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> >>>>>>> the device is in fact non-cache coherent, and relies on this
> >>>>>>> 'optimization' to get things working.
> >>>>>>
> >>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> >>>>>> correct basic operation (the scenario Christian brought up is a very
> >>>>>> specialized use-case), so that shouldn't be an issue.
> >>>>>
> >>>>> The point is that this is only true for x86.
> >>>>>
> >>>>> On other architectures, the use of non-cached mappings on the CPU side
> >>>>> means that you /do/ rely on non-snooped transfers, since if those
> >>>>> transfers turn out not to snoop inadvertently, the accesses are
> >>>>> incoherent with the CPU's view of memory.
> >>>>
> >>>> The driver generally only uses non-cached mappings if
> >>>> drm_arch/device_can_wc_memory returns true.
> >>>
> >>> Indeed. And so we should take care to only return 'true' from that
> >>> function if it is guaranteed that non-cached CPU mappings are coherent
> >>> with the mappings used by the GPU, either because that is always the
> >>> case (like on x86), or because we know that the platform in question
> >>> implements NoSnoop correctly throughout the interconnect.
> >>>
> >>> What seems to be complicating matters is that in some cases, the
> >>> device is non-cache coherent to begin with, so regardless of whether
> >>> the NoSnoop attribute is used or not, those accesses will not snoop in
> >>> the caches and be coherent with the non-cached mappings used by the
> >>> CPU. So if we restrict this optimization [on non-X86] to platforms
> >>> that are known to implement NoSnoop correctly, we may break platforms
> >>> that are implicitly NoSnoop all the time.
> >>
> >> Since the driver generally doesn't rely on non-snooped accesses for
> >> correctness, that couldn't "break" anything that hasn't always been broken.
> >
> > Again, that is only true on x86.
> >
> > On other architectures, DMA writes from the device may allocate in the
> > caches, and be invisible to the CPU when it uses non-cached mappings.
>
> Let me try one last time:
>

I could say the same :-)

> If drm_arch_can_wc_memory returns false, the driver falls back to the
> normal mode of operation, using a cacheable CPU mapping and snooped GPU
> transfers, even if userspace asks (as a performance optimization) for a
> write-combined CPU mapping and non-snooped GPU transfers via
> AMDGPU_GEM_CREATE_CPU_GTT_USWC.

I am not talking about the case where drm_arch_can_wc_memory() returns false.

I am talking about the case where it returns true, which is currently
the default for all architectures, except Power and MIPS in some
cases.

This mode of operation breaks my cache coherent arm64 system. (AMD Seattle)
With this patch applied, everything works fine.

> This normal mode of operation is also
> used for the ring buffers at the heart of the driver's operation.

But is it really the same mode of operation? Does it also vmap() the
pages? Or does it use the DMA API to allocate the ring buffers?
Because in the latter case, this will give you non-cached CPU mappings
as well if the device is non-cache coherent.

> If
> there is a platform where this normal mode of operation doesn't work,
> the driver could never have worked reliably on that platform, since
> before AMDGPU_GEM_CREATE_CPU_GTT_USWC or drm_arch_can_wc_memory even
> existed.
>

As I said, I am talking about the case where drm_arch_can_wc_memory()
returns true on a cache coherent system. This relies on NoSnoop being
implemented correctly in the platform, or a CPU architecture that
snoops the caches when doing uncached memory accesses (such as x86)
Ard Biesheuvel Jan. 22, 2019, 8:38 a.m. UTC | #18
On Mon, 21 Jan 2019 at 20:04, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel@daenzer.net> wrote:
> >> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> >>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
> >>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> >>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
> >>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> >>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> >>>>>>>
> >>>>>>>> Until that happens we should just change the driver ifdefs to default
> >>>>>>>> the hacks to off and only enable them on setups where we 100%
> >>>>>>>> positively know that they actually work.  And document that fact
> >>>>>>>> in big fat comments.
> >>>>>>>
> >>>>>>> Well, as I mentioned in my commit log as well, if we default to off
> >>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> >>>>>>> the device is in fact non-cache coherent, and relies on this
> >>>>>>> 'optimization' to get things working.
> >>>>>>
> >>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> >>>>>> correct basic operation (the scenario Christian brought up is a very
> >>>>>> specialized use-case), so that shouldn't be an issue.
> >>>>>
> >>>>> The point is that this is only true for x86.
> >>>>>
> >>>>> On other architectures, the use of non-cached mappings on the CPU side
> >>>>> means that you /do/ rely on non-snooped transfers, since if those
> >>>>> transfers turn out not to snoop inadvertently, the accesses are
> >>>>> incoherent with the CPU's view of memory.
> >>>>
> >>>> The driver generally only uses non-cached mappings if
> >>>> drm_arch/device_can_wc_memory returns true.
> >>>
> >>> Indeed. And so we should take care to only return 'true' from that
> >>> function if it is guaranteed that non-cached CPU mappings are coherent
> >>> with the mappings used by the GPU, either because that is always the
> >>> case (like on x86), or because we know that the platform in question
> >>> implements NoSnoop correctly throughout the interconnect.
> >>>
> >>> What seems to be complicating matters is that in some cases, the
> >>> device is non-cache coherent to begin with, so regardless of whether
> >>> the NoSnoop attribute is used or not, those accesses will not snoop in
> >>> the caches and be coherent with the non-cached mappings used by the
> >>> CPU. So if we restrict this optimization [on non-X86] to platforms
> >>> that are known to implement NoSnoop correctly, we may break platforms
> >>> that are implicitly NoSnoop all the time.
> >>
> >> Since the driver generally doesn't rely on non-snooped accesses for
> >> correctness, that couldn't "break" anything that hasn't always been broken.
> >
> > Again, that is only true on x86.
> >
> > On other architectures, DMA writes from the device may allocate in the
> > caches, and be invisible to the CPU when it uses non-cached mappings.
>
> Let me try one last time:
>
> If drm_arch_can_wc_memory returns false, the driver falls back to the
> normal mode of operation, using a cacheable CPU mapping and snooped GPU
> transfers, even if userspace asks (as a performance optimization) for a
> write-combined CPU mapping and non-snooped GPU transfers via
> AMDGPU_GEM_CREATE_CPU_GTT_USWC.

Another question: when userspace requests for such a mapping to be
created, does this involve pages that are mapped cacheable into the
userland process?
Alex Deucher Jan. 22, 2019, 8:56 p.m. UTC | #19
On Tue, Jan 22, 2019 at 4:19 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Mon, 21 Jan 2019 at 20:04, Michel Dänzer <michel@daenzer.net> wrote:
> >
> > On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> > > On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel@daenzer.net> wrote:
> > >> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> > >>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
> > >>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> > >>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
> > >>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> > >>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> > >>>>>>>
> > >>>>>>>> Until that happens we should just change the driver ifdefs to default
> > >>>>>>>> the hacks to off and only enable them on setups where we 100%
> > >>>>>>>> positively know that they actually work.  And document that fact
> > >>>>>>>> in big fat comments.
> > >>>>>>>
> > >>>>>>> Well, as I mentioned in my commit log as well, if we default to off
> > >>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> > >>>>>>> the device is in fact non-cache coherent, and relies on this
> > >>>>>>> 'optimization' to get things working.
> > >>>>>>
> > >>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> > >>>>>> correct basic operation (the scenario Christian brought up is a very
> > >>>>>> specialized use-case), so that shouldn't be an issue.
> > >>>>>
> > >>>>> The point is that this is only true for x86.
> > >>>>>
> > >>>>> On other architectures, the use of non-cached mappings on the CPU side
> > >>>>> means that you /do/ rely on non-snooped transfers, since if those
> > >>>>> transfers turn out not to snoop inadvertently, the accesses are
> > >>>>> incoherent with the CPU's view of memory.
> > >>>>
> > >>>> The driver generally only uses non-cached mappings if
> > >>>> drm_arch/device_can_wc_memory returns true.
> > >>>
> > >>> Indeed. And so we should take care to only return 'true' from that
> > >>> function if it is guaranteed that non-cached CPU mappings are coherent
> > >>> with the mappings used by the GPU, either because that is always the
> > >>> case (like on x86), or because we know that the platform in question
> > >>> implements NoSnoop correctly throughout the interconnect.
> > >>>
> > >>> What seems to be complicating matters is that in some cases, the
> > >>> device is non-cache coherent to begin with, so regardless of whether
> > >>> the NoSnoop attribute is used or not, those accesses will not snoop in
> > >>> the caches and be coherent with the non-cached mappings used by the
> > >>> CPU. So if we restrict this optimization [on non-X86] to platforms
> > >>> that are known to implement NoSnoop correctly, we may break platforms
> > >>> that are implicitly NoSnoop all the time.
> > >>
> > >> Since the driver generally doesn't rely on non-snooped accesses for
> > >> correctness, that couldn't "break" anything that hasn't always been broken.
> > >
> > > Again, that is only true on x86.
> > >
> > > On other architectures, DMA writes from the device may allocate in the
> > > caches, and be invisible to the CPU when it uses non-cached mappings.
> >
> > Let me try one last time:
> >
> > If drm_arch_can_wc_memory returns false, the driver falls back to the
> > normal mode of operation, using a cacheable CPU mapping and snooped GPU
> > transfers, even if userspace asks (as a performance optimization) for a
> > write-combined CPU mapping and non-snooped GPU transfers via
> > AMDGPU_GEM_CREATE_CPU_GTT_USWC.
>
> Another question: when userspace requests for such a mapping to be
> created, does this involve pages that are mapped cacheable into the
> userland process?

AMDGPU_GEM_CREATE_CPU_GTT_USWC means the buffer should be uncached and
write combined from the CPU's perspective (hence the 'CPU' in the flag
name).  On the GPU side if that flag is set, we do an non-snooped GPU
mapping for better performance if the buffer ends up getting mapped
into the GPU's address space for GPU access.

Alex
Ard Biesheuvel Jan. 22, 2019, 9:07 p.m. UTC | #20
On Tue, 22 Jan 2019 at 21:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Tue, Jan 22, 2019 at 4:19 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Mon, 21 Jan 2019 at 20:04, Michel Dänzer <michel@daenzer.net> wrote:
> > >
> > > On 2019-01-21 7:28 p.m., Ard Biesheuvel wrote:
> > > > On Mon, 21 Jan 2019 at 19:24, Michel Dänzer <michel@daenzer.net> wrote:
> > > >> On 2019-01-21 7:20 p.m., Ard Biesheuvel wrote:
> > > >>> On Mon, 21 Jan 2019 at 19:04, Michel Dänzer <michel@daenzer.net> wrote:
> > > >>>> On 2019-01-21 6:59 p.m., Ard Biesheuvel wrote:
> > > >>>>> On Mon, 21 Jan 2019 at 18:55, Michel Dänzer <michel@daenzer.net> wrote:
> > > >>>>>> On 2019-01-21 5:30 p.m., Ard Biesheuvel wrote:
> > > >>>>>>> On Mon, 21 Jan 2019 at 17:22, Christoph Hellwig <hch@infradead.org> wrote:
> > > >>>>>>>
> > > >>>>>>>> Until that happens we should just change the driver ifdefs to default
> > > >>>>>>>> the hacks to off and only enable them on setups where we 100%
> > > >>>>>>>> positively know that they actually work.  And document that fact
> > > >>>>>>>> in big fat comments.
> > > >>>>>>>
> > > >>>>>>> Well, as I mentioned in my commit log as well, if we default to off
> > > >>>>>>> unless CONFIG_X86, we may break working setups on MIPS and Power where
> > > >>>>>>> the device is in fact non-cache coherent, and relies on this
> > > >>>>>>> 'optimization' to get things working.
> > > >>>>>>
> > > >>>>>> FWIW, the amdgpu driver doesn't rely on non-snooped transfers for
> > > >>>>>> correct basic operation (the scenario Christian brought up is a very
> > > >>>>>> specialized use-case), so that shouldn't be an issue.
> > > >>>>>
> > > >>>>> The point is that this is only true for x86.
> > > >>>>>
> > > >>>>> On other architectures, the use of non-cached mappings on the CPU side
> > > >>>>> means that you /do/ rely on non-snooped transfers, since if those
> > > >>>>> transfers turn out not to snoop inadvertently, the accesses are
> > > >>>>> incoherent with the CPU's view of memory.
> > > >>>>
> > > >>>> The driver generally only uses non-cached mappings if
> > > >>>> drm_arch/device_can_wc_memory returns true.
> > > >>>
> > > >>> Indeed. And so we should take care to only return 'true' from that
> > > >>> function if it is guaranteed that non-cached CPU mappings are coherent
> > > >>> with the mappings used by the GPU, either because that is always the
> > > >>> case (like on x86), or because we know that the platform in question
> > > >>> implements NoSnoop correctly throughout the interconnect.
> > > >>>
> > > >>> What seems to be complicating matters is that in some cases, the
> > > >>> device is non-cache coherent to begin with, so regardless of whether
> > > >>> the NoSnoop attribute is used or not, those accesses will not snoop in
> > > >>> the caches and be coherent with the non-cached mappings used by the
> > > >>> CPU. So if we restrict this optimization [on non-X86] to platforms
> > > >>> that are known to implement NoSnoop correctly, we may break platforms
> > > >>> that are implicitly NoSnoop all the time.
> > > >>
> > > >> Since the driver generally doesn't rely on non-snooped accesses for
> > > >> correctness, that couldn't "break" anything that hasn't always been broken.
> > > >
> > > > Again, that is only true on x86.
> > > >
> > > > On other architectures, DMA writes from the device may allocate in the
> > > > caches, and be invisible to the CPU when it uses non-cached mappings.
> > >
> > > Let me try one last time:
> > >
> > > If drm_arch_can_wc_memory returns false, the driver falls back to the
> > > normal mode of operation, using a cacheable CPU mapping and snooped GPU
> > > transfers, even if userspace asks (as a performance optimization) for a
> > > write-combined CPU mapping and non-snooped GPU transfers via
> > > AMDGPU_GEM_CREATE_CPU_GTT_USWC.
> >
> > Another question: when userspace requests for such a mapping to be
> > created, does this involve pages that are mapped cacheable into the
> > userland process?
>
> AMDGPU_GEM_CREATE_CPU_GTT_USWC means the buffer should be uncached and
> write combined from the CPU's perspective (hence the 'CPU' in the flag
> name).  On the GPU side if that flag is set, we do an non-snooped GPU
> mapping for better performance if the buffer ends up getting mapped
> into the GPU's address space for GPU access.
>

Yes, so much was clear. And the reason this breaks on some arm64
systems is because
a) non-snooped PCIe TLP attributes may be ignored, and
b) non-x86 CPUs do not snoop the caches when accessing uncached mappings.

I don't think there is actually any disagreement on this part. And I
think my patch is reasonable, only Christoph is objecting to it on the
grounds that drivers should not go around the DMA API and create
vmap()s of DMA pages with self chosen attributes.

What I am trying to figure out is why the coherency problem exists:
- it could be that the device is reading using cached mappings and
sees stale clean cachelines that shadow data written by the CPU using
uncached mappings
- or the device is writing with cached mappings, resulting in data to
be allocated there, but the CPU does not see it because it reads using
uncached mappings.

In the former case, I would expect cache invalidation right after the
vmap() to be sufficient, but in the latter case, it is a bit more
difficult probably.

So my question was whether there are cacheable aliases for those
uncached vmap()s being used?
Christoph Hellwig Jan. 23, 2019, 7:15 a.m. UTC | #21
On Tue, Jan 22, 2019 at 10:07:07PM +0100, Ard Biesheuvel wrote:
> Yes, so much was clear. And the reason this breaks on some arm64
> systems is because
> a) non-snooped PCIe TLP attributes may be ignored, and
> b) non-x86 CPUs do not snoop the caches when accessing uncached mappings.
> 
> I don't think there is actually any disagreement on this part. And I
> think my patch is reasonable, only Christoph is objecting to it on the
> grounds that drivers should not go around the DMA API and create
> vmap()s of DMA pages with self chosen attributes.

I object to it on various grounds.  While the above is correct it really
is a mid to long-term fix.

But even in the short term your patch just maintains a random list of
idefs in a random driver, pokes into the dma-mapping internals and lacks
any comments in the code explaining on what is going on, leading to
futher cargo culting.  So it very clearly is not acceptable.
Ard Biesheuvel Jan. 23, 2019, 9:08 a.m. UTC | #22
On Wed, 23 Jan 2019 at 08:15, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Jan 22, 2019 at 10:07:07PM +0100, Ard Biesheuvel wrote:
> > Yes, so much was clear. And the reason this breaks on some arm64
> > systems is because
> > a) non-snooped PCIe TLP attributes may be ignored, and
> > b) non-x86 CPUs do not snoop the caches when accessing uncached mappings.
> >
> > I don't think there is actually any disagreement on this part. And I
> > think my patch is reasonable, only Christoph is objecting to it on the
> > grounds that drivers should not go around the DMA API and create
> > vmap()s of DMA pages with self chosen attributes.
>
> I object to it on various grounds.  While the above is correct it really
> is a mid to long-term fix.
>
> But even in the short term your patch just maintains a random list of
> idefs in a random driver, pokes into the dma-mapping internals and lacks
> any comments in the code explaining on what is going on, leading to
> futher cargo culting.  So it very clearly is not acceptable.

Fair enough. It would be helpful, though, if you could give more
guidance on what you do find acceptable. You haven't answered my
question on whether you think drivers should be allowed to reason at
all about the device's cache coherence.

In any case, I think we (you and I) could easily agree on the correct fix being:
- introduce DMA_ATTR_NOSNOOP
- implement it as a no-op for non-cache coherent devices (since
nosnoop is implied)
- permit non-x86 architectures to override/ignore it if not supported
- rip out all the explicit vmap()/kmap()s of DMA pages from the DRM
drivers, and instead, set the DMA_ATTR_NOSNOOP attribute where
desired, and always use the mappings provided by the DMA API.

The problem is that we will need the DRM/radeon/amdgpu maintainers on
board with this, and until this happens, I am stuck in the middle with
a broken arm64 system.

So, given the above, what /would/ you find acceptable?
Christoph Hellwig Jan. 23, 2019, 4:44 p.m. UTC | #23
I think we just want a driver-local check for those combinations
where we know this hack actually works, which really just seems
to be x86-64 with PAT. Something like the patch below, but maybe with
even more strong warnings to not do something like this elsewhere:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 728e15e5d68a..5fe657f20232 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -456,33 +456,16 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 
 	bo->flags = bp->flags;
 
-#ifdef CONFIG_X86_32
-	/* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit
-	 * See https://bugs.freedesktop.org/show_bug.cgi?id=84627
-	 */
-	bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
-#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
-	/* Don't try to enable write-combining when it can't work, or things
-	 * may be slow
-	 * See https://bugs.freedesktop.org/show_bug.cgi?id=88758
-	 */
-
-#ifndef CONFIG_COMPILE_TEST
-#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
-	 thanks to write-combining
-#endif
-
-	if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
-		DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for "
-			      "better performance thanks to write-combining\n");
-	bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
-#else
-	/* For architectures that don't support WC memory,
-	 * mask out the WC flag from the BO
+	/*
+	 * Don't try to enable write-combined CPU mappings unless we 100%
+	 * positively know it works, otherwise there may be dragons.
+	 *
+	 * See:
+	 *  - https://bugs.freedesktop.org/show_bug.cgi?id=88758
+	 *  - https://bugs.freedesktop.org/show_bug.cgi?id=84627
 	 */
-	if (!drm_arch_can_wc_memory())
+	if (!(IS_ENABLED(CONFIG_X86_64) && IS_ENABLED(CONFIG_X86_PAT)))
 		bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
-#endif
 
 	bo->tbo.bdev = &adev->mman.bdev;
 	amdgpu_bo_placement_from_domain(bo, bp->domain);
diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 833e909706a9..c1fb5ad4ab9a 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -226,32 +226,17 @@ int radeon_bo_create(struct radeon_device *rdev,
 	if (rdev->family >= CHIP_RV610 && rdev->family <= CHIP_RV635)
 		bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
 
-#ifdef CONFIG_X86_32
-	/* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit
-	 * See https://bugs.freedesktop.org/show_bug.cgi?id=84627
-	 */
-	bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
-#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
-	/* Don't try to enable write-combining when it can't work, or things
-	 * may be slow
-	 * See https://bugs.freedesktop.org/show_bug.cgi?id=88758
-	 */
-#ifndef CONFIG_COMPILE_TEST
-#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
-	 thanks to write-combining
-#endif
 
-	if (bo->flags & RADEON_GEM_GTT_WC)
-		DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for "
-			      "better performance thanks to write-combining\n");
-	bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
-#else
-	/* For architectures that don't support WC memory,
-	 * mask out the WC flag from the BO
+	/*
+	 * Don't try to enable write-combined CPU mappings unless we 100%
+	 * positively know it works, otherwise there may be dragons.
+	 *
+	 * See:
+	 *  - https://bugs.freedesktop.org/show_bug.cgi?id=88758
+	 *  - https://bugs.freedesktop.org/show_bug.cgi?id=84627
 	 */
-	if (!drm_arch_can_wc_memory())
-		bo->flags &= ~RADEON_GEM_GTT_WC;
-#endif
+	if (!(IS_ENABLED(CONFIG_X86_64) && IS_ENABLED(CONFIG_X86_PAT)))
+		bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
 
 	radeon_ttm_placement_from_domain(bo, domain);
 	/* Kernel allocation are uninterruptible */
diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h
index bfe1639df02d..6c3960f4c477 100644
--- a/include/drm/drm_cache.h
+++ b/include/drm/drm_cache.h
@@ -40,16 +40,4 @@ void drm_clflush_sg(struct sg_table *st);
 void drm_clflush_virt_range(void *addr, unsigned long length);
 u64 drm_get_max_iomem(void);
 
-
-static inline bool drm_arch_can_wc_memory(void)
-{
-#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
-	return false;
-#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
-	return false;
-#else
-	return true;
-#endif
-}
-
 #endif
Ard Biesheuvel Jan. 23, 2019, 4:52 p.m. UTC | #24
On Wed, 23 Jan 2019 at 17:44, Christoph Hellwig <hch@infradead.org> wrote:
>
> I think we just want a driver-local check for those combinations
> where we know this hack actually works, which really just seems
> to be x86-64 with PAT. Something like the patch below, but maybe with
> even more strong warnings to not do something like this elsewhere:
>

I agree that your patch seems like the right way to ensure that the WC
optimization hack is only used where we know it works.

But my concern is that it seems likely that non-cache coherent
implementations are relying on this hack as well. There must be a
reason that this hack is only disabled for PowerPC platforms if they
are cache coherent, for instance, and I suspect that that reason is
that the hack is the only thing ensuring that the CPU mapping
attributes match the device ones used for these buffers (the vmap()ed
ones), whereas the rings and other consistent data structures are
using the DMA API as intended, and thus getting uncached attributes in
the correct way.

Given that the same concern applies to ARM (and arm64 to some extent),
I'd be cautious to disable this optimization entirely, and this is why
I ended up disabling it only for cache coherent devices.




> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 728e15e5d68a..5fe657f20232 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -456,33 +456,16 @@ static int amdgpu_bo_do_create(struct amdgpu_device *adev,
>
>         bo->flags = bp->flags;
>
> -#ifdef CONFIG_X86_32
> -       /* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit
> -        * See https://bugs.freedesktop.org/show_bug.cgi?id=84627
> -        */
> -       bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
> -#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
> -       /* Don't try to enable write-combining when it can't work, or things
> -        * may be slow
> -        * See https://bugs.freedesktop.org/show_bug.cgi?id=88758
> -        */
> -
> -#ifndef CONFIG_COMPILE_TEST
> -#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
> -        thanks to write-combining
> -#endif
> -
> -       if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
> -               DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for "
> -                             "better performance thanks to write-combining\n");
> -       bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
> -#else
> -       /* For architectures that don't support WC memory,
> -        * mask out the WC flag from the BO
> +       /*
> +        * Don't try to enable write-combined CPU mappings unless we 100%
> +        * positively know it works, otherwise there may be dragons.
> +        *
> +        * See:
> +        *  - https://bugs.freedesktop.org/show_bug.cgi?id=88758
> +        *  - https://bugs.freedesktop.org/show_bug.cgi?id=84627
>          */
> -       if (!drm_arch_can_wc_memory())
> +       if (!(IS_ENABLED(CONFIG_X86_64) && IS_ENABLED(CONFIG_X86_PAT)))
>                 bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
> -#endif
>
>         bo->tbo.bdev = &adev->mman.bdev;
>         amdgpu_bo_placement_from_domain(bo, bp->domain);
> diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
> index 833e909706a9..c1fb5ad4ab9a 100644
> --- a/drivers/gpu/drm/radeon/radeon_object.c
> +++ b/drivers/gpu/drm/radeon/radeon_object.c
> @@ -226,32 +226,17 @@ int radeon_bo_create(struct radeon_device *rdev,
>         if (rdev->family >= CHIP_RV610 && rdev->family <= CHIP_RV635)
>                 bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
>
> -#ifdef CONFIG_X86_32
> -       /* XXX: Write-combined CPU mappings of GTT seem broken on 32-bit
> -        * See https://bugs.freedesktop.org/show_bug.cgi?id=84627
> -        */
> -       bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
> -#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
> -       /* Don't try to enable write-combining when it can't work, or things
> -        * may be slow
> -        * See https://bugs.freedesktop.org/show_bug.cgi?id=88758
> -        */
> -#ifndef CONFIG_COMPILE_TEST
> -#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
> -        thanks to write-combining
> -#endif
>
> -       if (bo->flags & RADEON_GEM_GTT_WC)
> -               DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for "
> -                             "better performance thanks to write-combining\n");
> -       bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
> -#else
> -       /* For architectures that don't support WC memory,
> -        * mask out the WC flag from the BO
> +       /*
> +        * Don't try to enable write-combined CPU mappings unless we 100%
> +        * positively know it works, otherwise there may be dragons.
> +        *
> +        * See:
> +        *  - https://bugs.freedesktop.org/show_bug.cgi?id=88758
> +        *  - https://bugs.freedesktop.org/show_bug.cgi?id=84627
>          */
> -       if (!drm_arch_can_wc_memory())
> -               bo->flags &= ~RADEON_GEM_GTT_WC;
> -#endif
> +       if (!(IS_ENABLED(CONFIG_X86_64) && IS_ENABLED(CONFIG_X86_PAT)))
> +               bo->flags &= ~(RADEON_GEM_GTT_WC | RADEON_GEM_GTT_UC);
>
>         radeon_ttm_placement_from_domain(bo, domain);
>         /* Kernel allocation are uninterruptible */
> diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h
> index bfe1639df02d..6c3960f4c477 100644
> --- a/include/drm/drm_cache.h
> +++ b/include/drm/drm_cache.h
> @@ -40,16 +40,4 @@ void drm_clflush_sg(struct sg_table *st);
>  void drm_clflush_virt_range(void *addr, unsigned long length);
>  u64 drm_get_max_iomem(void);
>
> -
> -static inline bool drm_arch_can_wc_memory(void)
> -{
> -#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> -       return false;
> -#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
> -       return false;
> -#else
> -       return true;
> -#endif
> -}
> -
>  #endif
>
Christoph Hellwig Jan. 24, 2019, 9:13 a.m. UTC | #25
On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
> But my concern is that it seems likely that non-cache coherent
> implementations are relying on this hack as well. There must be a
> reason that this hack is only disabled for PowerPC platforms if they
> are cache coherent, for instance, and I suspect that that reason is
> that the hack is the only thing ensuring that the CPU mapping
> attributes match the device ones used for these buffers (the vmap()ed
> ones), whereas the rings and other consistent data structures are
> using the DMA API as intended, and thus getting uncached attributes in
> the correct way.

Dave, who added that commit is on Cc together with just about everyone
involved in the review chain.  Based on the previous explanation
that idea what we might want an uncached mapping for some non-coherent
architectures for this to work at all makes sense, but then again
the way to create those mappings is entirely architecture specific,
and also need a cache flushing before creating the mapping to work
properly.  So my working theory is that this code never properly
worked on architectures without DMA coherent for PCIe at all, but
I'd love to be corrected by concrete examples including an explanation
of how it actually ends up working.
Christian König Jan. 24, 2019, 9:25 a.m. UTC | #26
Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
> On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
>> But my concern is that it seems likely that non-cache coherent
>> implementations are relying on this hack as well. There must be a
>> reason that this hack is only disabled for PowerPC platforms if they
>> are cache coherent, for instance, and I suspect that that reason is
>> that the hack is the only thing ensuring that the CPU mapping
>> attributes match the device ones used for these buffers (the vmap()ed
>> ones), whereas the rings and other consistent data structures are
>> using the DMA API as intended, and thus getting uncached attributes in
>> the correct way.
> Dave, who added that commit is on Cc together with just about everyone
> involved in the review chain.  Based on the previous explanation
> that idea what we might want an uncached mapping for some non-coherent
> architectures for this to work at all makes sense, but then again
> the way to create those mappings is entirely architecture specific,
> and also need a cache flushing before creating the mapping to work
> properly.  So my working theory is that this code never properly
> worked on architectures without DMA coherent for PCIe at all, but
> I'd love to be corrected by concrete examples including an explanation
> of how it actually ends up working.

Cache coherency is mandatory for modern GPU operation.

Otherwise you can't implement a bunch of the requirements of the 
userspace APIs.

In other words the applications doesn't inform the driver that the GPU 
or the CPU is accessing data, it just does it and assumes that it works.

Regards,
Christian.
Ard Biesheuvel Jan. 24, 2019, 9:28 a.m. UTC | #27
On Thu, 24 Jan 2019 at 10:25, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
> > On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
> >> But my concern is that it seems likely that non-cache coherent
> >> implementations are relying on this hack as well. There must be a
> >> reason that this hack is only disabled for PowerPC platforms if they
> >> are cache coherent, for instance, and I suspect that that reason is
> >> that the hack is the only thing ensuring that the CPU mapping
> >> attributes match the device ones used for these buffers (the vmap()ed
> >> ones), whereas the rings and other consistent data structures are
> >> using the DMA API as intended, and thus getting uncached attributes in
> >> the correct way.
> > Dave, who added that commit is on Cc together with just about everyone
> > involved in the review chain.  Based on the previous explanation
> > that idea what we might want an uncached mapping for some non-coherent
> > architectures for this to work at all makes sense, but then again
> > the way to create those mappings is entirely architecture specific,
> > and also need a cache flushing before creating the mapping to work
> > properly.  So my working theory is that this code never properly
> > worked on architectures without DMA coherent for PCIe at all, but
> > I'd love to be corrected by concrete examples including an explanation
> > of how it actually ends up working.
>
> Cache coherency is mandatory for modern GPU operation.
>
> Otherwise you can't implement a bunch of the requirements of the
> userspace APIs.
>
> In other words the applications doesn't inform the driver that the GPU
> or the CPU is accessing data, it just does it and assumes that it works.
>

Wonderful!

In that case, do you have any objections to the patch proposed by
Christoph above?
Michel Dänzer Jan. 24, 2019, 9:31 a.m. UTC | #28
On 2019-01-23 5:52 p.m., Ard Biesheuvel wrote:
> On Wed, 23 Jan 2019 at 17:44, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> I think we just want a driver-local check for those combinations
>> where we know this hack actually works, which really just seems
>> to be x86-64 with PAT. Something like the patch below, but maybe with
>> even more strong warnings to not do something like this elsewhere:
> 
> I agree that your patch seems like the right way to ensure that the WC
> optimization hack is only used where we know it works.
> 
> But my concern is that it seems likely that non-cache coherent
> implementations are relying on this hack as well.

I've been trying to tell you they can't rely on that, because the amdgpu
driver doesn't use this functionality for fundamentals such as ring
buffers used for feeding the hardware with commands. Instead, for those
it relies on snooped PCIe transfers being coherent with the CPU caches.


>> -#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
>> -       /* Don't try to enable write-combining when it can't work, or things
>> -        * may be slow
>> -        * See https://bugs.freedesktop.org/show_bug.cgi?id=88758
>> -        */
>> -
>> -#ifndef CONFIG_COMPILE_TEST
>> -#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
>> -        thanks to write-combining
>> -#endif
>> -
>> -       if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
>> -               DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for "
>> -                             "better performance thanks to write-combining\n");

FWIW, please don't drop these compile and build time warnings where we
continue to take advantage of PAT.
Ard Biesheuvel Jan. 24, 2019, 9:37 a.m. UTC | #29
On Thu, 24 Jan 2019 at 10:31, Michel Dänzer <michel@daenzer.net> wrote:
>
> On 2019-01-23 5:52 p.m., Ard Biesheuvel wrote:
> > On Wed, 23 Jan 2019 at 17:44, Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> I think we just want a driver-local check for those combinations
> >> where we know this hack actually works, which really just seems
> >> to be x86-64 with PAT. Something like the patch below, but maybe with
> >> even more strong warnings to not do something like this elsewhere:
> >
> > I agree that your patch seems like the right way to ensure that the WC
> > optimization hack is only used where we know it works.
> >
> > But my concern is that it seems likely that non-cache coherent
> > implementations are relying on this hack as well.
>
> I've been trying to tell you they can't rely on that, because the amdgpu
> driver doesn't use this functionality for fundamentals such as ring
> buffers used for feeding the hardware with commands. Instead, for those
> it relies on snooped PCIe transfers being coherent with the CPU caches.
>

I understand it does not use this functionality for the ring. Instead,
it uses the DMA API, no?

On non-cache coherent systems, that DMA API will allocate memory and
map it uncached for the CPU so that it is coherent with the non-cache
coherent device.

In any case, if non-cache coherent systems are unlikely to work, and
unsupported in case they do, I am fine with disabling this
optimization unconditionally for non-X86 architectures.

>
> >> -#elif defined(CONFIG_X86) && !defined(CONFIG_X86_PAT)
> >> -       /* Don't try to enable write-combining when it can't work, or things
> >> -        * may be slow
> >> -        * See https://bugs.freedesktop.org/show_bug.cgi?id=88758
> >> -        */
> >> -
> >> -#ifndef CONFIG_COMPILE_TEST
> >> -#warning Please enable CONFIG_MTRR and CONFIG_X86_PAT for better performance \
> >> -        thanks to write-combining
> >> -#endif
> >> -
> >> -       if (bo->flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
> >> -               DRM_INFO_ONCE("Please enable CONFIG_MTRR and CONFIG_X86_PAT for "
> >> -                             "better performance thanks to write-combining\n");
>
> FWIW, please don't drop these compile and build time warnings where we
> continue to take advantage of PAT.
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
Christian König Jan. 24, 2019, 9:45 a.m. UTC | #30
Am 24.01.19 um 10:28 schrieb Ard Biesheuvel:
> On Thu, 24 Jan 2019 at 10:25, Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
>>> On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
>>>> But my concern is that it seems likely that non-cache coherent
>>>> implementations are relying on this hack as well. There must be a
>>>> reason that this hack is only disabled for PowerPC platforms if they
>>>> are cache coherent, for instance, and I suspect that that reason is
>>>> that the hack is the only thing ensuring that the CPU mapping
>>>> attributes match the device ones used for these buffers (the vmap()ed
>>>> ones), whereas the rings and other consistent data structures are
>>>> using the DMA API as intended, and thus getting uncached attributes in
>>>> the correct way.
>>> Dave, who added that commit is on Cc together with just about everyone
>>> involved in the review chain.  Based on the previous explanation
>>> that idea what we might want an uncached mapping for some non-coherent
>>> architectures for this to work at all makes sense, but then again
>>> the way to create those mappings is entirely architecture specific,
>>> and also need a cache flushing before creating the mapping to work
>>> properly.  So my working theory is that this code never properly
>>> worked on architectures without DMA coherent for PCIe at all, but
>>> I'd love to be corrected by concrete examples including an explanation
>>> of how it actually ends up working.
>> Cache coherency is mandatory for modern GPU operation.
>>
>> Otherwise you can't implement a bunch of the requirements of the
>> userspace APIs.
>>
>> In other words the applications doesn't inform the driver that the GPU
>> or the CPU is accessing data, it just does it and assumes that it works.
>>
> Wonderful!
>
> In that case, do you have any objections to the patch proposed by
> Christoph above?

Yeah, the patch of Christoph actually goes way to far cause we have 
reports that this works on a bunch of other architectures.

E.g. X86 64bit, PowerPC (under some conditions) and some MIPS.

The only problematic here actually seems to be ARM, so you should 
probably just add an "#ifdef .._ARM return false;".

Regards,
Christian.
Ard Biesheuvel Jan. 24, 2019, 9:59 a.m. UTC | #31
On Thu, 24 Jan 2019 at 10:45, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 24.01.19 um 10:28 schrieb Ard Biesheuvel:
> > On Thu, 24 Jan 2019 at 10:25, Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 24.01.19 um 10:13 schrieb Christoph Hellwig:
> >>> On Wed, Jan 23, 2019 at 05:52:50PM +0100, Ard Biesheuvel wrote:
> >>>> But my concern is that it seems likely that non-cache coherent
> >>>> implementations are relying on this hack as well. There must be a
> >>>> reason that this hack is only disabled for PowerPC platforms if they
> >>>> are cache coherent, for instance, and I suspect that that reason is
> >>>> that the hack is the only thing ensuring that the CPU mapping
> >>>> attributes match the device ones used for these buffers (the vmap()ed
> >>>> ones), whereas the rings and other consistent data structures are
> >>>> using the DMA API as intended, and thus getting uncached attributes in
> >>>> the correct way.
> >>> Dave, who added that commit is on Cc together with just about everyone
> >>> involved in the review chain.  Based on the previous explanation
> >>> that idea what we might want an uncached mapping for some non-coherent
> >>> architectures for this to work at all makes sense, but then again
> >>> the way to create those mappings is entirely architecture specific,
> >>> and also need a cache flushing before creating the mapping to work
> >>> properly.  So my working theory is that this code never properly
> >>> worked on architectures without DMA coherent for PCIe at all, but
> >>> I'd love to be corrected by concrete examples including an explanation
> >>> of how it actually ends up working.
> >> Cache coherency is mandatory for modern GPU operation.
> >>
> >> Otherwise you can't implement a bunch of the requirements of the
> >> userspace APIs.
> >>
> >> In other words the applications doesn't inform the driver that the GPU
> >> or the CPU is accessing data, it just does it and assumes that it works.
> >>
> > Wonderful!
> >
> > In that case, do you have any objections to the patch proposed by
> > Christoph above?
>
> Yeah, the patch of Christoph actually goes way to far cause we have
> reports that this works on a bunch of other architectures.
>
> E.g. X86 64bit, PowerPC (under some conditions) and some MIPS.
>

This is *exactly* my point the whole time.

The current code has

static inline bool drm_arch_can_wc_memory(void)
{
#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
   return false;

which means the optimization is disabled *unless the system is
non-cache coherent*

So if you have reports that the optimization works on some PowerPC, it
must be non-cache coherent PowerPC, because that is the only place
where it is enabled in the first place.

> The only problematic here actually seems to be ARM, so you should
> probably just add an "#ifdef .._ARM return false;".
>

ARM/arm64 does not have a Kconfig symbol like
CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
there are non-coherent ARM systems that are currently working in the
same way as those non-coherent PowerPC systems, we will break them by
doing this.
Christian König Jan. 24, 2019, 11:23 a.m. UTC | #32
Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> [SNIP]
> This is *exactly* my point the whole time.
>
> The current code has
>
> static inline bool drm_arch_can_wc_memory(void)
> {
> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>     return false;
>
> which means the optimization is disabled *unless the system is
> non-cache coherent*
>
> So if you have reports that the optimization works on some PowerPC, it
> must be non-cache coherent PowerPC, because that is the only place
> where it is enabled in the first place.
>
>> The only problematic here actually seems to be ARM, so you should
>> probably just add an "#ifdef .._ARM return false;".
>>
> ARM/arm64 does not have a Kconfig symbol like
> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> there are non-coherent ARM systems that are currently working in the
> same way as those non-coherent PowerPC systems, we will break them by
> doing this.

Summing the things I've read so far for ARM up I actually think it 
depends on a runtime configuration and not on compile time one.

So the whole idea of providing the device to the drm_*_can_wc_memory() 
function isn't so far fetched.

But for now I do prefer working and slightly slower system over broken 
one, so I think we should just disable this on ARM for now.

Christian.
Ard Biesheuvel Jan. 24, 2019, 11:26 a.m. UTC | #33
On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> > [SNIP]
> > This is *exactly* my point the whole time.
> >
> > The current code has
> >
> > static inline bool drm_arch_can_wc_memory(void)
> > {
> > #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> >     return false;
> >
> > which means the optimization is disabled *unless the system is
> > non-cache coherent*
> >
> > So if you have reports that the optimization works on some PowerPC, it
> > must be non-cache coherent PowerPC, because that is the only place
> > where it is enabled in the first place.
> >
> >> The only problematic here actually seems to be ARM, so you should
> >> probably just add an "#ifdef .._ARM return false;".
> >>
> > ARM/arm64 does not have a Kconfig symbol like
> > CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> > there are non-coherent ARM systems that are currently working in the
> > same way as those non-coherent PowerPC systems, we will break them by
> > doing this.
>
> Summing the things I've read so far for ARM up I actually think it
> depends on a runtime configuration and not on compile time one.
>
> So the whole idea of providing the device to the drm_*_can_wc_memory()
> function isn't so far fetched.
>

Thank you.

> But for now I do prefer working and slightly slower system over broken
> one, so I think we should just disable this on ARM for now.
>

Again, this is not about non-cache coherent being slower without the
optimization, it is about non-cache coherent likely not working *at
all* unless the optimization is enabled.

Otherwise, the driver will vmap() DMA pages with cacheable attributes,
while the non-cache coherent device uses uncached attributes, breaking
coherency.
Christian König Jan. 24, 2019, 11:37 a.m. UTC | #34
Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
> On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
>>> [SNIP]
>>> This is *exactly* my point the whole time.
>>>
>>> The current code has
>>>
>>> static inline bool drm_arch_can_wc_memory(void)
>>> {
>>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>>>      return false;
>>>
>>> which means the optimization is disabled *unless the system is
>>> non-cache coherent*
>>>
>>> So if you have reports that the optimization works on some PowerPC, it
>>> must be non-cache coherent PowerPC, because that is the only place
>>> where it is enabled in the first place.
>>>
>>>> The only problematic here actually seems to be ARM, so you should
>>>> probably just add an "#ifdef .._ARM return false;".
>>>>
>>> ARM/arm64 does not have a Kconfig symbol like
>>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
>>> there are non-coherent ARM systems that are currently working in the
>>> same way as those non-coherent PowerPC systems, we will break them by
>>> doing this.
>> Summing the things I've read so far for ARM up I actually think it
>> depends on a runtime configuration and not on compile time one.
>>
>> So the whole idea of providing the device to the drm_*_can_wc_memory()
>> function isn't so far fetched.
>>
> Thank you.
>
>> But for now I do prefer working and slightly slower system over broken
>> one, so I think we should just disable this on ARM for now.
>>
> Again, this is not about non-cache coherent being slower without the
> optimization, it is about non-cache coherent likely not working *at
> all* unless the optimization is enabled.

As Michel tried to explain this CAN'T happen. The optimization is a 
purely optional request from userspace.

> Otherwise, the driver will vmap() DMA pages with cacheable attributes,
> while the non-cache coherent device uses uncached attributes, breaking
> coherency.

Again this is mandated by the userspace APIs anyway. E.g. we can't 
vmap() pages in any other way or our userspace APIs would break.

Regards,
Christian.
Ard Biesheuvel Jan. 24, 2019, 11:45 a.m. UTC | #35
On Thu, 24 Jan 2019 at 12:37, Koenig, Christian
<Christian.Koenig@amd.com> wrote:
>
> Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
> > On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> >> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> >>> [SNIP]
> >>> This is *exactly* my point the whole time.
> >>>
> >>> The current code has
> >>>
> >>> static inline bool drm_arch_can_wc_memory(void)
> >>> {
> >>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> >>>      return false;
> >>>
> >>> which means the optimization is disabled *unless the system is
> >>> non-cache coherent*
> >>>
> >>> So if you have reports that the optimization works on some PowerPC, it
> >>> must be non-cache coherent PowerPC, because that is the only place
> >>> where it is enabled in the first place.
> >>>
> >>>> The only problematic here actually seems to be ARM, so you should
> >>>> probably just add an "#ifdef .._ARM return false;".
> >>>>
> >>> ARM/arm64 does not have a Kconfig symbol like
> >>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> >>> there are non-coherent ARM systems that are currently working in the
> >>> same way as those non-coherent PowerPC systems, we will break them by
> >>> doing this.
> >> Summing the things I've read so far for ARM up I actually think it
> >> depends on a runtime configuration and not on compile time one.
> >>
> >> So the whole idea of providing the device to the drm_*_can_wc_memory()
> >> function isn't so far fetched.
> >>
> > Thank you.
> >
> >> But for now I do prefer working and slightly slower system over broken
> >> one, so I think we should just disable this on ARM for now.
> >>
> > Again, this is not about non-cache coherent being slower without the
> > optimization, it is about non-cache coherent likely not working *at
> > all* unless the optimization is enabled.
>
> As Michel tried to explain this CAN'T happen. The optimization is a
> purely optional request from userspace.
>

Right.

So in that case, we can assume that the following test

static inline bool drm_arch_can_wc_memory(void)
{
#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
    return false;

is bogus, and it was just unnecessary caution on the part of the
author to disregard non-cache coherent devices.
Unfortunately, those commits have no log messages whatsoever, so it is
difficult to infer the intent retroactively.

> > Otherwise, the driver will vmap() DMA pages with cacheable attributes,
> > while the non-cache coherent device uses uncached attributes, breaking
> > coherency.
>
> Again this is mandated by the userspace APIs anyway. E.g. we can't
> vmap() pages in any other way or our userspace APIs would break.
>

OK,

So let's just disable this for all ARM and arm64 then, given that
non-cache coherent is not supported in any case
Alex Deucher Jan. 24, 2019, 1:54 p.m. UTC | #36
On Thu, Jan 24, 2019 at 6:45 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Thu, 24 Jan 2019 at 12:37, Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
> >
> > Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
> > > On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
> > > <Christian.Koenig@amd.com> wrote:
> > >> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> > >>> [SNIP]
> > >>> This is *exactly* my point the whole time.
> > >>>
> > >>> The current code has
> > >>>
> > >>> static inline bool drm_arch_can_wc_memory(void)
> > >>> {
> > >>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> > >>>      return false;
> > >>>
> > >>> which means the optimization is disabled *unless the system is
> > >>> non-cache coherent*
> > >>>
> > >>> So if you have reports that the optimization works on some PowerPC, it
> > >>> must be non-cache coherent PowerPC, because that is the only place
> > >>> where it is enabled in the first place.
> > >>>
> > >>>> The only problematic here actually seems to be ARM, so you should
> > >>>> probably just add an "#ifdef .._ARM return false;".
> > >>>>
> > >>> ARM/arm64 does not have a Kconfig symbol like
> > >>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> > >>> there are non-coherent ARM systems that are currently working in the
> > >>> same way as those non-coherent PowerPC systems, we will break them by
> > >>> doing this.
> > >> Summing the things I've read so far for ARM up I actually think it
> > >> depends on a runtime configuration and not on compile time one.
> > >>
> > >> So the whole idea of providing the device to the drm_*_can_wc_memory()
> > >> function isn't so far fetched.
> > >>
> > > Thank you.
> > >
> > >> But for now I do prefer working and slightly slower system over broken
> > >> one, so I think we should just disable this on ARM for now.
> > >>
> > > Again, this is not about non-cache coherent being slower without the
> > > optimization, it is about non-cache coherent likely not working *at
> > > all* unless the optimization is enabled.
> >
> > As Michel tried to explain this CAN'T happen. The optimization is a
> > purely optional request from userspace.
> >
>
> Right.
>
> So in that case, we can assume that the following test
>
> static inline bool drm_arch_can_wc_memory(void)
> {
> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>     return false;
>
> is bogus, and it was just unnecessary caution on the part of the
> author to disregard non-cache coherent devices.
> Unfortunately, those commits have no log messages whatsoever, so it is
> difficult to infer the intent retroactively.
>
> > > Otherwise, the driver will vmap() DMA pages with cacheable attributes,
> > > while the non-cache coherent device uses uncached attributes, breaking
> > > coherency.
> >
> > Again this is mandated by the userspace APIs anyway. E.g. we can't
> > vmap() pages in any other way or our userspace APIs would break.
> >
>
> OK,
>
> So let's just disable this for all ARM and arm64 then, given that
> non-cache coherent is not supported in any case

So I think we are back to this patch:
https://patchwork.kernel.org/patch/10739023/

Alex
Ard Biesheuvel Jan. 24, 2019, 1:57 p.m. UTC | #37
On Thu, 24 Jan 2019 at 14:54, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Thu, Jan 24, 2019 at 6:45 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Thu, 24 Jan 2019 at 12:37, Koenig, Christian
> > <Christian.Koenig@amd.com> wrote:
> > >
> > > Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
> > > > On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
> > > > <Christian.Koenig@amd.com> wrote:
> > > >> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> > > >>> [SNIP]
> > > >>> This is *exactly* my point the whole time.
> > > >>>
> > > >>> The current code has
> > > >>>
> > > >>> static inline bool drm_arch_can_wc_memory(void)
> > > >>> {
> > > >>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> > > >>>      return false;
> > > >>>
> > > >>> which means the optimization is disabled *unless the system is
> > > >>> non-cache coherent*
> > > >>>
> > > >>> So if you have reports that the optimization works on some PowerPC, it
> > > >>> must be non-cache coherent PowerPC, because that is the only place
> > > >>> where it is enabled in the first place.
> > > >>>
> > > >>>> The only problematic here actually seems to be ARM, so you should
> > > >>>> probably just add an "#ifdef .._ARM return false;".
> > > >>>>
> > > >>> ARM/arm64 does not have a Kconfig symbol like
> > > >>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> > > >>> there are non-coherent ARM systems that are currently working in the
> > > >>> same way as those non-coherent PowerPC systems, we will break them by
> > > >>> doing this.
> > > >> Summing the things I've read so far for ARM up I actually think it
> > > >> depends on a runtime configuration and not on compile time one.
> > > >>
> > > >> So the whole idea of providing the device to the drm_*_can_wc_memory()
> > > >> function isn't so far fetched.
> > > >>
> > > > Thank you.
> > > >
> > > >> But for now I do prefer working and slightly slower system over broken
> > > >> one, so I think we should just disable this on ARM for now.
> > > >>
> > > > Again, this is not about non-cache coherent being slower without the
> > > > optimization, it is about non-cache coherent likely not working *at
> > > > all* unless the optimization is enabled.
> > >
> > > As Michel tried to explain this CAN'T happen. The optimization is a
> > > purely optional request from userspace.
> > >
> >
> > Right.
> >
> > So in that case, we can assume that the following test
> >
> > static inline bool drm_arch_can_wc_memory(void)
> > {
> > #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> >     return false;
> >
> > is bogus, and it was just unnecessary caution on the part of the
> > author to disregard non-cache coherent devices.
> > Unfortunately, those commits have no log messages whatsoever, so it is
> > difficult to infer the intent retroactively.
> >
> > > > Otherwise, the driver will vmap() DMA pages with cacheable attributes,
> > > > while the non-cache coherent device uses uncached attributes, breaking
> > > > coherency.
> > >
> > > Again this is mandated by the userspace APIs anyway. E.g. we can't
> > > vmap() pages in any other way or our userspace APIs would break.
> > >
> >
> > OK,
> >
> > So let's just disable this for all ARM and arm64 then, given that
> > non-cache coherent is not supported in any case
>
> So I think we are back to this patch:
> https://patchwork.kernel.org/patch/10739023/
>

Apart from the fact that the issue has nothing to do with write-combining, yes.
Alex Deucher Jan. 24, 2019, 2 p.m. UTC | #38
On Thu, Jan 24, 2019 at 8:57 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Thu, 24 Jan 2019 at 14:54, Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Thu, Jan 24, 2019 at 6:45 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> > >
> > > On Thu, 24 Jan 2019 at 12:37, Koenig, Christian
> > > <Christian.Koenig@amd.com> wrote:
> > > >
> > > > Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
> > > > > On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
> > > > > <Christian.Koenig@amd.com> wrote:
> > > > >> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
> > > > >>> [SNIP]
> > > > >>> This is *exactly* my point the whole time.
> > > > >>>
> > > > >>> The current code has
> > > > >>>
> > > > >>> static inline bool drm_arch_can_wc_memory(void)
> > > > >>> {
> > > > >>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> > > > >>>      return false;
> > > > >>>
> > > > >>> which means the optimization is disabled *unless the system is
> > > > >>> non-cache coherent*
> > > > >>>
> > > > >>> So if you have reports that the optimization works on some PowerPC, it
> > > > >>> must be non-cache coherent PowerPC, because that is the only place
> > > > >>> where it is enabled in the first place.
> > > > >>>
> > > > >>>> The only problematic here actually seems to be ARM, so you should
> > > > >>>> probably just add an "#ifdef .._ARM return false;".
> > > > >>>>
> > > > >>> ARM/arm64 does not have a Kconfig symbol like
> > > > >>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
> > > > >>> there are non-coherent ARM systems that are currently working in the
> > > > >>> same way as those non-coherent PowerPC systems, we will break them by
> > > > >>> doing this.
> > > > >> Summing the things I've read so far for ARM up I actually think it
> > > > >> depends on a runtime configuration and not on compile time one.
> > > > >>
> > > > >> So the whole idea of providing the device to the drm_*_can_wc_memory()
> > > > >> function isn't so far fetched.
> > > > >>
> > > > > Thank you.
> > > > >
> > > > >> But for now I do prefer working and slightly slower system over broken
> > > > >> one, so I think we should just disable this on ARM for now.
> > > > >>
> > > > > Again, this is not about non-cache coherent being slower without the
> > > > > optimization, it is about non-cache coherent likely not working *at
> > > > > all* unless the optimization is enabled.
> > > >
> > > > As Michel tried to explain this CAN'T happen. The optimization is a
> > > > purely optional request from userspace.
> > > >
> > >
> > > Right.
> > >
> > > So in that case, we can assume that the following test
> > >
> > > static inline bool drm_arch_can_wc_memory(void)
> > > {
> > > #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
> > >     return false;
> > >
> > > is bogus, and it was just unnecessary caution on the part of the
> > > author to disregard non-cache coherent devices.
> > > Unfortunately, those commits have no log messages whatsoever, so it is
> > > difficult to infer the intent retroactively.
> > >
> > > > > Otherwise, the driver will vmap() DMA pages with cacheable attributes,
> > > > > while the non-cache coherent device uses uncached attributes, breaking
> > > > > coherency.
> > > >
> > > > Again this is mandated by the userspace APIs anyway. E.g. we can't
> > > > vmap() pages in any other way or our userspace APIs would break.
> > > >
> > >
> > > OK,
> > >
> > > So let's just disable this for all ARM and arm64 then, given that
> > > non-cache coherent is not supported in any case
> >
> > So I think we are back to this patch:
> > https://patchwork.kernel.org/patch/10739023/
> >
>
> Apart from the fact that the issue has nothing to do with write-combining, yes.

Your patch has a better description.  Let's go with that.

Alex
Michel Dänzer Jan. 24, 2019, 4:04 p.m. UTC | #39
On 2019-01-24 12:45 p.m., Ard Biesheuvel wrote:
> On Thu, 24 Jan 2019 at 12:37, Koenig, Christian
> <Christian.Koenig@amd.com> wrote:
>> Am 24.01.19 um 12:26 schrieb Ard Biesheuvel:
>>> On Thu, 24 Jan 2019 at 12:23, Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Am 24.01.19 um 10:59 schrieb Ard Biesheuvel:
>>>>> [SNIP]
>>>>> This is *exactly* my point the whole time.
>>>>>
>>>>> The current code has
>>>>>
>>>>> static inline bool drm_arch_can_wc_memory(void)
>>>>> {
>>>>> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>>>>>      return false;
>>>>>
>>>>> which means the optimization is disabled *unless the system is
>>>>> non-cache coherent*
>>>>>
>>>>> So if you have reports that the optimization works on some PowerPC, it
>>>>> must be non-cache coherent PowerPC, because that is the only place
>>>>> where it is enabled in the first place.
>>>>>
>>>>>> The only problematic here actually seems to be ARM, so you should
>>>>>> probably just add an "#ifdef .._ARM return false;".
>>>>>>
>>>>> ARM/arm64 does not have a Kconfig symbol like
>>>>> CONFIG_NOT_COHERENT_CACHE, so we can only disable it everywhere. If
>>>>> there are non-coherent ARM systems that are currently working in the
>>>>> same way as those non-coherent PowerPC systems, we will break them by
>>>>> doing this.
>>>> Summing the things I've read so far for ARM up I actually think it
>>>> depends on a runtime configuration and not on compile time one.
>>>>
>>>> So the whole idea of providing the device to the drm_*_can_wc_memory()
>>>> function isn't so far fetched.
>>>>
>>> Thank you.
>>>
>>>> But for now I do prefer working and slightly slower system over broken
>>>> one, so I think we should just disable this on ARM for now.
>>>>
>>> Again, this is not about non-cache coherent being slower without the
>>> optimization, it is about non-cache coherent likely not working *at
>>> all* unless the optimization is enabled.
>>
>> As Michel tried to explain this CAN'T happen. The optimization is a
>> purely optional request from userspace.
>>
> 
> Right.
> 
> So in that case, we can assume that the following test
> 
> static inline bool drm_arch_can_wc_memory(void)
> {
> #if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
>     return false;
> 
> is bogus, and it was just unnecessary caution on the part of the
> author to disregard non-cache coherent devices.

This is driver-independent core code, meaning "non-snooped PCIe
transfers don't work on cache coherent PPC". It doesn't imply anything
about whether or not amdgpu or any other driver works on
non-cache-coherent PPC in general.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 728e15e5d68a..777fa251838f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -480,7 +480,7 @@  static int amdgpu_bo_do_create(struct amdgpu_device *adev,
 	/* For architectures that don't support WC memory,
 	 * mask out the WC flag from the BO
 	 */
-	if (!drm_arch_can_wc_memory())
+	if (!drm_device_can_wc_memory(adev->ddev))
 		bo->flags &= ~AMDGPU_GEM_CREATE_CPU_GTT_USWC;
 #endif
 
diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 833e909706a9..610889bf6ab5 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -249,7 +249,7 @@  int radeon_bo_create(struct radeon_device *rdev,
 	/* For architectures that don't support WC memory,
 	 * mask out the WC flag from the BO
 	 */
-	if (!drm_arch_can_wc_memory())
+	if (!drm_device_can_wc_memory(rdev->ddev))
 		bo->flags &= ~RADEON_GEM_GTT_WC;
 #endif
 
diff --git a/include/drm/drm_cache.h b/include/drm/drm_cache.h
index bfe1639df02d..ced63b1207a3 100644
--- a/include/drm/drm_cache.h
+++ b/include/drm/drm_cache.h
@@ -33,6 +33,8 @@ 
 #ifndef _DRM_CACHE_H_
 #define _DRM_CACHE_H_
 
+#include <drm/drm_device.h>
+#include <linux/dma-noncoherent.h>
 #include <linux/scatterlist.h>
 
 void drm_clflush_pages(struct page *pages[], unsigned long num_pages);
@@ -41,15 +43,16 @@  void drm_clflush_virt_range(void *addr, unsigned long length);
 u64 drm_get_max_iomem(void);
 
 
-static inline bool drm_arch_can_wc_memory(void)
+static inline bool drm_device_can_wc_memory(struct drm_device *ddev)
 {
-#if defined(CONFIG_PPC) && !defined(CONFIG_NOT_COHERENT_CACHE)
-	return false;
-#elif defined(CONFIG_MIPS) && defined(CONFIG_CPU_LOONGSON3)
-	return false;
-#else
-	return true;
-#endif
+	if (IS_ENABLED(CONFIG_PPC))
+		return IS_ENABLED(CONFIG_NOT_COHERENT_CACHE);
+	else if (IS_ENABLED(CONFIG_MIPS))
+		return !IS_ENABLED(CONFIG_CPU_LOONGSON3);
+	else if (IS_ENABLED(CONFIG_X86))
+		return true;
+
+	return !dev_is_dma_coherent(ddev->dev);
 }
 
 #endif