diff mbox

Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches

Message ID 20180706122610.GA6993@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Will Deacon July 6, 2018, 12:26 p.m. UTC
Hi Lucas,

On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> > Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> > that the barriers aren't fully propagated to the PL310 write-caches and
> > master ports. I'm unsure if this is just an artifact of the mentioned
> > MMIO access, so handling of normal vs. device memory transactions in
> > the PL310.
> 
> I'll check with the hardware folks about the PL310, but I don't see how
> you could propagate barriers over AXI3 anyway because I don't think this
> stuff existed back then.

Damn. As I feared, you can't propagate the barrier to the PL310, so the
A9 is unable to enforce ordering beyond the responses it gets back. The
PL310 will then re-order normal memory accesses (including non-cacheable)
in its store buffer, so we do actually need a bloody CACHE SYNC here.

Patch below. Does it help?

Will

--->8

From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Wed, 4 Jul 2018 10:56:22 +0100
Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()

Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
transactions or shareability domains and was therefore unable to enforce
ordering between memory accesses either side of a barrier instruction,
requiring the CPU to handle all of the ordering itself.

Unfortunately, IP such as the venerable PL310 will re-order stores to
Normal memory inside its store-buffer, meaning that accesses to a coherent
DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
a CACHE_SYNC command between buffering the accesses. Whilst this is already
the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
to sequences such as:

	STR	Rdata, [Rbuf, #DATA_OFFSET]
	DMB	OSHST
	STR	Rvalid, [Rbuf, #VALID_OFFSET]

being reordered such that the DMA device sees the write of valid but not
the write of data.

This patch ensures that the SoC fabric and the outer-cache are kicked
by dma_wmb() if they are able to buffer transactions outside of the CPU
but before the Point of Coherency.

Reported-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/barrier.h    | 14 ++++++++++----
 arch/arm/include/asm/outercache.h |  6 ++++++
 arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
 3 files changed, 39 insertions(+), 5 deletions(-)

Comments

Oleksij Rempel July 9, 2018, 6:20 a.m. UTC | #1
Hi Will,

On 06.07.2018 14:26, Will Deacon wrote:
> Hi Lucas,
> 
> On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
>> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
>>> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
>>> that the barriers aren't fully propagated to the PL310 write-caches and
>>> master ports. I'm unsure if this is just an artifact of the mentioned
>>> MMIO access, so handling of normal vs. device memory transactions in
>>> the PL310.
>>
>> I'll check with the hardware folks about the PL310, but I don't see how
>> you could propagate barriers over AXI3 anyway because I don't think this
>> stuff existed back then.
> 
> Damn. As I feared, you can't propagate the barrier to the PL310, so the
> A9 is unable to enforce ordering beyond the responses it gets back. The
> PL310 will then re-order normal memory accesses (including non-cacheable)
> in its store buffer, so we do actually need a bloody CACHE SYNC here.
> 
> Patch below. Does it help?

Suddenly no.
Is it possible that I run in to this issue:
764369: Data or Unified Cache Line Maintenance by MVA Fails on
Inner-Shareable Memory
"...
Note that even when all three components of the workaround are in place,
this erratum might still occur. However, this occurrence would require
some extremely rare and complex timing conditions, so that the
probability of reaching the point of failure is extremely low. This low
probability, along with the fact that this erratum requires an uncommon
software scenario, explains why this workaround is likely to be a
reliable practical solution for most systems.
To ARM's knowledge, no failure has been observed in any system when all
three components of this workaround have been implemented.
For critical systems that cannot cope with the extremely low failure
risks associated with the above workaround, a second workaround is
possible which involves changing the mapping of the data being accessed
so that it is in a non-cacheable area. This ensures that the written
data remains uncached, which means it is always visible to non-coherent
agents in the system, or to the instruction side in the case of
self-modifying code, without any need for cache maintenance operation."


> Will
> 
> --->8
> 
> From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 4 Jul 2018 10:56:22 +0100
> Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> 
> Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> transactions or shareability domains and was therefore unable to enforce
> ordering between memory accesses either side of a barrier instruction,
> requiring the CPU to handle all of the ordering itself.
> 
> Unfortunately, IP such as the venerable PL310 will re-order stores to
> Normal memory inside its store-buffer, meaning that accesses to a coherent
> DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> a CACHE_SYNC command between buffering the accesses. Whilst this is already
> the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> to sequences such as:
> 
> 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> 	DMB	OSHST
> 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> 
> being reordered such that the DMA device sees the write of valid but not
> the write of data.
> 
> This patch ensures that the SoC fabric and the outer-cache are kicked
> by dma_wmb() if they are able to buffer transactions outside of the CPU
> but before the Point of Coherency.
> 
> Reported-by: Lucas Stach <l.stach@pengutronix.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
>  arch/arm/include/asm/outercache.h |  6 ++++++
>  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 69772e742a0a..e31833f72b69 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,17 +53,23 @@
>  #ifdef CONFIG_ARM_HEAVY_MB
>  extern void (*soc_mb)(void);
>  extern void arm_heavy_mb(void);
> -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> +extern void arm_heavy_wmb(void);
> +extern void arm_heavy_dma_wmb(void);
> +#define __arm_heavy_mb()	arm_heavy_mb()
> +#define __arm_heavy_wmb()	arm_heavy_wmb()
> +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
>  #else
> -#define __arm_heavy_mb(x...) dsb(x)
> +#define __arm_heavy_mb()	dsb()
> +#define __arm_heavy_wmb()	dsb(st)
> +#define __arm_heavy_dma_wmb()	dmb(oshst)
>  #endif
>  
>  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
>  #define mb()		__arm_heavy_mb()
>  #define rmb()		dsb()
> -#define wmb()		__arm_heavy_mb(st)
> +#define wmb()		__arm_heavy_wmb()
>  #define dma_rmb()	dmb(osh)
> -#define dma_wmb()	dmb(oshst)
> +#define dma_wmb()	__arm_heavy_dma_wmb()
>  #else
>  #define mb()		barrier()
>  #define rmb()		barrier()
> diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> index c2bf24f40177..e744efe6e2a7 100644
> --- a/arch/arm/include/asm/outercache.h
> +++ b/arch/arm/include/asm/outercache.h
> @@ -43,6 +43,12 @@ struct outer_cache_fns {
>  
>  extern struct outer_cache_fns outer_cache;
>  
> +#ifdef CONFIG_OUTER_CACHE_SYNC
> +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> +#else
> +static inline bool outer_cache_has_sync(void) { return 0; }
> +#endif
> +
>  #ifdef CONFIG_OUTER_CACHE
>  /**
>   * outer_inv_range - invalidate range of outer cache lines
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 58469623b015..1252689362d8 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -24,7 +24,7 @@
>  #ifdef CONFIG_ARM_HEAVY_MB
>  void (*soc_mb)(void);
>  
> -void arm_heavy_mb(void)
> +static void __arm_heavy_sync(void)
>  {
>  #ifdef CONFIG_OUTER_CACHE_SYNC
>  	if (outer_cache.sync)
> @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
>  	if (soc_mb)
>  		soc_mb();
>  }
> +
> +void arm_heavy_mb(void)
> +{
> +	dsb();
> +	__arm_heavy_sync();
> +}
>  EXPORT_SYMBOL(arm_heavy_mb);
> +
> +void arm_heavy_wmb(void)
> +{
> +	dsb(st);
> +	__arm_heavy_sync();
> +}
> +EXPORT_SYMBOL(arm_heavy_wmb);
> +
> +void arm_heavy_dma_wmb(void)
> +{
> +	if (outer_cache_has_sync() || soc_mb)
> +		arm_heavy_wmb();
> +	else
> +		dmb(oshst);
> +}
> +EXPORT_SYMBOL(arm_heavy_dma_wmb);
>  #endif
>  
>  #ifdef CONFIG_CPU_CACHE_VIPT
>
Lucas Stach July 9, 2018, 9:45 a.m. UTC | #2
Hi Will,

Am Freitag, den 06.07.2018, 13:26 +0100 schrieb Will Deacon:
> Hi Lucas,
> 
> On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> > On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> > > Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> > > that the barriers aren't fully propagated to the PL310 write-caches and
> > > master ports. I'm unsure if this is just an artifact of the mentioned
> > > MMIO access, so handling of normal vs. device memory transactions in
> > > the PL310.
> > 
> > I'll check with the hardware folks about the PL310, but I don't see how
> > you could propagate barriers over AXI3 anyway because I don't think this
> > stuff existed back then.
> 
> Damn. As I feared, you can't propagate the barrier to the PL310, so the
> A9 is unable to enforce ordering beyond the responses it gets back. The
> PL310 will then re-order normal memory accesses (including non-cacheable)
> in its store buffer, so we do actually need a bloody CACHE SYNC here.
> 
> Patch below. Does it help?

It seems the issues we see on the SocFPGA aren't caused by insufficient
barriers, but we are still in the process of tracking down the root
cause.

Aside from this, the patch below seem to match my understanding of the
outer cache issue. I'll give this a spin on i.MX6, as this would
potentially allow us to switch to the dma_* barrier variants in a
number drivers, reducing overhead on systems without an outer cache.

Thanks,
Lucas

> --->8
> 
> From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 4 Jul 2018 10:56:22 +0100
> Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> 
> Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> transactions or shareability domains and was therefore unable to enforce
> ordering between memory accesses either side of a barrier instruction,
> requiring the CPU to handle all of the ordering itself.
> 
> Unfortunately, IP such as the venerable PL310 will re-order stores to
> Normal memory inside its store-buffer, meaning that accesses to a coherent
> DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> a CACHE_SYNC command between buffering the accesses. Whilst this is already
> the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> to sequences such as:
> 
> > 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> > 	DMB	OSHST
> > 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> 
> being reordered such that the DMA device sees the write of valid but not
> the write of data.
> 
> This patch ensures that the SoC fabric and the outer-cache are kicked
> by dma_wmb() if they are able to buffer transactions outside of the CPU
> but before the Point of Coherency.
> 
> > Reported-by: Lucas Stach <l.stach@pengutronix.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
>  arch/arm/include/asm/outercache.h |  6 ++++++
>  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 69772e742a0a..e31833f72b69 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,17 +53,23 @@
>  #ifdef CONFIG_ARM_HEAVY_MB
>  extern void (*soc_mb)(void);
>  extern void arm_heavy_mb(void);
> -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> +extern void arm_heavy_wmb(void);
> +extern void arm_heavy_dma_wmb(void);
> > +#define __arm_heavy_mb()	arm_heavy_mb()
> > +#define __arm_heavy_wmb()	arm_heavy_wmb()
> > +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
>  #else
> -#define __arm_heavy_mb(x...) dsb(x)
> > +#define __arm_heavy_mb()	dsb()
> > +#define __arm_heavy_wmb()	dsb(st)
> > +#define __arm_heavy_dma_wmb()	dmb(oshst)
>  #endif
>  
>  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> >  #define mb()		__arm_heavy_mb()
> >  #define rmb()		dsb()
> > -#define wmb()		__arm_heavy_mb(st)
> > +#define wmb()		__arm_heavy_wmb()
> >  #define dma_rmb()	dmb(osh)
> > -#define dma_wmb()	dmb(oshst)
> > +#define dma_wmb()	__arm_heavy_dma_wmb()
>  #else
> >  #define mb()		barrier()
> >  #define rmb()		barrier()
> diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> index c2bf24f40177..e744efe6e2a7 100644
> --- a/arch/arm/include/asm/outercache.h
> +++ b/arch/arm/include/asm/outercache.h
> @@ -43,6 +43,12 @@ struct outer_cache_fns {
>  
>  extern struct outer_cache_fns outer_cache;
>  
> +#ifdef CONFIG_OUTER_CACHE_SYNC
> +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> +#else
> +static inline bool outer_cache_has_sync(void) { return 0; }
> +#endif
> +
>  #ifdef CONFIG_OUTER_CACHE
>  /**
>   * outer_inv_range - invalidate range of outer cache lines
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 58469623b015..1252689362d8 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -24,7 +24,7 @@
>  #ifdef CONFIG_ARM_HEAVY_MB
>  void (*soc_mb)(void);
>  
> -void arm_heavy_mb(void)
> +static void __arm_heavy_sync(void)
>  {
>  #ifdef CONFIG_OUTER_CACHE_SYNC
> >  	if (outer_cache.sync)
> @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
> >  	if (soc_mb)
> >  		soc_mb();
>  }
> +
> +void arm_heavy_mb(void)
> +{
> > +	dsb();
> > +	__arm_heavy_sync();
> +}
>  EXPORT_SYMBOL(arm_heavy_mb);
> +
> +void arm_heavy_wmb(void)
> +{
> > +	dsb(st);
> > +	__arm_heavy_sync();
> +}
> +EXPORT_SYMBOL(arm_heavy_wmb);
> +
> +void arm_heavy_dma_wmb(void)
> +{
> > +	if (outer_cache_has_sync() || soc_mb)
> > +		arm_heavy_wmb();
> > +	else
> > +		dmb(oshst);
> +}
> +EXPORT_SYMBOL(arm_heavy_dma_wmb);
>  #endif
>  
>  #ifdef CONFIG_CPU_CACHE_VIPT
Will Deacon Sept. 13, 2018, 1:17 p.m. UTC | #3
On Mon, Jul 09, 2018 at 08:20:22AM +0200, Oleksij Rempel wrote:
> On 06.07.2018 14:26, Will Deacon wrote:
> > On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> >> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> >>> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> >>> that the barriers aren't fully propagated to the PL310 write-caches and
> >>> master ports. I'm unsure if this is just an artifact of the mentioned
> >>> MMIO access, so handling of normal vs. device memory transactions in
> >>> the PL310.
> >>
> >> I'll check with the hardware folks about the PL310, but I don't see how
> >> you could propagate barriers over AXI3 anyway because I don't think this
> >> stuff existed back then.
> > 
> > Damn. As I feared, you can't propagate the barrier to the PL310, so the
> > A9 is unable to enforce ordering beyond the responses it gets back. The
> > PL310 will then re-order normal memory accesses (including non-cacheable)
> > in its store buffer, so we do actually need a bloody CACHE SYNC here.
> > 
> > Patch below. Does it help?
> 
> Suddenly no.

I just noticed that I'm still carrying this patch in my tree. Are you saying
it doesn't help on your system? It all went quiet, so I assume you fixed
this issue one way or the other?

Will

> > From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> > Date: Wed, 4 Jul 2018 10:56:22 +0100
> > Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> > 
> > Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> > transactions or shareability domains and was therefore unable to enforce
> > ordering between memory accesses either side of a barrier instruction,
> > requiring the CPU to handle all of the ordering itself.
> > 
> > Unfortunately, IP such as the venerable PL310 will re-order stores to
> > Normal memory inside its store-buffer, meaning that accesses to a coherent
> > DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> > out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> > a CACHE_SYNC command between buffering the accesses. Whilst this is already
> > the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> > to sequences such as:
> > 
> > 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> > 	DMB	OSHST
> > 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> > 
> > being reordered such that the DMA device sees the write of valid but not
> > the write of data.
> > 
> > This patch ensures that the SoC fabric and the outer-cache are kicked
> > by dma_wmb() if they are able to buffer transactions outside of the CPU
> > but before the Point of Coherency.
> > 
> > Reported-by: Lucas Stach <l.stach@pengutronix.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
> >  arch/arm/include/asm/outercache.h |  6 ++++++
> >  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
> >  3 files changed, 39 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > index 69772e742a0a..e31833f72b69 100644
> > --- a/arch/arm/include/asm/barrier.h
> > +++ b/arch/arm/include/asm/barrier.h
> > @@ -53,17 +53,23 @@
> >  #ifdef CONFIG_ARM_HEAVY_MB
> >  extern void (*soc_mb)(void);
> >  extern void arm_heavy_mb(void);
> > -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> > +extern void arm_heavy_wmb(void);
> > +extern void arm_heavy_dma_wmb(void);
> > +#define __arm_heavy_mb()	arm_heavy_mb()
> > +#define __arm_heavy_wmb()	arm_heavy_wmb()
> > +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
> >  #else
> > -#define __arm_heavy_mb(x...) dsb(x)
> > +#define __arm_heavy_mb()	dsb()
> > +#define __arm_heavy_wmb()	dsb(st)
> > +#define __arm_heavy_dma_wmb()	dmb(oshst)
> >  #endif
> >  
> >  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> >  #define mb()		__arm_heavy_mb()
> >  #define rmb()		dsb()
> > -#define wmb()		__arm_heavy_mb(st)
> > +#define wmb()		__arm_heavy_wmb()
> >  #define dma_rmb()	dmb(osh)
> > -#define dma_wmb()	dmb(oshst)
> > +#define dma_wmb()	__arm_heavy_dma_wmb()
> >  #else
> >  #define mb()		barrier()
> >  #define rmb()		barrier()
> > diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> > index c2bf24f40177..e744efe6e2a7 100644
> > --- a/arch/arm/include/asm/outercache.h
> > +++ b/arch/arm/include/asm/outercache.h
> > @@ -43,6 +43,12 @@ struct outer_cache_fns {
> >  
> >  extern struct outer_cache_fns outer_cache;
> >  
> > +#ifdef CONFIG_OUTER_CACHE_SYNC
> > +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> > +#else
> > +static inline bool outer_cache_has_sync(void) { return 0; }
> > +#endif
> > +
> >  #ifdef CONFIG_OUTER_CACHE
> >  /**
> >   * outer_inv_range - invalidate range of outer cache lines
> > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > index 58469623b015..1252689362d8 100644
> > --- a/arch/arm/mm/flush.c
> > +++ b/arch/arm/mm/flush.c
> > @@ -24,7 +24,7 @@
> >  #ifdef CONFIG_ARM_HEAVY_MB
> >  void (*soc_mb)(void);
> >  
> > -void arm_heavy_mb(void)
> > +static void __arm_heavy_sync(void)
> >  {
> >  #ifdef CONFIG_OUTER_CACHE_SYNC
> >  	if (outer_cache.sync)
> > @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
> >  	if (soc_mb)
> >  		soc_mb();
> >  }
> > +
> > +void arm_heavy_mb(void)
> > +{
> > +	dsb();
> > +	__arm_heavy_sync();
> > +}
> >  EXPORT_SYMBOL(arm_heavy_mb);
> > +
> > +void arm_heavy_wmb(void)
> > +{
> > +	dsb(st);
> > +	__arm_heavy_sync();
> > +}
> > +EXPORT_SYMBOL(arm_heavy_wmb);
> > +
> > +void arm_heavy_dma_wmb(void)
> > +{
> > +	if (outer_cache_has_sync() || soc_mb)
> > +		arm_heavy_wmb();
> > +	else
> > +		dmb(oshst);
> > +}
> > +EXPORT_SYMBOL(arm_heavy_dma_wmb);
> >  #endif
> >  
> >  #ifdef CONFIG_CPU_CACHE_VIPT
> > 
>
Oleksij Rempel Sept. 13, 2018, 2:09 p.m. UTC | #4
On Thu, Sep 13, 2018 at 02:17:12PM +0100, Will Deacon wrote:
> On Mon, Jul 09, 2018 at 08:20:22AM +0200, Oleksij Rempel wrote:
> > On 06.07.2018 14:26, Will Deacon wrote:
> > > On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> > >> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> > >>> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> > >>> that the barriers aren't fully propagated to the PL310 write-caches and
> > >>> master ports. I'm unsure if this is just an artifact of the mentioned
> > >>> MMIO access, so handling of normal vs. device memory transactions in
> > >>> the PL310.
> > >>
> > >> I'll check with the hardware folks about the PL310, but I don't see how
> > >> you could propagate barriers over AXI3 anyway because I don't think this
> > >> stuff existed back then.
> > > 
> > > Damn. As I feared, you can't propagate the barrier to the PL310, so the
> > > A9 is unable to enforce ordering beyond the responses it gets back. The
> > > PL310 will then re-order normal memory accesses (including non-cacheable)
> > > in its store buffer, so we do actually need a bloody CACHE SYNC here.
> > > 
> > > Patch below. Does it help?
> > 
> > Suddenly no.
> 
> I just noticed that I'm still carrying this patch in my tree. Are you saying
> it doesn't help on your system? It all went quiet, so I assume you fixed
> this issue one way or the other?

Hi, I was no able to fix this problem. Currently we are using 2G memory
split. With this configuration we are not able to reproduce it, at least
not with described test.

> Will
> 
> > > From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> > > From: Will Deacon <will.deacon@arm.com>
> > > Date: Wed, 4 Jul 2018 10:56:22 +0100
> > > Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> > > 
> > > Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> > > transactions or shareability domains and was therefore unable to enforce
> > > ordering between memory accesses either side of a barrier instruction,
> > > requiring the CPU to handle all of the ordering itself.
> > > 
> > > Unfortunately, IP such as the venerable PL310 will re-order stores to
> > > Normal memory inside its store-buffer, meaning that accesses to a coherent
> > > DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> > > out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> > > a CACHE_SYNC command between buffering the accesses. Whilst this is already
> > > the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> > > to sequences such as:
> > > 
> > > 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> > > 	DMB	OSHST
> > > 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> > > 
> > > being reordered such that the DMA device sees the write of valid but not
> > > the write of data.
> > > 
> > > This patch ensures that the SoC fabric and the outer-cache are kicked
> > > by dma_wmb() if they are able to buffer transactions outside of the CPU
> > > but before the Point of Coherency.
> > > 
> > > Reported-by: Lucas Stach <l.stach@pengutronix.de>
> > > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > > ---
> > >  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
> > >  arch/arm/include/asm/outercache.h |  6 ++++++
> > >  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
> > >  3 files changed, 39 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > > index 69772e742a0a..e31833f72b69 100644
> > > --- a/arch/arm/include/asm/barrier.h
> > > +++ b/arch/arm/include/asm/barrier.h
> > > @@ -53,17 +53,23 @@
> > >  #ifdef CONFIG_ARM_HEAVY_MB
> > >  extern void (*soc_mb)(void);
> > >  extern void arm_heavy_mb(void);
> > > -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> > > +extern void arm_heavy_wmb(void);
> > > +extern void arm_heavy_dma_wmb(void);
> > > +#define __arm_heavy_mb()	arm_heavy_mb()
> > > +#define __arm_heavy_wmb()	arm_heavy_wmb()
> > > +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
> > >  #else
> > > -#define __arm_heavy_mb(x...) dsb(x)
> > > +#define __arm_heavy_mb()	dsb()
> > > +#define __arm_heavy_wmb()	dsb(st)
> > > +#define __arm_heavy_dma_wmb()	dmb(oshst)
> > >  #endif
> > >  
> > >  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> > >  #define mb()		__arm_heavy_mb()
> > >  #define rmb()		dsb()
> > > -#define wmb()		__arm_heavy_mb(st)
> > > +#define wmb()		__arm_heavy_wmb()
> > >  #define dma_rmb()	dmb(osh)
> > > -#define dma_wmb()	dmb(oshst)
> > > +#define dma_wmb()	__arm_heavy_dma_wmb()
> > >  #else
> > >  #define mb()		barrier()
> > >  #define rmb()		barrier()
> > > diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> > > index c2bf24f40177..e744efe6e2a7 100644
> > > --- a/arch/arm/include/asm/outercache.h
> > > +++ b/arch/arm/include/asm/outercache.h
> > > @@ -43,6 +43,12 @@ struct outer_cache_fns {
> > >  
> > >  extern struct outer_cache_fns outer_cache;
> > >  
> > > +#ifdef CONFIG_OUTER_CACHE_SYNC
> > > +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> > > +#else
> > > +static inline bool outer_cache_has_sync(void) { return 0; }
> > > +#endif
> > > +
> > >  #ifdef CONFIG_OUTER_CACHE
> > >  /**
> > >   * outer_inv_range - invalidate range of outer cache lines
> > > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > > index 58469623b015..1252689362d8 100644
> > > --- a/arch/arm/mm/flush.c
> > > +++ b/arch/arm/mm/flush.c
> > > @@ -24,7 +24,7 @@
> > >  #ifdef CONFIG_ARM_HEAVY_MB
> > >  void (*soc_mb)(void);
> > >  
> > > -void arm_heavy_mb(void)
> > > +static void __arm_heavy_sync(void)
> > >  {
> > >  #ifdef CONFIG_OUTER_CACHE_SYNC
> > >  	if (outer_cache.sync)
> > > @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
> > >  	if (soc_mb)
> > >  		soc_mb();
> > >  }
> > > +
> > > +void arm_heavy_mb(void)
> > > +{
> > > +	dsb();
> > > +	__arm_heavy_sync();
> > > +}
> > >  EXPORT_SYMBOL(arm_heavy_mb);
> > > +
> > > +void arm_heavy_wmb(void)
> > > +{
> > > +	dsb(st);
> > > +	__arm_heavy_sync();
> > > +}
> > > +EXPORT_SYMBOL(arm_heavy_wmb);
> > > +
> > > +void arm_heavy_dma_wmb(void)
> > > +{
> > > +	if (outer_cache_has_sync() || soc_mb)
> > > +		arm_heavy_wmb();
> > > +	else
> > > +		dmb(oshst);
> > > +}
> > > +EXPORT_SYMBOL(arm_heavy_dma_wmb);
> > >  #endif
> > >  
> > >  #ifdef CONFIG_CPU_CACHE_VIPT
> > > 
> > 
> 
> 
> 
>
diff mbox

Patch

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 69772e742a0a..e31833f72b69 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,17 +53,23 @@ 
 #ifdef CONFIG_ARM_HEAVY_MB
 extern void (*soc_mb)(void);
 extern void arm_heavy_mb(void);
-#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
+extern void arm_heavy_wmb(void);
+extern void arm_heavy_dma_wmb(void);
+#define __arm_heavy_mb()	arm_heavy_mb()
+#define __arm_heavy_wmb()	arm_heavy_wmb()
+#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
 #else
-#define __arm_heavy_mb(x...) dsb(x)
+#define __arm_heavy_mb()	dsb()
+#define __arm_heavy_wmb()	dsb(st)
+#define __arm_heavy_dma_wmb()	dmb(oshst)
 #endif
 
 #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
 #define mb()		__arm_heavy_mb()
 #define rmb()		dsb()
-#define wmb()		__arm_heavy_mb(st)
+#define wmb()		__arm_heavy_wmb()
 #define dma_rmb()	dmb(osh)
-#define dma_wmb()	dmb(oshst)
+#define dma_wmb()	__arm_heavy_dma_wmb()
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
index c2bf24f40177..e744efe6e2a7 100644
--- a/arch/arm/include/asm/outercache.h
+++ b/arch/arm/include/asm/outercache.h
@@ -43,6 +43,12 @@  struct outer_cache_fns {
 
 extern struct outer_cache_fns outer_cache;
 
+#ifdef CONFIG_OUTER_CACHE_SYNC
+static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
+#else
+static inline bool outer_cache_has_sync(void) { return 0; }
+#endif
+
 #ifdef CONFIG_OUTER_CACHE
 /**
  * outer_inv_range - invalidate range of outer cache lines
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 58469623b015..1252689362d8 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -24,7 +24,7 @@ 
 #ifdef CONFIG_ARM_HEAVY_MB
 void (*soc_mb)(void);
 
-void arm_heavy_mb(void)
+static void __arm_heavy_sync(void)
 {
 #ifdef CONFIG_OUTER_CACHE_SYNC
 	if (outer_cache.sync)
@@ -33,7 +33,29 @@  void arm_heavy_mb(void)
 	if (soc_mb)
 		soc_mb();
 }
+
+void arm_heavy_mb(void)
+{
+	dsb();
+	__arm_heavy_sync();
+}
 EXPORT_SYMBOL(arm_heavy_mb);
+
+void arm_heavy_wmb(void)
+{
+	dsb(st);
+	__arm_heavy_sync();
+}
+EXPORT_SYMBOL(arm_heavy_wmb);
+
+void arm_heavy_dma_wmb(void)
+{
+	if (outer_cache_has_sync() || soc_mb)
+		arm_heavy_wmb();
+	else
+		dmb(oshst);
+}
+EXPORT_SYMBOL(arm_heavy_dma_wmb);
 #endif
 
 #ifdef CONFIG_CPU_CACHE_VIPT