diff mbox series

[7/8] membarrier: Remove arm (32) support for SYNC_CORE

Message ID 2142129092ff9aa00e600c42a26c4015b7f5ceec.1623813516.git.luto@kernel.org (mailing list archive)
State New, archived
Headers show
Series membarrier cleanups | expand

Commit Message

Andy Lutomirski June 16, 2021, 3:21 a.m. UTC
On arm32, the only way to safely flush icache from usermode is to call
cacheflush(2).  This also handles any required pipeline flushes, so
membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/arm/Kconfig | 1 -
 1 file changed, 1 deletion(-)

Comments

Russell King (Oracle) June 16, 2021, 9:28 a.m. UTC | #1
On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Yay! About time too.

Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Peter Zijlstra June 16, 2021, 10:16 a.m. UTC | #2
On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

So SYNC_CORE is there to help an architecture that needs to do something
per CPU. If I$ invalidation is broadcast and I$ invalidation also
triggers the flush of any uarch caches derived from it (if there are
any).

Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
which, if I read things right, end up in arch/arm/mm/*.S, but that
doesn't consider cache_ops_need_broadcast().

Will suggests that perhaps ARM 11MPCore might need this due to their I$
flush maybe not being broadcast
Peter Zijlstra June 16, 2021, 10:20 a.m. UTC | #3
On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > On arm32, the only way to safely flush icache from usermode is to call
> > cacheflush(2).  This also handles any required pipeline flushes, so
> > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> 
> So SYNC_CORE is there to help an architecture that needs to do something
> per CPU. If I$ invalidation is broadcast and I$ invalidation also
> triggers the flush of any uarch caches derived from it (if there are
> any).

Incomplete sentence there: + then we don't need SYNC_CORE.

> Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> which, if I read things right, end up in arch/arm/mm/*.S, but that
> doesn't consider cache_ops_need_broadcast().
> 
> Will suggests that perhaps ARM 11MPCore might need this due to their I$
> flush maybe not being broadcast
Russell King (Oracle) June 16, 2021, 10:34 a.m. UTC | #4
On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > On arm32, the only way to safely flush icache from usermode is to call
> > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > 
> > So SYNC_CORE is there to help an architecture that needs to do something
> > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > triggers the flush of any uarch caches derived from it (if there are
> > any).
> 
> Incomplete sentence there: + then we don't need SYNC_CORE.
> 
> > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > doesn't consider cache_ops_need_broadcast().
> > 
> > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > flush maybe not being broadcast

If it leaves other cores with incoherent I cache, then that's already
a problem for SMP cores, since there could be no guarantee that the
modifications made by one core will be visible to some other core that
ends up running that code - and there is little option for userspace to
work around that except by pinning the thread making the modifications
and subsequently executing the code to a core.

The same is also true of flush_icache_range() - which is used when
loading a kernel module. In the case Will is referring to, these alias
to the same code.
Peter Zijlstra June 16, 2021, 11:10 a.m. UTC | #5
On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > 
> > > So SYNC_CORE is there to help an architecture that needs to do something
> > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > triggers the flush of any uarch caches derived from it (if there are
> > > any).
> > 
> > Incomplete sentence there: + then we don't need SYNC_CORE.
> > 
> > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > doesn't consider cache_ops_need_broadcast().
> > > 
> > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > flush maybe not being broadcast
> 
> If it leaves other cores with incoherent I cache, then that's already
> a problem for SMP cores, since there could be no guarantee that the
> modifications made by one core will be visible to some other core that
> ends up running that code - and there is little option for userspace to
> work around that except by pinning the thread making the modifications
> and subsequently executing the code to a core.

That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
system wide IPI.

> The same is also true of flush_icache_range() - which is used when
> loading a kernel module. In the case Will is referring to, these alias
> to the same code.

Yes, cache_ops_need_broadcast() seems to be missing in more places.
Russell King (Oracle) June 16, 2021, 1:22 p.m. UTC | #6
On Wed, Jun 16, 2021 at 01:10:58PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> > On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > 
> > > > So SYNC_CORE is there to help an architecture that needs to do something
> > > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > > triggers the flush of any uarch caches derived from it (if there are
> > > > any).
> > > 
> > > Incomplete sentence there: + then we don't need SYNC_CORE.
> > > 
> > > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > > doesn't consider cache_ops_need_broadcast().
> > > > 
> > > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > > flush maybe not being broadcast
> > 
> > If it leaves other cores with incoherent I cache, then that's already
> > a problem for SMP cores, since there could be no guarantee that the
> > modifications made by one core will be visible to some other core that
> > ends up running that code - and there is little option for userspace to
> > work around that except by pinning the thread making the modifications
> > and subsequently executing the code to a core.
> 
> That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
> system wide IPI.

If it's a problem, then it needs fixing. sys_cacheflush() is used to
implement GCC's __builtin___clear_cache(). I'm not sure who added this
to gcc.

> > The same is also true of flush_icache_range() - which is used when
> > loading a kernel module. In the case Will is referring to, these alias
> > to the same code.
> 
> Yes, cache_ops_need_broadcast() seems to be missing in more places.

Likely only in places where we care about I/D coherency - as the data
cache is required to be PIPT on these SMP platforms.
Catalin Marinas June 16, 2021, 3:04 p.m. UTC | #7
On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> On Wed, Jun 16, 2021 at 01:10:58PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 16, 2021 at 11:34:46AM +0100, Russell King (Oracle) wrote:
> > > On Wed, Jun 16, 2021 at 12:20:06PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Jun 16, 2021 at 12:16:27PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > > 
> > > > > So SYNC_CORE is there to help an architecture that needs to do something
> > > > > per CPU. If I$ invalidation is broadcast and I$ invalidation also
> > > > > triggers the flush of any uarch caches derived from it (if there are
> > > > > any).
> > > > 
> > > > Incomplete sentence there: + then we don't need SYNC_CORE.
> > > > 
> > > > > Now arm_syscall() NR(cacheflush) seems to do flush_icache_user_range(),
> > > > > which, if I read things right, end up in arch/arm/mm/*.S, but that
> > > > > doesn't consider cache_ops_need_broadcast().
> > > > > 
> > > > > Will suggests that perhaps ARM 11MPCore might need this due to their I$
> > > > > flush maybe not being broadcast
> > > 
> > > If it leaves other cores with incoherent I cache, then that's already
> > > a problem for SMP cores, since there could be no guarantee that the
> > > modifications made by one core will be visible to some other core that
> > > ends up running that code - and there is little option for userspace to
> > > work around that except by pinning the thread making the modifications
> > > and subsequently executing the code to a core.
> > 
> > That's where SYNC_CORE can help. Or you make sys_cacheflush() do a
> > system wide IPI.
> 
> If it's a problem, then it needs fixing. sys_cacheflush() is used to
> implement GCC's __builtin___clear_cache(). I'm not sure who added this
> to gcc.

I'm surprised that it works. I guess it's just luck that the thread
doing the code writing doesn't migrate before the sys_cacheflush() call.

> > > The same is also true of flush_icache_range() - which is used when
> > > loading a kernel module. In the case Will is referring to, these alias
> > > to the same code.
> > 
> > Yes, cache_ops_need_broadcast() seems to be missing in more places.
> 
> Likely only in places where we care about I/D coherency - as the data
> cache is required to be PIPT on these SMP platforms.

We had similar issue with the cache maintenance for DMA. The hack we
employed (in cache.S) is relying on the MESI protocol internals and
forcing a read/write for ownership before the D-cache maintenance.
Luckily ARM11MPCore doesn't do speculative data loads to trigger some
migration back.

The simpler fix for flush_icache_range() is to disable preemption, read
a word in a cacheline to force any dirty lines on another CPU to be
evicted and then issue the D-cache maintenance (for those cache lines
which are still dirty on the current CPU).

It's a hack that only works on ARM11MPCore. Newer MP cores are saner.
Russell King (Oracle) June 16, 2021, 3:23 p.m. UTC | #8
On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> > If it's a problem, then it needs fixing. sys_cacheflush() is used to
> > implement GCC's __builtin___clear_cache(). I'm not sure who added this
> > to gcc.
> 
> I'm surprised that it works. I guess it's just luck that the thread
> doing the code writing doesn't migrate before the sys_cacheflush() call.

Maybe the platforms that use ARM MPCore avoid the issue somehow (maybe
by not using self-modifying code?)

> > Likely only in places where we care about I/D coherency - as the data
> > cache is required to be PIPT on these SMP platforms.
> 
> We had similar issue with the cache maintenance for DMA. The hack we
> employed (in cache.S) is relying on the MESI protocol internals and
> forcing a read/write for ownership before the D-cache maintenance.
> Luckily ARM11MPCore doesn't do speculative data loads to trigger some
> migration back.

That's very similar to the hack that was originally implemented for
MPCore DMA - see the DMA_CACHE_RWFO configuration option.

An interesting point here is that cache_ops_need_broadcast() reads
MMFR3 bits 12..15, which in the MPCore TRM has nothing to with cache
operation broadcasting - but luckily is documented as containing zero.
So, cache_ops_need_broadcast() returns correctly (true) here.

> The simpler fix for flush_icache_range() is to disable preemption, read
> a word in a cacheline to force any dirty lines on another CPU to be
> evicted and then issue the D-cache maintenance (for those cache lines
> which are still dirty on the current CPU).

Is just reading sufficient? If so, why do we do a read-then-write in
the MPCore DMA cache ops? Don't we need the write to force exclusive
ownership? If we don't have exclusive ownership of the dirty line,
how can we be sure to write it out of the caches?
Catalin Marinas June 16, 2021, 3:45 p.m. UTC | #9
On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 02:22:27PM +0100, Russell King wrote:
> > > If it's a problem, then it needs fixing. sys_cacheflush() is used to
> > > implement GCC's __builtin___clear_cache(). I'm not sure who added this
> > > to gcc.
> > 
> > I'm surprised that it works. I guess it's just luck that the thread
> > doing the code writing doesn't migrate before the sys_cacheflush() call.
> 
> Maybe the platforms that use ARM MPCore avoid the issue somehow (maybe
> by not using self-modifying code?)

Not sure how widely it is/was used with JITs. In general, I think the
systems at the time were quite tolerant to missing I-cache maintenance
(maybe small caches?). We ran Linux for a while without 826cbdaff297
("[ARM] 5092/1: Fix the I-cache invalidation on ARMv6 and later CPUs").

> > > Likely only in places where we care about I/D coherency - as the data
> > > cache is required to be PIPT on these SMP platforms.
> > 
> > We had similar issue with the cache maintenance for DMA. The hack we
> > employed (in cache.S) is relying on the MESI protocol internals and
> > forcing a read/write for ownership before the D-cache maintenance.
> > Luckily ARM11MPCore doesn't do speculative data loads to trigger some
> > migration back.
> 
> That's very similar to the hack that was originally implemented for
> MPCore DMA - see the DMA_CACHE_RWFO configuration option.

Well, yes, that's what I wrote above ;) (I added the hack and config
option IIRC).

> An interesting point here is that cache_ops_need_broadcast() reads
> MMFR3 bits 12..15, which in the MPCore TRM has nothing to with cache
> operation broadcasting - but luckily is documented as containing zero.
> So, cache_ops_need_broadcast() returns correctly (true) here.

That's typical with any new feature. The 12..15 field was added in ARMv7
stating that cache maintenance is broadcast in hardware. Prior to this,
the field was read-as-zero. So it's not luck but we could have avoided
negating the meaning here, i.e. call it cache_ops_are_broadcast().

> > The simpler fix for flush_icache_range() is to disable preemption, read
> > a word in a cacheline to force any dirty lines on another CPU to be
> > evicted and then issue the D-cache maintenance (for those cache lines
> > which are still dirty on the current CPU).
> 
> Is just reading sufficient? If so, why do we do a read-then-write in
> the MPCore DMA cache ops? Don't we need the write to force exclusive
> ownership? If we don't have exclusive ownership of the dirty line,
> how can we be sure to write it out of the caches?

For cleaning (which is the case for I/D coherency), we only need reading
since we are fine with clean lines being left in the D-cache on other
CPUs. For invalidation, we indeed need to force the exclusive ownership,
hence the write.
Catalin Marinas June 16, 2021, 4 p.m. UTC | #10
On Wed, Jun 16, 2021 at 04:45:29PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> > On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > > The simpler fix for flush_icache_range() is to disable preemption, read
> > > a word in a cacheline to force any dirty lines on another CPU to be
> > > evicted and then issue the D-cache maintenance (for those cache lines
> > > which are still dirty on the current CPU).
> > 
> > Is just reading sufficient? If so, why do we do a read-then-write in
> > the MPCore DMA cache ops? Don't we need the write to force exclusive
> > ownership? If we don't have exclusive ownership of the dirty line,
> > how can we be sure to write it out of the caches?
> 
> For cleaning (which is the case for I/D coherency), we only need reading
> since we are fine with clean lines being left in the D-cache on other
> CPUs. For invalidation, we indeed need to force the exclusive ownership,
> hence the write.

Ah, I'm not sure the I-cache is broadcast in hardware on ARM11MPCore
either. So fixing the D side won't be sufficient.
Russell King (Oracle) June 16, 2021, 4:27 p.m. UTC | #11
On Wed, Jun 16, 2021 at 05:00:51PM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 04:45:29PM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 04:23:26PM +0100, Russell King wrote:
> > > On Wed, Jun 16, 2021 at 04:04:56PM +0100, Catalin Marinas wrote:
> > > > The simpler fix for flush_icache_range() is to disable preemption, read
> > > > a word in a cacheline to force any dirty lines on another CPU to be
> > > > evicted and then issue the D-cache maintenance (for those cache lines
> > > > which are still dirty on the current CPU).
> > > 
> > > Is just reading sufficient? If so, why do we do a read-then-write in
> > > the MPCore DMA cache ops? Don't we need the write to force exclusive
> > > ownership? If we don't have exclusive ownership of the dirty line,
> > > how can we be sure to write it out of the caches?
> > 
> > For cleaning (which is the case for I/D coherency), we only need reading
> > since we are fine with clean lines being left in the D-cache on other
> > CPUs. For invalidation, we indeed need to force the exclusive ownership,
> > hence the write.
> 
> Ah, I'm not sure the I-cache is broadcast in hardware on ARM11MPCore
> either. So fixing the D side won't be sufficient.

The other question is... do we bother to fix this.

Arnd tells me that the current remaining ARM11MPCore users are:
- CNS3xxx (where there is some martinal interest in the Gateworks
  Laguna platform)
- Similar for OXNAS
- There used to be the Realview MPCore tile - I haven't turned that on
  in ages, and it may be that the 3V cell that backs up the encryption
  keys is dead so it may not even boot.
- Not sure about the story with QEMU - Arnd doesn't think there would
  be a problem there as it may not model caches.

So it seems to come down to a question about CNS3xxx and OXNAS. If
these aren't being used, maybe we can drop ARM11MPCore support and
the associated platforms?

Linus, Krzysztof, Neil, any input?

Thanks.
Krzysztof Hałasa June 17, 2021, 8:55 a.m. UTC | #12
"Russell King (Oracle)" <linux@armlinux.org.uk> writes:

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?

Well, it appears we haven't updated software on our Gateworks Lagunas
(CNS3xxx dual core) for 4 years. This is old stuff, pre-DTB and all. We
have replacement setups (i.MX6 + mPCIe to mPCI bridge) which we don't
use either (due to lack of interest in mPCI - the old parallel, not the
express).

I don't have a problem with the CNS3xxx being dropped. In fact, we don't
use anything (ARM) older than v7 here.

Chris.
Mark Rutland June 17, 2021, 10:40 a.m. UTC | #13
On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

Unfortunately, it's a bit more complicated than that, and these days
SYNC_CORE is equally necessary on arm as on arm64. This is something
that changed in the architecture over time, but since ARMv7 we generally
need both the cache maintenance *and* a context synchronization event
(the latter must occur on the CPU which will execute the instructions).

If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
A3.5.4 "Concurrent modification and execution of instructions" covers
this. That manual can be found at:

	https://developer.arm.com/documentation/ddi0406/latest/

Likewise for ARMv8-A; the latest manual (ARM DDI 0487G.a) covers this in
sections B2.2.5 and E2.3.5. That manual can be found at:

	https://developer.arm.com/documentation/ddi0487/ga

I am not sure about exactly what's required 11MPcore, since that's
somewhat a special case as the only SMP design prior to ARMv7-A
mandating broadcast maintenance.

For intuition's sake, one reason for this is that once a CPU has fetched
an instruction from an instruction cache into its pipeline and that
instruction is "in-flight", changes to that instruction cache are not
guaranteed to affect the "in-flight" copy (which e.g. could be
decomposed into micro-ops and so on). While these parts of a CPU aren't
necessarily designed as caches, they effectively transiently cache a
stale copy of the instruction while it is being executed.

This is more pronounced on newer designs with more complex execution
pipelines (e.g. with bigger windows for out-of-order execution and
speculation), and generally it's unlikely for this to be noticed on
smaller/simpler designs.

As above, modifying instructions requires two things:

1) Making sure that *subsequent* instruction fetches will see the new
   instructions. This is what cacheflush(2) does, and this is similar to
   what SW does on arm64 with DC CVAU + IC IVAU instructions and
   associated memory barriers.

2) Making sure that a CPU fetches the instructions *after* the cache
   maintenance is complete. There are a few ways to do this:

   * A context synchronization event (e.g. an ISB or exception return)
     on the CPU that will execute the instructions. This is what
     membarrier(SYNC_CORE) does.

   * In ARMv8-A there are some restrictions on the order in which
     modified instructions are guaranteed to be observed (e.g. if you
     publish a function, then subsequently install a branch to that new
     function), where an ISB may not be necessary. In the latest ARMv8-A
     manual as linked above, those are described in sections:

     - B2.3.8 "Ordering of instruction fetches" (for 64-bit)
     - E2.3.8 "Ordering of instruction fetches" (for 32-bit)

   * Where we can guarantee that a CPU cannot possibly have an
     instruction in-flight (e.g. due to a lack of a mapping to fetch
     instructions from), nothing is necessary. This is what we rely on
     when faulting in code pages. In these cases, the CPU is liable to
     take fault on the missing translation anyway.

Thanks,
Mark.

> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/arm/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 24804f11302d..89a885fba724 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -10,7 +10,6 @@ config ARM
>  	select ARCH_HAS_FORTIFY_SOURCE
>  	select ARCH_HAS_KEEPINITRD
>  	select ARCH_HAS_KCOV
> -	select ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
>  	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
>  	select ARCH_HAS_PHYS_TO_DMA
> -- 
> 2.31.1
>
Russell King (Oracle) June 17, 2021, 11:23 a.m. UTC | #14
On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > On arm32, the only way to safely flush icache from usermode is to call
> > cacheflush(2).  This also handles any required pipeline flushes, so
> > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> 
> Unfortunately, it's a bit more complicated than that, and these days
> SYNC_CORE is equally necessary on arm as on arm64. This is something
> that changed in the architecture over time, but since ARMv7 we generally
> need both the cache maintenance *and* a context synchronization event
> (the latter must occur on the CPU which will execute the instructions).
> 
> If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> A3.5.4 "Concurrent modification and execution of instructions" covers
> this. That manual can be found at:
> 
> 	https://developer.arm.com/documentation/ddi0406/latest/

Looking at that, sys_cacheflush() meets this. The manual details a
series of cache maintenance calls in "step 1" that the modifying thread
must issue - this is exactly what sys_cacheflush() does. The same is
true for ARMv6, except the "ISB" terminology is replaced by a
"PrefetchFlush" terminology. (I checked DDI0100I).

"step 2" requires an ISB on the "other CPU" prior to executing that
code. As I understand it, in ARMv7, userspace can issue an ISB itself.

For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
for this that isn't availble to userspace. This is where we come to
the situation about ARM 11MPCore, and whether we continue to support
it or not.

So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
as userspace has everything that's required. ARMv6K is a different
matter as we've already identified for several reasons.
Mark Rutland June 17, 2021, 11:33 a.m. UTC | #15
On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > On arm32, the only way to safely flush icache from usermode is to call
> > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > 
> > Unfortunately, it's a bit more complicated than that, and these days
> > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > that changed in the architecture over time, but since ARMv7 we generally
> > need both the cache maintenance *and* a context synchronization event
> > (the latter must occur on the CPU which will execute the instructions).
> > 
> > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > A3.5.4 "Concurrent modification and execution of instructions" covers
> > this. That manual can be found at:
> > 
> > 	https://developer.arm.com/documentation/ddi0406/latest/
> 
> Looking at that, sys_cacheflush() meets this. The manual details a
> series of cache maintenance calls in "step 1" that the modifying thread
> must issue - this is exactly what sys_cacheflush() does. The same is
> true for ARMv6, except the "ISB" terminology is replaced by a
> "PrefetchFlush" terminology. (I checked DDI0100I).
> 
> "step 2" requires an ISB on the "other CPU" prior to executing that
> code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> 
> For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> for this that isn't availble to userspace. This is where we come to
> the situation about ARM 11MPCore, and whether we continue to support
> it or not.
> 
> So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> as userspace has everything that's required. ARMv6K is a different
> matter as we've already identified for several reasons.

Sure, and I agree we should not change cacheflush().

The point of membarrier(SYNC_CORE) is that you can move the cost of that
ISB out of the fast-path in the executing thread(s) and into the
slow-path on the thread which generated the code.

So e.g. rather than an executing thread always having to do:

	LDR	<reg>, [<funcptr>]
	ISB	// in case funcptr was just updated
	BLR	<reg>

... you have the thread generating the code use membarrier(SYNC_CORE)
prior to plublishing the funcptr, and the fast-path on all the executing
threads can be:

	LDR	<reg> [<funcptr>]
	BLR	<reg>

... and thus I think we still want membarrier(SYNC_CORE) so that people
can do this, even if there are other means to achieve the same
functionality.

Thanks,
Mark.

> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Andy Lutomirski June 17, 2021, 1:41 p.m. UTC | #16
On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > 
> > > Unfortunately, it's a bit more complicated than that, and these days
> > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > that changed in the architecture over time, but since ARMv7 we generally
> > > need both the cache maintenance *and* a context synchronization event
> > > (the latter must occur on the CPU which will execute the instructions).
> > > 
> > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > this. That manual can be found at:
> > > 
> > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > 
> > Looking at that, sys_cacheflush() meets this. The manual details a
> > series of cache maintenance calls in "step 1" that the modifying thread
> > must issue - this is exactly what sys_cacheflush() does. The same is
> > true for ARMv6, except the "ISB" terminology is replaced by a
> > "PrefetchFlush" terminology. (I checked DDI0100I).
> > 
> > "step 2" requires an ISB on the "other CPU" prior to executing that
> > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > 
> > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > for this that isn't availble to userspace. This is where we come to
> > the situation about ARM 11MPCore, and whether we continue to support
> > it or not.
> > 
> > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > as userspace has everything that's required. ARMv6K is a different
> > matter as we've already identified for several reasons.
> 
> Sure, and I agree we should not change cacheflush().
> 
> The point of membarrier(SYNC_CORE) is that you can move the cost of that
> ISB out of the fast-path in the executing thread(s) and into the
> slow-path on the thread which generated the code.
> 
> So e.g. rather than an executing thread always having to do:
> 
> 	LDR	<reg>, [<funcptr>]
> 	ISB	// in case funcptr was just updated
> 	BLR	<reg>
> 
> ... you have the thread generating the code use membarrier(SYNC_CORE)
> prior to plublishing the funcptr, and the fast-path on all the executing
> threads can be:
> 
> 	LDR	<reg> [<funcptr>]
> 	BLR	<reg>
> 
> ... and thus I think we still want membarrier(SYNC_CORE) so that people
> can do this, even if there are other means to achieve the same
> functionality.

I had the impression that sys_cacheflush() did that.  Am I wrong?

In any event, I’m even more convinced that no new SYNC_CORE arches should be added. We need a new API that just does the right thing.
Mark Rutland June 17, 2021, 1:51 p.m. UTC | #17
On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > 
> > > > Unfortunately, it's a bit more complicated than that, and these days
> > > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > > that changed in the architecture over time, but since ARMv7 we generally
> > > > need both the cache maintenance *and* a context synchronization event
> > > > (the latter must occur on the CPU which will execute the instructions).
> > > > 
> > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > > this. That manual can be found at:
> > > > 
> > > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > > 
> > > Looking at that, sys_cacheflush() meets this. The manual details a
> > > series of cache maintenance calls in "step 1" that the modifying thread
> > > must issue - this is exactly what sys_cacheflush() does. The same is
> > > true for ARMv6, except the "ISB" terminology is replaced by a
> > > "PrefetchFlush" terminology. (I checked DDI0100I).
> > > 
> > > "step 2" requires an ISB on the "other CPU" prior to executing that
> > > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > > 
> > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > > for this that isn't availble to userspace. This is where we come to
> > > the situation about ARM 11MPCore, and whether we continue to support
> > > it or not.
> > > 
> > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > > as userspace has everything that's required. ARMv6K is a different
> > > matter as we've already identified for several reasons.
> > 
> > Sure, and I agree we should not change cacheflush().
> > 
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > 
> > So e.g. rather than an executing thread always having to do:
> > 
> > 	LDR	<reg>, [<funcptr>]
> > 	ISB	// in case funcptr was just updated
> > 	BLR	<reg>
> > 
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > 
> > 	LDR	<reg> [<funcptr>]
> > 	BLR	<reg>
> > 
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> 
> I had the impression that sys_cacheflush() did that.  Am I wrong?

Currently sys_cacheflush() doesn't do this, and IIUC it has never done
remote context synchronization even for architectures that need that
(e.g. x86 requiring a serializing instruction).

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing. 

My intuition is the other way around, and that this is a gnereally
useful thing for architectures that require context synchronization.

It's not clear to me what "the right thing" would mean specifically, and
on architectures with userspace cache maintenance JITs can usually do
the most optimal maintenance, and only need help for the context
synchronization.

Thanks,
Mark.
Andy Lutomirski June 17, 2021, 2 p.m. UTC | #18
On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:

> > In any event, I’m even more convinced that no new SYNC_CORE arches
> > should be added. We need a new API that just does the right thing. 
> 
> My intuition is the other way around, and that this is a gnereally
> useful thing for architectures that require context synchronization.

Except that you can't use it in a generic way.  You have to know the specific rules for your arch.

> 
> It's not clear to me what "the right thing" would mean specifically, and
> on architectures with userspace cache maintenance JITs can usually do
> the most optimal maintenance, and only need help for the context
> synchronization.
> 

This I simply don't believe -- I doubt that any sane architecture really works like this.  I wrote an email about it to Intel that apparently generated internal discussion but no results.  Consider:

mmap(some shared library, some previously unmapped address);

this does no heavyweight synchronization, at least on x86.  There is no "serializing" instruction in the fast path, and it *works* despite anything the SDM may or may not say.

We can and, IMO, should develop a sane way for user programs to install instructions into VMAs, for security-conscious software to verify them (by splitting the read and write sides?), and for their consumers to execute them, without knowing any arch details.  And I think this can be done with no IPIs except for possible TLB flushing when needed, at least on most architectures.  It would require a nontrivial amount of design work, and it would not resemble sys_cacheflush() or SYNC_CORE.

--Andy
Peter Zijlstra June 17, 2021, 2:05 p.m. UTC | #19
On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:

> > Sure, and I agree we should not change cacheflush().
> > 
> > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > ISB out of the fast-path in the executing thread(s) and into the
> > slow-path on the thread which generated the code.
> > 
> > So e.g. rather than an executing thread always having to do:
> > 
> > 	LDR	<reg>, [<funcptr>]
> > 	ISB	// in case funcptr was just updated
> > 	BLR	<reg>
> > 
> > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > prior to plublishing the funcptr, and the fast-path on all the executing
> > threads can be:
> > 
> > 	LDR	<reg> [<funcptr>]
> > 	BLR	<reg>
> > 
> > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > can do this, even if there are other means to achieve the same
> > functionality.
> 
> I had the impression that sys_cacheflush() did that.  Am I wrong?

Yes, sys_cacheflush() only does what it says on the tin (and only
correctly for hardware broadcast -- everything except 11mpcore).

It only invalidates the caches, but not the per CPU derived state like
prefetch buffers and micro-op buffers, and certainly not instructions
already in flight.

So anything OoO needs at the very least a complete pipeline stall
injected, but probably something stronger to make it flush the buffers.

> In any event, I’m even more convinced that no new SYNC_CORE arches
> should be added. We need a new API that just does the right thing. 

I really don't understand why you hate the thing so much; SYNC_CORE is a
means of injecting whatever instruction is required to flush all uarch
state related to instructions on all theads (not all CPUs) of a process
as efficient as possible.

The alternative is sending signals to all threads (including the
non-running ones) which is known to scale very poorly indeed, or, as
Mark suggests above, have very expensive instructions unconditinoally in
the instruction stream, which is also undesired.
Mathieu Desnoyers June 17, 2021, 2:16 p.m. UTC | #20
On 17-Jun-2021 02:51:33 PM, Mark Rutland wrote:
> On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> > 
> > 
> > On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote:
> > > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote:
> > > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote:
> > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote:
> > > > > > On arm32, the only way to safely flush icache from usermode is to call
> > > > > > cacheflush(2).  This also handles any required pipeline flushes, so
> > > > > > membarrier's SYNC_CORE feature is useless on arm.  Remove it.
> > > > > 
> > > > > Unfortunately, it's a bit more complicated than that, and these days
> > > > > SYNC_CORE is equally necessary on arm as on arm64. This is something
> > > > > that changed in the architecture over time, but since ARMv7 we generally
> > > > > need both the cache maintenance *and* a context synchronization event
> > > > > (the latter must occur on the CPU which will execute the instructions).
> > > > > 
> > > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section
> > > > > A3.5.4 "Concurrent modification and execution of instructions" covers
> > > > > this. That manual can be found at:
> > > > > 
> > > > > 	https://developer.arm.com/documentation/ddi0406/latest/
> > > > 
> > > > Looking at that, sys_cacheflush() meets this. The manual details a
> > > > series of cache maintenance calls in "step 1" that the modifying thread
> > > > must issue - this is exactly what sys_cacheflush() does. The same is
> > > > true for ARMv6, except the "ISB" terminology is replaced by a
> > > > "PrefetchFlush" terminology. (I checked DDI0100I).
> > > > 
> > > > "step 2" requires an ISB on the "other CPU" prior to executing that
> > > > code. As I understand it, in ARMv7, userspace can issue an ISB itself.
> > > > 
> > > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction
> > > > for this that isn't availble to userspace. This is where we come to
> > > > the situation about ARM 11MPCore, and whether we continue to support
> > > > it or not.
> > > > 
> > > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels
> > > > as userspace has everything that's required. ARMv6K is a different
> > > > matter as we've already identified for several reasons.
> > > 
> > > Sure, and I agree we should not change cacheflush().
> > > 
> > > The point of membarrier(SYNC_CORE) is that you can move the cost of that
> > > ISB out of the fast-path in the executing thread(s) and into the
> > > slow-path on the thread which generated the code.
> > > 
> > > So e.g. rather than an executing thread always having to do:
> > > 
> > > 	LDR	<reg>, [<funcptr>]
> > > 	ISB	// in case funcptr was just updated
> > > 	BLR	<reg>
> > > 
> > > ... you have the thread generating the code use membarrier(SYNC_CORE)
> > > prior to plublishing the funcptr, and the fast-path on all the executing
> > > threads can be:
> > > 
> > > 	LDR	<reg> [<funcptr>]
> > > 	BLR	<reg>
> > > 
> > > ... and thus I think we still want membarrier(SYNC_CORE) so that people
> > > can do this, even if there are other means to achieve the same
> > > functionality.
> > 
> > I had the impression that sys_cacheflush() did that.  Am I wrong?
> 
> Currently sys_cacheflush() doesn't do this, and IIUC it has never done
> remote context synchronization even for architectures that need that
> (e.g. x86 requiring a serializing instruction).
> 
> > In any event, I’m even more convinced that no new SYNC_CORE arches
> > should be added. We need a new API that just does the right thing. 
> 
> My intuition is the other way around, and that this is a gnereally
> useful thing for architectures that require context synchronization.
> 
> It's not clear to me what "the right thing" would mean specifically, and
> on architectures with userspace cache maintenance JITs can usually do
> the most optimal maintenance, and only need help for the context
> synchronization.

If I can attempt to summarize the current situation for ARMv7:

- In addition to the cache flushing on the core doing the code update,
  the architecture requires every core to perform a context synchronizing
  instruction before executing the updated code.

- sys_cacheflush() don't do this core sync on every core. It also takes a
  single address range as parameter.

- ARM, ARM64, powerpc, powerpc64, x86, x86-64 all currently handle the
  context synchronization requirement for updating user-space code on
  SMP with sys_membarrier SYNC_CORE. It's not, however, meant to replace
  explicit cache flushing operations if those are needed.

So removing membarrier SYNC_CORE from ARM would be a step backward here.
On ARMv7, the SYNC_CORE is needed _in addition_ to sys_cacheflush.

Adding a sync-core operation at the end of sys_cacheflush would be
inefficient for common GC use-cases where a rather large set of address
ranges are invalidated in one go: for this, we either want the GC to:

- Invoke sys_cacheflush for each targeted range, and then issue a single
  sys_membarrier SYNC_CORE, or

- Implement a new "sys_cacheflush_iov" which takes an iovec input. There
  I see that it could indeed invalidate all relevant cache lines *and*
  issue the SYNC_CORE at the end.

But shoehorning the SYNC_CORE in the pre-existing sys_cacheflush after
the fact seems like a bad idea.

Thanks,

Mathieu
Mark Rutland June 17, 2021, 2:20 p.m. UTC | #21
On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> 
> 
> On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> > On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote:
> 
> > > In any event, I’m even more convinced that no new SYNC_CORE arches
> > > should be added. We need a new API that just does the right thing. 
> > 
> > My intuition is the other way around, and that this is a gnereally
> > useful thing for architectures that require context synchronization.
> 
> Except that you can't use it in a generic way.  You have to know the
> specific rules for your arch.

That's generally true for modifying instruction streams though? The man
page for cacheflush(2) calls out that it is not portable.

I think what's necessary here is some mandatory per-arch documentation?

> > It's not clear to me what "the right thing" would mean specifically, and
> > on architectures with userspace cache maintenance JITs can usually do
> > the most optimal maintenance, and only need help for the context
> > synchronization.
> > 
> 
> This I simply don't believe -- I doubt that any sane architecture
> really works like this.  I wrote an email about it to Intel that
> apparently generated internal discussion but no results.  Consider:
> 
> mmap(some shared library, some previously unmapped address);
> 
> this does no heavyweight synchronization, at least on x86.  There is
> no "serializing" instruction in the fast path, and it *works* despite
> anything the SDM may or may not say.

Sure, and I called this case out specifically when I said:

|   * Where we can guarantee that a CPU cannot possibly have an
|     instruction in-flight (e.g. due to a lack of a mapping to fetch
|     instructions from), nothing is necessary. This is what we rely on
|     when faulting in code pages. In these cases, the CPU is liable to
|     take fault on the missing translation anyway.

.. what really matters is whether the CPU had the oppoprtunity to fetch
something stale; the context synchronization is necessary to discard
that.

Bear in mind that in many cases where this could occur in theory, we
don't hit in practice because CPUs don't happen to predict/speculate as
aggressively as they are permitted to. On arm/arm64 it's obvious that
this is a problem because the documentation clearly defines the
boundaries of what a CPU is permitted to do, whereas on other
architectures docuentation is not necessarily as clear whether this is
permited or whether the architecture mandates additional guarantees.

> We can and, IMO, should develop a sane way for user programs to
> install instructions into VMAs, for security-conscious software to
> verify them (by splitting the read and write sides?), and for their
> consumers to execute them, without knowing any arch details.  And I
> think this can be done with no IPIs except for possible TLB flushing
> when needed, at least on most architectures.  It would require a
> nontrivial amount of design work, and it would not resemble
> sys_cacheflush() or SYNC_CORE.

I'm not opposed to adding new interfaces for stuff like that, but I
don't think that membarrier(SYNC_CORE) or cacheflush(2) are necessarily
wrong as-is.

Thanks,
Mark.
Peter Zijlstra June 17, 2021, 3:01 p.m. UTC | #22
On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:

> > It's not clear to me what "the right thing" would mean specifically, and
> > on architectures with userspace cache maintenance JITs can usually do
> > the most optimal maintenance, and only need help for the context
> > synchronization.
> > 
> 
> This I simply don't believe -- I doubt that any sane architecture
> really works like this.  I wrote an email about it to Intel that
> apparently generated internal discussion but no results.  Consider:
> 
> mmap(some shared library, some previously unmapped address);
> 
> this does no heavyweight synchronization, at least on x86.  There is
> no "serializing" instruction in the fast path, and it *works* despite
> anything the SDM may or may not say.

I'm confused; why do you think that is relevant?

The only way to get into a memory address space is CR3 write, which is
serializing and will flush everything. Since there wasn't anything
mapped, nothing could be 'cached' from that location.

So that has to work...

> We can and, IMO, should develop a sane way for user programs to
> install instructions into VMAs, for security-conscious software to
> verify them (by splitting the read and write sides?), and for their
> consumers to execute them, without knowing any arch details.  And I
> think this can be done with no IPIs except for possible TLB flushing
> when needed, at least on most architectures.  It would require a
> nontrivial amount of design work, and it would not resemble
> sys_cacheflush() or SYNC_CORE.

The interesting use-case is where we modify code that is under active
execution in a multi-threaded process; where CPU0 runs code and doesn't
make any syscalls at all, while CPU1 modifies code that is visible to
CPU0.

In that case CPU0 can have various internal state still reflecting the
old instructions that no longer exist in memory -- presumably.

We also need to inject at least a full memory barrier and pipeline
flush, to create a proper before and after modified. To reason about
when the *other* threads will be able to observe the new code.

Now, the SDM documents that prefetch and trace buffers are not flushed
on i$ invalidate (actual implementations might of course differ) and
doing this requires the SERIALIZE instruction or one of the many
instructions that implies this, one of which is IRET.

For the cross-modifying case, I really don't see how you can not send an
IPI and expect behavour one can reason about, irrespective of any
non-coherent behaviour.

Now, the SDM documents non-coherent behaviour and requires SERIALIZE,
while at the same time any IPI already implies IRET which implies
SERIALIZE -- except some Luto guy was having plans to optimize the IRET
paths so we couldn't rely on that.
Peter Zijlstra June 17, 2021, 3:13 p.m. UTC | #23
On Thu, Jun 17, 2021 at 05:01:53PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 07:00:26AM -0700, Andy Lutomirski wrote:
> > On Thu, Jun 17, 2021, at 6:51 AM, Mark Rutland wrote:
> 
> > > It's not clear to me what "the right thing" would mean specifically, and
> > > on architectures with userspace cache maintenance JITs can usually do
> > > the most optimal maintenance, and only need help for the context
> > > synchronization.
> > > 
> > 
> > This I simply don't believe -- I doubt that any sane architecture
> > really works like this.  I wrote an email about it to Intel that
> > apparently generated internal discussion but no results.  Consider:
> > 
> > mmap(some shared library, some previously unmapped address);
> > 
> > this does no heavyweight synchronization, at least on x86.  There is
> > no "serializing" instruction in the fast path, and it *works* despite
> > anything the SDM may or may not say.
> 
> I'm confused; why do you think that is relevant?
> 
> The only way to get into a memory address space is CR3 write, which is
> serializing and will flush everything. Since there wasn't anything
> mapped, nothing could be 'cached' from that location.
> 
> So that has to work...

Ooh, you mean mmap where there was something mmap'ed before. Not virgin
space so to say.

But in that case, the unmap() would've caused a TLB invalidate, which on
x86 is IPIs, which is IRET.

Other architectures include I/D cache flushes in their TLB
invalidations -- but as elsewhere in the thread, that might not be
suffient on its own.

But yes, I think TLBI has to imply flushing micro-arch instruction
related buffers for any of that to work.
Andy Lutomirski June 18, 2021, 12:07 a.m. UTC | #24
On 6/15/21 8:21 PM, Andy Lutomirski wrote:
> On arm32, the only way to safely flush icache from usermode is to call
> cacheflush(2).  This also handles any required pipeline flushes, so
> membarrier's SYNC_CORE feature is useless on arm.  Remove it.

After all the discussion, I'm dropping this patch.
Linus Walleij June 18, 2021, 12:54 p.m. UTC | #25
On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
<linux@armlinux.org.uk> wrote:

> Arnd tells me that the current remaining ARM11MPCore users are:
> - CNS3xxx (where there is some martinal interest in the Gateworks
>   Laguna platform)
> - Similar for OXNAS
> - There used to be the Realview MPCore tile - I haven't turned that on
>   in ages, and it may be that the 3V cell that backs up the encryption
>   keys is dead so it may not even boot.

I have this machine with 4 x ARM11 MPCore, it works like a charm.
I use it to test exactly this kind of stuff, I know if a kernel works
on ARM11MPCore it works on anything because of how fragile
it is.

> So it seems to come down to a question about CNS3xxx and OXNAS. If
> these aren't being used, maybe we can drop ARM11MPCore support and
> the associated platforms?
>
> Linus, Krzysztof, Neil, any input?

I don't especially need to keep the ARM11MPCore machine alive,
it is just a testchip after all. The Oxnas is another story, that has wide
deployment and was contributed recently (2016) and has excellent
support in OpenWrt so I wouldn't really want
to axe that.

Yours,
Linus Walleij
Russell King (Oracle) June 18, 2021, 1:19 p.m. UTC | #26
On Fri, Jun 18, 2021 at 02:54:05PM +0200, Linus Walleij wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> 
> > Arnd tells me that the current remaining ARM11MPCore users are:
> > - CNS3xxx (where there is some martinal interest in the Gateworks
> >   Laguna platform)
> > - Similar for OXNAS
> > - There used to be the Realview MPCore tile - I haven't turned that on
> >   in ages, and it may be that the 3V cell that backs up the encryption
> >   keys is dead so it may not even boot.
> 
> I have this machine with 4 x ARM11 MPCore, it works like a charm.
> I use it to test exactly this kind of stuff, I know if a kernel works
> on ARM11MPCore it works on anything because of how fragile
> it is.
> 
> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
> 
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want
> to axe that.

So I suppose the next question is... are these issues (with userland
self-modifying code and kernel module loading) entirely theoretical
or can they be produced on real hardware?

If they can't be produced on real hardware, and we attempt to fix them
how do we know that the fix has worked...
Arnd Bergmann June 18, 2021, 1:36 p.m. UTC | #27
On Fri, Jun 18, 2021 at 2:54 PM Linus Walleij <linus.walleij@linaro.org> wrote:
> On Wed, Jun 16, 2021 at 6:27 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote:

> > So it seems to come down to a question about CNS3xxx and OXNAS. If
> > these aren't being used, maybe we can drop ARM11MPCore support and
> > the associated platforms?
> >
> > Linus, Krzysztof, Neil, any input?
>
> I don't especially need to keep the ARM11MPCore machine alive,
> it is just a testchip after all. The Oxnas is another story, that has wide
> deployment and was contributed recently (2016) and has excellent
> support in OpenWrt so I wouldn't really want to axe that.

Agreed, as long as oxnas and/or cns3xxx are around, we should just keep
the realview 11mpcore support, but if both of the commercial platforms
are gone, then the realview can be retired as far as I'm concerned.

Regarding oxnas, I see that OpenWRT has a number of essential
device drivers (sata, pcie, usb and reset) that look like they could just
be merged upstream, but that effort appears to have stalled: no
device support was added to the dts files since the original 2016
merge. While the support in OpenWRT may be excellent, the platform
support in the mainline kernel is limited to ethernet, nand, uart
and gpio.

       Arnd
diff mbox series

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 24804f11302d..89a885fba724 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -10,7 +10,6 @@  config ARM
 	select ARCH_HAS_FORTIFY_SOURCE
 	select ARCH_HAS_KEEPINITRD
 	select ARCH_HAS_KCOV
-	select ARCH_HAS_MEMBARRIER_SYNC_CORE
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
 	select ARCH_HAS_PHYS_TO_DMA