Message ID | bf59ecb5487171a852bcc8cdd553ec797aedc485.1609093476.git.luto@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,please,help] membarrier: Rewrite sync_core_before_usermode() | expand |
On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > and I'm suspicious that it has never been very well tested. My apologies > > for not reviewing this more carefully in the first place. > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > sync core feature as of now: Sigh, I missed arm (32). Russell or ARM folks, what's the right incantation to make the CPU notice instruction changes initiated by other cores on 32-bit ARM? > > > # Architecture requirements > # > # * arm/arm64/powerpc > # > # Rely on implicit context synchronization as a result of exception return > # when returning from IPI handler, and when returning to user-space. > # > # * x86 > # > # x86-32 uses IRET as return from interrupt, which takes care of the IPI. > # However, it uses both IRET and SYSEXIT to go back to user-space. The IRET > # instruction is core serializing, but not SYSEXIT. > # > # x86-64 uses IRET as return from interrupt, which takes care of the IPI. > # However, it can return to user-space through either SYSRETL (compat code), > # SYSRETQ, or IRET. > # > # Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely > # instead on write_cr3() performed by switch_mm() to provide core serialization > # after changing the current mm, and deal with the special case of kthread -> > # uthread (temporarily keeping current mm into active_mm) by issuing a > # sync_core_before_usermode() in that specific case. > I need to update that document as part of my series. > This is based on direct feedback from the architecture maintainers. > > You seem to have noticed odd cases on arm64 where this guarantee does not > match reality. Where exactly can we find this in the code, and which part > of the architecture manual can you point us to which supports your concern ? > > Based on the notes I have, use of `eret` on aarch64 guarantees a context synchronizing > instruction when returning to user-space. Based on my reading of the manual, ERET on ARM doesn't synchronize anything at all. I can't find any evidence that it synchronizes data or instructions, and I've seen reports that the CPU will happily speculate right past it. --Andy
On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote: > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > > and I'm suspicious that it has never been very well tested. My apologies > > > for not reviewing this more carefully in the first place. > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > > sync core feature as of now: > > Sigh, I missed arm (32). Russell or ARM folks, what's the right > incantation to make the CPU notice instruction changes initiated by > other cores on 32-bit ARM? You need to call flush_icache_range(), since the changes need to be flushed from the data cache to the point of unification (of the Harvard I and D), and the instruction cache needs to be invalidated so it can then see those updated instructions. This will also take care of the necessary barriers that the CPU requires for you. ... as documented in Documentation/core-api/cachetlb.rst and so should be available on every kernel supported CPU.
On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote: > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > > > > > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > > > and I'm suspicious that it has never been very well tested. My apologies > > > > for not reviewing this more carefully in the first place. > > > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > > > sync core feature as of now: > > > > Sigh, I missed arm (32). Russell or ARM folks, what's the right > > incantation to make the CPU notice instruction changes initiated by > > other cores on 32-bit ARM? > > You need to call flush_icache_range(), since the changes need to be > flushed from the data cache to the point of unification (of the Harvard > I and D), and the instruction cache needs to be invalidated so it can > then see those updated instructions. This will also take care of the > necessary barriers that the CPU requires for you. With what parameters? From looking at the header, this is for the case in which the kernel writes some memory and then intends to execute it. That's not what membarrier() does at all. membarrier() works like this: User thread 1: write to RWX memory *or* write to an RW alias of an X region. membarrier(...); somehow tell thread 2 that we're ready (with a store release, perhaps, or even just a relaxed store.) User thread 2: wait for the indication from thread 1. barrier(); jump to the code. membarrier() is, for better or for worse, not given a range of addresses. On x86, the documentation is a bit weak, but a strict reading indicates that thread 2 must execute a serializing instruction at some point after thread 1 writes the code and before thread 2 runs it. membarrier() does this by sending an IPI and ensuring that a "serializing" instruction (thanks for great terminology, Intel) is executed. Note that flush_icache_range() is a no-op on x86, and I've asked Intel's architects to please clarify their precise rules. No response yet. On arm64, flush_icache_range() seems to send an IPI, and that's not what I want. membarrier() already does an IPI. I suppose one option is to revert support for this membarrier() operation on arm, but it would be nice to just implement it instead. --Andy
On Mon, Dec 28, 2020 at 09:14:23AM -0800, Andy Lutomirski wrote: > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote: > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers > > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > > > > > > > > > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > > > > and I'm suspicious that it has never been very well tested. My apologies > > > > > for not reviewing this more carefully in the first place. > > > > > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > > > > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > > > > sync core feature as of now: > > > > > > Sigh, I missed arm (32). Russell or ARM folks, what's the right > > > incantation to make the CPU notice instruction changes initiated by > > > other cores on 32-bit ARM? > > > > You need to call flush_icache_range(), since the changes need to be > > flushed from the data cache to the point of unification (of the Harvard > > I and D), and the instruction cache needs to be invalidated so it can > > then see those updated instructions. This will also take care of the > > necessary barriers that the CPU requires for you. > > With what parameters? From looking at the header, this is for the > case in which the kernel writes some memory and then intends to > execute it. That's not what membarrier() does at all. membarrier() > works like this: You didn't specify that you weren't looking at kernel memory. If you're talking about userspace, then the interface you require is flush_icache_user_range(), which does the same as flush_icache_range() but takes userspace addresses. Note that this requires that the memory is currently mapped at those userspace addresses. If that doesn't fit your needs, there isn't an interface to do what you require, and it basically means creating something brand new on every architecture. What you are asking for is not "just a matter of a few instructions". I have stated the required steps to achieve what you require above; that is the minimum when you have non-snooping harvard caches, which the majority of 32-bit ARMs have. > User thread 1: > > write to RWX memory *or* write to an RW alias of an X region. > membarrier(...); > somehow tell thread 2 that we're ready (with a store release, perhaps, > or even just a relaxed store.) > > User thread 2: > > wait for the indication from thread 1. > barrier(); > jump to the code. > > membarrier() is, for better or for worse, not given a range of addresses. Then, I'm sorry, it can't work on 32-bit ARM.
On Mon, Dec 28, 2020 at 9:23 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Mon, Dec 28, 2020 at 09:14:23AM -0800, Andy Lutomirski wrote: > > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > > > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote: > > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers > > > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > > > > > > > > > > > > > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > > > > > and I'm suspicious that it has never been very well tested. My apologies > > > > > > for not reviewing this more carefully in the first place. > > > > > > > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > > > > > > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > > > > > sync core feature as of now: > > > > > > > > Sigh, I missed arm (32). Russell or ARM folks, what's the right > > > > incantation to make the CPU notice instruction changes initiated by > > > > other cores on 32-bit ARM? > > > > > > You need to call flush_icache_range(), since the changes need to be > > > flushed from the data cache to the point of unification (of the Harvard > > > I and D), and the instruction cache needs to be invalidated so it can > > > then see those updated instructions. This will also take care of the > > > necessary barriers that the CPU requires for you. > > > > With what parameters? From looking at the header, this is for the > > case in which the kernel writes some memory and then intends to > > execute it. That's not what membarrier() does at all. membarrier() > > works like this: > > You didn't specify that you weren't looking at kernel memory. > > If you're talking about userspace, then the interface you require > is flush_icache_user_range(), which does the same as > flush_icache_range() but takes userspace addresses. Note that this > requires that the memory is currently mapped at those userspace > addresses. > > If that doesn't fit your needs, there isn't an interface to do what > you require, and it basically means creating something brand new > on every architecture. > > What you are asking for is not "just a matter of a few instructions". > I have stated the required steps to achieve what you require above; > that is the minimum when you have non-snooping harvard caches, which > the majority of 32-bit ARMs have. > > > User thread 1: > > > > write to RWX memory *or* write to an RW alias of an X region. > > membarrier(...); > > somehow tell thread 2 that we're ready (with a store release, perhaps, > > or even just a relaxed store.) > > > > User thread 2: > > > > wait for the indication from thread 1. > > barrier(); > > jump to the code. > > > > membarrier() is, for better or for worse, not given a range of addresses. > > Then, I'm sorry, it can't work on 32-bit ARM. Is there a way to flush the *entire* user icache? If so, and if it has reasonable performance, then it could probably be used here. Otherwise I'll just send a revert for this whole mechanism on 32-bit ARM. --Andy
On Mon, Dec 28, 2020 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote: > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote: > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers > > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > > > > > > > > > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > > > > and I'm suspicious that it has never been very well tested. My apologies > > > > > for not reviewing this more carefully in the first place. > > > > > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > > > > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > > > > sync core feature as of now: > > > > > > Sigh, I missed arm (32). Russell or ARM folks, what's the right > > > incantation to make the CPU notice instruction changes initiated by > > > other cores on 32-bit ARM? > > > > You need to call flush_icache_range(), since the changes need to be > > flushed from the data cache to the point of unification (of the Harvard > > I and D), and the instruction cache needs to be invalidated so it can > > then see those updated instructions. This will also take care of the > > necessary barriers that the CPU requires for you. > > With what parameters? From looking at the header, this is for the > case in which the kernel writes some memory and then intends to > execute it. That's not what membarrier() does at all. membarrier() > works like this: > > User thread 1: > > write to RWX memory *or* write to an RW alias of an X region. > membarrier(...); > somehow tell thread 2 that we're ready (with a store release, perhaps, > or even just a relaxed store.) > > User thread 2: > > wait for the indication from thread 1. > barrier(); > jump to the code. > > membarrier() is, for better or for worse, not given a range of addresses. > > On x86, the documentation is a bit weak, but a strict reading > indicates that thread 2 must execute a serializing instruction at some > point after thread 1 writes the code and before thread 2 runs it. > membarrier() does this by sending an IPI and ensuring that a > "serializing" instruction (thanks for great terminology, Intel) is > executed. Note that flush_icache_range() is a no-op on x86, and I've > asked Intel's architects to please clarify their precise rules. No > response yet. > > On arm64, flush_icache_range() seems to send an IPI, and that's not > what I want. membarrier() already does an IPI. After chatting with rmk about this (but without claiming that any of this is his opinion), based on the manpage, I think membarrier() currently doesn't really claim to be synchronizing caches? It just serializes cores. So arguably if userspace wants to use membarrier() to synchronize code changes, userspace should first do the code change, then flush icache as appropriate for the architecture, and then do the membarrier() to ensure that the old code is unused? For 32-bit arm, rmk pointed out that that would be the cacheflush() syscall. That might cause you to end up with two IPIs instead of one in total, but we probably don't care _that_ much about extra IPIs on 32-bit arm? For arm64, I believe userspace can flush icache across the entire system with some instructions from userspace - "DC CVAU" followed by "DSB ISH", or something like that, I think? (See e.g. compat_arm_syscall(), the arm64 compat code that implements the 32-bit arm cacheflush() syscall.)
On Mon, Dec 28, 2020 at 10:30 AM Jann Horn <jannh@google.com> wrote: > > On Mon, Dec 28, 2020 at 6:14 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > > > > > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote: > > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers > > > > <mathieu.desnoyers@efficios.com> wrote: > > > > > > > > > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski luto@kernel.org wrote: > > > > > > > > > > > > > > > > > > > > > I admit that I'm rather surprised that the code worked at all on arm64, > > > > > > and I'm suspicious that it has never been very well tested. My apologies > > > > > > for not reviewing this more carefully in the first place. > > > > > > > > > > Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt > > > > > > > > > > It clearly states that only arm, arm64, powerpc and x86 support the membarrier > > > > > sync core feature as of now: > > > > > > > > Sigh, I missed arm (32). Russell or ARM folks, what's the right > > > > incantation to make the CPU notice instruction changes initiated by > > > > other cores on 32-bit ARM? > > > > > > You need to call flush_icache_range(), since the changes need to be > > > flushed from the data cache to the point of unification (of the Harvard > > > I and D), and the instruction cache needs to be invalidated so it can > > > then see those updated instructions. This will also take care of the > > > necessary barriers that the CPU requires for you. > > > > With what parameters? From looking at the header, this is for the > > case in which the kernel writes some memory and then intends to > > execute it. That's not what membarrier() does at all. membarrier() > > works like this: > > > > User thread 1: > > > > write to RWX memory *or* write to an RW alias of an X region. > > membarrier(...); > > somehow tell thread 2 that we're ready (with a store release, perhaps, > > or even just a relaxed store.) > > > > User thread 2: > > > > wait for the indication from thread 1. > > barrier(); > > jump to the code. > > > > membarrier() is, for better or for worse, not given a range of addresses. > > > > On x86, the documentation is a bit weak, but a strict reading > > indicates that thread 2 must execute a serializing instruction at some > > point after thread 1 writes the code and before thread 2 runs it. > > membarrier() does this by sending an IPI and ensuring that a > > "serializing" instruction (thanks for great terminology, Intel) is > > executed. Note that flush_icache_range() is a no-op on x86, and I've > > asked Intel's architects to please clarify their precise rules. No > > response yet. > > > > On arm64, flush_icache_range() seems to send an IPI, and that's not > > what I want. membarrier() already does an IPI. > > After chatting with rmk about this (but without claiming that any of > this is his opinion), based on the manpage, I think membarrier() > currently doesn't really claim to be synchronizing caches? It just > serializes cores. So arguably if userspace wants to use membarrier() > to synchronize code changes, userspace should first do the code > change, then flush icache as appropriate for the architecture, and > then do the membarrier() to ensure that the old code is unused? I haven't the faintest clue what "serializes cores" means. It seems to be a bit of a mishmash of x86 SDM terminology and Linux x86 "sync_core" terminology. The latter means very little to me, even as an x86 person. I'm moderately confident that the *intent* is that a multithreaded program can write some JIT code to memory, do this membarrier() operation, and then execute the code, and it will work. Maybe it's even intended to work cross-architecture without any additional help from the program. But maybe the existing API is simply incorrect for this. > > For 32-bit arm, rmk pointed out that that would be the cacheflush() > syscall. That might cause you to end up with two IPIs instead of one > in total, but we probably don't care _that_ much about extra IPIs on > 32-bit arm? > > For arm64, I believe userspace can flush icache across the entire > system with some instructions from userspace - "DC CVAU" followed by > "DSB ISH", or something like that, I think? (See e.g. > compat_arm_syscall(), the arm64 compat code that implements the 32-bit > arm cacheflush() syscall.) I have no idea what DC anything does. Based on my very cursory reading of the manual, ISB is the right approach, but I don't pretend I understand all the details.
On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > After chatting with rmk about this (but without claiming that any of > this is his opinion), based on the manpage, I think membarrier() > currently doesn't really claim to be synchronizing caches? It just > serializes cores. So arguably if userspace wants to use membarrier() > to synchronize code changes, userspace should first do the code > change, then flush icache as appropriate for the architecture, and > then do the membarrier() to ensure that the old code is unused? > > For 32-bit arm, rmk pointed out that that would be the cacheflush() > syscall. That might cause you to end up with two IPIs instead of one > in total, but we probably don't care _that_ much about extra IPIs on > 32-bit arm? > > For arm64, I believe userspace can flush icache across the entire > system with some instructions from userspace - "DC CVAU" followed by > "DSB ISH", or something like that, I think? (See e.g. > compat_arm_syscall(), the arm64 compat code that implements the 32-bit > arm cacheflush() syscall.) Note that the ARM cacheflush syscall calls flush_icache_user_range() over the range of addresses that userspace has passed - it's intention since day one is to support cases where userspace wants to change executable code. It will issue the appropriate write-backs to the data cache (DCCMVAU), the invalidates to the instruction cache (ICIMVAU), invalidate the branch target buffer (BPIALLIS or BPIALL as appropriate), and issue the appropriate barriers (DSB ISHST, ISB). Note that neither flush_icache_user_range() nor flush_icache_range() result in IPIs; cache operations are broadcast across all CPUs (which is one of the minimums we require for SMP systems.) Now, that all said, I think the question that has to be asked is... What is the basic purpose of membarrier? Is the purpose of it to provide memory barriers, or is it to provide memory coherence? If it's the former and not the latter, then cache flushes are out of scope, and expecting memory written to be visible to the instruction stream is totally out of scope of the membarrier interface, whether or not the writes happen on the same or a different CPU to the one executing the rewritten code. The documentation in the kernel does not seem to describe what it's supposed to be doing - the only thing I could find is this: Documentation/features/sched/membarrier-sync-core/arch-support.txt which describes it as "arch supports core serializing membarrier" whatever that means. Seems to be the standard and usual case of utterly poor to non-existent documentation within the kernel tree, or even a pointer to where any useful documentation can be found. Reading the membarrier(2) man page, I find nothing in there that talks about any kind of cache coherency for self-modifying code - it only seems to be about _barriers_ and nothing more, and barriers alone do precisely nothing to save you from non-coherent Harvard caches. So, either Andy has a misunderstanding, or the man page is wrong, or my rudimentary understanding of what membarrier is supposed to be doing is wrong...
On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin <linux@armlinux.org.uk> wrote: > > On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > > After chatting with rmk about this (but without claiming that any of > > this is his opinion), based on the manpage, I think membarrier() > > currently doesn't really claim to be synchronizing caches? It just > > serializes cores. So arguably if userspace wants to use membarrier() > > to synchronize code changes, userspace should first do the code > > change, then flush icache as appropriate for the architecture, and > > then do the membarrier() to ensure that the old code is unused? > > > > For 32-bit arm, rmk pointed out that that would be the cacheflush() > > syscall. That might cause you to end up with two IPIs instead of one > > in total, but we probably don't care _that_ much about extra IPIs on > > 32-bit arm? > > > > For arm64, I believe userspace can flush icache across the entire > > system with some instructions from userspace - "DC CVAU" followed by > > "DSB ISH", or something like that, I think? (See e.g. > > compat_arm_syscall(), the arm64 compat code that implements the 32-bit > > arm cacheflush() syscall.) > > Note that the ARM cacheflush syscall calls flush_icache_user_range() > over the range of addresses that userspace has passed - it's intention > since day one is to support cases where userspace wants to change > executable code. > > It will issue the appropriate write-backs to the data cache (DCCMVAU), > the invalidates to the instruction cache (ICIMVAU), invalidate the > branch target buffer (BPIALLIS or BPIALL as appropriate), and issue > the appropriate barriers (DSB ISHST, ISB). > > Note that neither flush_icache_user_range() nor flush_icache_range() > result in IPIs; cache operations are broadcast across all CPUs (which > is one of the minimums we require for SMP systems.) > > Now, that all said, I think the question that has to be asked is... > > What is the basic purpose of membarrier? > > Is the purpose of it to provide memory barriers, or is it to provide > memory coherence? > > If it's the former and not the latter, then cache flushes are out of > scope, and expecting memory written to be visible to the instruction > stream is totally out of scope of the membarrier interface, whether > or not the writes happen on the same or a different CPU to the one > executing the rewritten code. > > The documentation in the kernel does not seem to describe what it's > supposed to be doing - the only thing I could find is this: > Documentation/features/sched/membarrier-sync-core/arch-support.txt > which describes it as "arch supports core serializing membarrier" > whatever that means. > > Seems to be the standard and usual case of utterly poor to non-existent > documentation within the kernel tree, or even a pointer to where any > useful documentation can be found. > > Reading the membarrier(2) man page, I find nothing in there that talks > about any kind of cache coherency for self-modifying code - it only > seems to be about _barriers_ and nothing more, and barriers alone do > precisely nothing to save you from non-coherent Harvard caches. > > So, either Andy has a misunderstanding, or the man page is wrong, or > my rudimentary understanding of what membarrier is supposed to be > doing is wrong... Look at the latest man page: https://man7.org/linux/man-pages/man2/membarrier.2.html for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE. The result may not be all that enlightening. --Andy
On Mon, Dec 28, 2020 at 11:44:33AM -0800, Andy Lutomirski wrote: > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin > <linux@armlinux.org.uk> wrote: > > > > On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > > > After chatting with rmk about this (but without claiming that any of > > > this is his opinion), based on the manpage, I think membarrier() > > > currently doesn't really claim to be synchronizing caches? It just > > > serializes cores. So arguably if userspace wants to use membarrier() > > > to synchronize code changes, userspace should first do the code > > > change, then flush icache as appropriate for the architecture, and > > > then do the membarrier() to ensure that the old code is unused? > > > > > > For 32-bit arm, rmk pointed out that that would be the cacheflush() > > > syscall. That might cause you to end up with two IPIs instead of one > > > in total, but we probably don't care _that_ much about extra IPIs on > > > 32-bit arm? > > > > > > For arm64, I believe userspace can flush icache across the entire > > > system with some instructions from userspace - "DC CVAU" followed by > > > "DSB ISH", or something like that, I think? (See e.g. > > > compat_arm_syscall(), the arm64 compat code that implements the 32-bit > > > arm cacheflush() syscall.) > > > > Note that the ARM cacheflush syscall calls flush_icache_user_range() > > over the range of addresses that userspace has passed - it's intention > > since day one is to support cases where userspace wants to change > > executable code. > > > > It will issue the appropriate write-backs to the data cache (DCCMVAU), > > the invalidates to the instruction cache (ICIMVAU), invalidate the > > branch target buffer (BPIALLIS or BPIALL as appropriate), and issue > > the appropriate barriers (DSB ISHST, ISB). > > > > Note that neither flush_icache_user_range() nor flush_icache_range() > > result in IPIs; cache operations are broadcast across all CPUs (which > > is one of the minimums we require for SMP systems.) > > > > Now, that all said, I think the question that has to be asked is... > > > > What is the basic purpose of membarrier? > > > > Is the purpose of it to provide memory barriers, or is it to provide > > memory coherence? > > > > If it's the former and not the latter, then cache flushes are out of > > scope, and expecting memory written to be visible to the instruction > > stream is totally out of scope of the membarrier interface, whether > > or not the writes happen on the same or a different CPU to the one > > executing the rewritten code. > > > > The documentation in the kernel does not seem to describe what it's > > supposed to be doing - the only thing I could find is this: > > Documentation/features/sched/membarrier-sync-core/arch-support.txt > > which describes it as "arch supports core serializing membarrier" > > whatever that means. > > > > Seems to be the standard and usual case of utterly poor to non-existent > > documentation within the kernel tree, or even a pointer to where any > > useful documentation can be found. > > > > Reading the membarrier(2) man page, I find nothing in there that talks > > about any kind of cache coherency for self-modifying code - it only > > seems to be about _barriers_ and nothing more, and barriers alone do > > precisely nothing to save you from non-coherent Harvard caches. > > > > So, either Andy has a misunderstanding, or the man page is wrong, or > > my rudimentary understanding of what membarrier is supposed to be > > doing is wrong... > > Look at the latest man page: > > https://man7.org/linux/man-pages/man2/membarrier.2.html > > for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE. The result may not be > all that enlightening. MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE (since Linux 4.16) In addition to providing the memory ordering guarantees de■ scribed in MEMBARRIER_CMD_PRIVATE_EXPEDITED, upon return from system call the calling thread has a guarantee that all its run■ ning thread siblings have executed a core serializing instruc■ tion. This guarantee is provided only for threads in the same process as the calling thread. The "expedited" commands complete faster than the non-expedited ones, they never block, but have the downside of causing extra overhead. A process must register its intent to use the private expedited sync core command prior to using it. This just says that the siblings have executed a serialising instruction, in other words a barrier. It makes no claims concerning cache coherency - and without some form of cache maintenance, there can be no expectation that the I and D streams to be coherent with each other. This description is also weird in another respect. "guarantee that all its running thread siblings have executed a core serializing instruction" ... "The expedited commands ... never block". So, the core executing this call is not allowed to block, but the other part indicates that the other CPUs _have_ executed a serialising instruction before this call returns... one wonders how that happens without blocking. Maybe the CPU spins waiting for completion instead?
On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: > > > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin > > <linux@armlinux.org.uk> wrote: > >> > >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > >> > After chatting with rmk about this (but without claiming that any of > >> > this is his opinion), based on the manpage, I think membarrier() > >> > currently doesn't really claim to be synchronizing caches? It just > >> > serializes cores. So arguably if userspace wants to use membarrier() > >> > to synchronize code changes, userspace should first do the code > >> > change, then flush icache as appropriate for the architecture, and > >> > then do the membarrier() to ensure that the old code is unused? > > ^ exactly, yes. > > >> > > >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() > >> > syscall. That might cause you to end up with two IPIs instead of one > >> > in total, but we probably don't care _that_ much about extra IPIs on > >> > 32-bit arm? > > This was the original thinking, yes. The cacheflush IPI will flush specific > regions of code, and the membarrier IPI issues context synchronizing > instructions. > > Architectures with coherent i/d caches don't need the cacheflush step. There are different levels of coherency -- VIVT architectures may have differing requirements compared to PIPT, etc. In any case, I feel like the approach taken by the documentation is fundamentally confusing. Architectures don't all speak the same language How about something like: The SYNC_CORE operation causes all threads in the caller's address space (including the caller) to execute an architecture-defined barrier operation. membarrier() will ensure that this barrier is executed at a time such that all data writes done by the calling thread before membarrier() are made visible by the barrier. Additional architecture-dependent cache management operations may be required to use this for JIT code. x86: SYNC_CORE executes a barrier that will cause subsequent instruction fetches to observe prior writes. Currently this will be a "serializing" instruction, but, if future improved CPU documentation becomes available and relaxes this requirement, the barrier may change. The kernel guarantees that writing new or modified instructions to normal memory (and issuing SFENCE if the writes were non-temporal) then doing a membarrier SYNC_CORE operation is sufficient to cause all threads in the caller's address space to execute the new or modified instructions. This is true regardless of whether or not those instructions are written at the same virtual address from which they are subsequently executed. No additional cache management is required on x86. arm: Something about the cache management syscalls. arm64: Ditto powerpc: I have no idea.
Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am: > The old sync_core_before_usermode() comments said that a non-icache-syncing > return-to-usermode instruction is x86-specific and that all other > architectures automatically notice cross-modified code on return to > userspace. Based on my general understanding of how CPUs work and based on > my atttempt to read the ARM manual, this is not true at all. In fact, x86 > seems to be a bit of an anomaly in the other direction: x86's IRET is > unusually heavyweight for a return-to-usermode instruction. "sync_core_before_usermode" as I've said says nothing to arch, or to the scheduler, or to membarrier. It's badly named to start with so if renaming it it should be something else. exit_lazy_tlb() at least says something quite precise to scheudler and arch code that implements the membarrier. But I don't mind the idea of just making it x86 specific if as you say the arch code can detect lazy mm switches more precisely than generic and you want to do that. > So let's drop any pretense that we can have a generic way implementation > behind membarrier's SYNC_CORE flush and require all architectures that opt > in to supply their own. This means x86, arm64, and powerpc for now. Let's > also rename the function from sync_core_before_usermode() to > membarrier_sync_core_before_usermode() because the precise flushing details > may very well be specific to membarrier, and even the concept of > "sync_core" in the kernel is mostly an x86-ism. The concept of "sync_core" (x86: serializing instruction, powerpc: context synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted to add a serializing instruction to generic code so it grew this nasty API, but the concept applies broadly. > > I admit that I'm rather surprised that the code worked at all on arm64, > and I'm suspicious that it has never been very well tested. My apologies > for not reviewing this more carefully in the first place. > > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Paul Mackerras <paulus@samba.org> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Nicholas Piggin <npiggin@gmail.com> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Will Deacon <will@kernel.org> > Cc: linux-arm-kernel@lists.infradead.org > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: x86@kernel.org > Cc: stable@vger.kernel.org > Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") > Signed-off-by: Andy Lutomirski <luto@kernel.org> > --- > > Hi arm64 and powerpc people- > > This is part of a series here: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes > > Before I send out the whole series, I'm hoping that some arm64 and powerpc > people can help me verify that I did this patch right. Once I get > some feedback on this patch, I'll send out the whole pile. And once > *that's* done, I'll start giving the mm lazy stuff some serious thought. > > The x86 part is already fixed in Linus' tree. > > Thanks, > Andy > > arch/arm64/include/asm/sync_core.h | 21 +++++++++++++++++++++ > arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++ > arch/x86/Kconfig | 1 - > arch/x86/include/asm/sync_core.h | 7 +++---- > include/linux/sched/mm.h | 1 - > include/linux/sync_core.h | 21 --------------------- > init/Kconfig | 3 --- > kernel/sched/membarrier.c | 15 +++++++++++---- > 8 files changed, 55 insertions(+), 34 deletions(-) > create mode 100644 arch/arm64/include/asm/sync_core.h > create mode 100644 arch/powerpc/include/asm/sync_core.h > delete mode 100644 include/linux/sync_core.h > > diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h > new file mode 100644 > index 000000000000..5be4531caabd > --- /dev/null > +++ b/arch/arm64/include/asm/sync_core.h > @@ -0,0 +1,21 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_ARM64_SYNC_CORE_H > +#define _ASM_ARM64_SYNC_CORE_H > + > +#include <asm/barrier.h> > + > +/* > + * Ensure that the CPU notices any instruction changes before the next time > + * it returns to usermode. > + */ > +static inline void membarrier_sync_core_before_usermode(void) > +{ > + /* > + * XXX: is this enough or do we need a DMB first to make sure that > + * writes from other CPUs become visible to this CPU? We have an > + * smp_mb() already, but that's not quite the same thing. > + */ > + isb(); > +} > + > +#endif /* _ASM_ARM64_SYNC_CORE_H */ > diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h > new file mode 100644 > index 000000000000..71dfbe7794e5 > --- /dev/null > +++ b/arch/powerpc/include/asm/sync_core.h > @@ -0,0 +1,20 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_POWERPC_SYNC_CORE_H > +#define _ASM_POWERPC_SYNC_CORE_H > + > +#include <asm/barrier.h> > + > +/* > + * Ensure that the CPU notices any instruction changes before the next time > + * it returns to usermode. > + */ > +static inline void membarrier_sync_core_before_usermode(void) > +{ > + /* > + * XXX: I know basically nothing about powerpc cache management. > + * Is this correct? > + */ > + isync(); This is not about memory ordering or cache management, it's about pipeline management. Powerpc's return to user mode serializes the CPU (aka the hardware thread, _not_ the core; another wrongness of the name, but AFAIKS the HW thread is what is required for membarrier). So this is wrong, powerpc needs nothing here. Thanks, Nick
On Mon, Dec 28, 2020 at 1:09 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ----- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski luto@kernel.org wrote: > > [...] > > >> You seem to have noticed odd cases on arm64 where this guarantee does not > >> match reality. Where exactly can we find this in the code, and which part > >> of the architecture manual can you point us to which supports your concern ? > >> > >> Based on the notes I have, use of `eret` on aarch64 guarantees a context > >> synchronizing > >> instruction when returning to user-space. > > > > Based on my reading of the manual, ERET on ARM doesn't synchronize > > anything at all. I can't find any evidence that it synchronizes data > > or instructions, and I've seen reports that the CPU will happily > > speculate right past it. > > Reading [1] there appears to be 3 kind of context synchronization events: > > - Taking an exception, > - Returning from an exception, > - ISB. My reading of [1] is that all three of these are "context synchronization event[s]", but that only ISB flushes the pipeline, etc. The little description of context synchronization seems to suggest that it only implies that certain register changes become effective. > > This other source [2] adds (search for Context synchronization operation): > > - Exit from Debug state > - Executing a DCPS instruction > - Executing a DRPS instruction > > "ERET" falls into the second kind of events, and AFAIU should be context > synchronizing. That was confirmed to me by Will Deacon when membarrier > sync-core was implemented for aarch64. If the architecture reference manuals > are wrong, is there an errata ? > > As for the algorithm to use on ARMv8 to update instructions, see [2] > B2.3.4 Implication of caches for the application programmer > "Synchronization and coherency issues between data and instruction accesses" This specifically discusses ISB. Let's wait for an actual ARM64 expert to chime in, though.
On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am: > > The old sync_core_before_usermode() comments said that a non-icache-syncing > > return-to-usermode instruction is x86-specific and that all other > > architectures automatically notice cross-modified code on return to > > userspace. Based on my general understanding of how CPUs work and based on > > my atttempt to read the ARM manual, this is not true at all. In fact, x86 > > seems to be a bit of an anomaly in the other direction: x86's IRET is > > unusually heavyweight for a return-to-usermode instruction. > > "sync_core_before_usermode" as I've said says nothing to arch, or to the > scheduler, or to membarrier. Agreed. My patch tries to fix this. I agree that the name is bad and could be improved further. We should define what membarrier(...SYNC_CORE) actually does and have arch hooks to make it happen. > > So let's drop any pretense that we can have a generic way implementation > > behind membarrier's SYNC_CORE flush and require all architectures that opt > > in to supply their own. This means x86, arm64, and powerpc for now. Let's > > also rename the function from sync_core_before_usermode() to > > membarrier_sync_core_before_usermode() because the precise flushing details > > may very well be specific to membarrier, and even the concept of > > "sync_core" in the kernel is mostly an x86-ism. > > The concept of "sync_core" (x86: serializing instruction, powerpc: context > synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted > to add a serializing instruction to generic code so it grew this nasty API, > but the concept applies broadly. I mean that the mapping from the name "sync_core" to its semantics is x86 only. The string "sync_core" appears in the kernel only in arch/x86, membarrier code, membarrier docs, and a single SGI driver that is x86-only. Sure, the idea of serializing things is fairly generic, but exactly what operations serialize what, when things need serialization, etc is quite architecture specific. Heck, on 486 you serialize the instruction stream with JMP. > > +static inline void membarrier_sync_core_before_usermode(void) > > +{ > > + /* > > + * XXX: I know basically nothing about powerpc cache management. > > + * Is this correct? > > + */ > > + isync(); > > This is not about memory ordering or cache management, it's about > pipeline management. Powerpc's return to user mode serializes the > CPU (aka the hardware thread, _not_ the core; another wrongness of > the name, but AFAIKS the HW thread is what is required for > membarrier). So this is wrong, powerpc needs nothing here. Fair enough. I'm happy to defer to you on the powerpc details. In any case, this just illustrates that we need feedback from a person who knows more about ARM64 than I do.
Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: >> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin >> > <linux@armlinux.org.uk> wrote: >> >> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: >> >> > After chatting with rmk about this (but without claiming that any of >> >> > this is his opinion), based on the manpage, I think membarrier() >> >> > currently doesn't really claim to be synchronizing caches? It just >> >> > serializes cores. So arguably if userspace wants to use membarrier() >> >> > to synchronize code changes, userspace should first do the code >> >> > change, then flush icache as appropriate for the architecture, and >> >> > then do the membarrier() to ensure that the old code is unused? >> >> ^ exactly, yes. >> >> >> > >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() >> >> > syscall. That might cause you to end up with two IPIs instead of one >> >> > in total, but we probably don't care _that_ much about extra IPIs on >> >> > 32-bit arm? >> >> This was the original thinking, yes. The cacheflush IPI will flush specific >> regions of code, and the membarrier IPI issues context synchronizing >> instructions. APIs should be written in terms of the service they provide to userspace, and in highest level terms as possible, rather than directing hardware to do some low level operation. Unfortunately we're stuck with this for now. We could deprecate it and replace it though. If userspace wants to modify code and ensure that after the system call returns then no other thread will be executing the previous code, then there should be an API for that. It could actually combine the two IPIs into one for architectures that require both too. >> >> Architectures with coherent i/d caches don't need the cacheflush step. > > There are different levels of coherency -- VIVT architectures may have > differing requirements compared to PIPT, etc. > > In any case, I feel like the approach taken by the documentation is > fundamentally confusing. Architectures don't all speak the same > language How about something like: > > The SYNC_CORE operation causes all threads in the caller's address > space (including the caller) to execute an architecture-defined > barrier operation. membarrier() will ensure that this barrier is > executed at a time such that all data writes done by the calling > thread before membarrier() are made visible by the barrier. > Additional architecture-dependent cache management operations may be > required to use this for JIT code. As said this isn't what SYNC_CORE does, and it's not what powerpc context synchronizing instructions do either, it will very much re-order visibility of stores around such an instruction. A thread completes store instructions into a store queue, which is as far as a context synchronizing event goes. Visibility comes at some indeterminite time later. Also note that the regular membarrier syscall which does order memory also does not make writes visible, it just ensures an ordering. > > x86: SYNC_CORE executes a barrier that will cause subsequent > instruction fetches to observe prior writes. Currently this will be a I would be surprised if x86's serializing instructions were different than powerpc. rdtsc ordering or flushing stores to cache would be surprising. Thanks, Nick > "serializing" instruction, but, if future improved CPU documentation > becomes available and relaxes this requirement, the barrier may > change. The kernel guarantees that writing new or modified > instructions to normal memory (and issuing SFENCE if the writes were > non-temporal) then doing a membarrier SYNC_CORE operation is > sufficient to cause all threads in the caller's address space to > execute the new or modified instructions. This is true regardless of > whether or not those instructions are written at the same virtual > address from which they are subsequently executed. No additional > cache management is required on x86. > > arm: Something about the cache management syscalls. > > arm64: Ditto > > powerpc: I have no idea. >
On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: > > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: > >> > >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin > >> > <linux@armlinux.org.uk> wrote: > >> >> > >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: > >> >> > After chatting with rmk about this (but without claiming that any of > >> >> > this is his opinion), based on the manpage, I think membarrier() > >> >> > currently doesn't really claim to be synchronizing caches? It just > >> >> > serializes cores. So arguably if userspace wants to use membarrier() > >> >> > to synchronize code changes, userspace should first do the code > >> >> > change, then flush icache as appropriate for the architecture, and > >> >> > then do the membarrier() to ensure that the old code is unused? > >> > >> ^ exactly, yes. > >> > >> >> > > >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() > >> >> > syscall. That might cause you to end up with two IPIs instead of one > >> >> > in total, but we probably don't care _that_ much about extra IPIs on > >> >> > 32-bit arm? > >> > >> This was the original thinking, yes. The cacheflush IPI will flush specific > >> regions of code, and the membarrier IPI issues context synchronizing > >> instructions. > > APIs should be written in terms of the service they provide to > userspace, and in highest level terms as possible, rather than directing > hardware to do some low level operation. Unfortunately we're stuck with > this for now. We could deprecate it and replace it though. > > If userspace wants to modify code and ensure that after the system call > returns then no other thread will be executing the previous code, then > there should be an API for that. It could actually combine the two IPIs > into one for architectures that require both too. I agree. The membarrier API for SYNC_CORE is pretty nasty. I would much prefer a real API for JITs to use. > > >> > >> Architectures with coherent i/d caches don't need the cacheflush step. > > > > There are different levels of coherency -- VIVT architectures may have > > differing requirements compared to PIPT, etc. > > > > In any case, I feel like the approach taken by the documentation is > > fundamentally confusing. Architectures don't all speak the same > > language How about something like: > > > > The SYNC_CORE operation causes all threads in the caller's address > > space (including the caller) to execute an architecture-defined > > barrier operation. membarrier() will ensure that this barrier is > > executed at a time such that all data writes done by the calling > > thread before membarrier() are made visible by the barrier. > > Additional architecture-dependent cache management operations may be > > required to use this for JIT code. > > As said this isn't what SYNC_CORE does, and it's not what powerpc > context synchronizing instructions do either, it will very much > re-order visibility of stores around such an instruction. Perhaps the docs should be entirely arch-specific. It may well be impossible to state what it does in an arch-neutral way. > > A thread completes store instructions into a store queue, which is > as far as a context synchronizing event goes. Visibility comes at > some indeterminite time later. As currently implemented, it has the same visibility semantics as regular membarrier, too. So if I do: a = 1; membarrier(SYNC_CORE); b = 1; and another thread does: while (READ_ONCE(b) != 1) ; barrier(); assert(a == 1); then the assertion will pass. Similarly, one can do this, I hope: memcpy(codeptr, [some new instructions], len); arch_dependent_cache_flush(codeptr, len); membarrier(SYNC_CORE); ready = 1; and another thread does: while (READ_ONCE(ready) != 1) ; barrier(); (*codeptr)(); arch_dependent_cache_flush is a nop on x86. On arm and arm64, it appears to be a syscall, although maybe arm64 can do it from userspace. I still don't know what it is on powerpc. Even using the term "cache" here is misleading. x86 chips have all kinds of barely-documented instruction caches, and they have varying degrees of coherency. The architecture actually promises that, if you do a certain incantation, then you get the desired result. membarrier() allows a user to do this incantation. But trying to replicate the incantation verbatim on an architecture like ARM is insufficient, and trying to flush the things that are documented as being caches on x86 is expensive and a complete waste of time on x86. When you write some JIT code, you do *not* want to flush it all the way to main memory, especially on CPUs don't have the ability to write back invalidating. (That's most CPUs.) Even on x86, I suspect that the various decoded insn caches are rather more coherent than documented, and I have questions in to Intel about this. No answers yet. So perhaps the right approach is to say that membarrier() helps you perform the architecture-specific sequence of steps needed to safely modify code. On x86, you use it like this. On arm64, you do this other thing. On powerpc, you do something else. > > I would be surprised if x86's serializing instructions were different > than powerpc. rdtsc ordering or flushing stores to cache would be > surprising. > At the very least, x86 has several levels of what ARM might call "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of context synchronization in that the changes they cause to the MMU take effect immediately, but they are not documented as synchronizing the instruction stream. "Serializing" instructions do all kinds of things, not all of which may be architecturally visible at all. MFENCE and LFENCE do various complicated things, and LFENCE has magic retroactive capabilities on old CPUs that were not documented when those CPUs were released. SFENCE does a different form of synchronization entirely. LOCK does something else. (The relationship between LOCK and MFENCE is confusing at best.) RDTSC doesn't serialize anything at all, but RDTSCP does provide a form of serialization that's kind of ilke LFENCE. It's a mess. Even the manuals are inconsistent about what "serialize" means. JMP has its own magic on x86, but only on very very old CPUs. The specific instruction that flushes everything into the coherency domain is SFENCE, and SFENCE is not, for normal purposes, needed for self- or cross-modifying code.
Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am: > On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: >> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers >> > <mathieu.desnoyers@efficios.com> wrote: >> >> >> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@kernel.org wrote: >> >> >> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin >> >> > <linux@armlinux.org.uk> wrote: >> >> >> >> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: >> >> >> > After chatting with rmk about this (but without claiming that any of >> >> >> > this is his opinion), based on the manpage, I think membarrier() >> >> >> > currently doesn't really claim to be synchronizing caches? It just >> >> >> > serializes cores. So arguably if userspace wants to use membarrier() >> >> >> > to synchronize code changes, userspace should first do the code >> >> >> > change, then flush icache as appropriate for the architecture, and >> >> >> > then do the membarrier() to ensure that the old code is unused? >> >> >> >> ^ exactly, yes. >> >> >> >> >> > >> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() >> >> >> > syscall. That might cause you to end up with two IPIs instead of one >> >> >> > in total, but we probably don't care _that_ much about extra IPIs on >> >> >> > 32-bit arm? >> >> >> >> This was the original thinking, yes. The cacheflush IPI will flush specific >> >> regions of code, and the membarrier IPI issues context synchronizing >> >> instructions. >> >> APIs should be written in terms of the service they provide to >> userspace, and in highest level terms as possible, rather than directing >> hardware to do some low level operation. Unfortunately we're stuck with >> this for now. We could deprecate it and replace it though. >> >> If userspace wants to modify code and ensure that after the system call >> returns then no other thread will be executing the previous code, then >> there should be an API for that. It could actually combine the two IPIs >> into one for architectures that require both too. > > I agree. The membarrier API for SYNC_CORE is pretty nasty. I would > much prefer a real API for JITs to use. > >> >> >> >> >> Architectures with coherent i/d caches don't need the cacheflush step. >> > >> > There are different levels of coherency -- VIVT architectures may have >> > differing requirements compared to PIPT, etc. >> > >> > In any case, I feel like the approach taken by the documentation is >> > fundamentally confusing. Architectures don't all speak the same >> > language How about something like: >> > >> > The SYNC_CORE operation causes all threads in the caller's address >> > space (including the caller) to execute an architecture-defined >> > barrier operation. membarrier() will ensure that this barrier is >> > executed at a time such that all data writes done by the calling >> > thread before membarrier() are made visible by the barrier. >> > Additional architecture-dependent cache management operations may be >> > required to use this for JIT code. >> >> As said this isn't what SYNC_CORE does, and it's not what powerpc >> context synchronizing instructions do either, it will very much >> re-order visibility of stores around such an instruction. > > Perhaps the docs should be entirely arch-specific. It may well be > impossible to state what it does in an arch-neutral way. I think what I wrote above -- after the call returns, none of the threads in the process will be executing instructions overwritten previously by the caller (provided their i-caches are made coherent with the caller's modifications). >> A thread completes store instructions into a store queue, which is >> as far as a context synchronizing event goes. Visibility comes at >> some indeterminite time later. > > As currently implemented, it has the same visibility semantics as > regular membarrier, too. Ah I actually missed that SYNC_CORE is in _addition_ to the memory barriers, and that's documented API too, not just implementation sorry. > So if I do: > > a = 1; > membarrier(SYNC_CORE); > b = 1; > > and another thread does: > > while (READ_ONCE(b) != 1) > ; > barrier(); > assert(a == 1); Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither that barrier or the added SYNC_CORE behaviour imply visibility though. > > then the assertion will pass. Similarly, one can do this, I hope: > > memcpy(codeptr, [some new instructions], len); > arch_dependent_cache_flush(codeptr, len); > membarrier(SYNC_CORE); > ready = 1; > > and another thread does: > > while (READ_ONCE(ready) != 1) > ; > barrier(); > (*codeptr)(); > > arch_dependent_cache_flush is a nop on x86. On arm and arm64, it > appears to be a syscall, although maybe arm64 can do it from > userspace. I still don't know what it is on powerpc. > > Even using the term "cache" here is misleading. x86 chips have all > kinds of barely-documented instruction caches, and they have varying > degrees of coherency. The architecture actually promises that, if you > do a certain incantation, then you get the desired result. > membarrier() allows a user to do this incantation. But trying to > replicate the incantation verbatim on an architecture like ARM is > insufficient, and trying to flush the things that are documented as > being caches on x86 is expensive and a complete waste of time on x86. > When you write some JIT code, you do *not* want to flush it all the > way to main memory, especially on CPUs don't have the ability to write > back invalidating. (That's most CPUs.) > > Even on x86, I suspect that the various decoded insn caches are rather > more coherent than documented, and I have questions in to Intel about > this. No answers yet. > > So perhaps the right approach is to say that membarrier() helps you > perform the architecture-specific sequence of steps needed to safely > modify code. On x86, you use it like this. On arm64, you do this > other thing. On powerpc, you do something else. I think it should certainly be documented in terms of what guarantees it provides to application, _not_ the kinds of instructions it may or may not induce the core to execute. And if existing API can't be re-documented sanely, then deprecatd and new ones added that DTRT. Possibly under a new system call, if arch's like ARM want a range flush and we don't want to expand the multiplexing behaviour of membarrier even more (sigh). > >> >> I would be surprised if x86's serializing instructions were different >> than powerpc. rdtsc ordering or flushing stores to cache would be >> surprising. >> > > At the very least, x86 has several levels of what ARM might call > "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of > context synchronization in that the changes they cause to the MMU take > effect immediately, but they are not documented as synchronizing the > instruction stream. "Serializing" instructions do all kinds of > things, not all of which may be architecturally visible at all. > MFENCE and LFENCE do various complicated things, and LFENCE has magic > retroactive capabilities on old CPUs that were not documented when > those CPUs were released. SFENCE does a different form of > synchronization entirely. LOCK does something else. (The > relationship between LOCK and MFENCE is confusing at best.) RDTSC > doesn't serialize anything at all, but RDTSCP does provide a form of > serialization that's kind of ilke LFENCE. It's a mess. Even the > manuals are inconsistent about what "serialize" means. JMP has its > own magic on x86, but only on very very old CPUs. > > The specific instruction that flushes everything into the coherency > domain is SFENCE, and SFENCE is not, for normal purposes, needed for > self- or cross-modifying code. > Good reason to avoid such language in the system call interface! Thanks, Nick
Excerpts from Andy Lutomirski's message of December 29, 2020 10:36 am: > On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am: >> > The old sync_core_before_usermode() comments said that a non-icache-syncing >> > return-to-usermode instruction is x86-specific and that all other >> > architectures automatically notice cross-modified code on return to >> > userspace. Based on my general understanding of how CPUs work and based on >> > my atttempt to read the ARM manual, this is not true at all. In fact, x86 >> > seems to be a bit of an anomaly in the other direction: x86's IRET is >> > unusually heavyweight for a return-to-usermode instruction. >> >> "sync_core_before_usermode" as I've said says nothing to arch, or to the >> scheduler, or to membarrier. > > Agreed. My patch tries to fix this. I agree that the name is bad and > could be improved further. We should define what > membarrier(...SYNC_CORE) actually does and have arch hooks to make it > happen. > >> > So let's drop any pretense that we can have a generic way implementation >> > behind membarrier's SYNC_CORE flush and require all architectures that opt >> > in to supply their own. This means x86, arm64, and powerpc for now. Let's >> > also rename the function from sync_core_before_usermode() to >> > membarrier_sync_core_before_usermode() because the precise flushing details >> > may very well be specific to membarrier, and even the concept of >> > "sync_core" in the kernel is mostly an x86-ism. >> >> The concept of "sync_core" (x86: serializing instruction, powerpc: context >> synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted >> to add a serializing instruction to generic code so it grew this nasty API, >> but the concept applies broadly. > > I mean that the mapping from the name "sync_core" to its semantics is > x86 only. The string "sync_core" appears in the kernel only in > arch/x86, membarrier code, membarrier docs, and a single SGI driver > that is x86-only. Sure, the idea of serializing things is fairly > generic, but exactly what operations serialize what, when things need > serialization, etc is quite architecture specific. Okay, well yes it's x86 only in name, I was more talking about the concept. > Heck, on 486 you serialize the instruction stream with JMP. x86-specific aside, I did think the semantics of a "serializing instruction" was reasonably well architected in x86. Sure it could do other things as well, but if you executed a serializing instruction, then you had a decent set of guarantees (e.g., what you might want for code modification). > >> > +static inline void membarrier_sync_core_before_usermode(void) >> > +{ >> > + /* >> > + * XXX: I know basically nothing about powerpc cache management. >> > + * Is this correct? >> > + */ >> > + isync(); >> >> This is not about memory ordering or cache management, it's about >> pipeline management. Powerpc's return to user mode serializes the >> CPU (aka the hardware thread, _not_ the core; another wrongness of >> the name, but AFAIKS the HW thread is what is required for >> membarrier). So this is wrong, powerpc needs nothing here. > > Fair enough. I'm happy to defer to you on the powerpc details. In > any case, this just illustrates that we need feedback from a person > who knows more about ARM64 than I do. > Thanks, Nick
On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote: > I think it should certainly be documented in terms of what guarantees > it provides to application, _not_ the kinds of instructions it may or > may not induce the core to execute. And if existing API can't be > re-documented sanely, then deprecatd and new ones added that DTRT. > Possibly under a new system call, if arch's like ARM want a range > flush and we don't want to expand the multiplexing behaviour of > membarrier even more (sigh). The 32-bit ARM sys_cacheflush() is there only to support self-modifying code, and takes whatever actions are necessary to support that. Exactly what actions it takes are cache implementation specific, and should be of no concern to the caller, but the underlying thing is... it's to support self-modifying code. Sadly, because it's existed for 20+ years, and it has historically been sufficient for other purposes too, it has seen quite a bit of abuse despite its design purpose not changing - it's been used by graphics drivers for example. They quickly learnt the error of their ways with ARMv6+, since it does not do sufficient for their purposes given the cache architectures found there. Let's not go around redesigning this after twenty odd years, requiring a hell of a lot of pain to users. This interface is called by code generated by GCC, so to change it you're looking at patching GCC as well as the kernel, and you basically will make new programs incompatible with older kernels - very bad news for users.
Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm: > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote: >> I think it should certainly be documented in terms of what guarantees >> it provides to application, _not_ the kinds of instructions it may or >> may not induce the core to execute. And if existing API can't be >> re-documented sanely, then deprecatd and new ones added that DTRT. >> Possibly under a new system call, if arch's like ARM want a range >> flush and we don't want to expand the multiplexing behaviour of >> membarrier even more (sigh). > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying > code, and takes whatever actions are necessary to support that. > Exactly what actions it takes are cache implementation specific, and > should be of no concern to the caller, but the underlying thing is... > it's to support self-modifying code. Caveat cacheflush() should not be used in programs intended to be portable. On Linux, this call first appeared on the MIPS architecture, but nowa‐ days, Linux provides a cacheflush() system call on some other architec‐ tures, but with different arguments. What a disaster. Another badly designed interface, although it didn't originate in Linux it sounds like we weren't to be outdone so we messed it up even worse. flushing caches is neither necessary nor sufficient for code modification on many processors. Maybe some old MIPS specific private thing was fine, but certainly before it grew to other architectures, somebody should have thought for more than 2 minutes about it. Sigh. > > Sadly, because it's existed for 20+ years, and it has historically been > sufficient for other purposes too, it has seen quite a bit of abuse > despite its design purpose not changing - it's been used by graphics > drivers for example. They quickly learnt the error of their ways with > ARMv6+, since it does not do sufficient for their purposes given the > cache architectures found there. > > Let's not go around redesigning this after twenty odd years, requiring > a hell of a lot of pain to users. This interface is called by code > generated by GCC, so to change it you're looking at patching GCC as > well as the kernel, and you basically will make new programs > incompatible with older kernels - very bad news for users. For something to be redesigned it had to have been designed in the first place, so there is no danger of that don't worry... But no I never suggested making incompatible changes to any existing system call, I said "re-documented". And yes I said deprecated but in Linux that really means kept indefinitely. If ARM, MIPS, 68k etc programs and toolchains keep using what they are using it'll keep working no problem. The point is we're growing new interfaces, and making the same mistakes. It's not portable (ARCH_HAS_MEMBARRIER_SYNC_CORE), it's also specified in terms of low level processor operations rather than higher level intent, and also is not sufficient for self-modifying code (without additional cache flush on some processors). The application wants a call that says something like "memory modified before the call will be visible as instructions (including illegal instructions) by all threads in the program after the system call returns, and no threads will be subject to any effects of executing the previous contents of that memory. So I think the basics are simple (although should confirm with some JIT and debugger etc developers, and not just Android mind you). There are some complications in details, address ranges, virtual/physical, thread local vs process vs different process or system-wide, memory ordering and propagation of i and d sides, etc. But that can be worked through, erring on the side of sanity rather than pointless micro-optmisations. Thanks, Nick
On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote: > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm: > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote: > >> I think it should certainly be documented in terms of what guarantees > >> it provides to application, _not_ the kinds of instructions it may or > >> may not induce the core to execute. And if existing API can't be > >> re-documented sanely, then deprecatd and new ones added that DTRT. > >> Possibly under a new system call, if arch's like ARM want a range > >> flush and we don't want to expand the multiplexing behaviour of > >> membarrier even more (sigh). > > > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying > > code, and takes whatever actions are necessary to support that. > > Exactly what actions it takes are cache implementation specific, and > > should be of no concern to the caller, but the underlying thing is... > > it's to support self-modifying code. > > Caveat > cacheflush() should not be used in programs intended to be portable. > On Linux, this call first appeared on the MIPS architecture, but nowa‐ > days, Linux provides a cacheflush() system call on some other architec‐ > tures, but with different arguments. > > What a disaster. Another badly designed interface, although it didn't > originate in Linux it sounds like we weren't to be outdone so > we messed it up even worse. > > flushing caches is neither necessary nor sufficient for code modification > on many processors. Maybe some old MIPS specific private thing was fine, > but certainly before it grew to other architectures, somebody should > have thought for more than 2 minutes about it. Sigh. WARNING: You are bordering on being objectionable and offensive with that comment. The ARM interface was designed by me back in the very early days of Linux, probably while you were still in dypers, based on what was known at the time. Back in the early 2000s, ideas such as relaxed memory ordering were not known. All there was was one level of harvard cache. So, juts shut up with your insults.
On Wed, Dec 30, 2020 at 10:00:28AM +0000, Russell King - ARM Linux admin wrote: > On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote: > > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm: > > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote: > > >> I think it should certainly be documented in terms of what guarantees > > >> it provides to application, _not_ the kinds of instructions it may or > > >> may not induce the core to execute. And if existing API can't be > > >> re-documented sanely, then deprecatd and new ones added that DTRT. > > >> Possibly under a new system call, if arch's like ARM want a range > > >> flush and we don't want to expand the multiplexing behaviour of > > >> membarrier even more (sigh). > > > > > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying > > > code, and takes whatever actions are necessary to support that. > > > Exactly what actions it takes are cache implementation specific, and > > > should be of no concern to the caller, but the underlying thing is... > > > it's to support self-modifying code. > > > > Caveat > > cacheflush() should not be used in programs intended to be portable. > > On Linux, this call first appeared on the MIPS architecture, but nowa‐ > > days, Linux provides a cacheflush() system call on some other architec‐ > > tures, but with different arguments. > > > > What a disaster. Another badly designed interface, although it didn't > > originate in Linux it sounds like we weren't to be outdone so > > we messed it up even worse. > > > > flushing caches is neither necessary nor sufficient for code modification > > on many processors. Maybe some old MIPS specific private thing was fine, > > but certainly before it grew to other architectures, somebody should > > have thought for more than 2 minutes about it. Sigh. > > WARNING: You are bordering on being objectionable and offensive with > that comment. > > The ARM interface was designed by me back in the very early days of > Linux, probably while you were still in dypers, based on what was > known at the time. Back in the early 2000s, ideas such as relaxed > memory ordering were not known. All there was was one level of > harvard cache. Sorry, I got that slightly wrong. Its origins on ARM were from 12 Dec 1998: http://www.armlinux.org.uk/developer/patches/viewpatch.php?id=88/1 by Philip Blundell, and first appeared in the ARM pre-patch-2.1.131-19981214-1.gz. It was subsequently documented in the kernel sources by me in July 2001 in ARM patch-2.4.6-rmk2.gz. It has a slightly different signature: the third argument on ARM is a flags argument, whereas the MIPS code, it is some undocumented "cache" parameter. Whether the ARM version pre or post dates the MIPS code, I couldn't say. Whether it was ultimately taken from the MIPS implementation, again, I couldn't say. However, please stop insulting your fellow developers ability to think.
Excerpts from Russell King - ARM Linux admin's message of December 30, 2020 8:58 pm: > On Wed, Dec 30, 2020 at 10:00:28AM +0000, Russell King - ARM Linux admin wrote: >> On Wed, Dec 30, 2020 at 12:33:02PM +1000, Nicholas Piggin wrote: >> > Excerpts from Russell King - ARM Linux admin's message of December 29, 2020 8:44 pm: >> > > On Tue, Dec 29, 2020 at 01:09:12PM +1000, Nicholas Piggin wrote: >> > >> I think it should certainly be documented in terms of what guarantees >> > >> it provides to application, _not_ the kinds of instructions it may or >> > >> may not induce the core to execute. And if existing API can't be >> > >> re-documented sanely, then deprecatd and new ones added that DTRT. >> > >> Possibly under a new system call, if arch's like ARM want a range >> > >> flush and we don't want to expand the multiplexing behaviour of >> > >> membarrier even more (sigh). >> > > >> > > The 32-bit ARM sys_cacheflush() is there only to support self-modifying >> > > code, and takes whatever actions are necessary to support that. >> > > Exactly what actions it takes are cache implementation specific, and >> > > should be of no concern to the caller, but the underlying thing is... >> > > it's to support self-modifying code. >> > >> > Caveat >> > cacheflush() should not be used in programs intended to be portable. >> > On Linux, this call first appeared on the MIPS architecture, but nowa‐ >> > days, Linux provides a cacheflush() system call on some other architec‐ >> > tures, but with different arguments. >> > >> > What a disaster. Another badly designed interface, although it didn't >> > originate in Linux it sounds like we weren't to be outdone so >> > we messed it up even worse. >> > >> > flushing caches is neither necessary nor sufficient for code modification >> > on many processors. Maybe some old MIPS specific private thing was fine, >> > but certainly before it grew to other architectures, somebody should >> > have thought for more than 2 minutes about it. Sigh. >> >> WARNING: You are bordering on being objectionable and offensive with >> that comment. >> >> The ARM interface was designed by me back in the very early days of >> Linux, probably while you were still in dypers, based on what was >> known at the time. Back in the early 2000s, ideas such as relaxed >> memory ordering were not known. All there was was one level of >> harvard cache. I wasn't talking about memory ordering at all, and I assumed it came earlier and was brought to Linux for portability reasons - CONFORMING TO Historically, this system call was available on all MIPS UNIX variants including RISC/os, IRIX, Ultrix, NetBSD, OpenBSD, and FreeBSD (and also on some non-UNIX MIPS operating systems), so that the existence of this call in MIPS operating systems is a de-facto standard. I don't think the call was totally unreasonable for incoherent virtual caches or incoherent i/d caches etc. Although early unix system call interface demonstrates that people understood how to define good interfaces that dealt with intent at an abstract level rather than implementation -- munmap doesn't specify anywhere that a TLB flush instruction must be executed, for example. So "cacheflush" was obviously never a well designed interface but rather the typical hardware-centric hack to get their stuff working (which was fine for its purpose I'm sure). > > Sorry, I got that slightly wrong. Its origins on ARM were from 12 Dec > 1998: > > http://www.armlinux.org.uk/developer/patches/viewpatch.php?id=88/1 > > by Philip Blundell, and first appeared in the ARM > pre-patch-2.1.131-19981214-1.gz. It was subsequently documented in the > kernel sources by me in July 2001 in ARM patch-2.4.6-rmk2.gz. It has > a slightly different signature: the third argument on ARM is a flags > argument, whereas the MIPS code, it is some undocumented "cache" > parameter. > > Whether the ARM version pre or post dates the MIPS code, I couldn't say. > Whether it was ultimately taken from the MIPS implementation, again, I > couldn't say. I can, it was in MIPS in late 1.3 kernels at least. I would guess it came from IRIX. > However, please stop insulting your fellow developers ability to think. Sorry, I was being melodramatic. Everyone makes mistakes or decisions which with hindsight could have been better or were under some constraint that isn't apparent. I shouldn't have suggested it indicated thoughtlessness. Thanks, Nick
From: Andy Lutomirski > Sent: 29 December 2020 00:36 ... > I mean that the mapping from the name "sync_core" to its semantics is > x86 only. The string "sync_core" appears in the kernel only in > arch/x86, membarrier code, membarrier docs, and a single SGI driver > that is x86-only. Sure, the idea of serializing things is fairly > generic, but exactly what operations serialize what, when things need > serialization, etc is quite architecture specific. > > Heck, on 486 you serialize the instruction stream with JMP. Did the 486 even have a memory cache? Never mind separate I&D caches. Without branch prediction or an I$ a jmp is enough. No idea how the dual 486 box we had actually behaved. For non-SMP the x86 cpus tend to still be compatible with the original 8086 - so are pretty much fully coherent. ISTR the memory writes will invalidate I$ lines. But there was some hardware compatibility that meant a load of Pentium-75 systems were 'scavenged' from development for a customer - we got faster P-266 boxes as replacements. OTOH we never did work out how to do the required 'barrier' when switching a Via C3 to and from 16-bit mode. Sometimes it worked, other times the cpu went AWOL. Best guess was that it sometimes executed pre-decoded instructions for the wrong mode when returning from the function call that flipped modes. Then there is the P-Pro era Intel doc that says that IOR/IOW aren't sequenced wrt memory accesses. Fortunately all x86 processors have sequenced them. Which is what the current docs say. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Hi Andy, Sorry for the slow reply, I was socially distanced from my keyboard. On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote: > On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > > +static inline void membarrier_sync_core_before_usermode(void) > > > +{ > > > + /* > > > + * XXX: I know basically nothing about powerpc cache management. > > > + * Is this correct? > > > + */ > > > + isync(); > > > > This is not about memory ordering or cache management, it's about > > pipeline management. Powerpc's return to user mode serializes the > > CPU (aka the hardware thread, _not_ the core; another wrongness of > > the name, but AFAIKS the HW thread is what is required for > > membarrier). So this is wrong, powerpc needs nothing here. > > Fair enough. I'm happy to defer to you on the powerpc details. In > any case, this just illustrates that we need feedback from a person > who knows more about ARM64 than I do. I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking: 1. SYNC_CORE does _not_ perform any cache management; that is the responsibility of userspace, either by executing the relevant maintenance instructions (arm64) or a system call (arm32). Crucially, the hardware will ensure that this cache maintenance is broadcast to all other CPUs. 2. Even with all the cache maintenance in the world, a CPU could have speculatively fetched stale instructions into its "pipeline" ahead of time, and these are _not_ flushed by the broadcast maintenance instructions in (1). SYNC_CORE provides a means for userspace to discard these stale instructions. 3. The context synchronization event on exception entry/exit is sufficient here. The Arm ARM isn't very good at describing what it does, because it's in denial about the existence of a pipeline, but it does have snippets such as: (s/PE/CPU/) | For all types of memory: | The PE might have fetched the instructions from memory at any time | since the last Context synchronization event on that PE. Interestingly, the architecture recently added a control bit to remove this synchronisation from exception return, so if we set that then we'd have a problem with SYNC_CORE and adding an ISB would be necessary (and we could probable then make kernel->kernel returns cheaper, but I suspect we're relying on this implicit synchronisation in other places too). Are you seeing a problem in practice, or did this come up while trying to decipher the semantics of SYNC_CORE? Will
> On Jan 5, 2021, at 5:26 AM, Will Deacon <will@kernel.org> wrote: > > Hi Andy, > > Sorry for the slow reply, I was socially distanced from my keyboard. > >> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote: >> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>> +static inline void membarrier_sync_core_before_usermode(void) >>>> +{ >>>> + /* >>>> + * XXX: I know basically nothing about powerpc cache management. >>>> + * Is this correct? >>>> + */ >>>> + isync(); >>> >>> This is not about memory ordering or cache management, it's about >>> pipeline management. Powerpc's return to user mode serializes the >>> CPU (aka the hardware thread, _not_ the core; another wrongness of >>> the name, but AFAIKS the HW thread is what is required for >>> membarrier). So this is wrong, powerpc needs nothing here. >> >> Fair enough. I'm happy to defer to you on the powerpc details. In >> any case, this just illustrates that we need feedback from a person >> who knows more about ARM64 than I do. > > I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking: > > 1. SYNC_CORE does _not_ perform any cache management; that is the > responsibility of userspace, either by executing the relevant > maintenance instructions (arm64) or a system call (arm32). Crucially, > the hardware will ensure that this cache maintenance is broadcast > to all other CPUs. Is this guaranteed regardless of any aliases? That is, if I flush from one CPU at one VA and then execute the same physical address from another CPU at a different VA, does this still work? > > 2. Even with all the cache maintenance in the world, a CPU could have > speculatively fetched stale instructions into its "pipeline" ahead of > time, and these are _not_ flushed by the broadcast maintenance instructions > in (1). SYNC_CORE provides a means for userspace to discard these stale > instructions. > > 3. The context synchronization event on exception entry/exit is > sufficient here. The Arm ARM isn't very good at describing what it > does, because it's in denial about the existence of a pipeline, but > it does have snippets such as: > > (s/PE/CPU/) > | For all types of memory: > | The PE might have fetched the instructions from memory at any time > | since the last Context synchronization event on that PE. > > Interestingly, the architecture recently added a control bit to remove > this synchronisation from exception return, so if we set that then we'd > have a problem with SYNC_CORE and adding an ISB would be necessary (and > we could probable then make kernel->kernel returns cheaper, but I > suspect we're relying on this implicit synchronisation in other places > too). > Is ISB just a context synchronization event or does it do more? On x86, it’s very hard to tell that MFENCE does any more than LOCK, but it’s much slower. And we have LFENCE, which, as documented, doesn’t appear to have any semantics at all. (Or at least it didn’t before Spectre.) > Are you seeing a problem in practice, or did this come up while trying to > decipher the semantics of SYNC_CORE? It came up while trying to understand the code and work through various bugs in it. The code was written using something approximating x86 terminology, but it was definitely wrong on x86 (at least if you believe the SDM, and I haven’t convinced any architects to say otherwise). Thanks! > > Will
On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote: > > Interestingly, the architecture recently added a control bit to remove > > this synchronisation from exception return, so if we set that then we'd > > have a problem with SYNC_CORE and adding an ISB would be necessary (and > > we could probable then make kernel->kernel returns cheaper, but I > > suspect we're relying on this implicit synchronisation in other places > > too). > > > > Is ISB just a context synchronization event or does it do more? IIRC it just the instruction sync (like power ISYNC). > On x86, it’s very hard to tell that MFENCE does any more than LOCK, > but it’s much slower. And we have LFENCE, which, as documented, > doesn’t appear to have any semantics at all. (Or at least it didn’t > before Spectre.) AFAIU MFENCE is a completion barrier, while LOCK prefix is not. A bit like ARM's DSB vs DMB. It is for this reason that mb() is still MFENCE, while our smp_mb() is a LOCK prefixed NO-OP. And yes, LFENCE used to be poorly defined and it was sometimes understood to be a completion barrier relative to prior LOADs, while it is now a completion barrier for any prior instruction, and really should be renamed to IFENCE.
On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote: > > On Jan 5, 2021, at 5:26 AM, Will Deacon <will@kernel.org> wrote: > > Sorry for the slow reply, I was socially distanced from my keyboard. > > > >> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote: > >> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@gmail.com> wrote: > >>>> +static inline void membarrier_sync_core_before_usermode(void) > >>>> +{ > >>>> + /* > >>>> + * XXX: I know basically nothing about powerpc cache management. > >>>> + * Is this correct? > >>>> + */ > >>>> + isync(); > >>> > >>> This is not about memory ordering or cache management, it's about > >>> pipeline management. Powerpc's return to user mode serializes the > >>> CPU (aka the hardware thread, _not_ the core; another wrongness of > >>> the name, but AFAIKS the HW thread is what is required for > >>> membarrier). So this is wrong, powerpc needs nothing here. > >> > >> Fair enough. I'm happy to defer to you on the powerpc details. In > >> any case, this just illustrates that we need feedback from a person > >> who knows more about ARM64 than I do. > > > > I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking: > > > > 1. SYNC_CORE does _not_ perform any cache management; that is the > > responsibility of userspace, either by executing the relevant > > maintenance instructions (arm64) or a system call (arm32). Crucially, > > the hardware will ensure that this cache maintenance is broadcast > > to all other CPUs. > > Is this guaranteed regardless of any aliases? That is, if I flush from > one CPU at one VA and then execute the same physical address from another > CPU at a different VA, does this still work? The data side will be fine, but the instruction side can have virtual aliases. We handle this in flush_ptrace_access() by blowing away the whole I-cache if we're not physically-indexed, but userspace would be in trouble if it wanted to handle this situation alone. > > 2. Even with all the cache maintenance in the world, a CPU could have > > speculatively fetched stale instructions into its "pipeline" ahead of > > time, and these are _not_ flushed by the broadcast maintenance instructions > > in (1). SYNC_CORE provides a means for userspace to discard these stale > > instructions. > > > > 3. The context synchronization event on exception entry/exit is > > sufficient here. The Arm ARM isn't very good at describing what it > > does, because it's in denial about the existence of a pipeline, but > > it does have snippets such as: > > > > (s/PE/CPU/) > > | For all types of memory: > > | The PE might have fetched the instructions from memory at any time > > | since the last Context synchronization event on that PE. > > > > Interestingly, the architecture recently added a control bit to remove > > this synchronisation from exception return, so if we set that then we'd > > have a problem with SYNC_CORE and adding an ISB would be necessary (and > > we could probable then make kernel->kernel returns cheaper, but I > > suspect we're relying on this implicit synchronisation in other places > > too). > > > > Is ISB just a context synchronization event or does it do more? That's a good question. Barrier instructions on ARM do tend to get overloaded with extra behaviours over time, so it could certainly end up doing the context synchronization event + extra stuff in future. Right now, the only thing that springs to mind is the spectre-v1 heavy mitigation barrier of 'DSB; ISB' which, for example, probably doesn't work for 'DSB; ERET' because the ERET can be treated like a conditional (!) branch. > On x86, it’s very hard to tell that MFENCE does any more than LOCK, but > it’s much slower. And we have LFENCE, which, as documented, doesn’t > appear to have any semantics at all. (Or at least it didn’t before > Spectre.) I tend to think of ISB as a front-end barrier relating to instruction fetch whereas DMB, acquire/release and DSB are all back-end barriers relating to memory accesses. You _can_ use ISB in conjunction with control dependencies to order a pair of loads (like you can with ISYNC on Power), but it's a really expensive way to do it. > > Are you seeing a problem in practice, or did this come up while trying to > > decipher the semantics of SYNC_CORE? > > It came up while trying to understand the code and work through various > bugs in it. The code was written using something approximating x86 > terminology, but it was definitely wrong on x86 (at least if you believe > the SDM, and I haven’t convinced any architects to say otherwise). Ok, thanks. Will
diff --git a/arch/arm64/include/asm/sync_core.h b/arch/arm64/include/asm/sync_core.h new file mode 100644 index 000000000000..5be4531caabd --- /dev/null +++ b/arch/arm64/include/asm/sync_core.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_ARM64_SYNC_CORE_H +#define _ASM_ARM64_SYNC_CORE_H + +#include <asm/barrier.h> + +/* + * Ensure that the CPU notices any instruction changes before the next time + * it returns to usermode. + */ +static inline void membarrier_sync_core_before_usermode(void) +{ + /* + * XXX: is this enough or do we need a DMB first to make sure that + * writes from other CPUs become visible to this CPU? We have an + * smp_mb() already, but that's not quite the same thing. + */ + isb(); +} + +#endif /* _ASM_ARM64_SYNC_CORE_H */ diff --git a/arch/powerpc/include/asm/sync_core.h b/arch/powerpc/include/asm/sync_core.h new file mode 100644 index 000000000000..71dfbe7794e5 --- /dev/null +++ b/arch/powerpc/include/asm/sync_core.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_SYNC_CORE_H +#define _ASM_POWERPC_SYNC_CORE_H + +#include <asm/barrier.h> + +/* + * Ensure that the CPU notices any instruction changes before the next time + * it returns to usermode. + */ +static inline void membarrier_sync_core_before_usermode(void) +{ + /* + * XXX: I know basically nothing about powerpc cache management. + * Is this correct? + */ + isync(); +} + +#endif /* _ASM_POWERPC_SYNC_CORE_H */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b5137cc5b7b4..895f70fd4a61 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -81,7 +81,6 @@ config X86 select ARCH_HAS_SET_DIRECT_MAP select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX - select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAS_DEBUG_WX diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index ab7382f92aff..c665b453969a 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -89,11 +89,10 @@ static inline void sync_core(void) } /* - * Ensure that a core serializing instruction is issued before returning - * to user-mode. x86 implements return to user-space through sysexit, - * sysrel, and sysretq, which are not core serializing. + * Ensure that the CPU notices any instruction changes before the next time + * it returns to usermode. */ -static inline void sync_core_before_usermode(void) +static inline void membarrier_sync_core_before_usermode(void) { /* With PTI, we unconditionally serialize before running user code. */ if (static_cpu_has(X86_FEATURE_PTI)) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 48640db6ca86..81ba47910a73 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -7,7 +7,6 @@ #include <linux/sched.h> #include <linux/mm_types.h> #include <linux/gfp.h> -#include <linux/sync_core.h> /* * Routines for handling mm_structs diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h deleted file mode 100644 index 013da4b8b327..000000000000 --- a/include/linux/sync_core.h +++ /dev/null @@ -1,21 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_SYNC_CORE_H -#define _LINUX_SYNC_CORE_H - -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE -#include <asm/sync_core.h> -#else -/* - * This is a dummy sync_core_before_usermode() implementation that can be used - * on all architectures which return to user-space through core serializing - * instructions. - * If your architecture returns to user-space through non-core-serializing - * instructions, you need to write your own functions. - */ -static inline void sync_core_before_usermode(void) -{ -} -#endif - -#endif /* _LINUX_SYNC_CORE_H */ - diff --git a/init/Kconfig b/init/Kconfig index c9446911cf41..eb9772078cd4 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2334,9 +2334,6 @@ source "kernel/Kconfig.locks" config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE bool -config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE - bool - # It may be useful for an architecture to override the definitions of the # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h> # and the COMPAT_ variants in <linux/compat.h>, in particular to use a diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index b3a82d7635da..db4945e1ec94 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -5,6 +5,9 @@ * membarrier system call */ #include "sched.h" +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE +#include <asm/sync_core.h> +#endif /* * The basic principle behind the regular memory barrier mode of membarrier() @@ -221,6 +224,7 @@ static void ipi_mb(void *info) smp_mb(); /* IPIs should be serializing but paranoid. */ } +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE static void ipi_sync_core(void *info) { /* @@ -230,13 +234,14 @@ static void ipi_sync_core(void *info) * the big comment at the top of this file. * * A sync_core() would provide this guarantee, but - * sync_core_before_usermode() might end up being deferred until - * after membarrier()'s smp_mb(). + * membarrier_sync_core_before_usermode() might end up being deferred + * until after membarrier()'s smp_mb(). */ smp_mb(); /* IPIs should be serializing but paranoid. */ - sync_core_before_usermode(); + membarrier_sync_core_before_usermode(); } +#endif static void ipi_rseq(void *info) { @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int cpu_id) smp_call_func_t ipi_func = ipi_mb; if (flags == MEMBARRIER_FLAG_SYNC_CORE) { - if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE return -EINVAL; +#else if (!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY)) return -EPERM; ipi_func = ipi_sync_core; +#endif } else if (flags == MEMBARRIER_FLAG_RSEQ) { if (!IS_ENABLED(CONFIG_RSEQ)) return -EINVAL;
The old sync_core_before_usermode() comments said that a non-icache-syncing return-to-usermode instruction is x86-specific and that all other architectures automatically notice cross-modified code on return to userspace. Based on my general understanding of how CPUs work and based on my atttempt to read the ARM manual, this is not true at all. In fact, x86 seems to be a bit of an anomaly in the other direction: x86's IRET is unusually heavyweight for a return-to-usermode instruction. So let's drop any pretense that we can have a generic way implementation behind membarrier's SYNC_CORE flush and require all architectures that opt in to supply their own. This means x86, arm64, and powerpc for now. Let's also rename the function from sync_core_before_usermode() to membarrier_sync_core_before_usermode() because the precise flushing details may very well be specific to membarrier, and even the concept of "sync_core" in the kernel is mostly an x86-ism. I admit that I'm rather surprised that the code worked at all on arm64, and I'm suspicious that it has never been very well tested. My apologies for not reviewing this more carefully in the first place. Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: x86@kernel.org Cc: stable@vger.kernel.org Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski <luto@kernel.org> --- Hi arm64 and powerpc people- This is part of a series here: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes Before I send out the whole series, I'm hoping that some arm64 and powerpc people can help me verify that I did this patch right. Once I get some feedback on this patch, I'll send out the whole pile. And once *that's* done, I'll start giving the mm lazy stuff some serious thought. The x86 part is already fixed in Linus' tree. Thanks, Andy arch/arm64/include/asm/sync_core.h | 21 +++++++++++++++++++++ arch/powerpc/include/asm/sync_core.h | 20 ++++++++++++++++++++ arch/x86/Kconfig | 1 - arch/x86/include/asm/sync_core.h | 7 +++---- include/linux/sched/mm.h | 1 - include/linux/sync_core.h | 21 --------------------- init/Kconfig | 3 --- kernel/sched/membarrier.c | 15 +++++++++++---- 8 files changed, 55 insertions(+), 34 deletions(-) create mode 100644 arch/arm64/include/asm/sync_core.h create mode 100644 arch/powerpc/include/asm/sync_core.h delete mode 100644 include/linux/sync_core.h