Message ID | 20200715170844.30064-30-catalin.marinas@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Memory Tagging Extension user-space support | expand |
The 07/15/2020 18:08, Catalin Marinas wrote: > From: Vincenzo Frascino <vincenzo.frascino@arm.com> > > Memory Tagging Extension (part of the ARMv8.5 Extensions) provides > a mechanism to detect the sources of memory related errors which > may be vulnerable to exploitation, including bounds violations, > use-after-free, use-after-return, use-out-of-scope and use before > initialization errors. > > Add Memory Tagging Extension documentation for the arm64 linux > kernel support. > > Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> > Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > Cc: Will Deacon <will@kernel.org> > --- > > Notes: > v7: > - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL). > > v4: > - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE). > - Document the initial process state on fork/execve. > - Clarify when the kernel uaccess checks the tags. > - Minor updates to the example code. > - A few other minor clean-ups following review. > > v3: > - Modify the uaccess checking conditions: only when the sync mode is > selected by the user. In async mode, the kernel uaccesses are not > checked. > - Clarify that an include mask of 0 (exclude mask 0xffff) results in > always generating tag 0. > - Document the ptrace() interface. > > v2: > - Documented the uaccess kernel tag checking mode. > - Removed the BTI definitions from cpu-feature-registers.rst. > - Removed the paragraph stating that MTE depends on the tagged address > ABI (while the Kconfig entry does, there is no requirement for the > user to enable both). > - Changed the GCR_EL1.Exclude handling description following the change > in the prctl() interface (include vs exclude mask). > - Updated the example code. > > Documentation/arm64/cpu-feature-registers.rst | 2 + > Documentation/arm64/elf_hwcaps.rst | 4 + > Documentation/arm64/index.rst | 1 + > .../arm64/memory-tagging-extension.rst | 305 ++++++++++++++++++ > 4 files changed, 312 insertions(+) > create mode 100644 Documentation/arm64/memory-tagging-extension.rst > > diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst ... > +Tag Check Faults > +---------------- > + > +When ``PROT_MTE`` is enabled on an address range and a mismatch between > +the logical and allocation tags occurs on access, there are three > +configurable behaviours: > + > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the > + tag check fault. > + > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with > + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The > + memory access is not performed. If ``SIGSEGV`` is ignored or blocked > + by the offending thread, the containing process is terminated with a > + ``coredump``. > + > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending > + thread, asynchronously following one or multiple tag check faults, > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting > + address is unknown). > + > +The user can select the above modes, per thread, using the > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > +bit-field: > + > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode > + > +The current tag check fault mode can be read using the > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call. we discussed the need for per process prctl off list, i will try to summarize the requirement here: - it cannot be guaranteed in general that a library initializer or first call into a library happens when the process is still single threaded. - user code currently has no way to call prctl in all threads of a process and even within the c runtime doing so is problematic (it has to signal all threads, which requires a reserved signal and dealing with exiting threads and signal masks, such mechanism can break qemu user and various other userspace tooling). - we don't yet have defined contract in userspace about how user code may enable mte (i.e. use the prctl call), but it seems that there will be use cases for it: LD_PRELOADing malloc for heap tagging is one such case, but any library or custom allocator that wants to use mte will have this issue: when it enables mte it wants to enable it for all threads in the process. (or at least all threads managed by the c runtime). - even if user code is not allowed to call the prctl directly, i.e. the prctl settings are owned by the libc, there will be cases when the settings have to be changed in a multithreaded process (e.g. dlopening a library that requires a particular mte state). a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC that means the prctl is for all threads in the process not just for the current one. however the exact semantics is not obvious if there are inconsistent settings in different threads or user code tries to use the prctl concurrently: first checking then setting the mte state via separate prctl calls is racy. but if the userspace contract for enabling mte limits who and when can call the prctl then i think the simple sync flag approach works. (the sync flag should apply to all prctl settings: tagged addr syscall abi, mte check fault mode, irg tag excludes. ideally it would work for getting the process wide state and it would fail in case of inconsistent settings.) we may need to document some memory ordering details when memory accesses in other threads are affected, but i think that can be something simple that leaves it unspecified what happens with memory accesses that are not synchrnized with the prctl call.
On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote: > The 07/15/2020 18:08, Catalin Marinas wrote: > > From: Vincenzo Frascino <vincenzo.frascino@arm.com> > > > > Memory Tagging Extension (part of the ARMv8.5 Extensions) provides > > a mechanism to detect the sources of memory related errors which > > may be vulnerable to exploitation, including bounds violations, > > use-after-free, use-after-return, use-out-of-scope and use before > > initialization errors. > > > > Add Memory Tagging Extension documentation for the arm64 linux > > kernel support. > > > > Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> > > Co-developed-by: Catalin Marinas <catalin.marinas@arm.com> > > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > Acked-by: Szabolcs Nagy <szabolcs.nagy@arm.com> > > Cc: Will Deacon <will@kernel.org> > > --- > > > > Notes: > > v7: > > - Add information on ptrace() regset access (NT_ARM_TAGGED_ADDR_CTRL). > > > > v4: > > - Document behaviour of madvise(MADV_DONTNEED/MADV_FREE). > > - Document the initial process state on fork/execve. > > - Clarify when the kernel uaccess checks the tags. > > - Minor updates to the example code. > > - A few other minor clean-ups following review. > > > > v3: > > - Modify the uaccess checking conditions: only when the sync mode is > > selected by the user. In async mode, the kernel uaccesses are not > > checked. > > - Clarify that an include mask of 0 (exclude mask 0xffff) results in > > always generating tag 0. > > - Document the ptrace() interface. > > > > v2: > > - Documented the uaccess kernel tag checking mode. > > - Removed the BTI definitions from cpu-feature-registers.rst. > > - Removed the paragraph stating that MTE depends on the tagged address > > ABI (while the Kconfig entry does, there is no requirement for the > > user to enable both). > > - Changed the GCR_EL1.Exclude handling description following the change > > in the prctl() interface (include vs exclude mask). > > - Updated the example code. > > > > Documentation/arm64/cpu-feature-registers.rst | 2 + > > Documentation/arm64/elf_hwcaps.rst | 4 + > > Documentation/arm64/index.rst | 1 + > > .../arm64/memory-tagging-extension.rst | 305 ++++++++++++++++++ > > 4 files changed, 312 insertions(+) > > create mode 100644 Documentation/arm64/memory-tagging-extension.rst > > > > diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst > ... > > +Tag Check Faults > > +---------------- > > + > > +When ``PROT_MTE`` is enabled on an address range and a mismatch between > > +the logical and allocation tags occurs on access, there are three > > +configurable behaviours: > > + > > +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the > > + tag check fault. > > + > > +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with > > + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The > > + memory access is not performed. If ``SIGSEGV`` is ignored or blocked > > + by the offending thread, the containing process is terminated with a > > + ``coredump``. > > + > > +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending > > + thread, asynchronously following one or multiple tag check faults, > > + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting > > + address is unknown). > > + > > +The user can select the above modes, per thread, using the > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > > +bit-field: > > + > > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode > > + > > +The current tag check fault mode can be read using the > > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call. > > we discussed the need for per process prctl off list, i will > try to summarize the requirement here: > > - it cannot be guaranteed in general that a library initializer > or first call into a library happens when the process is still > single threaded. > > - user code currently has no way to call prctl in all threads of > a process and even within the c runtime doing so is problematic > (it has to signal all threads, which requires a reserved signal > and dealing with exiting threads and signal masks, such mechanism > can break qemu user and various other userspace tooling). When working on the SVE support, I came to the conclusion that this kind of thing would normally either be done by the runtime itself, or in close cooperation with the runtime. However, for SVE it never makes sense for one thread to asynchronously change the vector length of another thread -- that's different from the MTE situation. > - we don't yet have defined contract in userspace about how user > code may enable mte (i.e. use the prctl call), but it seems that > there will be use cases for it: LD_PRELOADing malloc for heap > tagging is one such case, but any library or custom allocator > that wants to use mte will have this issue: when it enables mte > it wants to enable it for all threads in the process. (or at > least all threads managed by the c runtime). What are the situations where we anticipate a need to twiddle MTE in multiple threads simultaneously, other than during process startup? > - even if user code is not allowed to call the prctl directly, > i.e. the prctl settings are owned by the libc, there will be > cases when the settings have to be changed in a multithreaded > process (e.g. dlopening a library that requires a particular > mte state). Could be avoided by refusing to dlopen a library that is incompatible with the current process. dlopen()ing a library that doesn't support tagged addresses, in a process that does use tagged addresses, seems undesirable even if tag checking is currently turned off. > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC > that means the prctl is for all threads in the process not just > for the current one. however the exact semantics is not obvious > if there are inconsistent settings in different threads or user > code tries to use the prctl concurrently: first checking then > setting the mte state via separate prctl calls is racy. but if > the userspace contract for enabling mte limits who and when can > call the prctl then i think the simple sync flag approach works. > > (the sync flag should apply to all prctl settings: tagged addr > syscall abi, mte check fault mode, irg tag excludes. ideally it > would work for getting the process wide state and it would fail > in case of inconsistent settings.) If going down this route, perhaps we could have sets of settings: so for each setting we have a process-wide value and a per-thread value, with defines rules about how they combine. Since MTE is a debugging feature, we might be able to be less aggressive about synchronisation than in the SECCOMP case. > we may need to document some memory ordering details when > memory accesses in other threads are affected, but i think > that can be something simple that leaves it unspecified > what happens with memory accesses that are not synchrnized > with the prctl call. Hmmm... Cheers ---Dave
The 07/28/2020 12:08, Dave Martin wrote: > On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote: > > The 07/15/2020 18:08, Catalin Marinas wrote: > > > +The user can select the above modes, per thread, using the > > > +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where > > > +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` > > > +bit-field: > > > + > > > +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults > > > +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode > > > +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode > > > + > > > +The current tag check fault mode can be read using the > > > +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call. > > > > we discussed the need for per process prctl off list, i will > > try to summarize the requirement here: > > > > - it cannot be guaranteed in general that a library initializer > > or first call into a library happens when the process is still > > single threaded. > > > > - user code currently has no way to call prctl in all threads of > > a process and even within the c runtime doing so is problematic > > (it has to signal all threads, which requires a reserved signal > > and dealing with exiting threads and signal masks, such mechanism > > can break qemu user and various other userspace tooling). > > When working on the SVE support, I came to the conclusion that this > kind of thing would normally either be done by the runtime itself, or in > close cooperation with the runtime. However, for SVE it never makes > sense for one thread to asynchronously change the vector length of > another thread -- that's different from the MTE situation. currently there is libc mechanism to do some operation in all threads (e.g. for set*id) but this is fragile and not something that can be exposed to user code. (on the kernel side it should be much simpler to do) > > - we don't yet have defined contract in userspace about how user > > code may enable mte (i.e. use the prctl call), but it seems that > > there will be use cases for it: LD_PRELOADing malloc for heap > > tagging is one such case, but any library or custom allocator > > that wants to use mte will have this issue: when it enables mte > > it wants to enable it for all threads in the process. (or at > > least all threads managed by the c runtime). > > What are the situations where we anticipate a need to twiddle MTE in > multiple threads simultaneously, other than during process startup? > > > - even if user code is not allowed to call the prctl directly, > > i.e. the prctl settings are owned by the libc, there will be > > cases when the settings have to be changed in a multithreaded > > process (e.g. dlopening a library that requires a particular > > mte state). > > Could be avoided by refusing to dlopen a library that is incompatible > with the current process. > > dlopen()ing a library that doesn't support tagged addresses, in a > process that does use tagged addresses, seems undesirable even if tag > checking is currently turned off. yes but it can go the other way too: at startup the libc does not enable tag checks for performance reasons, but at dlopen time a library is detected to use mte (e.g. stack tagging or custom allocator). then libc or the dlopened library has to ensure that checks are enabled in all threads. (in case of stack tagging the libc has to mark existing stacks with PROT_MTE too, there is mechanism for this in glibc to deal with dlopened libraries that require executable stack and only reject the dlopen if this cannot be performed.) another usecase is that the libc is mte-safe (it accepts tagged pointers and memory in its interfaces), but it does not enable mte (this will be the case with glibc 2.32) and user libraries have to enable mte to use it (custom allocator or malloc interposition are examples). and i think this is necessary if userpsace wants to turn async tag check into sync tag check at runtime when a failure is detected. > > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC > > that means the prctl is for all threads in the process not just > > for the current one. however the exact semantics is not obvious > > if there are inconsistent settings in different threads or user > > code tries to use the prctl concurrently: first checking then > > setting the mte state via separate prctl calls is racy. but if > > the userspace contract for enabling mte limits who and when can > > call the prctl then i think the simple sync flag approach works. > > > > (the sync flag should apply to all prctl settings: tagged addr > > syscall abi, mte check fault mode, irg tag excludes. ideally it > > would work for getting the process wide state and it would fail > > in case of inconsistent settings.) > > If going down this route, perhaps we could have sets of settings: > so for each setting we have a process-wide value and a per-thread > value, with defines rules about how they combine. > > Since MTE is a debugging feature, we might be able to be less aggressive > about synchronisation than in the SECCOMP case. separate process-wide and per-thread value works for me and i expect most uses will be process wide settings. i don't think mte is less of a security feature than seccomp. if linux does not want to add a per process setting then only libc will be able to opt-in to mte and only at very early in the startup process (before executing any user code that may start threads). this is not out of question, but i think it limits the usage and deployment options. > > we may need to document some memory ordering details when > > memory accesses in other threads are affected, but i think > > that can be something simple that leaves it unspecified > > what happens with memory accesses that are not synchrnized > > with the prctl call. > > Hmmm... e.g. it may be enough if the spec only works if there is no PROT_MTE memory mapped yet, and no tagged addresses are present in the multi-threaded process when the prctl is called.
On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote: > The 07/28/2020 12:08, Dave Martin wrote: > > On Mon, Jul 27, 2020 at 05:36:35PM +0100, Szabolcs Nagy wrote: > > > a solution is to introduce a flag like SECCOMP_FILTER_FLAG_TSYNC > > > that means the prctl is for all threads in the process not just > > > for the current one. however the exact semantics is not obvious > > > if there are inconsistent settings in different threads or user > > > code tries to use the prctl concurrently: first checking then > > > setting the mte state via separate prctl calls is racy. but if > > > the userspace contract for enabling mte limits who and when can > > > call the prctl then i think the simple sync flag approach works. > > > > > > (the sync flag should apply to all prctl settings: tagged addr > > > syscall abi, mte check fault mode, irg tag excludes. ideally it > > > would work for getting the process wide state and it would fail > > > in case of inconsistent settings.) > > > > If going down this route, perhaps we could have sets of settings: > > so for each setting we have a process-wide value and a per-thread > > value, with defines rules about how they combine. > > > > Since MTE is a debugging feature, we might be able to be less aggressive > > about synchronisation than in the SECCOMP case. > > separate process-wide and per-thread value > works for me and i expect most uses will > be process wide settings. The problem with the thread synchronisation is, unlike SECCOMP, that we need to update the SCTLR_EL1.TCF0 field across all the CPUs that may run threads of the current process. I haven't convinced myself that this is race-free without heavy locking. If we go for some heavy mechanism like stop_machine(), that opens the kernel to DoS attacks from user. Still investigating if something like membarrier() would be sufficient. SECCOMP gets away with this as it only needs to set some variable without IPI'ing the other CPUs. > i don't think mte is less of a security > feature than seccomp. Well, MTE is probabilistic, SECCOMP seems to be more precise ;). > if linux does not want to add a per process > setting then only libc will be able to opt-in > to mte and only at very early in the startup > process (before executing any user code that > may start threads). this is not out of question, > but i think it limits the usage and deployment > options. There is also the risk that we try to be too flexible at this stage without a real use-case.
The 07/28/2020 20:59, Catalin Marinas wrote: > On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote: > > if linux does not want to add a per process > > setting then only libc will be able to opt-in > > to mte and only at very early in the startup > > process (before executing any user code that > > may start threads). this is not out of question, > > but i think it limits the usage and deployment > > options. > > There is also the risk that we try to be too flexible at this stage > without a real use-case. i don't know how mte will be turned on in libc. if we can always turn sync tag checks on early whenever mte is available then i think there is no issue. but if we have to make the decision later for compatibility or performance reasons then per thread setting is problematic. use of the prctl outside of libc is very limited if it's per thread only: - application code may use it in a (elf specific) pre-initialization function, but that's a bit obscure (not exposed in c) and it is reasonable for an application to enable mte checks after it registered a signal handler for mte faults. (and at that point it may be multi-threaded). - library code normally initializes per thread state on the first call into the library from a given thread, but with mte, as soon as memory / pointers are tagged in one thread, all threads are affected: not performing checks in other threads is less secure (may be ok) and it means incompatible syscall abi (not ok). so at least PR_TAGGED_ADDR_ENABLE should have process wide setting for this usage. but i guess it is fine to design the mechanism for these in a later linux version, until then such usage will be unreliable (will depend on how early threads are created).
Hi Szabolcs, On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote: > The 07/28/2020 20:59, Catalin Marinas wrote: > > On Tue, Jul 28, 2020 at 03:53:51PM +0100, Szabolcs Nagy wrote: > > > if linux does not want to add a per process setting then only libc > > > will be able to opt-in to mte and only at very early in the > > > startup process (before executing any user code that may start > > > threads). this is not out of question, but i think it limits the > > > usage and deployment options. > > > > There is also the risk that we try to be too flexible at this stage > > without a real use-case. > > i don't know how mte will be turned on in libc. > > if we can always turn sync tag checks on early whenever mte is > available then i think there is no issue. > > but if we have to make the decision later for compatibility or > performance reasons then per thread setting is problematic. At least for libc, I'm not sure how you could even turn MTE on at run-time. The heap allocations would have to be mapped with PROT_MTE as we can't easily change them (well, you could mprotect(), assuming the user doesn't use tagged pointers on them). There is a case to switch tag checking from asynchronous to synchronous at run-time based on a signal but that's rather specific to Android where zygote controls the signal handler. I don't think you can do this with libc. Even on Android, since the async fault signal is delivered per thread, it probably does this lazily (alternatively, it could issue a SIGUSRx to the other threads for synchronisation). > use of the prctl outside of libc is very limited if it's per thread > only: In the non-Android context, I think the prctl() for MTE control should be restricted to the libc. You can control the mode prior to the process being started using environment variables. I really don't see how the libc could handle the changing of the MTE behaviour at run-time without itself handling signals. > - application code may use it in a (elf specific) pre-initialization > function, but that's a bit obscure (not exposed in c) and it is > reasonable for an application to enable mte checks after it > registered a signal handler for mte faults. (and at that point it > may be multi-threaded). Since the app can install signal handlers, it can also deal with notifying other threads with a SIGUSRx, assuming that it decided this after multiple threads were created. If it does this while single-threaded, subsequent threads would inherit the first one. The only use-case I see for doing this in the kernel is if the code requiring an MTE behaviour change cannot install signal handlers. More on this below. > - library code normally initializes per thread state on the first call > into the library from a given thread, but with mte, as soon as > memory / pointers are tagged in one thread, all threads are > affected: not performing checks in other threads is less secure (may > be ok) and it means incompatible syscall abi (not ok). so at least > PR_TAGGED_ADDR_ENABLE should have process wide setting for this > usage. My assumption with MTE is that the libc will initialise it when the library is loaded (something __attribute__((constructor))) and it's still in single-threaded mode. Does it wait until the first malloc() call? Also, is there such thing as a per-thread initialiser for a dynamic library (not sure it can be implemented in practice though)? The PR_TAGGED_ADDR_ENABLE synchronisation at least doesn't require IPIs to other CPUs to change the hardware state. However, it can still race with thread creation or a prctl() on another thread, not sure what we can define here, especially as it depends on the kernel internals: e.g. thread creation copies some data structures of the calling thread but at the same time another thread wants to change such structures for all threads of that process. The ordering of events here looks pretty fragile. Maybe with another global status (per process) which takes priority over the per thread one would be easier. But such priority is not temporal (i.e. whoever called prctl() last) but pretty strict: once a global control was requested, it will remain global no matter what subsequent threads request (or we can do it the other way around). > but i guess it is fine to design the mechanism for these in a later > linux version, until then such usage will be unreliable (will depend > on how early threads are created). Until we have a real use-case, I'd not complicate the matters further. For example, I'm still not sure how realistic it is for an application to load a new heap allocator after some threads were created. Even the glibc support, I don't think it needs this. Could an LD_PRELOADED library be initialised after threads were created (I guess it could if another preloaded library created threads)? Even if it does, do we have an example or it's rather theoretical. If this becomes an essential use-case, we can look at adding a new flag for prctl() which would set the option globally, with the caveats mentioned above. It doesn't need to be in the initial ABI (and the PR_TAGGED_ADDR_ENABLE is already upstream). Thanks.
The 08/07/2020 16:19, Catalin Marinas wrote: > On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote: > > if we can always turn sync tag checks on early whenever mte is > > available then i think there is no issue. > > > > but if we have to make the decision later for compatibility or > > performance reasons then per thread setting is problematic. > > At least for libc, I'm not sure how you could even turn MTE on at > run-time. The heap allocations would have to be mapped with PROT_MTE as > we can't easily change them (well, you could mprotect(), assuming the > user doesn't use tagged pointers on them). e.g. dlopen of library with stack tagging. (libc can mark stacks with PROT_MTE at that time) or just turn on sync tag checks later when using heap tagging. > > There is a case to switch tag checking from asynchronous to synchronous > at run-time based on a signal but that's rather specific to Android > where zygote controls the signal handler. I don't think you can do this > with libc. Even on Android, since the async fault signal is delivered > per thread, it probably does this lazily (alternatively, it could issue > a SIGUSRx to the other threads for synchronisation). i think what that zygote is doing is a valid use-case but in a normal linux setup the application owns the signal handlers so the tag check switch has to be done by the application. the libc can expose some api for it, so in principle it's enough if the libc can do the runtime switch, but we dont plan to add new libc apis for mte. > > use of the prctl outside of libc is very limited if it's per thread > > only: > > In the non-Android context, I think the prctl() for MTE control should > be restricted to the libc. You can control the mode prior to the process > being started using environment variables. I really don't see how the > libc could handle the changing of the MTE behaviour at run-time without > itself handling signals. > > > - application code may use it in a (elf specific) pre-initialization > > function, but that's a bit obscure (not exposed in c) and it is > > reasonable for an application to enable mte checks after it > > registered a signal handler for mte faults. (and at that point it > > may be multi-threaded). > > Since the app can install signal handlers, it can also deal with > notifying other threads with a SIGUSRx, assuming that it decided this > after multiple threads were created. If it does this while > single-threaded, subsequent threads would inherit the first one. the application does not know what libraries create what threads in the background, i dont think there is a way to send signals to each thread (e.g. /proc/self/task cannot be read atomically with respect to thread creation/exit). the libc controls thread creation and exit so it can have a list of threads it can notify, but an application cannot do this. (libc could provide an api so applications can do some per thread operation, but a libc would not do this happily: currently there are locks around thread creation and exit that are only needed for this "signal all threads" mechanism which makes it hard to expose to users) one way applications sometimes work this around is to self re-exec. but that's a big hammer and not entirely reliable (e.g. the exe may not be available on the filesystem any more or the commandline args may need to be preserved but they are clobbered, or some complex application state needs to be recreated etc) > > The only use-case I see for doing this in the kernel is if the code > requiring an MTE behaviour change cannot install signal handlers. More > on this below. > > > - library code normally initializes per thread state on the first call > > into the library from a given thread, but with mte, as soon as > > memory / pointers are tagged in one thread, all threads are > > affected: not performing checks in other threads is less secure (may > > be ok) and it means incompatible syscall abi (not ok). so at least > > PR_TAGGED_ADDR_ENABLE should have process wide setting for this > > usage. > > My assumption with MTE is that the libc will initialise it when the > library is loaded (something __attribute__((constructor))) and it's > still in single-threaded mode. Does it wait until the first malloc() > call? Also, is there such thing as a per-thread initialiser for a > dynamic library (not sure it can be implemented in practice though)? there is no per thread initializer in an elf module. (tls state is usually initialized lazily in threads when necessary.) malloc calls can happen before the ctors of an LD_PRELOAD library and threads can be created before both. glibc runs ldpreload ctors after other library ctors. custom allocator can be of course dlopened. (i'd expect several language runtimes to have their own allocator and support dlopening the runtime) > > The PR_TAGGED_ADDR_ENABLE synchronisation at least doesn't require IPIs > to other CPUs to change the hardware state. However, it can still race > with thread creation or a prctl() on another thread, not sure what we > can define here, especially as it depends on the kernel internals: e.g. > thread creation copies some data structures of the calling thread but at > the same time another thread wants to change such structures for all > threads of that process. The ordering of events here looks pretty > fragile. > > Maybe with another global status (per process) which takes priority over > the per thread one would be easier. But such priority is not temporal > (i.e. whoever called prctl() last) but pretty strict: once a global > control was requested, it will remain global no matter what subsequent > threads request (or we can do it the other way around). i see. > > but i guess it is fine to design the mechanism for these in a later > > linux version, until then such usage will be unreliable (will depend > > on how early threads are created). > > Until we have a real use-case, I'd not complicate the matters further. > For example, I'm still not sure how realistic it is for an application > to load a new heap allocator after some threads were created. Even the > glibc support, I don't think it needs this. > > Could an LD_PRELOADED library be initialised after threads were created > (I guess it could if another preloaded library created threads)? Even if > it does, do we have an example or it's rather theoretical. i believe this happens e.g. in applications built with tsan. (the thread sanitizer creates a background thread early which i think does not call malloc itself but may want to access malloced memory, but i don't have a setup with tsan support to test this) > > If this becomes an essential use-case, we can look at adding a new flag > for prctl() which would set the option globally, with the caveats > mentioned above. It doesn't need to be in the initial ABI (and the > PR_TAGGED_ADDR_ENABLE is already upstream). > > Thanks. > > -- > Catalin
On Mon, Aug 10, 2020 at 03:13:09PM +0100, Szabolcs Nagy wrote: > The 08/07/2020 16:19, Catalin Marinas wrote: > > On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote: > > > if we can always turn sync tag checks on early whenever mte is > > > available then i think there is no issue. > > > > > > but if we have to make the decision later for compatibility or > > > performance reasons then per thread setting is problematic. > > > > At least for libc, I'm not sure how you could even turn MTE on at > > run-time. The heap allocations would have to be mapped with PROT_MTE as > > we can't easily change them (well, you could mprotect(), assuming the > > user doesn't use tagged pointers on them). > > e.g. dlopen of library with stack tagging. (libc can mark stacks with > PROT_MTE at that time) If we allow such mixed object support with stack tagging enabled at dlopen, PROT_MTE would need to be turned on for each thread stack. This wouldn't require synchronisation, only knowing where the thread stacks are, but you'd need to make sure threads don't call into the new library until the stacks have been mprotect'ed. Doing this midway through a function execution may corrupt the tags. So I'm not sure how safe any of this is without explicit user synchronisation (i.e. don't call into the library until all threads have been updated). Even changing options like GCR_EL1.Excl across multiple threads may have unwanted effects. See this comment from Peter, the difference being that instead of an explicit prctl() call on the current stack, another thread would do it: https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/ > or just turn on sync tag checks later when using heap tagging. I wonder whether setting the synchronous tag check mode by default would improve this aspect. This would not have any effect until PROT_MTE is used. If software wants some better performance they can explicitly opt in to asynchronous mode or disable tag checking after some SIGSEGV + reporting (this shouldn't exclude the environment variables you currently use for controlling the tag check mode). Also, if there are saner defaults for the user GCR_EL1.Excl (currently all masked), we should decide them now. If stack tagging will come with some ELF information, we could make the default tag checking and GCR_EL1.Excl choices based on that, otherwise maybe we should revisit the default configuration the kernel sets for the user in the absence of any other information. > > There is a case to switch tag checking from asynchronous to synchronous > > at run-time based on a signal but that's rather specific to Android > > where zygote controls the signal handler. I don't think you can do this > > with libc. Even on Android, since the async fault signal is delivered > > per thread, it probably does this lazily (alternatively, it could issue > > a SIGUSRx to the other threads for synchronisation). > > i think what that zygote is doing is a valid use-case but > in a normal linux setup the application owns the signal > handlers so the tag check switch has to be done by the > application. the libc can expose some api for it, so in > principle it's enough if the libc can do the runtime > switch, but we dont plan to add new libc apis for mte. Due to the synchronisation aspect especially regarding the stack tagging, I'm not sure the kernel alone can safely do this. Changing the tagged address syscall ABI across multiple threads should be safer (well, at least the relaxing part). But if we don't solve the other aspects I mentioned above, I don't think there is much point in only doing it for this. > > > - library code normally initializes per thread state on the first call > > > into the library from a given thread, but with mte, as soon as > > > memory / pointers are tagged in one thread, all threads are > > > affected: not performing checks in other threads is less secure (may > > > be ok) and it means incompatible syscall abi (not ok). so at least > > > PR_TAGGED_ADDR_ENABLE should have process wide setting for this > > > usage. > > > > My assumption with MTE is that the libc will initialise it when the > > library is loaded (something __attribute__((constructor))) and it's > > still in single-threaded mode. Does it wait until the first malloc() > > call? Also, is there such thing as a per-thread initialiser for a > > dynamic library (not sure it can be implemented in practice though)? > > there is no per thread initializer in an elf module. > (tls state is usually initialized lazily in threads > when necessary.) > > malloc calls can happen before the ctors of an LD_PRELOAD > library and threads can be created before both. > glibc runs ldpreload ctors after other library ctors. In the presence of stack tagging, I think any subsequent MTE config change across all threads is unsafe, irrespective of whether it's done by the kernel or user via SIGUSRx. I think the best we can do here is start with more appropriate defaults or enable them based on an ELF note before the application is started. The dynamic loader would not have to do anything extra here. If we ignore stack tagging, the global configuration change may be achievable. I think for the MTE bits, this could be done lazily by the libc (e.g. on malloc()/free() call). The tag checking won't happen before such calls unless we change the kernel defaults. There is still the tagged address ABI enabling, could this be done lazily on syscall by the libc? If not, the kernel could synchronise (force) this on syscall entry from each thread based on some global prctl() bit.
The 08/11/2020 18:20, Catalin Marinas wrote: > On Mon, Aug 10, 2020 at 03:13:09PM +0100, Szabolcs Nagy wrote: > > The 08/07/2020 16:19, Catalin Marinas wrote: > > > On Mon, Aug 03, 2020 at 01:43:10PM +0100, Szabolcs Nagy wrote: > > > > if we can always turn sync tag checks on early whenever mte is > > > > available then i think there is no issue. > > > > > > > > but if we have to make the decision later for compatibility or > > > > performance reasons then per thread setting is problematic. > > > > > > At least for libc, I'm not sure how you could even turn MTE on at > > > run-time. The heap allocations would have to be mapped with PROT_MTE as > > > we can't easily change them (well, you could mprotect(), assuming the > > > user doesn't use tagged pointers on them). > > > > e.g. dlopen of library with stack tagging. (libc can mark stacks with > > PROT_MTE at that time) > > If we allow such mixed object support with stack tagging enabled at > dlopen, PROT_MTE would need to be turned on for each thread stack. This > wouldn't require synchronisation, only knowing where the thread stacks > are, but you'd need to make sure threads don't call into the new library > until the stacks have been mprotect'ed. Doing this midway through a > function execution may corrupt the tags. > > So I'm not sure how safe any of this is without explicit user > synchronisation (i.e. don't call into the library until all threads have > been updated). Even changing options like GCR_EL1.Excl across multiple > threads may have unwanted effects. See this comment from Peter, the > difference being that instead of an explicit prctl() call on the current > stack, another thread would do it: > > https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/ there is no midway problem: the libc (ld.so) would do the PROT_MTE at dlopen time based on some elf marking (which can be handled before relocation processing, so before library code can run, the midway problem happens when a library, e.g libc, wants to turn on stack tagging on itself). the libc already does this when a library is loaded that requires executable stack (it marks stacks as PROT_EXEC at dlopen time or fails the dlopen if that is not possible, this does not require running code in other threads, only synchronization with thread creation and exit. but changing the check mode for mte needs per thread code execution.). i'm not entirely sure if this is a good idea, but i expect stack tagging not to be used in the libc (because libc needs to run on all hw and we don't yet have a backward compatible stack tagging solution), so stack tagging should work when only some elf modules in a process are built with it, which implies that enabling it at dlopen time should work otherwise it will not be very useful. > > or just turn on sync tag checks later when using heap tagging. > > I wonder whether setting the synchronous tag check mode by default would > improve this aspect. This would not have any effect until PROT_MTE is > used. If software wants some better performance they can explicitly opt > in to asynchronous mode or disable tag checking after some SIGSEGV + > reporting (this shouldn't exclude the environment variables you > currently use for controlling the tag check mode). > > Also, if there are saner defaults for the user GCR_EL1.Excl (currently > all masked), we should decide them now. > > If stack tagging will come with some ELF information, we could make the > default tag checking and GCR_EL1.Excl choices based on that, otherwise > maybe we should revisit the default configuration the kernel sets for > the user in the absence of any other information. do tag checks have overhead if PROT_MTE is not used? i'd expect some checks are still done at memory access. (and the tagged address syscall abi has to be in use.) turning sync tag checks on early would enable the most of the interesting usecases (only PROT_MTE has to be handled at runtime not the prctls. however i don't yet know how userspace will deal with compat issues, i.e. it may not be valid to unconditionally turn tag checks on early). > > > > - library code normally initializes per thread state on the first call > > > > into the library from a given thread, but with mte, as soon as > > > > memory / pointers are tagged in one thread, all threads are > > > > affected: not performing checks in other threads is less secure (may > > > > be ok) and it means incompatible syscall abi (not ok). so at least > > > > PR_TAGGED_ADDR_ENABLE should have process wide setting for this > > > > usage. > > > > > > My assumption with MTE is that the libc will initialise it when the > > > library is loaded (something __attribute__((constructor))) and it's > > > still in single-threaded mode. Does it wait until the first malloc() > > > call? Also, is there such thing as a per-thread initialiser for a > > > dynamic library (not sure it can be implemented in practice though)? > > > > there is no per thread initializer in an elf module. > > (tls state is usually initialized lazily in threads > > when necessary.) > > > > malloc calls can happen before the ctors of an LD_PRELOAD > > library and threads can be created before both. > > glibc runs ldpreload ctors after other library ctors. > > In the presence of stack tagging, I think any subsequent MTE config > change across all threads is unsafe, irrespective of whether it's done > by the kernel or user via SIGUSRx. I think the best we can do here is > start with more appropriate defaults or enable them based on an ELF note > before the application is started. The dynamic loader would not have to > do anything extra here. > > If we ignore stack tagging, the global configuration change may be > achievable. I think for the MTE bits, this could be done lazily by the > libc (e.g. on malloc()/free() call). The tag checking won't happen > before such calls unless we change the kernel defaults. There is still > the tagged address ABI enabling, could this be done lazily on syscall by > the libc? If not, the kernel could synchronise (force) this on syscall > entry from each thread based on some global prctl() bit. i think the interesting use-cases are all about changing mte settings before mte is in use in any way but after there are multiple threads. (the async -> sync mode change on tag faults is i think less interesting to the gnu linux world.) i guess lazy syscall abi switch works, but it is ugly: raw syscall usage will be problematic and doing checks before calling into the vdso might have unwanted overhead. based on the discussion it seems we should design the userspace abis so that per process prctl is not required and then see how far we get.
On Wed, Aug 12, 2020 at 01:45:21PM +0100, Szabolcs Nagy wrote: > On 08/11/2020 18:20, Catalin Marinas wrote: > > If we allow such mixed object support with stack tagging enabled at > > dlopen, PROT_MTE would need to be turned on for each thread stack. This > > wouldn't require synchronisation, only knowing where the thread stacks > > are, but you'd need to make sure threads don't call into the new library > > until the stacks have been mprotect'ed. Doing this midway through a > > function execution may corrupt the tags. > > > > So I'm not sure how safe any of this is without explicit user > > synchronisation (i.e. don't call into the library until all threads have > > been updated). Even changing options like GCR_EL1.Excl across multiple > > threads may have unwanted effects. See this comment from Peter, the > > difference being that instead of an explicit prctl() call on the current > > stack, another thread would do it: > > > > https://lore.kernel.org/linux-arch/CAMn1gO5rhOG1W+nVe103v=smvARcFFp_Ct9XqH2Ca4BUMfpDdg@mail.gmail.com/ > > there is no midway problem: the libc (ld.so) would do the PROT_MTE at > dlopen time based on some elf marking (which can be handled before > relocation processing, so before library code can run, the midway > problem happens when a library, e.g libc, wants to turn on stack > tagging on itself). OK, that makes sense, you can't call into the new object until the relocations have been resolved. > the libc already does this when a library is loaded that requires > executable stack (it marks stacks as PROT_EXEC at dlopen time or fails > the dlopen if that is not possible, this does not require running code > in other threads, only synchronization with thread creation and exit. > but changing the check mode for mte needs per thread code execution.). > > i'm not entirely sure if this is a good idea, but i expect stack > tagging not to be used in the libc (because libc needs to run on all > hw and we don't yet have a backward compatible stack tagging > solution), In theory, you could have two libc deployed in your distro and ldd gets smarter to pick the right one. I still hope we'd find a compromise with stack tagging and single binary. > so stack tagging should work when only some elf modules in a process > are built with it, which implies that enabling it at dlopen time > should work otherwise it will not be very useful. There is still the small risk of an old object using tagged pointers to the stack. Since the stack would be shared between such objects, turning PROT_MTE on would cause issues. Hopefully such problems are minor and not really a concern for the kernel. > do tag checks have overhead if PROT_MTE is not used? i'd expect some > checks are still done at memory access. (and the tagged address > syscall abi has to be in use.) My understanding from talking to hardware engineers is that there won't be an overhead if PROT_MTE is not used, no tags being fetched or checked. But I can't guarantee until we get real silicon. > turning sync tag checks on early would enable the most of the > interesting usecases (only PROT_MTE has to be handled at runtime not > the prctls. however i don't yet know how userspace will deal with > compat issues, i.e. it may not be valid to unconditionally turn tag > checks on early). If we change the defaults so that no prctl() is required for the standard use-case, it would solve most of the common deployment issues: 1. Tagged address ABI default on when HWCAP2_MTE is present 2. Synchronous TCF by default 3. GCR_EL1.Excl allows all tags except 0 by default Any other configuration diverging from the above is considered specialist deployment and will have to issue the prctl() on a per-thread basis. Compat issues in user-space will be dealt with via environment variables but pretty much on/off rather than fine-grained tag checking mode. So for glibc, you'd have only _MTAG=0 or 1 and the only effect is using PROT_MTE + tagged pointers or no-PROT_MTE + tag 0. > > In the presence of stack tagging, I think any subsequent MTE config > > change across all threads is unsafe, irrespective of whether it's done > > by the kernel or user via SIGUSRx. I think the best we can do here is > > start with more appropriate defaults or enable them based on an ELF note > > before the application is started. The dynamic loader would not have to > > do anything extra here. > > > > If we ignore stack tagging, the global configuration change may be > > achievable. I think for the MTE bits, this could be done lazily by the > > libc (e.g. on malloc()/free() call). The tag checking won't happen > > before such calls unless we change the kernel defaults. There is still > > the tagged address ABI enabling, could this be done lazily on syscall by > > the libc? If not, the kernel could synchronise (force) this on syscall > > entry from each thread based on some global prctl() bit. > > i think the interesting use-cases are all about changing mte settings > before mte is in use in any way but after there are multiple threads. > (the async -> sync mode change on tag faults is i think less > interesting to the gnu linux world.) So let's consider async/sync/no-check specialist uses and glibc would not have to handle them. I don't think async mode is useful on its own unless you have a way to turn on sync mode at run-time for more precise error identification (well, hoping that it will happen again). > i guess lazy syscall abi switch works, but it is ugly: raw syscall > usage will be problematic and doing checks before calling into the > vdso might have unwanted overhead. This lazy ABI switch could be handled by the kernel, though I wonder whether we should just relax it permanently when HWCAP2_MTE is present.
adding libc-alpha on cc, they might have additional input about compat issue handling: the discussion is about enabling tag checks, which cannot be done reasonably at runtime when a process is already multi-threaded so it has to be decided early either in the libc or in the kernel. such early decision might have backward compat issues (we dont yet know if any allocator wants to use tagging opportunistically later for debugging or if there will be code loaded later that is incompatible with tag checks). The 08/19/2020 10:54, Catalin Marinas wrote: > On Wed, Aug 12, 2020 at 01:45:21PM +0100, Szabolcs Nagy wrote: > > On 08/11/2020 18:20, Catalin Marinas wrote: > > > If we allow such mixed object support with stack tagging enabled at > > > dlopen, PROT_MTE would need to be turned on for each thread stack. This > > > wouldn't require synchronisation, only knowing where the thread stacks > > > are, but you'd need to make sure threads don't call into the new library > > > until the stacks have been mprotect'ed. Doing this midway through a > > > function execution may corrupt the tags. ... > > there is no midway problem: the libc (ld.so) would do the PROT_MTE at > > dlopen time based on some elf marking (which can be handled before > > relocation processing, so before library code can run, the midway > > problem happens when a library, e.g libc, wants to turn on stack > > tagging on itself). > > OK, that makes sense, you can't call into the new object until the > relocations have been resolved. > > > the libc already does this when a library is loaded that requires > > executable stack (it marks stacks as PROT_EXEC at dlopen time or fails > > the dlopen if that is not possible, this does not require running code > > in other threads, only synchronization with thread creation and exit. > > but changing the check mode for mte needs per thread code execution.). > > > > i'm not entirely sure if this is a good idea, but i expect stack > > tagging not to be used in the libc (because libc needs to run on all > > hw and we don't yet have a backward compatible stack tagging > > solution), > > In theory, you could have two libc deployed in your distro and ldd gets > smarter to pick the right one. I still hope we'd find a compromise with > stack tagging and single binary. distros don't like the two libc solution, i think we will look at backward compat stack tagging support, but we might end up not having any stack tagging in libc, only in user binaries (used for debugging). > > so stack tagging should work when only some elf modules in a process > > are built with it, which implies that enabling it at dlopen time > > should work otherwise it will not be very useful. > > There is still the small risk of an old object using tagged pointers to > the stack. Since the stack would be shared between such objects, turning > PROT_MTE on would cause issues. Hopefully such problems are minor and > not really a concern for the kernel. > > > do tag checks have overhead if PROT_MTE is not used? i'd expect some > > checks are still done at memory access. (and the tagged address > > syscall abi has to be in use.) > > My understanding from talking to hardware engineers is that there won't > be an overhead if PROT_MTE is not used, no tags being fetched or > checked. But I can't guarantee until we get real silicon. > > > turning sync tag checks on early would enable the most of the > > interesting usecases (only PROT_MTE has to be handled at runtime not > > the prctls. however i don't yet know how userspace will deal with > > compat issues, i.e. it may not be valid to unconditionally turn tag > > checks on early). > > If we change the defaults so that no prctl() is required for the > standard use-case, it would solve most of the common deployment issues: > > 1. Tagged address ABI default on when HWCAP2_MTE is present > 2. Synchronous TCF by default > 3. GCR_EL1.Excl allows all tags except 0 by default > > Any other configuration diverging from the above is considered > specialist deployment and will have to issue the prctl() on a per-thread > basis. > > Compat issues in user-space will be dealt with via environment > variables but pretty much on/off rather than fine-grained tag checking > mode. So for glibc, you'd have only _MTAG=0 or 1 and the only effect is > using PROT_MTE + tagged pointers or no-PROT_MTE + tag 0. enabling mte checks by default would be nice and simple (a libc can support tagging allocators without any change assuming its code is mte safe which is true e.g. for the latest glibc release and for musl libc). the compat issue with this is existing code using pointer top bits which i assume faults when dereferenced with the mte checks enabled. (although this should be very rare since top byte ignore on deref is aarch64 specific.) i see two options: - don't care about top bit compat issues: change the default in the kernel as you described (so checks are enabled and users only need PROT_MTE mapping if they want to use taggging). - care about top bit issues: leave the kernel abi as in the patch set and do the mte setup early in the libc. add elf markings to new binaries that they are mte compatible and libc can use that marking for the mte setup. dlopening incompatible libraries will fail. the issue with this is that we have no idea how to add the marking and the marking prevents mte use with existing binaries (and eg. ldpreload malloc would require an updated libc). for me it's hard to figure out which is the right direction for mte. > > > In the presence of stack tagging, I think any subsequent MTE config > > > change across all threads is unsafe, irrespective of whether it's done > > > by the kernel or user via SIGUSRx. I think the best we can do here is > > > start with more appropriate defaults or enable them based on an ELF note > > > before the application is started. The dynamic loader would not have to > > > do anything extra here. > > > > > > If we ignore stack tagging, the global configuration change may be > > > achievable. I think for the MTE bits, this could be done lazily by the > > > libc (e.g. on malloc()/free() call). The tag checking won't happen > > > before such calls unless we change the kernel defaults. There is still > > > the tagged address ABI enabling, could this be done lazily on syscall by > > > the libc? If not, the kernel could synchronise (force) this on syscall > > > entry from each thread based on some global prctl() bit. > > > > i think the interesting use-cases are all about changing mte settings > > before mte is in use in any way but after there are multiple threads. > > (the async -> sync mode change on tag faults is i think less > > interesting to the gnu linux world.) > > So let's consider async/sync/no-check specialist uses and glibc would > not have to handle them. I don't think async mode is useful on its own > unless you have a way to turn on sync mode at run-time for more precise > error identification (well, hoping that it will happen again). > > > i guess lazy syscall abi switch works, but it is ugly: raw syscall > > usage will be problematic and doing checks before calling into the > > vdso might have unwanted overhead. > > This lazy ABI switch could be handled by the kernel, though I wonder > whether we should just relax it permanently when HWCAP2_MTE is present. yeah i don't immediately see a problem with that. but ideally there would be an escape hatch (a way to opt out from the change). thanks
On 8/20/20 9:43 AM, Szabolcs Nagy wrote: > the compat issue with this is existing code > using pointer top bits which i assume faults > when dereferenced with the mte checks enabled. > (although this should be very rare since > top byte ignore on deref is aarch64 specific.) Does anyone know of significant aarch64-specific application code that depends on top byte ignore? I would think it's so rare (nonexistent?) as to not be worth worrying about. Even in the bad old days when Emacs used pointer top bits for typechecking, it carefully removed those bits before dereferencing. Any other reasonably-portable application would have to do the same of course. This whole thing reminds me of the ancient IBM S/360 mainframes that were documented to ignore the top 8 bits of 32-bit addresses merely because a single model (the IBM 360/30, circa 1965) was so underpowered that it couldn't quickly check that the top bits were zero. This has caused countless software hassles over the years. Even today, the IBM z-Series hardware and software still supports 24-bit addressing mode because of that early-1960s design mistake. See: Mashey JR. The Long Road to 64 Bits. ACM Queue. 2006-10-10. https://queue.acm.org/detail.cfm?id=1165766
On Thu, Aug 20, 2020 at 05:43:15PM +0100, Szabolcs Nagy wrote: > The 08/19/2020 10:54, Catalin Marinas wrote: > > On Wed, Aug 12, 2020 at 01:45:21PM +0100, Szabolcs Nagy wrote: > > > On 08/11/2020 18:20, Catalin Marinas wrote: > > > turning sync tag checks on early would enable the most of the > > > interesting usecases (only PROT_MTE has to be handled at runtime not > > > the prctls. however i don't yet know how userspace will deal with > > > compat issues, i.e. it may not be valid to unconditionally turn tag > > > checks on early). > > > > If we change the defaults so that no prctl() is required for the > > standard use-case, it would solve most of the common deployment issues: > > > > 1. Tagged address ABI default on when HWCAP2_MTE is present > > 2. Synchronous TCF by default > > 3. GCR_EL1.Excl allows all tags except 0 by default > > > > Any other configuration diverging from the above is considered > > specialist deployment and will have to issue the prctl() on a per-thread > > basis. > > > > Compat issues in user-space will be dealt with via environment > > variables but pretty much on/off rather than fine-grained tag checking > > mode. So for glibc, you'd have only _MTAG=0 or 1 and the only effect is > > using PROT_MTE + tagged pointers or no-PROT_MTE + tag 0. > > enabling mte checks by default would be nice and simple (a libc can > support tagging allocators without any change assuming its code is mte > safe which is true e.g. for the latest glibc release and for musl > libc). While talking to the Android folk, it occurred to me that the default tag checking mode doesn't even need to be decided by the kernel. The dynamic loader can set the desired tag check mode and the tagged address ABI based on environment variables (_MTAG_ENABLE=x) and do a prctl() before any threads have been created. Subsequent malloc() calls or dlopen() can mmap/mprotect different memory regions to PROT_MTE and all threads will be affected equally. The only configuration a heap allocator may want to change is the tag exclude mask (GCR_EL1.Excl) but even this can, by convention, be configured by the dynamic loader. > the compat issue with this is existing code using pointer top bits > which i assume faults when dereferenced with the mte checks enabled. > (although this should be very rare since top byte ignore on deref is > aarch64 specific.) They'd fault only if they dereference PROT_MTE memory and the tag check mode is async or sync. > i see two options: > > - don't care about top bit compat issues: > change the default in the kernel as you described (so checks are > enabled and users only need PROT_MTE mapping if they want to use > taggging). As I said above, suggested by the Google guys, this default choice can be left with the dynamic loader before any threads are started. > - care about top bit issues: > leave the kernel abi as in the patch set and do the mte setup early > in the libc. add elf markings to new binaries that they are mte > compatible and libc can use that marking for the mte setup. > dlopening incompatible libraries will fail. the issue with this is > that we have no idea how to add the marking and the marking prevents > mte use with existing binaries (and eg. ldpreload malloc would > require an updated libc). Maybe a third option (which leaves the kernel ABI as is): If the ELF markings only control the PROT_MTE regions (stack or heap), we can configure the tag checking mode and tagged address ABI early through environment variables (_MTAG_ENABLE). If you have a problematic binary, just set _MTAG_ENABLE=0 and a dlopen, even if loading an MTE-capable object, would not map the stack with PROT_MTE. Heap allocators could also ignore _MTAG_ENABLE since PROT_MTE doesn't have an effect if no tag checking is in place. This way we can probably mix objects as long as we have a control. So, in summary, I think we can get away with only issuing the prctl() in the dynamic loader before any threads start and using PROT_MTE later at run-time, multi-threaded, as needed by malloc(), dlopen etc.
On Thu, Aug 20, 2020 at 10:27:43AM -0700, Paul Eggert wrote: > On 8/20/20 9:43 AM, Szabolcs Nagy wrote: > > the compat issue with this is existing code > > using pointer top bits which i assume faults > > when dereferenced with the mte checks enabled. > > (although this should be very rare since > > top byte ignore on deref is aarch64 specific.) > > Does anyone know of significant aarch64-specific application code that > depends on top byte ignore? I would think it's so rare (nonexistent?) as to > not be worth worrying about. Apart from the LLVM hwasan feature, I'm not aware of code relying on the top byte ignore. There were discussions in the past to use it with some JITs but I'm not sure they ever materialised. I think the Mozilla JS engine uses (used?) additional bits on top of a pointer but they are masked out before the access. > Even in the bad old days when Emacs used pointer top bits for typechecking, > it carefully removed those bits before dereferencing. Any other > reasonably-portable application would have to do the same of course. I agree.
diff --git a/Documentation/arm64/cpu-feature-registers.rst b/Documentation/arm64/cpu-feature-registers.rst index 314fa5bc2655..27d8559d565b 100644 --- a/Documentation/arm64/cpu-feature-registers.rst +++ b/Documentation/arm64/cpu-feature-registers.rst @@ -174,6 +174,8 @@ infrastructure: +------------------------------+---------+---------+ | Name | bits | visible | +------------------------------+---------+---------+ + | MTE | [11-8] | y | + +------------------------------+---------+---------+ | SSBS | [7-4] | y | +------------------------------+---------+---------+ | BT | [3-0] | y | diff --git a/Documentation/arm64/elf_hwcaps.rst b/Documentation/arm64/elf_hwcaps.rst index 84a9fd2d41b4..bbd9cf54db6c 100644 --- a/Documentation/arm64/elf_hwcaps.rst +++ b/Documentation/arm64/elf_hwcaps.rst @@ -240,6 +240,10 @@ HWCAP2_BTI Functionality implied by ID_AA64PFR0_EL1.BT == 0b0001. +HWCAP2_MTE + + Functionality implied by ID_AA64PFR1_EL1.MTE == 0b0010, as described + by Documentation/arm64/memory-tagging-extension.rst. 4. Unused AT_HWCAP bits ----------------------- diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst index 09cbb4ed2237..4cd0e696f064 100644 --- a/Documentation/arm64/index.rst +++ b/Documentation/arm64/index.rst @@ -14,6 +14,7 @@ ARM64 Architecture hugetlbpage legacy_instructions memory + memory-tagging-extension pointer-authentication silicon-errata sve diff --git a/Documentation/arm64/memory-tagging-extension.rst b/Documentation/arm64/memory-tagging-extension.rst new file mode 100644 index 000000000000..e3709b536b89 --- /dev/null +++ b/Documentation/arm64/memory-tagging-extension.rst @@ -0,0 +1,305 @@ +=============================================== +Memory Tagging Extension (MTE) in AArch64 Linux +=============================================== + +Authors: Vincenzo Frascino <vincenzo.frascino@arm.com> + Catalin Marinas <catalin.marinas@arm.com> + +Date: 2020-02-25 + +This document describes the provision of the Memory Tagging Extension +functionality in AArch64 Linux. + +Introduction +============ + +ARMv8.5 based processors introduce the Memory Tagging Extension (MTE) +feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI +(Top Byte Ignore) feature and allows software to access a 4-bit +allocation tag for each 16-byte granule in the physical address space. +Such memory range must be mapped with the Normal-Tagged memory +attribute. A logical tag is derived from bits 59-56 of the virtual +address used for the memory access. A CPU with MTE enabled will compare +the logical tag against the allocation tag and potentially raise an +exception on mismatch, subject to system registers configuration. + +Userspace Support +================= + +When ``CONFIG_ARM64_MTE`` is selected and Memory Tagging Extension is +supported by the hardware, the kernel advertises the feature to +userspace via ``HWCAP2_MTE``. + +PROT_MTE +-------- + +To access the allocation tags, a user process must enable the Tagged +memory attribute on an address range using a new ``prot`` flag for +``mmap()`` and ``mprotect()``: + +``PROT_MTE`` - Pages allow access to the MTE allocation tags. + +The allocation tag is set to 0 when such pages are first mapped in the +user address space and preserved on copy-on-write. ``MAP_SHARED`` is +supported and the allocation tags can be shared between processes. + +**Note**: ``PROT_MTE`` is only supported on ``MAP_ANONYMOUS`` and +RAM-based file mappings (``tmpfs``, ``memfd``). Passing it to other +types of mapping will result in ``-EINVAL`` returned by these system +calls. + +**Note**: The ``PROT_MTE`` flag (and corresponding memory type) cannot +be cleared by ``mprotect()``. + +**Note**: ``madvise()`` memory ranges with ``MADV_DONTNEED`` and +``MADV_FREE`` may have the allocation tags cleared (set to 0) at any +point after the system call. + +Tag Check Faults +---------------- + +When ``PROT_MTE`` is enabled on an address range and a mismatch between +the logical and allocation tags occurs on access, there are three +configurable behaviours: + +- *Ignore* - This is the default mode. The CPU (and kernel) ignores the + tag check fault. + +- *Synchronous* - The kernel raises a ``SIGSEGV`` synchronously, with + ``.si_code = SEGV_MTESERR`` and ``.si_addr = <fault-address>``. The + memory access is not performed. If ``SIGSEGV`` is ignored or blocked + by the offending thread, the containing process is terminated with a + ``coredump``. + +- *Asynchronous* - The kernel raises a ``SIGSEGV``, in the offending + thread, asynchronously following one or multiple tag check faults, + with ``.si_code = SEGV_MTEAERR`` and ``.si_addr = 0`` (the faulting + address is unknown). + +The user can select the above modes, per thread, using the +``prctl(PR_SET_TAGGED_ADDR_CTRL, flags, 0, 0, 0)`` system call where +``flags`` contain one of the following values in the ``PR_MTE_TCF_MASK`` +bit-field: + +- ``PR_MTE_TCF_NONE`` - *Ignore* tag check faults +- ``PR_MTE_TCF_SYNC`` - *Synchronous* tag check fault mode +- ``PR_MTE_TCF_ASYNC`` - *Asynchronous* tag check fault mode + +The current tag check fault mode can be read using the +``prctl(PR_GET_TAGGED_ADDR_CTRL, 0, 0, 0, 0)`` system call. + +Tag checking can also be disabled for a user thread by setting the +``PSTATE.TCO`` bit with ``MSR TCO, #1``. + +**Note**: Signal handlers are always invoked with ``PSTATE.TCO = 0``, +irrespective of the interrupted context. ``PSTATE.TCO`` is restored on +``sigreturn()``. + +**Note**: There are no *match-all* logical tags available for user +applications. + +**Note**: Kernel accesses to the user address space (e.g. ``read()`` +system call) are not checked if the user thread tag checking mode is +``PR_MTE_TCF_NONE`` or ``PR_MTE_TCF_ASYNC``. If the tag checking mode is +``PR_MTE_TCF_SYNC``, the kernel makes a best effort to check its user +address accesses, however it cannot always guarantee it. + +Excluding Tags in the ``IRG``, ``ADDG`` and ``SUBG`` instructions +----------------------------------------------------------------- + +The architecture allows excluding certain tags to be randomly generated +via the ``GCR_EL1.Exclude`` register bit-field. By default, Linux +excludes all tags other than 0. A user thread can enable specific tags +in the randomly generated set using the ``prctl(PR_SET_TAGGED_ADDR_CTRL, +flags, 0, 0, 0)`` system call where ``flags`` contains the tags bitmap +in the ``PR_MTE_TAG_MASK`` bit-field. + +**Note**: The hardware uses an exclude mask but the ``prctl()`` +interface provides an include mask. An include mask of ``0`` (exclusion +mask ``0xffff``) results in the CPU always generating tag ``0``. + +Initial process state +--------------------- + +On ``execve()``, the new process has the following configuration: + +- ``PR_TAGGED_ADDR_ENABLE`` set to 0 (disabled) +- Tag checking mode set to ``PR_MTE_TCF_NONE`` +- ``PR_MTE_TAG_MASK`` set to 0 (all tags excluded) +- ``PSTATE.TCO`` set to 0 +- ``PROT_MTE`` not set on any of the initial memory maps + +On ``fork()``, the new process inherits the parent's configuration and +memory map attributes with the exception of the ``madvise()`` ranges +with ``MADV_WIPEONFORK`` which will have the data and tags cleared (set +to 0). + +The ``ptrace()`` interface +-------------------------- + +``PTRACE_PEEKMTETAGS`` and ``PTRACE_POKEMTETAGS`` allow a tracer to read +the tags from or set the tags to a tracee's address space. The +``ptrace()`` system call is invoked as ``ptrace(request, pid, addr, +data)`` where: + +- ``request`` - one of ``PTRACE_PEEKMTETAGS`` or ``PTRACE_PEEKMTETAGS``. +- ``pid`` - the tracee's PID. +- ``addr`` - address in the tracee's address space. +- ``data`` - pointer to a ``struct iovec`` where ``iov_base`` points to + a buffer of ``iov_len`` length in the tracer's address space. + +The tags in the tracer's ``iov_base`` buffer are represented as one +4-bit tag per byte and correspond to a 16-byte MTE tag granule in the +tracee's address space. + +**Note**: If ``addr`` is not aligned to a 16-byte granule, the kernel +will use the corresponding aligned address. + +``ptrace()`` return value: + +- 0 - tags were copied, the tracer's ``iov_len`` was updated to the + number of tags transferred. This may be smaller than the requested + ``iov_len`` if the requested address range in the tracee's or the + tracer's space cannot be accessed or does not have valid tags. +- ``-EPERM`` - the specified process cannot be traced. +- ``-EIO`` - the tracee's address range cannot be accessed (e.g. invalid + address) and no tags copied. ``iov_len`` not updated. +- ``-EFAULT`` - fault on accessing the tracer's memory (``struct iovec`` + or ``iov_base`` buffer) and no tags copied. ``iov_len`` not updated. +- ``-EOPNOTSUPP`` - the tracee's address does not have valid tags (never + mapped with the ``PROT_MTE`` flag). ``iov_len`` not updated. + +**Note**: There are no transient errors for the requests above, so user +programs should not retry in case of a non-zero system call return. + +``PTRACE_GETREGSET`` and ``PTRACE_SETREGSET`` with ``addr == +``NT_ARM_TAGGED_ADDR_CTRL`` allow ``ptrace()`` access to the tagged +address ABI control and MTE configuration of a process as per the +``prctl()`` options described in +Documentation/arm64/tagged-address-abi.rst and above. The corresponding +``regset`` is 1 element of 8 bytes (``sizeof(long))``). + +Example of correct usage +======================== + +*MTE Example code* + +.. code-block:: c + + /* + * To be compiled with -march=armv8.5-a+memtag + */ + #include <errno.h> + #include <stdint.h> + #include <stdio.h> + #include <stdlib.h> + #include <unistd.h> + #include <sys/auxv.h> + #include <sys/mman.h> + #include <sys/prctl.h> + + /* + * From arch/arm64/include/uapi/asm/hwcap.h + */ + #define HWCAP2_MTE (1 << 18) + + /* + * From arch/arm64/include/uapi/asm/mman.h + */ + #define PROT_MTE 0x20 + + /* + * From include/uapi/linux/prctl.h + */ + #define PR_SET_TAGGED_ADDR_CTRL 55 + #define PR_GET_TAGGED_ADDR_CTRL 56 + # define PR_TAGGED_ADDR_ENABLE (1UL << 0) + # define PR_MTE_TCF_SHIFT 1 + # define PR_MTE_TCF_NONE (0UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TCF_SYNC (1UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TCF_ASYNC (2UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TCF_MASK (3UL << PR_MTE_TCF_SHIFT) + # define PR_MTE_TAG_SHIFT 3 + # define PR_MTE_TAG_MASK (0xffffUL << PR_MTE_TAG_SHIFT) + + /* + * Insert a random logical tag into the given pointer. + */ + #define insert_random_tag(ptr) ({ \ + uint64_t __val; \ + asm("irg %0, %1" : "=r" (__val) : "r" (ptr)); \ + __val; \ + }) + + /* + * Set the allocation tag on the destination address. + */ + #define set_tag(tagged_addr) do { \ + asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \ + } while (0) + + int main() + { + unsigned char *a; + unsigned long page_sz = sysconf(_SC_PAGESIZE); + unsigned long hwcap2 = getauxval(AT_HWCAP2); + + /* check if MTE is present */ + if (!(hwcap2 & HWCAP2_MTE)) + return EXIT_FAILURE; + + /* + * Enable the tagged address ABI, synchronous MTE tag check faults and + * allow all non-zero tags in the randomly generated set. + */ + if (prctl(PR_SET_TAGGED_ADDR_CTRL, + PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT), + 0, 0, 0)) { + perror("prctl() failed"); + return EXIT_FAILURE; + } + + a = mmap(0, page_sz, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (a == MAP_FAILED) { + perror("mmap() failed"); + return EXIT_FAILURE; + } + + /* + * Enable MTE on the above anonymous mmap. The flag could be passed + * directly to mmap() and skip this step. + */ + if (mprotect(a, page_sz, PROT_READ | PROT_WRITE | PROT_MTE)) { + perror("mprotect() failed"); + return EXIT_FAILURE; + } + + /* access with the default tag (0) */ + a[0] = 1; + a[1] = 2; + + printf("a[0] = %hhu a[1] = %hhu\n", a[0], a[1]); + + /* set the logical and allocation tags */ + a = (unsigned char *)insert_random_tag(a); + set_tag(a); + + printf("%p\n", a); + + /* non-zero tag access */ + a[0] = 3; + printf("a[0] = %hhu a[1] = %hhu\n", a[0], a[1]); + + /* + * If MTE is enabled correctly the next instruction will generate an + * exception. + */ + printf("Expecting SIGSEGV...\n"); + a[16] = 0xdd; + + /* this should not be printed in the PR_MTE_TCF_SYNC mode */ + printf("...haven't got one\n"); + + return EXIT_FAILURE; + }