arm64: add prctl(PR_PAC_SET_ENABLED_KEYS)

Message ID	20200801011152.39838-1-pcc@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=15xB=BL=lists.infradead.org=linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B9DBE20836 Date: Fri, 31 Jul 2020 18:11:52 -0700 Message-Id: <20200801011152.39838-1-pcc@google.com> Mime-Version: 1.0 Subject: [PATCH] arm64: add prctl(PR_PAC_SET_ENABLED_KEYS) From: Peter Collingbourne <pcc@google.com> To: Catalin Marinas <catalin.marinas@arm.com>, Evgenii Stepanov <eugenis@google.com>, Kostya Serebryany <kcc@google.com>, Vincenzo Frascino <vincenzo.frascino@arm.com>, Dave Martin <Dave.Martin@arm.com> summary: Content analysis details: (-7.7 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [2607:f8b0:4864:20:0:0:0:b4a listed in] [list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -7.5 USER_IN_DEF_DKIM_WL From: address is in the default DKIM white-list -0.1 DKIM_VALID_EF Message has a valid DKIM or DK signature from envelope-from domain -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature -0.0 DKIMWL_WL_MED DKIMwl.org - Medium sender Precedence: list Cc: Andrey Konovalov <andreyknvl@google.com>, Will Deacon <will@kernel.org>, Kevin Brodsky <kevin.brodsky@arm.com>, Peter Collingbourne <pcc@google.com>, Linux ARM <linux-arm-kernel@lists.infradead.org> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	arm64: add prctl(PR_PAC_SET_ENABLED_KEYS) \| expand arm64: add prctl(PR_PAC_SET_ENABLED_KEYS)

Peter Collingbourne Aug. 1, 2020, 1:11 a.m. UTC

This prctl allows the user program to control which PAC keys are enabled
in a particular task. The main reason why this is useful is to enable a
userspace ABI that uses PAC to sign and authenticate function pointers
and other pointers exposed outside of the function, while still allowing
binaries conforming to the ABI to interoperate with legacy binaries that
do not sign or authenticate pointers.

The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.
---
 .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
 arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
 arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
 arch/arm64/include/asm/processor.h            |  5 +++
 arch/arm64/kernel/asm-offsets.c               |  1 +
 arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
 include/uapi/linux/prctl.h                    |  3 ++
 kernel/sys.c                                  |  8 +++++
 8 files changed, 105 insertions(+), 2 deletions(-)

Peter Collingbourne Aug. 1, 2020, 1:13 a.m. UTC | #1

On Fri, Jul 31, 2020 at 6:11 PM Peter Collingbourne <pcc@google.com> wrote:
>
> This prctl allows the user program to control which PAC keys are enabled
> in a particular task. The main reason why this is useful is to enable a
> userspace ABI that uses PAC to sign and authenticate function pointers
> and other pointers exposed outside of the function, while still allowing
> binaries conforming to the ABI to interoperate with legacy binaries that
> do not sign or authenticate pointers.
>
> The idea is that a dynamic loader or early startup code would issue
> this prctl very early after establishing that a process may load legacy
> binaries, but before executing any PAC instructions.

Signed-off-by: Peter Collingbourne <pcc@google.com>

Sorry, will fix in v2.

Peter

> ---
>  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
>  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
>  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
>  arch/arm64/include/asm/processor.h            |  5 +++
>  arch/arm64/kernel/asm-offsets.c               |  1 +
>  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
>  include/uapi/linux/prctl.h                    |  3 ++
>  kernel/sys.c                                  |  8 +++++
>  8 files changed, 105 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> index 30b2ab06526b..1f7e064deeb3 100644
> --- a/Documentation/arm64/pointer-authentication.rst
> +++ b/Documentation/arm64/pointer-authentication.rst
> @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
>  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
>  register. Any attempt to use the Pointer Authentication instructions will
>  result in an UNDEFINED exception being injected into the guest.
> +
> +
> +Enabling and disabling keys
> +---------------------------
> +
> +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> +PAC keys are enabled in a particular task. It takes two arguments, the
> +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> +and the second being a bitmask of the same bits specifying whether the key
> +should be enabled or disabled. For example::
> +
> +  prctl(PR_PAC_SET_ENABLED_KEYS,
> +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> +        PR_PAC_APIBKEY, 0, 0);
> +
> +disables all keys except the IB key.
> +
> +The main reason why this is useful is to enable a userspace ABI that uses PAC
> +instructions to sign and authenticate function pointers and other pointers
> +exposed outside of the function, while still allowing binaries conforming to
> +the ABI to interoperate with legacy binaries that do not sign or authenticate
> +pointers.
> +
> +The idea is that a dynamic loader or early startup code would issue this
> +prctl very early after establishing that a process may load legacy binaries,
> +but before executing any PAC instructions.
> diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> index 52dead2a8640..d121fa5fed5f 100644
> --- a/arch/arm64/include/asm/asm_pointer_auth.h
> +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> @@ -31,6 +31,14 @@ alternative_else_nop_endif
>         ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
>         msr_s   SYS_APDBKEYLO_EL1, \tmp2
>         msr_s   SYS_APDBKEYHI_EL1, \tmp3
> +
> +       ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> +       cbz     \tmp2, .Laddr_auth_skip_\@
> +
> +       mrs_s   \tmp3, SYS_SCTLR_EL1
> +       bic     \tmp3, \tmp3, \tmp2
> +       msr_s   SYS_SCTLR_EL1, \tmp3
> +
>  .Laddr_auth_skip_\@:
>  alternative_if ARM64_HAS_GENERIC_AUTH
>         ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> @@ -45,6 +53,17 @@ alternative_else_nop_endif
>         ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
>         msr_s   SYS_APIAKEYLO_EL1, \tmp2
>         msr_s   SYS_APIAKEYHI_EL1, \tmp3
> +
> +       ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> +       cbz     \tmp2, .Lset_sctlr_skip_\@
> +
> +       mrs_s   \tmp1, SYS_SCTLR_EL1
> +       mov     \tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)
> +       movk    \tmp2, #SCTLR_ELx_ENDB
> +       orr     \tmp1, \tmp1, \tmp2
> +       msr_s   SYS_SCTLR_EL1, \tmp1
> +
> +.Lset_sctlr_skip_\@:
>         .endm
>
>         .macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> index c6b4f0603024..d4c375454a36 100644
> --- a/arch/arm64/include/asm/pointer_auth.h
> +++ b/arch/arm64/include/asm/pointer_auth.h
> @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
>  }
>
>  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> +                                         unsigned long keys,
> +                                         unsigned long enabled);
>
>  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
>  {
>         return ptrauth_clear_pac(ptr);
>  }
>
> -#define ptrauth_thread_init_user(tsk)                                  \
> -       ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> +#define ptrauth_thread_init_user(tsk) do {                             \
> +               ptrauth_keys_init_user(&(tsk)->thread.keys_user);       \
> +               (tsk)->thread.sctlr_enxx_mask = 0;                      \
> +       } while (0)
>  #define ptrauth_thread_init_kernel(tsk)                                        \
>         ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
>  #define ptrauth_thread_switch_kernel(tsk)                              \
> @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
>
>  #else /* CONFIG_ARM64_PTR_AUTH */
>  #define ptrauth_prctl_reset_keys(tsk, arg)     (-EINVAL)
> +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)     (-EINVAL)
>  #define ptrauth_strip_insn_pac(lr)     (lr)
>  #define ptrauth_thread_init_user(tsk)
>  #define ptrauth_thread_init_kernel(tsk)
> diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> index 240fe5e5b720..6974d227b01f 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -150,6 +150,7 @@ struct thread_struct {
>  #ifdef CONFIG_ARM64_PTR_AUTH
>         struct ptrauth_keys_user        keys_user;
>         struct ptrauth_keys_kernel      keys_kernel;
> +       u64                             sctlr_enxx_mask;
>  #endif
>  };
>
> @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
>  /* PR_PAC_RESET_KEYS prctl */
>  #define PAC_RESET_KEYS(tsk, arg)       ptrauth_prctl_reset_keys(tsk, arg)
>
> +/* PR_PAC_SET_ENABLED_KEYS prctl */
> +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)                               \
> +       ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> +
>  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
>  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
>  long set_tagged_addr_ctrl(unsigned long arg);
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 0577e2142284..dac80e16fe35 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -47,6 +47,7 @@ int main(void)
>  #ifdef CONFIG_ARM64_PTR_AUTH
>    DEFINE(THREAD_KEYS_USER,     offsetof(struct task_struct, thread.keys_user));
>    DEFINE(THREAD_KEYS_KERNEL,   offsetof(struct task_struct, thread.keys_kernel));
> +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
>  #endif
>    BLANK();
>    DEFINE(S_X0,                 offsetof(struct pt_regs, regs[0]));
> diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> index 1e77736a4f66..8c385b7f324a 100644
> --- a/arch/arm64/kernel/pointer_auth.c
> +++ b/arch/arm64/kernel/pointer_auth.c
> @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
>
>         return 0;
>  }
> +
> +static u64 arg_to_enxx_mask(unsigned long arg)
> +{
> +       u64 sctlr_enxx_mask = 0;
> +       if (arg & PR_PAC_APIAKEY)
> +               sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> +       if (arg & PR_PAC_APIBKEY)
> +               sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> +       if (arg & PR_PAC_APDAKEY)
> +               sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> +       if (arg & PR_PAC_APDBKEY)
> +               sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> +       return sctlr_enxx_mask;
> +}
> +
> +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> +                                  unsigned long enabled)
> +{
> +       u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> +       unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> +                                     PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> +
> +       if (!system_supports_address_auth())
> +               return -EINVAL;
> +
> +       if ((keys & ~addr_key_mask) || (enabled & ~keys))
> +               return -EINVAL;
> +
> +       sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> +       sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> +
> +       tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> +       return 0;
> +}
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..18e1ae4a37a2 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,7 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER              57
>  #define PR_GET_IO_FLUSHER              58
>
> +/* Set enabled arm64 pointer authentication keys */
> +#define PR_PAC_SET_ENABLED_KEYS                59
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..623df216183b 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -119,6 +119,9 @@
>  #ifndef PAC_RESET_KEYS
>  # define PAC_RESET_KEYS(a, b)  (-EINVAL)
>  #endif
> +#ifndef PAC_SET_ENABLED_KEYS
> +# define PAC_SET_ENABLED_KEYS(a, b, c) (-EINVAL)
> +#endif
>  #ifndef SET_TAGGED_ADDR_CTRL
>  # define SET_TAGGED_ADDR_CTRL(a)       (-EINVAL)
>  #endif
> @@ -2494,6 +2497,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>                         return -EINVAL;
>                 error = PAC_RESET_KEYS(me, arg2);
>                 break;
> +       case PR_PAC_SET_ENABLED_KEYS:
> +               if (arg4 || arg5)
> +                       return -EINVAL;
> +               error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
> +               break;
>         case PR_SET_TAGGED_ADDR_CTRL:
>                 if (arg3 || arg4 || arg5)
>                         return -EINVAL;
> --
> 2.28.0.163.g6104cc2f0b6-goog
>

Dave Martin Aug. 19, 2020, 10:18 a.m. UTC | #2

On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> This prctl allows the user program to control which PAC keys are enabled
> in a particular task. The main reason why this is useful is to enable a
> userspace ABI that uses PAC to sign and authenticate function pointers
> and other pointers exposed outside of the function, while still allowing
> binaries conforming to the ABI to interoperate with legacy binaries that
> do not sign or authenticate pointers.
> 
> The idea is that a dynamic loader or early startup code would issue
> this prctl very early after establishing that a process may load legacy
> binaries, but before executing any PAC instructions.

Apologies for the slow response on this, I'd had it on my list for a
while...

> ---
>  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
>  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
>  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
>  arch/arm64/include/asm/processor.h            |  5 +++
>  arch/arm64/kernel/asm-offsets.c               |  1 +
>  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
>  include/uapi/linux/prctl.h                    |  3 ++
>  kernel/sys.c                                  |  8 +++++
>  8 files changed, 105 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> index 30b2ab06526b..1f7e064deeb3 100644
> --- a/Documentation/arm64/pointer-authentication.rst
> +++ b/Documentation/arm64/pointer-authentication.rst
> @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
>  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
>  register. Any attempt to use the Pointer Authentication instructions will
>  result in an UNDEFINED exception being injected into the guest.
> +
> +
> +Enabling and disabling keys
> +---------------------------
> +
> +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> +PAC keys are enabled in a particular task. It takes two arguments, the
> +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> +and the second being a bitmask of the same bits specifying whether the key
> +should be enabled or disabled. For example::
> +
> +  prctl(PR_PAC_SET_ENABLED_KEYS,
> +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> +        PR_PAC_APIBKEY, 0, 0);
> +
> +disables all keys except the IB key.
> +
> +The main reason why this is useful is to enable a userspace ABI that uses PAC
> +instructions to sign and authenticate function pointers and other pointers
> +exposed outside of the function, while still allowing binaries conforming to
> +the ABI to interoperate with legacy binaries that do not sign or authenticate
> +pointers.

What actually breaks without this?

Since the keys are all enabled by default, the only purpose of this
prctl seems to be to disable keys.  I'm not sure what this is giving us.

> +
> +The idea is that a dynamic loader or early startup code would issue this
> +prctl very early after establishing that a process may load legacy binaries,
> +but before executing any PAC instructions.
> diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> index 52dead2a8640..d121fa5fed5f 100644
> --- a/arch/arm64/include/asm/asm_pointer_auth.h
> +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> @@ -31,6 +31,14 @@ alternative_else_nop_endif
>  	ldp	\tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
>  	msr_s	SYS_APDBKEYLO_EL1, \tmp2
>  	msr_s	SYS_APDBKEYHI_EL1, \tmp3
> +
> +	ldr	\tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> +	cbz	\tmp2, .Laddr_auth_skip_\@

I wonder whether it would make sense to simple store the thread's base
SCTLR value (containing the ENxx bits), rather than storing the ENxx
bits separately.  There may be reasons outside this snippet why this
isn't such a good idea though -- I haven't gone digging so far.

> +
> +	mrs_s	\tmp3, SYS_SCTLR_EL1
> +	bic	\tmp3, \tmp3, \tmp2
> +	msr_s	SYS_SCTLR_EL1, \tmp3
> +
>  .Laddr_auth_skip_\@:
>  alternative_if ARM64_HAS_GENERIC_AUTH
>  	ldp	\tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> @@ -45,6 +53,17 @@ alternative_else_nop_endif
>  	ldp	\tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
>  	msr_s	SYS_APIAKEYLO_EL1, \tmp2
>  	msr_s	SYS_APIAKEYHI_EL1, \tmp3
> +
> +	ldr	\tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> +	cbz	\tmp2, .Lset_sctlr_skip_\@
> +
> +	mrs_s	\tmp1, SYS_SCTLR_EL1
> +	mov	\tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)

(Nit: harmless but unnecessary ().  # is not an operator as such, just
random syntax.  Whatever follows is greedily parsed as an immediate
expression.)

> +	movk	\tmp2, #SCTLR_ELx_ENDB

Why do we check THREAD_SCTLR_ENXX_MASK, and then proceed to set all the
ENxx bits unconditionally?  I may be missing something here.

> +	orr	\tmp1, \tmp1, \tmp2
> +	msr_s	SYS_SCTLR_EL1, \tmp1
> +
> +.Lset_sctlr_skip_\@:
>  	.endm
>  
>  	.macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> index c6b4f0603024..d4c375454a36 100644
> --- a/arch/arm64/include/asm/pointer_auth.h
> +++ b/arch/arm64/include/asm/pointer_auth.h
> @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
>  }
>  
>  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> +					  unsigned long keys,
> +					  unsigned long enabled);
>  
>  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
>  {
>  	return ptrauth_clear_pac(ptr);
>  }
>  
> -#define ptrauth_thread_init_user(tsk)					\
> -	ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> +#define ptrauth_thread_init_user(tsk) do {				\
> +		ptrauth_keys_init_user(&(tsk)->thread.keys_user);	\
> +		(tsk)->thread.sctlr_enxx_mask = 0;			\
> +	} while (0)
>  #define ptrauth_thread_init_kernel(tsk)					\
>  	ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
>  #define ptrauth_thread_switch_kernel(tsk)				\
> @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
>  
>  #else /* CONFIG_ARM64_PTR_AUTH */
>  #define ptrauth_prctl_reset_keys(tsk, arg)	(-EINVAL)
> +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)	(-EINVAL)
>  #define ptrauth_strip_insn_pac(lr)	(lr)
>  #define ptrauth_thread_init_user(tsk)
>  #define ptrauth_thread_init_kernel(tsk)
> diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> index 240fe5e5b720..6974d227b01f 100644
> --- a/arch/arm64/include/asm/processor.h
> +++ b/arch/arm64/include/asm/processor.h
> @@ -150,6 +150,7 @@ struct thread_struct {
>  #ifdef CONFIG_ARM64_PTR_AUTH
>  	struct ptrauth_keys_user	keys_user;
>  	struct ptrauth_keys_kernel	keys_kernel;
> +	u64				sctlr_enxx_mask;
>  #endif
>  };
>  
> @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
>  /* PR_PAC_RESET_KEYS prctl */
>  #define PAC_RESET_KEYS(tsk, arg)	ptrauth_prctl_reset_keys(tsk, arg)
>  
> +/* PR_PAC_SET_ENABLED_KEYS prctl */
> +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)				\
> +	ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> +
>  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
>  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
>  long set_tagged_addr_ctrl(unsigned long arg);
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 0577e2142284..dac80e16fe35 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -47,6 +47,7 @@ int main(void)
>  #ifdef CONFIG_ARM64_PTR_AUTH
>    DEFINE(THREAD_KEYS_USER,	offsetof(struct task_struct, thread.keys_user));
>    DEFINE(THREAD_KEYS_KERNEL,	offsetof(struct task_struct, thread.keys_kernel));
> +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
>  #endif
>    BLANK();
>    DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
> diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> index 1e77736a4f66..8c385b7f324a 100644
> --- a/arch/arm64/kernel/pointer_auth.c
> +++ b/arch/arm64/kernel/pointer_auth.c
> @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
>  
>  	return 0;
>  }
> +
> +static u64 arg_to_enxx_mask(unsigned long arg)
> +{
> +	u64 sctlr_enxx_mask = 0;
> +	if (arg & PR_PAC_APIAKEY)
> +		sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> +	if (arg & PR_PAC_APIBKEY)
> +		sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> +	if (arg & PR_PAC_APDAKEY)
> +		sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> +	if (arg & PR_PAC_APDBKEY)
> +		sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> +	return sctlr_enxx_mask;
> +}
> +
> +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> +				   unsigned long enabled)
> +{
> +	u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> +	unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> +				      PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> +
> +	if (!system_supports_address_auth())
> +		return -EINVAL;
> +
> +	if ((keys & ~addr_key_mask) || (enabled & ~keys))
> +		return -EINVAL;

Should we take the types of authentication supported?

I don't recall whether we expose ptrauth to userspace if only
instruction authentication or only data authentication is supported.  If
so, should we reject attempts to configure unsupported keys here?

We should probably try to do a consistent thing both here and in
PR_PAC_RESET_KEYS if so.

> +
> +	sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> +	sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> +
> +	tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> +	return 0;

Do we need a way to query the enabled keys?

We could either have a _GET_ prctl (the common approach), or have this
prctl return the mask of enabled keys (avoids the extra prctl, but
weirder).

As above, we might 

Things like CRIU may need a GET in order to save/restore this setting
properly.

> +}
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..18e1ae4a37a2 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,7 @@ struct prctl_mm_map {
>  #define PR_SET_IO_FLUSHER		57
>  #define PR_GET_IO_FLUSHER		58
>  
> +/* Set enabled arm64 pointer authentication keys */
> +#define PR_PAC_SET_ENABLED_KEYS		59
> +
>  #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..623df216183b 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -119,6 +119,9 @@
>  #ifndef PAC_RESET_KEYS
>  # define PAC_RESET_KEYS(a, b)	(-EINVAL)
>  #endif
> +#ifndef PAC_SET_ENABLED_KEYS
> +# define PAC_SET_ENABLED_KEYS(a, b, c)	(-EINVAL)
> +#endif
>  #ifndef SET_TAGGED_ADDR_CTRL
>  # define SET_TAGGED_ADDR_CTRL(a)	(-EINVAL)
>  #endif
> @@ -2494,6 +2497,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>  			return -EINVAL;
>  		error = PAC_RESET_KEYS(me, arg2);
>  		break;
> +	case PR_PAC_SET_ENABLED_KEYS:
> +		if (arg4 || arg5)
> +			return -EINVAL;
> +		error = PAC_SET_ENABLED_KEYS(me, arg2, arg3);
> +		break;
>  	case PR_SET_TAGGED_ADDR_CTRL:
>  		if (arg3 || arg4 || arg5)
>  			return -EINVAL;

[...]

Cheers
---Dave

Peter Collingbourne Aug. 19, 2020, 9:25 p.m. UTC | #3

On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > This prctl allows the user program to control which PAC keys are enabled
> > in a particular task. The main reason why this is useful is to enable a
> > userspace ABI that uses PAC to sign and authenticate function pointers
> > and other pointers exposed outside of the function, while still allowing
> > binaries conforming to the ABI to interoperate with legacy binaries that
> > do not sign or authenticate pointers.
> >
> > The idea is that a dynamic loader or early startup code would issue
> > this prctl very early after establishing that a process may load legacy
> > binaries, but before executing any PAC instructions.
>
> Apologies for the slow response on this, I'd had it on my list for a
> while...
>
> > ---
> >  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
> >  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
> >  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
> >  arch/arm64/include/asm/processor.h            |  5 +++
> >  arch/arm64/kernel/asm-offsets.c               |  1 +
> >  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
> >  include/uapi/linux/prctl.h                    |  3 ++
> >  kernel/sys.c                                  |  8 +++++
> >  8 files changed, 105 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> > index 30b2ab06526b..1f7e064deeb3 100644
> > --- a/Documentation/arm64/pointer-authentication.rst
> > +++ b/Documentation/arm64/pointer-authentication.rst
> > @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
> >  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
> >  register. Any attempt to use the Pointer Authentication instructions will
> >  result in an UNDEFINED exception being injected into the guest.
> > +
> > +
> > +Enabling and disabling keys
> > +---------------------------
> > +
> > +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> > +PAC keys are enabled in a particular task. It takes two arguments, the
> > +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> > +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> > +and the second being a bitmask of the same bits specifying whether the key
> > +should be enabled or disabled. For example::
> > +
> > +  prctl(PR_PAC_SET_ENABLED_KEYS,
> > +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> > +        PR_PAC_APIBKEY, 0, 0);
> > +
> > +disables all keys except the IB key.
> > +
> > +The main reason why this is useful is to enable a userspace ABI that uses PAC
> > +instructions to sign and authenticate function pointers and other pointers
> > +exposed outside of the function, while still allowing binaries conforming to
> > +the ABI to interoperate with legacy binaries that do not sign or authenticate
> > +pointers.
>
> What actually breaks without this?
>
> Since the keys are all enabled by default, the only purpose of this
> prctl seems to be to disable keys.  I'm not sure what this is giving us.

Yes, the purpose is to disable keys. Let's consider the function
pointer signing userspace ABI use case. An example is Apple's arm64e
ABI, and I have a prototype branch of LLVM [0] that implements a
similar ABI in Linux userspace, based on Apple's implementation of
their ABI.

Here's an example of a function that returns a function pointer, and a
function that calls a function pointer of the same type:

static void f(void) {}

void *return_fp(void) {
  return f;
}

void call_fp(void (*p)(void)) {
  p();
}

If I compile this with my prototype compiler I get:

$ clang --target=aarch64-android   -fptrauth-calls  fptr.c -S -o - -O3
-march=armv8.3a
[...]
return_fp:                              // @return_fp
// %bb.0:                               // %entry
        adrp    x16, f
        add     x16, x16, :lo12:f
        mov     x17, #16277
        pacia   x16, x17
        mov     x0, x16
        ret
[...]
call_fp:                                // @call_fp
// %bb.0:                               // %entry
        mov     w1, #16277
        braa    x0, x1
[...]

In this code snippet the function pointer is signed with the IA key
and discriminator 16277 before being returned. When the function is
called, the pointer is first authenticated with the same key and
discriminator.

Now imagine that this code lives in a shared library used both by
programs that use the function pointer signing ABI and by legacy
binaries (i.e. programs that use the existing ABI), and we have a
legacy binary that calls return_fp. If the legacy binary then calls
the function pointer returned by return_fp, that code will not
authenticate the pointer before calling it, it will just use a br or
blr instruction to call it directly, which will lead to a crash if the
signature bits are set in the function pointer. In order to prevent
the crash, we need a way to cause the pacia instruction in return_fp
to become a no-op when running inside the process hosting the legacy
binary, so that the signature bits will remain clear and the br or blr
instruction in the legacy binary will successfully call the function
f. That can be done by disabling the IA key, which is exactly what
this prctl() lets us do.

>
> > +
> > +The idea is that a dynamic loader or early startup code would issue this
> > +prctl very early after establishing that a process may load legacy binaries,
> > +but before executing any PAC instructions.
> > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > index 52dead2a8640..d121fa5fed5f 100644
> > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > +
> > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > +     cbz     \tmp2, .Laddr_auth_skip_\@
>
> I wonder whether it would make sense to simple store the thread's base
> SCTLR value (containing the ENxx bits), rather than storing the ENxx
> bits separately.  There may be reasons outside this snippet why this
> isn't such a good idea though -- I haven't gone digging so far.

As far as I know (as I learned [1] from the MTE patch series), it can
be expensive to access SCTLR_EL1, therefore I arrange to only access
SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
keys. Detecting the "process has disabled keys" case is quite simple
if we only store the disabled keys mask here, not so much if we store
the full value of SCTLR_EL1.

> > +
> > +     mrs_s   \tmp3, SYS_SCTLR_EL1
> > +     bic     \tmp3, \tmp3, \tmp2
> > +     msr_s   SYS_SCTLR_EL1, \tmp3
> > +
> >  .Laddr_auth_skip_\@:
> >  alternative_if ARM64_HAS_GENERIC_AUTH
> >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> > @@ -45,6 +53,17 @@ alternative_else_nop_endif
> >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
> >       msr_s   SYS_APIAKEYLO_EL1, \tmp2
> >       msr_s   SYS_APIAKEYHI_EL1, \tmp3
> > +
> > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > +     cbz     \tmp2, .Lset_sctlr_skip_\@
> > +
> > +     mrs_s   \tmp1, SYS_SCTLR_EL1
> > +     mov     \tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)
>
> (Nit: harmless but unnecessary ().  # is not an operator as such, just
> random syntax.  Whatever follows is greedily parsed as an immediate
> expression.)

Okay. While looking around in the kernel I noticed that there is a
mov_q macro that can be used to avoid manually splitting the constant
into 16-bit chunks, and apparently it doesn't require a #. I'll use it
in v2.

> > +     movk    \tmp2, #SCTLR_ELx_ENDB
>
> Why do we check THREAD_SCTLR_ENXX_MASK, and then proceed to set all the
> ENxx bits unconditionally?  I may be missing something here.

This code is to support the case where we are returning to the kernel
from a userspace task with keys disabled. The kernel needs at least
the IA key enabled in order for its own use of reverse-edge PAC to
work correctly. When returning from a userspace task with no keys
disabled, the keys enabled bits already have the correct values, so
there is nothing to be done (and as mentioned above, I avoid touching
SCTLR_EL1 unless necessary because it is apparently expensive to do
so). But in a process with keys disabled, we will need to re-enable at
least IA.

We may be able to get away with just enabling IA here, but that would
break the invariant that all keys are enabled inside the kernel, which
is relied on by the code that decides whether to access SCTLR_EL1 in
order to disable keys when entering a userspace task.

> > +     orr     \tmp1, \tmp1, \tmp2
> > +     msr_s   SYS_SCTLR_EL1, \tmp1
> > +
> > +.Lset_sctlr_skip_\@:
> >       .endm
> >
> >       .macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> > diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> > index c6b4f0603024..d4c375454a36 100644
> > --- a/arch/arm64/include/asm/pointer_auth.h
> > +++ b/arch/arm64/include/asm/pointer_auth.h
> > @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
> >  }
> >
> >  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> > +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> > +                                       unsigned long keys,
> > +                                       unsigned long enabled);
> >
> >  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> >  {
> >       return ptrauth_clear_pac(ptr);
> >  }
> >
> > -#define ptrauth_thread_init_user(tsk)                                        \
> > -     ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> > +#define ptrauth_thread_init_user(tsk) do {                           \
> > +             ptrauth_keys_init_user(&(tsk)->thread.keys_user);       \
> > +             (tsk)->thread.sctlr_enxx_mask = 0;                      \
> > +     } while (0)
> >  #define ptrauth_thread_init_kernel(tsk)                                      \
> >       ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
> >  #define ptrauth_thread_switch_kernel(tsk)                            \
> > @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> >
> >  #else /* CONFIG_ARM64_PTR_AUTH */
> >  #define ptrauth_prctl_reset_keys(tsk, arg)   (-EINVAL)
> > +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)   (-EINVAL)
> >  #define ptrauth_strip_insn_pac(lr)   (lr)
> >  #define ptrauth_thread_init_user(tsk)
> >  #define ptrauth_thread_init_kernel(tsk)
> > diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> > index 240fe5e5b720..6974d227b01f 100644
> > --- a/arch/arm64/include/asm/processor.h
> > +++ b/arch/arm64/include/asm/processor.h
> > @@ -150,6 +150,7 @@ struct thread_struct {
> >  #ifdef CONFIG_ARM64_PTR_AUTH
> >       struct ptrauth_keys_user        keys_user;
> >       struct ptrauth_keys_kernel      keys_kernel;
> > +     u64                             sctlr_enxx_mask;
> >  #endif
> >  };
> >
> > @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
> >  /* PR_PAC_RESET_KEYS prctl */
> >  #define PAC_RESET_KEYS(tsk, arg)     ptrauth_prctl_reset_keys(tsk, arg)
> >
> > +/* PR_PAC_SET_ENABLED_KEYS prctl */
> > +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)                             \
> > +     ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> > +
> >  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
> >  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
> >  long set_tagged_addr_ctrl(unsigned long arg);
> > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > index 0577e2142284..dac80e16fe35 100644
> > --- a/arch/arm64/kernel/asm-offsets.c
> > +++ b/arch/arm64/kernel/asm-offsets.c
> > @@ -47,6 +47,7 @@ int main(void)
> >  #ifdef CONFIG_ARM64_PTR_AUTH
> >    DEFINE(THREAD_KEYS_USER,   offsetof(struct task_struct, thread.keys_user));
> >    DEFINE(THREAD_KEYS_KERNEL, offsetof(struct task_struct, thread.keys_kernel));
> > +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
> >  #endif
> >    BLANK();
> >    DEFINE(S_X0,                       offsetof(struct pt_regs, regs[0]));
> > diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> > index 1e77736a4f66..8c385b7f324a 100644
> > --- a/arch/arm64/kernel/pointer_auth.c
> > +++ b/arch/arm64/kernel/pointer_auth.c
> > @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
> >
> >       return 0;
> >  }
> > +
> > +static u64 arg_to_enxx_mask(unsigned long arg)
> > +{
> > +     u64 sctlr_enxx_mask = 0;
> > +     if (arg & PR_PAC_APIAKEY)
> > +             sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> > +     if (arg & PR_PAC_APIBKEY)
> > +             sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> > +     if (arg & PR_PAC_APDAKEY)
> > +             sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> > +     if (arg & PR_PAC_APDBKEY)
> > +             sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> > +     return sctlr_enxx_mask;
> > +}
> > +
> > +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> > +                                unsigned long enabled)
> > +{
> > +     u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> > +     unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> > +                                   PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> > +
> > +     if (!system_supports_address_auth())
> > +             return -EINVAL;
> > +
> > +     if ((keys & ~addr_key_mask) || (enabled & ~keys))
> > +             return -EINVAL;
>
> Should we take the types of authentication supported?
>
> I don't recall whether we expose ptrauth to userspace if only
> instruction authentication or only data authentication is supported.  If
> so, should we reject attempts to configure unsupported keys here?
>
> We should probably try to do a consistent thing both here and in
> PR_PAC_RESET_KEYS if so.

As far as I know, there is nothing in the architecture that would
allow it to only advertise support for I keys or only advertise
support for D keys. The fields AA64ISAR1_EL1.AP[AI] apply to all four
keys: DA, DB, IA and IB. Maybe you are thinking of the GA key versus
the other keys (which is advertised separately via
AA64ISAR1_EL1.GP[AI])? The architecture appears to provide no way to
disable the GA key, so I did not include support for it here.

> > +
> > +     sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> > +     sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> > +
> > +     tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> > +     return 0;
>
> Do we need a way to query the enabled keys?
>
> We could either have a _GET_ prctl (the common approach), or have this
> prctl return the mask of enabled keys (avoids the extra prctl, but
> weirder).
>
> As above, we might
>
> Things like CRIU may need a GET in order to save/restore this setting
> properly.

Maybe it makes sense for there to be a GET prctl() to support CRIU.
But it would need to be defined carefully because CRIU would
presumably need to know what value to pass as the "keys" argument when
it calls SET to restore the state. It can't just hardcode a value
because that may harm extensibility if new keys are added.

If we require the kernel to start processes with all keys enabled
(including any keys that we may introduce in the future), then it
seems like if CRIU knows which keys were disabled, then it can restore
the state by issuing a syscall like this:

prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)

which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
saving the state to prepare the disabled_keys_mask argument passed
here. Then for consistency we can make the SET prctl() be
PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:

prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)

Does that make sense?

Peter

[0] https://github.com/pcc/llvm-project/tree/apple-pac
[1] https://github.com/pcc/linux/commit/b9fff3622f84eb4d939afcd77ea3c36f4e7f9bb0#diff-418b65fb9fc512fbbfae1a66a37bdaf0R92

Dave Martin Aug. 24, 2020, 2:49 p.m. UTC | #4

On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > > This prctl allows the user program to control which PAC keys are enabled
> > > in a particular task. The main reason why this is useful is to enable a
> > > userspace ABI that uses PAC to sign and authenticate function pointers
> > > and other pointers exposed outside of the function, while still allowing
> > > binaries conforming to the ABI to interoperate with legacy binaries that
> > > do not sign or authenticate pointers.
> > >
> > > The idea is that a dynamic loader or early startup code would issue
> > > this prctl very early after establishing that a process may load legacy
> > > binaries, but before executing any PAC instructions.
> >
> > Apologies for the slow response on this, I'd had it on my list for a
> > while...
> >
> > > ---
> > >  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
> > >  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
> > >  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
> > >  arch/arm64/include/asm/processor.h            |  5 +++
> > >  arch/arm64/kernel/asm-offsets.c               |  1 +
> > >  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
> > >  include/uapi/linux/prctl.h                    |  3 ++
> > >  kernel/sys.c                                  |  8 +++++
> > >  8 files changed, 105 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> > > index 30b2ab06526b..1f7e064deeb3 100644
> > > --- a/Documentation/arm64/pointer-authentication.rst
> > > +++ b/Documentation/arm64/pointer-authentication.rst
> > > @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
> > >  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
> > >  register. Any attempt to use the Pointer Authentication instructions will
> > >  result in an UNDEFINED exception being injected into the guest.
> > > +
> > > +
> > > +Enabling and disabling keys
> > > +---------------------------
> > > +
> > > +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> > > +PAC keys are enabled in a particular task. It takes two arguments, the
> > > +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> > > +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> > > +and the second being a bitmask of the same bits specifying whether the key
> > > +should be enabled or disabled. For example::
> > > +
> > > +  prctl(PR_PAC_SET_ENABLED_KEYS,
> > > +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> > > +        PR_PAC_APIBKEY, 0, 0);
> > > +
> > > +disables all keys except the IB key.
> > > +
> > > +The main reason why this is useful is to enable a userspace ABI that uses PAC
> > > +instructions to sign and authenticate function pointers and other pointers
> > > +exposed outside of the function, while still allowing binaries conforming to
> > > +the ABI to interoperate with legacy binaries that do not sign or authenticate
> > > +pointers.
> >
> > What actually breaks without this?
> >
> > Since the keys are all enabled by default, the only purpose of this
> > prctl seems to be to disable keys.  I'm not sure what this is giving us.
> 
> Yes, the purpose is to disable keys. Let's consider the function
> pointer signing userspace ABI use case. An example is Apple's arm64e
> ABI, and I have a prototype branch of LLVM [0] that implements a
> similar ABI in Linux userspace, based on Apple's implementation of
> their ABI.
> 
> Here's an example of a function that returns a function pointer, and a
> function that calls a function pointer of the same type:
> 
> static void f(void) {}
> 
> void *return_fp(void) {
>   return f;
> }
> 
> void call_fp(void (*p)(void)) {
>   p();
> }
> 
> If I compile this with my prototype compiler I get:
> 
> $ clang --target=aarch64-android   -fptrauth-calls  fptr.c -S -o - -O3
> -march=armv8.3a
> [...]
> return_fp:                              // @return_fp
> // %bb.0:                               // %entry
>         adrp    x16, f
>         add     x16, x16, :lo12:f
>         mov     x17, #16277
>         pacia   x16, x17
>         mov     x0, x16
>         ret
> [...]
> call_fp:                                // @call_fp
> // %bb.0:                               // %entry
>         mov     w1, #16277
>         braa    x0, x1
> [...]
> 
> In this code snippet the function pointer is signed with the IA key
> and discriminator 16277 before being returned. When the function is
> called, the pointer is first authenticated with the same key and
> discriminator.
> 
> Now imagine that this code lives in a shared library used both by
> programs that use the function pointer signing ABI and by legacy
> binaries (i.e. programs that use the existing ABI), and we have a
> legacy binary that calls return_fp. If the legacy binary then calls
> the function pointer returned by return_fp, that code will not
> authenticate the pointer before calling it, it will just use a br or
> blr instruction to call it directly, which will lead to a crash if the
> signature bits are set in the function pointer. In order to prevent
> the crash, we need a way to cause the pacia instruction in return_fp
> to become a no-op when running inside the process hosting the legacy
> binary, so that the signature bits will remain clear and the br or blr
> instruction in the legacy binary will successfully call the function
> f. That can be done by disabling the IA key, which is exactly what
> this prctl() lets us do.
> 
> >
> > > +
> > > +The idea is that a dynamic loader or early startup code would issue this
> > > +prctl very early after establishing that a process may load legacy binaries,
> > > +but before executing any PAC instructions.
> > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > index 52dead2a8640..d121fa5fed5f 100644
> > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > +
> > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> >
> > I wonder whether it would make sense to simple store the thread's base
> > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > bits separately.  There may be reasons outside this snippet why this
> > isn't such a good idea though -- I haven't gone digging so far.
> 
> As far as I know (as I learned [1] from the MTE patch series), it can
> be expensive to access SCTLR_EL1, therefore I arrange to only access
> SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> keys. Detecting the "process has disabled keys" case is quite simple
> if we only store the disabled keys mask here, not so much if we store
> the full value of SCTLR_EL1.

My thought was that we would still only write SCTLR_EL1 if needed, but
we would do the write-if-needed across the whole register in one go.
This would be easier to extend if we have to twiddle additional
SCTLR_EL1 bits in the future.  If the key enable bits are the only
affected bits for now then we could of course defer this factoring until
later.  (I vaguely remember a similar discussion in the past, but
possibly it was about the pauth keys anyway, rather than something
else.)

In a case like this, we'll still get overheads if there are a mixture of
tasks contending for the CPU, that have different key enable settings.
But I can't see much that we can do about that.  If userspace is mostly
built with the same options (true for the Apple case I guess) then I
guess we shouldn't need SCTLR_EL1 rewrites very often just for this.  In
other environments it may be messier.

> 
> > > +
> > > +     mrs_s   \tmp3, SYS_SCTLR_EL1
> > > +     bic     \tmp3, \tmp3, \tmp2
> > > +     msr_s   SYS_SCTLR_EL1, \tmp3
> > > +
> > >  .Laddr_auth_skip_\@:
> > >  alternative_if ARM64_HAS_GENERIC_AUTH
> > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> > > @@ -45,6 +53,17 @@ alternative_else_nop_endif
> > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
> > >       msr_s   SYS_APIAKEYLO_EL1, \tmp2
> > >       msr_s   SYS_APIAKEYHI_EL1, \tmp3
> > > +
> > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > +     cbz     \tmp2, .Lset_sctlr_skip_\@
> > > +
> > > +     mrs_s   \tmp1, SYS_SCTLR_EL1
> > > +     mov     \tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)
> >
> > (Nit: harmless but unnecessary ().  # is not an operator as such, just
> > random syntax.  Whatever follows is greedily parsed as an immediate
> > expression.)
> 
> Okay. While looking around in the kernel I noticed that there is a
> mov_q macro that can be used to avoid manually splitting the constant
> into 16-bit chunks, and apparently it doesn't require a #. I'll use it
> in v2.
> 
> > > +     movk    \tmp2, #SCTLR_ELx_ENDB
> >
> > Why do we check THREAD_SCTLR_ENXX_MASK, and then proceed to set all the
> > ENxx bits unconditionally?  I may be missing something here.
> 
> This code is to support the case where we are returning to the kernel
> from a userspace task with keys disabled. The kernel needs at least
> the IA key enabled in order for its own use of reverse-edge PAC to
> work correctly. When returning from a userspace task with no keys
> disabled, the keys enabled bits already have the correct values, so
> there is nothing to be done (and as mentioned above, I avoid touching
> SCTLR_EL1 unless necessary because it is apparently expensive to do
> so). But in a process with keys disabled, we will need to re-enable at
> least IA.
> 
> We may be able to get away with just enabling IA here, but that would
> break the invariant that all keys are enabled inside the kernel, which
> is relied on by the code that decides whether to access SCTLR_EL1 in
> order to disable keys when entering a userspace task.

OK, I think I just confused myself here: we are not setting the key
enables for userspace, but for the kernel, and we only need to do that
if the user task had some keys disabled in the first place.

> 
> > > +     orr     \tmp1, \tmp1, \tmp2
> > > +     msr_s   SYS_SCTLR_EL1, \tmp1
> > > +
> > > +.Lset_sctlr_skip_\@:
> > >       .endm
> > >
> > >       .macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> > > diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> > > index c6b4f0603024..d4c375454a36 100644
> > > --- a/arch/arm64/include/asm/pointer_auth.h
> > > +++ b/arch/arm64/include/asm/pointer_auth.h
> > > @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
> > >  }
> > >
> > >  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> > > +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> > > +                                       unsigned long keys,
> > > +                                       unsigned long enabled);
> > >
> > >  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > >  {
> > >       return ptrauth_clear_pac(ptr);
> > >  }
> > >
> > > -#define ptrauth_thread_init_user(tsk)                                        \
> > > -     ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> > > +#define ptrauth_thread_init_user(tsk) do {                           \
> > > +             ptrauth_keys_init_user(&(tsk)->thread.keys_user);       \
> > > +             (tsk)->thread.sctlr_enxx_mask = 0;                      \
> > > +     } while (0)
> > >  #define ptrauth_thread_init_kernel(tsk)                                      \
> > >       ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
> > >  #define ptrauth_thread_switch_kernel(tsk)                            \
> > > @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > >
> > >  #else /* CONFIG_ARM64_PTR_AUTH */
> > >  #define ptrauth_prctl_reset_keys(tsk, arg)   (-EINVAL)
> > > +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)   (-EINVAL)
> > >  #define ptrauth_strip_insn_pac(lr)   (lr)
> > >  #define ptrauth_thread_init_user(tsk)
> > >  #define ptrauth_thread_init_kernel(tsk)
> > > diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> > > index 240fe5e5b720..6974d227b01f 100644
> > > --- a/arch/arm64/include/asm/processor.h
> > > +++ b/arch/arm64/include/asm/processor.h
> > > @@ -150,6 +150,7 @@ struct thread_struct {
> > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > >       struct ptrauth_keys_user        keys_user;
> > >       struct ptrauth_keys_kernel      keys_kernel;
> > > +     u64                             sctlr_enxx_mask;
> > >  #endif
> > >  };
> > >
> > > @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
> > >  /* PR_PAC_RESET_KEYS prctl */
> > >  #define PAC_RESET_KEYS(tsk, arg)     ptrauth_prctl_reset_keys(tsk, arg)
> > >
> > > +/* PR_PAC_SET_ENABLED_KEYS prctl */
> > > +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)                             \
> > > +     ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> > > +
> > >  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
> > >  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
> > >  long set_tagged_addr_ctrl(unsigned long arg);
> > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > index 0577e2142284..dac80e16fe35 100644
> > > --- a/arch/arm64/kernel/asm-offsets.c
> > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > @@ -47,6 +47,7 @@ int main(void)
> > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > >    DEFINE(THREAD_KEYS_USER,   offsetof(struct task_struct, thread.keys_user));
> > >    DEFINE(THREAD_KEYS_KERNEL, offsetof(struct task_struct, thread.keys_kernel));
> > > +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
> > >  #endif
> > >    BLANK();
> > >    DEFINE(S_X0,                       offsetof(struct pt_regs, regs[0]));
> > > diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> > > index 1e77736a4f66..8c385b7f324a 100644
> > > --- a/arch/arm64/kernel/pointer_auth.c
> > > +++ b/arch/arm64/kernel/pointer_auth.c
> > > @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
> > >
> > >       return 0;
> > >  }
> > > +
> > > +static u64 arg_to_enxx_mask(unsigned long arg)
> > > +{
> > > +     u64 sctlr_enxx_mask = 0;
> > > +     if (arg & PR_PAC_APIAKEY)
> > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> > > +     if (arg & PR_PAC_APIBKEY)
> > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> > > +     if (arg & PR_PAC_APDAKEY)
> > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> > > +     if (arg & PR_PAC_APDBKEY)
> > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> > > +     return sctlr_enxx_mask;
> > > +}
> > > +
> > > +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> > > +                                unsigned long enabled)
> > > +{
> > > +     u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> > > +     unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> > > +                                   PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> > > +
> > > +     if (!system_supports_address_auth())
> > > +             return -EINVAL;
> > > +
> > > +     if ((keys & ~addr_key_mask) || (enabled & ~keys))
> > > +             return -EINVAL;
> >
> > Should we take the types of authentication supported?
> >
> > I don't recall whether we expose ptrauth to userspace if only
> > instruction authentication or only data authentication is supported.  If
> > so, should we reject attempts to configure unsupported keys here?
> >
> > We should probably try to do a consistent thing both here and in
> > PR_PAC_RESET_KEYS if so.
> 
> As far as I know, there is nothing in the architecture that would
> allow it to only advertise support for I keys or only advertise
> support for D keys. The fields AA64ISAR1_EL1.AP[AI] apply to all four
> keys: DA, DB, IA and IB. Maybe you are thinking of the GA key versus
> the other keys (which is advertised separately via
> AA64ISAR1_EL1.GP[AI])? The architecture appears to provide no way to
> disable the GA key, so I did not include support for it here.

I think I'm confusing myself here.  Yes, support for generic auth is
the (supposedly) architecturally orthogonal to address auth, but data
and instruction address auth are either both supported, or both not
supported -- so your code looks correct.

> 
> > > +
> > > +     sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> > > +     sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> > > +
> > > +     tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> > > +     return 0;
> >
> > Do we need a way to query the enabled keys?
> >
> > We could either have a _GET_ prctl (the common approach), or have this
> > prctl return the mask of enabled keys (avoids the extra prctl, but
> > weirder).
> >
> > As above, we might
> >
> > Things like CRIU may need a GET in order to save/restore this setting
> > properly.
> 
> Maybe it makes sense for there to be a GET prctl() to support CRIU.
> But it would need to be defined carefully because CRIU would
> presumably need to know what value to pass as the "keys" argument when
> it calls SET to restore the state. It can't just hardcode a value
> because that may harm extensibility if new keys are added.
> 
> If we require the kernel to start processes with all keys enabled
> (including any keys that we may introduce in the future), then it
> seems like if CRIU knows which keys were disabled, then it can restore
> the state by issuing a syscall like this:
> 
> prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> 
> which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> saving the state to prepare the disabled_keys_mask argument passed
> here. Then for consistency we can make the SET prctl() be
> PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> 
> prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> 
> Does that make sense?

Possibly, though it's nicer if the GET returns the same value suppiled
to the SET, rather than the complement of it.

If SET ignores set bits in arg3 that don't correspond to set bits in the
mask arg2, then

	u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);

	prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);

should work.


There's a final option, which is to expose this config through ptrace
instead for save/restore purposes.  From previous discussions with the
CRIU folks, I recall that they are trying to move away from doing setup
from within the new process where possible.  

There's no reason not to have both though -- there's precedent for that,
such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
in a similar direction too IIUC.

Having a GET remains useful for in-process debugging and diagnostics,
and it's extremely straightforward to add in the kernel.  So from my
side I'd vote to have it anyway...

Cheers
---Dave

Dave Martin Aug. 24, 2020, 2:55 p.m. UTC | #5

On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > > This prctl allows the user program to control which PAC keys are enabled
> > > in a particular task. The main reason why this is useful is to enable a
> > > userspace ABI that uses PAC to sign and authenticate function pointers
> > > and other pointers exposed outside of the function, while still allowing
> > > binaries conforming to the ABI to interoperate with legacy binaries that
> > > do not sign or authenticate pointers.
> > >
> > > The idea is that a dynamic loader or early startup code would issue
> > > this prctl very early after establishing that a process may load legacy
> > > binaries, but before executing any PAC instructions.
> >
> > Apologies for the slow response on this, I'd had it on my list for a
> > while...

A couple more comments I accidentally left out of my last reply, below.

> >
> > > ---
> > >  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
> > >  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
> > >  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
> > >  arch/arm64/include/asm/processor.h            |  5 +++
> > >  arch/arm64/kernel/asm-offsets.c               |  1 +
> > >  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
> > >  include/uapi/linux/prctl.h                    |  3 ++
> > >  kernel/sys.c                                  |  8 +++++
> > >  8 files changed, 105 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> > > index 30b2ab06526b..1f7e064deeb3 100644
> > > --- a/Documentation/arm64/pointer-authentication.rst
> > > +++ b/Documentation/arm64/pointer-authentication.rst
> > > @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
> > >  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
> > >  register. Any attempt to use the Pointer Authentication instructions will
> > >  result in an UNDEFINED exception being injected into the guest.
> > > +
> > > +
> > > +Enabling and disabling keys
> > > +---------------------------
> > > +
> > > +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> > > +PAC keys are enabled in a particular task. It takes two arguments, the
> > > +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> > > +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> > > +and the second being a bitmask of the same bits specifying whether the key
> > > +should be enabled or disabled. For example::
> > > +
> > > +  prctl(PR_PAC_SET_ENABLED_KEYS,
> > > +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> > > +        PR_PAC_APIBKEY, 0, 0);
> > > +
> > > +disables all keys except the IB key.
> > > +
> > > +The main reason why this is useful is to enable a userspace ABI that uses PAC
> > > +instructions to sign and authenticate function pointers and other pointers
> > > +exposed outside of the function, while still allowing binaries conforming to
> > > +the ABI to interoperate with legacy binaries that do not sign or authenticate
> > > +pointers.
> >
> > What actually breaks without this?
> >
> > Since the keys are all enabled by default, the only purpose of this
> > prctl seems to be to disable keys.  I'm not sure what this is giving us.
> 
> Yes, the purpose is to disable keys. Let's consider the function
> pointer signing userspace ABI use case. An example is Apple's arm64e
> ABI, and I have a prototype branch of LLVM [0] that implements a
> similar ABI in Linux userspace, based on Apple's implementation of
> their ABI.
> 
> Here's an example of a function that returns a function pointer, and a
> function that calls a function pointer of the same type:
> 
> static void f(void) {}
> 
> void *return_fp(void) {
>   return f;
> }
> 
> void call_fp(void (*p)(void)) {
>   p();
> }
> 
> If I compile this with my prototype compiler I get:
> 
> $ clang --target=aarch64-android   -fptrauth-calls  fptr.c -S -o - -O3
> -march=armv8.3a
> [...]
> return_fp:                              // @return_fp
> // %bb.0:                               // %entry
>         adrp    x16, f
>         add     x16, x16, :lo12:f
>         mov     x17, #16277
>         pacia   x16, x17
>         mov     x0, x16
>         ret
> [...]
> call_fp:                                // @call_fp
> // %bb.0:                               // %entry
>         mov     w1, #16277
>         braa    x0, x1
> [...]
> 
> In this code snippet the function pointer is signed with the IA key
> and discriminator 16277 before being returned. When the function is
> called, the pointer is first authenticated with the same key and
> discriminator.
> 
> Now imagine that this code lives in a shared library used both by
> programs that use the function pointer signing ABI and by legacy
> binaries (i.e. programs that use the existing ABI), and we have a
> legacy binary that calls return_fp. If the legacy binary then calls
> the function pointer returned by return_fp, that code will not
> authenticate the pointer before calling it, it will just use a br or
> blr instruction to call it directly, which will lead to a crash if the
> signature bits are set in the function pointer. In order to prevent
> the crash, we need a way to cause the pacia instruction in return_fp
> to become a no-op when running inside the process hosting the legacy
> binary, so that the signature bits will remain clear and the br or blr
> instruction in the legacy binary will successfully call the function
> f. That can be done by disabling the IA key, which is exactly what
> this prctl() lets us do.

Ack, I think there has been past discussion on this, but it's been
waiting for a usecase since it does impose a bit of extra overhead.

(I'd somehow assumes that you were actually wanting to get SIGILLs,
rather then compatibly executing some ptrauth insns as NOPs -- the
latter makes much more sense.)


Is this going to land for real, or is it just something people have
been experimenting with?

(I can guess the answer from the fact that you've proposed this patch
in the first place, but I have to ask...)

[...]

Cheers
---Dave

Peter Collingbourne Aug. 25, 2020, 12:23 a.m. UTC | #6

On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > > > This prctl allows the user program to control which PAC keys are enabled
> > > > in a particular task. The main reason why this is useful is to enable a
> > > > userspace ABI that uses PAC to sign and authenticate function pointers
> > > > and other pointers exposed outside of the function, while still allowing
> > > > binaries conforming to the ABI to interoperate with legacy binaries that
> > > > do not sign or authenticate pointers.
> > > >
> > > > The idea is that a dynamic loader or early startup code would issue
> > > > this prctl very early after establishing that a process may load legacy
> > > > binaries, but before executing any PAC instructions.
> > >
> > > Apologies for the slow response on this, I'd had it on my list for a
> > > while...
> > >
> > > > ---
> > > >  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
> > > >  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
> > > >  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
> > > >  arch/arm64/include/asm/processor.h            |  5 +++
> > > >  arch/arm64/kernel/asm-offsets.c               |  1 +
> > > >  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
> > > >  include/uapi/linux/prctl.h                    |  3 ++
> > > >  kernel/sys.c                                  |  8 +++++
> > > >  8 files changed, 105 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> > > > index 30b2ab06526b..1f7e064deeb3 100644
> > > > --- a/Documentation/arm64/pointer-authentication.rst
> > > > +++ b/Documentation/arm64/pointer-authentication.rst
> > > > @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
> > > >  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
> > > >  register. Any attempt to use the Pointer Authentication instructions will
> > > >  result in an UNDEFINED exception being injected into the guest.
> > > > +
> > > > +
> > > > +Enabling and disabling keys
> > > > +---------------------------
> > > > +
> > > > +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> > > > +PAC keys are enabled in a particular task. It takes two arguments, the
> > > > +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> > > > +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> > > > +and the second being a bitmask of the same bits specifying whether the key
> > > > +should be enabled or disabled. For example::
> > > > +
> > > > +  prctl(PR_PAC_SET_ENABLED_KEYS,
> > > > +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> > > > +        PR_PAC_APIBKEY, 0, 0);
> > > > +
> > > > +disables all keys except the IB key.
> > > > +
> > > > +The main reason why this is useful is to enable a userspace ABI that uses PAC
> > > > +instructions to sign and authenticate function pointers and other pointers
> > > > +exposed outside of the function, while still allowing binaries conforming to
> > > > +the ABI to interoperate with legacy binaries that do not sign or authenticate
> > > > +pointers.
> > >
> > > What actually breaks without this?
> > >
> > > Since the keys are all enabled by default, the only purpose of this
> > > prctl seems to be to disable keys.  I'm not sure what this is giving us.
> >
> > Yes, the purpose is to disable keys. Let's consider the function
> > pointer signing userspace ABI use case. An example is Apple's arm64e
> > ABI, and I have a prototype branch of LLVM [0] that implements a
> > similar ABI in Linux userspace, based on Apple's implementation of
> > their ABI.
> >
> > Here's an example of a function that returns a function pointer, and a
> > function that calls a function pointer of the same type:
> >
> > static void f(void) {}
> >
> > void *return_fp(void) {
> >   return f;
> > }
> >
> > void call_fp(void (*p)(void)) {
> >   p();
> > }
> >
> > If I compile this with my prototype compiler I get:
> >
> > $ clang --target=aarch64-android   -fptrauth-calls  fptr.c -S -o - -O3
> > -march=armv8.3a
> > [...]
> > return_fp:                              // @return_fp
> > // %bb.0:                               // %entry
> >         adrp    x16, f
> >         add     x16, x16, :lo12:f
> >         mov     x17, #16277
> >         pacia   x16, x17
> >         mov     x0, x16
> >         ret
> > [...]
> > call_fp:                                // @call_fp
> > // %bb.0:                               // %entry
> >         mov     w1, #16277
> >         braa    x0, x1
> > [...]
> >
> > In this code snippet the function pointer is signed with the IA key
> > and discriminator 16277 before being returned. When the function is
> > called, the pointer is first authenticated with the same key and
> > discriminator.
> >
> > Now imagine that this code lives in a shared library used both by
> > programs that use the function pointer signing ABI and by legacy
> > binaries (i.e. programs that use the existing ABI), and we have a
> > legacy binary that calls return_fp. If the legacy binary then calls
> > the function pointer returned by return_fp, that code will not
> > authenticate the pointer before calling it, it will just use a br or
> > blr instruction to call it directly, which will lead to a crash if the
> > signature bits are set in the function pointer. In order to prevent
> > the crash, we need a way to cause the pacia instruction in return_fp
> > to become a no-op when running inside the process hosting the legacy
> > binary, so that the signature bits will remain clear and the br or blr
> > instruction in the legacy binary will successfully call the function
> > f. That can be done by disabling the IA key, which is exactly what
> > this prctl() lets us do.
>
> Ack, I think there has been past discussion on this, but it's been
> waiting for a usecase since it does impose a bit of extra overhead.
>
> (I'd somehow assumes that you were actually wanting to get SIGILLs,
> rather then compatibly executing some ptrauth insns as NOPs -- the
> latter makes much more sense.)
>
>
> Is this going to land for real, or is it just something people have
> been experimenting with?
>
> (I can guess the answer from the fact that you've proposed this patch
> in the first place, but I have to ask...)

It's something that we're considering for a future revision of the
Android ABI, and something that I'm personally interested in having,
but no final decision has been made.

One problem is that by the time we're ready to make a decision, it'll
probably be far too late to make a start on getting the necessary
kernel support added, hence the desire to get started on this early.

> >
> > >
> > > > +
> > > > +The idea is that a dynamic loader or early startup code would issue this
> > > > +prctl very early after establishing that a process may load legacy binaries,
> > > > +but before executing any PAC instructions.
> > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > +
> > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > >
> > > I wonder whether it would make sense to simple store the thread's base
> > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > bits separately.  There may be reasons outside this snippet why this
> > > isn't such a good idea though -- I haven't gone digging so far.
> >
> > As far as I know (as I learned [1] from the MTE patch series), it can
> > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > keys. Detecting the "process has disabled keys" case is quite simple
> > if we only store the disabled keys mask here, not so much if we store
> > the full value of SCTLR_EL1.
>
> My thought was that we would still only write SCTLR_EL1 if needed, but
> we would do the write-if-needed across the whole register in one go.
> This would be easier to extend if we have to twiddle additional
> SCTLR_EL1 bits in the future.  If the key enable bits are the only
> affected bits for now then we could of course defer this factoring until
> later.  (I vaguely remember a similar discussion in the past, but
> possibly it was about the pauth keys anyway, rather than something
> else.)

We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
potentially combine the two, so that both
prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
up affecting a common SCTLR_EL1 field in the task_struct, and we do:

On kernel entry:
- read SCTLR_EL1 from task_struct (or from the system register if we
decide that's cheap enough)
- if EnIA clear, set it, and write the value to the system register.

On kernel exit:
- read system register SCTLR_EL1
- read SCTLR_EL1 from task_struct
- if they are different, write task's SCTLR_EL1 to the system register.

But this would require at least an unconditional read of SCTLR_EL1 per
kernel exit. The comment that I referenced says that "accesses" to the
register are expensive. I was operating under the assumption that both
reads and writes are expensive, but if it's actually only writes that
are expensive, we may be able to get away with the above.

I could try to measure the cost of an unconditional read from
SCTLR_EL1 on kernel exit on the hardware that I have access to, but I
don't know if that's going to be representative of all hardware,
especially the future hardware that will support PAC and MTE.

> In a case like this, we'll still get overheads if there are a mixture of
> tasks contending for the CPU, that have different key enable settings.
> But I can't see much that we can do about that.  If userspace is mostly
> built with the same options (true for the Apple case I guess) then I
> guess we shouldn't need SCTLR_EL1 rewrites very often just for this.  In
> other environments it may be messier.

Yes, I don't think we can avoid the write when switching between tasks
with different key enabled settings. If userspace arranges to use IA
for return addresses (matching the kernel) and IB for function
pointers, there would be no need to disable IA for compatibility with
legacy binaries, and so the kernel would not require a write on entry,
only on exit when transitioning between different settings. The effect
is that disabling IA would be more expensive than disabling the other
keys.

> >
> > > > +
> > > > +     mrs_s   \tmp3, SYS_SCTLR_EL1
> > > > +     bic     \tmp3, \tmp3, \tmp2
> > > > +     msr_s   SYS_SCTLR_EL1, \tmp3
> > > > +
> > > >  .Laddr_auth_skip_\@:
> > > >  alternative_if ARM64_HAS_GENERIC_AUTH
> > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> > > > @@ -45,6 +53,17 @@ alternative_else_nop_endif
> > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
> > > >       msr_s   SYS_APIAKEYLO_EL1, \tmp2
> > > >       msr_s   SYS_APIAKEYHI_EL1, \tmp3
> > > > +
> > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > +     cbz     \tmp2, .Lset_sctlr_skip_\@
> > > > +
> > > > +     mrs_s   \tmp1, SYS_SCTLR_EL1
> > > > +     mov     \tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)
> > >
> > > (Nit: harmless but unnecessary ().  # is not an operator as such, just
> > > random syntax.  Whatever follows is greedily parsed as an immediate
> > > expression.)
> >
> > Okay. While looking around in the kernel I noticed that there is a
> > mov_q macro that can be used to avoid manually splitting the constant
> > into 16-bit chunks, and apparently it doesn't require a #. I'll use it
> > in v2.
> >
> > > > +     movk    \tmp2, #SCTLR_ELx_ENDB
> > >
> > > Why do we check THREAD_SCTLR_ENXX_MASK, and then proceed to set all the
> > > ENxx bits unconditionally?  I may be missing something here.
> >
> > This code is to support the case where we are returning to the kernel
> > from a userspace task with keys disabled. The kernel needs at least
> > the IA key enabled in order for its own use of reverse-edge PAC to
> > work correctly. When returning from a userspace task with no keys
> > disabled, the keys enabled bits already have the correct values, so
> > there is nothing to be done (and as mentioned above, I avoid touching
> > SCTLR_EL1 unless necessary because it is apparently expensive to do
> > so). But in a process with keys disabled, we will need to re-enable at
> > least IA.
> >
> > We may be able to get away with just enabling IA here, but that would
> > break the invariant that all keys are enabled inside the kernel, which
> > is relied on by the code that decides whether to access SCTLR_EL1 in
> > order to disable keys when entering a userspace task.
>
> OK, I think I just confused myself here: we are not setting the key
> enables for userspace, but for the kernel, and we only need to do that
> if the user task had some keys disabled in the first place.
>
> >
> > > > +     orr     \tmp1, \tmp1, \tmp2
> > > > +     msr_s   SYS_SCTLR_EL1, \tmp1
> > > > +
> > > > +.Lset_sctlr_skip_\@:
> > > >       .endm
> > > >
> > > >       .macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> > > > diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> > > > index c6b4f0603024..d4c375454a36 100644
> > > > --- a/arch/arm64/include/asm/pointer_auth.h
> > > > +++ b/arch/arm64/include/asm/pointer_auth.h
> > > > @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
> > > >  }
> > > >
> > > >  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> > > > +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> > > > +                                       unsigned long keys,
> > > > +                                       unsigned long enabled);
> > > >
> > > >  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > > >  {
> > > >       return ptrauth_clear_pac(ptr);
> > > >  }
> > > >
> > > > -#define ptrauth_thread_init_user(tsk)                                        \
> > > > -     ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> > > > +#define ptrauth_thread_init_user(tsk) do {                           \
> > > > +             ptrauth_keys_init_user(&(tsk)->thread.keys_user);       \
> > > > +             (tsk)->thread.sctlr_enxx_mask = 0;                      \
> > > > +     } while (0)
> > > >  #define ptrauth_thread_init_kernel(tsk)                                      \
> > > >       ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
> > > >  #define ptrauth_thread_switch_kernel(tsk)                            \
> > > > @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > > >
> > > >  #else /* CONFIG_ARM64_PTR_AUTH */
> > > >  #define ptrauth_prctl_reset_keys(tsk, arg)   (-EINVAL)
> > > > +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)   (-EINVAL)
> > > >  #define ptrauth_strip_insn_pac(lr)   (lr)
> > > >  #define ptrauth_thread_init_user(tsk)
> > > >  #define ptrauth_thread_init_kernel(tsk)
> > > > diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> > > > index 240fe5e5b720..6974d227b01f 100644
> > > > --- a/arch/arm64/include/asm/processor.h
> > > > +++ b/arch/arm64/include/asm/processor.h
> > > > @@ -150,6 +150,7 @@ struct thread_struct {
> > > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > > >       struct ptrauth_keys_user        keys_user;
> > > >       struct ptrauth_keys_kernel      keys_kernel;
> > > > +     u64                             sctlr_enxx_mask;
> > > >  #endif
> > > >  };
> > > >
> > > > @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
> > > >  /* PR_PAC_RESET_KEYS prctl */
> > > >  #define PAC_RESET_KEYS(tsk, arg)     ptrauth_prctl_reset_keys(tsk, arg)
> > > >
> > > > +/* PR_PAC_SET_ENABLED_KEYS prctl */
> > > > +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)                             \
> > > > +     ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> > > > +
> > > >  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
> > > >  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
> > > >  long set_tagged_addr_ctrl(unsigned long arg);
> > > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > > index 0577e2142284..dac80e16fe35 100644
> > > > --- a/arch/arm64/kernel/asm-offsets.c
> > > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > > @@ -47,6 +47,7 @@ int main(void)
> > > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > > >    DEFINE(THREAD_KEYS_USER,   offsetof(struct task_struct, thread.keys_user));
> > > >    DEFINE(THREAD_KEYS_KERNEL, offsetof(struct task_struct, thread.keys_kernel));
> > > > +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
> > > >  #endif
> > > >    BLANK();
> > > >    DEFINE(S_X0,                       offsetof(struct pt_regs, regs[0]));
> > > > diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> > > > index 1e77736a4f66..8c385b7f324a 100644
> > > > --- a/arch/arm64/kernel/pointer_auth.c
> > > > +++ b/arch/arm64/kernel/pointer_auth.c
> > > > @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
> > > >
> > > >       return 0;
> > > >  }
> > > > +
> > > > +static u64 arg_to_enxx_mask(unsigned long arg)
> > > > +{
> > > > +     u64 sctlr_enxx_mask = 0;
> > > > +     if (arg & PR_PAC_APIAKEY)
> > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> > > > +     if (arg & PR_PAC_APIBKEY)
> > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> > > > +     if (arg & PR_PAC_APDAKEY)
> > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> > > > +     if (arg & PR_PAC_APDBKEY)
> > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> > > > +     return sctlr_enxx_mask;
> > > > +}
> > > > +
> > > > +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> > > > +                                unsigned long enabled)
> > > > +{
> > > > +     u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> > > > +     unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> > > > +                                   PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> > > > +
> > > > +     if (!system_supports_address_auth())
> > > > +             return -EINVAL;
> > > > +
> > > > +     if ((keys & ~addr_key_mask) || (enabled & ~keys))
> > > > +             return -EINVAL;
> > >
> > > Should we take the types of authentication supported?
> > >
> > > I don't recall whether we expose ptrauth to userspace if only
> > > instruction authentication or only data authentication is supported.  If
> > > so, should we reject attempts to configure unsupported keys here?
> > >
> > > We should probably try to do a consistent thing both here and in
> > > PR_PAC_RESET_KEYS if so.
> >
> > As far as I know, there is nothing in the architecture that would
> > allow it to only advertise support for I keys or only advertise
> > support for D keys. The fields AA64ISAR1_EL1.AP[AI] apply to all four
> > keys: DA, DB, IA and IB. Maybe you are thinking of the GA key versus
> > the other keys (which is advertised separately via
> > AA64ISAR1_EL1.GP[AI])? The architecture appears to provide no way to
> > disable the GA key, so I did not include support for it here.
>
> I think I'm confusing myself here.  Yes, support for generic auth is
> the (supposedly) architecturally orthogonal to address auth, but data
> and instruction address auth are either both supported, or both not
> supported -- so your code looks correct.
>
> >
> > > > +
> > > > +     sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> > > > +     sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> > > > +
> > > > +     tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> > > > +     return 0;
> > >
> > > Do we need a way to query the enabled keys?
> > >
> > > We could either have a _GET_ prctl (the common approach), or have this
> > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > weirder).
> > >
> > > As above, we might
> > >
> > > Things like CRIU may need a GET in order to save/restore this setting
> > > properly.
> >
> > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > But it would need to be defined carefully because CRIU would
> > presumably need to know what value to pass as the "keys" argument when
> > it calls SET to restore the state. It can't just hardcode a value
> > because that may harm extensibility if new keys are added.
> >
> > If we require the kernel to start processes with all keys enabled
> > (including any keys that we may introduce in the future), then it
> > seems like if CRIU knows which keys were disabled, then it can restore
> > the state by issuing a syscall like this:
> >
> > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> >
> > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > saving the state to prepare the disabled_keys_mask argument passed
> > here. Then for consistency we can make the SET prctl() be
> > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> >
> > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> >
> > Does that make sense?
>
> Possibly, though it's nicer if the GET returns the same value suppiled
> to the SET, rather than the complement of it.

Maybe you misread my proposal? The part about the complement is
addressed by the sentence beginning "Then for consistency..." So we
would end up with PR_PAC_SET_DISABLED_KEYS and
PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
*disabled* keys.

> If SET ignores set bits in arg3 that don't correspond to set bits in the
> mask arg2, then
>
>         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
>
>         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
>
> should work.

I'd prefer not to allow the SET prctl to accept all-ones as the first
argument, as this could lead to a scenario where sometimes people pass
all-ones as the first argument and sometimes they don't, which could
make it hard to add new keys in the future (since the addition of a
new key could cause existing callers to affect the new key
arbitrarily). Instead, the first argument should be required to
specify only known keys, in order to enforce that the caller is
explicit about which keys they intend to affect.

> There's a final option, which is to expose this config through ptrace
> instead for save/restore purposes.  From previous discussions with the
> CRIU folks, I recall that they are trying to move away from doing setup
> from within the new process where possible.
>
> There's no reason not to have both though -- there's precedent for that,
> such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> in a similar direction too IIUC.
>
> Having a GET remains useful for in-process debugging and diagnostics,
> and it's extremely straightforward to add in the kernel.  So from my
> side I'd vote to have it anyway...

Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
well, and I suppose that I could add a
NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
userspace applications (and it doesn't look like ptrace APIs such as
NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
need to be made aware of new keys anyway), we might as well just have
all the APIs deal with the set of enabled keys.

And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
GET prctl, since it's just more uapi surface that needs to be
maintained forever. For in-process use cases it seems redundant with
e.g. issuing a pacia instruction and checking whether it is a no-op,
and from a consistency viewpoint I suppose it could be argued that
it's like PR_PAC_RESET_KEYS not having a prctl counterpart to get the
keys.

Peter

Dave Martin Aug. 26, 2020, 4:37 p.m. UTC | #7

On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > >
> > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > > > > This prctl allows the user program to control which PAC keys are enabled
> > > > > in a particular task. The main reason why this is useful is to enable a
> > > > > userspace ABI that uses PAC to sign and authenticate function pointers
> > > > > and other pointers exposed outside of the function, while still allowing
> > > > > binaries conforming to the ABI to interoperate with legacy binaries that
> > > > > do not sign or authenticate pointers.
> > > > >
> > > > > The idea is that a dynamic loader or early startup code would issue
> > > > > this prctl very early after establishing that a process may load legacy
> > > > > binaries, but before executing any PAC instructions.
> > > >
> > > > Apologies for the slow response on this, I'd had it on my list for a
> > > > while...
> > > >
> > > > > ---
> > > > >  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
> > > > >  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
> > > > >  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
> > > > >  arch/arm64/include/asm/processor.h            |  5 +++
> > > > >  arch/arm64/kernel/asm-offsets.c               |  1 +
> > > > >  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
> > > > >  include/uapi/linux/prctl.h                    |  3 ++
> > > > >  kernel/sys.c                                  |  8 +++++
> > > > >  8 files changed, 105 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> > > > > index 30b2ab06526b..1f7e064deeb3 100644
> > > > > --- a/Documentation/arm64/pointer-authentication.rst
> > > > > +++ b/Documentation/arm64/pointer-authentication.rst
> > > > > @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
> > > > >  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
> > > > >  register. Any attempt to use the Pointer Authentication instructions will
> > > > >  result in an UNDEFINED exception being injected into the guest.
> > > > > +
> > > > > +
> > > > > +Enabling and disabling keys
> > > > > +---------------------------
> > > > > +
> > > > > +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> > > > > +PAC keys are enabled in a particular task. It takes two arguments, the
> > > > > +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> > > > > +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> > > > > +and the second being a bitmask of the same bits specifying whether the key
> > > > > +should be enabled or disabled. For example::
> > > > > +
> > > > > +  prctl(PR_PAC_SET_ENABLED_KEYS,
> > > > > +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> > > > > +        PR_PAC_APIBKEY, 0, 0);
> > > > > +
> > > > > +disables all keys except the IB key.
> > > > > +
> > > > > +The main reason why this is useful is to enable a userspace ABI that uses PAC
> > > > > +instructions to sign and authenticate function pointers and other pointers
> > > > > +exposed outside of the function, while still allowing binaries conforming to
> > > > > +the ABI to interoperate with legacy binaries that do not sign or authenticate
> > > > > +pointers.
> > > >
> > > > What actually breaks without this?
> > > >
> > > > Since the keys are all enabled by default, the only purpose of this
> > > > prctl seems to be to disable keys.  I'm not sure what this is giving us.
> > >
> > > Yes, the purpose is to disable keys. Let's consider the function
> > > pointer signing userspace ABI use case. An example is Apple's arm64e
> > > ABI, and I have a prototype branch of LLVM [0] that implements a
> > > similar ABI in Linux userspace, based on Apple's implementation of
> > > their ABI.
> > >
> > > Here's an example of a function that returns a function pointer, and a
> > > function that calls a function pointer of the same type:
> > >
> > > static void f(void) {}
> > >
> > > void *return_fp(void) {
> > >   return f;
> > > }
> > >
> > > void call_fp(void (*p)(void)) {
> > >   p();
> > > }
> > >
> > > If I compile this with my prototype compiler I get:
> > >
> > > $ clang --target=aarch64-android   -fptrauth-calls  fptr.c -S -o - -O3
> > > -march=armv8.3a
> > > [...]
> > > return_fp:                              // @return_fp
> > > // %bb.0:                               // %entry
> > >         adrp    x16, f
> > >         add     x16, x16, :lo12:f
> > >         mov     x17, #16277
> > >         pacia   x16, x17
> > >         mov     x0, x16
> > >         ret
> > > [...]
> > > call_fp:                                // @call_fp
> > > // %bb.0:                               // %entry
> > >         mov     w1, #16277
> > >         braa    x0, x1
> > > [...]
> > >
> > > In this code snippet the function pointer is signed with the IA key
> > > and discriminator 16277 before being returned. When the function is
> > > called, the pointer is first authenticated with the same key and
> > > discriminator.
> > >
> > > Now imagine that this code lives in a shared library used both by
> > > programs that use the function pointer signing ABI and by legacy
> > > binaries (i.e. programs that use the existing ABI), and we have a
> > > legacy binary that calls return_fp. If the legacy binary then calls
> > > the function pointer returned by return_fp, that code will not
> > > authenticate the pointer before calling it, it will just use a br or
> > > blr instruction to call it directly, which will lead to a crash if the
> > > signature bits are set in the function pointer. In order to prevent
> > > the crash, we need a way to cause the pacia instruction in return_fp
> > > to become a no-op when running inside the process hosting the legacy
> > > binary, so that the signature bits will remain clear and the br or blr
> > > instruction in the legacy binary will successfully call the function
> > > f. That can be done by disabling the IA key, which is exactly what
> > > this prctl() lets us do.
> >
> > Ack, I think there has been past discussion on this, but it's been
> > waiting for a usecase since it does impose a bit of extra overhead.
> >
> > (I'd somehow assumes that you were actually wanting to get SIGILLs,
> > rather then compatibly executing some ptrauth insns as NOPs -- the
> > latter makes much more sense.)
> >
> >
> > Is this going to land for real, or is it just something people have
> > been experimenting with?
> >
> > (I can guess the answer from the fact that you've proposed this patch
> > in the first place, but I have to ask...)
> 
> It's something that we're considering for a future revision of the
> Android ABI, and something that I'm personally interested in having,
> but no final decision has been made.
> 
> One problem is that by the time we're ready to make a decision, it'll
> probably be far too late to make a start on getting the necessary
> kernel support added, hence the desire to get started on this early.

Fair enough.  It's far from being pure fantasy at least, and the use
case seems sufficiently real.

> 
> > >
> > > >
> > > > > +
> > > > > +The idea is that a dynamic loader or early startup code would issue this
> > > > > +prctl very early after establishing that a process may load legacy binaries,
> > > > > +but before executing any PAC instructions.
> > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > +
> > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > >
> > > > I wonder whether it would make sense to simple store the thread's base
> > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > bits separately.  There may be reasons outside this snippet why this
> > > > isn't such a good idea though -- I haven't gone digging so far.
> > >
> > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > keys. Detecting the "process has disabled keys" case is quite simple
> > > if we only store the disabled keys mask here, not so much if we store
> > > the full value of SCTLR_EL1.
> >
> > My thought was that we would still only write SCTLR_EL1 if needed, but
> > we would do the write-if-needed across the whole register in one go.
> > This would be easier to extend if we have to twiddle additional
> > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > affected bits for now then we could of course defer this factoring until
> > later.  (I vaguely remember a similar discussion in the past, but
> > possibly it was about the pauth keys anyway, rather than something
> > else.)
> 
> We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> potentially combine the two, so that both
> prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> 
> On kernel entry:
> - read SCTLR_EL1 from task_struct (or from the system register if we
> decide that's cheap enough)
> - if EnIA clear, set it, and write the value to the system register.
> 
> On kernel exit:
> - read system register SCTLR_EL1
> - read SCTLR_EL1 from task_struct
> - if they are different, write task's SCTLR_EL1 to the system register.
> 
> But this would require at least an unconditional read of SCTLR_EL1 per
> kernel exit. The comment that I referenced says that "accesses" to the
> register are expensive. I was operating under the assumption that both
> reads and writes are expensive, but if it's actually only writes that
> are expensive, we may be able to get away with the above.

The kernel is in full control over what's in SCTLR_EL1, so perhaps we
could cache what we wrote percpu, if the overhead of reading it is
problematic.

> I could try to measure the cost of an unconditional read from
> SCTLR_EL1 on kernel exit on the hardware that I have access to, but I
> don't know if that's going to be representative of all hardware,
> especially the future hardware that will support PAC and MTE.
> 
> > In a case like this, we'll still get overheads if there are a mixture of
> > tasks contending for the CPU, that have different key enable settings.
> > But I can't see much that we can do about that.  If userspace is mostly
> > built with the same options (true for the Apple case I guess) then I
> > guess we shouldn't need SCTLR_EL1 rewrites very often just for this.  In
> > other environments it may be messier.
> 
> Yes, I don't think we can avoid the write when switching between tasks
> with different key enabled settings. If userspace arranges to use IA
> for return addresses (matching the kernel) and IB for function
> pointers, there would be no need to disable IA for compatibility with
> legacy binaries, and so the kernel would not require a write on entry,
> only on exit when transitioning between different settings. The effect
> is that disabling IA would be more expensive than disabling the other
> keys.
> 
> > >
> > > > > +
> > > > > +     mrs_s   \tmp3, SYS_SCTLR_EL1
> > > > > +     bic     \tmp3, \tmp3, \tmp2
> > > > > +     msr_s   SYS_SCTLR_EL1, \tmp3
> > > > > +
> > > > >  .Laddr_auth_skip_\@:
> > > > >  alternative_if ARM64_HAS_GENERIC_AUTH
> > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> > > > > @@ -45,6 +53,17 @@ alternative_else_nop_endif
> > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
> > > > >       msr_s   SYS_APIAKEYLO_EL1, \tmp2
> > > > >       msr_s   SYS_APIAKEYHI_EL1, \tmp3
> > > > > +
> > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > +     cbz     \tmp2, .Lset_sctlr_skip_\@
> > > > > +
> > > > > +     mrs_s   \tmp1, SYS_SCTLR_EL1
> > > > > +     mov     \tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)
> > > >
> > > > (Nit: harmless but unnecessary ().  # is not an operator as such, just
> > > > random syntax.  Whatever follows is greedily parsed as an immediate
> > > > expression.)
> > >
> > > Okay. While looking around in the kernel I noticed that there is a
> > > mov_q macro that can be used to avoid manually splitting the constant
> > > into 16-bit chunks, and apparently it doesn't require a #. I'll use it
> > > in v2.
> > >
> > > > > +     movk    \tmp2, #SCTLR_ELx_ENDB
> > > >
> > > > Why do we check THREAD_SCTLR_ENXX_MASK, and then proceed to set all the
> > > > ENxx bits unconditionally?  I may be missing something here.
> > >
> > > This code is to support the case where we are returning to the kernel
> > > from a userspace task with keys disabled. The kernel needs at least
> > > the IA key enabled in order for its own use of reverse-edge PAC to
> > > work correctly. When returning from a userspace task with no keys
> > > disabled, the keys enabled bits already have the correct values, so
> > > there is nothing to be done (and as mentioned above, I avoid touching
> > > SCTLR_EL1 unless necessary because it is apparently expensive to do
> > > so). But in a process with keys disabled, we will need to re-enable at
> > > least IA.
> > >
> > > We may be able to get away with just enabling IA here, but that would
> > > break the invariant that all keys are enabled inside the kernel, which
> > > is relied on by the code that decides whether to access SCTLR_EL1 in
> > > order to disable keys when entering a userspace task.
> >
> > OK, I think I just confused myself here: we are not setting the key
> > enables for userspace, but for the kernel, and we only need to do that
> > if the user task had some keys disabled in the first place.
> >
> > >
> > > > > +     orr     \tmp1, \tmp1, \tmp2
> > > > > +     msr_s   SYS_SCTLR_EL1, \tmp1
> > > > > +
> > > > > +.Lset_sctlr_skip_\@:
> > > > >       .endm
> > > > >
> > > > >       .macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> > > > > diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> > > > > index c6b4f0603024..d4c375454a36 100644
> > > > > --- a/arch/arm64/include/asm/pointer_auth.h
> > > > > +++ b/arch/arm64/include/asm/pointer_auth.h
> > > > > @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
> > > > >  }
> > > > >
> > > > >  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> > > > > +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> > > > > +                                       unsigned long keys,
> > > > > +                                       unsigned long enabled);
> > > > >
> > > > >  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > > > >  {
> > > > >       return ptrauth_clear_pac(ptr);
> > > > >  }
> > > > >
> > > > > -#define ptrauth_thread_init_user(tsk)                                        \
> > > > > -     ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> > > > > +#define ptrauth_thread_init_user(tsk) do {                           \
> > > > > +             ptrauth_keys_init_user(&(tsk)->thread.keys_user);       \
> > > > > +             (tsk)->thread.sctlr_enxx_mask = 0;                      \
> > > > > +     } while (0)
> > > > >  #define ptrauth_thread_init_kernel(tsk)                                      \
> > > > >       ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
> > > > >  #define ptrauth_thread_switch_kernel(tsk)                            \
> > > > > @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > > > >
> > > > >  #else /* CONFIG_ARM64_PTR_AUTH */
> > > > >  #define ptrauth_prctl_reset_keys(tsk, arg)   (-EINVAL)
> > > > > +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)   (-EINVAL)
> > > > >  #define ptrauth_strip_insn_pac(lr)   (lr)
> > > > >  #define ptrauth_thread_init_user(tsk)
> > > > >  #define ptrauth_thread_init_kernel(tsk)
> > > > > diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> > > > > index 240fe5e5b720..6974d227b01f 100644
> > > > > --- a/arch/arm64/include/asm/processor.h
> > > > > +++ b/arch/arm64/include/asm/processor.h
> > > > > @@ -150,6 +150,7 @@ struct thread_struct {
> > > > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > > > >       struct ptrauth_keys_user        keys_user;
> > > > >       struct ptrauth_keys_kernel      keys_kernel;
> > > > > +     u64                             sctlr_enxx_mask;
> > > > >  #endif
> > > > >  };
> > > > >
> > > > > @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
> > > > >  /* PR_PAC_RESET_KEYS prctl */
> > > > >  #define PAC_RESET_KEYS(tsk, arg)     ptrauth_prctl_reset_keys(tsk, arg)
> > > > >
> > > > > +/* PR_PAC_SET_ENABLED_KEYS prctl */
> > > > > +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)                             \
> > > > > +     ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> > > > > +
> > > > >  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
> > > > >  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
> > > > >  long set_tagged_addr_ctrl(unsigned long arg);
> > > > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > > > index 0577e2142284..dac80e16fe35 100644
> > > > > --- a/arch/arm64/kernel/asm-offsets.c
> > > > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > > > @@ -47,6 +47,7 @@ int main(void)
> > > > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > > > >    DEFINE(THREAD_KEYS_USER,   offsetof(struct task_struct, thread.keys_user));
> > > > >    DEFINE(THREAD_KEYS_KERNEL, offsetof(struct task_struct, thread.keys_kernel));
> > > > > +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
> > > > >  #endif
> > > > >    BLANK();
> > > > >    DEFINE(S_X0,                       offsetof(struct pt_regs, regs[0]));
> > > > > diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> > > > > index 1e77736a4f66..8c385b7f324a 100644
> > > > > --- a/arch/arm64/kernel/pointer_auth.c
> > > > > +++ b/arch/arm64/kernel/pointer_auth.c
> > > > > @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
> > > > >
> > > > >       return 0;
> > > > >  }
> > > > > +
> > > > > +static u64 arg_to_enxx_mask(unsigned long arg)
> > > > > +{
> > > > > +     u64 sctlr_enxx_mask = 0;
> > > > > +     if (arg & PR_PAC_APIAKEY)
> > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> > > > > +     if (arg & PR_PAC_APIBKEY)
> > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> > > > > +     if (arg & PR_PAC_APDAKEY)
> > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> > > > > +     if (arg & PR_PAC_APDBKEY)
> > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> > > > > +     return sctlr_enxx_mask;
> > > > > +}
> > > > > +
> > > > > +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> > > > > +                                unsigned long enabled)
> > > > > +{
> > > > > +     u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> > > > > +     unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> > > > > +                                   PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> > > > > +
> > > > > +     if (!system_supports_address_auth())
> > > > > +             return -EINVAL;
> > > > > +
> > > > > +     if ((keys & ~addr_key_mask) || (enabled & ~keys))
> > > > > +             return -EINVAL;
> > > >
> > > > Should we take the types of authentication supported?
> > > >
> > > > I don't recall whether we expose ptrauth to userspace if only
> > > > instruction authentication or only data authentication is supported.  If
> > > > so, should we reject attempts to configure unsupported keys here?
> > > >
> > > > We should probably try to do a consistent thing both here and in
> > > > PR_PAC_RESET_KEYS if so.
> > >
> > > As far as I know, there is nothing in the architecture that would
> > > allow it to only advertise support for I keys or only advertise
> > > support for D keys. The fields AA64ISAR1_EL1.AP[AI] apply to all four
> > > keys: DA, DB, IA and IB. Maybe you are thinking of the GA key versus
> > > the other keys (which is advertised separately via
> > > AA64ISAR1_EL1.GP[AI])? The architecture appears to provide no way to
> > > disable the GA key, so I did not include support for it here.
> >
> > I think I'm confusing myself here.  Yes, support for generic auth is
> > the (supposedly) architecturally orthogonal to address auth, but data
> > and instruction address auth are either both supported, or both not
> > supported -- so your code looks correct.
> >
> > >
> > > > > +
> > > > > +     sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> > > > > +     sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> > > > > +
> > > > > +     tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> > > > > +     return 0;
> > > >
> > > > Do we need a way to query the enabled keys?
> > > >
> > > > We could either have a _GET_ prctl (the common approach), or have this
> > > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > > weirder).
> > > >
> > > > As above, we might
> > > >
> > > > Things like CRIU may need a GET in order to save/restore this setting
> > > > properly.
> > >
> > > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > > But it would need to be defined carefully because CRIU would
> > > presumably need to know what value to pass as the "keys" argument when
> > > it calls SET to restore the state. It can't just hardcode a value
> > > because that may harm extensibility if new keys are added.
> > >
> > > If we require the kernel to start processes with all keys enabled
> > > (including any keys that we may introduce in the future), then it
> > > seems like if CRIU knows which keys were disabled, then it can restore
> > > the state by issuing a syscall like this:
> > >
> > > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> > >
> > > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > > saving the state to prepare the disabled_keys_mask argument passed
> > > here. Then for consistency we can make the SET prctl() be
> > > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> > >
> > > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> > >
> > > Does that make sense?
> >
> > Possibly, though it's nicer if the GET returns the same value suppiled
> > to the SET, rather than the complement of it.
> 
> Maybe you misread my proposal? The part about the complement is
> addressed by the sentence beginning "Then for consistency..." So we
> would end up with PR_PAC_SET_DISABLED_KEYS and
> PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
> *disabled* keys.

Oh, I see.  Yes, while I prefer the two are consistent with each other,
we can flip the sense of both here if desired.

> 
> > If SET ignores set bits in arg3 that don't correspond to set bits in the
> > mask arg2, then
> >
> >         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
> >
> >         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
> >
> > should work.
> 
> I'd prefer not to allow the SET prctl to accept all-ones as the first
> argument, as this could lead to a scenario where sometimes people pass
> all-ones as the first argument and sometimes they don't, which could
> make it hard to add new keys in the future (since the addition of a
> new key could cause existing callers to affect the new key
> arbitrarily). Instead, the first argument should be required to
> specify only known keys, in order to enforce that the caller is
> explicit about which keys they intend to affect.

That sounds fair.

> > There's a final option, which is to expose this config through ptrace
> > instead for save/restore purposes.  From previous discussions with the
> > CRIU folks, I recall that they are trying to move away from doing setup
> > from within the new process where possible.
> >
> > There's no reason not to have both though -- there's precedent for that,
> > such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> > in a similar direction too IIUC.
> >
> > Having a GET remains useful for in-process debugging and diagnostics,
> > and it's extremely straightforward to add in the kernel.  So from my
> > side I'd vote to have it anyway...
> 
> Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
> well, and I suppose that I could add a
> NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
> userspace applications (and it doesn't look like ptrace APIs such as
> NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
> need to be made aware of new keys anyway), we might as well just have
> all the APIs deal with the set of enabled keys.
> 
> And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
> GET prctl, since it's just more uapi surface that needs to be
> maintained forever. For in-process use cases it seems redundant with
> e.g. issuing a pacia instruction and checking whether it is a no-op,

Probing instructions to see whether they generate signals is a cardinal
sin and we make huge efforts to ensure that userspace doesn't need to do
that.  As for the no-op check, I suppose that works, but it's nasty if
the generic "check all the required featues" boilerplate on program
entry relies on intrinsics or can't be written in C at all.  This is
just a way of discouraging people from checking at all IMHO.


The implementation of that prctl would most likely be a one-line wrapper
that calls the same internal function used to populate the regset for
ptrace, so I think you might be worrying a bit too much about creating
ABI here.  This is more providing a second interface for same thing,
because no single interface is usable in all situations.

> and from a consistency viewpoint I suppose it could be argued that
> it's like PR_PAC_RESET_KEYS not having a prctl counterpart to get the
> keys.

That's somewhat different.  By definition a process isn't supposed to
be able to read its own keys, so this is a write-only operation.

[...]

Cheers
---Dave

Peter Collingbourne Aug. 27, 2020, 2:55 a.m. UTC | #8

On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > >
> > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > > > > > This prctl allows the user program to control which PAC keys are enabled
> > > > > > in a particular task. The main reason why this is useful is to enable a
> > > > > > userspace ABI that uses PAC to sign and authenticate function pointers
> > > > > > and other pointers exposed outside of the function, while still allowing
> > > > > > binaries conforming to the ABI to interoperate with legacy binaries that
> > > > > > do not sign or authenticate pointers.
> > > > > >
> > > > > > The idea is that a dynamic loader or early startup code would issue
> > > > > > this prctl very early after establishing that a process may load legacy
> > > > > > binaries, but before executing any PAC instructions.
> > > > >
> > > > > Apologies for the slow response on this, I'd had it on my list for a
> > > > > while...
> > > > >
> > > > > > ---
> > > > > >  .../arm64/pointer-authentication.rst          | 27 +++++++++++++++
> > > > > >  arch/arm64/include/asm/asm_pointer_auth.h     | 19 +++++++++++
> > > > > >  arch/arm64/include/asm/pointer_auth.h         | 10 ++++--
> > > > > >  arch/arm64/include/asm/processor.h            |  5 +++
> > > > > >  arch/arm64/kernel/asm-offsets.c               |  1 +
> > > > > >  arch/arm64/kernel/pointer_auth.c              | 34 +++++++++++++++++++
> > > > > >  include/uapi/linux/prctl.h                    |  3 ++
> > > > > >  kernel/sys.c                                  |  8 +++++
> > > > > >  8 files changed, 105 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/Documentation/arm64/pointer-authentication.rst b/Documentation/arm64/pointer-authentication.rst
> > > > > > index 30b2ab06526b..1f7e064deeb3 100644
> > > > > > --- a/Documentation/arm64/pointer-authentication.rst
> > > > > > +++ b/Documentation/arm64/pointer-authentication.rst
> > > > > > @@ -107,3 +107,30 @@ filter out the Pointer Authentication system key registers from
> > > > > >  KVM_GET/SET_REG_* ioctls and mask those features from cpufeature ID
> > > > > >  register. Any attempt to use the Pointer Authentication instructions will
> > > > > >  result in an UNDEFINED exception being injected into the guest.
> > > > > > +
> > > > > > +
> > > > > > +Enabling and disabling keys
> > > > > > +---------------------------
> > > > > > +
> > > > > > +The prctl PR_PAC_SET_ENABLED_KEYS allows the user program to control which
> > > > > > +PAC keys are enabled in a particular task. It takes two arguments, the
> > > > > > +first being a bitmask of PR_PAC_APIAKEY, PR_PAC_APIBKEY, PR_PAC_APDAKEY
> > > > > > +and PR_PAC_APDBKEY specifying which keys shall be affected by this prctl,
> > > > > > +and the second being a bitmask of the same bits specifying whether the key
> > > > > > +should be enabled or disabled. For example::
> > > > > > +
> > > > > > +  prctl(PR_PAC_SET_ENABLED_KEYS,
> > > > > > +        PR_PAC_APIAKEY | PR_PAC_APIBKEY | PR_PAC_APDAKEY | PR_PAC_APDBKEY,
> > > > > > +        PR_PAC_APIBKEY, 0, 0);
> > > > > > +
> > > > > > +disables all keys except the IB key.
> > > > > > +
> > > > > > +The main reason why this is useful is to enable a userspace ABI that uses PAC
> > > > > > +instructions to sign and authenticate function pointers and other pointers
> > > > > > +exposed outside of the function, while still allowing binaries conforming to
> > > > > > +the ABI to interoperate with legacy binaries that do not sign or authenticate
> > > > > > +pointers.
> > > > >
> > > > > What actually breaks without this?
> > > > >
> > > > > Since the keys are all enabled by default, the only purpose of this
> > > > > prctl seems to be to disable keys.  I'm not sure what this is giving us.
> > > >
> > > > Yes, the purpose is to disable keys. Let's consider the function
> > > > pointer signing userspace ABI use case. An example is Apple's arm64e
> > > > ABI, and I have a prototype branch of LLVM [0] that implements a
> > > > similar ABI in Linux userspace, based on Apple's implementation of
> > > > their ABI.
> > > >
> > > > Here's an example of a function that returns a function pointer, and a
> > > > function that calls a function pointer of the same type:
> > > >
> > > > static void f(void) {}
> > > >
> > > > void *return_fp(void) {
> > > >   return f;
> > > > }
> > > >
> > > > void call_fp(void (*p)(void)) {
> > > >   p();
> > > > }
> > > >
> > > > If I compile this with my prototype compiler I get:
> > > >
> > > > $ clang --target=aarch64-android   -fptrauth-calls  fptr.c -S -o - -O3
> > > > -march=armv8.3a
> > > > [...]
> > > > return_fp:                              // @return_fp
> > > > // %bb.0:                               // %entry
> > > >         adrp    x16, f
> > > >         add     x16, x16, :lo12:f
> > > >         mov     x17, #16277
> > > >         pacia   x16, x17
> > > >         mov     x0, x16
> > > >         ret
> > > > [...]
> > > > call_fp:                                // @call_fp
> > > > // %bb.0:                               // %entry
> > > >         mov     w1, #16277
> > > >         braa    x0, x1
> > > > [...]
> > > >
> > > > In this code snippet the function pointer is signed with the IA key
> > > > and discriminator 16277 before being returned. When the function is
> > > > called, the pointer is first authenticated with the same key and
> > > > discriminator.
> > > >
> > > > Now imagine that this code lives in a shared library used both by
> > > > programs that use the function pointer signing ABI and by legacy
> > > > binaries (i.e. programs that use the existing ABI), and we have a
> > > > legacy binary that calls return_fp. If the legacy binary then calls
> > > > the function pointer returned by return_fp, that code will not
> > > > authenticate the pointer before calling it, it will just use a br or
> > > > blr instruction to call it directly, which will lead to a crash if the
> > > > signature bits are set in the function pointer. In order to prevent
> > > > the crash, we need a way to cause the pacia instruction in return_fp
> > > > to become a no-op when running inside the process hosting the legacy
> > > > binary, so that the signature bits will remain clear and the br or blr
> > > > instruction in the legacy binary will successfully call the function
> > > > f. That can be done by disabling the IA key, which is exactly what
> > > > this prctl() lets us do.
> > >
> > > Ack, I think there has been past discussion on this, but it's been
> > > waiting for a usecase since it does impose a bit of extra overhead.
> > >
> > > (I'd somehow assumes that you were actually wanting to get SIGILLs,
> > > rather then compatibly executing some ptrauth insns as NOPs -- the
> > > latter makes much more sense.)
> > >
> > >
> > > Is this going to land for real, or is it just something people have
> > > been experimenting with?
> > >
> > > (I can guess the answer from the fact that you've proposed this patch
> > > in the first place, but I have to ask...)
> >
> > It's something that we're considering for a future revision of the
> > Android ABI, and something that I'm personally interested in having,
> > but no final decision has been made.
> >
> > One problem is that by the time we're ready to make a decision, it'll
> > probably be far too late to make a start on getting the necessary
> > kernel support added, hence the desire to get started on this early.
>
> Fair enough.  It's far from being pure fantasy at least, and the use
> case seems sufficiently real.
>
> >
> > > >
> > > > >
> > > > > > +
> > > > > > +The idea is that a dynamic loader or early startup code would issue this
> > > > > > +prctl very early after establishing that a process may load legacy binaries,
> > > > > > +but before executing any PAC instructions.
> > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > +
> > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > >
> > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > >
> > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > if we only store the disabled keys mask here, not so much if we store
> > > > the full value of SCTLR_EL1.
> > >
> > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > we would do the write-if-needed across the whole register in one go.
> > > This would be easier to extend if we have to twiddle additional
> > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > affected bits for now then we could of course defer this factoring until
> > > later.  (I vaguely remember a similar discussion in the past, but
> > > possibly it was about the pauth keys anyway, rather than something
> > > else.)
> >
> > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > potentially combine the two, so that both
> > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> >
> > On kernel entry:
> > - read SCTLR_EL1 from task_struct (or from the system register if we
> > decide that's cheap enough)
> > - if EnIA clear, set it, and write the value to the system register.
> >
> > On kernel exit:
> > - read system register SCTLR_EL1
> > - read SCTLR_EL1 from task_struct
> > - if they are different, write task's SCTLR_EL1 to the system register.
> >
> > But this would require at least an unconditional read of SCTLR_EL1 per
> > kernel exit. The comment that I referenced says that "accesses" to the
> > register are expensive. I was operating under the assumption that both
> > reads and writes are expensive, but if it's actually only writes that
> > are expensive, we may be able to get away with the above.
>
> The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> could cache what we wrote percpu, if the overhead of reading it is
> problematic.

Maybe, but that sounds like the kind of complexity that I'd like to avoid.

I went ahead and did some performance measurements using the following
program, which issues 10M invalid syscalls and exits. This should give
us something as close as possible to a measurement of a kernel entry
and exit on its own.

.globl _start
_start:
movz x1, :abs_g1:10000000
movk x1, :abs_g0_nc:10000000

.Lloop:
mov x8, #65536 // invalid
svc #0
sub x1, x1, #1
cbnz x1, .Lloop

mov x0, #0
mov x8, #0x5d // exit
svc #0

On a Pixel 4 (which according to Wikipedia has a CPU based on
Cortex-A76/A55) I measured the median of 1000 runs execution time at
0.554511s, implying an entry/exit cost of 55.5ns. If I make this
modification to the kernel (Pixel 4 uses kernel version 4.14 so this
won't apply cleanly to modern kernels; this is in the same place as
the ptrauth_keys_install_user macro invocation in a modern kernel):

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 303447c765f7..b7c72d767f3e 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -340,6 +340,7 @@ alternative_else_nop_endif
 #endif
 3:
        apply_ssbd 0, 5f, x0, x1
+       mrs x0, sctlr_el1
 5:
        .endif

The median execution time goes to 0.565604s, implying a cost of 1.1ns
to read the system register. I also measured the cost of reading from
a percpu variable at kernel exit like so:

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 303447c765f7..c5a89785f650 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -340,6 +340,7 @@ alternative_else_nop_endif
 #endif
 3:
        apply_ssbd 0, 5f, x0, x1
+       ldr_this_cpu x0, sctlr_el1_cache, x1
 5:
        .endif

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 0b6ca5479f1f..ac9c9a6179d4 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -52,6 +52,8 @@
 #include <asm/system_misc.h>
 #include <asm/sysreg.h>

+DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
+
 static const char *handler[]= {
        "Synchronous Abort",
        "IRQ",

With that the median execution time goes to 0.600186s, implying a cost
of 4.5ns. So at least on existing hardware, it looks like reading
SCTLR_EL1 directly is probably our cheapest option if we're going to
read *something*. Therefore I'll look at implementing this with
unconditional reads from SCTLR_EL1 in v2.

> > I could try to measure the cost of an unconditional read from
> > SCTLR_EL1 on kernel exit on the hardware that I have access to, but I
> > don't know if that's going to be representative of all hardware,
> > especially the future hardware that will support PAC and MTE.
> >
> > > In a case like this, we'll still get overheads if there are a mixture of
> > > tasks contending for the CPU, that have different key enable settings.
> > > But I can't see much that we can do about that.  If userspace is mostly
> > > built with the same options (true for the Apple case I guess) then I
> > > guess we shouldn't need SCTLR_EL1 rewrites very often just for this.  In
> > > other environments it may be messier.
> >
> > Yes, I don't think we can avoid the write when switching between tasks
> > with different key enabled settings. If userspace arranges to use IA
> > for return addresses (matching the kernel) and IB for function
> > pointers, there would be no need to disable IA for compatibility with
> > legacy binaries, and so the kernel would not require a write on entry,
> > only on exit when transitioning between different settings. The effect
> > is that disabling IA would be more expensive than disabling the other
> > keys.
> >
> > > >
> > > > > > +
> > > > > > +     mrs_s   \tmp3, SYS_SCTLR_EL1
> > > > > > +     bic     \tmp3, \tmp3, \tmp2
> > > > > > +     msr_s   SYS_SCTLR_EL1, \tmp3
> > > > > > +
> > > > > >  .Laddr_auth_skip_\@:
> > > > > >  alternative_if ARM64_HAS_GENERIC_AUTH
> > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APGA]
> > > > > > @@ -45,6 +53,17 @@ alternative_else_nop_endif
> > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_KERNEL_KEY_APIA]
> > > > > >       msr_s   SYS_APIAKEYLO_EL1, \tmp2
> > > > > >       msr_s   SYS_APIAKEYHI_EL1, \tmp3
> > > > > > +
> > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > +     cbz     \tmp2, .Lset_sctlr_skip_\@
> > > > > > +
> > > > > > +     mrs_s   \tmp1, SYS_SCTLR_EL1
> > > > > > +     mov     \tmp2, #(SCTLR_ELx_ENIA | SCTLR_ELx_ENIB | SCTLR_ELx_ENDA)
> > > > >
> > > > > (Nit: harmless but unnecessary ().  # is not an operator as such, just
> > > > > random syntax.  Whatever follows is greedily parsed as an immediate
> > > > > expression.)
> > > >
> > > > Okay. While looking around in the kernel I noticed that there is a
> > > > mov_q macro that can be used to avoid manually splitting the constant
> > > > into 16-bit chunks, and apparently it doesn't require a #. I'll use it
> > > > in v2.
> > > >
> > > > > > +     movk    \tmp2, #SCTLR_ELx_ENDB
> > > > >
> > > > > Why do we check THREAD_SCTLR_ENXX_MASK, and then proceed to set all the
> > > > > ENxx bits unconditionally?  I may be missing something here.
> > > >
> > > > This code is to support the case where we are returning to the kernel
> > > > from a userspace task with keys disabled. The kernel needs at least
> > > > the IA key enabled in order for its own use of reverse-edge PAC to
> > > > work correctly. When returning from a userspace task with no keys
> > > > disabled, the keys enabled bits already have the correct values, so
> > > > there is nothing to be done (and as mentioned above, I avoid touching
> > > > SCTLR_EL1 unless necessary because it is apparently expensive to do
> > > > so). But in a process with keys disabled, we will need to re-enable at
> > > > least IA.
> > > >
> > > > We may be able to get away with just enabling IA here, but that would
> > > > break the invariant that all keys are enabled inside the kernel, which
> > > > is relied on by the code that decides whether to access SCTLR_EL1 in
> > > > order to disable keys when entering a userspace task.
> > >
> > > OK, I think I just confused myself here: we are not setting the key
> > > enables for userspace, but for the kernel, and we only need to do that
> > > if the user task had some keys disabled in the first place.
> > >
> > > >
> > > > > > +     orr     \tmp1, \tmp1, \tmp2
> > > > > > +     msr_s   SYS_SCTLR_EL1, \tmp1
> > > > > > +
> > > > > > +.Lset_sctlr_skip_\@:
> > > > > >       .endm
> > > > > >
> > > > > >       .macro ptrauth_keys_install_kernel_nosync tsk, tmp1, tmp2, tmp3
> > > > > > diff --git a/arch/arm64/include/asm/pointer_auth.h b/arch/arm64/include/asm/pointer_auth.h
> > > > > > index c6b4f0603024..d4c375454a36 100644
> > > > > > --- a/arch/arm64/include/asm/pointer_auth.h
> > > > > > +++ b/arch/arm64/include/asm/pointer_auth.h
> > > > > > @@ -70,14 +70,19 @@ static __always_inline void ptrauth_keys_switch_kernel(struct ptrauth_keys_kerne
> > > > > >  }
> > > > > >
> > > > > >  extern int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg);
> > > > > > +extern int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk,
> > > > > > +                                       unsigned long keys,
> > > > > > +                                       unsigned long enabled);
> > > > > >
> > > > > >  static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > > > > >  {
> > > > > >       return ptrauth_clear_pac(ptr);
> > > > > >  }
> > > > > >
> > > > > > -#define ptrauth_thread_init_user(tsk)                                        \
> > > > > > -     ptrauth_keys_init_user(&(tsk)->thread.keys_user)
> > > > > > +#define ptrauth_thread_init_user(tsk) do {                           \
> > > > > > +             ptrauth_keys_init_user(&(tsk)->thread.keys_user);       \
> > > > > > +             (tsk)->thread.sctlr_enxx_mask = 0;                      \
> > > > > > +     } while (0)
> > > > > >  #define ptrauth_thread_init_kernel(tsk)                                      \
> > > > > >       ptrauth_keys_init_kernel(&(tsk)->thread.keys_kernel)
> > > > > >  #define ptrauth_thread_switch_kernel(tsk)                            \
> > > > > > @@ -85,6 +90,7 @@ static inline unsigned long ptrauth_strip_insn_pac(unsigned long ptr)
> > > > > >
> > > > > >  #else /* CONFIG_ARM64_PTR_AUTH */
> > > > > >  #define ptrauth_prctl_reset_keys(tsk, arg)   (-EINVAL)
> > > > > > +#define ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)   (-EINVAL)
> > > > > >  #define ptrauth_strip_insn_pac(lr)   (lr)
> > > > > >  #define ptrauth_thread_init_user(tsk)
> > > > > >  #define ptrauth_thread_init_kernel(tsk)
> > > > > > diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
> > > > > > index 240fe5e5b720..6974d227b01f 100644
> > > > > > --- a/arch/arm64/include/asm/processor.h
> > > > > > +++ b/arch/arm64/include/asm/processor.h
> > > > > > @@ -150,6 +150,7 @@ struct thread_struct {
> > > > > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > > > > >       struct ptrauth_keys_user        keys_user;
> > > > > >       struct ptrauth_keys_kernel      keys_kernel;
> > > > > > +     u64                             sctlr_enxx_mask;
> > > > > >  #endif
> > > > > >  };
> > > > > >
> > > > > > @@ -313,6 +314,10 @@ extern void __init minsigstksz_setup(void);
> > > > > >  /* PR_PAC_RESET_KEYS prctl */
> > > > > >  #define PAC_RESET_KEYS(tsk, arg)     ptrauth_prctl_reset_keys(tsk, arg)
> > > > > >
> > > > > > +/* PR_PAC_SET_ENABLED_KEYS prctl */
> > > > > > +#define PAC_SET_ENABLED_KEYS(tsk, keys, enabled)                             \
> > > > > > +     ptrauth_prctl_set_enabled_keys(tsk, keys, enabled)
> > > > > > +
> > > > > >  #ifdef CONFIG_ARM64_TAGGED_ADDR_ABI
> > > > > >  /* PR_{SET,GET}_TAGGED_ADDR_CTRL prctl */
> > > > > >  long set_tagged_addr_ctrl(unsigned long arg);
> > > > > > diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> > > > > > index 0577e2142284..dac80e16fe35 100644
> > > > > > --- a/arch/arm64/kernel/asm-offsets.c
> > > > > > +++ b/arch/arm64/kernel/asm-offsets.c
> > > > > > @@ -47,6 +47,7 @@ int main(void)
> > > > > >  #ifdef CONFIG_ARM64_PTR_AUTH
> > > > > >    DEFINE(THREAD_KEYS_USER,   offsetof(struct task_struct, thread.keys_user));
> > > > > >    DEFINE(THREAD_KEYS_KERNEL, offsetof(struct task_struct, thread.keys_kernel));
> > > > > > +  DEFINE(THREAD_SCTLR_ENXX_MASK,offsetof(struct task_struct, thread.sctlr_enxx_mask));
> > > > > >  #endif
> > > > > >    BLANK();
> > > > > >    DEFINE(S_X0,                       offsetof(struct pt_regs, regs[0]));
> > > > > > diff --git a/arch/arm64/kernel/pointer_auth.c b/arch/arm64/kernel/pointer_auth.c
> > > > > > index 1e77736a4f66..8c385b7f324a 100644
> > > > > > --- a/arch/arm64/kernel/pointer_auth.c
> > > > > > +++ b/arch/arm64/kernel/pointer_auth.c
> > > > > > @@ -42,3 +42,37 @@ int ptrauth_prctl_reset_keys(struct task_struct *tsk, unsigned long arg)
> > > > > >
> > > > > >       return 0;
> > > > > >  }
> > > > > > +
> > > > > > +static u64 arg_to_enxx_mask(unsigned long arg)
> > > > > > +{
> > > > > > +     u64 sctlr_enxx_mask = 0;
> > > > > > +     if (arg & PR_PAC_APIAKEY)
> > > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIA;
> > > > > > +     if (arg & PR_PAC_APIBKEY)
> > > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENIB;
> > > > > > +     if (arg & PR_PAC_APDAKEY)
> > > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDA;
> > > > > > +     if (arg & PR_PAC_APDBKEY)
> > > > > > +             sctlr_enxx_mask |= SCTLR_ELx_ENDB;
> > > > > > +     return sctlr_enxx_mask;
> > > > > > +}
> > > > > > +
> > > > > > +int ptrauth_prctl_set_enabled_keys(struct task_struct *tsk, unsigned long keys,
> > > > > > +                                unsigned long enabled)
> > > > > > +{
> > > > > > +     u64 sctlr_enxx_mask = tsk->thread.sctlr_enxx_mask;
> > > > > > +     unsigned long addr_key_mask = PR_PAC_APIAKEY | PR_PAC_APIBKEY |
> > > > > > +                                   PR_PAC_APDAKEY | PR_PAC_APDBKEY;
> > > > > > +
> > > > > > +     if (!system_supports_address_auth())
> > > > > > +             return -EINVAL;
> > > > > > +
> > > > > > +     if ((keys & ~addr_key_mask) || (enabled & ~keys))
> > > > > > +             return -EINVAL;
> > > > >
> > > > > Should we take the types of authentication supported?
> > > > >
> > > > > I don't recall whether we expose ptrauth to userspace if only
> > > > > instruction authentication or only data authentication is supported.  If
> > > > > so, should we reject attempts to configure unsupported keys here?
> > > > >
> > > > > We should probably try to do a consistent thing both here and in
> > > > > PR_PAC_RESET_KEYS if so.
> > > >
> > > > As far as I know, there is nothing in the architecture that would
> > > > allow it to only advertise support for I keys or only advertise
> > > > support for D keys. The fields AA64ISAR1_EL1.AP[AI] apply to all four
> > > > keys: DA, DB, IA and IB. Maybe you are thinking of the GA key versus
> > > > the other keys (which is advertised separately via
> > > > AA64ISAR1_EL1.GP[AI])? The architecture appears to provide no way to
> > > > disable the GA key, so I did not include support for it here.
> > >
> > > I think I'm confusing myself here.  Yes, support for generic auth is
> > > the (supposedly) architecturally orthogonal to address auth, but data
> > > and instruction address auth are either both supported, or both not
> > > supported -- so your code looks correct.
> > >
> > > >
> > > > > > +
> > > > > > +     sctlr_enxx_mask |= arg_to_enxx_mask(keys);
> > > > > > +     sctlr_enxx_mask &= ~arg_to_enxx_mask(enabled);
> > > > > > +
> > > > > > +     tsk->thread.sctlr_enxx_mask = sctlr_enxx_mask;
> > > > > > +     return 0;
> > > > >
> > > > > Do we need a way to query the enabled keys?
> > > > >
> > > > > We could either have a _GET_ prctl (the common approach), or have this
> > > > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > > > weirder).
> > > > >
> > > > > As above, we might
> > > > >
> > > > > Things like CRIU may need a GET in order to save/restore this setting
> > > > > properly.
> > > >
> > > > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > > > But it would need to be defined carefully because CRIU would
> > > > presumably need to know what value to pass as the "keys" argument when
> > > > it calls SET to restore the state. It can't just hardcode a value
> > > > because that may harm extensibility if new keys are added.
> > > >
> > > > If we require the kernel to start processes with all keys enabled
> > > > (including any keys that we may introduce in the future), then it
> > > > seems like if CRIU knows which keys were disabled, then it can restore
> > > > the state by issuing a syscall like this:
> > > >
> > > > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> > > >
> > > > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > > > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > > > saving the state to prepare the disabled_keys_mask argument passed
> > > > here. Then for consistency we can make the SET prctl() be
> > > > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> > > >
> > > > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> > > >
> > > > Does that make sense?
> > >
> > > Possibly, though it's nicer if the GET returns the same value suppiled
> > > to the SET, rather than the complement of it.
> >
> > Maybe you misread my proposal? The part about the complement is
> > addressed by the sentence beginning "Then for consistency..." So we
> > would end up with PR_PAC_SET_DISABLED_KEYS and
> > PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
> > *disabled* keys.
>
> Oh, I see.  Yes, while I prefer the two are consistent with each other,
> we can flip the sense of both here if desired.
>
> >
> > > If SET ignores set bits in arg3 that don't correspond to set bits in the
> > > mask arg2, then
> > >
> > >         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
> > >
> > >         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
> > >
> > > should work.
> >
> > I'd prefer not to allow the SET prctl to accept all-ones as the first
> > argument, as this could lead to a scenario where sometimes people pass
> > all-ones as the first argument and sometimes they don't, which could
> > make it hard to add new keys in the future (since the addition of a
> > new key could cause existing callers to affect the new key
> > arbitrarily). Instead, the first argument should be required to
> > specify only known keys, in order to enforce that the caller is
> > explicit about which keys they intend to affect.
>
> That sounds fair.
>
> > > There's a final option, which is to expose this config through ptrace
> > > instead for save/restore purposes.  From previous discussions with the
> > > CRIU folks, I recall that they are trying to move away from doing setup
> > > from within the new process where possible.
> > >
> > > There's no reason not to have both though -- there's precedent for that,
> > > such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> > > in a similar direction too IIUC.
> > >
> > > Having a GET remains useful for in-process debugging and diagnostics,
> > > and it's extremely straightforward to add in the kernel.  So from my
> > > side I'd vote to have it anyway...
> >
> > Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
> > well, and I suppose that I could add a
> > NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
> > userspace applications (and it doesn't look like ptrace APIs such as
> > NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
> > need to be made aware of new keys anyway), we might as well just have
> > all the APIs deal with the set of enabled keys.
> >
> > And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
> > GET prctl, since it's just more uapi surface that needs to be
> > maintained forever. For in-process use cases it seems redundant with
> > e.g. issuing a pacia instruction and checking whether it is a no-op,
>
> Probing instructions to see whether they generate signals is a cardinal
> sin and we make huge efforts to ensure that userspace doesn't need to do
> that.  As for the no-op check, I suppose that works, but it's nasty if
> the generic "check all the required featues" boilerplate on program
> entry relies on intrinsics or can't be written in C at all.  This is
> just a way of discouraging people from checking at all IMHO.
>
>
> The implementation of that prctl would most likely be a one-line wrapper
> that calls the same internal function used to populate the regset for
> ptrace, so I think you might be worrying a bit too much about creating
> ABI here.  This is more providing a second interface for same thing,
> because no single interface is usable in all situations.

Okay. Since I don't feel too strongly about it, I'll add the GET prctl in v2.

Peter

Dave Martin Sept. 1, 2020, 4:02 p.m. UTC | #9

On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > >
> > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:

[...]

> > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > +
> > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > >
> > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > >
> > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > the full value of SCTLR_EL1.
> > > >
> > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > we would do the write-if-needed across the whole register in one go.
> > > > This would be easier to extend if we have to twiddle additional
> > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > affected bits for now then we could of course defer this factoring until
> > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > possibly it was about the pauth keys anyway, rather than something
> > > > else.)
> > >
> > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > potentially combine the two, so that both
> > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > >
> > > On kernel entry:
> > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > decide that's cheap enough)
> > > - if EnIA clear, set it, and write the value to the system register.
> > >
> > > On kernel exit:
> > > - read system register SCTLR_EL1
> > > - read SCTLR_EL1 from task_struct
> > > - if they are different, write task's SCTLR_EL1 to the system register.
> > >
> > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > kernel exit. The comment that I referenced says that "accesses" to the
> > > register are expensive. I was operating under the assumption that both
> > > reads and writes are expensive, but if it's actually only writes that
> > > are expensive, we may be able to get away with the above.
> >
> > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > could cache what we wrote percpu, if the overhead of reading it is
> > problematic.
> 
> Maybe, but that sounds like the kind of complexity that I'd like to avoid.

Ack, this is more something to explore if the other approaches turn out
to be more costly.

> I went ahead and did some performance measurements using the following
> program, which issues 10M invalid syscalls and exits. This should give
> us something as close as possible to a measurement of a kernel entry
> and exit on its own.
> 
> .globl _start
> _start:
> movz x1, :abs_g1:10000000
> movk x1, :abs_g0_nc:10000000
> 
> .Lloop:
> mov x8, #65536 // invalid
> svc #0
> sub x1, x1, #1
> cbnz x1, .Lloop
> 
> mov x0, #0
> mov x8, #0x5d // exit
> svc #0
> 
> On a Pixel 4 (which according to Wikipedia has a CPU based on
> Cortex-A76/A55) I measured the median of 1000 runs execution time at
> 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> won't apply cleanly to modern kernels; this is in the same place as
> the ptrauth_keys_install_user macro invocation in a modern kernel):
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 303447c765f7..b7c72d767f3e 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -340,6 +340,7 @@ alternative_else_nop_endif
>  #endif
>  3:
>         apply_ssbd 0, 5f, x0, x1
> +       mrs x0, sctlr_el1
>  5:
>         .endif
> 
> The median execution time goes to 0.565604s, implying a cost of 1.1ns
> to read the system register. I also measured the cost of reading from
> a percpu variable at kernel exit like so:
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 303447c765f7..c5a89785f650 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -340,6 +340,7 @@ alternative_else_nop_endif
>  #endif
>  3:
>         apply_ssbd 0, 5f, x0, x1
> +       ldr_this_cpu x0, sctlr_el1_cache, x1
>  5:
>         .endif
> 
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 0b6ca5479f1f..ac9c9a6179d4 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -52,6 +52,8 @@
>  #include <asm/system_misc.h>
>  #include <asm/sysreg.h>
> 
> +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> +
>  static const char *handler[]= {
>         "Synchronous Abort",
>         "IRQ",
> 
> With that the median execution time goes to 0.600186s, implying a cost
> of 4.5ns. So at least on existing hardware, it looks like reading
> SCTLR_EL1 directly is probably our cheapest option if we're going to
> read *something*. Therefore I'll look at implementing this with
> unconditional reads from SCTLR_EL1 in v2.

Thanks for the investigation!

I'm not too surprised by this outcome.  There is a possibility that
reading SCTLR_EL1 sucks badly on some implementations, but on others it
may be little more than a register rename.  We should probably wait for
evidence before panicking about it.

So long as this SCTLR_EL1 switching is completely disabled via
alternatives on hardware to which it isn't applicable, I expect that
should be a reasonable compromise.

[...]

> > > > > > Do we need a way to query the enabled keys?
> > > > > >
> > > > > > We could either have a _GET_ prctl (the common approach), or have this
> > > > > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > > > > weirder).
> > > > > >
> > > > > > As above, we might
> > > > > >
> > > > > > Things like CRIU may need a GET in order to save/restore this setting
> > > > > > properly.
> > > > >
> > > > > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > > > > But it would need to be defined carefully because CRIU would
> > > > > presumably need to know what value to pass as the "keys" argument when
> > > > > it calls SET to restore the state. It can't just hardcode a value
> > > > > because that may harm extensibility if new keys are added.
> > > > >
> > > > > If we require the kernel to start processes with all keys enabled
> > > > > (including any keys that we may introduce in the future), then it
> > > > > seems like if CRIU knows which keys were disabled, then it can restore
> > > > > the state by issuing a syscall like this:
> > > > >
> > > > > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> > > > >
> > > > > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > > > > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > > > > saving the state to prepare the disabled_keys_mask argument passed
> > > > > here. Then for consistency we can make the SET prctl() be
> > > > > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> > > > >
> > > > > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> > > > >
> > > > > Does that make sense?
> > > >
> > > > Possibly, though it's nicer if the GET returns the same value suppiled
> > > > to the SET, rather than the complement of it.
> > >
> > > Maybe you misread my proposal? The part about the complement is
> > > addressed by the sentence beginning "Then for consistency..." So we
> > > would end up with PR_PAC_SET_DISABLED_KEYS and
> > > PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
> > > *disabled* keys.
> >
> > Oh, I see.  Yes, while I prefer the two are consistent with each other,
> > we can flip the sense of both here if desired.
> >
> > >
> > > > If SET ignores set bits in arg3 that don't correspond to set bits in the
> > > > mask arg2, then
> > > >
> > > >         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
> > > >
> > > >         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
> > > >
> > > > should work.
> > >
> > > I'd prefer not to allow the SET prctl to accept all-ones as the first
> > > argument, as this could lead to a scenario where sometimes people pass
> > > all-ones as the first argument and sometimes they don't, which could
> > > make it hard to add new keys in the future (since the addition of a
> > > new key could cause existing callers to affect the new key
> > > arbitrarily). Instead, the first argument should be required to
> > > specify only known keys, in order to enforce that the caller is
> > > explicit about which keys they intend to affect.
> >
> > That sounds fair.
> >
> > > > There's a final option, which is to expose this config through ptrace
> > > > instead for save/restore purposes.  From previous discussions with the
> > > > CRIU folks, I recall that they are trying to move away from doing setup
> > > > from within the new process where possible.
> > > >
> > > > There's no reason not to have both though -- there's precedent for that,
> > > > such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> > > > in a similar direction too IIUC.
> > > >
> > > > Having a GET remains useful for in-process debugging and diagnostics,
> > > > and it's extremely straightforward to add in the kernel.  So from my
> > > > side I'd vote to have it anyway...
> > >
> > > Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
> > > well, and I suppose that I could add a
> > > NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
> > > userspace applications (and it doesn't look like ptrace APIs such as
> > > NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
> > > need to be made aware of new keys anyway), we might as well just have
> > > all the APIs deal with the set of enabled keys.
> > >
> > > And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
> > > GET prctl, since it's just more uapi surface that needs to be
> > > maintained forever. For in-process use cases it seems redundant with
> > > e.g. issuing a pacia instruction and checking whether it is a no-op,
> >
> > Probing instructions to see whether they generate signals is a cardinal
> > sin and we make huge efforts to ensure that userspace doesn't need to do
> > that.  As for the no-op check, I suppose that works, but it's nasty if
> > the generic "check all the required featues" boilerplate on program
> > entry relies on intrinsics or can't be written in C at all.  This is
> > just a way of discouraging people from checking at all IMHO.
> >
> >
> > The implementation of that prctl would most likely be a one-line wrapper
> > that calls the same internal function used to populate the regset for
> > ptrace, so I think you might be worrying a bit too much about creating
> > ABI here.  This is more providing a second interface for same thing,
> > because no single interface is usable in all situations.
> 
> Okay. Since I don't feel too strongly about it, I'll add the GET prctl in v2.

Thanks

---Dave

Peter Collingbourne Sept. 3, 2020, 4:31 a.m. UTC | #10

On Tue, Sep 1, 2020 at 9:02 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> > On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > >
> > > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
>
> [...]
>
> > > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > > +
> > > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > > >
> > > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > > >
> > > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > > the full value of SCTLR_EL1.
> > > > >
> > > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > > we would do the write-if-needed across the whole register in one go.
> > > > > This would be easier to extend if we have to twiddle additional
> > > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > > affected bits for now then we could of course defer this factoring until
> > > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > > possibly it was about the pauth keys anyway, rather than something
> > > > > else.)
> > > >
> > > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > > potentially combine the two, so that both
> > > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > > >
> > > > On kernel entry:
> > > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > > decide that's cheap enough)
> > > > - if EnIA clear, set it, and write the value to the system register.
> > > >
> > > > On kernel exit:
> > > > - read system register SCTLR_EL1
> > > > - read SCTLR_EL1 from task_struct
> > > > - if they are different, write task's SCTLR_EL1 to the system register.
> > > >
> > > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > > kernel exit. The comment that I referenced says that "accesses" to the
> > > > register are expensive. I was operating under the assumption that both
> > > > reads and writes are expensive, but if it's actually only writes that
> > > > are expensive, we may be able to get away with the above.
> > >
> > > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > > could cache what we wrote percpu, if the overhead of reading it is
> > > problematic.
> >
> > Maybe, but that sounds like the kind of complexity that I'd like to avoid.
>
> Ack, this is more something to explore if the other approaches turn out
> to be more costly.
>
> > I went ahead and did some performance measurements using the following
> > program, which issues 10M invalid syscalls and exits. This should give
> > us something as close as possible to a measurement of a kernel entry
> > and exit on its own.
> >
> > .globl _start
> > _start:
> > movz x1, :abs_g1:10000000
> > movk x1, :abs_g0_nc:10000000
> >
> > .Lloop:
> > mov x8, #65536 // invalid
> > svc #0
> > sub x1, x1, #1
> > cbnz x1, .Lloop
> >
> > mov x0, #0
> > mov x8, #0x5d // exit
> > svc #0
> >
> > On a Pixel 4 (which according to Wikipedia has a CPU based on
> > Cortex-A76/A55) I measured the median of 1000 runs execution time at
> > 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> > modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> > won't apply cleanly to modern kernels; this is in the same place as
> > the ptrauth_keys_install_user macro invocation in a modern kernel):
> >
> > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > index 303447c765f7..b7c72d767f3e 100644
> > --- a/arch/arm64/kernel/entry.S
> > +++ b/arch/arm64/kernel/entry.S
> > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> >  #endif
> >  3:
> >         apply_ssbd 0, 5f, x0, x1
> > +       mrs x0, sctlr_el1
> >  5:
> >         .endif
> >
> > The median execution time goes to 0.565604s, implying a cost of 1.1ns
> > to read the system register. I also measured the cost of reading from
> > a percpu variable at kernel exit like so:
> >
> > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > index 303447c765f7..c5a89785f650 100644
> > --- a/arch/arm64/kernel/entry.S
> > +++ b/arch/arm64/kernel/entry.S
> > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> >  #endif
> >  3:
> >         apply_ssbd 0, 5f, x0, x1
> > +       ldr_this_cpu x0, sctlr_el1_cache, x1
> >  5:
> >         .endif
> >
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index 0b6ca5479f1f..ac9c9a6179d4 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -52,6 +52,8 @@
> >  #include <asm/system_misc.h>
> >  #include <asm/sysreg.h>
> >
> > +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> > +
> >  static const char *handler[]= {
> >         "Synchronous Abort",
> >         "IRQ",
> >
> > With that the median execution time goes to 0.600186s, implying a cost
> > of 4.5ns. So at least on existing hardware, it looks like reading
> > SCTLR_EL1 directly is probably our cheapest option if we're going to
> > read *something*. Therefore I'll look at implementing this with
> > unconditional reads from SCTLR_EL1 in v2.
>
> Thanks for the investigation!
>
> I'm not too surprised by this outcome.  There is a possibility that
> reading SCTLR_EL1 sucks badly on some implementations, but on others it
> may be little more than a register rename.  We should probably wait for
> evidence before panicking about it.
>
> So long as this SCTLR_EL1 switching is completely disabled via
> alternatives on hardware to which it isn't applicable, I expect that
> should be a reasonable compromise.

One disadvantage of disabling SCTLR_EL1 switching on inapplicable
hardware is that it complicates the mechanism for enabling and
disabling deprecated instruction support (see
arch/arm64/kernel/armv8_deprecated.c), which is currently implemented
by using on_each_cpu to set or clear certain bits in SCTLR_EL1 on each
CPU. This is incompatible with an entry/exit that assumes full control
over SCTLR_EL1.

I would propose the following instruction sequences in entry.S, which
are justified by further experiments run on a Pixel 4 showing that:
1) an unconditional SCTLR_EL1 write on kernel entry/exit costs 3.0ns,
whereas conditional writes cost 4.5ns
2) an SCTLR_EL1 read on kernel entry is 0.3ns faster than a load from
task_struct
3) an unconditional SCTLR_EL1 write on kernel exit costs 0.7ns.

diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 55af8b504b65..9e0acd5e7527 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -186,6 +186,12 @@ alternative_cb_end

        ptrauth_keys_install_kernel tsk, x20, x22, x23

+alternative_if ARM64_HAS_ADDRESS_AUTH
+       mrs     x22, sctlr_el1
+       orr     x22, x22, SCTLR_ELx_ENIA
+       msr     sctlr_el1, x22
+alternative_else_nop_endif
+
        scs_load tsk, x20
        .else
        add     x21, sp, #S_FRAME_SIZE
@@ -297,6 +303,9 @@ alternative_else_nop_endif
        /* No kernel C function calls after this as user keys are set. */
        ptrauth_keys_install_user tsk, x0, x1, x2

+       ldr     x0, [tsk, THREAD_SCTLR]
+       msr     sctlr_el1, x0
+
        apply_ssbd 0, x0, x1
        .endif


With these instruction sequences I would propose to reimplement
toggling deprecated instruction support as follows:
1) Stop the machine using stop_machine.
2) While the machine is stopped, adjust the deprecated instruction
support feature bits in each task_struct.

This will make toggling deprecated instruction support more expensive,
but that seems like the right tradeoff since I imagine that these
features are rarely toggled, and making it more efficient to toggle
them would mean more instructions in the entry/exit hot path.

This seems like it will work well if SCTLR_EL1 is set on kernel exit.
The deprecated feature bits are CP15BEN and SED, and each of them
affects features that are only implemented at EL0, so it seems like we
can wait until kernel exit to update SCTLR_EL1. But if we don't set
SCTLR_EL1 on exit on certain hardware, we will need to have both
mechanisms, use the existing mechanism on hardware that doesn't set
SCTLR_EL1 and the new mechanism on hardware that does set SCTLR_EL1.
Which I wouldn't mind too much, but I'm not sure that it's worth it to
save 0.7ns (I'm not claiming that my numbers are universal, just that
we should probably favor simpler approaches if we don't have evidence
that they are significantly more expensive than more complex ones).

Peter

> [...]
>
> > > > > > > Do we need a way to query the enabled keys?
> > > > > > >
> > > > > > > We could either have a _GET_ prctl (the common approach), or have this
> > > > > > > prctl return the mask of enabled keys (avoids the extra prctl, but
> > > > > > > weirder).
> > > > > > >
> > > > > > > As above, we might
> > > > > > >
> > > > > > > Things like CRIU may need a GET in order to save/restore this setting
> > > > > > > properly.
> > > > > >
> > > > > > Maybe it makes sense for there to be a GET prctl() to support CRIU.
> > > > > > But it would need to be defined carefully because CRIU would
> > > > > > presumably need to know what value to pass as the "keys" argument when
> > > > > > it calls SET to restore the state. It can't just hardcode a value
> > > > > > because that may harm extensibility if new keys are added.
> > > > > >
> > > > > > If we require the kernel to start processes with all keys enabled
> > > > > > (including any keys that we may introduce in the future), then it
> > > > > > seems like if CRIU knows which keys were disabled, then it can restore
> > > > > > the state by issuing a syscall like this:
> > > > > >
> > > > > > prctl(PR_PAC_SET_ENABLED_KEYS, disabled_keys_mask, 0)
> > > > > >
> > > > > > which would imply that instead of providing PR_PAC_GET_ENABLED_KEYS we
> > > > > > instead provide PR_PAC_GET_DISABLED_KEYS, which CRIU would call when
> > > > > > saving the state to prepare the disabled_keys_mask argument passed
> > > > > > here. Then for consistency we can make the SET prctl() be
> > > > > > PR_PAC_SET_DISABLED_KEYS, and CRIU can then do:
> > > > > >
> > > > > > prctl(PR_PAC_SET_DISABLED_KEYS, disabled_keys_mask, disabled_keys_mask)
> > > > > >
> > > > > > Does that make sense?
> > > > >
> > > > > Possibly, though it's nicer if the GET returns the same value suppiled
> > > > > to the SET, rather than the complement of it.
> > > >
> > > > Maybe you misread my proposal? The part about the complement is
> > > > addressed by the sentence beginning "Then for consistency..." So we
> > > > would end up with PR_PAC_SET_DISABLED_KEYS and
> > > > PR_PAC_GET_DISABLED_KEYS, and they would both take/return a mask of
> > > > *disabled* keys.
> > >
> > > Oh, I see.  Yes, while I prefer the two are consistent with each other,
> > > we can flip the sense of both here if desired.
> > >
> > > >
> > > > > If SET ignores set bits in arg3 that don't correspond to set bits in the
> > > > > mask arg2, then
> > > > >
> > > > >         u64 current_keys = prctl(PR_PAC_GET_ENABLED_KEYS);
> > > > >
> > > > >         prctl(PR_PAC_SET_ENABLED_KEYS, ~0UL, current_keys);
> > > > >
> > > > > should work.
> > > >
> > > > I'd prefer not to allow the SET prctl to accept all-ones as the first
> > > > argument, as this could lead to a scenario where sometimes people pass
> > > > all-ones as the first argument and sometimes they don't, which could
> > > > make it hard to add new keys in the future (since the addition of a
> > > > new key could cause existing callers to affect the new key
> > > > arbitrarily). Instead, the first argument should be required to
> > > > specify only known keys, in order to enforce that the caller is
> > > > explicit about which keys they intend to affect.
> > >
> > > That sounds fair.
> > >
> > > > > There's a final option, which is to expose this config through ptrace
> > > > > instead for save/restore purposes.  From previous discussions with the
> > > > > CRIU folks, I recall that they are trying to move away from doing setup
> > > > > from within the new process where possible.
> > > > >
> > > > > There's no reason not to have both though -- there's precedent for that,
> > > > > such as for PR_SVE_{GET,SET}_VL and the NT_ARM_SVE regset.  MTE may move
> > > > > in a similar direction too IIUC.
> > > > >
> > > > > Having a GET remains useful for in-process debugging and diagnostics,
> > > > > and it's extremely straightforward to add in the kernel.  So from my
> > > > > side I'd vote to have it anyway...
> > > >
> > > > Okay. I see that the MTE series is adding NT_ARM_TAGGED_ADDR_CTRL as
> > > > well, and I suppose that I could add a
> > > > NT_ARM_PAC_(whatever)ABLED_KEYS. Since this wouldn't be used in normal
> > > > userspace applications (and it doesn't look like ptrace APIs such as
> > > > NT_ARM_PACA_KEYS will accommodate new keys being added so CRIU would
> > > > need to be made aware of new keys anyway), we might as well just have
> > > > all the APIs deal with the set of enabled keys.
> > > >
> > > > And if the ptrace is enough for CRIU, I'd mildly prefer not to add a
> > > > GET prctl, since it's just more uapi surface that needs to be
> > > > maintained forever. For in-process use cases it seems redundant with
> > > > e.g. issuing a pacia instruction and checking whether it is a no-op,
> > >
> > > Probing instructions to see whether they generate signals is a cardinal
> > > sin and we make huge efforts to ensure that userspace doesn't need to do
> > > that.  As for the no-op check, I suppose that works, but it's nasty if
> > > the generic "check all the required featues" boilerplate on program
> > > entry relies on intrinsics or can't be written in C at all.  This is
> > > just a way of discouraging people from checking at all IMHO.
> > >
> > >
> > > The implementation of that prctl would most likely be a one-line wrapper
> > > that calls the same internal function used to populate the regset for
> > > ptrace, so I think you might be worrying a bit too much about creating
> > > ABI here.  This is more providing a second interface for same thing,
> > > because no single interface is usable in all situations.
> >
> > Okay. Since I don't feel too strongly about it, I'll add the GET prctl in v2.
>
> Thanks
>
> ---Dave

Dave Martin Sept. 7, 2020, 11:03 a.m. UTC | #11

On Wed, Sep 02, 2020 at 09:31:55PM -0700, Peter Collingbourne wrote:
> On Tue, Sep 1, 2020 at 9:02 AM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> > > On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > >
> > > > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > >
> > > > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> >
> > [...]
> >
> > > > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > > > +
> > > > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > > > >
> > > > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > > > >
> > > > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > > > the full value of SCTLR_EL1.
> > > > > >
> > > > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > > > we would do the write-if-needed across the whole register in one go.
> > > > > > This would be easier to extend if we have to twiddle additional
> > > > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > > > affected bits for now then we could of course defer this factoring until
> > > > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > > > possibly it was about the pauth keys anyway, rather than something
> > > > > > else.)
> > > > >
> > > > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > > > potentially combine the two, so that both
> > > > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > > > >
> > > > > On kernel entry:
> > > > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > > > decide that's cheap enough)
> > > > > - if EnIA clear, set it, and write the value to the system register.
> > > > >
> > > > > On kernel exit:
> > > > > - read system register SCTLR_EL1
> > > > > - read SCTLR_EL1 from task_struct
> > > > > - if they are different, write task's SCTLR_EL1 to the system register.
> > > > >
> > > > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > > > kernel exit. The comment that I referenced says that "accesses" to the
> > > > > register are expensive. I was operating under the assumption that both
> > > > > reads and writes are expensive, but if it's actually only writes that
> > > > > are expensive, we may be able to get away with the above.
> > > >
> > > > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > > > could cache what we wrote percpu, if the overhead of reading it is
> > > > problematic.
> > >
> > > Maybe, but that sounds like the kind of complexity that I'd like to avoid.
> >
> > Ack, this is more something to explore if the other approaches turn out
> > to be more costly.
> >
> > > I went ahead and did some performance measurements using the following
> > > program, which issues 10M invalid syscalls and exits. This should give
> > > us something as close as possible to a measurement of a kernel entry
> > > and exit on its own.
> > >
> > > .globl _start
> > > _start:
> > > movz x1, :abs_g1:10000000
> > > movk x1, :abs_g0_nc:10000000
> > >
> > > .Lloop:
> > > mov x8, #65536 // invalid
> > > svc #0
> > > sub x1, x1, #1
> > > cbnz x1, .Lloop
> > >
> > > mov x0, #0
> > > mov x8, #0x5d // exit
> > > svc #0
> > >
> > > On a Pixel 4 (which according to Wikipedia has a CPU based on
> > > Cortex-A76/A55) I measured the median of 1000 runs execution time at
> > > 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> > > modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> > > won't apply cleanly to modern kernels; this is in the same place as
> > > the ptrauth_keys_install_user macro invocation in a modern kernel):
> > >
> > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > index 303447c765f7..b7c72d767f3e 100644
> > > --- a/arch/arm64/kernel/entry.S
> > > +++ b/arch/arm64/kernel/entry.S
> > > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> > >  #endif
> > >  3:
> > >         apply_ssbd 0, 5f, x0, x1
> > > +       mrs x0, sctlr_el1
> > >  5:
> > >         .endif
> > >
> > > The median execution time goes to 0.565604s, implying a cost of 1.1ns
> > > to read the system register. I also measured the cost of reading from
> > > a percpu variable at kernel exit like so:
> > >
> > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > index 303447c765f7..c5a89785f650 100644
> > > --- a/arch/arm64/kernel/entry.S
> > > +++ b/arch/arm64/kernel/entry.S
> > > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> > >  #endif
> > >  3:
> > >         apply_ssbd 0, 5f, x0, x1
> > > +       ldr_this_cpu x0, sctlr_el1_cache, x1
> > >  5:
> > >         .endif
> > >
> > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > index 0b6ca5479f1f..ac9c9a6179d4 100644
> > > --- a/arch/arm64/kernel/traps.c
> > > +++ b/arch/arm64/kernel/traps.c
> > > @@ -52,6 +52,8 @@
> > >  #include <asm/system_misc.h>
> > >  #include <asm/sysreg.h>
> > >
> > > +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> > > +
> > >  static const char *handler[]= {
> > >         "Synchronous Abort",
> > >         "IRQ",
> > >
> > > With that the median execution time goes to 0.600186s, implying a cost
> > > of 4.5ns. So at least on existing hardware, it looks like reading
> > > SCTLR_EL1 directly is probably our cheapest option if we're going to
> > > read *something*. Therefore I'll look at implementing this with
> > > unconditional reads from SCTLR_EL1 in v2.
> >
> > Thanks for the investigation!
> >
> > I'm not too surprised by this outcome.  There is a possibility that
> > reading SCTLR_EL1 sucks badly on some implementations, but on others it
> > may be little more than a register rename.  We should probably wait for
> > evidence before panicking about it.
> >
> > So long as this SCTLR_EL1 switching is completely disabled via
> > alternatives on hardware to which it isn't applicable, I expect that
> > should be a reasonable compromise.
> 
> One disadvantage of disabling SCTLR_EL1 switching on inapplicable
> hardware is that it complicates the mechanism for enabling and
> disabling deprecated instruction support (see
> arch/arm64/kernel/armv8_deprecated.c), which is currently implemented
> by using on_each_cpu to set or clear certain bits in SCTLR_EL1 on each
> CPU. This is incompatible with an entry/exit that assumes full control
> over SCTLR_EL1.
> 
> I would propose the following instruction sequences in entry.S, which
> are justified by further experiments run on a Pixel 4 showing that:
> 1) an unconditional SCTLR_EL1 write on kernel entry/exit costs 3.0ns,
> whereas conditional writes cost 4.5ns
> 2) an SCTLR_EL1 read on kernel entry is 0.3ns faster than a load from
> task_struct
> 3) an unconditional SCTLR_EL1 write on kernel exit costs 0.7ns.
> 
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 55af8b504b65..9e0acd5e7527 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -186,6 +186,12 @@ alternative_cb_end
> 
>         ptrauth_keys_install_kernel tsk, x20, x22, x23
> 
> +alternative_if ARM64_HAS_ADDRESS_AUTH
> +       mrs     x22, sctlr_el1
> +       orr     x22, x22, SCTLR_ELx_ENIA
> +       msr     sctlr_el1, x22
> +alternative_else_nop_endif
> +
>         scs_load tsk, x20
>         .else
>         add     x21, sp, #S_FRAME_SIZE
> @@ -297,6 +303,9 @@ alternative_else_nop_endif
>         /* No kernel C function calls after this as user keys are set. */
>         ptrauth_keys_install_user tsk, x0, x1, x2
> 
> +       ldr     x0, [tsk, THREAD_SCTLR]
> +       msr     sctlr_el1, x0
> +
>         apply_ssbd 0, x0, x1
>         .endif
> 
> 
> With these instruction sequences I would propose to reimplement
> toggling deprecated instruction support as follows:
> 1) Stop the machine using stop_machine.
> 2) While the machine is stopped, adjust the deprecated instruction
> support feature bits in each task_struct.
> 
> This will make toggling deprecated instruction support more expensive,
> but that seems like the right tradeoff since I imagine that these
> features are rarely toggled, and making it more efficient to toggle
> them would mean more instructions in the entry/exit hot path.
> 
> This seems like it will work well if SCTLR_EL1 is set on kernel exit.
> The deprecated feature bits are CP15BEN and SED, and each of them
> affects features that are only implemented at EL0, so it seems like we
> can wait until kernel exit to update SCTLR_EL1. But if we don't set
> SCTLR_EL1 on exit on certain hardware, we will need to have both
> mechanisms, use the existing mechanism on hardware that doesn't set
> SCTLR_EL1 and the new mechanism on hardware that does set SCTLR_EL1.
> Which I wouldn't mind too much, but I'm not sure that it's worth it to
> save 0.7ns (I'm not claiming that my numbers are universal, just that
> we should probably favor simpler approaches if we don't have evidence
> that they are significantly more expensive than more complex ones).
> 
> Peter

This approach does look like close to the simplest option for the entry/
exit path -- I'll let others comment on whether the actual overhead
numbers sound acceptable.

I think there are more places where the kernel modifies SCTLR than just
for the deprecated instruction emulation though.  Do you think your
approach will work with all of them, and can they all tolerate a
stop_machine()?

[...]

Cheers
---Dave

Peter Collingbourne Oct. 13, 2020, 4:03 a.m. UTC | #12

On Mon, Sep 7, 2020 at 4:03 AM Dave Martin <Dave.Martin@arm.com> wrote:
>
> On Wed, Sep 02, 2020 at 09:31:55PM -0700, Peter Collingbourne wrote:
> > On Tue, Sep 1, 2020 at 9:02 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > >
> > > On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> > > > On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > >
> > > > > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > > > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > > >
> > > > > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > >
> > > [...]
> > >
> > > > > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > > > > +
> > > > > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > > > > >
> > > > > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > > > > >
> > > > > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > > > > the full value of SCTLR_EL1.
> > > > > > >
> > > > > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > > > > we would do the write-if-needed across the whole register in one go.
> > > > > > > This would be easier to extend if we have to twiddle additional
> > > > > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > > > > affected bits for now then we could of course defer this factoring until
> > > > > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > > > > possibly it was about the pauth keys anyway, rather than something
> > > > > > > else.)
> > > > > >
> > > > > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > > > > potentially combine the two, so that both
> > > > > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > > > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > > > > >
> > > > > > On kernel entry:
> > > > > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > > > > decide that's cheap enough)
> > > > > > - if EnIA clear, set it, and write the value to the system register.
> > > > > >
> > > > > > On kernel exit:
> > > > > > - read system register SCTLR_EL1
> > > > > > - read SCTLR_EL1 from task_struct
> > > > > > - if they are different, write task's SCTLR_EL1 to the system register.
> > > > > >
> > > > > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > > > > kernel exit. The comment that I referenced says that "accesses" to the
> > > > > > register are expensive. I was operating under the assumption that both
> > > > > > reads and writes are expensive, but if it's actually only writes that
> > > > > > are expensive, we may be able to get away with the above.
> > > > >
> > > > > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > > > > could cache what we wrote percpu, if the overhead of reading it is
> > > > > problematic.
> > > >
> > > > Maybe, but that sounds like the kind of complexity that I'd like to avoid.
> > >
> > > Ack, this is more something to explore if the other approaches turn out
> > > to be more costly.
> > >
> > > > I went ahead and did some performance measurements using the following
> > > > program, which issues 10M invalid syscalls and exits. This should give
> > > > us something as close as possible to a measurement of a kernel entry
> > > > and exit on its own.
> > > >
> > > > .globl _start
> > > > _start:
> > > > movz x1, :abs_g1:10000000
> > > > movk x1, :abs_g0_nc:10000000
> > > >
> > > > .Lloop:
> > > > mov x8, #65536 // invalid
> > > > svc #0
> > > > sub x1, x1, #1
> > > > cbnz x1, .Lloop
> > > >
> > > > mov x0, #0
> > > > mov x8, #0x5d // exit
> > > > svc #0
> > > >
> > > > On a Pixel 4 (which according to Wikipedia has a CPU based on
> > > > Cortex-A76/A55) I measured the median of 1000 runs execution time at
> > > > 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> > > > modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> > > > won't apply cleanly to modern kernels; this is in the same place as
> > > > the ptrauth_keys_install_user macro invocation in a modern kernel):
> > > >
> > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > > index 303447c765f7..b7c72d767f3e 100644
> > > > --- a/arch/arm64/kernel/entry.S
> > > > +++ b/arch/arm64/kernel/entry.S
> > > > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> > > >  #endif
> > > >  3:
> > > >         apply_ssbd 0, 5f, x0, x1
> > > > +       mrs x0, sctlr_el1
> > > >  5:
> > > >         .endif
> > > >
> > > > The median execution time goes to 0.565604s, implying a cost of 1.1ns
> > > > to read the system register. I also measured the cost of reading from
> > > > a percpu variable at kernel exit like so:
> > > >
> > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > > index 303447c765f7..c5a89785f650 100644
> > > > --- a/arch/arm64/kernel/entry.S
> > > > +++ b/arch/arm64/kernel/entry.S
> > > > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> > > >  #endif
> > > >  3:
> > > >         apply_ssbd 0, 5f, x0, x1
> > > > +       ldr_this_cpu x0, sctlr_el1_cache, x1
> > > >  5:
> > > >         .endif
> > > >
> > > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > > index 0b6ca5479f1f..ac9c9a6179d4 100644
> > > > --- a/arch/arm64/kernel/traps.c
> > > > +++ b/arch/arm64/kernel/traps.c
> > > > @@ -52,6 +52,8 @@
> > > >  #include <asm/system_misc.h>
> > > >  #include <asm/sysreg.h>
> > > >
> > > > +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> > > > +
> > > >  static const char *handler[]= {
> > > >         "Synchronous Abort",
> > > >         "IRQ",
> > > >
> > > > With that the median execution time goes to 0.600186s, implying a cost
> > > > of 4.5ns. So at least on existing hardware, it looks like reading
> > > > SCTLR_EL1 directly is probably our cheapest option if we're going to
> > > > read *something*. Therefore I'll look at implementing this with
> > > > unconditional reads from SCTLR_EL1 in v2.
> > >
> > > Thanks for the investigation!
> > >
> > > I'm not too surprised by this outcome.  There is a possibility that
> > > reading SCTLR_EL1 sucks badly on some implementations, but on others it
> > > may be little more than a register rename.  We should probably wait for
> > > evidence before panicking about it.
> > >
> > > So long as this SCTLR_EL1 switching is completely disabled via
> > > alternatives on hardware to which it isn't applicable, I expect that
> > > should be a reasonable compromise.
> >
> > One disadvantage of disabling SCTLR_EL1 switching on inapplicable
> > hardware is that it complicates the mechanism for enabling and
> > disabling deprecated instruction support (see
> > arch/arm64/kernel/armv8_deprecated.c), which is currently implemented
> > by using on_each_cpu to set or clear certain bits in SCTLR_EL1 on each
> > CPU. This is incompatible with an entry/exit that assumes full control
> > over SCTLR_EL1.
> >
> > I would propose the following instruction sequences in entry.S, which
> > are justified by further experiments run on a Pixel 4 showing that:
> > 1) an unconditional SCTLR_EL1 write on kernel entry/exit costs 3.0ns,
> > whereas conditional writes cost 4.5ns
> > 2) an SCTLR_EL1 read on kernel entry is 0.3ns faster than a load from
> > task_struct
> > 3) an unconditional SCTLR_EL1 write on kernel exit costs 0.7ns.
> >
> > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > index 55af8b504b65..9e0acd5e7527 100644
> > --- a/arch/arm64/kernel/entry.S
> > +++ b/arch/arm64/kernel/entry.S
> > @@ -186,6 +186,12 @@ alternative_cb_end
> >
> >         ptrauth_keys_install_kernel tsk, x20, x22, x23
> >
> > +alternative_if ARM64_HAS_ADDRESS_AUTH
> > +       mrs     x22, sctlr_el1
> > +       orr     x22, x22, SCTLR_ELx_ENIA
> > +       msr     sctlr_el1, x22
> > +alternative_else_nop_endif
> > +
> >         scs_load tsk, x20
> >         .else
> >         add     x21, sp, #S_FRAME_SIZE
> > @@ -297,6 +303,9 @@ alternative_else_nop_endif
> >         /* No kernel C function calls after this as user keys are set. */
> >         ptrauth_keys_install_user tsk, x0, x1, x2
> >
> > +       ldr     x0, [tsk, THREAD_SCTLR]
> > +       msr     sctlr_el1, x0
> > +
> >         apply_ssbd 0, x0, x1
> >         .endif
> >
> >
> > With these instruction sequences I would propose to reimplement
> > toggling deprecated instruction support as follows:
> > 1) Stop the machine using stop_machine.
> > 2) While the machine is stopped, adjust the deprecated instruction
> > support feature bits in each task_struct.
> >
> > This will make toggling deprecated instruction support more expensive,
> > but that seems like the right tradeoff since I imagine that these
> > features are rarely toggled, and making it more efficient to toggle
> > them would mean more instructions in the entry/exit hot path.
> >
> > This seems like it will work well if SCTLR_EL1 is set on kernel exit.
> > The deprecated feature bits are CP15BEN and SED, and each of them
> > affects features that are only implemented at EL0, so it seems like we
> > can wait until kernel exit to update SCTLR_EL1. But if we don't set
> > SCTLR_EL1 on exit on certain hardware, we will need to have both
> > mechanisms, use the existing mechanism on hardware that doesn't set
> > SCTLR_EL1 and the new mechanism on hardware that does set SCTLR_EL1.
> > Which I wouldn't mind too much, but I'm not sure that it's worth it to
> > save 0.7ns (I'm not claiming that my numbers are universal, just that
> > we should probably favor simpler approaches if we don't have evidence
> > that they are significantly more expensive than more complex ones).
> >
> > Peter
>
> This approach does look like close to the simplest option for the entry/
> exit path -- I'll let others comment on whether the actual overhead
> numbers sound acceptable.
>
> I think there are more places where the kernel modifies SCTLR than just
> for the deprecated instruction emulation though.  Do you think your
> approach will work with all of them, and can they all tolerate a
> stop_machine()?

You're right, there are other places. Looking at the places written in
assembly, I think all of them do not expect the value to be retained.
But there are more places where we were setting this register in C
that I hadn't checked. For example, in cpufeature.c we have:

static void cpu_emulate_effective_ctr(const struct
arm64_cpu_capabilities *__unused)
{
        /*
         * If the CPU exposes raw CTR_EL0.IDC = 0, while effectively
         * CTR_EL0.IDC = 1 (from CLIDR values), we need to trap accesses
         * to the CTR_EL0 on this CPU and emulate it with the real/safe
         * value.
         */
        if (!(read_cpuid_cachetype() & BIT(CTR_IDC_SHIFT)))
                sysreg_clear_set(sctlr_el1, SCTLR_EL1_UCT, 0);
}

So we may want heterogeneous values of SCTLR_EL1 on different cores,
unless for example for this routine we reimplement to trap accesses to
CTR_EL0 homogeneously.

Furthermore it is unclear that it would be valid to write to SCTLR_EL1
directly with the result of reading from a thread_struct field. My
belief that this behavior may be incorrect comes from the documented
semantics of reserved fields in SCTLR_EL1. These bits are defined as
RES0 which means that software must use an SBZP approach to write to
them, and an approach that reads a value from thread_struct and then
writes exactly that value to SCTLR_EL1 would not necessarily preserve
RES0 bits that are different between cores.

This led me to go back to considering approaches where we only update
EnIA on entry and exit and leave the updating of the other bits to
task switch at the same time as TCF0 in Catalin's MTE series.

Such approaches seem like they will perform better as well, if
extrapolating from existing hardware is anything to go by. It seems
that my previous measurements on Pixel 4 may have been inaccurate
because /sys/devices/system/cpu/*/cpufreq/scaling_governor was set to
its default value so scaling may have interfered with the results. I
ran a new set of measurements with scaling_governor set to powersave
which gave me much more stable results. With the newly stable results
I decided to take the median of 100 runs rather than 1000 to
compensate for the slower run time caused by the powersave governor.
These results showed that conditional MSR is faster than unconditional
MSR, and that LDR from thread_struct to obtain the state of the
incoming task is slightly faster than MRS. The full results with
references to patches are below:

Baseline: 316.7ns
Conditional MSR based on LDR [1]: 319.6ns
Unconditional MSR [2]: 353.4ns
Conditional MSR based on MRS [3]: 323.8ns

I also ran the same experiment on a DragonBoard 845c. Since this board
can run the mainline kernel it gives us results that we may expect to
be slightly more valid when making changes to the mainline kernel than
the 4.14 kernel used by Pixel 4. On this board I saw two "clusters" of
results so I took the median of the faster cluster which was more
stable (I don't think this was due to big.LITTLE because I saw very
different numbers if I pinned the process to a LITTLE core):

Baseline: 450.5ns
Conditional MSR based on LDR: 457.8ns
Unconditional MSR: 487.0ns
Conditional MSR based on MRS: 461.5ns

So we can see very similar perf impacts to Pixel 4. Although this
board is Cortex-A75/A55 which is the prior uarch generation to Pixel
4's Cortex-A76/A55, it also tells us that on big Cortex uarches
SCTLR_EL1 performance is somewhat stable over time, at least to the
extent that we can extrapolate that from two data points. Of course,
we may tune the instruction sequences again once hardware with PAC
support is publicly available.

Thus I think that to begin with we should use instruction sequences
similar to those in [1], protected via alternatives so we wouldn't be
impacting the performance on existing hardware. I will develop my
patch on top of the MTE series since they will interfere with each
other and it seems plausible that MTE will land first.

Peter

[1] https://android-review.googlesource.com/c/kernel/msm/+/1457681
[2] https://android-review.googlesource.com/c/kernel/msm/+/1456578
[3] https://android-review.googlesource.com/c/kernel/msm/+/1456579



Peter

Dave Martin Oct. 13, 2020, 10:14 a.m. UTC | #13

On Mon, Oct 12, 2020 at 09:03:49PM -0700, Peter Collingbourne wrote:
> On Mon, Sep 7, 2020 at 4:03 AM Dave Martin <Dave.Martin@arm.com> wrote:
> >
> > On Wed, Sep 02, 2020 at 09:31:55PM -0700, Peter Collingbourne wrote:
> > > On Tue, Sep 1, 2020 at 9:02 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > >
> > > > On Wed, Aug 26, 2020 at 07:55:51PM -0700, Peter Collingbourne wrote:
> > > > > On Wed, Aug 26, 2020 at 9:37 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 24, 2020 at 05:23:27PM -0700, Peter Collingbourne wrote:
> > > > > > > On Mon, Aug 24, 2020 at 7:49 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Aug 19, 2020 at 02:25:45PM -0700, Peter Collingbourne wrote:
> > > > > > > > > On Wed, Aug 19, 2020 at 3:18 AM Dave Martin <Dave.Martin@arm.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, Jul 31, 2020 at 06:11:52PM -0700, Peter Collingbourne wrote:
> > > >
> > > > [...]
> > > >
> > > > > > > > > > > diff --git a/arch/arm64/include/asm/asm_pointer_auth.h b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > > > index 52dead2a8640..d121fa5fed5f 100644
> > > > > > > > > > > --- a/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > > > +++ b/arch/arm64/include/asm/asm_pointer_auth.h
> > > > > > > > > > > @@ -31,6 +31,14 @@ alternative_else_nop_endif
> > > > > > > > > > >       ldp     \tmp2, \tmp3, [\tmp1, #PTRAUTH_USER_KEY_APDB]
> > > > > > > > > > >       msr_s   SYS_APDBKEYLO_EL1, \tmp2
> > > > > > > > > > >       msr_s   SYS_APDBKEYHI_EL1, \tmp3
> > > > > > > > > > > +
> > > > > > > > > > > +     ldr     \tmp2, [\tsk, #THREAD_SCTLR_ENXX_MASK]
> > > > > > > > > > > +     cbz     \tmp2, .Laddr_auth_skip_\@
> > > > > > > > > >
> > > > > > > > > > I wonder whether it would make sense to simple store the thread's base
> > > > > > > > > > SCTLR value (containing the ENxx bits), rather than storing the ENxx
> > > > > > > > > > bits separately.  There may be reasons outside this snippet why this
> > > > > > > > > > isn't such a good idea though -- I haven't gone digging so far.
> > > > > > > > >
> > > > > > > > > As far as I know (as I learned [1] from the MTE patch series), it can
> > > > > > > > > be expensive to access SCTLR_EL1, therefore I arrange to only access
> > > > > > > > > SCTLR_EL1 in the hopefully-uncommon case where a process has disabled
> > > > > > > > > keys. Detecting the "process has disabled keys" case is quite simple
> > > > > > > > > if we only store the disabled keys mask here, not so much if we store
> > > > > > > > > the full value of SCTLR_EL1.
> > > > > > > >
> > > > > > > > My thought was that we would still only write SCTLR_EL1 if needed, but
> > > > > > > > we would do the write-if-needed across the whole register in one go.
> > > > > > > > This would be easier to extend if we have to twiddle additional
> > > > > > > > SCTLR_EL1 bits in the future.  If the key enable bits are the only
> > > > > > > > affected bits for now then we could of course defer this factoring until
> > > > > > > > later.  (I vaguely remember a similar discussion in the past, but
> > > > > > > > possibly it was about the pauth keys anyway, rather than something
> > > > > > > > else.)
> > > > > > >
> > > > > > > We will also need a per-task SCTLR_EL1.TCF0 setting for MTE. We could
> > > > > > > potentially combine the two, so that both
> > > > > > > prctl(PR_PAC_SET_ENABLED_KEYS) and prctl(PR_SET_TAGGED_ADDR_CTRL) end
> > > > > > > up affecting a common SCTLR_EL1 field in the task_struct, and we do:
> > > > > > >
> > > > > > > On kernel entry:
> > > > > > > - read SCTLR_EL1 from task_struct (or from the system register if we
> > > > > > > decide that's cheap enough)
> > > > > > > - if EnIA clear, set it, and write the value to the system register.
> > > > > > >
> > > > > > > On kernel exit:
> > > > > > > - read system register SCTLR_EL1
> > > > > > > - read SCTLR_EL1 from task_struct
> > > > > > > - if they are different, write task's SCTLR_EL1 to the system register.
> > > > > > >
> > > > > > > But this would require at least an unconditional read of SCTLR_EL1 per
> > > > > > > kernel exit. The comment that I referenced says that "accesses" to the
> > > > > > > register are expensive. I was operating under the assumption that both
> > > > > > > reads and writes are expensive, but if it's actually only writes that
> > > > > > > are expensive, we may be able to get away with the above.
> > > > > >
> > > > > > The kernel is in full control over what's in SCTLR_EL1, so perhaps we
> > > > > > could cache what we wrote percpu, if the overhead of reading it is
> > > > > > problematic.
> > > > >
> > > > > Maybe, but that sounds like the kind of complexity that I'd like to avoid.
> > > >
> > > > Ack, this is more something to explore if the other approaches turn out
> > > > to be more costly.
> > > >
> > > > > I went ahead and did some performance measurements using the following
> > > > > program, which issues 10M invalid syscalls and exits. This should give
> > > > > us something as close as possible to a measurement of a kernel entry
> > > > > and exit on its own.
> > > > >
> > > > > .globl _start
> > > > > _start:
> > > > > movz x1, :abs_g1:10000000
> > > > > movk x1, :abs_g0_nc:10000000
> > > > >
> > > > > .Lloop:
> > > > > mov x8, #65536 // invalid
> > > > > svc #0
> > > > > sub x1, x1, #1
> > > > > cbnz x1, .Lloop
> > > > >
> > > > > mov x0, #0
> > > > > mov x8, #0x5d // exit
> > > > > svc #0
> > > > >
> > > > > On a Pixel 4 (which according to Wikipedia has a CPU based on
> > > > > Cortex-A76/A55) I measured the median of 1000 runs execution time at
> > > > > 0.554511s, implying an entry/exit cost of 55.5ns. If I make this
> > > > > modification to the kernel (Pixel 4 uses kernel version 4.14 so this
> > > > > won't apply cleanly to modern kernels; this is in the same place as
> > > > > the ptrauth_keys_install_user macro invocation in a modern kernel):
> > > > >
> > > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > > > index 303447c765f7..b7c72d767f3e 100644
> > > > > --- a/arch/arm64/kernel/entry.S
> > > > > +++ b/arch/arm64/kernel/entry.S
> > > > > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> > > > >  #endif
> > > > >  3:
> > > > >         apply_ssbd 0, 5f, x0, x1
> > > > > +       mrs x0, sctlr_el1
> > > > >  5:
> > > > >         .endif
> > > > >
> > > > > The median execution time goes to 0.565604s, implying a cost of 1.1ns
> > > > > to read the system register. I also measured the cost of reading from
> > > > > a percpu variable at kernel exit like so:
> > > > >
> > > > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > > > index 303447c765f7..c5a89785f650 100644
> > > > > --- a/arch/arm64/kernel/entry.S
> > > > > +++ b/arch/arm64/kernel/entry.S
> > > > > @@ -340,6 +340,7 @@ alternative_else_nop_endif
> > > > >  #endif
> > > > >  3:
> > > > >         apply_ssbd 0, 5f, x0, x1
> > > > > +       ldr_this_cpu x0, sctlr_el1_cache, x1
> > > > >  5:
> > > > >         .endif
> > > > >
> > > > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > > > index 0b6ca5479f1f..ac9c9a6179d4 100644
> > > > > --- a/arch/arm64/kernel/traps.c
> > > > > +++ b/arch/arm64/kernel/traps.c
> > > > > @@ -52,6 +52,8 @@
> > > > >  #include <asm/system_misc.h>
> > > > >  #include <asm/sysreg.h>
> > > > >
> > > > > +DEFINE_PER_CPU(unsigned long, sctlr_el1_cache);
> > > > > +
> > > > >  static const char *handler[]= {
> > > > >         "Synchronous Abort",
> > > > >         "IRQ",
> > > > >
> > > > > With that the median execution time goes to 0.600186s, implying a cost
> > > > > of 4.5ns. So at least on existing hardware, it looks like reading
> > > > > SCTLR_EL1 directly is probably our cheapest option if we're going to
> > > > > read *something*. Therefore I'll look at implementing this with
> > > > > unconditional reads from SCTLR_EL1 in v2.
> > > >
> > > > Thanks for the investigation!
> > > >
> > > > I'm not too surprised by this outcome.  There is a possibility that
> > > > reading SCTLR_EL1 sucks badly on some implementations, but on others it
> > > > may be little more than a register rename.  We should probably wait for
> > > > evidence before panicking about it.
> > > >
> > > > So long as this SCTLR_EL1 switching is completely disabled via
> > > > alternatives on hardware to which it isn't applicable, I expect that
> > > > should be a reasonable compromise.
> > >
> > > One disadvantage of disabling SCTLR_EL1 switching on inapplicable
> > > hardware is that it complicates the mechanism for enabling and
> > > disabling deprecated instruction support (see
> > > arch/arm64/kernel/armv8_deprecated.c), which is currently implemented
> > > by using on_each_cpu to set or clear certain bits in SCTLR_EL1 on each
> > > CPU. This is incompatible with an entry/exit that assumes full control
> > > over SCTLR_EL1.
> > >
> > > I would propose the following instruction sequences in entry.S, which
> > > are justified by further experiments run on a Pixel 4 showing that:
> > > 1) an unconditional SCTLR_EL1 write on kernel entry/exit costs 3.0ns,
> > > whereas conditional writes cost 4.5ns
> > > 2) an SCTLR_EL1 read on kernel entry is 0.3ns faster than a load from
> > > task_struct
> > > 3) an unconditional SCTLR_EL1 write on kernel exit costs 0.7ns.
> > >
> > > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> > > index 55af8b504b65..9e0acd5e7527 100644
> > > --- a/arch/arm64/kernel/entry.S
> > > +++ b/arch/arm64/kernel/entry.S
> > > @@ -186,6 +186,12 @@ alternative_cb_end
> > >
> > >         ptrauth_keys_install_kernel tsk, x20, x22, x23
> > >
> > > +alternative_if ARM64_HAS_ADDRESS_AUTH
> > > +       mrs     x22, sctlr_el1
> > > +       orr     x22, x22, SCTLR_ELx_ENIA
> > > +       msr     sctlr_el1, x22
> > > +alternative_else_nop_endif
> > > +
> > >         scs_load tsk, x20
> > >         .else
> > >         add     x21, sp, #S_FRAME_SIZE
> > > @@ -297,6 +303,9 @@ alternative_else_nop_endif
> > >         /* No kernel C function calls after this as user keys are set. */
> > >         ptrauth_keys_install_user tsk, x0, x1, x2
> > >
> > > +       ldr     x0, [tsk, THREAD_SCTLR]
> > > +       msr     sctlr_el1, x0
> > > +
> > >         apply_ssbd 0, x0, x1
> > >         .endif
> > >
> > >
> > > With these instruction sequences I would propose to reimplement
> > > toggling deprecated instruction support as follows:
> > > 1) Stop the machine using stop_machine.
> > > 2) While the machine is stopped, adjust the deprecated instruction
> > > support feature bits in each task_struct.
> > >
> > > This will make toggling deprecated instruction support more expensive,
> > > but that seems like the right tradeoff since I imagine that these
> > > features are rarely toggled, and making it more efficient to toggle
> > > them would mean more instructions in the entry/exit hot path.
> > >
> > > This seems like it will work well if SCTLR_EL1 is set on kernel exit.
> > > The deprecated feature bits are CP15BEN and SED, and each of them
> > > affects features that are only implemented at EL0, so it seems like we
> > > can wait until kernel exit to update SCTLR_EL1. But if we don't set
> > > SCTLR_EL1 on exit on certain hardware, we will need to have both
> > > mechanisms, use the existing mechanism on hardware that doesn't set
> > > SCTLR_EL1 and the new mechanism on hardware that does set SCTLR_EL1.
> > > Which I wouldn't mind too much, but I'm not sure that it's worth it to
> > > save 0.7ns (I'm not claiming that my numbers are universal, just that
> > > we should probably favor simpler approaches if we don't have evidence
> > > that they are significantly more expensive than more complex ones).
> > >
> > > Peter
> >
> > This approach does look like close to the simplest option for the entry/
> > exit path -- I'll let others comment on whether the actual overhead
> > numbers sound acceptable.
> >
> > I think there are more places where the kernel modifies SCTLR than just
> > for the deprecated instruction emulation though.  Do you think your
> > approach will work with all of them, and can they all tolerate a
> > stop_machine()?
> 
> You're right, there are other places. Looking at the places written in
> assembly, I think all of them do not expect the value to be retained.
> But there are more places where we were setting this register in C
> that I hadn't checked. For example, in cpufeature.c we have:
> 
> static void cpu_emulate_effective_ctr(const struct
> arm64_cpu_capabilities *__unused)
> {
>         /*
>          * If the CPU exposes raw CTR_EL0.IDC = 0, while effectively
>          * CTR_EL0.IDC = 1 (from CLIDR values), we need to trap accesses
>          * to the CTR_EL0 on this CPU and emulate it with the real/safe
>          * value.
>          */
>         if (!(read_cpuid_cachetype() & BIT(CTR_IDC_SHIFT)))
>                 sysreg_clear_set(sctlr_el1, SCTLR_EL1_UCT, 0);
> }
> 
> So we may want heterogeneous values of SCTLR_EL1 on different cores,
> unless for example for this routine we reimplement to trap accesses to
> CTR_EL0 homogeneously.

I think that this CTR bit is supposed to read the same for all cores at
EL0, so we could set the UCT bit on all cores if any need it.  There
would be a performance penalty for access to CTR_EL0, but userspace must
already assume that register is slow to access (precisely because it is
trappable).

So perhaps this setting can happily be global.


It it possible that per-cpu settings might be neeed to deal with things
like errata workarounds, but I don't know whether there are any such
workarounds yet.


> Furthermore it is unclear that it would be valid to write to SCTLR_EL1
> directly with the result of reading from a thread_struct field. My
> belief that this behavior may be incorrect comes from the documented
> semantics of reserved fields in SCTLR_EL1. These bits are defined as
> RES0 which means that software must use an SBZP approach to write to
> them, and an approach that reads a value from thread_struct and then
> writes exactly that value to SCTLR_EL1 would not necessarily preserve
> RES0 bits that are different between cores.

Handling of these bits depends on what the software is trying to do.

To flip X when some RES0 bit Y has an unknown value _and_ you don't want
to affect any behaviour unrelated to X, then you should indeed do a
read-modify-write.

If you previously wrote 0 to Y, then you know a) Y is already 0, or
b) it may safely be written with 0  (since a self-modifying bit that
can't be safely zeroed could not be allocated to a RES0 bit in the first
place.)

Also, if you want to guarantee to disable any wacky behaviour some
future architecture version may assign to Y, then you _must_ write Y
with 0.


In general, Linux sanitises RES0 bits during startup by forcing them all
to 0, because we don't know what junk may have been left behind by the
boot stack (though booting.rst should perhaps have some more explicit
rules).  In practice we know that those bits are all already 0 once
Linux has survived through the arch setup code.

So, since Linux detemines the values of all bits, it can cache those
somewhere and simply copy the cache back into the sysreg whenever
appropriate, without needing to read the sysreg again.

As you observe though, this doesn't necessarily mean that this is the
lowest-cost approach.


> This led me to go back to considering approaches where we only update
> EnIA on entry and exit and leave the updating of the other bits to
> task switch at the same time as TCF0 in Catalin's MTE series.
> 
> Such approaches seem like they will perform better as well, if
> extrapolating from existing hardware is anything to go by. It seems
> that my previous measurements on Pixel 4 may have been inaccurate
> because /sys/devices/system/cpu/*/cpufreq/scaling_governor was set to
> its default value so scaling may have interfered with the results. I
> ran a new set of measurements with scaling_governor set to powersave
> which gave me much more stable results. With the newly stable results
> I decided to take the median of 100 runs rather than 1000 to
> compensate for the slower run time caused by the powersave governor.
> These results showed that conditional MSR is faster than unconditional
> MSR, and that LDR from thread_struct to obtain the state of the
> incoming task is slightly faster than MRS. The full results with
> references to patches are below:
> 
> Baseline: 316.7ns
> Conditional MSR based on LDR [1]: 319.6ns
> Unconditional MSR [2]: 353.4ns
> Conditional MSR based on MRS [3]: 323.8ns
> 
> I also ran the same experiment on a DragonBoard 845c. Since this board
> can run the mainline kernel it gives us results that we may expect to
> be slightly more valid when making changes to the mainline kernel than
> the 4.14 kernel used by Pixel 4. On this board I saw two "clusters" of
> results so I took the median of the faster cluster which was more
> stable (I don't think this was due to big.LITTLE because I saw very
> different numbers if I pinned the process to a LITTLE core):
> 
> Baseline: 450.5ns
> Conditional MSR based on LDR: 457.8ns
> Unconditional MSR: 487.0ns
> Conditional MSR based on MRS: 461.5ns
> 
> So we can see very similar perf impacts to Pixel 4. Although this
> board is Cortex-A75/A55 which is the prior uarch generation to Pixel
> 4's Cortex-A76/A55, it also tells us that on big Cortex uarches
> SCTLR_EL1 performance is somewhat stable over time, at least to the
> extent that we can extrapolate that from two data points. Of course,
> we may tune the instruction sequences again once hardware with PAC
> support is publicly available.
> 
> Thus I think that to begin with we should use instruction sequences
> similar to those in [1], protected via alternatives so we wouldn't be
> impacting the performance on existing hardware. I will develop my
> patch on top of the MTE series since they will interfere with each
> other and it seems plausible that MTE will land first.
> 
> Peter

Seems reasonable.  This may emit redundant reads to SCTLR_EL1, but it is
minimally invasive and according to your measurements we are unlikely to
do a lot better on performance.  My guess is that the unconditional-
write approach generates more writes on average, increasing the overall
cost; whereas in [1] the writes saved cover the cost of the
(unconditional) reads introduced.

If we needed to fix up multiple bits at the same time, we could replace
the tbz with a eor+tst+b.eq sequence without significant extra cost,
assuming we have a spare scratch register we can use.


There might still be some microarchitectures on which the unconditional
write approach is faster: but I'm just hypothesising.  Best not to worry
too much about that unless someone comes up with actual evidence for it.

Cheers
---Dave


> [1] https://android-review.googlesource.com/c/kernel/msm/+/1457681
> [2] https://android-review.googlesource.com/c/kernel/msm/+/1456578
> [3] https://android-review.googlesource.com/c/kernel/msm/+/1456579

[...]

arm64: add prctl(PR_PAC_SET_ENABLED_KEYS)

Commit Message

Comments

Patch