diff mbox series

[V7,08/18] x86/entry: Preserve PKRS MSR across exceptions

Message ID 20210804043231.2655537-9-ira.weiny@intel.com (mailing list archive)
State New, archived
Headers show
Series PKS/PMEM: Add Stray Write Protection | expand

Commit Message

Ira Weiny Aug. 4, 2021, 4:32 a.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

The PKRS MSR is not managed by XSAVE.  It is preserved through a context
switch but this support leaves exception handling code open to memory
accesses during exceptions.

2 possible places for preserving this state were considered,
irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
was potentially fraught with unintended consequences.[2]  However, Andy
came up with a way to hide additional values on the stack which could be
accessed as "extended_pt_regs".[3]  This method allows for; any place
which has struct pt_regs can get access to the extra information; no
extra information is added to irq_state; and pt_regs is left intact for
compatibility with outside tools like BPF.

To simplify, the assembly code only adds space on the stack.  The
setting or use of any needed values are left to the C code.  While some
entry points may not use this space it is still added where ever pt_regs
is passed to the C code for consistency.

Each nested exception gets another copy of this extended space allowing
for any number of levels of exception handling.

In the assembly, a macro is defined to allow a central place to add
space for other uses should the need arise.

Finally export pkrs_{save|restore}_irq to the common code to allow
it to preserve the current task's PKRS in the new extended pt_regs if
enabled.

Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
aided in the development of the patch..

[1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
[2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
[3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes for V7:
	Rebased to 5.14 entry code
	declare write_pkrs() in pks.h
	s/INIT_PKRS_VALUE/pkrs_init_value
	Remove unnecessary INIT_PKRS_VALUE def
	s/pkrs_save_set_irq/pkrs_save_irq/
		The inital value for exceptions is best managed
		completely within the pkey code.
---
 arch/x86/entry/calling.h               | 26 +++++++++++++
 arch/x86/entry/common.c                | 54 ++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S              | 22 ++++++-----
 arch/x86/entry/entry_64_compat.S       |  6 +--
 arch/x86/include/asm/pks.h             | 18 +++++++++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/kernel/head_64.S              |  7 ++--
 arch/x86/mm/fault.c                    |  3 ++
 include/linux/pkeys.h                  | 11 +++++-
 kernel/entry/common.c                  | 14 ++++++-
 10 files changed, 143 insertions(+), 20 deletions(-)

Comments

Ira Weiny Nov. 13, 2021, 12:50 a.m. UTC | #1
On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
> switch but this support leaves exception handling code open to memory
> accesses during exceptions.
> 
> 2 possible places for preserving this state were considered,
> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
> was potentially fraught with unintended consequences.[2]  However, Andy
> came up with a way to hide additional values on the stack which could be
> accessed as "extended_pt_regs".[3]

Andy,

I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
since I originally implemented this in V4[1].

Does this meets your expectations?  Are there any issues you can see with this
code?

I would appreciate your feedback.

Thanks,
Ira

[1] https://lore.kernel.org/lkml/20210322053020.2287058-9-ira.weiny@intel.com/

> This method allows for; any place
> which has struct pt_regs can get access to the extra information; no
> extra information is added to irq_state; and pt_regs is left intact for
> compatibility with outside tools like BPF.
> 
> To simplify, the assembly code only adds space on the stack.  The
> setting or use of any needed values are left to the C code.  While some
> entry points may not use this space it is still added where ever pt_regs
> is passed to the C code for consistency.
> 
> Each nested exception gets another copy of this extended space allowing
> for any number of levels of exception handling.
> 
> In the assembly, a macro is defined to allow a central place to add
> space for other uses should the need arise.
> 
> Finally export pkrs_{save|restore}_irq to the common code to allow
> it to preserve the current task's PKRS in the new extended pt_regs if
> enabled.
> 
> Peter, Thomas, Andy, Dave, and Dan all suggested parts of the patch or
> aided in the development of the patch..
> 
> [1] https://lore.kernel.org/lkml/CALCETrVe1i5JdyzD_BcctxQJn+ZE3T38EFPgjxN1F577M36g+w@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/874kpxx4jf.fsf@nanos.tec.linutronix.de/#t
> [3] https://lore.kernel.org/lkml/CALCETrUHwZPic89oExMMe-WyDY8-O3W68NcZvse3=PGW+iW5=w@mail.gmail.com/
> 
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Suggested-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes for V7:
> 	Rebased to 5.14 entry code
> 	declare write_pkrs() in pks.h
> 	s/INIT_PKRS_VALUE/pkrs_init_value
> 	Remove unnecessary INIT_PKRS_VALUE def
> 	s/pkrs_save_set_irq/pkrs_save_irq/
> 		The inital value for exceptions is best managed
> 		completely within the pkey code.
> ---
>  arch/x86/entry/calling.h               | 26 +++++++++++++
>  arch/x86/entry/common.c                | 54 ++++++++++++++++++++++++++
>  arch/x86/entry/entry_64.S              | 22 ++++++-----
>  arch/x86/entry/entry_64_compat.S       |  6 +--
>  arch/x86/include/asm/pks.h             | 18 +++++++++
>  arch/x86/include/asm/processor-flags.h |  2 +
>  arch/x86/kernel/head_64.S              |  7 ++--
>  arch/x86/mm/fault.c                    |  3 ++
>  include/linux/pkeys.h                  | 11 +++++-
>  kernel/entry/common.c                  | 14 ++++++-
>  10 files changed, 143 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index a4c061fb7c6e..a2f94677c3fd 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -63,6 +63,32 @@ For 32-bit we have the following conventions - kernel is built with
>   * for assembly code:
>   */
>  
> +/*
> + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> + * @cfunc:		C function to be called
> + *
> + * This will ensure that extended_ptregs is added and removed as needed during
> + * a call into C code.
> + */
> +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* add space for extended_pt_regs */
> +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> +	.if \annotate_retpoline_safe == 1
> +		ANNOTATE_RETPOLINE_SAFE
> +	.endif
> +	call	\cfunc
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* remove space for extended_pt_regs */
> +	addq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> +.endm
> +
> +.macro call_ext_ptregs cfunc
> +	__call_ext_ptregs \cfunc, annotate_retpoline_safe=0
> +.endm
> +
>  .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
>  	.if \save_ret
>  	pushq	%rsi		/* pt_regs->si */
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 6c2826417b33..a0d1d5519dba 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -19,6 +19,7 @@
>  #include <linux/nospec.h>
>  #include <linux/syscalls.h>
>  #include <linux/uaccess.h>
> +#include <linux/pkeys.h>
>  
>  #ifdef CONFIG_XEN_PV
>  #include <xen/xen-ops.h>
> @@ -34,6 +35,7 @@
>  #include <asm/io_bitmap.h>
>  #include <asm/syscall.h>
>  #include <asm/irq_stack.h>
> +#include <asm/pks.h>
>  
>  #ifdef CONFIG_X86_64
>  
> @@ -252,6 +254,56 @@ SYSCALL_DEFINE0(ni_syscall)
>  	return -ENOSYS;
>  }
>  
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +
> +void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code)
> +{
> +	struct extended_pt_regs *ept_regs = extended_pt_regs(regs);
> +
> +	if (cpu_feature_enabled(X86_FEATURE_PKS) && (error_code & X86_PF_PK))
> +		pr_alert("PKRS: 0x%x\n", ept_regs->thread_pkrs);
> +}
> +
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection for
> + * pages which have been mapped with a protection key.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory save the current
> + * thread value and set the PKRS value to be used during the exception.
> + */
> +void pkrs_save_irq(struct pt_regs *regs)
> +{
> +	struct extended_pt_regs *ept_regs;
> +
> +	BUILD_BUG_ON(sizeof(struct extended_pt_regs)
> +			!= EXTENDED_PT_REGS_SIZE
> +				+ sizeof(struct pt_regs));
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	ept_regs = extended_pt_regs(regs);
> +	ept_regs->thread_pkrs = current->thread.saved_pkrs;
> +	write_pkrs(pkrs_init_value);
> +}
> +
> +void pkrs_restore_irq(struct pt_regs *regs)
> +{
> +	struct extended_pt_regs *ept_regs;
> +
> +	if (!cpu_feature_enabled(X86_FEATURE_PKS))
> +		return;
> +
> +	ept_regs = extended_pt_regs(regs);
> +	write_pkrs(ept_regs->thread_pkrs);
> +	current->thread.saved_pkrs = ept_regs->thread_pkrs;
> +}
> +
> +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
> +
>  #ifdef CONFIG_XEN_PV
>  #ifndef CONFIG_PREEMPTION
>  /*
> @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>  	inhcall = get_and_clear_inhcall();
>  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> +		/* Normally called by irqentry_exit, restore pkrs here */
> +		pkrs_restore_irq(regs);
>  		irqentry_exit_cond_resched();
>  		instrumentation_end();
>  		restore_inhcall(inhcall);
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index e38a4cf795d9..1c390975a3de 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -332,7 +332,7 @@ SYM_CODE_END(ret_from_fork)
>  		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
>  	.endif
>  
> -	call	\cfunc
> +	call_ext_ptregs \cfunc
>  
>  	jmp	error_return
>  .endm
> @@ -435,7 +435,7 @@ SYM_CODE_START(\asmsym)
>  
>  	movq	%rsp, %rdi		/* pt_regs pointer */
>  
> -	call	\cfunc
> +	call_ext_ptregs \cfunc
>  
>  	jmp	paranoid_exit
>  
> @@ -496,7 +496,7 @@ SYM_CODE_START(\asmsym)
>  	 * stack.
>  	 */
>  	movq	%rsp, %rdi		/* pt_regs pointer */
> -	call	vc_switch_off_ist
> +	call_ext_ptregs vc_switch_off_ist
>  	movq	%rax, %rsp		/* Switch to new stack */
>  
>  	UNWIND_HINT_REGS
> @@ -507,7 +507,7 @@ SYM_CODE_START(\asmsym)
>  
>  	movq	%rsp, %rdi		/* pt_regs pointer */
>  
> -	call	kernel_\cfunc
> +	call_ext_ptregs kernel_\cfunc
>  
>  	/*
>  	 * No need to switch back to the IST stack. The current stack is either
> @@ -542,7 +542,7 @@ SYM_CODE_START(\asmsym)
>  	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
>  	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
>  	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
> -	call	\cfunc
> +	call_ext_ptregs \cfunc
>  
>  	jmp	paranoid_exit
>  
> @@ -781,7 +781,7 @@ SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
>  	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
>  	UNWIND_HINT_REGS
>  
> -	call	xen_pv_evtchn_do_upcall
> +	call_ext_ptregs xen_pv_evtchn_do_upcall
>  
>  	jmp	error_return
>  SYM_CODE_END(exc_xen_hypervisor_callback)
> @@ -987,7 +987,7 @@ SYM_CODE_START_LOCAL(error_entry)
>  	/* Put us onto the real thread stack. */
>  	popq	%r12				/* save return addr in %12 */
>  	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
> -	call	sync_regs
> +	call_ext_ptregs sync_regs
>  	movq	%rax, %rsp			/* switch stack */
>  	ENCODE_FRAME_POINTER
>  	pushq	%r12
> @@ -1042,7 +1042,7 @@ SYM_CODE_START_LOCAL(error_entry)
>  	 * as if we faulted immediately after IRET.
>  	 */
>  	mov	%rsp, %rdi
> -	call	fixup_bad_iret
> +	call_ext_ptregs fixup_bad_iret
>  	mov	%rax, %rsp
>  	jmp	.Lerror_entry_from_usermode_after_swapgs
>  SYM_CODE_END(error_entry)
> @@ -1148,7 +1148,7 @@ SYM_CODE_START(asm_exc_nmi)
>  
>  	movq	%rsp, %rdi
>  	movq	$-1, %rsi
> -	call	exc_nmi
> +	call_ext_ptregs exc_nmi
>  
>  	/*
>  	 * Return back to user mode.  We must *not* do the normal exit
> @@ -1184,6 +1184,8 @@ SYM_CODE_START(asm_exc_nmi)
>  	 * +---------------------------------------------------------+
>  	 * | pt_regs                                                 |
>  	 * +---------------------------------------------------------+
> +	 * | (Optionally) extended_pt_regs                           |
> +	 * +---------------------------------------------------------+
>  	 *
>  	 * The "original" frame is used by hardware.  Before re-enabling
>  	 * NMIs, we need to be done with it, and we need to leave enough
> @@ -1360,7 +1362,7 @@ end_repeat_nmi:
>  
>  	movq	%rsp, %rdi
>  	movq	$-1, %rsi
> -	call	exc_nmi
> +	call_ext_ptregs exc_nmi
>  
>  	/* Always restore stashed CR3 value (see paranoid_entry) */
>  	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
> diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
> index 0051cf5c792d..53254d29d5c7 100644
> --- a/arch/x86/entry/entry_64_compat.S
> +++ b/arch/x86/entry/entry_64_compat.S
> @@ -136,7 +136,7 @@ SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
>  .Lsysenter_flags_fixed:
>  
>  	movq	%rsp, %rdi
> -	call	do_SYSENTER_32
> +	call_ext_ptregs do_SYSENTER_32
>  	/* XEN PV guests always use IRET path */
>  	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
>  		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
> @@ -253,7 +253,7 @@ SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
>  	UNWIND_HINT_REGS
>  
>  	movq	%rsp, %rdi
> -	call	do_fast_syscall_32
> +	call_ext_ptregs do_fast_syscall_32
>  	/* XEN PV guests always use IRET path */
>  	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
>  		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
> @@ -410,6 +410,6 @@ SYM_CODE_START(entry_INT80_compat)
>  	cld
>  
>  	movq	%rsp, %rdi
> -	call	do_int80_syscall_32
> +	call_ext_ptregs do_int80_syscall_32
>  	jmp	swapgs_restore_regs_and_return_to_usermode
>  SYM_CODE_END(entry_INT80_compat)
> diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
> index e7727086cec2..76960ec71b4b 100644
> --- a/arch/x86/include/asm/pks.h
> +++ b/arch/x86/include/asm/pks.h
> @@ -4,15 +4,33 @@
>  
>  #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>  
> +struct extended_pt_regs {
> +	u32 thread_pkrs;
> +	/* Keep stack 8 byte aligned */
> +	u32 pad;
> +	struct pt_regs pt_regs;
> +};
> +
>  void setup_pks(void);
>  void pkrs_write_current(void);
>  void pks_init_task(struct task_struct *task);
> +void write_pkrs(u32 new_pkrs);
> +
> +static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs)
> +{
> +	return container_of(regs, struct extended_pt_regs, pt_regs);
> +}
> +
> +void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code);
>  
>  #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
>  
>  static inline void setup_pks(void) { }
>  static inline void pkrs_write_current(void) { }
>  static inline void pks_init_task(struct task_struct *task) { }
> +static inline void write_pkrs(u32 new_pkrs) { }
> +static inline void show_extended_regs_oops(struct pt_regs *regs,
> +					   unsigned long error_code) { }
>  
>  #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
>  
> diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
> index 02c2cbda4a74..4a41fc4cf028 100644
> --- a/arch/x86/include/asm/processor-flags.h
> +++ b/arch/x86/include/asm/processor-flags.h
> @@ -53,4 +53,6 @@
>  # define X86_CR3_PTI_PCID_USER_BIT	11
>  #endif
>  
> +#define EXTENDED_PT_REGS_SIZE 8
> +
>  #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index d8b3ebd2bb85..90e76178b6b4 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -319,8 +319,7 @@ SYM_CODE_START_NOALIGN(vc_boot_ghcb)
>  	movq    %rsp, %rdi
>  	movq	ORIG_RAX(%rsp), %rsi
>  	movq	initial_vc_handler(%rip), %rax
> -	ANNOTATE_RETPOLINE_SAFE
> -	call	*%rax
> +	__call_ext_ptregs *%rax, annotate_retpoline_safe=1
>  
>  	/* Unwind pt_regs */
>  	POP_REGS
> @@ -397,7 +396,7 @@ SYM_CODE_START_LOCAL(early_idt_handler_common)
>  	UNWIND_HINT_REGS
>  
>  	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
> -	call do_early_exception
> +	call_ext_ptregs do_early_exception
>  
>  	decl early_recursion_flag(%rip)
>  	jmp restore_regs_and_return_to_kernel
> @@ -421,7 +420,7 @@ SYM_CODE_START_NOALIGN(vc_no_ghcb)
>  	/* Call C handler */
>  	movq    %rsp, %rdi
>  	movq	ORIG_RAX(%rsp), %rsi
> -	call    do_vc_no_ghcb
> +	call_ext_ptregs do_vc_no_ghcb
>  
>  	/* Unwind pt_regs */
>  	POP_REGS
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index e133c0ed72a0..a4ce7cef0260 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -32,6 +32,7 @@
>  #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
>  #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
>  #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
> +#include <asm/pks.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <asm/trace/exceptions.h>
> @@ -547,6 +548,8 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
>  		 (error_code & X86_PF_PK)    ? "protection keys violation" :
>  					       "permissions violation");
>  
> +	show_extended_regs_oops(regs, error_code);
> +
>  	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
>  		struct desc_ptr idt, gdt;
>  		u16 ldtr, tr;
> diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
> index 580238388f0c..76eb19a37942 100644
> --- a/include/linux/pkeys.h
> +++ b/include/linux/pkeys.h
> @@ -52,6 +52,15 @@ enum pks_pkey_consumers {
>  	PKS_KEY_NR_CONSUMERS
>  };
>  extern u32 pkrs_init_value;
> -#endif
> +
> +void pkrs_save_irq(struct pt_regs *regs);
> +void pkrs_restore_irq(struct pt_regs *regs);
> +
> +#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
> +
> +static inline void pkrs_save_irq(struct pt_regs *regs) { }
> +static inline void pkrs_restore_irq(struct pt_regs *regs) { }
> +
> +#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
>  
>  #endif /* _LINUX_PKEYS_H */
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..aa0b1e8dd742 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/pkeys.h>
>  
>  #include "common.h"
>  
> @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  		instrumentation_end();
>  
>  		ret.exit_rcu = true;
> -		return ret;
> +		goto done;
>  	}
>  
>  	/*
> @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  	trace_hardirqs_off_finish();
>  	instrumentation_end();
>  
> +done:
> +	pkrs_save_irq(regs);
>  	return ret;
>  }
>  
> @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  	/* Check whether this returns to user mode */
>  	if (user_mode(regs)) {
>  		irqentry_exit_to_user_mode(regs);
> -	} else if (!regs_irqs_disabled(regs)) {
> +		return;
> +	}
> +
> +	pkrs_restore_irq(regs);
> +
> +	if (!regs_irqs_disabled(regs)) {
>  		/*
>  		 * If RCU was not watching on entry this needs to be done
>  		 * carefully and needs the same ordering of lockdep/tracing
> @@ -458,11 +466,13 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  	ftrace_nmi_enter();
>  	instrumentation_end();
>  
> +	pkrs_save_irq(regs);
>  	return irq_state;
>  }
>  
>  void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
>  {
> +	pkrs_restore_irq(regs);
>  	instrumentation_begin();
>  	ftrace_nmi_exit();
>  	if (irq_state.lockdep) {
> -- 
> 2.28.0.rc0.12.gb6a658bd00c9
>
Thomas Gleixner Nov. 25, 2021, 11:19 a.m. UTC | #2
On Fri, Nov 12 2021 at 16:50, Ira Weiny wrote:
> On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny <ira.weiny@intel.com>
>> 
>> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
>> switch but this support leaves exception handling code open to memory
>> accesses during exceptions.
>> 
>> 2 possible places for preserving this state were considered,
>> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
>> was potentially fraught with unintended consequences.[2]  However, Andy
>> came up with a way to hide additional values on the stack which could be
>> accessed as "extended_pt_regs".[3]
>
> Andy,
>
> I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
> since I originally implemented this in V4[1].
>
> Does this meets your expectations?  Are there any issues you can see with this
> code?
>
> I would appreciate your feedback.

Not Andy here, but I'll respond to the patch in a minute.

Thanks,

        tglx
Thomas Gleixner Nov. 25, 2021, 2:12 p.m. UTC | #3
Ira,

On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> +/*
> + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> + * @cfunc:		C function to be called
> + *
> + * This will ensure that extended_ptregs is added and removed as needed during
> + * a call into C code.
> + */
> +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* add space for extended_pt_regs */
> +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif
> +	.if \annotate_retpoline_safe == 1
> +		ANNOTATE_RETPOLINE_SAFE
> +	.endif

This annotation is new and nowhere mentioned why it is part of this
patch.

Can you please do _ONE_ functional change per patch and not a
unreviewable pile of changes in one go? Same applies for the ASM and the
C code changes. The ASM change has to go first and then the C code can
build upon it.

> +	call	\cfunc
> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* remove space for extended_pt_regs */
> +	addq    $EXTENDED_PT_REGS_SIZE, %rsp
> +#endif

I really have to ask the question whether this #ifdeffery has any value
at all. 8 bytes extra stack usage is not going to be the end of the
world and distro kernels will enable that config anyway.

If we really want to save the space then certainly not by sprinkling
CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
extra sized ptregs in the pkrs header.

You are changing generic architecture code so you better think about
making such a change generic and extensible. Can folks please start
thinking beyond the brim of their teacup and not pretend that the
feature they are working on is the unicorn which requires unique magic
brandnamed after the unicorn of the day.

If the next feature comes around which needs to save something in that
extended area then we are going to change the world again, right?
Certainly not.

This wants to go into asm/ptrace.h:

struct pt_regs_aux {
	u32	pkrs;
};

struct pt_regs_extended {
	struct pt_regs_aux	aux;
        struct pt_regs		regs __attribute__((aligned(8)));
};

and then have in asm-offset:

   DEFINE(PT_REGS_AUX_SIZE, sizeof(struct pt_regs_extended) - sizeof(struct pt_regs));

which does the right thing whatever the size of pt_regs_aux is. So for
the above it will have:

 #define PT_REGS_AUX_SIZE 8 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */

Even, if you do

struct pt_regs_aux {
#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
	u32	pkrs;
#endif        
};

and the config switch is disabled. It's still correct:

 #define PT_REGS_AUX_SIZE 0 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */

See? No magic hardcoded constant, no build time error checking for that
constant. Nothing, it just works.

That's one part, but let me come back to this:

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +	/* add space for extended_pt_regs */
> +	subq    $EXTENDED_PT_REGS_SIZE, %rsp

What guarantees that RSP points to pt_regs at this point?  Nothing at
all. It's just pure luck and a question of time until this explodes in
hard to diagnose ways.

Because between

        movq	%rsp, %rdi
and
        call    ....

can legitimately be other code which causes the stack pointer to
change. It's not the case today, but nothing prevents this in the
future.

The correct thing to do is:

        movq	%rsp, %rdi
        RSP_MAKE_PT_REGS_AUX_SPACE
        call	...
        RSP_REMOVE_PT_REGS_AUX_SPACE

The few extra macro lines in the actual code are way better as they make
it completely obvious what's going on and any misuse can be spotted
easily.

> +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> +/*
> + * PKRS is a per-logical-processor MSR which overlays additional protection for
> + * pages which have been mapped with a protection key.
> + *
> + * Context switches save the MSR in the task struct thus taking that value to
> + * other processors if necessary.
> + *
> + * To protect against exceptions having access to this memory save the current
> + * thread value and set the PKRS value to be used during the exception.
> + */
> +void pkrs_save_irq(struct pt_regs *regs)

That's a misnomer as this is invoked for _any_ exception not just
interrupts.

>  #ifdef CONFIG_XEN_PV
>  #ifndef CONFIG_PREEMPTION
>  /*
> @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>  	inhcall = get_and_clear_inhcall();
>  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> +		/* Normally called by irqentry_exit, restore pkrs here */
> +		pkrs_restore_irq(regs);
> 		irqentry_exit_cond_resched();

Sigh. Consistency is overrated....

> +
>  void setup_pks(void);
>  void pkrs_write_current(void);
>  void pks_init_task(struct task_struct *task);
> +void write_pkrs(u32 new_pkrs);

So we have pkrs_write_current() and write_pkrs() now. Can you please
stick to a common prefix, i.e. pkrs_ ?

> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index bf16395b9e13..aa0b1e8dd742 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,7 @@
>  #include <linux/livepatch.h>
>  #include <linux/audit.h>
>  #include <linux/tick.h>
> +#include <linux/pkeys.h>
>  
>  #include "common.h"
>  
> @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  		instrumentation_end();
>  
>  		ret.exit_rcu = true;
> -		return ret;
> +		goto done;
>  	}
>  
>  	/*
> @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  	trace_hardirqs_off_finish();
>  	instrumentation_end();
>  
> +done:
> +	pkrs_save_irq(regs);

This still calls out into instrumentable code. I explained you before
why this is wrong. Also objtool emits warnings to that effect if you do a
proper verified build.

>  	return ret;
>  }
>  
> @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>  	/* Check whether this returns to user mode */
>  	if (user_mode(regs)) {
>  		irqentry_exit_to_user_mode(regs);
> -	} else if (!regs_irqs_disabled(regs)) {
> +		return;
> +	}
> +
> +	pkrs_restore_irq(regs);

At least you are now putting it consistently at the wrong place
vs. noinstr.

Though, if you look at the xen_pv_evtchn_do_upcall() part where you
added this extra invocation you might figure out that adding
pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
the 'else' path in irqentry_exit() makes it magically consistent for
both use cases.

Thanks,

        tglx
Andy Lutomirski Dec. 3, 2021, 1:13 a.m. UTC | #4
On 11/12/21 16:50, Ira Weiny wrote:
> On Tue, Aug 03, 2021 at 09:32:21PM -0700, 'Ira Weiny' wrote:
>> From: Ira Weiny <ira.weiny@intel.com>
>>
>> The PKRS MSR is not managed by XSAVE.  It is preserved through a context
>> switch but this support leaves exception handling code open to memory
>> accesses during exceptions.
>>
>> 2 possible places for preserving this state were considered,
>> irqentry_state_t or pt_regs.[1]  pt_regs was much more complicated and
>> was potentially fraught with unintended consequences.[2]  However, Andy
>> came up with a way to hide additional values on the stack which could be
>> accessed as "extended_pt_regs".[3]
> 
> Andy,
> 
> I'm preparing to send V8 of this PKS work.  But I have not seen any feed back
> since I originally implemented this in V4[1].
> 
> Does this meets your expectations?  Are there any issues you can see with this
> code?

I think I'm generally okay with the approach to allocating space.  All 
of Thomas' comments still apply, though.  (Sorry, I'm horribly behind.)
Ira Weiny Dec. 7, 2021, 1:54 a.m. UTC | #5
Thomas,

Thanks for the review.  Sorry for being so late to respond I was sick all last
week and so it took me longer to figure out some of this stuff.

On Thu, Nov 25, 2021 at 03:12:47PM +0100, Thomas Gleixner wrote:
> Ira,
> 
> On Tue, Aug 03 2021 at 21:32, ira weiny wrote:
> > +/*
> > + * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
> > + * @cfunc:		C function to be called
> > + *
> > + * This will ensure that extended_ptregs is added and removed as needed during
> > + * a call into C code.
> > + */
> > +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* add space for extended_pt_regs */
> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> > +#endif
> > +	.if \annotate_retpoline_safe == 1
> > +		ANNOTATE_RETPOLINE_SAFE
> > +	.endif
> 
> This annotation is new and nowhere mentioned why it is part of this
> patch.

I don't understand.  ANNOTATE_RETPOLINE_SAFE has been around since:

9e0e3c5130e9 x86/speculation, objtool: Annotate indirect calls/jumps for objtool

> 
> Can you please do _ONE_ functional change per patch and not a
> unreviewable pile of changes in one go? Same applies for the ASM and the
> C code changes. The ASM change has to go first and then the C code can
> build upon it.

I'm sorry for having the ASM and C code together but this all seemed like 1
change to me.

I can split it if you prefer.  How about a patch with just the x86 extended
pt_regs stuff but that would leave a zero size for the extended stuff?  Then
followed by the pks bits?

> 
> > +	call	\cfunc
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* remove space for extended_pt_regs */
> > +	addq    $EXTENDED_PT_REGS_SIZE, %rsp
> > +#endif
> 
> I really have to ask the question whether this #ifdeffery has any value
> at all. 8 bytes extra stack usage is not going to be the end of the
> world and distro kernels will enable that config anyway.

My goal with this has always been 0 overhead if turned off.  So this seemed
like a logical addition.  Furthermore, ARCH_ENABLE_SUPERVISOR_PKEYS is
predicated on ARCH_HAS_SUPERVISOR_PKEYS which is only available with x86_64.
This removes the space for x86 when not needed.

All the config stuff was introduced in patch 04/18.[0]

> 
> If we really want to save the space then certainly not by sprinkling
> CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
> extra sized ptregs in the pkrs header.
> 
> You are changing generic architecture code so you better think about
> making such a change generic and extensible.

I agree.  And I tried to do so.  The generic entry code is modified only by the
addition of pkrs_[save|restore]_irq().  These are only defined if the arch
defines ARCH_HAS_SUPERVISOR_PKEYS and furthermore, if something triggers
enabling ARCH_ENABLE_SUPERVISOR_PKEYS.

ARCH_HAS_SUPERVISOR_PKEYS is restricted to x86_64 at the moment.  All other
arch's including x86 should not see any changes in the generic code.

I thought we had agreed that it was ok for me to restrict the addition of the
extended pt_regs to what was required for PKS when these changes were
discussed.  Because at the time I was concerned about my lack of knowledge of
all the other architectures.[1]

>
> Can folks please start
> thinking beyond the brim of their teacup and not pretend that the
> feature they are working on is the unicorn which requires unique magic
> brandnamed after the unicorn of the day.
> 
> If the next feature comes around which needs to save something in that
> extended area then we are going to change the world again, right?

I'm not sure what you mean by 'change the world'.  I would anticipate the entry
code to be modified with something similar to pks_[save|restore]_irq() and let
the arch deal with the specifics.

Also in [1] I thought Peter and Andy agreed that placing additional generic
state in the extended pt_regs was not needed and does not buy us anything.  I
specifically asked if that was something we wanted to do in.[2]

> Certainly not.
> 
> This wants to go into asm/ptrace.h:
> 
> struct pt_regs_aux {
> 	u32	pkrs;
> };
> 
> struct pt_regs_extended {
> 	struct pt_regs_aux	aux;
>         struct pt_regs		regs __attribute__((aligned(8)));
> };

Ok the aligned attribute does what I was doing much more gracefully.  This is a
good idea yes, thank you.

> 
> and then have in asm-offset:
> 
>    DEFINE(PT_REGS_AUX_SIZE, sizeof(struct pt_regs_extended) - sizeof(struct pt_regs));
> 
> which does the right thing whatever the size of pt_regs_aux is. So for
> the above it will have:
> 
>  #define PT_REGS_AUX_SIZE 8 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */
> 
> Even, if you do
> 
> struct pt_regs_aux {
> #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> 	u32	pkrs;
> #endif        
> };
> 
> and the config switch is disabled. It's still correct:
> 
>  #define PT_REGS_AUX_SIZE 0 /* sizeof(struct pt_regs_extended) - sizeof(struct pt_regs) */
> 
> See? No magic hardcoded constant, no build time error checking for that
> constant. Nothing, it just works.

Yes agreed definitely an improvement.

> 
> That's one part, but let me come back to this:
> 
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +	/* add space for extended_pt_regs */
> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
> 
> What guarantees that RSP points to pt_regs at this point?  Nothing at
> all. It's just pure luck and a question of time until this explodes in
> hard to diagnose ways.

It took me a bit to wrap my head around what I think you mean.  My initial
response was that rsp should be the stack pointer for __call_ext_ptregs() just
like it was for call.  But I think I see that it is better to open code this
since others may want to play the same trick without using this code and
therefore we may not be getting the extended pt_regs structure on the stack
like we think.  For example if someone did...

	movq	%rsp, %rdi
	RSP_ADD_OTHER_STACK_STUFF
	__call_ext_ptregs	...
	RSP_REMOVE_OTHER_STACK_STUFF

... it would be broken.

My assumption was that would be illegal after this patch.  But indeed there is
no way to easily see that in the future.

> 
> Because between
> 
>         movq	%rsp, %rdi
> and
>         call    ....
> 
> can legitimately be other code which causes the stack pointer to
> change. It's not the case today, but nothing prevents this in the
> future.
> 
> The correct thing to do is:
> 
>         movq	%rsp, %rdi
>         RSP_MAKE_PT_REGS_AUX_SPACE
>         call	...
>         RSP_REMOVE_PT_REGS_AUX_SPACE
> 
> The few extra macro lines in the actual code are way better as they make
> it completely obvious what's going on and any misuse can be spotted
> easily.

Sure FWIW this is what I had originally but thought it would be cleaner to wrap
the 'call'.  I will convert it back.  Also this removes the
annotate_retpoline_safe stuff above.

> 
> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
> > +/*
> > + * PKRS is a per-logical-processor MSR which overlays additional protection for
> > + * pages which have been mapped with a protection key.
> > + *
> > + * Context switches save the MSR in the task struct thus taking that value to
> > + * other processors if necessary.
> > + *
> > + * To protect against exceptions having access to this memory save the current
> > + * thread value and set the PKRS value to be used during the exception.
> > + */
> > +void pkrs_save_irq(struct pt_regs *regs)
> 
> That's a misnomer as this is invoked for _any_ exception not just
> interrupts.

I'm confused by the naming in kernel/entry/common.c then.  I'm more than
willing to change the name.  But I only see irq* for almost everything in that
file.  And I was trying to follow that convention.

> 
> >  #ifdef CONFIG_XEN_PV
> >  #ifndef CONFIG_PREEMPTION
> >  /*
> > @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
> >  
> >  	inhcall = get_and_clear_inhcall();
> >  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> > +		/* Normally called by irqentry_exit, restore pkrs here */
> > +		pkrs_restore_irq(regs);
> > 		irqentry_exit_cond_resched();
> 
> Sigh. Consistency is overrated....

I'm not that familiar with the xen code so perhaps I missed something?

> 
> > +
> >  void setup_pks(void);
> >  void pkrs_write_current(void);
> >  void pks_init_task(struct task_struct *task);
> > +void write_pkrs(u32 new_pkrs);
> 
> So we have pkrs_write_current() and write_pkrs() now. Can you please
> stick to a common prefix, i.e. pkrs_ ?

Sorry, yes.

> 
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index bf16395b9e13..aa0b1e8dd742 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/livepatch.h>
> >  #include <linux/audit.h>
> >  #include <linux/tick.h>
> > +#include <linux/pkeys.h>
> >  
> >  #include "common.h"
> >  
> > @@ -364,7 +365,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> >  		instrumentation_end();
> >  
> >  		ret.exit_rcu = true;
> > -		return ret;
> > +		goto done;
> >  	}
> >  
> >  	/*
> > @@ -379,6 +380,8 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
> >  	trace_hardirqs_off_finish();
> >  	instrumentation_end();
> >  
> > +done:
> > +	pkrs_save_irq(regs);
> 
> This still calls out into instrumentable code. I explained you before
> why this is wrong. Also objtool emits warnings to that effect if you do a
> proper verified build.

I was not sure what a 'proper verified build' was and objtool was not throwing
any warnings for me even if I ran it directly.

10:49:27 > ./tools/objtool/objtool check -n vmlinux.o
vmlinux.o: warning: objtool: ftrace_caller()+0x94: call without frame pointer save/setup
vmlinux.o: warning: objtool: ftrace_regs_caller()+0xde: call without frame pointer save/setup
vmlinux.o: warning: objtool: return_to_handler()+0x10: call without frame pointer save/setup
vmlinux.o: warning: objtool: copy_mc_fragile() falls through to next function copy_mc_fragile_handle_tail()
vmlinux.o: warning: objtool: copy_user_enhanced_fast_string() falls through to next function copy_user_generic_unrolled()
vmlinux.o: warning: objtool: __memset() falls through to next function memset_erms()
vmlinux.o: warning: objtool: __memcpy() falls through to next function memcpy_erms()
vmlinux.o: warning: objtool: file already has .static_call_sites section, skipping


After asking around and digging quite a bit I found CONFIG_DEBUG_ENTRY which
enabled the check and the error.  [But only during a build and not with the
above command???  Shouldn't the above command work too?]

What other config options should we be running with to verify the build?

Regardless, reading more about noinstr and looking at the code more carefully I
realize I _completely_ misunderstood what you meant before in [3].  I should
have asked for clarification.

Yes this was originally marked noinstr because it was called from a noinstr
function.  I see now, or at least I think I see, that you were taking exception
to my blindly marking pkrs_save_irq() noinstr without a good reason.

When you said 'there is absolutely no reason to have this marked noinstr.'  I
thought that meant we could simply remove it from noinstr.  But what I think
you meant is that there is no reason to have it _be_ noinstr _and_ I should
also make it called from the instrumentable sections of the irqentry_*() calls.

So something like this patch on top of this series?  [With an equivalent change
for pkrs_restore_irq().]

11:03:18 > git di
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index c7356733632e..1c0a70a17e93 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -360,10 +360,11 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
                rcu_irq_enter();
                instrumentation_begin();
                trace_hardirqs_off_finish();
+               pkrs_save_irq(regs);
                instrumentation_end();
 
                ret.exit_rcu = true;
-               goto done;
+               return ret;
        }
 
        /*
@@ -376,10 +377,9 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
        instrumentation_begin();
        rcu_irq_enter_check_tick();
        trace_hardirqs_off_finish();
+       pkrs_save_irq(regs);
        instrumentation_end();
 
-done:
-       pkrs_save_irq(regs);
        return ret;
 }
 
@@ -462,9 +462,9 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
        instrumentation_begin();
        trace_hardirqs_off_finish();
        ftrace_nmi_enter();
+       pkrs_save_irq(regs);
        instrumentation_end();
 
-       pkrs_save_irq(regs);
        return irq_state;
 }


> 
> >  	return ret;
> >  }
> >  
> > @@ -404,7 +407,12 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
> >  	/* Check whether this returns to user mode */
> >  	if (user_mode(regs)) {
> >  		irqentry_exit_to_user_mode(regs);
> > -	} else if (!regs_irqs_disabled(regs)) {
> > +		return;
> > +	}
> > +
> > +	pkrs_restore_irq(regs);
> 
> At least you are now putting it consistently at the wrong place
> vs. noinstr.

Indeed.  Sorry about not understanding noinstr fully.

> 
> Though, if you look at the xen_pv_evtchn_do_upcall() part where you
> added this extra invocation you might figure out that adding
> pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
> the 'else' path in irqentry_exit() makes it magically consistent for
> both use cases.
> 

Thank you, yes good catch.  However, I think I need at least 1 more call in the
!regs_irqs_disabled() && state.exit_rcu case right?

11:29:48 > git di
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 717091910ebc..667676ebc129 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -363,8 +363,6 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
        inhcall = get_and_clear_inhcall();
        if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
-               /* Normally called by irqentry_exit, restore pkrs here */
-               pkrs_restore_irq(regs);
                irqentry_exit_cond_resched();
                instrumentation_end();
                restore_inhcall(inhcall);
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 1c0a70a17e93..60ae3d4c9cc0 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -385,6 +385,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 
 void irqentry_exit_cond_resched(void)
 {
+       pkrs_restore_irq(regs);
        if (!preempt_count()) {
                /* Sanity check RCU and thread stack */
                rcu_irq_exit_check_preempt();
@@ -408,8 +409,6 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
                return;
        }
 
-       pkrs_restore_irq(regs);
-
        if (!regs_irqs_disabled(regs)) {
                /*
                 * If RCU was not watching on entry this needs to be done
@@ -421,6 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
                        /* Tell the tracer that IRET will enable interrupts */
                        trace_hardirqs_on_prepare();
                        lockdep_hardirqs_on_prepare(CALLER_ADDR0);
+                       pkrs_restore_irq(regs);
                        instrumentation_end();
                        rcu_irq_exit();
                        lockdep_hardirqs_on(CALLER_ADDR0);
@@ -439,6 +439,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
                trace_hardirqs_on();
                instrumentation_end();
        } else {
+               instrumentation_begin();
+               pkrs_restore_irq(regs);
+               instrumentation_end();
+
                /*
                 * IRQ flags state is correct already. Just tell RCU if it
                 * was not watching on entry.
@@ -470,8 +474,8 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
-       pkrs_restore_irq(regs);
        instrumentation_begin();
+       pkrs_restore_irq(regs);
        ftrace_nmi_exit();
        if (irq_state.lockdep) {
                trace_hardirqs_on_prepare();


Thank you again for the review,
Ira


[0] https://lore.kernel.org/lkml/20210804043231.2655537-5-ira.weiny@intel.com/
[1] https://lore.kernel.org/lkml/20201217131924.GW3040@hirez.programming.kicks-ass.net/
[2] https://lore.kernel.org/lkml/20201216013202.GY1563847@iweiny-DESK2.sc.intel.com/
[3] https://lore.kernel.org/lkml/87y2hwqwng.fsf@nanos.tec.linutronix.de/

> Thanks,
> 
>         tglx
Ira Weiny Dec. 7, 2021, 4:45 a.m. UTC | #6
On Mon, Dec 06, 2021 at 05:54:23PM -0800, 'Ira Weiny' wrote:

[snip]

> > 
> > Though, if you look at the xen_pv_evtchn_do_upcall() part where you
> > added this extra invocation you might figure out that adding
> > pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
> > the 'else' path in irqentry_exit() makes it magically consistent for
> > both use cases.
> > 
> 
> Thank you, yes good catch.  However, I think I need at least 1 more call in the
> !regs_irqs_disabled() && state.exit_rcu case right?
> 
> 11:29:48 > git di
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 717091910ebc..667676ebc129 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -363,8 +363,6 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>  
>         inhcall = get_and_clear_inhcall();
>         if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
> -               /* Normally called by irqentry_exit, restore pkrs here */
> -               pkrs_restore_irq(regs);
>                 irqentry_exit_cond_resched();
>                 instrumentation_end();
>                 restore_inhcall(inhcall);
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 1c0a70a17e93..60ae3d4c9cc0 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -385,6 +385,7 @@ noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
>  
>  void irqentry_exit_cond_resched(void)

Opps...  Of course regs will need to be passed in here now...

Ira

>  {
> +       pkrs_restore_irq(regs);
>         if (!preempt_count()) {
>                 /* Sanity check RCU and thread stack */
>                 rcu_irq_exit_check_preempt();
> @@ -408,8 +409,6 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>                 return;
>         }
>  
> -       pkrs_restore_irq(regs);
> -
>         if (!regs_irqs_disabled(regs)) {
>                 /*
>                  * If RCU was not watching on entry this needs to be done
> @@ -421,6 +420,7 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>                         /* Tell the tracer that IRET will enable interrupts */
>                         trace_hardirqs_on_prepare();
>                         lockdep_hardirqs_on_prepare(CALLER_ADDR0);
> +                       pkrs_restore_irq(regs);
>                         instrumentation_end();
>                         rcu_irq_exit();
>                         lockdep_hardirqs_on(CALLER_ADDR0);
> @@ -439,6 +439,10 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
>                 trace_hardirqs_on();
>                 instrumentation_end();
>         } else {
> +               instrumentation_begin();
> +               pkrs_restore_irq(regs);
> +               instrumentation_end();
> +
>                 /*
>                  * IRQ flags state is correct already. Just tell RCU if it
>                  * was not watching on entry.
> @@ -470,8 +474,8 @@ irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
>  
>  void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
>  {
> -       pkrs_restore_irq(regs);
>         instrumentation_begin();
> +       pkrs_restore_irq(regs);
>         ftrace_nmi_exit();
>         if (irq_state.lockdep) {
>                 trace_hardirqs_on_prepare();
> 
> 
> Thank you again for the review,
> Ira
> 
> 
> [0] https://lore.kernel.org/lkml/20210804043231.2655537-5-ira.weiny@intel.com/
> [1] https://lore.kernel.org/lkml/20201217131924.GW3040@hirez.programming.kicks-ass.net/
> [2] https://lore.kernel.org/lkml/20201216013202.GY1563847@iweiny-DESK2.sc.intel.com/
> [3] https://lore.kernel.org/lkml/87y2hwqwng.fsf@nanos.tec.linutronix.de/
> 
> > Thanks,
> > 
> >         tglx
Thomas Gleixner Dec. 8, 2021, 12:21 a.m. UTC | #7
Ira,

On Mon, Dec 06 2021 at 17:54, Ira Weiny wrote:
> On Thu, Nov 25, 2021 at 03:12:47PM +0100, Thomas Gleixner wrote:
>> > +.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +	/* add space for extended_pt_regs */
>> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
>> > +#endif
>> > +	.if \annotate_retpoline_safe == 1
>> > +		ANNOTATE_RETPOLINE_SAFE
>> > +	.endif
>> 
>> This annotation is new and nowhere mentioned why it is part of this
>> patch.
>
> I don't understand.  ANNOTATE_RETPOLINE_SAFE has been around since:
>
> 9e0e3c5130e9 x86/speculation, objtool: Annotate indirect calls/jumps
> for objtool

Sorry, I misread that macro maze. It's conditional obviously.

> I can split it if you prefer.  How about a patch with just the x86 extended
> pt_regs stuff but that would leave a zero size for the extended stuff?  Then
> followed by the pks bits?

Whatever makes sense and does one thing per patch.

>> I really have to ask the question whether this #ifdeffery has any value
>> at all. 8 bytes extra stack usage is not going to be the end of the
>> world and distro kernels will enable that config anyway.
>
> My goal with this has always been 0 overhead if turned off.  So this seemed
> like a logical addition.  Furthermore, ARCH_ENABLE_SUPERVISOR_PKEYS is
> predicated on ARCH_HAS_SUPERVISOR_PKEYS which is only available with x86_64.
> This removes the space for x86 when not needed.

The question is not about 64 vs. 32bit. The question is whether the
conditional makes sense for 64bit in the first place. Whether this
matters for 32bit has to be determined. It makes some sense, but less
#ifdeffery and less obfuscation makes sense too.

>> If we really want to save the space then certainly not by sprinkling
>> CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS all over the place and hiding the
>> extra sized ptregs in the pkrs header.
>> 
>> You are changing generic architecture code so you better think about
>> making such a change generic and extensible.
>
> I agree.  And I tried to do so.  The generic entry code is modified only by the
> addition of pkrs_[save|restore]_irq().  These are only defined if the arch
> defines ARCH_HAS_SUPERVISOR_PKEYS and furthermore, if something triggers
> enabling ARCH_ENABLE_SUPERVISOR_PKEYS.

I'm talking about generic _architecture_ code, i.e. the code in
arch/x86/ which affects all vendors and systems.

> ARCH_HAS_SUPERVISOR_PKEYS is restricted to x86_64 at the moment.  All other
> arch's including x86 should not see any changes in the generic code.

That was not the question and I'm well aware of that.

>> If the next feature comes around which needs to save something in that
>> extended area then we are going to change the world again, right?
>
> I'm not sure what you mean by 'change the world'.  I would anticipate the entry
> code to be modified with something similar to pks_[save|restore]_irq() and let
> the arch deal with the specifics.

If on X86 the next X86 specific feature comes around which needs extra
reg space then someone has to change world in arch/x86 again, replace
all the ARCH_ENABLE_SUPERVISOR_PKEYS #ifdefs with something else, right?

Instead of adding a new field to pt_regs_aux and be done with it.

> Also in [1] I thought Peter and Andy agreed that placing additional generic
> state in the extended pt_regs was not needed and does not buy us anything.  I
> specifically asked if that was something we wanted to do in.[2]

This was about a generic representation which affects the common entry
code in kernel/entry/... Can you spot the difference?

What I suggested is _solely_ x86 specific and does not trickle into
anything outside of arch/x86.

>> See? No magic hardcoded constant, no build time error checking for that
>> constant. Nothing, it just works.
>
> Yes agreed definitely an improvement.

It's the right thing to do.

>> That's one part, but let me come back to this:
>> 
>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +	/* add space for extended_pt_regs */
>> > +	subq    $EXTENDED_PT_REGS_SIZE, %rsp
>> 
>> What guarantees that RSP points to pt_regs at this point?  Nothing at
>> all. It's just pure luck and a question of time until this explodes in
>> hard to diagnose ways.
>
> It took me a bit to wrap my head around what I think you mean.  My initial
> response was that rsp should be the stack pointer for __call_ext_ptregs() just
> like it was for call.  But I think I see that it is better to open code this
> since others may want to play the same trick without using this code and
> therefore we may not be getting the extended pt_regs structure on the stack
> like we think.  For example if someone did...
>
> 	movq	%rsp, %rdi
> 	RSP_ADD_OTHER_STACK_STUFF
> 	__call_ext_ptregs	...
> 	RSP_REMOVE_OTHER_STACK_STUFF
>
> ... it would be broken.
>
> My assumption was that would be illegal after this patch.  But indeed there is
> no way to easily see that in the future.

There is no law which forbids to put code there. Aside of that software
developmemnt is strictly not based on assumptions by definition.

>> Because between
>> 
>>         movq	%rsp, %rdi
>> and
>>         call    ....
>> 
>> can legitimately be other code which causes the stack pointer to
>> change. It's not the case today, but nothing prevents this in the
>> future.
>> 
>> The correct thing to do is:
>> 
>>         movq	%rsp, %rdi
>>         RSP_MAKE_PT_REGS_AUX_SPACE
>>         call	...
>>         RSP_REMOVE_PT_REGS_AUX_SPACE
>> 
>> The few extra macro lines in the actual code are way better as they make
>> it completely obvious what's going on and any misuse can be spotted
>> easily.
>
> Sure FWIW this is what I had originally but thought it would be cleaner to wrap
> the 'call'.  I will convert it back.  Also this removes the
> annotate_retpoline_safe stuff above.

It makes the whole ifdeffery more palatable. Hiding everything in
hideous macros in not an improvement at all.

>> > +#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
>> > +/*
>> > + * PKRS is a per-logical-processor MSR which overlays additional protection for
>> > + * pages which have been mapped with a protection key.
>> > + *
>> > + * Context switches save the MSR in the task struct thus taking that value to
>> > + * other processors if necessary.
>> > + *
>> > + * To protect against exceptions having access to this memory save the current
>> > + * thread value and set the PKRS value to be used during the exception.
>> > + */
>> > +void pkrs_save_irq(struct pt_regs *regs)
>> 
>> That's a misnomer as this is invoked for _any_ exception not just
>> interrupts.
>
> I'm confused by the naming in kernel/entry/common.c then.  I'm more than
> willing to change the name.  But I only see irq* for almost everything in that
> file.  And I was trying to follow that convention.

Do you see anything named irq* in the NMI parts?

>> >  #ifdef CONFIG_XEN_PV
>> >  #ifndef CONFIG_PREEMPTION
>> >  /*
>> > @@ -309,6 +361,8 @@ __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
>> >  
>> >  	inhcall = get_and_clear_inhcall();
>> >  	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
>> > +		/* Normally called by irqentry_exit, restore pkrs here */
>> > +		pkrs_restore_irq(regs);
>> > 		irqentry_exit_cond_resched();
>> 
>> Sigh. Consistency is overrated....
>
> I'm not that familiar with the xen code so perhaps I missed something?

Yes, taste. And that has nothing to do with being familiar with XEN code.

>> > +		/* Normally called by irqentry_exit, restore pkrs here */
>> > +		pkrs_restore_irq(regs);
>> > 		irqentry_exit_cond_resched();

Your comment says: Normally called by irqentry_exit

Alone writing this comment should have made you look into the invoked
function below:

 		irqentry_exit_cond_resched();

And because the generic entry code is pretty consequent about naming
conventions it's not a surprise that you can find an invocation of
irqentry_exit_cond_resched() in irqentry_exit(), right?

So instead of writing 'Normally' which is completely confusing you could
have done a proper analysis and figured out what I suggested:

>> Though, if you look at the xen_pv_evtchn_do_upcall() part where you
>> added this extra invocation you might figure out that adding
>> pkrs_restore_irq() to irqentry_exit_cond_resched() and explicitely to
>> the 'else' path in irqentry_exit() makes it magically consistent for
>> both use cases.

But because your preference seems to be have random naming conventions,
i.e. chosing the prefix of the day, I'm not that surprised.

> Thank you, yes good catch.  However, I think I need at least 1 more
> call in the !regs_irqs_disabled() && state.exit_rcu case right?

I take this as a rethorical question.

>> > +done:
>> > +	pkrs_save_irq(regs);
>> 
>> This still calls out into instrumentable code. I explained you before
>> why this is wrong. Also objtool emits warnings to that effect if you do a
>> proper verified build.
>
> I was not sure what a 'proper verified build' was and objtool was not throwing
> any warnings for me even if I ran it directly.
...
> After asking around and digging quite a bit I found CONFIG_DEBUG_ENTRY which
> enabled the check and the error.

May I ask the obvious question why you did not ask around when I told
you the same thing several month ago?

Seriously, you are changing code which has very obviously placed
'noinstr' annotations on functions and a boat load of very obvious
instrumentation_begin()/end() pairs in it along with a gazillion of
comments and you just go and add your stuff as you can see fit?

Even _after_ I told you that this is wrong?

The word "engineering" has a meaning. Engineering is based on math and
science. Both mandate structured and diligent problem analyis.

Can you pretty please explain me how ignoring these annotations in the
first place and then ignoring the related review comments is related to
that?

> [But only during a build and not with the above command???  Shouldn't
> the above command work too?]

Did you even try to figure out what CONFIG_DEBUG_ENTRY does?

# git grep -l CONFIG_DEBUG_ENTRY

and looking at the resulting output which has one very obvious named
file in it:

     include/linux/instrumentation.h

might have told you the answer. Also

# git log ...
# git blame ...
# your_favourite_browser https://duckduckgo.com/?q=objtool+noinstr
# your_favourite_browser https://duckduckgo.com/?q=objtool+CONFIG_DEBUG_ENTRY

aside of asking colleagues or replying to my previous review comment
with a sensible question would have solved that, right?

Asking does not make you look stupid. Not asking and making uninformed
assumptions will make you look stupid for sure.

But asking just to avoid homework is not in the book either. The
community does not have the capacity to deal with that.

> Regardless, reading more about noinstr and looking at the code more carefully I
> realize I _completely_ misunderstood what you meant before in [3].  I should
> have asked for clarification.

Bingo!

> Yes this was originally marked noinstr because it was called from a noinstr
> function.  I see now, or at least I think I see, that you were taking exception
> to my blindly marking pkrs_save_irq() noinstr without a good reason.
>
> When you said 'there is absolutely no reason to have this marked noinstr.'  I
> thought that meant we could simply remove it from noinstr.  But what I think
> you meant is that there is no reason to have it _be_ noinstr _and_ I should
> also make it called from the instrumentable sections of the irqentry_*() calls.
>
> So something like this patch on top of this series?  [With an equivalent change
> for pkrs_restore_irq().]

No comment. Why?

   1) This is not an basic enginering course

   2) Because I refuse to comment on hastily cobbled together crap which still
      gets it wrong. Hint:

      I did not even look at the result of this patch applied.  I just
      did a mental inventory based on the patch hunks you provided.
      They simply do not sum up.

      Don't dare to ask me why.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index a4c061fb7c6e..a2f94677c3fd 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -63,6 +63,32 @@  For 32-bit we have the following conventions - kernel is built with
  * for assembly code:
  */
 
+/*
+ * __call_ext_ptregs - Helper macro to call into C with extended pt_regs
+ * @cfunc:		C function to be called
+ *
+ * This will ensure that extended_ptregs is added and removed as needed during
+ * a call into C code.
+ */
+.macro __call_ext_ptregs cfunc annotate_retpoline_safe:req
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* add space for extended_pt_regs */
+	subq    $EXTENDED_PT_REGS_SIZE, %rsp
+#endif
+	.if \annotate_retpoline_safe == 1
+		ANNOTATE_RETPOLINE_SAFE
+	.endif
+	call	\cfunc
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+	/* remove space for extended_pt_regs */
+	addq    $EXTENDED_PT_REGS_SIZE, %rsp
+#endif
+.endm
+
+.macro call_ext_ptregs cfunc
+	__call_ext_ptregs \cfunc, annotate_retpoline_safe=0
+.endm
+
 .macro PUSH_REGS rdx=%rdx rax=%rax save_ret=0
 	.if \save_ret
 	pushq	%rsi		/* pt_regs->si */
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6c2826417b33..a0d1d5519dba 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@ 
 #include <linux/nospec.h>
 #include <linux/syscalls.h>
 #include <linux/uaccess.h>
+#include <linux/pkeys.h>
 
 #ifdef CONFIG_XEN_PV
 #include <xen/xen-ops.h>
@@ -34,6 +35,7 @@ 
 #include <asm/io_bitmap.h>
 #include <asm/syscall.h>
 #include <asm/irq_stack.h>
+#include <asm/pks.h>
 
 #ifdef CONFIG_X86_64
 
@@ -252,6 +254,56 @@  SYSCALL_DEFINE0(ni_syscall)
 	return -ENOSYS;
 }
 
+#ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
+
+void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code)
+{
+	struct extended_pt_regs *ept_regs = extended_pt_regs(regs);
+
+	if (cpu_feature_enabled(X86_FEATURE_PKS) && (error_code & X86_PF_PK))
+		pr_alert("PKRS: 0x%x\n", ept_regs->thread_pkrs);
+}
+
+/*
+ * PKRS is a per-logical-processor MSR which overlays additional protection for
+ * pages which have been mapped with a protection key.
+ *
+ * Context switches save the MSR in the task struct thus taking that value to
+ * other processors if necessary.
+ *
+ * To protect against exceptions having access to this memory save the current
+ * thread value and set the PKRS value to be used during the exception.
+ */
+void pkrs_save_irq(struct pt_regs *regs)
+{
+	struct extended_pt_regs *ept_regs;
+
+	BUILD_BUG_ON(sizeof(struct extended_pt_regs)
+			!= EXTENDED_PT_REGS_SIZE
+				+ sizeof(struct pt_regs));
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	ept_regs = extended_pt_regs(regs);
+	ept_regs->thread_pkrs = current->thread.saved_pkrs;
+	write_pkrs(pkrs_init_value);
+}
+
+void pkrs_restore_irq(struct pt_regs *regs)
+{
+	struct extended_pt_regs *ept_regs;
+
+	if (!cpu_feature_enabled(X86_FEATURE_PKS))
+		return;
+
+	ept_regs = extended_pt_regs(regs);
+	write_pkrs(ept_regs->thread_pkrs);
+	current->thread.saved_pkrs = ept_regs->thread_pkrs;
+}
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
 #ifdef CONFIG_XEN_PV
 #ifndef CONFIG_PREEMPTION
 /*
@@ -309,6 +361,8 @@  __visible noinstr void xen_pv_evtchn_do_upcall(struct pt_regs *regs)
 
 	inhcall = get_and_clear_inhcall();
 	if (inhcall && !WARN_ON_ONCE(state.exit_rcu)) {
+		/* Normally called by irqentry_exit, restore pkrs here */
+		pkrs_restore_irq(regs);
 		irqentry_exit_cond_resched();
 		instrumentation_end();
 		restore_inhcall(inhcall);
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index e38a4cf795d9..1c390975a3de 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -332,7 +332,7 @@  SYM_CODE_END(ret_from_fork)
 		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
 	.endif
 
-	call	\cfunc
+	call_ext_ptregs \cfunc
 
 	jmp	error_return
 .endm
@@ -435,7 +435,7 @@  SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
-	call	\cfunc
+	call_ext_ptregs \cfunc
 
 	jmp	paranoid_exit
 
@@ -496,7 +496,7 @@  SYM_CODE_START(\asmsym)
 	 * stack.
 	 */
 	movq	%rsp, %rdi		/* pt_regs pointer */
-	call	vc_switch_off_ist
+	call_ext_ptregs vc_switch_off_ist
 	movq	%rax, %rsp		/* Switch to new stack */
 
 	UNWIND_HINT_REGS
@@ -507,7 +507,7 @@  SYM_CODE_START(\asmsym)
 
 	movq	%rsp, %rdi		/* pt_regs pointer */
 
-	call	kernel_\cfunc
+	call_ext_ptregs kernel_\cfunc
 
 	/*
 	 * No need to switch back to the IST stack. The current stack is either
@@ -542,7 +542,7 @@  SYM_CODE_START(\asmsym)
 	movq	%rsp, %rdi		/* pt_regs pointer into first argument */
 	movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
 	movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
-	call	\cfunc
+	call_ext_ptregs \cfunc
 
 	jmp	paranoid_exit
 
@@ -781,7 +781,7 @@  SYM_CODE_START_LOCAL(exc_xen_hypervisor_callback)
 	movq	%rdi, %rsp			/* we don't return, adjust the stack frame */
 	UNWIND_HINT_REGS
 
-	call	xen_pv_evtchn_do_upcall
+	call_ext_ptregs xen_pv_evtchn_do_upcall
 
 	jmp	error_return
 SYM_CODE_END(exc_xen_hypervisor_callback)
@@ -987,7 +987,7 @@  SYM_CODE_START_LOCAL(error_entry)
 	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
 	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
-	call	sync_regs
+	call_ext_ptregs sync_regs
 	movq	%rax, %rsp			/* switch stack */
 	ENCODE_FRAME_POINTER
 	pushq	%r12
@@ -1042,7 +1042,7 @@  SYM_CODE_START_LOCAL(error_entry)
 	 * as if we faulted immediately after IRET.
 	 */
 	mov	%rsp, %rdi
-	call	fixup_bad_iret
+	call_ext_ptregs fixup_bad_iret
 	mov	%rax, %rsp
 	jmp	.Lerror_entry_from_usermode_after_swapgs
 SYM_CODE_END(error_entry)
@@ -1148,7 +1148,7 @@  SYM_CODE_START(asm_exc_nmi)
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	exc_nmi
+	call_ext_ptregs exc_nmi
 
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
@@ -1184,6 +1184,8 @@  SYM_CODE_START(asm_exc_nmi)
 	 * +---------------------------------------------------------+
 	 * | pt_regs                                                 |
 	 * +---------------------------------------------------------+
+	 * | (Optionally) extended_pt_regs                           |
+	 * +---------------------------------------------------------+
 	 *
 	 * The "original" frame is used by hardware.  Before re-enabling
 	 * NMIs, we need to be done with it, and we need to leave enough
@@ -1360,7 +1362,7 @@  end_repeat_nmi:
 
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
-	call	exc_nmi
+	call_ext_ptregs exc_nmi
 
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 0051cf5c792d..53254d29d5c7 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -136,7 +136,7 @@  SYM_INNER_LABEL(entry_SYSENTER_compat_after_hwframe, SYM_L_GLOBAL)
 .Lsysenter_flags_fixed:
 
 	movq	%rsp, %rdi
-	call	do_SYSENTER_32
+	call_ext_ptregs do_SYSENTER_32
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -253,7 +253,7 @@  SYM_INNER_LABEL(entry_SYSCALL_compat_after_hwframe, SYM_L_GLOBAL)
 	UNWIND_HINT_REGS
 
 	movq	%rsp, %rdi
-	call	do_fast_syscall_32
+	call_ext_ptregs do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
 	ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \
 		    "jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV
@@ -410,6 +410,6 @@  SYM_CODE_START(entry_INT80_compat)
 	cld
 
 	movq	%rsp, %rdi
-	call	do_int80_syscall_32
+	call_ext_ptregs do_int80_syscall_32
 	jmp	swapgs_restore_regs_and_return_to_usermode
 SYM_CODE_END(entry_INT80_compat)
diff --git a/arch/x86/include/asm/pks.h b/arch/x86/include/asm/pks.h
index e7727086cec2..76960ec71b4b 100644
--- a/arch/x86/include/asm/pks.h
+++ b/arch/x86/include/asm/pks.h
@@ -4,15 +4,33 @@ 
 
 #ifdef CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS
 
+struct extended_pt_regs {
+	u32 thread_pkrs;
+	/* Keep stack 8 byte aligned */
+	u32 pad;
+	struct pt_regs pt_regs;
+};
+
 void setup_pks(void);
 void pkrs_write_current(void);
 void pks_init_task(struct task_struct *task);
+void write_pkrs(u32 new_pkrs);
+
+static inline struct extended_pt_regs *extended_pt_regs(struct pt_regs *regs)
+{
+	return container_of(regs, struct extended_pt_regs, pt_regs);
+}
+
+void show_extended_regs_oops(struct pt_regs *regs, unsigned long error_code);
 
 #else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 static inline void setup_pks(void) { }
 static inline void pkrs_write_current(void) { }
 static inline void pks_init_task(struct task_struct *task) { }
+static inline void write_pkrs(u32 new_pkrs) { }
+static inline void show_extended_regs_oops(struct pt_regs *regs,
+					   unsigned long error_code) { }
 
 #endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h
index 02c2cbda4a74..4a41fc4cf028 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -53,4 +53,6 @@ 
 # define X86_CR3_PTI_PCID_USER_BIT	11
 #endif
 
+#define EXTENDED_PT_REGS_SIZE 8
+
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d8b3ebd2bb85..90e76178b6b4 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -319,8 +319,7 @@  SYM_CODE_START_NOALIGN(vc_boot_ghcb)
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
 	movq	initial_vc_handler(%rip), %rax
-	ANNOTATE_RETPOLINE_SAFE
-	call	*%rax
+	__call_ext_ptregs *%rax, annotate_retpoline_safe=1
 
 	/* Unwind pt_regs */
 	POP_REGS
@@ -397,7 +396,7 @@  SYM_CODE_START_LOCAL(early_idt_handler_common)
 	UNWIND_HINT_REGS
 
 	movq %rsp,%rdi		/* RDI = pt_regs; RSI is already trapnr */
-	call do_early_exception
+	call_ext_ptregs do_early_exception
 
 	decl early_recursion_flag(%rip)
 	jmp restore_regs_and_return_to_kernel
@@ -421,7 +420,7 @@  SYM_CODE_START_NOALIGN(vc_no_ghcb)
 	/* Call C handler */
 	movq    %rsp, %rdi
 	movq	ORIG_RAX(%rsp), %rsi
-	call    do_vc_no_ghcb
+	call_ext_ptregs do_vc_no_ghcb
 
 	/* Unwind pt_regs */
 	POP_REGS
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e133c0ed72a0..a4ce7cef0260 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -32,6 +32,7 @@ 
 #include <asm/pgtable_areas.h>		/* VMALLOC_START, ...		*/
 #include <asm/kvm_para.h>		/* kvm_handle_async_pf		*/
 #include <asm/vdso.h>			/* fixup_vdso_exception()	*/
+#include <asm/pks.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/exceptions.h>
@@ -547,6 +548,8 @@  show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 		 (error_code & X86_PF_PK)    ? "protection keys violation" :
 					       "permissions violation");
 
+	show_extended_regs_oops(regs, error_code);
+
 	if (!(error_code & X86_PF_USER) && user_mode(regs)) {
 		struct desc_ptr idt, gdt;
 		u16 ldtr, tr;
diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h
index 580238388f0c..76eb19a37942 100644
--- a/include/linux/pkeys.h
+++ b/include/linux/pkeys.h
@@ -52,6 +52,15 @@  enum pks_pkey_consumers {
 	PKS_KEY_NR_CONSUMERS
 };
 extern u32 pkrs_init_value;
-#endif
+
+void pkrs_save_irq(struct pt_regs *regs);
+void pkrs_restore_irq(struct pt_regs *regs);
+
+#else /* !CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
+
+static inline void pkrs_save_irq(struct pt_regs *regs) { }
+static inline void pkrs_restore_irq(struct pt_regs *regs) { }
+
+#endif /* CONFIG_ARCH_ENABLE_SUPERVISOR_PKEYS */
 
 #endif /* _LINUX_PKEYS_H */
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index bf16395b9e13..aa0b1e8dd742 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,7 @@ 
 #include <linux/livepatch.h>
 #include <linux/audit.h>
 #include <linux/tick.h>
+#include <linux/pkeys.h>
 
 #include "common.h"
 
@@ -364,7 +365,7 @@  noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 		instrumentation_end();
 
 		ret.exit_rcu = true;
-		return ret;
+		goto done;
 	}
 
 	/*
@@ -379,6 +380,8 @@  noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 	trace_hardirqs_off_finish();
 	instrumentation_end();
 
+done:
+	pkrs_save_irq(regs);
 	return ret;
 }
 
@@ -404,7 +407,12 @@  noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state)
 	/* Check whether this returns to user mode */
 	if (user_mode(regs)) {
 		irqentry_exit_to_user_mode(regs);
-	} else if (!regs_irqs_disabled(regs)) {
+		return;
+	}
+
+	pkrs_restore_irq(regs);
+
+	if (!regs_irqs_disabled(regs)) {
 		/*
 		 * If RCU was not watching on entry this needs to be done
 		 * carefully and needs the same ordering of lockdep/tracing
@@ -458,11 +466,13 @@  irqentry_state_t noinstr irqentry_nmi_enter(struct pt_regs *regs)
 	ftrace_nmi_enter();
 	instrumentation_end();
 
+	pkrs_save_irq(regs);
 	return irq_state;
 }
 
 void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
+	pkrs_restore_irq(regs);
 	instrumentation_begin();
 	ftrace_nmi_exit();
 	if (irq_state.lockdep) {