diff mbox series

[14/22] x86/fpu: Eager switch PKRU state

Message ID 20190221115020.12385-15-bigeasy@linutronix.de (mailing list archive)
State New, archived
Headers show
Series [01/22] x86/fpu: Remove fpu->initialized usage in __fpu__restore_sig() | expand

Commit Message

Sebastian Sewior Feb. 21, 2019, 11:50 a.m. UTC
From: Rik van Riel <riel@surriel.com>

While most of a task's FPU state is only needed in user space, the
protection keys need to be in place immediately after a context switch.

The reason is that any access to userspace memory while running in
kernel mode also need to abide by the memory permissions specified in
the protection keys.

The "eager switch" is a preparation for loading the FPU state on return
to userland. Instead of decoupling PKRU state from xstate I update PKRU
within xstate on write operations by the kernel.

The read/write_pkru() is moved to another header file so it can easily
accessed from pgtable.h and fpu/internal.h.

For user tasks we should always get the PKRU from the xsave area and it
should not change anything because the PKRU value was loaded as part of
FPU restore.
For kernel kernel threads we now will have the default "allow
everything" written.  Before this commit the kernel thread would end up
with a random value which it inherited from the previous user task.

Signed-off-by: Rik van Riel <riel@surriel.com>
[bigeasy: save pkru to xstate, no cache, don't use __raw_xsave_addr()]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 arch/x86/include/asm/fpu/internal.h | 20 ++++++++++++++++++--
 arch/x86/include/asm/fpu/xstate.h   |  1 +
 2 files changed, 19 insertions(+), 2 deletions(-)

Comments

Dave Hansen Feb. 25, 2019, 6:16 p.m. UTC | #1
On 2/21/19 3:50 AM, Sebastian Andrzej Siewior wrote:
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 67e4805bccb6f..05f6fce62e9f1 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -562,8 +562,24 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
>   */
>  static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
>  {
> -	if (static_cpu_has(X86_FEATURE_FPU))
> -		__fpregs_load_activate(new_fpu, cpu);
> +	struct pkru_state *pk;
> +	u32 pkru_val = 0;
> +
> +	if (!static_cpu_has(X86_FEATURE_FPU))
> +		return;
> +
> +	__fpregs_load_activate(new_fpu, cpu);

This is still a bit light on comments.

Maybe:
	/* PKRU state is switched eagerly because... */

> +	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> +		return;
> +
> +	if (current->mm) {
> +		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> +		WARN_ON_ONCE(!pk);

This can trip on us of the 'init optimization' is in play because
get_xsave_addr() checks xsave->header.xfeatures.  That's unlikely today
because we usually set PKRU to a restrictive value.  But, it's also not
*guaranteed*.

Userspace could easily do an XRSTOR that puts PKRU back in its init
state if it wanted to, then this would end up with pk==NULL.

We might actually want a selftest that *does* that.  I don't think we
have one.

> +		if (pk)
> +			pkru_val = pk->pkru;
> +	}> +	__write_pkru(pkru_val);
>  }

A comment above __write_pkru() would be nice to say that it only
actually does the slow instruction on changes to the value.

BTW, this has the implicit behavior of always trying to do a
__write_pkru(0) on switches to kernel threads.  That seems a bit weird
and it is likely to impose WRPKRU overhead on switches between user and
kernel threads.

The 0 value is also the most permissive, which is not great considering
that user mm's can be active the in page tables when running kernel
threads if we're being lazy.

Seems like we should either leave PKRU alone or have 'init_pkru_value'
be the default.  That gives good security properties and is likely to
match the application value, removing the WRPKRU overhead.
Sebastian Sewior March 8, 2019, 6:08 p.m. UTC | #2
On 2019-02-25 10:16:24 [-0800], Dave Hansen wrote:
> On 2/21/19 3:50 AM, Sebastian Andrzej Siewior wrote:
> > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> > index 67e4805bccb6f..05f6fce62e9f1 100644
> > --- a/arch/x86/include/asm/fpu/internal.h
> > +++ b/arch/x86/include/asm/fpu/internal.h
> > @@ -562,8 +562,24 @@ switch_fpu_prepare(struct fpu *old_fpu, int cpu)
> >   */
> >  static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
> >  {
> > -	if (static_cpu_has(X86_FEATURE_FPU))
> > -		__fpregs_load_activate(new_fpu, cpu);
> > +	struct pkru_state *pk;
> > +	u32 pkru_val = 0;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_FPU))
> > +		return;
> > +
> > +	__fpregs_load_activate(new_fpu, cpu);
> 
> This is still a bit light on comments.
> 
> Maybe:
> 	/* PKRU state is switched eagerly because... */

okay, will update.

> > +	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> > +		return;
> > +
> > +	if (current->mm) {
> > +		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> > +		WARN_ON_ONCE(!pk);
> 
> This can trip on us of the 'init optimization' is in play because
> get_xsave_addr() checks xsave->header.xfeatures.  That's unlikely today
> because we usually set PKRU to a restrictive value.  But, it's also not
> *guaranteed*.
> 
> Userspace could easily do an XRSTOR that puts PKRU back in its init
> state if it wanted to, then this would end up with pk==NULL.
> 
> We might actually want a selftest that *does* that.  I don't think we
> have one.

So you are saying that the above warning might trigger and be "okay"?
My understanding is that the in-kernel XSAVE will always save everything
so we should never "lose" the XFEATURE_PKRU no matter what user space
does.

So as test case you want
	xsave (-1 & ~XFEATURE_PKRU)
	xrestore (-1 & ~XFEATURE_PKRU)

in userland and then a context switch to see if the warning above
triggers?

> > +		if (pk)
> > +			pkru_val = pk->pkru;
> > +	}> +	__write_pkru(pkru_val);
> >  }
> 
> A comment above __write_pkru() would be nice to say that it only
> actually does the slow instruction on changes to the value.

Could we please not do this? It is a comment above one of the callers
function and we have two or three. And we have that comment already
within __write_pkru().

> BTW, this has the implicit behavior of always trying to do a
> __write_pkru(0) on switches to kernel threads.  That seems a bit weird
> and it is likely to impose WRPKRU overhead on switches between user and
> kernel threads.
> 
> The 0 value is also the most permissive, which is not great considering
> that user mm's can be active the in page tables when running kernel
> threads if we're being lazy.
> 
> Seems like we should either leave PKRU alone or have 'init_pkru_value'
> be the default.  That gives good security properties and is likely to
> match the application value, removing the WRPKRU overhead.

Last time we talked about this we agreed (or this was my impression) that
0 should be written so that the kernel thread should always be able to
write to user space in case it borrowed its mm (otherwise it has none
and it would fail anyway).
We didn't want to leave PKRU alone because the outcome (whether or not
the write by the kernel thread succeeds) should not depend on the last
running task (and be random) but deterministic.

I am personally open to each outcome you decide :) I you want to use
`init_pkru_value' instead of 0 then I can change this. If you want to
skip the possible update for kernel threads then okay but maybe this
should be documented somehow.

Sebastian
Dave Hansen March 8, 2019, 7:01 p.m. UTC | #3
On 3/8/19 10:08 AM, Sebastian Andrzej Siewior wrote:
> On 2019-02-25 10:16:24 [-0800], Dave Hansen wrote:
>>> +	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
>>> +		return;
>>> +
>>> +	if (current->mm) {
>>> +		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
>>> +		WARN_ON_ONCE(!pk);
>>
>> This can trip on us of the 'init optimization' is in play because
>> get_xsave_addr() checks xsave->header.xfeatures.  That's unlikely today
>> because we usually set PKRU to a restrictive value.  But, it's also not
>> *guaranteed*.
>>
>> Userspace could easily do an XRSTOR that puts PKRU back in its init
>> state if it wanted to, then this would end up with pk==NULL.
>>
>> We might actually want a selftest that *does* that.  I don't think we
>> have one.
> 
> So you are saying that the above warning might trigger and be "okay"?

Nothing will break, but the warning will trigger, which isn't nice.

> My understanding is that the in-kernel XSAVE will always save everything
> so we should never "lose" the XFEATURE_PKRU no matter what user space
> does.
> 
> So as test case you want
> 	xsave (-1 & ~XFEATURE_PKRU)
> 	xrestore (-1 & ~XFEATURE_PKRU)
> 
> in userland and then a context switch to see if the warning above
> triggers?

I think you need an XRSTOR with RFBM=-1 (or at least with the PKRU bit
set) and the PKRU bit in the XFEATURES field in the XSAVE buffer set to 0.

>>> +		if (pk)
>>> +			pkru_val = pk->pkru;
>>> +	}> +	__write_pkru(pkru_val);
>>>  }
>>
>> A comment above __write_pkru() would be nice to say that it only
>> actually does the slow instruction on changes to the value.
> 
> Could we please not do this? It is a comment above one of the callers
> function and we have two or three. And we have that comment already
> within __write_pkru().

I looked at this code and thought "writing PKRU is slow", and "this
writes PKRU unconditionally", and "the __ version of the function
shoudn't have much logic in it".

I got 2/3 wrong.  To me that means this site needs a 1-line comment.
Feel free to move one of the other comments to here if you think it's
over-commented, but this site needs one.

>> BTW, this has the implicit behavior of always trying to do a
>> __write_pkru(0) on switches to kernel threads.  That seems a bit weird
>> and it is likely to impose WRPKRU overhead on switches between user and
>> kernel threads.
>>
>> The 0 value is also the most permissive, which is not great considering
>> that user mm's can be active the in page tables when running kernel
>> threads if we're being lazy.
>>
>> Seems like we should either leave PKRU alone or have 'init_pkru_value'
>> be the default.  That gives good security properties and is likely to
>> match the application value, removing the WRPKRU overhead.
> 
> Last time we talked about this we agreed (or this was my impression) that
> 0 should be written so that the kernel thread should always be able to
> write to user space in case it borrowed its mm (otherwise it has none
> and it would fail anyway).

We can't write to userspace when borrowing an mm.  If the kernel borrows
an mm, we might as well be on the init_mm which has no userspace mappings.

> We didn't want to leave PKRU alone because the outcome (whether or not
> the write by the kernel thread succeeds) should not depend on the last
> running task (and be random) but deterministic.

Right, so let's make it deterministically restrictive: either
init_pkru_value, or -1 since kernel threads shouldn't be touching
userspace in the first place.
Sebastian Sewior March 11, 2019, 11:06 a.m. UTC | #4
On 2019-03-08 11:01:25 [-0800], Dave Hansen wrote:
> On 3/8/19 10:08 AM, Sebastian Andrzej Siewior wrote:
> > On 2019-02-25 10:16:24 [-0800], Dave Hansen wrote:
> >>> +	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> >>> +		return;
> >>> +
> >>> +	if (current->mm) {
> >>> +		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> >>> +		WARN_ON_ONCE(!pk);> Nothing will break, but the warning will trigger, which isn't nice.

the warning should trigger if something goes south, I was not expecting
it to happen.

> > My understanding is that the in-kernel XSAVE will always save everything
> > so we should never "lose" the XFEATURE_PKRU no matter what user space
> > does.
> > 
> > So as test case you want
> > 	xsave (-1 & ~XFEATURE_PKRU)
> > 	xrestore (-1 & ~XFEATURE_PKRU)
> > 
> > in userland and then a context switch to see if the warning above
> > triggers?
> 
> I think you need an XRSTOR with RFBM=-1 (or at least with the PKRU bit
> set) and the PKRU bit in the XFEATURES field in the XSAVE buffer set to 0.

let me check that, write a test case in userland and I come back with
the results. I can remove that warning but I wasn't expecting it to
trigger so let me verify that first.

> >>> +		if (pk)
> >>> +			pkru_val = pk->pkru;
> >>> +	}> +	__write_pkru(pkru_val);
> >>>  }
> >>
> >> A comment above __write_pkru() would be nice to say that it only
> >> actually does the slow instruction on changes to the value.
> > 
> > Could we please not do this? It is a comment above one of the callers
> > function and we have two or three. And we have that comment already
> > within __write_pkru().
> 
> I looked at this code and thought "writing PKRU is slow", and "this
> writes PKRU unconditionally", and "the __ version of the function
> shoudn't have much logic in it".
> 
> I got 2/3 wrong.  To me that means this site needs a 1-line comment.
> Feel free to move one of the other comments to here if you think it's
> over-commented, but this site needs one.

right because things changed as part of patch series. 
You wanted to have in __write_pkru() the same semantic like in
__read_pkru() which is currently the case because __write_pkru() has the
check. It  would be great if we could rename it to something else and
avoid the comment. (Because if this user gets a comment then other
should, too and I think this is an overkill).

> > Last time we talked about this we agreed (or this was my impression) that
> > 0 should be written so that the kernel thread should always be able to
> > write to user space in case it borrowed its mm (otherwise it has none
> > and it would fail anyway).
> 
> We can't write to userspace when borrowing an mm.  If the kernel borrows
> an mm, we might as well be on the init_mm which has no userspace mappings.

If a kernel thread borrows a mm from a user task via use_mm() then it
_can_ write to that task's user land memory from a kthread.

> > We didn't want to leave PKRU alone because the outcome (whether or not
> > the write by the kernel thread succeeds) should not depend on the last
> > running task (and be random) but deterministic.
> 
> Right, so let's make it deterministically restrictive: either
> init_pkru_value, or -1 since kernel threads shouldn't be touching
> userspace in the first place.

I'm fine either way, just tell me what you want. Just consider the
use_mm() part above I wrote. (I remember you/luto suggest to have an API
for something like that so that the PKRU value can be 

Sebastian
Sebastian Sewior March 11, 2019, 2:30 p.m. UTC | #5
On 2019-03-11 12:06:05 [+0100], To Dave Hansen wrote:
> On 2019-03-08 11:01:25 [-0800], Dave Hansen wrote:
> > On 3/8/19 10:08 AM, Sebastian Andrzej Siewior wrote:
> > > On 2019-02-25 10:16:24 [-0800], Dave Hansen wrote:
> > >>> +	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
> > >>> +		return;
> > >>> +
> > >>> +	if (current->mm) {
> > >>> +		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
> > >>> +		WARN_ON_ONCE(!pk);
> …
> > Nothing will break, but the warning will trigger, which isn't nice.
> 
> the warning should trigger if something goes south, I was not expecting
> it to happen.
> 
> > > My understanding is that the in-kernel XSAVE will always save everything
> > > so we should never "lose" the XFEATURE_PKRU no matter what user space
> > > does.
> > > 
> > > So as test case you want
> > > 	xsave (-1 & ~XFEATURE_PKRU)
> > > 	xrestore (-1 & ~XFEATURE_PKRU)
> > > 
> > > in userland and then a context switch to see if the warning above
> > > triggers?
> > 
> > I think you need an XRSTOR with RFBM=-1 (or at least with the PKRU bit
> > set) and the PKRU bit in the XFEATURES field in the XSAVE buffer set to 0.
> 
> let me check that, write a test case in userland and I come back with
> the results. I can remove that warning but I wasn't expecting it to
> trigger so let me verify that first.

so I made dis:
	https://breakpoint.cc/tc-xsave.c

and it doesn't trigger. 
XSAVE saves what is specified and enabled. XRSTOR restores what is
specified, available in the header and enabled. Which means even if
userland disables a bit in the header, it is still available during
context switch by kernel's XSAVE as long as it is set in XSTATE_BV.
The user can't use XSETBV (can only query via XGETBV) which means that
XFEATURE_PKRU can't be removed by the user.

But you got me thinking: During signal delivery we save tasks' FPU
content on the signal stack. If the signal handler removes
XFEATURE_PKRU bit then __fpu__restore_sig() will load the "missing
state" from init_fpstate. Which means that the protection key will be
set to zero. Not sure we want this or not but this the case. A XRSTOR in
userland without XFEATURE_PKRU would leave the PKRU state unchanged.

The test case from above does this if you want to double check.

Sebastian
Sebastian Sewior March 21, 2019, 5:10 p.m. UTC | #6
On 2019-03-12 17:06:18 [-0700], Dave Hansen wrote:
> Thanks for doing this.  I'll see if I can dig into the test case and
> figure out why it's not doing what I expected.  It might be that the CPU
> _could_ go back to the init state but chooses not to, or that something
> else is taking us out of the init state before we reach the kernel code
> in question.

I'm going to drop that WARN_ON() check. The reason is that we are fully
preemptible during fpstate_init() and a context switch during memset()
and fpstate_init_xstate() triggers this warning.

Sebastian
diff mbox series

Patch

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 67e4805bccb6f..05f6fce62e9f1 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -562,8 +562,24 @@  switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  */
 static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
 {
-	if (static_cpu_has(X86_FEATURE_FPU))
-		__fpregs_load_activate(new_fpu, cpu);
+	struct pkru_state *pk;
+	u32 pkru_val = 0;
+
+	if (!static_cpu_has(X86_FEATURE_FPU))
+		return;
+
+	__fpregs_load_activate(new_fpu, cpu);
+
+	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
+		return;
+
+	if (current->mm) {
+		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+		WARN_ON_ONCE(!pk);
+		if (pk)
+			pkru_val = pk->pkru;
+	}
+	__write_pkru(pkru_val);
 }
 
 /*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index fbe41f808e5d8..4e18a837223ff 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -5,6 +5,7 @@ 
 #include <linux/types.h>
 #include <asm/processor.h>
 #include <linux/uaccess.h>
+#include <asm/user.h>
 
 /* Bit 63 of XCR0 is reserved for future expansion */
 #define XFEATURE_MASK_EXTEND	(~(XFEATURE_MASK_FPSSE | (1ULL << 63)))