mbox series

[0/6] Introduce CET supervisor state support

Message ID 20240531090331.13713-1-weijiang.yang@intel.com (mailing list archive)
Headers show
Series Introduce CET supervisor state support | expand

Message

Yang, Weijiang May 31, 2024, 9:03 a.m. UTC
This series spins off from CET KVM virtualization enabling series [1].
The purpose is to get these preparation work resolved ahead of KVM part
landing. There was a discussion about introducing CET supervisor state
support [2] [3].

CET supervisor state, i.e., IA32_PL{0,1,2}_SSP, are xsave-managed MSRs,
it can be enabled via IA32_XSS[bit 12]. KVM relies on host side CET
supervisor state support to fully enable guest CET MSR contents storage.
The benefits are: 1) No need to manually save/restore the 3 MSRs when
vCPU fpu context is sched in/out. 2) Omit manually swapping the three
MSRs at VM-Exit/VM-Entry for guest/host. 3) Make guest CET user/supervisor
states managed in a consistent manner within host kernel FPU framework.

This series tries to:
1) Fix existing issue regarding enabling guest supervisor states support.
2) Add CET supervisor state support in core kernel.
3) Introduce new FPU config for guest fpstate setup.

With the preparation work landed, for guest fpstate, we have xstate_bv[12]
== xcomp_bv[12] == 1 and CET supervisor state is saved/reloaded when
xsaves/xrstors executes on guest fpstate.
For non-guest/normal fpstate, we have xstate_bv[12] == xcomp_bv[12] == 0,
then HW can optimize xsaves/xrstors operations.

Patch1: Preserve guest supervisor xfeatures in __state_perm.
Patch2: Enable CET supervisor xstate support.
Patch3: Introduce kernel dynamic xfeature set.
Patch4: Initialize fpu_guest_cfg settings.
Patch5: Create guest fpstate with fpu_guest_cfg.
Patch6: Check invalid fpstate config before executes xsaves.

[1]: https://lore.kernel.org/all/20240219074733.122080-1-weijiang.yang@intel.com/
[2]: https://lore.kernel.org/all/ZM1jV3UPL0AMpVDI@google.com/
[3]: https://lore.kernel.org/all/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/


Sean Christopherson (1):
  x86/fpu/xstate: Always preserve non-user xfeatures/flags in
    __state_perm

Yang Weijiang (5):
  x86/fpu/xstate: Add CET supervisor mode state support
  x86/fpu/xstate: Introduce XFEATURE_MASK_KERNEL_DYNAMIC xfeature set
  x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU configuration
  x86/fpu/xstate: Create guest fpstate with guest specific config
  x86/fpu/xstate: Warn if CET supervisor state is detected in normal
    fpstate

 arch/x86/include/asm/fpu/types.h  | 16 ++++++++--
 arch/x86/include/asm/fpu/xstate.h | 11 ++++---
 arch/x86/kernel/fpu/core.c        | 53 ++++++++++++++++++++++++-------
 arch/x86/kernel/fpu/xstate.c      | 35 +++++++++++++++-----
 arch/x86/kernel/fpu/xstate.h      |  2 ++
 5 files changed, 90 insertions(+), 27 deletions(-)


base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0

Comments

Yang, Weijiang July 9, 2024, 3:17 a.m. UTC | #1
Hi, maintainers,
I recently did some tests for this series, the benchmark is here:
https://github.com/antonblanchard/will-it-scale/blob/master/tests/context_switch1.c

The purpose is to check what's the overall performance impact in thread/process context-
switch when CET supervisor state bit, i.e., RFBM[12], is fixed with 0 or 1 in xsaves/xrstors.

3 cases are tested:
case 1: stock v6.10-rc1 kernel.
case 2: v6.10-rc1 kernel + this patch series so that RFBM[12] == 0 for normal thread/process.
case 3: v6.10-rc1 kernel + this patch series and with patch 3(Introduce XFEATURE_MASK_KERNEL_DYNAMIC
xfeature set) reverted so that RFBM[12] == 1 for normal thread process.

Run below command 10 times in each case:

./context_switch1_processes -s 20 -t 50 -n

Trim the results by removing the top and down 10% of the data in each case, and I got below numbers:

case 1:16444675.45Set as 1

case 2:16412285.61~99.80%

case 3:16405716.19~99.76%

As you can see from the results, in case 2 with optimization based on XFEATURE_MASK_KERNEL_DYNAMIC,
the performance gain is ~0.2%.

So I'm not sure whether XFEATURE_MASK_KERNEL_DYNAMIC and related changes are worth or not
for this series.

Could you share your thoughts?

Thanks a lot!
Dave Hansen July 11, 2024, 8:58 p.m. UTC | #2
On 7/8/24 20:17, Yang, Weijiang wrote:
> So I'm not sure whether XFEATURE_MASK_KERNEL_DYNAMIC and related changes
> are worth or not for this series.
> 
> Could you share your thoughts?

First of all, I really do appreciate when folks make the effort to _try_
to draw their own conclusions before asking the maintainers to share
theirs.  Next time, OK? ;)

But here goes.  So we've basically got three cases.  Here's a fancy table:

> https://docs.google.com/spreadsheets/d/e/2PACX-1vROHIgrtHzUJmdlzT7D7tuVzgM8AMlK2XlorvFIJvk-I0NjD7A-T_qntjz7cUJlCScfWGtSfPK30Xtu/pubhtml

... and the same in ASCII

Case |IA32_XSS[12] | Space | RFBM[12] | Drop%	
-----+-------------+-------+----------+------
  1  |	   0	   | None  |	0     |  0.0%
  2  |	   1	   | None  |	0     |  0.2%
  3  |	   1	   | 24B?  |	1     |  0.2%

Case 1 is the baseline of course.  Case 2 avoids allocating space for
CET and also leans on the kernel to set RFBM[12]==0 and tell the
hardware not to write CET-S state.  Case 3 wastes the CET-S space in
each task and also leans on the hardware init optimization to avoid
writing out CET-S space on each XSAVES.

#1 is: 0 lines of code.
#2 is: 5 files changed, 90 insertions(+), 27 deletions(-)
#3 is: very few lines of code, nearing zero

#2 and #3 have the same performance.

So we're down to choosing between

 * $BYTES space in 'struct fpu' (on hardware supporting CET-S)

or

 * ~100 loc

$BYTES is 24, right?  Did I get anything wrong?

So, here's my stake in the ground: I think the 100 lines of code is
probably worth it.  But I also hate complicating the FPU code, so I'm
also somewhat drawn to just eating the 24 bytes and moving on.

But I'm still in the "case 2" camp.

Anybody disagree?
Rick Edgecombe July 11, 2024, 10:11 p.m. UTC | #3
On Thu, 2024-07-11 at 13:58 -0700, Dave Hansen wrote:
> So we're down to choosing between
> 
>  * $BYTES space in 'struct fpu' (on hardware supporting CET-S)
> 
> or
> 
>  * ~100 loc
> 
> $BYTES is 24, right?  Did I get anything wrong?

Do we know what the actual memory use is? It would increases the size asked of
of the allocator by 24 bytes, but what amount of memory actually gets reserved?

It is sometimes a slab allocated buffer, and sometimes a vmalloc, right? I'm not
sure about slab sizes, but for vmalloc if the increase doesn't cross a page
size, it will be the same size allocation in reality. Or if it is close to a
page size already, it might use a whole extra 4096 bytes.

So we might be looking at a situation where some tasks get an entire extra page
allocated per task, and some get no difference. And only the average is 24 bytes
increase.

Hmm, I was trying to reason out something to use for tie breaking. Not sure how
successful it was. The worst case of a whole page size sounds like something
someone might care about?
Dave Hansen July 11, 2024, 10:30 p.m. UTC | #4
On 7/11/24 15:11, Edgecombe, Rick P wrote:
> On Thu, 2024-07-11 at 13:58 -0700, Dave Hansen wrote:
>> So we're down to choosing between
>>
>>  * $BYTES space in 'struct fpu' (on hardware supporting CET-S)
>>
>> or
>>
>>  * ~100 loc
>>
>> $BYTES is 24, right?  Did I get anything wrong?
> 
> Do we know what the actual memory use is? It would increases the size asked of
> of the allocator by 24 bytes, but what amount of memory actually gets reserved?
> 
> It is sometimes a slab allocated buffer, and sometimes a vmalloc, right? I'm not
> sure about slab sizes, but for vmalloc if the increase doesn't cross a page
> size, it will be the same size allocation in reality. Or if it is close to a
> page size already, it might use a whole extra 4096 bytes.

Man, I hope I don't have this all mixed up in my head.  Wouldn't be the
first time.  I _think_ you might be confusing thread_info and
thread_struct, though.  I know I've gotten them confused before.

But we get to the 'struct fpu' via:

	current->thread.fpu

Where current is a 'task_struct' which is in /proc/slabinfo and 'struct
thread_struct thread' and 'struct fpu' are embedded in 'task_struct',
not allocated on their own:

	task_struct         2958   3018  10048  3 8 ...

So my current task_struct is 10048 bytes and 3 of them fit in each
8-page slab, leaving 2624 bytes to spare.

I don't think we're too dainty about adding thing to task_struct.  Are we?

> So we might be looking at a situation where some tasks get an entire extra page
> allocated per task, and some get no difference. And only the average is 24 bytes
> increase.

I think you're right here, at least when it comes to large weirdly-sized
slabs.  But _so_ many things affect task_struct that I've never seen
anyone sweat it too much.
Rick Edgecombe July 11, 2024, 10:55 p.m. UTC | #5
On Thu, 2024-07-11 at 15:30 -0700, Dave Hansen wrote:
> > > $BYTES is 24, right?  Did I get anything wrong?
> > 
> > Do we know what the actual memory use is? It would increases the size asked
> > of
> > of the allocator by 24 bytes, but what amount of memory actually gets
> > reserved?
> > 
> > It is sometimes a slab allocated buffer, and sometimes a vmalloc, right? I'm
> > not
> > sure about slab sizes, but for vmalloc if the increase doesn't cross a page
> > size, it will be the same size allocation in reality. Or if it is close to a
> > page size already, it might use a whole extra 4096 bytes.
> 
> Man, I hope I don't have this all mixed up in my head.  Wouldn't be the
> first time.  I _think_ you might be confusing thread_info and
> thread_struct, though.  I know I've gotten them confused before.
> 
> But we get to the 'struct fpu' via:
> 
>         current->thread.fpu
> 
> Where current is a 'task_struct' which is in /proc/slabinfo and 'struct
> thread_struct thread' and 'struct fpu' are embedded in 'task_struct',
> not allocated on their own:

I think thread_struct is always a slab, but the current->thread.fpu.fpstate
pointer can be reallocated to point to a vmalloc in fpstate_realloc(), in the
case of XFD features.

> 
>         task_struct         2958   3018  10048  3 8 ...
> 
> So my current task_struct is 10048 bytes and 3 of them fit in each
> 8-page slab, leaving 2624 bytes to spare.
> 
> I don't think we're too dainty about adding thing to task_struct.  Are we?

So for you there would actually not be any extra memory usage to unconditionally
add 24 bytes to the xstate. But, yes, it all could change for a number of
reasons.

> 
> > So we might be looking at a situation where some tasks get an entire extra
> > page
> > allocated per task, and some get no difference. And only the average is 24
> > bytes
> > increase.
> 
> I think you're right here, at least when it comes to large weirdly-sized
> slabs.  But _so_ many things affect task_struct that I've never seen
> anyone sweat it too much.

Makes sense. Then I can't think of any argument to move from case 2.
Yang, Weijiang July 12, 2024, 6:27 a.m. UTC | #6
On 7/12/2024 4:58 AM, Dave Hansen wrote:
> On 7/8/24 20:17, Yang, Weijiang wrote:
>> So I'm not sure whether XFEATURE_MASK_KERNEL_DYNAMIC and related changes
>> are worth or not for this series.
>>
>> Could you share your thoughts?
> First of all, I really do appreciate when folks make the effort to _try_
> to draw their own conclusions before asking the maintainers to share
> theirs.  Next time, OK? ;)

Hi, Dave,
Sorry for not doing that and thanks for making the conclusion clear!
I personally prefer applying the whole series so as to eliminates storage space issue and make
guest fpu config on its own settings. But also not sure the changes are worthwhile from kernel's
point of view.
Dave Hansen July 12, 2024, 2:36 p.m. UTC | #7
On 7/11/24 15:55, Edgecombe, Rick P wrote:
>> Where current is a 'task_struct' which is in /proc/slabinfo and 'struct
>> thread_struct thread' and 'struct fpu' are embedded in 'task_struct',
>> not allocated on their own:
> I think thread_struct is always a slab, but the current->thread.fpu.fpstate
> pointer can be reallocated to point to a vmalloc in fpstate_realloc(), in the
> case of XFD features.

Good point. I was ignoring XFD and AMX.  They're super rare and
(conditionally) add another 8k. -- Well, closer to 11k since we
duplicate some of the XSAVE area. --  But honestly, even if the AMX
'struct fpu' fit perfectly into 4k*3 pages and CET-S made it take 4k*4
pages,  I'm not sure I'd even care.  It would only affect AMX-using apps
on AMX-capable hardware.  So a small minority of tasks on a small
minority of one x86 vendor's CPUs.

The (potential) space consumption from the inline task_struct fpu will
matter a lot more across all Linux users than AMX ever will.  It would
affect all tasks on all CPUs that have CET-S which will hopefully be the
majority of x86 CPUs running Linux some day.