mbox series

[v4,0/6] KVM: MMU: performance tweaks for heavy CR0.WP users

Message ID 20230322013731.102955-1-minipli@grsecurity.net (mailing list archive)
Headers show
Series KVM: MMU: performance tweaks for heavy CR0.WP users | expand

Message

Mathias Krause March 22, 2023, 1:37 a.m. UTC
v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/

This series is the fourth iteration of resurrecting the missing pieces of
Paolo's previous attempt[1] to avoid needless MMU roots unloading.

It's incorporating Sean's feedback to v3 and rebased on top of
kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
kvm_arch_vm_ioctl() to "int"").

The performance gap between TDP and legacy MMU is still existent,
especially noticeable under grsecurity which implements kernel W^X by
toggling CR0.WP, which happens very frequently. This series tries to fix
this needless performance loss.

Patch 1 is a v3 of [3], addressing Sean's feedback.

Patch 2 implements Sean's feedback[2] to Paolo's original approach and
skips unloading the MMU roots for CR0.WP toggling under TDP.

Patch 3 further micro-optimizes this for non-paging guests -- anyone still
running MS Singularity? ;)

Sean was suggesting another change on top of v2 of this series (and then
again a more elaborate version on top of v3), to skip intercepting CR0.WP
writes completely for VMX[4]. That turned out to be yet another
performance boost and is implemented in patch 6.

Patches 2 and 6 are the most important ones, as they bring the big
performance gains.

I used 'ssdd 10 50000' from rt-tests[5] as a micro-benchmark, running on a
grsecurity L1 VM. Below table shows the results (runtime in seconds, lower
is better):

                              legacy     TDP    shadow
    kvm-x86/next@d8708b        8.43s    9.45s    70.3s
    + patches 1-3              5.39s    5.63s    70.2s
    + patches 4-6              3.51s    3.47s    67.8s

I re-ran the benchmarks (again) and the numbers changed a little from the
ones in v3. We got a better baseline which is likely caused by the rebase
to a v6.3-rc2 based tree (was v6.2-rc3 based before).

Patches 1, 4 and 5 didn't change from v3, beside minor changelog tweaks.

Patches 2 and 6 have been rewritten.

Patch 3 is new to this series.

Bonus rant^Wbug report:

Actually re-running the benchmarks took me a while because my VM was
constantly crashing on me with a #GP during scheduling. Looking a little
closer, I noticed it was for a IA32_PRED_CMD MSR write which was puzzling,
as the VM's kernel didn't change for my tests (built it more than a month
ago!), so the old test runs should have triggered that code path (and #GP)
too! Digging through some kernel code let me see it's all tied to the x86
feature flag X86_FEATURE_USE_IBPB which gets set when X86_FEATURE_IBPB is,
i.e. the CPU supports IBPB.

*head-scratching pause* 

Foolish me probably did a system update of the host and got a microcode
update that added IBPB support to my CPU. Yayyy... NOT! As this implies
announcing IBPB support to the VM as well and, in turn, makes the guest
kernel try to use it, I'm doomed to hit that bug. *bummer*

Something must not be right with KVM / QEMU / the kernel, as this guest
behaviour clearly shouldn't cause KVM to inject a #GP into the guest.

The bugging call chain in the guest kernel is:
  -> switch_mm_irqs_off()
     -> cond_mitigation()
        -> indirect_branch_prediction_barrier()
           -> alternative_msr_write(MSR_IA32_PRED_CMD, val, X86_FEATURE_USE_IBPB))

So far I'm working around this by passing 'clearcpuid=ibpb' on the kernel
commandline. But this should probably be fixed, that's why I'm mentioning
it. It's just too late here already to debug this further today. Well, for
some definition of "today."


Thanks,
Mathias

[1] https://lore.kernel.org/kvm/20220217210340.312449-1-pbonzini@redhat.com/
[2] https://lore.kernel.org/kvm/YhATewkkO%2Fl4P9UN@google.com/
[3] https://lore.kernel.org/kvm/YhAB1d1%2FnQbx6yvk@google.com/
[4] https://lore.kernel.org/kvm/Y8cTMnyBzNdO5dY3@google.com/
[5] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git


Mathias Krause (5):
  KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP
    enabled
  KVM: x86: Ignore CR0.WP toggles in non-paging mode
  KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
  KVM: x86/mmu: Fix comment typo
  KVM: VMX: Make CR0.WP a guest owned bit

Paolo Bonzini (1):
  KVM: x86/mmu: Avoid indirect call for get_cr3

 arch/x86/kvm/kvm_cache_regs.h  |  2 +-
 arch/x86/kvm/mmu/mmu.c         | 31 ++++++++++++++++++++-----------
 arch/x86/kvm/mmu/paging_tmpl.h |  2 +-
 arch/x86/kvm/mmu/spte.c        |  2 +-
 arch/x86/kvm/pmu.c             |  4 ++--
 arch/x86/kvm/vmx/nested.c      |  4 ++--
 arch/x86/kvm/vmx/vmx.c         |  6 +++---
 arch/x86/kvm/vmx/vmx.h         | 18 ++++++++++++++++++
 arch/x86/kvm/x86.c             | 18 ++++++++++++++++++
 9 files changed, 66 insertions(+), 21 deletions(-)

Comments

Mathias Krause March 22, 2023, 7:41 a.m. UTC | #1
On 22.03.23 02:37, Mathias Krause wrote:
> Bonus rant^Wbug report:
> [VMs #GP'ing on WRMSR(IA32_PRED_CMD,...)]

I see that this is already a well known bug, fully analyzed and taken
care off:

https://lore.kernel.org/kvm/20230322011440.2195485-1-seanjc@google.com/

Thanks,
Mathias
Sean Christopherson March 23, 2023, 10:50 p.m. UTC | #2
On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
> 
> This series is the fourth iteration of resurrecting the missing pieces of
> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
> 
> It's incorporating Sean's feedback to v3 and rebased on top of
> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
> kvm_arch_vm_ioctl() to "int"").
> 
> [...]

Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!

[1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
      https://github.com/kvm-x86/linux/commit/2fdcc1b32418
[2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
      https://github.com/kvm-x86/linux/commit/01b31714bd90
[3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
      https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
[4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
      https://github.com/kvm-x86/linux/commit/74cdc836919b
[5/6] KVM: x86/mmu: Fix comment typo
      https://github.com/kvm-x86/linux/commit/50f13998451e
[6/6] KVM: VMX: Make CR0.WP a guest owned bit
      https://github.com/kvm-x86/linux/commit/fb509f76acc8

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes
Mathias Krause March 25, 2023, 11:39 a.m. UTC | #3
On 23.03.23 23:50, Sean Christopherson wrote:
> On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
>> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
>>
>> This series is the fourth iteration of resurrecting the missing pieces of
>> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
>>
>> It's incorporating Sean's feedback to v3 and rebased on top of
>> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
>> kvm_arch_vm_ioctl() to "int"").
>>
>> [...]
> 
> Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!
> 
> [1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
>       https://github.com/kvm-x86/linux/commit/2fdcc1b32418
> [2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
>       https://github.com/kvm-x86/linux/commit/01b31714bd90
> [3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
>       https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
> [4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
>       https://github.com/kvm-x86/linux/commit/74cdc836919b
> [5/6] KVM: x86/mmu: Fix comment typo
>       https://github.com/kvm-x86/linux/commit/50f13998451e
> [6/6] KVM: VMX: Make CR0.WP a guest owned bit
>       https://github.com/kvm-x86/linux/commit/fb509f76acc8

Thanks a lot, Sean!

As this is a huge performance fix for us, we'd like to get it integrated
into current stable kernels as well -- not without having the changes
get some wider testing, of course, i.e. not before they end up in a
non-rc version released by Linus. But I already did a backport to 5.4 to
get a feeling how hard it would be and for the impact it has on older
kernels.

Using the 'ssdd 10 50000' test I used before, I get promising results
there as well. Without the patches it takes 9.31s, while with them we're
down to 4.64s. Taking into account that this is the runtime of a
workload in a VM that gets cut in half, I hope this qualifies as stable
material, as it's a huge performance fix.

Greg, what's your opinion on it? Original series here:
https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/

Thanks,
Mathias
Greg KH March 25, 2023, 12:25 p.m. UTC | #4
On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
> On 23.03.23 23:50, Sean Christopherson wrote:
> > On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
> >> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
> >>
> >> This series is the fourth iteration of resurrecting the missing pieces of
> >> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
> >>
> >> It's incorporating Sean's feedback to v3 and rebased on top of
> >> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
> >> kvm_arch_vm_ioctl() to "int"").
> >>
> >> [...]
> > 
> > Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!
> > 
> > [1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
> >       https://github.com/kvm-x86/linux/commit/2fdcc1b32418
> > [2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
> >       https://github.com/kvm-x86/linux/commit/01b31714bd90
> > [3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
> >       https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
> > [4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
> >       https://github.com/kvm-x86/linux/commit/74cdc836919b
> > [5/6] KVM: x86/mmu: Fix comment typo
> >       https://github.com/kvm-x86/linux/commit/50f13998451e
> > [6/6] KVM: VMX: Make CR0.WP a guest owned bit
> >       https://github.com/kvm-x86/linux/commit/fb509f76acc8
> 
> Thanks a lot, Sean!
> 
> As this is a huge performance fix for us, we'd like to get it integrated
> into current stable kernels as well -- not without having the changes
> get some wider testing, of course, i.e. not before they end up in a
> non-rc version released by Linus. But I already did a backport to 5.4 to
> get a feeling how hard it would be and for the impact it has on older
> kernels.
> 
> Using the 'ssdd 10 50000' test I used before, I get promising results
> there as well. Without the patches it takes 9.31s, while with them we're
> down to 4.64s. Taking into account that this is the runtime of a
> workload in a VM that gets cut in half, I hope this qualifies as stable
> material, as it's a huge performance fix.
> 
> Greg, what's your opinion on it? Original series here:
> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/

I'll leave the judgement call up to the KVM maintainers, as they are the
ones that need to ack any KVM patch added to stable trees.

thanks,

greg k-h
Sean Christopherson April 6, 2023, 2:25 a.m. UTC | #5
On Sat, Mar 25, 2023, Greg KH wrote:
> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
> > On 23.03.23 23:50, Sean Christopherson wrote:
> > > On Wed, 22 Mar 2023 02:37:25 +0100, Mathias Krause wrote:
> > >> v3: https://lore.kernel.org/kvm/20230201194604.11135-1-minipli@grsecurity.net/
> > >>
> > >> This series is the fourth iteration of resurrecting the missing pieces of
> > >> Paolo's previous attempt[1] to avoid needless MMU roots unloading.
> > >>
> > >> It's incorporating Sean's feedback to v3 and rebased on top of
> > >> kvm-x86/next, namely commit d8708b80fa0e ("KVM: Change return type of
> > >> kvm_arch_vm_ioctl() to "int"").
> > >>
> > >> [...]
> > > 
> > > Applied 1 and 5 to kvm-x86 mmu, and the rest to kvm-x86 misc, thanks!
> > > 
> > > [1/6] KVM: x86/mmu: Avoid indirect call for get_cr3
> > >       https://github.com/kvm-x86/linux/commit/2fdcc1b32418
> > > [2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
> > >       https://github.com/kvm-x86/linux/commit/01b31714bd90
> > > [3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode
> > >       https://github.com/kvm-x86/linux/commit/e40bcf9f3a18
> > > [4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
> > >       https://github.com/kvm-x86/linux/commit/74cdc836919b
> > > [5/6] KVM: x86/mmu: Fix comment typo
> > >       https://github.com/kvm-x86/linux/commit/50f13998451e
> > > [6/6] KVM: VMX: Make CR0.WP a guest owned bit
> > >       https://github.com/kvm-x86/linux/commit/fb509f76acc8
> > 
> > Thanks a lot, Sean!
> > 
> > As this is a huge performance fix for us, we'd like to get it integrated
> > into current stable kernels as well -- not without having the changes
> > get some wider testing, of course, i.e. not before they end up in a
> > non-rc version released by Linus. But I already did a backport to 5.4 to
> > get a feeling how hard it would be and for the impact it has on older
> > kernels.
> > 
> > Using the 'ssdd 10 50000' test I used before, I get promising results
> > there as well. Without the patches it takes 9.31s, while with them we're
> > down to 4.64s. Taking into account that this is the runtime of a
> > workload in a VM that gets cut in half, I hope this qualifies as stable
> > material, as it's a huge performance fix.
> > 
> > Greg, what's your opinion on it? Original series here:
> > https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
> 
> I'll leave the judgement call up to the KVM maintainers, as they are the
> ones that need to ack any KVM patch added to stable trees.

These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
mainline.

I'm not totally opposed to the idea since our tests _should_ be provide solid
coverage, e.g. existing tests caught my subtle bug, but I don't think we should
backport these without a solid usecase, as there is a fairly high risk of breaking
random KVM users that wouldn't see any meaningful benefit.

In other words, who cares enough about the performance of running grsecurity kernels
in VMs to want these backported, but doesn't have the resources to maintain (or pay
someone to maintain) their own host kernel?

[*] https://lkml.kernel.org/r/20230405002608.418442-1-seanjc%40google.com
Mathias Krause April 6, 2023, 1:22 p.m. UTC | #6
On 06.04.23 04:25, Sean Christopherson wrote:
> On Sat, Mar 25, 2023, Greg KH wrote:
>> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
>>> As this is a huge performance fix for us, we'd like to get it integrated
>>> into current stable kernels as well -- not without having the changes
>>> get some wider testing, of course, i.e. not before they end up in a
>>> non-rc version released by Linus. But I already did a backport to 5.4 to
>>> get a feeling how hard it would be and for the impact it has on older
>>> kernels.
>>>
>>> Using the 'ssdd 10 50000' test I used before, I get promising results
>>> there as well. Without the patches it takes 9.31s, while with them we're
>>> down to 4.64s. Taking into account that this is the runtime of a
>>> workload in a VM that gets cut in half, I hope this qualifies as stable
>>> material, as it's a huge performance fix.
>>>
>>> Greg, what's your opinion on it? Original series here:
>>> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
>>
>> I'll leave the judgement call up to the KVM maintainers, as they are the
>> ones that need to ack any KVM patch added to stable trees.
> 
> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
> mainline.

I totally agree. Getting the changes to work with older kernels needs
more work. The MMU role handling was refactored in 5.14 and down to 5.4
it differs even more, so backports to earlier kernels definitely needs
more care.

My plan would be to limit backporting of the whole series to kernels
down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
before that only without patch 6. That would leave out the problematic
change but still give us the benefits of dropping the needless mmu
unloads for only toggling CR0.WP in the VM. This already helps us a lot!

> 
> I'm not totally opposed to the idea since our tests _should_ be provide solid
> coverage, e.g. existing tests caught my subtle bug, but I don't think we should
> backport these without a solid usecase, as there is a fairly high risk of breaking
> random KVM users that wouldn't see any meaningful benefit.
> 
> In other words, who cares enough about the performance of running grsecurity kernels
> in VMs to want these backported, but doesn't have the resources to maintain (or pay
> someone to maintain) their own host kernel?

The ones who care are, obviously, our customers -- and we, of course!
Customers that can run their own infrastructure don't need these
backports in upstream LTS kernels, as we will provide them as well.
However, customers that rent VMs in the cloud have no control of what
runs as host kernel. It'll likely be some distribution kernel or some
tailored version of that, which is likely based on one of the LTS kernels.

Proxmox[1], for example, is a Debian based virtualization management
system. They do provide their own kernels, based on 5.15. However, the
official Debian stable kernel is based on 5.10. So it would be nice to
get backports down to this version at least.

[1] https://www.proxmox.com/en/proxmox-ve/features

> 
> [*] https://lkml.kernel.org/r/20230405002608.418442-1-seanjc%40google.com
Mathias Krause April 14, 2023, 9:29 a.m. UTC | #7
On 06.04.23 15:22, Mathias Krause wrote:
> On 06.04.23 04:25, Sean Christopherson wrote:
>> On Sat, Mar 25, 2023, Greg KH wrote:
>>> On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
>>>> As this is a huge performance fix for us, we'd like to get it integrated
>>>> into current stable kernels as well -- not without having the changes
>>>> get some wider testing, of course, i.e. not before they end up in a
>>>> non-rc version released by Linus. But I already did a backport to 5.4 to
>>>> get a feeling how hard it would be and for the impact it has on older
>>>> kernels.
>>>>
>>>> Using the 'ssdd 10 50000' test I used before, I get promising results
>>>> there as well. Without the patches it takes 9.31s, while with them we're
>>>> down to 4.64s. Taking into account that this is the runtime of a
>>>> workload in a VM that gets cut in half, I hope this qualifies as stable
>>>> material, as it's a huge performance fix.
>>>>
>>>> Greg, what's your opinion on it? Original series here:
>>>> https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
>>>
>>> I'll leave the judgement call up to the KVM maintainers, as they are the
>>> ones that need to ack any KVM patch added to stable trees.
>>
>> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
>> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
>> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
>> mainline.
> 
> I totally agree. Getting the changes to work with older kernels needs
> more work. The MMU role handling was refactored in 5.14 and down to 5.4
> it differs even more, so backports to earlier kernels definitely needs
> more care.
> 
> My plan would be to limit backporting of the whole series to kernels
> down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
> before that only without patch 6. That would leave out the problematic
> change but still give us the benefits of dropping the needless mmu
> unloads for only toggling CR0.WP in the VM. This already helps us a lot!

To back up the "helps us a lot" with some numbers, here are the results
I got from running the 'ssdd 10 50000' micro-benchmark on the backports
I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
stated below; runtime in seconds, lower is better):

                          legacy     TDP    shadow
    Linux v5.4.240          -        8.87s   56.8s
    + patches               -        5.84s   55.4s

    Linux v5.10.177       10.37s    88.7s    69.7s
    + patches              4.88s     4.92s   70.1s

    Linux v5.15.106        9.94s    66.1s    64.9s
    + patches              4.81s     4.79s   64.6s

    Linux v6.1.23          7.65s    8.23s    68.7s
    + patches              3.36s    3.36s    69.1s

    Linux v6.2.10          7.61s    7.98s    68.6s
    + patches              3.37s    3.41s    70.2s

I guess we can grossly ignore the shadow MMU numbers, beside noting them
to regress from v5.4 to v5.10 (something to investigate?). The backports
don't help (much) for shadow MMU setups and the flux in the measurements
is likely related to the slab allocations involved.

Another unrelated data point is that TDP MMU is really broken for our
use case on v5.10 and v5.15 -- it's even slower that shadow paging!

OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
on v5.10.

I backported the whole series down to v5.10 but left out the CR0.WP
guest owning patch+fix for v5.4 as the code base is too different to get
all the nuances right, as Sean already hinted. However, even this
limited backport provides a big performance fix for our use case!

Thanks,
Mathias
Sean Christopherson April 14, 2023, 4:49 p.m. UTC | #8
+Jeremi

On Fri, Apr 14, 2023, Mathias Krause wrote:
> On 06.04.23 15:22, Mathias Krause wrote:
> > On 06.04.23 04:25, Sean Christopherson wrote:
> >> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
> >> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
> >> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
> >> mainline.
> > 
> > I totally agree. Getting the changes to work with older kernels needs
> > more work. The MMU role handling was refactored in 5.14 and down to 5.4
> > it differs even more, so backports to earlier kernels definitely needs
> > more care.
> > 
> > My plan would be to limit backporting of the whole series to kernels
> > down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
> > before that only without patch 6. That would leave out the problematic
> > change but still give us the benefits of dropping the needless mmu
> > unloads for only toggling CR0.WP in the VM. This already helps us a lot!
> 
> To back up the "helps us a lot" with some numbers, here are the results
> I got from running the 'ssdd 10 50000' micro-benchmark on the backports
> I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
> stated below; runtime in seconds, lower is better):
> 
>                           legacy     TDP    shadow
>     Linux v5.4.240          -        8.87s   56.8s
>     + patches               -        5.84s   55.4s

I believe "legacy" and "TDP" are flip-flopped, the TDP MMU does't exist in v5.4.

>     Linux v5.10.177       10.37s    88.7s    69.7s
>     + patches              4.88s     4.92s   70.1s
> 
>     Linux v5.15.106        9.94s    66.1s    64.9s
>     + patches              4.81s     4.79s   64.6s
> 
>     Linux v6.1.23          7.65s    8.23s    68.7s
>     + patches              3.36s    3.36s    69.1s
> 
>     Linux v6.2.10          7.61s    7.98s    68.6s
>     + patches              3.37s    3.41s    70.2s
> 
> I guess we can grossly ignore the shadow MMU numbers, beside noting them
> to regress from v5.4 to v5.10 (something to investigate?). The backports
> don't help (much) for shadow MMU setups and the flux in the measurements
> is likely related to the slab allocations involved.
> 
> Another unrelated data point is that TDP MMU is really broken for our
> use case on v5.10 and v5.15 -- it's even slower that shadow paging!
> 
> OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
> for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
> on v5.10.

Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
TDP MMU by default until v5.14 for very good reasons.

> I backported the whole series down to v5.10 but left out the CR0.WP
> guest owning patch+fix for v5.4 as the code base is too different to get
> all the nuances right, as Sean already hinted. However, even this
> limited backport provides a big performance fix for our use case!

As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
I think that's perfectly acceptable since KVM has had the suboptimal behavior
literally since EPT/NPT support was first added.

I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
level of confidence that we aren't overlooking some subtle dependency.

For v5.15, I am less confident in the safety of a backport, and more importantly,
I think we should disable the TDP MMU by default to mitigate the underlying flaw
that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
SMM.

We mitigated the CR0.WP case between v5.15 and v6.1[1], which is why v6.1 doesn't
exhibit the same pain as v5.10, but Jeremi discovered that the SMM case badly affects
KVM-on-HyperV[2], e.g. when lauching KVM guests using WSL.  I posted a fix[3] to
finally resolve the underlying bug, but as Jeremi discovered[4], backporting the fix
to v5.15 is going to be gnarly, to say the least.  It'll be far worse than backporting
these CR0.WP patches, and maybe even infeasible without a large scale rework (no thanks).

Anyone that will realize meaningful benefits from the TDP MMU is all but guaranteed
to be rolling their own kernels, i.e. can do the backports themselves if they want
to use a v5.15 based kernel.  The big selling point of the TDP MMU is that it scales
better to hundreds of vCPUs, particularly when live migrating such VMs.  I highly
doubt that anyone running a stock kernel is running 100+ vCPU VMs, let alone trying
to live migrate them.

[1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
[2] https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
[3] https://lore.kernel.org/all/20230413231251.1481410-1-seanjc@google.com
[4] https://lore.kernel.org/all/7332d846-fada-eb5c-6068-18ff267bd37f@linux.microsoft.com
Jeremi Piotrowski April 14, 2023, 8:09 p.m. UTC | #9
On Fri, Apr 14, 2023 at 09:49:28AM -0700, Sean Christopherson wrote:
> +Jeremi
> 

Adding myself :)

> On Fri, Apr 14, 2023, Mathias Krause wrote:

...

> > OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
> > for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
> > on v5.10.
> 
> Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
> TDP MMU by default until v5.14 for very good reasons.
> 
> > I backported the whole series down to v5.10 but left out the CR0.WP
> > guest owning patch+fix for v5.4 as the code base is too different to get
> > all the nuances right, as Sean already hinted. However, even this
> > limited backport provides a big performance fix for our use case!
> 
> As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
> and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
> I think that's perfectly acceptable since KVM has had the suboptimal behavior
> literally since EPT/NPT support was first added.
> 

Disabling TDP MMU for v5.15, and backporting things to v6.1 works for me.

> I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
> substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
> level of confidence that we aren't overlooking some subtle dependency.
> 
> For v5.15, I am less confident in the safety of a backport, and more importantly,
> I think we should disable the TDP MMU by default to mitigate the underlying flaw
> that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
> rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
> SMM.
> 

The interesting thing here is that these CR0.WP fixes seem to improve things
with legacy MMU as well, and legacy MMU is not affected/touched by [3].

So I think you can consider Mathias' ask independent of disabling TDP MMU. On the one
hand: there is no regression here. On the other: the gain is big and seems important
to him.

I didn't have time to review these patches so I can't judge risk-benefit, or
whether any single patch might be a silver bullet on its own.

> We mitigated the CR0.WP case between v5.15 and v6.1[1], which is why v6.1 doesn't
> exhibit the same pain as v5.10, but Jeremi discovered that the SMM case badly affects
> KVM-on-HyperV[2], e.g. when lauching KVM guests using WSL.  I posted a fix[3] to
> finally resolve the underlying bug, but as Jeremi discovered[4], backporting the fix
> to v5.15 is going to be gnarly, to say the least.  It'll be far worse than backporting
> these CR0.WP patches, and maybe even infeasible without a large scale rework (no thanks).
> 
> Anyone that will realize meaningful benefits from the TDP MMU is all but guaranteed
> to be rolling their own kernels, i.e. can do the backports themselves if they want
> to use a v5.15 based kernel.  The big selling point of the TDP MMU is that it scales
> better to hundreds of vCPUs, particularly when live migrating such VMs.  I highly
> doubt that anyone running a stock kernel is running 100+ vCPU VMs, let alone trying
> to live migrate them.
> 
> [1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
> [2] https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
> [3] https://lore.kernel.org/all/20230413231251.1481410-1-seanjc@google.com
> [4] https://lore.kernel.org/all/7332d846-fada-eb5c-6068-18ff267bd37f@linux.microsoft.com
Sean Christopherson April 14, 2023, 8:17 p.m. UTC | #10
On Fri, Apr 14, 2023, Jeremi Piotrowski wrote:
> On Fri, Apr 14, 2023 at 09:49:28AM -0700, Sean Christopherson wrote:
> > +Jeremi
> > 
> 
> Adding myself :)

/facepalm

This isn't some mundane detail, Michael!!!

> > On Fri, Apr 14, 2023, Mathias Krause wrote:
> 
> ...
> 
> > > OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
> > > for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
> > > on v5.10.
> > 
> > Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
> > TDP MMU by default until v5.14 for very good reasons.
> > 
> > > I backported the whole series down to v5.10 but left out the CR0.WP
> > > guest owning patch+fix for v5.4 as the code base is too different to get
> > > all the nuances right, as Sean already hinted. However, even this
> > > limited backport provides a big performance fix for our use case!
> > 
> > As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
> > and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
> > I think that's perfectly acceptable since KVM has had the suboptimal behavior
> > literally since EPT/NPT support was first added.
> > 
> 
> Disabling TDP MMU for v5.15, and backporting things to v6.1 works for me.
> 
> > I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
> > substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
> > level of confidence that we aren't overlooking some subtle dependency.
> > 
> > For v5.15, I am less confident in the safety of a backport, and more importantly,
> > I think we should disable the TDP MMU by default to mitigate the underlying flaw
> > that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
> > rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
> > SMM.
> > 
> 
> The interesting thing here is that these CR0.WP fixes seem to improve things
> with legacy MMU as well, and legacy MMU is not affected/touched by [3].

Yep, that's totally expected.  The final patch in this series allows KVM to elide
VM-Exits when the guest toggles CR0.WP (but only on Intel hardware).  Avoiding
VM-Exit entirely is a big performance win when the guest is constantly toggling
CR0.WP, e.g. each exit is roughly 1500 cycles, versus probalby something like ~50
for a native write to CR0.WP.

> So I think you can consider Mathias' ask independent of disabling TDP MMU. On the one
> hand: there is no regression here. On the other: the gain is big and seems important
> to him.

Ya, that's the compromise I am proposing.  Give v6.1 the full tune-up, but only
do the super safe change for v5.15.
Jeremi Piotrowski May 2, 2023, 5:38 p.m. UTC | #11
On 4/14/2023 10:17 PM, Sean Christopherson wrote:
> On Fri, Apr 14, 2023, Jeremi Piotrowski wrote:
>> On Fri, Apr 14, 2023 at 09:49:28AM -0700, Sean Christopherson wrote:
>>> +Jeremi
>>>
>>
>> Adding myself :)
> 
> /facepalm
> 
> This isn't some mundane detail, Michael!!!
> 
>>> On Fri, Apr 14, 2023, Mathias Krause wrote:
>>
>> ...
>>
>>>> OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
>>>> for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
>>>> on v5.10.
>>>
>>> Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
>>> TDP MMU by default until v5.14 for very good reasons.
>>>
>>>> I backported the whole series down to v5.10 but left out the CR0.WP
>>>> guest owning patch+fix for v5.4 as the code base is too different to get
>>>> all the nuances right, as Sean already hinted. However, even this
>>>> limited backport provides a big performance fix for our use case!
>>>
>>> As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
>>> and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
>>> I think that's perfectly acceptable since KVM has had the suboptimal behavior
>>> literally since EPT/NPT support was first added.
>>>
>>
>> Disabling TDP MMU for v5.15, and backporting things to v6.1 works for me.
>>
>>> I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
>>> substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
>>> level of confidence that we aren't overlooking some subtle dependency.
>>>
>>> For v5.15, I am less confident in the safety of a backport, and more importantly,
>>> I think we should disable the TDP MMU by default to mitigate the underlying flaw
>>> that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
>>> rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
>>> SMM.
>>>
>>
So given that there hasn't been any further comms, I assume we stick to the plan outlined
above: disable tdp_mmu by default for 5.15.

Should that just be a revert of 71ba3f3189c78f756a659568fb473600fd78f207
("KVM: x86: enable TDP MMU by default") or a new patch and more importantly - do you want
to post the patch Sean, or are you busy and would prefer if someone else did?

Jeremi
Mathias Krause May 8, 2023, 9:19 a.m. UTC | #12
Sorry for the late reply, I've been traveling the past three weeks.

On 14.04.23 18:49, Sean Christopherson wrote:
> +Jeremi
> 
> On Fri, Apr 14, 2023, Mathias Krause wrote:
>> On 06.04.23 15:22, Mathias Krause wrote:
>>> On 06.04.23 04:25, Sean Christopherson wrote:
>>>> These are quite risky to backport.  E.g. we botched patch 6[*], and my initial
>>>> fix also had a subtle bug.  There have also been quite a few KVM MMU changes since
>>>> 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in
>>>> mainline.
>>>
>>> I totally agree. Getting the changes to work with older kernels needs
>>> more work. The MMU role handling was refactored in 5.14 and down to 5.4
>>> it differs even more, so backports to earlier kernels definitely needs
>>> more care.
>>>
>>> My plan would be to limit backporting of the whole series to kernels
>>> down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels
>>> before that only without patch 6. That would leave out the problematic
>>> change but still give us the benefits of dropping the needless mmu
>>> unloads for only toggling CR0.WP in the VM. This already helps us a lot!
>>
>> To back up the "helps us a lot" with some numbers, here are the results
>> I got from running the 'ssdd 10 50000' micro-benchmark on the backports
>> I did, running on a grsecurity L1 VM (host is a vanilla kernel, as
>> stated below; runtime in seconds, lower is better):
>>
>>                           legacy     TDP    shadow
>>     Linux v5.4.240          -        8.87s   56.8s
>>     + patches               -        5.84s   55.4s
> 
> I believe "legacy" and "TDP" are flip-flopped, the TDP MMU does't exist in v5.4.

Well, whatever the meaning of "TDP" is in v5.4 -- 'tdp_enabled'
completely mirrors the value of 'enable_ept' / 'npt_enabled'. But yeah,
it probably means what "legacy" is for newer kernels.

> 
>>     Linux v5.10.177       10.37s    88.7s    69.7s
>>     + patches              4.88s     4.92s   70.1s
>>
>>     Linux v5.15.106        9.94s    66.1s    64.9s
>>     + patches              4.81s     4.79s   64.6s
>>
>>     Linux v6.1.23          7.65s    8.23s    68.7s
>>     + patches              3.36s    3.36s    69.1s
>>
>>     Linux v6.2.10          7.61s    7.98s    68.6s
>>     + patches              3.37s    3.41s    70.2s
>>
>> I guess we can grossly ignore the shadow MMU numbers, beside noting them
>> to regress from v5.4 to v5.10 (something to investigate?). The backports
>> don't help (much) for shadow MMU setups and the flux in the measurements
>> is likely related to the slab allocations involved.
>>
>> Another unrelated data point is that TDP MMU is really broken for our
>> use case on v5.10 and v5.15 -- it's even slower that shadow paging!
>>
>> OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster
>> for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU
>> on v5.10.
> 
> Anyone that's enabling the TDP MMU on v5.10 is on their own, we didn't enable the
> TDP MMU by default until v5.14 for very good reasons.

Fair enough. But the numbers don't look much better for v5.15, so we
still want to fix that performance degradation when using TDP MMU (we
used to have a patch that disables TDP MMU in grsec by default but this,
of course, has no impact on setups making use of a vanilla / distro host
kernel and using grsec in the guest VM).

> 
>> I backported the whole series down to v5.10 but left out the CR0.WP
>> guest owning patch+fix for v5.4 as the code base is too different to get
>> all the nuances right, as Sean already hinted. However, even this
>> limited backport provides a big performance fix for our use case!
> 
> As a compromise of sorts, I propose that we disable the TDP MMU by default on v5.15,
> and backport these fixes to v6.1.  v5.15 and earlier won't get "ludicrous speed", but
> I think that's perfectly acceptable since KVM has had the suboptimal behavior
> literally since EPT/NPT support was first added.

The issue only started to get really bad when TDP MMU was enabled by
default in 5.14. That's why we reverted that change in grsecurity right
away. Integrating that change upstream will get us back to the pre-5.14
performance numbers but why not do better and fix the underlying bug by
backporting this series?

> 
> I'm comfortable backporting to v6.1 as that is recent enough, and there weren't
> substantial MMU changes between v6.1 and v6.3 in this area.  I.e. I have a decent
> level of confidence that we aren't overlooking some subtle dependency.

Agreed, the backports down to v6.1 were trivial.

> 
> For v5.15, I am less confident in the safety of a backport, and more importantly,
> I think we should disable the TDP MMU by default to mitigate the underlying flaw
> that makes the 18x speedup possible.  That flaw is that KVM can end up freeing and
> rebuilding TDP MMU roots every time CR0.WP is toggled or a vCPU transitions to/from
> SMM.

For v5.15 a few more commits are needed to ensure all requirements are
met, like no guest owned CR4 bits overlap with KVM's MMU role. But
that's manageable, IMHO, as some parts already went into previous stable
updates, making the missing net diff for the backport +6 -5 lines.

> 
> We mitigated the CR0.WP case between v5.15 and v6.1[1], which is why v6.1 doesn't
> exhibit the same pain as v5.10, but Jeremi discovered that the SMM case badly affects
> KVM-on-HyperV[2], e.g. when lauching KVM guests using WSL.  I posted a fix[3] to
> finally resolve the underlying bug, but as Jeremi discovered[4], backporting the fix
> to v5.15 is going to be gnarly, to say the least.  It'll be far worse than backporting
> these CR0.WP patches, and maybe even infeasible without a large scale rework (no thanks).

So disabling TDP MMU for v5.15 seems to mitigate both performance
degradation, Jeremi's and ours. However, I'd like to get the extra
speedup still. As I wrote, I already did the backports and it wasn't all
that bad in the end. I just had to read and map the code of older
kernels to what newer kernels would do and it's not that far away,
actually. It's the cleanup patches that just make it look a lot a
different, but for the MMU role it's actually not all that different.
It's completely different story for v5.4, sure. But I don't propose to
backport the full series to that kernel either.

> 
> Anyone that will realize meaningful benefits from the TDP MMU is all but guaranteed
> to be rolling their own kernels, i.e. can do the backports themselves if they want
> to use a v5.15 based kernel.  The big selling point of the TDP MMU is that it scales
> better to hundreds of vCPUs, particularly when live migrating such VMs.  I highly
> doubt that anyone running a stock kernel is running 100+ vCPU VMs, let alone trying
> to live migrate them.

I'll post the backports I did and maybe can convince you as well that
it's not all that bad ;) But I see your proposal patch [3] got merged in
the meantime and is cc:stable. Might make sense to re-do my benchmarks
after it got applied to the older kernels.

Thanks,
Mathias

> 
> [1] https://lkml.kernel.org/r/20220209170020.1775368-1-pbonzini%40redhat.com
> [2] https://lore.kernel.org/all/959c5bce-beb5-b463-7158-33fc4a4f910c@linux.microsoft.com
> [3] https://lore.kernel.org/all/20230413231251.1481410-1-seanjc@google.com
> [4] https://lore.kernel.org/all/7332d846-fada-eb5c-6068-18ff267bd37f@linux.microsoft.com
Mathias Krause May 8, 2023, 3:57 p.m. UTC | #13
On 08.05.23 11:19, Mathias Krause wrote:
> [...]
> 
> I'll post the backports I did and maybe can convince you as well that
> it's not all that bad ;) But I see your proposal patch [3] got merged in
> the meantime and is cc:stable. Might make sense to re-do my benchmarks
> after it got applied to the older kernels.

I just sent out the backports to the mailing list, please have a look!

The benchmark numbers are still the old ones I did three weeks ago.
However, seeing what's needed for the backport might give you a better
feeling for the impact.

I'll refresh the series if there's demand and when edbdb43fc96b ("KVM:
x86: Preserve TDP MMU roots until they are explicitly invalidated") got
merged into the relevant kernels.

v6.2:
https://lore.kernel.org/stable/20230508154457.29956-1-minipli@grsecurity.net/
v6.1:
https://lore.kernel.org/stable/20230508154602.30008-1-minipli@grsecurity.net/
v5.15:
https://lore.kernel.org/stable/20230508154709.30043-1-minipli@grsecurity.net/
v5.10:
https://lore.kernel.org/stable/20230508154804.30078-1-minipli@grsecurity.net/
v5.4:
https://lore.kernel.org/stable/20230508154943.30113-1-minipli@grsecurity.net/

Thanks,
Mathias