Message ID | 20171114001223.441ea2ca@annuminas.surriel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 14.11.2017 06:12, Rik van Riel wrote: > Currently, every time a VCPU is scheduled out, the host kernel will > first save the guest FPU/xstate context, then load the qemu userspace > FPU context, only to then immediately save the qemu userspace FPU > context back to memory. When scheduling in a VCPU, the same extraneous > FPU loads and saves are done. > > This could be avoided by moving from a model where the guest FPU is > loaded and stored with preemption disabled, to a model where the > qemu userspace FPU is swapped out for the guest FPU context for > the duration of the KVM_RUN ioctl. > > This is done under the VCPU mutex, which is also taken when other > tasks inspect the VCPU FPU context, so the code should already be > safe for this change. That should come as no surprise, given that > s390 already has this optimization. > > No performance changes were detected in quick ping-pong tests on > my 4 socket system, which is expected since an FPU+xstate load is > on the order of 0.1us, while ping-ponging between CPUs is on the > order of 20us, and somewhat noisy. > > There may be other tests where performance changes are noticeable. > > Signed-off-by: Rik van Riel <riel@redhat.com> > Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> > --- > arch/x86/include/asm/kvm_host.h | 13 +++++++++++++ > arch/x86/kvm/x86.c | 29 ++++++++++++----------------- > 2 files changed, 25 insertions(+), 17 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index c73e493adf07..92e66685249e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h We should also get rid of guest_fpu_loaded now, right? > @@ -536,7 +536,20 @@ struct kvm_vcpu_arch { > struct kvm_mmu_memory_cache mmu_page_cache; > struct kvm_mmu_memory_cache mmu_page_header_cache; > > + /* > + * QEMU userspace and the guest each have their own FPU state. > + * In vcpu_run, we switch between the user and guest FPU contexts. > + * While running a VCPU, the VCPU thread will have the guest FPU > + * context. > + * > + * Note that while the PKRU state lives inside the fpu registers, > + * it is switched out separately at VMENTER and VMEXIT time. The > + * "guest_fpu" state here contains the guest FPU context, with the > + * host PRKU bits. > + */ > + struct fpu user_fpu; > struct fpu guest_fpu; > + > u64 xcr0; > u64 guest_supported_xcr0; > u32 guest_xstate_size; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 03869eb7fcd6..59912b20a830 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2917,7 +2917,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > srcu_read_unlock(&vcpu->kvm->srcu, idx); > pagefault_enable(); > kvm_x86_ops->vcpu_put(vcpu); > - kvm_put_guest_fpu(vcpu); > vcpu->arch.last_host_tsc = rdtsc(); > } > > @@ -6908,7 +6907,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > preempt_disable(); > > kvm_x86_ops->prepare_guest_switch(vcpu); > - kvm_load_guest_fpu(vcpu); > > /* > * Disable IRQs before setting IN_GUEST_MODE. Posted interrupt > @@ -7095,6 +7093,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > > vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); > > + kvm_load_guest_fpu(vcpu); > + > for (;;) { > if (kvm_vcpu_running(vcpu)) { > r = vcpu_enter_guest(vcpu); > @@ -7132,6 +7132,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > } > } > > + kvm_put_guest_fpu(vcpu); > + > srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); > > return r; > @@ -7663,32 +7665,25 @@ static void fx_init(struct kvm_vcpu *vcpu) > vcpu->arch.cr0 |= X86_CR0_ET; > } > > +/* Swap (qemu) user FPU context for the guest FPU context. */ > void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) > { > - if (vcpu->guest_fpu_loaded) > - return; > - > - /* > - * Restore all possible states in the guest, > - * and assume host would use all available bits. > - * Guest xcr0 would be loaded later. > - */ > - vcpu->guest_fpu_loaded = 1; > - __kernel_fpu_begin(); > + preempt_disable(); > + copy_fpregs_to_fpstate(&vcpu->arch.user_fpu); > /* PKRU is separately restored in kvm_x86_ops->run. */ > __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state, > ~XFEATURE_MASK_PKRU); > + preempt_enable(); > trace_kvm_fpu(1); > } > > +/* When vcpu_run ends, restore user space FPU context. */ > void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) > { > - if (!vcpu->guest_fpu_loaded) > - return; > - > - vcpu->guest_fpu_loaded = 0; > + preempt_disable(); > copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu); > - __kernel_fpu_end(); > + copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state); > + preempt_enable(); > ++vcpu->stat.fpu_reload; > trace_kvm_fpu(0); > } > emulator_get_fpu() does a kvm_load_guest_fpu(). Doesn't that mean that this is now not needed anymore? (at least when emulator code is called from inside the loop?) Also, what about preempt_diable() at that point, still needed?
On Tue, 2017-11-14 at 17:57 +0100, David Hildenbrand wrote: > > > diff --git a/arch/x86/include/asm/kvm_host.h > > b/arch/x86/include/asm/kvm_host.h > > index c73e493adf07..92e66685249e 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > We should also get rid of guest_fpu_loaded now, right? Indeed, we no longer need that member. I'll get rid of it. > emulator_get_fpu() does a kvm_load_guest_fpu(). Doesn't that mean > that > this is now not needed anymore? (at least when emulator code is > called > from inside the loop?) Now that is a very good question! When called from inside the loop, it is indeed not needed. My question is, can the in-kernel emulator code ever be called from OUTSIDE the KVM_RUN ioctl loop? If so, we need to restore the user FPU context before returning from the emulator code. Given that the current emulator code does not do that, I suspect this is not the case. I also see no path from the kvm ioctl into the emulator code, other than via KVM_RUN. The FPU and XSAVE ioctls all work on the saved vcpu->arch.guest_fpu data, and never directly on the registers. Looks like we can completely get rid of .get_fpu and .put_fpu... Unless Paolo has any objection, I'll go do that :) > Also, what about preempt_diable() at that point, still needed? If the guest FPU context is the only FPU context loaded for the task at that point in time, we should not need to run with preemption disabled. After all, if we were to get preempted, the context switch code would automatically save and restore the guest FPU context for us.
On 14/11/2017 19:07, Rik van Riel wrote: > My question is, can the in-kernel emulator code ever > be called from OUTSIDE the KVM_RUN ioctl loop? No, it can't. This makes the patch much more appealing... Paolo > If so, we need to restore the user FPU context before > returning from the emulator code. Given that the current > emulator code does not do that, I suspect this is not > the case. I also see no path from the kvm ioctl into > the emulator code, other than via KVM_RUN. > > The FPU and XSAVE ioctls all work on the saved > vcpu->arch.guest_fpu data, and never directly on the > registers.
On 14.11.2017 19:07, Rik van Riel wrote: > On Tue, 2017-11-14 at 17:57 +0100, David Hildenbrand wrote: >> >>> diff --git a/arch/x86/include/asm/kvm_host.h >>> b/arch/x86/include/asm/kvm_host.h >>> index c73e493adf07..92e66685249e 100644 >>> --- a/arch/x86/include/asm/kvm_host.h >>> +++ b/arch/x86/include/asm/kvm_host.h >> >> We should also get rid of guest_fpu_loaded now, right? > > Indeed, we no longer need that member. I'll get rid of it. > >> emulator_get_fpu() does a kvm_load_guest_fpu(). Doesn't that mean >> that >> this is now not needed anymore? (at least when emulator code is >> called >> from inside the loop?) > > Now that is a very good question! > > When called from inside the loop, it is indeed not > needed. > > My question is, can the in-kernel emulator code ever > be called from OUTSIDE the KVM_RUN ioctl loop? > > If so, we need to restore the user FPU context before > returning from the emulator code. Given that the current > emulator code does not do that, I suspect this is not > the case. I also see no path from the kvm ioctl into > the emulator code, other than via KVM_RUN. > > The FPU and XSAVE ioctls all work on the saved > vcpu->arch.guest_fpu data, and never directly on the > registers. > > Looks like we can completely get rid of .get_fpu and > .put_fpu... > > Unless Paolo has any objection, I'll go do that :) I think we should check all get/put_fpu callers if they need preempt_disable(). E.g. em_fxrstor() needs disabled preemption as we temporarily save + restore some host register (via fxsave + fxrstor) under some circumstances that are not saved/restored when switching to/back from another process. We should double check. @Paolo what about complete_userspace_io? It can end up calling emulate_instruction(). So maybe we have to move load/put fpu further out or add special handling.
On Tue, 2017-11-14 at 20:40 +0100, David Hildenbrand wrote: > On 14.11.2017 19:07, Rik van Riel wrote: > > On Tue, 2017-11-14 at 17:57 +0100, David Hildenbrand wrote: > > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h > > > > b/arch/x86/include/asm/kvm_host.h > > > > index c73e493adf07..92e66685249e 100644 > > > > --- a/arch/x86/include/asm/kvm_host.h > > > > +++ b/arch/x86/include/asm/kvm_host.h > > > > > > We should also get rid of guest_fpu_loaded now, right? > > > > Indeed, we no longer need that member. I'll get rid of it. > > > > > emulator_get_fpu() does a kvm_load_guest_fpu(). Doesn't that mean > > > that > > > this is now not needed anymore? (at least when emulator code is > > > called > > > from inside the loop?) > > > > Now that is a very good question! > > > > When called from inside the loop, it is indeed not > > needed. > > > > My question is, can the in-kernel emulator code ever > > be called from OUTSIDE the KVM_RUN ioctl loop? > > > > If so, we need to restore the user FPU context before > > returning from the emulator code. Given that the current > > emulator code does not do that, I suspect this is not > > the case. I also see no path from the kvm ioctl into > > the emulator code, other than via KVM_RUN. > > > > The FPU and XSAVE ioctls all work on the saved > > vcpu->arch.guest_fpu data, and never directly on the > > registers. > > > > Looks like we can completely get rid of .get_fpu and > > .put_fpu... > > > > Unless Paolo has any objection, I'll go do that :) > > > I think we should check all get/put_fpu callers if they need > preempt_disable(). > > E.g. em_fxrstor() needs disabled preemption as we temporarily > save + restore some host register (via fxsave + fxrstor) under some > circumstances that are not saved/restored when switching to/back from > another process. We should double check. > > @Paolo what about complete_userspace_io? It can end up calling > emulate_instruction(). So maybe we have to move load/put fpu further > out > or add special handling. It looks like all complete_userspace_io causes is for the vcpu_run loop to exit, and return to userspace from the KVM_RUN ioctl code. In other words, the userspace qemu FPU context should be restored before we return to userspace, even with my patch (v2 on the way).
On 14/11/2017 20:40, David Hildenbrand wrote: > I think we should check all get/put_fpu callers if they need > preempt_disable(). > > E.g. em_fxrstor() needs disabled preemption as we temporarily > save + restore some host register (via fxsave + fxrstor) under some > circumstances that are not saved/restored when switching to/back from > another process. We should double check. Rik may correct me, but I believe that you don't need preempt_disable/enable because preempt notifiers do this for you. Paolo
On 15.11.2017 09:34, Paolo Bonzini wrote: > On 14/11/2017 20:40, David Hildenbrand wrote: >> I think we should check all get/put_fpu callers if they need >> preempt_disable(). >> >> E.g. em_fxrstor() needs disabled preemption as we temporarily >> save + restore some host register (via fxsave + fxrstor) under some >> circumstances that are not saved/restored when switching to/back from >> another process. We should double check. > > Rik may correct me, but I believe that you don't need > preempt_disable/enable because preempt notifiers do this for you. > Seems to boil down to what Wanpeng Li asked about preempt notifiers in response to v2.
On Wed, 2017-11-15 at 09:34 +0100, Paolo Bonzini wrote: > On 14/11/2017 20:40, David Hildenbrand wrote: > > I think we should check all get/put_fpu callers if they need > > preempt_disable(). > > > > E.g. em_fxrstor() needs disabled preemption as we temporarily > > save + restore some host register (via fxsave + fxrstor) under some > > circumstances that are not saved/restored when switching to/back > > from > > another process. We should double check. > > Rik may correct me, but I believe that you don't need > preempt_disable/enable because preempt notifiers do this for you. We no longer even need the preempt notifiers to save and restore the guest FPU state. The context switch code itself will save the FPU state from the registers, into current->thread.fpu.state, when the VCPU thread gets scheduled out. When the VCPU thread gets scheduled in, the scheduler will restore the guest FPU state from current->thread.fpu.state. At this point, vcpu->arch.guest_fpu may be OUT OF DATE. However, this is just fine, because we will save the guest FPU state into vcpu->arch.guest_fpu in kvm_put_guest_fpu, before we leave the KVM_RUN ioctl, and before we release the vcpu->mutex. In other words, by the time anybody else can examine the VCPU FPU state (after they obtain the vcpu->mutex), the vcpu->arch.guest_fpu area will contain the correct FPU state.
On 15.11.2017 15:50, Rik van Riel wrote: > On Wed, 2017-11-15 at 09:34 +0100, Paolo Bonzini wrote: >> On 14/11/2017 20:40, David Hildenbrand wrote: >>> I think we should check all get/put_fpu callers if they need >>> preempt_disable(). >>> >>> E.g. em_fxrstor() needs disabled preemption as we temporarily >>> save + restore some host register (via fxsave + fxrstor) under some >>> circumstances that are not saved/restored when switching to/back >>> from >>> another process. We should double check. >> >> Rik may correct me, but I believe that you don't need >> preempt_disable/enable because preempt notifiers do this for you. > > We no longer even need the preempt notifiers to save and > restore the guest FPU state. > > The context switch code itself will save the FPU state > from the registers, into current->thread.fpu.state, when > the VCPU thread gets scheduled out. > > When the VCPU thread gets scheduled in, the scheduler > will restore the guest FPU state from current->thread.fpu.state. > > At this point, vcpu->arch.guest_fpu may be OUT OF DATE. > > However, this is just fine, because we will save the guest > FPU state into vcpu->arch.guest_fpu in kvm_put_guest_fpu, > before we leave the KVM_RUN ioctl, and before we release > the vcpu->mutex. > > In other words, by the time anybody else can examine the > VCPU FPU state (after they obtain the vcpu->mutex), the > vcpu->arch.guest_fpu area will contain the correct FPU > state. > Okay, that answers my question, thanks!
2017-11-14 13:12 GMT+08:00 Rik van Riel <riel@redhat.com>: > Currently, every time a VCPU is scheduled out, the host kernel will > first save the guest FPU/xstate context, then load the qemu userspace > FPU context, only to then immediately save the qemu userspace FPU > context back to memory. When scheduling in a VCPU, the same extraneous > FPU loads and saves are done. > > This could be avoided by moving from a model where the guest FPU is > loaded and stored with preemption disabled, to a model where the > qemu userspace FPU is swapped out for the guest FPU context for > the duration of the KVM_RUN ioctl. > > This is done under the VCPU mutex, which is also taken when other > tasks inspect the VCPU FPU context, so the code should already be > safe for this change. That should come as no surprise, given that > s390 already has this optimization. > > No performance changes were detected in quick ping-pong tests on > my 4 socket system, which is expected since an FPU+xstate load is > on the order of 0.1us, while ping-ponging between CPUs is on the > order of 20us, and somewhat noisy. > > There may be other tests where performance changes are noticeable. The kvm/queue has the below splatting: [ 650.866212] Bad FPU state detected at kvm_put_guest_fpu+0x7d/0x210 [kvm], reinitializing FPU registers. [ 650.866232] ------------[ cut here ]------------ [ 650.866241] WARNING: CPU: 2 PID: 2583 at arch/x86/mm/extable.c:103 ex_handler_fprestore+0x5f/0x70 [ 650.866473] libahci wmi hid pinctrl_sunrisepoint video pinctrl_intel [ 650.866496] CPU: 2 PID: 2583 Comm: qemu-system-x86 Not tainted 4.14.0+ #7 [ 650.866500] Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016 [ 650.866503] task: ffff97a095a28000 task.stack: ffffa71c8585c000 [ 650.866509] RIP: 0010:ex_handler_fprestore+0x5f/0x70 [ 650.866512] RSP: 0018:ffffa71c8585fc28 EFLAGS: 00010282 [ 650.866519] RAX: 000000000000005b RBX: ffffa71c8585fc68 RCX: 0000000000000006 [ 650.866522] RDX: 0000000000000000 RSI: ffffffffb4d35333 RDI: 0000000000000282 [ 650.866526] RBP: 000000000000000d R08: 00000000fddae359 R09: 0000000000000000 [ 650.866529] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 650.866532] R13: 0000000000000000 R14: ffff97a095a30000 R15: 000055824b58e280 [ 650.866536] FS: 00007f6f8f22c700(0000) GS:ffff97a09ca00000(0000) knlGS:0000000000000000 [ 650.866540] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 650.866543] CR2: 00007f6f993f3000 CR3: 00000003d4aae005 CR4: 00000000003626e0 [ 650.866547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 650.866550] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 650.866554] Call Trace: [ 650.866559] fixup_exception+0x32/0x40 [ 650.866567] do_general_protection+0xa0/0x1b0 [ 650.866574] general_protection+0x22/0x30 [ 650.866595] RIP: 0010:kvm_put_guest_fpu+0x7d/0x210 [kvm] [ 650.866599] RSP: 0018:ffffa71c8585fd18 EFLAGS: 00010246 [ 650.866605] RAX: 00000000ffffffff RBX: ffff97a095a30000 RCX: 0000000000000001 [ 650.866608] RDX: 00000000ffffffff RSI: 00000000f7d5d46a RDI: ffff97a095a30b80 [ 650.866611] RBP: 0000000000000000 R08: 00000000fddae359 R09: ffff97a095a28968 [ 650.866615] R10: 0000000000000000 R11: 00000000e8d39b88 R12: ffff97a095a31bc0 [ 650.866618] R13: 0000000000000000 R14: ffff97a095a30000 R15: 000055824b58e280 [ 650.866650] ? kvm_put_guest_fpu+0x27/0x210 [kvm] [ 650.866670] kvm_vcpu_reset+0x1be/0x250 [kvm] [ 650.866689] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [ 650.866707] kvm_vm_ioctl+0x31a/0x820 [kvm] [ 650.866712] ? __lock_acquire+0x809/0x1410 [ 650.866721] ? __lock_acquire+0x809/0x1410 [ 650.866734] do_vfs_ioctl+0x9f/0x6c0 [ 650.866743] ? __fget+0x108/0x1f0 [ 650.866752] SyS_ioctl+0x74/0x80 [ 650.866757] ? do_syscall_64+0xc4/0x3d0 [ 650.866764] do_syscall_64+0x8a/0x3d0 [ 650.866769] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 650.866781] entry_SYSCALL64_slow_path+0x25/0x25 [ 650.866785] RIP: 0033:0x7f6f973a0f07 [ 650.866788] RSP: 002b:00007f6f8f22b968 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 650.866795] RAX: ffffffffffffffda RBX: 000000000000ae41 RCX: 00007f6f973a0f07 [ 650.866798] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 000000000000000b [ 650.866802] RBP: 0000000000000000 R08: 0000558248e26a40 R09: 000055824b58e280 [ 650.866805] R10: 0000558249593f40 R11: 0000000000000246 R12: 000055824b55ec90 [ 650.866808] R13: 00007ffd274d79ff R14: 00007f6f8f22c9c0 R15: 000055824b58e280 [ 650.867014] ---[ end trace 2c5d6cfaba0ee1b3 ]--- Regards, Wanpeng Li > > Signed-off-by: Rik van Riel <riel@redhat.com> > Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> > --- > arch/x86/include/asm/kvm_host.h | 13 +++++++++++++ > arch/x86/kvm/x86.c | 29 ++++++++++++----------------- > 2 files changed, 25 insertions(+), 17 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index c73e493adf07..92e66685249e 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -536,7 +536,20 @@ struct kvm_vcpu_arch { > struct kvm_mmu_memory_cache mmu_page_cache; > struct kvm_mmu_memory_cache mmu_page_header_cache; > > + /* > + * QEMU userspace and the guest each have their own FPU state. > + * In vcpu_run, we switch between the user and guest FPU contexts. > + * While running a VCPU, the VCPU thread will have the guest FPU > + * context. > + * > + * Note that while the PKRU state lives inside the fpu registers, > + * it is switched out separately at VMENTER and VMEXIT time. The > + * "guest_fpu" state here contains the guest FPU context, with the > + * host PRKU bits. > + */ > + struct fpu user_fpu; > struct fpu guest_fpu; > + > u64 xcr0; > u64 guest_supported_xcr0; > u32 guest_xstate_size; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 03869eb7fcd6..59912b20a830 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2917,7 +2917,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > srcu_read_unlock(&vcpu->kvm->srcu, idx); > pagefault_enable(); > kvm_x86_ops->vcpu_put(vcpu); > - kvm_put_guest_fpu(vcpu); > vcpu->arch.last_host_tsc = rdtsc(); > } > > @@ -6908,7 +6907,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > preempt_disable(); > > kvm_x86_ops->prepare_guest_switch(vcpu); > - kvm_load_guest_fpu(vcpu); > > /* > * Disable IRQs before setting IN_GUEST_MODE. Posted interrupt > @@ -7095,6 +7093,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > > vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); > > + kvm_load_guest_fpu(vcpu); > + > for (;;) { > if (kvm_vcpu_running(vcpu)) { > r = vcpu_enter_guest(vcpu); > @@ -7132,6 +7132,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > } > } > > + kvm_put_guest_fpu(vcpu); > + > srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); > > return r; > @@ -7663,32 +7665,25 @@ static void fx_init(struct kvm_vcpu *vcpu) > vcpu->arch.cr0 |= X86_CR0_ET; > } > > +/* Swap (qemu) user FPU context for the guest FPU context. */ > void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) > { > - if (vcpu->guest_fpu_loaded) > - return; > - > - /* > - * Restore all possible states in the guest, > - * and assume host would use all available bits. > - * Guest xcr0 would be loaded later. > - */ > - vcpu->guest_fpu_loaded = 1; > - __kernel_fpu_begin(); > + preempt_disable(); > + copy_fpregs_to_fpstate(&vcpu->arch.user_fpu); > /* PKRU is separately restored in kvm_x86_ops->run. */ > __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state, > ~XFEATURE_MASK_PKRU); > + preempt_enable(); > trace_kvm_fpu(1); > } > > +/* When vcpu_run ends, restore user space FPU context. */ > void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) > { > - if (!vcpu->guest_fpu_loaded) > - return; > - > - vcpu->guest_fpu_loaded = 0; > + preempt_disable(); > copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu); > - __kernel_fpu_end(); > + copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state); > + preempt_enable(); > ++vcpu->stat.fpu_reload; > trace_kvm_fpu(0); > } >
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c73e493adf07..92e66685249e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -536,7 +536,20 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_page_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + /* + * QEMU userspace and the guest each have their own FPU state. + * In vcpu_run, we switch between the user and guest FPU contexts. + * While running a VCPU, the VCPU thread will have the guest FPU + * context. + * + * Note that while the PKRU state lives inside the fpu registers, + * it is switched out separately at VMENTER and VMEXIT time. The + * "guest_fpu" state here contains the guest FPU context, with the + * host PRKU bits. + */ + struct fpu user_fpu; struct fpu guest_fpu; + u64 xcr0; u64 guest_supported_xcr0; u32 guest_xstate_size; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03869eb7fcd6..59912b20a830 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2917,7 +2917,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) srcu_read_unlock(&vcpu->kvm->srcu, idx); pagefault_enable(); kvm_x86_ops->vcpu_put(vcpu); - kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); } @@ -6908,7 +6907,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) preempt_disable(); kvm_x86_ops->prepare_guest_switch(vcpu); - kvm_load_guest_fpu(vcpu); /* * Disable IRQs before setting IN_GUEST_MODE. Posted interrupt @@ -7095,6 +7093,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) vcpu->srcu_idx = srcu_read_lock(&kvm->srcu); + kvm_load_guest_fpu(vcpu); + for (;;) { if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); @@ -7132,6 +7132,8 @@ static int vcpu_run(struct kvm_vcpu *vcpu) } } + kvm_put_guest_fpu(vcpu); + srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx); return r; @@ -7663,32 +7665,25 @@ static void fx_init(struct kvm_vcpu *vcpu) vcpu->arch.cr0 |= X86_CR0_ET; } +/* Swap (qemu) user FPU context for the guest FPU context. */ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu) { - if (vcpu->guest_fpu_loaded) - return; - - /* - * Restore all possible states in the guest, - * and assume host would use all available bits. - * Guest xcr0 would be loaded later. - */ - vcpu->guest_fpu_loaded = 1; - __kernel_fpu_begin(); + preempt_disable(); + copy_fpregs_to_fpstate(&vcpu->arch.user_fpu); /* PKRU is separately restored in kvm_x86_ops->run. */ __copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state, ~XFEATURE_MASK_PKRU); + preempt_enable(); trace_kvm_fpu(1); } +/* When vcpu_run ends, restore user space FPU context. */ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu) { - if (!vcpu->guest_fpu_loaded) - return; - - vcpu->guest_fpu_loaded = 0; + preempt_disable(); copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu); - __kernel_fpu_end(); + copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state); + preempt_enable(); ++vcpu->stat.fpu_reload; trace_kvm_fpu(0); }
Currently, every time a VCPU is scheduled out, the host kernel will first save the guest FPU/xstate context, then load the qemu userspace FPU context, only to then immediately save the qemu userspace FPU context back to memory. When scheduling in a VCPU, the same extraneous FPU loads and saves are done. This could be avoided by moving from a model where the guest FPU is loaded and stored with preemption disabled, to a model where the qemu userspace FPU is swapped out for the guest FPU context for the duration of the KVM_RUN ioctl. This is done under the VCPU mutex, which is also taken when other tasks inspect the VCPU FPU context, so the code should already be safe for this change. That should come as no surprise, given that s390 already has this optimization. No performance changes were detected in quick ping-pong tests on my 4 socket system, which is expected since an FPU+xstate load is on the order of 0.1us, while ping-ponging between CPUs is on the order of 20us, and somewhat noisy. There may be other tests where performance changes are noticeable. Signed-off-by: Rik van Riel <riel@redhat.com> Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> --- arch/x86/include/asm/kvm_host.h | 13 +++++++++++++ arch/x86/kvm/x86.c | 29 ++++++++++++----------------- 2 files changed, 25 insertions(+), 17 deletions(-)