diff mbox series

KVM: x86: Prevent L0 VMM from modifying L2 VM registers via ioctl

Message ID 20240515080607.919497-1-liangchen.linux@gmail.com (mailing list archive)
State New
Headers show
Series KVM: x86: Prevent L0 VMM from modifying L2 VM registers via ioctl | expand

Commit Message

Liang Chen May 15, 2024, 8:06 a.m. UTC
In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the
L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while
the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's
registers, corrupting the L2 VM. To avoid this error, registers should be
treated as read-only when the vCPU is actively running an L2 VM.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com
---
 arch/x86/kvm/x86.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Paolo Bonzini May 15, 2024, 11:08 a.m. UTC | #1
On 5/15/24 10:06, Liang Chen wrote:
> In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the
> L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while
> the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's
> registers, corrupting the L2 VM. To avoid this error, registers should be
> treated as read-only when the vCPU is actively running an L2 VM.

No, this is intentional.  The L0 hypervisor has full control on the CPU
registers, no matter if the VM is in guest mode or not.

Looking at the syzkaller report, the first thing to do is to remove the
threading, because vcpu ioctls are anyway serialized by the vcpu mutex.

Let's assume that the first 7 operations, i.e. all except KVM_RUN and
KVM_SET_REGS, execute in sequence.  There is possible interleaving of
the two syz_kvm_setup_cpu$x86 calls, but because only the second sets
up nested virtualization, we can hope that we're lucky.[1]

To do so, change the main to something like

int main(int argc, char **argv)
{
   int i;
   syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
           /*offset=*/0ul);
   syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
           /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
           /*offset=*/0ul);
   syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
           /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
           /*offset=*/0ul);

   // r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000440), 0x0, 0x0)
   execute_call(0);
   // r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
   execute_call(1);
   // r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
   execute_call(2);
   // mmap(&(0x7f0000000000/0x3000)=nil, 0x3000, 0x2, 0x13, r2, 0x0)
   execute_call(3);
   // syz_kvm_setup_cpu$x86(r1, 0xffffffffffffffff, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000200)=[@text16={0x10, 0x0}], 0x1, 0x0, 0x0, 0x0)
   execute_call(4);
   // ioctl$KVM_REGISTER_COALESCED_MMIO(0xffffffffffffffff, 0x4010ae67, &(0x7f0000000000)={0x2})
   execute_call(5);
   // syz_kvm_setup_cpu$x86(0xffffffffffffffff, r2, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000340)=[@text64={0x40, 0x0}], 0x1, 0x50, 0x0, 0x0)
   execute_call(6);
   // ioctl$KVM_RUN(r2, 0xae80, 0x0) (rerun: 64)
   for (i = 0; i < atoi(argv[1]); i++)
     execute_call(7);
   // ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f00000000c0)={[0x7], 0x0, 0x60000})
   // interleaved with KVM_RUN
   execute_call(8);
   // one more KVM_RUN to show the bug
   execute_call(7);
   return 0;
}

This reproduces the fact that KVM_SET_REGS can sneak in between any
two KVM_RUN.  You'll see that changing the argument may cause an entry with
invalid guest state (argument is 0 or >= 3 - that's ok) or the WARN
(argument == 1).

The attached cleaned up reproducer shows that the problem is simply that
EFLAGS.VM is set in 64-bit mode.  To fix it, it should be enough to do
a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like
a few lines below.

Paolo

[1] in fact the first syz_kvm_setup_cpu$v86 passes vcpufd==-1 and
the second passes vmfd==-1.  Each of them does only half of the work.
The ioctl$KVM_REGISTER_COALESCED_MMIO is also mostly dummy because it
passes -1 as the file descriptor, but it happens to set
vcpu->run->request_interrupt_window!  Which is important later.


> Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com
> ---
>   arch/x86/kvm/x86.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 91478b769af0..30f63a7dd120 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>   
>   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>   {
> +	/*
> +	 * Prevent L0 VMM from inadvertently modifying L2 VM registers directly.
> +	 */
> +	if (is_guest_mode(vcpu))
> +		return -EACCES;
> +
>   	vcpu_load(vcpu);
>   	__set_regs(vcpu, regs);
>   	vcpu_put(vcpu);
Julian Stecklina May 15, 2024, 2:37 p.m. UTC | #2
On Wed, 15 May 2024 13:08:39 +0200 Paolo wrote:
> On 5/15/24 10:06, Liang Chen wrote:
>> In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the
>> L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while
>> the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's
>> registers, corrupting the L2 VM. To avoid this error, registers should be
>> treated as read-only when the vCPU is actively running an L2 VM.
>
> No, this is intentional.  The L0 hypervisor has full control on the CPU
> registers, no matter if the VM is in guest mode or not.

We have a very similar issue and we already discussed it in these two
threads [1, 2]. Our proposed solution is to introduce a flag in
kvm_run to make userspace aware of exits with L2 state.

Julian


[1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m2eebd2ab30a86622aea3732112150851ac0768fe
[2] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@cyberus-technology.de/T/#u
Liang Chen May 17, 2024, 11:37 a.m. UTC | #3
On Wed, May 15, 2024 at 7:08 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 5/15/24 10:06, Liang Chen wrote:
> > In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the
> > L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while
> > the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's
> > registers, corrupting the L2 VM. To avoid this error, registers should be
> > treated as read-only when the vCPU is actively running an L2 VM.
>
> No, this is intentional.  The L0 hypervisor has full control on the CPU
> registers, no matter if the VM is in guest mode or not.
>

I see. Thank you! The patch [1] will provide a convenient way for
userspace to determine if the CPU is in guest mode, which should be
sufficient for userspace to avoid mistakenly setting L2 registers.

[1] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@cyberus-technology.de/T/#u

> Looking at the syzkaller report, the first thing to do is to remove the
> threading, because vcpu ioctls are anyway serialized by the vcpu mutex.
>
> Let's assume that the first 7 operations, i.e. all except KVM_RUN and
> KVM_SET_REGS, execute in sequence.  There is possible interleaving of
> the two syz_kvm_setup_cpu$x86 calls, but because only the second sets
> up nested virtualization, we can hope that we're lucky.[1]
>
> To do so, change the main to something like
>
> int main(int argc, char **argv)
> {
>    int i;
>    syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>            /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>            /*offset=*/0ul);
>    syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
>            /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
>            /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>            /*offset=*/0ul);
>    syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>            /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>            /*offset=*/0ul);
>
>    // r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000440), 0x0, 0x0)
>    execute_call(0);
>    // r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>    execute_call(1);
>    // r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
>    execute_call(2);
>    // mmap(&(0x7f0000000000/0x3000)=nil, 0x3000, 0x2, 0x13, r2, 0x0)
>    execute_call(3);
>    // syz_kvm_setup_cpu$x86(r1, 0xffffffffffffffff, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000200)=[@text16={0x10, 0x0}], 0x1, 0x0, 0x0, 0x0)
>    execute_call(4);
>    // ioctl$KVM_REGISTER_COALESCED_MMIO(0xffffffffffffffff, 0x4010ae67, &(0x7f0000000000)={0x2})
>    execute_call(5);
>    // syz_kvm_setup_cpu$x86(0xffffffffffffffff, r2, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000340)=[@text64={0x40, 0x0}], 0x1, 0x50, 0x0, 0x0)
>    execute_call(6);
>    // ioctl$KVM_RUN(r2, 0xae80, 0x0) (rerun: 64)
>    for (i = 0; i < atoi(argv[1]); i++)
>      execute_call(7);
>    // ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f00000000c0)={[0x7], 0x0, 0x60000})
>    // interleaved with KVM_RUN
>    execute_call(8);
>    // one more KVM_RUN to show the bug
>    execute_call(7);
>    return 0;
> }
>
> This reproduces the fact that KVM_SET_REGS can sneak in between any
> two KVM_RUN.  You'll see that changing the argument may cause an entry with
> invalid guest state (argument is 0 or >= 3 - that's ok) or the WARN
> (argument == 1).
>
> The attached cleaned up reproducer shows that the problem is simply that
> EFLAGS.VM is set in 64-bit mode.  To fix it, it should be enough to do
> a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like
> a few lines below.
>

Yes, that was the situation we were trying to deal with. However, I am
not quite sure if I fully understand the suggestion, "To fix it, it
should be enough to do a nested_vmx_vmexit(vcpu,
EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From
what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in
__vmx_handle_exit can be a result of an invalid VMCS12 from L1 that
somehow escaped checking when trapped into L0 in nested_vmx_run. It is
not convenient to tell whether it was a result of userspace
register_set ops, as we are discussing, or an invalid VMCS12 supplied
by L1. Additionally, nested_vmx_vmexit warns when
'vmx->nested.nested_run_pending is true,' saying that "trying to
cancel vmlaunch/vmresume is a bug".

Thanks,
Liang

> Paolo
>
> [1] in fact the first syz_kvm_setup_cpu$v86 passes vcpufd==-1 and
> the second passes vmfd==-1.  Each of them does only half of the work.
> The ioctl$KVM_REGISTER_COALESCED_MMIO is also mostly dummy because it
> passes -1 as the file descriptor, but it happens to set
> vcpu->run->request_interrupt_window!  Which is important later.
>
>
> > Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
> > Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com
> > ---
> >   arch/x86/kvm/x86.c | 6 ++++++
> >   1 file changed, 6 insertions(+)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 91478b769af0..30f63a7dd120 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> >
> >   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> >   {
> > +     /*
> > +      * Prevent L0 VMM from inadvertently modifying L2 VM registers directly.
> > +      */
> > +     if (is_guest_mode(vcpu))
> > +             return -EACCES;
> > +
> >       vcpu_load(vcpu);
> >       __set_regs(vcpu, regs);
> >       vcpu_put(vcpu);
Liang Chen May 17, 2024, 11:40 a.m. UTC | #4
On Wed, May 15, 2024 at 10:38 PM Julian Stecklina
<julian.stecklina@cyberus-technology.de> wrote:
>
> On Wed, 15 May 2024 13:08:39 +0200 Paolo wrote:
> > On 5/15/24 10:06, Liang Chen wrote:
> >> In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the
> >> L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while
> >> the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's
> >> registers, corrupting the L2 VM. To avoid this error, registers should be
> >> treated as read-only when the vCPU is actively running an L2 VM.
> >
> > No, this is intentional.  The L0 hypervisor has full control on the CPU
> > registers, no matter if the VM is in guest mode or not.
>
> We have a very similar issue and we already discussed it in these two
> threads [1, 2]. Our proposed solution is to introduce a flag in
> kvm_run to make userspace aware of exits with L2 state.
>
Thank you for the information. That should be sufficient for userspace
to determine if the vCPU is in guest mode.

Thanks,
Liang

> Julian
>
>
> [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m2eebd2ab30a86622aea3732112150851ac0768fe
> [2] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@cyberus-technology.de/T/#u
>
Paolo Bonzini May 17, 2024, 4:51 p.m. UTC | #5
On 5/17/24 13:37, Liang Chen wrote:
>>
>> The attached cleaned up reproducer shows that the problem is simply that
>> EFLAGS.VM is set in 64-bit mode.  To fix it, it should be enough to do
>> a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like
>> a few lines below.
>>
> Yes, that was the situation we were trying to deal with. However, I am
> not quite sure if I fully understand the suggestion, "To fix it, it
> should be enough to do a nested_vmx_vmexit(vcpu,
> EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From
> what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in
> __vmx_handle_exit can be a result of an invalid VMCS12 from L1 that
> somehow escaped checking when trapped into L0 in nested_vmx_run. It is
> not convenient to tell whether it was a result of userspace
> register_set ops, as we are discussing, or an invalid VMCS12 supplied
> by L1.

Right, KVM assumes that it can delegate the "Checks on Guest Segment 
Registers" to the processor if a field is copied straight from VMCS12 to 
VMCS02.  In this case the segments are not set up for virtual-8086 mode;
interestingly the manual seems to say that EFLAGS.VM wins over "IA-32e 
mode guest" is 1 for the purpose of checking guest state.  AMD's manual 
says that EFLAGS.VM is completely ignored in 64-bit mode instead.

I need to look more at the sequence of VMLAUNCH/RESUME, KVM_SET_MSR and 
the failed vmentry to understand exactly what the right fix is.

Paolo

> Additionally, nested_vmx_vmexit warns when
> 'vmx->nested.nested_run_pending is true,' saying that "trying to
> cancel vmlaunch/vmresume is a bug".
Liang Chen May 19, 2024, 5:17 a.m. UTC | #6
On Sat, May 18, 2024 at 12:51 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 5/17/24 13:37, Liang Chen wrote:
> >>
> >> The attached cleaned up reproducer shows that the problem is simply that
> >> EFLAGS.VM is set in 64-bit mode.  To fix it, it should be enough to do
> >> a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like
> >> a few lines below.
> >>
> > Yes, that was the situation we were trying to deal with. However, I am
> > not quite sure if I fully understand the suggestion, "To fix it, it
> > should be enough to do a nested_vmx_vmexit(vcpu,
> > EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From
> > what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in
> > __vmx_handle_exit can be a result of an invalid VMCS12 from L1 that
> > somehow escaped checking when trapped into L0 in nested_vmx_run. It is
> > not convenient to tell whether it was a result of userspace
> > register_set ops, as we are discussing, or an invalid VMCS12 supplied
> > by L1.
>
> Right, KVM assumes that it can delegate the "Checks on Guest Segment
> Registers" to the processor if a field is copied straight from VMCS12 to
> VMCS02.  In this case the segments are not set up for virtual-8086 mode;
> interestingly the manual seems to say that EFLAGS.VM wins over "IA-32e
> mode guest" is 1 for the purpose of checking guest state.  AMD's manual
> says that EFLAGS.VM is completely ignored in 64-bit mode instead.
>
> I need to look more at the sequence of VMLAUNCH/RESUME, KVM_SET_MSR and
> the failed vmentry to understand exactly what the right fix is.

Yep. Thanks a lot!

Thanks,
Liang
>
> Paolo
>
> > Additionally, nested_vmx_vmexit warns when
> > 'vmx->nested.nested_run_pending is true,' saying that "trying to
> > cancel vmlaunch/vmresume is a bug".
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91478b769af0..30f63a7dd120 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11527,6 +11527,12 @@  static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
+	/*
+	 * Prevent L0 VMM from inadvertently modifying L2 VM registers directly.
+	 */
+	if (is_guest_mode(vcpu))
+		return -EACCES;
+
 	vcpu_load(vcpu);
 	__set_regs(vcpu, regs);
 	vcpu_put(vcpu);