Message ID | 20240515080607.919497-1-liangchen.linux@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | KVM: x86: Prevent L0 VMM from modifying L2 VM registers via ioctl | expand |
On 5/15/24 10:06, Liang Chen wrote: > In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the > L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while > the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's > registers, corrupting the L2 VM. To avoid this error, registers should be > treated as read-only when the vCPU is actively running an L2 VM. No, this is intentional. The L0 hypervisor has full control on the CPU registers, no matter if the VM is in guest mode or not. Looking at the syzkaller report, the first thing to do is to remove the threading, because vcpu ioctls are anyway serialized by the vcpu mutex. Let's assume that the first 7 operations, i.e. all except KVM_RUN and KVM_SET_REGS, execute in sequence. There is possible interleaving of the two syz_kvm_setup_cpu$x86 calls, but because only the second sets up nested virtualization, we can hope that we're lucky.[1] To do so, change the main to something like int main(int argc, char **argv) { int i; syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, /*offset=*/0ul); syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, /*offset=*/0ul); syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, /*offset=*/0ul); // r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000440), 0x0, 0x0) execute_call(0); // r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) execute_call(1); // r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) execute_call(2); // mmap(&(0x7f0000000000/0x3000)=nil, 0x3000, 0x2, 0x13, r2, 0x0) execute_call(3); // syz_kvm_setup_cpu$x86(r1, 0xffffffffffffffff, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000200)=[@text16={0x10, 0x0}], 0x1, 0x0, 0x0, 0x0) execute_call(4); // ioctl$KVM_REGISTER_COALESCED_MMIO(0xffffffffffffffff, 0x4010ae67, &(0x7f0000000000)={0x2}) execute_call(5); // syz_kvm_setup_cpu$x86(0xffffffffffffffff, r2, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000340)=[@text64={0x40, 0x0}], 0x1, 0x50, 0x0, 0x0) execute_call(6); // ioctl$KVM_RUN(r2, 0xae80, 0x0) (rerun: 64) for (i = 0; i < atoi(argv[1]); i++) execute_call(7); // ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f00000000c0)={[0x7], 0x0, 0x60000}) // interleaved with KVM_RUN execute_call(8); // one more KVM_RUN to show the bug execute_call(7); return 0; } This reproduces the fact that KVM_SET_REGS can sneak in between any two KVM_RUN. You'll see that changing the argument may cause an entry with invalid guest state (argument is 0 or >= 3 - that's ok) or the WARN (argument == 1). The attached cleaned up reproducer shows that the problem is simply that EFLAGS.VM is set in 64-bit mode. To fix it, it should be enough to do a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below. Paolo [1] in fact the first syz_kvm_setup_cpu$v86 passes vcpufd==-1 and the second passes vmfd==-1. Each of them does only half of the work. The ioctl$KVM_REGISTER_COALESCED_MMIO is also mostly dummy because it passes -1 as the file descriptor, but it happens to set vcpu->run->request_interrupt_window! Which is important later. > Signed-off-by: Liang Chen <liangchen.linux@gmail.com> > Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com > --- > arch/x86/kvm/x86.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 91478b769af0..30f63a7dd120 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) > > int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) > { > + /* > + * Prevent L0 VMM from inadvertently modifying L2 VM registers directly. > + */ > + if (is_guest_mode(vcpu)) > + return -EACCES; > + > vcpu_load(vcpu); > __set_regs(vcpu, regs); > vcpu_put(vcpu);
On Wed, 15 May 2024 13:08:39 +0200 Paolo wrote: > On 5/15/24 10:06, Liang Chen wrote: >> In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the >> L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while >> the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's >> registers, corrupting the L2 VM. To avoid this error, registers should be >> treated as read-only when the vCPU is actively running an L2 VM. > > No, this is intentional. The L0 hypervisor has full control on the CPU > registers, no matter if the VM is in guest mode or not. We have a very similar issue and we already discussed it in these two threads [1, 2]. Our proposed solution is to introduce a flag in kvm_run to make userspace aware of exits with L2 state. Julian [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m2eebd2ab30a86622aea3732112150851ac0768fe [2] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@cyberus-technology.de/T/#u
On Wed, May 15, 2024 at 7:08 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 5/15/24 10:06, Liang Chen wrote: > > In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the > > L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while > > the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's > > registers, corrupting the L2 VM. To avoid this error, registers should be > > treated as read-only when the vCPU is actively running an L2 VM. > > No, this is intentional. The L0 hypervisor has full control on the CPU > registers, no matter if the VM is in guest mode or not. > I see. Thank you! The patch [1] will provide a convenient way for userspace to determine if the CPU is in guest mode, which should be sufficient for userspace to avoid mistakenly setting L2 registers. [1] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@cyberus-technology.de/T/#u > Looking at the syzkaller report, the first thing to do is to remove the > threading, because vcpu ioctls are anyway serialized by the vcpu mutex. > > Let's assume that the first 7 operations, i.e. all except KVM_RUN and > KVM_SET_REGS, execute in sequence. There is possible interleaving of > the two syz_kvm_setup_cpu$x86 calls, but because only the second sets > up nested virtualization, we can hope that we're lucky.[1] > > To do so, change the main to something like > > int main(int argc, char **argv) > { > int i; > syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, > /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, > /*offset=*/0ul); > syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, > /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul, > /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, > /*offset=*/0ul); > syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, > /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, > /*offset=*/0ul); > > // r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000440), 0x0, 0x0) > execute_call(0); > // r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) > execute_call(1); > // r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) > execute_call(2); > // mmap(&(0x7f0000000000/0x3000)=nil, 0x3000, 0x2, 0x13, r2, 0x0) > execute_call(3); > // syz_kvm_setup_cpu$x86(r1, 0xffffffffffffffff, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000200)=[@text16={0x10, 0x0}], 0x1, 0x0, 0x0, 0x0) > execute_call(4); > // ioctl$KVM_REGISTER_COALESCED_MMIO(0xffffffffffffffff, 0x4010ae67, &(0x7f0000000000)={0x2}) > execute_call(5); > // syz_kvm_setup_cpu$x86(0xffffffffffffffff, r2, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000340)=[@text64={0x40, 0x0}], 0x1, 0x50, 0x0, 0x0) > execute_call(6); > // ioctl$KVM_RUN(r2, 0xae80, 0x0) (rerun: 64) > for (i = 0; i < atoi(argv[1]); i++) > execute_call(7); > // ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f00000000c0)={[0x7], 0x0, 0x60000}) > // interleaved with KVM_RUN > execute_call(8); > // one more KVM_RUN to show the bug > execute_call(7); > return 0; > } > > This reproduces the fact that KVM_SET_REGS can sneak in between any > two KVM_RUN. You'll see that changing the argument may cause an entry with > invalid guest state (argument is 0 or >= 3 - that's ok) or the WARN > (argument == 1). > > The attached cleaned up reproducer shows that the problem is simply that > EFLAGS.VM is set in 64-bit mode. To fix it, it should be enough to do > a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like > a few lines below. > Yes, that was the situation we were trying to deal with. However, I am not quite sure if I fully understand the suggestion, "To fix it, it should be enough to do a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in __vmx_handle_exit can be a result of an invalid VMCS12 from L1 that somehow escaped checking when trapped into L0 in nested_vmx_run. It is not convenient to tell whether it was a result of userspace register_set ops, as we are discussing, or an invalid VMCS12 supplied by L1. Additionally, nested_vmx_vmexit warns when 'vmx->nested.nested_run_pending is true,' saying that "trying to cancel vmlaunch/vmresume is a bug". Thanks, Liang > Paolo > > [1] in fact the first syz_kvm_setup_cpu$v86 passes vcpufd==-1 and > the second passes vmfd==-1. Each of them does only half of the work. > The ioctl$KVM_REGISTER_COALESCED_MMIO is also mostly dummy because it > passes -1 as the file descriptor, but it happens to set > vcpu->run->request_interrupt_window! Which is important later. > > > > Signed-off-by: Liang Chen <liangchen.linux@gmail.com> > > Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com > > Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com > > --- > > arch/x86/kvm/x86.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 91478b769af0..30f63a7dd120 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) > > > > int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) > > { > > + /* > > + * Prevent L0 VMM from inadvertently modifying L2 VM registers directly. > > + */ > > + if (is_guest_mode(vcpu)) > > + return -EACCES; > > + > > vcpu_load(vcpu); > > __set_regs(vcpu, regs); > > vcpu_put(vcpu);
On Wed, May 15, 2024 at 10:38 PM Julian Stecklina <julian.stecklina@cyberus-technology.de> wrote: > > On Wed, 15 May 2024 13:08:39 +0200 Paolo wrote: > > On 5/15/24 10:06, Liang Chen wrote: > >> In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the > >> L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while > >> the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's > >> registers, corrupting the L2 VM. To avoid this error, registers should be > >> treated as read-only when the vCPU is actively running an L2 VM. > > > > No, this is intentional. The L0 hypervisor has full control on the CPU > > registers, no matter if the VM is in guest mode or not. > > We have a very similar issue and we already discussed it in these two > threads [1, 2]. Our proposed solution is to introduce a flag in > kvm_run to make userspace aware of exits with L2 state. > Thank you for the information. That should be sufficient for userspace to determine if the vCPU is in guest mode. Thanks, Liang > Julian > > > [1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m2eebd2ab30a86622aea3732112150851ac0768fe > [2] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@cyberus-technology.de/T/#u >
On 5/17/24 13:37, Liang Chen wrote: >> >> The attached cleaned up reproducer shows that the problem is simply that >> EFLAGS.VM is set in 64-bit mode. To fix it, it should be enough to do >> a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like >> a few lines below. >> > Yes, that was the situation we were trying to deal with. However, I am > not quite sure if I fully understand the suggestion, "To fix it, it > should be enough to do a nested_vmx_vmexit(vcpu, > EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From > what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in > __vmx_handle_exit can be a result of an invalid VMCS12 from L1 that > somehow escaped checking when trapped into L0 in nested_vmx_run. It is > not convenient to tell whether it was a result of userspace > register_set ops, as we are discussing, or an invalid VMCS12 supplied > by L1. Right, KVM assumes that it can delegate the "Checks on Guest Segment Registers" to the processor if a field is copied straight from VMCS12 to VMCS02. In this case the segments are not set up for virtual-8086 mode; interestingly the manual seems to say that EFLAGS.VM wins over "IA-32e mode guest" is 1 for the purpose of checking guest state. AMD's manual says that EFLAGS.VM is completely ignored in 64-bit mode instead. I need to look more at the sequence of VMLAUNCH/RESUME, KVM_SET_MSR and the failed vmentry to understand exactly what the right fix is. Paolo > Additionally, nested_vmx_vmexit warns when > 'vmx->nested.nested_run_pending is true,' saying that "trying to > cancel vmlaunch/vmresume is a bug".
On Sat, May 18, 2024 at 12:51 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 5/17/24 13:37, Liang Chen wrote: > >> > >> The attached cleaned up reproducer shows that the problem is simply that > >> EFLAGS.VM is set in 64-bit mode. To fix it, it should be enough to do > >> a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like > >> a few lines below. > >> > > Yes, that was the situation we were trying to deal with. However, I am > > not quite sure if I fully understand the suggestion, "To fix it, it > > should be enough to do a nested_vmx_vmexit(vcpu, > > EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From > > what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in > > __vmx_handle_exit can be a result of an invalid VMCS12 from L1 that > > somehow escaped checking when trapped into L0 in nested_vmx_run. It is > > not convenient to tell whether it was a result of userspace > > register_set ops, as we are discussing, or an invalid VMCS12 supplied > > by L1. > > Right, KVM assumes that it can delegate the "Checks on Guest Segment > Registers" to the processor if a field is copied straight from VMCS12 to > VMCS02. In this case the segments are not set up for virtual-8086 mode; > interestingly the manual seems to say that EFLAGS.VM wins over "IA-32e > mode guest" is 1 for the purpose of checking guest state. AMD's manual > says that EFLAGS.VM is completely ignored in 64-bit mode instead. > > I need to look more at the sequence of VMLAUNCH/RESUME, KVM_SET_MSR and > the failed vmentry to understand exactly what the right fix is. Yep. Thanks a lot! Thanks, Liang > > Paolo > > > Additionally, nested_vmx_vmexit warns when > > 'vmx->nested.nested_run_pending is true,' saying that "trying to > > cancel vmlaunch/vmresume is a bug". >
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 91478b769af0..30f63a7dd120 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { + /* + * Prevent L0 VMM from inadvertently modifying L2 VM registers directly. + */ + if (is_guest_mode(vcpu)) + return -EACCES; + vcpu_load(vcpu); __set_regs(vcpu, regs); vcpu_put(vcpu);
In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's registers, corrupting the L2 VM. To avoid this error, registers should be treated as read-only when the vCPU is actively running an L2 VM. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com --- arch/x86/kvm/x86.c | 6 ++++++ 1 file changed, 6 insertions(+)