Message ID | 20191217204041.10815-2-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Dynamically size memslot arrays | expand |
Dropping non-x86 folks... This should be included in 5.5 if possible even though the bug has existed for over a decade. It's trivially easy for a malicious userspace to crash KVM and hang the host. Depending how userspace VMM behavior, it may even be possible to trigger from a guest. On Tue, Dec 17, 2019 at 12:40:23PM -0800, Sean Christopherson wrote: > Reallocate a rmap array and recalcuate large page compatibility when > moving an existing memslot to correctly handle the alignment properties > of the new memslot. The number of rmap entries required at each level > is dependent on the alignment of the memslot's base gfn with respect to > that level, e.g. moving a large-page aligned memslot so that it becomes > unaligned will increase the number of rmap entries needed at the now > unaligned level. > > Not updating the rmap array is the most obvious bug, as KVM accesses > garbage data beyond the end of the rmap. KVM interprets the bad data as > pointers, leading to non-canonical #GPs, unexpected #PFs, etc... > > general protection fault: 0000 [#1] SMP > CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:rmap_get_first+0x37/0x50 [kvm] > Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 > RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 > RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 > RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 > RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 > R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 > R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 > FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 > Call Trace: > kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] > __kvm_set_memory_region.part.64+0x559/0x960 [kvm] > kvm_set_memory_region+0x45/0x60 [kvm] > kvm_vm_ioctl+0x30f/0x920 [kvm] > do_vfs_ioctl+0xa1/0x620 > ksys_ioctl+0x66/0x70 > __x64_sys_ioctl+0x16/0x20 > do_syscall_64+0x4c/0x170 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7f7ed9911f47 > Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 > RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 > RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 > RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 > R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 > R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 > Modules linked in: kvm_intel kvm irqbypass > ---[ end trace 0c5f570b3358ca89 ]--- > > The disallow_lpage tracking is more subtle. Failure to update results > in KVM creating large pages when it shouldn't, either due to stale data > or again due to indexing beyond the end of the metadata arrays, which > can lead to memory corruption and/or leaking data to guest/userspace. > > Note, the arrays for the old memslot are freed by the unconditional call > to kvm_free_memslot() in __kvm_set_memory_region(). > > Fixes: 05da45583de9b ("KVM: MMU: large page support") > Cc: stable@vger.kernel.org > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > --- > arch/x86/kvm/x86.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 8bb2fb1705ff..04d1bf89da0e 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -9703,6 +9703,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, > { > int i; > > + /* > + * Clear out the previous array pointers for the KVM_MR_MOVE case. The > + * old arrays will be freed by __kvm_set_memory_region() if installing > + * the new memslot is successful. > + */ > + memset(&slot->arch, 0, sizeof(slot->arch)); > + > for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) { > struct kvm_lpage_info *linfo; > unsigned long ugfn; > @@ -9777,6 +9784,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, > const struct kvm_userspace_memory_region *mem, > enum kvm_mr_change change) > { > + if (change == KVM_MR_MOVE) > + return kvm_arch_create_memslot(kvm, memslot, > + mem->memory_size >> PAGE_SHIFT); > + > return 0; > } > > -- > 2.24.1 >
On Tue, Dec 17, 2019 at 12:40:23PM -0800, Sean Christopherson wrote: > Reallocate a rmap array and recalcuate large page compatibility when > moving an existing memslot to correctly handle the alignment properties > of the new memslot. The number of rmap entries required at each level > is dependent on the alignment of the memslot's base gfn with respect to > that level, e.g. moving a large-page aligned memslot so that it becomes > unaligned will increase the number of rmap entries needed at the now > unaligned level. > > Not updating the rmap array is the most obvious bug, as KVM accesses > garbage data beyond the end of the rmap. KVM interprets the bad data as > pointers, leading to non-canonical #GPs, unexpected #PFs, etc... > > general protection fault: 0000 [#1] SMP > CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:rmap_get_first+0x37/0x50 [kvm] > Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 > RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 > RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 > RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 > RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 > R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 > R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 > FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 > Call Trace: > kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] > __kvm_set_memory_region.part.64+0x559/0x960 [kvm] > kvm_set_memory_region+0x45/0x60 [kvm] > kvm_vm_ioctl+0x30f/0x920 [kvm] > do_vfs_ioctl+0xa1/0x620 > ksys_ioctl+0x66/0x70 > __x64_sys_ioctl+0x16/0x20 > do_syscall_64+0x4c/0x170 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7f7ed9911f47 > Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 > RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 > RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 > RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 > R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 > R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 > Modules linked in: kvm_intel kvm irqbypass > ---[ end trace 0c5f570b3358ca89 ]--- > > The disallow_lpage tracking is more subtle. Failure to update results > in KVM creating large pages when it shouldn't, either due to stale data > or again due to indexing beyond the end of the metadata arrays, which > can lead to memory corruption and/or leaking data to guest/userspace. > > Note, the arrays for the old memslot are freed by the unconditional call > to kvm_free_memslot() in __kvm_set_memory_region(). > > Fixes: 05da45583de9b ("KVM: MMU: large page support") > Cc: stable@vger.kernel.org > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> I think the error-prone part is: new = old = *slot; Where IMHO it would be better if we only copy pointers explicitly when under control, rather than blindly copying all the pointers in the structure which even contains sub-structures. For example, I see PPC has this: struct kvm_arch_memory_slot { #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE unsigned long *rmap; #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ }; I started to look into HV code of it a bit, then I see... - kvm_arch_create_memslot(kvmppc_core_create_memslot_hv) init slot->arch.rmap, - kvm_arch_flush_shadow_memslot(kvmppc_core_flush_memslot_hv) didn't free it, - kvm_arch_prepare_memory_region(kvmppc_core_prepare_memory_region_hv) is nop. So Does it have similar issue?
On Tue, Dec 17, 2019 at 04:56:40PM -0500, Peter Xu wrote: > On Tue, Dec 17, 2019 at 12:40:23PM -0800, Sean Christopherson wrote: > > Reallocate a rmap array and recalcuate large page compatibility when > > moving an existing memslot to correctly handle the alignment properties > > of the new memslot. The number of rmap entries required at each level > > is dependent on the alignment of the memslot's base gfn with respect to > > that level, e.g. moving a large-page aligned memslot so that it becomes > > unaligned will increase the number of rmap entries needed at the now > > unaligned level. ... > I think the error-prone part is: > > new = old = *slot; Lol, IMO the error-prone part is the entire memslot mess :-) > Where IMHO it would be better if we only copy pointers explicitly when > under control, rather than blindly copying all the pointers in the > structure which even contains sub-structures. Long term, yes, that would be ideal. For the immediate bug fix, reworking common KVM and other arch code would be unnecessarily dangerous and would make it more difficult to backport the fix to stable branches. I actually briefly considered moving the slot->arch handling into arch code as part of the bug fix, but the memslot code has many subtle dependencies, e.g. PPC and x86 rely on common KVM code to copy slot->arch when flags are being changed. I'll happily clean up the slot->arch code once this series is merged. There is refactoring in this series that will make it a lot easier to do additional clean up. > For example, I see PPC has this: > > struct kvm_arch_memory_slot { > #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > unsigned long *rmap; > #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ > }; > > I started to look into HV code of it a bit, then I see... > > - kvm_arch_create_memslot(kvmppc_core_create_memslot_hv) init slot->arch.rmap, > - kvm_arch_flush_shadow_memslot(kvmppc_core_flush_memslot_hv) didn't free it, > - kvm_arch_prepare_memory_region(kvmppc_core_prepare_memory_region_hv) is nop. > > So Does it have similar issue? No, KVM doesn't allow a memslot's size to be changed, and PPC's rmap allocation is directly tied to the size of the memslot. The x86 bug exists because the size of its metadata arrays varies based on the alignment of the base gfn.
On Tue, Dec 17, 2019 at 02:20:59PM -0800, Sean Christopherson wrote: > > For example, I see PPC has this: > > > > struct kvm_arch_memory_slot { > > #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > > unsigned long *rmap; > > #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ > > }; > > > > I started to look into HV code of it a bit, then I see... > > > > - kvm_arch_create_memslot(kvmppc_core_create_memslot_hv) init slot->arch.rmap, > > - kvm_arch_flush_shadow_memslot(kvmppc_core_flush_memslot_hv) didn't free it, > > - kvm_arch_prepare_memory_region(kvmppc_core_prepare_memory_region_hv) is nop. > > > > So Does it have similar issue? > > No, KVM doesn't allow a memslot's size to be changed, and PPC's rmap > allocation is directly tied to the size of the memslot. The x86 bug exists > because the size of its metadata arrays varies based on the alignment of > the base gfn. Yes, I was actually thinking those rmap would be invalid rather than the size after the move. But I think kvm_arch_flush_shadow_memslot() will flush all of them anyways... So yes it seems fine. Thanks,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8bb2fb1705ff..04d1bf89da0e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9703,6 +9703,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot, { int i; + /* + * Clear out the previous array pointers for the KVM_MR_MOVE case. The + * old arrays will be freed by __kvm_set_memory_region() if installing + * the new memslot is successful. + */ + memset(&slot->arch, 0, sizeof(slot->arch)); + for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) { struct kvm_lpage_info *linfo; unsigned long ugfn; @@ -9777,6 +9784,10 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, const struct kvm_userspace_memory_region *mem, enum kvm_mr_change change) { + if (change == KVM_MR_MOVE) + return kvm_arch_create_memslot(kvm, memslot, + mem->memory_size >> PAGE_SHIFT); + return 0; }
Reallocate a rmap array and recalcuate large page compatibility when moving an existing memslot to correctly handle the alignment properties of the new memslot. The number of rmap entries required at each level is dependent on the alignment of the memslot's base gfn with respect to that level, e.g. moving a large-page aligned memslot so that it becomes unaligned will increase the number of rmap entries needed at the now unaligned level. Not updating the rmap array is the most obvious bug, as KVM accesses garbage data beyond the end of the rmap. KVM interprets the bad data as pointers, leading to non-canonical #GPs, unexpected #PFs, etc... general protection fault: 0000 [#1] SMP CPU: 0 PID: 1909 Comm: move_memory_reg Not tainted 5.4.0-rc7+ #139 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:rmap_get_first+0x37/0x50 [kvm] Code: <48> 8b 3b 48 85 ff 74 ec e8 6c f4 ff ff 85 c0 74 e3 48 89 d8 5b c3 RSP: 0018:ffffc9000021bbc8 EFLAGS: 00010246 RAX: ffff00617461642e RBX: ffff00617461642e RCX: 0000000000000012 RDX: ffff88827400f568 RSI: ffffc9000021bbe0 RDI: ffff88827400f570 RBP: 0010000000000000 R08: ffffc9000021bd00 R09: ffffc9000021bda8 R10: ffffc9000021bc48 R11: 0000000000000000 R12: 0030000000000000 R13: 0000000000000000 R14: ffff88827427d700 R15: ffffc9000021bce8 FS: 00007f7eda014700(0000) GS:ffff888277a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ed9216ff8 CR3: 0000000274391003 CR4: 0000000000162eb0 Call Trace: kvm_mmu_slot_set_dirty+0xa1/0x150 [kvm] __kvm_set_memory_region.part.64+0x559/0x960 [kvm] kvm_set_memory_region+0x45/0x60 [kvm] kvm_vm_ioctl+0x30f/0x920 [kvm] do_vfs_ioctl+0xa1/0x620 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x4c/0x170 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f7ed9911f47 Code: <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 21 6f 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc00937498 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000001ab0010 RCX: 00007f7ed9911f47 RDX: 0000000001ab1350 RSI: 000000004020ae46 RDI: 0000000000000004 RBP: 000000000000000a R08: 0000000000000000 R09: 00007f7ed9214700 R10: 00007f7ed92149d0 R11: 0000000000000246 R12: 00000000bffff000 R13: 0000000000000003 R14: 00007f7ed9215000 R15: 0000000000000000 Modules linked in: kvm_intel kvm irqbypass ---[ end trace 0c5f570b3358ca89 ]--- The disallow_lpage tracking is more subtle. Failure to update results in KVM creating large pages when it shouldn't, either due to stale data or again due to indexing beyond the end of the metadata arrays, which can lead to memory corruption and/or leaking data to guest/userspace. Note, the arrays for the old memslot are freed by the unconditional call to kvm_free_memslot() in __kvm_set_memory_region(). Fixes: 05da45583de9b ("KVM: MMU: large page support") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- arch/x86/kvm/x86.c | 11 +++++++++++ 1 file changed, 11 insertions(+)