Message ID | 20240703021043.13881-1-yan.y.zhao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce a quirk to control memslot zap behavior | expand |
On Wed, Jul 03, 2024, Yan Zhao wrote: > Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select > KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM > VMs. Make sure KVM behave as if the quirk is always disabled for > non-KVM_X86_DEFAULT_VM VMs. ... > Suggested-by: Kai Huang <kai.huang@intel.com> > Suggested-by: Sean Christopherson <seanjc@google.com> Bad Sean, bad. > +/* > + * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted. > + * > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst > + * case scenario we'll have unused shadow pages lying around until they > + * are recycled due to age or when the VM is destroyed. > + */ > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) > +{ > + struct kvm_gfn_range range = { > + .slot = slot, > + .start = slot->base_gfn, > + .end = slot->base_gfn + slot->npages, > + .may_block = true, > + }; > + bool flush = false; > + > + write_lock(&kvm->mmu_lock); > + > + if (kvm_memslots_have_rmaps(kvm)) > + flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap); This, and Paolo's merged variant, break shadow paging. As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting for a shadow page is tied to the memslot, i.e. the shadow page holds a reference to the memslot, for all intents and purposes. Deleting the memslot without removing all relevant shadow pages results in NULL pointer derefs when tearing down the VM. Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well. https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com Rather than trying to get this functional for shadow paging (which includes nested TDP), I think we should scrap the quirk idea and simply make this the behavior for S-EPT and nothing else. BUG: kernel NULL pointer dereference, address: 00000000000000b0 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 6085f43067 P4D 608c080067 PUD 608c081067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 79 UID: 0 PID: 187063 Comm: set_memory_regi Tainted: G W 6.11.0-smp--24867312d167-cpl #395 Tainted: [W]=WARN Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024 RIP: 0010:__kvm_mmu_prepare_zap_page+0x3a9/0x7b0 [kvm] Code: <48> 8b 8e b0 00 00 00 48 8b 96 e0 00 00 00 48 c1 e9 09 48 29 c8 8b RSP: 0018:ff314a25b19f7c28 EFLAGS: 00010212 Call Trace: <TASK> kvm_arch_flush_shadow_all+0x7a/0xf0 [kvm] kvm_mmu_notifier_release+0x6c/0xb0 [kvm] mmu_notifier_unregister+0x85/0x140 kvm_put_kvm+0x263/0x410 [kvm] kvm_vm_release+0x21/0x30 [kvm] __fput+0x8d/0x2c0 __se_sys_close+0x71/0xc0 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e
On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote: > On Wed, Jul 03, 2024, Yan Zhao wrote: > > Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select > > KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM > > VMs. Make sure KVM behave as if the quirk is always disabled for > > non-KVM_X86_DEFAULT_VM VMs. > > ... > > > Suggested-by: Kai Huang <kai.huang@intel.com> > > Suggested-by: Sean Christopherson <seanjc@google.com> > > Bad Sean, bad. > > > +/* > > + * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted. > > + * > > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst > > + * case scenario we'll have unused shadow pages lying around until they > > + * are recycled due to age or when the VM is destroyed. > > + */ > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) > > +{ > > + struct kvm_gfn_range range = { > > + .slot = slot, > > + .start = slot->base_gfn, > > + .end = slot->base_gfn + slot->npages, > > + .may_block = true, > > + }; > > + bool flush = false; > > + > > + write_lock(&kvm->mmu_lock); > > + > > + if (kvm_memslots_have_rmaps(kvm)) > > + flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap); > > This, and Paolo's merged variant, break shadow paging. As was tried in commit > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), > all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference > to the memslot, for all intents and purposes. Deleting the memslot without removing > all relevant shadow pages results in NULL pointer derefs when tearing down the VM. > > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well. > https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com > > Rather than trying to get this functional for shadow paging (which includes nested > TDP), I think we should scrap the quirk idea and simply make this the behavior for > S-EPT and nothing else. Ok. Thanks for identifying this error. Will change code to this way. BTW: update some findings regarding to the previous bug with Nvidia GPU assignment: I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not reproducible when only leaf entries of memslot are zapped. (no more detailed info due to limited time to debug). > > BUG: kernel NULL pointer dereference, address: 00000000000000b0 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 6085f43067 P4D 608c080067 PUD 608c081067 PMD 0 > Oops: Oops: 0000 [#1] SMP NOPTI > CPU: 79 UID: 0 PID: 187063 Comm: set_memory_regi Tainted: G W 6.11.0-smp--24867312d167-cpl #395 > Tainted: [W]=WARN > Hardware name: Google Astoria/astoria, BIOS 0.20240617.0-0 06/17/2024 > RIP: 0010:__kvm_mmu_prepare_zap_page+0x3a9/0x7b0 [kvm] > Code: <48> 8b 8e b0 00 00 00 48 8b 96 e0 00 00 00 48 c1 e9 09 48 29 c8 8b > RSP: 0018:ff314a25b19f7c28 EFLAGS: 00010212 > Call Trace: > <TASK> > kvm_arch_flush_shadow_all+0x7a/0xf0 [kvm] > kvm_mmu_notifier_release+0x6c/0xb0 [kvm] > mmu_notifier_unregister+0x85/0x140 > kvm_put_kvm+0x263/0x410 [kvm] > kvm_vm_release+0x21/0x30 [kvm] > __fput+0x8d/0x2c0 > __se_sys_close+0x71/0xc0 > do_syscall_64+0x83/0x160 > entry_SYSCALL_64_after_hwframe+0x76/0x7e
On Tue, Sep 24, 2024, Yan Zhao wrote: > On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote: > > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) > > > +{ > > > + struct kvm_gfn_range range = { > > > + .slot = slot, > > > + .start = slot->base_gfn, > > > + .end = slot->base_gfn + slot->npages, > > > + .may_block = true, > > > + }; > > > + bool flush = false; > > > + > > > + write_lock(&kvm->mmu_lock); > > > + > > > + if (kvm_memslots_have_rmaps(kvm)) > > > + flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap); > > > > This, and Paolo's merged variant, break shadow paging. As was tried in commit > > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), > > all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting > > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference > > to the memslot, for all intents and purposes. Deleting the memslot without removing > > all relevant shadow pages results in NULL pointer derefs when tearing down the VM. > > > > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well. > > https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com > > > > Rather than trying to get this functional for shadow paging (which includes nested > > TDP), I think we should scrap the quirk idea and simply make this the behavior for > > S-EPT and nothing else. > Ok. Thanks for identifying this error. Will change code to this way. For now, I think a full revert of the entire series makes sense. Irrespective of this bug, I don't think KVM should commit to specific implementation behavior, i.e. KVM shouldn't explicitly say only leaf SPTEs are zapped. The quirk docs should instead say that if the quirk is disabled, KVM will only guarantee that the delete memslot will be inaccessible. That way, KVM can still do a fast zap when it makes sense, e.g. for large memslots, do a complete fast zap, and for small memslots, do a targeted zap of the TDP MMU but a fast zap of any shadow MMUs. > BTW: update some findings regarding to the previous bug with Nvidia GPU > assignment: > I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not > reproducible when only leaf entries of memslot are zapped. > (no more detailed info due to limited time to debug). Heh, I've given up hope on ever finding a root cause for that issue.
On Tue, Sep 24, 2024 at 12:45:52AM -0700, Sean Christopherson wrote: > On Tue, Sep 24, 2024, Yan Zhao wrote: > > On Mon, Sep 23, 2024 at 11:37:14AM -0700, Sean Christopherson wrote: > > > > +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) > > > > +{ > > > > + struct kvm_gfn_range range = { > > > > + .slot = slot, > > > > + .start = slot->base_gfn, > > > > + .end = slot->base_gfn + slot->npages, > > > > + .may_block = true, > > > > + }; > > > > + bool flush = false; > > > > + > > > > + write_lock(&kvm->mmu_lock); > > > > + > > > > + if (kvm_memslots_have_rmaps(kvm)) > > > > + flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap); > > > > > > This, and Paolo's merged variant, break shadow paging. As was tried in commit > > > 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), > > > all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting > > > for a shadow page is tied to the memslot, i.e. the shadow page holds a reference > > > to the memslot, for all intents and purposes. Deleting the memslot without removing > > > all relevant shadow pages results in NULL pointer derefs when tearing down the VM. > > > > > > Note, that commit is/was buggy, and I suspect my follow-up attempt[*] was as well. > > > https://lore.kernel.org/all/20190820200318.GA15808@linux.intel.com > > > > > > Rather than trying to get this functional for shadow paging (which includes nested > > > TDP), I think we should scrap the quirk idea and simply make this the behavior for > > > S-EPT and nothing else. > > Ok. Thanks for identifying this error. Will change code to this way. > > For now, I think a full revert of the entire series makes sense. Irrespective of > this bug, I don't think KVM should commit to specific implementation behavior, > i.e. KVM shouldn't explicitly say only leaf SPTEs are zapped. The quirk docs > should instead say that if the quirk is disabled, KVM will only guarantee that > the delete memslot will be inaccessible. That way, KVM can still do a fast zap > when it makes sense, e.g. for large memslots, do a complete fast zap, and for > small memslots, do a targeted zap of the TDP MMU but a fast zap of any shadow > MMUs. For TDX, could we do as below after the full revert of this series? void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { kvm_mmu_zap_all_fast(kvm); ==> this will skip mirror root kvm_mmu_zap_memslot_mirror_leafs(kvm, slot); ==> zap memslot leaf entries in mirror root } > > > BTW: update some findings regarding to the previous bug with Nvidia GPU > > assignment: > > I found that after v5.19-rc1+, even with nx_huge_pages=N, the bug is not > > reproducible when only leaf entries of memslot are zapped. > > (no more detailed info due to limited time to debug). > > Heh, I've given up hope on ever finding a root cause for that issue.
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index a71d91978d9e..7c1248142a5a 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7997,6 +7997,14 @@ KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS By default, KVM emulates MONITOR/MWAIT (if guest CPUID on writes to MISC_ENABLE if KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT is disabled. + +KVM_X86_QUIRK_SLOT_ZAP_ALL By default, KVM invalidates all SPTEs in + fast way for memslot deletion when VM type + is KVM_X86_DEFAULT_VM. + When this quirk is disabled or when VM type + is other than KVM_X86_DEFAULT_VM, KVM zaps + only leaf SPTEs that are within the range of + the memslot being deleted. =================================== ============================================ 7.32 KVM_CAP_MAX_VCPU_ID diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9bb2e164c523..a40577911744 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -2327,7 +2327,8 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages); KVM_X86_QUIRK_OUT_7E_INC_RIP | \ KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT | \ KVM_X86_QUIRK_FIX_HYPERCALL_INSN | \ - KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS) + KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS | \ + KVM_X86_QUIRK_SLOT_ZAP_ALL) /* * KVM previously used a u32 field in kvm_run to indicate the hypercall was diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 988b5204d636..6a399859c42d 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -438,6 +438,7 @@ struct kvm_sync_regs { #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4) #define KVM_X86_QUIRK_FIX_HYPERCALL_INSN (1 << 5) #define KVM_X86_QUIRK_MWAIT_NEVER_UD_FAULTS (1 << 6) +#define KVM_X86_QUIRK_SLOT_ZAP_ALL (1 << 7) #define KVM_STATE_NESTED_FORMAT_VMX 0 #define KVM_STATE_NESTED_FORMAT_SVM 1 diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1432deb75cbb..be3a82a32295 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6915,10 +6915,50 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm) kvm_mmu_zap_all(kvm); } +/* + * Zapping leaf SPTEs with memslot range when a memslot is moved/deleted. + * + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst + * case scenario we'll have unused shadow pages lying around until they + * are recycled due to age or when the VM is destroyed. + */ +static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot *slot) +{ + struct kvm_gfn_range range = { + .slot = slot, + .start = slot->base_gfn, + .end = slot->base_gfn + slot->npages, + .may_block = true, + }; + bool flush = false; + + write_lock(&kvm->mmu_lock); + + if (kvm_memslots_have_rmaps(kvm)) + flush = kvm_handle_gfn_range(kvm, &range, kvm_zap_rmap); + + if (tdp_mmu_enabled) + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush); + + if (flush) + kvm_flush_remote_tlbs_memslot(kvm, slot); + + write_unlock(&kvm->mmu_lock); +} + +static inline bool kvm_memslot_flush_zap_all(struct kvm *kvm) +{ + return kvm->arch.vm_type == KVM_X86_DEFAULT_VM && + kvm_check_has_quirk(kvm, KVM_X86_QUIRK_SLOT_ZAP_ALL); +} + void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { - kvm_mmu_zap_all_fast(kvm); + if (kvm_memslot_flush_zap_all(kvm)) + kvm_mmu_zap_all_fast(kvm); + else + kvm_mmu_zap_memslot_leafs(kvm, slot); } void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)