Message ID | fdddad066c88c6cd8f2090f11e32e54f7d5c6178.1721092739.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86/mmu: Allow per VM kvm_mmu_max_gfn() | expand |
On Mon, 2024-07-15 at 18:21 -0700, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > Prepare for TDX support by making kvm_mmu_max_gfn() configurable. Have > this preparation also be useful by non-TDX changes to improve correctness > associated with the combination of 4-level EPT and MAXPA > 48. The issue > is analyzed at [1]. It looks like the part where you were going to describe the TDX and other testing got left off, can elaborate?
On Mon, Jul 15, 2024, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > Prepare for TDX support by making kvm_mmu_max_gfn() configurable. Have > this preparation also be useful by non-TDX changes to improve correctness > associated with the combination of 4-level EPT and MAXPA > 48. The issue > is analyzed at [1]. > > Like other confidential computing technologies, TDX has the concept of > private and shared memory. For TDX, the private and shared mappings of the > same GFN are mapped at separate GPAs, with a configurable GPA bit selecting > between private and shared aliases of the same GFN. This means that > operations working with the concept of a maximum GFN should only go up to > this configurable GPA bit instead of the existing behavior based on > shadow_phys_bits. Other TDX changes will handle applying the operation > across both GFN aliases. This is going to be confusing. A TDX will be all but guaranteed to generate an EPT violation on a gfn > kvm->mmu_max_gfn. > Using the existing kvm_mmu_max_gfn() based on shadow_phys_bits would cause > functional problems for TDX. Specifically, because the function is used to > restrict the range where memslots can be created. For TDX, if a memslot is > created at a GFN that includes the bit that selects between private/shared, > it would confuse the logic that zaps both aliases. It would end up zapping > only the higher alias and missing the lower one. In this case, freed pages > could remain mapped in the TDX guest. So why don't we simply disallow creating aliased memslots? > Since this GPA bit is configurable per-VM, make kvm_mmu_max_gfn() per-VM by > having it take a struct kvm, and keep the max GFN as a member on that > struct. Future TDX changes will set this member based on the configurable > position of the private/shared bit. > > Besides functional issues, it is generally more correct and easier to No, it most definitely is not more correct. There is absolutely no issue zapping SPTEs that should never exist. In fact, restricting the zapping path is far more likely to *cause* correctness issues, e.g. see 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") 86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR") Creating SPTEs is a different matter, but unless I'm missing something, the only path that _needs_ to be updated is kvm_arch_prepare_memory_region(), to disallow aliased memslots. I assume TDX will strip the shared bit from fault->gfn, and shove it back in when creating MMIO SPTEs in the shared EPT page tables. Why can't we simply do: diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 842a3a4cdfe9..5ea428dde891 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4395,6 +4395,9 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, struct kvm_memory_slot *slot = fault->slot; int ret; + if (WARN_ON_ONCE(vcpu->kvm, kvm_is_gfn_alias(fault->gfn))) + return -EFAULT; + /* * Note that the mmu_invalidate_seq also serves to detect a concurrent * change in attributes. is_page_fault_stale() will detect an diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 994743266480..091da7607025 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12979,6 +12979,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn()) return -EINVAL; + if (kvm_is_gfn_alias(kvm, new->base_gfn + new->npages - 1)) + return -EINVAL; + return kvm_alloc_memslot_metadata(kvm, new); }
On Tue, 2024-07-16 at 12:31 -0700, Sean Christopherson wrote: > > No, it most definitely is not more correct. There is absolutely no issue > zapping > SPTEs that should never exist. In fact, restricting the zapping path is far > more > likely to *cause* correctness issues, e.g. see > > 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all > SPTEs") > 86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed > host.MAXPHYADDR") The type of correctness this was going for was around the new treatment of GFNs not having the shared/alias bit. As you know it can get confusing which variables have these bits and which have them stripped. Part of the recent MMU work involved making sure at least GFN's didn't contain the shared bit. Then in TDP MMU where it iterates from start to end, for example: static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, gfn_t start, gfn_t end, bool can_yield, bool flush) { struct tdp_iter iter; end = min(end, tdp_mmu_max_gfn_exclusive()); lockdep_assert_held_write(&kvm->mmu_lock); rcu_read_lock(); for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) { if (can_yield && tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) { flush = false; continue; } ... The math gets a bit confused. For the private/mirror root, start will begin at 0, and iterate to a range that includes the shared bit. No functional problem because we are zapping things that shouldn't be set. But it means the 'gfn' has the bit position of the shared bit set. Although it is not acting as the shared bit in this case, just an out of range bit. For the shared/direct root, it will iterate from (shared_bit | 0) to (shared_bit | max_gfn). So where the mirror root iterates through the whole range, the shared case skips it in the current code anyway. And then the fact that the code already takes care here to avoid zapping over ranges that exceed the max gfn. So it's a bit asymmetric, and just overall weird. We are weighing functional correctness risk with known code weirdness. My inclination was to try to reduce the places where TDX MMU needs paths happen to work for subtle reasons for the cost of the VM field. I think I still lean that way, but not a strong opinion. > > Creating SPTEs is a different matter, but unless I'm missing something, the > only > path that _needs_ to be updated is kvm_arch_prepare_memory_region(), to > disallow > aliased memslots. > > I assume TDX will strip the shared bit from fault->gfn, and shove it back in > when > creating MMIO SPTEs in the shared EPT page tables. > > Why can't we simply do: Seems ok. It's not dissimilar to what this patch does for memslots. For the fault path check, it is catching something that 86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR") says not to worry about. But maybe that calculus changes with the TDX module in the loop. We also should be ok functionally without either.
On Tue, Jul 16, 2024, Rick P Edgecombe wrote: > On Tue, 2024-07-16 at 12:31 -0700, Sean Christopherson wrote: > > > > No, it most definitely is not more correct. There is absolutely no issue > > zapping > > SPTEs that should never exist. In fact, restricting the zapping path is far > > more > > likely to *cause* correctness issues, e.g. see > > > > 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all > > SPTEs") > > 86931ff7207b ("KVM: x86/mmu: Do not create SPTEs for GFNs that exceed > > host.MAXPHYADDR") > > The type of correctness this was going for was around the new treatment of GFNs > not having the shared/alias bit. As you know it can get confusing which > variables have these bits and which have them stripped. Part of the recent MMU > work involved making sure at least GFN's didn't contain the shared bit. That's fine, and doesn't conflict with what I'm asserting, which is that it's a-ok to process SPTEs a range of GFNs that, barring KVM bugs, should never have "valid" entries. > Then in TDP MMU where it iterates from start to end, for example: > static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, > gfn_t start, gfn_t end, bool can_yield, bool > flush) > { > struct tdp_iter iter; > > end = min(end, tdp_mmu_max_gfn_exclusive()); > > lockdep_assert_held_write(&kvm->mmu_lock); > > rcu_read_lock(); > > for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) { > if (can_yield && > tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) { > flush = false; > continue; > } > ... > > The math gets a bit confused. For the private/mirror root, start will begin at > 0, and iterate to a range that includes the shared bit. No functional problem > because we are zapping things that shouldn't be set. But it means the 'gfn' has > the bit position of the shared bit set. Although it is not acting as the shared > bit in this case, just an out of range bit. > > For the shared/direct root, it will iterate from (shared_bit | 0) to (shared_bit > | max_gfn). So where the mirror root iterates through the whole range, the > shared case skips it in the current code anyway. > > And then the fact that the code already takes care here to avoid zapping over > ranges that exceed the max gfn. > > So it's a bit asymmetric, and just overall weird. We are weighing functional > correctness risk with known code weirdness. IMO, you're looking at it with too much of a TDX lens and not thinking about all the people that don't care about TDX, which is the majority of KVM developers. The unaliased GFN is definitely not the max GFN of all the VM's MMUs, since the shared EPT must be able to process GPAs with bits set above the "max" GFN. And to me, _that's_ far more weird than saying that "S-EPT MMUs never set the shared bit, and shared EPT MMUs never clear the shared bit". I'm guessing the S-EPT support ORs in the shared bit, but it's still a GFN. If you were adding a per-MMU max GFN, then I'd buy that it legitimately is the max GFN, but why not have a full GFN range for the MMU? E.g. static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, bool shared, int zap_level) { struct tdp_iter iter; gfn_t end = tdp_mmu_max_gfn_exclusive(root); gfn_t start = tdp_mmu_min_gfn_inclusive(root); and then have the helpers incorporated the S-EPT vs. EPT information. That gets us optimized, precise zapping without needing to muddy the waters by tracking a per-VM "max" GFN that is only kinda sorta the max if you close your eyes and don't think too hard about the shared MMU usage. > My inclination was to try to reduce the places where TDX MMU needs paths > happen to work for subtle reasons for the cost of the VM field. But it doesn't happen to work for subtle reasons. It works because it has to work. Processing !PRESENT SPTEs should always work, regardless of why KVM can guarantee there are no SPTEs in a given GFN range.
On Tue, 2024-07-16 at 16:08 -0700, Sean Christopherson wrote: > > IMO, you're looking at it with too much of a TDX lens and not thinking about > all > the people that don't care about TDX, which is the majority of KVM developers. Well, I don't mean to. Actually trying to do the opposite. I don't see how per-vm max gfn makes it easier or harder to look at things from the normal VM perspective. I guess you'd have to figure out what set kvm- >mmu_max_gfn. > > The unaliased GFN is definitely not the max GFN of all the VM's MMUs, since > the > shared EPT must be able to process GPAs with bits set above the "max" GFN. > And > to me, _that's_ far more weird than saying that "S-EPT MMUs never set the > shared > bit, and shared EPT MMUs never clear the shared bit". I'm guessing the S-EPT > support ORs in the shared bit, but it's still a GFN. In the current solution GFNs always have the shared bit stripped. It gets added back within the TDP MMU iterator. So for the regular VM reader's perspective, TDP MMU behaves pretty much like normal for TDX direct (shared) roots, that is memslot GFNs get mapped at the same TDP GFNs. The iterator handles writing the PTE for the memslot GFN at the special position in the EPT tables (gfn | shared_bit). > > If you were adding a per-MMU max GFN, then I'd buy that it legitimately is the > max > GFN, but why not have a full GFN range for the MMU? E.g. > > static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, > bool shared, int zap_level) > { > struct tdp_iter iter; > > gfn_t end = tdp_mmu_max_gfn_exclusive(root); > gfn_t start = tdp_mmu_min_gfn_inclusive(root); > > and then have the helpers incorporated the S-EPT vs. EPT information. That > gets > us optimized, precise zapping without needing to muddy the waters by tracking > a > per-VM "max" GFN that is only kinda sorta the max if you close your eyes and > don't > think too hard about the shared MMU usage. This is similar to what we had before with the kvm_gfn_for_root(). Have you looked at a recent version? Here, this recent one has some discussion on it: https://lore.kernel.org/kvm/20240530210714.364118-7-rick.p.edgecombe@intel.com/#t Pushing the shared bit re-application into to the TDP MMU iterator pretty nicely hides a lot of the TDX specific behavior. It wraps up the TDX bits so that other developers *don't* working on the various operations don't have to think about it. > > > My inclination was to try to reduce the places where TDX MMU needs paths > > happen to work for subtle reasons for the cost of the VM field. > > But it doesn't happen to work for subtle reasons. It works because it has to > work. Processing !PRESENT SPTEs should always work, regardless of why KVM can > guarantee there are no SPTEs in a given GFN range. The current code (without this patch) doesn't zap the whole range that is mappable by the EPT for the shared root case. It covers the whole range in the private case, and only the range that is expected to be mapped in the shared case. So this is good or bad? I think you are saying bad. With this patch, it also doesn't zap the whole range mappable by the EPT, but does it in a consistent way. I think you are saying it's important to zap the whole range mappable by the EPT. With TDX it is fuzzy, because the direct root range without the shared bit, or beyond shouldn't be accessible via that root so it is not really mapped. We would still be zapping the whole accessible paging structure, even if not the whole part documented in the normal EPT. But if we want to zap the whole structure I see the positives. I'd still rather not go back to a kvm_gfn_for_root()-like solution for the TDP MMU support if possible. Is that alright with you? What about something like this in conjunction with your earlier diff? diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h index 8bc19e65371c..bf19d5a45f87 100644 --- a/arch/x86/kvm/mmu/tdp_iter.h +++ b/arch/x86/kvm/mmu/tdp_iter.h @@ -129,6 +129,11 @@ struct tdp_iter { iter.valid && iter.gfn < end; \ tdp_iter_next(&iter)) +#define for_each_tdp_pte_min_level_all(iter, root, min_level) \ + for (tdp_iter_start(&iter, root, min_level, 0, 0); \ + iter.valid && iter.gfn < tdp_mmu_max_gfn_exclusive(); \ + tdp_iter_next(&iter)) + #define for_each_tdp_pte(iter, kvm, root, start, end) \ for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index c3ce43ce7b3f..6f425652e396 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -889,10 +889,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, { struct tdp_iter iter; - gfn_t end = tdp_mmu_max_gfn_exclusive(); - gfn_t start = 0; - - for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) { + for_each_tdp_pte_min_level_all(iter, root, zap_level) { retry: if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared)) continue;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 9bb2e164c523..2e1d330206a4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1519,6 +1519,8 @@ struct kvm_arch { */ #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) struct kvm_mmu_memory_cache split_desc_cache; + + gfn_t mmu_max_gfn; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index dc80e72e4848..670ec63b0434 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -63,7 +63,7 @@ static __always_inline u64 rsvd_bits(int s, int e) */ extern u8 __read_mostly shadow_phys_bits; -static inline gfn_t kvm_mmu_max_gfn(void) +static inline gfn_t __kvm_mmu_max_gfn(void) { /* * Note that this uses the host MAXPHYADDR, not the guest's. @@ -81,6 +81,11 @@ static inline gfn_t kvm_mmu_max_gfn(void) return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1; } +static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm) +{ + return kvm->arch.mmu_max_gfn; +} + static inline u8 kvm_get_shadow_phys_bits(void) { /* diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4e0e9963066f..b8f36779898a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3302,7 +3302,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu, * only if L1's MAXPHYADDR is inaccurate with respect to the * hardware's). */ - if (unlikely(fault->gfn > kvm_mmu_max_gfn())) + if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm))) return RET_PF_EMULATE; return RET_PF_CONTINUE; @@ -6483,6 +6483,7 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm) void kvm_mmu_init_vm(struct kvm *kvm) { + kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn(); kvm->arch.shadow_mmio_value = shadow_mmio_value; INIT_LIST_HEAD(&kvm->arch.active_mmu_pages); INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages); diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index ff27e1eadd54..a903300d3869 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -733,7 +733,7 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm, return iter->yielded; } -static inline gfn_t tdp_mmu_max_gfn_exclusive(void) +static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm) { /* * Bound TDP MMU walks at host.MAXPHYADDR. KVM disallows memslots with @@ -741,7 +741,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void) * MMIO SPTEs for "impossible" gfns, instead sending such accesses down * the slow emulation path every time. */ - return kvm_mmu_max_gfn() + 1; + return kvm_mmu_max_gfn(kvm) + 1; } static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, @@ -749,7 +749,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root, { struct tdp_iter iter; - gfn_t end = tdp_mmu_max_gfn_exclusive(); + gfn_t end = tdp_mmu_max_gfn_exclusive(kvm); gfn_t start = 0; for_each_tdp_pte_min_level(iter, root, zap_level, start, end) { @@ -850,7 +850,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root, { struct tdp_iter iter; - end = min(end, tdp_mmu_max_gfn_exclusive()); + end = min(end, tdp_mmu_max_gfn_exclusive(kvm)); lockdep_assert_held_write(&kvm->mmu_lock); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index a6968eadd418..f8bfbcb818e3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12939,7 +12939,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, return -EINVAL; if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) { - if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn()) + if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm)) return -EINVAL; return kvm_alloc_memslot_metadata(kvm, new);