Message ID | 20230612042559.375660-5-michael.roth@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
On Sun, Jun 11, 2023 at 11:25:12PM -0500, Michael Roth <michael.roth@amd.com> wrote: > This will be used to determine whether or not an #NPF should be serviced > using a normal page vs. a guarded/gmem one. > > Signed-off-by: Michael Roth <michael.roth@amd.com> > --- > arch/x86/include/asm/kvm_host.h | 7 +++++++ > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > 2 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index b3bd24f2a390..c26f76641121 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1445,6 +1445,13 @@ struct kvm_arch { > */ > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > struct kvm_mmu_memory_cache split_desc_cache; > + > + /* > + * When set, used to determine whether a fault should be treated as > + * private in the context of protected VMs which use a separate gmem > + * pool to back private guest pages. > + */ > + u64 mmu_private_fault_mask; > }; > > struct kvm_vm_stat { > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > index 780b91e1da9f..9b9e75aa43f4 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -252,6 +252,39 @@ struct kvm_page_fault { > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err) > +{ > + struct kvm_memory_slot *slot; > + bool private_fault = false; > + gfn_t gfn = gpa_to_gfn(gpa); > + > + slot = gfn_to_memslot(kvm, gfn); > + if (!slot) { > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn); > + goto out; > + } > + > + if (!kvm_slot_can_be_private(slot)) { > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn); > + goto out; > + } > + > + if (kvm->arch.mmu_private_fault_mask) { > + private_fault = !!(err & kvm->arch.mmu_private_fault_mask); > + goto out; > + } What's the convention of err? Can we abstract it by introducing a new bit PFERR_PRIVATE_MASK? The caller sets it based on arch specific value. the logic will be .is_private = err & PFERR_PRIVATE_MASK; > + > + /* > + * Handling below is for UPM self-tests and guests that treat userspace > + * as the authority on whether a fault should be private or not. > + */ > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT); This code path is sad. One extra slot lookup and xarray look up. Without mmu lock, the result can change by other vcpu. Let's find a better way. > + > +out: > + pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault); > + return private_fault; > +} > + > /* > * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(), > * and of course kvm_mmu_do_page_fault(). > @@ -301,7 +334,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > .max_level = KVM_MAX_HUGEPAGE_LEVEL, > .req_level = PG_LEVEL_4K, > .goal_level = PG_LEVEL_4K, > - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), > + .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err), > }; > int r; > > -- > 2.25.1 >
On Sun, Jun 11, 2023 at 11:25:12PM -0500, Michael Roth wrote: > This will be used to determine whether or not an #NPF should be serviced > using a normal page vs. a guarded/gmem one. > > Signed-off-by: Michael Roth <michael.roth@amd.com> > --- > arch/x86/include/asm/kvm_host.h | 7 +++++++ > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > 2 files changed, 41 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index b3bd24f2a390..c26f76641121 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1445,6 +1445,13 @@ struct kvm_arch { > */ > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > struct kvm_mmu_memory_cache split_desc_cache; > + > + /* > + * When set, used to determine whether a fault should be treated as ^^^^^^^^ And when not set? Invalid? I guess so, judging by the code below.
On Wed, Jun 14, 2023 at 09:47:09AM -0700, Isaku Yamahata wrote: > On Sun, Jun 11, 2023 at 11:25:12PM -0500, > Michael Roth <michael.roth@amd.com> wrote: > > > This will be used to determine whether or not an #NPF should be serviced > > using a normal page vs. a guarded/gmem one. > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > --- > > arch/x86/include/asm/kvm_host.h | 7 +++++++ > > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > > 2 files changed, 41 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index b3bd24f2a390..c26f76641121 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -1445,6 +1445,13 @@ struct kvm_arch { > > */ > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > > struct kvm_mmu_memory_cache split_desc_cache; > > + > > + /* > > + * When set, used to determine whether a fault should be treated as > > + * private in the context of protected VMs which use a separate gmem > > + * pool to back private guest pages. > > + */ > > + u64 mmu_private_fault_mask; > > }; > > > > struct kvm_vm_stat { > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > index 780b91e1da9f..9b9e75aa43f4 100644 > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > @@ -252,6 +252,39 @@ struct kvm_page_fault { > > > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > > > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err) > > +{ > > + struct kvm_memory_slot *slot; > > + bool private_fault = false; > > + gfn_t gfn = gpa_to_gfn(gpa); > > + > > + slot = gfn_to_memslot(kvm, gfn); > > + if (!slot) { > > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn); > > + goto out; > > + } > > + > > + if (!kvm_slot_can_be_private(slot)) { > > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn); > > + goto out; > > + } > > + > > + if (kvm->arch.mmu_private_fault_mask) { > > + private_fault = !!(err & kvm->arch.mmu_private_fault_mask); > > + goto out; > > + } > > What's the convention of err? Can we abstract it by introducing a new bit > PFERR_PRIVATE_MASK? The caller sets it based on arch specific value. > the logic will be > .is_private = err & PFERR_PRIVATE_MASK; I'm not sure I understand the question. 'err' is just the page fault flags, and arch.mmu_private_fault_mask is something that can be set on a per-platform basis when running in a mode where shared/private access is recorded in the page fault flags during a #NPF. I'm not sure how we'd keep the handling cross-platform by moving to a macro, since TDX uses a different bit, and we'd want to be able to build a SNP+TDX kernel that could run on either type of hardware. Are you suggesting to reverse that and have err be set in a platform-specific way and then use a common PFERR_PRIVATE_MASK that's software-defined and consistent across platforms? That could work, but existing handling seems to use page fault flags as-is, keeping the hardware-set values, rather than modifying them to pass additional metadata, so it seems like it might make things more confusing to make an exception to that here. Or are there other cases where it's done that way? > > > > + > > + /* > > + * Handling below is for UPM self-tests and guests that treat userspace > > + * as the authority on whether a fault should be private or not. > > + */ > > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT); > > This code path is sad. One extra slot lookup and xarray look up. > Without mmu lock, the result can change by other vcpu. > Let's find a better way. The intention was to rely on fault->mmu_seq to determine if a KVM_SET_MEMORY_ATTRIBUTES update came in after .private_fault was set so that fault handling could be retried, but that doesn't happen until kvm_faultin_pfn() which is *after* this is logged. So yes, I think there is a race here, and the approach you took in your Misc. series of keeping the kvm_mem_is_private() check inside kvm_faultin_pfn() is more efficient/correct. If we can figure out a way to handle checking the fault flags in a way that works for both TDX/SNP (and KVM self-test use-case) we can consolidate around that. -Mike > > > + > > +out: > > + pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault); > > + return private_fault; > > +} > > + > > /* > > * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(), > > * and of course kvm_mmu_do_page_fault(). > > @@ -301,7 +334,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, > > .max_level = KVM_MAX_HUGEPAGE_LEVEL, > > .req_level = PG_LEVEL_4K, > > .goal_level = PG_LEVEL_4K, > > - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), > > + .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err), > > }; > > int r; > > > > -- > > 2.25.1 > > > > -- > Isaku Yamahata <isaku.yamahata@gmail.com>
On Mon, Jun 19, 2023 at 06:27:03PM +0200, Borislav Petkov wrote: > On Sun, Jun 11, 2023 at 11:25:12PM -0500, Michael Roth wrote: > > This will be used to determine whether or not an #NPF should be serviced > > using a normal page vs. a guarded/gmem one. > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > --- > > arch/x86/include/asm/kvm_host.h | 7 +++++++ > > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > > 2 files changed, 41 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index b3bd24f2a390..c26f76641121 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -1445,6 +1445,13 @@ struct kvm_arch { > > */ > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > > struct kvm_mmu_memory_cache split_desc_cache; > > + > > + /* > > + * When set, used to determine whether a fault should be treated as > ^^^^^^^^ > > And when not set? Invalid? > > I guess so, judging by the code below. Yes, or more specifically, "When not set, treat the value set by userspace via KVM_SET_MEMORY_ATTRIBUTES as the authority on whether to treat a fault as private or not. In this case, KVM_EXIT_MEMORY_FAULT events won't be generated, so there will never be a mismatch between what hardware indicates via page fault flags vs. what software has assigned via KVM_SET_MEMORY_ATTRIBUTES". -Mike > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Jun 20, 2023 at 03:28:41PM -0500, Michael Roth <michael.roth@amd.com> wrote: > On Wed, Jun 14, 2023 at 09:47:09AM -0700, Isaku Yamahata wrote: > > On Sun, Jun 11, 2023 at 11:25:12PM -0500, > > Michael Roth <michael.roth@amd.com> wrote: > > > > > This will be used to determine whether or not an #NPF should be serviced > > > using a normal page vs. a guarded/gmem one. > > > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > --- > > > arch/x86/include/asm/kvm_host.h | 7 +++++++ > > > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > > > 2 files changed, 41 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > > index b3bd24f2a390..c26f76641121 100644 > > > --- a/arch/x86/include/asm/kvm_host.h > > > +++ b/arch/x86/include/asm/kvm_host.h > > > @@ -1445,6 +1445,13 @@ struct kvm_arch { > > > */ > > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > > > struct kvm_mmu_memory_cache split_desc_cache; > > > + > > > + /* > > > + * When set, used to determine whether a fault should be treated as > > > + * private in the context of protected VMs which use a separate gmem > > > + * pool to back private guest pages. > > > + */ > > > + u64 mmu_private_fault_mask; > > > }; > > > > > > struct kvm_vm_stat { > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > > index 780b91e1da9f..9b9e75aa43f4 100644 > > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > > @@ -252,6 +252,39 @@ struct kvm_page_fault { > > > > > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > > > > > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err) > > > +{ > > > + struct kvm_memory_slot *slot; > > > + bool private_fault = false; > > > + gfn_t gfn = gpa_to_gfn(gpa); > > > + > > > + slot = gfn_to_memslot(kvm, gfn); > > > + if (!slot) { > > > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn); > > > + goto out; > > > + } > > > + > > > + if (!kvm_slot_can_be_private(slot)) { > > > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn); > > > + goto out; > > > + } > > > + > > > + if (kvm->arch.mmu_private_fault_mask) { > > > + private_fault = !!(err & kvm->arch.mmu_private_fault_mask); > > > + goto out; > > > + } > > > > What's the convention of err? Can we abstract it by introducing a new bit > > PFERR_PRIVATE_MASK? The caller sets it based on arch specific value. > > the logic will be > > .is_private = err & PFERR_PRIVATE_MASK; > > I'm not sure I understand the question. 'err' is just the page fault flags, > and arch.mmu_private_fault_mask is something that can be set on a > per-platform basis when running in a mode where shared/private access > is recorded in the page fault flags during a #NPF. > > I'm not sure how we'd keep the handling cross-platform by moving to a macro, > since TDX uses a different bit, and we'd want to be able to build a > SNP+TDX kernel that could run on either type of hardware. > > Are you suggesting to reverse that and have err be set in a platform-specific > way and then use a common PFERR_PRIVATE_MASK that's software-defined and > consistent across platforms? That could work, but existing handling seems > to use page fault flags as-is, keeping the hardware-set values, rather than > modifying them to pass additional metadata, so it seems like it might > make things more confusing to make an exception to that here. Or are > there other cases where it's done that way? I meant the latter, making PFERR_PRIVATE_MASK common software-defined. I think the SVM fault handler can use hardware value directly by carefully defining those PFERR values. TDX doesn't have architectural bit in error code to indicate the private fault. It's coded in faulted address as shared bit. GPA bit 51 or 47. PFERR_{USER, WRITE, FETCH, PRESENT} are already software-defined value for VMX (and TDX). The fault handler for VMX, handle_ept_violation(), converts encoding. For TDX, PFERR_PRIVATE_MASK is just one more software defined bit. I'm fine with either way, variable or macro. Which do you prefer? - Define variable mmu_private_fault_mask (global or per struct kvm) The module initialization code, hardware_setup(), sets mmu_private_fault_mask. - Define the software defined value, PFERR_PRIVATE_MASK. The caller of kvm_mmu_page_fault() parses the hardware value and construct software defined error_code. - any other? > > > + > > > + /* > > > + * Handling below is for UPM self-tests and guests that treat userspace > > > + * as the authority on whether a fault should be private or not. > > > + */ > > > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT); > > > > This code path is sad. One extra slot lookup and xarray look up. > > Without mmu lock, the result can change by other vcpu. > > Let's find a better way. > > The intention was to rely on fault->mmu_seq to determine if a > KVM_SET_MEMORY_ATTRIBUTES update came in after .private_fault was set so > that fault handling could be retried, but that doesn't happen until > kvm_faultin_pfn() which is *after* this is logged. So yes, I think there > is a race here, and the approach you took in your Misc. series of > keeping the kvm_mem_is_private() check inside kvm_faultin_pfn() is more > efficient/correct. > > If we can figure out a way to handle checking the fault flags in a way > that works for both TDX/SNP (and KVM self-test use-case) we can > consolidate around that. I can think of the following ways. I think the second option is better because we don't need exit bit for error code. - Introduce software defined error code - Add a flags to struct kvm_arch for self-test use-case VM_TYPE_PROTECTED_VM. Set it to true for VM_TYPE_PROTECTED_VM case. - any other? Thanks,
On Tue, Jun 20, 2023 at 02:18:45PM -0700, Isaku Yamahata wrote: > On Tue, Jun 20, 2023 at 03:28:41PM -0500, > Michael Roth <michael.roth@amd.com> wrote: > > > On Wed, Jun 14, 2023 at 09:47:09AM -0700, Isaku Yamahata wrote: > > > On Sun, Jun 11, 2023 at 11:25:12PM -0500, > > > Michael Roth <michael.roth@amd.com> wrote: > > > > > > > This will be used to determine whether or not an #NPF should be serviced > > > > using a normal page vs. a guarded/gmem one. > > > > > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > > --- > > > > arch/x86/include/asm/kvm_host.h | 7 +++++++ > > > > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > > > > 2 files changed, 41 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > > > index b3bd24f2a390..c26f76641121 100644 > > > > --- a/arch/x86/include/asm/kvm_host.h > > > > +++ b/arch/x86/include/asm/kvm_host.h > > > > @@ -1445,6 +1445,13 @@ struct kvm_arch { > > > > */ > > > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > > > > struct kvm_mmu_memory_cache split_desc_cache; > > > > + > > > > + /* > > > > + * When set, used to determine whether a fault should be treated as > > > > + * private in the context of protected VMs which use a separate gmem > > > > + * pool to back private guest pages. > > > > + */ > > > > + u64 mmu_private_fault_mask; > > > > }; > > > > > > > > struct kvm_vm_stat { > > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > > > index 780b91e1da9f..9b9e75aa43f4 100644 > > > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > > > @@ -252,6 +252,39 @@ struct kvm_page_fault { > > > > > > > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > > > > > > > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err) > > > > +{ > > > > + struct kvm_memory_slot *slot; > > > > + bool private_fault = false; > > > > + gfn_t gfn = gpa_to_gfn(gpa); > > > > + > > > > + slot = gfn_to_memslot(kvm, gfn); > > > > + if (!slot) { > > > > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn); > > > > + goto out; > > > > + } > > > > + > > > > + if (!kvm_slot_can_be_private(slot)) { > > > > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn); > > > > + goto out; > > > > + } > > > > + > > > > + if (kvm->arch.mmu_private_fault_mask) { > > > > + private_fault = !!(err & kvm->arch.mmu_private_fault_mask); > > > > + goto out; > > > > + } > > > > > > What's the convention of err? Can we abstract it by introducing a new bit > > > PFERR_PRIVATE_MASK? The caller sets it based on arch specific value. > > > the logic will be > > > .is_private = err & PFERR_PRIVATE_MASK; > > > > I'm not sure I understand the question. 'err' is just the page fault flags, > > and arch.mmu_private_fault_mask is something that can be set on a > > per-platform basis when running in a mode where shared/private access > > is recorded in the page fault flags during a #NPF. > > > > I'm not sure how we'd keep the handling cross-platform by moving to a macro, > > since TDX uses a different bit, and we'd want to be able to build a > > SNP+TDX kernel that could run on either type of hardware. > > > > Are you suggesting to reverse that and have err be set in a platform-specific > > way and then use a common PFERR_PRIVATE_MASK that's software-defined and > > consistent across platforms? That could work, but existing handling seems > > to use page fault flags as-is, keeping the hardware-set values, rather than > > modifying them to pass additional metadata, so it seems like it might > > make things more confusing to make an exception to that here. Or are > > there other cases where it's done that way? > > I meant the latter, making PFERR_PRIVATE_MASK common software-defined. > > I think the SVM fault handler can use hardware value directly by carefully > defining those PFERR values. > > TDX doesn't have architectural bit in error code to indicate the private fault. > It's coded in faulted address as shared bit. GPA bit 51 or 47. > PFERR_{USER, WRITE, FETCH, PRESENT} are already software-defined value for VMX > (and TDX). The fault handler for VMX, handle_ept_violation(), converts > encoding. For TDX, PFERR_PRIVATE_MASK is just one more software defined bit. > > I'm fine with either way, variable or macro. Which do you prefer? > > - Define variable mmu_private_fault_mask (global or per struct kvm) > The module initialization code, hardware_setup(), sets mmu_private_fault_mask. > - Define the software defined value, PFERR_PRIVATE_MASK. > The caller of kvm_mmu_page_fault() parses the hardware value and construct > software defined error_code. > - any other? > > > > > > + > > > > + /* > > > > + * Handling below is for UPM self-tests and guests that treat userspace > > > > + * as the authority on whether a fault should be private or not. > > > > + */ > > > > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT); > > > > > > This code path is sad. One extra slot lookup and xarray look up. > > > Without mmu lock, the result can change by other vcpu. > > > Let's find a better way. > > > > The intention was to rely on fault->mmu_seq to determine if a > > KVM_SET_MEMORY_ATTRIBUTES update came in after .private_fault was set so > > that fault handling could be retried, but that doesn't happen until > > kvm_faultin_pfn() which is *after* this is logged. So yes, I think there > > is a race here, and the approach you took in your Misc. series of > > keeping the kvm_mem_is_private() check inside kvm_faultin_pfn() is more > > efficient/correct. > > > > If we can figure out a way to handle checking the fault flags in a way > > that works for both TDX/SNP (and KVM self-test use-case) we can > > consolidate around that. > > I can think of the following ways. I think the second option is better because > we don't need exit bit for error code. > > - Introduce software defined error code > - Add a flags to struct kvm_arch for self-test use-case VM_TYPE_PROTECTED_VM. > Set it to true for VM_TYPE_PROTECTED_VM case. > - any other? Vishal: hoping to get your thoughts here as well from the perspective of the KVM self-test use-case. I was thinking that once we set fault->is_private, that sort of becomes our "software-defined" bit, and what KVM would use from that point forward to determine whether or not the access should be treated as a private one or not, and that whatever handler sets fault->is_private would encapsulate away all the platform-specific bit-checking needed to do that. So if we were to straight-forwardly implement that based on how TDX currently handles checking for the shared bit in GPA, paired with how SEV-SNP handles checking for private bit in fault flags, it would look something like: bool kvm_fault_is_private(kvm, gpa, err) { /* SEV-SNP handling */ if (kvm->arch.mmu_private_fault_mask) return !!(err & arch.mmu_private_fault_mask); /* TDX handling */ if (kvm->arch.gfn_shared_mask) return !!(gpa & arch.gfn_shared_mask); return false; } kvm_mmu_do_page_fault(vcpu, gpa, err, ...) { struct kvm_page_fault fault = { ... .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err) }; ... } And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be set per-KVM-instance, just like they are now with current SNP and TDX patchsets, since stuff like KVM self-test wouldn't be setting those masks, so it makes sense to do it per-instance in that regard. But that still gets a little awkward for the KVM self-test use-case where .is_private should sort of be ignored in favor of whatever the xarray reports via kvm_mem_is_private(). In your Misc. series I believe you handled this by introducing a PFERR_HASATTR_MASK bit so we can determine whether existing value of fault->is_private should be ignored/overwritten or not. So maybe kvm_fault_is_private() needs to return an integer value instead, like: enum { KVM_FAULT_VMM_DEFINED, KVM_FAULT_SHARED, KVM_FAULT_PRIVATE, } bool kvm_fault_is_private(kvm, gpa, err) { /* SEV-SNP handling */ if (kvm->arch.mmu_private_fault_mask) (err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED /* TDX handling */ if (kvm->arch.gfn_shared_mask) (gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE return KVM_FAULT_VMM_DEFINED; } And then down in __kvm_faultin_pfn() we do: if (fault->is_private == KVM_FAULT_VMM_DEFINED) fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn); else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) return kvm_do_memory_fault_exit(vcpu, fault); if (fault->is_private) return kvm_faultin_pfn_private(vcpu, fault); Maybe kvm_fault_is_private() can be simplified based on what direction we end up taking WRT ongoing discussions like whether we decide to define KVM_X86_{SNP,TDX}_VM vm_types in addition to the KVM_X86_PROTECTED_VM type that the selftests uses, but hoping that for this path, any changes along that line can be encapsulated away in kvm_fault_is_private() without any/much further churn at the various call-sites like __kvm_faultin_pfn(). We could even push all the above logic down into the KVM self-tests, but have: bool kvm_fault_is_private(kvm, gpa, err) { return KVM_FAULT_VMM_DEFINED; } And that would be enough to run self-tests as standalone series, with TDX/SNP should filling in kvm_fault_is_private() with their platform-specific handling. Does that seem reasonable to you? At least as a starting point. -Mike > > Thanks, > -- > Isaku Yamahata <isaku.yamahata@gmail.com>
On Wed, Jun 21, 2023 at 06:00:31PM -0500, Michael Roth <michael.roth@amd.com> wrote: > On Tue, Jun 20, 2023 at 02:18:45PM -0700, Isaku Yamahata wrote: > > On Tue, Jun 20, 2023 at 03:28:41PM -0500, > > Michael Roth <michael.roth@amd.com> wrote: > > > > > On Wed, Jun 14, 2023 at 09:47:09AM -0700, Isaku Yamahata wrote: > > > > On Sun, Jun 11, 2023 at 11:25:12PM -0500, > > > > Michael Roth <michael.roth@amd.com> wrote: > > > > > > > > > This will be used to determine whether or not an #NPF should be serviced > > > > > using a normal page vs. a guarded/gmem one. > > > > > > > > > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > > > --- > > > > > arch/x86/include/asm/kvm_host.h | 7 +++++++ > > > > > arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- > > > > > 2 files changed, 41 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > > > > index b3bd24f2a390..c26f76641121 100644 > > > > > --- a/arch/x86/include/asm/kvm_host.h > > > > > +++ b/arch/x86/include/asm/kvm_host.h > > > > > @@ -1445,6 +1445,13 @@ struct kvm_arch { > > > > > */ > > > > > #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) > > > > > struct kvm_mmu_memory_cache split_desc_cache; > > > > > + > > > > > + /* > > > > > + * When set, used to determine whether a fault should be treated as > > > > > + * private in the context of protected VMs which use a separate gmem > > > > > + * pool to back private guest pages. > > > > > + */ > > > > > + u64 mmu_private_fault_mask; > > > > > }; > > > > > > > > > > struct kvm_vm_stat { > > > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h > > > > > index 780b91e1da9f..9b9e75aa43f4 100644 > > > > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > > > > @@ -252,6 +252,39 @@ struct kvm_page_fault { > > > > > > > > > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); > > > > > > > > > > +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err) > > > > > +{ > > > > > + struct kvm_memory_slot *slot; > > > > > + bool private_fault = false; > > > > > + gfn_t gfn = gpa_to_gfn(gpa); > > > > > + > > > > > + slot = gfn_to_memslot(kvm, gfn); > > > > > + if (!slot) { > > > > > + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn); > > > > > + goto out; > > > > > + } > > > > > + > > > > > + if (!kvm_slot_can_be_private(slot)) { > > > > > + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn); > > > > > + goto out; > > > > > + } > > > > > + > > > > > + if (kvm->arch.mmu_private_fault_mask) { > > > > > + private_fault = !!(err & kvm->arch.mmu_private_fault_mask); > > > > > + goto out; > > > > > + } > > > > > > > > What's the convention of err? Can we abstract it by introducing a new bit > > > > PFERR_PRIVATE_MASK? The caller sets it based on arch specific value. > > > > the logic will be > > > > .is_private = err & PFERR_PRIVATE_MASK; > > > > > > I'm not sure I understand the question. 'err' is just the page fault flags, > > > and arch.mmu_private_fault_mask is something that can be set on a > > > per-platform basis when running in a mode where shared/private access > > > is recorded in the page fault flags during a #NPF. > > > > > > I'm not sure how we'd keep the handling cross-platform by moving to a macro, > > > since TDX uses a different bit, and we'd want to be able to build a > > > SNP+TDX kernel that could run on either type of hardware. > > > > > > Are you suggesting to reverse that and have err be set in a platform-specific > > > way and then use a common PFERR_PRIVATE_MASK that's software-defined and > > > consistent across platforms? That could work, but existing handling seems > > > to use page fault flags as-is, keeping the hardware-set values, rather than > > > modifying them to pass additional metadata, so it seems like it might > > > make things more confusing to make an exception to that here. Or are > > > there other cases where it's done that way? > > > > I meant the latter, making PFERR_PRIVATE_MASK common software-defined. > > > > I think the SVM fault handler can use hardware value directly by carefully > > defining those PFERR values. > > > > TDX doesn't have architectural bit in error code to indicate the private fault. > > It's coded in faulted address as shared bit. GPA bit 51 or 47. > > PFERR_{USER, WRITE, FETCH, PRESENT} are already software-defined value for VMX > > (and TDX). The fault handler for VMX, handle_ept_violation(), converts > > encoding. For TDX, PFERR_PRIVATE_MASK is just one more software defined bit. > > > > I'm fine with either way, variable or macro. Which do you prefer? > > > > - Define variable mmu_private_fault_mask (global or per struct kvm) > > The module initialization code, hardware_setup(), sets mmu_private_fault_mask. > > - Define the software defined value, PFERR_PRIVATE_MASK. > > The caller of kvm_mmu_page_fault() parses the hardware value and construct > > software defined error_code. > > - any other? > > > > > > > > > + > > > > > + /* > > > > > + * Handling below is for UPM self-tests and guests that treat userspace > > > > > + * as the authority on whether a fault should be private or not. > > > > > + */ > > > > > + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT); > > > > > > > > This code path is sad. One extra slot lookup and xarray look up. > > > > Without mmu lock, the result can change by other vcpu. > > > > Let's find a better way. > > > > > > The intention was to rely on fault->mmu_seq to determine if a > > > KVM_SET_MEMORY_ATTRIBUTES update came in after .private_fault was set so > > > that fault handling could be retried, but that doesn't happen until > > > kvm_faultin_pfn() which is *after* this is logged. So yes, I think there > > > is a race here, and the approach you took in your Misc. series of > > > keeping the kvm_mem_is_private() check inside kvm_faultin_pfn() is more > > > efficient/correct. > > > > > > If we can figure out a way to handle checking the fault flags in a way > > > that works for both TDX/SNP (and KVM self-test use-case) we can > > > consolidate around that. > > > > I can think of the following ways. I think the second option is better because > > we don't need exit bit for error code. > > > > - Introduce software defined error code > > - Add a flags to struct kvm_arch for self-test use-case VM_TYPE_PROTECTED_VM. > > Set it to true for VM_TYPE_PROTECTED_VM case. > > - any other? > > Vishal: hoping to get your thoughts here as well from the perspective of > the KVM self-test use-case. > > I was thinking that once we set fault->is_private, that sort of > becomes our "software-defined" bit, and what KVM would use from that > point forward to determine whether or not the access should be treated > as a private one or not, and that whatever handler sets > fault->is_private would encapsulate away all the platform-specific > bit-checking needed to do that. > > So if we were to straight-forwardly implement that based on how TDX > currently handles checking for the shared bit in GPA, paired with how > SEV-SNP handles checking for private bit in fault flags, it would look > something like: > > bool kvm_fault_is_private(kvm, gpa, err) > { > /* SEV-SNP handling */ > if (kvm->arch.mmu_private_fault_mask) > return !!(err & arch.mmu_private_fault_mask); > > /* TDX handling */ > if (kvm->arch.gfn_shared_mask) > return !!(gpa & arch.gfn_shared_mask); > > return false; > } > > kvm_mmu_do_page_fault(vcpu, gpa, err, ...) > { > struct kvm_page_fault fault = { > ... > .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err) > }; > > ... > } > > And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be > set per-KVM-instance, just like they are now with current SNP and TDX > patchsets, since stuff like KVM self-test wouldn't be setting those > masks, so it makes sense to do it per-instance in that regard. > > But that still gets a little awkward for the KVM self-test use-case where > .is_private should sort of be ignored in favor of whatever the xarray > reports via kvm_mem_is_private(). In your Misc. series I believe you > handled this by introducing a PFERR_HASATTR_MASK bit so we can determine > whether existing value of fault->is_private should be > ignored/overwritten or not. > > So maybe kvm_fault_is_private() needs to return an integer value > instead, like: > > enum { > KVM_FAULT_VMM_DEFINED, > KVM_FAULT_SHARED, > KVM_FAULT_PRIVATE, > } > > bool kvm_fault_is_private(kvm, gpa, err) > { > /* SEV-SNP handling */ > if (kvm->arch.mmu_private_fault_mask) > (err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED > > /* TDX handling */ > if (kvm->arch.gfn_shared_mask) > (gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE > > return KVM_FAULT_VMM_DEFINED; > } > > And then down in __kvm_faultin_pfn() we do: > > if (fault->is_private == KVM_FAULT_VMM_DEFINED) > fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn); > else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) > return kvm_do_memory_fault_exit(vcpu, fault); > > if (fault->is_private) > return kvm_faultin_pfn_private(vcpu, fault); > > Maybe kvm_fault_is_private() can be simplified based on what direction > we end up taking WRT ongoing discussions like whether we decide to define > KVM_X86_{SNP,TDX}_VM vm_types in addition to the KVM_X86_PROTECTED_VM > type that the selftests uses, but hoping that for this path, any changes > along that line can be encapsulated away in kvm_fault_is_private() without > any/much further churn at the various call-sites like __kvm_faultin_pfn(). > > We could even push all the above logic down into the KVM self-tests, but > have: > > bool kvm_fault_is_private(kvm, gpa, err) { > return KVM_FAULT_VMM_DEFINED; > } > > And that would be enough to run self-tests as standalone series, with > TDX/SNP should filling in kvm_fault_is_private() with their > platform-specific handling. > > Does that seem reasonable to you? At least as a starting point. I tried to move the logic to the caller site and hide it from KVM MMU code. You'd like to move the logic down into KVM MMU side and make the difference explicit. Either way works for me. Let me respin my patches for the direction you propose.
> > So if we were to straight-forwardly implement that based on how TDX > currently handles checking for the shared bit in GPA, paired with how > SEV-SNP handles checking for private bit in fault flags, it would look > something like: > > bool kvm_fault_is_private(kvm, gpa, err) > { > /* SEV-SNP handling */ > if (kvm->arch.mmu_private_fault_mask) > return !!(err & arch.mmu_private_fault_mask); > > /* TDX handling */ > if (kvm->arch.gfn_shared_mask) > return !!(gpa & arch.gfn_shared_mask); The logic of the two are identical. I think they need to be converged. Either SEV-SNP should convert the error code private bit to the gfn_shared_mask, or TDX's shared bit should be converted to some private error bit. Perhaps converting SEV-SNP makes more sense because if I recall correctly SEV guest also has a C-bit, correct? Or, ... > > return false; > } > > kvm_mmu_do_page_fault(vcpu, gpa, err, ...) > { > struct kvm_page_fault fault = { > ... > .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err) ... should we do something like: .is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, gpa, err); ? > }; > > ... > } > > And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be > set per-KVM-instance, just like they are now with current SNP and TDX > patchsets, since stuff like KVM self-test wouldn't be setting those > masks, so it makes sense to do it per-instance in that regard. > > But that still gets a little awkward for the KVM self-test use-case where > .is_private should sort of be ignored in favor of whatever the xarray > reports via kvm_mem_is_private(). > I must have missed something. Why does KVM self-test have impact to how does KVM handles private fault? > In your Misc. series I believe you > handled this by introducing a PFERR_HASATTR_MASK bit so we can determine > whether existing value of fault->is_private should be > ignored/overwritten or not. > > So maybe kvm_fault_is_private() needs to return an integer value > instead, like: > > enum { > KVM_FAULT_VMM_DEFINED, > KVM_FAULT_SHARED, > KVM_FAULT_PRIVATE, > } > > bool kvm_fault_is_private(kvm, gpa, err) > { > /* SEV-SNP handling */ > if (kvm->arch.mmu_private_fault_mask) > (err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED > > /* TDX handling */ > if (kvm->arch.gfn_shared_mask) > (gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE > > return KVM_FAULT_VMM_DEFINED; > } > > And then down in __kvm_faultin_pfn() we do: > > if (fault->is_private == KVM_FAULT_VMM_DEFINED) > fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn); > else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) > return kvm_do_memory_fault_exit(vcpu, fault); > > if (fault->is_private) > return kvm_faultin_pfn_private(vcpu, fault); What does KVM_FAULT_VMM_DEFINED mean, exactly? Shouldn't the fault type come from _hardware_?
On Thu, Jun 22, 2023 at 09:55:22AM +0000, Huang, Kai wrote: > > > > > So if we were to straight-forwardly implement that based on how TDX > > currently handles checking for the shared bit in GPA, paired with how > > SEV-SNP handles checking for private bit in fault flags, it would look > > something like: > > > > bool kvm_fault_is_private(kvm, gpa, err) > > { > > /* SEV-SNP handling */ > > if (kvm->arch.mmu_private_fault_mask) > > return !!(err & arch.mmu_private_fault_mask); > > > > /* TDX handling */ > > if (kvm->arch.gfn_shared_mask) > > return !!(gpa & arch.gfn_shared_mask); > > The logic of the two are identical. I think they need to be converged. I think they're just different enough that trying too hard to converge them might obfuscate things. If the determination didn't come from 2 completely different fields (gpa vs. fault flags) maybe it could be simplified a bit more, but have well-defined open-coded handler that gets called once to set fault->is_private during initial fault time so that that ugliness never needs to be looked at again by KVM MMU seems like a good way to keep things simple through the rest of the handling. > > Either SEV-SNP should convert the error code private bit to the gfn_shared_mask, > or TDX's shared bit should be converted to some private error bit. struct kvm_page_fault seems to be the preferred way to pass additional params/metadata around, and .is_private field was introduced to track this private/shared state as part of UPM base series: https://lore.kernel.org/lkml/20221202061347.1070246-9-chao.p.peng@linux.intel.com/ So it seems like unecessary complexity to track/encode that state into other additional places rather than just encapsulating it all in fault->is_private (or some similar field), and synthesizing all this platform-specific handling into a well-defined value that's conveyed by something like fault->is_private in a way where KVM MMU doesn't need to worry as much about platform-specific stuff seems like a good thing, and in line with what the UPM base series was trying to do by adding the fault->is_private field. So all I'm really proposing is that whatever SNP and TDX end up doing should center around setting that fault->is_private field and keeping everything contained there. If there are better ways to handle *how* that's done I don't have any complaints there, but moving/adding bits to GPA/error_flags after fault time just seems unecessary to me when fault->is_private field can serve that purpose just as well. > > Perhaps converting SEV-SNP makes more sense because if I recall correctly SEV > guest also has a C-bit, correct? That's correct, but the C-bit doesn't show in the GPA that gets passed up to KVM during an #NPF, and instead gets encoded into the fault's error_flags. > > Or, ... > > > > > return false; > > } > > > > kvm_mmu_do_page_fault(vcpu, gpa, err, ...) > > { > > struct kvm_page_fault fault = { > > ... > > .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err) > > ... should we do something like: > > .is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, gpa, > err); We actually had exactly this in v7 of SNP hypervisor patches: https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m17841f5bfdfb8350d69d78c6831dd8f3a4748638 but Sean was hoping to avoid a callback, which is why we ended up using a bitmask in this version since it basically all that callback would need to do. It's unfortunately that we don't have a common scheme between SNP/TDX, but maybe that's still possible, I just think that whatever that ends up being, it should live and be contained inside whatever helper ends up setting fault->is_private. There's some other awkwardness with a callback approach. It sort of ties into your question about selftests so I'll address it below... > > ? > > > }; > > > > ... > > } > > > > And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be > > set per-KVM-instance, just like they are now with current SNP and TDX > > patchsets, since stuff like KVM self-test wouldn't be setting those > > masks, so it makes sense to do it per-instance in that regard. > > > > But that still gets a little awkward for the KVM self-test use-case where > > .is_private should sort of be ignored in favor of whatever the xarray > > reports via kvm_mem_is_private(). > > > > I must have missed something. Why does KVM self-test have impact to how does > KVM handles private fault? The self-tests I'm referring to here are the ones from Vishal that shipped with v10 of Chao's UPM/fd-based private memory series, and also as part of Sean's gmem tree: https://github.com/sean-jc/linux/commit/a0f5f8c911804f55935094ad3a277301704330a6 These exercise gmem/UPM handling without the need for any SNP/TDX hardware support. They do so by "trusting" the shared/private state that the VMM sets via KVM_SET_MEMORY_ATTRIBUTES. So if VMM says it should be private, KVM MMU will treat it as private, so we'd never get a mismatch, so KVM_EXIT_MEMORY_FAULT will never be generated. > > > In your Misc. series I believe you > > handled this by introducing a PFERR_HASATTR_MASK bit so we can determine > > whether existing value of fault->is_private should be > > ignored/overwritten or not. > > > > So maybe kvm_fault_is_private() needs to return an integer value > > instead, like: > > > > enum { > > KVM_FAULT_VMM_DEFINED, > > KVM_FAULT_SHARED, > > KVM_FAULT_PRIVATE, > > } > > > > bool kvm_fault_is_private(kvm, gpa, err) > > { > > /* SEV-SNP handling */ > > if (kvm->arch.mmu_private_fault_mask) > > (err & arch.mmu_private_fault_mask) ? KVM_FAULT_PRIVATE : KVM_FAULT_SHARED > > > > /* TDX handling */ > > if (kvm->arch.gfn_shared_mask) > > (gpa & arch.gfn_shared_mask) ? KVM_FAULT_SHARED : KVM_FAULT_PRIVATE > > > > return KVM_FAULT_VMM_DEFINED; > > } > > > > And then down in __kvm_faultin_pfn() we do: > > > > if (fault->is_private == KVM_FAULT_VMM_DEFINED) > > fault->is_private = kvm_mem_is_private(vcpu->kvm, fault->gfn); > > else if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) > > return kvm_do_memory_fault_exit(vcpu, fault); > > > > if (fault->is_private) > > return kvm_faultin_pfn_private(vcpu, fault); > > > What does KVM_FAULT_VMM_DEFINED mean, exactly? > > Shouldn't the fault type come from _hardware_? In above self-test use-case, there is no reliance on hardware support, and fault->is_private should always be treated as being whatever was set by the VMM via KVM_SET_MEMORY_ATTRIBUTES, so that's why I proposed the KVM_FAULT_VMM_DEFINED value to encode that case into fault->is_private so KVM MMU and handle protected self-test VMs of this sort. In the future, this protected self-test VMs might become the basis of a new protected VM type where some sort of guest-issued hypercall can be used to set whether a fault should be treated as shared/private, rather than relying on VMM-defined value. There's some current discussion about that here: https://lore.kernel.org/lkml/20230620190443.GU2244082@ls.amr.corp.intel.com/T/#me627bed3d9acf73ea882e8baa76dfcb27759c440 Going back to your callback question above, that makes things a little awkward, since kvm_x86_ops is statically defined for both kvm_amd/kvm_intel modules, and either can run these self-tests guests as well as these proposed "non-CC VMs" which rely on enlightened guest kernels instead of TDX/SNPhardware support for managing private/shared access. So you either need to duplicate the handling for how to determine private/shared for these other types into the kvm_intel/kvm_amd callbacks, or have some way for the callback to say to "fall back to the common handling for self-tests and non-CC VMs". The latter is what we implemented in v8 of this series, but Isaku suggested it was a bit too heavyweight and proposed dropping the fall-back logic in favor of updating the kvm_x86_ops at run-time once we know whether or not it's a TDX/SNP guest: https://lkml.iu.edu/hypermail/linux/kernel/2303.2/03009.html which could work, but it still doesn't address Sean's desire to avoid callbacks completely, and still amounts to a somewhat convulated way to hide away TDX/SNP-specific bit checks for shared/private. Rather than hide them away in callbacks that are already frowned upon by maintainer, I think it makes sense to "open-code" all these checks in a common handler like kvm_fault_is_private() to we can make some progress toward a consensus, and then iterate on it from there rather than refining what may already be a dead-end path. -Mike
On Thu, 2023-06-22 at 10:32 -0500, Michael Roth wrote: > On Thu, Jun 22, 2023 at 09:55:22AM +0000, Huang, Kai wrote: > > > > > > > > So if we were to straight-forwardly implement that based on how TDX > > > currently handles checking for the shared bit in GPA, paired with how > > > SEV-SNP handles checking for private bit in fault flags, it would look > > > something like: > > > > > > bool kvm_fault_is_private(kvm, gpa, err) > > > { > > > /* SEV-SNP handling */ > > > if (kvm->arch.mmu_private_fault_mask) > > > return !!(err & arch.mmu_private_fault_mask); > > > > > > /* TDX handling */ > > > if (kvm->arch.gfn_shared_mask) > > > return !!(gpa & arch.gfn_shared_mask); > > > > The logic of the two are identical. I think they need to be converged. > > I think they're just different enough that trying too hard to converge > them might obfuscate things. If the determination didn't come from 2 > completely different fields (gpa vs. fault flags) maybe it could be > simplified a bit more, but have well-defined open-coded handler that > gets called once to set fault->is_private during initial fault time so > that that ugliness never needs to be looked at again by KVM MMU seems > like a good way to keep things simple through the rest of the handling. I actually agree with the second half that is after "but ...", but IIUC it doesn't conflict with converging arch.mmu_private_fault_mask and arch.gfn_shared_mask into one. No? > > > > > Either SEV-SNP should convert the error code private bit to the gfn_shared_mask, > > or TDX's shared bit should be converted to some private error bit. > > struct kvm_page_fault seems to be the preferred way to pass additional > params/metadata around, and .is_private field was introduced to track > this private/shared state as part of UPM base series: > > https://lore.kernel.org/lkml/20221202061347.1070246-9-chao.p.peng@linux.intel.com/ Sure. Agreed. > > So it seems like unecessary complexity to track/encode that state into > other additional places rather than just encapsulating it all in > fault->is_private (or some similar field), and synthesizing all this > platform-specific handling into a well-defined value that's conveyed > by something like fault->is_private in a way where KVM MMU doesn't need > to worry as much about platform-specific stuff seems like a good thing, > and in line with what the UPM base series was trying to do by adding the > fault->is_private field. Yes we should just set fault->is_private and pass to the rest of the common KVM MMU code. > > So all I'm really proposing is that whatever SNP and TDX end up doing > should center around setting that fault->is_private field and keeping > everything contained there. > I agree. > If there are better ways to handle *how* > that's done I don't have any complaints there, but moving/adding bits > to GPA/error_flags after fault time just seems unecessary to me when > fault->is_private field can serve that purpose just as well. Perhaps you missed my point. My point is arch.mmu_private_fault_mask and arch.gfn_shared_mask seem redundant because the logic around them are exactly the same. I do believe we should have fault->is_private passing to the common MMU code. In fact, now I am wondering why we need to have "mmu_private_fault_mask" and "gfn_shared_mask" in _common_ KVM MMU code. We already have enough mechanism in KVM common code: 1) fault->is_private 2) kvm_mmu_page_role.private 3) an Xarray to tell whether a GFN is private or shared I am not convinced that we need to have "mmu_private_fault_mask" and "gfn_shared_mask" in common KVM MMU code. Instead, they can be in AMD and Intel's vendor code. Maybe it makes sense to have "gfn_shared_mask" in the KVM common code so that the fault handler can just strip away the "shared bit" at the very beginning (at least for TDX), but for the rest of the time I think we should already have enough infrastructure to handle private/shared mapping. Btw, one minor issue is, if I recall correctly, for TDX the shared bit must be applied to the GFN for shared mapping in normal EPT. I guess AMD doesn't need that for shared mapping. So "gfn_shared_mask" maybe useful in this case, but w/o it I believe we can also achieve in another way via vendor callback. > > > > > Perhaps converting SEV-SNP makes more sense because if I recall correctly SEV > > guest also has a C-bit, correct? > > That's correct, but the C-bit doesn't show in the GPA that gets passed > up to KVM during an #NPF, and instead gets encoded into the fault's > error_flags. > > > > > Or, ... > > > > > > > > return false; > > > } > > > > > > kvm_mmu_do_page_fault(vcpu, gpa, err, ...) > > > { > > > struct kvm_page_fault fault = { > > > ... > > > .is_private = kvm_fault_is_private(vcpu->kvm, gpa, err) > > > > ... should we do something like: > > > > .is_private = static_call(kvm_x86_fault_is_private)(vcpu->kvm, gpa, > > err); > > We actually had exactly this in v7 of SNP hypervisor patches: > > https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m17841f5bfdfb8350d69d78c6831dd8f3a4748638 > > but Sean was hoping to avoid a callback, which is why we ended up using > a bitmask in this version since it basically all that callback would > need to do. It's unfortunately that we don't have a common scheme > between SNP/TDX, but maybe that's still possible, I just think that > whatever that ends up being, it should live and be contained inside > whatever helper ends up setting fault->is_private. Sure. If the function call can be avoided in fault handler then we should. > > There's some other awkwardness with a callback approach. It sort of ties > into your question about selftests so I'll address it below... > > > > > > ? > > > > > }; > > > > > > ... > > > } > > > > > > And then arch.mmu_private_fault_mask and arch.gfn_shared_mask would be > > > set per-KVM-instance, just like they are now with current SNP and TDX > > > patchsets, since stuff like KVM self-test wouldn't be setting those > > > masks, so it makes sense to do it per-instance in that regard. > > > > > > But that still gets a little awkward for the KVM self-test use-case where > > > .is_private should sort of be ignored in favor of whatever the xarray > > > reports via kvm_mem_is_private(). > > > > > > > I must have missed something. Why does KVM self-test have impact to how does > > KVM handles private fault? > > The self-tests I'm referring to here are the ones from Vishal that shipped with > v10 of Chao's UPM/fd-based private memory series, and also as part of Sean's > gmem tree: > > https://github.com/sean-jc/linux/commit/a0f5f8c911804f55935094ad3a277301704330a6 > > These exercise gmem/UPM handling without the need for any SNP/TDX > hardware support. They do so by "trusting" the shared/private state > that the VMM sets via KVM_SET_MEMORY_ATTRIBUTES. So if VMM says it > should be private, KVM MMU will treat it as private, so we'd never > get a mismatch, so KVM_EXIT_MEMORY_FAULT will never be generated. If KVM supports a test mode, or by reading another thread, KVM wants to support a software-based KVM_X86_PROTECTED_VM and distinguish it with hardware-based confidential VMs such as TDX and SEV-SNP: https://lore.kernel.org/lkml/9e3e99f78fcbd7db21368b5fe1d931feeb4db567.1686858861.git.isaku.yamahata@intel.com/T/#me627bed3d9acf73ea882e8baa76dfcb27759c440 then it's fine to handle it when setting up fault->is_private. But my point is KVM self-test itself should not impact how does KVM implement fault handler -- it is KVM itself wants to support additional software-based things. [...] (Thanks for explaining additional background).
On Thu, Jun 22, 2023 at 10:31:08PM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > > If there are better ways to handle *how* > > that's done I don't have any complaints there, but moving/adding bits > > to GPA/error_flags after fault time just seems unecessary to me when > > fault->is_private field can serve that purpose just as well. > > Perhaps you missed my point. My point is arch.mmu_private_fault_mask and > arch.gfn_shared_mask seem redundant because the logic around them are exactly > the same. I do believe we should have fault->is_private passing to the common > MMU code. > > In fact, now I am wondering why we need to have "mmu_private_fault_mask" and > "gfn_shared_mask" in _common_ KVM MMU code. We already have enough mechanism in > KVM common code: > > 1) fault->is_private > 2) kvm_mmu_page_role.private > 3) an Xarray to tell whether a GFN is private or shared > > I am not convinced that we need to have "mmu_private_fault_mask" and > "gfn_shared_mask" in common KVM MMU code. Instead, they can be in AMD and > Intel's vendor code. > > Maybe it makes sense to have "gfn_shared_mask" in the KVM common code so that > the fault handler can just strip away the "shared bit" at the very beginning (at > least for TDX), but for the rest of the time I think we should already have > enough infrastructure to handle private/shared mapping. > > Btw, one minor issue is, if I recall correctly, for TDX the shared bit must be > applied to the GFN for shared mapping in normal EPT. I guess AMD doesn't need > that for shared mapping. So "gfn_shared_mask" maybe useful in this case, but > w/o it I believe we can also achieve in another way via vendor callback. "2) kvm_mmu_page_role.private" above has different meaning. a). The fault is private or not. b). page table the fault handler is walking is private or conventional. a.) is common for SNP, TDX and PROTECTED_VM. It makes sense in kvm_mmu_do_page_fault() and __kvm_faultin_pfn(). After kvm_faultin_pfn(), the fault handler can mostly forget it for SNP and PROTECTED_VM. (large page adjustment needs it, though.) This is what we're discussing in this thread. b.) is specific to TDX. TDX KVM MMU introduces one more page table.
On Thu, 2023-06-22 at 16:39 -0700, Isaku Yamahata wrote: > On Thu, Jun 22, 2023 at 10:31:08PM +0000, > "Huang, Kai" <kai.huang@intel.com> wrote: > > > > If there are better ways to handle *how* > > > that's done I don't have any complaints there, but moving/adding bits > > > to GPA/error_flags after fault time just seems unecessary to me when > > > fault->is_private field can serve that purpose just as well. > > > > Perhaps you missed my point. My point is arch.mmu_private_fault_mask and > > arch.gfn_shared_mask seem redundant because the logic around them are exactly > > the same. I do believe we should have fault->is_private passing to the common > > MMU code. > > > > In fact, now I am wondering why we need to have "mmu_private_fault_mask" and > > "gfn_shared_mask" in _common_ KVM MMU code. We already have enough mechanism in > > KVM common code: > > > > 1) fault->is_private > > 2) kvm_mmu_page_role.private > > 3) an Xarray to tell whether a GFN is private or shared > > > > I am not convinced that we need to have "mmu_private_fault_mask" and > > "gfn_shared_mask" in common KVM MMU code. Instead, they can be in AMD and > > Intel's vendor code. > > > > Maybe it makes sense to have "gfn_shared_mask" in the KVM common code so that > > the fault handler can just strip away the "shared bit" at the very beginning (at > > least for TDX), but for the rest of the time I think we should already have > > enough infrastructure to handle private/shared mapping. > > > > Btw, one minor issue is, if I recall correctly, for TDX the shared bit must be > > applied to the GFN for shared mapping in normal EPT. I guess AMD doesn't need > > that for shared mapping. So "gfn_shared_mask" maybe useful in this case, but > > w/o it I believe we can also achieve in another way via vendor callback. > > > "2) kvm_mmu_page_role.private" above has different meaning. > > a). The fault is private or not. > b). page table the fault handler is walking is private or conventional. > > a.) is common for SNP, TDX and PROTECTED_VM. It makes sense in > kvm_mmu_do_page_fault() and __kvm_faultin_pfn(). After kvm_faultin_pfn(), the > fault handler can mostly forget it for SNP and PROTECTED_VM. (large page > adjustment needs it, though.) This is what we're discussing in this thread. > > b.) is specific to TDX. TDX KVM MMU introduces one more page table. > > I don't buy the last sentence. Even it's not necessarily for AMD from hardware's perspective, but the concept remains true for AMD too. So why cannot we use it for AMD? Note "shared_gfn_mask" and kvm_mmu_page_role.private is kinda redundant too, I do believe we should avoid introducing a lot fields to KVM *common* code due to vendor differences (i.e., in this case, "gfn_shared_mask" and "mmu_private_fault_mask").
On Thu, Jun 22, 2023 at 11:52:56PM +0000, "Huang, Kai" <kai.huang@intel.com> wrote: > On Thu, 2023-06-22 at 16:39 -0700, Isaku Yamahata wrote: > > On Thu, Jun 22, 2023 at 10:31:08PM +0000, > > "Huang, Kai" <kai.huang@intel.com> wrote: > > > > > > If there are better ways to handle *how* > > > > that's done I don't have any complaints there, but moving/adding bits > > > > to GPA/error_flags after fault time just seems unecessary to me when > > > > fault->is_private field can serve that purpose just as well. > > > > > > Perhaps you missed my point. My point is arch.mmu_private_fault_mask and > > > arch.gfn_shared_mask seem redundant because the logic around them are exactly > > > the same. I do believe we should have fault->is_private passing to the common > > > MMU code. > > > > > > In fact, now I am wondering why we need to have "mmu_private_fault_mask" and > > > "gfn_shared_mask" in _common_ KVM MMU code. We already have enough mechanism in > > > KVM common code: > > > > > > 1) fault->is_private > > > 2) kvm_mmu_page_role.private > > > 3) an Xarray to tell whether a GFN is private or shared > > > > > > I am not convinced that we need to have "mmu_private_fault_mask" and > > > "gfn_shared_mask" in common KVM MMU code. Instead, they can be in AMD and > > > Intel's vendor code. > > > > > > Maybe it makes sense to have "gfn_shared_mask" in the KVM common code so that > > > the fault handler can just strip away the "shared bit" at the very beginning (at > > > least for TDX), but for the rest of the time I think we should already have > > > enough infrastructure to handle private/shared mapping. > > > > > > Btw, one minor issue is, if I recall correctly, for TDX the shared bit must be > > > applied to the GFN for shared mapping in normal EPT. I guess AMD doesn't need > > > that for shared mapping. So "gfn_shared_mask" maybe useful in this case, but > > > w/o it I believe we can also achieve in another way via vendor callback. > > > > > > "2) kvm_mmu_page_role.private" above has different meaning. > > > > a). The fault is private or not. > > b). page table the fault handler is walking is private or conventional. > > > > a.) is common for SNP, TDX and PROTECTED_VM. It makes sense in > > kvm_mmu_do_page_fault() and __kvm_faultin_pfn(). After kvm_faultin_pfn(), the > > fault handler can mostly forget it for SNP and PROTECTED_VM. (large page > > adjustment needs it, though.) This is what we're discussing in this thread. > > > > b.) is specific to TDX. TDX KVM MMU introduces one more page table. > > > > > > I don't buy the last sentence. Even it's not necessarily for AMD from > hardware's perspective, but the concept remains true for AMD too. So why cannot > we use it for AMD? We can use it for AMD. Let me rephrase it. TDX only uses it now. SEV-SNP may or may not use it at their option.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b3bd24f2a390..c26f76641121 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1445,6 +1445,13 @@ struct kvm_arch { */ #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1) struct kvm_mmu_memory_cache split_desc_cache; + + /* + * When set, used to determine whether a fault should be treated as + * private in the context of protected VMs which use a separate gmem + * pool to back private guest pages. + */ + u64 mmu_private_fault_mask; }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index 780b91e1da9f..9b9e75aa43f4 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -252,6 +252,39 @@ struct kvm_page_fault { int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); +static bool kvm_mmu_fault_is_private(struct kvm *kvm, gpa_t gpa, u64 err) +{ + struct kvm_memory_slot *slot; + bool private_fault = false; + gfn_t gfn = gpa_to_gfn(gpa); + + slot = gfn_to_memslot(kvm, gfn); + if (!slot) { + pr_debug("%s: no slot, GFN: 0x%llx\n", __func__, gfn); + goto out; + } + + if (!kvm_slot_can_be_private(slot)) { + pr_debug("%s: slot is not private, GFN: 0x%llx\n", __func__, gfn); + goto out; + } + + if (kvm->arch.mmu_private_fault_mask) { + private_fault = !!(err & kvm->arch.mmu_private_fault_mask); + goto out; + } + + /* + * Handling below is for UPM self-tests and guests that treat userspace + * as the authority on whether a fault should be private or not. + */ + private_fault = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT); + +out: + pr_debug("%s: GFN: 0x%llx, private: %d\n", __func__, gfn, private_fault); + return private_fault; +} + /* * Return values of handle_mmio_page_fault(), mmu.page_fault(), fast_page_fault(), * and of course kvm_mmu_do_page_fault(). @@ -301,7 +334,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, .max_level = KVM_MAX_HUGEPAGE_LEVEL, .req_level = PG_LEVEL_4K, .goal_level = PG_LEVEL_4K, - .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), + .is_private = kvm_mmu_fault_is_private(vcpu->kvm, cr2_or_gpa, err), }; int r;
This will be used to determine whether or not an #NPF should be serviced using a normal page vs. a guarded/gmem one. Signed-off-by: Michael Roth <michael.roth@amd.com> --- arch/x86/include/asm/kvm_host.h | 7 +++++++ arch/x86/kvm/mmu/mmu_internal.h | 35 ++++++++++++++++++++++++++++++++- 2 files changed, 41 insertions(+), 1 deletion(-)