Message ID | 20210707183616.5620-34-brijesh.singh@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
On Wed, Jul 07, 2021, Brijesh Singh wrote: > +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level) I can live with e.g. GHCB_MSR_PSC_REQ, but I'd strongly prefer to spell this out, e.g. __snp_handle_page_state_change() or whatever. I had a hell of a time figuring out what PSC was the first time I saw it in some random context. > +{ > + struct kvm *kvm = vcpu->kvm; > + int rc, tdp_level; > + kvm_pfn_t pfn; > + gpa_t gpa_end; > + > + gpa_end = gpa + page_level_size(level); > + > + while (gpa < gpa_end) { > + /* > + * Get the pfn and level for the gpa from the nested page table. > + * > + * If the TDP walk failed, then its safe to say that we don't have a valid > + * mapping for the gpa in the nested page table. Create a fault to map the > + * page is nested page table. > + */ > + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) { > + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level); > + if (is_error_noslot_pfn(pfn)) > + goto out; > + > + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) > + goto out; > + } > + > + /* Adjust the level so that we don't go higher than the backing page level */ > + level = min_t(size_t, level, tdp_level); > + > + write_lock(&kvm->mmu_lock); Retrieving the PFN and level outside of mmu_lock is not correct. Because the pages are pinned and the VMM is not malicious, it will function as intended, but it is far from correct. The overall approach also feels wrong, e.g. a guest won't be able to convert a 2mb chunk back to a 2mb large page if KVM mapped the GPA as a 4kb page in the past (from a different conversion). I'd also strongly prefer to have a common flow between SNP and TDX for converting between shared/prviate. I'll circle back to this next week, it'll probably take a few hours of staring to figure out a solution, if a common one for SNP+TDX is even possible. > + > + switch (op) { > + case SNP_PAGE_STATE_SHARED: > + rc = snp_make_page_shared(vcpu, gpa, pfn, level); > + break; > + case SNP_PAGE_STATE_PRIVATE: > + rc = snp_make_page_private(vcpu, gpa, pfn, level); > + break; > + default: > + rc = -EINVAL; > + break; > + } > + > + write_unlock(&kvm->mmu_lock); > + > + if (rc) { > + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n", > + op, gpa, pfn, level, rc); > + goto out; > + } > + > + gpa = gpa + page_level_size(level); > + } > + > +out: > + return rc; > +} > + > static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) > { > struct vmcb_control_area *control = &svm->vmcb->control; > @@ -2941,6 +3063,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) > GHCB_MSR_INFO_POS); > break; > } > + case GHCB_MSR_PSC_REQ: { > + gfn_t gfn; > + int ret; > + u8 op; > + > + gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS); > + op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS); > + > + ret = __snp_handle_psc(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K); > + > + /* If failed to change the state then spec requires to return all F's */ That doesn't mesh with what I could find: o 0x015 – SNP Page State Change Response ▪ GHCBData[63:32] – Error code ▪ GHCBData[31:12] – Reserved, must be zero Written by the hypervisor in response to a Page State Change request. Any non- zero value for the error code indicates that the page state change was not successful. And if "all Fs" is indeed the error code, 'int ret' probably only works by luck since the return value is a 64-bit value, where as ret is a 32-bit signed int. > + if (ret) > + ret = -1; Uh, this is fubar. You've created a shadow of 'ret', i.e. the outer ret is likely uninitialized. > + > + set_ghcb_msr_bits(svm, ret, GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS); > + set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS); > + set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS); > + break; > + } > case GHCB_MSR_TERM_REQ: { > u64 reason_set, reason_code; > > -- > 2.17.1 >
On 7/16/21 4:00 PM, Sean Christopherson wrote: > On Wed, Jul 07, 2021, Brijesh Singh wrote: >> +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level) > > I can live with e.g. GHCB_MSR_PSC_REQ, but I'd strongly prefer to spell this out, > e.g. __snp_handle_page_state_change() or whatever. I had a hell of a time figuring > out what PSC was the first time I saw it in some random context. Based on the previous review feedback I renamed from __snp_handle_page_state_change to __snp_handle_psc(). I will see what others say and based on that will rename accordingly. > >> +{ >> + struct kvm *kvm = vcpu->kvm; >> + int rc, tdp_level; >> + kvm_pfn_t pfn; >> + gpa_t gpa_end; >> + >> + gpa_end = gpa + page_level_size(level); >> + >> + while (gpa < gpa_end) { >> + /* >> + * Get the pfn and level for the gpa from the nested page table. >> + * >> + * If the TDP walk failed, then its safe to say that we don't have a valid >> + * mapping for the gpa in the nested page table. Create a fault to map the >> + * page is nested page table. >> + */ >> + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) { >> + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level); >> + if (is_error_noslot_pfn(pfn)) >> + goto out; >> + >> + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) >> + goto out; >> + } >> + >> + /* Adjust the level so that we don't go higher than the backing page level */ >> + level = min_t(size_t, level, tdp_level); >> + >> + write_lock(&kvm->mmu_lock); > > Retrieving the PFN and level outside of mmu_lock is not correct. Because the > pages are pinned and the VMM is not malicious, it will function as intended, but > it is far from correct. > Good point, I should have retrieved the pfn and level inside the lock. > The overall approach also feels wrong, e.g. a guest won't be able to convert a > 2mb chunk back to a 2mb large page if KVM mapped the GPA as a 4kb page in the > past (from a different conversion). > Maybe I am missing something, I am not able to follow 'guest won't be able to convert a 2mb chunk back to a 2mb large page'. The page-size used inside the guest have to relationship with the RMP/NPT page-size. e.g, a guest can validate the page range as a 4k and still map the page range as a 2mb or 1gb in its pagetable. > I'd also strongly prefer to have a common flow between SNP and TDX for converting > between shared/prviate. > > I'll circle back to this next week, it'll probably take a few hours of staring > to figure out a solution, if a common one for SNP+TDX is even possible. > Sounds good. >> + >> + switch (op) { >> + case SNP_PAGE_STATE_SHARED: >> + rc = snp_make_page_shared(vcpu, gpa, pfn, level); >> + break; >> + case SNP_PAGE_STATE_PRIVATE: >> + rc = snp_make_page_private(vcpu, gpa, pfn, level); >> + break; >> + default: >> + rc = -EINVAL; >> + break; >> + } >> + >> + write_unlock(&kvm->mmu_lock); >> + >> + if (rc) { >> + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n", >> + op, gpa, pfn, level, rc); >> + goto out; >> + } >> + >> + gpa = gpa + page_level_size(level); >> + } >> + >> +out: >> + return rc; >> +} >> + >> static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) >> { >> struct vmcb_control_area *control = &svm->vmcb->control; >> @@ -2941,6 +3063,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) >> GHCB_MSR_INFO_POS); >> break; >> } >> + case GHCB_MSR_PSC_REQ: { >> + gfn_t gfn; >> + int ret; >> + u8 op; >> + >> + gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS); >> + op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS); >> + >> + ret = __snp_handle_psc(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K); >> + >> + /* If failed to change the state then spec requires to return all F's */ > > That doesn't mesh with what I could find: > > o 0x015 – SNP Page State Change Response > ▪ GHCBData[63:32] – Error code > ▪ GHCBData[31:12] – Reserved, must be zero > Written by the hypervisor in response to a Page State Change request. Any non- > zero value for the error code indicates that the page state change was not > successful. > > And if "all Fs" is indeed the error code, 'int ret' probably only works by luck > since the return value is a 64-bit value, where as ret is a 32-bit signed int. > >> + if (ret) >> + ret = -1; > > Uh, this is fubar. You've created a shadow of 'ret', i.e. the outer ret is likely > uninitialized. > Ah, let me fix it in next rev. thanks
On Mon, Jul 19, 2021, Brijesh Singh wrote: > > On 7/16/21 4:00 PM, Sean Christopherson wrote: > > On Wed, Jul 07, 2021, Brijesh Singh wrote: > > > +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level) > > > > I can live with e.g. GHCB_MSR_PSC_REQ, but I'd strongly prefer to spell this out, > > e.g. __snp_handle_page_state_change() or whatever. I had a hell of a time figuring > > out what PSC was the first time I saw it in some random context. > > Based on the previous review feedback I renamed from > __snp_handle_page_state_change to __snp_handle_psc(). I will see what others > say and based on that will rename accordingly. I've no objection to using PSC for enums and whatnot, and I'll happily defer to Boris for functions in the core kernel and guest, but for KVM I'd really like to spell out the name for the two or so main handler functions. > > > + while (gpa < gpa_end) { > > > + /* > > > + * Get the pfn and level for the gpa from the nested page table. > > > + * > > > + * If the TDP walk failed, then its safe to say that we don't have a valid > > > + * mapping for the gpa in the nested page table. Create a fault to map the > > > + * page is nested page table. > > > + */ > > > + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) { > > > + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level); > > > + if (is_error_noslot_pfn(pfn)) > > > + goto out; > > > + > > > + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) > > > + goto out; > > > + } > > > + > > > + /* Adjust the level so that we don't go higher than the backing page level */ > > > + level = min_t(size_t, level, tdp_level); > > > + > > > + write_lock(&kvm->mmu_lock); > > > > Retrieving the PFN and level outside of mmu_lock is not correct. Because the > > pages are pinned and the VMM is not malicious, it will function as intended, but > > it is far from correct. > > Good point, I should have retrieved the pfn and level inside the lock. > > > The overall approach also feels wrong, e.g. a guest won't be able to convert a > > 2mb chunk back to a 2mb large page if KVM mapped the GPA as a 4kb page in the > > past (from a different conversion). > > > > Maybe I am missing something, I am not able to follow 'guest won't be able > to convert a 2mb chunk back to a 2mb large page'. The page-size used inside > the guest have to relationship with the RMP/NPT page-size. e.g, a guest can > validate the page range as a 4k and still map the page range as a 2mb or 1gb > in its pagetable. The proposed code walks KVM's TDP and adjusts the RMP level to be the min of the guest+host levels. Once KVM has installed a 4kb TDP SPTE, that walk will find the 4kb TDP SPTE and thus operate on the RMP at a 4kb granularity. To allow full restoration of 2mb PTE+SPTE+RMP, KVM needs to zap the 4kb SPTE(s) at some point to allow rebuilding a 2mb SPTE.
On 7/19/21 1:55 PM, Sean Christopherson wrote: > > I've no objection to using PSC for enums and whatnot, and I'll happily defer to > Boris for functions in the core kernel and guest, but for KVM I'd really like to > spell out the name for the two or so main handler functions. Noted. >> >> Maybe I am missing something, I am not able to follow 'guest won't be able >> to convert a 2mb chunk back to a 2mb large page'. The page-size used inside >> the guest have to relationship with the RMP/NPT page-size. e.g, a guest can >> validate the page range as a 4k and still map the page range as a 2mb or 1gb >> in its pagetable. > > The proposed code walks KVM's TDP and adjusts the RMP level to be the min of the > guest+host levels. Once KVM has installed a 4kb TDP SPTE, that walk will find > the 4kb TDP SPTE and thus operate on the RMP at a 4kb granularity. To allow full > restoration of 2mb PTE+SPTE+RMP, KVM needs to zap the 4kb SPTE(s) at some point > to allow rebuilding a 2mb SPTE. > Ah I see. In that case, SNP firmware provides a command "SNP_PAGE_UNMASH" that can be used by the hypervisor to combines the multiple 4k entry into a single 2mb without affecting the validation. -Brijesh
On Mon, Jul 19, 2021 at 06:55:54PM +0000, Sean Christopherson wrote: > I've no objection to using PSC for enums and whatnot, and I'll happily > defer to Boris for functions in the core kernel and guest, but for KVM > I'd really like to spell out the name for the two or so main handler > functions. Well, - we abbreviate things in the kernel all the time - this is no exception. We don't name it secure_nested_paging_handle_page_state_change() either. :-) - If you don't know what PSC means, now you do :-P Also, comments above the function names help. - I asked for the shortening because those function names were a mouthful and when you see snp_handle_page_state_change() I wonder, it is "handling", is it "changing" or what is it doing. And the page state change is a thing so the "handle" is important here. Thus the snp_handle_psc() variant. Also, shorter function names make the code more readable. I'm sure you've read firmware code or other OS code before to understand what I mean. Anyway, I felt I should give my arguments why I requested the shortening. Thx.
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h index 6990d5a9d73c..2561413cb316 100644 --- a/arch/x86/include/asm/sev-common.h +++ b/arch/x86/include/asm/sev-common.h @@ -81,6 +81,9 @@ #define GHCB_MSR_PSC_RESP 0x015 #define GHCB_MSR_PSC_ERROR_POS 32 +#define GHCB_MSR_PSC_ERROR_MASK GENMASK_ULL(31, 0) +#define GHCB_MSR_PSC_RSVD_POS 12 +#define GHCB_MSR_PSC_RSVD_MASK GENMASK_ULL(19, 0) #define GHCB_MSR_PSC_RESP_VAL(val) ((val) >> GHCB_MSR_PSC_ERROR_POS) /* GHCB Hypervisor Feature Request */ diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 3af5d1ad41bf..68d275b2a660 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -28,6 +28,7 @@ #include "svm_ops.h" #include "cpuid.h" #include "trace.h" +#include "mmu.h" #define __ex(x) __kvm_handle_fault_on_reboot(x) @@ -2843,6 +2844,127 @@ static void set_ghcb_msr(struct vcpu_svm *svm, u64 value) svm->vmcb->control.ghcb_gpa = value; } +static int snp_rmptable_psmash(struct kvm_vcpu *vcpu, kvm_pfn_t pfn) +{ + pfn = pfn & ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1); + + return psmash(pfn_to_page(pfn)); +} + +static int snp_make_page_shared(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level) +{ + struct rmpupdate val; + int rc, rmp_level; + struct rmpentry *e; + + e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level); + if (!e) + return -EINVAL; + + if (!rmpentry_assigned(e)) + return 0; + + /* Log if the entry is validated */ + if (rmpentry_validated(e)) + pr_warn_ratelimited("Remove RMP entry for a validated gpa 0x%llx\n", gpa); + + /* + * Is the page part of an existing 2M RMP entry ? Split the 2MB into multiple + * of 4K-page before making the memory shared. + */ + if ((level == PG_LEVEL_4K) && (rmp_level == PG_LEVEL_2M)) { + rc = snp_rmptable_psmash(vcpu, pfn); + if (rc) + return rc; + } + + memset(&val, 0, sizeof(val)); + val.pagesize = X86_TO_RMP_PG_LEVEL(level); + return rmpupdate(pfn_to_page(pfn), &val); +} + +static int snp_make_page_private(struct kvm_vcpu *vcpu, gpa_t gpa, kvm_pfn_t pfn, int level) +{ + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info; + struct rmpupdate val; + struct rmpentry *e; + int rmp_level; + + e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level); + if (!e) + return -EINVAL; + + /* Log if the entry is validated */ + if (rmpentry_validated(e)) + pr_warn_ratelimited("Asked to make a pre-validated gpa %llx private\n", gpa); + + memset(&val, 0, sizeof(val)); + val.gpa = gpa; + val.asid = sev->asid; + val.pagesize = X86_TO_RMP_PG_LEVEL(level); + val.assigned = true; + + return rmpupdate(pfn_to_page(pfn), &val); +} + +static int __snp_handle_psc(struct kvm_vcpu *vcpu, int op, gpa_t gpa, int level) +{ + struct kvm *kvm = vcpu->kvm; + int rc, tdp_level; + kvm_pfn_t pfn; + gpa_t gpa_end; + + gpa_end = gpa + page_level_size(level); + + while (gpa < gpa_end) { + /* + * Get the pfn and level for the gpa from the nested page table. + * + * If the TDP walk failed, then its safe to say that we don't have a valid + * mapping for the gpa in the nested page table. Create a fault to map the + * page is nested page table. + */ + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) { + pfn = kvm_mmu_map_tdp_page(vcpu, gpa, PFERR_USER_MASK, level); + if (is_error_noslot_pfn(pfn)) + goto out; + + if (!kvm_mmu_get_tdp_walk(vcpu, gpa, &pfn, &tdp_level)) + goto out; + } + + /* Adjust the level so that we don't go higher than the backing page level */ + level = min_t(size_t, level, tdp_level); + + write_lock(&kvm->mmu_lock); + + switch (op) { + case SNP_PAGE_STATE_SHARED: + rc = snp_make_page_shared(vcpu, gpa, pfn, level); + break; + case SNP_PAGE_STATE_PRIVATE: + rc = snp_make_page_private(vcpu, gpa, pfn, level); + break; + default: + rc = -EINVAL; + break; + } + + write_unlock(&kvm->mmu_lock); + + if (rc) { + pr_err_ratelimited("Error op %d gpa %llx pfn %llx level %d rc %d\n", + op, gpa, pfn, level, rc); + goto out; + } + + gpa = gpa + page_level_size(level); + } + +out: + return rc; +} + static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) { struct vmcb_control_area *control = &svm->vmcb->control; @@ -2941,6 +3063,25 @@ static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) GHCB_MSR_INFO_POS); break; } + case GHCB_MSR_PSC_REQ: { + gfn_t gfn; + int ret; + u8 op; + + gfn = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_GFN_MASK, GHCB_MSR_PSC_GFN_POS); + op = get_ghcb_msr_bits(svm, GHCB_MSR_PSC_OP_MASK, GHCB_MSR_PSC_OP_POS); + + ret = __snp_handle_psc(vcpu, op, gfn_to_gpa(gfn), PG_LEVEL_4K); + + /* If failed to change the state then spec requires to return all F's */ + if (ret) + ret = -1; + + set_ghcb_msr_bits(svm, ret, GHCB_MSR_PSC_ERROR_MASK, GHCB_MSR_PSC_ERROR_POS); + set_ghcb_msr_bits(svm, 0, GHCB_MSR_PSC_RSVD_MASK, GHCB_MSR_PSC_RSVD_POS); + set_ghcb_msr_bits(svm, GHCB_MSR_PSC_RESP, GHCB_MSR_INFO_MASK, GHCB_MSR_INFO_POS); + break; + } case GHCB_MSR_TERM_REQ: { u64 reason_set, reason_code;
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP table to be private or shared using the Page State Change MSR protocol as defined in the GHCB specification. Before changing the page state in the RMP entry, we lookup the page in the TDP to make sure that there is a valid mapping for it. If the mapping exist then try to find a workable page level between the TDP and RMP for the page. If the page is not mapped in the TDP, then create a fault such that it gets mapped before we change the page state in the RMP entry. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> --- arch/x86/include/asm/sev-common.h | 3 + arch/x86/kvm/svm/sev.c | 141 ++++++++++++++++++++++++++++++ 2 files changed, 144 insertions(+)