Message ID | 20230220183847.59159-48-michael.roth@amd.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
On 20.02.23 19:38, Michael Roth wrote: > From: Tom Lendacky <thomas.lendacky@amd.com> > > Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP > guests to alter the register state of the APs on their own. This allows > the guest a way of simulating INIT-SIPI. > > A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used > so as to avoid updating the VMSA pointer while the vCPU is running. > > For CREATE > The guest supplies the GPA of the VMSA to be used for the vCPU with > the specified APIC ID. The GPA is saved in the svm struct of the > target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added > to the vCPU and then the vCPU is kicked. > > For CREATE_ON_INIT: > The guest supplies the GPA of the VMSA to be used for the vCPU with > the specified APIC ID the next time an INIT is performed. The GPA is > saved in the svm struct of the target vCPU. > > For DESTROY: > The guest indicates it wishes to stop the vCPU. The GPA is cleared > from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is > added to vCPU and then the vCPU is kicked. > > The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked > as a result of the event or as a result of an INIT. The handler sets the > vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will > leave the vCPU as not runnable. Any previous VMSA pages that were > installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If > a new VMSA is to be installed, the VMSA guest page is pinned and set as > the VMSA in the vCPU VMCB and the vCPU state is set to > KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is > cleared in the vCPU VMCB and the vCPU state is left as > KVM_MP_STATE_UNINITIALIZED to prevent it from being run. > > Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> > Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> > Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> > [mdr: add handling for restrictedmem] > Signed-off-by: Michael Roth <michael.roth@amd.com> What is the intended boot sequence for SEV-SNP guests? FWIW with this interface in place, guests will typically use in-guest VMSA pages to hold secondary vcpu state. But that means we're now allocating 4kb of memory for every vcpu that we create that will be for most of the guest's lifetime superfluous. Wouldn't it make more sense to have a model where we only allocate the VMSA for the boot CPU and leave secondary allocation to the guest? We already need firmware changes for SEV-SNP - may as well make this one more. [...] > + > +static int sev_snp_ap_creation(struct vcpu_svm *svm) > +{ > + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; > + struct kvm_vcpu *vcpu = &svm->vcpu; > + struct kvm_vcpu *target_vcpu; > + struct vcpu_svm *target_svm; > + unsigned int request; > + unsigned int apic_id; > + bool kick; > + int ret; > + > + request = lower_32_bits(svm->vmcb->control.exit_info_1); > + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1); > + > + /* Validate the APIC ID */ > + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id); Out of curiosity: The target CPU can be my own vCPU, right? > + if (!target_vcpu) { > + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n", > + apic_id); > + return -EINVAL; > + } > + > + ret = 0; > + > + target_svm = to_svm(target_vcpu); > + > + /* > + * The target vCPU is valid, so the vCPU will be kicked unless the > + * request is for CREATE_ON_INIT. For any errors at this stage, the > + * kick will place the vCPU in an non-runnable state. > + */ > + kick = true; > + > + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex); > + > + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; > + target_svm->sev_es.snp_ap_create = true; > + > + /* Interrupt injection mode shouldn't change for AP creation */ > + if (request < SVM_VMGEXIT_AP_DESTROY) { > + u64 sev_features; > + > + sev_features = vcpu->arch.regs[VCPU_REGS_RAX]; > + sev_features ^= sev->sev_features; > + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) { > + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n", > + vcpu->arch.regs[VCPU_REGS_RAX]); > + ret = -EINVAL; > + goto out; > + } > + } > + > + switch (request) { > + case SVM_VMGEXIT_AP_CREATE_ON_INIT: > + kick = false; > + fallthrough; > + case SVM_VMGEXIT_AP_CREATE: > + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) { > + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n", > + svm->vmcb->control.exit_info_2); > + ret = -EINVAL; > + goto out; > + } > + > + /* > + * Malicious guest can RMPADJUST a large page into VMSA which > + * will hit the SNP erratum where the CPU will incorrectly signal > + * an RMP violation #PF if a hugepage collides with the RMP entry > + * of VMSA page, reject the AP CREATE request if VMSA address from > + * guest is 2M aligned. This will break genuine current Linux kernels that just happen to allocate a guest page, no? In fact, given enough vCPUs you're almost guaranteed to hit an aligned structure somewhere. What is the guest supposed to do in that situation? > + */ > + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) { > + vcpu_unimpl(vcpu, > + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n", > + svm->vmcb->control.exit_info_2); > + ret = -EINVAL; > + goto out; > + } > + > + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2; > + break; > + case SVM_VMGEXIT_AP_DESTROY: I don't understand the destroy path. Why does this case destroy anything? > + break; > + default: > + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n", > + request); > + ret = -EINVAL; > + break; > + } > + > +out: > + if (kick) { > + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) > + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; What if the guest AP goes through a create -> destroy -> create cycle? Will it stay runnable while destroyed? Alex > + > + kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu); > + kvm_vcpu_kick(target_vcpu); > + } > + > + mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex); > + > + return ret; > +} > + > static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) > { > struct vmcb_control_area *control = &svm->vmcb->control; Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Fri, 24 Feb 2023 13:37:48 +0100 Alexander Graf <graf@amazon.com> wrote: > > On 20.02.23 19:38, Michael Roth wrote: > > From: Tom Lendacky <thomas.lendacky@amd.com> > > > > Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP > > guests to alter the register state of the APs on their own. This allows > > the guest a way of simulating INIT-SIPI. > > > > A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used > > so as to avoid updating the VMSA pointer while the vCPU is running. > > > > For CREATE > > The guest supplies the GPA of the VMSA to be used for the vCPU with > > the specified APIC ID. The GPA is saved in the svm struct of the > > target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added > > to the vCPU and then the vCPU is kicked. > > > > For CREATE_ON_INIT: > > The guest supplies the GPA of the VMSA to be used for the vCPU with > > the specified APIC ID the next time an INIT is performed. The GPA is > > saved in the svm struct of the target vCPU. > > > > For DESTROY: > > The guest indicates it wishes to stop the vCPU. The GPA is cleared > > from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is > > added to vCPU and then the vCPU is kicked. > > > > The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked > > as a result of the event or as a result of an INIT. The handler sets the > > vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will > > leave the vCPU as not runnable. Any previous VMSA pages that were > > installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If > > a new VMSA is to be installed, the VMSA guest page is pinned and set as > > the VMSA in the vCPU VMCB and the vCPU state is set to > > KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is > > cleared in the vCPU VMCB and the vCPU state is left as > > KVM_MP_STATE_UNINITIALIZED to prevent it from being run. > > > > Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> > > Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> > > Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> > > [mdr: add handling for restrictedmem] > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > What is the intended boot sequence for SEV-SNP guests? FWIW with this > interface in place, guests will typically use in-guest VMSA pages to > hold secondary vcpu state. But that means we're now allocating 4kb of > memory for every vcpu that we create that will be for most of the > guest's lifetime superfluous. > > Wouldn't it make more sense to have a model where we only allocate the > VMSA for the boot CPU and leave secondary allocation to the guest? We > already need firmware changes for SEV-SNP - may as well make this one more. > > [...] > > > + > > +static int sev_snp_ap_creation(struct vcpu_svm *svm) > > +{ > > + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; > > + struct kvm_vcpu *vcpu = &svm->vcpu; > > + struct kvm_vcpu *target_vcpu; > > + struct vcpu_svm *target_svm; > > + unsigned int request; > > + unsigned int apic_id; > > + bool kick; > > + int ret; > > + > > + request = lower_32_bits(svm->vmcb->control.exit_info_1); > > + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1); > > + > > + /* Validate the APIC ID */ > > + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id); > > > Out of curiosity: The target CPU can be my own vCPU, right? > > > > + if (!target_vcpu) { > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n", > > + apic_id); > > + return -EINVAL; > > + } > > + > > + ret = 0; > > + > > + target_svm = to_svm(target_vcpu); > > + > > + /* > > + * The target vCPU is valid, so the vCPU will be kicked unless the > > + * request is for CREATE_ON_INIT. For any errors at this stage, the > > + * kick will place the vCPU in an non-runnable state. > > + */ > > + kick = true; > > + > > + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex); > > + > > + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; > > + target_svm->sev_es.snp_ap_create = true; > > + > > + /* Interrupt injection mode shouldn't change for AP creation */ > > + if (request < SVM_VMGEXIT_AP_DESTROY) { > > + u64 sev_features; > > + > > + sev_features = vcpu->arch.regs[VCPU_REGS_RAX]; > > + sev_features ^= sev->sev_features; > > + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) { > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n", > > + vcpu->arch.regs[VCPU_REGS_RAX]); > > + ret = -EINVAL; > > + goto out; > > + } > > + } > > + > > + switch (request) { > > + case SVM_VMGEXIT_AP_CREATE_ON_INIT: > > + kick = false; > > + fallthrough; > > + case SVM_VMGEXIT_AP_CREATE: > > + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) { > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n", > > + svm->vmcb->control.exit_info_2); > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + /* > > + * Malicious guest can RMPADJUST a large page into VMSA which > > + * will hit the SNP erratum where the CPU will incorrectly signal > > + * an RMP violation #PF if a hugepage collides with the RMP entry > > + * of VMSA page, reject the AP CREATE request if VMSA address from > > + * guest is 2M aligned. > > > This will break genuine current Linux kernels that just happen to > allocate a guest page, no? In fact, given enough vCPUs you're almost > guaranteed to hit an aligned structure somewhere. What is the guest > supposed to do in that situation? > > > > + */ > > + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) { > > + vcpu_unimpl(vcpu, > > + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n", > > + svm->vmcb->control.exit_info_2); > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2; > > + break; > > + case SVM_VMGEXIT_AP_DESTROY: > > > I don't understand the destroy path. Why does this case destroy anything? > > > > + break; > > + default: > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n", > > + request); > > + ret = -EINVAL; > > + break; > > + } > > + > > +out: > > + if (kick) { > > + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) > > + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; > > > What if the guest AP goes through a create -> destroy -> create cycle? > Will it stay runnable while destroyed? The code is not very straightforward. 1) target_svm->sev_es.snp_vmsa_gpa is set as INVALID_PAGE in the beginning of this function. 2) If a DESTROY is hit in this function, target_svm->sev_es.snp_vmsa_gpa will be left as INVALID_PAGE. 3) At the end of this function, it calls kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE). 4) In the vcpu_enter_guest(), the kvm_vcpu_reset()->sev_snp_init_protected_guest_state() ->__sev_snp_init_protected_guest_state() is called. 5) The mp_state is set to KVM_MP_STATE_STOPPED by default and the runtime VMSA is cleared. Then the it will be initialized according to the guest's configuration. 6) As the snp_vmsa_gpa is set as INVALID_PAGE in 1, the mp_state will be left as KVM_MP_STATE_STOPPED. 7) With this code piece: + kvm_vcpu_reset(vcpu, true); + if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE) + goto out; vcpu_enter_guest() bails out. > > > Alex > > > + > > + kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu); > > + kvm_vcpu_kick(target_vcpu); > > + } > > + > > + mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex); > > + > > + return ret; > > +} > > + > > static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) > > { > > struct vmcb_control_area *control = &svm->vmcb->control; > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >
On 28.02.23 21:47, Zhi Wang wrote: > On Fri, 24 Feb 2023 13:37:48 +0100 > Alexander Graf <graf@amazon.com> wrote: > >> On 20.02.23 19:38, Michael Roth wrote: >>> From: Tom Lendacky <thomas.lendacky@amd.com> >>> >>> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP >>> guests to alter the register state of the APs on their own. This allows >>> the guest a way of simulating INIT-SIPI. >>> >>> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used >>> so as to avoid updating the VMSA pointer while the vCPU is running. >>> >>> For CREATE >>> The guest supplies the GPA of the VMSA to be used for the vCPU with >>> the specified APIC ID. The GPA is saved in the svm struct of the >>> target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added >>> to the vCPU and then the vCPU is kicked. >>> >>> For CREATE_ON_INIT: >>> The guest supplies the GPA of the VMSA to be used for the vCPU with >>> the specified APIC ID the next time an INIT is performed. The GPA is >>> saved in the svm struct of the target vCPU. >>> >>> For DESTROY: >>> The guest indicates it wishes to stop the vCPU. The GPA is cleared >>> from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is >>> added to vCPU and then the vCPU is kicked. >>> >>> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked >>> as a result of the event or as a result of an INIT. The handler sets the >>> vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will >>> leave the vCPU as not runnable. Any previous VMSA pages that were >>> installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If >>> a new VMSA is to be installed, the VMSA guest page is pinned and set as >>> the VMSA in the vCPU VMCB and the vCPU state is set to >>> KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is >>> cleared in the vCPU VMCB and the vCPU state is left as >>> KVM_MP_STATE_UNINITIALIZED to prevent it from being run. >>> >>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> >>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> >>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> >>> [mdr: add handling for restrictedmem] >>> Signed-off-by: Michael Roth <michael.roth@amd.com> >> >> What is the intended boot sequence for SEV-SNP guests? FWIW with this >> interface in place, guests will typically use in-guest VMSA pages to >> hold secondary vcpu state. But that means we're now allocating 4kb of >> memory for every vcpu that we create that will be for most of the >> guest's lifetime superfluous. >> >> Wouldn't it make more sense to have a model where we only allocate the >> VMSA for the boot CPU and leave secondary allocation to the guest? We >> already need firmware changes for SEV-SNP - may as well make this one more. >> >> [...] >> >>> + >>> +static int sev_snp_ap_creation(struct vcpu_svm *svm) >>> +{ >>> + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; >>> + struct kvm_vcpu *vcpu = &svm->vcpu; >>> + struct kvm_vcpu *target_vcpu; >>> + struct vcpu_svm *target_svm; >>> + unsigned int request; >>> + unsigned int apic_id; >>> + bool kick; >>> + int ret; >>> + >>> + request = lower_32_bits(svm->vmcb->control.exit_info_1); >>> + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1); >>> + >>> + /* Validate the APIC ID */ >>> + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id); >> >> Out of curiosity: The target CPU can be my own vCPU, right? >> >> >>> + if (!target_vcpu) { >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n", >>> + apic_id); >>> + return -EINVAL; >>> + } >>> + >>> + ret = 0; >>> + >>> + target_svm = to_svm(target_vcpu); >>> + >>> + /* >>> + * The target vCPU is valid, so the vCPU will be kicked unless the >>> + * request is for CREATE_ON_INIT. For any errors at this stage, the >>> + * kick will place the vCPU in an non-runnable state. >>> + */ >>> + kick = true; >>> + >>> + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex); >>> + >>> + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; >>> + target_svm->sev_es.snp_ap_create = true; >>> + >>> + /* Interrupt injection mode shouldn't change for AP creation */ >>> + if (request < SVM_VMGEXIT_AP_DESTROY) { >>> + u64 sev_features; >>> + >>> + sev_features = vcpu->arch.regs[VCPU_REGS_RAX]; >>> + sev_features ^= sev->sev_features; >>> + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) { >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n", >>> + vcpu->arch.regs[VCPU_REGS_RAX]); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + } >>> + >>> + switch (request) { >>> + case SVM_VMGEXIT_AP_CREATE_ON_INIT: >>> + kick = false; >>> + fallthrough; >>> + case SVM_VMGEXIT_AP_CREATE: >>> + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) { >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n", >>> + svm->vmcb->control.exit_info_2); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >>> + /* >>> + * Malicious guest can RMPADJUST a large page into VMSA which >>> + * will hit the SNP erratum where the CPU will incorrectly signal >>> + * an RMP violation #PF if a hugepage collides with the RMP entry >>> + * of VMSA page, reject the AP CREATE request if VMSA address from >>> + * guest is 2M aligned. >> >> This will break genuine current Linux kernels that just happen to >> allocate a guest page, no? In fact, given enough vCPUs you're almost >> guaranteed to hit an aligned structure somewhere. What is the guest >> supposed to do in that situation? >> >> >>> + */ >>> + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) { >>> + vcpu_unimpl(vcpu, >>> + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n", >>> + svm->vmcb->control.exit_info_2); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >>> + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2; >>> + break; >>> + case SVM_VMGEXIT_AP_DESTROY: >> >> I don't understand the destroy path. Why does this case destroy anything? >> >> >>> + break; >>> + default: >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n", >>> + request); >>> + ret = -EINVAL; >>> + break; >>> + } >>> + >>> +out: >>> + if (kick) { >>> + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) >>> + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; >> >> What if the guest AP goes through a create -> destroy -> create cycle? >> Will it stay runnable while destroyed? > The code is not very straightforward. > > 1) target_svm->sev_es.snp_vmsa_gpa is set as INVALID_PAGE in the beginning of this function. > > 2) If a DESTROY is hit in this function, target_svm->sev_es.snp_vmsa_gpa will be > left as INVALID_PAGE. > > 3) At the end of this function, it calls kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE). > > 4) In the vcpu_enter_guest(), the kvm_vcpu_reset()->sev_snp_init_protected_guest_state() > ->__sev_snp_init_protected_guest_state() is called. > > 5) The mp_state is set to KVM_MP_STATE_STOPPED by default and the runtime VMSA is > cleared. Then the it will be initialized according to the guest's > configuration. > > 6) As the snp_vmsa_gpa is set as INVALID_PAGE in 1, the mp_state will be left as > KVM_MP_STATE_STOPPED. > > 7) With this code piece: > > + kvm_vcpu_reset(vcpu, true); > + if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE) > + goto out; > > vcpu_enter_guest() bails out. Thanks a lot Zhi for the detailed explanation! I think this code flow wants to become slightly more obvious. For example, if we just said case SVM_VMGEXIT_AP_DESTROY: /* This will tell __sev_snp_update_protected_guest_state to unmap the VMSA */ target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; break; We'd get a big win in readability with little effort. It makes it immediately obvious where to look for the destroy operation. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Fri, Feb 24, 2023 at 01:37:48PM +0100, Alexander Graf wrote: > > On 20.02.23 19:38, Michael Roth wrote: > > From: Tom Lendacky <thomas.lendacky@amd.com> > > > > Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP > > guests to alter the register state of the APs on their own. This allows > > the guest a way of simulating INIT-SIPI. > > > > A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used > > so as to avoid updating the VMSA pointer while the vCPU is running. > > > > For CREATE > > The guest supplies the GPA of the VMSA to be used for the vCPU with > > the specified APIC ID. The GPA is saved in the svm struct of the > > target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added > > to the vCPU and then the vCPU is kicked. > > > > For CREATE_ON_INIT: > > The guest supplies the GPA of the VMSA to be used for the vCPU with > > the specified APIC ID the next time an INIT is performed. The GPA is > > saved in the svm struct of the target vCPU. > > > > For DESTROY: > > The guest indicates it wishes to stop the vCPU. The GPA is cleared > > from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is > > added to vCPU and then the vCPU is kicked. > > > > The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked > > as a result of the event or as a result of an INIT. The handler sets the > > vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will > > leave the vCPU as not runnable. Any previous VMSA pages that were > > installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If > > a new VMSA is to be installed, the VMSA guest page is pinned and set as > > the VMSA in the vCPU VMCB and the vCPU state is set to > > KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is > > cleared in the vCPU VMCB and the vCPU state is left as > > KVM_MP_STATE_UNINITIALIZED to prevent it from being run. > > > > Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> > > Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> > > Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> > > [mdr: add handling for restrictedmem] > > Signed-off-by: Michael Roth <michael.roth@amd.com> > > > What is the intended boot sequence for SEV-SNP guests? FWIW with this > interface in place, guests will typically use in-guest VMSA pages to hold > secondary vcpu state. But that means we're now allocating 4kb of memory for > every vcpu that we create that will be for most of the guest's lifetime > superfluous. > > Wouldn't it make more sense to have a model where we only allocate the VMSA > for the boot CPU and leave secondary allocation to the guest? We already > need firmware changes for SEV-SNP - may as well make this one more. I don't think we'd necessarily need a firmware change. We could just free original VMSA back to the hypervisor as soon as those APs come online. The down-side to that versus deferring cleaning till guest shutdown is there is some flushing activity (see: sev_flush_encrypted_page()) that would now likely be occuring during guest boot up where the overhead might be more noticeable. But for SNP the host likely supports X86_FEATURE_SME_COHERENT so the overhead probably isn't that bad. > > [...] > > > + > > +static int sev_snp_ap_creation(struct vcpu_svm *svm) > > +{ > > + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; > > + struct kvm_vcpu *vcpu = &svm->vcpu; > > + struct kvm_vcpu *target_vcpu; > > + struct vcpu_svm *target_svm; > > + unsigned int request; > > + unsigned int apic_id; > > + bool kick; > > + int ret; > > + > > + request = lower_32_bits(svm->vmcb->control.exit_info_1); > > + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1); > > + > > + /* Validate the APIC ID */ > > + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id); > > > Out of curiosity: The target CPU can be my own vCPU, right? I don't think that would be the normal behavior, but maybe with some care it's possible for a guest to do things that way. I haven't seen anything strictly prohibiting this in the relevant specs. > > > > + if (!target_vcpu) { > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n", > > + apic_id); > > + return -EINVAL; > > + } > > + > > + ret = 0; > > + > > + target_svm = to_svm(target_vcpu); > > + > > + /* > > + * The target vCPU is valid, so the vCPU will be kicked unless the > > + * request is for CREATE_ON_INIT. For any errors at this stage, the > > + * kick will place the vCPU in an non-runnable state. > > + */ > > + kick = true; > > + > > + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex); > > + > > + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; > > + target_svm->sev_es.snp_ap_create = true; > > + > > + /* Interrupt injection mode shouldn't change for AP creation */ > > + if (request < SVM_VMGEXIT_AP_DESTROY) { > > + u64 sev_features; > > + > > + sev_features = vcpu->arch.regs[VCPU_REGS_RAX]; > > + sev_features ^= sev->sev_features; > > + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) { > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n", > > + vcpu->arch.regs[VCPU_REGS_RAX]); > > + ret = -EINVAL; > > + goto out; > > + } > > + } > > + > > + switch (request) { > > + case SVM_VMGEXIT_AP_CREATE_ON_INIT: > > + kick = false; > > + fallthrough; > > + case SVM_VMGEXIT_AP_CREATE: > > + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) { > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n", > > + svm->vmcb->control.exit_info_2); > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + /* > > + * Malicious guest can RMPADJUST a large page into VMSA which > > + * will hit the SNP erratum where the CPU will incorrectly signal > > + * an RMP violation #PF if a hugepage collides with the RMP entry > > + * of VMSA page, reject the AP CREATE request if VMSA address from > > + * guest is 2M aligned. > > > This will break genuine current Linux kernels that just happen to allocate a > guest page, no? In fact, given enough vCPUs you're almost guaranteed to hit > an aligned structure somewhere. What is the guest supposed to do in that > situation? The initial SNP support for guest kernels already made use of snp_alloc_vmsa_page() to do the appropriate workaround to avoid allocating 2MB-aligned VMSA pages. -Mike > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >
On Wed, Mar 01, 2023 at 10:14:19PM +0100, Alexander Graf wrote: > > On 28.02.23 21:47, Zhi Wang wrote: > > On Fri, 24 Feb 2023 13:37:48 +0100 > > Alexander Graf <graf@amazon.com> wrote: > > > > > On 20.02.23 19:38, Michael Roth wrote: > > > > From: Tom Lendacky <thomas.lendacky@amd.com> > > > > > > > > Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP > > > > guests to alter the register state of the APs on their own. This allows > > > > the guest a way of simulating INIT-SIPI. <snip> > > > > + */ > > > > + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) { > > > > + vcpu_unimpl(vcpu, > > > > + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n", > > > > + svm->vmcb->control.exit_info_2); > > > > + ret = -EINVAL; > > > > + goto out; > > > > + } > > > > + > > > > + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2; > > > > + break; > > > > + case SVM_VMGEXIT_AP_DESTROY: > > > > > > I don't understand the destroy path. Why does this case destroy anything? > > > > > > > > > > + break; > > > > + default: > > > > + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n", > > > > + request); > > > > + ret = -EINVAL; > > > > + break; > > > > + } > > > > + > > > > +out: > > > > + if (kick) { > > > > + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) > > > > + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; > > > > > > What if the guest AP goes through a create -> destroy -> create cycle? > > > Will it stay runnable while destroyed? > > The code is not very straightforward. > > > > 1) target_svm->sev_es.snp_vmsa_gpa is set as INVALID_PAGE in the beginning of this function. > > > > 2) If a DESTROY is hit in this function, target_svm->sev_es.snp_vmsa_gpa will be > > left as INVALID_PAGE. > > > > 3) At the end of this function, it calls kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE). > > > > 4) In the vcpu_enter_guest(), the kvm_vcpu_reset()->sev_snp_init_protected_guest_state() > > ->__sev_snp_init_protected_guest_state() is called. > > > > 5) The mp_state is set to KVM_MP_STATE_STOPPED by default and the runtime VMSA is > > cleared. Then the it will be initialized according to the guest's > > configuration. > > > > 6) As the snp_vmsa_gpa is set as INVALID_PAGE in 1, the mp_state will be left as > > KVM_MP_STATE_STOPPED. > > > > 7) With this code piece: > > > > + kvm_vcpu_reset(vcpu, true); > > + if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE) > > + goto out; > > > > vcpu_enter_guest() bails out. > > > Thanks a lot Zhi for the detailed explanation! I think this code flow wants > to become slightly more obvious. For example, if we just said > > case SVM_VMGEXIT_AP_DESTROY: > /* This will tell __sev_snp_update_protected_guest_state to unmap the > VMSA */ > target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; > break; > > We'd get a big win in readability with little effort. It makes it > immediately obvious where to look for the destroy operation. The target->snp_vmsa_gpa value mainly serves to cache the GPA of the new VMSA up until the point where KVM points the target vCPU's VMCB over to using it as part of sev_snp_init_protected_guest_state(). We want to make sure it is either INVALID_PAGE, or whatever GPA the guest want us to set it to. Once sev_snp_init_protected_guest_state() processes the update, it will set it back to INVALID_PAGE. So if we initialized it once during VM creation it wouldn't be necessary to keep setting it to INVALID_PAGE at all, except for one case ... There is a window where the guest could do an AP_CREATE with new GPA, followed immediately by AP_DESTROY. If that AP_DESTROY got processed before the vCPU re-enters the guest, it might instead restart the vCPU (due to taking VALID_PAGE(snp_vmsa_gpa) branch in __sev_snp_update_protected_guest_state()). So for that reason it makes sense to always initialize it to INVALID_PAGE when an AP_CREATION request comes in, to essentially clear any pending updates from a previous pending AP_CREATION for the same vCPU. If we move it to the AP_DESTROY case, but hit a failure elsewhere in sev_snp_ap_creation(), then the calling vCPU will get a #GP, but the target vCPU might get set up with the VMSA for the previous pending request. That might be appropriate though, just because calling vCPU hosed itself doesn't mean a previous valid AP_CREATION request shouldn't go through for target. Hmm... And having said all that... yes, I guess that basically boils down to INVALID_PAGE being used as an indicator of whether or not to unmap VMSA. I'll try to come up with some wording to better convey all this a bit more clearly. Thanks, -Mike > > > Alex > > > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >
On 4/4/23 17:48, Michael Roth wrote: > On Fri, Feb 24, 2023 at 01:37:48PM +0100, Alexander Graf wrote: >> >> On 20.02.23 19:38, Michael Roth wrote: >>> From: Tom Lendacky <thomas.lendacky@amd.com> >>> >>> Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP >>> guests to alter the register state of the APs on their own. This allows >>> the guest a way of simulating INIT-SIPI. >>> >>> A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used >>> so as to avoid updating the VMSA pointer while the vCPU is running. >>> >>> For CREATE >>> The guest supplies the GPA of the VMSA to be used for the vCPU with >>> the specified APIC ID. The GPA is saved in the svm struct of the >>> target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added >>> to the vCPU and then the vCPU is kicked. >>> >>> For CREATE_ON_INIT: >>> The guest supplies the GPA of the VMSA to be used for the vCPU with >>> the specified APIC ID the next time an INIT is performed. The GPA is >>> saved in the svm struct of the target vCPU. >>> >>> For DESTROY: >>> The guest indicates it wishes to stop the vCPU. The GPA is cleared >>> from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is >>> added to vCPU and then the vCPU is kicked. >>> >>> The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked >>> as a result of the event or as a result of an INIT. The handler sets the >>> vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will >>> leave the vCPU as not runnable. Any previous VMSA pages that were >>> installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If >>> a new VMSA is to be installed, the VMSA guest page is pinned and set as >>> the VMSA in the vCPU VMCB and the vCPU state is set to >>> KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is >>> cleared in the vCPU VMCB and the vCPU state is left as >>> KVM_MP_STATE_UNINITIALIZED to prevent it from being run. >>> >>> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> >>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> >>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> >>> [mdr: add handling for restrictedmem] >>> Signed-off-by: Michael Roth <michael.roth@amd.com> >> >> >> What is the intended boot sequence for SEV-SNP guests? FWIW with this >> interface in place, guests will typically use in-guest VMSA pages to hold >> secondary vcpu state. But that means we're now allocating 4kb of memory for >> every vcpu that we create that will be for most of the guest's lifetime >> superfluous. >> >> Wouldn't it make more sense to have a model where we only allocate the VMSA >> for the boot CPU and leave secondary allocation to the guest? We already >> need firmware changes for SEV-SNP - may as well make this one more. > > I don't think we'd necessarily need a firmware change. We could just > free original VMSA back to the hypervisor as soon as those APs come > online. The down-side to that versus deferring cleaning till guest > shutdown is there is some flushing activity (see: > sev_flush_encrypted_page()) that would now likely be occuring during > guest boot up where the overhead might be more noticeable. But for SNP > the host likely supports X86_FEATURE_SME_COHERENT so the overhead > probably isn't that bad. Currently, OVMF code will perform a broadcast IPI to start all the APs because it doesn't know the APIC IDs until they start for the first time. Until the APIC IDs are known, the guest BSP can't create the VMSAs. However, a new GHCB event is in plan to retrieve the APIC IDs for the guest. Once that is in place, then you could create just a single VMSA for the BSP and then allow the guest to create the remainder (the current OVMF PoC patches to support an SVSM do this). The VMM would have to know that the hypervisor and the firmware both support that, though. That could be advertised as part of the GUID table of the firmware (in the case of OVMF) and as a capability from KVM. Thanks, Tom > >> >> [...] >> >>> + >>> +static int sev_snp_ap_creation(struct vcpu_svm *svm) >>> +{ >>> + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; >>> + struct kvm_vcpu *vcpu = &svm->vcpu; >>> + struct kvm_vcpu *target_vcpu; >>> + struct vcpu_svm *target_svm; >>> + unsigned int request; >>> + unsigned int apic_id; >>> + bool kick; >>> + int ret; >>> + >>> + request = lower_32_bits(svm->vmcb->control.exit_info_1); >>> + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1); >>> + >>> + /* Validate the APIC ID */ >>> + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id); >> >> >> Out of curiosity: The target CPU can be my own vCPU, right? > > I don't think that would be the normal behavior, but maybe with some > care it's possible for a guest to do things that way. I haven't seen > anything strictly prohibiting this in the relevant specs. > >> >> >>> + if (!target_vcpu) { >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n", >>> + apic_id); >>> + return -EINVAL; >>> + } >>> + >>> + ret = 0; >>> + >>> + target_svm = to_svm(target_vcpu); >>> + >>> + /* >>> + * The target vCPU is valid, so the vCPU will be kicked unless the >>> + * request is for CREATE_ON_INIT. For any errors at this stage, the >>> + * kick will place the vCPU in an non-runnable state. >>> + */ >>> + kick = true; >>> + >>> + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex); >>> + >>> + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; >>> + target_svm->sev_es.snp_ap_create = true; >>> + >>> + /* Interrupt injection mode shouldn't change for AP creation */ >>> + if (request < SVM_VMGEXIT_AP_DESTROY) { >>> + u64 sev_features; >>> + >>> + sev_features = vcpu->arch.regs[VCPU_REGS_RAX]; >>> + sev_features ^= sev->sev_features; >>> + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) { >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n", >>> + vcpu->arch.regs[VCPU_REGS_RAX]); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + } >>> + >>> + switch (request) { >>> + case SVM_VMGEXIT_AP_CREATE_ON_INIT: >>> + kick = false; >>> + fallthrough; >>> + case SVM_VMGEXIT_AP_CREATE: >>> + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) { >>> + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n", >>> + svm->vmcb->control.exit_info_2); >>> + ret = -EINVAL; >>> + goto out; >>> + } >>> + >>> + /* >>> + * Malicious guest can RMPADJUST a large page into VMSA which >>> + * will hit the SNP erratum where the CPU will incorrectly signal >>> + * an RMP violation #PF if a hugepage collides with the RMP entry >>> + * of VMSA page, reject the AP CREATE request if VMSA address from >>> + * guest is 2M aligned. >> >> >> This will break genuine current Linux kernels that just happen to allocate a >> guest page, no? In fact, given enough vCPUs you're almost guaranteed to hit >> an aligned structure somewhere. What is the guest supposed to do in that >> situation? > > The initial SNP support for guest kernels already made use of > snp_alloc_vmsa_page() to do the appropriate workaround to avoid allocating > 2MB-aligned VMSA pages. > > -Mike > >> >> >> >> Amazon Development Center Germany GmbH >> Krausenstr. 38 >> 10117 Berlin >> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss >> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B >> Sitz: Berlin >> Ust-ID: DE 289 237 879 >> >>
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 28b01cc7f64d..09b36462582c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -113,6 +113,7 @@ KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_HV_TLB_FLUSH \ KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) +#define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE KVM_ARCH_REQ(34) #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index c18d78d5e505..e76ad26ba64f 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -278,7 +278,12 @@ enum avic_ipi_failure_cause { #define AVIC_HPA_MASK ~((0xFFFULL << 52) | 0xFFF) #define VMCB_AVIC_APIC_BAR_MASK 0xFFFFFFFFFF000ULL -#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0) +#define SVM_SEV_FEAT_SNP_ACTIVE BIT(0) +#define SVM_SEV_FEAT_RESTRICTED_INJECTION BIT(3) +#define SVM_SEV_FEAT_ALTERNATE_INJECTION BIT(4) +#define SVM_SEV_FEAT_INT_INJ_MODES \ + (SVM_SEV_FEAT_RESTRICTED_INJECTION | \ + SVM_SEV_FEAT_ALTERNATE_INJECTION) struct vmcb_seg { u16 selector; diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 6bec2712ecc6..b2f1a12685ed 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -779,6 +779,7 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) static int sev_es_sync_vmsa(struct vcpu_svm *svm) { + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; struct sev_es_save_area *save = svm->sev_es.vmsa; /* Check some debug related fields before encrypting the VMSA */ @@ -824,6 +825,12 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm) if (sev_snp_guest(svm->vcpu.kvm)) save->sev_features |= SVM_SEV_FEAT_SNP_ACTIVE; + /* + * Save the VMSA synced SEV features. For now, they are the same for + * all vCPUs, so just save each time. + */ + sev->sev_features = save->sev_features; + pr_debug("Virtual Machine Save Area (VMSA):\n"); print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false); @@ -3161,6 +3168,10 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm) if (!ghcb_sw_scratch_is_valid(ghcb)) goto vmgexit_err; break; + case SVM_VMGEXIT_AP_CREATION: + if (!ghcb_rax_is_valid(ghcb)) + goto vmgexit_err; + break; case SVM_VMGEXIT_NMI_COMPLETE: case SVM_VMGEXIT_AP_HLT_LOOP: case SVM_VMGEXIT_AP_JUMP_TABLE: @@ -3543,6 +3554,226 @@ static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gp ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc); } +static kvm_pfn_t gfn_to_pfn_restricted(struct kvm *kvm, gfn_t gfn) +{ + struct kvm_memory_slot *slot; + kvm_pfn_t pfn; + int order = 0; + + slot = gfn_to_memslot(kvm, gfn); + if (!kvm_slot_can_be_private(slot)) { + pr_err("SEV: Failure retrieving restricted memslot for GFN 0x%llx, flags 0x%x, userspace_addr: 0x%lx\n", + gfn, slot->flags, slot->userspace_addr); + return INVALID_PAGE; + } + + if (!kvm_mem_is_private(kvm, gfn)) { + pr_err("SEV: Failure retrieving restricted PFN for GFN 0x%llx\n", gfn); + return INVALID_PAGE; + } + + if (kvm_restrictedmem_get_pfn(slot, gfn, &pfn, &order)) { + pr_err("SEV: Failure retrieving restricted PFN for GFN 0x%llx\n", gfn); + return INVALID_PAGE; + } + + put_page(pfn_to_page(pfn)); + + return pfn; +} + +static int __sev_snp_update_protected_guest_state(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + kvm_pfn_t pfn; + hpa_t cur_pa; + + WARN_ON(!mutex_is_locked(&svm->sev_es.snp_vmsa_mutex)); + + /* Save off the current VMSA PA for later checks */ + cur_pa = svm->sev_es.vmsa_pa; + + /* Mark the vCPU as offline and not runnable */ + vcpu->arch.pv.pv_unhalted = false; + vcpu->arch.mp_state = KVM_MP_STATE_STOPPED; + + /* Clear use of the VMSA */ + svm->sev_es.vmsa_pa = INVALID_PAGE; + svm->vmcb->control.vmsa_pa = INVALID_PAGE; + + if (cur_pa != __pa(svm->sev_es.vmsa) && VALID_PAGE(cur_pa)) { + /* + * The svm->sev_es.vmsa_pa field holds the hypervisor physical + * address of the about to be replaced VMSA which will no longer + * be used or referenced, so un-pin it. However, restricted + * pages (e.g. via AP creation) should be left to the + * restrictedmem backend to deal with, so don't release the + * page in that case. + */ + if (!VALID_PAGE(gfn_to_pfn_restricted(vcpu->kvm, + gpa_to_gfn(svm->sev_es.snp_vmsa_gpa)))) + kvm_release_pfn_dirty(__phys_to_pfn(cur_pa)); + } + + if (VALID_PAGE(svm->sev_es.snp_vmsa_gpa)) { + /* + * The VMSA is referenced by the hypervisor physical address, + * so retrieve the PFN and ensure it is restricted memory. + */ + pfn = gfn_to_pfn_restricted(vcpu->kvm, gpa_to_gfn(svm->sev_es.snp_vmsa_gpa)); + if (!VALID_PAGE(pfn)) + return pfn; + + /* Use the new VMSA */ + svm->sev_es.vmsa_pa = pfn_to_hpa(pfn); + svm->vmcb->control.vmsa_pa = svm->sev_es.vmsa_pa; + + /* Mark the vCPU as runnable */ + vcpu->arch.pv.pv_unhalted = false; + vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; + + svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; + } + + /* + * When replacing the VMSA during SEV-SNP AP creation, + * mark the VMCB dirty so that full state is always reloaded. + */ + vmcb_mark_all_dirty(svm->vmcb); + + return 0; +} + +/* + * Invoked as part of svm_vcpu_reset() processing of an init event. + */ +void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + int ret; + + if (!sev_snp_guest(vcpu->kvm)) + return; + + mutex_lock(&svm->sev_es.snp_vmsa_mutex); + + if (!svm->sev_es.snp_ap_create) + goto unlock; + + svm->sev_es.snp_ap_create = false; + + ret = __sev_snp_update_protected_guest_state(vcpu); + if (ret) + vcpu_unimpl(vcpu, "snp: AP state update on init failed\n"); + +unlock: + mutex_unlock(&svm->sev_es.snp_vmsa_mutex); +} + +static int sev_snp_ap_creation(struct vcpu_svm *svm) +{ + struct kvm_sev_info *sev = &to_kvm_svm(svm->vcpu.kvm)->sev_info; + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm_vcpu *target_vcpu; + struct vcpu_svm *target_svm; + unsigned int request; + unsigned int apic_id; + bool kick; + int ret; + + request = lower_32_bits(svm->vmcb->control.exit_info_1); + apic_id = upper_32_bits(svm->vmcb->control.exit_info_1); + + /* Validate the APIC ID */ + target_vcpu = kvm_get_vcpu_by_id(vcpu->kvm, apic_id); + if (!target_vcpu) { + vcpu_unimpl(vcpu, "vmgexit: invalid AP APIC ID [%#x] from guest\n", + apic_id); + return -EINVAL; + } + + ret = 0; + + target_svm = to_svm(target_vcpu); + + /* + * The target vCPU is valid, so the vCPU will be kicked unless the + * request is for CREATE_ON_INIT. For any errors at this stage, the + * kick will place the vCPU in an non-runnable state. + */ + kick = true; + + mutex_lock(&target_svm->sev_es.snp_vmsa_mutex); + + target_svm->sev_es.snp_vmsa_gpa = INVALID_PAGE; + target_svm->sev_es.snp_ap_create = true; + + /* Interrupt injection mode shouldn't change for AP creation */ + if (request < SVM_VMGEXIT_AP_DESTROY) { + u64 sev_features; + + sev_features = vcpu->arch.regs[VCPU_REGS_RAX]; + sev_features ^= sev->sev_features; + if (sev_features & SVM_SEV_FEAT_INT_INJ_MODES) { + vcpu_unimpl(vcpu, "vmgexit: invalid AP injection mode [%#lx] from guest\n", + vcpu->arch.regs[VCPU_REGS_RAX]); + ret = -EINVAL; + goto out; + } + } + + switch (request) { + case SVM_VMGEXIT_AP_CREATE_ON_INIT: + kick = false; + fallthrough; + case SVM_VMGEXIT_AP_CREATE: + if (!page_address_valid(vcpu, svm->vmcb->control.exit_info_2)) { + vcpu_unimpl(vcpu, "vmgexit: invalid AP VMSA address [%#llx] from guest\n", + svm->vmcb->control.exit_info_2); + ret = -EINVAL; + goto out; + } + + /* + * Malicious guest can RMPADJUST a large page into VMSA which + * will hit the SNP erratum where the CPU will incorrectly signal + * an RMP violation #PF if a hugepage collides with the RMP entry + * of VMSA page, reject the AP CREATE request if VMSA address from + * guest is 2M aligned. + */ + if (IS_ALIGNED(svm->vmcb->control.exit_info_2, PMD_SIZE)) { + vcpu_unimpl(vcpu, + "vmgexit: AP VMSA address [%llx] from guest is unsafe as it is 2M aligned\n", + svm->vmcb->control.exit_info_2); + ret = -EINVAL; + goto out; + } + + target_svm->sev_es.snp_vmsa_gpa = svm->vmcb->control.exit_info_2; + break; + case SVM_VMGEXIT_AP_DESTROY: + break; + default: + vcpu_unimpl(vcpu, "vmgexit: invalid AP creation request [%#x] from guest\n", + request); + ret = -EINVAL; + break; + } + +out: + if (kick) { + if (target_vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) + target_vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; + + kvm_make_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, target_vcpu); + kvm_vcpu_kick(target_vcpu); + } + + mutex_unlock(&target_svm->sev_es.snp_vmsa_mutex); + + return ret; +} + static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) { struct vmcb_control_area *control = &svm->vmcb->control; @@ -3806,6 +4037,18 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) ret = 1; break; } + case SVM_VMGEXIT_AP_CREATION: + ret = sev_snp_ap_creation(svm); + if (ret) { + ghcb_set_sw_exit_info_1(ghcb, 1); + ghcb_set_sw_exit_info_2(ghcb, + X86_TRAP_GP | + SVM_EVTINJ_TYPE_EXEPT | + SVM_EVTINJ_VALID); + } + + ret = 1; + break; case SVM_VMGEXIT_UNSUPPORTED_EVENT: vcpu_unimpl(vcpu, "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n", @@ -3910,6 +4153,8 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm) set_ghcb_msr(svm, GHCB_MSR_SEV_INFO(GHCB_VERSION_MAX, GHCB_VERSION_MIN, sev_enc_bit)); + + mutex_init(&svm->sev_es.snp_vmsa_mutex); } void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 745f736d9c98..539926b07ee5 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1349,6 +1349,9 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) svm->spec_ctrl = 0; svm->virt_spec_ctrl = 0; + if (init_event) + sev_snp_init_protected_guest_state(vcpu); + init_vmcb(vcpu); if (!init_event) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index b6ca6657aa6c..37bd7b728d52 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -95,6 +95,8 @@ struct kvm_sev_info { void *snp_context; /* SNP guest context page */ void *snp_certs_data; struct mutex guest_req_lock; /* Lock for guest request handling */ + + u64 sev_features; /* Features set at VMSA creation */ }; struct kvm_svm { @@ -209,6 +211,10 @@ struct vcpu_sev_es_state { bool ghcb_sa_free; u64 ghcb_registered_gpa; + + struct mutex snp_vmsa_mutex; + gpa_t snp_vmsa_gpa; + bool snp_ap_create; }; struct vcpu_svm { @@ -718,6 +724,7 @@ void sev_es_unmap_ghcb(struct vcpu_svm *svm); struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu); void sev_adjust_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *level); void handle_rmp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code); +void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu); /* vmenter.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0154fc7a28c1..9872217e3a06 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10501,6 +10501,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (kvm_check_request(KVM_REQ_UPDATE_CPU_DIRTY_LOGGING, vcpu)) static_call(kvm_x86_update_cpu_dirty_logging)(vcpu); + + if (kvm_check_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu)) { + kvm_vcpu_reset(vcpu, true); + if (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE) + goto out; + } } if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win || @@ -12698,6 +12704,9 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu) return true; #endif + if (kvm_test_request(KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, vcpu)) + return true; + if (kvm_arch_interrupt_allowed(vcpu) && (kvm_cpu_has_interrupt(vcpu) || kvm_guest_apic_has_interrupt(vcpu)))