diff mbox series

[v14,09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT

Message ID 20240421180122.1650812-10-michael.roth@amd.com (mailing list archive)
State New
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Michael Roth April 21, 2024, 6:01 p.m. UTC
From: Brijesh Singh <brijesh.singh@amd.com>

SEV-SNP VMs can ask the hypervisor to change the page state in the RMP
table to be private or shared using the Page State Change MSR protocol
as defined in the GHCB specification.

When using gmem, private/shared memory is allocated through separate
pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES
KVM ioctl to tell the KVM MMU whether or not a particular GFN should be
backed by private memory or not.

Forward these page state change requests to userspace so that it can
issue the expected KVM ioctls. The KVM MMU will handle updating the RMP
entries when it is ready to map a private page into a guest.

Define a new KVM_EXIT_VMGEXIT for exits of this type, and structure it
so that it can be extended for other cases where VMGEXITs need some
level of handling in userspace.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 Documentation/virt/kvm/api.rst    | 33 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/sev-common.h |  6 ++++++
 arch/x86/kvm/svm/sev.c            | 33 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h          | 17 ++++++++++++++++
 4 files changed, 89 insertions(+)

Comments

Sean Christopherson April 24, 2024, 8:59 p.m. UTC | #1
On Sun, Apr 21, 2024, Michael Roth wrote:
> +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> +{
> +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +
> +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> +		return 1; /* resume guest */
> +	}
> +
> +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> +	vcpu->run->vmgexit.psc_msr.op = op;

Argh, no.

This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
and extend as *needed*.  There is no good reason page state change requests need
*two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
KVM_EXIT_MEMORY_FAULT.

The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
majority of requests are contiguous, i.e. can be combined into a range by KVM,
and that handling any outliers by performing multiple exits to userspace will
provide sufficient performance.

And the non-MSR version that comes in later patch is a complete mess.  It kicks
the PSC out to userspace without *any* validation.  As I complained in the TDX
thread, that will create an unmaintable ABI for KVM.

KVM needs to have its own, well-defined ABI.  Splitting functionality between
KVM and userspace at seemingly random points is not maintainable.

E.g. if/when KVM supports UNSMASH, upgrading to the KVM would arguably break
userspace as PSC requests that previously exited would suddenly be handled by
KVM.  Maybe.  It's impossible to review this because there's no KVM ABI, KVM is
little more than a dumb pipe parroting information to userspace.

I truly do not understand why we would even consider allowing this.  We push back
on people wanting new hypercalls for some specific use case, because we already
have generic ways to achieve things, but then CoCo comes along and we apparently
throw out any thought of maintainability.  I don't get it.

[*] https://lore.kernel.org/all/Zg18ul8Q4PGQMWam@google.com
Michael Roth April 25, 2024, 10 p.m. UTC | #2
On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> On Sun, Apr 21, 2024, Michael Roth wrote:
> > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > +{
> > +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +
> > +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > +		return 1; /* resume guest */
> > +	}
> > +
> > +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > +	vcpu->run->vmgexit.psc_msr.op = op;
> 
> Argh, no.
> 
> This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
> and extend as *needed*.  There is no good reason page state change requests need
> *two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
> conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> KVM_EXIT_MEMORY_FAULT.
> 
> The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> majority of requests are contiguous, i.e. can be combined into a range by KVM,
> and that handling any outliers by performing multiple exits to userspace will
> provide sufficient performance.

That does tend to be the case. We won't have as much granularity with
the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
expected to be for the entire range anyway, and if that fails for
whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
for cases where the entries aren't contiguous however, which would
involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
not a huge deal since it doesn't seem to be a common case.

KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
flexibility to just issue that directly within a guest rather than
relying on SNP/TDX specific hcalls. I don't know if that approach is
practical for a real guest, but it could be useful for having re-usable
guest code in KVM selftests that "just works" for all variants of
SNP/TDX/sw-protected. (though we'd still want stuff that exercises
SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).

I think we'd there is some potential baggage there with the previous SEV
live migration use cases. There's some potential that existing guest kernels
will use it once it gets advertised and issue them alongside GHCB-based
page-state changes. It might make sense to use one of the reserved bits
to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
hardware/software-protected VMs and not interchangeable with calls that
were used for SEV live migration stuff.

If this seems reasonable I'll give it a go and see what it looks like.

> 
> And the non-MSR version that comes in later patch is a complete mess.  It kicks
> the PSC out to userspace without *any* validation.  As I complained in the TDX
> thread, that will create an unmaintable ABI for KVM.
> 
> KVM needs to have its own, well-defined ABI.  Splitting functionality between
> KVM and userspace at seemingly random points is not maintainable.
> 
> E.g. if/when KVM supports UNSMASH, upgrading to the KVM would arguably break
> userspace as PSC requests that previously exited would suddenly be handled by
> KVM.  Maybe.  It's impossible to review this because there's no KVM ABI, KVM is
> little more than a dumb pipe parroting information to userspace.

It leans on the GHCB spec to avoid re-inventing structs/documentation
for things like Page State Change buffers, but do have some control
as we want over how much we farm out versus lock into the KVM ABI. For
instance the accompanying Documentation/ update mentions we only send a
subset of GHCB requests that need to be handled by userspace, so we
could handle SMASH/UNSMASH in KVM without breaking expectations (or if
SMASH/UNSMASH were intermixed with PSCs, documentation that only PSC
opcodes could be updated by userspace).

But I'm certainly not arguing it wouldn't be better to have a
guest-agnostic alternative if we can reach an agreement on that, and
KVM_HC_MAP_GPA_RANGE seems like it could work.

-Mike

> 
> I truly do not understand why we would even consider allowing this.  We push back
> on people wanting new hypercalls for some specific use case, because we already
> have generic ways to achieve things, but then CoCo comes along and we apparently
> throw out any thought of maintainability.  I don't get it.
> 
> [*] https://lore.kernel.org/all/Zg18ul8Q4PGQMWam@google.com
Sean Christopherson April 25, 2024, 10:13 p.m. UTC | #3
On Thu, Apr 25, 2024, Michael Roth wrote:
> On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > +{
> > > +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > +
> > > +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > +		return 1; /* resume guest */
> > > +	}
> > > +
> > > +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > +	vcpu->run->vmgexit.psc_msr.op = op;
> > 
> > Argh, no.
> > 
> > This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
> > and extend as *needed*.  There is no good reason page state change requests need
> > *two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
> > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > KVM_EXIT_MEMORY_FAULT.
> > 
> > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > and that handling any outliers by performing multiple exits to userspace will
> > provide sufficient performance.
> 
> That does tend to be the case. We won't have as much granularity with
> the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> expected to be for the entire range anyway, and if that fails for
> whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> for cases where the entries aren't contiguous however, which would
> involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> not a huge deal since it doesn't seem to be a common case.

If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
into a buffer, but I suspect it will be simpler to just have KVM loop until the
PSC request is complete.

> KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> flexibility to just issue that directly within a guest rather than
> relying on SNP/TDX specific hcalls. I don't know if that approach is
> practical for a real guest, but it could be useful for having re-usable
> guest code in KVM selftests that "just works" for all variants of
> SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
> 
> I think we'd there is some potential baggage there with the previous SEV
> live migration use cases. There's some potential that existing guest kernels
> will use it once it gets advertised and issue them alongside GHCB-based
> page-state changes. It might make sense to use one of the reserved bits
> to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> hardware/software-protected VMs and not interchangeable with calls that
> were used for SEV live migration stuff.

I don't think I follow, what exactly wouldn't be interchangeable, and why?

> If this seems reasonable I'll give it a go and see what it looks like.
> 
> > 
> > And the non-MSR version that comes in later patch is a complete mess.  It kicks
> > the PSC out to userspace without *any* validation.  As I complained in the TDX
> > thread, that will create an unmaintable ABI for KVM.
> > 
> > KVM needs to have its own, well-defined ABI.  Splitting functionality between
> > KVM and userspace at seemingly random points is not maintainable.
> > 
> > E.g. if/when KVM supports UNSMASH, upgrading to the KVM would arguably break
> > userspace as PSC requests that previously exited would suddenly be handled by
> > KVM.  Maybe.  It's impossible to review this because there's no KVM ABI, KVM is
> > little more than a dumb pipe parroting information to userspace.
> 
> It leans on the GHCB spec to avoid re-inventing structs/documentation
> for things like Page State Change buffers, but do have some control
> as we want over how much we farm out versus lock into the KVM ABI. For
> instance the accompanying Documentation/ update mentions we only send a
> subset of GHCB requests that need to be handled by userspace, so we
> could handle SMASH/UNSMASH in KVM without breaking expectations (or if
> SMASH/UNSMASH were intermixed with PSCs, documentation that only PSC
> opcodes could be updated by userspace).
> 
> But I'm certainly not arguing it wouldn't be better to have a
> guest-agnostic alternative if we can reach an agreement on that, and
> KVM_HC_MAP_GPA_RANGE seems like it could work.

Yeah, I want to at least _try_ to achieve common ground, because the basic
functionality of all this stuff is the exact same.
Michael Roth April 26, 2024, 5:16 p.m. UTC | #4
On Thu, Apr 25, 2024 at 03:13:40PM -0700, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Michael Roth wrote:
> > On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > > +{
> > > > +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > > +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > > +
> > > > +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > > +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > > +		return 1; /* resume guest */
> > > > +	}
> > > > +
> > > > +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > > +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > > +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > > +	vcpu->run->vmgexit.psc_msr.op = op;
> > > 
> > > Argh, no.
> > > 
> > > This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
> > > and extend as *needed*.  There is no good reason page state change requests need
> > > *two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
> > > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > > KVM_EXIT_MEMORY_FAULT.
> > > 
> > > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > > and that handling any outliers by performing multiple exits to userspace will
> > > provide sufficient performance.
> > 
> > That does tend to be the case. We won't have as much granularity with
> > the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> > expected to be for the entire range anyway, and if that fails for
> > whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> > for cases where the entries aren't contiguous however, which would
> > involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> > not a huge deal since it doesn't seem to be a common case.
> 
> If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
> into a buffer, but I suspect it will be simpler to just have KVM loop until the
> PSC request is complete.

Agreed. But *if* we decided to introduce a buffer, where would you
suggest adding it? The kvm_run union fields are set to 256 bytes, and
we'd need close to 4K to handle a full GHCB PSC buffer in 1 go. Would
additional storage at the end of struct kvm_run be acceptable?

> 
> > KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> > flexibility to just issue that directly within a guest rather than
> > relying on SNP/TDX specific hcalls. I don't know if that approach is
> > practical for a real guest, but it could be useful for having re-usable
> > guest code in KVM selftests that "just works" for all variants of
> > SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> > SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
> > 
> > I think we'd there is some potential baggage there with the previous SEV
> > live migration use cases. There's some potential that existing guest kernels
> > will use it once it gets advertised and issue them alongside GHCB-based
> > page-state changes. It might make sense to use one of the reserved bits
> > to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> > hardware/software-protected VMs and not interchangeable with calls that
> > were used for SEV live migration stuff.
> 
> I don't think I follow, what exactly wouldn't be interchangeable, and why?

For instance, if KVM_FEATURE_MIGRATION_CONTROL is advertised, then when
amd_enc_status_change_finish() is triggered as a result of
set_memory_encrypted(), we'd see

  1) a GHCB PSC for SNP, which will get forwarded to userspace via
     KVM_HC_MAP_GPA_RANGE
  2) KVM_HC_MAP_GPA_RANGE issued directly by the guest.

In that case, we'd be duplicating PSCs but it wouldn't necessarily hurt
anything. But ideally we'd be able to distinguish the 2 cases so we
could rightly treat 1) as only being expected for SNP, and 2) as only
being expected for SEV/SEV-ES.

I'm also concerned there's potential for more serious issues, for instance
kvm_init_platform() has a loop that resets all E820_TYPE_RAM ranges to
encrypted, which may step on PSCs that SNP had to do earlier on for
setting up shared pages for things like GHCB buffers. For SEV live
migration use case, it's not a huge deal if a shared page gets processed
as private, it's just slower because it would need to go through the SEV
firmware APIs. There's also more leeway in that updates can happen at
opportunistic times since there's as KVM_MIGRATION_READY state that can
be set via MSR_KVM_MIGRATION_CONTROL once everything is ready is set
up.

For SNP, all those associates KVM_HC_MAP_GPA_RANGE calls would have
immediate effect if they are not in lock-step with the actual state of
each page before access, and there is no leeway for stuff like leaving
a shared page as private.

I also a lot of that was before lazy acceptance support which happens
out-of-band versus these pv_ops.mmu.notify_page_* hooks, and kexec
support for SNP guests has some notion of tracking state across kexec
boundaries that might get clobbered too.

I haven't looked too closely though, but it seems like a good idea to at
least set a bit for kvm->arch.has_private_mem use cases. If it's not
needed then worst case we have a bit that userspace doesn't necessarily
need to care about, but for SNP just filtering out the duplicate
requests seems like enough to justify having it.

-Mike
Sean Christopherson April 26, 2024, 8:14 p.m. UTC | #5
On Fri, Apr 26, 2024, Michael Roth wrote:
> On Thu, Apr 25, 2024 at 03:13:40PM -0700, Sean Christopherson wrote:
> > On Thu, Apr 25, 2024, Michael Roth wrote:
> > > On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > > > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > > > +{
> > > > > +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > > > +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > > > +
> > > > > +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > > > +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > > > +		return 1; /* resume guest */
> > > > > +	}
> > > > > +
> > > > > +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > > > +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > > > +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > > > +	vcpu->run->vmgexit.psc_msr.op = op;
> > > > 
> > > > Argh, no.
> > > > 
> > > > This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
> > > > and extend as *needed*.  There is no good reason page state change requests need
> > > > *two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
> > > > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > > > KVM_EXIT_MEMORY_FAULT.
> > > > 
> > > > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > > > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > > > and that handling any outliers by performing multiple exits to userspace will
> > > > provide sufficient performance.
> > > 
> > > That does tend to be the case. We won't have as much granularity with
> > > the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> > > expected to be for the entire range anyway, and if that fails for
> > > whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> > > for cases where the entries aren't contiguous however, which would
> > > involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> > > not a huge deal since it doesn't seem to be a common case.
> > 
> > If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
> > into a buffer, but I suspect it will be simpler to just have KVM loop until the
> > PSC request is complete.
> 
> Agreed. But *if* we decided to introduce a buffer, where would you
> suggest adding it? The kvm_run union fields are set to 256 bytes, and
> we'd need close to 4K to handle a full GHCB PSC buffer in 1 go. Would
> additional storage at the end of struct kvm_run be acceptable?

Don't even need more memory, just use vcpu->arch.pio_data, which is always
allocated and is mmap()able by userspace via KVM_PIO_PAGE_OFFSET.

> > > KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> > > flexibility to just issue that directly within a guest rather than
> > > relying on SNP/TDX specific hcalls. I don't know if that approach is
> > > practical for a real guest, but it could be useful for having re-usable
> > > guest code in KVM selftests that "just works" for all variants of
> > > SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> > > SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
> > > 
> > > I think we'd there is some potential baggage there with the previous SEV
> > > live migration use cases. There's some potential that existing guest kernels
> > > will use it once it gets advertised and issue them alongside GHCB-based
> > > page-state changes. It might make sense to use one of the reserved bits
> > > to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> > > hardware/software-protected VMs and not interchangeable with calls that
> > > were used for SEV live migration stuff.
> > 
> > I don't think I follow, what exactly wouldn't be interchangeable, and why?
> 
> For instance, if KVM_FEATURE_MIGRATION_CONTROL is advertised, then when
> amd_enc_status_change_finish() is triggered as a result of
> set_memory_encrypted(), we'd see
> 
>   1) a GHCB PSC for SNP, which will get forwarded to userspace via
>      KVM_HC_MAP_GPA_RANGE
>   2) KVM_HC_MAP_GPA_RANGE issued directly by the guest.
> 
> In that case, we'd be duplicating PSCs but it wouldn't necessarily hurt
> anything. But ideally we'd be able to distinguish the 2 cases so we
> could rightly treat 1) as only being expected for SNP, and 2) as only
> being expected for SEV/SEV-ES.

Why would the guest issue both?  That's a guest bug.  Or if supressing the second
hypercall is an issue, simply don't enumerate MIGRATION_CONTROL for SNP guests.
Michael Roth April 26, 2024, 10:24 p.m. UTC | #6
On Fri, Apr 26, 2024 at 01:14:32PM -0700, Sean Christopherson wrote:
> On Fri, Apr 26, 2024, Michael Roth wrote:
> > On Thu, Apr 25, 2024 at 03:13:40PM -0700, Sean Christopherson wrote:
> > > On Thu, Apr 25, 2024, Michael Roth wrote:
> > > > On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > > > > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > > > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > > > > +{
> > > > > > +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > > > > +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > > > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > > > > +
> > > > > > +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > > > > +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > > > > +		return 1; /* resume guest */
> > > > > > +	}
> > > > > > +
> > > > > > +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > > > > +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > > > > +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > > > > +	vcpu->run->vmgexit.psc_msr.op = op;
> > > > > 
> > > > > Argh, no.
> > > > > 
> > > > > This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
> > > > > and extend as *needed*.  There is no good reason page state change requests need
> > > > > *two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
> > > > > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > > > > KVM_EXIT_MEMORY_FAULT.
> > > > > 
> > > > > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > > > > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > > > > and that handling any outliers by performing multiple exits to userspace will
> > > > > provide sufficient performance.
> > > > 
> > > > That does tend to be the case. We won't have as much granularity with
> > > > the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> > > > expected to be for the entire range anyway, and if that fails for
> > > > whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> > > > for cases where the entries aren't contiguous however, which would
> > > > involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> > > > not a huge deal since it doesn't seem to be a common case.
> > > 
> > > If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
> > > into a buffer, but I suspect it will be simpler to just have KVM loop until the
> > > PSC request is complete.
> > 
> > Agreed. But *if* we decided to introduce a buffer, where would you
> > suggest adding it? The kvm_run union fields are set to 256 bytes, and
> > we'd need close to 4K to handle a full GHCB PSC buffer in 1 go. Would
> > additional storage at the end of struct kvm_run be acceptable?
> 
> Don't even need more memory, just use vcpu->arch.pio_data, which is always
> allocated and is mmap()able by userspace via KVM_PIO_PAGE_OFFSET.

Nice, that seems like a good option if needed.

> 
> > > > KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> > > > flexibility to just issue that directly within a guest rather than
> > > > relying on SNP/TDX specific hcalls. I don't know if that approach is
> > > > practical for a real guest, but it could be useful for having re-usable
> > > > guest code in KVM selftests that "just works" for all variants of
> > > > SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> > > > SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
> > > > 
> > > > I think we'd there is some potential baggage there with the previous SEV
> > > > live migration use cases. There's some potential that existing guest kernels
> > > > will use it once it gets advertised and issue them alongside GHCB-based
> > > > page-state changes. It might make sense to use one of the reserved bits
> > > > to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> > > > hardware/software-protected VMs and not interchangeable with calls that
> > > > were used for SEV live migration stuff.
> > > 
> > > I don't think I follow, what exactly wouldn't be interchangeable, and why?
> > 
> > For instance, if KVM_FEATURE_MIGRATION_CONTROL is advertised, then when
> > amd_enc_status_change_finish() is triggered as a result of
> > set_memory_encrypted(), we'd see
> > 
> >   1) a GHCB PSC for SNP, which will get forwarded to userspace via
> >      KVM_HC_MAP_GPA_RANGE
> >   2) KVM_HC_MAP_GPA_RANGE issued directly by the guest.
> > 
> > In that case, we'd be duplicating PSCs but it wouldn't necessarily hurt
> > anything. But ideally we'd be able to distinguish the 2 cases so we
> > could rightly treat 1) as only being expected for SNP, and 2) as only
> > being expected for SEV/SEV-ES.
> 
> Why would the guest issue both?  That's a guest bug.  Or if supressing the second
> hypercall is an issue, simply don't enumerate MIGRATION_CONTROL for SNP guests.

At the time of its inception, KVM_HC_MAP_GPA_RANGE was simply
KVM_HC_PAGE_ENC_STATUS and got a more generic name over the course of
development. But its purpose never changed: to inform the hypervisor of
the current encryption status of a GPA range so VMMs could build up a
list of shared guest regions that don't need to go through firmware for
migration.. And it was and still is asynchronous to a degree, since the
the migration control MSRs signals when that list of shared pages is
usable.

These are very different semantics the proposal to use KVM_HC_MAP_GPA_RANGE
as a means to set memory attributes via KVM_SET_MEMORY_ATTRIBUTES, and
the 2 purposes aren't necessarily mutually exclusive to one another. It
only really becomes a bug if we begin to interpret the original use-case
as something other than it's initial intent in the case of SNP.

But at the same time, it's hard to imagine this older SEV live migration
use-case being useful for SNP, since userspace will necessarily have all
the information it needs to determine what is/isn't shared with relying
on an additional hypercall.

So treating the older use case as specific to non-SNP and disallowing the
use of MIGRATION_CONTROL does seems reasonable. But it's really the CPUID
bit that advertises it, SEV just happens to only use it for when
MIGRATION_CONTROL is also advertised. So we could disable that as well,
but I did like the idea of being able to handle guest-issued
KVM_HC_MAP_GPA_RANGE calls even with SNP/TDX enabled, which is less of an
option if we can't advertised KVM_HC_MAP_GPA_RANGE via cpuid. But I
suppose we could do that with KVM selftests which is probably where
that's more likely to be useful.

-Mike
Michael Roth April 26, 2024, 10:48 p.m. UTC | #7
On Fri, Apr 26, 2024 at 05:24:57PM -0500, Michael Roth wrote:
> On Fri, Apr 26, 2024 at 01:14:32PM -0700, Sean Christopherson wrote:
> > On Fri, Apr 26, 2024, Michael Roth wrote:
> > > On Thu, Apr 25, 2024 at 03:13:40PM -0700, Sean Christopherson wrote:
> > > > On Thu, Apr 25, 2024, Michael Roth wrote:
> > > > > On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > > > > > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > > > > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > > > > > +{
> > > > > > > +	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > > > > > +	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > > > > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > > > > > +
> > > > > > > +	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > > > > > +		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > > > > > +		return 1; /* resume guest */
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > > > > > +	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > > > > > +	vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > > > > > +	vcpu->run->vmgexit.psc_msr.op = op;
> > > > > > 
> > > > > > Argh, no.
> > > > > > 
> > > > > > This is the same crud that TDX tried to push[*].  Use KVM's existing user exits,
> > > > > > and extend as *needed*.  There is no good reason page state change requests need
> > > > > > *two* exit reasons.  The *only* thing KVM supports right now is private<=>shared
> > > > > > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > > > > > KVM_EXIT_MEMORY_FAULT.
> > > > > > 
> > > > > > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > > > > > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > > > > > and that handling any outliers by performing multiple exits to userspace will
> > > > > > provide sufficient performance.
> > > > > 
> > > > > That does tend to be the case. We won't have as much granularity with
> > > > > the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> > > > > expected to be for the entire range anyway, and if that fails for
> > > > > whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> > > > > for cases where the entries aren't contiguous however, which would
> > > > > involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> > > > > not a huge deal since it doesn't seem to be a common case.
> > > > 
> > > > If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
> > > > into a buffer, but I suspect it will be simpler to just have KVM loop until the
> > > > PSC request is complete.
> > > 
> > > Agreed. But *if* we decided to introduce a buffer, where would you
> > > suggest adding it? The kvm_run union fields are set to 256 bytes, and
> > > we'd need close to 4K to handle a full GHCB PSC buffer in 1 go. Would
> > > additional storage at the end of struct kvm_run be acceptable?
> > 
> > Don't even need more memory, just use vcpu->arch.pio_data, which is always
> > allocated and is mmap()able by userspace via KVM_PIO_PAGE_OFFSET.
> 
> Nice, that seems like a good option if needed.
> 
> > 
> > > > > KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> > > > > flexibility to just issue that directly within a guest rather than
> > > > > relying on SNP/TDX specific hcalls. I don't know if that approach is
> > > > > practical for a real guest, but it could be useful for having re-usable
> > > > > guest code in KVM selftests that "just works" for all variants of
> > > > > SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> > > > > SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
> > > > > 
> > > > > I think we'd there is some potential baggage there with the previous SEV
> > > > > live migration use cases. There's some potential that existing guest kernels
> > > > > will use it once it gets advertised and issue them alongside GHCB-based
> > > > > page-state changes. It might make sense to use one of the reserved bits
> > > > > to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> > > > > hardware/software-protected VMs and not interchangeable with calls that
> > > > > were used for SEV live migration stuff.
> > > > 
> > > > I don't think I follow, what exactly wouldn't be interchangeable, and why?
> > > 
> > > For instance, if KVM_FEATURE_MIGRATION_CONTROL is advertised, then when
> > > amd_enc_status_change_finish() is triggered as a result of
> > > set_memory_encrypted(), we'd see
> > > 
> > >   1) a GHCB PSC for SNP, which will get forwarded to userspace via
> > >      KVM_HC_MAP_GPA_RANGE
> > >   2) KVM_HC_MAP_GPA_RANGE issued directly by the guest.
> > > 
> > > In that case, we'd be duplicating PSCs but it wouldn't necessarily hurt
> > > anything. But ideally we'd be able to distinguish the 2 cases so we
> > > could rightly treat 1) as only being expected for SNP, and 2) as only
> > > being expected for SEV/SEV-ES.
> > 
> > Why would the guest issue both?  That's a guest bug.  Or if supressing the second
> > hypercall is an issue, simply don't enumerate MIGRATION_CONTROL for SNP guests.
> 
> At the time of its inception, KVM_HC_MAP_GPA_RANGE was simply
> KVM_HC_PAGE_ENC_STATUS and got a more generic name over the course of
> development. But its purpose never changed: to inform the hypervisor of
> the current encryption status of a GPA range so VMMs could build up a
> list of shared guest regions that don't need to go through firmware for
> migration.. And it was and still is asynchronous to a degree, since the
> the migration control MSRs signals when that list of shared pages is
> usable.
> 
> These are very different semantics the proposal to use KVM_HC_MAP_GPA_RANGE
> as a means to set memory attributes via KVM_SET_MEMORY_ATTRIBUTES, and
> the 2 purposes aren't necessarily mutually exclusive to one another. It
> only really becomes a bug if we begin to interpret the original use-case
> as something other than it's initial intent in the case of SNP.
> 
> But at the same time, it's hard to imagine this older SEV live migration
> use-case being useful for SNP, since userspace will necessarily have all
> the information it needs to determine what is/isn't shared with relying
> on an additional hypercall.
> 
> So treating the older use case as specific to non-SNP and disallowing the
> use of MIGRATION_CONTROL does seems reasonable. But it's really the CPUID
> bit that advertises it, SEV just happens to only use it for when
> MIGRATION_CONTROL is also advertised. So we could disable that as well,
> but I did like the idea of being able to handle guest-issued
> KVM_HC_MAP_GPA_RANGE calls even with SNP/TDX enabled, which is less of an
> option if we can't advertised KVM_HC_MAP_GPA_RANGE via cpuid. But I
> suppose we could do that with KVM selftests which is probably where
> that's more likely to be useful.

Hmm, well...assuming SNP/TDX guest agree to make those vCPU registers
available via VMSA/etc in those cases... So i suppose we'd need some
additional handling to support advertising KVM_HC_MAP_GPA_RANGE via cpuid
either way and it is best to disallow guest-issued KVM_HC_MAP_GPA_RANGE
from being advertised to guests until there's support and a solid use-case
for it.

-Mike

> 
> -Mike
>
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..4a7a2945bc78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7060,6 +7060,39 @@  Please note that the kernel is allowed to use the kvm_run structure as the
 primary storage for certain register types. Therefore, the kernel may use the
 values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
 
+::
+
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit {
+		#define KVM_USER_VMGEXIT_PSC_MSR	1
+			__u32 type; /* KVM_USER_VMGEXIT_* type */
+			union {
+				struct {
+					__u64 gpa;
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+		#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+					__u8 op;
+					__u32 ret;
+				} psc_msr;
+			};
+		};
+
+If exit reason is KVM_EXIT_VMGEXIT then it indicates that an SEV-SNP guest
+has issued a VMGEXIT instruction (as documented by the AMD Architecture
+Programmer's Manual (APM)) to the hypervisor that needs to be serviced by
+userspace. These are generally handled by the host kernel, but in some
+cases some aspects handling a VMGEXIT are handled by userspace.
+
+A kvm_user_vmgexit structure is defined to encapsulate the data to be
+sent to or returned by userspace. The type field defines the specific type
+of exit that needs to be serviced, and that type is used as a discriminator
+to determine which union type should be used for input/output.
+
+For the KVM_USER_VMGEXIT_PSC_MSR type, the psc_msr union type is used. The
+kernel will supply the 'gpa' and 'op' fields, and userspace is expected to
+update the private/shared state of the GPA using the corresponding
+KVM_SET_MEMORY_ATTRIBUTES ioctl. The 'ret' field is to be set to 0 by
+userpace on success, or some non-zero value on failure.
 
 6. Capabilities that can be enabled on vCPUs
 ============================================
diff --git a/arch/x86/include/asm/sev-common.h b/arch/x86/include/asm/sev-common.h
index 1006bfffe07a..6d68db812de1 100644
--- a/arch/x86/include/asm/sev-common.h
+++ b/arch/x86/include/asm/sev-common.h
@@ -101,11 +101,17 @@  enum psc_op {
 	/* GHCBData[11:0] */				\
 	GHCB_MSR_PSC_REQ)
 
+#define GHCB_MSR_PSC_REQ_TO_GFN(msr) (((msr) & GENMASK_ULL(51, 12)) >> 12)
+#define GHCB_MSR_PSC_REQ_TO_OP(msr) (((msr) & GENMASK_ULL(55, 52)) >> 52)
+
 #define GHCB_MSR_PSC_RESP		0x015
 #define GHCB_MSR_PSC_RESP_VAL(val)			\
 	/* GHCBData[63:32] */				\
 	(((u64)(val) & GENMASK_ULL(63, 32)) >> 32)
 
+/* Set highest bit as a generic error response */
+#define GHCB_MSR_PSC_RESP_ERROR (BIT_ULL(63) | GHCB_MSR_PSC_RESP)
+
 /* GHCB Hypervisor Feature Request/Response */
 #define GHCB_MSR_HV_FT_REQ		0x080
 #define GHCB_MSR_HV_FT_RESP		0x081
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 76084e109f66..f6f54a889fde 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3463,6 +3463,36 @@  static void set_ghcb_msr(struct vcpu_svm *svm, u64 value)
 	svm->vmcb->control.ghcb_gpa = value;
 }
 
+static int snp_complete_psc_msr(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 vmm_ret = vcpu->run->vmgexit.psc_msr.ret;
+
+	set_ghcb_msr(svm, (vmm_ret << 32) | GHCB_MSR_PSC_RESP);
+
+	return 1; /* resume guest */
+}
+
+static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
+{
+	u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
+	u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
+		set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
+		return 1; /* resume guest */
+	}
+
+	vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
+	vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
+	vcpu->run->vmgexit.psc_msr.gpa = gpa;
+	vcpu->run->vmgexit.psc_msr.op = op;
+	vcpu->arch.complete_userspace_io = snp_complete_psc_msr;
+
+	return 0; /* forward request to userspace */
+}
+
 static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control = &svm->vmcb->control;
@@ -3561,6 +3591,9 @@  static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
 				  GHCB_MSR_INFO_POS);
 		break;
 	}
+	case GHCB_MSR_PSC_REQ:
+		ret = snp_begin_psc_msr(vcpu, control->ghcb_gpa);
+		break;
 	case GHCB_MSR_TERM_REQ: {
 		u64 reason_set, reason_code;
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..54b81e46a9fa 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -135,6 +135,20 @@  struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_user_vmgexit {
+#define KVM_USER_VMGEXIT_PSC_MSR	1
+	__u32 type; /* KVM_USER_VMGEXIT_* type */
+	union {
+		struct {
+			__u64 gpa;
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_PRIVATE	1
+#define KVM_USER_VMGEXIT_PSC_MSR_OP_SHARED	2
+			__u8 op;
+			__u32 ret;
+		} psc_msr;
+	};
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -178,6 +192,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_LOONGARCH_IOCSR  38
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_VMGEXIT          40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -433,6 +448,8 @@  struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_VMGEXIT */
+		struct kvm_user_vmgexit vmgexit;
 		/* Fix the size of the union. */
 		char padding[256];
 	};