diff mbox series

[RFC,gmem,v1,4/8] KVM: x86: Add gmem hook for invalidating memory

Message ID 20231016115028.996656-5-michael.roth@amd.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series KVM: gmem hooks/changes needed for x86 (other archs?) | expand

Commit Message

Michael Roth Oct. 16, 2023, 11:50 a.m. UTC
In some cases, like with SEV-SNP, guest memory needs to be updated in a
platform-specific manner before it can be safely freed back to the host.
Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
allow for special handling of this sort when freeing memory in response
to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
ahead and define an arch-specific hook for x86 since it will be needed
for handling memory used for SEV-SNP guests.

Signed-off-by: Michael Roth <michael.roth@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/x86.c                 |  7 +++++++
 include/linux/kvm_host.h           |  4 ++++
 virt/kvm/Kconfig                   |  4 ++++
 virt/kvm/guest_memfd.c             | 14 ++++++++++++++
 6 files changed, 31 insertions(+)

Comments

Steven Price Feb. 9, 2024, 10:11 a.m. UTC | #1
On 16/10/2023 12:50, Michael Roth wrote:
> In some cases, like with SEV-SNP, guest memory needs to be updated in a
> platform-specific manner before it can be safely freed back to the host.
> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> allow for special handling of this sort when freeing memory in response
> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> ahead and define an arch-specific hook for x86 since it will be needed
> for handling memory used for SEV-SNP guests.

Hi all,

Arm CCA has a similar need to prepare/unprepare memory (granule
delegate/undelegate using our terminology) before it is used for
protected memory.

However I see a problem with the current gmem implementation that the
"invalidations" are not precise enough for our RMI API. When punching a
hole in the memfd the code currently hits the same path (ending in
kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
the shared version). The Arm CCA architecture doesn't allow the
protected memory to be removed and refaulted without the permission of
the guest (the memory contents would be wiped in this case).

One option that I've considered is to implement a seperate CCA ioctl to
notify KVM whether the memory should be mapped protected. The
invalidations would then be ignored on ranges that are currently
protected for this guest.

This 'solves' the problem nicely except for the case where the VMM
deliberately punches holes in memory which the guest is using.

The issue in this case is that there's no way of failing the punch hole
operation - we can detect that the memory is in use and shouldn't be
freed, but this callback doesn't give the opportunity to actually block
the freeing of the memory.

Sadly there's no easy way to map from a physical page in a gmem back to
which VM (and where in the VM) the page is mapped. So actually ripping
the page out of the appropriate VM isn't really possible in this case.

How is this situation handled on x86? Is it possible to invalidate and
then refault a protected page without affecting the memory contents? My
guess is yes and that is a CCA specific problem - is my understanding
correct?

My current thoughts for CCA are one of three options:

1. Represent shared and protected memory as two separate memslots. This
matches the underlying architecture more closely (the top address bit is
repurposed as a 'shared' flag), but I don't like it because it's a
deviation from other CoCo architectures (notably pKVM).

2. Allow punch-hole to fail on CCA if the memory is mapped into the
guest's protected space. Again, this is CCA being different and also
creates nasty corner cases where the gmem descriptor could have to
outlive the VMM - so looks like a potential source of memory leaks.

3. 'Fix' the invalidation to provide more precise semantics. I haven't
yet prototyped it but it might be possible to simply provide a flag from
kvm_gmem_invalidate_begin specifying that the invalidation is for the
protected memory. KVM would then only unmap the protected memory when
this flag is set (avoiding issues with VMA updates causing spurious unmaps).

Fairly obviously (3) is my preferred option, but it relies on the
guarantees that the "invalidation" is actually a precise set of
addresses where the memory is actually being freed.

Comments, thoughts, objections welcome!

Steve

> Signed-off-by: Michael Roth <michael.roth@amd.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |  1 +
>  arch/x86/include/asm/kvm_host.h    |  1 +
>  arch/x86/kvm/x86.c                 |  7 +++++++
>  include/linux/kvm_host.h           |  4 ++++
>  virt/kvm/Kconfig                   |  4 ++++
>  virt/kvm/guest_memfd.c             | 14 ++++++++++++++
>  6 files changed, 31 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 0c113f42d5c7..f1505a5fa781 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -135,6 +135,7 @@ KVM_X86_OP(complete_emulated_msr)
>  KVM_X86_OP(vcpu_deliver_sipi_vector)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
> +KVM_X86_OP_OPTIONAL(gmem_invalidate)
>  
>  #undef KVM_X86_OP
>  #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 66fc89d1858f..dbec74783f48 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1754,6 +1754,7 @@ struct kvm_x86_ops {
>  	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>  
>  	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
> +	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 33a4cc33d86d..0e95c3a95e59 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13308,6 +13308,13 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
>  }
>  #endif
>  
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
> +{
> +	static_call_cond(kvm_x86_gmem_invalidate)(start, end);
> +}
> +#endif
> +
>  int kvm_spec_ctrl_test_value(u64 value)
>  {
>  	/*
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index c7f82c2f1bcf..840a5be5962a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2429,4 +2429,8 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
>  int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
>  #endif
>  
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
> +#endif
> +
>  #endif
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index 992cf6ed86ef..7fd1362a7ebe 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -113,3 +113,7 @@ config KVM_GENERIC_PRIVATE_MEM
>  config HAVE_KVM_GMEM_PREPARE
>         bool
>         depends on KVM_PRIVATE_MEM
> +
> +config HAVE_KVM_GMEM_INVALIDATE
> +       bool
> +       depends on KVM_PRIVATE_MEM
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 72ff8b7b31d5..b4c4df259fb8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -369,12 +369,26 @@ static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
>  	return MF_DELAYED;
>  }
>  
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +static void kvm_gmem_free_folio(struct folio *folio)
> +{
> +	struct page *page = folio_page(folio, 0);
> +	kvm_pfn_t pfn = page_to_pfn(page);
> +	int order = folio_order(folio);
> +
> +	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
> +}
> +#endif
> +
>  static const struct address_space_operations kvm_gmem_aops = {
>  	.dirty_folio = noop_dirty_folio,
>  #ifdef CONFIG_MIGRATION
>  	.migrate_folio	= kvm_gmem_migrate_folio,
>  #endif
>  	.error_remove_page = kvm_gmem_error_page,
> +#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
> +	.free_folio = kvm_gmem_free_folio,
> +#endif
>  };
>  
>  static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,
Sean Christopherson Feb. 9, 2024, 2:28 p.m. UTC | #2
On Fri, Feb 09, 2024, Steven Price wrote:
> On 16/10/2023 12:50, Michael Roth wrote:
> > In some cases, like with SEV-SNP, guest memory needs to be updated in a
> > platform-specific manner before it can be safely freed back to the host.
> > Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
> > allow for special handling of this sort when freeing memory in response
> > to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
> > ahead and define an arch-specific hook for x86 since it will be needed
> > for handling memory used for SEV-SNP guests.
> 
> Hi all,
> 
> Arm CCA has a similar need to prepare/unprepare memory (granule
> delegate/undelegate using our terminology) before it is used for
> protected memory.
> 
> However I see a problem with the current gmem implementation that the
> "invalidations" are not precise enough for our RMI API. When punching a
> hole in the memfd the code currently hits the same path (ending in
> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
> the shared version).
>
> The Arm CCA architecture doesn't allow the protected memory to be removed and
> refaulted without the permission of the guest (the memory contents would be
> wiped in this case).

TDX behaves almost exactly like CCA.  Well, that's not technically true, strictly
speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
!PRESENT, but those aren't in play for invalidation events like this.

SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
would do, is destructive, so SNP also behaves the same way for all intents and
purposes.

> One option that I've considered is to implement a seperate CCA ioctl to
> notify KVM whether the memory should be mapped protected.

That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?

> The invalidations would then be ignored on ranges that are currently
> protected for this guest.

That's backwards.  Invalidations on a guest_memfd should affect only *protected*
mappings.  And for that, the plan/proposal is to plumb only_{shared,private} flags
into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
mappings, and mmu_notifier invalidation don't zap private mappings.  Sample usage
in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
patch despite, I only provided a rough sketch).

[1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@google.com
[2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com

> This 'solves' the problem nicely except for the case where the VMM
> deliberately punches holes in memory which the guest is using.

I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
so don't do that.

> The issue in this case is that there's no way of failing the punch hole
> operation - we can detect that the memory is in use and shouldn't be
> freed, but this callback doesn't give the opportunity to actually block
> the freeing of the memory.

Why is this KVM's problem?  E.g. the same exact thing happens without guest_memfd
if userspace munmap()s memory the guest is using.

> Sadly there's no easy way to map from a physical page in a gmem back to
> which VM (and where in the VM) the page is mapped. So actually ripping
> the page out of the appropriate VM isn't really possible in this case.

I don't follow.  guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
not know what exactly needs to be invalidated?

> How is this situation handled on x86? Is it possible to invalidate and
> then refault a protected page without affecting the memory contents? My
> guess is yes and that is a CCA specific problem - is my understanding
> correct?
> 
> My current thoughts for CCA are one of three options:
> 
> 1. Represent shared and protected memory as two separate memslots. This
> matches the underlying architecture more closely (the top address bit is
> repurposed as a 'shared' flag), but I don't like it because it's a
> deviation from other CoCo architectures (notably pKVM).
> 
> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
> guest's protected space. Again, this is CCA being different and also
> creates nasty corner cases where the gmem descriptor could have to
> outlive the VMM - so looks like a potential source of memory leaks.
> 
> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
> yet prototyped it but it might be possible to simply provide a flag from
> kvm_gmem_invalidate_begin specifying that the invalidation is for the
> protected memory. KVM would then only unmap the protected memory when
> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
> 
> Fairly obviously (3) is my preferred option, but it relies on the
> guarantees that the "invalidation" is actually a precise set of
> addresses where the memory is actually being freed.

#3 is what we are planning for x86, and except for the only_{shared,private} flags,
the requisite functionality should already be in Linus' tree, though it does need
to be wired up for ARM.
Steven Price Feb. 9, 2024, 3:02 p.m. UTC | #3
Hi Sean,

Thanks for the reply.

On 09/02/2024 14:28, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
>> On 16/10/2023 12:50, Michael Roth wrote:
>>> In some cases, like with SEV-SNP, guest memory needs to be updated in a
>>> platform-specific manner before it can be safely freed back to the host.
>>> Wire up arch-defined hooks to the .free_folio kvm_gmem_aops callback to
>>> allow for special handling of this sort when freeing memory in response
>>> to FALLOC_FL_PUNCH_HOLE operations and when releasing the inode, and go
>>> ahead and define an arch-specific hook for x86 since it will be needed
>>> for handling memory used for SEV-SNP guests.
>>
>> Hi all,
>>
>> Arm CCA has a similar need to prepare/unprepare memory (granule
>> delegate/undelegate using our terminology) before it is used for
>> protected memory.
>>
>> However I see a problem with the current gmem implementation that the
>> "invalidations" are not precise enough for our RMI API. When punching a
>> hole in the memfd the code currently hits the same path (ending in
>> kvm_unmap_gfn_range()) as if a VMA is modified in the same range (for
>> the shared version).
>>
>> The Arm CCA architecture doesn't allow the protected memory to be removed and
>> refaulted without the permission of the guest (the memory contents would be
>> wiped in this case).
> 
> TDX behaves almost exactly like CCA.  Well, that's not technically true, strictly
> speaking, as there are TDX APIs that do allow for *temporarily* marking mappings
> !PRESENT, but those aren't in play for invalidation events like this.

Ok, great I was under the impression they were similar.

> SNP does allow zapping page table mappings, but fully removing a page, as PUNCH_HOLE
> would do, is destructive, so SNP also behaves the same way for all intents and
> purposes.

Zapping page table mappings is what the invalidate calls imply. This is
something CCA can't do. Obviously fully removing the page would be
destructive.

>> One option that I've considered is to implement a seperate CCA ioctl to
>> notify KVM whether the memory should be mapped protected.
> 
> That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?

Sorry, I really didn't explain that well. Yes effectively this is the
attribute flag, but there's corner cases for destruction of the VM. My
thought was that if the VMM wanted to tear down part of the protected
range (without making it shared) then a separate ioctl would be needed
to notify KVM of the unmap.

>> The invalidations would then be ignored on ranges that are currently
>> protected for this guest.
> 
> That's backwards.  Invalidations on a guest_memfd should affect only *protected*
> mappings.  And for that, the plan/proposal is to plumb only_{shared,private} flags
> into "struct kvm_gfn_range"[1] so that guest_memfd invalidations don't zap shared
> mappings, and mmu_notifier invalidation don't zap private mappings.  Sample usage
> in the TDX context[2] (disclaimer, I'm pretty sure I didn't write most of that
> patch despite, I only provided a rough sketch).

Aha, this sounds much like my option 3 below - a way to tell if the
invalidate comes from guest_memfd as opposed to VMA changes.

> [1] https://lore.kernel.org/all/20231027182217.3615211-13-seanjc@google.com
> [2] https://lore.kernel.org/all/0b308fb6dd52bafe7153086c7f54bfad03da74b1.1705965635.git.isaku.yamahata@intel.com
> 
>> This 'solves' the problem nicely except for the case where the VMM
>> deliberately punches holes in memory which the guest is using.
> 
> I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> so don't do that.

A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
my concern here is a VMM which is trying to break the host. In this case
either the PUNCH_HOLE needs to fail, or we actually need to recover the
memory from the guest (effectively killing the guest in the process).

>> The issue in this case is that there's no way of failing the punch hole
>> operation - we can detect that the memory is in use and shouldn't be
>> freed, but this callback doesn't give the opportunity to actually block
>> the freeing of the memory.
> 
> Why is this KVM's problem?  E.g. the same exact thing happens without guest_memfd
> if userspace munmap()s memory the guest is using.

Indeed. The difference here is that for a normal non-realm guest the
pages can be removed from the page-table and refaulted on a later
access. Indeed there's nothing stopping the VMM from using freeing the
pages and reallocating them later.

For a realm guest if the memory is pulled from the guest then the guest
is effectively dead (at least until migration is implemented but even
then there's going to be a specific controlled mechanism).

>> Sadly there's no easy way to map from a physical page in a gmem back to
>> which VM (and where in the VM) the page is mapped. So actually ripping
>> the page out of the appropriate VM isn't really possible in this case.
> 
> I don't follow.  guest_memfd has a 1:1 binding with a VM *and* a gfn, how can you
> not know what exactly needs to be invalidated?

At the point that gmem calls kvm_mmu_unmap_gfn_range() the fact that the
range is a gmem is lost.

>> How is this situation handled on x86? Is it possible to invalidate and
>> then refault a protected page without affecting the memory contents? My
>> guess is yes and that is a CCA specific problem - is my understanding
>> correct?
>>
>> My current thoughts for CCA are one of three options:
>>
>> 1. Represent shared and protected memory as two separate memslots. This
>> matches the underlying architecture more closely (the top address bit is
>> repurposed as a 'shared' flag), but I don't like it because it's a
>> deviation from other CoCo architectures (notably pKVM).
>>
>> 2. Allow punch-hole to fail on CCA if the memory is mapped into the
>> guest's protected space. Again, this is CCA being different and also
>> creates nasty corner cases where the gmem descriptor could have to
>> outlive the VMM - so looks like a potential source of memory leaks.
>>
>> 3. 'Fix' the invalidation to provide more precise semantics. I haven't
>> yet prototyped it but it might be possible to simply provide a flag from
>> kvm_gmem_invalidate_begin specifying that the invalidation is for the
>> protected memory. KVM would then only unmap the protected memory when
>> this flag is set (avoiding issues with VMA updates causing spurious unmaps).
>>
>> Fairly obviously (3) is my preferred option, but it relies on the
>> guarantees that the "invalidation" is actually a precise set of
>> addresses where the memory is actually being freed.
> 
> #3 is what we are planning for x86, and except for the only_{shared,private} flags,
> the requisite functionality should already be in Linus' tree, though it does need
> to be wired up for ARM.

Thanks, looks like the only_{shared,private} flags should do it. My only
worry about that solution was that it implicitly changes the
"invalidation" when only_private==1 to a precise list of pages that are
to be unmapped. Whereas for a normal guest it's only a performance issue
if a larger region is invalidated, for a CoCo guest it would be fatal to
the guest.

I'll cherry-pick the "KVM: Add new members to struct kvm_gfn_range to
operate on" patch from the TDX tree as I think this should do the trick.
I have hacked up something similar and it looks like it should work.

Thanks,

Steve
Sean Christopherson Feb. 9, 2024, 3:13 p.m. UTC | #4
On Fri, Feb 09, 2024, Steven Price wrote:
> >> One option that I've considered is to implement a seperate CCA ioctl to
> >> notify KVM whether the memory should be mapped protected.
> > 
> > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> 
> Sorry, I really didn't explain that well. Yes effectively this is the
> attribute flag, but there's corner cases for destruction of the VM. My
> thought was that if the VMM wanted to tear down part of the protected
> range (without making it shared) then a separate ioctl would be needed
> to notify KVM of the unmap.

No new uAPI should be needed, because the only scenario time a benign VMM should
do this is if the guest also knows the memory is being removed, in which case
PUNCH_HOLE will suffice.

> >> This 'solves' the problem nicely except for the case where the VMM
> >> deliberately punches holes in memory which the guest is using.
> > 
> > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > so don't do that.
> 
> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> my concern here is a VMM which is trying to break the host. In this case
> either the PUNCH_HOLE needs to fail, or we actually need to recover the
> memory from the guest (effectively killing the guest in the process).

The latter.  IIRC, we talked about this exact case somewhere in the hour-long
rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
this multiple times on-list, though I don't know that there is a single thread
that captures the entire plan.

The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
instance that's attached to a given guest_memfd inode when a page is being fully
removed, i.e. when a page is being freed back to the normal memory pool.  Something
like this proposed SNP patch[2].

Mike, do have WIP patches you can share?

[1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
[2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com
Michael Roth March 11, 2024, 5:24 p.m. UTC | #5
On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> On Fri, Feb 09, 2024, Steven Price wrote:
> > >> One option that I've considered is to implement a seperate CCA ioctl to
> > >> notify KVM whether the memory should be mapped protected.
> > > 
> > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> > 
> > Sorry, I really didn't explain that well. Yes effectively this is the
> > attribute flag, but there's corner cases for destruction of the VM. My
> > thought was that if the VMM wanted to tear down part of the protected
> > range (without making it shared) then a separate ioctl would be needed
> > to notify KVM of the unmap.
> 
> No new uAPI should be needed, because the only scenario time a benign VMM should
> do this is if the guest also knows the memory is being removed, in which case
> PUNCH_HOLE will suffice.
> 
> > >> This 'solves' the problem nicely except for the case where the VMM
> > >> deliberately punches holes in memory which the guest is using.
> > > 
> > > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > > so don't do that.
> > 
> > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > my concern here is a VMM which is trying to break the host. In this case
> > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > memory from the guest (effectively killing the guest in the process).
> 
> The latter.  IIRC, we talked about this exact case somewhere in the hour-long
> rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
> this multiple times on-list, though I don't know that there is a single thread
> that captures the entire plan.
> 
> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> instance that's attached to a given guest_memfd inode when a page is being fully
> removed, i.e. when a page is being freed back to the normal memory pool.  Something
> like this proposed SNP patch[2].
> 
> Mike, do have WIP patches you can share?

Sorry, I missed this query earlier. I'm a bit confused though, I thought
the kvm_arch_gmem_invalidate() hook provided in this patch was what we
ended up agreeing on during the PUCK call in question.

There was an open question about what to do if a use-case came along
where we needed to pass additional parameters to
kvm_arch_gmem_invalidate() other than just the start/end PFN range for
the pages being freed, but we'd determined that SNP and TDX did not
currently need this, so I didn't have any changes planned in this
regard.

If we now have such a need, what we had proposed was to modify
__filemap_remove_folio()/page_cache_delete() to defer setting
folio->mapping to NULL so that we could still access it in
kvm_gmem_free_folio() so that we can still access mapping->i_private_list
to get the list of gmem/KVM instances and pass them on via
kvm_arch_gmem_invalidate().

So that's doable, but it's not clear from this discussion that that's
needed. If the idea to block/kill the guest if VMM tries to hole-punch,
and ARM CCA already has plans to wire up the shared/private flags in
kvm_unmap_gfn_range(), wouldn't that have all the information needed to
kill that guest? At that point, kvm_gmem_free_folio() can handle
additional per-page cleanup (with additional gmem/KVM info plumbed in
if necessary).

-Mike


[1] https://lore.kernel.org/kvm/20240202230611.351544-1-seanjc@google.com/T/


> 
> [1] https://drive.google.com/corp/drive/folders/116YTH1h9yBZmjqeJc03cV4_AhSe-VBkc?resourcekey=0-sOGeFEUi60-znJJmZBsTHQ
> [2] https://lore.kernel.org/all/20231230172351.574091-30-michael.roth@amd.com
Sean Christopherson March 12, 2024, 8:26 p.m. UTC | #6
On Mon, Mar 11, 2024, Michael Roth wrote:
> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
> > On Fri, Feb 09, 2024, Steven Price wrote:
> > > >> One option that I've considered is to implement a seperate CCA ioctl to
> > > >> notify KVM whether the memory should be mapped protected.
> > > > 
> > > > That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
> > > 
> > > Sorry, I really didn't explain that well. Yes effectively this is the
> > > attribute flag, but there's corner cases for destruction of the VM. My
> > > thought was that if the VMM wanted to tear down part of the protected
> > > range (without making it shared) then a separate ioctl would be needed
> > > to notify KVM of the unmap.
> > 
> > No new uAPI should be needed, because the only scenario time a benign VMM should
> > do this is if the guest also knows the memory is being removed, in which case
> > PUNCH_HOLE will suffice.
> > 
> > > >> This 'solves' the problem nicely except for the case where the VMM
> > > >> deliberately punches holes in memory which the guest is using.
> > > > 
> > > > I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
> > > > so don't do that.
> > > 
> > > A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
> > > my concern here is a VMM which is trying to break the host. In this case
> > > either the PUNCH_HOLE needs to fail, or we actually need to recover the
> > > memory from the guest (effectively killing the guest in the process).
> > 
> > The latter.  IIRC, we talked about this exact case somewhere in the hour-long
> > rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
> > this multiple times on-list, though I don't know that there is a single thread
> > that captures the entire plan.
> > 
> > The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
> > instance that's attached to a given guest_memfd inode when a page is being fully
> > removed, i.e. when a page is being freed back to the normal memory pool.  Something
> > like this proposed SNP patch[2].
> > 
> > Mike, do have WIP patches you can share?
> 
> Sorry, I missed this query earlier. I'm a bit confused though, I thought
> the kvm_arch_gmem_invalidate() hook provided in this patch was what we
> ended up agreeing on during the PUCK call in question.

Heh, I trust your memory of things far more than I trust mine.  I'm just proving
Cunningham's Law.  :-)

> There was an open question about what to do if a use-case came along
> where we needed to pass additional parameters to
> kvm_arch_gmem_invalidate() other than just the start/end PFN range for
> the pages being freed, but we'd determined that SNP and TDX did not
> currently need this, so I didn't have any changes planned in this
> regard.
> 
> If we now have such a need, what we had proposed was to modify
> __filemap_remove_folio()/page_cache_delete() to defer setting
> folio->mapping to NULL so that we could still access it in
> kvm_gmem_free_folio() so that we can still access mapping->i_private_list
> to get the list of gmem/KVM instances and pass them on via
> kvm_arch_gmem_invalidate().

Yeah, this is what I was remembering.  I obviously forgot that we didn't have a
need to iterate over all bindings at this time.

> So that's doable, but it's not clear from this discussion that that's
> needed.

Same here.  And even if it is needed, it's not your problem to solve.  The above
blurb about needing to preserve folio->mapping being free_folio() is sufficient
to get the ARM code moving in the right direction.

Thanks!

> If the idea to block/kill the guest if VMM tries to hole-punch,
> and ARM CCA already has plans to wire up the shared/private flags in
> kvm_unmap_gfn_range(), wouldn't that have all the information needed to
> kill that guest? At that point, kvm_gmem_free_folio() can handle
> additional per-page cleanup (with additional gmem/KVM info plumbed in
> if necessary).
Steven Price March 13, 2024, 5:11 p.m. UTC | #7
On 12/03/2024 20:26, Sean Christopherson wrote:
> On Mon, Mar 11, 2024, Michael Roth wrote:
>> On Fri, Feb 09, 2024 at 07:13:13AM -0800, Sean Christopherson wrote:
>>> On Fri, Feb 09, 2024, Steven Price wrote:
>>>>>> One option that I've considered is to implement a seperate CCA ioctl to
>>>>>> notify KVM whether the memory should be mapped protected.
>>>>>
>>>>> That's what KVM_SET_MEMORY_ATTRIBUTES+KVM_MEMORY_ATTRIBUTE_PRIVATE is for, no?
>>>>
>>>> Sorry, I really didn't explain that well. Yes effectively this is the
>>>> attribute flag, but there's corner cases for destruction of the VM. My
>>>> thought was that if the VMM wanted to tear down part of the protected
>>>> range (without making it shared) then a separate ioctl would be needed
>>>> to notify KVM of the unmap.
>>>
>>> No new uAPI should be needed, because the only scenario time a benign VMM should
>>> do this is if the guest also knows the memory is being removed, in which case
>>> PUNCH_HOLE will suffice.
>>>
>>>>>> This 'solves' the problem nicely except for the case where the VMM
>>>>>> deliberately punches holes in memory which the guest is using.
>>>>>
>>>>> I don't see what problem there is to solve in this case.  PUNCH_HOLE is destructive,
>>>>> so don't do that.
>>>>
>>>> A well behaving VMM wouldn't PUNCH_HOLE when the guest is using it, but
>>>> my concern here is a VMM which is trying to break the host. In this case
>>>> either the PUNCH_HOLE needs to fail, or we actually need to recover the
>>>> memory from the guest (effectively killing the guest in the process).
>>>
>>> The latter.  IIRC, we talked about this exact case somewhere in the hour-long
>>> rambling discussion on guest_memfd at PUCK[1].  And we've definitely discussed
>>> this multiple times on-list, though I don't know that there is a single thread
>>> that captures the entire plan.
>>>
>>> The TL;DR is that gmem will invoke an arch hook for every "struct kvm_gmem"
>>> instance that's attached to a given guest_memfd inode when a page is being fully
>>> removed, i.e. when a page is being freed back to the normal memory pool.  Something
>>> like this proposed SNP patch[2].
>>>
>>> Mike, do have WIP patches you can share?
>>
>> Sorry, I missed this query earlier. I'm a bit confused though, I thought
>> the kvm_arch_gmem_invalidate() hook provided in this patch was what we
>> ended up agreeing on during the PUCK call in question.
> 
> Heh, I trust your memory of things far more than I trust mine.  I'm just proving
> Cunningham's Law.  :-)
> 
>> There was an open question about what to do if a use-case came along
>> where we needed to pass additional parameters to
>> kvm_arch_gmem_invalidate() other than just the start/end PFN range for
>> the pages being freed, but we'd determined that SNP and TDX did not
>> currently need this, so I didn't have any changes planned in this
>> regard.
>>
>> If we now have such a need, what we had proposed was to modify
>> __filemap_remove_folio()/page_cache_delete() to defer setting
>> folio->mapping to NULL so that we could still access it in
>> kvm_gmem_free_folio() so that we can still access mapping->i_private_list
>> to get the list of gmem/KVM instances and pass them on via
>> kvm_arch_gmem_invalidate().
> 
> Yeah, this is what I was remembering.  I obviously forgot that we didn't have a
> need to iterate over all bindings at this time.
> 
>> So that's doable, but it's not clear from this discussion that that's
>> needed.
> 
> Same here.  And even if it is needed, it's not your problem to solve.  The above
> blurb about needing to preserve folio->mapping being free_folio() is sufficient
> to get the ARM code moving in the right direction.
> 
> Thanks!
> 
>> If the idea to block/kill the guest if VMM tries to hole-punch,
>> and ARM CCA already has plans to wire up the shared/private flags in
>> kvm_unmap_gfn_range(), wouldn't that have all the information needed to
>> kill that guest? At that point, kvm_gmem_free_folio() can handle
>> additional per-page cleanup (with additional gmem/KVM info plumbed in
>> if necessary).

Yes, the missing piece of the puzzle was provided by "KVM: Prepare for
handling only shared mappings in mmu_notifier events"[1] - namely the
"only_shared" flag. We don't need to actually block/kill the guest until
it attempts access to the memory which has been removed from the guest -
at that point the guest cannot continue because the security properties
have been violated (the protected memory contents have been lost) so
attempts to continue the guest will fail.

You can ignore most of my other ramblings - as long as everyone is happy
with that flag then Arm CCA should be fine. I was just looking at other
options.

Thanks,

Steve

[1]
https://lore.kernel.org/lkml/20231027182217.3615211-13-seanjc@google.com/
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 0c113f42d5c7..f1505a5fa781 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -135,6 +135,7 @@  KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
+KVM_X86_OP_OPTIONAL(gmem_invalidate)
 
 #undef KVM_X86_OP
 #undef KVM_X86_OP_OPTIONAL
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 66fc89d1858f..dbec74783f48 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1754,6 +1754,7 @@  struct kvm_x86_ops {
 	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
 
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
+	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 33a4cc33d86d..0e95c3a95e59 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13308,6 +13308,13 @@  int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
 }
 #endif
 
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
+{
+	static_call_cond(kvm_x86_gmem_invalidate)(start, end);
+}
+#endif
+
 int kvm_spec_ctrl_test_value(u64 value)
 {
 	/*
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c7f82c2f1bcf..840a5be5962a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2429,4 +2429,8 @@  static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
 #endif
 
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#endif
+
 #endif
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 992cf6ed86ef..7fd1362a7ebe 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -113,3 +113,7 @@  config KVM_GENERIC_PRIVATE_MEM
 config HAVE_KVM_GMEM_PREPARE
        bool
        depends on KVM_PRIVATE_MEM
+
+config HAVE_KVM_GMEM_INVALIDATE
+       bool
+       depends on KVM_PRIVATE_MEM
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 72ff8b7b31d5..b4c4df259fb8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -369,12 +369,26 @@  static int kvm_gmem_error_page(struct address_space *mapping, struct page *page)
 	return MF_DELAYED;
 }
 
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+static void kvm_gmem_free_folio(struct folio *folio)
+{
+	struct page *page = folio_page(folio, 0);
+	kvm_pfn_t pfn = page_to_pfn(page);
+	int order = folio_order(folio);
+
+	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
+}
+#endif
+
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 #ifdef CONFIG_MIGRATION
 	.migrate_folio	= kvm_gmem_migrate_folio,
 #endif
 	.error_remove_page = kvm_gmem_error_page,
+#ifdef CONFIG_HAVE_KVM_GMEM_INVALIDATE
+	.free_folio = kvm_gmem_free_folio,
+#endif
 };
 
 static int kvm_gmem_getattr(struct mnt_idmap *idmap, const struct path *path,