[RFC,v3,0/6] Direct Map Removal for guest_memfd

Message ID	20241030134912.515725-1-roypat@amazon.co.uk (mailing list archive)
Headers	show Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77FD613FEE; Wed, 30 Oct 2024 13:49:41 +0000 (UTC) From: Patrick Roy <roypat@amazon.co.uk> To: <tabba@google.com>, <quic_eberman@quicinc.com>, <david@redhat.com>, <seanjc@google.com>, <pbonzini@redhat.com>, <jthoughton@google.com>, <ackerleytng@google.com>, <vannapurve@google.com>, <rppt@kernel.org> CC: Patrick Roy <roypat@amazon.co.uk>, <graf@amazon.com>, <jgowans@amazon.com>, <derekmn@amazon.com>, <kalyazin@amazon.com>, <xmarcalx@amazon.com>, <linux-mm@kvack.org>, <corbet@lwn.net>, <catalin.marinas@arm.com>, <will@kernel.org>, <chenhuacai@kernel.org>, <kernel@xen0n.name>, <paul.walmsley@sifive.com>, <palmer@dabbelt.com>, <aou@eecs.berkeley.edu>, <hca@linux.ibm.com>, <gor@linux.ibm.com>, <agordeev@linux.ibm.com>, <borntraeger@linux.ibm.com>, <svens@linux.ibm.com>, <gerald.schaefer@linux.ibm.com>, <tglx@linutronix.de>, <mingo@redhat.com>, <bp@alien8.de>, <dave.hansen@linux.intel.com>, <x86@kernel.org>, <hpa@zytor.com>, <luto@kernel.org>, <peterz@infradead.org>, <rostedt@goodmis.org>, <mhiramat@kernel.org>, <mathieu.desnoyers@efficios.com>, <shuah@kernel.org>, <kvm@vger.kernel.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <loongarch@lists.linux.dev>, <linux-riscv@lists.infradead.org>, <linux-s390@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org> Subject: [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd Date: Wed, 30 Oct 2024 13:49:04 +0000 Message-ID: <20241030134912.515725-1-roypat@amazon.co.uk> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain
Series	Direct Map Removal for guest_memfd \| expand [RFC,v3,0/6] Direct Map Removal for guest_memfd [RFC,v3,1/6] arch: introduce set_direct_map_valid_noflush() [RFC,v3,2/6] kvm: gmem: add flag to remove memory from kernel direct map [RFC,v3,3/6] kvm: gmem: implement direct map manipulation routines [RFC,v3,4/6] kvm: gmem: add trace point for direct map state changes [RFC,v3,5/6] kvm: document KVM_GMEM_NO_DIRECT_MAP flag [RFC,v3,6/6] kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set

Patrick Roy Oct. 30, 2024, 1:49 p.m. UTC

Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].

This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.

=== Changes to v2 ===

- Handle direct map removal for physically contiguous pages in arch code
  (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the
  folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory
  (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)

Most significantly, I've reduced the patch series to focus only on
direct map removal for guest_memfd for now, leaving the whole "how to do
non-CoCo VMs in guest_memfd" for later. If this separation is
acceptable, then I think I can drop the RFC tag in the next revision
(I've mainly kept it here because I'm not entirely sure what to do with
patches 3 and 4).

=== Implementation ===

This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD
that causes guest_memfd to remove its pages from the host kernel's
direct map immediately after population/preparation.  It also adds
infrastructure for tracking the direct map state of all gmem folios
inside the guest_memfd inode. Storing this information in the inode has
the advantage that the code is ready for future hugepages extensions,
where only removing/reinserting direct map entries for sub-ranges of a
huge folio is a valid usecase, and it allows pre-configuring the direct
map state of not-yet faulted in parts of memory (for example, when the
VMM is receiving a RX virtio buffer from the guest).

=== Summary ===

Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct
map for ranges of physically contiguous pages, which are used by
guest_memfd in follow up patches. Patch 2 adds the
KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map
state of freshly prepared folios. Patches 3 and 4 mainly serve an
illustrative purpose, to show how the framework from patch 2 can be
extended with routines for runtime direct map manipulation. Patches 5
and 6 deal with documentation and self-tests respectively.

[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/

Mike Rapoport (Microsoft) (1):
  arch: introduce set_direct_map_valid_noflush()

Patrick Roy (5):
  kvm: gmem: add flag to remove memory from kernel direct map
  kvm: gmem: implement direct map manipulation routines
  kvm: gmem: add trace point for direct map state changes
  kvm: document KVM_GMEM_NO_DIRECT_MAP flag
  kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set

 Documentation/virt/kvm/api.rst                |  14 +
 arch/arm64/include/asm/set_memory.h           |   1 +
 arch/arm64/mm/pageattr.c                      |  10 +
 arch/loongarch/include/asm/set_memory.h       |   1 +
 arch/loongarch/mm/pageattr.c                  |  21 ++
 arch/riscv/include/asm/set_memory.h           |   1 +
 arch/riscv/mm/pageattr.c                      |  15 +
 arch/s390/include/asm/set_memory.h            |   1 +
 arch/s390/mm/pageattr.c                       |  11 +
 arch/x86/include/asm/set_memory.h             |   1 +
 arch/x86/mm/pat/set_memory.c                  |   8 +
 include/linux/set_memory.h                    |   6 +
 include/trace/events/kvm.h                    |  22 ++
 include/uapi/linux/kvm.h                      |   2 +
 .../testing/selftests/kvm/guest_memfd_test.c  |   2 +-
 .../kvm/x86_64/private_mem_conversions_test.c |   7 +-
 virt/kvm/guest_memfd.c                        | 280 +++++++++++++++++-
 17 files changed, 384 insertions(+), 19 deletions(-)


base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6

David Hildenbrand Oct. 31, 2024, 9:50 a.m. UTC | #1

On 30.10.24 14:49, Patrick Roy wrote:
> Unmapping virtual machine guest memory from the host kernel's direct map
> is a successful mitigation against Spectre-style transient execution
> issues: If the kernel page tables do not contain entries pointing to
> guest memory, then any attempted speculative read through the direct map
> will necessarily be blocked by the MMU before any observable
> microarchitectural side-effects happen. This means that Spectre-gadgets
> and similar cannot be used to target virtual machine memory. Roughly 60%
> of speculative execution issues fall into this category [1, Table 1].
> 
> This patch series extends guest_memfd with the ability to remove its
> memory from the host kernel's direct map, to be able to attain the above
> protection for KVM guests running inside guest_memfd.
> 
> === Changes to v2 ===
> 
> - Handle direct map removal for physically contiguous pages in arch code
>    (Mike R.)
> - Track the direct map state in guest_memfd itself instead of at the
>    folio level, to prepare for huge pages support (Sean C.)
> - Allow configuring direct map state of not-yet faulted in memory
>    (Vishal A.)
> - Pay attention to alignment in ftrace structs (Steven R.)
> 
> Most significantly, I've reduced the patch series to focus only on
> direct map removal for guest_memfd for now, leaving the whole "how to do
> non-CoCo VMs in guest_memfd" for later. If this separation is
> acceptable, then I think I can drop the RFC tag in the next revision
> (I've mainly kept it here because I'm not entirely sure what to do with
> patches 3 and 4).

Hi,

keeping upcoming "shared and private memory in guest_memfd" in mind, I 
assume the focus would be to only remove the direct map for private memory?

So in the current upstream state, you would only be removing the direct 
map for private memory, currently translating to "encrypted"/"protected" 
memory that is inaccessible either way already.

Correct?

Patrick Roy Oct. 31, 2024, 10:42 a.m. UTC | #2

On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
> On 30.10.24 14:49, Patrick Roy wrote:
>> Unmapping virtual machine guest memory from the host kernel's direct map
>> is a successful mitigation against Spectre-style transient execution
>> issues: If the kernel page tables do not contain entries pointing to
>> guest memory, then any attempted speculative read through the direct map
>> will necessarily be blocked by the MMU before any observable
>> microarchitectural side-effects happen. This means that Spectre-gadgets
>> and similar cannot be used to target virtual machine memory. Roughly 60%
>> of speculative execution issues fall into this category [1, Table 1].
>>
>> This patch series extends guest_memfd with the ability to remove its
>> memory from the host kernel's direct map, to be able to attain the above
>> protection for KVM guests running inside guest_memfd.
>>
>> === Changes to v2 ===
>>
>> - Handle direct map removal for physically contiguous pages in arch code
>>    (Mike R.)
>> - Track the direct map state in guest_memfd itself instead of at the
>>    folio level, to prepare for huge pages support (Sean C.)
>> - Allow configuring direct map state of not-yet faulted in memory
>>    (Vishal A.)
>> - Pay attention to alignment in ftrace structs (Steven R.)
>>
>> Most significantly, I've reduced the patch series to focus only on
>> direct map removal for guest_memfd for now, leaving the whole "how to do
>> non-CoCo VMs in guest_memfd" for later. If this separation is
>> acceptable, then I think I can drop the RFC tag in the next revision
>> (I've mainly kept it here because I'm not entirely sure what to do with
>> patches 3 and 4).
> 
> Hi,
> 
> keeping upcoming "shared and private memory in guest_memfd" in mind, I
> assume the focus would be to only remove the direct map for private memory?
> 
> So in the current upstream state, you would only be removing the direct
> map for private memory, currently translating to "encrypted"/"protected"
> memory that is inaccessible either way already.
> 
> Correct?

Yea, with the upcomming "shared and private" stuff, I would expect the
the shared<->private conversions would call the routines from patch 3 to
restore direct map entries on private->shared, and zap them on
shared->private.

But as you said, the current upstream state has no notion of "shared"
memory in guest_memfd, so everything is private and thus everything is
direct map removed (although it is indeed already inaccessible anyway
for TDX and friends. That's what makes this patch series a bit awkward
:( )

> -- 
> Cheers,
> 
> David / dhildenb
> 

Best, 
Patrick

Manwaring, Derek Nov. 1, 2024, 12:10 a.m. UTC | #3

On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
> On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
> > On 30.10.24 14:49, Patrick Roy wrote:
> >> Most significantly, I've reduced the patch series to focus only on
> >> direct map removal for guest_memfd for now, leaving the whole "how to do
> >> non-CoCo VMs in guest_memfd" for later. If this separation is
> >> acceptable, then I think I can drop the RFC tag in the next revision
> >> (I've mainly kept it here because I'm not entirely sure what to do with
> >> patches 3 and 4).
> >
> > Hi,
> >
> > keeping upcoming "shared and private memory in guest_memfd" in mind, I
> > assume the focus would be to only remove the direct map for private memory?
> >
> > So in the current upstream state, you would only be removing the direct
> > map for private memory, currently translating to "encrypted"/"protected"
> > memory that is inaccessible either way already.
> >
> > Correct?
>
> Yea, with the upcomming "shared and private" stuff, I would expect the
> the shared<->private conversions would call the routines from patch 3 to
> restore direct map entries on private->shared, and zap them on
> shared->private.
>
> But as you said, the current upstream state has no notion of "shared"
> memory in guest_memfd, so everything is private and thus everything is
> direct map removed (although it is indeed already inaccessible anyway
> for TDX and friends. That's what makes this patch series a bit awkward
> :( )

TDX and SEV encryption happens between the core and main memory, so
cached guest data we're most concerned about for transient execution
attacks isn't necessarily inaccessible.

I'd be interested what Intel, AMD, and other folks think on this, but I
think direct map removal is worthwhile for CoCo cases as well.

Derek

Sean Christopherson Nov. 1, 2024, 3:18 p.m. UTC | #4

+David Kaplan

On Thu, Oct 31, 2024, Derek Manwaring wrote:
> On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
> > On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
> > > On 30.10.24 14:49, Patrick Roy wrote:
> > >> Most significantly, I've reduced the patch series to focus only on
> > >> direct map removal for guest_memfd for now, leaving the whole "how to do
> > >> non-CoCo VMs in guest_memfd" for later. If this separation is
> > >> acceptable, then I think I can drop the RFC tag in the next revision
> > >> (I've mainly kept it here because I'm not entirely sure what to do with
> > >> patches 3 and 4).
> > >
> > > Hi,
> > >
> > > keeping upcoming "shared and private memory in guest_memfd" in mind, I
> > > assume the focus would be to only remove the direct map for private memory?
> > >
> > > So in the current upstream state, you would only be removing the direct
> > > map for private memory, currently translating to "encrypted"/"protected"
> > > memory that is inaccessible either way already.
> > >
> > > Correct?
> >
> > Yea, with the upcomming "shared and private" stuff, I would expect the
> > the shared<->private conversions would call the routines from patch 3 to
> > restore direct map entries on private->shared, and zap them on
> > shared->private.
> >
> > But as you said, the current upstream state has no notion of "shared"
> > memory in guest_memfd, so everything is private and thus everything is
> > direct map removed (although it is indeed already inaccessible anyway
> > for TDX and friends. That's what makes this patch series a bit awkward
> > :( )
> 
> TDX and SEV encryption happens between the core and main memory, so
> cached guest data we're most concerned about for transient execution
> attacks isn't necessarily inaccessible.
> 
> I'd be interested what Intel, AMD, and other folks think on this, but I
> think direct map removal is worthwhile for CoCo cases as well.

Removal of the direct map entries for guest private PFNs likely won't affect the
ability of an attacker to glean information from the unencrypted data that's in
the CPU caches, at least not on x86.  Both TDX and SEV steal physical address
bit(s) for tagging encrypted memory, and unless things have changed on recent
AMD microarchitectures (I'm 99.9% certain Intel CPUs haven't changed), those stolen
address bits are propagated into the caches.  I.e. the encrypted and unencrypted
forms of a given PFN are actually two different physical addresses under the hood.

I don't actually know how SEV uses the stolen PA bits though.  I don't see how it
simply be the ASID, because IIUC, AMD CPUs allow for more unique SEV-capable ASIDs
than uniquely addressable PAs by the number of stolen bits.  But I would be very
surprised if the tag for the cache isn't guaranteed to be unique per encryption key.

David?

Dave Hansen Nov. 1, 2024, 4:06 p.m. UTC | #5

On 10/31/24 17:10, Manwaring, Derek wrote:
> TDX and SEV encryption happens between the core and main memory, so
> cached guest data we're most concerned about for transient execution
> attacks isn't necessarily inaccessible.
> 
> I'd be interested what Intel, AMD, and other folks think on this, but I
> think direct map removal is worthwhile for CoCo cases as well.

I'm not sure specifically which attacks you have in mind.  Also, don't
forget that TDX doesn't get any intra-system security from encryption
itself.  All that the encryption does is prevent some physical attacks.

The main thing I think you want to keep in mind is mentioned in the "TDX
Module v1.5 Base Architecture Specification"[1]:

> Any software except guest TD or TDX module must not be able to
> speculatively or non-speculatively access TD private memory,

That's a pretty broad claim and it involves mitigations in hardware and
the TDX module.

I _think_ you might be thinking of attacks like MDS where some random
microarchitectural buffer contains guest data after a VM exit and then
an attacker extracts it.  Direct map removal doesn't affect these
buffers and doesn't mitigate an attacker getting the data out.  TDX
relies on other defenses.

As for the CPU caches, direct map removal doesn't help or hurt anything.
 A malicious VMM would just map the guest data if it could and try to
extract it if that were possible.  If the attack is mitigated when the
data is _mapped_, then it's certainly not possible _unmapped_.

So why bother with direct map removal for TDX?  A VMM write to TD
private data causes machine checks.  So any kernel bug that even
accidentally writes to kernel memory can bring the whole system down.
Not nice.

1. https://cdrdv2.intel.com/v1/dl/getContent/733575

Manwaring, Derek Nov. 1, 2024, 4:56 p.m. UTC | #6

+Elena

On 2024-11-01 at 16:06+0000, Dave Hansen wrote:
> On 10/31/24 17:10, Manwaring, Derek wrote:
> > TDX and SEV encryption happens between the core and main memory, so
> > cached guest data we're most concerned about for transient execution
> > attacks isn't necessarily inaccessible.
> >
> > I'd be interested what Intel, AMD, and other folks think on this, but I
> > think direct map removal is worthwhile for CoCo cases as well.
>
> I'm not sure specifically which attacks you have in mind.  [...]
>
> I _think_ you might be thinking of attacks like MDS where some random
> microarchitectural buffer contains guest data after a VM exit and then
> an attacker extracts it.  Direct map removal doesn't affect these
> buffers and doesn't mitigate an attacker getting the data out.

Right, the only attacks we can thwart with direct map removal are
transient execution attacks on the host kernel whose leak origin is
"Mapped memory" in Table 1 of the Quarantine paper [2]. Maybe the
simplest hypothetical to consider here is a new spectre v1 gadget in the
host kernel.

> The main thing I think you want to keep in mind is mentioned in the "TDX
> Module v1.5 Base Architecture Specification"[1]:
>
> > Any software except guest TD or TDX module must not be able to
> > speculatively or non-speculatively access TD private memory,
>
> That's a pretty broad claim and it involves mitigations in hardware and
> the TDX module.
>
> 1. https://cdrdv2.intel.com/v1/dl/getContent/733575

Thank you, I hadn't seen that. That is a very strong claim as far as
preventing speculative access; I didn't realize Intel claimed that about
TDX. The comma followed by "to detect if a prior corruption attempt was
successful" makes me wonder a bit if the statement is not quite as broad
as it sounds, but maybe that's just meant to relate it to the integrity
section?

> If the attack is mitigated when the > data is _mapped_, then it's
> certainly not possible _unmapped_.
>
> So why bother with direct map removal for TDX?  A VMM write to TD
> private data causes machine checks.  So any kernel bug that even
> accidentally writes to kernel memory can bring the whole system down.
> Not nice.

Fair enough. It hasn't been clear to me if there is a machine check when
the host kernel accesses guest memory only transiently. I was assuming
there is not. But if other mitigations completely prevent even
speculative access of TD private memory like you're saying, then agree
nothing to gain from direct map removal in the TDX case.

Derek

[2] https://download.vusec.net/papers/quarantine_raid23.pdf

Dave Hansen Nov. 1, 2024, 5:20 p.m. UTC | #7

On 11/1/24 09:56, Manwaring, Derek wrote:
...
>>> Any software except guest TD or TDX module must not be able to
>>> speculatively or non-speculatively access TD private memory,
>>
>> That's a pretty broad claim and it involves mitigations in hardware and
>> the TDX module.
>>
>> 1. https://cdrdv2.intel.com/v1/dl/getContent/733575
> 
> Thank you, I hadn't seen that. That is a very strong claim as far as
> preventing speculative access; I didn't realize Intel claimed that about
> TDX. The comma followed by "to detect if a prior corruption attempt was
> successful" makes me wonder a bit if the statement is not quite as broad
> as it sounds, but maybe that's just meant to relate it to the integrity
> section?

I think it's just relating it to the integrity section.

>> If the attack is mitigated when the > data is _mapped_, then it's
>> certainly not possible _unmapped_.
>>
>> So why bother with direct map removal for TDX?  A VMM write to TD
>> private data causes machine checks.  So any kernel bug that even
>> accidentally writes to kernel memory can bring the whole system down.
>> Not nice.
> 
> Fair enough. It hasn't been clear to me if there is a machine check when
> the host kernel accesses guest memory only transiently. I was assuming
> there is not. 

Previous generations of hardware have had some nastiness in this area.
Speculative accesses were (I think) logged in the machine check banks,
but wouldn't raise an #MC.  I believe TDX-capable hardware won't even
log speculative accesses.

> But if other mitigations completely prevent even speculative access
> of TD private memory like you're saying, then agree nothing to gain
> from direct map removal in the TDX case.
Remember, guest unmapping is done in the VMM.  The VMM is not trusted in
the TDX (or SEV-SNP) model.  If any VMM can harm the protections on
guest memory, then we have a big problem.

That isn't to say big problem can't happen.  Say some crazy attack comes
to light where the VMM can attack TDX if the VMM has mapping for a guest
(or TDX module) memory.  Crazier things have happened, and guest
unmapping _would_ help there, if you trusted the VMM.

Basically, I think guest unmapping only helps system security as a whole
if you must _already_ trust the VMM.

Manwaring, Derek Nov. 1, 2024, 6:31 p.m. UTC | #8

On 2024-11-01 at 17:20+0000, Dave Hansen wrote:
> On 11/1/24 09:56, Manwaring, Derek wrote:
> > But if other mitigations completely prevent even speculative access
> > of TD private memory like you're saying, then agree nothing to gain
> > from direct map removal in the TDX case.
> Remember, guest unmapping is done in the VMM.  The VMM is not trusted in
> the TDX (or SEV-SNP) model.  If any VMM can harm the protections on
> guest memory, then we have a big problem.
>
> That isn't to say big problem can't happen.  Say some crazy attack comes
> to light where the VMM can attack TDX if the VMM has mapping for a guest
> (or TDX module) memory.  Crazier things have happened, and guest
> unmapping _would_ help there, if you trusted the VMM.
>
> Basically, I think guest unmapping only helps system security as a whole
> if you must _already_ trust the VMM.

Yeah that makes a lot of sense. I just view the ideal outcome as a
composition of strong, independent defenses. So as a guest you have the
confidentiality and integrity guarantees of the hardware, *and* you have
an up-to-date, good-hygiene (albeit not attested) host kernel just in
case some crazy attack/gap comes up.

From that standpoint I'm still tempted to turn the question around a bit
for the host kernel's perspective. Like if the host kernel should not
(and indeed cannot with TDX controls in place) access guest private
memory, why not remove it from the direct map?

Derek

Kaplan, David Nov. 1, 2024, 6:32 p.m. UTC | #9

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: Friday, November 1, 2024 10:18 AM
> To: Derek Manwaring <derekmn@amazon.com>
> Cc: roypat@amazon.co.uk; ackerleytng@google.com;
> agordeev@linux.ibm.com; aou@eecs.berkeley.edu;
> borntraeger@linux.ibm.com; bp@alien8.de; catalin.marinas@arm.com;
> chenhuacai@kernel.org; corbet@lwn.net; dave.hansen@linux.intel.com;
> david@redhat.com; gerald.schaefer@linux.ibm.com; gor@linux.ibm.com;
> graf@amazon.com; hca@linux.ibm.com; hpa@zytor.com;
> jgowans@amazon.com; jthoughton@google.com; kalyazin@amazon.com;
> kernel@xen0n.name; kvm@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; linux-doc@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-
> mm@kvack.org; linux-riscv@lists.infradead.org; linux-s390@vger.kernel.org;
> linux-trace-kernel@vger.kernel.org; loongarch@lists.linux.dev;
> luto@kernel.org; mathieu.desnoyers@efficios.com; mhiramat@kernel.org;
> mingo@redhat.com; palmer@dabbelt.com; paul.walmsley@sifive.com;
> pbonzini@redhat.com; peterz@infradead.org; quic_eberman@quicinc.com;
> rostedt@goodmis.org; rppt@kernel.org; shuah@kernel.org;
> svens@linux.ibm.com; tabba@google.com; tglx@linutronix.de;
> vannapurve@google.com; will@kernel.org; x86@kernel.org;
> xmarcalx@amazon.com; Kaplan, David <David.Kaplan@amd.com>
> Subject: Re: [RFC PATCH v3 0/6] Direct Map Removal for guest_memfd
>
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> +David Kaplan
>
> On Thu, Oct 31, 2024, Derek Manwaring wrote:
> > On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
> > > On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
> > > > On 30.10.24 14:49, Patrick Roy wrote:
> > > >> Most significantly, I've reduced the patch series to focus only
> > > >> on direct map removal for guest_memfd for now, leaving the whole
> > > >> "how to do non-CoCo VMs in guest_memfd" for later. If this
> > > >> separation is acceptable, then I think I can drop the RFC tag in
> > > >> the next revision (I've mainly kept it here because I'm not
> > > >> entirely sure what to do with patches 3 and 4).
> > > >
> > > > Hi,
> > > >
> > > > keeping upcoming "shared and private memory in guest_memfd" in
> > > > mind, I assume the focus would be to only remove the direct map for
> private memory?
> > > >
> > > > So in the current upstream state, you would only be removing the
> > > > direct map for private memory, currently translating to
> "encrypted"/"protected"
> > > > memory that is inaccessible either way already.
> > > >
> > > > Correct?
> > >
> > > Yea, with the upcomming "shared and private" stuff, I would expect
> > > the the shared<->private conversions would call the routines from
> > > patch 3 to restore direct map entries on private->shared, and zap
> > > them on
> > > shared->private.
> > >
> > > But as you said, the current upstream state has no notion of "shared"
> > > memory in guest_memfd, so everything is private and thus everything
> > > is direct map removed (although it is indeed already inaccessible
> > > anyway for TDX and friends. That's what makes this patch series a
> > > bit awkward :( )
> >
> > TDX and SEV encryption happens between the core and main memory, so
> > cached guest data we're most concerned about for transient execution
> > attacks isn't necessarily inaccessible.
> >
> > I'd be interested what Intel, AMD, and other folks think on this, but
> > I think direct map removal is worthwhile for CoCo cases as well.
>
> Removal of the direct map entries for guest private PFNs likely won't affect
> the ability of an attacker to glean information from the unencrypted data
> that's in the CPU caches, at least not on x86.  Both TDX and SEV steal physical
> address
> bit(s) for tagging encrypted memory, and unless things have changed on
> recent AMD microarchitectures (I'm 99.9% certain Intel CPUs haven't
> changed), those stolen address bits are propagated into the caches.  I.e. the
> encrypted and unencrypted forms of a given PFN are actually two different
> physical addresses under the hood.
>
> I don't actually know how SEV uses the stolen PA bits though.  I don't see
> how it simply be the ASID, because IIUC, AMD CPUs allow for more unique
> SEV-capable ASIDs than uniquely addressable PAs by the number of stolen
> bits.  But I would be very surprised if the tag for the cache isn't guaranteed to
> be unique per encryption key.
>
> David?

How the stolen PA bits are used is a microarchitectural implementation detail.  It is true that the tag will be unique per encryption key.  Beyond that, I'm not sure what other details are relevant to SW.

--David Kaplan

Dave Hansen Nov. 1, 2024, 6:43 p.m. UTC | #10

On 11/1/24 11:31, Manwaring, Derek wrote:
>>From that standpoint I'm still tempted to turn the question around a bit
> for the host kernel's perspective. Like if the host kernel should not
> (and indeed cannot with TDX controls in place) access guest private
> memory, why not remove it from the direct map?

Pretend that the machine check warts aren't there.

It costs performance and complexity, for an only theoretical gain.  This
is especially true for a VMM that's not doing a just doing confidential
guests.  You fracture the direct map to pieces forever (for now).

Manwaring, Derek Nov. 1, 2024, 7:29 p.m. UTC | #11

On 2024-11-01 at 18:43+0000, Dave Hansen wrote:
> On 11/1/24 11:31, Manwaring, Derek wrote:
> > From that standpoint I'm still tempted to turn the question around a bit
> > for the host kernel's perspective. Like if the host kernel should not
> > (and indeed cannot with TDX controls in place) access guest private
> > memory, why not remove it from the direct map?
>
> Pretend that the machine check warts aren't there.
>
> It costs performance and complexity, for an only theoretical gain.  This
> is especially true for a VMM that's not doing a just doing confidential
> guests.  You fracture the direct map to pieces forever (for now).

I'm hopeful we'll navigate the complexity in a worthwhile way for the
non-CoCo case. Assuming we get there and have the option to remove from
direct map, users with CoCo hardware could choose if they want to do
both on their host. For me that's a sensible choice, but maybe that's
just me.

As far as performance, are you talking about just the fracturing or
something beyond that? The data Mike brought to LSFMMBPF 2023 showed the
perf impact from direct map fragmentation for memfd_secret isn't "that
bad" [1].

Derek

[1] https://lwn.net/Articles/931406/

Dave Hansen Nov. 1, 2024, 7:39 p.m. UTC | #12

On 11/1/24 12:29, Manwaring, Derek wrote:
> As far as performance, are you talking about just the fracturing or
> something beyond that? The data Mike brought to LSFMMBPF 2023 showed the
> perf impact from direct map fragmentation for memfd_secret isn't "that
> bad" [1].

Just the fracturing.

Mike's data are interesting, but I still think we're pretty far away
from saying that the kernel doesn't need large mappings.

Reshetova, Elena Nov. 4, 2024, 8:33 a.m. UTC | #13

> 
> +Elena
> 
> On 2024-11-01 at 16:06+0000, Dave Hansen wrote:
> > On 10/31/24 17:10, Manwaring, Derek wrote:
> > > TDX and SEV encryption happens between the core and main memory, so
> > > cached guest data we're most concerned about for transient execution
> > > attacks isn't necessarily inaccessible.
> > >
> > > I'd be interested what Intel, AMD, and other folks think on this, but I
> > > think direct map removal is worthwhile for CoCo cases as well.
> >
> > I'm not sure specifically which attacks you have in mind.  [...]
> >
> > I _think_ you might be thinking of attacks like MDS where some random
> > microarchitectural buffer contains guest data after a VM exit and then
> > an attacker extracts it.  Direct map removal doesn't affect these
> > buffers and doesn't mitigate an attacker getting the data out.
> 
> Right, the only attacks we can thwart with direct map removal are
> transient execution attacks on the host kernel whose leak origin is
> "Mapped memory" in Table 1 of the Quarantine paper [2]. Maybe the
> simplest hypothetical to consider here is a new spectre v1 gadget in the
> host kernel.
> 
> > The main thing I think you want to keep in mind is mentioned in the "TDX
> > Module v1.5 Base Architecture Specification"[1]:
> >
> > > Any software except guest TD or TDX module must not be able to
> > > speculatively or non-speculatively access TD private memory,
> >
> > That's a pretty broad claim and it involves mitigations in hardware and
> > the TDX module.
> >
> > 1. https://cdrdv2.intel.com/v1/dl/getContent/733575
> 
> Thank you, I hadn't seen that. That is a very strong claim as far as
> preventing speculative access; I didn't realize Intel claimed that about
> TDX. The comma followed by "to detect if a prior corruption attempt was
> successful" makes me wonder a bit if the statement is not quite as broad
> as it sounds, but maybe that's just meant to relate it to the integrity
> section?

This statement *is* for integrity section. We have a separate TDX guidance
on side-channels (including speculative) [3] and some speculative attacks
that affect confidentiality (for example spectre v1) are listed as not covered
by TDX but remaining SW responsibility (as they are now). 

[3] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/best-practices/trusted-domain-security-guidance-for-developers.html

Best Regards,
Elena.

David Hildenbrand Nov. 4, 2024, 12:18 p.m. UTC | #14

On 31.10.24 11:42, Patrick Roy wrote:
> On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
>> On 30.10.24 14:49, Patrick Roy wrote:
>>> Unmapping virtual machine guest memory from the host kernel's direct map
>>> is a successful mitigation against Spectre-style transient execution
>>> issues: If the kernel page tables do not contain entries pointing to
>>> guest memory, then any attempted speculative read through the direct map
>>> will necessarily be blocked by the MMU before any observable
>>> microarchitectural side-effects happen. This means that Spectre-gadgets
>>> and similar cannot be used to target virtual machine memory. Roughly 60%
>>> of speculative execution issues fall into this category [1, Table 1].
>>>
>>> This patch series extends guest_memfd with the ability to remove its
>>> memory from the host kernel's direct map, to be able to attain the above
>>> protection for KVM guests running inside guest_memfd.
>>>
>>> === Changes to v2 ===
>>>
>>> - Handle direct map removal for physically contiguous pages in arch code
>>>     (Mike R.)
>>> - Track the direct map state in guest_memfd itself instead of at the
>>>     folio level, to prepare for huge pages support (Sean C.)
>>> - Allow configuring direct map state of not-yet faulted in memory
>>>     (Vishal A.)
>>> - Pay attention to alignment in ftrace structs (Steven R.)
>>>
>>> Most significantly, I've reduced the patch series to focus only on
>>> direct map removal for guest_memfd for now, leaving the whole "how to do
>>> non-CoCo VMs in guest_memfd" for later. If this separation is
>>> acceptable, then I think I can drop the RFC tag in the next revision
>>> (I've mainly kept it here because I'm not entirely sure what to do with
>>> patches 3 and 4).
>>
>> Hi,
>>
>> keeping upcoming "shared and private memory in guest_memfd" in mind, I
>> assume the focus would be to only remove the direct map for private memory?
>>
>> So in the current upstream state, you would only be removing the direct
>> map for private memory, currently translating to "encrypted"/"protected"
>> memory that is inaccessible either way already.
>>
>> Correct?
> 
> Yea, with the upcomming "shared and private" stuff, I would expect the
> the shared<->private conversions would call the routines from patch 3 to
> restore direct map entries on private->shared, and zap them on
> shared->private.

I wanted to follow-up to the discussion we had in the bi-weekly call.

We talked about shared (faultable) vs. private (unfaultable), and how it 
would interact with the directmap patches here.

As discussed, having private (unfaultable) memory with the direct-map 
removed and shared (faultable) memory with the direct-mapping can make 
sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, 
the discussion here seems to indicate that it might currently not be 
required.

So one thing we could do is that shared (faultable) will have a direct 
mapping and be gup-able and private (unfaultable) memory will not have a 
direct mapping and is, by design, not gup-able.

Maybe it could make sense to not have a direct map for all guest_memfd 
memory, making it behave like secretmem (and it would be easy to 
implement)? But I'm not sure if that is really desirable in VM context.

Having a mixture of "has directmap" and "has no directmap" for shared 
(faultable) memory should not be done. Similarly, private memory really 
should stay "unfaultable".

I think one of the points raised during the bi-weekly call was that 
using a viommu/swiotlb might be the right call, such that all memory can 
be considered private (unfaultable) that is not explicitly 
shared/expected to be modified by the hypervisor (-> faultable, -> 
GUP-able).

Further, I think Sean had some good points why we should explore that 
direction, but I recall that there were some issue to be sorted out 
(interpreted instructions requiring direct map when accessing "private" 
memory?), not sure if that is already working/can be made working in KVM.

What's your opinion after the call and the next step for use cases like 
you have in mind (IIRC firecracker, which wants to not have the 
direct-map for guest memory where it can be avoided)?

Patrick Roy Nov. 4, 2024, 1:09 p.m. UTC | #15

Hi David,

On 11/4/24 12:18, David Hildenbrand wrote:
> On 31.10.24 11:42, Patrick Roy wrote:
>> On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
>>> On 30.10.24 14:49, Patrick Roy wrote:
>>>> Unmapping virtual machine guest memory from the host kernel's direct map
>>>> is a successful mitigation against Spectre-style transient execution
>>>> issues: If the kernel page tables do not contain entries pointing to
>>>> guest memory, then any attempted speculative read through the direct map
>>>> will necessarily be blocked by the MMU before any observable
>>>> microarchitectural side-effects happen. This means that Spectre-gadgets
>>>> and similar cannot be used to target virtual machine memory. Roughly 60%
>>>> of speculative execution issues fall into this category [1, Table 1].
>>>>
>>>> This patch series extends guest_memfd with the ability to remove its
>>>> memory from the host kernel's direct map, to be able to attain the above
>>>> protection for KVM guests running inside guest_memfd.
>>>>
>>>> === Changes to v2 ===
>>>>
>>>> - Handle direct map removal for physically contiguous pages in arch code
>>>>     (Mike R.)
>>>> - Track the direct map state in guest_memfd itself instead of at the
>>>>     folio level, to prepare for huge pages support (Sean C.)
>>>> - Allow configuring direct map state of not-yet faulted in memory
>>>>     (Vishal A.)
>>>> - Pay attention to alignment in ftrace structs (Steven R.)
>>>>
>>>> Most significantly, I've reduced the patch series to focus only on
>>>> direct map removal for guest_memfd for now, leaving the whole "how to do
>>>> non-CoCo VMs in guest_memfd" for later. If this separation is
>>>> acceptable, then I think I can drop the RFC tag in the next revision
>>>> (I've mainly kept it here because I'm not entirely sure what to do with
>>>> patches 3 and 4).
>>>
>>> Hi,
>>>
>>> keeping upcoming "shared and private memory in guest_memfd" in mind, I
>>> assume the focus would be to only remove the direct map for private memory?
>>>
>>> So in the current upstream state, you would only be removing the direct
>>> map for private memory, currently translating to "encrypted"/"protected"
>>> memory that is inaccessible either way already.
>>>
>>> Correct?
>>
>> Yea, with the upcomming "shared and private" stuff, I would expect the
>> the shared<->private conversions would call the routines from patch 3 to
>> restore direct map entries on private->shared, and zap them on
>> shared->private.
> 
> I wanted to follow-up to the discussion we had in the bi-weekly call.

Thanks for summarizing!

> We talked about shared (faultable) vs. private (unfaultable), and how it
> would interact with the directmap patches here.
> 
> As discussed, having private (unfaultable) memory with the direct-map
> removed and shared (faultable) memory with the direct-mapping can make
> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
> the discussion here seems to indicate that it might currently not be
> required.
>
> So one thing we could do is that shared (faultable) will have a direct
> mapping and be gup-able and private (unfaultable) memory will not have a
> direct mapping and is, by design, not gup-able.> 
> Maybe it could make sense to not have a direct map for all guest_memfd
> memory, making it behave like secretmem (and it would be easy to
> implement)? But I'm not sure if that is really desirable in VM context.

This would work for us (in this scenario, the swiotlb areas would be
"traditional" memory, e.g. set to shared via mem attributes instead of
"shared" inside KVM), it's kinda what I had prototyped in my v1 of this
series (well, we'd need to figure out how to get the mappings of gmem
back into KVM, since in this setup, short-circuiting it into
userspace_addr wouldn't work, unless we banish swiotlb into a different
memslot altogether somehow). But I don't think it'd work for pKVM, iirc
they need GUP on gmem, and also want direct map removal (... but maybe,
the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
GUP+no faults of private mem).

> Having a mixture of "has directmap" and "has no directmap" for shared
> (faultable) memory should not be done. Similarly, private memory really
> should stay "unfaultable".

You've convinced me that having both GUP-able and non GUP-able
memory in the same VMA will be tricky. However, I'm less convinced on
why private memory should stay unfaultable; only that it shouldn't be
faultable into a VMA that also allows GUP. Can we have two VMAs? One
that disallows GUP, but allows userspace access to shared and private,
and one that allows GUP, but disallows accessing private memory? Maybe
via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
different spin of the above idea.

> I think one of the points raised during the bi-weekly call was that
> using a viommu/swiotlb might be the right call, such that all memory can
> be considered private (unfaultable) that is not explicitly
> shared/expected to be modified by the hypervisor (-> faultable, ->
> GUP-able).
> 
> Further, I think Sean had some good points why we should explore that
> direction, but I recall that there were some issue to be sorted out
> (interpreted instructions requiring direct map when accessing "private"
> memory?), not sure if that is already working/can be made working in KVM.

Yeah, the big one is MMIO instruction emulation on x86, which does guest
page table walks and instruction fetch (and particularly the latter
cannot be known ahead-of-time by the guest, aka cannot be explicitly
"shared"). That's what the majority of my v2 series was about. For
traditional memslots, KVM handles these via get_user and friends, but if
we don't have a VMA that allows faulting all of gmem, then that's
impossible, and we're in "temporarily restore direct map" land. Which
comes with significantly performance penalties due to TLB flushes.

> What's your opinion after the call and the next step for use cases like
> you have in mind (IIRC firecracker, which wants to not have the
> direct-map for guest memory where it can be avoided)?

Yea, the usecase is for Firecracker to not have direct map entries for
guest memory, unless needed for I/O (-> swiotlb).

As for next steps, let's determine once and for all if we can do the
KVM-internal guest memory accesses for MMIO emulation through userspace
mappings (although if we can't I'll have some serious soul-searching to
do, because all other solutions we talked about so far also have fairly
big drawbacks; on-demand direct map reinsertion has terrible
performance, protection keys would limit us to 15 VMs on the host, and
the page table swapping runs into problems with NMIs if I understood
Sean correctly last Thursday :( ).

> -- 
> Cheers,
> 
> David / dhildenb

Best, 
Patrick

David Hildenbrand Nov. 4, 2024, 9:30 p.m. UTC | #16

>> We talked about shared (faultable) vs. private (unfaultable), and how it
>> would interact with the directmap patches here.
>>
>> As discussed, having private (unfaultable) memory with the direct-map
>> removed and shared (faultable) memory with the direct-mapping can make
>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>> the discussion here seems to indicate that it might currently not be
>> required.
>>
>> So one thing we could do is that shared (faultable) will have a direct
>> mapping and be gup-able and private (unfaultable) memory will not have a
>> direct mapping and is, by design, not gup-able.>
>> Maybe it could make sense to not have a direct map for all guest_memfd
>> memory, making it behave like secretmem (and it would be easy to
>> implement)? But I'm not sure if that is really desirable in VM context.
> 
> This would work for us (in this scenario, the swiotlb areas would be
> "traditional" memory, e.g. set to shared via mem attributes instead of
> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
> series (well, we'd need to figure out how to get the mappings of gmem
> back into KVM, since in this setup, short-circuiting it into
> userspace_addr wouldn't work, unless we banish swiotlb into a different
> memslot altogether somehow).

Right.

> But I don't think it'd work for pKVM, iirc
> they need GUP on gmem, and also want direct map removal (... but maybe,
> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
> behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
> GUP+no faults of private mem).

Good question. So far my perception was that the directmap removal on 
"private/unfaultable" would be sufficient.

> 
>> Having a mixture of "has directmap" and "has no directmap" for shared
>> (faultable) memory should not be done. Similarly, private memory really
>> should stay "unfaultable".
> 
> You've convinced me that having both GUP-able and non GUP-able
> memory in the same VMA will be tricky. However, I'm less convinced on
> why private memory should stay unfaultable; only that it shouldn't be
> faultable into a VMA that also allows GUP. Can we have two VMAs? One
> that disallows GUP, but allows userspace access to shared and private,
> and one that allows GUP, but disallows accessing private memory? Maybe
> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
> different spin of the above idea.

What we are trying to achieve is making guest_memfd not behave 
completely different on that level for different "types" of VMs. So one 
of the goals should be to try to unify it as much as possible.

shared -> faultable: GUP-able
private -> unfaultable: unGUP-able


And it makes sense, because a lot of future work will rely on some 
important properties: for example, if private memory cannot be faulted 
in + GUPed, core-MM will never have obtained valid references to such a 
page. There is no need to split large folios into smaller ones for 
tracking purposes; there is no need to maintain per-page refcounts and 
pincounts ...

It doesn't mean that we cannot consider it if really required, but there 
really has to be a strong case for it, because it will all get really messy.

For example, one issue is that a folio only has a single mapping 
(folio->mapping), and that is used in the GUP-fast path (no VMA) to 
determine whether GUP-fast is allowed or not.

So you'd have to force everything through GUP-slow, where you could 
consider VMA properties :( It sounds quite suboptimal.

I don't think multiple VMAs are what we really want. See below.

> 
>> I think one of the points raised during the bi-weekly call was that
>> using a viommu/swiotlb might be the right call, such that all memory can
>> be considered private (unfaultable) that is not explicitly
>> shared/expected to be modified by the hypervisor (-> faultable, ->
>> GUP-able).
>>
>> Further, I think Sean had some good points why we should explore that
>> direction, but I recall that there were some issue to be sorted out
>> (interpreted instructions requiring direct map when accessing "private"
>> memory?), not sure if that is already working/can be made working in KVM.
> 
> Yeah, the big one is MMIO instruction emulation on x86, which does guest
> page table walks and instruction fetch (and particularly the latter
> cannot be known ahead-of-time by the guest, aka cannot be explicitly
> "shared"). That's what the majority of my v2 series was about. For
> traditional memslots, KVM handles these via get_user and friends, but if
> we don't have a VMA that allows faulting all of gmem, then that's
> impossible, and we're in "temporarily restore direct map" land. Which
> comes with significantly performance penalties due to TLB flushes.

Agreed.

 > >> What's your opinion after the call and the next step for use cases 
like
>> you have in mind (IIRC firecracker, which wants to not have the
>> direct-map for guest memory where it can be avoided)?
> 
> Yea, the usecase is for Firecracker to not have direct map entries for
> guest memory, unless needed for I/O (-> swiotlb).
> 
> As for next steps, let's determine once and for all if we can do the
> KVM-internal guest memory accesses for MMIO emulation through userspace
> mappings (although if we can't I'll have some serious soul-searching to
> do, because all other solutions we talked about so far also have fairly
> big drawbacks; on-demand direct map reinsertion has terrible
> performance
So IIUC, KVM would have to access "unfaultable" guest_memfd memory using 
fd+offset, and that's problematic because "no-directmap".

So you'd have to map+unmap the directmap repeatedly, and still expose it 
temporarily in the direct map to others. I see how that is undesirable, 
even when trying to cache hotspots (partly destroying the purpose of the 
directmap removal).


Would a per-MM kernel mapping of these pages work, so KVM can access them?

It sounds a bit like what is required for clean per-MM allocations [1]: 
establish a per-MM kernel mapping of (selected?) pages. Not necessarily 
all of them.

Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything 
involved with ordinary user mappings for these private/unfaultable 
thingies. Just like as discussed in, and similar to [1].

Just throwing it out there, maybe we really want to avoid the directmap 
(keep it unmapped) and maintain a per-mm mapping for a bunch of folios 
that can be easily removed when required by guest_memfd (ftruncate, 
conversion private->shared) on request.

[1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u

Manwaring, Derek Nov. 6, 2024, 5:04 p.m. UTC | #17

On 2024-11-04 at 08:33+0000, Elena Reshetova wrote:
> This statement *is* for integrity section. We have a separate TDX guidance
> on side-channels (including speculative) [3] and some speculative attacks
> that affect confidentiality (for example spectre v1) are listed as not covered
> by TDX but remaining SW responsibility (as they are now).

Thanks for the additional info, Elena. Given that clarification, I
definitely see direct map removal and TDX as complementary.

Derek

Reshetova, Elena Nov. 8, 2024, 10:36 a.m. UTC | #18

> On 2024-11-04 at 08:33+0000, Elena Reshetova wrote:
> > This statement *is* for integrity section. We have a separate TDX guidance
> > on side-channels (including speculative) [3] and some speculative attacks
> > that affect confidentiality (for example spectre v1) are listed as not covered
> > by TDX but remaining SW responsibility (as they are now).
> 
> Thanks for the additional info, Elena. Given that clarification, I
> definitely see direct map removal and TDX as complementary.

Jus to clarify to make sure my comment is not misunderstood.
What I meant is that we cannot generally assume that confidentiality
leaks from CoCo guests to host/VMM via speculative channels
are completely impossible. Spectre V1 is a prime example of such a
possible leak. Dave also elaborated on other potential vectors earlier.

The usefulness of direct map removal for CoCo guests as a concrete
mitigation for certain types of memory attacks must be precisely
evaluated per each attack vector, attack vector direction (host -> guest,
guest ->host, etc) and per each countermeasure that CoCo vendors
provide for each such case. I don't know if there is any existing study
that examines this for major CoCo vendors. I think this is what must
be done for this work in order to have a strong reasoning for its usefulness.

Best Regards,
Elena.

Patrick Roy Nov. 12, 2024, 2:40 p.m. UTC | #19

Hi David, 

sorry for the late response, I ended up catching the flu last week and
was out of commission for a while :(

On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote:
>>> We talked about shared (faultable) vs. private (unfaultable), and how it
>>> would interact with the directmap patches here.
>>>
>>> As discussed, having private (unfaultable) memory with the direct-map
>>> removed and shared (faultable) memory with the direct-mapping can make
>>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>>> the discussion here seems to indicate that it might currently not be
>>> required.
>>>
>>> So one thing we could do is that shared (faultable) will have a direct
>>> mapping and be gup-able and private (unfaultable) memory will not have a
>>> direct mapping and is, by design, not gup-able.>
>>> Maybe it could make sense to not have a direct map for all guest_memfd
>>> memory, making it behave like secretmem (and it would be easy to
>>> implement)? But I'm not sure if that is really desirable in VM context.
>>
>> This would work for us (in this scenario, the swiotlb areas would be
>> "traditional" memory, e.g. set to shared via mem attributes instead of
>> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
>> series (well, we'd need to figure out how to get the mappings of gmem
>> back into KVM, since in this setup, short-circuiting it into
>> userspace_addr wouldn't work, unless we banish swiotlb into a different
>> memslot altogether somehow).
> 
> Right.

"right" as in, "yes we could do that"? :p

>> But I don't think it'd work for pKVM, iirc
>> they need GUP on gmem, and also want direct map removal (... but maybe,
>> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be
>> behave differently?  non-CoCo gets essentially memfd_secret, pKVM gets
>> GUP+no faults of private mem).
> 
> Good question. So far my perception was that the directmap removal on
> "private/unfaultable" would be sufficient.
> 
>>
>>> Having a mixture of "has directmap" and "has no directmap" for shared
>>> (faultable) memory should not be done. Similarly, private memory really
>>> should stay "unfaultable".
>>
>> You've convinced me that having both GUP-able and non GUP-able
>> memory in the same VMA will be tricky. However, I'm less convinced on
>> why private memory should stay unfaultable; only that it shouldn't be
>> faultable into a VMA that also allows GUP. Can we have two VMAs? One
>> that disallows GUP, but allows userspace access to shared and private,
>> and one that allows GUP, but disallows accessing private memory? Maybe
>> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly
>> different spin of the above idea.
> 
> What we are trying to achieve is making guest_memfd not behave
> completely different on that level for different "types" of VMs. So one
> of the goals should be to try to unify it as much as possible.
> 
> shared -> faultable: GUP-able
> private -> unfaultable: unGUP-able
> 
> 
> And it makes sense, because a lot of future work will rely on some
> important properties: for example, if private memory cannot be faulted
> in + GUPed, core-MM will never have obtained valid references to such a
> page. There is no need to split large folios into smaller ones for
> tracking purposes; there is no need to maintain per-page refcounts and
> pincounts ...
> 
> It doesn't mean that we cannot consider it if really required, but there
> really has to be a strong case for it, because it will all get really messy.
> 
> For example, one issue is that a folio only has a single mapping
> (folio->mapping), and that is used in the GUP-fast path (no VMA) to
> determine whether GUP-fast is allowed or not.
> 
> So you'd have to force everything through GUP-slow, where you could
> consider VMA properties :( It sounds quite suboptimal.
> 
> I don't think multiple VMAs are what we really want. See below.

Ah, okay, I see. Thanks for explaining, this all makes a lot of sense to
me now!

>>
>>> I think one of the points raised during the bi-weekly call was that
>>> using a viommu/swiotlb might be the right call, such that all memory can
>>> be considered private (unfaultable) that is not explicitly
>>> shared/expected to be modified by the hypervisor (-> faultable, ->
>>> GUP-able).
>>>
>>> Further, I think Sean had some good points why we should explore that
>>> direction, but I recall that there were some issue to be sorted out
>>> (interpreted instructions requiring direct map when accessing "private"
>>> memory?), not sure if that is already working/can be made working in KVM.
>>
>> Yeah, the big one is MMIO instruction emulation on x86, which does guest
>> page table walks and instruction fetch (and particularly the latter
>> cannot be known ahead-of-time by the guest, aka cannot be explicitly
>> "shared"). That's what the majority of my v2 series was about. For
>> traditional memslots, KVM handles these via get_user and friends, but if
>> we don't have a VMA that allows faulting all of gmem, then that's
>> impossible, and we're in "temporarily restore direct map" land. Which
>> comes with significantly performance penalties due to TLB flushes.
> 
> Agreed.
> 
>> >> What's your opinion after the call and the next step for use cases
> like
>>> you have in mind (IIRC firecracker, which wants to not have the
>>> direct-map for guest memory where it can be avoided)?
>>
>> Yea, the usecase is for Firecracker to not have direct map entries for
>> guest memory, unless needed for I/O (-> swiotlb).
>>
>> As for next steps, let's determine once and for all if we can do the
>> KVM-internal guest memory accesses for MMIO emulation through userspace
>> mappings (although if we can't I'll have some serious soul-searching to
>> do, because all other solutions we talked about so far also have fairly
>> big drawbacks; on-demand direct map reinsertion has terrible
>> performance
> So IIUC, KVM would have to access "unfaultable" guest_memfd memory using
> fd+offset, and that's problematic because "no-directmap".
> 
> So you'd have to map+unmap the directmap repeatedly, and still expose it
> temporarily in the direct map to others. I see how that is undesirable,
> even when trying to cache hotspots (partly destroying the purpose of the
> directmap removal).
> 
> 
> Would a per-MM kernel mapping of these pages work, so KVM can access them?
> 
> It sounds a bit like what is required for clean per-MM allocations [1]:
> establish a per-MM kernel mapping of (selected?) pages. Not necessarily
> all of them.
> 
> Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything
> involved with ordinary user mappings for these private/unfaultable
> thingies. Just like as discussed in, and similar to [1].
> 
> Just throwing it out there, maybe we really want to avoid the directmap
> (keep it unmapped) and maintain a per-mm mapping for a bunch of folios
> that can be easily removed when required by guest_memfd (ftruncate,
> conversion private->shared) on request.

I remember talking to someone at some point about whether we could reuse
the proc-local stuff for guest memory, but I cannot remember the outcome
of that discussion... (or maybe I just wanted to have a discussion about
it, but forgot to follow up on that thought?).  I guess we wouldn't use
proc-local _allocations_, but rather just set up proc-local mappings of
the gmem allocations that have been removed from the direct map.

I'm wondering, where exactly would be the differences to Sean's idea
about messing with the CR3 register inside KVM to temporarily install
page tables that contain all the gmem stuff, conceptually? Wouldn't we
run into the same interrupt problems that Sean foresaw for the CR3
stuff? (which, admittedly, I still don't quite follow what these are :(
).

(I've cc'd Fares Mehanna as well)

> [1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

Best,
Patrick

David Hildenbrand Nov. 12, 2024, 2:52 p.m. UTC | #20

On 12.11.24 15:40, Patrick Roy wrote:
> 
> Hi David,
> 
> sorry for the late response, I ended up catching the flu last week and
> was out of commission for a while :(
> 
> On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote:
>>>> We talked about shared (faultable) vs. private (unfaultable), and how it
>>>> would interact with the directmap patches here.
>>>>
>>>> As discussed, having private (unfaultable) memory with the direct-map
>>>> removed and shared (faultable) memory with the direct-mapping can make
>>>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo,
>>>> the discussion here seems to indicate that it might currently not be
>>>> required.
>>>>
>>>> So one thing we could do is that shared (faultable) will have a direct
>>>> mapping and be gup-able and private (unfaultable) memory will not have a
>>>> direct mapping and is, by design, not gup-able.>
>>>> Maybe it could make sense to not have a direct map for all guest_memfd
>>>> memory, making it behave like secretmem (and it would be easy to
>>>> implement)? But I'm not sure if that is really desirable in VM context.
>>>
>>> This would work for us (in this scenario, the swiotlb areas would be
>>> "traditional" memory, e.g. set to shared via mem attributes instead of
>>> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this
>>> series (well, we'd need to figure out how to get the mappings of gmem
>>> back into KVM, since in this setup, short-circuiting it into
>>> userspace_addr wouldn't work, unless we banish swiotlb into a different
>>> memslot altogether somehow).
>>
>> Right.
> 
> "right" as in, "yes we could do that"? :p

"right" as in "I see how that could work" :)

[...]

> 
> I remember talking to someone at some point about whether we could reuse
> the proc-local stuff for guest memory, but I cannot remember the outcome
> of that discussion... (or maybe I just wanted to have a discussion about
> it, but forgot to follow up on that thought?).  I guess we wouldn't use
> proc-local _allocations_, but rather just set up proc-local mappings of
> the gmem allocations that have been removed from the direct map.

Yes. And likely only for memory we really access / try access, if possible.

> 
> I'm wondering, where exactly would be the differences to Sean's idea
> about messing with the CR3 register inside KVM to temporarily install
> page tables that contain all the gmem stuff, conceptually? Wouldn't we
> run into the same interrupt problems that Sean foresaw for the CR3
> stuff? (which, admittedly, I still don't quite follow what these are :(
> ).

I'd need some more details on that. If anything would rely on the direct 
mapping (from IRQ context?) than ... we obviously cannot remove the 
direct mapping :)

Manwaring, Derek Nov. 13, 2024, 3:31 a.m. UTC | #21

On 2024-11-08 at 10:36, Elena Reshetova wrote:
> On 2024-11-06 at 17:04, Derek Manwaring wrote:
> > On 2024-11-04 at 08:33+0000, Elena Reshetova wrote:
> > > This statement *is* for integrity section. We have a separate TDX guidance
> > > on side-channels (including speculative) [3] and some speculative attacks
> > > that affect confidentiality (for example spectre v1) are listed as not covered
> > > by TDX but remaining SW responsibility (as they are now).
> >
> > Thanks for the additional info, Elena. Given that clarification, I
> > definitely see direct map removal and TDX as complementary.
>
> Jus to clarify to make sure my comment is not misunderstood.
> What I meant is that we cannot generally assume that confidentiality
> leaks from CoCo guests to host/VMM via speculative channels
> are completely impossible. Spectre V1 is a prime example of such a
> possible leak. Dave also elaborated on other potential vectors earlier.
>
> The usefulness of direct map removal for CoCo guests as a concrete
> mitigation for certain types of memory attacks must be precisely
> evaluated per each attack vector, attack vector direction (host -> guest,
> guest ->host, etc) and per each countermeasure that CoCo vendors
> provide for each such case. I don't know if there is any existing study
> that examines this for major CoCo vendors. I think this is what must
> be done for this work in order to have a strong reasoning for its usefulness.

Thanks, that makes sense. I'm a little hyperfocused on guest->host which
is the opposite direction of what is generally used for the CoCo threat
model. I think what both cases care about though is guest->guest. For
me, guest->host matters because it's a route for guest->guest (at least
in the non-CoCo world). There's some good discussion on this in David's
series on attack vector controls [1].

Like you mention, beyond direction it matters which CoCo countermeasures
are at play and how far they go during transient execution. That part is
not clear to me for the host->guest direction involving the direct map,
but agree any countermeasures like direct map removal should be
evaluated based on a better understanding of what those existing
countermeasures already cover and what attack is intended to be
mitigated.

Derek

[1] https://lore.kernel.org/lkml/LV3PR12MB92658EA6CCF18F90DAAA168394532@LV3PR12MB9265.namprd12.prod.outlook.com/

Patrick Roy Nov. 15, 2024, 4:59 p.m. UTC | #22

On Tue, 2024-11-12 at 14:52 +0000, David Hildenbrand wrote:
> On 12.11.24 15:40, Patrick Roy wrote:
>> I remember talking to someone at some point about whether we could reuse
>> the proc-local stuff for guest memory, but I cannot remember the outcome
>> of that discussion... (or maybe I just wanted to have a discussion about
>> it, but forgot to follow up on that thought?).  I guess we wouldn't use
>> proc-local _allocations_, but rather just set up proc-local mappings of
>> the gmem allocations that have been removed from the direct map.
> 
> Yes. And likely only for memory we really access / try access, if possible.

Well, if we start on-demand mm-local mapping the things we want to
access, we're back in TLB flush hell, no? And we can't know
ahead-of-time what needs to be mapped, so everything would need to be
mapped (unless we do something like mm-local mapping a page on first
access, and then just never unmapping it again, under the assumption
that establishing the mapping won't be expensive)

>>
>> I'm wondering, where exactly would be the differences to Sean's idea
>> about messing with the CR3 register inside KVM to temporarily install
>> page tables that contain all the gmem stuff, conceptually? Wouldn't we
>> run into the same interrupt problems that Sean foresaw for the CR3
>> stuff? (which, admittedly, I still don't quite follow what these are :(
>> ).
> 
> I'd need some more details on that. If anything would rely on the direct
> mapping (from IRQ context?) than ... we obviously cannot remove the
> direct mapping :)

I've talked to Fares internally, and it seems that generally doing
mm-local mappings of guest memory would work for us. We also figured out
what the "interrupt problem" is, namely that if we receive an interrupt
while executing in a context that has mm-local mappings available, those
mappings will continue to be available while the interrupt is being
handled. I'm talking to my security folks to see how much of a concern
this is for the speculation hardening we're trying to achieve. Will keep
you in the loop there :)

> -- 
> Cheers,
> 
> David / dhildenb
> 

Best, 
Patrick

David Hildenbrand Nov. 15, 2024, 5:10 p.m. UTC | #23

On 15.11.24 17:59, Patrick Roy wrote:
> 
> 
> On Tue, 2024-11-12 at 14:52 +0000, David Hildenbrand wrote:
>> On 12.11.24 15:40, Patrick Roy wrote:
>>> I remember talking to someone at some point about whether we could reuse
>>> the proc-local stuff for guest memory, but I cannot remember the outcome
>>> of that discussion... (or maybe I just wanted to have a discussion about
>>> it, but forgot to follow up on that thought?).  I guess we wouldn't use
>>> proc-local _allocations_, but rather just set up proc-local mappings of
>>> the gmem allocations that have been removed from the direct map.
>>
>> Yes. And likely only for memory we really access / try access, if possible.
> 
> Well, if we start on-demand mm-local mapping the things we want to
> access, we're back in TLB flush hell, no?

At least the on-demand mapping shouldn't require a TLB flush? Only 
"unmapping" if we want to restrict the size of a "mapped pool" of 
restricted size.

Anyhow, this would be a pure optimization, to avoid the expense of 
mapping everything, when in practice you'd like not access most of it.

(my theory, happy to be told I'm wrong :) )

> And we can't know
> ahead-of-time what needs to be mapped, so everything would need to be
> mapped (unless we do something like mm-local mapping a page on first
> access, and then just never unmapping it again, under the assumption
> that establishing the mapping won't be expensive)

Right, the whole problem is that we don't know that upfront.

> 
>>>
>>> I'm wondering, where exactly would be the differences to Sean's idea
>>> about messing with the CR3 register inside KVM to temporarily install
>>> page tables that contain all the gmem stuff, conceptually? Wouldn't we
>>> run into the same interrupt problems that Sean foresaw for the CR3
>>> stuff? (which, admittedly, I still don't quite follow what these are :(
>>> ).
>>
>> I'd need some more details on that. If anything would rely on the direct
>> mapping (from IRQ context?) than ... we obviously cannot remove the
>> direct mapping :)
> 
> I've talked to Fares internally, and it seems that generally doing
> mm-local mappings of guest memory would work for us. We also figured out
> what the "interrupt problem" is, namely that if we receive an interrupt
> while executing in a context that has mm-local mappings available, those
> mappings will continue to be available while the interrupt is being
> handled.

Isn't that likely also the case with secretmem where we removed the 
directmap, but have an effective per-mm mapping in the (user-space 
portion) of the page table?

> I'm talking to my security folks to see how much of a concern
> this is for the speculation hardening we're trying to achieve. Will keep
> you in the loop there :)

Thanks!

Patrick Roy Nov. 15, 2024, 5:23 p.m. UTC | #24

On Fri, 2024-11-15 at 17:10 +0000, David Hildenbrand wrote:
>> [...]
>>
>> I've talked to Fares internally, and it seems that generally doing
>> mm-local mappings of guest memory would work for us. We also figured out
>> what the "interrupt problem" is, namely that if we receive an interrupt
>> while executing in a context that has mm-local mappings available, those
>> mappings will continue to be available while the interrupt is being
>> handled.
> 
> Isn't that likely also the case with secretmem where we removed the
> directmap, but have an effective per-mm mapping in the (user-space
> portion) of the page table?

Mh, that's an excellent point, I never thought of that. But with
secretmem, the memory would still be protected by SMAP (admittedly, I
have no idea how much this is worth in the face of all these speculative
issues), right?

>> I'm talking to my security folks to see how much of a concern
>> this is for the speculation hardening we're trying to achieve. Will keep
>> you in the loop there :)
> 
> Thanks!
> 
> -- 
> Cheers,
> 
> David / dhildenb

Best,
Patrick

[RFC,v3,0/6] Direct Map Removal for guest_memfd

Message

Comments