mbox series

[RFC,v1,00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support

Message ID	20240222161047.402609-1-tabba@google.com (mailing list archive)
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 843E314F9ED for <kvm@vger.kernel.org>; Thu, 22 Feb 2024 16:10:51 +0000 (UTC) Date: Thu, 22 Feb 2024 16:10:21 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240222161047.402609-1-tabba@google.com> Subject: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support From: Fuad Tabba <tabba@google.com> To: kvm@vger.kernel.org, kvmarm@lists.linux.dev Cc: pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com, aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org, xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com, jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com, yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net, vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com, mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com, kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com, steven.price@arm.com, quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com, quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com, quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com, james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev, maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com, tabba@google.com Content-Type: text/plain; charset="UTF-8"
Series	KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support \| expand [RFC,v1,00/26] KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support [RFC,v1,01/26] KVM: Split KVM memory attributes into user and kernel attributes [RFC,v1,02/26] KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock [RFC,v1,03/26] KVM: Add restricted support for mapping guestmem by the host [RFC,v1,04/26] KVM: Don't allow private attribute to be set if mapped by host [RFC,v1,05/26] KVM: Don't allow private attribute to be removed for unmappable memory [RFC,v1,06/26] KVM: Implement kvm_(read\|/write)_guest_page for private memory slots [RFC,v1,07/26] KVM: arm64: Turn llist of pinned pages into an rb-tree [RFC,v1,08/26] KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall [RFC,v1,09/26] KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall [RFC,v1,10/26] KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall [RFC,v1,11/26] KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL [RFC,v1,12/26] KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications [RFC,v1,13/26] KVM: arm64: Create hypercall return handler [RFC,v1,14/26] KVM: arm64: Refactor code around handling return from host to guest [RFC,v1,15/26] KVM: arm64: Rename kvm_pinned_page to kvm_guest_page [RFC,v1,16/26] KVM: arm64: Add a field to indicate whether the guest page was pinned [RFC,v1,17/26] KVM: arm64: Do not allow changes to private memory slots [RFC,v1,18/26] KVM: arm64: Skip VMA checks for slots without userspace address [RFC,v1,19/26] KVM: arm64: Handle guest_memfd()-backed guest page faults [RFC,v1,20/26] KVM: arm64: Track sharing of memory from protected guest to host [RFC,v1,21/26] KVM: arm64: Mark a protected VM's memory as unmappable at initialization [RFC,v1,22/26] KVM: arm64: Handle unshare on way back to guest entry rather than exit [RFC,v1,23/26] KVM: arm64: Check that host unmaps memory unshared by guest [RFC,v1,24/26] KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() [RFC,v1,25/26] KVM: arm64: Enable private memory support when pKVM is enabled [RFC,v1,26/26] KVM: arm64: Enable private memory kconfig for arm64

Message ID

20240222161047.402609-1-tabba@google.com (mailing list archive)

Headers

Date: Thu, 22 Feb 2024 16:10:21 +0000
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20240222161047.402609-1-tabba@google.com>
Subject: [RFC PATCH v1 00/26] KVM: Restricted mapping of guest_memfd at the
 host and pKVM/arm64 support
From: Fuad Tabba <tabba@google.com>
To: kvm@vger.kernel.org, kvmarm@lists.linux.dev
Cc: pbonzini@redhat.com, chenhuacai@kernel.org, mpe@ellerman.id.au,
	anup@brainfault.org, paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, seanjc@google.com, viro@zeniv.linux.org.uk,
	brauner@kernel.org, willy@infradead.org, akpm@linux-foundation.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, chao.p.peng@linux.intel.com,
	jarkko@kernel.org, amoorthy@google.com, dmatlack@google.com,
	yu.c.zhang@linux.intel.com, isaku.yamahata@intel.com, mic@digikod.net,
	vbabka@suse.cz, vannapurve@google.com, ackerleytng@google.com,
	mail@maciej.szmigiero.name, david@redhat.com, michael.roth@amd.com,
	wei.w.wang@intel.com, liam.merwick@oracle.com, isaku.yamahata@gmail.com,
	kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
 steven.price@arm.com,
	quic_eberman@quicinc.com, quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
	quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
	quic_pderrin@quicinc.com, quic_pheragu@quicinc.com, catalin.marinas@arm.com,
	james.morse@arm.com, yuzenghui@huawei.com, oliver.upton@linux.dev,
	maz@kernel.org, will@kernel.org, qperret@google.com, keirf@google.com,
	tabba@google.com
Content-Type: text/plain; charset="UTF-8"

Series

KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support | expand

Message

Fuad Tabba Feb. 22, 2024, 4:10 p.m. UTC

This series adds restricted mmap() support to guest_memfd [1], as
well as support guest_memfd on pKVM/arm64.

This series is based on Linux 6.8-rc4 + our pKVM core series [2].
The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
the remainder (patches 7-26) require the pKVM core series. A git
repo with this series applied can be found here [3]. We have a
(WIP) kvmtool port capable of running the code in this series
[4]. For a technical deep dive into pKVM, please refer to Quentin
Perret's KVM Forum Presentation [5, 6].

I've covered some of the issues presented here in my LPC 2023
presentation [7].

We haven't started using this in Android yet, but we aim to move
away from anonymous memory to guest_memfd once we have the
necessary support merged upstream. Others (e.g., Gunyah [8]) are
also looking into guest_memfd for similar reasons as us.

By design, guest_memfd cannot be mapped, read, or written by the
host userspace. In pKVM, memory shared between a protected guest
and the host is shared in-place, unlike the other confidential
computing solutions that guest_memfd was originally envisaged for
(e.g, TDX). When initializing a guest, as well as when accessing
memory shared by the guest to the host, it would be useful to
support mapping that memory at the host to avoid copying its
contents.

One of the benefits of guest_memfd is that it prevents a
misbehaving host process from crashing the system when attempting
to access (deliberately or accidentally) protected guest memory,
since this memory isn't mapped to begin with. Without
guest_memfd, the hypervisor would still prevent such accesses,
but in certain cases the host kernel wouldn't be able to recover,
causing the system to crash.

Support for mmap() in this patch series maintains the invariant
that only memory shared with the host, either explicitly by the
guest or implicitly before the guest has started running (in
order to populate its memory) is allowed to be mapped. At no time
should private memory be mapped at the host.

This patch series is divided into two parts:

The first part is to the KVM core code (patches 1-6), and is
based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
for mapping guest memory only as long as it is shared. For that,
the host needs to know the sharing status of guest memory.
Therefore, the series adds a new KVM memory attribute, accessible
only by the host kernel, that specifies whether the memory is
allowed to be mapped by the host userspace.

The second part of the series (patches 7-26) adds guest_memfd
support for pKVM/arm64, and is based on the latest version of our
pKVM series [2]. It uses guest_memfd instead of the current
approach in Android (not upstreamed) of maintaining a long-term
GUP on anonymous memory donated to the guest. These patches
handle faulting in guest memory for a guest, as well as handling
sharing and unsharing of guest memory while maintaining the
invariant mentioned earlier.

In addition to general feedback, we would like feedback on how we
handle mmap() and faulting-in guest pages at the host (KVM: Add
restricted support for mapping guest_memfd by the host).

We don't enforce the invariant that only memory shared with the
host can be mapped by the host userspace in
file_operations::mmap(), but in vm_operations_struct:fault(). On
vm_operations_struct::fault(), we check whether the page is
shared with the host. If not, we deliver a SIGBUS to the current
task. The reason for enforcing this at fault() is that mmap()
does not elevate the pagecount(); it's the faulting in of the
page which does. Even if we were to check at mmap() whether an
address can be mapped, we would still need to check again on
fault(), since between mmap() and fault() the status of the page
can change.

This creates the situation where access to successfully mmap()'d
memory might SIGBUS at page fault. There is precedence for
similar behavior in the kernel I believe, with MADV_HWPOISON and
the hugetlbfs cgroups controller, which could SIGBUS at page
fault time depending on the accounting limit.

Another pKVM specific aspect we would like feedback on, is how to
handle memory mapped by the host being unshared by a guest. The
approach we've taken is that on an unshare call from the guest,
the host userspace is notified that the memory has been unshared,
in order to allow it to unmap it and mark it as PRIVATE as
acknowledgment. If the host does not unmap the memory, the
unshare call issued by the guest fails, which the guest is
informed about on return.

Cheers,
/fuad

[1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/

[2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core

[3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1

[4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8

[5] Protected KVM on arm64 (slides)
https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf

[6] Protected KVM on arm64 (video)
https://www.youtube.com/watch?v=9npebeVFbFw

[7] Supporting guest private memory in Protected KVM on Android (presentation)
https://lpc.events/event/17/contributions/1487/

[8] Drivers for Gunyah (patch series)
https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/

Fuad Tabba (20):
  KVM: Split KVM memory attributes into user and kernel attributes
  KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
  KVM: Add restricted support for mapping guestmem by the host
  KVM: Don't allow private attribute to be set if mapped by host
  KVM: Don't allow private attribute to be removed for unmappable memory
  KVM: Implement kvm_(read|/write)_guest_page for private memory slots
  KVM: arm64: Create hypercall return handler
  KVM: arm64: Refactor code around handling return from host to guest
  KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
  KVM: arm64: Add a field to indicate whether the guest page was pinned
  KVM: arm64: Do not allow changes to private memory slots
  KVM: arm64: Skip VMA checks for slots without userspace address
  KVM: arm64: Handle guest_memfd()-backed guest page faults
  KVM: arm64: Track sharing of memory from protected guest to host
  KVM: arm64: Mark a protected VM's memory as unmappable at
    initialization
  KVM: arm64: Handle unshare on way back to guest entry rather than exit
  KVM: arm64: Check that host unmaps memory unshared by guest
  KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
  KVM: arm64: Enable private memory support when pKVM is enabled
  KVM: arm64: Enable private memory kconfig for arm64

Keir Fraser (3):
  KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
  KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
  KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall

Quentin Perret (1):
  KVM: arm64: Turn llist of pinned pages into an rb-tree

Will Deacon (2):
  KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
  KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications

 arch/arm64/include/asm/kvm_host.h             |  17 +-
 arch/arm64/include/asm/kvm_pkvm.h             |   1 +
 arch/arm64/kvm/Kconfig                        |   2 +
 arch/arm64/kvm/arm.c                          |  32 ++-
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
 arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
 arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
 arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
 arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
 arch/arm64/kvm/mmu.c                          | 138 +++++++++-
 arch/arm64/kvm/pkvm.c                         |  83 +++++-
 include/linux/arm-smccc.h                     |   7 +
 include/linux/kvm_host.h                      |  34 +++
 include/uapi/linux/kvm.h                      |   4 +
 virt/kvm/Kconfig                              |   4 +
 virt/kvm/guest_memfd.c                        |  89 +++++-
 virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
 19 files changed, 904 insertions(+), 68 deletions(-)

Comments

Elliot Berman Feb. 22, 2024, 11:43 p.m. UTC | #1

On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd [1], as
> well as support guest_memfd on pKVM/arm64.
> 
> We haven't started using this in Android yet, but we aim to move
> away from anonymous memory to guest_memfd once we have the
> necessary support merged upstream. Others (e.g., Gunyah [8]) are
> also looking into guest_memfd for similar reasons as us.

I'm especially interested to see if we can factor out much of the common
implementation bits between KVM and Gunyah. In principle, we're doing
same thing: the difference is the exact mechanics to interact with the
hypervisor which (I think) could be easily extracted into an ops
structure.

[...]

> In addition to general feedback, we would like feedback on how we
> handle mmap() and faulting-in guest pages at the host (KVM: Add
> restricted support for mapping guest_memfd by the host).
> 
> We don't enforce the invariant that only memory shared with the
> host can be mapped by the host userspace in
> file_operations::mmap(), but in vm_operations_struct:fault(). On
> vm_operations_struct::fault(), we check whether the page is
> shared with the host. If not, we deliver a SIGBUS to the current
> task. The reason for enforcing this at fault() is that mmap()
> does not elevate the pagecount(); it's the faulting in of the
> page which does. Even if we were to check at mmap() whether an
> address can be mapped, we would still need to check again on
> fault(), since between mmap() and fault() the status of the page
> can change.
> 
> This creates the situation where access to successfully mmap()'d
> memory might SIGBUS at page fault. There is precedence for
> similar behavior in the kernel I believe, with MADV_HWPOISON and
> the hugetlbfs cgroups controller, which could SIGBUS at page
> fault time depending on the accounting limit.

I added a test: folio_mmapped() [1] which checks if there's a vma
covering the corresponding offset into the guest_memfd. I use this
test before trying to make page private to guest and I've been able to
ensure that userspace can't even mmap() private guest memory. If I try
to make memory private, I can test that it's not mmapped and not allow
memory to become private. In my testing so far, this is enough to
prevent SIGBUS from happening.

This test probably should be moved outside Gunyah specific code, and was
looking for maintainer to suggest the right home for it :)

[1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/

> 
> Another pKVM specific aspect we would like feedback on, is how to
> handle memory mapped by the host being unshared by a guest. The
> approach we've taken is that on an unshare call from the guest,
> the host userspace is notified that the memory has been unshared,
> in order to allow it to unmap it and mark it as PRIVATE as
> acknowledgment. If the host does not unmap the memory, the
> unshare call issued by the guest fails, which the guest is
> informed about on return.
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> 
> [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> 
> [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> 
> [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> 
> [5] Protected KVM on arm64 (slides)
> https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> 
> [6] Protected KVM on arm64 (video)
> https://www.youtube.com/watch?v=9npebeVFbFw
> 
> [7] Supporting guest private memory in Protected KVM on Android (presentation)
> https://lpc.events/event/17/contributions/1487/
> 
> [8] Drivers for Gunyah (patch series)
> https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/

As of 5 min ago when I send this, there's a v17:
https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/

Thanks,
Elliot

Matthew Wilcox (Oracle) Feb. 23, 2024, 12:35 a.m. UTC | #2

On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote:
> > This creates the situation where access to successfully mmap()'d
> > memory might SIGBUS at page fault. There is precedence for
> > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > the hugetlbfs cgroups controller, which could SIGBUS at page
> > fault time depending on the accounting limit.
> 
> I added a test: folio_mmapped() [1] which checks if there's a vma
> covering the corresponding offset into the guest_memfd. I use this
> test before trying to make page private to guest and I've been able to
> ensure that userspace can't even mmap() private guest memory. If I try
> to make memory private, I can test that it's not mmapped and not allow
> memory to become private. In my testing so far, this is enough to
> prevent SIGBUS from happening.
> 
> This test probably should be moved outside Gunyah specific code, and was
> looking for maintainer to suggest the right home for it :)
> 
> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/

You, um, might have wanted to send an email to linux-mm, not bury it in
the middle of a series of 35 patches?

So this isn't folio_mapped() because you're interested if anyone _could_
fault this page, not whether the folio is currently present in anyone's
page tables.

It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure
it's worth the effort to use walk_page_mapping(), but I would defer to
David.

Alexandru Elisei Feb. 23, 2024, noon UTC | #3

Hi,

I have a question regarding memory shared between the host and a protected
guest. I scanned the series, and the pKVM patches this series is based on,
but I couldn't easily find the answer.

When a page is shared, that page is not mapped in the stage 2 tables that
the host maintains for a regular VM (kvm->arch.mmu), right? It wouldn't
make much sense for KVM to maintain its own stage 2 that is never used, but
I thought I should double check that to make sure I'm not missing
something.

Thanks,
Alex

On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd [1], as
> well as support guest_memfd on pKVM/arm64.
> 
> This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> the remainder (patches 7-26) require the pKVM core series. A git
> repo with this series applied can be found here [3]. We have a
> (WIP) kvmtool port capable of running the code in this series
> [4]. For a technical deep dive into pKVM, please refer to Quentin
> Perret's KVM Forum Presentation [5, 6].
> 
> I've covered some of the issues presented here in my LPC 2023
> presentation [7].
> 
> We haven't started using this in Android yet, but we aim to move
> away from anonymous memory to guest_memfd once we have the
> necessary support merged upstream. Others (e.g., Gunyah [8]) are
> also looking into guest_memfd for similar reasons as us.
> 
> By design, guest_memfd cannot be mapped, read, or written by the
> host userspace. In pKVM, memory shared between a protected guest
> and the host is shared in-place, unlike the other confidential
> computing solutions that guest_memfd was originally envisaged for
> (e.g, TDX). When initializing a guest, as well as when accessing
> memory shared by the guest to the host, it would be useful to
> support mapping that memory at the host to avoid copying its
> contents.
> 
> One of the benefits of guest_memfd is that it prevents a
> misbehaving host process from crashing the system when attempting
> to access (deliberately or accidentally) protected guest memory,
> since this memory isn't mapped to begin with. Without
> guest_memfd, the hypervisor would still prevent such accesses,
> but in certain cases the host kernel wouldn't be able to recover,
> causing the system to crash.
> 
> Support for mmap() in this patch series maintains the invariant
> that only memory shared with the host, either explicitly by the
> guest or implicitly before the guest has started running (in
> order to populate its memory) is allowed to be mapped. At no time
> should private memory be mapped at the host.
> 
> This patch series is divided into two parts:
> 
> The first part is to the KVM core code (patches 1-6), and is
> based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
> for mapping guest memory only as long as it is shared. For that,
> the host needs to know the sharing status of guest memory.
> Therefore, the series adds a new KVM memory attribute, accessible
> only by the host kernel, that specifies whether the memory is
> allowed to be mapped by the host userspace.
> 
> The second part of the series (patches 7-26) adds guest_memfd
> support for pKVM/arm64, and is based on the latest version of our
> pKVM series [2]. It uses guest_memfd instead of the current
> approach in Android (not upstreamed) of maintaining a long-term
> GUP on anonymous memory donated to the guest. These patches
> handle faulting in guest memory for a guest, as well as handling
> sharing and unsharing of guest memory while maintaining the
> invariant mentioned earlier.
> 
> In addition to general feedback, we would like feedback on how we
> handle mmap() and faulting-in guest pages at the host (KVM: Add
> restricted support for mapping guest_memfd by the host).
> 
> We don't enforce the invariant that only memory shared with the
> host can be mapped by the host userspace in
> file_operations::mmap(), but in vm_operations_struct:fault(). On
> vm_operations_struct::fault(), we check whether the page is
> shared with the host. If not, we deliver a SIGBUS to the current
> task. The reason for enforcing this at fault() is that mmap()
> does not elevate the pagecount(); it's the faulting in of the
> page which does. Even if we were to check at mmap() whether an
> address can be mapped, we would still need to check again on
> fault(), since between mmap() and fault() the status of the page
> can change.
> 
> This creates the situation where access to successfully mmap()'d
> memory might SIGBUS at page fault. There is precedence for
> similar behavior in the kernel I believe, with MADV_HWPOISON and
> the hugetlbfs cgroups controller, which could SIGBUS at page
> fault time depending on the accounting limit.
> 
> Another pKVM specific aspect we would like feedback on, is how to
> handle memory mapped by the host being unshared by a guest. The
> approach we've taken is that on an unshare call from the guest,
> the host userspace is notified that the memory has been unshared,
> in order to allow it to unmap it and mark it as PRIVATE as
> acknowledgment. If the host does not unmap the memory, the
> unshare call issued by the guest fails, which the guest is
> informed about on return.
> 
> Cheers,
> /fuad
> 
> [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> 
> [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> 
> [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> 
> [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> 
> [5] Protected KVM on arm64 (slides)
> https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> 
> [6] Protected KVM on arm64 (video)
> https://www.youtube.com/watch?v=9npebeVFbFw
> 
> [7] Supporting guest private memory in Protected KVM on Android (presentation)
> https://lpc.events/event/17/contributions/1487/
> 
> [8] Drivers for Gunyah (patch series)
> https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/
> 
> Fuad Tabba (20):
>   KVM: Split KVM memory attributes into user and kernel attributes
>   KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
>   KVM: Add restricted support for mapping guestmem by the host
>   KVM: Don't allow private attribute to be set if mapped by host
>   KVM: Don't allow private attribute to be removed for unmappable memory
>   KVM: Implement kvm_(read|/write)_guest_page for private memory slots
>   KVM: arm64: Create hypercall return handler
>   KVM: arm64: Refactor code around handling return from host to guest
>   KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
>   KVM: arm64: Add a field to indicate whether the guest page was pinned
>   KVM: arm64: Do not allow changes to private memory slots
>   KVM: arm64: Skip VMA checks for slots without userspace address
>   KVM: arm64: Handle guest_memfd()-backed guest page faults
>   KVM: arm64: Track sharing of memory from protected guest to host
>   KVM: arm64: Mark a protected VM's memory as unmappable at
>     initialization
>   KVM: arm64: Handle unshare on way back to guest entry rather than exit
>   KVM: arm64: Check that host unmaps memory unshared by guest
>   KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
>   KVM: arm64: Enable private memory support when pKVM is enabled
>   KVM: arm64: Enable private memory kconfig for arm64
> 
> Keir Fraser (3):
>   KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
>   KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
>   KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall
> 
> Quentin Perret (1):
>   KVM: arm64: Turn llist of pinned pages into an rb-tree
> 
> Will Deacon (2):
>   KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
>   KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications
> 
>  arch/arm64/include/asm/kvm_host.h             |  17 +-
>  arch/arm64/include/asm/kvm_pkvm.h             |   1 +
>  arch/arm64/kvm/Kconfig                        |   2 +
>  arch/arm64/kvm/arm.c                          |  32 ++-
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
>  arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
>  arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
>  arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
>  arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
>  arch/arm64/kvm/mmu.c                          | 138 +++++++++-
>  arch/arm64/kvm/pkvm.c                         |  83 +++++-
>  include/linux/arm-smccc.h                     |   7 +
>  include/linux/kvm_host.h                      |  34 +++
>  include/uapi/linux/kvm.h                      |   4 +
>  virt/kvm/Kconfig                              |   4 +
>  virt/kvm/guest_memfd.c                        |  89 +++++-
>  virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
>  19 files changed, 904 insertions(+), 68 deletions(-)
> 
> -- 
> 2.44.0.rc1.240.g4c46232300-goog
> 
>

Fuad Tabba Feb. 26, 2024, 9:03 a.m. UTC | #4

Hi Elliot,

On Thu, Feb 22, 2024 at 11:44 PM Elliot Berman <quic_eberman@quicinc.com> wrote:
>
> On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd [1], as
> > well as support guest_memfd on pKVM/arm64.
> >
> > We haven't started using this in Android yet, but we aim to move
> > away from anonymous memory to guest_memfd once we have the
> > necessary support merged upstream. Others (e.g., Gunyah [8]) are
> > also looking into guest_memfd for similar reasons as us.
>
> I'm especially interested to see if we can factor out much of the common
> implementation bits between KVM and Gunyah. In principle, we're doing
> same thing: the difference is the exact mechanics to interact with the
> hypervisor which (I think) could be easily extracted into an ops
> structure.

I agree. We should share and reuse as much code as possible. I'll sync
with you before the V2 of this series.

> [...]
>
> > In addition to general feedback, we would like feedback on how we
> > handle mmap() and faulting-in guest pages at the host (KVM: Add
> > restricted support for mapping guest_memfd by the host).
> >
> > We don't enforce the invariant that only memory shared with the
> > host can be mapped by the host userspace in
> > file_operations::mmap(), but in vm_operations_struct:fault(). On
> > vm_operations_struct::fault(), we check whether the page is
> > shared with the host. If not, we deliver a SIGBUS to the current
> > task. The reason for enforcing this at fault() is that mmap()
> > does not elevate the pagecount(); it's the faulting in of the
> > page which does. Even if we were to check at mmap() whether an
> > address can be mapped, we would still need to check again on
> > fault(), since between mmap() and fault() the status of the page
> > can change.
> >
> > This creates the situation where access to successfully mmap()'d
> > memory might SIGBUS at page fault. There is precedence for
> > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > the hugetlbfs cgroups controller, which could SIGBUS at page
> > fault time depending on the accounting limit.
>
> I added a test: folio_mmapped() [1] which checks if there's a vma
> covering the corresponding offset into the guest_memfd. I use this
> test before trying to make page private to guest and I've been able to
> ensure that userspace can't even mmap() private guest memory. If I try
> to make memory private, I can test that it's not mmapped and not allow
> memory to become private. In my testing so far, this is enough to
> prevent SIGBUS from happening.
>
> This test probably should be moved outside Gunyah specific code, and was
> looking for maintainer to suggest the right home for it :)
>
> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/

Let's see what the mm-folks think about this [*].

Thanks!
/fuad

[*] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/


> >
> > Another pKVM specific aspect we would like feedback on, is how to
> > handle memory mapped by the host being unshared by a guest. The
> > approach we've taken is that on an unshare call from the guest,
> > the host userspace is notified that the memory has been unshared,
> > in order to allow it to unmap it and mark it as PRIVATE as
> > acknowledgment. If the host does not unmap the memory, the
> > unshare call issued by the guest fails, which the guest is
> > informed about on return.
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> >
> > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> >
> > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> >
> > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> >
> > [5] Protected KVM on arm64 (slides)
> > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> >
> > [6] Protected KVM on arm64 (video)
> > https://www.youtube.com/watch?v=9npebeVFbFw
> >
> > [7] Supporting guest private memory in Protected KVM on Android (presentation)
> > https://lpc.events/event/17/contributions/1487/
> >
> > [8] Drivers for Gunyah (patch series)
> > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/
>
> As of 5 min ago when I send this, there's a v17:
> https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/
>
> Thanks,
> Elliot
>

Fuad Tabba Feb. 26, 2024, 9:05 a.m. UTC | #5

Hi Alex,

On Fri, Feb 23, 2024 at 12:01 PM Alexandru Elisei
<alexandru.elisei@arm.com> wrote:
>
> Hi,
>
> I have a question regarding memory shared between the host and a protected
> guest. I scanned the series, and the pKVM patches this series is based on,
> but I couldn't easily find the answer.
>
> When a page is shared, that page is not mapped in the stage 2 tables that
> the host maintains for a regular VM (kvm->arch.mmu), right? It wouldn't
> make much sense for KVM to maintain its own stage 2 that is never used, but
> I thought I should double check that to make sure I'm not missing
> something.

You're right. In protected mode the stage-2 tables are maintained by
the hypervisor in EL2, since we don't trust the host kernel. It is
still KVM of course, but not the regular VM structure, like you said.

Cheers,
/fuad

>
> Thanks,
> Alex
>
> On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd [1], as
> > well as support guest_memfd on pKVM/arm64.
> >
> > This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> > the remainder (patches 7-26) require the pKVM core series. A git
> > repo with this series applied can be found here [3]. We have a
> > (WIP) kvmtool port capable of running the code in this series
> > [4]. For a technical deep dive into pKVM, please refer to Quentin
> > Perret's KVM Forum Presentation [5, 6].
> >
> > I've covered some of the issues presented here in my LPC 2023
> > presentation [7].
> >
> > We haven't started using this in Android yet, but we aim to move
> > away from anonymous memory to guest_memfd once we have the
> > necessary support merged upstream. Others (e.g., Gunyah [8]) are
> > also looking into guest_memfd for similar reasons as us.
> >
> > By design, guest_memfd cannot be mapped, read, or written by the
> > host userspace. In pKVM, memory shared between a protected guest
> > and the host is shared in-place, unlike the other confidential
> > computing solutions that guest_memfd was originally envisaged for
> > (e.g, TDX). When initializing a guest, as well as when accessing
> > memory shared by the guest to the host, it would be useful to
> > support mapping that memory at the host to avoid copying its
> > contents.
> >
> > One of the benefits of guest_memfd is that it prevents a
> > misbehaving host process from crashing the system when attempting
> > to access (deliberately or accidentally) protected guest memory,
> > since this memory isn't mapped to begin with. Without
> > guest_memfd, the hypervisor would still prevent such accesses,
> > but in certain cases the host kernel wouldn't be able to recover,
> > causing the system to crash.
> >
> > Support for mmap() in this patch series maintains the invariant
> > that only memory shared with the host, either explicitly by the
> > guest or implicitly before the guest has started running (in
> > order to populate its memory) is allowed to be mapped. At no time
> > should private memory be mapped at the host.
> >
> > This patch series is divided into two parts:
> >
> > The first part is to the KVM core code (patches 1-6), and is
> > based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support
> > for mapping guest memory only as long as it is shared. For that,
> > the host needs to know the sharing status of guest memory.
> > Therefore, the series adds a new KVM memory attribute, accessible
> > only by the host kernel, that specifies whether the memory is
> > allowed to be mapped by the host userspace.
> >
> > The second part of the series (patches 7-26) adds guest_memfd
> > support for pKVM/arm64, and is based on the latest version of our
> > pKVM series [2]. It uses guest_memfd instead of the current
> > approach in Android (not upstreamed) of maintaining a long-term
> > GUP on anonymous memory donated to the guest. These patches
> > handle faulting in guest memory for a guest, as well as handling
> > sharing and unsharing of guest memory while maintaining the
> > invariant mentioned earlier.
> >
> > In addition to general feedback, we would like feedback on how we
> > handle mmap() and faulting-in guest pages at the host (KVM: Add
> > restricted support for mapping guest_memfd by the host).
> >
> > We don't enforce the invariant that only memory shared with the
> > host can be mapped by the host userspace in
> > file_operations::mmap(), but in vm_operations_struct:fault(). On
> > vm_operations_struct::fault(), we check whether the page is
> > shared with the host. If not, we deliver a SIGBUS to the current
> > task. The reason for enforcing this at fault() is that mmap()
> > does not elevate the pagecount(); it's the faulting in of the
> > page which does. Even if we were to check at mmap() whether an
> > address can be mapped, we would still need to check again on
> > fault(), since between mmap() and fault() the status of the page
> > can change.
> >
> > This creates the situation where access to successfully mmap()'d
> > memory might SIGBUS at page fault. There is precedence for
> > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > the hugetlbfs cgroups controller, which could SIGBUS at page
> > fault time depending on the accounting limit.
> >
> > Another pKVM specific aspect we would like feedback on, is how to
> > handle memory mapped by the host being unshared by a guest. The
> > approach we've taken is that on an unshare call from the guest,
> > the host userspace is notified that the memory has been unshared,
> > in order to allow it to unmap it and mark it as PRIVATE as
> > acknowledgment. If the host does not unmap the memory, the
> > unshare call issued by the guest fails, which the guest is
> > informed about on return.
> >
> > Cheers,
> > /fuad
> >
> > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
> >
> > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core
> >
> > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1
> >
> > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8
> >
> > [5] Protected KVM on arm64 (slides)
> > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf
> >
> > [6] Protected KVM on arm64 (video)
> > https://www.youtube.com/watch?v=9npebeVFbFw
> >
> > [7] Supporting guest private memory in Protected KVM on Android (presentation)
> > https://lpc.events/event/17/contributions/1487/
> >
> > [8] Drivers for Gunyah (patch series)
> > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/
> >
> > Fuad Tabba (20):
> >   KVM: Split KVM memory attributes into user and kernel attributes
> >   KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock
> >   KVM: Add restricted support for mapping guestmem by the host
> >   KVM: Don't allow private attribute to be set if mapped by host
> >   KVM: Don't allow private attribute to be removed for unmappable memory
> >   KVM: Implement kvm_(read|/write)_guest_page for private memory slots
> >   KVM: arm64: Create hypercall return handler
> >   KVM: arm64: Refactor code around handling return from host to guest
> >   KVM: arm64: Rename kvm_pinned_page to kvm_guest_page
> >   KVM: arm64: Add a field to indicate whether the guest page was pinned
> >   KVM: arm64: Do not allow changes to private memory slots
> >   KVM: arm64: Skip VMA checks for slots without userspace address
> >   KVM: arm64: Handle guest_memfd()-backed guest page faults
> >   KVM: arm64: Track sharing of memory from protected guest to host
> >   KVM: arm64: Mark a protected VM's memory as unmappable at
> >     initialization
> >   KVM: arm64: Handle unshare on way back to guest entry rather than exit
> >   KVM: arm64: Check that host unmaps memory unshared by guest
> >   KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes()
> >   KVM: arm64: Enable private memory support when pKVM is enabled
> >   KVM: arm64: Enable private memory kconfig for arm64
> >
> > Keir Fraser (3):
> >   KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall
> >   KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall
> >   KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall
> >
> > Quentin Perret (1):
> >   KVM: arm64: Turn llist of pinned pages into an rb-tree
> >
> > Will Deacon (2):
> >   KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL
> >   KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications
> >
> >  arch/arm64/include/asm/kvm_host.h             |  17 +-
> >  arch/arm64/include/asm/kvm_pkvm.h             |   1 +
> >  arch/arm64/kvm/Kconfig                        |   2 +
> >  arch/arm64/kvm/arm.c                          |  32 ++-
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |   2 +
> >  arch/arm64/kvm/hyp/include/nvhe/pkvm.h        |   1 +
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  24 +-
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  67 +++++
> >  arch/arm64/kvm/hyp/nvhe/pkvm.c                |  89 +++++-
> >  arch/arm64/kvm/hyp/nvhe/switch.c              |   1 +
> >  arch/arm64/kvm/hypercalls.c                   | 117 +++++++-
> >  arch/arm64/kvm/mmu.c                          | 138 +++++++++-
> >  arch/arm64/kvm/pkvm.c                         |  83 +++++-
> >  include/linux/arm-smccc.h                     |   7 +
> >  include/linux/kvm_host.h                      |  34 +++
> >  include/uapi/linux/kvm.h                      |   4 +
> >  virt/kvm/Kconfig                              |   4 +
> >  virt/kvm/guest_memfd.c                        |  89 +++++-
> >  virt/kvm/kvm_main.c                           | 260 ++++++++++++++++--
> >  19 files changed, 904 insertions(+), 68 deletions(-)
> >
> > --
> > 2.44.0.rc1.240.g4c46232300-goog
> >
> >

David Hildenbrand Feb. 26, 2024, 9:28 a.m. UTC | #6

On 23.02.24 01:35, Matthew Wilcox wrote:
> On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote:
>>> This creates the situation where access to successfully mmap()'d
>>> memory might SIGBUS at page fault. There is precedence for
>>> similar behavior in the kernel I believe, with MADV_HWPOISON and
>>> the hugetlbfs cgroups controller, which could SIGBUS at page
>>> fault time depending on the accounting limit.
>>
>> I added a test: folio_mmapped() [1] which checks if there's a vma
>> covering the corresponding offset into the guest_memfd. I use this
>> test before trying to make page private to guest and I've been able to
>> ensure that userspace can't even mmap() private guest memory. If I try
>> to make memory private, I can test that it's not mmapped and not allow
>> memory to become private. In my testing so far, this is enough to
>> prevent SIGBUS from happening.
>>
>> This test probably should be moved outside Gunyah specific code, and was
>> looking for maintainer to suggest the right home for it :)
>>
>> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/
> 
> You, um, might have wanted to send an email to linux-mm, not bury it in
> the middle of a series of 35 patches?
> 
> So this isn't folio_mapped() because you're interested if anyone _could_
> fault this page, not whether the folio is currently present in anyone's
> page tables.
> 
> It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure
> it's worth the effort to use walk_page_mapping(), but I would defer to
> David.

First, I suspect we are not only concerned about current+future VMAs 
covering the page, we are also interested in any page pins that could 
have been derived from such a VMA?

Imagine user space mmap'ed the file, faulted in page, took a pin on the 
page using pin_user_pages() and friends, and then munmap()'ed the VMA.

You likely want to catch that as well and not allow a conversion to private?

[I assume you want to convert the page to private only if you hold all 
the folio references -- i.e., if the refcount of a small folio is 1]


Now, regarding the original question (disallow mapping the page), I see 
the following approaches:

1) SIGBUS during page fault. There are other cases that can trigger
    SIGBUS during page faults: hugetlb when we are out of free hugetlb
    pages, userfaultfd with UFFD_FEATURE_SIGBUS.

-> Simple and should get the job done.

2) folio_mmapped() + preventing new mmaps covering that folio

-> More complicated, requires an rmap walk on every conversion.

3) Disallow any mmaps of the file while any page is private

-> Likely not what you want.


Why was 1) abandoned? I looks a lot easier and harder to mess up. Why 
are you trying to avoid page faults? What's the use case?

David Hildenbrand Feb. 26, 2024, 9:47 a.m. UTC | #7

On 22.02.24 17:10, Fuad Tabba wrote:
> This series adds restricted mmap() support to guest_memfd [1], as
> well as support guest_memfd on pKVM/arm64.
> 
> This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> the remainder (patches 7-26) require the pKVM core series. A git
> repo with this series applied can be found here [3]. We have a
> (WIP) kvmtool port capable of running the code in this series
> [4]. For a technical deep dive into pKVM, please refer to Quentin
> Perret's KVM Forum Presentation [5, 6].
> 
> I've covered some of the issues presented here in my LPC 2023
> presentation [7].
> 
> We haven't started using this in Android yet, but we aim to move
> away from anonymous memory to guest_memfd once we have the
> necessary support merged upstream. Others (e.g., Gunyah [8]) are
> also looking into guest_memfd for similar reasons as us.
> 
> By design, guest_memfd cannot be mapped, read, or written by the
> host userspace. In pKVM, memory shared between a protected guest
> and the host is shared in-place, unlike the other confidential
> computing solutions that guest_memfd was originally envisaged for
> (e.g, TDX).

Can you elaborate (or point to a summary) why pKVM has to be special 
here? Why can't you use guest_memfd only for private memory and another 
(ordinary) memfd for shared memory, like the other confidential 
computing technologies are planning to?

What's the main reason for that decision and can it be avoided?

(s390x also shares in-place, but doesn't need any special-casing like 
guest_memfd provides)

Elliot Berman Feb. 26, 2024, 9:14 p.m. UTC | #8

On Mon, Feb 26, 2024 at 10:28:11AM +0100, David Hildenbrand wrote:
> On 23.02.24 01:35, Matthew Wilcox wrote:
> > On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote:
> > > > This creates the situation where access to successfully mmap()'d
> > > > memory might SIGBUS at page fault. There is precedence for
> > > > similar behavior in the kernel I believe, with MADV_HWPOISON and
> > > > the hugetlbfs cgroups controller, which could SIGBUS at page
> > > > fault time depending on the accounting limit.
> > > 
> > > I added a test: folio_mmapped() [1] which checks if there's a vma
> > > covering the corresponding offset into the guest_memfd. I use this
> > > test before trying to make page private to guest and I've been able to
> > > ensure that userspace can't even mmap() private guest memory. If I try
> > > to make memory private, I can test that it's not mmapped and not allow
> > > memory to become private. In my testing so far, this is enough to
> > > prevent SIGBUS from happening.
> > > 
> > > This test probably should be moved outside Gunyah specific code, and was
> > > looking for maintainer to suggest the right home for it :)
> > > 
> > > [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/
> > 
> > You, um, might have wanted to send an email to linux-mm, not bury it in
> > the middle of a series of 35 patches?
> > 
> > So this isn't folio_mapped() because you're interested if anyone _could_
> > fault this page, not whether the folio is currently present in anyone's
> > page tables.
> > 
> > It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure
> > it's worth the effort to use walk_page_mapping(), but I would defer to
> > David.
> 
> First, I suspect we are not only concerned about current+future VMAs
> covering the page, we are also interested in any page pins that could have
> been derived from such a VMA?
> 
> Imagine user space mmap'ed the file, faulted in page, took a pin on the page
> using pin_user_pages() and friends, and then munmap()'ed the VMA.
> 
> You likely want to catch that as well and not allow a conversion to private?
> 
> [I assume you want to convert the page to private only if you hold all the
> folio references -- i.e., if the refcount of a small folio is 1]
> 

Ah, this was something I hadn't thought about. I think both Fuad and I
need to update our series to check the refcount rather than mapcount
(kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).

> 
> Now, regarding the original question (disallow mapping the page), I see the
> following approaches:
> 
> 1) SIGBUS during page fault. There are other cases that can trigger
>    SIGBUS during page faults: hugetlb when we are out of free hugetlb
>    pages, userfaultfd with UFFD_FEATURE_SIGBUS.
> 
> -> Simple and should get the job done.
> 
> 2) folio_mmapped() + preventing new mmaps covering that folio
> 
> -> More complicated, requires an rmap walk on every conversion.
> 
> 3) Disallow any mmaps of the file while any page is private
> 
> -> Likely not what you want.
> 
> 
> Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are
> you trying to avoid page faults? What's the use case?
> 

We were chatting whether we could do better than the SIGBUS approach.
SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to
return errors early. One difference between hugetlb and this usecase is
that running out of free hugetlb pages isn't something we could detect
at mmap time. In guest_memfd usecase, we should be able to detect when
SIGBUS becomes possible due to memory being lent to guest.

I can't think of a reason why userspace would want/be able to resume
operation after trying to access a page that it shouldn't be allowed, so
SIGBUS is functional. The advantage of trying to avoid SIGBUS was
better/easier reporting to userspace.

- Elliot

Fuad Tabba Feb. 27, 2024, 9:37 a.m. UTC | #9

Hi David,

On Mon, Feb 26, 2024 at 9:47 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.02.24 17:10, Fuad Tabba wrote:
> > This series adds restricted mmap() support to guest_memfd [1], as
> > well as support guest_memfd on pKVM/arm64.
> >
> > This series is based on Linux 6.8-rc4 + our pKVM core series [2].
> > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but
> > the remainder (patches 7-26) require the pKVM core series. A git
> > repo with this series applied can be found here [3]. We have a
> > (WIP) kvmtool port capable of running the code in this series
> > [4]. For a technical deep dive into pKVM, please refer to Quentin
> > Perret's KVM Forum Presentation [5, 6].
> >
> > I've covered some of the issues presented here in my LPC 2023
> > presentation [7].
> >
> > We haven't started using this in Android yet, but we aim to move
> > away from anonymous memory to guest_memfd once we have the
> > necessary support merged upstream. Others (e.g., Gunyah [8]) are
> > also looking into guest_memfd for similar reasons as us.
> >
> > By design, guest_memfd cannot be mapped, read, or written by the
> > host userspace. In pKVM, memory shared between a protected guest
> > and the host is shared in-place, unlike the other confidential
> > computing solutions that guest_memfd was originally envisaged for
> > (e.g, TDX).
>
> Can you elaborate (or point to a summary) why pKVM has to be special
> here? Why can't you use guest_memfd only for private memory and another
> (ordinary) memfd for shared memory, like the other confidential
> computing technologies are planning to?

Because the same memory location can switch back and forth between
being shared and private in-place. The host/vmm doesn't know
beforehand which parts of the guest's private memory might be shared
with it later, therefore, it cannot use guest_memfd() for the private
memory and anonymous memory for the shared memory without resorting to
copying. Even if it did know beforehand, it wouldn't help much since
that memory can change back to being private later on. Other
confidential computing proposals like TDX and Arm CCA don't share in
place, and need to copy shared data between private and shared memory.

If you're interested, there was also a more detailed discussion about
this in an earlier guest_memfd() thread [1]

> What's the main reason for that decision and can it be avoided?
> (s390x also shares in-place, but doesn't need any special-casing like
> guest_memfd provides)

In our current implementation of pKVM, we use anonymous memory with a
long-term gup, and the host ends up with valid mappings. This isn't
just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX,
accessing guest private memory can be fatal to the host and the system
as a whole since it could result in a machine check. In arm64 it's not
necessarily fatal to the system as a whole if a userspace process were
to attempt the access. However, a userspace process could trick the
host kernel to try to access the protected guest memory, e.g., by
having a process A strace a malicious process B which passes protected
guest memory as argument to a syscall.

What makes pKVM and Gunyah special is that both can easily share
memory (and its contents) in place, since it's not encrypted, and
convert memory locations between shared/unshared. I'm not familiar
with how s390x handles sharing in place, or how it handles memory
donated to the guest. I assume it's by donating anonymous memory. I
would be also interested to know how it handles and recovers from
similar situations, i.e., host (userspace or kernel) trying to access
guest protected memory.

Thank you,
/fuad

[1] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/

[2] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/

[3] https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/

>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Feb. 27, 2024, 2:41 p.m. UTC | #10

Hi,

>> Can you elaborate (or point to a summary) why pKVM has to be special
>> here? Why can't you use guest_memfd only for private memory and another
>> (ordinary) memfd for shared memory, like the other confidential
>> computing technologies are planning to?
> 
> Because the same memory location can switch back and forth between
> being shared and private in-place. The host/vmm doesn't know
> beforehand which parts of the guest's private memory might be shared
> with it later, therefore, it cannot use guest_memfd() for the private
> memory and anonymous memory for the shared memory without resorting to

I don't remember the latest details about the guest_memfd incarnation in 
user space, but I though we'd be using guest_memfd for private memory 
and an ordinary memfd for shared memory. But maybe it also works with 
anon memory instead of the memfd and that was just an implementation 
detail :)

> copying. Even if it did know beforehand, it wouldn't help much since
> that memory can change back to being private later on. Other
> confidential computing proposals like TDX and Arm CCA don't share in
> place, and need to copy shared data between private and shared memory.

Right.

> 
> If you're interested, there was also a more detailed discussion about
> this in an earlier guest_memfd() thread [1]

Thanks for the pointer!

> 
>> What's the main reason for that decision and can it be avoided?
>> (s390x also shares in-place, but doesn't need any special-casing like
>> guest_memfd provides)
> 
> In our current implementation of pKVM, we use anonymous memory with a
> long-term gup, and the host ends up with valid mappings. This isn't
> just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX,
> accessing guest private memory can be fatal to the host and the system
> as a whole since it could result in a machine check. In arm64 it's not
> necessarily fatal to the system as a whole if a userspace process were
> to attempt the access. However, a userspace process could trick the
> host kernel to try to access the protected guest memory, e.g., by
> having a process A strace a malicious process B which passes protected
> guest memory as argument to a syscall.

Right.

> 
> What makes pKVM and Gunyah special is that both can easily share
> memory (and its contents) in place, since it's not encrypted, and
> convert memory locations between shared/unshared. I'm not familiar
> with how s390x handles sharing in place, or how it handles memory
> donated to the guest. I assume it's by donating anonymous memory. I
> would be also interested to know how it handles and recovers from
> similar situations, i.e., host (userspace or kernel) trying to access
> guest protected memory.

I don't know all of the s390x "protected VM" details, but it is pretty 
similar. Take a look at arch/s390/kernel/uv.c if you are interested.

There are "ultravisor" calls that can convert a page
* from secure (inaccessible by the host) to non-secure (encrypted but
   accessible by the host)
* from non-secure to secure

Once the host tries to access a "secure" page -- either from the kernel 
or from user space, the host gets a page fault and calls 
arch_make_page_accessible(). That will encrypt page content such that 
the host can access it (migrate/swapout/whatsoever).

The host has to set aside some memory area for the ultravisor to 
"remember" page state.

So you can swapout/migrate these pages, but the host will only read 
encrypted garbage. In contrast to disallowing access to these pages.

So you don't need any guest_memfd games to protect from that -- and one 
doesn't have to travel back in time to have memory that isn't 
swappable/migratable and only comes in one page size.

[I'm not up-to-date which obscure corner-cases CCA requirement the s390x 
implementation cannot fulfill -- like replacing pages in page tables and 
such; I suspect pKVM also cannot cover all these corner-cases]


Extending guest_memfd (the one that was promised initially to not be 
mmappable) to be mmappable just to avoid some crashes in corner cases is 
the right approach. But I'm pretty sure that has all been discussed 
before, that's why I am asking about some details :)

David Hildenbrand Feb. 27, 2024, 2:49 p.m. UTC | #11

I missed here "I'm wondering whether ..."

> Extending guest_memfd (the one that was promised initially to not be
> mmappable) to be mmappable just to avoid some crashes in corner cases is
> the right approach. But I'm pretty sure that has all been discussed
> before, that's why I am asking about some details :)
>

David Hildenbrand Feb. 27, 2024, 2:59 p.m. UTC | #12

> 
> Ah, this was something I hadn't thought about. I think both Fuad and I
> need to update our series to check the refcount rather than mapcount
> (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).

An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). 
But checking for any unexpected references might be better (there are 
still some GUP users that don't use FOLL_PIN).

At least concurrent migration/swapout (that temporarily unmaps a folio 
and can give you folio_mapped() "false negatives", which both take a 
temporary folio reference and hold the page lock) should not be a 
concern because guest_memfd doesn't support that yet.

> 
>>
>> Now, regarding the original question (disallow mapping the page), I see the
>> following approaches:
>>
>> 1) SIGBUS during page fault. There are other cases that can trigger
>>     SIGBUS during page faults: hugetlb when we are out of free hugetlb
>>     pages, userfaultfd with UFFD_FEATURE_SIGBUS.
>>
>> -> Simple and should get the job done.
>>
>> 2) folio_mmapped() + preventing new mmaps covering that folio
>>
>> -> More complicated, requires an rmap walk on every conversion.
>>
>> 3) Disallow any mmaps of the file while any page is private
>>
>> -> Likely not what you want.
>>
>>
>> Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are
>> you trying to avoid page faults? What's the use case?
>>
> 
> We were chatting whether we could do better than the SIGBUS approach.
> SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to
> return errors early. One difference between hugetlb and this usecase is
> that running out of free hugetlb pages isn't something we could detect

With hugetlb reservation one can try detecting it at mmap() time. But as 
reservations are not NUMA aware, it's not reliable.

> at mmap time. In guest_memfd usecase, we should be able to detect when
> SIGBUS becomes possible due to memory being lent to guest.
> 
> I can't think of a reason why userspace would want/be able to resume
> operation after trying to access a page that it shouldn't be allowed, so
> SIGBUS is functional. The advantage of trying to avoid SIGBUS was
> better/easier reporting to userspace.

To me, it sounds conceptually easier and less error-prone to

1) Converting a page to private only if there are no unexpected
    references (no mappings, GUP pins, ...)
2) Disallowing mapping private pages and failing the page fault.
3) Handling that small race window only (page lock?)

Instead of

1) Converting a page to private only if there are no unexpected
    references (no mappings, GUP pins, ...) and no VMAs covering it where
    we could fault it in later
2) Disallowing mmap when the range would contain any private page
3) Handling races between mmap and page conversion

Fuad Tabba Feb. 28, 2024, 9:57 a.m. UTC | #13

Hi David,

On Tue, Feb 27, 2024 at 2:41 PM David Hildenbrand <david@redhat.com> wrote:
>
> Hi,
>
> >> Can you elaborate (or point to a summary) why pKVM has to be special
> >> here? Why can't you use guest_memfd only for private memory and another
> >> (ordinary) memfd for shared memory, like the other confidential
> >> computing technologies are planning to?
> >
> > Because the same memory location can switch back and forth between
> > being shared and private in-place. The host/vmm doesn't know
> > beforehand which parts of the guest's private memory might be shared
> > with it later, therefore, it cannot use guest_memfd() for the private
> > memory and anonymous memory for the shared memory without resorting to
>
> I don't remember the latest details about the guest_memfd incarnation in
> user space, but I though we'd be using guest_memfd for private memory
> and an ordinary memfd for shared memory. But maybe it also works with
> anon memory instead of the memfd and that was just an implementation
> detail :)
>
> > copying. Even if it did know beforehand, it wouldn't help much since
> > that memory can change back to being private later on. Other
> > confidential computing proposals like TDX and Arm CCA don't share in
> > place, and need to copy shared data between private and shared memory.
>
> Right.
>
> >
> > If you're interested, there was also a more detailed discussion about
> > this in an earlier guest_memfd() thread [1]
>
> Thanks for the pointer!
>
> >
> >> What's the main reason for that decision and can it be avoided?
> >> (s390x also shares in-place, but doesn't need any special-casing like
> >> guest_memfd provides)
> >
> > In our current implementation of pKVM, we use anonymous memory with a
> > long-term gup, and the host ends up with valid mappings. This isn't
> > just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX,
> > accessing guest private memory can be fatal to the host and the system
> > as a whole since it could result in a machine check. In arm64 it's not
> > necessarily fatal to the system as a whole if a userspace process were
> > to attempt the access. However, a userspace process could trick the
> > host kernel to try to access the protected guest memory, e.g., by
> > having a process A strace a malicious process B which passes protected
> > guest memory as argument to a syscall.
>
> Right.
>
> >
> > What makes pKVM and Gunyah special is that both can easily share
> > memory (and its contents) in place, since it's not encrypted, and
> > convert memory locations between shared/unshared. I'm not familiar
> > with how s390x handles sharing in place, or how it handles memory
> > donated to the guest. I assume it's by donating anonymous memory. I
> > would be also interested to know how it handles and recovers from
> > similar situations, i.e., host (userspace or kernel) trying to access
> > guest protected memory.
>
> I don't know all of the s390x "protected VM" details, but it is pretty
> similar. Take a look at arch/s390/kernel/uv.c if you are interested.
>
> There are "ultravisor" calls that can convert a page
> * from secure (inaccessible by the host) to non-secure (encrypted but
>    accessible by the host)
> * from non-secure to secure
>
> Once the host tries to access a "secure" page -- either from the kernel
> or from user space, the host gets a page fault and calls
> arch_make_page_accessible(). That will encrypt page content such that
> the host can access it (migrate/swapout/whatsoever).
>
> The host has to set aside some memory area for the ultravisor to
> "remember" page state.
>
> So you can swapout/migrate these pages, but the host will only read
> encrypted garbage. In contrast to disallowing access to these pages.
>
> So you don't need any guest_memfd games to protect from that -- and one
> doesn't have to travel back in time to have memory that isn't
> swappable/migratable and only comes in one page size.
>
> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
> implementation cannot fulfill -- like replacing pages in page tables and
> such; I suspect pKVM also cannot cover all these corner-cases]

Thanks for this. I'll do some more reading on how things work with s390x.

Right, and of course, one key difference of course is that pKVM
doesn't encrypt anything, and only relies on stage-2 protection to
protect the guest.

>
> Extending guest_memfd (the one that was promised initially to not be
> mmappable) to be mmappable just to avoid some crashes in corner cases is
> the right approach. But I'm pretty sure that has all been discussed
> before, that's why I am asking about some details :)

Thank you very much for your reviews and comments. They've already
been very helpful. I noticed the gmap.h in the s390 source, which
might also be something that we could learn from. So please do ask for
as much details as you like.

Cheers,
/fuad

> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand Feb. 28, 2024, 10:12 a.m. UTC | #14

>> So you don't need any guest_memfd games to protect from that -- and one
>> doesn't have to travel back in time to have memory that isn't
>> swappable/migratable and only comes in one page size.
>>
>> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
>> implementation cannot fulfill -- like replacing pages in page tables and
>> such; I suspect pKVM also cannot cover all these corner-cases]
> 
> Thanks for this. I'll do some more reading on how things work with s390x.
> 
> Right, and of course, one key difference of course is that pKVM
> doesn't encrypt anything, and only relies on stage-2 protection to
> protect the guest.

I don't remember what exactly s390x does, but I recall that it might 
only encrypt the memory content as it transitions a page from secure to 
non-secure.

Something like that could also be implemented using pKVM (unless I am 
missing something), but it might not be that trivial, of course :)

Quentin Perret Feb. 28, 2024, 10:48 a.m. UTC | #15

On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
> > 
> > Ah, this was something I hadn't thought about. I think both Fuad and I
> > need to update our series to check the refcount rather than mapcount
> > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
> 
> An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
> checking for any unexpected references might be better (there are still some
> GUP users that don't use FOLL_PIN).

As a non-mm person I'm not sure to understand to consequences of holding
a GUP pin to a page that is not covered by any VMA. The absence of VMAs
imply that userspace cannot access the page right? Presumably the kernel
can't be coerced into accessing that page either? Is that correct?

> At least concurrent migration/swapout (that temporarily unmaps a folio and
> can give you folio_mapped() "false negatives", which both take a temporary
> folio reference and hold the page lock) should not be a concern because
> guest_memfd doesn't support that yet.
> 
> > 
> > > 
> > > Now, regarding the original question (disallow mapping the page), I see the
> > > following approaches:
> > > 
> > > 1) SIGBUS during page fault. There are other cases that can trigger
> > >     SIGBUS during page faults: hugetlb when we are out of free hugetlb
> > >     pages, userfaultfd with UFFD_FEATURE_SIGBUS.
> > > 
> > > -> Simple and should get the job done.
> > > 
> > > 2) folio_mmapped() + preventing new mmaps covering that folio
> > > 
> > > -> More complicated, requires an rmap walk on every conversion.
> > > 
> > > 3) Disallow any mmaps of the file while any page is private
> > > 
> > > -> Likely not what you want.
> > > 
> > > 
> > > Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are
> > > you trying to avoid page faults? What's the use case?
> > > 
> > 
> > We were chatting whether we could do better than the SIGBUS approach.
> > SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to
> > return errors early. One difference between hugetlb and this usecase is
> > that running out of free hugetlb pages isn't something we could detect
> 
> With hugetlb reservation one can try detecting it at mmap() time. But as
> reservations are not NUMA aware, it's not reliable.
> 
> > at mmap time. In guest_memfd usecase, we should be able to detect when
> > SIGBUS becomes possible due to memory being lent to guest.
> > 
> > I can't think of a reason why userspace would want/be able to resume
> > operation after trying to access a page that it shouldn't be allowed, so
> > SIGBUS is functional. The advantage of trying to avoid SIGBUS was
> > better/easier reporting to userspace.
> 
> To me, it sounds conceptually easier and less error-prone to
> 
> 1) Converting a page to private only if there are no unexpected
>    references (no mappings, GUP pins, ...)
> 2) Disallowing mapping private pages and failing the page fault.
> 3) Handling that small race window only (page lock?)
> 
> Instead of
> 
> 1) Converting a page to private only if there are no unexpected
>    references (no mappings, GUP pins, ...) and no VMAs covering it where
>    we could fault it in later
> 2) Disallowing mmap when the range would contain any private page
> 3) Handling races between mmap and page conversion

The one thing that makes the second option cleaner from a userspace
perspective (IMO) is that the conversion to private is happening lazily
during guest faults. So whether or not an mmapped page can indeed be
accessed from userspace will be entirely undeterministic as it depends
on the guest faulting pattern which userspace is entirely unaware of.
Elliot's suggestion would prevent spurious crashes caused by that
somewhat odd behaviour, though arguably sane userspace software
shouldn't be doing that to start with.

To add a layer of paint to the shed, the usage of SIGBUS for
something that is really a permission access problem doesn't feel
appropriate. Allocating memory via guestmem and donating that to a
protected guest is a way for userspace to voluntarily relinquish access
permissions to the memory it allocated. So a userspace process violating
that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
point that signal would be sent, the page would have been accounted
against that userspace process, so not sure the paging examples that
were discussed earlier are exactly comparable. To illustrate that
differently, given that pKVM and Gunyah use MMU-based protection, there
is nothing architecturally that prevents a guest from sharing a page
back with Linux as RO. Note that we don't currently support this, so I
don't want to conflate this use case, but that hopefully makes it a
little more obvious that this is a "there is a page, but you don't
currently have the permission to access it" problem rather than "sorry
but we ran out of pages" problem.

Thanks,
Quentin

David Hildenbrand Feb. 28, 2024, 11:11 a.m. UTC | #16

On 28.02.24 11:48, Quentin Perret wrote:
> On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
>>>
>>> Ah, this was something I hadn't thought about. I think both Fuad and I
>>> need to update our series to check the refcount rather than mapcount
>>> (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
>>
>> An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
>> checking for any unexpected references might be better (there are still some
>> GUP users that don't use FOLL_PIN).
> 
> As a non-mm person I'm not sure to understand to consequences of holding
> a GUP pin to a page that is not covered by any VMA. The absence of VMAs
> imply that userspace cannot access the page right? Presumably the kernel
> can't be coerced into accessing that page either? Is that correct?

Simple example: register the page using an iouring fixed buffer, then 
unmap the VMA. iouring now has the page pinned and can read/write it 
using an address in the kernel vitual address space (direct map).

Then, you can happily make the kernel read/write that page using 
iouring, even though no VMA still covers/maps that page.

[...]

>> Instead of
>>
>> 1) Converting a page to private only if there are no unexpected
>>     references (no mappings, GUP pins, ...) and no VMAs covering it where
>>     we could fault it in later
>> 2) Disallowing mmap when the range would contain any private page
>> 3) Handling races between mmap and page conversion
> 
> The one thing that makes the second option cleaner from a userspace
> perspective (IMO) is that the conversion to private is happening lazily
> during guest faults. So whether or not an mmapped page can indeed be
> accessed from userspace will be entirely undeterministic as it depends
> on the guest faulting pattern which userspace is entirely unaware of.
> Elliot's suggestion would prevent spurious crashes caused by that
> somewhat odd behaviour, though arguably sane userspace software
> shouldn't be doing that to start with.

The last sentence is the important one. User space should not access 
that memory. If it does, it gets a slap on the hand. Because it should 
not access that memory.

We might even be able to export to user space which pages are currently 
accessible and which ones not (e.g., pagemap), although it would be racy 
as long as the VM is running and can trigger a conversion.

> 
> To add a layer of paint to the shed, the usage of SIGBUS for
> something that is really a permission access problem doesn't feel

SIGBUS stands for "BUS error (bad memory access)."

Which makes sense, if you try accessing something that can no longer be 
accessed. It's now inaccessible. Even if it is temporarily.

Just like a page with an MCE error. Swapin errors. Etc. You cannot 
access it.

It might be a permission problem on the pKVM side, but it's not the 
traditional "permission problem" as in mprotect() and friends. You 
cannot resolve that permission problem yourself. It's a higher entity 
that turned that memory inaccessible.

> appropriate. Allocating memory via guestmem and donating that to a
> protected guest is a way for userspace to voluntarily relinquish access
> permissions to the memory it allocated. So a userspace process violating
> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> point that signal would be sent, the page would have been accounted
> against that userspace process, so not sure the paging examples that
> were discussed earlier are exactly comparable. To illustrate that
> differently, given that pKVM and Gunyah use MMU-based protection, there
> is nothing architecturally that prevents a guest from sharing a page
> back with Linux as RO. 

Sure, then allow page faults that allow for reads and give a signal on 
write faults.

In the scenario, it even makes more sense to not constantly require new 
mmap's from user space just to access a now-shared page.

> Note that we don't currently support this, so I
> don't want to conflate this use case, but that hopefully makes it a
> little more obvious that this is a "there is a page, but you don't
> currently have the permission to access it" problem rather than "sorry
> but we ran out of pages" problem.

We could user other signals, at least as the semantics are clear and 
it's documented. Maybe SIGSEGV would be warranted.

I consider that a minor detail, though.

Requiring mmap()/munmap() dances just to access a page that is now 
shared from user space sounds a bit suboptimal. But I don't know all the 
details of the user space implementation.

"mmap() the whole thing once and only access what you are supposed to 
access" sounds reasonable to me. If you don't play by the rules, you get 
a signal.

But I'm happy to learn more details.

Quentin Perret Feb. 28, 2024, 12:44 p.m. UTC | #17

On Wednesday 28 Feb 2024 at 12:11:30 (+0100), David Hildenbrand wrote:
> On 28.02.24 11:48, Quentin Perret wrote:
> > On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote:
> > > > 
> > > > Ah, this was something I hadn't thought about. I think both Fuad and I
> > > > need to update our series to check the refcount rather than mapcount
> > > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me).
> > > 
> > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But
> > > checking for any unexpected references might be better (there are still some
> > > GUP users that don't use FOLL_PIN).
> > 
> > As a non-mm person I'm not sure to understand to consequences of holding
> > a GUP pin to a page that is not covered by any VMA. The absence of VMAs
> > imply that userspace cannot access the page right? Presumably the kernel
> > can't be coerced into accessing that page either? Is that correct?
> 
> Simple example: register the page using an iouring fixed buffer, then unmap
> the VMA. iouring now has the page pinned and can read/write it using an
> address in the kernel vitual address space (direct map).
> 
> Then, you can happily make the kernel read/write that page using iouring,
> even though no VMA still covers/maps that page.

Makes sense, and yes that would be a major bug if we let that happen,
thanks for the explanation.

> [...]
> 
> > > Instead of
> > > 
> > > 1) Converting a page to private only if there are no unexpected
> > >     references (no mappings, GUP pins, ...) and no VMAs covering it where
> > >     we could fault it in later
> > > 2) Disallowing mmap when the range would contain any private page
> > > 3) Handling races between mmap and page conversion
> > 
> > The one thing that makes the second option cleaner from a userspace
> > perspective (IMO) is that the conversion to private is happening lazily
> > during guest faults. So whether or not an mmapped page can indeed be
> > accessed from userspace will be entirely undeterministic as it depends
> > on the guest faulting pattern which userspace is entirely unaware of.
> > Elliot's suggestion would prevent spurious crashes caused by that
> > somewhat odd behaviour, though arguably sane userspace software
> > shouldn't be doing that to start with.
> 
> The last sentence is the important one. User space should not access that
> memory. If it does, it gets a slap on the hand. Because it should not access
> that memory.
> 
> We might even be able to export to user space which pages are currently
> accessible and which ones not (e.g., pagemap), although it would be racy as
> long as the VM is running and can trigger a conversion.
> 
> > 
> > To add a layer of paint to the shed, the usage of SIGBUS for
> > something that is really a permission access problem doesn't feel
> 
> SIGBUS stands for "BUS error (bad memory access)."
> 
> Which makes sense, if you try accessing something that can no longer be
> accessed. It's now inaccessible. Even if it is temporarily.
> 
> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
> it.
> 
> It might be a permission problem on the pKVM side, but it's not the
> traditional "permission problem" as in mprotect() and friends. You cannot
> resolve that permission problem yourself. It's a higher entity that turned
> that memory inaccessible.

Well that's where I'm not sure to agree. Userspace can, in fact, get
back all of that memory by simply killing the protected VM. With the
approach suggested here, the guestmem pages are entirely accessible to
the host until they are attached to a running protected VM which
triggers the protection. It is very much userspace saying "I promise not
to touch these pages from now on" when it does that, in a way that I
personally find very comparable to the mprotect case. It is not some
other entity that pulls the carpet from under userspace's feet, it is
userspace being inconsistent with itself that causes the issue here, and
that's why SIGBUS feels kinda wrong as it tends to be used to report
external errors of some sort.

> > appropriate. Allocating memory via guestmem and donating that to a
> > protected guest is a way for userspace to voluntarily relinquish access
> > permissions to the memory it allocated. So a userspace process violating
> > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> > point that signal would be sent, the page would have been accounted
> > against that userspace process, so not sure the paging examples that
> > were discussed earlier are exactly comparable. To illustrate that
> > differently, given that pKVM and Gunyah use MMU-based protection, there
> > is nothing architecturally that prevents a guest from sharing a page
> > back with Linux as RO.
> 
> Sure, then allow page faults that allow for reads and give a signal on write
> faults.
> 
> In the scenario, it even makes more sense to not constantly require new
> mmap's from user space just to access a now-shared page.
> 
> > Note that we don't currently support this, so I
> > don't want to conflate this use case, but that hopefully makes it a
> > little more obvious that this is a "there is a page, but you don't
> > currently have the permission to access it" problem rather than "sorry
> > but we ran out of pages" problem.
> 
> We could user other signals, at least as the semantics are clear and it's
> documented. Maybe SIGSEGV would be warranted.
> 
> I consider that a minor detail, though.
>
> Requiring mmap()/munmap() dances just to access a page that is now shared
> from user space sounds a bit suboptimal. But I don't know all the details of
> the user space implementation.

Agreed, if we could save having to mmap() each page that gets shared
back that would be a nice performance optimization.

> "mmap() the whole thing once and only access what you are supposed to
> access" sounds reasonable to me. If you don't play by the rules, you get a
> signal.

"... you get a signal, or maybe you don't". But yes I understand your
point, and as per the above there are real benefits to this approach so
why not.

What do we expect userspace to do when a page goes from shared back to
being guest-private, because e.g. the guest decides to unshare? Use
munmap() on that page? Or perhaps an madvise() call of some sort? Note
that this will be needed when starting a guest as well, as userspace
needs to copy the guest payload in the guestmem file prior to starting
the protected VM.

Thanks,
Quentin

David Hildenbrand Feb. 28, 2024, 1 p.m. UTC | #18

>>> To add a layer of paint to the shed, the usage of SIGBUS for
>>> something that is really a permission access problem doesn't feel
>>
>> SIGBUS stands for "BUS error (bad memory access)."
>>
>> Which makes sense, if you try accessing something that can no longer be
>> accessed. It's now inaccessible. Even if it is temporarily.
>>
>> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
>> it.
>>
>> It might be a permission problem on the pKVM side, but it's not the
>> traditional "permission problem" as in mprotect() and friends. You cannot
>> resolve that permission problem yourself. It's a higher entity that turned
>> that memory inaccessible.
> 
> Well that's where I'm not sure to agree. Userspace can, in fact, get
> back all of that memory by simply killing the protected VM. With the

Right, but that would likely "wipe" the pages so they can be made 
accessible again, right?

That's the whole point why we are handing the pages over to the "higher 
entity", and allow someone else (the VM) to turn them into a state where 
we can no longer read them.

(if you follow the other discussion, it would actually be nice if we 
could read them and would get encrypted content back, like s390x does; 
but that's a different discussion and I assume pretty much out of scope :) )

> approach suggested here, the guestmem pages are entirely accessible to
> the host until they are attached to a running protected VM which
> triggers the protection. It is very much userspace saying "I promise not
> to touch these pages from now on" when it does that, in a way that I
> personally find very comparable to the mprotect case. It is not some
> other entity that pulls the carpet from under userspace's feet, it is
> userspace being inconsistent with itself that causes the issue here, and
> that's why SIGBUS feels kinda wrong as it tends to be used to report
> external errors of some sort.

I recall that user space can also trigger SIGBUS when doing some 
mmap()+truncate() thingies, and probably a bunch more, that could be 
fixed up later.

I don't see a problem with SIUGBUS here, but I do understand your view. 
I consider the exact signal a minor detail, though.

> 
>>> appropriate. Allocating memory via guestmem and donating that to a
>>> protected guest is a way for userspace to voluntarily relinquish access
>>> permissions to the memory it allocated. So a userspace process violating
>>> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
>>> point that signal would be sent, the page would have been accounted
>>> against that userspace process, so not sure the paging examples that
>>> were discussed earlier are exactly comparable. To illustrate that
>>> differently, given that pKVM and Gunyah use MMU-based protection, there
>>> is nothing architecturally that prevents a guest from sharing a page
>>> back with Linux as RO.
>>
>> Sure, then allow page faults that allow for reads and give a signal on write
>> faults.
>>
>> In the scenario, it even makes more sense to not constantly require new
>> mmap's from user space just to access a now-shared page.
>>
>>> Note that we don't currently support this, so I
>>> don't want to conflate this use case, but that hopefully makes it a
>>> little more obvious that this is a "there is a page, but you don't
>>> currently have the permission to access it" problem rather than "sorry
>>> but we ran out of pages" problem.
>>
>> We could user other signals, at least as the semantics are clear and it's
>> documented. Maybe SIGSEGV would be warranted.
>>
>> I consider that a minor detail, though.
>>
>> Requiring mmap()/munmap() dances just to access a page that is now shared
>> from user space sounds a bit suboptimal. But I don't know all the details of
>> the user space implementation.
> 
> Agreed, if we could save having to mmap() each page that gets shared
> back that would be a nice performance optimization.
> 
>> "mmap() the whole thing once and only access what you are supposed to
>> access" sounds reasonable to me. If you don't play by the rules, you get a
>> signal.
> 
> "... you get a signal, or maybe you don't". But yes I understand your
> point, and as per the above there are real benefits to this approach so
> why not.
> 
> What do we expect userspace to do when a page goes from shared back to
> being guest-private, because e.g. the guest decides to unshare? Use
> munmap() on that page? Or perhaps an madvise() call of some sort? Note
> that this will be needed when starting a guest as well, as userspace
> needs to copy the guest payload in the guestmem file prior to starting
> the protected VM.

Let's assume we have the whole guest_memfd mapped exactly once in our 
process, a single VMA.

When setting up the VM, we'll write the payload and then fire up the VM.

That will (I assume) trigger some shared -> private conversion.

When we want to convert shared -> private in the kernel, we would first 
check if the page is currently mapped. If it is, we could try unmapping 
that page using an rmap walk.

Then, we'd make sure that there are really no other references and 
protect against concurrent mapping of the page. Now we can convert the 
page to private.

As we want to avoid the rmap walk, user space can be nice and simply 
MADV_DONTNEED the shared memory portions once it's done with it. For 
example, after writing the payload.

Just a thought, maybe I am missing some details.

Quentin Perret Feb. 28, 2024, 1:34 p.m. UTC | #19

On Wednesday 28 Feb 2024 at 14:00:50 (+0100), David Hildenbrand wrote:
> > > > To add a layer of paint to the shed, the usage of SIGBUS for
> > > > something that is really a permission access problem doesn't feel
> > > 
> > > SIGBUS stands for "BUS error (bad memory access)."
> > > 
> > > Which makes sense, if you try accessing something that can no longer be
> > > accessed. It's now inaccessible. Even if it is temporarily.
> > > 
> > > Just like a page with an MCE error. Swapin errors. Etc. You cannot access
> > > it.
> > > 
> > > It might be a permission problem on the pKVM side, but it's not the
> > > traditional "permission problem" as in mprotect() and friends. You cannot
> > > resolve that permission problem yourself. It's a higher entity that turned
> > > that memory inaccessible.
> > 
> > Well that's where I'm not sure to agree. Userspace can, in fact, get
> > back all of that memory by simply killing the protected VM. With the
> 
> Right, but that would likely "wipe" the pages so they can be made accessible
> again, right?

Yep, indeed.

> That's the whole point why we are handing the pages over to the "higher
> entity", and allow someone else (the VM) to turn them into a state where we
> can no longer read them.
> 
> (if you follow the other discussion, it would actually be nice if we could
> read them and would get encrypted content back, like s390x does; but that's
> a different discussion and I assume pretty much out of scope :) )

Interesting, I'll read up. On a side note, I'm also considering adding a
guest-facing hypervisor interface to let the guest decide to opt out of
the hypervisor wipe as discussed above. That would be useful for a guest
that is shutting itself down (which could be cooperating with the host
Linux) and that knows it has erased its secrets. That is in general
difficult to do for an OS, but a simple approach could be to poison all
its memory (or maybe encrypt it?) before opting out of that wipe.

The hypervisor wipe is done in hypervisor context (obviously), which is
non-preemptible, so avoiding wiping (or encrypting) loads of memory
there is highly desirable. Also pKVM doesn't have a linear map of all
memory for security reasons, so we need to map/unmap the pages one by
one, which sucks as much as it sounds.

But yes, we're digressing, that is all for later :)

> > approach suggested here, the guestmem pages are entirely accessible to
> > the host until they are attached to a running protected VM which
> > triggers the protection. It is very much userspace saying "I promise not
> > to touch these pages from now on" when it does that, in a way that I
> > personally find very comparable to the mprotect case. It is not some
> > other entity that pulls the carpet from under userspace's feet, it is
> > userspace being inconsistent with itself that causes the issue here, and
> > that's why SIGBUS feels kinda wrong as it tends to be used to report
> > external errors of some sort.
> 
> I recall that user space can also trigger SIGBUS when doing some
> mmap()+truncate() thingies, and probably a bunch more, that could be fixed
> up later.

Right, so that probably still falls into "there is no page" bucket
rather than the "there is a page that is already accounted against the
userspace process, but it doesn't have the permission to access it
bucket. But yes that's probably an infinite debate.

> I don't see a problem with SIUGBUS here, but I do understand your view. I
> consider the exact signal a minor detail, though.
> 
> > 
> > > > appropriate. Allocating memory via guestmem and donating that to a
> > > > protected guest is a way for userspace to voluntarily relinquish access
> > > > permissions to the memory it allocated. So a userspace process violating
> > > > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
> > > > point that signal would be sent, the page would have been accounted
> > > > against that userspace process, so not sure the paging examples that
> > > > were discussed earlier are exactly comparable. To illustrate that
> > > > differently, given that pKVM and Gunyah use MMU-based protection, there
> > > > is nothing architecturally that prevents a guest from sharing a page
> > > > back with Linux as RO.
> > > 
> > > Sure, then allow page faults that allow for reads and give a signal on write
> > > faults.
> > > 
> > > In the scenario, it even makes more sense to not constantly require new
> > > mmap's from user space just to access a now-shared page.
> > > 
> > > > Note that we don't currently support this, so I
> > > > don't want to conflate this use case, but that hopefully makes it a
> > > > little more obvious that this is a "there is a page, but you don't
> > > > currently have the permission to access it" problem rather than "sorry
> > > > but we ran out of pages" problem.
> > > 
> > > We could user other signals, at least as the semantics are clear and it's
> > > documented. Maybe SIGSEGV would be warranted.
> > > 
> > > I consider that a minor detail, though.
> > > 
> > > Requiring mmap()/munmap() dances just to access a page that is now shared
> > > from user space sounds a bit suboptimal. But I don't know all the details of
> > > the user space implementation.
> > 
> > Agreed, if we could save having to mmap() each page that gets shared
> > back that would be a nice performance optimization.
> > 
> > > "mmap() the whole thing once and only access what you are supposed to
 (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
> > > signal.
> > 
> > "... you get a signal, or maybe you don't". But yes I understand your
> > point, and as per the above there are real benefits to this approach so
> > why not.
> > 
> > What do we expect userspace to do when a page goes from shared back to
> > being guest-private, because e.g. the guest decides to unshare? Use
> > munmap() on that page? Or perhaps an madvise() call of some sort? Note
> > that this will be needed when starting a guest as well, as userspace
> > needs to copy the guest payload in the guestmem file prior to starting
> > the protected VM.
> 
> Let's assume we have the whole guest_memfd mapped exactly once in our
> process, a single VMA.
> 
> When setting up the VM, we'll write the payload and then fire up the VM.
> 
> That will (I assume) trigger some shared -> private conversion.
> 
> When we want to convert shared -> private in the kernel, we would first
> check if the page is currently mapped. If it is, we could try unmapping that
> page using an rmap walk.

I had not considered that. That would most certainly be slow, but a well
behaved userspace process shouldn't hit it so, that's probably not a
problem...

> Then, we'd make sure that there are really no other references and protect
> against concurrent mapping of the page. Now we can convert the page to
> private.

Right.

Alternatively, the shared->private conversion happens in the KVM vcpu
run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
new exit reason saying "can't donate that page while it's shared" and
have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
of that. But I tend to prefer the rmap option if it's workable as that
avoids adding new KVM userspace ABI.

> As we want to avoid the rmap walk, user space can be nice and simply
> MADV_DONTNEED the shared memory portions once it's done with it. For
> example, after writing the payload.

That makes sense to me.

Thanks,
Quentin

Quentin Perret Feb. 28, 2024, 2:01 p.m. UTC | #20

On Wednesday 28 Feb 2024 at 11:12:18 (+0100), David Hildenbrand wrote:
> > > So you don't need any guest_memfd games to protect from that -- and one
> > > doesn't have to travel back in time to have memory that isn't
> > > swappable/migratable and only comes in one page size.
> > > 
> > > [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
> > > implementation cannot fulfill -- like replacing pages in page tables and
> > > such; I suspect pKVM also cannot cover all these corner-cases]
> > 
> > Thanks for this. I'll do some more reading on how things work with s390x.
> > 
> > Right, and of course, one key difference of course is that pKVM
> > doesn't encrypt anything, and only relies on stage-2 protection to
> > protect the guest.
> 
> I don't remember what exactly s390x does, but I recall that it might only
> encrypt the memory content as it transitions a page from secure to
> non-secure.
> 
> Something like that could also be implemented using pKVM (unless I am
> missing something), but it might not be that trivial, of course :)

One of the complicated aspects of having the host migrate pages like so
is for the hypervisor to make sure the content of the page has not been
tempered with when the new page is re-mapped in the guest. That means
having additional tracking in the hypervisor of pages that have been
encrypted and returned to the host, indexed by IPA, with something like
a 'checksum' of some sort, which is non-trivial to do securely from a
cryptographic PoV.

A simpler and secure way to do this is (I think) is to do
hypervisor-assisted migration. IOW, pKVM exposes a new migrate_page(ipa,
old_pa, new_pa) hypercall which Linux can call to migrate a page.
pKVM unmaps the new page from the host stage-2, unmap the old page from
guest stage-2, does the copy, wipes the old page, maps the pages in the
respective page-tables, and off we go. That way the content is never
visible to Linux and that avoids the problems I highlighted above by
construction. The downside is that it doesn't work for swapping, but
that is quite hard to do in general...

Elliot Berman Feb. 28, 2024, 6:43 p.m. UTC | #21

On Wed, Feb 28, 2024 at 01:34:15PM +0000, Quentin Perret wrote:
> Alternatively, the shared->private conversion happens in the KVM vcpu
> run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
> new exit reason saying "can't donate that page while it's shared" and
> have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
> of that. But I tend to prefer the rmap option if it's workable as that
> avoids adding new KVM userspace ABI.
> 

You'll still probably need the new exit reason saying "can't donate that
page while it's shared" if the refcount tests fail. Can use David's
iouring as example of some other part of the kernel has a reference to
the page. I can't think of anything to do other than exiting to
userspace because we don't know how to drop that extra ref.

Thanks,
Elliot

Quentin Perret Feb. 28, 2024, 6:51 p.m. UTC | #22

On Wednesday 28 Feb 2024 at 10:43:27 (-0800), Elliot Berman wrote:
> On Wed, Feb 28, 2024 at 01:34:15PM +0000, Quentin Perret wrote:
> > Alternatively, the shared->private conversion happens in the KVM vcpu
> > run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
> > new exit reason saying "can't donate that page while it's shared" and
> > have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
> > of that. But I tend to prefer the rmap option if it's workable as that
> > avoids adding new KVM userspace ABI.
> > 
> 
> You'll still probably need the new exit reason saying "can't donate that
> page while it's shared" if the refcount tests fail. Can use David's
> iouring as example of some other part of the kernel has a reference to
> the page. I can't think of anything to do other than exiting to
> userspace because we don't know how to drop that extra ref.

Ack, I realized that later on. I guess there may be cases where
userspace doesn't know how to drop that pin, but that's not the kernel's
fault and it can't do anything about it, so a userspace exit is our best
chance...

Thanks,
Quentin

David Hildenbrand Feb. 29, 2024, 9:51 a.m. UTC | #23

On 28.02.24 15:01, Quentin Perret wrote:
> On Wednesday 28 Feb 2024 at 11:12:18 (+0100), David Hildenbrand wrote:
>>>> So you don't need any guest_memfd games to protect from that -- and one
>>>> doesn't have to travel back in time to have memory that isn't
>>>> swappable/migratable and only comes in one page size.
>>>>
>>>> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x
>>>> implementation cannot fulfill -- like replacing pages in page tables and
>>>> such; I suspect pKVM also cannot cover all these corner-cases]
>>>
>>> Thanks for this. I'll do some more reading on how things work with s390x.
>>>
>>> Right, and of course, one key difference of course is that pKVM
>>> doesn't encrypt anything, and only relies on stage-2 protection to
>>> protect the guest.
>>
>> I don't remember what exactly s390x does, but I recall that it might only
>> encrypt the memory content as it transitions a page from secure to
>> non-secure.
>>
>> Something like that could also be implemented using pKVM (unless I am
>> missing something), but it might not be that trivial, of course :)
> 
> One of the complicated aspects of having the host migrate pages like so
> is for the hypervisor to make sure the content of the page has not been
> tempered with when the new page is re-mapped in the guest. That means
> having additional tracking in the hypervisor of pages that have been
> encrypted and returned to the host, indexed by IPA, with something like
> a 'checksum' of some sort, which is non-trivial to do securely from a
> cryptographic PoV.
> 

I don't know what exactly s390x does, and how it does it -- and which 
CCA cases they can handle.

Details are scarce, for example:
 
https://www.ibm.com/docs/en/linux-on-systems?topic=virtualization-introducing-secure-execution-linux

I suspect they do it in some way you describe, and I fully agree with 
the "non-trivial" aspect :)

> A simpler and secure way to do this is (I think) is to do
> hypervisor-assisted migration. IOW, pKVM exposes a new migrate_page(ipa,
> old_pa, new_pa) hypercall which Linux can call to migrate a page.
> pKVM unmaps the new page from the host stage-2, unmap the old page from
> guest stage-2, does the copy, wipes the old page, maps the pages in the
> respective page-tables, and off we go. That way the content is never
> visible to Linux and that avoids the problems I highlighted above by
> construction. The downside is that it doesn't work for swapping, but
> that is quite hard to do in general...

The main "problem" with that is that you still end up with these 
inaccessible pages, that require the use of guest_memfd for your 
in-place conversion requirement in the first place.

I'm sure at some point we'll see migration support also for TDX and 
friends. For pKVM it might be even easier to support.

David Hildenbrand Feb. 29, 2024, 10:04 a.m. UTC | #24

On 28.02.24 14:34, Quentin Perret wrote:
> On Wednesday 28 Feb 2024 at 14:00:50 (+0100), David Hildenbrand wrote:
>>>>> To add a layer of paint to the shed, the usage of SIGBUS for
>>>>> something that is really a permission access problem doesn't feel
>>>>
>>>> SIGBUS stands for "BUS error (bad memory access)."
>>>>
>>>> Which makes sense, if you try accessing something that can no longer be
>>>> accessed. It's now inaccessible. Even if it is temporarily.
>>>>
>>>> Just like a page with an MCE error. Swapin errors. Etc. You cannot access
>>>> it.
>>>>
>>>> It might be a permission problem on the pKVM side, but it's not the
>>>> traditional "permission problem" as in mprotect() and friends. You cannot
>>>> resolve that permission problem yourself. It's a higher entity that turned
>>>> that memory inaccessible.
>>>
>>> Well that's where I'm not sure to agree. Userspace can, in fact, get
>>> back all of that memory by simply killing the protected VM. With the
>>
>> Right, but that would likely "wipe" the pages so they can be made accessible
>> again, right?
> 
> Yep, indeed.
> 
>> That's the whole point why we are handing the pages over to the "higher
>> entity", and allow someone else (the VM) to turn them into a state where we
>> can no longer read them.
>>
>> (if you follow the other discussion, it would actually be nice if we could
>> read them and would get encrypted content back, like s390x does; but that's
>> a different discussion and I assume pretty much out of scope :) )
> 
> Interesting, I'll read up. On a side note, I'm also considering adding a
> guest-facing hypervisor interface to let the guest decide to opt out of
> the hypervisor wipe as discussed above. That would be useful for a guest
> that is shutting itself down (which could be cooperating with the host
> Linux) and that knows it has erased its secrets. That is in general
> difficult to do for an OS, but a simple approach could be to poison all
> its memory (or maybe encrypt it?) before opting out of that wipe.
> 
> The hypervisor wipe is done in hypervisor context (obviously), which is
> non-preemptible, so avoiding wiping (or encrypting) loads of memory
> there is highly desirable. Also pKVM doesn't have a linear map of all
> memory for security reasons, so we need to map/unmap the pages one by
> one, which sucks as much as it sounds.
> 
> But yes, we're digressing, that is all for later :)

:) Sounds like an interesting optimization.

An alternative would be to remember in pKVM that a page needs a wipe 
before reaccess. Once re-accessed by anybody (hypervisor or new guest), 
it first has to be wiped by pKVM.

... but that also sounds complicated and similar requires the linear 
map+unmap in pKVM page-by-page as they are reused. But at least a guest 
shutdown would be faster.

> 
>>> approach suggested here, the guestmem pages are entirely accessible to
>>> the host until they are attached to a running protected VM which
>>> triggers the protection. It is very much userspace saying "I promise not
>>> to touch these pages from now on" when it does that, in a way that I
>>> personally find very comparable to the mprotect case. It is not some
>>> other entity that pulls the carpet from under userspace's feet, it is
>>> userspace being inconsistent with itself that causes the issue here, and
>>> that's why SIGBUS feels kinda wrong as it tends to be used to report
>>> external errors of some sort.
>>
>> I recall that user space can also trigger SIGBUS when doing some
>> mmap()+truncate() thingies, and probably a bunch more, that could be fixed
>> up later.
> 
> Right, so that probably still falls into "there is no page" bucket
> rather than the "there is a page that is already accounted against the
> userspace process, but it doesn't have the permission to access it
> bucket. But yes that's probably an infinite debate.

Yes, we should rather focus on the bigger idea: have inaccessible memory 
that fails a pagefault instead of the mmap.

> 
>> I don't see a problem with SIUGBUS here, but I do understand your view. I
>> consider the exact signal a minor detail, though.
>>
>>>
>>>>> appropriate. Allocating memory via guestmem and donating that to a
>>>>> protected guest is a way for userspace to voluntarily relinquish access
>>>>> permissions to the memory it allocated. So a userspace process violating
>>>>> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the
>>>>> point that signal would be sent, the page would have been accounted
>>>>> against that userspace process, so not sure the paging examples that
>>>>> were discussed earlier are exactly comparable. To illustrate that
>>>>> differently, given that pKVM and Gunyah use MMU-based protection, there
>>>>> is nothing architecturally that prevents a guest from sharing a page
>>>>> back with Linux as RO.
>>>>
>>>> Sure, then allow page faults that allow for reads and give a signal on write
>>>> faults.
>>>>
>>>> In the scenario, it even makes more sense to not constantly require new
>>>> mmap's from user space just to access a now-shared page.
>>>>
>>>>> Note that we don't currently support this, so I
>>>>> don't want to conflate this use case, but that hopefully makes it a
>>>>> little more obvious that this is a "there is a page, but you don't
>>>>> currently have the permission to access it" problem rather than "sorry
>>>>> but we ran out of pages" problem.
>>>>
>>>> We could user other signals, at least as the semantics are clear and it's
>>>> documented. Maybe SIGSEGV would be warranted.
>>>>
>>>> I consider that a minor detail, though.
>>>>
>>>> Requiring mmap()/munmap() dances just to access a page that is now shared
>>>> from user space sounds a bit suboptimal. But I don't know all the details of
>>>> the user space implementation.
>>>
>>> Agreed, if we could save having to mmap() each page that gets shared
>>> back that would be a nice performance optimization.
>>>
>>>> "mmap() the whole thing once and only access what you are supposed to
>   (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
>>>> signal.
>>>
>>> "... you get a signal, or maybe you don't". But yes I understand your
>>> point, and as per the above there are real benefits to this approach so
>>> why not.
>>>
>>> What do we expect userspace to do when a page goes from shared back to
>>> being guest-private, because e.g. the guest decides to unshare? Use
>>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
>>> that this will be needed when starting a guest as well, as userspace
>>> needs to copy the guest payload in the guestmem file prior to starting
>>> the protected VM.
>>
>> Let's assume we have the whole guest_memfd mapped exactly once in our
>> process, a single VMA.
>>
>> When setting up the VM, we'll write the payload and then fire up the VM.
>>
>> That will (I assume) trigger some shared -> private conversion.
>>
>> When we want to convert shared -> private in the kernel, we would first
>> check if the page is currently mapped. If it is, we could try unmapping that
>> page using an rmap walk.
> 
> I had not considered that. That would most certainly be slow, but a well
> behaved userspace process shouldn't hit it so, that's probably not a
> problem...

If there really only is a single VMA that covers the page (or even mmaps 
the guest_memfd), it should not be too bad. For example, any 
fallocate(PUNCHHOLE) has to do the same, to unmap the page before 
discarding it from the pagecache.

But of course, no rmap walk is always better.

> 
>> Then, we'd make sure that there are really no other references and protect
>> against concurrent mapping of the page. Now we can convert the page to
>> private.
> 
> Right.
> 
> Alternatively, the shared->private conversion happens in the KVM vcpu
> run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a
> new exit reason saying "can't donate that page while it's shared" and
> have userspace use MADVISE_DONTNEED or munmap, or whatever on the back
> of that. But I tend to prefer the rmap option if it's workable as that
> avoids adding new KVM userspace ABI.

As discussed in the sub-thread, that might still be required.

One could think about completely forbidding GUP on these mmap'ed 
guest-memfds. But likely, there might be use cases in the future where 
you want to use GUP on shared memory inside a guest_memfd.

(the iouring example I gave might currently not work because 
FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and 
guest_memfd will likely not be detected as shmem; 8ac268436e6d contains 
some details)

Fuad Tabba Feb. 29, 2024, 7:01 p.m. UTC | #25

Hi David,

...

>>>> "mmap() the whole thing once and only access what you are supposed to
> >   (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
> >>>> signal.
> >>>
> >>> "... you get a signal, or maybe you don't". But yes I understand your
> >>> point, and as per the above there are real benefits to this approach so
> >>> why not.
> >>>
> >>> What do we expect userspace to do when a page goes from shared back to
> >>> being guest-private, because e.g. the guest decides to unshare? Use
> >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
> >>> that this will be needed when starting a guest as well, as userspace
> >>> needs to copy the guest payload in the guestmem file prior to starting
> >>> the protected VM.
> >>
> >> Let's assume we have the whole guest_memfd mapped exactly once in our
> >> process, a single VMA.
> >>
> >> When setting up the VM, we'll write the payload and then fire up the VM.
> >>
> >> That will (I assume) trigger some shared -> private conversion.
> >>
> >> When we want to convert shared -> private in the kernel, we would first
> >> check if the page is currently mapped. If it is, we could try unmapping that
> >> page using an rmap walk.
> >
> > I had not considered that. That would most certainly be slow, but a well
> > behaved userspace process shouldn't hit it so, that's probably not a
> > problem...
>
> If there really only is a single VMA that covers the page (or even mmaps
> the guest_memfd), it should not be too bad. For example, any
> fallocate(PUNCHHOLE) has to do the same, to unmap the page before
> discarding it from the pagecache.

I don't think that we can assume that only a single VMA covers a page.

> But of course, no rmap walk is always better.

We've been thinking some more about how to handle the case where the
host userspace has a mapping of a page that later becomes private.

One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
to the host with a meaningful exit reason) until the host unmaps that
page, and check for the refcount to the page as you mentioned earlier.
This is essentially what the RFC I sent does (minus the bugs :) ) .

The other idea is to use the rmap walk as you suggested to zap that
page. If the host tries to access that page again, it would get a
SIGBUS on the fault. This has the advantage that, as you'd mentioned,
the host doesn't need to constantly mmap() and munmap() pages. It
could potentially be optimised further as suggested if we have a
cooperating VMM that would issue a MADV_DONTNEED or something like
that, but that's just an optimisation and we would still need to have
the option of the rmap walk. However, I was wondering how practical
this idea would be if more than a single VMA covers a page?

Also, there's the question of what to do if the page is gupped? In
this case I think the only thing we can do is refuse to run the guest
until the gup (and all references) are released, which also brings us
back to the way things (kind of) are...

Thanks,
/fuad

Elliot Berman March 1, 2024, 12:40 a.m. UTC | #26

On Thu, Feb 29, 2024 at 07:01:51PM +0000, Fuad Tabba wrote:
> Hi David,
> 
> ...
> 
> >>>> "mmap() the whole thing once and only access what you are supposed to
> > >   (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
> > >>>> signal.
> > >>>
> > >>> "... you get a signal, or maybe you don't". But yes I understand your
> > >>> point, and as per the above there are real benefits to this approach so
> > >>> why not.
> > >>>
> > >>> What do we expect userspace to do when a page goes from shared back to
> > >>> being guest-private, because e.g. the guest decides to unshare? Use
> > >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
> > >>> that this will be needed when starting a guest as well, as userspace
> > >>> needs to copy the guest payload in the guestmem file prior to starting
> > >>> the protected VM.
> > >>
> > >> Let's assume we have the whole guest_memfd mapped exactly once in our
> > >> process, a single VMA.
> > >>
> > >> When setting up the VM, we'll write the payload and then fire up the VM.
> > >>
> > >> That will (I assume) trigger some shared -> private conversion.
> > >>
> > >> When we want to convert shared -> private in the kernel, we would first
> > >> check if the page is currently mapped. If it is, we could try unmapping that
> > >> page using an rmap walk.
> > >
> > > I had not considered that. That would most certainly be slow, but a well
> > > behaved userspace process shouldn't hit it so, that's probably not a
> > > problem...
> >
> > If there really only is a single VMA that covers the page (or even mmaps
> > the guest_memfd), it should not be too bad. For example, any
> > fallocate(PUNCHHOLE) has to do the same, to unmap the page before
> > discarding it from the pagecache.
> 
> I don't think that we can assume that only a single VMA covers a page.
> 
> > But of course, no rmap walk is always better.
> 
> We've been thinking some more about how to handle the case where the
> host userspace has a mapping of a page that later becomes private.
> 
> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
> to the host with a meaningful exit reason) until the host unmaps that
> page, and check for the refcount to the page as you mentioned earlier.
> This is essentially what the RFC I sent does (minus the bugs :) ) .
> 
> The other idea is to use the rmap walk as you suggested to zap that
> page. If the host tries to access that page again, it would get a
> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
> the host doesn't need to constantly mmap() and munmap() pages. It
> could potentially be optimised further as suggested if we have a
> cooperating VMM that would issue a MADV_DONTNEED or something like
> that, but that's just an optimisation and we would still need to have
> the option of the rmap walk. However, I was wondering how practical
> this idea would be if more than a single VMA covers a page?
> 

Agree with all your points here. I changed Gunyah's implementation to do
the unmap instead of erroring out. I didn't observe a significant
performance difference. However, doing unmap might be a little faster
because we can check folio_mapped() before doing the rmap walk. When
erroring out at mmap() level, we always have to do the walk.

> Also, there's the question of what to do if the page is gupped? In
> this case I think the only thing we can do is refuse to run the guest
> until the gup (and all references) are released, which also brings us
> back to the way things (kind of) are...
> 

If there are gup users who don't do FOLL_PIN, I think we either need to
fix them or live with possibility here? We don't have a reliable
refcount for a folio to be safe to unmap: it might be that another vCPU
is trying to get the same page, has incremented the refcount, and
waiting for the folio_lock. This problem exists whether we block the
mmap() or do SIGBUS.

Thanks,
Elliot

David Hildenbrand March 1, 2024, 11:06 a.m. UTC | #27

On 29.02.24 20:01, Fuad Tabba wrote:
> Hi David,
> 
> ...
> 

Hi!

>>>>> "mmap() the whole thing once and only access what you are supposed to
>>>    (> > > access" sounds reasonable to me. If you don't play by the rules, you get a
>>>>>> signal.
>>>>>
>>>>> "... you get a signal, or maybe you don't". But yes I understand your
>>>>> point, and as per the above there are real benefits to this approach so
>>>>> why not.
>>>>>
>>>>> What do we expect userspace to do when a page goes from shared back to
>>>>> being guest-private, because e.g. the guest decides to unshare? Use
>>>>> munmap() on that page? Or perhaps an madvise() call of some sort? Note
>>>>> that this will be needed when starting a guest as well, as userspace
>>>>> needs to copy the guest payload in the guestmem file prior to starting
>>>>> the protected VM.
>>>>
>>>> Let's assume we have the whole guest_memfd mapped exactly once in our
>>>> process, a single VMA.
>>>>
>>>> When setting up the VM, we'll write the payload and then fire up the VM.
>>>>
>>>> That will (I assume) trigger some shared -> private conversion.
>>>>
>>>> When we want to convert shared -> private in the kernel, we would first
>>>> check if the page is currently mapped. If it is, we could try unmapping that
>>>> page using an rmap walk.
>>>
>>> I had not considered that. That would most certainly be slow, but a well
>>> behaved userspace process shouldn't hit it so, that's probably not a
>>> problem...
>>
>> If there really only is a single VMA that covers the page (or even mmaps
>> the guest_memfd), it should not be too bad. For example, any
>> fallocate(PUNCHHOLE) has to do the same, to unmap the page before
>> discarding it from the pagecache.
> 
> I don't think that we can assume that only a single VMA covers a page.
> 
>> But of course, no rmap walk is always better.
> 
> We've been thinking some more about how to handle the case where the
> host userspace has a mapping of a page that later becomes private.
> 
> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
> to the host with a meaningful exit reason) until the host unmaps that
> page, and check for the refcount to the page as you mentioned earlier.
> This is essentially what the RFC I sent does (minus the bugs :) ) .

:)

> 
> The other idea is to use the rmap walk as you suggested to zap that
> page. If the host tries to access that page again, it would get a
> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
> the host doesn't need to constantly mmap() and munmap() pages. It
> could potentially be optimised further as suggested if we have a
> cooperating VMM that would issue a MADV_DONTNEED or something like
> that, but that's just an optimisation and we would still need to have
> the option of the rmap walk. However, I was wondering how practical
> this idea would be if more than a single VMA covers a page?

A few VMAs won't make a difference I assume. Any idea how many "sharers" 
you'd expect in a sane configuration?

> 
> Also, there's the question of what to do if the page is gupped? In
> this case I think the only thing we can do is refuse to run the guest
> until the gup (and all references) are released, which also brings us
> back to the way things (kind of) are...

Indeed. There is no way you could possibly make progress.

David Hildenbrand March 1, 2024, 11:16 a.m. UTC | #28

>> I don't think that we can assume that only a single VMA covers a page.
>>
>>> But of course, no rmap walk is always better.
>>
>> We've been thinking some more about how to handle the case where the
>> host userspace has a mapping of a page that later becomes private.
>>
>> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
>> to the host with a meaningful exit reason) until the host unmaps that
>> page, and check for the refcount to the page as you mentioned earlier.
>> This is essentially what the RFC I sent does (minus the bugs :) ) .
>>
>> The other idea is to use the rmap walk as you suggested to zap that
>> page. If the host tries to access that page again, it would get a
>> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
>> the host doesn't need to constantly mmap() and munmap() pages. It
>> could potentially be optimised further as suggested if we have a
>> cooperating VMM that would issue a MADV_DONTNEED or something like
>> that, but that's just an optimisation and we would still need to have
>> the option of the rmap walk. However, I was wondering how practical
>> this idea would be if more than a single VMA covers a page?
>>
> 
> Agree with all your points here. I changed Gunyah's implementation to do
> the unmap instead of erroring out. I didn't observe a significant
> performance difference. However, doing unmap might be a little faster
> because we can check folio_mapped() before doing the rmap walk. When
> erroring out at mmap() level, we always have to do the walk.

Right. On the mmap() level you won't really have to walk page tables, as 
the the munmap() already zapped the page and removed the "problematic" VMA.

Likely, you really want to avoid repeatedly calling mmap()+munmap() just 
to access shared memory; but that's just my best guess about your user 
space app :)

> 
>> Also, there's the question of what to do if the page is gupped? In
>> this case I think the only thing we can do is refuse to run the guest
>> until the gup (and all references) are released, which also brings us
>> back to the way things (kind of) are...
>>
> 
> If there are gup users who don't do FOLL_PIN, I think we either need to
> fix them or live with possibility here? We don't have a reliable
> refcount for a folio to be safe to unmap: it might be that another vCPU
> is trying to get the same page, has incremented the refcount, and
> waiting for the folio_lock.

Likely there could be a way to detect that when only the vCPUs are your 
concern? But yes, it's nasty.

(has to be handled in either case :()

Disallowing any FOLL_GET|FOLL_PIN could work. Not sure how some 
core-kernel FOLL_GET users would react to that, though.

See vma_is_secretmem() and folio_is_secretmem() in mm/gup.c, where we 
disallow any FOLL_GET|FOLL_PIN of secretmem pages.

We'd need a way to teach core-mm similarly about guest_memfd, which 
might end up rather tricky, but not impossible :)

> This problem exists whether we block the
> mmap() or do SIGBUS.

There is work on doing more conversion to FOLL_PIN, but some cases are 
harder to convert. Most of O_DIRECT should be using it nowadays, but 
some other known use cases don't.

The simplest and readily-available example is still vmsplice(). I don't 
think it was fixed yet to use FOLL_PIN.

Use vmsplice() to pin the page in the pipe (read-only). Unmap the VMA. 
You can read the page any time later by reading from the pipe.

So I wouldn't bet on all relevant cases being gone in the near future.

Quentin Perret March 4, 2024, 12:36 p.m. UTC | #29

On Thursday 29 Feb 2024 at 11:04:09 (+0100), David Hildenbrand wrote:
> An alternative would be to remember in pKVM that a page needs a wipe before
> reaccess. Once re-accessed by anybody (hypervisor or new guest), it first
> has to be wiped by pKVM.
> 
> ... but that also sounds complicated and similar requires the linear
> map+unmap in pKVM page-by-page as they are reused. But at least a guest
> shutdown would be faster.

Yep, FWIW we did try that, but ended up having issues with Linux trying
to DMA to these pages before 'touching' them from the CPU side. pKVM can
keep the pages unmapped from the CPU and IOMMU stage-2 page-tables, and
we can easily handle the CPU faults lazily, but not faults from other
masters, our hardware generally doesn't support that.

<snip>
> As discussed in the sub-thread, that might still be required.
> 
> One could think about completely forbidding GUP on these mmap'ed
> guest-memfds. But likely, there might be use cases in the future where you
> want to use GUP on shared memory inside a guest_memfd.
> 
> (the iouring example I gave might currently not work because
> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> details)

Perhaps it would be wise to start with GUP being forbidden if the
current users do not need it (not sure if that is the case in Android,
I'll check) ? We can always relax this constraint later when/if the
use-cases arise, which is obviously much harder to do the other way
around.

Thanks,
Quentin

Quentin Perret March 4, 2024, 12:53 p.m. UTC | #30

On Friday 01 Mar 2024 at 12:16:54 (+0100), David Hildenbrand wrote:
> > > I don't think that we can assume that only a single VMA covers a page.
> > > 
> > > > But of course, no rmap walk is always better.
> > > 
> > > We've been thinking some more about how to handle the case where the
> > > host userspace has a mapping of a page that later becomes private.
> > > 
> > > One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
> > > to the host with a meaningful exit reason) until the host unmaps that
> > > page, and check for the refcount to the page as you mentioned earlier.
> > > This is essentially what the RFC I sent does (minus the bugs :) ) .
> > > 
> > > The other idea is to use the rmap walk as you suggested to zap that
> > > page. If the host tries to access that page again, it would get a
> > > SIGBUS on the fault. This has the advantage that, as you'd mentioned,
> > > the host doesn't need to constantly mmap() and munmap() pages. It
> > > could potentially be optimised further as suggested if we have a
> > > cooperating VMM that would issue a MADV_DONTNEED or something like
> > > that, but that's just an optimisation and we would still need to have
> > > the option of the rmap walk. However, I was wondering how practical
> > > this idea would be if more than a single VMA covers a page?
> > > 
> > 
> > Agree with all your points here. I changed Gunyah's implementation to do
> > the unmap instead of erroring out. I didn't observe a significant
> > performance difference. However, doing unmap might be a little faster
> > because we can check folio_mapped() before doing the rmap walk. When
> > erroring out at mmap() level, we always have to do the walk.
> 
> Right. On the mmap() level you won't really have to walk page tables, as the
> the munmap() already zapped the page and removed the "problematic" VMA.
> 
> Likely, you really want to avoid repeatedly calling mmap()+munmap() just to
> access shared memory; but that's just my best guess about your user space
> app :)

Ack, and expecting userspace to munmap the pages whenever we hit a valid
mapping in userspace page-tables in the KVM faults path makes for a
somewhat unusual interface IMO. Userspace can munmap, mmap again, and if
it doesn't touch the pages, it can proceed to run the guest just fine,
is that the expectation? If so, it feels like we're 'leaking' internal
kernel state somehow. The kernel is normally well within its rights to
zap userspace mappings if it wants to e.g. swap. (Obviously mlock is a
weird case, but even in that case, IIRC the kernel still has a certain
amount of flexibility and can use compaction and friends). Similarly,
it should be well within its right to proactively create them. How
would this scheme work if, 10 years from now, something like
Speculative Page Faults makes it into the kernel in a different form?

Not requiring to userspace to unmap makes the userspace interface a lot
simpler I think -- once a protected guest starts, you better not touch
its memory if it's not been shared back or you'll get slapped on the
wrist. Whether or not those pages have been accessed beforehand for
example is irrelevant.

Sean Christopherson March 4, 2024, 7:04 p.m. UTC | #31

On Mon, Mar 04, 2024, Quentin Perret wrote:
> > As discussed in the sub-thread, that might still be required.
> > 
> > One could think about completely forbidding GUP on these mmap'ed
> > guest-memfds. But likely, there might be use cases in the future where you
> > want to use GUP on shared memory inside a guest_memfd.
> > 
> > (the iouring example I gave might currently not work because
> > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> > details)
> 
> Perhaps it would be wise to start with GUP being forbidden if the
> current users do not need it (not sure if that is the case in Android,
> I'll check) ? We can always relax this constraint later when/if the
> use-cases arise, which is obviously much harder to do the other way
> around.

+1000.  At least on the KVM side, I would like to be as conservative as possible
when it comes to letting anything other than the guest access guest_memfd.

David Hildenbrand March 4, 2024, 8:17 p.m. UTC | #32

On 04.03.24 20:04, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>> As discussed in the sub-thread, that might still be required.
>>>
>>> One could think about completely forbidding GUP on these mmap'ed
>>> guest-memfds. But likely, there might be use cases in the future where you
>>> want to use GUP on shared memory inside a guest_memfd.
>>>
>>> (the iouring example I gave might currently not work because
>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>> details)
>>
>> Perhaps it would be wise to start with GUP being forbidden if the
>> current users do not need it (not sure if that is the case in Android,
>> I'll check) ? We can always relax this constraint later when/if the
>> use-cases arise, which is obviously much harder to do the other way
>> around.
> 
> +1000.  At least on the KVM side, I would like to be as conservative as possible
> when it comes to letting anything other than the guest access guest_memfd.

So we'll have to do it similar to any occurrences of "secretmem" in 
gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code 
similar to e.g., folio_is_secretmem().

IIRC, we might not be able to de-reference the actual mapping because it 
could get free concurrently ...

That will then prohibit any kind of GUP access to these pages, including 
reading/writing for ptrace/debugging purposes, for core dumping purposes 
etc. But at least, you know that nobody was able to optain page 
references using GUP that might be used for reading/writing later.

David Hildenbrand March 4, 2024, 8:22 p.m. UTC | #33

On 04.03.24 13:53, Quentin Perret wrote:
> On Friday 01 Mar 2024 at 12:16:54 (+0100), David Hildenbrand wrote:
>>>> I don't think that we can assume that only a single VMA covers a page.
>>>>
>>>>> But of course, no rmap walk is always better.
>>>>
>>>> We've been thinking some more about how to handle the case where the
>>>> host userspace has a mapping of a page that later becomes private.
>>>>
>>>> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back
>>>> to the host with a meaningful exit reason) until the host unmaps that
>>>> page, and check for the refcount to the page as you mentioned earlier.
>>>> This is essentially what the RFC I sent does (minus the bugs :) ) .
>>>>
>>>> The other idea is to use the rmap walk as you suggested to zap that
>>>> page. If the host tries to access that page again, it would get a
>>>> SIGBUS on the fault. This has the advantage that, as you'd mentioned,
>>>> the host doesn't need to constantly mmap() and munmap() pages. It
>>>> could potentially be optimised further as suggested if we have a
>>>> cooperating VMM that would issue a MADV_DONTNEED or something like
>>>> that, but that's just an optimisation and we would still need to have
>>>> the option of the rmap walk. However, I was wondering how practical
>>>> this idea would be if more than a single VMA covers a page?
>>>>
>>>
>>> Agree with all your points here. I changed Gunyah's implementation to do
>>> the unmap instead of erroring out. I didn't observe a significant
>>> performance difference. However, doing unmap might be a little faster
>>> because we can check folio_mapped() before doing the rmap walk. When
>>> erroring out at mmap() level, we always have to do the walk.
>>
>> Right. On the mmap() level you won't really have to walk page tables, as the
>> the munmap() already zapped the page and removed the "problematic" VMA.
>>
>> Likely, you really want to avoid repeatedly calling mmap()+munmap() just to
>> access shared memory; but that's just my best guess about your user space
>> app :)
> 
> Ack, and expecting userspace to munmap the pages whenever we hit a valid
> mapping in userspace page-tables in the KVM faults path makes for a
> somewhat unusual interface IMO. Userspace can munmap, mmap again, and if
> it doesn't touch the pages, it can proceed to run the guest just fine,
> is that the expectation? If so, it feels like we're 'leaking' internal

It would be weird, and I would not suggest that. It's either

(1) you can leave it mmap'ed, but any access to private memory will 
SIGBUS. The kernel will try zapping pages inside a VMA itself.

(2) you cannot leave it mmap'ed. In order to convert shared -> private, 
you have to munmap. mmap will fail if it would cover a currently-private 
page.

So for (1) you could mmap once in user space and be done with it. For 
(2) you would have to mmap+munmap when accessing shared memory.

> kernel state somehow. The kernel is normally well within its rights to
> zap userspace mappings if it wants to e.g. swap. (Obviously mlock is a
> weird case, but even in that case, IIRC the kernel still has a certain
> amount of flexibility and can use compaction and friends). Similarly,
> it should be well within its right to proactively create them. How
> would this scheme work if, 10 years from now, something like
> Speculative Page Faults makes it into the kernel in a different form?

At least with (2), speculative page faults would never generate a 
SIGBUS. Just like HW must not generate a fault on speculative access.

> 
> Not requiring to userspace to unmap makes the userspace interface a lot
> simpler I think -- once a protected guest starts, you better not touch
> its memory if it's not been shared back or you'll get slapped on the
> wrist. Whether or not those pages have been accessed beforehand for
> example is irrelevant.

Yes. So the theory :)

Elliot Berman March 4, 2024, 9:43 p.m. UTC | #34

On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
> On 04.03.24 20:04, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Quentin Perret wrote:
> > > > As discussed in the sub-thread, that might still be required.
> > > > 
> > > > One could think about completely forbidding GUP on these mmap'ed
> > > > guest-memfds. But likely, there might be use cases in the future where you
> > > > want to use GUP on shared memory inside a guest_memfd.
> > > > 
> > > > (the iouring example I gave might currently not work because
> > > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> > > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> > > > details)
> > > 
> > > Perhaps it would be wise to start with GUP being forbidden if the
> > > current users do not need it (not sure if that is the case in Android,
> > > I'll check) ? We can always relax this constraint later when/if the
> > > use-cases arise, which is obviously much harder to do the other way
> > > around.
> > 
> > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > when it comes to letting anything other than the guest access guest_memfd.
> 
> So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
> We'll have to see how to marry KVM guest_memfd with core-mm code similar to
> e.g., folio_is_secretmem().
> 
> IIRC, we might not be able to de-reference the actual mapping because it
> could get free concurrently ...
> 
> That will then prohibit any kind of GUP access to these pages, including
> reading/writing for ptrace/debugging purposes, for core dumping purposes
> etc. But at least, you know that nobody was able to optain page references
> using GUP that might be used for reading/writing later.

Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
replacing folio_is_secretmem() with a test of this bit instead of
comparing the a_ops? I think it scales better.

Thanks,
Elliot

David Hildenbrand March 4, 2024, 9:58 p.m. UTC | #35

On 04.03.24 22:43, Elliot Berman wrote:
> On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
>> On 04.03.24 20:04, Sean Christopherson wrote:
>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>> As discussed in the sub-thread, that might still be required.
>>>>>
>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>
>>>>> (the iouring example I gave might currently not work because
>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>> details)
>>>>
>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>> current users do not need it (not sure if that is the case in Android,
>>>> I'll check) ? We can always relax this constraint later when/if the
>>>> use-cases arise, which is obviously much harder to do the other way
>>>> around.
>>>
>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>> when it comes to letting anything other than the guest access guest_memfd.
>>
>> So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
>> We'll have to see how to marry KVM guest_memfd with core-mm code similar to
>> e.g., folio_is_secretmem().
>>
>> IIRC, we might not be able to de-reference the actual mapping because it
>> could get free concurrently ...
>>
>> That will then prohibit any kind of GUP access to these pages, including
>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>> etc. But at least, you know that nobody was able to optain page references
>> using GUP that might be used for reading/writing later.
> 
> Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
> replacing folio_is_secretmem() with a test of this bit instead of
> comparing the a_ops? I think it scales better.

The only concern I have are races, but let's look into the details:

In GUP-fast, we can essentially race with unmap of folios, munmap() of 
VMAs etc.

We had a similar discussion recently about possible races. It's 
documented in folio_fast_pin_allowed() regarding disabled IRQs and RCU 
grace periods.

"inodes and thus their mappings are freed under RCU, which means the 
mapping cannot be freed beneath us and thus we can safely dereference it."

So if we follow the same rules as folio_fast_pin_allowed(), we can 
de-reference folio->mapping, for example comparing mapping->a_ops.

[folio_is_secretmem should better follow the same approach]

So likely that should just work!

Vishal Annapurve March 18, 2024, 5:06 p.m. UTC | #36

On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 20:04, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Quentin Perret wrote:
> >>> As discussed in the sub-thread, that might still be required.
> >>>
> >>> One could think about completely forbidding GUP on these mmap'ed
> >>> guest-memfds. But likely, there might be use cases in the future where you
> >>> want to use GUP on shared memory inside a guest_memfd.
> >>>
> >>> (the iouring example I gave might currently not work because
> >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> >>> details)
> >>
> >> Perhaps it would be wise to start with GUP being forbidden if the
> >> current users do not need it (not sure if that is the case in Android,
> >> I'll check) ? We can always relax this constraint later when/if the
> >> use-cases arise, which is obviously much harder to do the other way
> >> around.
> >
> > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > when it comes to letting anything other than the guest access guest_memfd.
>
> So we'll have to do it similar to any occurrences of "secretmem" in
> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
> similar to e.g., folio_is_secretmem().
>
> IIRC, we might not be able to de-reference the actual mapping because it
> could get free concurrently ...
>
> That will then prohibit any kind of GUP access to these pages, including
> reading/writing for ptrace/debugging purposes, for core dumping purposes
> etc. But at least, you know that nobody was able to optain page
> references using GUP that might be used for reading/writing later.
>

There has been little discussion about supporting 1G pages with
guest_memfd for TDX/SNP or pKVM. I would like to restart this
discussion [1]. 1G pages should be a very important usecase for guest
memfd, especially considering large VM sizes supporting confidential
GPU/TPU workloads.

Using separate backing stores for private and shared memory ranges is
not going to work effectively when using 1G pages. Consider the
following scenario of memory conversion when using 1G pages to back
private memory:
* Guest requests conversion of 4KB range from private to shared, host
in response ideally does following steps:
    a) Updates the guest memory attributes
    b) Unbacks the corresponding private memory
    c) Allocates corresponding shared memory or let it be faulted in
when guest accesses it

Step b above can't be skipped here, otherwise we would have two
physical pages (1 backing private memory, another backing the shared
memory) for the same GPA range causing "double allocation".

With 1G pages, it would be difficult to punch KBs or even MBs sized
hole since to support that:
1G page would need to be split (which hugetlbfs doesn't support today
because of right reasons), causing -
        - loss of vmemmap optimization [3]
        - losing ability to reconstitute the huge page again,
especially as private pages in CVMs are not relocatable today,
increasing overall fragmentation over time.
              - unless a smarter algorithm is devised for memory
reclaim to reconstitute large pages for unmovable memory.

With the above limitations in place, best thing could be to allow:
 - single backing store for both shared and private memory ranges
 - host userspace to mmap the guest memfd (as this series is trying to do)
 - allow userspace to fault in memfd file ranges that correspond to
shared GPA ranges
     - pagetable mappings will need to be restricted to shared memory
ranges causing higher granularity mappings (somewhat similar to what
HGM series from James [2] was trying to do) than 1G.
 - Allow IOMMU also to map those pages (pfns would be requested using
get_user_pages* APIs) to allow devices to access shared memory. IOMMU
management code would have to be enlightened or somehow restricted to
map only shared regions of guest memfd.
 - Upon conversion from shared to private, host will have to ensure
that there are no mappings/references present for the memory ranges
being converted to private.

If the above usecase sounds reasonable, GUP access to guest memfd
pages should be allowed.

[1] https://lore.kernel.org/lkml/CAGtprH_H1afUJ2cUnznWqYLTZVuEcOogRwXF6uBAeHbLMQsrsQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20230218002819.1486479-2-jthoughton@google.com/
[3] https://docs.kernel.org/mm/vmemmap_dedup.html



> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand March 18, 2024, 10:02 p.m. UTC | #37

On 18.03.24 18:06, Vishal Annapurve wrote:
> On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 20:04, Sean Christopherson wrote:
>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>> As discussed in the sub-thread, that might still be required.
>>>>>
>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>
>>>>> (the iouring example I gave might currently not work because
>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>> details)
>>>>
>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>> current users do not need it (not sure if that is the case in Android,
>>>> I'll check) ? We can always relax this constraint later when/if the
>>>> use-cases arise, which is obviously much harder to do the other way
>>>> around.
>>>
>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>> when it comes to letting anything other than the guest access guest_memfd.
>>
>> So we'll have to do it similar to any occurrences of "secretmem" in
>> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
>> similar to e.g., folio_is_secretmem().
>>
>> IIRC, we might not be able to de-reference the actual mapping because it
>> could get free concurrently ...
>>
>> That will then prohibit any kind of GUP access to these pages, including
>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>> etc. But at least, you know that nobody was able to optain page
>> references using GUP that might be used for reading/writing later.
>>
> 
> There has been little discussion about supporting 1G pages with
> guest_memfd for TDX/SNP or pKVM. I would like to restart this
> discussion [1]. 1G pages should be a very important usecase for guest
> memfd, especially considering large VM sizes supporting confidential
> GPU/TPU workloads.
> 
> Using separate backing stores for private and shared memory ranges is
> not going to work effectively when using 1G pages. Consider the
> following scenario of memory conversion when using 1G pages to back
> private memory:
> * Guest requests conversion of 4KB range from private to shared, host
> in response ideally does following steps:
>      a) Updates the guest memory attributes
>      b) Unbacks the corresponding private memory
>      c) Allocates corresponding shared memory or let it be faulted in
> when guest accesses it
> 
> Step b above can't be skipped here, otherwise we would have two
> physical pages (1 backing private memory, another backing the shared
> memory) for the same GPA range causing "double allocation".
> 
> With 1G pages, it would be difficult to punch KBs or even MBs sized
> hole since to support that:
> 1G page would need to be split (which hugetlbfs doesn't support today
> because of right reasons), causing -
>          - loss of vmemmap optimization [3]
>          - losing ability to reconstitute the huge page again,
> especially as private pages in CVMs are not relocatable today,
> increasing overall fragmentation over time.
>                - unless a smarter algorithm is devised for memory
> reclaim to reconstitute large pages for unmovable memory.
> 
> With the above limitations in place, best thing could be to allow:
>   - single backing store for both shared and private memory ranges
>   - host userspace to mmap the guest memfd (as this series is trying to do)
>   - allow userspace to fault in memfd file ranges that correspond to
> shared GPA ranges
>       - pagetable mappings will need to be restricted to shared memory
> ranges causing higher granularity mappings (somewhat similar to what
> HGM series from James [2] was trying to do) than 1G.
>   - Allow IOMMU also to map those pages (pfns would be requested using
> get_user_pages* APIs) to allow devices to access shared memory. IOMMU
> management code would have to be enlightened or somehow restricted to
> map only shared regions of guest memfd.
>   - Upon conversion from shared to private, host will have to ensure
> that there are no mappings/references present for the memory ranges
> being converted to private.
> 
> If the above usecase sounds reasonable, GUP access to guest memfd
> pages should be allowed.

To say it with nice words: "Not a fan".

First, I don't think only 1 GiB will be problematic. Already 2 MiB ones 
will be problematic and so far it is even unclear how guest_memfd will 
consume them in a way acceptable to upstream MM. Likely not using 
hugetlb from what I recall after the previous discussions with Mike.

Second, we should find better ways to let an IOMMU map these pages, 
*not* using GUP. There were already discussions on providing a similar 
fd+offset-style interface instead. GUP really sounds like the wrong 
approach here. Maybe we should look into passing not only guest_memfd, 
but also "ordinary" memfds.

Third, I don't think we should be using huge pages where huge pages 
don't make any sense. Using a 1 GiB page so the VM will convert some 
pieces to map it using PTEs will destroy the whole purpose of using 1 
GiB pages. It doesn't make any sense.

A direction that might make sense is either (A) enlighting the VM about 
the granularity in which memory can be converted (but also problematic 
for 1 GiB pages) and/or (B) physically restricting the memory that can 
be converted.

For example, one could create a GPA layout where some regions are backed 
by gigantic pages that cannot be converted/can only be converted as a 
whole, and some are backed by 4k pages that can be converted back and 
forth. We'd use multiple guest_memfds for that. I recall that physically 
restricting such conversions/locations (e.g., for bounce buffers) in 
Linux was already discussed somewhere, but I don't recall the details.

It's all not trivial and not easy to get "clean".

Concluding that individual pieces of a 1 GiB / 2 MiB huge page should 
not be converted back and forth might be a reasonable. Although I'm sure 
people will argue the opposite and develop hackish solutions in 
desperate ways to make it work somehow.

Huge pages, and especially gigantic pages, are simply a bad fit if the 
VM will convert individual 4k pages.

But to answer your last question: we might be able to avoid GUP by using 
a different mapping API, similar to the once KVM now provides.

Vishal Annapurve March 18, 2024, 11:07 p.m. UTC | #38

On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.03.24 18:06, Vishal Annapurve wrote:
> > On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 04.03.24 20:04, Sean Christopherson wrote:
> >>> On Mon, Mar 04, 2024, Quentin Perret wrote:
> >>>>> As discussed in the sub-thread, that might still be required.
> >>>>>
> >>>>> One could think about completely forbidding GUP on these mmap'ed
> >>>>> guest-memfds. But likely, there might be use cases in the future where you
> >>>>> want to use GUP on shared memory inside a guest_memfd.
> >>>>>
> >>>>> (the iouring example I gave might currently not work because
> >>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> >>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> >>>>> details)
> >>>>
> >>>> Perhaps it would be wise to start with GUP being forbidden if the
> >>>> current users do not need it (not sure if that is the case in Android,
> >>>> I'll check) ? We can always relax this constraint later when/if the
> >>>> use-cases arise, which is obviously much harder to do the other way
> >>>> around.
> >>>
> >>> +1000.  At least on the KVM side, I would like to be as conservative as possible
> >>> when it comes to letting anything other than the guest access guest_memfd.
> >>
> >> So we'll have to do it similar to any occurrences of "secretmem" in
> >> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code
> >> similar to e.g., folio_is_secretmem().
> >>
> >> IIRC, we might not be able to de-reference the actual mapping because it
> >> could get free concurrently ...
> >>
> >> That will then prohibit any kind of GUP access to these pages, including
> >> reading/writing for ptrace/debugging purposes, for core dumping purposes
> >> etc. But at least, you know that nobody was able to optain page
> >> references using GUP that might be used for reading/writing later.
> >>
> >
> > There has been little discussion about supporting 1G pages with
> > guest_memfd for TDX/SNP or pKVM. I would like to restart this
> > discussion [1]. 1G pages should be a very important usecase for guest
> > memfd, especially considering large VM sizes supporting confidential
> > GPU/TPU workloads.
> >
> > Using separate backing stores for private and shared memory ranges is
> > not going to work effectively when using 1G pages. Consider the
> > following scenario of memory conversion when using 1G pages to back
> > private memory:
> > * Guest requests conversion of 4KB range from private to shared, host
> > in response ideally does following steps:
> >      a) Updates the guest memory attributes
> >      b) Unbacks the corresponding private memory
> >      c) Allocates corresponding shared memory or let it be faulted in
> > when guest accesses it
> >
> > Step b above can't be skipped here, otherwise we would have two
> > physical pages (1 backing private memory, another backing the shared
> > memory) for the same GPA range causing "double allocation".
> >
> > With 1G pages, it would be difficult to punch KBs or even MBs sized
> > hole since to support that:
> > 1G page would need to be split (which hugetlbfs doesn't support today
> > because of right reasons), causing -
> >          - loss of vmemmap optimization [3]
> >          - losing ability to reconstitute the huge page again,
> > especially as private pages in CVMs are not relocatable today,
> > increasing overall fragmentation over time.
> >                - unless a smarter algorithm is devised for memory
> > reclaim to reconstitute large pages for unmovable memory.
> >
> > With the above limitations in place, best thing could be to allow:
> >   - single backing store for both shared and private memory ranges
> >   - host userspace to mmap the guest memfd (as this series is trying to do)
> >   - allow userspace to fault in memfd file ranges that correspond to
> > shared GPA ranges
> >       - pagetable mappings will need to be restricted to shared memory
> > ranges causing higher granularity mappings (somewhat similar to what
> > HGM series from James [2] was trying to do) than 1G.
> >   - Allow IOMMU also to map those pages (pfns would be requested using
> > get_user_pages* APIs) to allow devices to access shared memory. IOMMU
> > management code would have to be enlightened or somehow restricted to
> > map only shared regions of guest memfd.
> >   - Upon conversion from shared to private, host will have to ensure
> > that there are no mappings/references present for the memory ranges
> > being converted to private.
> >
> > If the above usecase sounds reasonable, GUP access to guest memfd
> > pages should be allowed.
>
> To say it with nice words: "Not a fan".
>
> First, I don't think only 1 GiB will be problematic. Already 2 MiB ones
> will be problematic and so far it is even unclear how guest_memfd will
> consume them in a way acceptable to upstream MM. Likely not using
> hugetlb from what I recall after the previous discussions with Mike.
>

Agree, the support for 1G pages with guest memfd is yet to be figured
out, but it remains a scenario to be considered here.

> Second, we should find better ways to let an IOMMU map these pages,
> *not* using GUP. There were already discussions on providing a similar
> fd+offset-style interface instead. GUP really sounds like the wrong
> approach here. Maybe we should look into passing not only guest_memfd,
> but also "ordinary" memfds.

I need to dig into past discussions around this, but agree that
passing guest memfds to VFIO drivers in addition to HVAs seems worth
exploring. This may be required anyways for devices supporting TDX
connect [1].

If we are talking about the same file catering to both private and
shared memory, there has to be some way to keep track of references on
the shared memory from both host userspace and IOMMU.

>
> Third, I don't think we should be using huge pages where huge pages
> don't make any sense. Using a 1 GiB page so the VM will convert some
> pieces to map it using PTEs will destroy the whole purpose of using 1
> GiB pages. It doesn't make any sense.

I had started a discussion for this [2] using an RFC series. Main
challenge here remain:
1) Unifying all the conversions under one layer
2) Ensuring shared memory allocations are huge page aligned at boot
time and runtime.

Using any kind of unified shared memory allocator (today this part is
played by SWIOTLB) will need to support huge page aligned dynamic
increments, which can be only guaranteed by carving out enough memory
at boot time for CMA area and using CMA area for allocation at
runtime.
   - Since it's hard to come up with a maximum amount of shared memory
needed by VM, especially with GPUs/TPUs around, it's difficult to come
up with CMA area size at boot time.

I think it's arguable that even if a VM converts 10 % of its memory to
shared using 4k granularity, we still have fewer page table walks on
the rest of the memory when using 1G/2M pages, which is a significant
portion.

>
> A direction that might make sense is either (A) enlighting the VM about
> the granularity in which memory can be converted (but also problematic
> for 1 GiB pages) and/or (B) physically restricting the memory that can
> be converted.

Physically restricting the memory will still need a safe maximum bound
to be calculated based on all the shared memory usecases that VM can
encounter.

>
> For example, one could create a GPA layout where some regions are backed
> by gigantic pages that cannot be converted/can only be converted as a
> whole, and some are backed by 4k pages that can be converted back and
> forth. We'd use multiple guest_memfds for that. I recall that physically
> restricting such conversions/locations (e.g., for bounce buffers) in
> Linux was already discussed somewhere, but I don't recall the details.
>
> It's all not trivial and not easy to get "clean".

Yeah, agree with this point, it's difficult to get a clean solution
here, but the host side solution might be easier to deploy (not
necessarily easier to implement) and possibly cleaner than attempts to
regulate the guest side.

>
> Concluding that individual pieces of a 1 GiB / 2 MiB huge page should
> not be converted back and forth might be a reasonable. Although I'm sure
> people will argue the opposite and develop hackish solutions in
> desperate ways to make it work somehow.
>
> Huge pages, and especially gigantic pages, are simply a bad fit if the
> VM will convert individual 4k pages.
>
>
> But to answer your last question: we might be able to avoid GUP by using
> a different mapping API, similar to the once KVM now provides.
>
> --
> Cheers,
>
> David / dhildenb
>

[1] -> https://cdrdv2.intel.com/v1/dl/getContent/773614
[2] https://lore.kernel.org/lkml/20240112055251.36101-2-vannapurve@google.com/

Sean Christopherson March 19, 2024, 12:10 a.m. UTC | #39

On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > Second, we should find better ways to let an IOMMU map these pages,
> > *not* using GUP. There were already discussions on providing a similar
> > fd+offset-style interface instead. GUP really sounds like the wrong
> > approach here. Maybe we should look into passing not only guest_memfd,
> > but also "ordinary" memfds.

+1.  I am not completely opposed to letting SNP and TDX effectively convert
pages between private and shared, but I also completely agree that letting
anything gup() guest_memfd memory is likely to end in tears.

> I need to dig into past discussions around this, but agree that
> passing guest memfds to VFIO drivers in addition to HVAs seems worth
> exploring. This may be required anyways for devices supporting TDX
> connect [1].
> 
> If we are talking about the same file catering to both private and
> shared memory, there has to be some way to keep track of references on
> the shared memory from both host userspace and IOMMU.
> 
> >
> > Third, I don't think we should be using huge pages where huge pages
> > don't make any sense. Using a 1 GiB page so the VM will convert some
> > pieces to map it using PTEs will destroy the whole purpose of using 1
> > GiB pages. It doesn't make any sense.

I don't disagree, but the fundamental problem is that we have no guarantees as to
what that guest will or will not do.  We can certainly make very educated guesses,
and probably be right 99.99% of the time, but being wrong 0.01% of the time
probably means a lot of broken VMs, and a lot of unhappy customers.

> I had started a discussion for this [2] using an RFC series. 

David is talking about the host side of things, AFAICT you're talking about the
guest side...

> challenge here remain:
> 1) Unifying all the conversions under one layer
> 2) Ensuring shared memory allocations are huge page aligned at boot
> time and runtime.
> 
> Using any kind of unified shared memory allocator (today this part is
> played by SWIOTLB) will need to support huge page aligned dynamic
> increments, which can be only guaranteed by carving out enough memory
> at boot time for CMA area and using CMA area for allocation at
> runtime.
>    - Since it's hard to come up with a maximum amount of shared memory
> needed by VM, especially with GPUs/TPUs around, it's difficult to come
> up with CMA area size at boot time.

...which is very relevant as carving out memory in the guest is nigh impossible,
but carving out memory in the host for systems whose sole purpose is to run VMs
is very doable.

> I think it's arguable that even if a VM converts 10 % of its memory to
> shared using 4k granularity, we still have fewer page table walks on
> the rest of the memory when using 1G/2M pages, which is a significant
> portion.

Performance is a secondary concern.  If this were _just_ about guest performance,
I would unequivocally side with David: the guest gets to keep the pieces if it
fragments a 1GiB page.

The main problem we're trying to solve is that we want to provision a host such
that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
would also get the benefits of 1GiB mappings, that's not the driving motiviation
for this discussion.

As HugeTLB exists today, supporting that use case isn't really feasible because
there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute
the 1GiB when the sliver is converted/freed back).

Peeking ahead at my next comment, I don't think that solving this in the guest
is a realistic option, i.e. IMO, we need to figure out a way to handle this in
the host, without relying on the guest to cooperate.  Luckily, we haven't added
hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate
to work with.

The other big advantage that we should lean into is that we can make assumptions
about guest_memfd usage that would never fly for a general purpose backing stores,
e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
desirable, for (almost?) all of the CoCo use cases.

I don't have any concrete ideas at this time, but my gut feeling is that this
won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
general purposes, and if we we account for conversion scenarios when designing
hugepage support for guest_memfd.

> > For example, one could create a GPA layout where some regions are backed
> > by gigantic pages that cannot be converted/can only be converted as a
> > whole, and some are backed by 4k pages that can be converted back and
> > forth. We'd use multiple guest_memfds for that. I recall that physically
> > restricting such conversions/locations (e.g., for bounce buffers) in
> > Linux was already discussed somewhere, but I don't recall the details.
> >
> > It's all not trivial and not easy to get "clean".
> 
> Yeah, agree with this point, it's difficult to get a clean solution
> here, but the host side solution might be easier to deploy (not
> necessarily easier to implement) and possibly cleaner than attempts to
> regulate the guest side.

I think we missed the opportunity to regulate the guest side by several years.
To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer
VMs, KVM would need to be able to _enforce_ the scheme.  And while I am more than
willing to put my foot down on things where the guest is being blatantly ridiculous,
wanting to convert an arbitrary 4KiB chunk of memory between private and shared
isn't ridiculous (likely inefficient, but not ridiculous).  I.e. I'm not willing
to have KVM refuse conversions that are legal according to the SNP and TDX specs
(and presumably the CCA spec, too).

That's why I think we're years too late; this sort of restriction needs to go in
the "hardware" spec, and that ship has sailed.

Quentin Perret March 19, 2024, 9:47 a.m. UTC | #40

On Monday 04 Mar 2024 at 22:58:49 (+0100), David Hildenbrand wrote:
> On 04.03.24 22:43, Elliot Berman wrote:
> > On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
> > > On 04.03.24 20:04, Sean Christopherson wrote:
> > > > On Mon, Mar 04, 2024, Quentin Perret wrote:
> > > > > > As discussed in the sub-thread, that might still be required.
> > > > > > 
> > > > > > One could think about completely forbidding GUP on these mmap'ed
> > > > > > guest-memfds. But likely, there might be use cases in the future where you
> > > > > > want to use GUP on shared memory inside a guest_memfd.
> > > > > > 
> > > > > > (the iouring example I gave might currently not work because
> > > > > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
> > > > > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
> > > > > > details)
> > > > > 
> > > > > Perhaps it would be wise to start with GUP being forbidden if the
> > > > > current users do not need it (not sure if that is the case in Android,
> > > > > I'll check) ? We can always relax this constraint later when/if the
> > > > > use-cases arise, which is obviously much harder to do the other way
> > > > > around.
> > > > 
> > > > +1000.  At least on the KVM side, I would like to be as conservative as possible
> > > > when it comes to letting anything other than the guest access guest_memfd.
> > > 
> > > So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
> > > We'll have to see how to marry KVM guest_memfd with core-mm code similar to
> > > e.g., folio_is_secretmem().
> > > 
> > > IIRC, we might not be able to de-reference the actual mapping because it
> > > could get free concurrently ...
> > > 
> > > That will then prohibit any kind of GUP access to these pages, including
> > > reading/writing for ptrace/debugging purposes, for core dumping purposes
> > > etc. But at least, you know that nobody was able to optain page references
> > > using GUP that might be used for reading/writing later.
> > 
> > Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
> > replacing folio_is_secretmem() with a test of this bit instead of
> > comparing the a_ops? I think it scales better.
> 
> The only concern I have are races, but let's look into the details:
> 
> In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs
> etc.
> 
> We had a similar discussion recently about possible races. It's documented
> in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods.
> 
> "inodes and thus their mappings are freed under RCU, which means the mapping
> cannot be freed beneath us and thus we can safely dereference it."
> 
> So if we follow the same rules as folio_fast_pin_allowed(), we can
> de-reference folio->mapping, for example comparing mapping->a_ops.
> 
> [folio_is_secretmem should better follow the same approach]

Resurecting this discussion, I had discussions internally and as it
turns out Android makes extensive use of vhost/vsock when communicating
with guest VMs, which requires GUP. So, my bad, not supporting GUP for
the pKVM variant of guest_memfd is a bit of a non-starter, we'll need to
support it from the start. But again this should be a matter of 'simply'
having a dedicated KVM exit reason so hopefully it's not too bad.

Thanks,
Quentin

David Hildenbrand March 19, 2024, 9:54 a.m. UTC | #41

On 19.03.24 10:47, Quentin Perret wrote:
> On Monday 04 Mar 2024 at 22:58:49 (+0100), David Hildenbrand wrote:
>> On 04.03.24 22:43, Elliot Berman wrote:
>>> On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote:
>>>> On 04.03.24 20:04, Sean Christopherson wrote:
>>>>> On Mon, Mar 04, 2024, Quentin Perret wrote:
>>>>>>> As discussed in the sub-thread, that might still be required.
>>>>>>>
>>>>>>> One could think about completely forbidding GUP on these mmap'ed
>>>>>>> guest-memfds. But likely, there might be use cases in the future where you
>>>>>>> want to use GUP on shared memory inside a guest_memfd.
>>>>>>>
>>>>>>> (the iouring example I gave might currently not work because
>>>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and
>>>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some
>>>>>>> details)
>>>>>>
>>>>>> Perhaps it would be wise to start with GUP being forbidden if the
>>>>>> current users do not need it (not sure if that is the case in Android,
>>>>>> I'll check) ? We can always relax this constraint later when/if the
>>>>>> use-cases arise, which is obviously much harder to do the other way
>>>>>> around.
>>>>>
>>>>> +1000.  At least on the KVM side, I would like to be as conservative as possible
>>>>> when it comes to letting anything other than the guest access guest_memfd.
>>>>
>>>> So we'll have to do it similar to any occurrences of "secretmem" in gup.c.
>>>> We'll have to see how to marry KVM guest_memfd with core-mm code similar to
>>>> e.g., folio_is_secretmem().
>>>>
>>>> IIRC, we might not be able to de-reference the actual mapping because it
>>>> could get free concurrently ...
>>>>
>>>> That will then prohibit any kind of GUP access to these pages, including
>>>> reading/writing for ptrace/debugging purposes, for core dumping purposes
>>>> etc. But at least, you know that nobody was able to optain page references
>>>> using GUP that might be used for reading/writing later.
>>>
>>> Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and
>>> replacing folio_is_secretmem() with a test of this bit instead of
>>> comparing the a_ops? I think it scales better.
>>
>> The only concern I have are races, but let's look into the details:
>>
>> In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs
>> etc.
>>
>> We had a similar discussion recently about possible races. It's documented
>> in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods.
>>
>> "inodes and thus their mappings are freed under RCU, which means the mapping
>> cannot be freed beneath us and thus we can safely dereference it."
>>
>> So if we follow the same rules as folio_fast_pin_allowed(), we can
>> de-reference folio->mapping, for example comparing mapping->a_ops.
>>
>> [folio_is_secretmem should better follow the same approach]
> 
> Resurecting this discussion, I had discussions internally and as it
> turns out Android makes extensive use of vhost/vsock when communicating
> with guest VMs, which requires GUP. So, my bad, not supporting GUP for
> the pKVM variant of guest_memfd is a bit of a non-starter, we'll need to
> support it from the start. But again this should be a matter of 'simply'
> having a dedicated KVM exit reason so hopefully it's not too bad.

Likely you should look into ways to teach these interfaces that require 
GUP to consume fd+offset instead.

Yes, there is no such thing as free lunch :P

David Hildenbrand March 19, 2024, 10:26 a.m. UTC | #42

On 19.03.24 01:10, Sean Christopherson wrote:
> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>> Second, we should find better ways to let an IOMMU map these pages,
>>> *not* using GUP. There were already discussions on providing a similar
>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>> approach here. Maybe we should look into passing not only guest_memfd,
>>> but also "ordinary" memfds.
> 
> +1.  I am not completely opposed to letting SNP and TDX effectively convert
> pages between private and shared, but I also completely agree that letting
> anything gup() guest_memfd memory is likely to end in tears.

Yes. Avoid it right from the start, if possible.

People wanted guest_memfd to *not* have to mmap guest memory ("even for 
ordinary VMs"). Now people are saying we have to be able to mmap it in 
order to GUP it. It's getting tiring, really.

> 
>> I need to dig into past discussions around this, but agree that
>> passing guest memfds to VFIO drivers in addition to HVAs seems worth
>> exploring. This may be required anyways for devices supporting TDX
>> connect [1].
>>
>> If we are talking about the same file catering to both private and
>> shared memory, there has to be some way to keep track of references on
>> the shared memory from both host userspace and IOMMU.
>>
>>>
>>> Third, I don't think we should be using huge pages where huge pages
>>> don't make any sense. Using a 1 GiB page so the VM will convert some
>>> pieces to map it using PTEs will destroy the whole purpose of using 1
>>> GiB pages. It doesn't make any sense.
> 
> I don't disagree, but the fundamental problem is that we have no guarantees as to
> what that guest will or will not do.  We can certainly make very educated guesses,
> and probably be right 99.99% of the time, but being wrong 0.01% of the time
> probably means a lot of broken VMs, and a lot of unhappy customers.
> 

Right, then don't use huge/gigantic pages. Because it doesn't make any 
sense. You have no guarantees about memory waste. You have no guarantee 
about performance. Then just don't use huge/gigantic pages.

Use them where reasonable, they are an expensive resource. See below.

>> I had started a discussion for this [2] using an RFC series.
> 
> David is talking about the host side of things, AFAICT you're talking about the
> guest side...
> 
>> challenge here remain:
>> 1) Unifying all the conversions under one layer
>> 2) Ensuring shared memory allocations are huge page aligned at boot
>> time and runtime.
>>
>> Using any kind of unified shared memory allocator (today this part is
>> played by SWIOTLB) will need to support huge page aligned dynamic
>> increments, which can be only guaranteed by carving out enough memory
>> at boot time for CMA area and using CMA area for allocation at
>> runtime.
>>     - Since it's hard to come up with a maximum amount of shared memory
>> needed by VM, especially with GPUs/TPUs around, it's difficult to come
>> up with CMA area size at boot time.
> 
> ...which is very relevant as carving out memory in the guest is nigh impossible,
> but carving out memory in the host for systems whose sole purpose is to run VMs
> is very doable.
> 
>> I think it's arguable that even if a VM converts 10 % of its memory to
>> shared using 4k granularity, we still have fewer page table walks on
>> the rest of the memory when using 1G/2M pages, which is a significant
>> portion.
> 
> Performance is a secondary concern.  If this were _just_ about guest performance,
> I would unequivocally side with David: the guest gets to keep the pieces if it
> fragments a 1GiB page.
> 
> The main problem we're trying to solve is that we want to provision a host such
> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
> run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
> 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
> would also get the benefits of 1GiB mappings, that's not the driving motiviation
> for this discussion.

Supporting 1 GiB mappings there sounds like unnecessary complexity and 
opening a big can of worms, especially if "it's not the driving motivation".

If I understand you correctly, the scenario is

(1) We have free 1 GiB hugetlb pages lying around
(2) We want to start a CoCo VM
(3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb
     pages is all we have.
(4) We want to be able to use the 1 GiB hugetlb page in the future.

With hugetlb, it's possible to reserve a CMA area from which to later 
allocate 1 GiB pages. While not allocated, they can be used for movable 
allocations.

So in the scenario above, free the hugetlb pages back to CMA. Then, 
consume them as 4K pages for the CoCo VM. When wanting to start a 
non-CoCo VM, re-allocate them from CMA.

One catch with that is that
(a) CMA pages cannot get longterm-pinned: for obvious reasons, we
     wouldn't be able to migrate them in order to free up the 1 GiB page.
(b) guest_memfd pages are not movable and cannot currently end up on CMA
     memory.

But maybe that's not actually required in this scenario and we'd like to 
have slightly different semantics: if you were to give the CoCo VM the 1 
GiB pages, they would similarly be unusable until that VM quit and freed 
up the memory!

So it might be acceptable to get "selected" unmovable allocations (from 
guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free 
up when wanting to re-allocate that memory. Otherwise, the CMA 
allocation will simply fail.

If we need improvements in that area to support this case, we can talk. 
Just an idea to avoid HGM and friends just to make it somehow work with 
1 GiB pages ...

> 
> As HugeTLB exists today, supporting that use case isn't really feasible because
> there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute
> the 1GiB when the sliver is converted/freed back).
> 
> Peeking ahead at my next comment, I don't think that solving this in the guest
> is a realistic option, i.e. IMO, we need to figure out a way to handle this in
> the host, without relying on the guest to cooperate.  Luckily, we haven't added
> hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate
> to work with.
> 
> The other big advantage that we should lean into is that we can make assumptions
> about guest_memfd usage that would never fly for a general purpose backing stores,
> e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
> desirable, for (almost?) all of the CoCo use cases.
> 
> I don't have any concrete ideas at this time, but my gut feeling is that this
> won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
> general purposes, and if we we account for conversion scenarios when designing
> hugepage support for guest_memfd.

I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;)

> 
>>> For example, one could create a GPA layout where some regions are backed
>>> by gigantic pages that cannot be converted/can only be converted as a
>>> whole, and some are backed by 4k pages that can be converted back and
>>> forth. We'd use multiple guest_memfds for that. I recall that physically
>>> restricting such conversions/locations (e.g., for bounce buffers) in
>>> Linux was already discussed somewhere, but I don't recall the details.
>>>
>>> It's all not trivial and not easy to get "clean".
>>
>> Yeah, agree with this point, it's difficult to get a clean solution
>> here, but the host side solution might be easier to deploy (not
>> necessarily easier to implement) and possibly cleaner than attempts to
>> regulate the guest side.
> 
> I think we missed the opportunity to regulate the guest side by several years.
> To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer
> VMs, KVM would need to be able to _enforce_ the scheme.  And while I am more than
> willing to put my foot down on things where the guest is being blatantly ridiculous,
> wanting to convert an arbitrary 4KiB chunk of memory between private and shared
> isn't ridiculous (likely inefficient, but not ridiculous).  I.e. I'm not willing
> to have KVM refuse conversions that are legal according to the SNP and TDX specs
> (and presumably the CCA spec, too).
> 
> That's why I think we're years too late; this sort of restriction needs to go in
> the "hardware" spec, and that ship has sailed.

Once could consider extend the spec and glue huge+gigantic page support 
to new hardware.

But ideally, we could just avoid any partial conversion / HGM just to 
support a scenario where we might not actually want 1 GiB pages, but 
somehow want to make it work with 1 GiB pages.

David Hildenbrand March 19, 2024, 1:19 p.m. UTC | #43

>>> I had started a discussion for this [2] using an RFC series.
>>
>> David is talking about the host side of things, AFAICT you're talking about the
>> guest side...
>>
>>> challenge here remain:
>>> 1) Unifying all the conversions under one layer
>>> 2) Ensuring shared memory allocations are huge page aligned at boot
>>> time and runtime.
>>>
>>> Using any kind of unified shared memory allocator (today this part is
>>> played by SWIOTLB) will need to support huge page aligned dynamic
>>> increments, which can be only guaranteed by carving out enough memory
>>> at boot time for CMA area and using CMA area for allocation at
>>> runtime.
>>>      - Since it's hard to come up with a maximum amount of shared memory
>>> needed by VM, especially with GPUs/TPUs around, it's difficult to come
>>> up with CMA area size at boot time.
>>
>> ...which is very relevant as carving out memory in the guest is nigh impossible,
>> but carving out memory in the host for systems whose sole purpose is to run VMs
>> is very doable.
>>
>>> I think it's arguable that even if a VM converts 10 % of its memory to
>>> shared using 4k granularity, we still have fewer page table walks on
>>> the rest of the memory when using 1G/2M pages, which is a significant
>>> portion.
>>
>> Performance is a secondary concern.  If this were _just_ about guest performance,
>> I would unequivocally side with David: the guest gets to keep the pieces if it
>> fragments a 1GiB page.
>>
>> The main problem we're trying to solve is that we want to provision a host such
>> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
>> run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
>> 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
>> would also get the benefits of 1GiB mappings, that's not the driving motiviation
>> for this discussion.
> 
> Supporting 1 GiB mappings there sounds like unnecessary complexity and
> opening a big can of worms, especially if "it's not the driving motivation".
> 
> If I understand you correctly, the scenario is
> 
> (1) We have free 1 GiB hugetlb pages lying around
> (2) We want to start a CoCo VM
> (3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb
>       pages is all we have.
> (4) We want to be able to use the 1 GiB hugetlb page in the future.
> 
> With hugetlb, it's possible to reserve a CMA area from which to later
> allocate 1 GiB pages. While not allocated, they can be used for movable
> allocations.
> 
> So in the scenario above, free the hugetlb pages back to CMA. Then,
> consume them as 4K pages for the CoCo VM. When wanting to start a
> non-CoCo VM, re-allocate them from CMA.
> 
> One catch with that is that
> (a) CMA pages cannot get longterm-pinned: for obvious reasons, we
>       wouldn't be able to migrate them in order to free up the 1 GiB page.
> (b) guest_memfd pages are not movable and cannot currently end up on CMA
>       memory.
> 
> But maybe that's not actually required in this scenario and we'd like to
> have slightly different semantics: if you were to give the CoCo VM the 1
> GiB pages, they would similarly be unusable until that VM quit and freed
> up the memory!
> 
> So it might be acceptable to get "selected" unmovable allocations (from
> guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free
> up when wanting to re-allocate that memory. Otherwise, the CMA
> allocation will simply fail.
> 
> If we need improvements in that area to support this case, we can talk.
> Just an idea to avoid HGM and friends just to make it somehow work with
> 1 GiB pages ...


Thought about that some more and some cases can also be tricky (avoiding 
fragmenting multiple 1 GiB pages ...).

It's all tricky, especially once multiple (guest_)memfds are involved 
and we really want to avoid most waste. Knowing that large mappings for 
CoCo are rather "optional" and that the challenge is in "reusing" large 
pages is valuable, though.

Will Deacon March 19, 2024, 2:31 p.m. UTC | #44

Hi David,

On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> On 19.03.24 01:10, Sean Christopherson wrote:
> > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > *not* using GUP. There were already discussions on providing a similar
> > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > but also "ordinary" memfds.
> > 
> > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > pages between private and shared, but I also completely agree that letting
> > anything gup() guest_memfd memory is likely to end in tears.
> 
> Yes. Avoid it right from the start, if possible.
> 
> People wanted guest_memfd to *not* have to mmap guest memory ("even for
> ordinary VMs"). Now people are saying we have to be able to mmap it in order
> to GUP it. It's getting tiring, really.

From the pKVM side, we're working on guest_memfd primarily to avoid
diverging from what other CoCo solutions end up using, but if it gets
de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
today with anonymous memory, then it's a really hard sell to switch over
from what we have in production. We're also hoping that, over time,
guest_memfd will become more closely integrated with the mm subsystem to
enable things like hypervisor-assisted page migration, which we would
love to have.

Today, we use the existing KVM interfaces (i.e. based on anonymous
memory) and it mostly works with the one significant exception that
accessing private memory via a GUP pin will crash the host kernel. If
all guest_memfd() can offer to solve that problem is preventing GUP
altogether, then I'd sooner just add that same restriction to what we
currently have instead of overhauling the user ABI in favour of
something which offers us very little in return.

On the mmap() side of things for guest_memfd, a simpler option for us
than what has currently been proposed might be to enforce that the VMM
has unmapped all private pages on vCPU run, failing the ioctl if that's
not the case. It needs a little more tracking in guest_memfd but I think
GUP will then fall out in the wash because only shared pages will be
mapped by userspace and so GUP will fail by construction for private
pages.

We're happy to pursue alternative approaches using anonymous memory if
you'd prefer to keep guest_memfd limited in functionality (e.g.
preventing GUP of private pages by extending mapping_flags as per [1]),
but we're equally willing to contribute to guest_memfd if extensions are
welcome.

What do you prefer?

Cheers,

Will

[1] https://lore.kernel.org/r/4b0fd46a-cc4f-4cb7-9f6f-ce19a2d3064e@redhat.com

Sean Christopherson March 19, 2024, 3:04 p.m. UTC | #45

On Tue, Mar 19, 2024, David Hildenbrand wrote:
> On 19.03.24 01:10, Sean Christopherson wrote:
> > Performance is a secondary concern.  If this were _just_ about guest performance,
> > I would unequivocally side with David: the guest gets to keep the pieces if it
> > fragments a 1GiB page.
> > 
> > The main problem we're trying to solve is that we want to provision a host such
> > that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
> > run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
> > 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
> > would also get the benefits of 1GiB mappings, that's not the driving motiviation
> > for this discussion.
> 
> Supporting 1 GiB mappings there sounds like unnecessary complexity and
> opening a big can of worms, especially if "it's not the driving motivation".
> 
> If I understand you correctly, the scenario is
> 
> (1) We have free 1 GiB hugetlb pages lying around
> (2) We want to start a CoCo VM
> (3) We don't care about 1 GiB mappings for that CoCo VM,

We care about 1GiB mappings for CoCo VMs.  My comment about performance being a
secondary concern was specifically saying that it's the guest's responsilibity
to play nice with huge mappings if the guest cares about its performance.  For
guests that are well behaved, we most definitely want to provide a configuration
that performs as close to non-CoCo VMs as we can reasonably make it.

And we can do that today, but it requires some amount of host memory to NOT be
in the HugeTLB pool, and instead be kept in reserved so that it can be used for
shared memory for CoCo VMs.  That approach has many downsides, as the extra memory
overhead affects CoCo VM shapes, our ability to use a common pool for non-CoCo
and CoCo VMs, and so on and so forth.

>     but hguetlb pages is all we have.
> (4) We want to be able to use the 1 GiB hugetlb page in the future.

...

> > The other big advantage that we should lean into is that we can make assumptions
> > about guest_memfd usage that would never fly for a general purpose backing stores,
> > e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
> > desirable, for (almost?) all of the CoCo use cases.
> >
> > I don't have any concrete ideas at this time, but my gut feeling is that this
> > won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
> > general purposes, and if we we account for conversion scenarios when designing
> > hugepage support for guest_memfd.
>
> I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;)

Quite the opposite, I'm saying we should be very deliberate in how we add hugepage
support and others features to guest_memfd, so that guest_memfd doesn't become a
hacky mess.

And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g.
swap/reclaim and probably page migration should get a hard "no".

In other words, ditch the complexity for features that are well served by existing
general purpose solutions, so that guest_memfd can take on a bit of complexity to
serve use cases that are unique to KVM guests, without becoming an unmaintainble
mess due to cross-products.

Elliot Berman March 19, 2024, 11:54 p.m. UTC | #46

On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > On 19.03.24 01:10, Sean Christopherson wrote:
> > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > > *not* using GUP. There were already discussions on providing a similar
> > > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > > but also "ordinary" memfds.
> > > 
> > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > pages between private and shared, but I also completely agree that letting
> > > anything gup() guest_memfd memory is likely to end in tears.
> > 
> > Yes. Avoid it right from the start, if possible.
> > 
> > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > to GUP it. It's getting tiring, really.
> 
> From the pKVM side, we're working on guest_memfd primarily to avoid
> diverging from what other CoCo solutions end up using, but if it gets
> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> today with anonymous memory, then it's a really hard sell to switch over
> from what we have in production. We're also hoping that, over time,
> guest_memfd will become more closely integrated with the mm subsystem to
> enable things like hypervisor-assisted page migration, which we would
> love to have.
> 
> Today, we use the existing KVM interfaces (i.e. based on anonymous
> memory) and it mostly works with the one significant exception that
> accessing private memory via a GUP pin will crash the host kernel. If
> all guest_memfd() can offer to solve that problem is preventing GUP
> altogether, then I'd sooner just add that same restriction to what we
> currently have instead of overhauling the user ABI in favour of
> something which offers us very little in return.

How would we add the restriction to anonymous memory?

Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
where mm can ensure that the exclusive GUP pin is the only pin? If the
refcount for the page is >1, then the exclusive GUP fails. Any future
GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.

> On the mmap() side of things for guest_memfd, a simpler option for us
> than what has currently been proposed might be to enforce that the VMM
> has unmapped all private pages on vCPU run, failing the ioctl if that's
> not the case. It needs a little more tracking in guest_memfd but I think
> GUP will then fall out in the wash because only shared pages will be
> mapped by userspace and so GUP will fail by construction for private
> pages.

We can prevent GUP after the pages are marked private, but the pages
could be marked private after the pages were already GUP'd. I don't have
a good way to detect this, so converting a page to private is difficult.

> We're happy to pursue alternative approaches using anonymous memory if
> you'd prefer to keep guest_memfd limited in functionality (e.g.
> preventing GUP of private pages by extending mapping_flags as per [1]),
> but we're equally willing to contribute to guest_memfd if extensions are
> welcome.
> 
> What do you prefer?
> 

I like this as a stepping stone. For the Android use cases, we don't
need to be able to convert a private page to shared and then also be
able to GUP it. If you want to GUP a page, use anonymous memory and that
memory will always be shared. If you don't care about GUP'ing (e.g. it's
going to be guest-private or you otherwise know you won't be GUP'ing),
you can use guest_memfd.

I don't think this design prevents us from adding "sometimes you can
GUP" to guest_memfd in the future. I don't think it creates extra
changes for KVM since anonymous memory is already supported; although
I'd have to add the support for Gunyah.

> [1] https://lore.kernel.org/r/4b0fd46a-cc4f-4cb7-9f6f-ce19a2d3064e@redhat.com

Thanks,
Elliot

Will Deacon March 22, 2024, 4:36 p.m. UTC | #47

Hi Elliot,

On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote:
> On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > pages between private and shared, but I also completely agree that letting
> > > > anything gup() guest_memfd memory is likely to end in tears.
> > > 
> > > Yes. Avoid it right from the start, if possible.
> > > 
> > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > to GUP it. It's getting tiring, really.
> > 
> > From the pKVM side, we're working on guest_memfd primarily to avoid
> > diverging from what other CoCo solutions end up using, but if it gets
> > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > today with anonymous memory, then it's a really hard sell to switch over
> > from what we have in production. We're also hoping that, over time,
> > guest_memfd will become more closely integrated with the mm subsystem to
> > enable things like hypervisor-assisted page migration, which we would
> > love to have.
> > 
> > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > memory) and it mostly works with the one significant exception that
> > accessing private memory via a GUP pin will crash the host kernel. If
> > all guest_memfd() can offer to solve that problem is preventing GUP
> > altogether, then I'd sooner just add that same restriction to what we
> > currently have instead of overhauling the user ABI in favour of
> > something which offers us very little in return.
> 
> How would we add the restriction to anonymous memory?
> 
> Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
> where mm can ensure that the exclusive GUP pin is the only pin? If the
> refcount for the page is >1, then the exclusive GUP fails. Any future
> GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.

Yes, I think we'd want something like that, but I don't think using a
bias on its own is a good idea as false positives due to a large number
of page references will then actually lead to problems (i.e. rejecting
GUP spuriously), no? I suppose if you only considered the new bias in
conjunction with the AS_NOGUP flag you proposed then it might be ok
(i.e. when you see the bias, you then go check the address space to
confirm). What do you think?

> > On the mmap() side of things for guest_memfd, a simpler option for us
> > than what has currently been proposed might be to enforce that the VMM
> > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > not the case. It needs a little more tracking in guest_memfd but I think
> > GUP will then fall out in the wash because only shared pages will be
> > mapped by userspace and so GUP will fail by construction for private
> > pages.
> 
> We can prevent GUP after the pages are marked private, but the pages
> could be marked private after the pages were already GUP'd. I don't have
> a good way to detect this, so converting a page to private is difficult.

For anonymous memory, marking the page as private is going to involve an
exclusive GUP so that the page can safely be donated to the guest. In
that case, any existing GUP pin should cause that to fail gracefully.
What is the situation you are concerned about here?

> > We're happy to pursue alternative approaches using anonymous memory if
> > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > preventing GUP of private pages by extending mapping_flags as per [1]),
> > but we're equally willing to contribute to guest_memfd if extensions are
> > welcome.
> > 
> > What do you prefer?
> > 
> 
> I like this as a stepping stone. For the Android use cases, we don't
> need to be able to convert a private page to shared and then also be
> able to GUP it.

I wouldn't want to rule that out, though. The VMM should be able to use
shared pages just like it can with normal anonymous pages.

> I don't think this design prevents us from adding "sometimes you can
> GUP" to guest_memfd in the future.

Technically, I think we can add all the stuff we need to guest_memfd,
but there's a desire to keep that as simple as possible for now, which
is why I'm keen to explore alternatives to unblock the pKVM upstreaming.

Will

David Hildenbrand March 22, 2024, 5:16 p.m. UTC | #48

On 19.03.24 16:04, Sean Christopherson wrote:
> On Tue, Mar 19, 2024, David Hildenbrand wrote:
>> On 19.03.24 01:10, Sean Christopherson wrote:
>>> Performance is a secondary concern.  If this were _just_ about guest performance,
>>> I would unequivocally side with David: the guest gets to keep the pieces if it
>>> fragments a 1GiB page.
>>>
>>> The main problem we're trying to solve is that we want to provision a host such
>>> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously
>>> run CoCo VMs, with 100% fungibility.  I.e. a host could run 100% non-CoCo VMs,
>>> 100% CoCo VMs, or more likely, some sliding mix of the two.  Ideally, CoCo VMs
>>> would also get the benefits of 1GiB mappings, that's not the driving motiviation
>>> for this discussion.
>>
>> Supporting 1 GiB mappings there sounds like unnecessary complexity and
>> opening a big can of worms, especially if "it's not the driving motivation".
>>
>> If I understand you correctly, the scenario is
>>
>> (1) We have free 1 GiB hugetlb pages lying around
>> (2) We want to start a CoCo VM
>> (3) We don't care about 1 GiB mappings for that CoCo VM,
> 
> We care about 1GiB mappings for CoCo VMs.  My comment about performance being a
> secondary concern was specifically saying that it's the guest's responsilibity
> to play nice with huge mappings if the guest cares about its performance.  For
> guests that are well behaved, we most definitely want to provide a configuration
> that performs as close to non-CoCo VMs as we can reasonably make it.

How does the guest know the granularity? I suspect it's just implicit 
knowledge that "PUD granularity might be nice".

> 
> And we can do that today, but it requires some amount of host memory to NOT be
> in the HugeTLB pool, and instead be kept in reserved so that it can be used for
> shared memory for CoCo VMs.  That approach has many downsides, as the extra memory
> overhead affects CoCo VM shapes, our ability to use a common pool for non-CoCo
> and CoCo VMs, and so on and so forth.

Right. But avoiding memory waste as soon as hugetlb is involved (and we 
have two separate memfds for private/shared memory) is not feasible.

> 
>>      but hguetlb pages is all we have.
>> (4) We want to be able to use the 1 GiB hugetlb page in the future.
> 
> ...
> 
>>> The other big advantage that we should lean into is that we can make assumptions
>>> about guest_memfd usage that would never fly for a general purpose backing stores,
>>> e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not
>>> desirable, for (almost?) all of the CoCo use cases.
>>>
>>> I don't have any concrete ideas at this time, but my gut feeling is that this
>>> won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being
>>> general purposes, and if we we account for conversion scenarios when designing
>>> hugepage support for guest_memfd.
>>
>> I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;)
> 
> Quite the opposite, I'm saying we should be very deliberate in how we add hugepage
> support and others features to guest_memfd, so that guest_memfd doesn't become a
> hacky mess.

Good.

> 
> And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g.
> swap/reclaim and probably page migration should get a hard "no".

I thought people wanted to support at least page migration in the 
future? (for example, see the reply from Will)

> 
> In other words, ditch the complexity for features that are well served by existing
> general purpose solutions, so that guest_memfd can take on a bit of complexity to
> serve use cases that are unique to KVM guests, without becoming an unmaintainble
> mess due to cross-products.

And I believed that was true until people started wanting to mmap() this 
thing and brought GUP into the picture ... and then talk about HGM and 
all that. *shivers*

David Hildenbrand March 22, 2024, 5:52 p.m. UTC | #49

On 19.03.24 15:31, Will Deacon wrote:
> Hi David,

Hi Will,

sorry for the late reply!

> 
> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>> On 19.03.24 01:10, Sean Christopherson wrote:
>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>>> Second, we should find better ways to let an IOMMU map these pages,
>>>>> *not* using GUP. There were already discussions on providing a similar
>>>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>>>> approach here. Maybe we should look into passing not only guest_memfd,
>>>>> but also "ordinary" memfds.
>>>
>>> +1.  I am not completely opposed to letting SNP and TDX effectively convert
>>> pages between private and shared, but I also completely agree that letting
>>> anything gup() guest_memfd memory is likely to end in tears.
>>
>> Yes. Avoid it right from the start, if possible.
>>
>> People wanted guest_memfd to *not* have to mmap guest memory ("even for
>> ordinary VMs"). Now people are saying we have to be able to mmap it in order
>> to GUP it. It's getting tiring, really.
> 
>  From the pKVM side, we're working on guest_memfd primarily to avoid
> diverging from what other CoCo solutions end up using, but if it gets
> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> today with anonymous memory, then it's a really hard sell to switch over
> from what we have in production. We're also hoping that, over time,
> guest_memfd will become more closely integrated with the mm subsystem to
> enable things like hypervisor-assisted page migration, which we would
> love to have.

Reading Sean's reply, he has a different view on that. And I think 
that's the main issue: there are too many different use cases and too 
many different requirements that could turn guest_memfd into something 
that maybe it really shouldn't be.

> 
> Today, we use the existing KVM interfaces (i.e. based on anonymous
> memory) and it mostly works with the one significant exception that
> accessing private memory via a GUP pin will crash the host kernel. If
> all guest_memfd() can offer to solve that problem is preventing GUP
> altogether, then I'd sooner just add that same restriction to what we
> currently have instead of overhauling the user ABI in favour of
> something which offers us very little in return.
> 
> On the mmap() side of things for guest_memfd, a simpler option for us
> than what has currently been proposed might be to enforce that the VMM
> has unmapped all private pages on vCPU run, failing the ioctl if that's
> not the case. It needs a little more tracking in guest_memfd but I think
> GUP will then fall out in the wash because only shared pages will be
> mapped by userspace and so GUP will fail by construction for private
> pages.
> 
> We're happy to pursue alternative approaches using anonymous memory if
> you'd prefer to keep guest_memfd limited in functionality (e.g.
> preventing GUP of private pages by extending mapping_flags as per [1]),
> but we're equally willing to contribute to guest_memfd if extensions are
> welcome.
> 
> What do you prefer?

Let me summarize the history:

AMD had its thing running and it worked for them (but I recall it was 
hacky :) ).

TDX made it possible to crash the machine when accessing secure memory 
from user space (MCE).

So secure memory must not be mapped into user space -- no page tables. 
Prototypes with anonymous memory existed (and I didn't hate them, 
although hacky), but one of the other selling points of guest_memfd was 
that we could create VMs that wouldn't need any page tables at all, 
which I found interesting.

There was a bit more to that (easier conversion, avoiding GUP, 
specifying on allocation that the memory was unmovable ...), but I'll 
get to that later.

The design principle was: nasty private memory (unmovable, unswappable, 
inaccessible, un-GUPable) is allocated from guest_memfd, ordinary 
"shared" memory is allocated from an ordinary memfd.

This makes sense: shared memory is neither nasty nor special. You can 
migrate it, swap it out, map it into page tables, GUP it, ... without 
any issues.

So if I would describe some key characteristics of guest_memfd as of 
today, it would probably be:

1) Memory is unmovable and unswappable. Right from the beginning, it is
    allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
2) Memory is inaccessible. It cannot be read from user space, the
    kernel, it cannot be GUP'ed ... only some mechanisms might end up
    touching that memory (e.g., hibernation, /proc/kcore) might end up
    touching it "by accident", and we usually can handle these cases.
3) Memory can be discarded in page granularity. There should be no cases
    where you cannot discard memory to over-allocate memory for private
    pages that have been replaced by shared pages otherwise.
4) Page tables are not required (well, it's an memfd), and the fd could
    in theory be passed to other processes.

Having "ordinary shared" memory in there implies that 1) and 2) will 
have to be adjusted for them, which kind-of turns it "partially" into 
ordinary shmem again.

Going back to the beginning: with pKVM, we likely want the following

1) Convert pages private<->shared in-place
2) Stop user space + kernel from accessing private memory in process
    context. Likely for pKVM we would only crash the process, which
    would be acceptable.
3) Prevent GUP to private memory. Otherwise we could crash the kernel.
4) Prevent private pages from swapout+migration until supported.

I suspect your current solution with anonymous memory gets all but 3) 
sorted out, correct?

I'm curious, may there be a requirement in the future that shared memory 
could be mapped into other processes? (thinking vhost-user and such 
things). Of course that's impossible with anonymous memory; teaching 
shmem to contain private memory would kind-of lead to ... guest_memfd, 
just that we don't have shared memory there.

Elliot Berman March 22, 2024, 6:46 p.m. UTC | #50

On Fri, Mar 22, 2024 at 04:36:55PM +0000, Will Deacon wrote:
> Hi Elliot,
> 
> On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote:
> > On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote:
> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > > pages between private and shared, but I also completely agree that letting
> > > > > anything gup() guest_memfd memory is likely to end in tears.
> > > > 
> > > > Yes. Avoid it right from the start, if possible.
> > > > 
> > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > > to GUP it. It's getting tiring, really.
> > > 
> > > From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > > 
> > > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > > memory) and it mostly works with the one significant exception that
> > > accessing private memory via a GUP pin will crash the host kernel. If
> > > all guest_memfd() can offer to solve that problem is preventing GUP
> > > altogether, then I'd sooner just add that same restriction to what we
> > > currently have instead of overhauling the user ABI in favour of
> > > something which offers us very little in return.
> > 
> > How would we add the restriction to anonymous memory?
> > 
> > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
> > where mm can ensure that the exclusive GUP pin is the only pin? If the
> > refcount for the page is >1, then the exclusive GUP fails. Any future
> > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.
> 
> Yes, I think we'd want something like that, but I don't think using a
> bias on its own is a good idea as false positives due to a large number
> of page references will then actually lead to problems (i.e. rejecting
> GUP spuriously), no? I suppose if you only considered the new bias in
> conjunction with the AS_NOGUP flag you proposed then it might be ok
> (i.e. when you see the bias, you then go check the address space to
> confirm). What do you think?
> 

I think the AS_NOGUP would prevent GUPing the first place. If we set the
EXCLUSIVE_BIAS value to something like INT_MAX, do we need to be worried
about there being INT_MAX-1 valid GUPs and wanting to add another?  From
the GUPer's perspective, I don't think it would be much different from
overflowing the refcount.

> > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > than what has currently been proposed might be to enforce that the VMM
> > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > not the case. It needs a little more tracking in guest_memfd but I think
> > > GUP will then fall out in the wash because only shared pages will be
> > > mapped by userspace and so GUP will fail by construction for private
> > > pages.
> > 
> > We can prevent GUP after the pages are marked private, but the pages
> > could be marked private after the pages were already GUP'd. I don't have
> > a good way to detect this, so converting a page to private is difficult.
> 
> For anonymous memory, marking the page as private is going to involve an
> exclusive GUP so that the page can safely be donated to the guest. In
> that case, any existing GUP pin should cause that to fail gracefully.
> What is the situation you are concerned about here?
> 

I wasn't thinking about exclusive GUP here. The exclusive GUP should be
able to get the guarantees we need.

I was thinking about making sure we gracefully handle a race to provide
the same page. The kernel should detect the difference between "we're
already providing the page" and "somebody has an unexpected pin". We can
easily read the refcount if we couldn't take the exclusive pin to know.

Thanks,
Elliot

> > > We're happy to pursue alternative approaches using anonymous memory if
> > > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > > preventing GUP of private pages by extending mapping_flags as per [1]),
> > > but we're equally willing to contribute to guest_memfd if extensions are
> > > welcome.
> > > 
> > > What do you prefer?
> > > 
> > 
> > I like this as a stepping stone. For the Android use cases, we don't
> > need to be able to convert a private page to shared and then also be
> > able to GUP it.
> 
> I wouldn't want to rule that out, though. The VMM should be able to use
> shared pages just like it can with normal anonymous pages.
> 
> > I don't think this design prevents us from adding "sometimes you can
> > GUP" to guest_memfd in the future.
> 
> Technically, I think we can add all the stuff we need to guest_memfd,
> but there's a desire to keep that as simple as possible for now, which
> is why I'm keen to explore alternatives to unblock the pKVM upstreaming.
> 
> Will
>

David Hildenbrand March 22, 2024, 9:21 p.m. UTC | #51

On 22.03.24 18:52, David Hildenbrand wrote:
> On 19.03.24 15:31, Will Deacon wrote:
>> Hi David,
> 
> Hi Will,
> 
> sorry for the late reply!
> 
>>
>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>> Second, we should find better ways to let an IOMMU map these pages,
>>>>>> *not* using GUP. There were already discussions on providing a similar
>>>>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>>>>> approach here. Maybe we should look into passing not only guest_memfd,
>>>>>> but also "ordinary" memfds.
>>>>
>>>> +1.  I am not completely opposed to letting SNP and TDX effectively convert
>>>> pages between private and shared, but I also completely agree that letting
>>>> anything gup() guest_memfd memory is likely to end in tears.
>>>
>>> Yes. Avoid it right from the start, if possible.
>>>
>>> People wanted guest_memfd to *not* have to mmap guest memory ("even for
>>> ordinary VMs"). Now people are saying we have to be able to mmap it in order
>>> to GUP it. It's getting tiring, really.
>>
>>   From the pKVM side, we're working on guest_memfd primarily to avoid
>> diverging from what other CoCo solutions end up using, but if it gets
>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>> today with anonymous memory, then it's a really hard sell to switch over
>> from what we have in production. We're also hoping that, over time,
>> guest_memfd will become more closely integrated with the mm subsystem to
>> enable things like hypervisor-assisted page migration, which we would
>> love to have.
> 
> Reading Sean's reply, he has a different view on that. And I think
> that's the main issue: there are too many different use cases and too
> many different requirements that could turn guest_memfd into something
> that maybe it really shouldn't be.
> 
>>
>> Today, we use the existing KVM interfaces (i.e. based on anonymous
>> memory) and it mostly works with the one significant exception that
>> accessing private memory via a GUP pin will crash the host kernel. If
>> all guest_memfd() can offer to solve that problem is preventing GUP
>> altogether, then I'd sooner just add that same restriction to what we
>> currently have instead of overhauling the user ABI in favour of
>> something which offers us very little in return.
>>
>> On the mmap() side of things for guest_memfd, a simpler option for us
>> than what has currently been proposed might be to enforce that the VMM
>> has unmapped all private pages on vCPU run, failing the ioctl if that's
>> not the case. It needs a little more tracking in guest_memfd but I think
>> GUP will then fall out in the wash because only shared pages will be
>> mapped by userspace and so GUP will fail by construction for private
>> pages.
>>
>> We're happy to pursue alternative approaches using anonymous memory if
>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>> preventing GUP of private pages by extending mapping_flags as per [1]),
>> but we're equally willing to contribute to guest_memfd if extensions are
>> welcome.
>>
>> What do you prefer?
> 
> Let me summarize the history:
> 
> AMD had its thing running and it worked for them (but I recall it was
> hacky :) ).
> 
> TDX made it possible to crash the machine when accessing secure memory
> from user space (MCE).
> 
> So secure memory must not be mapped into user space -- no page tables.
> Prototypes with anonymous memory existed (and I didn't hate them,
> although hacky), but one of the other selling points of guest_memfd was
> that we could create VMs that wouldn't need any page tables at all,
> which I found interesting.
> 
> There was a bit more to that (easier conversion, avoiding GUP,
> specifying on allocation that the memory was unmovable ...), but I'll
> get to that later.
> 
> The design principle was: nasty private memory (unmovable, unswappable,
> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
> "shared" memory is allocated from an ordinary memfd.
> 
> This makes sense: shared memory is neither nasty nor special. You can
> migrate it, swap it out, map it into page tables, GUP it, ... without
> any issues.
> 
> 
> So if I would describe some key characteristics of guest_memfd as of
> today, it would probably be:
> 
> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>      allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> 2) Memory is inaccessible. It cannot be read from user space, the
>      kernel, it cannot be GUP'ed ... only some mechanisms might end up
>      touching that memory (e.g., hibernation, /proc/kcore) might end up
>      touching it "by accident", and we usually can handle these cases.
> 3) Memory can be discarded in page granularity. There should be no cases
>      where you cannot discard memory to over-allocate memory for private
>      pages that have been replaced by shared pages otherwise.
> 4) Page tables are not required (well, it's an memfd), and the fd could
>      in theory be passed to other processes.
> 
> Having "ordinary shared" memory in there implies that 1) and 2) will
> have to be adjusted for them, which kind-of turns it "partially" into
> ordinary shmem again.
> 
> 
> Going back to the beginning: with pKVM, we likely want the following
> 
> 1) Convert pages private<->shared in-place
> 2) Stop user space + kernel from accessing private memory in process
>      context. Likely for pKVM we would only crash the process, which
>      would be acceptable.
> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> 4) Prevent private pages from swapout+migration until supported.
> 
> 
> I suspect your current solution with anonymous memory gets all but 3)
> sorted out, correct?
> 
> I'm curious, may there be a requirement in the future that shared memory
> could be mapped into other processes? (thinking vhost-user and such
> things). Of course that's impossible with anonymous memory; teaching
> shmem to contain private memory would kind-of lead to ... guest_memfd,
> just that we don't have shared memory there.
> 

I was just thinking of something stupid, not sure if it makes any sense. 
I'll raise it here before I forget over the weekend.

... what if we glued one guest_memfd and a memfd (shmem) together in the 
kernel somehow?

(1) A to-shared conversion moves a page from the guest_memfd to the memfd.

(2) A to-private conversion moves a page from the memfd to the guest_memfd.

Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd 
behave like any shmem pages: migratable, swappable etc.


Of course, (2) is only possible if the page is not pinned, not mapped 
(we can unmap it). AND, the page must not reside on ZONE_MOVABLE / 
MIGRATE_CMA.

We'd have to decide what to do when we access a "hole" in the memfd -- 
instead of allocating a fresh page and filling the hole, we'd want to 
SIGBUS.

Elliot Berman March 26, 2024, 10:04 p.m. UTC | #52

On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote:
> On 22.03.24 18:52, David Hildenbrand wrote:
> > On 19.03.24 15:31, Will Deacon wrote:
> > > Hi David,
> > 
> > Hi Will,
> > 
> > sorry for the late reply!
> > 
> > > 
> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > > > Second, we should find better ways to let an IOMMU map these pages,
> > > > > > > *not* using GUP. There were already discussions on providing a similar
> > > > > > > fd+offset-style interface instead. GUP really sounds like the wrong
> > > > > > > approach here. Maybe we should look into passing not only guest_memfd,
> > > > > > > but also "ordinary" memfds.
> > > > > 
> > > > > +1.  I am not completely opposed to letting SNP and TDX effectively convert
> > > > > pages between private and shared, but I also completely agree that letting
> > > > > anything gup() guest_memfd memory is likely to end in tears.
> > > > 
> > > > Yes. Avoid it right from the start, if possible.
> > > > 
> > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for
> > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order
> > > > to GUP it. It's getting tiring, really.
> > > 
> > >   From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > 
> > Reading Sean's reply, he has a different view on that. And I think
> > that's the main issue: there are too many different use cases and too
> > many different requirements that could turn guest_memfd into something
> > that maybe it really shouldn't be.
> > 
> > > 
> > > Today, we use the existing KVM interfaces (i.e. based on anonymous
> > > memory) and it mostly works with the one significant exception that
> > > accessing private memory via a GUP pin will crash the host kernel. If
> > > all guest_memfd() can offer to solve that problem is preventing GUP
> > > altogether, then I'd sooner just add that same restriction to what we
> > > currently have instead of overhauling the user ABI in favour of
> > > something which offers us very little in return.
> > > 
> > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > than what has currently been proposed might be to enforce that the VMM
> > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > not the case. It needs a little more tracking in guest_memfd but I think
> > > GUP will then fall out in the wash because only shared pages will be
> > > mapped by userspace and so GUP will fail by construction for private
> > > pages.
> > > 
> > > We're happy to pursue alternative approaches using anonymous memory if
> > > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > > preventing GUP of private pages by extending mapping_flags as per [1]),
> > > but we're equally willing to contribute to guest_memfd if extensions are
> > > welcome.
> > > 
> > > What do you prefer?
> > 
> > Let me summarize the history:
> > 
> > AMD had its thing running and it worked for them (but I recall it was
> > hacky :) ).
> > 
> > TDX made it possible to crash the machine when accessing secure memory
> > from user space (MCE).
> > 
> > So secure memory must not be mapped into user space -- no page tables.
> > Prototypes with anonymous memory existed (and I didn't hate them,
> > although hacky), but one of the other selling points of guest_memfd was
> > that we could create VMs that wouldn't need any page tables at all,
> > which I found interesting.
> > 
> > There was a bit more to that (easier conversion, avoiding GUP,
> > specifying on allocation that the memory was unmovable ...), but I'll
> > get to that later.
> > 
> > The design principle was: nasty private memory (unmovable, unswappable,
> > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
> > "shared" memory is allocated from an ordinary memfd.
> > 
> > This makes sense: shared memory is neither nasty nor special. You can
> > migrate it, swap it out, map it into page tables, GUP it, ... without
> > any issues.
> > 
> > 
> > So if I would describe some key characteristics of guest_memfd as of
> > today, it would probably be:
> > 
> > 1) Memory is unmovable and unswappable. Right from the beginning, it is
> >      allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> > 2) Memory is inaccessible. It cannot be read from user space, the
> >      kernel, it cannot be GUP'ed ... only some mechanisms might end up
> >      touching that memory (e.g., hibernation, /proc/kcore) might end up
> >      touching it "by accident", and we usually can handle these cases.
> > 3) Memory can be discarded in page granularity. There should be no cases
> >      where you cannot discard memory to over-allocate memory for private
> >      pages that have been replaced by shared pages otherwise.
> > 4) Page tables are not required (well, it's an memfd), and the fd could
> >      in theory be passed to other processes.
> > 
> > Having "ordinary shared" memory in there implies that 1) and 2) will
> > have to be adjusted for them, which kind-of turns it "partially" into
> > ordinary shmem again.
> > 
> > 
> > Going back to the beginning: with pKVM, we likely want the following
> > 
> > 1) Convert pages private<->shared in-place
> > 2) Stop user space + kernel from accessing private memory in process
> >      context. Likely for pKVM we would only crash the process, which
> >      would be acceptable.
> > 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> > 4) Prevent private pages from swapout+migration until supported.
> > 
> > 
> > I suspect your current solution with anonymous memory gets all but 3)
> > sorted out, correct?
> > 
> > I'm curious, may there be a requirement in the future that shared memory
> > could be mapped into other processes? (thinking vhost-user and such
> > things). Of course that's impossible with anonymous memory; teaching
> > shmem to contain private memory would kind-of lead to ... guest_memfd,
> > just that we don't have shared memory there.
> > 
> 
> I was just thinking of something stupid, not sure if it makes any sense.
> I'll raise it here before I forget over the weekend.
> 
> ... what if we glued one guest_memfd and a memfd (shmem) together in the
> kernel somehow?
> 
> (1) A to-shared conversion moves a page from the guest_memfd to the memfd.
> 
> (2) A to-private conversion moves a page from the memfd to the guest_memfd.
> 
> Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave
> like any shmem pages: migratable, swappable etc.
> 
> 
> Of course, (2) is only possible if the page is not pinned, not mapped (we
> can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA.
> 

Quentin gave idea offline of using splice to achieve the conversions.
I'd want to use the in-kernel APIs on page-fault to do the conversion;
not requiring userspace to make the splice() syscall.  One thing splice
currently requires is the source (in) file; KVM UAPI today only gives
userspace address. We could resolve that by for_each_vma_range(). I've
just started looking into splice(), but I believe it takes care of not
pinned and not mapped. guest_memfd would have to migrate the page out of
ZONE_MOVABLE / MIGRATE_CMA.

Does this seem like a good path to pursue further or any other ideas for
doing the conversion?

> We'd have to decide what to do when we access a "hole" in the memfd --
> instead of allocating a fresh page and filling the hole, we'd want to
> SIGBUS.

Since the KVM UAPI is based on userspace addresses and not fds for the
shared memory part, maybe we could add a mmu_notifier_ops that allows
KVM to intercept and reject faults if we couldn't reclaim the memory. I
think it would be conceptually similar to userfaultfd except in the
kernel; not sure if re-using userfaultfd makes sense?

Thanks,
Elliot

David Hildenbrand March 27, 2024, 5:50 p.m. UTC | #53

On 26.03.24 23:04, Elliot Berman wrote:
> On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote:
>> On 22.03.24 18:52, David Hildenbrand wrote:
>>> On 19.03.24 15:31, Will Deacon wrote:
>>>> Hi David,
>>>
>>> Hi Will,
>>>
>>> sorry for the late reply!
>>>
>>>>
>>>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>> Second, we should find better ways to let an IOMMU map these pages,
>>>>>>>> *not* using GUP. There were already discussions on providing a similar
>>>>>>>> fd+offset-style interface instead. GUP really sounds like the wrong
>>>>>>>> approach here. Maybe we should look into passing not only guest_memfd,
>>>>>>>> but also "ordinary" memfds.
>>>>>>
>>>>>> +1.  I am not completely opposed to letting SNP and TDX effectively convert
>>>>>> pages between private and shared, but I also completely agree that letting
>>>>>> anything gup() guest_memfd memory is likely to end in tears.
>>>>>
>>>>> Yes. Avoid it right from the start, if possible.
>>>>>
>>>>> People wanted guest_memfd to *not* have to mmap guest memory ("even for
>>>>> ordinary VMs"). Now people are saying we have to be able to mmap it in order
>>>>> to GUP it. It's getting tiring, really.
>>>>
>>>>    From the pKVM side, we're working on guest_memfd primarily to avoid
>>>> diverging from what other CoCo solutions end up using, but if it gets
>>>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>>>> today with anonymous memory, then it's a really hard sell to switch over
>>>> from what we have in production. We're also hoping that, over time,
>>>> guest_memfd will become more closely integrated with the mm subsystem to
>>>> enable things like hypervisor-assisted page migration, which we would
>>>> love to have.
>>>
>>> Reading Sean's reply, he has a different view on that. And I think
>>> that's the main issue: there are too many different use cases and too
>>> many different requirements that could turn guest_memfd into something
>>> that maybe it really shouldn't be.
>>>
>>>>
>>>> Today, we use the existing KVM interfaces (i.e. based on anonymous
>>>> memory) and it mostly works with the one significant exception that
>>>> accessing private memory via a GUP pin will crash the host kernel. If
>>>> all guest_memfd() can offer to solve that problem is preventing GUP
>>>> altogether, then I'd sooner just add that same restriction to what we
>>>> currently have instead of overhauling the user ABI in favour of
>>>> something which offers us very little in return.
>>>>
>>>> On the mmap() side of things for guest_memfd, a simpler option for us
>>>> than what has currently been proposed might be to enforce that the VMM
>>>> has unmapped all private pages on vCPU run, failing the ioctl if that's
>>>> not the case. It needs a little more tracking in guest_memfd but I think
>>>> GUP will then fall out in the wash because only shared pages will be
>>>> mapped by userspace and so GUP will fail by construction for private
>>>> pages.
>>>>
>>>> We're happy to pursue alternative approaches using anonymous memory if
>>>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>>>> preventing GUP of private pages by extending mapping_flags as per [1]),
>>>> but we're equally willing to contribute to guest_memfd if extensions are
>>>> welcome.
>>>>
>>>> What do you prefer?
>>>
>>> Let me summarize the history:
>>>
>>> AMD had its thing running and it worked for them (but I recall it was
>>> hacky :) ).
>>>
>>> TDX made it possible to crash the machine when accessing secure memory
>>> from user space (MCE).
>>>
>>> So secure memory must not be mapped into user space -- no page tables.
>>> Prototypes with anonymous memory existed (and I didn't hate them,
>>> although hacky), but one of the other selling points of guest_memfd was
>>> that we could create VMs that wouldn't need any page tables at all,
>>> which I found interesting.
>>>
>>> There was a bit more to that (easier conversion, avoiding GUP,
>>> specifying on allocation that the memory was unmovable ...), but I'll
>>> get to that later.
>>>
>>> The design principle was: nasty private memory (unmovable, unswappable,
>>> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary
>>> "shared" memory is allocated from an ordinary memfd.
>>>
>>> This makes sense: shared memory is neither nasty nor special. You can
>>> migrate it, swap it out, map it into page tables, GUP it, ... without
>>> any issues.
>>>
>>>
>>> So if I would describe some key characteristics of guest_memfd as of
>>> today, it would probably be:
>>>
>>> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>>>       allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
>>> 2) Memory is inaccessible. It cannot be read from user space, the
>>>       kernel, it cannot be GUP'ed ... only some mechanisms might end up
>>>       touching that memory (e.g., hibernation, /proc/kcore) might end up
>>>       touching it "by accident", and we usually can handle these cases.
>>> 3) Memory can be discarded in page granularity. There should be no cases
>>>       where you cannot discard memory to over-allocate memory for private
>>>       pages that have been replaced by shared pages otherwise.
>>> 4) Page tables are not required (well, it's an memfd), and the fd could
>>>       in theory be passed to other processes.
>>>
>>> Having "ordinary shared" memory in there implies that 1) and 2) will
>>> have to be adjusted for them, which kind-of turns it "partially" into
>>> ordinary shmem again.
>>>
>>>
>>> Going back to the beginning: with pKVM, we likely want the following
>>>
>>> 1) Convert pages private<->shared in-place
>>> 2) Stop user space + kernel from accessing private memory in process
>>>       context. Likely for pKVM we would only crash the process, which
>>>       would be acceptable.
>>> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
>>> 4) Prevent private pages from swapout+migration until supported.
>>>
>>>
>>> I suspect your current solution with anonymous memory gets all but 3)
>>> sorted out, correct?
>>>
>>> I'm curious, may there be a requirement in the future that shared memory
>>> could be mapped into other processes? (thinking vhost-user and such
>>> things). Of course that's impossible with anonymous memory; teaching
>>> shmem to contain private memory would kind-of lead to ... guest_memfd,
>>> just that we don't have shared memory there.
>>>
>>
>> I was just thinking of something stupid, not sure if it makes any sense.
>> I'll raise it here before I forget over the weekend.
>>
>> ... what if we glued one guest_memfd and a memfd (shmem) together in the
>> kernel somehow?
>>
>> (1) A to-shared conversion moves a page from the guest_memfd to the memfd.
>>
>> (2) A to-private conversion moves a page from the memfd to the guest_memfd.
>>
>> Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave
>> like any shmem pages: migratable, swappable etc.
>>
>>
>> Of course, (2) is only possible if the page is not pinned, not mapped (we
>> can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA.
>>
> 
> Quentin gave idea offline of using splice to achieve the conversions.
> I'd want to use the in-kernel APIs on page-fault to do the conversion;
> not requiring userspace to make the splice() syscall.  One thing splice
> currently requires is the source (in) file; KVM UAPI today only gives
> userspace address. We could resolve that by for_each_vma_range(). I've
> just started looking into splice(), but I believe it takes care of not
> pinned and not mapped. guest_memfd would have to migrate the page out of
> ZONE_MOVABLE / MIGRATE_CMA.

I don't think we want to involve splice. Conceptually, I think KVM 
should create a pair of FDs: guest_memfd for private memory and 
"ordinary shmem/memfd" for shared memory.

Conversion back and forth can either be triggered using a KVM API (TDX 
use case), or internally from KVM (pkvm use case). Maybe it does 
something internally that splice would do that we can reuse, otherwise 
we have to do the plumbing.

Then, we have some logic on how to handle access to unbacked regions 
(SIGBUS instead of allocating memory) inside both memfds, and allow to 
allocate memory for parts of the fds explicitly.

No offset in the fd's can be populated the same time. That is, pages can 
be moved back and forth, but allocating a fresh page in an fd is only 
possible if there is nothing at that location in the other fd. No memory 
over-allocation.

Coming up with a KVM API for that should be possible.

> 
> Does this seem like a good path to pursue further or any other ideas for
> doing the conversion?
> 
>> We'd have to decide what to do when we access a "hole" in the memfd --
>> instead of allocating a fresh page and filling the hole, we'd want to
>> SIGBUS.
> 
> Since the KVM UAPI is based on userspace addresses and not fds for the
> shared memory part, maybe we could add a mmu_notifier_ops that allows
> KVM to intercept and reject faults if we couldn't reclaim the memory. I
> think it would be conceptually similar to userfaultfd except in the
> kernel; not sure if re-using userfaultfd makes sense?

Or if KVM exposes this other fd as well, we extend the UAPI to consume 
for the shared part also fd+offset.

Will Deacon March 27, 2024, 7:31 p.m. UTC | #54

Hi Elliot,

On Fri, Mar 22, 2024 at 11:46:10AM -0700, Elliot Berman wrote:
> On Fri, Mar 22, 2024 at 04:36:55PM +0000, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote:
> > > How would we add the restriction to anonymous memory?
> > > 
> > > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag
> > > where mm can ensure that the exclusive GUP pin is the only pin? If the
> > > refcount for the page is >1, then the exclusive GUP fails. Any future
> > > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS.
> > 
> > Yes, I think we'd want something like that, but I don't think using a
> > bias on its own is a good idea as false positives due to a large number
> > of page references will then actually lead to problems (i.e. rejecting
> > GUP spuriously), no? I suppose if you only considered the new bias in
> > conjunction with the AS_NOGUP flag you proposed then it might be ok
> > (i.e. when you see the bias, you then go check the address space to
> > confirm). What do you think?
> > 
> 
> I think the AS_NOGUP would prevent GUPing the first place. If we set the
> EXCLUSIVE_BIAS value to something like INT_MAX, do we need to be worried
> about there being INT_MAX-1 valid GUPs and wanting to add another?  From
> the GUPer's perspective, I don't think it would be much different from
> overflowing the refcount.

I don't think we should end up in a position where a malicious user can
take a tonne of references and cause functional problems. For example,
look at page_maybe_dma_pinned(); the worst case is we end up treating
heavily referenced pages as pinned and the soft-dirty logic leaves them
perpetually dirtied. The side-effects of what we're talking about here
seem to be much worse than that unless we can confirm that the page
really is exclusive.

> > > > On the mmap() side of things for guest_memfd, a simpler option for us
> > > > than what has currently been proposed might be to enforce that the VMM
> > > > has unmapped all private pages on vCPU run, failing the ioctl if that's
> > > > not the case. It needs a little more tracking in guest_memfd but I think
> > > > GUP will then fall out in the wash because only shared pages will be
> > > > mapped by userspace and so GUP will fail by construction for private
> > > > pages.
> > > 
> > > We can prevent GUP after the pages are marked private, but the pages
> > > could be marked private after the pages were already GUP'd. I don't have
> > > a good way to detect this, so converting a page to private is difficult.
> > 
> > For anonymous memory, marking the page as private is going to involve an
> > exclusive GUP so that the page can safely be donated to the guest. In
> > that case, any existing GUP pin should cause that to fail gracefully.
> > What is the situation you are concerned about here?
> > 
> 
> I wasn't thinking about exclusive GUP here. The exclusive GUP should be
> able to get the guarantees we need.
> 
> I was thinking about making sure we gracefully handle a race to provide
> the same page. The kernel should detect the difference between "we're
> already providing the page" and "somebody has an unexpected pin". We can
> easily read the refcount if we couldn't take the exclusive pin to know.

Thanks, that makes sense to me. For pKVM, the architecture code also
tracks all the donated pages, so we should be able to provide additional
metadata here if we shuffle things around a little.

Will

Will Deacon March 27, 2024, 7:34 p.m. UTC | #55

Hi again, David,

On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
> On 19.03.24 15:31, Will Deacon wrote:
> sorry for the late reply!

Bah, you and me both!

> > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> >  From the pKVM side, we're working on guest_memfd primarily to avoid
> > diverging from what other CoCo solutions end up using, but if it gets
> > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > today with anonymous memory, then it's a really hard sell to switch over
> > from what we have in production. We're also hoping that, over time,
> > guest_memfd will become more closely integrated with the mm subsystem to
> > enable things like hypervisor-assisted page migration, which we would
> > love to have.
> 
> Reading Sean's reply, he has a different view on that. And I think that's
> the main issue: there are too many different use cases and too many
> different requirements that could turn guest_memfd into something that maybe
> it really shouldn't be.

No argument there, and we're certainly not tied to any specific
mechanism on the pKVM side. Maybe Sean can chime in, but we've
definitely spoken about migration being a goal in the past, so I guess
something changed since then on the guest_memfd side.

Regardless, from our point of view, we just need to make sure that
whatever we settle on for pKVM does the things we need it to do (or can
at least be extended to do them) and we're happy to implement that in
whatever way works best for upstream, guest_memfd or otherwise.

> > We're happy to pursue alternative approaches using anonymous memory if
> > you'd prefer to keep guest_memfd limited in functionality (e.g.
> > preventing GUP of private pages by extending mapping_flags as per [1]),
> > but we're equally willing to contribute to guest_memfd if extensions are
> > welcome.
> > 
> > What do you prefer?
> 
> Let me summarize the history:

First off, thanks for piecing together the archaeology...

> AMD had its thing running and it worked for them (but I recall it was hacky
> :) ).
> 
> TDX made it possible to crash the machine when accessing secure memory from
> user space (MCE).
> 
> So secure memory must not be mapped into user space -- no page tables.
> Prototypes with anonymous memory existed (and I didn't hate them, although
> hacky), but one of the other selling points of guest_memfd was that we could
> create VMs that wouldn't need any page tables at all, which I found
> interesting.

Are the prototypes you refer to here based on the old stuff from Kirill?
We followed that work at the time, thinking we were going to be using
that before guest_memfd came along, so we've sadly been collecting
out-of-tree patches for a little while :/

> There was a bit more to that (easier conversion, avoiding GUP, specifying on
> allocation that the memory was unmovable ...), but I'll get to that later.
> 
> The design principle was: nasty private memory (unmovable, unswappable,
> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared"
> memory is allocated from an ordinary memfd.
> 
> This makes sense: shared memory is neither nasty nor special. You can
> migrate it, swap it out, map it into page tables, GUP it, ... without any
> issues.

Slight aside and not wanting to derail the discussion, but we have a few
different types of sharing which we'll have to consider:

  * Memory shared from the host to the guest. This remains owned by the
    host and the normal mm stuff can be made to work with it.

  * Memory shared from the guest to the host. This remains owned by the
    guest, so there's a pin on the pages and the normal mm stuff can't
    work without co-operation from the guest (see next point).

  * Memory relinquished from the guest to the host. This actually unmaps
    the pages from the host and transfers ownership back to the host,
    after which the pin is dropped and the normal mm stuff can work. We
    use this to implement ballooning.

I suppose the main thing is that the architecture backend can deal with
these states, so the core code shouldn't really care as long as it's
aware that shared memory may be pinned.

> So if I would describe some key characteristics of guest_memfd as of today,
> it would probably be:
> 
> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>    allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> 2) Memory is inaccessible. It cannot be read from user space, the
>    kernel, it cannot be GUP'ed ... only some mechanisms might end up
>    touching that memory (e.g., hibernation, /proc/kcore) might end up
>    touching it "by accident", and we usually can handle these cases.
> 3) Memory can be discarded in page granularity. There should be no cases
>    where you cannot discard memory to over-allocate memory for private
>    pages that have been replaced by shared pages otherwise.
> 4) Page tables are not required (well, it's an memfd), and the fd could
>    in theory be passed to other processes.
> 
> Having "ordinary shared" memory in there implies that 1) and 2) will have to
> be adjusted for them, which kind-of turns it "partially" into ordinary shmem
> again.

Yes, and we'd also need a way to establish hugepages (where possible)
even for the *private* memory so as to reduce the depth of the guest's
stage-2 walk.

> Going back to the beginning: with pKVM, we likely want the following
> 
> 1) Convert pages private<->shared in-place
> 2) Stop user space + kernel from accessing private memory in process
>    context. Likely for pKVM we would only crash the process, which
>    would be acceptable.
> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
> 4) Prevent private pages from swapout+migration until supported.
> 
> 
> I suspect your current solution with anonymous memory gets all but 3) sorted
> out, correct?

I agree on all of these and, yes, (3) is the problem for us. We've also
been thinking a bit about CoW recently and I suspect the use of
vm_normal_page() in do_wp_page() could lead to issues similar to those
we hit with GUP. There are various ways to approach that, but I'm not
sure what's best.

> I'm curious, may there be a requirement in the future that shared memory
> could be mapped into other processes? (thinking vhost-user and such things).

It's not impossible. We use crosvm as our VMM, and that has a
multi-process sandbox mode which I think relies on just that...

Cheers,

Will

(btw: I'm getting some time away from the computer over Easter, so I'll be
 a little slow on email again. Nothing personal!).

David Hildenbrand March 28, 2024, 9:06 a.m. UTC | #56

On 27.03.24 20:34, Will Deacon wrote:
> Hi again, David,
> 
> On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
>> On 19.03.24 15:31, Will Deacon wrote:
>> sorry for the late reply!
> 
> Bah, you and me both!

This time I'm faster! :)

> 
>>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
>>>> On 19.03.24 01:10, Sean Christopherson wrote:
>>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote:
>>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
>>>   From the pKVM side, we're working on guest_memfd primarily to avoid
>>> diverging from what other CoCo solutions end up using, but if it gets
>>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
>>> today with anonymous memory, then it's a really hard sell to switch over
>>> from what we have in production. We're also hoping that, over time,
>>> guest_memfd will become more closely integrated with the mm subsystem to
>>> enable things like hypervisor-assisted page migration, which we would
>>> love to have.
>>
>> Reading Sean's reply, he has a different view on that. And I think that's
>> the main issue: there are too many different use cases and too many
>> different requirements that could turn guest_memfd into something that maybe
>> it really shouldn't be.
> 
> No argument there, and we're certainly not tied to any specific
> mechanism on the pKVM side. Maybe Sean can chime in, but we've
> definitely spoken about migration being a goal in the past, so I guess
> something changed since then on the guest_memfd side.
> 
> Regardless, from our point of view, we just need to make sure that
> whatever we settle on for pKVM does the things we need it to do (or can
> at least be extended to do them) and we're happy to implement that in
> whatever way works best for upstream, guest_memfd or otherwise.
> 
>>> We're happy to pursue alternative approaches using anonymous memory if
>>> you'd prefer to keep guest_memfd limited in functionality (e.g.
>>> preventing GUP of private pages by extending mapping_flags as per [1]),
>>> but we're equally willing to contribute to guest_memfd if extensions are
>>> welcome.
>>>
>>> What do you prefer?
>>
>> Let me summarize the history:
> 
> First off, thanks for piecing together the archaeology...
> 
>> AMD had its thing running and it worked for them (but I recall it was hacky
>> :) ).
>>
>> TDX made it possible to crash the machine when accessing secure memory from
>> user space (MCE).
>>
>> So secure memory must not be mapped into user space -- no page tables.
>> Prototypes with anonymous memory existed (and I didn't hate them, although
>> hacky), but one of the other selling points of guest_memfd was that we could
>> create VMs that wouldn't need any page tables at all, which I found
>> interesting.
> 
> Are the prototypes you refer to here based on the old stuff from Kirill?

Yes.

> We followed that work at the time, thinking we were going to be using
> that before guest_memfd came along, so we've sadly been collecting
> out-of-tree patches for a little while :/

:/

> 
>> There was a bit more to that (easier conversion, avoiding GUP, specifying on
>> allocation that the memory was unmovable ...), but I'll get to that later.
>>
>> The design principle was: nasty private memory (unmovable, unswappable,
>> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared"
>> memory is allocated from an ordinary memfd.
>>
>> This makes sense: shared memory is neither nasty nor special. You can
>> migrate it, swap it out, map it into page tables, GUP it, ... without any
>> issues.
> 
> Slight aside and not wanting to derail the discussion, but we have a few
> different types of sharing which we'll have to consider:

Thanks for sharing!

> 
>    * Memory shared from the host to the guest. This remains owned by the
>      host and the normal mm stuff can be made to work with it.

Okay, host and guest can access it. We can jut migrate memory around, 
swap it out ... like ordinary guest memory today.

> 
>    * Memory shared from the guest to the host. This remains owned by the
>      guest, so there's a pin on the pages and the normal mm stuff can't
>      work without co-operation from the guest (see next point).

Okay, host and guest can access it, but we cannot migrate memory around 
or swap it out ... like ordinary guest memory today that is longterm pinned.

> 
>    * Memory relinquished from the guest to the host. This actually unmaps
>      the pages from the host and transfers ownership back to the host,
>      after which the pin is dropped and the normal mm stuff can work. We
>      use this to implement ballooning.
> 

Okay, so this is essentially just a state transition between the two above.


> I suppose the main thing is that the architecture backend can deal with
> these states, so the core code shouldn't really care as long as it's
> aware that shared memory may be pinned.

So IIUC, the states are:

(1) Private: inaccesible by the host, accessible by the guest, "owned by
     the guest"

(2) Host Shared: accessible by the host + guest, "owned by the host"

(3) Guest Shared: accessible by the host, "owned by the guest"


Memory ballooning is simply transitioning from (3) to (2), and then 
discarding the memory.

Any state I am missing?


Which transitions are possible?

(1) <-> (2) ? Not sure if the direct transition is possible.

(2) <-> (3) ? IIUC yes.

(1) <-> (3) ? IIUC yes.



There is ongoing work on longterm-pinning memory from a memfd/shmem. So 
thinking in terms of my vague "fd guest_memfd + fd pair", that approach 
could look like the following:

(1) guest_memfd (could be "with longterm pin")

(2) memfd

(3) memfd with a longterm pin

But again, just some possible idea to make it work with guest_memfd.

> 
>> So if I would describe some key characteristics of guest_memfd as of today,
>> it would probably be:
>>
>> 1) Memory is unmovable and unswappable. Right from the beginning, it is
>>     allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
>> 2) Memory is inaccessible. It cannot be read from user space, the
>>     kernel, it cannot be GUP'ed ... only some mechanisms might end up
>>     touching that memory (e.g., hibernation, /proc/kcore) might end up
>>     touching it "by accident", and we usually can handle these cases.
>> 3) Memory can be discarded in page granularity. There should be no cases
>>     where you cannot discard memory to over-allocate memory for private
>>     pages that have been replaced by shared pages otherwise.
>> 4) Page tables are not required (well, it's an memfd), and the fd could
>>     in theory be passed to other processes.
>>
>> Having "ordinary shared" memory in there implies that 1) and 2) will have to
>> be adjusted for them, which kind-of turns it "partially" into ordinary shmem
>> again.
> 
> Yes, and we'd also need a way to establish hugepages (where possible)
> even for the *private* memory so as to reduce the depth of the guest's
> stage-2 walk.
> 

Understood, and as discussed, that's a bit more "hairy".

>> Going back to the beginning: with pKVM, we likely want the following
>>
>> 1) Convert pages private<->shared in-place
>> 2) Stop user space + kernel from accessing private memory in process
>>     context. Likely for pKVM we would only crash the process, which
>>     would be acceptable.
>> 3) Prevent GUP to private memory. Otherwise we could crash the kernel.
>> 4) Prevent private pages from swapout+migration until supported.
>>
>>
>> I suspect your current solution with anonymous memory gets all but 3) sorted
>> out, correct?
> 
> I agree on all of these and, yes, (3) is the problem for us. We've also
> been thinking a bit about CoW recently and I suspect the use of
> vm_normal_page() in do_wp_page() could lead to issues similar to those
> we hit with GUP. There are various ways to approach that, but I'm not
> sure what's best.

Would COW be required or is that just the nasty side-effect of trying to 
use anonymous memory?

> 
>> I'm curious, may there be a requirement in the future that shared memory
>> could be mapped into other processes? (thinking vhost-user and such things).
> 
> It's not impossible. We use crosvm as our VMM, and that has a
> multi-process sandbox mode which I think relies on just that...
> 

Okay, so basing the design on anonymous memory might not be the best 
choice ... :/

> Cheers,
> 
> Will
> 
> (btw: I'm getting some time away from the computer over Easter, so I'll be
>   a little slow on email again. Nothing personal!).

Sure, no worries! Enjoy!

Quentin Perret March 28, 2024, 10:10 a.m. UTC | #57

Hey David,

I'll try to pick up the baton while Will is away :-)

On Thursday 28 Mar 2024 at 10:06:52 (+0100), David Hildenbrand wrote:
> On 27.03.24 20:34, Will Deacon wrote:
> > I suppose the main thing is that the architecture backend can deal with
> > these states, so the core code shouldn't really care as long as it's
> > aware that shared memory may be pinned.
> 
> So IIUC, the states are:
> 
> (1) Private: inaccesible by the host, accessible by the guest, "owned by
>     the guest"
> 
> (2) Host Shared: accessible by the host + guest, "owned by the host"
> 
> (3) Guest Shared: accessible by the host, "owned by the guest"

Yup.

> Memory ballooning is simply transitioning from (3) to (2), and then
> discarding the memory.

Well, not quite actually, see below.

> Any state I am missing?

So there is probably state (0) which is 'owned only by the host'. It's a
bit obvious, but I'll make it explicit because it has its importance for
the rest of the discussion.

And while at it, there are other cases (memory shared/owned with/by the
hypervisor and/or TrustZone) but they're somewhat irrelevant to this
discussion. These pages are usually backed by kernel allocations, so
much less problematic to deal with. So let's ignore those.

> Which transitions are possible?

Basically a page must be in the 'exclusively owned' state for an owner
to initiate a share or donation. So e.g. a shared page must be unshared
before it can be donated to someone else (that is true regardless of the
owner, host, guest, hypervisor, ...). That simplifies significantly the
state tracking in pKVM.

> (1) <-> (2) ? Not sure if the direct transition is possible.

Yep, not possible.

> (2) <-> (3) ? IIUC yes.

Actually it's not directly possible as is. The ballooning procedure is
essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a
single hypercall when doing ballooning, but it's technically just a
(3) -> (1) -> (0) sequence that has been micro-optimized).

Note that state (2) is actually never used for protected VMs. It's
mainly used to implement standard non-protected VMs. The biggest
difference in pKVM between protected and non-protected VMs is basically
that in the former case, in the fault path KVM does a (0) -> (1)
transition, but in the latter it's (0) -> (2). That implies that in the
unprotected case, the host remains the page owner and is allowed to
decide to unshare arbitrary pages, to restrict the guest permissions for
the shared pages etc, which paves the way for implementing migration,
swap, ... relatively easily.

> (1) <-> (3) ? IIUC yes.

Yep.

<snip>
> > I agree on all of these and, yes, (3) is the problem for us. We've also
> > been thinking a bit about CoW recently and I suspect the use of
> > vm_normal_page() in do_wp_page() could lead to issues similar to those
> > we hit with GUP. There are various ways to approach that, but I'm not
> > sure what's best.
> 
> Would COW be required or is that just the nasty side-effect of trying to use
> anonymous memory?

That'd qualify as an undesirable side effect I think.

> > 
> > > I'm curious, may there be a requirement in the future that shared memory
> > > could be mapped into other processes? (thinking vhost-user and such things).
> > 
> > It's not impossible. We use crosvm as our VMM, and that has a
> > multi-process sandbox mode which I think relies on just that...
> > 
> 
> Okay, so basing the design on anonymous memory might not be the best choice
> ... :/

So, while we're at this stage, let me throw another idea at the wall to
see if it sticks :-)

One observation is that a standard memfd would work relatively well for
pKVM if we had a way to enforce that all mappings to it are MAP_SHARED.
KVM would still need to take an 'exclusive GUP' from the fault path
(which may fail in case of a pre-existing GUP, but that's fine), but
then CoW and friends largely become a non-issue by construction I think.
Is there any way we could enforce that cleanly? Perhaps introducing a
sort of 'mmap notifier' would do the trick? By that I mean something a
bit similar to an MMU notifier offered by memfd that KVM could register
against whenever the memfd is attached to a protected VM memslot.

One of the nice things here is that we could retain an entire mapping of
the whole of guest memory in userspace, conversions wouldn't require any
additional efforts from userspace. A bad thing is that a process that is
being passed such a memfd may not expect the new semantic and the
inability to map !MAP_SHARED. But I guess a process that receives a
handle to private memory must be enlightened regardless of the type of
fd, so maybe it's not so bad.

Thoughts?

Thanks,
Quentin

David Hildenbrand March 28, 2024, 10:32 a.m. UTC | #58

Hi!

[...]

> 
>> Any state I am missing?
> 
> So there is probably state (0) which is 'owned only by the host'. It's a
> bit obvious, but I'll make it explicit because it has its importance for
> the rest of the discussion.

Yes, I treated it as "simply not mapped into the VM".

> 
> And while at it, there are other cases (memory shared/owned with/by the
> hypervisor and/or TrustZone) but they're somewhat irrelevant to this
> discussion. These pages are usually backed by kernel allocations, so
> much less problematic to deal with. So let's ignore those.
> 
>> Which transitions are possible?
> 
> Basically a page must be in the 'exclusively owned' state for an owner
> to initiate a share or donation. So e.g. a shared page must be unshared
> before it can be donated to someone else (that is true regardless of the
> owner, host, guest, hypervisor, ...). That simplifies significantly the
> state tracking in pKVM.

Makes sense!

> 
>> (1) <-> (2) ? Not sure if the direct transition is possible.
> 
> Yep, not possible.
> 
>> (2) <-> (3) ? IIUC yes.
> 
> Actually it's not directly possible as is. The ballooning procedure is
> essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a
> single hypercall when doing ballooning, but it's technically just a
> (3) -> (1) -> (0) sequence that has been micro-optimized).
> 
> Note that state (2) is actually never used for protected VMs. It's
> mainly used to implement standard non-protected VMs. The biggest

Interesting.

> difference in pKVM between protected and non-protected VMs is basically
> that in the former case, in the fault path KVM does a (0) -> (1)
> transition, but in the latter it's (0) -> (2). That implies that in the
> unprotected case, the host remains the page owner and is allowed to
> decide to unshare arbitrary pages, to restrict the guest permissions for
> the shared pages etc, which paves the way for implementing migration,
> swap, ... relatively easily.

I'll have to digest that :)

... does that mean that for pKVM with protected VMs, "shared" pages are 
also never migratable/swappable?

> 
>> (1) <-> (3) ? IIUC yes.
> 
> Yep.
> 
> <snip>
>>> I agree on all of these and, yes, (3) is the problem for us. We've also
>>> been thinking a bit about CoW recently and I suspect the use of
>>> vm_normal_page() in do_wp_page() could lead to issues similar to those
>>> we hit with GUP. There are various ways to approach that, but I'm not
>>> sure what's best.
>>
>> Would COW be required or is that just the nasty side-effect of trying to use
>> anonymous memory?
> 
> That'd qualify as an undesirable side effect I think.

Makes sense!

> 
>>>
>>>> I'm curious, may there be a requirement in the future that shared memory
>>>> could be mapped into other processes? (thinking vhost-user and such things).
>>>
>>> It's not impossible. We use crosvm as our VMM, and that has a
>>> multi-process sandbox mode which I think relies on just that...
>>>
>>
>> Okay, so basing the design on anonymous memory might not be the best choice
>> ... :/
> 
> So, while we're at this stage, let me throw another idea at the wall to
> see if it sticks :-)
> 
> One observation is that a standard memfd would work relatively well for
> pKVM if we had a way to enforce that all mappings to it are MAP_SHARED.

It should be fairly easy to enforce, I wouldn't worry too much about that.

> KVM would still need to take an 'exclusive GUP' from the fault path
> (which may fail in case of a pre-existing GUP, but that's fine), but
> then CoW and friends largely become a non-issue by construction I think.
> Is there any way we could enforce that cleanly? Perhaps introducing a
> sort of 'mmap notifier' would do the trick? By that I mean something a
> bit similar to an MMU notifier offered by memfd that KVM could register
> against whenever the memfd is attached to a protected VM memslot.
> 
> One of the nice things here is that we could retain an entire mapping of
> the whole of guest memory in userspace, conversions wouldn't require any
> additional efforts from userspace. A bad thing is that a process that is
> being passed such a memfd may not expect the new semantic and the
> inability to map !MAP_SHARED. But I guess a process that receives a

I wouldn't worry about the !MAP_SHARED requirement. vhost-user and 
friends all *must* map it MAP_SHARED to do anything reasonable, so 
that's what they do.

> handle to private memory must be enlightened regardless of the type of
> fd, so maybe it's not so bad.
> 
> Thoughts?

The whole reason I brought up the guest_memfd+memfd pair idea is that 
you would similarly be able to do the conversion in the kernel, BUT, 
you'd never be able to mmap+GUP encrypted pages.

Essentially you're using guest_memfd for what it was designed for: 
private memory that is inaccessible.

Quentin Perret March 28, 2024, 10:58 a.m. UTC | #59

On Thursday 28 Mar 2024 at 11:32:21 (+0100), David Hildenbrand wrote:
> ... does that mean that for pKVM with protected VMs, "shared" pages are also
> never migratable/swappable?

In our current implementation, yes, KVM keeps its longterm GUP pin on
pages that are shared back. And we might want to retain this behaviour
in the short term, even with guest_memfd or using the hybrid approach
you suggested. But that could totally be relaxed in the future, it's
"just" a matter of adding extra support to the hypervisor for that. That
has not been prioritized yet since the number of shared pages in
practice is relatively small for current use-cases, so ballooning was a
better option (and in the case of ballooning, we do drop the GUP pin).
But that's clearly on the TODO list!

> The whole reason I brought up the guest_memfd+memfd pair idea is that you
> would similarly be able to do the conversion in the kernel, BUT, you'd never
> be able to mmap+GUP encrypted pages.
> 
> Essentially you're using guest_memfd for what it was designed for: private
> memory that is inaccessible.

Ack, that sounds pretty reasonable to me. But I think we'd still want to
make sure the other users of guest_memfd have the _desire_ to support
huge pages,  migration, swap (probably longer term), and related
features, otherwise I don't think a guest_memfd-based option will
really work for us :-)

Thanks,
Quentin

David Hildenbrand March 28, 2024, 11:41 a.m. UTC | #60

On 28.03.24 11:58, Quentin Perret wrote:
> On Thursday 28 Mar 2024 at 11:32:21 (+0100), David Hildenbrand wrote:
>> ... does that mean that for pKVM with protected VMs, "shared" pages are also
>> never migratable/swappable?
> 
> In our current implementation, yes, KVM keeps its longterm GUP pin on
> pages that are shared back. And we might want to retain this behaviour
> in the short term, even with guest_memfd or using the hybrid approach
> you suggested. But that could totally be relaxed in the future, it's
> "just" a matter of adding extra support to the hypervisor for that. That
> has not been prioritized yet since the number of shared pages in
> practice is relatively small for current use-cases, so ballooning was a
> better option (and in the case of ballooning, we do drop the GUP pin).
> But that's clearly on the TODO list!

Okay, so nothing "fundamental", good!

> 
>> The whole reason I brought up the guest_memfd+memfd pair idea is that you
>> would similarly be able to do the conversion in the kernel, BUT, you'd never
>> be able to mmap+GUP encrypted pages.
>>
>> Essentially you're using guest_memfd for what it was designed for: private
>> memory that is inaccessible.
> 
> Ack, that sounds pretty reasonable to me. But I think we'd still want to
> make sure the other users of guest_memfd have the _desire_ to support
> huge pages,  migration, swap (probably longer term), and related
> features, otherwise I don't think a guest_memfd-based option will
> really work for us :-)

*Probably* some easy way to get hugetlb pages into a guest_memfd would 
be by allocating them for an memfd and then converting/moving them into 
the guest_memfd part of the "fd pair" on conversion to private :)

(but the "partial shared, partial private" case is and remains the ugly 
thing that is hard and I still don't think it makes sense. Maybe it 
could be handles somehow in such a dual approach with some enlightment 
in the fds ... hard to find solutions for things that don't make any 
sense :P )

I also do strongly believe that we want to see some HW-assisted 
migration support for guest_memfd pages. Swap, as you say, maybe in the 
long-term. After all, we're not interested in having MM features for 
backing memory that you could similarly find under Windows 95. Wait, 
that one did support swapping! :P

But unfortunately, that's what the shiny new CoCo world currently 
offers. Well, excluding s390x secure execution, as discussed.

Vishal Annapurve March 29, 2024, 6:38 p.m. UTC | #61

On Thu, Mar 28, 2024 at 4:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> ....
> >
> >> The whole reason I brought up the guest_memfd+memfd pair idea is that you
> >> would similarly be able to do the conversion in the kernel, BUT, you'd never
> >> be able to mmap+GUP encrypted pages.
> >>
> >> Essentially you're using guest_memfd for what it was designed for: private
> >> memory that is inaccessible.
> >
> > Ack, that sounds pretty reasonable to me. But I think we'd still want to
> > make sure the other users of guest_memfd have the _desire_ to support
> > huge pages,  migration, swap (probably longer term), and related
> > features, otherwise I don't think a guest_memfd-based option will
> > really work for us :-)
>
> *Probably* some easy way to get hugetlb pages into a guest_memfd would
> be by allocating them for an memfd and then converting/moving them into
> the guest_memfd part of the "fd pair" on conversion to private :)
>
> (but the "partial shared, partial private" case is and remains the ugly
> thing that is hard and I still don't think it makes sense. Maybe it
> could be handles somehow in such a dual approach with some enlightment
> in the fds ... hard to find solutions for things that don't make any
> sense :P )
>

I would again emphasize that this usecase exists for Confidential VMs,
whether we like it or not.

1) TDX hardware allows usage of 1G pages to back guest memory.
2) Larger VM sizes benefit more with 1G page sizes, which would be a
norm with VMs exposing GPU/TPU devices.
3) Confidential VMs will need to share host resources with
non-confidential VMs using 1G pages.
4) When using normal shmem/hugetlbfs files to back guest memory, this
usecase was achievable by just manipulating guest page tables
(although at the cost of host safety which led to invention of guest
memfd). Something equivalent "might be possible" with guest memfd.

Without handling "partial shared, partial private", it is impractical
to support 1G pages for Confidential VMs (discounting any long term
efforts to tame the guest VMs to play nice).

Maybe to handle this usecase, all the host side shared memory usage of
guest memfd (userspace, IOMMU etc) should be associated with (or
tracked via) file ranges rather than offsets within huge pages (like
it's done for faulting in private memory pages when populating guest
EPTs/NPTs). Given the current guest behavior, host MMU and IOMMU may
have to be forced to map shared memory regions always via 4KB
mappings.

> I also do strongly believe that we want to see some HW-assisted
> migration support for guest_memfd pages. Swap, as you say, maybe in the
> long-term. After all, we're not interested in having MM features for
> backing memory that you could similarly find under Windows 95. Wait,
> that one did support swapping! :P
>
> But unfortunately, that's what the shiny new CoCo world currently
> offers. Well, excluding s390x secure execution, as discussed.
>
> --
> Cheers,
>
> David / dhildenb
>

Sean Christopherson April 4, 2024, 12:15 a.m. UTC | #62

On Wed, Mar 27, 2024, Will Deacon wrote:
> Hi again, David,
> 
> On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
> > On 19.03.24 15:31, Will Deacon wrote:
> > sorry for the late reply!
> 
> Bah, you and me both!

Hold my beer ;-)

> > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
> > > > On 19.03.24 01:10, Sean Christopherson wrote:
> > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote:
> > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote:
> > >  From the pKVM side, we're working on guest_memfd primarily to avoid
> > > diverging from what other CoCo solutions end up using, but if it gets
> > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
> > > today with anonymous memory, then it's a really hard sell to switch over
> > > from what we have in production. We're also hoping that, over time,
> > > guest_memfd will become more closely integrated with the mm subsystem to
> > > enable things like hypervisor-assisted page migration, which we would
> > > love to have.
> > 
> > Reading Sean's reply, he has a different view on that. And I think that's
> > the main issue: there are too many different use cases and too many
> > different requirements that could turn guest_memfd into something that maybe
> > it really shouldn't be.
> 
> No argument there, and we're certainly not tied to any specific
> mechanism on the pKVM side. Maybe Sean can chime in, but we've
> definitely spoken about migration being a goal in the past, so I guess
> something changed since then on the guest_memfd side.

What's "hypervisor-assisted page migration"?  More specifically, what's the
mechanism that drives it?

I am not opposed to page migration itself, what I am opposed to is adding deep
integration with core MM to do some of the fancy/complex things that lead to page
migration.

Another thing I want to avoid is taking a hard dependency on "struct page", so
that we can have line of sight to eliminating "struct page" overhead for guest_memfd,
but that's definitely a more distant future concern.

> > This makes sense: shared memory is neither nasty nor special. You can
> > migrate it, swap it out, map it into page tables, GUP it, ... without any
> > issues.
> 
> Slight aside and not wanting to derail the discussion, but we have a few
> different types of sharing which we'll have to consider:
> 
>   * Memory shared from the host to the guest. This remains owned by the
>     host and the normal mm stuff can be made to work with it.

This seems like it should be !guest_memfd, i.e. can't be converted to guest
private (without first unmapping it from the host, but at that point it's
completely different memory, for all intents and purposes).

>   * Memory shared from the guest to the host. This remains owned by the
>     guest, so there's a pin on the pages and the normal mm stuff can't
>     work without co-operation from the guest (see next point).

Do you happen to have a list of exactly what you mean by "normal mm stuff"?  I
am not at all opposed to supporting .mmap(), because long term I also want to
use guest_memfd for non-CoCo VMs.  But I want to be very conservative with respect
to what is allowed for guest_memfd.   E.g. host userspace can map guest_memfd,
and do operations that are directly related to its mapping, but that's about it.

>   * Memory relinquished from the guest to the host. This actually unmaps
>     the pages from the host and transfers ownership back to the host,
>     after which the pin is dropped and the normal mm stuff can work. We
>     use this to implement ballooning.
> 
> I suppose the main thing is that the architecture backend can deal with
> these states, so the core code shouldn't really care as long as it's
> aware that shared memory may be pinned.
> 
> > So if I would describe some key characteristics of guest_memfd as of today,
> > it would probably be:
> > 
> > 1) Memory is unmovable and unswappable. Right from the beginning, it is
> >    allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
> > 2) Memory is inaccessible. It cannot be read from user space, the
> >    kernel, it cannot be GUP'ed ... only some mechanisms might end up
> >    touching that memory (e.g., hibernation, /proc/kcore) might end up
> >    touching it "by accident", and we usually can handle these cases.
> > 3) Memory can be discarded in page granularity. There should be no cases
> >    where you cannot discard memory to over-allocate memory for private
> >    pages that have been replaced by shared pages otherwise.
> > 4) Page tables are not required (well, it's an memfd), and the fd could
> >    in theory be passed to other processes.o

More broadly, no VMAs are required.  The lack of stage-1 page tables are nice to
have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g.
it's not subject to VMA protections, isn't restricted to host mapping size, etc.

> > Having "ordinary shared" memory in there implies that 1) and 2) will have to
> > be adjusted for them, which kind-of turns it "partially" into ordinary shmem
> > again.
> 
> Yes, and we'd also need a way to establish hugepages (where possible)
> even for the *private* memory so as to reduce the depth of the guest's
> stage-2 walk.

Yeah, hugepage support for guest_memfd is very much a WIP.  Getting _something_
is easy, getting the right thing is much harder.