Message ID | 20240222161047.402609-1-tabba@google.com (mailing list archive) |
---|---|
Headers | show |
Series | KVM: Restricted mapping of guest_memfd at the host and pKVM/arm64 support | expand |
On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote: > This series adds restricted mmap() support to guest_memfd [1], as > well as support guest_memfd on pKVM/arm64. > > We haven't started using this in Android yet, but we aim to move > away from anonymous memory to guest_memfd once we have the > necessary support merged upstream. Others (e.g., Gunyah [8]) are > also looking into guest_memfd for similar reasons as us. I'm especially interested to see if we can factor out much of the common implementation bits between KVM and Gunyah. In principle, we're doing same thing: the difference is the exact mechanics to interact with the hypervisor which (I think) could be easily extracted into an ops structure. [...] > In addition to general feedback, we would like feedback on how we > handle mmap() and faulting-in guest pages at the host (KVM: Add > restricted support for mapping guest_memfd by the host). > > We don't enforce the invariant that only memory shared with the > host can be mapped by the host userspace in > file_operations::mmap(), but in vm_operations_struct:fault(). On > vm_operations_struct::fault(), we check whether the page is > shared with the host. If not, we deliver a SIGBUS to the current > task. The reason for enforcing this at fault() is that mmap() > does not elevate the pagecount(); it's the faulting in of the > page which does. Even if we were to check at mmap() whether an > address can be mapped, we would still need to check again on > fault(), since between mmap() and fault() the status of the page > can change. > > This creates the situation where access to successfully mmap()'d > memory might SIGBUS at page fault. There is precedence for > similar behavior in the kernel I believe, with MADV_HWPOISON and > the hugetlbfs cgroups controller, which could SIGBUS at page > fault time depending on the accounting limit. I added a test: folio_mmapped() [1] which checks if there's a vma covering the corresponding offset into the guest_memfd. I use this test before trying to make page private to guest and I've been able to ensure that userspace can't even mmap() private guest memory. If I try to make memory private, I can test that it's not mmapped and not allow memory to become private. In my testing so far, this is enough to prevent SIGBUS from happening. This test probably should be moved outside Gunyah specific code, and was looking for maintainer to suggest the right home for it :) [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/ > > Another pKVM specific aspect we would like feedback on, is how to > handle memory mapped by the host being unshared by a guest. The > approach we've taken is that on an unshare call from the guest, > the host userspace is notified that the memory has been unshared, > in order to allow it to unmap it and mark it as PRIVATE as > acknowledgment. If the host does not unmap the memory, the > unshare call issued by the guest fails, which the guest is > informed about on return. > > Cheers, > /fuad > > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/ > > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core > > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1 > > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8 > > [5] Protected KVM on arm64 (slides) > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf > > [6] Protected KVM on arm64 (video) > https://www.youtube.com/watch?v=9npebeVFbFw > > [7] Supporting guest private memory in Protected KVM on Android (presentation) > https://lpc.events/event/17/contributions/1487/ > > [8] Drivers for Gunyah (patch series) > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/ As of 5 min ago when I send this, there's a v17: https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/ Thanks, Elliot
On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote: > > This creates the situation where access to successfully mmap()'d > > memory might SIGBUS at page fault. There is precedence for > > similar behavior in the kernel I believe, with MADV_HWPOISON and > > the hugetlbfs cgroups controller, which could SIGBUS at page > > fault time depending on the accounting limit. > > I added a test: folio_mmapped() [1] which checks if there's a vma > covering the corresponding offset into the guest_memfd. I use this > test before trying to make page private to guest and I've been able to > ensure that userspace can't even mmap() private guest memory. If I try > to make memory private, I can test that it's not mmapped and not allow > memory to become private. In my testing so far, this is enough to > prevent SIGBUS from happening. > > This test probably should be moved outside Gunyah specific code, and was > looking for maintainer to suggest the right home for it :) > > [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/ You, um, might have wanted to send an email to linux-mm, not bury it in the middle of a series of 35 patches? So this isn't folio_mapped() because you're interested if anyone _could_ fault this page, not whether the folio is currently present in anyone's page tables. It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure it's worth the effort to use walk_page_mapping(), but I would defer to David.
Hi, I have a question regarding memory shared between the host and a protected guest. I scanned the series, and the pKVM patches this series is based on, but I couldn't easily find the answer. When a page is shared, that page is not mapped in the stage 2 tables that the host maintains for a regular VM (kvm->arch.mmu), right? It wouldn't make much sense for KVM to maintain its own stage 2 that is never used, but I thought I should double check that to make sure I'm not missing something. Thanks, Alex On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote: > This series adds restricted mmap() support to guest_memfd [1], as > well as support guest_memfd on pKVM/arm64. > > This series is based on Linux 6.8-rc4 + our pKVM core series [2]. > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but > the remainder (patches 7-26) require the pKVM core series. A git > repo with this series applied can be found here [3]. We have a > (WIP) kvmtool port capable of running the code in this series > [4]. For a technical deep dive into pKVM, please refer to Quentin > Perret's KVM Forum Presentation [5, 6]. > > I've covered some of the issues presented here in my LPC 2023 > presentation [7]. > > We haven't started using this in Android yet, but we aim to move > away from anonymous memory to guest_memfd once we have the > necessary support merged upstream. Others (e.g., Gunyah [8]) are > also looking into guest_memfd for similar reasons as us. > > By design, guest_memfd cannot be mapped, read, or written by the > host userspace. In pKVM, memory shared between a protected guest > and the host is shared in-place, unlike the other confidential > computing solutions that guest_memfd was originally envisaged for > (e.g, TDX). When initializing a guest, as well as when accessing > memory shared by the guest to the host, it would be useful to > support mapping that memory at the host to avoid copying its > contents. > > One of the benefits of guest_memfd is that it prevents a > misbehaving host process from crashing the system when attempting > to access (deliberately or accidentally) protected guest memory, > since this memory isn't mapped to begin with. Without > guest_memfd, the hypervisor would still prevent such accesses, > but in certain cases the host kernel wouldn't be able to recover, > causing the system to crash. > > Support for mmap() in this patch series maintains the invariant > that only memory shared with the host, either explicitly by the > guest or implicitly before the guest has started running (in > order to populate its memory) is allowed to be mapped. At no time > should private memory be mapped at the host. > > This patch series is divided into two parts: > > The first part is to the KVM core code (patches 1-6), and is > based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support > for mapping guest memory only as long as it is shared. For that, > the host needs to know the sharing status of guest memory. > Therefore, the series adds a new KVM memory attribute, accessible > only by the host kernel, that specifies whether the memory is > allowed to be mapped by the host userspace. > > The second part of the series (patches 7-26) adds guest_memfd > support for pKVM/arm64, and is based on the latest version of our > pKVM series [2]. It uses guest_memfd instead of the current > approach in Android (not upstreamed) of maintaining a long-term > GUP on anonymous memory donated to the guest. These patches > handle faulting in guest memory for a guest, as well as handling > sharing and unsharing of guest memory while maintaining the > invariant mentioned earlier. > > In addition to general feedback, we would like feedback on how we > handle mmap() and faulting-in guest pages at the host (KVM: Add > restricted support for mapping guest_memfd by the host). > > We don't enforce the invariant that only memory shared with the > host can be mapped by the host userspace in > file_operations::mmap(), but in vm_operations_struct:fault(). On > vm_operations_struct::fault(), we check whether the page is > shared with the host. If not, we deliver a SIGBUS to the current > task. The reason for enforcing this at fault() is that mmap() > does not elevate the pagecount(); it's the faulting in of the > page which does. Even if we were to check at mmap() whether an > address can be mapped, we would still need to check again on > fault(), since between mmap() and fault() the status of the page > can change. > > This creates the situation where access to successfully mmap()'d > memory might SIGBUS at page fault. There is precedence for > similar behavior in the kernel I believe, with MADV_HWPOISON and > the hugetlbfs cgroups controller, which could SIGBUS at page > fault time depending on the accounting limit. > > Another pKVM specific aspect we would like feedback on, is how to > handle memory mapped by the host being unshared by a guest. The > approach we've taken is that on an unshare call from the guest, > the host userspace is notified that the memory has been unshared, > in order to allow it to unmap it and mark it as PRIVATE as > acknowledgment. If the host does not unmap the memory, the > unshare call issued by the guest fails, which the guest is > informed about on return. > > Cheers, > /fuad > > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/ > > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core > > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1 > > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8 > > [5] Protected KVM on arm64 (slides) > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf > > [6] Protected KVM on arm64 (video) > https://www.youtube.com/watch?v=9npebeVFbFw > > [7] Supporting guest private memory in Protected KVM on Android (presentation) > https://lpc.events/event/17/contributions/1487/ > > [8] Drivers for Gunyah (patch series) > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/ > > Fuad Tabba (20): > KVM: Split KVM memory attributes into user and kernel attributes > KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock > KVM: Add restricted support for mapping guestmem by the host > KVM: Don't allow private attribute to be set if mapped by host > KVM: Don't allow private attribute to be removed for unmappable memory > KVM: Implement kvm_(read|/write)_guest_page for private memory slots > KVM: arm64: Create hypercall return handler > KVM: arm64: Refactor code around handling return from host to guest > KVM: arm64: Rename kvm_pinned_page to kvm_guest_page > KVM: arm64: Add a field to indicate whether the guest page was pinned > KVM: arm64: Do not allow changes to private memory slots > KVM: arm64: Skip VMA checks for slots without userspace address > KVM: arm64: Handle guest_memfd()-backed guest page faults > KVM: arm64: Track sharing of memory from protected guest to host > KVM: arm64: Mark a protected VM's memory as unmappable at > initialization > KVM: arm64: Handle unshare on way back to guest entry rather than exit > KVM: arm64: Check that host unmaps memory unshared by guest > KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() > KVM: arm64: Enable private memory support when pKVM is enabled > KVM: arm64: Enable private memory kconfig for arm64 > > Keir Fraser (3): > KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall > KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall > KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall > > Quentin Perret (1): > KVM: arm64: Turn llist of pinned pages into an rb-tree > > Will Deacon (2): > KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL > KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications > > arch/arm64/include/asm/kvm_host.h | 17 +- > arch/arm64/include/asm/kvm_pkvm.h | 1 + > arch/arm64/kvm/Kconfig | 2 + > arch/arm64/kvm/arm.c | 32 ++- > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 + > arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 1 + > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 24 +- > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 67 +++++ > arch/arm64/kvm/hyp/nvhe/pkvm.c | 89 +++++- > arch/arm64/kvm/hyp/nvhe/switch.c | 1 + > arch/arm64/kvm/hypercalls.c | 117 +++++++- > arch/arm64/kvm/mmu.c | 138 +++++++++- > arch/arm64/kvm/pkvm.c | 83 +++++- > include/linux/arm-smccc.h | 7 + > include/linux/kvm_host.h | 34 +++ > include/uapi/linux/kvm.h | 4 + > virt/kvm/Kconfig | 4 + > virt/kvm/guest_memfd.c | 89 +++++- > virt/kvm/kvm_main.c | 260 ++++++++++++++++-- > 19 files changed, 904 insertions(+), 68 deletions(-) > > -- > 2.44.0.rc1.240.g4c46232300-goog > >
Hi Elliot, On Thu, Feb 22, 2024 at 11:44 PM Elliot Berman <quic_eberman@quicinc.com> wrote: > > On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote: > > This series adds restricted mmap() support to guest_memfd [1], as > > well as support guest_memfd on pKVM/arm64. > > > > We haven't started using this in Android yet, but we aim to move > > away from anonymous memory to guest_memfd once we have the > > necessary support merged upstream. Others (e.g., Gunyah [8]) are > > also looking into guest_memfd for similar reasons as us. > > I'm especially interested to see if we can factor out much of the common > implementation bits between KVM and Gunyah. In principle, we're doing > same thing: the difference is the exact mechanics to interact with the > hypervisor which (I think) could be easily extracted into an ops > structure. I agree. We should share and reuse as much code as possible. I'll sync with you before the V2 of this series. > [...] > > > In addition to general feedback, we would like feedback on how we > > handle mmap() and faulting-in guest pages at the host (KVM: Add > > restricted support for mapping guest_memfd by the host). > > > > We don't enforce the invariant that only memory shared with the > > host can be mapped by the host userspace in > > file_operations::mmap(), but in vm_operations_struct:fault(). On > > vm_operations_struct::fault(), we check whether the page is > > shared with the host. If not, we deliver a SIGBUS to the current > > task. The reason for enforcing this at fault() is that mmap() > > does not elevate the pagecount(); it's the faulting in of the > > page which does. Even if we were to check at mmap() whether an > > address can be mapped, we would still need to check again on > > fault(), since between mmap() and fault() the status of the page > > can change. > > > > This creates the situation where access to successfully mmap()'d > > memory might SIGBUS at page fault. There is precedence for > > similar behavior in the kernel I believe, with MADV_HWPOISON and > > the hugetlbfs cgroups controller, which could SIGBUS at page > > fault time depending on the accounting limit. > > I added a test: folio_mmapped() [1] which checks if there's a vma > covering the corresponding offset into the guest_memfd. I use this > test before trying to make page private to guest and I've been able to > ensure that userspace can't even mmap() private guest memory. If I try > to make memory private, I can test that it's not mmapped and not allow > memory to become private. In my testing so far, this is enough to > prevent SIGBUS from happening. > > This test probably should be moved outside Gunyah specific code, and was > looking for maintainer to suggest the right home for it :) > > [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/ Let's see what the mm-folks think about this [*]. Thanks! /fuad [*] https://lore.kernel.org/all/ZdfoR3nCEP3HTtm1@casper.infradead.org/ > > > > Another pKVM specific aspect we would like feedback on, is how to > > handle memory mapped by the host being unshared by a guest. The > > approach we've taken is that on an unshare call from the guest, > > the host userspace is notified that the memory has been unshared, > > in order to allow it to unmap it and mark it as PRIVATE as > > acknowledgment. If the host does not unmap the memory, the > > unshare call issued by the guest fails, which the guest is > > informed about on return. > > > > Cheers, > > /fuad > > > > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/ > > > > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core > > > > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1 > > > > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8 > > > > [5] Protected KVM on arm64 (slides) > > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf > > > > [6] Protected KVM on arm64 (video) > > https://www.youtube.com/watch?v=9npebeVFbFw > > > > [7] Supporting guest private memory in Protected KVM on Android (presentation) > > https://lpc.events/event/17/contributions/1487/ > > > > [8] Drivers for Gunyah (patch series) > > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/ > > As of 5 min ago when I send this, there's a v17: > https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/ > > Thanks, > Elliot >
Hi Alex, On Fri, Feb 23, 2024 at 12:01 PM Alexandru Elisei <alexandru.elisei@arm.com> wrote: > > Hi, > > I have a question regarding memory shared between the host and a protected > guest. I scanned the series, and the pKVM patches this series is based on, > but I couldn't easily find the answer. > > When a page is shared, that page is not mapped in the stage 2 tables that > the host maintains for a regular VM (kvm->arch.mmu), right? It wouldn't > make much sense for KVM to maintain its own stage 2 that is never used, but > I thought I should double check that to make sure I'm not missing > something. You're right. In protected mode the stage-2 tables are maintained by the hypervisor in EL2, since we don't trust the host kernel. It is still KVM of course, but not the regular VM structure, like you said. Cheers, /fuad > > Thanks, > Alex > > On Thu, Feb 22, 2024 at 04:10:21PM +0000, Fuad Tabba wrote: > > This series adds restricted mmap() support to guest_memfd [1], as > > well as support guest_memfd on pKVM/arm64. > > > > This series is based on Linux 6.8-rc4 + our pKVM core series [2]. > > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but > > the remainder (patches 7-26) require the pKVM core series. A git > > repo with this series applied can be found here [3]. We have a > > (WIP) kvmtool port capable of running the code in this series > > [4]. For a technical deep dive into pKVM, please refer to Quentin > > Perret's KVM Forum Presentation [5, 6]. > > > > I've covered some of the issues presented here in my LPC 2023 > > presentation [7]. > > > > We haven't started using this in Android yet, but we aim to move > > away from anonymous memory to guest_memfd once we have the > > necessary support merged upstream. Others (e.g., Gunyah [8]) are > > also looking into guest_memfd for similar reasons as us. > > > > By design, guest_memfd cannot be mapped, read, or written by the > > host userspace. In pKVM, memory shared between a protected guest > > and the host is shared in-place, unlike the other confidential > > computing solutions that guest_memfd was originally envisaged for > > (e.g, TDX). When initializing a guest, as well as when accessing > > memory shared by the guest to the host, it would be useful to > > support mapping that memory at the host to avoid copying its > > contents. > > > > One of the benefits of guest_memfd is that it prevents a > > misbehaving host process from crashing the system when attempting > > to access (deliberately or accidentally) protected guest memory, > > since this memory isn't mapped to begin with. Without > > guest_memfd, the hypervisor would still prevent such accesses, > > but in certain cases the host kernel wouldn't be able to recover, > > causing the system to crash. > > > > Support for mmap() in this patch series maintains the invariant > > that only memory shared with the host, either explicitly by the > > guest or implicitly before the guest has started running (in > > order to populate its memory) is allowed to be mapped. At no time > > should private memory be mapped at the host. > > > > This patch series is divided into two parts: > > > > The first part is to the KVM core code (patches 1-6), and is > > based on guest_memfd as of Linux 6.8-rc4. It adds opt-in support > > for mapping guest memory only as long as it is shared. For that, > > the host needs to know the sharing status of guest memory. > > Therefore, the series adds a new KVM memory attribute, accessible > > only by the host kernel, that specifies whether the memory is > > allowed to be mapped by the host userspace. > > > > The second part of the series (patches 7-26) adds guest_memfd > > support for pKVM/arm64, and is based on the latest version of our > > pKVM series [2]. It uses guest_memfd instead of the current > > approach in Android (not upstreamed) of maintaining a long-term > > GUP on anonymous memory donated to the guest. These patches > > handle faulting in guest memory for a guest, as well as handling > > sharing and unsharing of guest memory while maintaining the > > invariant mentioned earlier. > > > > In addition to general feedback, we would like feedback on how we > > handle mmap() and faulting-in guest pages at the host (KVM: Add > > restricted support for mapping guest_memfd by the host). > > > > We don't enforce the invariant that only memory shared with the > > host can be mapped by the host userspace in > > file_operations::mmap(), but in vm_operations_struct:fault(). On > > vm_operations_struct::fault(), we check whether the page is > > shared with the host. If not, we deliver a SIGBUS to the current > > task. The reason for enforcing this at fault() is that mmap() > > does not elevate the pagecount(); it's the faulting in of the > > page which does. Even if we were to check at mmap() whether an > > address can be mapped, we would still need to check again on > > fault(), since between mmap() and fault() the status of the page > > can change. > > > > This creates the situation where access to successfully mmap()'d > > memory might SIGBUS at page fault. There is precedence for > > similar behavior in the kernel I believe, with MADV_HWPOISON and > > the hugetlbfs cgroups controller, which could SIGBUS at page > > fault time depending on the accounting limit. > > > > Another pKVM specific aspect we would like feedback on, is how to > > handle memory mapped by the host being unshared by a guest. The > > approach we've taken is that on an unshare call from the guest, > > the host userspace is notified that the memory has been unshared, > > in order to allow it to unmap it and mark it as PRIVATE as > > acknowledgment. If the host does not unmap the memory, the > > unshare call issued by the guest fails, which the guest is > > informed about on return. > > > > Cheers, > > /fuad > > > > [1] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/ > > > > [2] https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-core > > > > [3] https://android-kvm.googlesource.com/linux/+/refs/heads/tabba/guestmem-6.8-rfc-v1 > > > > [4] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-6.8 > > > > [5] Protected KVM on arm64 (slides) > > https://static.sched.com/hosted_files/kvmforum2022/88/KVM%20forum%202022%20-%20pKVM%20deep%20dive.pdf > > > > [6] Protected KVM on arm64 (video) > > https://www.youtube.com/watch?v=9npebeVFbFw > > > > [7] Supporting guest private memory in Protected KVM on Android (presentation) > > https://lpc.events/event/17/contributions/1487/ > > > > [8] Drivers for Gunyah (patch series) > > https://lore.kernel.org/all/20240109-gunyah-v16-0-634904bf4ce9@quicinc.com/ > > > > Fuad Tabba (20): > > KVM: Split KVM memory attributes into user and kernel attributes > > KVM: Introduce kvm_gmem_get_pfn_locked(), which retains the folio lock > > KVM: Add restricted support for mapping guestmem by the host > > KVM: Don't allow private attribute to be set if mapped by host > > KVM: Don't allow private attribute to be removed for unmappable memory > > KVM: Implement kvm_(read|/write)_guest_page for private memory slots > > KVM: arm64: Create hypercall return handler > > KVM: arm64: Refactor code around handling return from host to guest > > KVM: arm64: Rename kvm_pinned_page to kvm_guest_page > > KVM: arm64: Add a field to indicate whether the guest page was pinned > > KVM: arm64: Do not allow changes to private memory slots > > KVM: arm64: Skip VMA checks for slots without userspace address > > KVM: arm64: Handle guest_memfd()-backed guest page faults > > KVM: arm64: Track sharing of memory from protected guest to host > > KVM: arm64: Mark a protected VM's memory as unmappable at > > initialization > > KVM: arm64: Handle unshare on way back to guest entry rather than exit > > KVM: arm64: Check that host unmaps memory unshared by guest > > KVM: arm64: Add handlers for kvm_arch_*_set_memory_attributes() > > KVM: arm64: Enable private memory support when pKVM is enabled > > KVM: arm64: Enable private memory kconfig for arm64 > > > > Keir Fraser (3): > > KVM: arm64: Implement MEM_RELINQUISH SMCCC hypercall > > KVM: arm64: Strictly check page type in MEM_RELINQUISH hypercall > > KVM: arm64: Avoid unnecessary unmap walk in MEM_RELINQUISH hypercall > > > > Quentin Perret (1): > > KVM: arm64: Turn llist of pinned pages into an rb-tree > > > > Will Deacon (2): > > KVM: arm64: Add initial support for KVM_CAP_EXIT_HYPERCALL > > KVM: arm64: Allow userspace to receive SHARE and UNSHARE notifications > > > > arch/arm64/include/asm/kvm_host.h | 17 +- > > arch/arm64/include/asm/kvm_pkvm.h | 1 + > > arch/arm64/kvm/Kconfig | 2 + > > arch/arm64/kvm/arm.c | 32 ++- > > arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 2 + > > arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 1 + > > arch/arm64/kvm/hyp/nvhe/hyp-main.c | 24 +- > > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 67 +++++ > > arch/arm64/kvm/hyp/nvhe/pkvm.c | 89 +++++- > > arch/arm64/kvm/hyp/nvhe/switch.c | 1 + > > arch/arm64/kvm/hypercalls.c | 117 +++++++- > > arch/arm64/kvm/mmu.c | 138 +++++++++- > > arch/arm64/kvm/pkvm.c | 83 +++++- > > include/linux/arm-smccc.h | 7 + > > include/linux/kvm_host.h | 34 +++ > > include/uapi/linux/kvm.h | 4 + > > virt/kvm/Kconfig | 4 + > > virt/kvm/guest_memfd.c | 89 +++++- > > virt/kvm/kvm_main.c | 260 ++++++++++++++++-- > > 19 files changed, 904 insertions(+), 68 deletions(-) > > > > -- > > 2.44.0.rc1.240.g4c46232300-goog > > > >
On 23.02.24 01:35, Matthew Wilcox wrote: > On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote: >>> This creates the situation where access to successfully mmap()'d >>> memory might SIGBUS at page fault. There is precedence for >>> similar behavior in the kernel I believe, with MADV_HWPOISON and >>> the hugetlbfs cgroups controller, which could SIGBUS at page >>> fault time depending on the accounting limit. >> >> I added a test: folio_mmapped() [1] which checks if there's a vma >> covering the corresponding offset into the guest_memfd. I use this >> test before trying to make page private to guest and I've been able to >> ensure that userspace can't even mmap() private guest memory. If I try >> to make memory private, I can test that it's not mmapped and not allow >> memory to become private. In my testing so far, this is enough to >> prevent SIGBUS from happening. >> >> This test probably should be moved outside Gunyah specific code, and was >> looking for maintainer to suggest the right home for it :) >> >> [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/ > > You, um, might have wanted to send an email to linux-mm, not bury it in > the middle of a series of 35 patches? > > So this isn't folio_mapped() because you're interested if anyone _could_ > fault this page, not whether the folio is currently present in anyone's > page tables. > > It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure > it's worth the effort to use walk_page_mapping(), but I would defer to > David. First, I suspect we are not only concerned about current+future VMAs covering the page, we are also interested in any page pins that could have been derived from such a VMA? Imagine user space mmap'ed the file, faulted in page, took a pin on the page using pin_user_pages() and friends, and then munmap()'ed the VMA. You likely want to catch that as well and not allow a conversion to private? [I assume you want to convert the page to private only if you hold all the folio references -- i.e., if the refcount of a small folio is 1] Now, regarding the original question (disallow mapping the page), I see the following approaches: 1) SIGBUS during page fault. There are other cases that can trigger SIGBUS during page faults: hugetlb when we are out of free hugetlb pages, userfaultfd with UFFD_FEATURE_SIGBUS. -> Simple and should get the job done. 2) folio_mmapped() + preventing new mmaps covering that folio -> More complicated, requires an rmap walk on every conversion. 3) Disallow any mmaps of the file while any page is private -> Likely not what you want. Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are you trying to avoid page faults? What's the use case?
On 22.02.24 17:10, Fuad Tabba wrote: > This series adds restricted mmap() support to guest_memfd [1], as > well as support guest_memfd on pKVM/arm64. > > This series is based on Linux 6.8-rc4 + our pKVM core series [2]. > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but > the remainder (patches 7-26) require the pKVM core series. A git > repo with this series applied can be found here [3]. We have a > (WIP) kvmtool port capable of running the code in this series > [4]. For a technical deep dive into pKVM, please refer to Quentin > Perret's KVM Forum Presentation [5, 6]. > > I've covered some of the issues presented here in my LPC 2023 > presentation [7]. > > We haven't started using this in Android yet, but we aim to move > away from anonymous memory to guest_memfd once we have the > necessary support merged upstream. Others (e.g., Gunyah [8]) are > also looking into guest_memfd for similar reasons as us. > > By design, guest_memfd cannot be mapped, read, or written by the > host userspace. In pKVM, memory shared between a protected guest > and the host is shared in-place, unlike the other confidential > computing solutions that guest_memfd was originally envisaged for > (e.g, TDX). Can you elaborate (or point to a summary) why pKVM has to be special here? Why can't you use guest_memfd only for private memory and another (ordinary) memfd for shared memory, like the other confidential computing technologies are planning to? What's the main reason for that decision and can it be avoided? (s390x also shares in-place, but doesn't need any special-casing like guest_memfd provides)
On Mon, Feb 26, 2024 at 10:28:11AM +0100, David Hildenbrand wrote: > On 23.02.24 01:35, Matthew Wilcox wrote: > > On Thu, Feb 22, 2024 at 03:43:56PM -0800, Elliot Berman wrote: > > > > This creates the situation where access to successfully mmap()'d > > > > memory might SIGBUS at page fault. There is precedence for > > > > similar behavior in the kernel I believe, with MADV_HWPOISON and > > > > the hugetlbfs cgroups controller, which could SIGBUS at page > > > > fault time depending on the accounting limit. > > > > > > I added a test: folio_mmapped() [1] which checks if there's a vma > > > covering the corresponding offset into the guest_memfd. I use this > > > test before trying to make page private to guest and I've been able to > > > ensure that userspace can't even mmap() private guest memory. If I try > > > to make memory private, I can test that it's not mmapped and not allow > > > memory to become private. In my testing so far, this is enough to > > > prevent SIGBUS from happening. > > > > > > This test probably should be moved outside Gunyah specific code, and was > > > looking for maintainer to suggest the right home for it :) > > > > > > [1]: https://lore.kernel.org/all/20240222-gunyah-v17-20-1e9da6763d38@quicinc.com/ > > > > You, um, might have wanted to send an email to linux-mm, not bury it in > > the middle of a series of 35 patches? > > > > So this isn't folio_mapped() because you're interested if anyone _could_ > > fault this page, not whether the folio is currently present in anyone's > > page tables. > > > > It's like walk_page_mapping() but with a trivial mm_walk_ops; not sure > > it's worth the effort to use walk_page_mapping(), but I would defer to > > David. > > First, I suspect we are not only concerned about current+future VMAs > covering the page, we are also interested in any page pins that could have > been derived from such a VMA? > > Imagine user space mmap'ed the file, faulted in page, took a pin on the page > using pin_user_pages() and friends, and then munmap()'ed the VMA. > > You likely want to catch that as well and not allow a conversion to private? > > [I assume you want to convert the page to private only if you hold all the > folio references -- i.e., if the refcount of a small folio is 1] > Ah, this was something I hadn't thought about. I think both Fuad and I need to update our series to check the refcount rather than mapcount (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). > > Now, regarding the original question (disallow mapping the page), I see the > following approaches: > > 1) SIGBUS during page fault. There are other cases that can trigger > SIGBUS during page faults: hugetlb when we are out of free hugetlb > pages, userfaultfd with UFFD_FEATURE_SIGBUS. > > -> Simple and should get the job done. > > 2) folio_mmapped() + preventing new mmaps covering that folio > > -> More complicated, requires an rmap walk on every conversion. > > 3) Disallow any mmaps of the file while any page is private > > -> Likely not what you want. > > > Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are > you trying to avoid page faults? What's the use case? > We were chatting whether we could do better than the SIGBUS approach. SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to return errors early. One difference between hugetlb and this usecase is that running out of free hugetlb pages isn't something we could detect at mmap time. In guest_memfd usecase, we should be able to detect when SIGBUS becomes possible due to memory being lent to guest. I can't think of a reason why userspace would want/be able to resume operation after trying to access a page that it shouldn't be allowed, so SIGBUS is functional. The advantage of trying to avoid SIGBUS was better/easier reporting to userspace. - Elliot
Hi David, On Mon, Feb 26, 2024 at 9:47 AM David Hildenbrand <david@redhat.com> wrote: > > On 22.02.24 17:10, Fuad Tabba wrote: > > This series adds restricted mmap() support to guest_memfd [1], as > > well as support guest_memfd on pKVM/arm64. > > > > This series is based on Linux 6.8-rc4 + our pKVM core series [2]. > > The KVM core patches apply to Linux 6.8-rc4 (patches 1-6), but > > the remainder (patches 7-26) require the pKVM core series. A git > > repo with this series applied can be found here [3]. We have a > > (WIP) kvmtool port capable of running the code in this series > > [4]. For a technical deep dive into pKVM, please refer to Quentin > > Perret's KVM Forum Presentation [5, 6]. > > > > I've covered some of the issues presented here in my LPC 2023 > > presentation [7]. > > > > We haven't started using this in Android yet, but we aim to move > > away from anonymous memory to guest_memfd once we have the > > necessary support merged upstream. Others (e.g., Gunyah [8]) are > > also looking into guest_memfd for similar reasons as us. > > > > By design, guest_memfd cannot be mapped, read, or written by the > > host userspace. In pKVM, memory shared between a protected guest > > and the host is shared in-place, unlike the other confidential > > computing solutions that guest_memfd was originally envisaged for > > (e.g, TDX). > > Can you elaborate (or point to a summary) why pKVM has to be special > here? Why can't you use guest_memfd only for private memory and another > (ordinary) memfd for shared memory, like the other confidential > computing technologies are planning to? Because the same memory location can switch back and forth between being shared and private in-place. The host/vmm doesn't know beforehand which parts of the guest's private memory might be shared with it later, therefore, it cannot use guest_memfd() for the private memory and anonymous memory for the shared memory without resorting to copying. Even if it did know beforehand, it wouldn't help much since that memory can change back to being private later on. Other confidential computing proposals like TDX and Arm CCA don't share in place, and need to copy shared data between private and shared memory. If you're interested, there was also a more detailed discussion about this in an earlier guest_memfd() thread [1] > What's the main reason for that decision and can it be avoided? > (s390x also shares in-place, but doesn't need any special-casing like > guest_memfd provides) In our current implementation of pKVM, we use anonymous memory with a long-term gup, and the host ends up with valid mappings. This isn't just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX, accessing guest private memory can be fatal to the host and the system as a whole since it could result in a machine check. In arm64 it's not necessarily fatal to the system as a whole if a userspace process were to attempt the access. However, a userspace process could trick the host kernel to try to access the protected guest memory, e.g., by having a process A strace a malicious process B which passes protected guest memory as argument to a syscall. What makes pKVM and Gunyah special is that both can easily share memory (and its contents) in place, since it's not encrypted, and convert memory locations between shared/unshared. I'm not familiar with how s390x handles sharing in place, or how it handles memory donated to the guest. I assume it's by donating anonymous memory. I would be also interested to know how it handles and recovers from similar situations, i.e., host (userspace or kernel) trying to access guest protected memory. Thank you, /fuad [1] https://lore.kernel.org/all/YkcTTY4YjQs5BRhE@google.com/ [2] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/ [3] https://lore.kernel.org/all/20240222-gunyah-v17-0-1e9da6763d38@quicinc.com/ > > -- > Cheers, > > David / dhildenb >
Hi, >> Can you elaborate (or point to a summary) why pKVM has to be special >> here? Why can't you use guest_memfd only for private memory and another >> (ordinary) memfd for shared memory, like the other confidential >> computing technologies are planning to? > > Because the same memory location can switch back and forth between > being shared and private in-place. The host/vmm doesn't know > beforehand which parts of the guest's private memory might be shared > with it later, therefore, it cannot use guest_memfd() for the private > memory and anonymous memory for the shared memory without resorting to I don't remember the latest details about the guest_memfd incarnation in user space, but I though we'd be using guest_memfd for private memory and an ordinary memfd for shared memory. But maybe it also works with anon memory instead of the memfd and that was just an implementation detail :) > copying. Even if it did know beforehand, it wouldn't help much since > that memory can change back to being private later on. Other > confidential computing proposals like TDX and Arm CCA don't share in > place, and need to copy shared data between private and shared memory. Right. > > If you're interested, there was also a more detailed discussion about > this in an earlier guest_memfd() thread [1] Thanks for the pointer! > >> What's the main reason for that decision and can it be avoided? >> (s390x also shares in-place, but doesn't need any special-casing like >> guest_memfd provides) > > In our current implementation of pKVM, we use anonymous memory with a > long-term gup, and the host ends up with valid mappings. This isn't > just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX, > accessing guest private memory can be fatal to the host and the system > as a whole since it could result in a machine check. In arm64 it's not > necessarily fatal to the system as a whole if a userspace process were > to attempt the access. However, a userspace process could trick the > host kernel to try to access the protected guest memory, e.g., by > having a process A strace a malicious process B which passes protected > guest memory as argument to a syscall. Right. > > What makes pKVM and Gunyah special is that both can easily share > memory (and its contents) in place, since it's not encrypted, and > convert memory locations between shared/unshared. I'm not familiar > with how s390x handles sharing in place, or how it handles memory > donated to the guest. I assume it's by donating anonymous memory. I > would be also interested to know how it handles and recovers from > similar situations, i.e., host (userspace or kernel) trying to access > guest protected memory. I don't know all of the s390x "protected VM" details, but it is pretty similar. Take a look at arch/s390/kernel/uv.c if you are interested. There are "ultravisor" calls that can convert a page * from secure (inaccessible by the host) to non-secure (encrypted but accessible by the host) * from non-secure to secure Once the host tries to access a "secure" page -- either from the kernel or from user space, the host gets a page fault and calls arch_make_page_accessible(). That will encrypt page content such that the host can access it (migrate/swapout/whatsoever). The host has to set aside some memory area for the ultravisor to "remember" page state. So you can swapout/migrate these pages, but the host will only read encrypted garbage. In contrast to disallowing access to these pages. So you don't need any guest_memfd games to protect from that -- and one doesn't have to travel back in time to have memory that isn't swappable/migratable and only comes in one page size. [I'm not up-to-date which obscure corner-cases CCA requirement the s390x implementation cannot fulfill -- like replacing pages in page tables and such; I suspect pKVM also cannot cover all these corner-cases] Extending guest_memfd (the one that was promised initially to not be mmappable) to be mmappable just to avoid some crashes in corner cases is the right approach. But I'm pretty sure that has all been discussed before, that's why I am asking about some details :)
I missed here "I'm wondering whether ..." > Extending guest_memfd (the one that was promised initially to not be > mmappable) to be mmappable just to avoid some crashes in corner cases is > the right approach. But I'm pretty sure that has all been discussed > before, that's why I am asking about some details :) >
> > Ah, this was something I hadn't thought about. I think both Fuad and I > need to update our series to check the refcount rather than mapcount > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But checking for any unexpected references might be better (there are still some GUP users that don't use FOLL_PIN). At least concurrent migration/swapout (that temporarily unmaps a folio and can give you folio_mapped() "false negatives", which both take a temporary folio reference and hold the page lock) should not be a concern because guest_memfd doesn't support that yet. > >> >> Now, regarding the original question (disallow mapping the page), I see the >> following approaches: >> >> 1) SIGBUS during page fault. There are other cases that can trigger >> SIGBUS during page faults: hugetlb when we are out of free hugetlb >> pages, userfaultfd with UFFD_FEATURE_SIGBUS. >> >> -> Simple and should get the job done. >> >> 2) folio_mmapped() + preventing new mmaps covering that folio >> >> -> More complicated, requires an rmap walk on every conversion. >> >> 3) Disallow any mmaps of the file while any page is private >> >> -> Likely not what you want. >> >> >> Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are >> you trying to avoid page faults? What's the use case? >> > > We were chatting whether we could do better than the SIGBUS approach. > SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to > return errors early. One difference between hugetlb and this usecase is > that running out of free hugetlb pages isn't something we could detect With hugetlb reservation one can try detecting it at mmap() time. But as reservations are not NUMA aware, it's not reliable. > at mmap time. In guest_memfd usecase, we should be able to detect when > SIGBUS becomes possible due to memory being lent to guest. > > I can't think of a reason why userspace would want/be able to resume > operation after trying to access a page that it shouldn't be allowed, so > SIGBUS is functional. The advantage of trying to avoid SIGBUS was > better/easier reporting to userspace. To me, it sounds conceptually easier and less error-prone to 1) Converting a page to private only if there are no unexpected references (no mappings, GUP pins, ...) 2) Disallowing mapping private pages and failing the page fault. 3) Handling that small race window only (page lock?) Instead of 1) Converting a page to private only if there are no unexpected references (no mappings, GUP pins, ...) and no VMAs covering it where we could fault it in later 2) Disallowing mmap when the range would contain any private page 3) Handling races between mmap and page conversion
Hi David, On Tue, Feb 27, 2024 at 2:41 PM David Hildenbrand <david@redhat.com> wrote: > > Hi, > > >> Can you elaborate (or point to a summary) why pKVM has to be special > >> here? Why can't you use guest_memfd only for private memory and another > >> (ordinary) memfd for shared memory, like the other confidential > >> computing technologies are planning to? > > > > Because the same memory location can switch back and forth between > > being shared and private in-place. The host/vmm doesn't know > > beforehand which parts of the guest's private memory might be shared > > with it later, therefore, it cannot use guest_memfd() for the private > > memory and anonymous memory for the shared memory without resorting to > > I don't remember the latest details about the guest_memfd incarnation in > user space, but I though we'd be using guest_memfd for private memory > and an ordinary memfd for shared memory. But maybe it also works with > anon memory instead of the memfd and that was just an implementation > detail :) > > > copying. Even if it did know beforehand, it wouldn't help much since > > that memory can change back to being private later on. Other > > confidential computing proposals like TDX and Arm CCA don't share in > > place, and need to copy shared data between private and shared memory. > > Right. > > > > > If you're interested, there was also a more detailed discussion about > > this in an earlier guest_memfd() thread [1] > > Thanks for the pointer! > > > > >> What's the main reason for that decision and can it be avoided? > >> (s390x also shares in-place, but doesn't need any special-casing like > >> guest_memfd provides) > > > > In our current implementation of pKVM, we use anonymous memory with a > > long-term gup, and the host ends up with valid mappings. This isn't > > just a problem for pKVM, but also for TDX and Gunyah [2, 3]. In TDX, > > accessing guest private memory can be fatal to the host and the system > > as a whole since it could result in a machine check. In arm64 it's not > > necessarily fatal to the system as a whole if a userspace process were > > to attempt the access. However, a userspace process could trick the > > host kernel to try to access the protected guest memory, e.g., by > > having a process A strace a malicious process B which passes protected > > guest memory as argument to a syscall. > > Right. > > > > > What makes pKVM and Gunyah special is that both can easily share > > memory (and its contents) in place, since it's not encrypted, and > > convert memory locations between shared/unshared. I'm not familiar > > with how s390x handles sharing in place, or how it handles memory > > donated to the guest. I assume it's by donating anonymous memory. I > > would be also interested to know how it handles and recovers from > > similar situations, i.e., host (userspace or kernel) trying to access > > guest protected memory. > > I don't know all of the s390x "protected VM" details, but it is pretty > similar. Take a look at arch/s390/kernel/uv.c if you are interested. > > There are "ultravisor" calls that can convert a page > * from secure (inaccessible by the host) to non-secure (encrypted but > accessible by the host) > * from non-secure to secure > > Once the host tries to access a "secure" page -- either from the kernel > or from user space, the host gets a page fault and calls > arch_make_page_accessible(). That will encrypt page content such that > the host can access it (migrate/swapout/whatsoever). > > The host has to set aside some memory area for the ultravisor to > "remember" page state. > > So you can swapout/migrate these pages, but the host will only read > encrypted garbage. In contrast to disallowing access to these pages. > > So you don't need any guest_memfd games to protect from that -- and one > doesn't have to travel back in time to have memory that isn't > swappable/migratable and only comes in one page size. > > [I'm not up-to-date which obscure corner-cases CCA requirement the s390x > implementation cannot fulfill -- like replacing pages in page tables and > such; I suspect pKVM also cannot cover all these corner-cases] Thanks for this. I'll do some more reading on how things work with s390x. Right, and of course, one key difference of course is that pKVM doesn't encrypt anything, and only relies on stage-2 protection to protect the guest. > > Extending guest_memfd (the one that was promised initially to not be > mmappable) to be mmappable just to avoid some crashes in corner cases is > the right approach. But I'm pretty sure that has all been discussed > before, that's why I am asking about some details :) Thank you very much for your reviews and comments. They've already been very helpful. I noticed the gmap.h in the s390 source, which might also be something that we could learn from. So please do ask for as much details as you like. Cheers, /fuad > -- > Cheers, > > David / dhildenb >
>> So you don't need any guest_memfd games to protect from that -- and one >> doesn't have to travel back in time to have memory that isn't >> swappable/migratable and only comes in one page size. >> >> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x >> implementation cannot fulfill -- like replacing pages in page tables and >> such; I suspect pKVM also cannot cover all these corner-cases] > > Thanks for this. I'll do some more reading on how things work with s390x. > > Right, and of course, one key difference of course is that pKVM > doesn't encrypt anything, and only relies on stage-2 protection to > protect the guest. I don't remember what exactly s390x does, but I recall that it might only encrypt the memory content as it transitions a page from secure to non-secure. Something like that could also be implemented using pKVM (unless I am missing something), but it might not be that trivial, of course :)
On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote: > > > > Ah, this was something I hadn't thought about. I think both Fuad and I > > need to update our series to check the refcount rather than mapcount > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But > checking for any unexpected references might be better (there are still some > GUP users that don't use FOLL_PIN). As a non-mm person I'm not sure to understand to consequences of holding a GUP pin to a page that is not covered by any VMA. The absence of VMAs imply that userspace cannot access the page right? Presumably the kernel can't be coerced into accessing that page either? Is that correct? > At least concurrent migration/swapout (that temporarily unmaps a folio and > can give you folio_mapped() "false negatives", which both take a temporary > folio reference and hold the page lock) should not be a concern because > guest_memfd doesn't support that yet. > > > > > > > > > Now, regarding the original question (disallow mapping the page), I see the > > > following approaches: > > > > > > 1) SIGBUS during page fault. There are other cases that can trigger > > > SIGBUS during page faults: hugetlb when we are out of free hugetlb > > > pages, userfaultfd with UFFD_FEATURE_SIGBUS. > > > > > > -> Simple and should get the job done. > > > > > > 2) folio_mmapped() + preventing new mmaps covering that folio > > > > > > -> More complicated, requires an rmap walk on every conversion. > > > > > > 3) Disallow any mmaps of the file while any page is private > > > > > > -> Likely not what you want. > > > > > > > > > Why was 1) abandoned? I looks a lot easier and harder to mess up. Why are > > > you trying to avoid page faults? What's the use case? > > > > > > > We were chatting whether we could do better than the SIGBUS approach. > > SIGBUS/FAULT usually crashes userspace, so I was brainstorming ways to > > return errors early. One difference between hugetlb and this usecase is > > that running out of free hugetlb pages isn't something we could detect > > With hugetlb reservation one can try detecting it at mmap() time. But as > reservations are not NUMA aware, it's not reliable. > > > at mmap time. In guest_memfd usecase, we should be able to detect when > > SIGBUS becomes possible due to memory being lent to guest. > > > > I can't think of a reason why userspace would want/be able to resume > > operation after trying to access a page that it shouldn't be allowed, so > > SIGBUS is functional. The advantage of trying to avoid SIGBUS was > > better/easier reporting to userspace. > > To me, it sounds conceptually easier and less error-prone to > > 1) Converting a page to private only if there are no unexpected > references (no mappings, GUP pins, ...) > 2) Disallowing mapping private pages and failing the page fault. > 3) Handling that small race window only (page lock?) > > Instead of > > 1) Converting a page to private only if there are no unexpected > references (no mappings, GUP pins, ...) and no VMAs covering it where > we could fault it in later > 2) Disallowing mmap when the range would contain any private page > 3) Handling races between mmap and page conversion The one thing that makes the second option cleaner from a userspace perspective (IMO) is that the conversion to private is happening lazily during guest faults. So whether or not an mmapped page can indeed be accessed from userspace will be entirely undeterministic as it depends on the guest faulting pattern which userspace is entirely unaware of. Elliot's suggestion would prevent spurious crashes caused by that somewhat odd behaviour, though arguably sane userspace software shouldn't be doing that to start with. To add a layer of paint to the shed, the usage of SIGBUS for something that is really a permission access problem doesn't feel appropriate. Allocating memory via guestmem and donating that to a protected guest is a way for userspace to voluntarily relinquish access permissions to the memory it allocated. So a userspace process violating that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the point that signal would be sent, the page would have been accounted against that userspace process, so not sure the paging examples that were discussed earlier are exactly comparable. To illustrate that differently, given that pKVM and Gunyah use MMU-based protection, there is nothing architecturally that prevents a guest from sharing a page back with Linux as RO. Note that we don't currently support this, so I don't want to conflate this use case, but that hopefully makes it a little more obvious that this is a "there is a page, but you don't currently have the permission to access it" problem rather than "sorry but we ran out of pages" problem. Thanks, Quentin
On 28.02.24 11:48, Quentin Perret wrote: > On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote: >>> >>> Ah, this was something I hadn't thought about. I think both Fuad and I >>> need to update our series to check the refcount rather than mapcount >>> (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). >> >> An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But >> checking for any unexpected references might be better (there are still some >> GUP users that don't use FOLL_PIN). > > As a non-mm person I'm not sure to understand to consequences of holding > a GUP pin to a page that is not covered by any VMA. The absence of VMAs > imply that userspace cannot access the page right? Presumably the kernel > can't be coerced into accessing that page either? Is that correct? Simple example: register the page using an iouring fixed buffer, then unmap the VMA. iouring now has the page pinned and can read/write it using an address in the kernel vitual address space (direct map). Then, you can happily make the kernel read/write that page using iouring, even though no VMA still covers/maps that page. [...] >> Instead of >> >> 1) Converting a page to private only if there are no unexpected >> references (no mappings, GUP pins, ...) and no VMAs covering it where >> we could fault it in later >> 2) Disallowing mmap when the range would contain any private page >> 3) Handling races between mmap and page conversion > > The one thing that makes the second option cleaner from a userspace > perspective (IMO) is that the conversion to private is happening lazily > during guest faults. So whether or not an mmapped page can indeed be > accessed from userspace will be entirely undeterministic as it depends > on the guest faulting pattern which userspace is entirely unaware of. > Elliot's suggestion would prevent spurious crashes caused by that > somewhat odd behaviour, though arguably sane userspace software > shouldn't be doing that to start with. The last sentence is the important one. User space should not access that memory. If it does, it gets a slap on the hand. Because it should not access that memory. We might even be able to export to user space which pages are currently accessible and which ones not (e.g., pagemap), although it would be racy as long as the VM is running and can trigger a conversion. > > To add a layer of paint to the shed, the usage of SIGBUS for > something that is really a permission access problem doesn't feel SIGBUS stands for "BUS error (bad memory access)." Which makes sense, if you try accessing something that can no longer be accessed. It's now inaccessible. Even if it is temporarily. Just like a page with an MCE error. Swapin errors. Etc. You cannot access it. It might be a permission problem on the pKVM side, but it's not the traditional "permission problem" as in mprotect() and friends. You cannot resolve that permission problem yourself. It's a higher entity that turned that memory inaccessible. > appropriate. Allocating memory via guestmem and donating that to a > protected guest is a way for userspace to voluntarily relinquish access > permissions to the memory it allocated. So a userspace process violating > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the > point that signal would be sent, the page would have been accounted > against that userspace process, so not sure the paging examples that > were discussed earlier are exactly comparable. To illustrate that > differently, given that pKVM and Gunyah use MMU-based protection, there > is nothing architecturally that prevents a guest from sharing a page > back with Linux as RO. Sure, then allow page faults that allow for reads and give a signal on write faults. In the scenario, it even makes more sense to not constantly require new mmap's from user space just to access a now-shared page. > Note that we don't currently support this, so I > don't want to conflate this use case, but that hopefully makes it a > little more obvious that this is a "there is a page, but you don't > currently have the permission to access it" problem rather than "sorry > but we ran out of pages" problem. We could user other signals, at least as the semantics are clear and it's documented. Maybe SIGSEGV would be warranted. I consider that a minor detail, though. Requiring mmap()/munmap() dances just to access a page that is now shared from user space sounds a bit suboptimal. But I don't know all the details of the user space implementation. "mmap() the whole thing once and only access what you are supposed to access" sounds reasonable to me. If you don't play by the rules, you get a signal. But I'm happy to learn more details.
On Wednesday 28 Feb 2024 at 12:11:30 (+0100), David Hildenbrand wrote: > On 28.02.24 11:48, Quentin Perret wrote: > > On Tuesday 27 Feb 2024 at 15:59:37 (+0100), David Hildenbrand wrote: > > > > > > > > Ah, this was something I hadn't thought about. I think both Fuad and I > > > > need to update our series to check the refcount rather than mapcount > > > > (kvm_is_gmem_mapped for Fuad, gunyah_folio_lend_safe for me). > > > > > > An alternative might be !folio_mapped() && !folio_maybe_dma_pinned(). But > > > checking for any unexpected references might be better (there are still some > > > GUP users that don't use FOLL_PIN). > > > > As a non-mm person I'm not sure to understand to consequences of holding > > a GUP pin to a page that is not covered by any VMA. The absence of VMAs > > imply that userspace cannot access the page right? Presumably the kernel > > can't be coerced into accessing that page either? Is that correct? > > Simple example: register the page using an iouring fixed buffer, then unmap > the VMA. iouring now has the page pinned and can read/write it using an > address in the kernel vitual address space (direct map). > > Then, you can happily make the kernel read/write that page using iouring, > even though no VMA still covers/maps that page. Makes sense, and yes that would be a major bug if we let that happen, thanks for the explanation. > [...] > > > > Instead of > > > > > > 1) Converting a page to private only if there are no unexpected > > > references (no mappings, GUP pins, ...) and no VMAs covering it where > > > we could fault it in later > > > 2) Disallowing mmap when the range would contain any private page > > > 3) Handling races between mmap and page conversion > > > > The one thing that makes the second option cleaner from a userspace > > perspective (IMO) is that the conversion to private is happening lazily > > during guest faults. So whether or not an mmapped page can indeed be > > accessed from userspace will be entirely undeterministic as it depends > > on the guest faulting pattern which userspace is entirely unaware of. > > Elliot's suggestion would prevent spurious crashes caused by that > > somewhat odd behaviour, though arguably sane userspace software > > shouldn't be doing that to start with. > > The last sentence is the important one. User space should not access that > memory. If it does, it gets a slap on the hand. Because it should not access > that memory. > > We might even be able to export to user space which pages are currently > accessible and which ones not (e.g., pagemap), although it would be racy as > long as the VM is running and can trigger a conversion. > > > > > To add a layer of paint to the shed, the usage of SIGBUS for > > something that is really a permission access problem doesn't feel > > SIGBUS stands for "BUS error (bad memory access)." > > Which makes sense, if you try accessing something that can no longer be > accessed. It's now inaccessible. Even if it is temporarily. > > Just like a page with an MCE error. Swapin errors. Etc. You cannot access > it. > > It might be a permission problem on the pKVM side, but it's not the > traditional "permission problem" as in mprotect() and friends. You cannot > resolve that permission problem yourself. It's a higher entity that turned > that memory inaccessible. Well that's where I'm not sure to agree. Userspace can, in fact, get back all of that memory by simply killing the protected VM. With the approach suggested here, the guestmem pages are entirely accessible to the host until they are attached to a running protected VM which triggers the protection. It is very much userspace saying "I promise not to touch these pages from now on" when it does that, in a way that I personally find very comparable to the mprotect case. It is not some other entity that pulls the carpet from under userspace's feet, it is userspace being inconsistent with itself that causes the issue here, and that's why SIGBUS feels kinda wrong as it tends to be used to report external errors of some sort. > > appropriate. Allocating memory via guestmem and donating that to a > > protected guest is a way for userspace to voluntarily relinquish access > > permissions to the memory it allocated. So a userspace process violating > > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the > > point that signal would be sent, the page would have been accounted > > against that userspace process, so not sure the paging examples that > > were discussed earlier are exactly comparable. To illustrate that > > differently, given that pKVM and Gunyah use MMU-based protection, there > > is nothing architecturally that prevents a guest from sharing a page > > back with Linux as RO. > > Sure, then allow page faults that allow for reads and give a signal on write > faults. > > In the scenario, it even makes more sense to not constantly require new > mmap's from user space just to access a now-shared page. > > > Note that we don't currently support this, so I > > don't want to conflate this use case, but that hopefully makes it a > > little more obvious that this is a "there is a page, but you don't > > currently have the permission to access it" problem rather than "sorry > > but we ran out of pages" problem. > > We could user other signals, at least as the semantics are clear and it's > documented. Maybe SIGSEGV would be warranted. > > I consider that a minor detail, though. > > Requiring mmap()/munmap() dances just to access a page that is now shared > from user space sounds a bit suboptimal. But I don't know all the details of > the user space implementation. Agreed, if we could save having to mmap() each page that gets shared back that would be a nice performance optimization. > "mmap() the whole thing once and only access what you are supposed to > access" sounds reasonable to me. If you don't play by the rules, you get a > signal. "... you get a signal, or maybe you don't". But yes I understand your point, and as per the above there are real benefits to this approach so why not. What do we expect userspace to do when a page goes from shared back to being guest-private, because e.g. the guest decides to unshare? Use munmap() on that page? Or perhaps an madvise() call of some sort? Note that this will be needed when starting a guest as well, as userspace needs to copy the guest payload in the guestmem file prior to starting the protected VM. Thanks, Quentin
>>> To add a layer of paint to the shed, the usage of SIGBUS for >>> something that is really a permission access problem doesn't feel >> >> SIGBUS stands for "BUS error (bad memory access)." >> >> Which makes sense, if you try accessing something that can no longer be >> accessed. It's now inaccessible. Even if it is temporarily. >> >> Just like a page with an MCE error. Swapin errors. Etc. You cannot access >> it. >> >> It might be a permission problem on the pKVM side, but it's not the >> traditional "permission problem" as in mprotect() and friends. You cannot >> resolve that permission problem yourself. It's a higher entity that turned >> that memory inaccessible. > > Well that's where I'm not sure to agree. Userspace can, in fact, get > back all of that memory by simply killing the protected VM. With the Right, but that would likely "wipe" the pages so they can be made accessible again, right? That's the whole point why we are handing the pages over to the "higher entity", and allow someone else (the VM) to turn them into a state where we can no longer read them. (if you follow the other discussion, it would actually be nice if we could read them and would get encrypted content back, like s390x does; but that's a different discussion and I assume pretty much out of scope :) ) > approach suggested here, the guestmem pages are entirely accessible to > the host until they are attached to a running protected VM which > triggers the protection. It is very much userspace saying "I promise not > to touch these pages from now on" when it does that, in a way that I > personally find very comparable to the mprotect case. It is not some > other entity that pulls the carpet from under userspace's feet, it is > userspace being inconsistent with itself that causes the issue here, and > that's why SIGBUS feels kinda wrong as it tends to be used to report > external errors of some sort. I recall that user space can also trigger SIGBUS when doing some mmap()+truncate() thingies, and probably a bunch more, that could be fixed up later. I don't see a problem with SIUGBUS here, but I do understand your view. I consider the exact signal a minor detail, though. > >>> appropriate. Allocating memory via guestmem and donating that to a >>> protected guest is a way for userspace to voluntarily relinquish access >>> permissions to the memory it allocated. So a userspace process violating >>> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the >>> point that signal would be sent, the page would have been accounted >>> against that userspace process, so not sure the paging examples that >>> were discussed earlier are exactly comparable. To illustrate that >>> differently, given that pKVM and Gunyah use MMU-based protection, there >>> is nothing architecturally that prevents a guest from sharing a page >>> back with Linux as RO. >> >> Sure, then allow page faults that allow for reads and give a signal on write >> faults. >> >> In the scenario, it even makes more sense to not constantly require new >> mmap's from user space just to access a now-shared page. >> >>> Note that we don't currently support this, so I >>> don't want to conflate this use case, but that hopefully makes it a >>> little more obvious that this is a "there is a page, but you don't >>> currently have the permission to access it" problem rather than "sorry >>> but we ran out of pages" problem. >> >> We could user other signals, at least as the semantics are clear and it's >> documented. Maybe SIGSEGV would be warranted. >> >> I consider that a minor detail, though. >> >> Requiring mmap()/munmap() dances just to access a page that is now shared >> from user space sounds a bit suboptimal. But I don't know all the details of >> the user space implementation. > > Agreed, if we could save having to mmap() each page that gets shared > back that would be a nice performance optimization. > >> "mmap() the whole thing once and only access what you are supposed to >> access" sounds reasonable to me. If you don't play by the rules, you get a >> signal. > > "... you get a signal, or maybe you don't". But yes I understand your > point, and as per the above there are real benefits to this approach so > why not. > > What do we expect userspace to do when a page goes from shared back to > being guest-private, because e.g. the guest decides to unshare? Use > munmap() on that page? Or perhaps an madvise() call of some sort? Note > that this will be needed when starting a guest as well, as userspace > needs to copy the guest payload in the guestmem file prior to starting > the protected VM. Let's assume we have the whole guest_memfd mapped exactly once in our process, a single VMA. When setting up the VM, we'll write the payload and then fire up the VM. That will (I assume) trigger some shared -> private conversion. When we want to convert shared -> private in the kernel, we would first check if the page is currently mapped. If it is, we could try unmapping that page using an rmap walk. Then, we'd make sure that there are really no other references and protect against concurrent mapping of the page. Now we can convert the page to private. As we want to avoid the rmap walk, user space can be nice and simply MADV_DONTNEED the shared memory portions once it's done with it. For example, after writing the payload. Just a thought, maybe I am missing some details.
On Wednesday 28 Feb 2024 at 14:00:50 (+0100), David Hildenbrand wrote: > > > > To add a layer of paint to the shed, the usage of SIGBUS for > > > > something that is really a permission access problem doesn't feel > > > > > > SIGBUS stands for "BUS error (bad memory access)." > > > > > > Which makes sense, if you try accessing something that can no longer be > > > accessed. It's now inaccessible. Even if it is temporarily. > > > > > > Just like a page with an MCE error. Swapin errors. Etc. You cannot access > > > it. > > > > > > It might be a permission problem on the pKVM side, but it's not the > > > traditional "permission problem" as in mprotect() and friends. You cannot > > > resolve that permission problem yourself. It's a higher entity that turned > > > that memory inaccessible. > > > > Well that's where I'm not sure to agree. Userspace can, in fact, get > > back all of that memory by simply killing the protected VM. With the > > Right, but that would likely "wipe" the pages so they can be made accessible > again, right? Yep, indeed. > That's the whole point why we are handing the pages over to the "higher > entity", and allow someone else (the VM) to turn them into a state where we > can no longer read them. > > (if you follow the other discussion, it would actually be nice if we could > read them and would get encrypted content back, like s390x does; but that's > a different discussion and I assume pretty much out of scope :) ) Interesting, I'll read up. On a side note, I'm also considering adding a guest-facing hypervisor interface to let the guest decide to opt out of the hypervisor wipe as discussed above. That would be useful for a guest that is shutting itself down (which could be cooperating with the host Linux) and that knows it has erased its secrets. That is in general difficult to do for an OS, but a simple approach could be to poison all its memory (or maybe encrypt it?) before opting out of that wipe. The hypervisor wipe is done in hypervisor context (obviously), which is non-preemptible, so avoiding wiping (or encrypting) loads of memory there is highly desirable. Also pKVM doesn't have a linear map of all memory for security reasons, so we need to map/unmap the pages one by one, which sucks as much as it sounds. But yes, we're digressing, that is all for later :) > > approach suggested here, the guestmem pages are entirely accessible to > > the host until they are attached to a running protected VM which > > triggers the protection. It is very much userspace saying "I promise not > > to touch these pages from now on" when it does that, in a way that I > > personally find very comparable to the mprotect case. It is not some > > other entity that pulls the carpet from under userspace's feet, it is > > userspace being inconsistent with itself that causes the issue here, and > > that's why SIGBUS feels kinda wrong as it tends to be used to report > > external errors of some sort. > > I recall that user space can also trigger SIGBUS when doing some > mmap()+truncate() thingies, and probably a bunch more, that could be fixed > up later. Right, so that probably still falls into "there is no page" bucket rather than the "there is a page that is already accounted against the userspace process, but it doesn't have the permission to access it bucket. But yes that's probably an infinite debate. > I don't see a problem with SIUGBUS here, but I do understand your view. I > consider the exact signal a minor detail, though. > > > > > > > appropriate. Allocating memory via guestmem and donating that to a > > > > protected guest is a way for userspace to voluntarily relinquish access > > > > permissions to the memory it allocated. So a userspace process violating > > > > that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the > > > > point that signal would be sent, the page would have been accounted > > > > against that userspace process, so not sure the paging examples that > > > > were discussed earlier are exactly comparable. To illustrate that > > > > differently, given that pKVM and Gunyah use MMU-based protection, there > > > > is nothing architecturally that prevents a guest from sharing a page > > > > back with Linux as RO. > > > > > > Sure, then allow page faults that allow for reads and give a signal on write > > > faults. > > > > > > In the scenario, it even makes more sense to not constantly require new > > > mmap's from user space just to access a now-shared page. > > > > > > > Note that we don't currently support this, so I > > > > don't want to conflate this use case, but that hopefully makes it a > > > > little more obvious that this is a "there is a page, but you don't > > > > currently have the permission to access it" problem rather than "sorry > > > > but we ran out of pages" problem. > > > > > > We could user other signals, at least as the semantics are clear and it's > > > documented. Maybe SIGSEGV would be warranted. > > > > > > I consider that a minor detail, though. > > > > > > Requiring mmap()/munmap() dances just to access a page that is now shared > > > from user space sounds a bit suboptimal. But I don't know all the details of > > > the user space implementation. > > > > Agreed, if we could save having to mmap() each page that gets shared > > back that would be a nice performance optimization. > > > > > "mmap() the whole thing once and only access what you are supposed to (> > > access" sounds reasonable to me. If you don't play by the rules, you get a > > > signal. > > > > "... you get a signal, or maybe you don't". But yes I understand your > > point, and as per the above there are real benefits to this approach so > > why not. > > > > What do we expect userspace to do when a page goes from shared back to > > being guest-private, because e.g. the guest decides to unshare? Use > > munmap() on that page? Or perhaps an madvise() call of some sort? Note > > that this will be needed when starting a guest as well, as userspace > > needs to copy the guest payload in the guestmem file prior to starting > > the protected VM. > > Let's assume we have the whole guest_memfd mapped exactly once in our > process, a single VMA. > > When setting up the VM, we'll write the payload and then fire up the VM. > > That will (I assume) trigger some shared -> private conversion. > > When we want to convert shared -> private in the kernel, we would first > check if the page is currently mapped. If it is, we could try unmapping that > page using an rmap walk. I had not considered that. That would most certainly be slow, but a well behaved userspace process shouldn't hit it so, that's probably not a problem... > Then, we'd make sure that there are really no other references and protect > against concurrent mapping of the page. Now we can convert the page to > private. Right. Alternatively, the shared->private conversion happens in the KVM vcpu run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a new exit reason saying "can't donate that page while it's shared" and have userspace use MADVISE_DONTNEED or munmap, or whatever on the back of that. But I tend to prefer the rmap option if it's workable as that avoids adding new KVM userspace ABI. > As we want to avoid the rmap walk, user space can be nice and simply > MADV_DONTNEED the shared memory portions once it's done with it. For > example, after writing the payload. That makes sense to me. Thanks, Quentin
On Wednesday 28 Feb 2024 at 11:12:18 (+0100), David Hildenbrand wrote: > > > So you don't need any guest_memfd games to protect from that -- and one > > > doesn't have to travel back in time to have memory that isn't > > > swappable/migratable and only comes in one page size. > > > > > > [I'm not up-to-date which obscure corner-cases CCA requirement the s390x > > > implementation cannot fulfill -- like replacing pages in page tables and > > > such; I suspect pKVM also cannot cover all these corner-cases] > > > > Thanks for this. I'll do some more reading on how things work with s390x. > > > > Right, and of course, one key difference of course is that pKVM > > doesn't encrypt anything, and only relies on stage-2 protection to > > protect the guest. > > I don't remember what exactly s390x does, but I recall that it might only > encrypt the memory content as it transitions a page from secure to > non-secure. > > Something like that could also be implemented using pKVM (unless I am > missing something), but it might not be that trivial, of course :) One of the complicated aspects of having the host migrate pages like so is for the hypervisor to make sure the content of the page has not been tempered with when the new page is re-mapped in the guest. That means having additional tracking in the hypervisor of pages that have been encrypted and returned to the host, indexed by IPA, with something like a 'checksum' of some sort, which is non-trivial to do securely from a cryptographic PoV. A simpler and secure way to do this is (I think) is to do hypervisor-assisted migration. IOW, pKVM exposes a new migrate_page(ipa, old_pa, new_pa) hypercall which Linux can call to migrate a page. pKVM unmaps the new page from the host stage-2, unmap the old page from guest stage-2, does the copy, wipes the old page, maps the pages in the respective page-tables, and off we go. That way the content is never visible to Linux and that avoids the problems I highlighted above by construction. The downside is that it doesn't work for swapping, but that is quite hard to do in general...
On Wed, Feb 28, 2024 at 01:34:15PM +0000, Quentin Perret wrote: > Alternatively, the shared->private conversion happens in the KVM vcpu > run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a > new exit reason saying "can't donate that page while it's shared" and > have userspace use MADVISE_DONTNEED or munmap, or whatever on the back > of that. But I tend to prefer the rmap option if it's workable as that > avoids adding new KVM userspace ABI. > You'll still probably need the new exit reason saying "can't donate that page while it's shared" if the refcount tests fail. Can use David's iouring as example of some other part of the kernel has a reference to the page. I can't think of anything to do other than exiting to userspace because we don't know how to drop that extra ref. Thanks, Elliot
On Wednesday 28 Feb 2024 at 10:43:27 (-0800), Elliot Berman wrote: > On Wed, Feb 28, 2024 at 01:34:15PM +0000, Quentin Perret wrote: > > Alternatively, the shared->private conversion happens in the KVM vcpu > > run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a > > new exit reason saying "can't donate that page while it's shared" and > > have userspace use MADVISE_DONTNEED or munmap, or whatever on the back > > of that. But I tend to prefer the rmap option if it's workable as that > > avoids adding new KVM userspace ABI. > > > > You'll still probably need the new exit reason saying "can't donate that > page while it's shared" if the refcount tests fail. Can use David's > iouring as example of some other part of the kernel has a reference to > the page. I can't think of anything to do other than exiting to > userspace because we don't know how to drop that extra ref. Ack, I realized that later on. I guess there may be cases where userspace doesn't know how to drop that pin, but that's not the kernel's fault and it can't do anything about it, so a userspace exit is our best chance... Thanks, Quentin
On 28.02.24 15:01, Quentin Perret wrote: > On Wednesday 28 Feb 2024 at 11:12:18 (+0100), David Hildenbrand wrote: >>>> So you don't need any guest_memfd games to protect from that -- and one >>>> doesn't have to travel back in time to have memory that isn't >>>> swappable/migratable and only comes in one page size. >>>> >>>> [I'm not up-to-date which obscure corner-cases CCA requirement the s390x >>>> implementation cannot fulfill -- like replacing pages in page tables and >>>> such; I suspect pKVM also cannot cover all these corner-cases] >>> >>> Thanks for this. I'll do some more reading on how things work with s390x. >>> >>> Right, and of course, one key difference of course is that pKVM >>> doesn't encrypt anything, and only relies on stage-2 protection to >>> protect the guest. >> >> I don't remember what exactly s390x does, but I recall that it might only >> encrypt the memory content as it transitions a page from secure to >> non-secure. >> >> Something like that could also be implemented using pKVM (unless I am >> missing something), but it might not be that trivial, of course :) > > One of the complicated aspects of having the host migrate pages like so > is for the hypervisor to make sure the content of the page has not been > tempered with when the new page is re-mapped in the guest. That means > having additional tracking in the hypervisor of pages that have been > encrypted and returned to the host, indexed by IPA, with something like > a 'checksum' of some sort, which is non-trivial to do securely from a > cryptographic PoV. > I don't know what exactly s390x does, and how it does it -- and which CCA cases they can handle. Details are scarce, for example: https://www.ibm.com/docs/en/linux-on-systems?topic=virtualization-introducing-secure-execution-linux I suspect they do it in some way you describe, and I fully agree with the "non-trivial" aspect :) > A simpler and secure way to do this is (I think) is to do > hypervisor-assisted migration. IOW, pKVM exposes a new migrate_page(ipa, > old_pa, new_pa) hypercall which Linux can call to migrate a page. > pKVM unmaps the new page from the host stage-2, unmap the old page from > guest stage-2, does the copy, wipes the old page, maps the pages in the > respective page-tables, and off we go. That way the content is never > visible to Linux and that avoids the problems I highlighted above by > construction. The downside is that it doesn't work for swapping, but > that is quite hard to do in general... The main "problem" with that is that you still end up with these inaccessible pages, that require the use of guest_memfd for your in-place conversion requirement in the first place. I'm sure at some point we'll see migration support also for TDX and friends. For pKVM it might be even easier to support.
On 28.02.24 14:34, Quentin Perret wrote: > On Wednesday 28 Feb 2024 at 14:00:50 (+0100), David Hildenbrand wrote: >>>>> To add a layer of paint to the shed, the usage of SIGBUS for >>>>> something that is really a permission access problem doesn't feel >>>> >>>> SIGBUS stands for "BUS error (bad memory access)." >>>> >>>> Which makes sense, if you try accessing something that can no longer be >>>> accessed. It's now inaccessible. Even if it is temporarily. >>>> >>>> Just like a page with an MCE error. Swapin errors. Etc. You cannot access >>>> it. >>>> >>>> It might be a permission problem on the pKVM side, but it's not the >>>> traditional "permission problem" as in mprotect() and friends. You cannot >>>> resolve that permission problem yourself. It's a higher entity that turned >>>> that memory inaccessible. >>> >>> Well that's where I'm not sure to agree. Userspace can, in fact, get >>> back all of that memory by simply killing the protected VM. With the >> >> Right, but that would likely "wipe" the pages so they can be made accessible >> again, right? > > Yep, indeed. > >> That's the whole point why we are handing the pages over to the "higher >> entity", and allow someone else (the VM) to turn them into a state where we >> can no longer read them. >> >> (if you follow the other discussion, it would actually be nice if we could >> read them and would get encrypted content back, like s390x does; but that's >> a different discussion and I assume pretty much out of scope :) ) > > Interesting, I'll read up. On a side note, I'm also considering adding a > guest-facing hypervisor interface to let the guest decide to opt out of > the hypervisor wipe as discussed above. That would be useful for a guest > that is shutting itself down (which could be cooperating with the host > Linux) and that knows it has erased its secrets. That is in general > difficult to do for an OS, but a simple approach could be to poison all > its memory (or maybe encrypt it?) before opting out of that wipe. > > The hypervisor wipe is done in hypervisor context (obviously), which is > non-preemptible, so avoiding wiping (or encrypting) loads of memory > there is highly desirable. Also pKVM doesn't have a linear map of all > memory for security reasons, so we need to map/unmap the pages one by > one, which sucks as much as it sounds. > > But yes, we're digressing, that is all for later :) :) Sounds like an interesting optimization. An alternative would be to remember in pKVM that a page needs a wipe before reaccess. Once re-accessed by anybody (hypervisor or new guest), it first has to be wiped by pKVM. ... but that also sounds complicated and similar requires the linear map+unmap in pKVM page-by-page as they are reused. But at least a guest shutdown would be faster. > >>> approach suggested here, the guestmem pages are entirely accessible to >>> the host until they are attached to a running protected VM which >>> triggers the protection. It is very much userspace saying "I promise not >>> to touch these pages from now on" when it does that, in a way that I >>> personally find very comparable to the mprotect case. It is not some >>> other entity that pulls the carpet from under userspace's feet, it is >>> userspace being inconsistent with itself that causes the issue here, and >>> that's why SIGBUS feels kinda wrong as it tends to be used to report >>> external errors of some sort. >> >> I recall that user space can also trigger SIGBUS when doing some >> mmap()+truncate() thingies, and probably a bunch more, that could be fixed >> up later. > > Right, so that probably still falls into "there is no page" bucket > rather than the "there is a page that is already accounted against the > userspace process, but it doesn't have the permission to access it > bucket. But yes that's probably an infinite debate. Yes, we should rather focus on the bigger idea: have inaccessible memory that fails a pagefault instead of the mmap. > >> I don't see a problem with SIUGBUS here, but I do understand your view. I >> consider the exact signal a minor detail, though. >> >>> >>>>> appropriate. Allocating memory via guestmem and donating that to a >>>>> protected guest is a way for userspace to voluntarily relinquish access >>>>> permissions to the memory it allocated. So a userspace process violating >>>>> that could, IMO, reasonably expect a SEGV instead of SIGBUS. By the >>>>> point that signal would be sent, the page would have been accounted >>>>> against that userspace process, so not sure the paging examples that >>>>> were discussed earlier are exactly comparable. To illustrate that >>>>> differently, given that pKVM and Gunyah use MMU-based protection, there >>>>> is nothing architecturally that prevents a guest from sharing a page >>>>> back with Linux as RO. >>>> >>>> Sure, then allow page faults that allow for reads and give a signal on write >>>> faults. >>>> >>>> In the scenario, it even makes more sense to not constantly require new >>>> mmap's from user space just to access a now-shared page. >>>> >>>>> Note that we don't currently support this, so I >>>>> don't want to conflate this use case, but that hopefully makes it a >>>>> little more obvious that this is a "there is a page, but you don't >>>>> currently have the permission to access it" problem rather than "sorry >>>>> but we ran out of pages" problem. >>>> >>>> We could user other signals, at least as the semantics are clear and it's >>>> documented. Maybe SIGSEGV would be warranted. >>>> >>>> I consider that a minor detail, though. >>>> >>>> Requiring mmap()/munmap() dances just to access a page that is now shared >>>> from user space sounds a bit suboptimal. But I don't know all the details of >>>> the user space implementation. >>> >>> Agreed, if we could save having to mmap() each page that gets shared >>> back that would be a nice performance optimization. >>> >>>> "mmap() the whole thing once and only access what you are supposed to > (> > > access" sounds reasonable to me. If you don't play by the rules, you get a >>>> signal. >>> >>> "... you get a signal, or maybe you don't". But yes I understand your >>> point, and as per the above there are real benefits to this approach so >>> why not. >>> >>> What do we expect userspace to do when a page goes from shared back to >>> being guest-private, because e.g. the guest decides to unshare? Use >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note >>> that this will be needed when starting a guest as well, as userspace >>> needs to copy the guest payload in the guestmem file prior to starting >>> the protected VM. >> >> Let's assume we have the whole guest_memfd mapped exactly once in our >> process, a single VMA. >> >> When setting up the VM, we'll write the payload and then fire up the VM. >> >> That will (I assume) trigger some shared -> private conversion. >> >> When we want to convert shared -> private in the kernel, we would first >> check if the page is currently mapped. If it is, we could try unmapping that >> page using an rmap walk. > > I had not considered that. That would most certainly be slow, but a well > behaved userspace process shouldn't hit it so, that's probably not a > problem... If there really only is a single VMA that covers the page (or even mmaps the guest_memfd), it should not be too bad. For example, any fallocate(PUNCHHOLE) has to do the same, to unmap the page before discarding it from the pagecache. But of course, no rmap walk is always better. > >> Then, we'd make sure that there are really no other references and protect >> against concurrent mapping of the page. Now we can convert the page to >> private. > > Right. > > Alternatively, the shared->private conversion happens in the KVM vcpu > run loop, so we'd be in a good position to exit the VCPU_RUN ioctl with a > new exit reason saying "can't donate that page while it's shared" and > have userspace use MADVISE_DONTNEED or munmap, or whatever on the back > of that. But I tend to prefer the rmap option if it's workable as that > avoids adding new KVM userspace ABI. As discussed in the sub-thread, that might still be required. One could think about completely forbidding GUP on these mmap'ed guest-memfds. But likely, there might be use cases in the future where you want to use GUP on shared memory inside a guest_memfd. (the iouring example I gave might currently not work because FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some details)
Hi David, ... >>>> "mmap() the whole thing once and only access what you are supposed to > > (> > > access" sounds reasonable to me. If you don't play by the rules, you get a > >>>> signal. > >>> > >>> "... you get a signal, or maybe you don't". But yes I understand your > >>> point, and as per the above there are real benefits to this approach so > >>> why not. > >>> > >>> What do we expect userspace to do when a page goes from shared back to > >>> being guest-private, because e.g. the guest decides to unshare? Use > >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note > >>> that this will be needed when starting a guest as well, as userspace > >>> needs to copy the guest payload in the guestmem file prior to starting > >>> the protected VM. > >> > >> Let's assume we have the whole guest_memfd mapped exactly once in our > >> process, a single VMA. > >> > >> When setting up the VM, we'll write the payload and then fire up the VM. > >> > >> That will (I assume) trigger some shared -> private conversion. > >> > >> When we want to convert shared -> private in the kernel, we would first > >> check if the page is currently mapped. If it is, we could try unmapping that > >> page using an rmap walk. > > > > I had not considered that. That would most certainly be slow, but a well > > behaved userspace process shouldn't hit it so, that's probably not a > > problem... > > If there really only is a single VMA that covers the page (or even mmaps > the guest_memfd), it should not be too bad. For example, any > fallocate(PUNCHHOLE) has to do the same, to unmap the page before > discarding it from the pagecache. I don't think that we can assume that only a single VMA covers a page. > But of course, no rmap walk is always better. We've been thinking some more about how to handle the case where the host userspace has a mapping of a page that later becomes private. One idea is to refuse to run the guest (i.e., exit vcpu_run() to back to the host with a meaningful exit reason) until the host unmaps that page, and check for the refcount to the page as you mentioned earlier. This is essentially what the RFC I sent does (minus the bugs :) ) . The other idea is to use the rmap walk as you suggested to zap that page. If the host tries to access that page again, it would get a SIGBUS on the fault. This has the advantage that, as you'd mentioned, the host doesn't need to constantly mmap() and munmap() pages. It could potentially be optimised further as suggested if we have a cooperating VMM that would issue a MADV_DONTNEED or something like that, but that's just an optimisation and we would still need to have the option of the rmap walk. However, I was wondering how practical this idea would be if more than a single VMA covers a page? Also, there's the question of what to do if the page is gupped? In this case I think the only thing we can do is refuse to run the guest until the gup (and all references) are released, which also brings us back to the way things (kind of) are... Thanks, /fuad
On Thu, Feb 29, 2024 at 07:01:51PM +0000, Fuad Tabba wrote: > Hi David, > > ... > > >>>> "mmap() the whole thing once and only access what you are supposed to > > > (> > > access" sounds reasonable to me. If you don't play by the rules, you get a > > >>>> signal. > > >>> > > >>> "... you get a signal, or maybe you don't". But yes I understand your > > >>> point, and as per the above there are real benefits to this approach so > > >>> why not. > > >>> > > >>> What do we expect userspace to do when a page goes from shared back to > > >>> being guest-private, because e.g. the guest decides to unshare? Use > > >>> munmap() on that page? Or perhaps an madvise() call of some sort? Note > > >>> that this will be needed when starting a guest as well, as userspace > > >>> needs to copy the guest payload in the guestmem file prior to starting > > >>> the protected VM. > > >> > > >> Let's assume we have the whole guest_memfd mapped exactly once in our > > >> process, a single VMA. > > >> > > >> When setting up the VM, we'll write the payload and then fire up the VM. > > >> > > >> That will (I assume) trigger some shared -> private conversion. > > >> > > >> When we want to convert shared -> private in the kernel, we would first > > >> check if the page is currently mapped. If it is, we could try unmapping that > > >> page using an rmap walk. > > > > > > I had not considered that. That would most certainly be slow, but a well > > > behaved userspace process shouldn't hit it so, that's probably not a > > > problem... > > > > If there really only is a single VMA that covers the page (or even mmaps > > the guest_memfd), it should not be too bad. For example, any > > fallocate(PUNCHHOLE) has to do the same, to unmap the page before > > discarding it from the pagecache. > > I don't think that we can assume that only a single VMA covers a page. > > > But of course, no rmap walk is always better. > > We've been thinking some more about how to handle the case where the > host userspace has a mapping of a page that later becomes private. > > One idea is to refuse to run the guest (i.e., exit vcpu_run() to back > to the host with a meaningful exit reason) until the host unmaps that > page, and check for the refcount to the page as you mentioned earlier. > This is essentially what the RFC I sent does (minus the bugs :) ) . > > The other idea is to use the rmap walk as you suggested to zap that > page. If the host tries to access that page again, it would get a > SIGBUS on the fault. This has the advantage that, as you'd mentioned, > the host doesn't need to constantly mmap() and munmap() pages. It > could potentially be optimised further as suggested if we have a > cooperating VMM that would issue a MADV_DONTNEED or something like > that, but that's just an optimisation and we would still need to have > the option of the rmap walk. However, I was wondering how practical > this idea would be if more than a single VMA covers a page? > Agree with all your points here. I changed Gunyah's implementation to do the unmap instead of erroring out. I didn't observe a significant performance difference. However, doing unmap might be a little faster because we can check folio_mapped() before doing the rmap walk. When erroring out at mmap() level, we always have to do the walk. > Also, there's the question of what to do if the page is gupped? In > this case I think the only thing we can do is refuse to run the guest > until the gup (and all references) are released, which also brings us > back to the way things (kind of) are... > If there are gup users who don't do FOLL_PIN, I think we either need to fix them or live with possibility here? We don't have a reliable refcount for a folio to be safe to unmap: it might be that another vCPU is trying to get the same page, has incremented the refcount, and waiting for the folio_lock. This problem exists whether we block the mmap() or do SIGBUS. Thanks, Elliot
On 29.02.24 20:01, Fuad Tabba wrote: > Hi David, > > ... > Hi! >>>>> "mmap() the whole thing once and only access what you are supposed to >>> (> > > access" sounds reasonable to me. If you don't play by the rules, you get a >>>>>> signal. >>>>> >>>>> "... you get a signal, or maybe you don't". But yes I understand your >>>>> point, and as per the above there are real benefits to this approach so >>>>> why not. >>>>> >>>>> What do we expect userspace to do when a page goes from shared back to >>>>> being guest-private, because e.g. the guest decides to unshare? Use >>>>> munmap() on that page? Or perhaps an madvise() call of some sort? Note >>>>> that this will be needed when starting a guest as well, as userspace >>>>> needs to copy the guest payload in the guestmem file prior to starting >>>>> the protected VM. >>>> >>>> Let's assume we have the whole guest_memfd mapped exactly once in our >>>> process, a single VMA. >>>> >>>> When setting up the VM, we'll write the payload and then fire up the VM. >>>> >>>> That will (I assume) trigger some shared -> private conversion. >>>> >>>> When we want to convert shared -> private in the kernel, we would first >>>> check if the page is currently mapped. If it is, we could try unmapping that >>>> page using an rmap walk. >>> >>> I had not considered that. That would most certainly be slow, but a well >>> behaved userspace process shouldn't hit it so, that's probably not a >>> problem... >> >> If there really only is a single VMA that covers the page (or even mmaps >> the guest_memfd), it should not be too bad. For example, any >> fallocate(PUNCHHOLE) has to do the same, to unmap the page before >> discarding it from the pagecache. > > I don't think that we can assume that only a single VMA covers a page. > >> But of course, no rmap walk is always better. > > We've been thinking some more about how to handle the case where the > host userspace has a mapping of a page that later becomes private. > > One idea is to refuse to run the guest (i.e., exit vcpu_run() to back > to the host with a meaningful exit reason) until the host unmaps that > page, and check for the refcount to the page as you mentioned earlier. > This is essentially what the RFC I sent does (minus the bugs :) ) . :) > > The other idea is to use the rmap walk as you suggested to zap that > page. If the host tries to access that page again, it would get a > SIGBUS on the fault. This has the advantage that, as you'd mentioned, > the host doesn't need to constantly mmap() and munmap() pages. It > could potentially be optimised further as suggested if we have a > cooperating VMM that would issue a MADV_DONTNEED or something like > that, but that's just an optimisation and we would still need to have > the option of the rmap walk. However, I was wondering how practical > this idea would be if more than a single VMA covers a page? A few VMAs won't make a difference I assume. Any idea how many "sharers" you'd expect in a sane configuration? > > Also, there's the question of what to do if the page is gupped? In > this case I think the only thing we can do is refuse to run the guest > until the gup (and all references) are released, which also brings us > back to the way things (kind of) are... Indeed. There is no way you could possibly make progress.
>> I don't think that we can assume that only a single VMA covers a page. >> >>> But of course, no rmap walk is always better. >> >> We've been thinking some more about how to handle the case where the >> host userspace has a mapping of a page that later becomes private. >> >> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back >> to the host with a meaningful exit reason) until the host unmaps that >> page, and check for the refcount to the page as you mentioned earlier. >> This is essentially what the RFC I sent does (minus the bugs :) ) . >> >> The other idea is to use the rmap walk as you suggested to zap that >> page. If the host tries to access that page again, it would get a >> SIGBUS on the fault. This has the advantage that, as you'd mentioned, >> the host doesn't need to constantly mmap() and munmap() pages. It >> could potentially be optimised further as suggested if we have a >> cooperating VMM that would issue a MADV_DONTNEED or something like >> that, but that's just an optimisation and we would still need to have >> the option of the rmap walk. However, I was wondering how practical >> this idea would be if more than a single VMA covers a page? >> > > Agree with all your points here. I changed Gunyah's implementation to do > the unmap instead of erroring out. I didn't observe a significant > performance difference. However, doing unmap might be a little faster > because we can check folio_mapped() before doing the rmap walk. When > erroring out at mmap() level, we always have to do the walk. Right. On the mmap() level you won't really have to walk page tables, as the the munmap() already zapped the page and removed the "problematic" VMA. Likely, you really want to avoid repeatedly calling mmap()+munmap() just to access shared memory; but that's just my best guess about your user space app :) > >> Also, there's the question of what to do if the page is gupped? In >> this case I think the only thing we can do is refuse to run the guest >> until the gup (and all references) are released, which also brings us >> back to the way things (kind of) are... >> > > If there are gup users who don't do FOLL_PIN, I think we either need to > fix them or live with possibility here? We don't have a reliable > refcount for a folio to be safe to unmap: it might be that another vCPU > is trying to get the same page, has incremented the refcount, and > waiting for the folio_lock. Likely there could be a way to detect that when only the vCPUs are your concern? But yes, it's nasty. (has to be handled in either case :() Disallowing any FOLL_GET|FOLL_PIN could work. Not sure how some core-kernel FOLL_GET users would react to that, though. See vma_is_secretmem() and folio_is_secretmem() in mm/gup.c, where we disallow any FOLL_GET|FOLL_PIN of secretmem pages. We'd need a way to teach core-mm similarly about guest_memfd, which might end up rather tricky, but not impossible :) > This problem exists whether we block the > mmap() or do SIGBUS. There is work on doing more conversion to FOLL_PIN, but some cases are harder to convert. Most of O_DIRECT should be using it nowadays, but some other known use cases don't. The simplest and readily-available example is still vmsplice(). I don't think it was fixed yet to use FOLL_PIN. Use vmsplice() to pin the page in the pipe (read-only). Unmap the VMA. You can read the page any time later by reading from the pipe. So I wouldn't bet on all relevant cases being gone in the near future.
On Thursday 29 Feb 2024 at 11:04:09 (+0100), David Hildenbrand wrote: > An alternative would be to remember in pKVM that a page needs a wipe before > reaccess. Once re-accessed by anybody (hypervisor or new guest), it first > has to be wiped by pKVM. > > ... but that also sounds complicated and similar requires the linear > map+unmap in pKVM page-by-page as they are reused. But at least a guest > shutdown would be faster. Yep, FWIW we did try that, but ended up having issues with Linux trying to DMA to these pages before 'touching' them from the CPU side. pKVM can keep the pages unmapped from the CPU and IOMMU stage-2 page-tables, and we can easily handle the CPU faults lazily, but not faults from other masters, our hardware generally doesn't support that. <snip> > As discussed in the sub-thread, that might still be required. > > One could think about completely forbidding GUP on these mmap'ed > guest-memfds. But likely, there might be use cases in the future where you > want to use GUP on shared memory inside a guest_memfd. > > (the iouring example I gave might currently not work because > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > details) Perhaps it would be wise to start with GUP being forbidden if the current users do not need it (not sure if that is the case in Android, I'll check) ? We can always relax this constraint later when/if the use-cases arise, which is obviously much harder to do the other way around. Thanks, Quentin
On Friday 01 Mar 2024 at 12:16:54 (+0100), David Hildenbrand wrote: > > > I don't think that we can assume that only a single VMA covers a page. > > > > > > > But of course, no rmap walk is always better. > > > > > > We've been thinking some more about how to handle the case where the > > > host userspace has a mapping of a page that later becomes private. > > > > > > One idea is to refuse to run the guest (i.e., exit vcpu_run() to back > > > to the host with a meaningful exit reason) until the host unmaps that > > > page, and check for the refcount to the page as you mentioned earlier. > > > This is essentially what the RFC I sent does (minus the bugs :) ) . > > > > > > The other idea is to use the rmap walk as you suggested to zap that > > > page. If the host tries to access that page again, it would get a > > > SIGBUS on the fault. This has the advantage that, as you'd mentioned, > > > the host doesn't need to constantly mmap() and munmap() pages. It > > > could potentially be optimised further as suggested if we have a > > > cooperating VMM that would issue a MADV_DONTNEED or something like > > > that, but that's just an optimisation and we would still need to have > > > the option of the rmap walk. However, I was wondering how practical > > > this idea would be if more than a single VMA covers a page? > > > > > > > Agree with all your points here. I changed Gunyah's implementation to do > > the unmap instead of erroring out. I didn't observe a significant > > performance difference. However, doing unmap might be a little faster > > because we can check folio_mapped() before doing the rmap walk. When > > erroring out at mmap() level, we always have to do the walk. > > Right. On the mmap() level you won't really have to walk page tables, as the > the munmap() already zapped the page and removed the "problematic" VMA. > > Likely, you really want to avoid repeatedly calling mmap()+munmap() just to > access shared memory; but that's just my best guess about your user space > app :) Ack, and expecting userspace to munmap the pages whenever we hit a valid mapping in userspace page-tables in the KVM faults path makes for a somewhat unusual interface IMO. Userspace can munmap, mmap again, and if it doesn't touch the pages, it can proceed to run the guest just fine, is that the expectation? If so, it feels like we're 'leaking' internal kernel state somehow. The kernel is normally well within its rights to zap userspace mappings if it wants to e.g. swap. (Obviously mlock is a weird case, but even in that case, IIRC the kernel still has a certain amount of flexibility and can use compaction and friends). Similarly, it should be well within its right to proactively create them. How would this scheme work if, 10 years from now, something like Speculative Page Faults makes it into the kernel in a different form? Not requiring to userspace to unmap makes the userspace interface a lot simpler I think -- once a protected guest starts, you better not touch its memory if it's not been shared back or you'll get slapped on the wrist. Whether or not those pages have been accessed beforehand for example is irrelevant.
On Mon, Mar 04, 2024, Quentin Perret wrote: > > As discussed in the sub-thread, that might still be required. > > > > One could think about completely forbidding GUP on these mmap'ed > > guest-memfds. But likely, there might be use cases in the future where you > > want to use GUP on shared memory inside a guest_memfd. > > > > (the iouring example I gave might currently not work because > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > > details) > > Perhaps it would be wise to start with GUP being forbidden if the > current users do not need it (not sure if that is the case in Android, > I'll check) ? We can always relax this constraint later when/if the > use-cases arise, which is obviously much harder to do the other way > around. +1000. At least on the KVM side, I would like to be as conservative as possible when it comes to letting anything other than the guest access guest_memfd.
On 04.03.24 20:04, Sean Christopherson wrote: > On Mon, Mar 04, 2024, Quentin Perret wrote: >>> As discussed in the sub-thread, that might still be required. >>> >>> One could think about completely forbidding GUP on these mmap'ed >>> guest-memfds. But likely, there might be use cases in the future where you >>> want to use GUP on shared memory inside a guest_memfd. >>> >>> (the iouring example I gave might currently not work because >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some >>> details) >> >> Perhaps it would be wise to start with GUP being forbidden if the >> current users do not need it (not sure if that is the case in Android, >> I'll check) ? We can always relax this constraint later when/if the >> use-cases arise, which is obviously much harder to do the other way >> around. > > +1000. At least on the KVM side, I would like to be as conservative as possible > when it comes to letting anything other than the guest access guest_memfd. So we'll have to do it similar to any occurrences of "secretmem" in gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code similar to e.g., folio_is_secretmem(). IIRC, we might not be able to de-reference the actual mapping because it could get free concurrently ... That will then prohibit any kind of GUP access to these pages, including reading/writing for ptrace/debugging purposes, for core dumping purposes etc. But at least, you know that nobody was able to optain page references using GUP that might be used for reading/writing later.
On 04.03.24 13:53, Quentin Perret wrote: > On Friday 01 Mar 2024 at 12:16:54 (+0100), David Hildenbrand wrote: >>>> I don't think that we can assume that only a single VMA covers a page. >>>> >>>>> But of course, no rmap walk is always better. >>>> >>>> We've been thinking some more about how to handle the case where the >>>> host userspace has a mapping of a page that later becomes private. >>>> >>>> One idea is to refuse to run the guest (i.e., exit vcpu_run() to back >>>> to the host with a meaningful exit reason) until the host unmaps that >>>> page, and check for the refcount to the page as you mentioned earlier. >>>> This is essentially what the RFC I sent does (minus the bugs :) ) . >>>> >>>> The other idea is to use the rmap walk as you suggested to zap that >>>> page. If the host tries to access that page again, it would get a >>>> SIGBUS on the fault. This has the advantage that, as you'd mentioned, >>>> the host doesn't need to constantly mmap() and munmap() pages. It >>>> could potentially be optimised further as suggested if we have a >>>> cooperating VMM that would issue a MADV_DONTNEED or something like >>>> that, but that's just an optimisation and we would still need to have >>>> the option of the rmap walk. However, I was wondering how practical >>>> this idea would be if more than a single VMA covers a page? >>>> >>> >>> Agree with all your points here. I changed Gunyah's implementation to do >>> the unmap instead of erroring out. I didn't observe a significant >>> performance difference. However, doing unmap might be a little faster >>> because we can check folio_mapped() before doing the rmap walk. When >>> erroring out at mmap() level, we always have to do the walk. >> >> Right. On the mmap() level you won't really have to walk page tables, as the >> the munmap() already zapped the page and removed the "problematic" VMA. >> >> Likely, you really want to avoid repeatedly calling mmap()+munmap() just to >> access shared memory; but that's just my best guess about your user space >> app :) > > Ack, and expecting userspace to munmap the pages whenever we hit a valid > mapping in userspace page-tables in the KVM faults path makes for a > somewhat unusual interface IMO. Userspace can munmap, mmap again, and if > it doesn't touch the pages, it can proceed to run the guest just fine, > is that the expectation? If so, it feels like we're 'leaking' internal It would be weird, and I would not suggest that. It's either (1) you can leave it mmap'ed, but any access to private memory will SIGBUS. The kernel will try zapping pages inside a VMA itself. (2) you cannot leave it mmap'ed. In order to convert shared -> private, you have to munmap. mmap will fail if it would cover a currently-private page. So for (1) you could mmap once in user space and be done with it. For (2) you would have to mmap+munmap when accessing shared memory. > kernel state somehow. The kernel is normally well within its rights to > zap userspace mappings if it wants to e.g. swap. (Obviously mlock is a > weird case, but even in that case, IIRC the kernel still has a certain > amount of flexibility and can use compaction and friends). Similarly, > it should be well within its right to proactively create them. How > would this scheme work if, 10 years from now, something like > Speculative Page Faults makes it into the kernel in a different form? At least with (2), speculative page faults would never generate a SIGBUS. Just like HW must not generate a fault on speculative access. > > Not requiring to userspace to unmap makes the userspace interface a lot > simpler I think -- once a protected guest starts, you better not touch > its memory if it's not been shared back or you'll get slapped on the > wrist. Whether or not those pages have been accessed beforehand for > example is irrelevant. Yes. So the theory :)
On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote: > On 04.03.24 20:04, Sean Christopherson wrote: > > On Mon, Mar 04, 2024, Quentin Perret wrote: > > > > As discussed in the sub-thread, that might still be required. > > > > > > > > One could think about completely forbidding GUP on these mmap'ed > > > > guest-memfds. But likely, there might be use cases in the future where you > > > > want to use GUP on shared memory inside a guest_memfd. > > > > > > > > (the iouring example I gave might currently not work because > > > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > > > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > > > > details) > > > > > > Perhaps it would be wise to start with GUP being forbidden if the > > > current users do not need it (not sure if that is the case in Android, > > > I'll check) ? We can always relax this constraint later when/if the > > > use-cases arise, which is obviously much harder to do the other way > > > around. > > > > +1000. At least on the KVM side, I would like to be as conservative as possible > > when it comes to letting anything other than the guest access guest_memfd. > > So we'll have to do it similar to any occurrences of "secretmem" in gup.c. > We'll have to see how to marry KVM guest_memfd with core-mm code similar to > e.g., folio_is_secretmem(). > > IIRC, we might not be able to de-reference the actual mapping because it > could get free concurrently ... > > That will then prohibit any kind of GUP access to these pages, including > reading/writing for ptrace/debugging purposes, for core dumping purposes > etc. But at least, you know that nobody was able to optain page references > using GUP that might be used for reading/writing later. Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and replacing folio_is_secretmem() with a test of this bit instead of comparing the a_ops? I think it scales better. Thanks, Elliot
On 04.03.24 22:43, Elliot Berman wrote: > On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote: >> On 04.03.24 20:04, Sean Christopherson wrote: >>> On Mon, Mar 04, 2024, Quentin Perret wrote: >>>>> As discussed in the sub-thread, that might still be required. >>>>> >>>>> One could think about completely forbidding GUP on these mmap'ed >>>>> guest-memfds. But likely, there might be use cases in the future where you >>>>> want to use GUP on shared memory inside a guest_memfd. >>>>> >>>>> (the iouring example I gave might currently not work because >>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and >>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some >>>>> details) >>>> >>>> Perhaps it would be wise to start with GUP being forbidden if the >>>> current users do not need it (not sure if that is the case in Android, >>>> I'll check) ? We can always relax this constraint later when/if the >>>> use-cases arise, which is obviously much harder to do the other way >>>> around. >>> >>> +1000. At least on the KVM side, I would like to be as conservative as possible >>> when it comes to letting anything other than the guest access guest_memfd. >> >> So we'll have to do it similar to any occurrences of "secretmem" in gup.c. >> We'll have to see how to marry KVM guest_memfd with core-mm code similar to >> e.g., folio_is_secretmem(). >> >> IIRC, we might not be able to de-reference the actual mapping because it >> could get free concurrently ... >> >> That will then prohibit any kind of GUP access to these pages, including >> reading/writing for ptrace/debugging purposes, for core dumping purposes >> etc. But at least, you know that nobody was able to optain page references >> using GUP that might be used for reading/writing later. > > Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and > replacing folio_is_secretmem() with a test of this bit instead of > comparing the a_ops? I think it scales better. The only concern I have are races, but let's look into the details: In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs etc. We had a similar discussion recently about possible races. It's documented in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods. "inodes and thus their mappings are freed under RCU, which means the mapping cannot be freed beneath us and thus we can safely dereference it." So if we follow the same rules as folio_fast_pin_allowed(), we can de-reference folio->mapping, for example comparing mapping->a_ops. [folio_is_secretmem should better follow the same approach] So likely that should just work!
On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote: > > On 04.03.24 20:04, Sean Christopherson wrote: > > On Mon, Mar 04, 2024, Quentin Perret wrote: > >>> As discussed in the sub-thread, that might still be required. > >>> > >>> One could think about completely forbidding GUP on these mmap'ed > >>> guest-memfds. But likely, there might be use cases in the future where you > >>> want to use GUP on shared memory inside a guest_memfd. > >>> > >>> (the iouring example I gave might currently not work because > >>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > >>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > >>> details) > >> > >> Perhaps it would be wise to start with GUP being forbidden if the > >> current users do not need it (not sure if that is the case in Android, > >> I'll check) ? We can always relax this constraint later when/if the > >> use-cases arise, which is obviously much harder to do the other way > >> around. > > > > +1000. At least on the KVM side, I would like to be as conservative as possible > > when it comes to letting anything other than the guest access guest_memfd. > > So we'll have to do it similar to any occurrences of "secretmem" in > gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code > similar to e.g., folio_is_secretmem(). > > IIRC, we might not be able to de-reference the actual mapping because it > could get free concurrently ... > > That will then prohibit any kind of GUP access to these pages, including > reading/writing for ptrace/debugging purposes, for core dumping purposes > etc. But at least, you know that nobody was able to optain page > references using GUP that might be used for reading/writing later. > There has been little discussion about supporting 1G pages with guest_memfd for TDX/SNP or pKVM. I would like to restart this discussion [1]. 1G pages should be a very important usecase for guest memfd, especially considering large VM sizes supporting confidential GPU/TPU workloads. Using separate backing stores for private and shared memory ranges is not going to work effectively when using 1G pages. Consider the following scenario of memory conversion when using 1G pages to back private memory: * Guest requests conversion of 4KB range from private to shared, host in response ideally does following steps: a) Updates the guest memory attributes b) Unbacks the corresponding private memory c) Allocates corresponding shared memory or let it be faulted in when guest accesses it Step b above can't be skipped here, otherwise we would have two physical pages (1 backing private memory, another backing the shared memory) for the same GPA range causing "double allocation". With 1G pages, it would be difficult to punch KBs or even MBs sized hole since to support that: 1G page would need to be split (which hugetlbfs doesn't support today because of right reasons), causing - - loss of vmemmap optimization [3] - losing ability to reconstitute the huge page again, especially as private pages in CVMs are not relocatable today, increasing overall fragmentation over time. - unless a smarter algorithm is devised for memory reclaim to reconstitute large pages for unmovable memory. With the above limitations in place, best thing could be to allow: - single backing store for both shared and private memory ranges - host userspace to mmap the guest memfd (as this series is trying to do) - allow userspace to fault in memfd file ranges that correspond to shared GPA ranges - pagetable mappings will need to be restricted to shared memory ranges causing higher granularity mappings (somewhat similar to what HGM series from James [2] was trying to do) than 1G. - Allow IOMMU also to map those pages (pfns would be requested using get_user_pages* APIs) to allow devices to access shared memory. IOMMU management code would have to be enlightened or somehow restricted to map only shared regions of guest memfd. - Upon conversion from shared to private, host will have to ensure that there are no mappings/references present for the memory ranges being converted to private. If the above usecase sounds reasonable, GUP access to guest memfd pages should be allowed. [1] https://lore.kernel.org/lkml/CAGtprH_H1afUJ2cUnznWqYLTZVuEcOogRwXF6uBAeHbLMQsrsQ@mail.gmail.com/ [2] https://lore.kernel.org/lkml/20230218002819.1486479-2-jthoughton@google.com/ [3] https://docs.kernel.org/mm/vmemmap_dedup.html > -- > Cheers, > > David / dhildenb >
On 18.03.24 18:06, Vishal Annapurve wrote: > On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 04.03.24 20:04, Sean Christopherson wrote: >>> On Mon, Mar 04, 2024, Quentin Perret wrote: >>>>> As discussed in the sub-thread, that might still be required. >>>>> >>>>> One could think about completely forbidding GUP on these mmap'ed >>>>> guest-memfds. But likely, there might be use cases in the future where you >>>>> want to use GUP on shared memory inside a guest_memfd. >>>>> >>>>> (the iouring example I gave might currently not work because >>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and >>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some >>>>> details) >>>> >>>> Perhaps it would be wise to start with GUP being forbidden if the >>>> current users do not need it (not sure if that is the case in Android, >>>> I'll check) ? We can always relax this constraint later when/if the >>>> use-cases arise, which is obviously much harder to do the other way >>>> around. >>> >>> +1000. At least on the KVM side, I would like to be as conservative as possible >>> when it comes to letting anything other than the guest access guest_memfd. >> >> So we'll have to do it similar to any occurrences of "secretmem" in >> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code >> similar to e.g., folio_is_secretmem(). >> >> IIRC, we might not be able to de-reference the actual mapping because it >> could get free concurrently ... >> >> That will then prohibit any kind of GUP access to these pages, including >> reading/writing for ptrace/debugging purposes, for core dumping purposes >> etc. But at least, you know that nobody was able to optain page >> references using GUP that might be used for reading/writing later. >> > > There has been little discussion about supporting 1G pages with > guest_memfd for TDX/SNP or pKVM. I would like to restart this > discussion [1]. 1G pages should be a very important usecase for guest > memfd, especially considering large VM sizes supporting confidential > GPU/TPU workloads. > > Using separate backing stores for private and shared memory ranges is > not going to work effectively when using 1G pages. Consider the > following scenario of memory conversion when using 1G pages to back > private memory: > * Guest requests conversion of 4KB range from private to shared, host > in response ideally does following steps: > a) Updates the guest memory attributes > b) Unbacks the corresponding private memory > c) Allocates corresponding shared memory or let it be faulted in > when guest accesses it > > Step b above can't be skipped here, otherwise we would have two > physical pages (1 backing private memory, another backing the shared > memory) for the same GPA range causing "double allocation". > > With 1G pages, it would be difficult to punch KBs or even MBs sized > hole since to support that: > 1G page would need to be split (which hugetlbfs doesn't support today > because of right reasons), causing - > - loss of vmemmap optimization [3] > - losing ability to reconstitute the huge page again, > especially as private pages in CVMs are not relocatable today, > increasing overall fragmentation over time. > - unless a smarter algorithm is devised for memory > reclaim to reconstitute large pages for unmovable memory. > > With the above limitations in place, best thing could be to allow: > - single backing store for both shared and private memory ranges > - host userspace to mmap the guest memfd (as this series is trying to do) > - allow userspace to fault in memfd file ranges that correspond to > shared GPA ranges > - pagetable mappings will need to be restricted to shared memory > ranges causing higher granularity mappings (somewhat similar to what > HGM series from James [2] was trying to do) than 1G. > - Allow IOMMU also to map those pages (pfns would be requested using > get_user_pages* APIs) to allow devices to access shared memory. IOMMU > management code would have to be enlightened or somehow restricted to > map only shared regions of guest memfd. > - Upon conversion from shared to private, host will have to ensure > that there are no mappings/references present for the memory ranges > being converted to private. > > If the above usecase sounds reasonable, GUP access to guest memfd > pages should be allowed. To say it with nice words: "Not a fan". First, I don't think only 1 GiB will be problematic. Already 2 MiB ones will be problematic and so far it is even unclear how guest_memfd will consume them in a way acceptable to upstream MM. Likely not using hugetlb from what I recall after the previous discussions with Mike. Second, we should find better ways to let an IOMMU map these pages, *not* using GUP. There were already discussions on providing a similar fd+offset-style interface instead. GUP really sounds like the wrong approach here. Maybe we should look into passing not only guest_memfd, but also "ordinary" memfds. Third, I don't think we should be using huge pages where huge pages don't make any sense. Using a 1 GiB page so the VM will convert some pieces to map it using PTEs will destroy the whole purpose of using 1 GiB pages. It doesn't make any sense. A direction that might make sense is either (A) enlighting the VM about the granularity in which memory can be converted (but also problematic for 1 GiB pages) and/or (B) physically restricting the memory that can be converted. For example, one could create a GPA layout where some regions are backed by gigantic pages that cannot be converted/can only be converted as a whole, and some are backed by 4k pages that can be converted back and forth. We'd use multiple guest_memfds for that. I recall that physically restricting such conversions/locations (e.g., for bounce buffers) in Linux was already discussed somewhere, but I don't recall the details. It's all not trivial and not easy to get "clean". Concluding that individual pieces of a 1 GiB / 2 MiB huge page should not be converted back and forth might be a reasonable. Although I'm sure people will argue the opposite and develop hackish solutions in desperate ways to make it work somehow. Huge pages, and especially gigantic pages, are simply a bad fit if the VM will convert individual 4k pages. But to answer your last question: we might be able to avoid GUP by using a different mapping API, similar to the once KVM now provides.
On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > On 18.03.24 18:06, Vishal Annapurve wrote: > > On Mon, Mar 4, 2024 at 12:17 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 04.03.24 20:04, Sean Christopherson wrote: > >>> On Mon, Mar 04, 2024, Quentin Perret wrote: > >>>>> As discussed in the sub-thread, that might still be required. > >>>>> > >>>>> One could think about completely forbidding GUP on these mmap'ed > >>>>> guest-memfds. But likely, there might be use cases in the future where you > >>>>> want to use GUP on shared memory inside a guest_memfd. > >>>>> > >>>>> (the iouring example I gave might currently not work because > >>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > >>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > >>>>> details) > >>>> > >>>> Perhaps it would be wise to start with GUP being forbidden if the > >>>> current users do not need it (not sure if that is the case in Android, > >>>> I'll check) ? We can always relax this constraint later when/if the > >>>> use-cases arise, which is obviously much harder to do the other way > >>>> around. > >>> > >>> +1000. At least on the KVM side, I would like to be as conservative as possible > >>> when it comes to letting anything other than the guest access guest_memfd. > >> > >> So we'll have to do it similar to any occurrences of "secretmem" in > >> gup.c. We'll have to see how to marry KVM guest_memfd with core-mm code > >> similar to e.g., folio_is_secretmem(). > >> > >> IIRC, we might not be able to de-reference the actual mapping because it > >> could get free concurrently ... > >> > >> That will then prohibit any kind of GUP access to these pages, including > >> reading/writing for ptrace/debugging purposes, for core dumping purposes > >> etc. But at least, you know that nobody was able to optain page > >> references using GUP that might be used for reading/writing later. > >> > > > > There has been little discussion about supporting 1G pages with > > guest_memfd for TDX/SNP or pKVM. I would like to restart this > > discussion [1]. 1G pages should be a very important usecase for guest > > memfd, especially considering large VM sizes supporting confidential > > GPU/TPU workloads. > > > > Using separate backing stores for private and shared memory ranges is > > not going to work effectively when using 1G pages. Consider the > > following scenario of memory conversion when using 1G pages to back > > private memory: > > * Guest requests conversion of 4KB range from private to shared, host > > in response ideally does following steps: > > a) Updates the guest memory attributes > > b) Unbacks the corresponding private memory > > c) Allocates corresponding shared memory or let it be faulted in > > when guest accesses it > > > > Step b above can't be skipped here, otherwise we would have two > > physical pages (1 backing private memory, another backing the shared > > memory) for the same GPA range causing "double allocation". > > > > With 1G pages, it would be difficult to punch KBs or even MBs sized > > hole since to support that: > > 1G page would need to be split (which hugetlbfs doesn't support today > > because of right reasons), causing - > > - loss of vmemmap optimization [3] > > - losing ability to reconstitute the huge page again, > > especially as private pages in CVMs are not relocatable today, > > increasing overall fragmentation over time. > > - unless a smarter algorithm is devised for memory > > reclaim to reconstitute large pages for unmovable memory. > > > > With the above limitations in place, best thing could be to allow: > > - single backing store for both shared and private memory ranges > > - host userspace to mmap the guest memfd (as this series is trying to do) > > - allow userspace to fault in memfd file ranges that correspond to > > shared GPA ranges > > - pagetable mappings will need to be restricted to shared memory > > ranges causing higher granularity mappings (somewhat similar to what > > HGM series from James [2] was trying to do) than 1G. > > - Allow IOMMU also to map those pages (pfns would be requested using > > get_user_pages* APIs) to allow devices to access shared memory. IOMMU > > management code would have to be enlightened or somehow restricted to > > map only shared regions of guest memfd. > > - Upon conversion from shared to private, host will have to ensure > > that there are no mappings/references present for the memory ranges > > being converted to private. > > > > If the above usecase sounds reasonable, GUP access to guest memfd > > pages should be allowed. > > To say it with nice words: "Not a fan". > > First, I don't think only 1 GiB will be problematic. Already 2 MiB ones > will be problematic and so far it is even unclear how guest_memfd will > consume them in a way acceptable to upstream MM. Likely not using > hugetlb from what I recall after the previous discussions with Mike. > Agree, the support for 1G pages with guest memfd is yet to be figured out, but it remains a scenario to be considered here. > Second, we should find better ways to let an IOMMU map these pages, > *not* using GUP. There were already discussions on providing a similar > fd+offset-style interface instead. GUP really sounds like the wrong > approach here. Maybe we should look into passing not only guest_memfd, > but also "ordinary" memfds. I need to dig into past discussions around this, but agree that passing guest memfds to VFIO drivers in addition to HVAs seems worth exploring. This may be required anyways for devices supporting TDX connect [1]. If we are talking about the same file catering to both private and shared memory, there has to be some way to keep track of references on the shared memory from both host userspace and IOMMU. > > Third, I don't think we should be using huge pages where huge pages > don't make any sense. Using a 1 GiB page so the VM will convert some > pieces to map it using PTEs will destroy the whole purpose of using 1 > GiB pages. It doesn't make any sense. I had started a discussion for this [2] using an RFC series. Main challenge here remain: 1) Unifying all the conversions under one layer 2) Ensuring shared memory allocations are huge page aligned at boot time and runtime. Using any kind of unified shared memory allocator (today this part is played by SWIOTLB) will need to support huge page aligned dynamic increments, which can be only guaranteed by carving out enough memory at boot time for CMA area and using CMA area for allocation at runtime. - Since it's hard to come up with a maximum amount of shared memory needed by VM, especially with GPUs/TPUs around, it's difficult to come up with CMA area size at boot time. I think it's arguable that even if a VM converts 10 % of its memory to shared using 4k granularity, we still have fewer page table walks on the rest of the memory when using 1G/2M pages, which is a significant portion. > > A direction that might make sense is either (A) enlighting the VM about > the granularity in which memory can be converted (but also problematic > for 1 GiB pages) and/or (B) physically restricting the memory that can > be converted. Physically restricting the memory will still need a safe maximum bound to be calculated based on all the shared memory usecases that VM can encounter. > > For example, one could create a GPA layout where some regions are backed > by gigantic pages that cannot be converted/can only be converted as a > whole, and some are backed by 4k pages that can be converted back and > forth. We'd use multiple guest_memfds for that. I recall that physically > restricting such conversions/locations (e.g., for bounce buffers) in > Linux was already discussed somewhere, but I don't recall the details. > > It's all not trivial and not easy to get "clean". Yeah, agree with this point, it's difficult to get a clean solution here, but the host side solution might be easier to deploy (not necessarily easier to implement) and possibly cleaner than attempts to regulate the guest side. > > Concluding that individual pieces of a 1 GiB / 2 MiB huge page should > not be converted back and forth might be a reasonable. Although I'm sure > people will argue the opposite and develop hackish solutions in > desperate ways to make it work somehow. > > Huge pages, and especially gigantic pages, are simply a bad fit if the > VM will convert individual 4k pages. > > > But to answer your last question: we might be able to avoid GUP by using > a different mapping API, similar to the once KVM now provides. > > -- > Cheers, > > David / dhildenb > [1] -> https://cdrdv2.intel.com/v1/dl/getContent/773614 [2] https://lore.kernel.org/lkml/20240112055251.36101-2-vannapurve@google.com/
On Mon, Mar 18, 2024, Vishal Annapurve wrote: > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > Second, we should find better ways to let an IOMMU map these pages, > > *not* using GUP. There were already discussions on providing a similar > > fd+offset-style interface instead. GUP really sounds like the wrong > > approach here. Maybe we should look into passing not only guest_memfd, > > but also "ordinary" memfds. +1. I am not completely opposed to letting SNP and TDX effectively convert pages between private and shared, but I also completely agree that letting anything gup() guest_memfd memory is likely to end in tears. > I need to dig into past discussions around this, but agree that > passing guest memfds to VFIO drivers in addition to HVAs seems worth > exploring. This may be required anyways for devices supporting TDX > connect [1]. > > If we are talking about the same file catering to both private and > shared memory, there has to be some way to keep track of references on > the shared memory from both host userspace and IOMMU. > > > > > Third, I don't think we should be using huge pages where huge pages > > don't make any sense. Using a 1 GiB page so the VM will convert some > > pieces to map it using PTEs will destroy the whole purpose of using 1 > > GiB pages. It doesn't make any sense. I don't disagree, but the fundamental problem is that we have no guarantees as to what that guest will or will not do. We can certainly make very educated guesses, and probably be right 99.99% of the time, but being wrong 0.01% of the time probably means a lot of broken VMs, and a lot of unhappy customers. > I had started a discussion for this [2] using an RFC series. David is talking about the host side of things, AFAICT you're talking about the guest side... > challenge here remain: > 1) Unifying all the conversions under one layer > 2) Ensuring shared memory allocations are huge page aligned at boot > time and runtime. > > Using any kind of unified shared memory allocator (today this part is > played by SWIOTLB) will need to support huge page aligned dynamic > increments, which can be only guaranteed by carving out enough memory > at boot time for CMA area and using CMA area for allocation at > runtime. > - Since it's hard to come up with a maximum amount of shared memory > needed by VM, especially with GPUs/TPUs around, it's difficult to come > up with CMA area size at boot time. ...which is very relevant as carving out memory in the guest is nigh impossible, but carving out memory in the host for systems whose sole purpose is to run VMs is very doable. > I think it's arguable that even if a VM converts 10 % of its memory to > shared using 4k granularity, we still have fewer page table walks on > the rest of the memory when using 1G/2M pages, which is a significant > portion. Performance is a secondary concern. If this were _just_ about guest performance, I would unequivocally side with David: the guest gets to keep the pieces if it fragments a 1GiB page. The main problem we're trying to solve is that we want to provision a host such that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo VMs, 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo VMs would also get the benefits of 1GiB mappings, that's not the driving motiviation for this discussion. As HugeTLB exists today, supporting that use case isn't really feasible because there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute the 1GiB when the sliver is converted/freed back). Peeking ahead at my next comment, I don't think that solving this in the guest is a realistic option, i.e. IMO, we need to figure out a way to handle this in the host, without relying on the guest to cooperate. Luckily, we haven't added hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate to work with. The other big advantage that we should lean into is that we can make assumptions about guest_memfd usage that would never fly for a general purpose backing stores, e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not desirable, for (almost?) all of the CoCo use cases. I don't have any concrete ideas at this time, but my gut feeling is that this won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being general purposes, and if we we account for conversion scenarios when designing hugepage support for guest_memfd. > > For example, one could create a GPA layout where some regions are backed > > by gigantic pages that cannot be converted/can only be converted as a > > whole, and some are backed by 4k pages that can be converted back and > > forth. We'd use multiple guest_memfds for that. I recall that physically > > restricting such conversions/locations (e.g., for bounce buffers) in > > Linux was already discussed somewhere, but I don't recall the details. > > > > It's all not trivial and not easy to get "clean". > > Yeah, agree with this point, it's difficult to get a clean solution > here, but the host side solution might be easier to deploy (not > necessarily easier to implement) and possibly cleaner than attempts to > regulate the guest side. I think we missed the opportunity to regulate the guest side by several years. To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer VMs, KVM would need to be able to _enforce_ the scheme. And while I am more than willing to put my foot down on things where the guest is being blatantly ridiculous, wanting to convert an arbitrary 4KiB chunk of memory between private and shared isn't ridiculous (likely inefficient, but not ridiculous). I.e. I'm not willing to have KVM refuse conversions that are legal according to the SNP and TDX specs (and presumably the CCA spec, too). That's why I think we're years too late; this sort of restriction needs to go in the "hardware" spec, and that ship has sailed.
On Monday 04 Mar 2024 at 22:58:49 (+0100), David Hildenbrand wrote: > On 04.03.24 22:43, Elliot Berman wrote: > > On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote: > > > On 04.03.24 20:04, Sean Christopherson wrote: > > > > On Mon, Mar 04, 2024, Quentin Perret wrote: > > > > > > As discussed in the sub-thread, that might still be required. > > > > > > > > > > > > One could think about completely forbidding GUP on these mmap'ed > > > > > > guest-memfds. But likely, there might be use cases in the future where you > > > > > > want to use GUP on shared memory inside a guest_memfd. > > > > > > > > > > > > (the iouring example I gave might currently not work because > > > > > > FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and > > > > > > guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some > > > > > > details) > > > > > > > > > > Perhaps it would be wise to start with GUP being forbidden if the > > > > > current users do not need it (not sure if that is the case in Android, > > > > > I'll check) ? We can always relax this constraint later when/if the > > > > > use-cases arise, which is obviously much harder to do the other way > > > > > around. > > > > > > > > +1000. At least on the KVM side, I would like to be as conservative as possible > > > > when it comes to letting anything other than the guest access guest_memfd. > > > > > > So we'll have to do it similar to any occurrences of "secretmem" in gup.c. > > > We'll have to see how to marry KVM guest_memfd with core-mm code similar to > > > e.g., folio_is_secretmem(). > > > > > > IIRC, we might not be able to de-reference the actual mapping because it > > > could get free concurrently ... > > > > > > That will then prohibit any kind of GUP access to these pages, including > > > reading/writing for ptrace/debugging purposes, for core dumping purposes > > > etc. But at least, you know that nobody was able to optain page references > > > using GUP that might be used for reading/writing later. > > > > Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and > > replacing folio_is_secretmem() with a test of this bit instead of > > comparing the a_ops? I think it scales better. > > The only concern I have are races, but let's look into the details: > > In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs > etc. > > We had a similar discussion recently about possible races. It's documented > in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods. > > "inodes and thus their mappings are freed under RCU, which means the mapping > cannot be freed beneath us and thus we can safely dereference it." > > So if we follow the same rules as folio_fast_pin_allowed(), we can > de-reference folio->mapping, for example comparing mapping->a_ops. > > [folio_is_secretmem should better follow the same approach] Resurecting this discussion, I had discussions internally and as it turns out Android makes extensive use of vhost/vsock when communicating with guest VMs, which requires GUP. So, my bad, not supporting GUP for the pKVM variant of guest_memfd is a bit of a non-starter, we'll need to support it from the start. But again this should be a matter of 'simply' having a dedicated KVM exit reason so hopefully it's not too bad. Thanks, Quentin
On 19.03.24 10:47, Quentin Perret wrote: > On Monday 04 Mar 2024 at 22:58:49 (+0100), David Hildenbrand wrote: >> On 04.03.24 22:43, Elliot Berman wrote: >>> On Mon, Mar 04, 2024 at 09:17:05PM +0100, David Hildenbrand wrote: >>>> On 04.03.24 20:04, Sean Christopherson wrote: >>>>> On Mon, Mar 04, 2024, Quentin Perret wrote: >>>>>>> As discussed in the sub-thread, that might still be required. >>>>>>> >>>>>>> One could think about completely forbidding GUP on these mmap'ed >>>>>>> guest-memfds. But likely, there might be use cases in the future where you >>>>>>> want to use GUP on shared memory inside a guest_memfd. >>>>>>> >>>>>>> (the iouring example I gave might currently not work because >>>>>>> FOLL_PIN|FOLL_LONGTERM|FOLL_WRITE only works on shmem+hugetlb, and >>>>>>> guest_memfd will likely not be detected as shmem; 8ac268436e6d contains some >>>>>>> details) >>>>>> >>>>>> Perhaps it would be wise to start with GUP being forbidden if the >>>>>> current users do not need it (not sure if that is the case in Android, >>>>>> I'll check) ? We can always relax this constraint later when/if the >>>>>> use-cases arise, which is obviously much harder to do the other way >>>>>> around. >>>>> >>>>> +1000. At least on the KVM side, I would like to be as conservative as possible >>>>> when it comes to letting anything other than the guest access guest_memfd. >>>> >>>> So we'll have to do it similar to any occurrences of "secretmem" in gup.c. >>>> We'll have to see how to marry KVM guest_memfd with core-mm code similar to >>>> e.g., folio_is_secretmem(). >>>> >>>> IIRC, we might not be able to de-reference the actual mapping because it >>>> could get free concurrently ... >>>> >>>> That will then prohibit any kind of GUP access to these pages, including >>>> reading/writing for ptrace/debugging purposes, for core dumping purposes >>>> etc. But at least, you know that nobody was able to optain page references >>>> using GUP that might be used for reading/writing later. >>> >>> Do you have any concerns to add to enum mapping_flags, AS_NOGUP, and >>> replacing folio_is_secretmem() with a test of this bit instead of >>> comparing the a_ops? I think it scales better. >> >> The only concern I have are races, but let's look into the details: >> >> In GUP-fast, we can essentially race with unmap of folios, munmap() of VMAs >> etc. >> >> We had a similar discussion recently about possible races. It's documented >> in folio_fast_pin_allowed() regarding disabled IRQs and RCU grace periods. >> >> "inodes and thus their mappings are freed under RCU, which means the mapping >> cannot be freed beneath us and thus we can safely dereference it." >> >> So if we follow the same rules as folio_fast_pin_allowed(), we can >> de-reference folio->mapping, for example comparing mapping->a_ops. >> >> [folio_is_secretmem should better follow the same approach] > > Resurecting this discussion, I had discussions internally and as it > turns out Android makes extensive use of vhost/vsock when communicating > with guest VMs, which requires GUP. So, my bad, not supporting GUP for > the pKVM variant of guest_memfd is a bit of a non-starter, we'll need to > support it from the start. But again this should be a matter of 'simply' > having a dedicated KVM exit reason so hopefully it's not too bad. Likely you should look into ways to teach these interfaces that require GUP to consume fd+offset instead. Yes, there is no such thing as free lunch :P
On 19.03.24 01:10, Sean Christopherson wrote: > On Mon, Mar 18, 2024, Vishal Annapurve wrote: >> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: >>> Second, we should find better ways to let an IOMMU map these pages, >>> *not* using GUP. There were already discussions on providing a similar >>> fd+offset-style interface instead. GUP really sounds like the wrong >>> approach here. Maybe we should look into passing not only guest_memfd, >>> but also "ordinary" memfds. > > +1. I am not completely opposed to letting SNP and TDX effectively convert > pages between private and shared, but I also completely agree that letting > anything gup() guest_memfd memory is likely to end in tears. Yes. Avoid it right from the start, if possible. People wanted guest_memfd to *not* have to mmap guest memory ("even for ordinary VMs"). Now people are saying we have to be able to mmap it in order to GUP it. It's getting tiring, really. > >> I need to dig into past discussions around this, but agree that >> passing guest memfds to VFIO drivers in addition to HVAs seems worth >> exploring. This may be required anyways for devices supporting TDX >> connect [1]. >> >> If we are talking about the same file catering to both private and >> shared memory, there has to be some way to keep track of references on >> the shared memory from both host userspace and IOMMU. >> >>> >>> Third, I don't think we should be using huge pages where huge pages >>> don't make any sense. Using a 1 GiB page so the VM will convert some >>> pieces to map it using PTEs will destroy the whole purpose of using 1 >>> GiB pages. It doesn't make any sense. > > I don't disagree, but the fundamental problem is that we have no guarantees as to > what that guest will or will not do. We can certainly make very educated guesses, > and probably be right 99.99% of the time, but being wrong 0.01% of the time > probably means a lot of broken VMs, and a lot of unhappy customers. > Right, then don't use huge/gigantic pages. Because it doesn't make any sense. You have no guarantees about memory waste. You have no guarantee about performance. Then just don't use huge/gigantic pages. Use them where reasonable, they are an expensive resource. See below. >> I had started a discussion for this [2] using an RFC series. > > David is talking about the host side of things, AFAICT you're talking about the > guest side... > >> challenge here remain: >> 1) Unifying all the conversions under one layer >> 2) Ensuring shared memory allocations are huge page aligned at boot >> time and runtime. >> >> Using any kind of unified shared memory allocator (today this part is >> played by SWIOTLB) will need to support huge page aligned dynamic >> increments, which can be only guaranteed by carving out enough memory >> at boot time for CMA area and using CMA area for allocation at >> runtime. >> - Since it's hard to come up with a maximum amount of shared memory >> needed by VM, especially with GPUs/TPUs around, it's difficult to come >> up with CMA area size at boot time. > > ...which is very relevant as carving out memory in the guest is nigh impossible, > but carving out memory in the host for systems whose sole purpose is to run VMs > is very doable. > >> I think it's arguable that even if a VM converts 10 % of its memory to >> shared using 4k granularity, we still have fewer page table walks on >> the rest of the memory when using 1G/2M pages, which is a significant >> portion. > > Performance is a secondary concern. If this were _just_ about guest performance, > I would unequivocally side with David: the guest gets to keep the pieces if it > fragments a 1GiB page. > > The main problem we're trying to solve is that we want to provision a host such > that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously > run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo VMs, > 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo VMs > would also get the benefits of 1GiB mappings, that's not the driving motiviation > for this discussion. Supporting 1 GiB mappings there sounds like unnecessary complexity and opening a big can of worms, especially if "it's not the driving motivation". If I understand you correctly, the scenario is (1) We have free 1 GiB hugetlb pages lying around (2) We want to start a CoCo VM (3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb pages is all we have. (4) We want to be able to use the 1 GiB hugetlb page in the future. With hugetlb, it's possible to reserve a CMA area from which to later allocate 1 GiB pages. While not allocated, they can be used for movable allocations. So in the scenario above, free the hugetlb pages back to CMA. Then, consume them as 4K pages for the CoCo VM. When wanting to start a non-CoCo VM, re-allocate them from CMA. One catch with that is that (a) CMA pages cannot get longterm-pinned: for obvious reasons, we wouldn't be able to migrate them in order to free up the 1 GiB page. (b) guest_memfd pages are not movable and cannot currently end up on CMA memory. But maybe that's not actually required in this scenario and we'd like to have slightly different semantics: if you were to give the CoCo VM the 1 GiB pages, they would similarly be unusable until that VM quit and freed up the memory! So it might be acceptable to get "selected" unmovable allocations (from guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free up when wanting to re-allocate that memory. Otherwise, the CMA allocation will simply fail. If we need improvements in that area to support this case, we can talk. Just an idea to avoid HGM and friends just to make it somehow work with 1 GiB pages ... > > As HugeTLB exists today, supporting that use case isn't really feasible because > there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute > the 1GiB when the sliver is converted/freed back). > > Peeking ahead at my next comment, I don't think that solving this in the guest > is a realistic option, i.e. IMO, we need to figure out a way to handle this in > the host, without relying on the guest to cooperate. Luckily, we haven't added > hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate > to work with. > > The other big advantage that we should lean into is that we can make assumptions > about guest_memfd usage that would never fly for a general purpose backing stores, > e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not > desirable, for (almost?) all of the CoCo use cases. > > I don't have any concrete ideas at this time, but my gut feeling is that this > won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being > general purposes, and if we we account for conversion scenarios when designing > hugepage support for guest_memfd. I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;) > >>> For example, one could create a GPA layout where some regions are backed >>> by gigantic pages that cannot be converted/can only be converted as a >>> whole, and some are backed by 4k pages that can be converted back and >>> forth. We'd use multiple guest_memfds for that. I recall that physically >>> restricting such conversions/locations (e.g., for bounce buffers) in >>> Linux was already discussed somewhere, but I don't recall the details. >>> >>> It's all not trivial and not easy to get "clean". >> >> Yeah, agree with this point, it's difficult to get a clean solution >> here, but the host side solution might be easier to deploy (not >> necessarily easier to implement) and possibly cleaner than attempts to >> regulate the guest side. > > I think we missed the opportunity to regulate the guest side by several years. > To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer > VMs, KVM would need to be able to _enforce_ the scheme. And while I am more than > willing to put my foot down on things where the guest is being blatantly ridiculous, > wanting to convert an arbitrary 4KiB chunk of memory between private and shared > isn't ridiculous (likely inefficient, but not ridiculous). I.e. I'm not willing > to have KVM refuse conversions that are legal according to the SNP and TDX specs > (and presumably the CCA spec, too). > > That's why I think we're years too late; this sort of restriction needs to go in > the "hardware" spec, and that ship has sailed. Once could consider extend the spec and glue huge+gigantic page support to new hardware. But ideally, we could just avoid any partial conversion / HGM just to support a scenario where we might not actually want 1 GiB pages, but somehow want to make it work with 1 GiB pages.
>>> I had started a discussion for this [2] using an RFC series. >> >> David is talking about the host side of things, AFAICT you're talking about the >> guest side... >> >>> challenge here remain: >>> 1) Unifying all the conversions under one layer >>> 2) Ensuring shared memory allocations are huge page aligned at boot >>> time and runtime. >>> >>> Using any kind of unified shared memory allocator (today this part is >>> played by SWIOTLB) will need to support huge page aligned dynamic >>> increments, which can be only guaranteed by carving out enough memory >>> at boot time for CMA area and using CMA area for allocation at >>> runtime. >>> - Since it's hard to come up with a maximum amount of shared memory >>> needed by VM, especially with GPUs/TPUs around, it's difficult to come >>> up with CMA area size at boot time. >> >> ...which is very relevant as carving out memory in the guest is nigh impossible, >> but carving out memory in the host for systems whose sole purpose is to run VMs >> is very doable. >> >>> I think it's arguable that even if a VM converts 10 % of its memory to >>> shared using 4k granularity, we still have fewer page table walks on >>> the rest of the memory when using 1G/2M pages, which is a significant >>> portion. >> >> Performance is a secondary concern. If this were _just_ about guest performance, >> I would unequivocally side with David: the guest gets to keep the pieces if it >> fragments a 1GiB page. >> >> The main problem we're trying to solve is that we want to provision a host such >> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously >> run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo VMs, >> 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo VMs >> would also get the benefits of 1GiB mappings, that's not the driving motiviation >> for this discussion. > > Supporting 1 GiB mappings there sounds like unnecessary complexity and > opening a big can of worms, especially if "it's not the driving motivation". > > If I understand you correctly, the scenario is > > (1) We have free 1 GiB hugetlb pages lying around > (2) We want to start a CoCo VM > (3) We don't care about 1 GiB mappings for that CoCo VM, but hguetlb > pages is all we have. > (4) We want to be able to use the 1 GiB hugetlb page in the future. > > With hugetlb, it's possible to reserve a CMA area from which to later > allocate 1 GiB pages. While not allocated, they can be used for movable > allocations. > > So in the scenario above, free the hugetlb pages back to CMA. Then, > consume them as 4K pages for the CoCo VM. When wanting to start a > non-CoCo VM, re-allocate them from CMA. > > One catch with that is that > (a) CMA pages cannot get longterm-pinned: for obvious reasons, we > wouldn't be able to migrate them in order to free up the 1 GiB page. > (b) guest_memfd pages are not movable and cannot currently end up on CMA > memory. > > But maybe that's not actually required in this scenario and we'd like to > have slightly different semantics: if you were to give the CoCo VM the 1 > GiB pages, they would similarly be unusable until that VM quit and freed > up the memory! > > So it might be acceptable to get "selected" unmovable allocations (from > guest_memfd) on selected (hugetlb) CMA area, that the "owner" will free > up when wanting to re-allocate that memory. Otherwise, the CMA > allocation will simply fail. > > If we need improvements in that area to support this case, we can talk. > Just an idea to avoid HGM and friends just to make it somehow work with > 1 GiB pages ... Thought about that some more and some cases can also be tricky (avoiding fragmenting multiple 1 GiB pages ...). It's all tricky, especially once multiple (guest_)memfds are involved and we really want to avoid most waste. Knowing that large mappings for CoCo are rather "optional" and that the challenge is in "reusing" large pages is valuable, though.
Hi David, On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > On 19.03.24 01:10, Sean Christopherson wrote: > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > > > Second, we should find better ways to let an IOMMU map these pages, > > > > *not* using GUP. There were already discussions on providing a similar > > > > fd+offset-style interface instead. GUP really sounds like the wrong > > > > approach here. Maybe we should look into passing not only guest_memfd, > > > > but also "ordinary" memfds. > > > > +1. I am not completely opposed to letting SNP and TDX effectively convert > > pages between private and shared, but I also completely agree that letting > > anything gup() guest_memfd memory is likely to end in tears. > > Yes. Avoid it right from the start, if possible. > > People wanted guest_memfd to *not* have to mmap guest memory ("even for > ordinary VMs"). Now people are saying we have to be able to mmap it in order > to GUP it. It's getting tiring, really. From the pKVM side, we're working on guest_memfd primarily to avoid diverging from what other CoCo solutions end up using, but if it gets de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do today with anonymous memory, then it's a really hard sell to switch over from what we have in production. We're also hoping that, over time, guest_memfd will become more closely integrated with the mm subsystem to enable things like hypervisor-assisted page migration, which we would love to have. Today, we use the existing KVM interfaces (i.e. based on anonymous memory) and it mostly works with the one significant exception that accessing private memory via a GUP pin will crash the host kernel. If all guest_memfd() can offer to solve that problem is preventing GUP altogether, then I'd sooner just add that same restriction to what we currently have instead of overhauling the user ABI in favour of something which offers us very little in return. On the mmap() side of things for guest_memfd, a simpler option for us than what has currently been proposed might be to enforce that the VMM has unmapped all private pages on vCPU run, failing the ioctl if that's not the case. It needs a little more tracking in guest_memfd but I think GUP will then fall out in the wash because only shared pages will be mapped by userspace and so GUP will fail by construction for private pages. We're happy to pursue alternative approaches using anonymous memory if you'd prefer to keep guest_memfd limited in functionality (e.g. preventing GUP of private pages by extending mapping_flags as per [1]), but we're equally willing to contribute to guest_memfd if extensions are welcome. What do you prefer? Cheers, Will [1] https://lore.kernel.org/r/4b0fd46a-cc4f-4cb7-9f6f-ce19a2d3064e@redhat.com
On Tue, Mar 19, 2024, David Hildenbrand wrote: > On 19.03.24 01:10, Sean Christopherson wrote: > > Performance is a secondary concern. If this were _just_ about guest performance, > > I would unequivocally side with David: the guest gets to keep the pieces if it > > fragments a 1GiB page. > > > > The main problem we're trying to solve is that we want to provision a host such > > that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously > > run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo VMs, > > 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo VMs > > would also get the benefits of 1GiB mappings, that's not the driving motiviation > > for this discussion. > > Supporting 1 GiB mappings there sounds like unnecessary complexity and > opening a big can of worms, especially if "it's not the driving motivation". > > If I understand you correctly, the scenario is > > (1) We have free 1 GiB hugetlb pages lying around > (2) We want to start a CoCo VM > (3) We don't care about 1 GiB mappings for that CoCo VM, We care about 1GiB mappings for CoCo VMs. My comment about performance being a secondary concern was specifically saying that it's the guest's responsilibity to play nice with huge mappings if the guest cares about its performance. For guests that are well behaved, we most definitely want to provide a configuration that performs as close to non-CoCo VMs as we can reasonably make it. And we can do that today, but it requires some amount of host memory to NOT be in the HugeTLB pool, and instead be kept in reserved so that it can be used for shared memory for CoCo VMs. That approach has many downsides, as the extra memory overhead affects CoCo VM shapes, our ability to use a common pool for non-CoCo and CoCo VMs, and so on and so forth. > but hguetlb pages is all we have. > (4) We want to be able to use the 1 GiB hugetlb page in the future. ... > > The other big advantage that we should lean into is that we can make assumptions > > about guest_memfd usage that would never fly for a general purpose backing stores, > > e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not > > desirable, for (almost?) all of the CoCo use cases. > > > > I don't have any concrete ideas at this time, but my gut feeling is that this > > won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being > > general purposes, and if we we account for conversion scenarios when designing > > hugepage support for guest_memfd. > > I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;) Quite the opposite, I'm saying we should be very deliberate in how we add hugepage support and others features to guest_memfd, so that guest_memfd doesn't become a hacky mess. And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g. swap/reclaim and probably page migration should get a hard "no". In other words, ditch the complexity for features that are well served by existing general purpose solutions, so that guest_memfd can take on a bit of complexity to serve use cases that are unique to KVM guests, without becoming an unmaintainble mess due to cross-products.
On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote: > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > On 19.03.24 01:10, Sean Christopherson wrote: > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > > > > Second, we should find better ways to let an IOMMU map these pages, > > > > > *not* using GUP. There were already discussions on providing a similar > > > > > fd+offset-style interface instead. GUP really sounds like the wrong > > > > > approach here. Maybe we should look into passing not only guest_memfd, > > > > > but also "ordinary" memfds. > > > > > > +1. I am not completely opposed to letting SNP and TDX effectively convert > > > pages between private and shared, but I also completely agree that letting > > > anything gup() guest_memfd memory is likely to end in tears. > > > > Yes. Avoid it right from the start, if possible. > > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for > > ordinary VMs"). Now people are saying we have to be able to mmap it in order > > to GUP it. It's getting tiring, really. > > From the pKVM side, we're working on guest_memfd primarily to avoid > diverging from what other CoCo solutions end up using, but if it gets > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > today with anonymous memory, then it's a really hard sell to switch over > from what we have in production. We're also hoping that, over time, > guest_memfd will become more closely integrated with the mm subsystem to > enable things like hypervisor-assisted page migration, which we would > love to have. > > Today, we use the existing KVM interfaces (i.e. based on anonymous > memory) and it mostly works with the one significant exception that > accessing private memory via a GUP pin will crash the host kernel. If > all guest_memfd() can offer to solve that problem is preventing GUP > altogether, then I'd sooner just add that same restriction to what we > currently have instead of overhauling the user ABI in favour of > something which offers us very little in return. How would we add the restriction to anonymous memory? Thinking aloud -- do you mean like some sort of "exclusive GUP" flag where mm can ensure that the exclusive GUP pin is the only pin? If the refcount for the page is >1, then the exclusive GUP fails. Any future GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS. > On the mmap() side of things for guest_memfd, a simpler option for us > than what has currently been proposed might be to enforce that the VMM > has unmapped all private pages on vCPU run, failing the ioctl if that's > not the case. It needs a little more tracking in guest_memfd but I think > GUP will then fall out in the wash because only shared pages will be > mapped by userspace and so GUP will fail by construction for private > pages. We can prevent GUP after the pages are marked private, but the pages could be marked private after the pages were already GUP'd. I don't have a good way to detect this, so converting a page to private is difficult. > We're happy to pursue alternative approaches using anonymous memory if > you'd prefer to keep guest_memfd limited in functionality (e.g. > preventing GUP of private pages by extending mapping_flags as per [1]), > but we're equally willing to contribute to guest_memfd if extensions are > welcome. > > What do you prefer? > I like this as a stepping stone. For the Android use cases, we don't need to be able to convert a private page to shared and then also be able to GUP it. If you want to GUP a page, use anonymous memory and that memory will always be shared. If you don't care about GUP'ing (e.g. it's going to be guest-private or you otherwise know you won't be GUP'ing), you can use guest_memfd. I don't think this design prevents us from adding "sometimes you can GUP" to guest_memfd in the future. I don't think it creates extra changes for KVM since anonymous memory is already supported; although I'd have to add the support for Gunyah. > [1] https://lore.kernel.org/r/4b0fd46a-cc4f-4cb7-9f6f-ce19a2d3064e@redhat.com Thanks, Elliot
Hi Elliot, On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote: > On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote: > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > +1. I am not completely opposed to letting SNP and TDX effectively convert > > > > pages between private and shared, but I also completely agree that letting > > > > anything gup() guest_memfd memory is likely to end in tears. > > > > > > Yes. Avoid it right from the start, if possible. > > > > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order > > > to GUP it. It's getting tiring, really. > > > > From the pKVM side, we're working on guest_memfd primarily to avoid > > diverging from what other CoCo solutions end up using, but if it gets > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > today with anonymous memory, then it's a really hard sell to switch over > > from what we have in production. We're also hoping that, over time, > > guest_memfd will become more closely integrated with the mm subsystem to > > enable things like hypervisor-assisted page migration, which we would > > love to have. > > > > Today, we use the existing KVM interfaces (i.e. based on anonymous > > memory) and it mostly works with the one significant exception that > > accessing private memory via a GUP pin will crash the host kernel. If > > all guest_memfd() can offer to solve that problem is preventing GUP > > altogether, then I'd sooner just add that same restriction to what we > > currently have instead of overhauling the user ABI in favour of > > something which offers us very little in return. > > How would we add the restriction to anonymous memory? > > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag > where mm can ensure that the exclusive GUP pin is the only pin? If the > refcount for the page is >1, then the exclusive GUP fails. Any future > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS. Yes, I think we'd want something like that, but I don't think using a bias on its own is a good idea as false positives due to a large number of page references will then actually lead to problems (i.e. rejecting GUP spuriously), no? I suppose if you only considered the new bias in conjunction with the AS_NOGUP flag you proposed then it might be ok (i.e. when you see the bias, you then go check the address space to confirm). What do you think? > > On the mmap() side of things for guest_memfd, a simpler option for us > > than what has currently been proposed might be to enforce that the VMM > > has unmapped all private pages on vCPU run, failing the ioctl if that's > > not the case. It needs a little more tracking in guest_memfd but I think > > GUP will then fall out in the wash because only shared pages will be > > mapped by userspace and so GUP will fail by construction for private > > pages. > > We can prevent GUP after the pages are marked private, but the pages > could be marked private after the pages were already GUP'd. I don't have > a good way to detect this, so converting a page to private is difficult. For anonymous memory, marking the page as private is going to involve an exclusive GUP so that the page can safely be donated to the guest. In that case, any existing GUP pin should cause that to fail gracefully. What is the situation you are concerned about here? > > We're happy to pursue alternative approaches using anonymous memory if > > you'd prefer to keep guest_memfd limited in functionality (e.g. > > preventing GUP of private pages by extending mapping_flags as per [1]), > > but we're equally willing to contribute to guest_memfd if extensions are > > welcome. > > > > What do you prefer? > > > > I like this as a stepping stone. For the Android use cases, we don't > need to be able to convert a private page to shared and then also be > able to GUP it. I wouldn't want to rule that out, though. The VMM should be able to use shared pages just like it can with normal anonymous pages. > I don't think this design prevents us from adding "sometimes you can > GUP" to guest_memfd in the future. Technically, I think we can add all the stuff we need to guest_memfd, but there's a desire to keep that as simple as possible for now, which is why I'm keen to explore alternatives to unblock the pKVM upstreaming. Will
On 19.03.24 16:04, Sean Christopherson wrote: > On Tue, Mar 19, 2024, David Hildenbrand wrote: >> On 19.03.24 01:10, Sean Christopherson wrote: >>> Performance is a secondary concern. If this were _just_ about guest performance, >>> I would unequivocally side with David: the guest gets to keep the pieces if it >>> fragments a 1GiB page. >>> >>> The main problem we're trying to solve is that we want to provision a host such >>> that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously >>> run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo VMs, >>> 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo VMs >>> would also get the benefits of 1GiB mappings, that's not the driving motiviation >>> for this discussion. >> >> Supporting 1 GiB mappings there sounds like unnecessary complexity and >> opening a big can of worms, especially if "it's not the driving motivation". >> >> If I understand you correctly, the scenario is >> >> (1) We have free 1 GiB hugetlb pages lying around >> (2) We want to start a CoCo VM >> (3) We don't care about 1 GiB mappings for that CoCo VM, > > We care about 1GiB mappings for CoCo VMs. My comment about performance being a > secondary concern was specifically saying that it's the guest's responsilibity > to play nice with huge mappings if the guest cares about its performance. For > guests that are well behaved, we most definitely want to provide a configuration > that performs as close to non-CoCo VMs as we can reasonably make it. How does the guest know the granularity? I suspect it's just implicit knowledge that "PUD granularity might be nice". > > And we can do that today, but it requires some amount of host memory to NOT be > in the HugeTLB pool, and instead be kept in reserved so that it can be used for > shared memory for CoCo VMs. That approach has many downsides, as the extra memory > overhead affects CoCo VM shapes, our ability to use a common pool for non-CoCo > and CoCo VMs, and so on and so forth. Right. But avoiding memory waste as soon as hugetlb is involved (and we have two separate memfds for private/shared memory) is not feasible. > >> but hguetlb pages is all we have. >> (4) We want to be able to use the 1 GiB hugetlb page in the future. > > ... > >>> The other big advantage that we should lean into is that we can make assumptions >>> about guest_memfd usage that would never fly for a general purpose backing stores, >>> e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not >>> desirable, for (almost?) all of the CoCo use cases. >>> >>> I don't have any concrete ideas at this time, but my gut feeling is that this >>> won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being >>> general purposes, and if we we account for conversion scenarios when designing >>> hugepage support for guest_memfd. >> >> I'm hoping guest_memfd won't end up being the wild west of hacky MM ideas ;) > > Quite the opposite, I'm saying we should be very deliberate in how we add hugepage > support and others features to guest_memfd, so that guest_memfd doesn't become a > hacky mess. Good. > > And I'm saying say we should stand firm in what guest_memfd _won't_ support, e.g. > swap/reclaim and probably page migration should get a hard "no". I thought people wanted to support at least page migration in the future? (for example, see the reply from Will) > > In other words, ditch the complexity for features that are well served by existing > general purpose solutions, so that guest_memfd can take on a bit of complexity to > serve use cases that are unique to KVM guests, without becoming an unmaintainble > mess due to cross-products. And I believed that was true until people started wanting to mmap() this thing and brought GUP into the picture ... and then talk about HGM and all that. *shivers*
On 19.03.24 15:31, Will Deacon wrote: > Hi David, Hi Will, sorry for the late reply! > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: >> On 19.03.24 01:10, Sean Christopherson wrote: >>> On Mon, Mar 18, 2024, Vishal Annapurve wrote: >>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: >>>>> Second, we should find better ways to let an IOMMU map these pages, >>>>> *not* using GUP. There were already discussions on providing a similar >>>>> fd+offset-style interface instead. GUP really sounds like the wrong >>>>> approach here. Maybe we should look into passing not only guest_memfd, >>>>> but also "ordinary" memfds. >>> >>> +1. I am not completely opposed to letting SNP and TDX effectively convert >>> pages between private and shared, but I also completely agree that letting >>> anything gup() guest_memfd memory is likely to end in tears. >> >> Yes. Avoid it right from the start, if possible. >> >> People wanted guest_memfd to *not* have to mmap guest memory ("even for >> ordinary VMs"). Now people are saying we have to be able to mmap it in order >> to GUP it. It's getting tiring, really. > > From the pKVM side, we're working on guest_memfd primarily to avoid > diverging from what other CoCo solutions end up using, but if it gets > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > today with anonymous memory, then it's a really hard sell to switch over > from what we have in production. We're also hoping that, over time, > guest_memfd will become more closely integrated with the mm subsystem to > enable things like hypervisor-assisted page migration, which we would > love to have. Reading Sean's reply, he has a different view on that. And I think that's the main issue: there are too many different use cases and too many different requirements that could turn guest_memfd into something that maybe it really shouldn't be. > > Today, we use the existing KVM interfaces (i.e. based on anonymous > memory) and it mostly works with the one significant exception that > accessing private memory via a GUP pin will crash the host kernel. If > all guest_memfd() can offer to solve that problem is preventing GUP > altogether, then I'd sooner just add that same restriction to what we > currently have instead of overhauling the user ABI in favour of > something which offers us very little in return. > > On the mmap() side of things for guest_memfd, a simpler option for us > than what has currently been proposed might be to enforce that the VMM > has unmapped all private pages on vCPU run, failing the ioctl if that's > not the case. It needs a little more tracking in guest_memfd but I think > GUP will then fall out in the wash because only shared pages will be > mapped by userspace and so GUP will fail by construction for private > pages. > > We're happy to pursue alternative approaches using anonymous memory if > you'd prefer to keep guest_memfd limited in functionality (e.g. > preventing GUP of private pages by extending mapping_flags as per [1]), > but we're equally willing to contribute to guest_memfd if extensions are > welcome. > > What do you prefer? Let me summarize the history: AMD had its thing running and it worked for them (but I recall it was hacky :) ). TDX made it possible to crash the machine when accessing secure memory from user space (MCE). So secure memory must not be mapped into user space -- no page tables. Prototypes with anonymous memory existed (and I didn't hate them, although hacky), but one of the other selling points of guest_memfd was that we could create VMs that wouldn't need any page tables at all, which I found interesting. There was a bit more to that (easier conversion, avoiding GUP, specifying on allocation that the memory was unmovable ...), but I'll get to that later. The design principle was: nasty private memory (unmovable, unswappable, inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared" memory is allocated from an ordinary memfd. This makes sense: shared memory is neither nasty nor special. You can migrate it, swap it out, map it into page tables, GUP it, ... without any issues. So if I would describe some key characteristics of guest_memfd as of today, it would probably be: 1) Memory is unmovable and unswappable. Right from the beginning, it is allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). 2) Memory is inaccessible. It cannot be read from user space, the kernel, it cannot be GUP'ed ... only some mechanisms might end up touching that memory (e.g., hibernation, /proc/kcore) might end up touching it "by accident", and we usually can handle these cases. 3) Memory can be discarded in page granularity. There should be no cases where you cannot discard memory to over-allocate memory for private pages that have been replaced by shared pages otherwise. 4) Page tables are not required (well, it's an memfd), and the fd could in theory be passed to other processes. Having "ordinary shared" memory in there implies that 1) and 2) will have to be adjusted for them, which kind-of turns it "partially" into ordinary shmem again. Going back to the beginning: with pKVM, we likely want the following 1) Convert pages private<->shared in-place 2) Stop user space + kernel from accessing private memory in process context. Likely for pKVM we would only crash the process, which would be acceptable. 3) Prevent GUP to private memory. Otherwise we could crash the kernel. 4) Prevent private pages from swapout+migration until supported. I suspect your current solution with anonymous memory gets all but 3) sorted out, correct? I'm curious, may there be a requirement in the future that shared memory could be mapped into other processes? (thinking vhost-user and such things). Of course that's impossible with anonymous memory; teaching shmem to contain private memory would kind-of lead to ... guest_memfd, just that we don't have shared memory there.
On Fri, Mar 22, 2024 at 04:36:55PM +0000, Will Deacon wrote: > Hi Elliot, > > On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote: > > On Tue, Mar 19, 2024 at 02:31:19PM +0000, Will Deacon wrote: > > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > > +1. I am not completely opposed to letting SNP and TDX effectively convert > > > > > pages between private and shared, but I also completely agree that letting > > > > > anything gup() guest_memfd memory is likely to end in tears. > > > > > > > > Yes. Avoid it right from the start, if possible. > > > > > > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for > > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order > > > > to GUP it. It's getting tiring, really. > > > > > > From the pKVM side, we're working on guest_memfd primarily to avoid > > > diverging from what other CoCo solutions end up using, but if it gets > > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > > today with anonymous memory, then it's a really hard sell to switch over > > > from what we have in production. We're also hoping that, over time, > > > guest_memfd will become more closely integrated with the mm subsystem to > > > enable things like hypervisor-assisted page migration, which we would > > > love to have. > > > > > > Today, we use the existing KVM interfaces (i.e. based on anonymous > > > memory) and it mostly works with the one significant exception that > > > accessing private memory via a GUP pin will crash the host kernel. If > > > all guest_memfd() can offer to solve that problem is preventing GUP > > > altogether, then I'd sooner just add that same restriction to what we > > > currently have instead of overhauling the user ABI in favour of > > > something which offers us very little in return. > > > > How would we add the restriction to anonymous memory? > > > > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag > > where mm can ensure that the exclusive GUP pin is the only pin? If the > > refcount for the page is >1, then the exclusive GUP fails. Any future > > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS. > > Yes, I think we'd want something like that, but I don't think using a > bias on its own is a good idea as false positives due to a large number > of page references will then actually lead to problems (i.e. rejecting > GUP spuriously), no? I suppose if you only considered the new bias in > conjunction with the AS_NOGUP flag you proposed then it might be ok > (i.e. when you see the bias, you then go check the address space to > confirm). What do you think? > I think the AS_NOGUP would prevent GUPing the first place. If we set the EXCLUSIVE_BIAS value to something like INT_MAX, do we need to be worried about there being INT_MAX-1 valid GUPs and wanting to add another? From the GUPer's perspective, I don't think it would be much different from overflowing the refcount. > > > On the mmap() side of things for guest_memfd, a simpler option for us > > > than what has currently been proposed might be to enforce that the VMM > > > has unmapped all private pages on vCPU run, failing the ioctl if that's > > > not the case. It needs a little more tracking in guest_memfd but I think > > > GUP will then fall out in the wash because only shared pages will be > > > mapped by userspace and so GUP will fail by construction for private > > > pages. > > > > We can prevent GUP after the pages are marked private, but the pages > > could be marked private after the pages were already GUP'd. I don't have > > a good way to detect this, so converting a page to private is difficult. > > For anonymous memory, marking the page as private is going to involve an > exclusive GUP so that the page can safely be donated to the guest. In > that case, any existing GUP pin should cause that to fail gracefully. > What is the situation you are concerned about here? > I wasn't thinking about exclusive GUP here. The exclusive GUP should be able to get the guarantees we need. I was thinking about making sure we gracefully handle a race to provide the same page. The kernel should detect the difference between "we're already providing the page" and "somebody has an unexpected pin". We can easily read the refcount if we couldn't take the exclusive pin to know. Thanks, Elliot > > > We're happy to pursue alternative approaches using anonymous memory if > > > you'd prefer to keep guest_memfd limited in functionality (e.g. > > > preventing GUP of private pages by extending mapping_flags as per [1]), > > > but we're equally willing to contribute to guest_memfd if extensions are > > > welcome. > > > > > > What do you prefer? > > > > > > > I like this as a stepping stone. For the Android use cases, we don't > > need to be able to convert a private page to shared and then also be > > able to GUP it. > > I wouldn't want to rule that out, though. The VMM should be able to use > shared pages just like it can with normal anonymous pages. > > > I don't think this design prevents us from adding "sometimes you can > > GUP" to guest_memfd in the future. > > Technically, I think we can add all the stuff we need to guest_memfd, > but there's a desire to keep that as simple as possible for now, which > is why I'm keen to explore alternatives to unblock the pKVM upstreaming. > > Will >
On 22.03.24 18:52, David Hildenbrand wrote: > On 19.03.24 15:31, Will Deacon wrote: >> Hi David, > > Hi Will, > > sorry for the late reply! > >> >> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: >>> On 19.03.24 01:10, Sean Christopherson wrote: >>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote: >>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: >>>>>> Second, we should find better ways to let an IOMMU map these pages, >>>>>> *not* using GUP. There were already discussions on providing a similar >>>>>> fd+offset-style interface instead. GUP really sounds like the wrong >>>>>> approach here. Maybe we should look into passing not only guest_memfd, >>>>>> but also "ordinary" memfds. >>>> >>>> +1. I am not completely opposed to letting SNP and TDX effectively convert >>>> pages between private and shared, but I also completely agree that letting >>>> anything gup() guest_memfd memory is likely to end in tears. >>> >>> Yes. Avoid it right from the start, if possible. >>> >>> People wanted guest_memfd to *not* have to mmap guest memory ("even for >>> ordinary VMs"). Now people are saying we have to be able to mmap it in order >>> to GUP it. It's getting tiring, really. >> >> From the pKVM side, we're working on guest_memfd primarily to avoid >> diverging from what other CoCo solutions end up using, but if it gets >> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do >> today with anonymous memory, then it's a really hard sell to switch over >> from what we have in production. We're also hoping that, over time, >> guest_memfd will become more closely integrated with the mm subsystem to >> enable things like hypervisor-assisted page migration, which we would >> love to have. > > Reading Sean's reply, he has a different view on that. And I think > that's the main issue: there are too many different use cases and too > many different requirements that could turn guest_memfd into something > that maybe it really shouldn't be. > >> >> Today, we use the existing KVM interfaces (i.e. based on anonymous >> memory) and it mostly works with the one significant exception that >> accessing private memory via a GUP pin will crash the host kernel. If >> all guest_memfd() can offer to solve that problem is preventing GUP >> altogether, then I'd sooner just add that same restriction to what we >> currently have instead of overhauling the user ABI in favour of >> something which offers us very little in return. >> >> On the mmap() side of things for guest_memfd, a simpler option for us >> than what has currently been proposed might be to enforce that the VMM >> has unmapped all private pages on vCPU run, failing the ioctl if that's >> not the case. It needs a little more tracking in guest_memfd but I think >> GUP will then fall out in the wash because only shared pages will be >> mapped by userspace and so GUP will fail by construction for private >> pages. >> >> We're happy to pursue alternative approaches using anonymous memory if >> you'd prefer to keep guest_memfd limited in functionality (e.g. >> preventing GUP of private pages by extending mapping_flags as per [1]), >> but we're equally willing to contribute to guest_memfd if extensions are >> welcome. >> >> What do you prefer? > > Let me summarize the history: > > AMD had its thing running and it worked for them (but I recall it was > hacky :) ). > > TDX made it possible to crash the machine when accessing secure memory > from user space (MCE). > > So secure memory must not be mapped into user space -- no page tables. > Prototypes with anonymous memory existed (and I didn't hate them, > although hacky), but one of the other selling points of guest_memfd was > that we could create VMs that wouldn't need any page tables at all, > which I found interesting. > > There was a bit more to that (easier conversion, avoiding GUP, > specifying on allocation that the memory was unmovable ...), but I'll > get to that later. > > The design principle was: nasty private memory (unmovable, unswappable, > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary > "shared" memory is allocated from an ordinary memfd. > > This makes sense: shared memory is neither nasty nor special. You can > migrate it, swap it out, map it into page tables, GUP it, ... without > any issues. > > > So if I would describe some key characteristics of guest_memfd as of > today, it would probably be: > > 1) Memory is unmovable and unswappable. Right from the beginning, it is > allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). > 2) Memory is inaccessible. It cannot be read from user space, the > kernel, it cannot be GUP'ed ... only some mechanisms might end up > touching that memory (e.g., hibernation, /proc/kcore) might end up > touching it "by accident", and we usually can handle these cases. > 3) Memory can be discarded in page granularity. There should be no cases > where you cannot discard memory to over-allocate memory for private > pages that have been replaced by shared pages otherwise. > 4) Page tables are not required (well, it's an memfd), and the fd could > in theory be passed to other processes. > > Having "ordinary shared" memory in there implies that 1) and 2) will > have to be adjusted for them, which kind-of turns it "partially" into > ordinary shmem again. > > > Going back to the beginning: with pKVM, we likely want the following > > 1) Convert pages private<->shared in-place > 2) Stop user space + kernel from accessing private memory in process > context. Likely for pKVM we would only crash the process, which > would be acceptable. > 3) Prevent GUP to private memory. Otherwise we could crash the kernel. > 4) Prevent private pages from swapout+migration until supported. > > > I suspect your current solution with anonymous memory gets all but 3) > sorted out, correct? > > I'm curious, may there be a requirement in the future that shared memory > could be mapped into other processes? (thinking vhost-user and such > things). Of course that's impossible with anonymous memory; teaching > shmem to contain private memory would kind-of lead to ... guest_memfd, > just that we don't have shared memory there. > I was just thinking of something stupid, not sure if it makes any sense. I'll raise it here before I forget over the weekend. ... what if we glued one guest_memfd and a memfd (shmem) together in the kernel somehow? (1) A to-shared conversion moves a page from the guest_memfd to the memfd. (2) A to-private conversion moves a page from the memfd to the guest_memfd. Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave like any shmem pages: migratable, swappable etc. Of course, (2) is only possible if the page is not pinned, not mapped (we can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA. We'd have to decide what to do when we access a "hole" in the memfd -- instead of allocating a fresh page and filling the hole, we'd want to SIGBUS.
On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote: > On 22.03.24 18:52, David Hildenbrand wrote: > > On 19.03.24 15:31, Will Deacon wrote: > > > Hi David, > > > > Hi Will, > > > > sorry for the late reply! > > > > > > > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > > Second, we should find better ways to let an IOMMU map these pages, > > > > > > > *not* using GUP. There were already discussions on providing a similar > > > > > > > fd+offset-style interface instead. GUP really sounds like the wrong > > > > > > > approach here. Maybe we should look into passing not only guest_memfd, > > > > > > > but also "ordinary" memfds. > > > > > > > > > > +1. I am not completely opposed to letting SNP and TDX effectively convert > > > > > pages between private and shared, but I also completely agree that letting > > > > > anything gup() guest_memfd memory is likely to end in tears. > > > > > > > > Yes. Avoid it right from the start, if possible. > > > > > > > > People wanted guest_memfd to *not* have to mmap guest memory ("even for > > > > ordinary VMs"). Now people are saying we have to be able to mmap it in order > > > > to GUP it. It's getting tiring, really. > > > > > > From the pKVM side, we're working on guest_memfd primarily to avoid > > > diverging from what other CoCo solutions end up using, but if it gets > > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > > today with anonymous memory, then it's a really hard sell to switch over > > > from what we have in production. We're also hoping that, over time, > > > guest_memfd will become more closely integrated with the mm subsystem to > > > enable things like hypervisor-assisted page migration, which we would > > > love to have. > > > > Reading Sean's reply, he has a different view on that. And I think > > that's the main issue: there are too many different use cases and too > > many different requirements that could turn guest_memfd into something > > that maybe it really shouldn't be. > > > > > > > > Today, we use the existing KVM interfaces (i.e. based on anonymous > > > memory) and it mostly works with the one significant exception that > > > accessing private memory via a GUP pin will crash the host kernel. If > > > all guest_memfd() can offer to solve that problem is preventing GUP > > > altogether, then I'd sooner just add that same restriction to what we > > > currently have instead of overhauling the user ABI in favour of > > > something which offers us very little in return. > > > > > > On the mmap() side of things for guest_memfd, a simpler option for us > > > than what has currently been proposed might be to enforce that the VMM > > > has unmapped all private pages on vCPU run, failing the ioctl if that's > > > not the case. It needs a little more tracking in guest_memfd but I think > > > GUP will then fall out in the wash because only shared pages will be > > > mapped by userspace and so GUP will fail by construction for private > > > pages. > > > > > > We're happy to pursue alternative approaches using anonymous memory if > > > you'd prefer to keep guest_memfd limited in functionality (e.g. > > > preventing GUP of private pages by extending mapping_flags as per [1]), > > > but we're equally willing to contribute to guest_memfd if extensions are > > > welcome. > > > > > > What do you prefer? > > > > Let me summarize the history: > > > > AMD had its thing running and it worked for them (but I recall it was > > hacky :) ). > > > > TDX made it possible to crash the machine when accessing secure memory > > from user space (MCE). > > > > So secure memory must not be mapped into user space -- no page tables. > > Prototypes with anonymous memory existed (and I didn't hate them, > > although hacky), but one of the other selling points of guest_memfd was > > that we could create VMs that wouldn't need any page tables at all, > > which I found interesting. > > > > There was a bit more to that (easier conversion, avoiding GUP, > > specifying on allocation that the memory was unmovable ...), but I'll > > get to that later. > > > > The design principle was: nasty private memory (unmovable, unswappable, > > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary > > "shared" memory is allocated from an ordinary memfd. > > > > This makes sense: shared memory is neither nasty nor special. You can > > migrate it, swap it out, map it into page tables, GUP it, ... without > > any issues. > > > > > > So if I would describe some key characteristics of guest_memfd as of > > today, it would probably be: > > > > 1) Memory is unmovable and unswappable. Right from the beginning, it is > > allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). > > 2) Memory is inaccessible. It cannot be read from user space, the > > kernel, it cannot be GUP'ed ... only some mechanisms might end up > > touching that memory (e.g., hibernation, /proc/kcore) might end up > > touching it "by accident", and we usually can handle these cases. > > 3) Memory can be discarded in page granularity. There should be no cases > > where you cannot discard memory to over-allocate memory for private > > pages that have been replaced by shared pages otherwise. > > 4) Page tables are not required (well, it's an memfd), and the fd could > > in theory be passed to other processes. > > > > Having "ordinary shared" memory in there implies that 1) and 2) will > > have to be adjusted for them, which kind-of turns it "partially" into > > ordinary shmem again. > > > > > > Going back to the beginning: with pKVM, we likely want the following > > > > 1) Convert pages private<->shared in-place > > 2) Stop user space + kernel from accessing private memory in process > > context. Likely for pKVM we would only crash the process, which > > would be acceptable. > > 3) Prevent GUP to private memory. Otherwise we could crash the kernel. > > 4) Prevent private pages from swapout+migration until supported. > > > > > > I suspect your current solution with anonymous memory gets all but 3) > > sorted out, correct? > > > > I'm curious, may there be a requirement in the future that shared memory > > could be mapped into other processes? (thinking vhost-user and such > > things). Of course that's impossible with anonymous memory; teaching > > shmem to contain private memory would kind-of lead to ... guest_memfd, > > just that we don't have shared memory there. > > > > I was just thinking of something stupid, not sure if it makes any sense. > I'll raise it here before I forget over the weekend. > > ... what if we glued one guest_memfd and a memfd (shmem) together in the > kernel somehow? > > (1) A to-shared conversion moves a page from the guest_memfd to the memfd. > > (2) A to-private conversion moves a page from the memfd to the guest_memfd. > > Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave > like any shmem pages: migratable, swappable etc. > > > Of course, (2) is only possible if the page is not pinned, not mapped (we > can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA. > Quentin gave idea offline of using splice to achieve the conversions. I'd want to use the in-kernel APIs on page-fault to do the conversion; not requiring userspace to make the splice() syscall. One thing splice currently requires is the source (in) file; KVM UAPI today only gives userspace address. We could resolve that by for_each_vma_range(). I've just started looking into splice(), but I believe it takes care of not pinned and not mapped. guest_memfd would have to migrate the page out of ZONE_MOVABLE / MIGRATE_CMA. Does this seem like a good path to pursue further or any other ideas for doing the conversion? > We'd have to decide what to do when we access a "hole" in the memfd -- > instead of allocating a fresh page and filling the hole, we'd want to > SIGBUS. Since the KVM UAPI is based on userspace addresses and not fds for the shared memory part, maybe we could add a mmu_notifier_ops that allows KVM to intercept and reject faults if we couldn't reclaim the memory. I think it would be conceptually similar to userfaultfd except in the kernel; not sure if re-using userfaultfd makes sense? Thanks, Elliot
On 26.03.24 23:04, Elliot Berman wrote: > On Fri, Mar 22, 2024 at 10:21:09PM +0100, David Hildenbrand wrote: >> On 22.03.24 18:52, David Hildenbrand wrote: >>> On 19.03.24 15:31, Will Deacon wrote: >>>> Hi David, >>> >>> Hi Will, >>> >>> sorry for the late reply! >>> >>>> >>>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: >>>>> On 19.03.24 01:10, Sean Christopherson wrote: >>>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote: >>>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: >>>>>>>> Second, we should find better ways to let an IOMMU map these pages, >>>>>>>> *not* using GUP. There were already discussions on providing a similar >>>>>>>> fd+offset-style interface instead. GUP really sounds like the wrong >>>>>>>> approach here. Maybe we should look into passing not only guest_memfd, >>>>>>>> but also "ordinary" memfds. >>>>>> >>>>>> +1. I am not completely opposed to letting SNP and TDX effectively convert >>>>>> pages between private and shared, but I also completely agree that letting >>>>>> anything gup() guest_memfd memory is likely to end in tears. >>>>> >>>>> Yes. Avoid it right from the start, if possible. >>>>> >>>>> People wanted guest_memfd to *not* have to mmap guest memory ("even for >>>>> ordinary VMs"). Now people are saying we have to be able to mmap it in order >>>>> to GUP it. It's getting tiring, really. >>>> >>>> From the pKVM side, we're working on guest_memfd primarily to avoid >>>> diverging from what other CoCo solutions end up using, but if it gets >>>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do >>>> today with anonymous memory, then it's a really hard sell to switch over >>>> from what we have in production. We're also hoping that, over time, >>>> guest_memfd will become more closely integrated with the mm subsystem to >>>> enable things like hypervisor-assisted page migration, which we would >>>> love to have. >>> >>> Reading Sean's reply, he has a different view on that. And I think >>> that's the main issue: there are too many different use cases and too >>> many different requirements that could turn guest_memfd into something >>> that maybe it really shouldn't be. >>> >>>> >>>> Today, we use the existing KVM interfaces (i.e. based on anonymous >>>> memory) and it mostly works with the one significant exception that >>>> accessing private memory via a GUP pin will crash the host kernel. If >>>> all guest_memfd() can offer to solve that problem is preventing GUP >>>> altogether, then I'd sooner just add that same restriction to what we >>>> currently have instead of overhauling the user ABI in favour of >>>> something which offers us very little in return. >>>> >>>> On the mmap() side of things for guest_memfd, a simpler option for us >>>> than what has currently been proposed might be to enforce that the VMM >>>> has unmapped all private pages on vCPU run, failing the ioctl if that's >>>> not the case. It needs a little more tracking in guest_memfd but I think >>>> GUP will then fall out in the wash because only shared pages will be >>>> mapped by userspace and so GUP will fail by construction for private >>>> pages. >>>> >>>> We're happy to pursue alternative approaches using anonymous memory if >>>> you'd prefer to keep guest_memfd limited in functionality (e.g. >>>> preventing GUP of private pages by extending mapping_flags as per [1]), >>>> but we're equally willing to contribute to guest_memfd if extensions are >>>> welcome. >>>> >>>> What do you prefer? >>> >>> Let me summarize the history: >>> >>> AMD had its thing running and it worked for them (but I recall it was >>> hacky :) ). >>> >>> TDX made it possible to crash the machine when accessing secure memory >>> from user space (MCE). >>> >>> So secure memory must not be mapped into user space -- no page tables. >>> Prototypes with anonymous memory existed (and I didn't hate them, >>> although hacky), but one of the other selling points of guest_memfd was >>> that we could create VMs that wouldn't need any page tables at all, >>> which I found interesting. >>> >>> There was a bit more to that (easier conversion, avoiding GUP, >>> specifying on allocation that the memory was unmovable ...), but I'll >>> get to that later. >>> >>> The design principle was: nasty private memory (unmovable, unswappable, >>> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary >>> "shared" memory is allocated from an ordinary memfd. >>> >>> This makes sense: shared memory is neither nasty nor special. You can >>> migrate it, swap it out, map it into page tables, GUP it, ... without >>> any issues. >>> >>> >>> So if I would describe some key characteristics of guest_memfd as of >>> today, it would probably be: >>> >>> 1) Memory is unmovable and unswappable. Right from the beginning, it is >>> allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). >>> 2) Memory is inaccessible. It cannot be read from user space, the >>> kernel, it cannot be GUP'ed ... only some mechanisms might end up >>> touching that memory (e.g., hibernation, /proc/kcore) might end up >>> touching it "by accident", and we usually can handle these cases. >>> 3) Memory can be discarded in page granularity. There should be no cases >>> where you cannot discard memory to over-allocate memory for private >>> pages that have been replaced by shared pages otherwise. >>> 4) Page tables are not required (well, it's an memfd), and the fd could >>> in theory be passed to other processes. >>> >>> Having "ordinary shared" memory in there implies that 1) and 2) will >>> have to be adjusted for them, which kind-of turns it "partially" into >>> ordinary shmem again. >>> >>> >>> Going back to the beginning: with pKVM, we likely want the following >>> >>> 1) Convert pages private<->shared in-place >>> 2) Stop user space + kernel from accessing private memory in process >>> context. Likely for pKVM we would only crash the process, which >>> would be acceptable. >>> 3) Prevent GUP to private memory. Otherwise we could crash the kernel. >>> 4) Prevent private pages from swapout+migration until supported. >>> >>> >>> I suspect your current solution with anonymous memory gets all but 3) >>> sorted out, correct? >>> >>> I'm curious, may there be a requirement in the future that shared memory >>> could be mapped into other processes? (thinking vhost-user and such >>> things). Of course that's impossible with anonymous memory; teaching >>> shmem to contain private memory would kind-of lead to ... guest_memfd, >>> just that we don't have shared memory there. >>> >> >> I was just thinking of something stupid, not sure if it makes any sense. >> I'll raise it here before I forget over the weekend. >> >> ... what if we glued one guest_memfd and a memfd (shmem) together in the >> kernel somehow? >> >> (1) A to-shared conversion moves a page from the guest_memfd to the memfd. >> >> (2) A to-private conversion moves a page from the memfd to the guest_memfd. >> >> Only the memfd can be mmap'ed/read/written/GUP'ed. Pages in the memfd behave >> like any shmem pages: migratable, swappable etc. >> >> >> Of course, (2) is only possible if the page is not pinned, not mapped (we >> can unmap it). AND, the page must not reside on ZONE_MOVABLE / MIGRATE_CMA. >> > > Quentin gave idea offline of using splice to achieve the conversions. > I'd want to use the in-kernel APIs on page-fault to do the conversion; > not requiring userspace to make the splice() syscall. One thing splice > currently requires is the source (in) file; KVM UAPI today only gives > userspace address. We could resolve that by for_each_vma_range(). I've > just started looking into splice(), but I believe it takes care of not > pinned and not mapped. guest_memfd would have to migrate the page out of > ZONE_MOVABLE / MIGRATE_CMA. I don't think we want to involve splice. Conceptually, I think KVM should create a pair of FDs: guest_memfd for private memory and "ordinary shmem/memfd" for shared memory. Conversion back and forth can either be triggered using a KVM API (TDX use case), or internally from KVM (pkvm use case). Maybe it does something internally that splice would do that we can reuse, otherwise we have to do the plumbing. Then, we have some logic on how to handle access to unbacked regions (SIGBUS instead of allocating memory) inside both memfds, and allow to allocate memory for parts of the fds explicitly. No offset in the fd's can be populated the same time. That is, pages can be moved back and forth, but allocating a fresh page in an fd is only possible if there is nothing at that location in the other fd. No memory over-allocation. Coming up with a KVM API for that should be possible. > > Does this seem like a good path to pursue further or any other ideas for > doing the conversion? > >> We'd have to decide what to do when we access a "hole" in the memfd -- >> instead of allocating a fresh page and filling the hole, we'd want to >> SIGBUS. > > Since the KVM UAPI is based on userspace addresses and not fds for the > shared memory part, maybe we could add a mmu_notifier_ops that allows > KVM to intercept and reject faults if we couldn't reclaim the memory. I > think it would be conceptually similar to userfaultfd except in the > kernel; not sure if re-using userfaultfd makes sense? Or if KVM exposes this other fd as well, we extend the UAPI to consume for the shared part also fd+offset.
Hi Elliot, On Fri, Mar 22, 2024 at 11:46:10AM -0700, Elliot Berman wrote: > On Fri, Mar 22, 2024 at 04:36:55PM +0000, Will Deacon wrote: > > On Tue, Mar 19, 2024 at 04:54:10PM -0700, Elliot Berman wrote: > > > How would we add the restriction to anonymous memory? > > > > > > Thinking aloud -- do you mean like some sort of "exclusive GUP" flag > > > where mm can ensure that the exclusive GUP pin is the only pin? If the > > > refcount for the page is >1, then the exclusive GUP fails. Any future > > > GUP pin attempts would fail if the refcount has the EXCLUSIVE_BIAS. > > > > Yes, I think we'd want something like that, but I don't think using a > > bias on its own is a good idea as false positives due to a large number > > of page references will then actually lead to problems (i.e. rejecting > > GUP spuriously), no? I suppose if you only considered the new bias in > > conjunction with the AS_NOGUP flag you proposed then it might be ok > > (i.e. when you see the bias, you then go check the address space to > > confirm). What do you think? > > > > I think the AS_NOGUP would prevent GUPing the first place. If we set the > EXCLUSIVE_BIAS value to something like INT_MAX, do we need to be worried > about there being INT_MAX-1 valid GUPs and wanting to add another? From > the GUPer's perspective, I don't think it would be much different from > overflowing the refcount. I don't think we should end up in a position where a malicious user can take a tonne of references and cause functional problems. For example, look at page_maybe_dma_pinned(); the worst case is we end up treating heavily referenced pages as pinned and the soft-dirty logic leaves them perpetually dirtied. The side-effects of what we're talking about here seem to be much worse than that unless we can confirm that the page really is exclusive. > > > > On the mmap() side of things for guest_memfd, a simpler option for us > > > > than what has currently been proposed might be to enforce that the VMM > > > > has unmapped all private pages on vCPU run, failing the ioctl if that's > > > > not the case. It needs a little more tracking in guest_memfd but I think > > > > GUP will then fall out in the wash because only shared pages will be > > > > mapped by userspace and so GUP will fail by construction for private > > > > pages. > > > > > > We can prevent GUP after the pages are marked private, but the pages > > > could be marked private after the pages were already GUP'd. I don't have > > > a good way to detect this, so converting a page to private is difficult. > > > > For anonymous memory, marking the page as private is going to involve an > > exclusive GUP so that the page can safely be donated to the guest. In > > that case, any existing GUP pin should cause that to fail gracefully. > > What is the situation you are concerned about here? > > > > I wasn't thinking about exclusive GUP here. The exclusive GUP should be > able to get the guarantees we need. > > I was thinking about making sure we gracefully handle a race to provide > the same page. The kernel should detect the difference between "we're > already providing the page" and "somebody has an unexpected pin". We can > easily read the refcount if we couldn't take the exclusive pin to know. Thanks, that makes sense to me. For pKVM, the architecture code also tracks all the donated pages, so we should be able to provide additional metadata here if we shuffle things around a little. Will
Hi again, David, On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote: > On 19.03.24 15:31, Will Deacon wrote: > sorry for the late reply! Bah, you and me both! > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > From the pKVM side, we're working on guest_memfd primarily to avoid > > diverging from what other CoCo solutions end up using, but if it gets > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > today with anonymous memory, then it's a really hard sell to switch over > > from what we have in production. We're also hoping that, over time, > > guest_memfd will become more closely integrated with the mm subsystem to > > enable things like hypervisor-assisted page migration, which we would > > love to have. > > Reading Sean's reply, he has a different view on that. And I think that's > the main issue: there are too many different use cases and too many > different requirements that could turn guest_memfd into something that maybe > it really shouldn't be. No argument there, and we're certainly not tied to any specific mechanism on the pKVM side. Maybe Sean can chime in, but we've definitely spoken about migration being a goal in the past, so I guess something changed since then on the guest_memfd side. Regardless, from our point of view, we just need to make sure that whatever we settle on for pKVM does the things we need it to do (or can at least be extended to do them) and we're happy to implement that in whatever way works best for upstream, guest_memfd or otherwise. > > We're happy to pursue alternative approaches using anonymous memory if > > you'd prefer to keep guest_memfd limited in functionality (e.g. > > preventing GUP of private pages by extending mapping_flags as per [1]), > > but we're equally willing to contribute to guest_memfd if extensions are > > welcome. > > > > What do you prefer? > > Let me summarize the history: First off, thanks for piecing together the archaeology... > AMD had its thing running and it worked for them (but I recall it was hacky > :) ). > > TDX made it possible to crash the machine when accessing secure memory from > user space (MCE). > > So secure memory must not be mapped into user space -- no page tables. > Prototypes with anonymous memory existed (and I didn't hate them, although > hacky), but one of the other selling points of guest_memfd was that we could > create VMs that wouldn't need any page tables at all, which I found > interesting. Are the prototypes you refer to here based on the old stuff from Kirill? We followed that work at the time, thinking we were going to be using that before guest_memfd came along, so we've sadly been collecting out-of-tree patches for a little while :/ > There was a bit more to that (easier conversion, avoiding GUP, specifying on > allocation that the memory was unmovable ...), but I'll get to that later. > > The design principle was: nasty private memory (unmovable, unswappable, > inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared" > memory is allocated from an ordinary memfd. > > This makes sense: shared memory is neither nasty nor special. You can > migrate it, swap it out, map it into page tables, GUP it, ... without any > issues. Slight aside and not wanting to derail the discussion, but we have a few different types of sharing which we'll have to consider: * Memory shared from the host to the guest. This remains owned by the host and the normal mm stuff can be made to work with it. * Memory shared from the guest to the host. This remains owned by the guest, so there's a pin on the pages and the normal mm stuff can't work without co-operation from the guest (see next point). * Memory relinquished from the guest to the host. This actually unmaps the pages from the host and transfers ownership back to the host, after which the pin is dropped and the normal mm stuff can work. We use this to implement ballooning. I suppose the main thing is that the architecture backend can deal with these states, so the core code shouldn't really care as long as it's aware that shared memory may be pinned. > So if I would describe some key characteristics of guest_memfd as of today, > it would probably be: > > 1) Memory is unmovable and unswappable. Right from the beginning, it is > allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). > 2) Memory is inaccessible. It cannot be read from user space, the > kernel, it cannot be GUP'ed ... only some mechanisms might end up > touching that memory (e.g., hibernation, /proc/kcore) might end up > touching it "by accident", and we usually can handle these cases. > 3) Memory can be discarded in page granularity. There should be no cases > where you cannot discard memory to over-allocate memory for private > pages that have been replaced by shared pages otherwise. > 4) Page tables are not required (well, it's an memfd), and the fd could > in theory be passed to other processes. > > Having "ordinary shared" memory in there implies that 1) and 2) will have to > be adjusted for them, which kind-of turns it "partially" into ordinary shmem > again. Yes, and we'd also need a way to establish hugepages (where possible) even for the *private* memory so as to reduce the depth of the guest's stage-2 walk. > Going back to the beginning: with pKVM, we likely want the following > > 1) Convert pages private<->shared in-place > 2) Stop user space + kernel from accessing private memory in process > context. Likely for pKVM we would only crash the process, which > would be acceptable. > 3) Prevent GUP to private memory. Otherwise we could crash the kernel. > 4) Prevent private pages from swapout+migration until supported. > > > I suspect your current solution with anonymous memory gets all but 3) sorted > out, correct? I agree on all of these and, yes, (3) is the problem for us. We've also been thinking a bit about CoW recently and I suspect the use of vm_normal_page() in do_wp_page() could lead to issues similar to those we hit with GUP. There are various ways to approach that, but I'm not sure what's best. > I'm curious, may there be a requirement in the future that shared memory > could be mapped into other processes? (thinking vhost-user and such things). It's not impossible. We use crosvm as our VMM, and that has a multi-process sandbox mode which I think relies on just that... Cheers, Will (btw: I'm getting some time away from the computer over Easter, so I'll be a little slow on email again. Nothing personal!).
On 27.03.24 20:34, Will Deacon wrote: > Hi again, David, > > On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote: >> On 19.03.24 15:31, Will Deacon wrote: >> sorry for the late reply! > > Bah, you and me both! This time I'm faster! :) > >>> On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: >>>> On 19.03.24 01:10, Sean Christopherson wrote: >>>>> On Mon, Mar 18, 2024, Vishal Annapurve wrote: >>>>>> On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: >>> From the pKVM side, we're working on guest_memfd primarily to avoid >>> diverging from what other CoCo solutions end up using, but if it gets >>> de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do >>> today with anonymous memory, then it's a really hard sell to switch over >>> from what we have in production. We're also hoping that, over time, >>> guest_memfd will become more closely integrated with the mm subsystem to >>> enable things like hypervisor-assisted page migration, which we would >>> love to have. >> >> Reading Sean's reply, he has a different view on that. And I think that's >> the main issue: there are too many different use cases and too many >> different requirements that could turn guest_memfd into something that maybe >> it really shouldn't be. > > No argument there, and we're certainly not tied to any specific > mechanism on the pKVM side. Maybe Sean can chime in, but we've > definitely spoken about migration being a goal in the past, so I guess > something changed since then on the guest_memfd side. > > Regardless, from our point of view, we just need to make sure that > whatever we settle on for pKVM does the things we need it to do (or can > at least be extended to do them) and we're happy to implement that in > whatever way works best for upstream, guest_memfd or otherwise. > >>> We're happy to pursue alternative approaches using anonymous memory if >>> you'd prefer to keep guest_memfd limited in functionality (e.g. >>> preventing GUP of private pages by extending mapping_flags as per [1]), >>> but we're equally willing to contribute to guest_memfd if extensions are >>> welcome. >>> >>> What do you prefer? >> >> Let me summarize the history: > > First off, thanks for piecing together the archaeology... > >> AMD had its thing running and it worked for them (but I recall it was hacky >> :) ). >> >> TDX made it possible to crash the machine when accessing secure memory from >> user space (MCE). >> >> So secure memory must not be mapped into user space -- no page tables. >> Prototypes with anonymous memory existed (and I didn't hate them, although >> hacky), but one of the other selling points of guest_memfd was that we could >> create VMs that wouldn't need any page tables at all, which I found >> interesting. > > Are the prototypes you refer to here based on the old stuff from Kirill? Yes. > We followed that work at the time, thinking we were going to be using > that before guest_memfd came along, so we've sadly been collecting > out-of-tree patches for a little while :/ :/ > >> There was a bit more to that (easier conversion, avoiding GUP, specifying on >> allocation that the memory was unmovable ...), but I'll get to that later. >> >> The design principle was: nasty private memory (unmovable, unswappable, >> inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared" >> memory is allocated from an ordinary memfd. >> >> This makes sense: shared memory is neither nasty nor special. You can >> migrate it, swap it out, map it into page tables, GUP it, ... without any >> issues. > > Slight aside and not wanting to derail the discussion, but we have a few > different types of sharing which we'll have to consider: Thanks for sharing! > > * Memory shared from the host to the guest. This remains owned by the > host and the normal mm stuff can be made to work with it. Okay, host and guest can access it. We can jut migrate memory around, swap it out ... like ordinary guest memory today. > > * Memory shared from the guest to the host. This remains owned by the > guest, so there's a pin on the pages and the normal mm stuff can't > work without co-operation from the guest (see next point). Okay, host and guest can access it, but we cannot migrate memory around or swap it out ... like ordinary guest memory today that is longterm pinned. > > * Memory relinquished from the guest to the host. This actually unmaps > the pages from the host and transfers ownership back to the host, > after which the pin is dropped and the normal mm stuff can work. We > use this to implement ballooning. > Okay, so this is essentially just a state transition between the two above. > I suppose the main thing is that the architecture backend can deal with > these states, so the core code shouldn't really care as long as it's > aware that shared memory may be pinned. So IIUC, the states are: (1) Private: inaccesible by the host, accessible by the guest, "owned by the guest" (2) Host Shared: accessible by the host + guest, "owned by the host" (3) Guest Shared: accessible by the host, "owned by the guest" Memory ballooning is simply transitioning from (3) to (2), and then discarding the memory. Any state I am missing? Which transitions are possible? (1) <-> (2) ? Not sure if the direct transition is possible. (2) <-> (3) ? IIUC yes. (1) <-> (3) ? IIUC yes. There is ongoing work on longterm-pinning memory from a memfd/shmem. So thinking in terms of my vague "fd guest_memfd + fd pair", that approach could look like the following: (1) guest_memfd (could be "with longterm pin") (2) memfd (3) memfd with a longterm pin But again, just some possible idea to make it work with guest_memfd. > >> So if I would describe some key characteristics of guest_memfd as of today, >> it would probably be: >> >> 1) Memory is unmovable and unswappable. Right from the beginning, it is >> allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). >> 2) Memory is inaccessible. It cannot be read from user space, the >> kernel, it cannot be GUP'ed ... only some mechanisms might end up >> touching that memory (e.g., hibernation, /proc/kcore) might end up >> touching it "by accident", and we usually can handle these cases. >> 3) Memory can be discarded in page granularity. There should be no cases >> where you cannot discard memory to over-allocate memory for private >> pages that have been replaced by shared pages otherwise. >> 4) Page tables are not required (well, it's an memfd), and the fd could >> in theory be passed to other processes. >> >> Having "ordinary shared" memory in there implies that 1) and 2) will have to >> be adjusted for them, which kind-of turns it "partially" into ordinary shmem >> again. > > Yes, and we'd also need a way to establish hugepages (where possible) > even for the *private* memory so as to reduce the depth of the guest's > stage-2 walk. > Understood, and as discussed, that's a bit more "hairy". >> Going back to the beginning: with pKVM, we likely want the following >> >> 1) Convert pages private<->shared in-place >> 2) Stop user space + kernel from accessing private memory in process >> context. Likely for pKVM we would only crash the process, which >> would be acceptable. >> 3) Prevent GUP to private memory. Otherwise we could crash the kernel. >> 4) Prevent private pages from swapout+migration until supported. >> >> >> I suspect your current solution with anonymous memory gets all but 3) sorted >> out, correct? > > I agree on all of these and, yes, (3) is the problem for us. We've also > been thinking a bit about CoW recently and I suspect the use of > vm_normal_page() in do_wp_page() could lead to issues similar to those > we hit with GUP. There are various ways to approach that, but I'm not > sure what's best. Would COW be required or is that just the nasty side-effect of trying to use anonymous memory? > >> I'm curious, may there be a requirement in the future that shared memory >> could be mapped into other processes? (thinking vhost-user and such things). > > It's not impossible. We use crosvm as our VMM, and that has a > multi-process sandbox mode which I think relies on just that... > Okay, so basing the design on anonymous memory might not be the best choice ... :/ > Cheers, > > Will > > (btw: I'm getting some time away from the computer over Easter, so I'll be > a little slow on email again. Nothing personal!). Sure, no worries! Enjoy!
Hey David, I'll try to pick up the baton while Will is away :-) On Thursday 28 Mar 2024 at 10:06:52 (+0100), David Hildenbrand wrote: > On 27.03.24 20:34, Will Deacon wrote: > > I suppose the main thing is that the architecture backend can deal with > > these states, so the core code shouldn't really care as long as it's > > aware that shared memory may be pinned. > > So IIUC, the states are: > > (1) Private: inaccesible by the host, accessible by the guest, "owned by > the guest" > > (2) Host Shared: accessible by the host + guest, "owned by the host" > > (3) Guest Shared: accessible by the host, "owned by the guest" Yup. > Memory ballooning is simply transitioning from (3) to (2), and then > discarding the memory. Well, not quite actually, see below. > Any state I am missing? So there is probably state (0) which is 'owned only by the host'. It's a bit obvious, but I'll make it explicit because it has its importance for the rest of the discussion. And while at it, there are other cases (memory shared/owned with/by the hypervisor and/or TrustZone) but they're somewhat irrelevant to this discussion. These pages are usually backed by kernel allocations, so much less problematic to deal with. So let's ignore those. > Which transitions are possible? Basically a page must be in the 'exclusively owned' state for an owner to initiate a share or donation. So e.g. a shared page must be unshared before it can be donated to someone else (that is true regardless of the owner, host, guest, hypervisor, ...). That simplifies significantly the state tracking in pKVM. > (1) <-> (2) ? Not sure if the direct transition is possible. Yep, not possible. > (2) <-> (3) ? IIUC yes. Actually it's not directly possible as is. The ballooning procedure is essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a single hypercall when doing ballooning, but it's technically just a (3) -> (1) -> (0) sequence that has been micro-optimized). Note that state (2) is actually never used for protected VMs. It's mainly used to implement standard non-protected VMs. The biggest difference in pKVM between protected and non-protected VMs is basically that in the former case, in the fault path KVM does a (0) -> (1) transition, but in the latter it's (0) -> (2). That implies that in the unprotected case, the host remains the page owner and is allowed to decide to unshare arbitrary pages, to restrict the guest permissions for the shared pages etc, which paves the way for implementing migration, swap, ... relatively easily. > (1) <-> (3) ? IIUC yes. Yep. <snip> > > I agree on all of these and, yes, (3) is the problem for us. We've also > > been thinking a bit about CoW recently and I suspect the use of > > vm_normal_page() in do_wp_page() could lead to issues similar to those > > we hit with GUP. There are various ways to approach that, but I'm not > > sure what's best. > > Would COW be required or is that just the nasty side-effect of trying to use > anonymous memory? That'd qualify as an undesirable side effect I think. > > > > > I'm curious, may there be a requirement in the future that shared memory > > > could be mapped into other processes? (thinking vhost-user and such things). > > > > It's not impossible. We use crosvm as our VMM, and that has a > > multi-process sandbox mode which I think relies on just that... > > > > Okay, so basing the design on anonymous memory might not be the best choice > ... :/ So, while we're at this stage, let me throw another idea at the wall to see if it sticks :-) One observation is that a standard memfd would work relatively well for pKVM if we had a way to enforce that all mappings to it are MAP_SHARED. KVM would still need to take an 'exclusive GUP' from the fault path (which may fail in case of a pre-existing GUP, but that's fine), but then CoW and friends largely become a non-issue by construction I think. Is there any way we could enforce that cleanly? Perhaps introducing a sort of 'mmap notifier' would do the trick? By that I mean something a bit similar to an MMU notifier offered by memfd that KVM could register against whenever the memfd is attached to a protected VM memslot. One of the nice things here is that we could retain an entire mapping of the whole of guest memory in userspace, conversions wouldn't require any additional efforts from userspace. A bad thing is that a process that is being passed such a memfd may not expect the new semantic and the inability to map !MAP_SHARED. But I guess a process that receives a handle to private memory must be enlightened regardless of the type of fd, so maybe it's not so bad. Thoughts? Thanks, Quentin
Hi! [...] > >> Any state I am missing? > > So there is probably state (0) which is 'owned only by the host'. It's a > bit obvious, but I'll make it explicit because it has its importance for > the rest of the discussion. Yes, I treated it as "simply not mapped into the VM". > > And while at it, there are other cases (memory shared/owned with/by the > hypervisor and/or TrustZone) but they're somewhat irrelevant to this > discussion. These pages are usually backed by kernel allocations, so > much less problematic to deal with. So let's ignore those. > >> Which transitions are possible? > > Basically a page must be in the 'exclusively owned' state for an owner > to initiate a share or donation. So e.g. a shared page must be unshared > before it can be donated to someone else (that is true regardless of the > owner, host, guest, hypervisor, ...). That simplifies significantly the > state tracking in pKVM. Makes sense! > >> (1) <-> (2) ? Not sure if the direct transition is possible. > > Yep, not possible. > >> (2) <-> (3) ? IIUC yes. > > Actually it's not directly possible as is. The ballooning procedure is > essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a > single hypercall when doing ballooning, but it's technically just a > (3) -> (1) -> (0) sequence that has been micro-optimized). > > Note that state (2) is actually never used for protected VMs. It's > mainly used to implement standard non-protected VMs. The biggest Interesting. > difference in pKVM between protected and non-protected VMs is basically > that in the former case, in the fault path KVM does a (0) -> (1) > transition, but in the latter it's (0) -> (2). That implies that in the > unprotected case, the host remains the page owner and is allowed to > decide to unshare arbitrary pages, to restrict the guest permissions for > the shared pages etc, which paves the way for implementing migration, > swap, ... relatively easily. I'll have to digest that :) ... does that mean that for pKVM with protected VMs, "shared" pages are also never migratable/swappable? > >> (1) <-> (3) ? IIUC yes. > > Yep. > > <snip> >>> I agree on all of these and, yes, (3) is the problem for us. We've also >>> been thinking a bit about CoW recently and I suspect the use of >>> vm_normal_page() in do_wp_page() could lead to issues similar to those >>> we hit with GUP. There are various ways to approach that, but I'm not >>> sure what's best. >> >> Would COW be required or is that just the nasty side-effect of trying to use >> anonymous memory? > > That'd qualify as an undesirable side effect I think. Makes sense! > >>> >>>> I'm curious, may there be a requirement in the future that shared memory >>>> could be mapped into other processes? (thinking vhost-user and such things). >>> >>> It's not impossible. We use crosvm as our VMM, and that has a >>> multi-process sandbox mode which I think relies on just that... >>> >> >> Okay, so basing the design on anonymous memory might not be the best choice >> ... :/ > > So, while we're at this stage, let me throw another idea at the wall to > see if it sticks :-) > > One observation is that a standard memfd would work relatively well for > pKVM if we had a way to enforce that all mappings to it are MAP_SHARED. It should be fairly easy to enforce, I wouldn't worry too much about that. > KVM would still need to take an 'exclusive GUP' from the fault path > (which may fail in case of a pre-existing GUP, but that's fine), but > then CoW and friends largely become a non-issue by construction I think. > Is there any way we could enforce that cleanly? Perhaps introducing a > sort of 'mmap notifier' would do the trick? By that I mean something a > bit similar to an MMU notifier offered by memfd that KVM could register > against whenever the memfd is attached to a protected VM memslot. > > One of the nice things here is that we could retain an entire mapping of > the whole of guest memory in userspace, conversions wouldn't require any > additional efforts from userspace. A bad thing is that a process that is > being passed such a memfd may not expect the new semantic and the > inability to map !MAP_SHARED. But I guess a process that receives a I wouldn't worry about the !MAP_SHARED requirement. vhost-user and friends all *must* map it MAP_SHARED to do anything reasonable, so that's what they do. > handle to private memory must be enlightened regardless of the type of > fd, so maybe it's not so bad. > > Thoughts? The whole reason I brought up the guest_memfd+memfd pair idea is that you would similarly be able to do the conversion in the kernel, BUT, you'd never be able to mmap+GUP encrypted pages. Essentially you're using guest_memfd for what it was designed for: private memory that is inaccessible.
On Thursday 28 Mar 2024 at 11:32:21 (+0100), David Hildenbrand wrote: > ... does that mean that for pKVM with protected VMs, "shared" pages are also > never migratable/swappable? In our current implementation, yes, KVM keeps its longterm GUP pin on pages that are shared back. And we might want to retain this behaviour in the short term, even with guest_memfd or using the hybrid approach you suggested. But that could totally be relaxed in the future, it's "just" a matter of adding extra support to the hypervisor for that. That has not been prioritized yet since the number of shared pages in practice is relatively small for current use-cases, so ballooning was a better option (and in the case of ballooning, we do drop the GUP pin). But that's clearly on the TODO list! > The whole reason I brought up the guest_memfd+memfd pair idea is that you > would similarly be able to do the conversion in the kernel, BUT, you'd never > be able to mmap+GUP encrypted pages. > > Essentially you're using guest_memfd for what it was designed for: private > memory that is inaccessible. Ack, that sounds pretty reasonable to me. But I think we'd still want to make sure the other users of guest_memfd have the _desire_ to support huge pages, migration, swap (probably longer term), and related features, otherwise I don't think a guest_memfd-based option will really work for us :-) Thanks, Quentin
On 28.03.24 11:58, Quentin Perret wrote: > On Thursday 28 Mar 2024 at 11:32:21 (+0100), David Hildenbrand wrote: >> ... does that mean that for pKVM with protected VMs, "shared" pages are also >> never migratable/swappable? > > In our current implementation, yes, KVM keeps its longterm GUP pin on > pages that are shared back. And we might want to retain this behaviour > in the short term, even with guest_memfd or using the hybrid approach > you suggested. But that could totally be relaxed in the future, it's > "just" a matter of adding extra support to the hypervisor for that. That > has not been prioritized yet since the number of shared pages in > practice is relatively small for current use-cases, so ballooning was a > better option (and in the case of ballooning, we do drop the GUP pin). > But that's clearly on the TODO list! Okay, so nothing "fundamental", good! > >> The whole reason I brought up the guest_memfd+memfd pair idea is that you >> would similarly be able to do the conversion in the kernel, BUT, you'd never >> be able to mmap+GUP encrypted pages. >> >> Essentially you're using guest_memfd for what it was designed for: private >> memory that is inaccessible. > > Ack, that sounds pretty reasonable to me. But I think we'd still want to > make sure the other users of guest_memfd have the _desire_ to support > huge pages, migration, swap (probably longer term), and related > features, otherwise I don't think a guest_memfd-based option will > really work for us :-) *Probably* some easy way to get hugetlb pages into a guest_memfd would be by allocating them for an memfd and then converting/moving them into the guest_memfd part of the "fd pair" on conversion to private :) (but the "partial shared, partial private" case is and remains the ugly thing that is hard and I still don't think it makes sense. Maybe it could be handles somehow in such a dual approach with some enlightment in the fds ... hard to find solutions for things that don't make any sense :P ) I also do strongly believe that we want to see some HW-assisted migration support for guest_memfd pages. Swap, as you say, maybe in the long-term. After all, we're not interested in having MM features for backing memory that you could similarly find under Windows 95. Wait, that one did support swapping! :P But unfortunately, that's what the shiny new CoCo world currently offers. Well, excluding s390x secure execution, as discussed.
On Thu, Mar 28, 2024 at 4:41 AM David Hildenbrand <david@redhat.com> wrote: > > .... > > > >> The whole reason I brought up the guest_memfd+memfd pair idea is that you > >> would similarly be able to do the conversion in the kernel, BUT, you'd never > >> be able to mmap+GUP encrypted pages. > >> > >> Essentially you're using guest_memfd for what it was designed for: private > >> memory that is inaccessible. > > > > Ack, that sounds pretty reasonable to me. But I think we'd still want to > > make sure the other users of guest_memfd have the _desire_ to support > > huge pages, migration, swap (probably longer term), and related > > features, otherwise I don't think a guest_memfd-based option will > > really work for us :-) > > *Probably* some easy way to get hugetlb pages into a guest_memfd would > be by allocating them for an memfd and then converting/moving them into > the guest_memfd part of the "fd pair" on conversion to private :) > > (but the "partial shared, partial private" case is and remains the ugly > thing that is hard and I still don't think it makes sense. Maybe it > could be handles somehow in such a dual approach with some enlightment > in the fds ... hard to find solutions for things that don't make any > sense :P ) > I would again emphasize that this usecase exists for Confidential VMs, whether we like it or not. 1) TDX hardware allows usage of 1G pages to back guest memory. 2) Larger VM sizes benefit more with 1G page sizes, which would be a norm with VMs exposing GPU/TPU devices. 3) Confidential VMs will need to share host resources with non-confidential VMs using 1G pages. 4) When using normal shmem/hugetlbfs files to back guest memory, this usecase was achievable by just manipulating guest page tables (although at the cost of host safety which led to invention of guest memfd). Something equivalent "might be possible" with guest memfd. Without handling "partial shared, partial private", it is impractical to support 1G pages for Confidential VMs (discounting any long term efforts to tame the guest VMs to play nice). Maybe to handle this usecase, all the host side shared memory usage of guest memfd (userspace, IOMMU etc) should be associated with (or tracked via) file ranges rather than offsets within huge pages (like it's done for faulting in private memory pages when populating guest EPTs/NPTs). Given the current guest behavior, host MMU and IOMMU may have to be forced to map shared memory regions always via 4KB mappings. > I also do strongly believe that we want to see some HW-assisted > migration support for guest_memfd pages. Swap, as you say, maybe in the > long-term. After all, we're not interested in having MM features for > backing memory that you could similarly find under Windows 95. Wait, > that one did support swapping! :P > > But unfortunately, that's what the shiny new CoCo world currently > offers. Well, excluding s390x secure execution, as discussed. > > -- > Cheers, > > David / dhildenb >
On Wed, Mar 27, 2024, Will Deacon wrote: > Hi again, David, > > On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote: > > On 19.03.24 15:31, Will Deacon wrote: > > sorry for the late reply! > > Bah, you and me both! Hold my beer ;-) > > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > > From the pKVM side, we're working on guest_memfd primarily to avoid > > > diverging from what other CoCo solutions end up using, but if it gets > > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > > today with anonymous memory, then it's a really hard sell to switch over > > > from what we have in production. We're also hoping that, over time, > > > guest_memfd will become more closely integrated with the mm subsystem to > > > enable things like hypervisor-assisted page migration, which we would > > > love to have. > > > > Reading Sean's reply, he has a different view on that. And I think that's > > the main issue: there are too many different use cases and too many > > different requirements that could turn guest_memfd into something that maybe > > it really shouldn't be. > > No argument there, and we're certainly not tied to any specific > mechanism on the pKVM side. Maybe Sean can chime in, but we've > definitely spoken about migration being a goal in the past, so I guess > something changed since then on the guest_memfd side. What's "hypervisor-assisted page migration"? More specifically, what's the mechanism that drives it? I am not opposed to page migration itself, what I am opposed to is adding deep integration with core MM to do some of the fancy/complex things that lead to page migration. Another thing I want to avoid is taking a hard dependency on "struct page", so that we can have line of sight to eliminating "struct page" overhead for guest_memfd, but that's definitely a more distant future concern. > > This makes sense: shared memory is neither nasty nor special. You can > > migrate it, swap it out, map it into page tables, GUP it, ... without any > > issues. > > Slight aside and not wanting to derail the discussion, but we have a few > different types of sharing which we'll have to consider: > > * Memory shared from the host to the guest. This remains owned by the > host and the normal mm stuff can be made to work with it. This seems like it should be !guest_memfd, i.e. can't be converted to guest private (without first unmapping it from the host, but at that point it's completely different memory, for all intents and purposes). > * Memory shared from the guest to the host. This remains owned by the > guest, so there's a pin on the pages and the normal mm stuff can't > work without co-operation from the guest (see next point). Do you happen to have a list of exactly what you mean by "normal mm stuff"? I am not at all opposed to supporting .mmap(), because long term I also want to use guest_memfd for non-CoCo VMs. But I want to be very conservative with respect to what is allowed for guest_memfd. E.g. host userspace can map guest_memfd, and do operations that are directly related to its mapping, but that's about it. > * Memory relinquished from the guest to the host. This actually unmaps > the pages from the host and transfers ownership back to the host, > after which the pin is dropped and the normal mm stuff can work. We > use this to implement ballooning. > > I suppose the main thing is that the architecture backend can deal with > these states, so the core code shouldn't really care as long as it's > aware that shared memory may be pinned. > > > So if I would describe some key characteristics of guest_memfd as of today, > > it would probably be: > > > > 1) Memory is unmovable and unswappable. Right from the beginning, it is > > allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). > > 2) Memory is inaccessible. It cannot be read from user space, the > > kernel, it cannot be GUP'ed ... only some mechanisms might end up > > touching that memory (e.g., hibernation, /proc/kcore) might end up > > touching it "by accident", and we usually can handle these cases. > > 3) Memory can be discarded in page granularity. There should be no cases > > where you cannot discard memory to over-allocate memory for private > > pages that have been replaced by shared pages otherwise. > > 4) Page tables are not required (well, it's an memfd), and the fd could > > in theory be passed to other processes.o More broadly, no VMAs are required. The lack of stage-1 page tables are nice to have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. it's not subject to VMA protections, isn't restricted to host mapping size, etc. > > Having "ordinary shared" memory in there implies that 1) and 2) will have to > > be adjusted for them, which kind-of turns it "partially" into ordinary shmem > > again. > > Yes, and we'd also need a way to establish hugepages (where possible) > even for the *private* memory so as to reduce the depth of the guest's > stage-2 walk. Yeah, hugepage support for guest_memfd is very much a WIP. Getting _something_ is easy, getting the right thing is much harder.