mbox series

[RFC,0/6] Enable shared device assignment

Message ID 20240725072118.358923-1-chenyi.qiang@intel.com (mailing list archive)
Headers show
Series Enable shared device assignment | expand

Message

Chenyi Qiang July 25, 2024, 7:21 a.m. UTC
Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
discard") effectively disables device assignment with guest_memfd.
guest_memfd is required for confidential guests, so device assignment to
confidential guests is disabled. A supporting assumption for disabling
device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
etc...) solves the confidential-guest device-assignment problem [1].
That turns out not to be the case because TEE I/O depends on being able
to operate devices against "shared"/untrusted memory for device
initialization and error recovery scenarios.

This series utilizes an existing framework named RamDiscardManager to
notify VFIO of page conversions. However, there's still one concern
related to the semantics of RamDiscardManager which is used to manage
the memory plug/unplug state. This is a little different from the memory
shared/private in our requirement. See the "Open" section below for more
details.

Background
==========
Confidential VMs have two classes of memory: shared and private memory.
Shared memory is accessible from the host/VMM while private memory is
not. Confidential VMs can decide which memory is shared/private and
convert memory between shared/private at runtime.

"guest_memfd" is a new kind of fd whose primary goal is to serve guest
private memory. The key differences between guest_memfd and normal memfd
are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
cannot be mapped, read or written by userspace.

In QEMU's implementation, shared memory is allocated with normal methods
(e.g. mmap or fallocate) while private memory is allocated from
guest_memfd. When a VM performs memory conversions, QEMU frees pages via
madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
allocates new pages from the other side.

Problem
=======
Device assignment in QEMU is implemented via VFIO system. In the normal
VM, VM memory is pinned at the beginning of time by VFIO. In the
confidential VM, the VM can convert memory and when that happens
nothing currently tells VFIO that its mappings are stale. This means
that page conversion leaks memory and leaves stale IOMMU mappings. For
example, sequence like the following can result in stale IOMMU mappings:

1. allocate shared page
2. convert page shared->private
3. discard shared page
4. convert page private->shared
5. allocate shared page
6. issue DMA operations against that shared page

After step 3, VFIO is still pinning the page. However, DMA operations in
step 6 will hit the old mapping that was allocated in step 1, which
causes the device to access the invalid data.

Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
uncoordinated discard") has blocked the device assignment with
guest_memfd to avoid this problem.

Solution
========
The key to enable shared device assignment is to solve the stale IOMMU
mappings problem.

Given the constraints and assumptions here is a solution that satisfied
the use cases. RamDiscardManager, an existing interface currently
utilized by virtio-mem, offers a means to modify IOMMU mappings in
accordance with VM page assignment. Page conversion is similar to
hot-removing a page in one mode and adding it back in the other.

This series implements a RamDiscardManager for confidential VMs and
utilizes its infrastructure to notify VFIO of page conversions.

Another possible attempt [2] was to not discard shared pages in step 3
above. This was an incomplete band-aid because guests would consume
twice the memory since shared pages wouldn't be freed even after they
were converted to private.

Open
====
Implementing a RamDiscardManager to notify VFIO of page conversions
causes changes in semantics: private memory is treated as discarded (or
hot-removed) memory. This isn't aligned with the expectation of current
RamDiscardManager users (e.g. VFIO or live migration) who really
expect that discarded memory is hot-removed and thus can be skipped when
the users are processing guest memory. Treating private memory as
discarded won't work in future if VFIO or live migration needs to handle
private memory. e.g. VFIO may need to map private memory to support
Trusted IO and live migration for confidential VMs need to migrate
private memory.

There are two possible ways to mitigate the semantics changes.
1. Develop a new mechanism to notify the page conversions between
private and shared. For example, utilize the notifier_list in QEMU. VFIO
registers its own handler and gets notified upon page conversions. This
is a clean approach which only touches the notifier workflow. A
challenge is that for device hotplug, existing shared memory should be
mapped in IOMMU. This will need additional changes.

2. Extend the existing RamDiscardManager interface to manage not only
the discarded/populated status of guest memory but also the
shared/private status. RamDiscardManager users like VFIO will be
notified with one more argument indicating what change is happening and
can take action accordingly. It also has challenges e.g. QEMU allows
only one RamDiscardManager, how to support virtio-mem for confidential
VMs would be a problem. And some APIs like .is_populated() exposed by
RamDiscardManager are meaningless to shared/private memory. So they may
need some adjustments.

Testing
=======
This patch series is tested based on the internal TDX KVM/QEMU tree.

To facilitate shared device assignment with the NIC, employ the legacy
type1 VFIO with the QEMU command:

qemu-system-x86_64 [...]
    -device vfio-pci,host=XX:XX.X

The parameter of dma_entry_limit needs to be adjusted. For example, a
16GB guest needs to adjust the parameter like
vfio_iommu_type1.dma_entry_limit=4194304.

If use the iommufd-backed VFIO with the qemu command:

qemu-system-x86_64 [...]
    -object iommufd,id=iommufd0 \
    -device vfio-pci,host=XX:XX.X,iommufd=iommufd0

No additional adjustment required.

Following the bootup of the TD guest, the guest's IP address becomes
visible, and iperf is able to successfully send and receive data.

Related link
============
[1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
[2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/

Chenyi Qiang (6):
  guest_memfd: Introduce an object to manage the guest-memfd with
    RamDiscardManager
  guest_memfd: Introduce a helper to notify the shared/private state
    change
  KVM: Notify the state change via RamDiscardManager helper during
    shared/private conversion
  memory: Register the RamDiscardManager instance upon guest_memfd
    creation
  guest-memfd: Default to discarded (private) in guest_memfd_manager
  RAMBlock: make guest_memfd require coordinate discard

 accel/kvm/kvm-all.c                  |   7 +
 include/sysemu/guest-memfd-manager.h |  49 +++
 system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
 system/meson.build                   |   1 +
 system/physmem.c                     |  11 +-
 5 files changed, 492 insertions(+), 1 deletion(-)
 create mode 100644 include/sysemu/guest-memfd-manager.h
 create mode 100644 system/guest-memfd-manager.c


base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819

Comments

David Hildenbrand July 25, 2024, 2:04 p.m. UTC | #1
> Open
> ====
> Implementing a RamDiscardManager to notify VFIO of page conversions
> causes changes in semantics: private memory is treated as discarded (or
> hot-removed) memory. This isn't aligned with the expectation of current
> RamDiscardManager users (e.g. VFIO or live migration) who really
> expect that discarded memory is hot-removed and thus can be skipped when
> the users are processing guest memory. Treating private memory as
> discarded won't work in future if VFIO or live migration needs to handle
> private memory. e.g. VFIO may need to map private memory to support
> Trusted IO and live migration for confidential VMs need to migrate
> private memory.

"VFIO may need to map private memory to support Trusted IO"

I've been told that the way we handle shared memory won't be the way 
this is going to work with guest_memfd. KVM will coordinate directly 
with VFIO or $whatever and update the IOMMU tables itself right in the 
kernel; the pages are pinned/owned by guest_memfd, so that will just 
work. So I don't consider that currently a concern. guest_memfd private 
memory is not mapped into user page tables and as it currently seems it 
never will be.

Similarly: live migration. We cannot simply migrate that memory the 
traditional way. We even have to track the dirty state differently.

So IMHO, treating both memory as discarded == don't touch it the usual 
way might actually be a feature not a bug ;)

> 
> There are two possible ways to mitigate the semantics changes.
> 1. Develop a new mechanism to notify the page conversions between
> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> registers its own handler and gets notified upon page conversions. This
> is a clean approach which only touches the notifier workflow. A
> challenge is that for device hotplug, existing shared memory should be
> mapped in IOMMU. This will need additional changes.
> 
> 2. Extend the existing RamDiscardManager interface to manage not only
> the discarded/populated status of guest memory but also the
> shared/private status. RamDiscardManager users like VFIO will be
> notified with one more argument indicating what change is happening and
> can take action accordingly. It also has challenges e.g. QEMU allows
> only one RamDiscardManager, how to support virtio-mem for confidential
> VMs would be a problem. And some APIs like .is_populated() exposed by
> RamDiscardManager are meaningless to shared/private memory. So they may
> need some adjustments.

Think of all of that in terms of "shared memory is populated, private 
memory is some inaccessible stuff that needs very special way and other 
means for device assignment, live migration, etc.". Then it actually 
quite makes sense to use of RamDiscardManager (AFAIKS :) ).

> 
> Testing
> =======
> This patch series is tested based on the internal TDX KVM/QEMU tree.
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>      -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.

But here you note the biggest real issue I see (not related to 
RAMDiscardManager, but that we have to prepare for conversion of each 
possible private page to shared and back): we need a single IOMMU 
mapping for each 4 KiB page.

Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. 
Does it even scale then?


There is the alternative of having in-place private/shared conversion 
when we also let guest_memfd manage some shared memory. It has plenty of 
downsides, but for the problem at hand it would mean that we don't 
discard on shared/private conversion.

But whenever we want to convert memory shared->private we would 
similarly have to from IOMMU page tables via VFIO. (the in-place 
conversion will only be allowed if any additional references on a page 
are gone -- when it is inaccessible by userspace/kernel).

Again, if IOMMU page tables would be managed by KVM in the kernel 
without user space intervention/vfio this would work with device 
assignment just fine. But I guess it will take a while until we actually 
have that option.
Tian, Kevin July 26, 2024, 5:02 a.m. UTC | #2
> From: David Hildenbrand <david@redhat.com>
> Sent: Thursday, July 25, 2024 10:04 PM
> 
> > Open
> > ====
> > Implementing a RamDiscardManager to notify VFIO of page conversions
> > causes changes in semantics: private memory is treated as discarded (or
> > hot-removed) memory. This isn't aligned with the expectation of current
> > RamDiscardManager users (e.g. VFIO or live migration) who really
> > expect that discarded memory is hot-removed and thus can be skipped
> when
> > the users are processing guest memory. Treating private memory as
> > discarded won't work in future if VFIO or live migration needs to handle
> > private memory. e.g. VFIO may need to map private memory to support
> > Trusted IO and live migration for confidential VMs need to migrate
> > private memory.
> 
> "VFIO may need to map private memory to support Trusted IO"
> 
> I've been told that the way we handle shared memory won't be the way
> this is going to work with guest_memfd. KVM will coordinate directly
> with VFIO or $whatever and update the IOMMU tables itself right in the
> kernel; the pages are pinned/owned by guest_memfd, so that will just
> work. So I don't consider that currently a concern. guest_memfd private
> memory is not mapped into user page tables and as it currently seems it
> never will be.

Or could extend MAP_DMA to accept guest_memfd+offset in place of
'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
the pinned pfn.

IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs
to manage the mapping of the private memory instead of the use of
guest_memfd.

e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP)
to check the HPA after the IOMMU walks the existing I/O page tables. 
So reasonably VFIO/IOMMUFD could continue to manage those I/O
page tables including both private and shared memory, with a hint to
know where to find the pfn (host page table or guest_memfd).

But TDX Connect introduces a new I/O page table format (same as secure
EPT) for mapping the private memory and further requires sharing the
secure-EPT between CPU/IOMMU for private. Then it appears to be
a different story.
Chenyi Qiang July 26, 2024, 6:20 a.m. UTC | #3
On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>> Open
>> ====
>> Implementing a RamDiscardManager to notify VFIO of page conversions
>> causes changes in semantics: private memory is treated as discarded (or
>> hot-removed) memory. This isn't aligned with the expectation of current
>> RamDiscardManager users (e.g. VFIO or live migration) who really
>> expect that discarded memory is hot-removed and thus can be skipped when
>> the users are processing guest memory. Treating private memory as
>> discarded won't work in future if VFIO or live migration needs to handle
>> private memory. e.g. VFIO may need to map private memory to support
>> Trusted IO and live migration for confidential VMs need to migrate
>> private memory.
> 
> "VFIO may need to map private memory to support Trusted IO"
> 
> I've been told that the way we handle shared memory won't be the way
> this is going to work with guest_memfd. KVM will coordinate directly
> with VFIO or $whatever and update the IOMMU tables itself right in the
> kernel; the pages are pinned/owned by guest_memfd, so that will just
> work. So I don't consider that currently a concern. guest_memfd private
> memory is not mapped into user page tables and as it currently seems it
> never will be.

That's correct. AFAIK, some TEE IO solution like TDX Connect would let
kernel coordinate and update private mapping in IOMMU tables. Here, It
mentions that VFIO "may" need map private memory. I want to make this
more generic to account for potential future TEE IO solutions that may
require such functionality. :)

> 
> Similarly: live migration. We cannot simply migrate that memory the
> traditional way. We even have to track the dirty state differently.
> 
> So IMHO, treating both memory as discarded == don't touch it the usual
> way might actually be a feature not a bug ;)

Do you mean treating the private memory in both VFIO and live migration
as discarded? That is what this patch series does. And as you mentioned,
these RDM users cannot follow the traditional RDM way. Because of this,
we also considered whether we should use RDM or a more generic mechanism
like notifier_list below.

> 
>>
>> There are two possible ways to mitigate the semantics changes.
>> 1. Develop a new mechanism to notify the page conversions between
>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>> registers its own handler and gets notified upon page conversions. This
>> is a clean approach which only touches the notifier workflow. A
>> challenge is that for device hotplug, existing shared memory should be
>> mapped in IOMMU. This will need additional changes.
>>
>> 2. Extend the existing RamDiscardManager interface to manage not only
>> the discarded/populated status of guest memory but also the
>> shared/private status. RamDiscardManager users like VFIO will be
>> notified with one more argument indicating what change is happening and
>> can take action accordingly. It also has challenges e.g. QEMU allows
>> only one RamDiscardManager, how to support virtio-mem for confidential
>> VMs would be a problem. And some APIs like .is_populated() exposed by
>> RamDiscardManager are meaningless to shared/private memory. So they may
>> need some adjustments.
> 
> Think of all of that in terms of "shared memory is populated, private
> memory is some inaccessible stuff that needs very special way and other
> means for device assignment, live migration, etc.". Then it actually
> quite makes sense to use of RamDiscardManager (AFAIKS :) ).

Yes, such notification mechanism is what we want. But for the users of
RDM, it would require additional change accordingly. Current users just
skip inaccessible stuff, but in private memory case, it can't be simply
skipped. Maybe renaming RamDiscardManager to RamStateManager is more
accurate then. :)

> 
>>
>> Testing
>> =======
>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>      -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> But here you note the biggest real issue I see (not related to
> RAMDiscardManager, but that we have to prepare for conversion of each
> possible private page to shared and back): we need a single IOMMU
> mapping for each 4 KiB page.
> 
> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
> Does it even scale then?

The entry limitation needs to be increased as the guest memory size
increases. For this issue, are you concerned that having too many
entries might bring some performance issue? Maybe we could introduce
some PV mechanism to coordinate with guest to convert memory only in 2M
granularity. This may help mitigate the problem.

> 
> 
> There is the alternative of having in-place private/shared conversion
> when we also let guest_memfd manage some shared memory. It has plenty of
> downsides, but for the problem at hand it would mean that we don't
> discard on shared/private conversion.>
> But whenever we want to convert memory shared->private we would
> similarly have to from IOMMU page tables via VFIO. (the in-place
> conversion will only be allowed if any additional references on a page
> are gone -- when it is inaccessible by userspace/kernel).

I'm not clear about this in-place private/shared conversion. Can you
elaborate a little bit? It seems this alternative changes private and
shared management in current guest_memfd?

> 
> Again, if IOMMU page tables would be managed by KVM in the kernel
> without user space intervention/vfio this would work with device
> assignment just fine. But I guess it will take a while until we actually
> have that option.
>
David Hildenbrand July 26, 2024, 7:08 a.m. UTC | #4
On 26.07.24 07:02, Tian, Kevin wrote:
>> From: David Hildenbrand <david@redhat.com>
>> Sent: Thursday, July 25, 2024 10:04 PM
>>
>>> Open
>>> ====
>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>> causes changes in semantics: private memory is treated as discarded (or
>>> hot-removed) memory. This isn't aligned with the expectation of current
>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>> expect that discarded memory is hot-removed and thus can be skipped
>> when
>>> the users are processing guest memory. Treating private memory as
>>> discarded won't work in future if VFIO or live migration needs to handle
>>> private memory. e.g. VFIO may need to map private memory to support
>>> Trusted IO and live migration for confidential VMs need to migrate
>>> private memory.
>>
>> "VFIO may need to map private memory to support Trusted IO"
>>
>> I've been told that the way we handle shared memory won't be the way
>> this is going to work with guest_memfd. KVM will coordinate directly
>> with VFIO or $whatever and update the IOMMU tables itself right in the
>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>> work. So I don't consider that currently a concern. guest_memfd private
>> memory is not mapped into user page tables and as it currently seems it
>> never will be.
> 
> Or could extend MAP_DMA to accept guest_memfd+offset in place of
> 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
> the pinned pfn.

In theory yes, and I've been thinking of the same for a while. Until 
people told me that it is unlikely that it will work that way in the future.

> 
> IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs
> to manage the mapping of the private memory instead of the use of
> guest_memfd.
> 
> e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP)
> to check the HPA after the IOMMU walks the existing I/O page tables.
> So reasonably VFIO/IOMMUFD could continue to manage those I/O
> page tables including both private and shared memory, with a hint to
> know where to find the pfn (host page table or guest_memfd).
> 
> But TDX Connect introduces a new I/O page table format (same as secure
> EPT) for mapping the private memory and further requires sharing the
> secure-EPT between CPU/IOMMU for private. Then it appears to be
> a different story.

Yes. This seems to be the future and more in-line with 
in-place/in-kernel conversion as e.g., pKVM wants to have it. If you 
want to avoid user space altogether when doing shared<->private 
conversions, then letting user space manage the IOMMUs is not going to work.


If we ever have to go down that path (MAP_DMA of guest_memfd), we could 
have two RAMDiscardManager for a RAM region, just like we have two 
memory backends: one for shared memory populate/discard (what this 
series tries to achieve), one for private memory populate/discard.

The thing is, that private memory will always have to be special-cased 
all over the place either way, unfortunately.
David Hildenbrand July 26, 2024, 7:20 a.m. UTC | #5
On 26.07.24 08:20, Chenyi Qiang wrote:
> 
> 
> On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>>> Open
>>> ====
>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>> causes changes in semantics: private memory is treated as discarded (or
>>> hot-removed) memory. This isn't aligned with the expectation of current
>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>> expect that discarded memory is hot-removed and thus can be skipped when
>>> the users are processing guest memory. Treating private memory as
>>> discarded won't work in future if VFIO or live migration needs to handle
>>> private memory. e.g. VFIO may need to map private memory to support
>>> Trusted IO and live migration for confidential VMs need to migrate
>>> private memory.
>>
>> "VFIO may need to map private memory to support Trusted IO"
>>
>> I've been told that the way we handle shared memory won't be the way
>> this is going to work with guest_memfd. KVM will coordinate directly
>> with VFIO or $whatever and update the IOMMU tables itself right in the
>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>> work. So I don't consider that currently a concern. guest_memfd private
>> memory is not mapped into user page tables and as it currently seems it
>> never will be.
> 
> That's correct. AFAIK, some TEE IO solution like TDX Connect would let
> kernel coordinate and update private mapping in IOMMU tables. Here, It
> mentions that VFIO "may" need map private memory. I want to make this
> more generic to account for potential future TEE IO solutions that may
> require such functionality. :)

Careful to not over-enginner something that is not even real or 
close-to-be-real yet, though. :) Nobody really knows who that will look 
like, besides that we know for Intel that we won't need that.

> 
>>
>> Similarly: live migration. We cannot simply migrate that memory the
>> traditional way. We even have to track the dirty state differently.
>>
>> So IMHO, treating both memory as discarded == don't touch it the usual
>> way might actually be a feature not a bug ;)
> 
> Do you mean treating the private memory in both VFIO and live migration
> as discarded? That is what this patch series does. And as you mentioned,
> these RDM users cannot follow the traditional RDM way. Because of this,
> we also considered whether we should use RDM or a more generic mechanism
> like notifier_list below.

Yes, the shared memory is logically discarded. At the same time we 
*might* get private memory effectively populated. See my reply to Kevin 
that there might be ways of having shared vs. private populate/discard 
in the future, if required. Just some idea, though.

> 
>>
>>>
>>> There are two possible ways to mitigate the semantics changes.
>>> 1. Develop a new mechanism to notify the page conversions between
>>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>>> registers its own handler and gets notified upon page conversions. This
>>> is a clean approach which only touches the notifier workflow. A
>>> challenge is that for device hotplug, existing shared memory should be
>>> mapped in IOMMU. This will need additional changes.
>>>
>>> 2. Extend the existing RamDiscardManager interface to manage not only
>>> the discarded/populated status of guest memory but also the
>>> shared/private status. RamDiscardManager users like VFIO will be
>>> notified with one more argument indicating what change is happening and
>>> can take action accordingly. It also has challenges e.g. QEMU allows
>>> only one RamDiscardManager, how to support virtio-mem for confidential
>>> VMs would be a problem. And some APIs like .is_populated() exposed by
>>> RamDiscardManager are meaningless to shared/private memory. So they may
>>> need some adjustments.
>>
>> Think of all of that in terms of "shared memory is populated, private
>> memory is some inaccessible stuff that needs very special way and other
>> means for device assignment, live migration, etc.". Then it actually
>> quite makes sense to use of RamDiscardManager (AFAIKS :) ).
> 
> Yes, such notification mechanism is what we want. But for the users of
> RDM, it would require additional change accordingly. Current users just
> skip inaccessible stuff, but in private memory case, it can't be simply
> skipped. Maybe renaming RamDiscardManager to RamStateManager is more
> accurate then. :)

Current users must skip it, yes. How private memory would have to be 
handled, and who would handle it, is rather unclear.

Again, maybe we'd want separate RamDiscardManager for private and shared 
memory (after all, these are two separate memory backends).

Not sure that "RamStateManager" terminology would be reasonable in that 
approach.

> 
>>
>>>
>>> Testing
>>> =======
>>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>>
>>> To facilitate shared device assignment with the NIC, employ the legacy
>>> type1 VFIO with the QEMU command:
>>>
>>> qemu-system-x86_64 [...]
>>>       -device vfio-pci,host=XX:XX.X
>>>
>>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>>> 16GB guest needs to adjust the parameter like
>>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> But here you note the biggest real issue I see (not related to
>> RAMDiscardManager, but that we have to prepare for conversion of each
>> possible private page to shared and back): we need a single IOMMU
>> mapping for each 4 KiB page.
>>
>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
>> Does it even scale then?
> 
> The entry limitation needs to be increased as the guest memory size
> increases. For this issue, are you concerned that having too many
> entries might bring some performance issue? Maybe we could introduce
> some PV mechanism to coordinate with guest to convert memory only in 2M
> granularity. This may help mitigate the problem.

I've had this talk with Intel, because the 4K granularity is a pain. I 
was told that ship has sailed ... and we have to cope with random 4K 
conversions :(

The many mappings will likely add both memory and runtime overheads in 
the kernel. But we only know once we measure.

Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB 
of shared memory :/

> 
>>
>>
>> There is the alternative of having in-place private/shared conversion
>> when we also let guest_memfd manage some shared memory. It has plenty of
>> downsides, but for the problem at hand it would mean that we don't
>> discard on shared/private conversion.>
>> But whenever we want to convert memory shared->private we would
>> similarly have to from IOMMU page tables via VFIO. (the in-place
>> conversion will only be allowed if any additional references on a page
>> are gone -- when it is inaccessible by userspace/kernel).
> 
> I'm not clear about this in-place private/shared conversion. Can you
> elaborate a little bit? It seems this alternative changes private and
> shared management in current guest_memfd?

Yes, there have been discussions about that, also in the context of 
supporting huge pages while allowing for the guest to still convert 
individual 4K chunks ...

A summary is here [1]. Likely more things will be covered at Linux Plumbers.


[1] 
https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
Chenyi Qiang July 26, 2024, 10:56 a.m. UTC | #6
On 7/26/2024 3:20 PM, David Hildenbrand wrote:
> On 26.07.24 08:20, Chenyi Qiang wrote:
>>
>>
>> On 7/25/2024 10:04 PM, David Hildenbrand wrote:
>>>> Open
>>>> ====
>>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>>> causes changes in semantics: private memory is treated as discarded (or
>>>> hot-removed) memory. This isn't aligned with the expectation of current
>>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>>> expect that discarded memory is hot-removed and thus can be skipped
>>>> when
>>>> the users are processing guest memory. Treating private memory as
>>>> discarded won't work in future if VFIO or live migration needs to
>>>> handle
>>>> private memory. e.g. VFIO may need to map private memory to support
>>>> Trusted IO and live migration for confidential VMs need to migrate
>>>> private memory.
>>>
>>> "VFIO may need to map private memory to support Trusted IO"
>>>
>>> I've been told that the way we handle shared memory won't be the way
>>> this is going to work with guest_memfd. KVM will coordinate directly
>>> with VFIO or $whatever and update the IOMMU tables itself right in the
>>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>>> work. So I don't consider that currently a concern. guest_memfd private
>>> memory is not mapped into user page tables and as it currently seems it
>>> never will be.
>>
>> That's correct. AFAIK, some TEE IO solution like TDX Connect would let
>> kernel coordinate and update private mapping in IOMMU tables. Here, It
>> mentions that VFIO "may" need map private memory. I want to make this
>> more generic to account for potential future TEE IO solutions that may
>> require such functionality. :)
> 
> Careful to not over-enginner something that is not even real or
> close-to-be-real yet, though. :) Nobody really knows who that will look
> like, besides that we know for Intel that we won't need that.

OK, Thanks for the reminder!

> 
>>
>>>
>>> Similarly: live migration. We cannot simply migrate that memory the
>>> traditional way. We even have to track the dirty state differently.
>>>
>>> So IMHO, treating both memory as discarded == don't touch it the usual
>>> way might actually be a feature not a bug ;)
>>
>> Do you mean treating the private memory in both VFIO and live migration
>> as discarded? That is what this patch series does. And as you mentioned,
>> these RDM users cannot follow the traditional RDM way. Because of this,
>> we also considered whether we should use RDM or a more generic mechanism
>> like notifier_list below.
> 
> Yes, the shared memory is logically discarded. At the same time we
> *might* get private memory effectively populated. See my reply to Kevin
> that there might be ways of having shared vs. private populate/discard
> in the future, if required. Just some idea, though.
> 
>>
>>>
>>>>
>>>> There are two possible ways to mitigate the semantics changes.
>>>> 1. Develop a new mechanism to notify the page conversions between
>>>> private and shared. For example, utilize the notifier_list in QEMU.
>>>> VFIO
>>>> registers its own handler and gets notified upon page conversions. This
>>>> is a clean approach which only touches the notifier workflow. A
>>>> challenge is that for device hotplug, existing shared memory should be
>>>> mapped in IOMMU. This will need additional changes.
>>>>
>>>> 2. Extend the existing RamDiscardManager interface to manage not only
>>>> the discarded/populated status of guest memory but also the
>>>> shared/private status. RamDiscardManager users like VFIO will be
>>>> notified with one more argument indicating what change is happening and
>>>> can take action accordingly. It also has challenges e.g. QEMU allows
>>>> only one RamDiscardManager, how to support virtio-mem for confidential
>>>> VMs would be a problem. And some APIs like .is_populated() exposed by
>>>> RamDiscardManager are meaningless to shared/private memory. So they may
>>>> need some adjustments.
>>>
>>> Think of all of that in terms of "shared memory is populated, private
>>> memory is some inaccessible stuff that needs very special way and other
>>> means for device assignment, live migration, etc.". Then it actually
>>> quite makes sense to use of RamDiscardManager (AFAIKS :) ).
>>
>> Yes, such notification mechanism is what we want. But for the users of
>> RDM, it would require additional change accordingly. Current users just
>> skip inaccessible stuff, but in private memory case, it can't be simply
>> skipped. Maybe renaming RamDiscardManager to RamStateManager is more
>> accurate then. :)
> 
> Current users must skip it, yes. How private memory would have to be
> handled, and who would handle it, is rather unclear.
> 
> Again, maybe we'd want separate RamDiscardManager for private and shared
> memory (after all, these are two separate memory backends).

We also considered distinguishing the populate and discard operation for
private and shared memory separately. As in method 2 above, we mentioned
to add a new argument to indicate the memory attribute to operate on.
They seem to have a similar idea.

> 
> Not sure that "RamStateManager" terminology would be reasonable in that
> approach.
> 
>>
>>>
>>>>
>>>> Testing
>>>> =======
>>>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>>>
>>>> To facilitate shared device assignment with the NIC, employ the legacy
>>>> type1 VFIO with the QEMU command:
>>>>
>>>> qemu-system-x86_64 [...]
>>>>       -device vfio-pci,host=XX:XX.X
>>>>
>>>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>>>> 16GB guest needs to adjust the parameter like
>>>> vfio_iommu_type1.dma_entry_limit=4194304.
>>>
>>> But here you note the biggest real issue I see (not related to
>>> RAMDiscardManager, but that we have to prepare for conversion of each
>>> possible private page to shared and back): we need a single IOMMU
>>> mapping for each 4 KiB page.
>>>
>>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB.
>>> Does it even scale then?
>>
>> The entry limitation needs to be increased as the guest memory size
>> increases. For this issue, are you concerned that having too many
>> entries might bring some performance issue? Maybe we could introduce
>> some PV mechanism to coordinate with guest to convert memory only in 2M
>> granularity. This may help mitigate the problem.
> 
> I've had this talk with Intel, because the 4K granularity is a pain. I
> was told that ship has sailed ... and we have to cope with random 4K
> conversions :(
> 
> The many mappings will likely add both memory and runtime overheads in
> the kernel. But we only know once we measure.

In the normal case, the main runtime overhead comes from
private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
maximum of 1Gbyte. I think this overhead is acceptable. In non-default
case, e.g. dynamic allocated DMA buffer, the runtime overhead will
increase. As for the memory overheads, It is indeed unavoidable.

Will these performance issues be a deal breaker for enabling shared
device assignment in this way?

> 
> Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB
> of shared memory :/
> 
>>
>>>
>>>
>>> There is the alternative of having in-place private/shared conversion
>>> when we also let guest_memfd manage some shared memory. It has plenty of
>>> downsides, but for the problem at hand it would mean that we don't
>>> discard on shared/private conversion.>
>>> But whenever we want to convert memory shared->private we would
>>> similarly have to from IOMMU page tables via VFIO. (the in-place
>>> conversion will only be allowed if any additional references on a page
>>> are gone -- when it is inaccessible by userspace/kernel).
>>
>> I'm not clear about this in-place private/shared conversion. Can you
>> elaborate a little bit? It seems this alternative changes private and
>> shared management in current guest_memfd?
> 
> Yes, there have been discussions about that, also in the context of
> supporting huge pages while allowing for the guest to still convert
> individual 4K chunks ...
> 
> A summary is here [1]. Likely more things will be covered at Linux
> Plumbers.
> 
> 
> [1]
> https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
> 

Thanks for your sharing.
Xu Yilun July 31, 2024, 7:12 a.m. UTC | #7
On Fri, Jul 26, 2024 at 09:08:51AM +0200, David Hildenbrand wrote:
> On 26.07.24 07:02, Tian, Kevin wrote:
> > > From: David Hildenbrand <david@redhat.com>
> > > Sent: Thursday, July 25, 2024 10:04 PM
> > > 
> > > > Open
> > > > ====
> > > > Implementing a RamDiscardManager to notify VFIO of page conversions
> > > > causes changes in semantics: private memory is treated as discarded (or
> > > > hot-removed) memory. This isn't aligned with the expectation of current
> > > > RamDiscardManager users (e.g. VFIO or live migration) who really
> > > > expect that discarded memory is hot-removed and thus can be skipped
> > > when
> > > > the users are processing guest memory. Treating private memory as
> > > > discarded won't work in future if VFIO or live migration needs to handle
> > > > private memory. e.g. VFIO may need to map private memory to support
> > > > Trusted IO and live migration for confidential VMs need to migrate
> > > > private memory.
> > > 
> > > "VFIO may need to map private memory to support Trusted IO"
> > > 
> > > I've been told that the way we handle shared memory won't be the way
> > > this is going to work with guest_memfd. KVM will coordinate directly
> > > with VFIO or $whatever and update the IOMMU tables itself right in the
> > > kernel; the pages are pinned/owned by guest_memfd, so that will just
> > > work. So I don't consider that currently a concern. guest_memfd private
> > > memory is not mapped into user page tables and as it currently seems it
> > > never will be.
> > 
> > Or could extend MAP_DMA to accept guest_memfd+offset in place of

With TIO, I can imagine several buffer sharing requirements: KVM maps VFIO
owned private MMIO, IOMMU maps gmem owned private memory, IOMMU maps VFIO
owned private MMIO. These buffers cannot be found by user page table
anymore. I'm wondering it would be messy to have specific PFN finding
methods for each FD type. Is it possible we have a unified way for
buffer sharing and PFN finding, is dma-buf a candidate?

> > 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
> > the pinned pfn.
> 
> In theory yes, and I've been thinking of the same for a while. Until people
> told me that it is unlikely that it will work that way in the future.

Could you help specify why it won't work? As Kevin mentioned below, SEV-TIO
may still allow userspace to manage the IOMMU mapping for private. I'm
not sure how they map private memory for IOMMU without touching gmemfd.

Thanks,
Yilun

> 
> > 
> > IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs
> > to manage the mapping of the private memory instead of the use of
> > guest_memfd.
> > 
> > e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP)
> > to check the HPA after the IOMMU walks the existing I/O page tables.
> > So reasonably VFIO/IOMMUFD could continue to manage those I/O
> > page tables including both private and shared memory, with a hint to
> > know where to find the pfn (host page table or guest_memfd).
> > 
> > But TDX Connect introduces a new I/O page table format (same as secure
> > EPT) for mapping the private memory and further requires sharing the
> > secure-EPT between CPU/IOMMU for private. Then it appears to be
> > a different story.
> 
> Yes. This seems to be the future and more in-line with in-place/in-kernel
> conversion as e.g., pKVM wants to have it. If you want to avoid user space
> altogether when doing shared<->private conversions, then letting user space
> manage the IOMMUs is not going to work.
> 
> 
> If we ever have to go down that path (MAP_DMA of guest_memfd), we could have
> two RAMDiscardManager for a RAM region, just like we have two memory
> backends: one for shared memory populate/discard (what this series tries to
> achieve), one for private memory populate/discard.
> 
> The thing is, that private memory will always have to be special-cased all
> over the place either way, unfortunately.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
>
David Hildenbrand July 31, 2024, 11:05 a.m. UTC | #8
On 31.07.24 09:12, Xu Yilun wrote:
> On Fri, Jul 26, 2024 at 09:08:51AM +0200, David Hildenbrand wrote:
>> On 26.07.24 07:02, Tian, Kevin wrote:
>>>> From: David Hildenbrand <david@redhat.com>
>>>> Sent: Thursday, July 25, 2024 10:04 PM
>>>>
>>>>> Open
>>>>> ====
>>>>> Implementing a RamDiscardManager to notify VFIO of page conversions
>>>>> causes changes in semantics: private memory is treated as discarded (or
>>>>> hot-removed) memory. This isn't aligned with the expectation of current
>>>>> RamDiscardManager users (e.g. VFIO or live migration) who really
>>>>> expect that discarded memory is hot-removed and thus can be skipped
>>>> when
>>>>> the users are processing guest memory. Treating private memory as
>>>>> discarded won't work in future if VFIO or live migration needs to handle
>>>>> private memory. e.g. VFIO may need to map private memory to support
>>>>> Trusted IO and live migration for confidential VMs need to migrate
>>>>> private memory.
>>>>
>>>> "VFIO may need to map private memory to support Trusted IO"
>>>>
>>>> I've been told that the way we handle shared memory won't be the way
>>>> this is going to work with guest_memfd. KVM will coordinate directly
>>>> with VFIO or $whatever and update the IOMMU tables itself right in the
>>>> kernel; the pages are pinned/owned by guest_memfd, so that will just
>>>> work. So I don't consider that currently a concern. guest_memfd private
>>>> memory is not mapped into user page tables and as it currently seems it
>>>> never will be.
>>>
>>> Or could extend MAP_DMA to accept guest_memfd+offset in place of
> 
> With TIO, I can imagine several buffer sharing requirements: KVM maps VFIO
> owned private MMIO, IOMMU maps gmem owned private memory, IOMMU maps VFIO
> owned private MMIO. These buffers cannot be found by user page table
> anymore. I'm wondering it would be messy to have specific PFN finding
> methods for each FD type. Is it possible we have a unified way for
> buffer sharing and PFN finding, is dma-buf a candidate?

No expert on that, so I'm afraid I can't help.

> 
>>> 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve
>>> the pinned pfn.
>>
>> In theory yes, and I've been thinking of the same for a while. Until people
>> told me that it is unlikely that it will work that way in the future.
> 
> Could you help specify why it won't work? As Kevin mentioned below, SEV-TIO
> may still allow userspace to manage the IOMMU mapping for private. I'm
> not sure how they map private memory for IOMMU without touching gmemfd.

I raised that question in [1]:

"How would the device be able to grab/access "private memory", if not 
via the user page tables?"

Jason summarized it as "The approaches I'm aware of require the secure 
world to own the IOMMU and generate the IOMMU page tables. So we will 
not use a GUP approach with VFIO today as the kernel will not have any 
reason to generate a page table in the first place. Instead we will say 
"this PCI device translates through the secure world" and walk away."

I think for some cVM approaches it really cannot work without letting 
KVM/secure world handle the IOMMU (e.g., sharing of page tables between 
IOMMU and KVM).

For your use case it *might* work, but I am wondering if this is how it 
should be done, and if there are better alternatives.


[1] https://lkml.org/lkml/2024/6/20/920
David Hildenbrand July 31, 2024, 11:18 a.m. UTC | #9
Sorry for the late reply!

>> Current users must skip it, yes. How private memory would have to be
>> handled, and who would handle it, is rather unclear.
>>
>> Again, maybe we'd want separate RamDiscardManager for private and shared
>> memory (after all, these are two separate memory backends).
> 
> We also considered distinguishing the populate and discard operation for
> private and shared memory separately. As in method 2 above, we mentioned
> to add a new argument to indicate the memory attribute to operate on.
> They seem to have a similar idea.

Yes. Likely it's just some implementation detail. I think the following 
states would be possible:

* Discarded in shared + discarded in private (not populated)
* Discarded in shared + populated in private (private populated)
* Populated in shared + discarded in private (shared populated)

One could map these to states discarded/private/shared indeed.

[...]

>> I've had this talk with Intel, because the 4K granularity is a pain. I
>> was told that ship has sailed ... and we have to cope with random 4K
>> conversions :(
>>
>> The many mappings will likely add both memory and runtime overheads in
>> the kernel. But we only know once we measure.
> 
> In the normal case, the main runtime overhead comes from
> private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
> maximum of 1Gbyte. I think this overhead is acceptable. In non-default
> case, e.g. dynamic allocated DMA buffer, the runtime overhead will
> increase. As for the memory overheads, It is indeed unavoidable.
> 
> Will these performance issues be a deal breaker for enabling shared
> device assignment in this way?

I see the most problematic part being the dma_entry_limit and all of 
these individual MAP/UNMAP calls on 4KiB granularity.

dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the 
possible maximum should be 4294967296, and the default is 65535.

So we should be able to have a maximum of 16 TiB shared memory all in 
4KiB chunks.

sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying 
a per-page overhead of ~2.4%, excluding the actual rbtree.

Tree lookup/modifications with that many nodes might also get a bit 
slower, but likely still tolerable as you note.

Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable 
for your use case?
Yin Fengwei Aug. 1, 2024, 7:32 a.m. UTC | #10
Hi David,

On 7/26/2024 3:20 PM, David Hildenbrand wrote:
> Yes, there have been discussions about that, also in the context of 
> supporting huge pages while allowing for the guest to still convert 
> individual 4K chunks ...
> 
> A summary is here [1]. Likely more things will be covered at Linux 
> Plumbers.
> 
> 
> [1] 
> https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
This is a very valuable link. Thanks a lot for sharing.

Aaron and I are particular interesting to the huge page (both hugetlb
and THP) support for gmem_fd (per our testing, at least 10%+ performance
gain with it with TDX for many workloads). We will monitor the linux-mm
for such kind of discussion. I am wondering whether it's possible that
you can involve Aaron and me if the discussion is still open but not on
the mailing list (I suppose you will be included always for such kind of
discussion). Thanks.


Regards
Yin, Fengwei
Chenyi Qiang Aug. 2, 2024, 7 a.m. UTC | #11
On 7/31/2024 7:18 PM, David Hildenbrand wrote:
> Sorry for the late reply!
> 
>>> Current users must skip it, yes. How private memory would have to be
>>> handled, and who would handle it, is rather unclear.
>>>
>>> Again, maybe we'd want separate RamDiscardManager for private and shared
>>> memory (after all, these are two separate memory backends).
>>
>> We also considered distinguishing the populate and discard operation for
>> private and shared memory separately. As in method 2 above, we mentioned
>> to add a new argument to indicate the memory attribute to operate on.
>> They seem to have a similar idea.
> 
> Yes. Likely it's just some implementation detail. I think the following
> states would be possible:
> 
> * Discarded in shared + discarded in private (not populated)
> * Discarded in shared + populated in private (private populated)
> * Populated in shared + discarded in private (shared populated)
> 
> One could map these to states discarded/private/shared indeed.

Make sense. We can follow this if the mechanism of RamDiscardManager is
acceptable and no other concerns.

> 
> [...]
> 
>>> I've had this talk with Intel, because the 4K granularity is a pain. I
>>> was told that ship has sailed ... and we have to cope with random 4K
>>> conversions :(
>>>
>>> The many mappings will likely add both memory and runtime overheads in
>>> the kernel. But we only know once we measure.
>>
>> In the normal case, the main runtime overhead comes from
>> private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
>> maximum of 1Gbyte. I think this overhead is acceptable. In non-default
>> case, e.g. dynamic allocated DMA buffer, the runtime overhead will
>> increase. As for the memory overheads, It is indeed unavoidable.
>>
>> Will these performance issues be a deal breaker for enabling shared
>> device assignment in this way?
> 
> I see the most problematic part being the dma_entry_limit and all of
> these individual MAP/UNMAP calls on 4KiB granularity.
> 
> dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the
> possible maximum should be 4294967296, and the default is 65535.
> 
> So we should be able to have a maximum of 16 TiB shared memory all in
> 4KiB chunks.
> 
> sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying
> a per-page overhead of ~2.4%, excluding the actual rbtree.
> 
> Tree lookup/modifications with that many nodes might also get a bit
> slower, but likely still tolerable as you note.
> 
> Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable
> for your use case?

Yes. We can't guarantee the behavior of guest, so the overhead would be
uncertain and unavoidable.

>
Chenyi Qiang Aug. 16, 2024, 3:02 a.m. UTC | #12
Hi Paolo,

Hope to draw your attention. As TEE I/O would depend on shared device
assignment and we introduce this RDM solution in QEMU. Now, Observe the
in-place private/shared conversion option mentioned by David, do you
think we should continue to add pass-thru support for this in-qemu page
conversion method? Or wait for the option discussion to see if it will
change to in-kernel conversion.

Thanks
Chenyi

On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> discard") effectively disables device assignment with guest_memfd.
> guest_memfd is required for confidential guests, so device assignment to
> confidential guests is disabled. A supporting assumption for disabling
> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
> etc...) solves the confidential-guest device-assignment problem [1].
> That turns out not to be the case because TEE I/O depends on being able
> to operate devices against "shared"/untrusted memory for device
> initialization and error recovery scenarios.
> 
> This series utilizes an existing framework named RamDiscardManager to
> notify VFIO of page conversions. However, there's still one concern
> related to the semantics of RamDiscardManager which is used to manage
> the memory plug/unplug state. This is a little different from the memory
> shared/private in our requirement. See the "Open" section below for more
> details.
> 
> Background
> ==========
> Confidential VMs have two classes of memory: shared and private memory.
> Shared memory is accessible from the host/VMM while private memory is
> not. Confidential VMs can decide which memory is shared/private and
> convert memory between shared/private at runtime.
> 
> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> private memory. The key differences between guest_memfd and normal memfd
> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> cannot be mapped, read or written by userspace.
> 
> In QEMU's implementation, shared memory is allocated with normal methods
> (e.g. mmap or fallocate) while private memory is allocated from
> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> allocates new pages from the other side.
> 
> Problem
> =======
> Device assignment in QEMU is implemented via VFIO system. In the normal
> VM, VM memory is pinned at the beginning of time by VFIO. In the
> confidential VM, the VM can convert memory and when that happens
> nothing currently tells VFIO that its mappings are stale. This means
> that page conversion leaks memory and leaves stale IOMMU mappings. For
> example, sequence like the following can result in stale IOMMU mappings:
> 
> 1. allocate shared page
> 2. convert page shared->private
> 3. discard shared page
> 4. convert page private->shared
> 5. allocate shared page
> 6. issue DMA operations against that shared page
> 
> After step 3, VFIO is still pinning the page. However, DMA operations in
> step 6 will hit the old mapping that was allocated in step 1, which
> causes the device to access the invalid data.
> 
> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> uncoordinated discard") has blocked the device assignment with
> guest_memfd to avoid this problem.
> 
> Solution
> ========
> The key to enable shared device assignment is to solve the stale IOMMU
> mappings problem.
> 
> Given the constraints and assumptions here is a solution that satisfied
> the use cases. RamDiscardManager, an existing interface currently
> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> accordance with VM page assignment. Page conversion is similar to
> hot-removing a page in one mode and adding it back in the other.
> 
> This series implements a RamDiscardManager for confidential VMs and
> utilizes its infrastructure to notify VFIO of page conversions.
> 
> Another possible attempt [2] was to not discard shared pages in step 3
> above. This was an incomplete band-aid because guests would consume
> twice the memory since shared pages wouldn't be freed even after they
> were converted to private.
> 
> Open
> ====
> Implementing a RamDiscardManager to notify VFIO of page conversions
> causes changes in semantics: private memory is treated as discarded (or
> hot-removed) memory. This isn't aligned with the expectation of current
> RamDiscardManager users (e.g. VFIO or live migration) who really
> expect that discarded memory is hot-removed and thus can be skipped when
> the users are processing guest memory. Treating private memory as
> discarded won't work in future if VFIO or live migration needs to handle
> private memory. e.g. VFIO may need to map private memory to support
> Trusted IO and live migration for confidential VMs need to migrate
> private memory.
> 
> There are two possible ways to mitigate the semantics changes.
> 1. Develop a new mechanism to notify the page conversions between
> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> registers its own handler and gets notified upon page conversions. This
> is a clean approach which only touches the notifier workflow. A
> challenge is that for device hotplug, existing shared memory should be
> mapped in IOMMU. This will need additional changes.
> 
> 2. Extend the existing RamDiscardManager interface to manage not only
> the discarded/populated status of guest memory but also the
> shared/private status. RamDiscardManager users like VFIO will be
> notified with one more argument indicating what change is happening and
> can take action accordingly. It also has challenges e.g. QEMU allows
> only one RamDiscardManager, how to support virtio-mem for confidential
> VMs would be a problem. And some APIs like .is_populated() exposed by
> RamDiscardManager are meaningless to shared/private memory. So they may
> need some adjustments.
> 
> Testing
> =======
> This patch series is tested based on the internal TDX KVM/QEMU tree.
> 
> To facilitate shared device assignment with the NIC, employ the legacy
> type1 VFIO with the QEMU command:
> 
> qemu-system-x86_64 [...]
>     -device vfio-pci,host=XX:XX.X
> 
> The parameter of dma_entry_limit needs to be adjusted. For example, a
> 16GB guest needs to adjust the parameter like
> vfio_iommu_type1.dma_entry_limit=4194304.
> 
> If use the iommufd-backed VFIO with the qemu command:
> 
> qemu-system-x86_64 [...]
>     -object iommufd,id=iommufd0 \
>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> 
> No additional adjustment required.
> 
> Following the bootup of the TD guest, the guest's IP address becomes
> visible, and iperf is able to successfully send and receive data.
> 
> Related link
> ============
> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> 
> Chenyi Qiang (6):
>   guest_memfd: Introduce an object to manage the guest-memfd with
>     RamDiscardManager
>   guest_memfd: Introduce a helper to notify the shared/private state
>     change
>   KVM: Notify the state change via RamDiscardManager helper during
>     shared/private conversion
>   memory: Register the RamDiscardManager instance upon guest_memfd
>     creation
>   guest-memfd: Default to discarded (private) in guest_memfd_manager
>   RAMBlock: make guest_memfd require coordinate discard
> 
>  accel/kvm/kvm-all.c                  |   7 +
>  include/sysemu/guest-memfd-manager.h |  49 +++
>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
>  system/meson.build                   |   1 +
>  system/physmem.c                     |  11 +-
>  5 files changed, 492 insertions(+), 1 deletion(-)
>  create mode 100644 include/sysemu/guest-memfd-manager.h
>  create mode 100644 system/guest-memfd-manager.c
> 
> 
> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
Chenyi Qiang Oct. 8, 2024, 8:59 a.m. UTC | #13
Hi Paolo,

Kindly ping for this thread. The in-place page conversion is discussed
at Linux Plumbers. Does it give some direction for shared device
assignment enabling work?

Thanks
Chenyi

On 8/16/2024 11:02 AM, Chenyi Qiang wrote:
> Hi Paolo,
> 
> Hope to draw your attention. As TEE I/O would depend on shared device
> assignment and we introduce this RDM solution in QEMU. Now, Observe the
> in-place private/shared conversion option mentioned by David, do you
> think we should continue to add pass-thru support for this in-qemu page
> conversion method? Or wait for the option discussion to see if it will
> change to in-kernel conversion.
> 
> Thanks
> Chenyi
> 
> On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
>> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
>> discard") effectively disables device assignment with guest_memfd.
>> guest_memfd is required for confidential guests, so device assignment to
>> confidential guests is disabled. A supporting assumption for disabling
>> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
>> etc...) solves the confidential-guest device-assignment problem [1].
>> That turns out not to be the case because TEE I/O depends on being able
>> to operate devices against "shared"/untrusted memory for device
>> initialization and error recovery scenarios.
>>
>> This series utilizes an existing framework named RamDiscardManager to
>> notify VFIO of page conversions. However, there's still one concern
>> related to the semantics of RamDiscardManager which is used to manage
>> the memory plug/unplug state. This is a little different from the memory
>> shared/private in our requirement. See the "Open" section below for more
>> details.
>>
>> Background
>> ==========
>> Confidential VMs have two classes of memory: shared and private memory.
>> Shared memory is accessible from the host/VMM while private memory is
>> not. Confidential VMs can decide which memory is shared/private and
>> convert memory between shared/private at runtime.
>>
>> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
>> private memory. The key differences between guest_memfd and normal memfd
>> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
>> cannot be mapped, read or written by userspace.
>>
>> In QEMU's implementation, shared memory is allocated with normal methods
>> (e.g. mmap or fallocate) while private memory is allocated from
>> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
>> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
>> allocates new pages from the other side.
>>
>> Problem
>> =======
>> Device assignment in QEMU is implemented via VFIO system. In the normal
>> VM, VM memory is pinned at the beginning of time by VFIO. In the
>> confidential VM, the VM can convert memory and when that happens
>> nothing currently tells VFIO that its mappings are stale. This means
>> that page conversion leaks memory and leaves stale IOMMU mappings. For
>> example, sequence like the following can result in stale IOMMU mappings:
>>
>> 1. allocate shared page
>> 2. convert page shared->private
>> 3. discard shared page
>> 4. convert page private->shared
>> 5. allocate shared page
>> 6. issue DMA operations against that shared page
>>
>> After step 3, VFIO is still pinning the page. However, DMA operations in
>> step 6 will hit the old mapping that was allocated in step 1, which
>> causes the device to access the invalid data.
>>
>> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
>> uncoordinated discard") has blocked the device assignment with
>> guest_memfd to avoid this problem.
>>
>> Solution
>> ========
>> The key to enable shared device assignment is to solve the stale IOMMU
>> mappings problem.
>>
>> Given the constraints and assumptions here is a solution that satisfied
>> the use cases. RamDiscardManager, an existing interface currently
>> utilized by virtio-mem, offers a means to modify IOMMU mappings in
>> accordance with VM page assignment. Page conversion is similar to
>> hot-removing a page in one mode and adding it back in the other.
>>
>> This series implements a RamDiscardManager for confidential VMs and
>> utilizes its infrastructure to notify VFIO of page conversions.
>>
>> Another possible attempt [2] was to not discard shared pages in step 3
>> above. This was an incomplete band-aid because guests would consume
>> twice the memory since shared pages wouldn't be freed even after they
>> were converted to private.
>>
>> Open
>> ====
>> Implementing a RamDiscardManager to notify VFIO of page conversions
>> causes changes in semantics: private memory is treated as discarded (or
>> hot-removed) memory. This isn't aligned with the expectation of current
>> RamDiscardManager users (e.g. VFIO or live migration) who really
>> expect that discarded memory is hot-removed and thus can be skipped when
>> the users are processing guest memory. Treating private memory as
>> discarded won't work in future if VFIO or live migration needs to handle
>> private memory. e.g. VFIO may need to map private memory to support
>> Trusted IO and live migration for confidential VMs need to migrate
>> private memory.
>>
>> There are two possible ways to mitigate the semantics changes.
>> 1. Develop a new mechanism to notify the page conversions between
>> private and shared. For example, utilize the notifier_list in QEMU. VFIO
>> registers its own handler and gets notified upon page conversions. This
>> is a clean approach which only touches the notifier workflow. A
>> challenge is that for device hotplug, existing shared memory should be
>> mapped in IOMMU. This will need additional changes.
>>
>> 2. Extend the existing RamDiscardManager interface to manage not only
>> the discarded/populated status of guest memory but also the
>> shared/private status. RamDiscardManager users like VFIO will be
>> notified with one more argument indicating what change is happening and
>> can take action accordingly. It also has challenges e.g. QEMU allows
>> only one RamDiscardManager, how to support virtio-mem for confidential
>> VMs would be a problem. And some APIs like .is_populated() exposed by
>> RamDiscardManager are meaningless to shared/private memory. So they may
>> need some adjustments.
>>
>> Testing
>> =======
>> This patch series is tested based on the internal TDX KVM/QEMU tree.
>>
>> To facilitate shared device assignment with the NIC, employ the legacy
>> type1 VFIO with the QEMU command:
>>
>> qemu-system-x86_64 [...]
>>     -device vfio-pci,host=XX:XX.X
>>
>> The parameter of dma_entry_limit needs to be adjusted. For example, a
>> 16GB guest needs to adjust the parameter like
>> vfio_iommu_type1.dma_entry_limit=4194304.
>>
>> If use the iommufd-backed VFIO with the qemu command:
>>
>> qemu-system-x86_64 [...]
>>     -object iommufd,id=iommufd0 \
>>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
>>
>> No additional adjustment required.
>>
>> Following the bootup of the TD guest, the guest's IP address becomes
>> visible, and iperf is able to successfully send and receive data.
>>
>> Related link
>> ============
>> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
>> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
>>
>> Chenyi Qiang (6):
>>   guest_memfd: Introduce an object to manage the guest-memfd with
>>     RamDiscardManager
>>   guest_memfd: Introduce a helper to notify the shared/private state
>>     change
>>   KVM: Notify the state change via RamDiscardManager helper during
>>     shared/private conversion
>>   memory: Register the RamDiscardManager instance upon guest_memfd
>>     creation
>>   guest-memfd: Default to discarded (private) in guest_memfd_manager
>>   RAMBlock: make guest_memfd require coordinate discard
>>
>>  accel/kvm/kvm-all.c                  |   7 +
>>  include/sysemu/guest-memfd-manager.h |  49 +++
>>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
>>  system/meson.build                   |   1 +
>>  system/physmem.c                     |  11 +-
>>  5 files changed, 492 insertions(+), 1 deletion(-)
>>  create mode 100644 include/sysemu/guest-memfd-manager.h
>>  create mode 100644 system/guest-memfd-manager.c
>>
>>
>> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
Rob Nertney Nov. 15, 2024, 4:47 p.m. UTC | #14
On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote:
> Hi Paolo,
> 
> Kindly ping for this thread. The in-place page conversion is discussed
> at Linux Plumbers. Does it give some direction for shared device
> assignment enabling work?
>
Hi everybody.

Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to
provide AI acceleration within TEE CVMs. We require passing though the GPU via
VFIO stubbing, which means that we are impacted by the absence of an API to
inform VFIO about page conversions.

The CSPs have enough kernel engineers who handle this process in their own host
kernels, but we have several enterprise customers who are eager to begin using
this solution in the upstream. AMD has successfully ported enough of the
SEV-SNP support into 6.11 and our initial testing shows successful operation,
but only by disabling discard via these two QEMU patches:
- https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8
- https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095

This "workaround" is a bit of a hack, as it effectively requires greater than
double the amount of host memory than as to be allocated to the guest CVM. The
proposal here appears to be a promising workaround; are there other solutions
that are recommended for this use case?

This configuration is in GA right now and NVIDIA is committed to support and
test this bounce-buffer mailbox solution for many years into the future, so
we're highly invested in seeing a converged solution in the upstream.

Thanks,
Rob

> Thanks
> Chenyi
> 
> On 8/16/2024 11:02 AM, Chenyi Qiang wrote:
> > Hi Paolo,
> > 
> > Hope to draw your attention. As TEE I/O would depend on shared device
> > assignment and we introduce this RDM solution in QEMU. Now, Observe the
> > in-place private/shared conversion option mentioned by David, do you
> > think we should continue to add pass-thru support for this in-qemu page
> > conversion method? Or wait for the option discussion to see if it will
> > change to in-kernel conversion.
> > 
> > Thanks
> > Chenyi
> > 
> > On 7/25/2024 3:21 PM, Chenyi Qiang wrote:
> >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated
> >> discard") effectively disables device assignment with guest_memfd.
> >> guest_memfd is required for confidential guests, so device assignment to
> >> confidential guests is disabled. A supporting assumption for disabling
> >> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO
> >> etc...) solves the confidential-guest device-assignment problem [1].
> >> That turns out not to be the case because TEE I/O depends on being able
> >> to operate devices against "shared"/untrusted memory for device
> >> initialization and error recovery scenarios.
> >>
> >> This series utilizes an existing framework named RamDiscardManager to
> >> notify VFIO of page conversions. However, there's still one concern
> >> related to the semantics of RamDiscardManager which is used to manage
> >> the memory plug/unplug state. This is a little different from the memory
> >> shared/private in our requirement. See the "Open" section below for more
> >> details.
> >>
> >> Background
> >> ==========
> >> Confidential VMs have two classes of memory: shared and private memory.
> >> Shared memory is accessible from the host/VMM while private memory is
> >> not. Confidential VMs can decide which memory is shared/private and
> >> convert memory between shared/private at runtime.
> >>
> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest
> >> private memory. The key differences between guest_memfd and normal memfd
> >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and
> >> cannot be mapped, read or written by userspace.
> >>
> >> In QEMU's implementation, shared memory is allocated with normal methods
> >> (e.g. mmap or fallocate) while private memory is allocated from
> >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via
> >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and
> >> allocates new pages from the other side.
> >>
> >> Problem
> >> =======
> >> Device assignment in QEMU is implemented via VFIO system. In the normal
> >> VM, VM memory is pinned at the beginning of time by VFIO. In the
> >> confidential VM, the VM can convert memory and when that happens
> >> nothing currently tells VFIO that its mappings are stale. This means
> >> that page conversion leaks memory and leaves stale IOMMU mappings. For
> >> example, sequence like the following can result in stale IOMMU mappings:
> >>
> >> 1. allocate shared page
> >> 2. convert page shared->private
> >> 3. discard shared page
> >> 4. convert page private->shared
> >> 5. allocate shared page
> >> 6. issue DMA operations against that shared page
> >>
> >> After step 3, VFIO is still pinning the page. However, DMA operations in
> >> step 6 will hit the old mapping that was allocated in step 1, which
> >> causes the device to access the invalid data.
> >>
> >> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require
> >> uncoordinated discard") has blocked the device assignment with
> >> guest_memfd to avoid this problem.
> >>
> >> Solution
> >> ========
> >> The key to enable shared device assignment is to solve the stale IOMMU
> >> mappings problem.
> >>
> >> Given the constraints and assumptions here is a solution that satisfied
> >> the use cases. RamDiscardManager, an existing interface currently
> >> utilized by virtio-mem, offers a means to modify IOMMU mappings in
> >> accordance with VM page assignment. Page conversion is similar to
> >> hot-removing a page in one mode and adding it back in the other.
> >>
> >> This series implements a RamDiscardManager for confidential VMs and
> >> utilizes its infrastructure to notify VFIO of page conversions.
> >>
> >> Another possible attempt [2] was to not discard shared pages in step 3
> >> above. This was an incomplete band-aid because guests would consume
> >> twice the memory since shared pages wouldn't be freed even after they
> >> were converted to private.
> >>
> >> Open
> >> ====
> >> Implementing a RamDiscardManager to notify VFIO of page conversions
> >> causes changes in semantics: private memory is treated as discarded (or
> >> hot-removed) memory. This isn't aligned with the expectation of current
> >> RamDiscardManager users (e.g. VFIO or live migration) who really
> >> expect that discarded memory is hot-removed and thus can be skipped when
> >> the users are processing guest memory. Treating private memory as
> >> discarded won't work in future if VFIO or live migration needs to handle
> >> private memory. e.g. VFIO may need to map private memory to support
> >> Trusted IO and live migration for confidential VMs need to migrate
> >> private memory.
> >>
> >> There are two possible ways to mitigate the semantics changes.
> >> 1. Develop a new mechanism to notify the page conversions between
> >> private and shared. For example, utilize the notifier_list in QEMU. VFIO
> >> registers its own handler and gets notified upon page conversions. This
> >> is a clean approach which only touches the notifier workflow. A
> >> challenge is that for device hotplug, existing shared memory should be
> >> mapped in IOMMU. This will need additional changes.
> >>
> >> 2. Extend the existing RamDiscardManager interface to manage not only
> >> the discarded/populated status of guest memory but also the
> >> shared/private status. RamDiscardManager users like VFIO will be
> >> notified with one more argument indicating what change is happening and
> >> can take action accordingly. It also has challenges e.g. QEMU allows
> >> only one RamDiscardManager, how to support virtio-mem for confidential
> >> VMs would be a problem. And some APIs like .is_populated() exposed by
> >> RamDiscardManager are meaningless to shared/private memory. So they may
> >> need some adjustments.
> >>
> >> Testing
> >> =======
> >> This patch series is tested based on the internal TDX KVM/QEMU tree.
> >>
> >> To facilitate shared device assignment with the NIC, employ the legacy
> >> type1 VFIO with the QEMU command:
> >>
> >> qemu-system-x86_64 [...]
> >>     -device vfio-pci,host=XX:XX.X
> >>
> >> The parameter of dma_entry_limit needs to be adjusted. For example, a
> >> 16GB guest needs to adjust the parameter like
> >> vfio_iommu_type1.dma_entry_limit=4194304.
> >>
> >> If use the iommufd-backed VFIO with the qemu command:
> >>
> >> qemu-system-x86_64 [...]
> >>     -object iommufd,id=iommufd0 \
> >>     -device vfio-pci,host=XX:XX.X,iommufd=iommufd0
> >>
> >> No additional adjustment required.
> >>
> >> Following the bootup of the TD guest, the guest's IP address becomes
> >> visible, and iperf is able to successfully send and receive data.
> >>
> >> Related link
> >> ============
> >> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/
> >> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/
> >>
> >> Chenyi Qiang (6):
> >>   guest_memfd: Introduce an object to manage the guest-memfd with
> >>     RamDiscardManager
> >>   guest_memfd: Introduce a helper to notify the shared/private state
> >>     change
> >>   KVM: Notify the state change via RamDiscardManager helper during
> >>     shared/private conversion
> >>   memory: Register the RamDiscardManager instance upon guest_memfd
> >>     creation
> >>   guest-memfd: Default to discarded (private) in guest_memfd_manager
> >>   RAMBlock: make guest_memfd require coordinate discard
> >>
> >>  accel/kvm/kvm-all.c                  |   7 +
> >>  include/sysemu/guest-memfd-manager.h |  49 +++
> >>  system/guest-memfd-manager.c         | 425 +++++++++++++++++++++++++++
> >>  system/meson.build                   |   1 +
> >>  system/physmem.c                     |  11 +-
> >>  5 files changed, 492 insertions(+), 1 deletion(-)
> >>  create mode 100644 include/sysemu/guest-memfd-manager.h
> >>  create mode 100644 system/guest-memfd-manager.c
> >>
> >>
> >> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
> 
>
David Hildenbrand Nov. 15, 2024, 5:20 p.m. UTC | #15
On 15.11.24 17:47, Rob Nertney wrote:
> On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote:
>> Hi Paolo,
>>
>> Kindly ping for this thread. The in-place page conversion is discussed
>> at Linux Plumbers. Does it give some direction for shared device
>> assignment enabling work?
>>
> Hi everybody.

Hi,

> 
> Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to
> provide AI acceleration within TEE CVMs. We require passing though the GPU via
> VFIO stubbing, which means that we are impacted by the absence of an API to
> inform VFIO about page conversions.
> 
> The CSPs have enough kernel engineers who handle this process in their own host
> kernels, but we have several enterprise customers who are eager to begin using
> this solution in the upstream. AMD has successfully ported enough of the
> SEV-SNP support into 6.11 and our initial testing shows successful operation,
> but only by disabling discard via these two QEMU patches:
> - https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8
> - https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095
> 
> This "workaround" is a bit of a hack, as it effectively requires greater than
> double the amount of host memory than as to be allocated to the guest CVM. The
> proposal here appears to be a promising workaround; are there other solutions
> that are recommended for this use case?

What people we are working on is supporting private and shared memory in 
guest_memfd, and allowing an in-place conversion between shared and 
private: this avoids discards + reallocation and consequently any double 
memory allocation.

To get stuff into VFIO, we must only map the currently shared pages 
(VFIO will pin + map them), and unmap them (VFIO will unmap + unpin 
them) before converting them to private.

This series should likely achieve the 
unmap-before-conversion-to-private, and map-after-conversion-to-shared, 
such that it could be compatible with guest_memfd.

QEMU would simply mmap the guest_memfd to obtain a user space mapping, 
from which it can pass address ranges to VFIO like we already do. This 
user space mapping only allows for shared pages to be faulted in. 
Currently private pages cannot be faulted in (inaccessible -> SIGBUS). 
So far the theory.

I'll note that this is likely not the most elegant solution, but 
something that would achieve in a reasonable timeframe one solution to 
the problem.

Cheers!