Message ID | 20240725072118.358923-1-chenyi.qiang@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Enable shared device assignment | expand |
> Open > ==== > Implementing a RamDiscardManager to notify VFIO of page conversions > causes changes in semantics: private memory is treated as discarded (or > hot-removed) memory. This isn't aligned with the expectation of current > RamDiscardManager users (e.g. VFIO or live migration) who really > expect that discarded memory is hot-removed and thus can be skipped when > the users are processing guest memory. Treating private memory as > discarded won't work in future if VFIO or live migration needs to handle > private memory. e.g. VFIO may need to map private memory to support > Trusted IO and live migration for confidential VMs need to migrate > private memory. "VFIO may need to map private memory to support Trusted IO" I've been told that the way we handle shared memory won't be the way this is going to work with guest_memfd. KVM will coordinate directly with VFIO or $whatever and update the IOMMU tables itself right in the kernel; the pages are pinned/owned by guest_memfd, so that will just work. So I don't consider that currently a concern. guest_memfd private memory is not mapped into user page tables and as it currently seems it never will be. Similarly: live migration. We cannot simply migrate that memory the traditional way. We even have to track the dirty state differently. So IMHO, treating both memory as discarded == don't touch it the usual way might actually be a feature not a bug ;) > > There are two possible ways to mitigate the semantics changes. > 1. Develop a new mechanism to notify the page conversions between > private and shared. For example, utilize the notifier_list in QEMU. VFIO > registers its own handler and gets notified upon page conversions. This > is a clean approach which only touches the notifier workflow. A > challenge is that for device hotplug, existing shared memory should be > mapped in IOMMU. This will need additional changes. > > 2. Extend the existing RamDiscardManager interface to manage not only > the discarded/populated status of guest memory but also the > shared/private status. RamDiscardManager users like VFIO will be > notified with one more argument indicating what change is happening and > can take action accordingly. It also has challenges e.g. QEMU allows > only one RamDiscardManager, how to support virtio-mem for confidential > VMs would be a problem. And some APIs like .is_populated() exposed by > RamDiscardManager are meaningless to shared/private memory. So they may > need some adjustments. Think of all of that in terms of "shared memory is populated, private memory is some inaccessible stuff that needs very special way and other means for device assignment, live migration, etc.". Then it actually quite makes sense to use of RamDiscardManager (AFAIKS :) ). > > Testing > ======= > This patch series is tested based on the internal TDX KVM/QEMU tree. > > To facilitate shared device assignment with the NIC, employ the legacy > type1 VFIO with the QEMU command: > > qemu-system-x86_64 [...] > -device vfio-pci,host=XX:XX.X > > The parameter of dma_entry_limit needs to be adjusted. For example, a > 16GB guest needs to adjust the parameter like > vfio_iommu_type1.dma_entry_limit=4194304. But here you note the biggest real issue I see (not related to RAMDiscardManager, but that we have to prepare for conversion of each possible private page to shared and back): we need a single IOMMU mapping for each 4 KiB page. Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. Does it even scale then? There is the alternative of having in-place private/shared conversion when we also let guest_memfd manage some shared memory. It has plenty of downsides, but for the problem at hand it would mean that we don't discard on shared/private conversion. But whenever we want to convert memory shared->private we would similarly have to from IOMMU page tables via VFIO. (the in-place conversion will only be allowed if any additional references on a page are gone -- when it is inaccessible by userspace/kernel). Again, if IOMMU page tables would be managed by KVM in the kernel without user space intervention/vfio this would work with device assignment just fine. But I guess it will take a while until we actually have that option.
> From: David Hildenbrand <david@redhat.com> > Sent: Thursday, July 25, 2024 10:04 PM > > > Open > > ==== > > Implementing a RamDiscardManager to notify VFIO of page conversions > > causes changes in semantics: private memory is treated as discarded (or > > hot-removed) memory. This isn't aligned with the expectation of current > > RamDiscardManager users (e.g. VFIO or live migration) who really > > expect that discarded memory is hot-removed and thus can be skipped > when > > the users are processing guest memory. Treating private memory as > > discarded won't work in future if VFIO or live migration needs to handle > > private memory. e.g. VFIO may need to map private memory to support > > Trusted IO and live migration for confidential VMs need to migrate > > private memory. > > "VFIO may need to map private memory to support Trusted IO" > > I've been told that the way we handle shared memory won't be the way > this is going to work with guest_memfd. KVM will coordinate directly > with VFIO or $whatever and update the IOMMU tables itself right in the > kernel; the pages are pinned/owned by guest_memfd, so that will just > work. So I don't consider that currently a concern. guest_memfd private > memory is not mapped into user page tables and as it currently seems it > never will be. Or could extend MAP_DMA to accept guest_memfd+offset in place of 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve the pinned pfn. IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs to manage the mapping of the private memory instead of the use of guest_memfd. e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP) to check the HPA after the IOMMU walks the existing I/O page tables. So reasonably VFIO/IOMMUFD could continue to manage those I/O page tables including both private and shared memory, with a hint to know where to find the pfn (host page table or guest_memfd). But TDX Connect introduces a new I/O page table format (same as secure EPT) for mapping the private memory and further requires sharing the secure-EPT between CPU/IOMMU for private. Then it appears to be a different story.
On 7/25/2024 10:04 PM, David Hildenbrand wrote: >> Open >> ==== >> Implementing a RamDiscardManager to notify VFIO of page conversions >> causes changes in semantics: private memory is treated as discarded (or >> hot-removed) memory. This isn't aligned with the expectation of current >> RamDiscardManager users (e.g. VFIO or live migration) who really >> expect that discarded memory is hot-removed and thus can be skipped when >> the users are processing guest memory. Treating private memory as >> discarded won't work in future if VFIO or live migration needs to handle >> private memory. e.g. VFIO may need to map private memory to support >> Trusted IO and live migration for confidential VMs need to migrate >> private memory. > > "VFIO may need to map private memory to support Trusted IO" > > I've been told that the way we handle shared memory won't be the way > this is going to work with guest_memfd. KVM will coordinate directly > with VFIO or $whatever and update the IOMMU tables itself right in the > kernel; the pages are pinned/owned by guest_memfd, so that will just > work. So I don't consider that currently a concern. guest_memfd private > memory is not mapped into user page tables and as it currently seems it > never will be. That's correct. AFAIK, some TEE IO solution like TDX Connect would let kernel coordinate and update private mapping in IOMMU tables. Here, It mentions that VFIO "may" need map private memory. I want to make this more generic to account for potential future TEE IO solutions that may require such functionality. :) > > Similarly: live migration. We cannot simply migrate that memory the > traditional way. We even have to track the dirty state differently. > > So IMHO, treating both memory as discarded == don't touch it the usual > way might actually be a feature not a bug ;) Do you mean treating the private memory in both VFIO and live migration as discarded? That is what this patch series does. And as you mentioned, these RDM users cannot follow the traditional RDM way. Because of this, we also considered whether we should use RDM or a more generic mechanism like notifier_list below. > >> >> There are two possible ways to mitigate the semantics changes. >> 1. Develop a new mechanism to notify the page conversions between >> private and shared. For example, utilize the notifier_list in QEMU. VFIO >> registers its own handler and gets notified upon page conversions. This >> is a clean approach which only touches the notifier workflow. A >> challenge is that for device hotplug, existing shared memory should be >> mapped in IOMMU. This will need additional changes. >> >> 2. Extend the existing RamDiscardManager interface to manage not only >> the discarded/populated status of guest memory but also the >> shared/private status. RamDiscardManager users like VFIO will be >> notified with one more argument indicating what change is happening and >> can take action accordingly. It also has challenges e.g. QEMU allows >> only one RamDiscardManager, how to support virtio-mem for confidential >> VMs would be a problem. And some APIs like .is_populated() exposed by >> RamDiscardManager are meaningless to shared/private memory. So they may >> need some adjustments. > > Think of all of that in terms of "shared memory is populated, private > memory is some inaccessible stuff that needs very special way and other > means for device assignment, live migration, etc.". Then it actually > quite makes sense to use of RamDiscardManager (AFAIKS :) ). Yes, such notification mechanism is what we want. But for the users of RDM, it would require additional change accordingly. Current users just skip inaccessible stuff, but in private memory case, it can't be simply skipped. Maybe renaming RamDiscardManager to RamStateManager is more accurate then. :) > >> >> Testing >> ======= >> This patch series is tested based on the internal TDX KVM/QEMU tree. >> >> To facilitate shared device assignment with the NIC, employ the legacy >> type1 VFIO with the QEMU command: >> >> qemu-system-x86_64 [...] >> -device vfio-pci,host=XX:XX.X >> >> The parameter of dma_entry_limit needs to be adjusted. For example, a >> 16GB guest needs to adjust the parameter like >> vfio_iommu_type1.dma_entry_limit=4194304. > > But here you note the biggest real issue I see (not related to > RAMDiscardManager, but that we have to prepare for conversion of each > possible private page to shared and back): we need a single IOMMU > mapping for each 4 KiB page. > > Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. > Does it even scale then? The entry limitation needs to be increased as the guest memory size increases. For this issue, are you concerned that having too many entries might bring some performance issue? Maybe we could introduce some PV mechanism to coordinate with guest to convert memory only in 2M granularity. This may help mitigate the problem. > > > There is the alternative of having in-place private/shared conversion > when we also let guest_memfd manage some shared memory. It has plenty of > downsides, but for the problem at hand it would mean that we don't > discard on shared/private conversion.> > But whenever we want to convert memory shared->private we would > similarly have to from IOMMU page tables via VFIO. (the in-place > conversion will only be allowed if any additional references on a page > are gone -- when it is inaccessible by userspace/kernel). I'm not clear about this in-place private/shared conversion. Can you elaborate a little bit? It seems this alternative changes private and shared management in current guest_memfd? > > Again, if IOMMU page tables would be managed by KVM in the kernel > without user space intervention/vfio this would work with device > assignment just fine. But I guess it will take a while until we actually > have that option. >
On 26.07.24 07:02, Tian, Kevin wrote: >> From: David Hildenbrand <david@redhat.com> >> Sent: Thursday, July 25, 2024 10:04 PM >> >>> Open >>> ==== >>> Implementing a RamDiscardManager to notify VFIO of page conversions >>> causes changes in semantics: private memory is treated as discarded (or >>> hot-removed) memory. This isn't aligned with the expectation of current >>> RamDiscardManager users (e.g. VFIO or live migration) who really >>> expect that discarded memory is hot-removed and thus can be skipped >> when >>> the users are processing guest memory. Treating private memory as >>> discarded won't work in future if VFIO or live migration needs to handle >>> private memory. e.g. VFIO may need to map private memory to support >>> Trusted IO and live migration for confidential VMs need to migrate >>> private memory. >> >> "VFIO may need to map private memory to support Trusted IO" >> >> I've been told that the way we handle shared memory won't be the way >> this is going to work with guest_memfd. KVM will coordinate directly >> with VFIO or $whatever and update the IOMMU tables itself right in the >> kernel; the pages are pinned/owned by guest_memfd, so that will just >> work. So I don't consider that currently a concern. guest_memfd private >> memory is not mapped into user page tables and as it currently seems it >> never will be. > > Or could extend MAP_DMA to accept guest_memfd+offset in place of > 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve > the pinned pfn. In theory yes, and I've been thinking of the same for a while. Until people told me that it is unlikely that it will work that way in the future. > > IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs > to manage the mapping of the private memory instead of the use of > guest_memfd. > > e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP) > to check the HPA after the IOMMU walks the existing I/O page tables. > So reasonably VFIO/IOMMUFD could continue to manage those I/O > page tables including both private and shared memory, with a hint to > know where to find the pfn (host page table or guest_memfd). > > But TDX Connect introduces a new I/O page table format (same as secure > EPT) for mapping the private memory and further requires sharing the > secure-EPT between CPU/IOMMU for private. Then it appears to be > a different story. Yes. This seems to be the future and more in-line with in-place/in-kernel conversion as e.g., pKVM wants to have it. If you want to avoid user space altogether when doing shared<->private conversions, then letting user space manage the IOMMUs is not going to work. If we ever have to go down that path (MAP_DMA of guest_memfd), we could have two RAMDiscardManager for a RAM region, just like we have two memory backends: one for shared memory populate/discard (what this series tries to achieve), one for private memory populate/discard. The thing is, that private memory will always have to be special-cased all over the place either way, unfortunately.
On 26.07.24 08:20, Chenyi Qiang wrote: > > > On 7/25/2024 10:04 PM, David Hildenbrand wrote: >>> Open >>> ==== >>> Implementing a RamDiscardManager to notify VFIO of page conversions >>> causes changes in semantics: private memory is treated as discarded (or >>> hot-removed) memory. This isn't aligned with the expectation of current >>> RamDiscardManager users (e.g. VFIO or live migration) who really >>> expect that discarded memory is hot-removed and thus can be skipped when >>> the users are processing guest memory. Treating private memory as >>> discarded won't work in future if VFIO or live migration needs to handle >>> private memory. e.g. VFIO may need to map private memory to support >>> Trusted IO and live migration for confidential VMs need to migrate >>> private memory. >> >> "VFIO may need to map private memory to support Trusted IO" >> >> I've been told that the way we handle shared memory won't be the way >> this is going to work with guest_memfd. KVM will coordinate directly >> with VFIO or $whatever and update the IOMMU tables itself right in the >> kernel; the pages are pinned/owned by guest_memfd, so that will just >> work. So I don't consider that currently a concern. guest_memfd private >> memory is not mapped into user page tables and as it currently seems it >> never will be. > > That's correct. AFAIK, some TEE IO solution like TDX Connect would let > kernel coordinate and update private mapping in IOMMU tables. Here, It > mentions that VFIO "may" need map private memory. I want to make this > more generic to account for potential future TEE IO solutions that may > require such functionality. :) Careful to not over-enginner something that is not even real or close-to-be-real yet, though. :) Nobody really knows who that will look like, besides that we know for Intel that we won't need that. > >> >> Similarly: live migration. We cannot simply migrate that memory the >> traditional way. We even have to track the dirty state differently. >> >> So IMHO, treating both memory as discarded == don't touch it the usual >> way might actually be a feature not a bug ;) > > Do you mean treating the private memory in both VFIO and live migration > as discarded? That is what this patch series does. And as you mentioned, > these RDM users cannot follow the traditional RDM way. Because of this, > we also considered whether we should use RDM or a more generic mechanism > like notifier_list below. Yes, the shared memory is logically discarded. At the same time we *might* get private memory effectively populated. See my reply to Kevin that there might be ways of having shared vs. private populate/discard in the future, if required. Just some idea, though. > >> >>> >>> There are two possible ways to mitigate the semantics changes. >>> 1. Develop a new mechanism to notify the page conversions between >>> private and shared. For example, utilize the notifier_list in QEMU. VFIO >>> registers its own handler and gets notified upon page conversions. This >>> is a clean approach which only touches the notifier workflow. A >>> challenge is that for device hotplug, existing shared memory should be >>> mapped in IOMMU. This will need additional changes. >>> >>> 2. Extend the existing RamDiscardManager interface to manage not only >>> the discarded/populated status of guest memory but also the >>> shared/private status. RamDiscardManager users like VFIO will be >>> notified with one more argument indicating what change is happening and >>> can take action accordingly. It also has challenges e.g. QEMU allows >>> only one RamDiscardManager, how to support virtio-mem for confidential >>> VMs would be a problem. And some APIs like .is_populated() exposed by >>> RamDiscardManager are meaningless to shared/private memory. So they may >>> need some adjustments. >> >> Think of all of that in terms of "shared memory is populated, private >> memory is some inaccessible stuff that needs very special way and other >> means for device assignment, live migration, etc.". Then it actually >> quite makes sense to use of RamDiscardManager (AFAIKS :) ). > > Yes, such notification mechanism is what we want. But for the users of > RDM, it would require additional change accordingly. Current users just > skip inaccessible stuff, but in private memory case, it can't be simply > skipped. Maybe renaming RamDiscardManager to RamStateManager is more > accurate then. :) Current users must skip it, yes. How private memory would have to be handled, and who would handle it, is rather unclear. Again, maybe we'd want separate RamDiscardManager for private and shared memory (after all, these are two separate memory backends). Not sure that "RamStateManager" terminology would be reasonable in that approach. > >> >>> >>> Testing >>> ======= >>> This patch series is tested based on the internal TDX KVM/QEMU tree. >>> >>> To facilitate shared device assignment with the NIC, employ the legacy >>> type1 VFIO with the QEMU command: >>> >>> qemu-system-x86_64 [...] >>> -device vfio-pci,host=XX:XX.X >>> >>> The parameter of dma_entry_limit needs to be adjusted. For example, a >>> 16GB guest needs to adjust the parameter like >>> vfio_iommu_type1.dma_entry_limit=4194304. >> >> But here you note the biggest real issue I see (not related to >> RAMDiscardManager, but that we have to prepare for conversion of each >> possible private page to shared and back): we need a single IOMMU >> mapping for each 4 KiB page. >> >> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. >> Does it even scale then? > > The entry limitation needs to be increased as the guest memory size > increases. For this issue, are you concerned that having too many > entries might bring some performance issue? Maybe we could introduce > some PV mechanism to coordinate with guest to convert memory only in 2M > granularity. This may help mitigate the problem. I've had this talk with Intel, because the 4K granularity is a pain. I was told that ship has sailed ... and we have to cope with random 4K conversions :( The many mappings will likely add both memory and runtime overheads in the kernel. But we only know once we measure. Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB of shared memory :/ > >> >> >> There is the alternative of having in-place private/shared conversion >> when we also let guest_memfd manage some shared memory. It has plenty of >> downsides, but for the problem at hand it would mean that we don't >> discard on shared/private conversion.> >> But whenever we want to convert memory shared->private we would >> similarly have to from IOMMU page tables via VFIO. (the in-place >> conversion will only be allowed if any additional references on a page >> are gone -- when it is inaccessible by userspace/kernel). > > I'm not clear about this in-place private/shared conversion. Can you > elaborate a little bit? It seems this alternative changes private and > shared management in current guest_memfd? Yes, there have been discussions about that, also in the context of supporting huge pages while allowing for the guest to still convert individual 4K chunks ... A summary is here [1]. Likely more things will be covered at Linux Plumbers. [1] https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/
On 7/26/2024 3:20 PM, David Hildenbrand wrote: > On 26.07.24 08:20, Chenyi Qiang wrote: >> >> >> On 7/25/2024 10:04 PM, David Hildenbrand wrote: >>>> Open >>>> ==== >>>> Implementing a RamDiscardManager to notify VFIO of page conversions >>>> causes changes in semantics: private memory is treated as discarded (or >>>> hot-removed) memory. This isn't aligned with the expectation of current >>>> RamDiscardManager users (e.g. VFIO or live migration) who really >>>> expect that discarded memory is hot-removed and thus can be skipped >>>> when >>>> the users are processing guest memory. Treating private memory as >>>> discarded won't work in future if VFIO or live migration needs to >>>> handle >>>> private memory. e.g. VFIO may need to map private memory to support >>>> Trusted IO and live migration for confidential VMs need to migrate >>>> private memory. >>> >>> "VFIO may need to map private memory to support Trusted IO" >>> >>> I've been told that the way we handle shared memory won't be the way >>> this is going to work with guest_memfd. KVM will coordinate directly >>> with VFIO or $whatever and update the IOMMU tables itself right in the >>> kernel; the pages are pinned/owned by guest_memfd, so that will just >>> work. So I don't consider that currently a concern. guest_memfd private >>> memory is not mapped into user page tables and as it currently seems it >>> never will be. >> >> That's correct. AFAIK, some TEE IO solution like TDX Connect would let >> kernel coordinate and update private mapping in IOMMU tables. Here, It >> mentions that VFIO "may" need map private memory. I want to make this >> more generic to account for potential future TEE IO solutions that may >> require such functionality. :) > > Careful to not over-enginner something that is not even real or > close-to-be-real yet, though. :) Nobody really knows who that will look > like, besides that we know for Intel that we won't need that. OK, Thanks for the reminder! > >> >>> >>> Similarly: live migration. We cannot simply migrate that memory the >>> traditional way. We even have to track the dirty state differently. >>> >>> So IMHO, treating both memory as discarded == don't touch it the usual >>> way might actually be a feature not a bug ;) >> >> Do you mean treating the private memory in both VFIO and live migration >> as discarded? That is what this patch series does. And as you mentioned, >> these RDM users cannot follow the traditional RDM way. Because of this, >> we also considered whether we should use RDM or a more generic mechanism >> like notifier_list below. > > Yes, the shared memory is logically discarded. At the same time we > *might* get private memory effectively populated. See my reply to Kevin > that there might be ways of having shared vs. private populate/discard > in the future, if required. Just some idea, though. > >> >>> >>>> >>>> There are two possible ways to mitigate the semantics changes. >>>> 1. Develop a new mechanism to notify the page conversions between >>>> private and shared. For example, utilize the notifier_list in QEMU. >>>> VFIO >>>> registers its own handler and gets notified upon page conversions. This >>>> is a clean approach which only touches the notifier workflow. A >>>> challenge is that for device hotplug, existing shared memory should be >>>> mapped in IOMMU. This will need additional changes. >>>> >>>> 2. Extend the existing RamDiscardManager interface to manage not only >>>> the discarded/populated status of guest memory but also the >>>> shared/private status. RamDiscardManager users like VFIO will be >>>> notified with one more argument indicating what change is happening and >>>> can take action accordingly. It also has challenges e.g. QEMU allows >>>> only one RamDiscardManager, how to support virtio-mem for confidential >>>> VMs would be a problem. And some APIs like .is_populated() exposed by >>>> RamDiscardManager are meaningless to shared/private memory. So they may >>>> need some adjustments. >>> >>> Think of all of that in terms of "shared memory is populated, private >>> memory is some inaccessible stuff that needs very special way and other >>> means for device assignment, live migration, etc.". Then it actually >>> quite makes sense to use of RamDiscardManager (AFAIKS :) ). >> >> Yes, such notification mechanism is what we want. But for the users of >> RDM, it would require additional change accordingly. Current users just >> skip inaccessible stuff, but in private memory case, it can't be simply >> skipped. Maybe renaming RamDiscardManager to RamStateManager is more >> accurate then. :) > > Current users must skip it, yes. How private memory would have to be > handled, and who would handle it, is rather unclear. > > Again, maybe we'd want separate RamDiscardManager for private and shared > memory (after all, these are two separate memory backends). We also considered distinguishing the populate and discard operation for private and shared memory separately. As in method 2 above, we mentioned to add a new argument to indicate the memory attribute to operate on. They seem to have a similar idea. > > Not sure that "RamStateManager" terminology would be reasonable in that > approach. > >> >>> >>>> >>>> Testing >>>> ======= >>>> This patch series is tested based on the internal TDX KVM/QEMU tree. >>>> >>>> To facilitate shared device assignment with the NIC, employ the legacy >>>> type1 VFIO with the QEMU command: >>>> >>>> qemu-system-x86_64 [...] >>>> -device vfio-pci,host=XX:XX.X >>>> >>>> The parameter of dma_entry_limit needs to be adjusted. For example, a >>>> 16GB guest needs to adjust the parameter like >>>> vfio_iommu_type1.dma_entry_limit=4194304. >>> >>> But here you note the biggest real issue I see (not related to >>> RAMDiscardManager, but that we have to prepare for conversion of each >>> possible private page to shared and back): we need a single IOMMU >>> mapping for each 4 KiB page. >>> >>> Doesn't that mean that we limit shared memory to 4194304*4096 == 16 GiB. >>> Does it even scale then? >> >> The entry limitation needs to be increased as the guest memory size >> increases. For this issue, are you concerned that having too many >> entries might bring some performance issue? Maybe we could introduce >> some PV mechanism to coordinate with guest to convert memory only in 2M >> granularity. This may help mitigate the problem. > > I've had this talk with Intel, because the 4K granularity is a pain. I > was told that ship has sailed ... and we have to cope with random 4K > conversions :( > > The many mappings will likely add both memory and runtime overheads in > the kernel. But we only know once we measure. In the normal case, the main runtime overhead comes from private<->shared flip in SWIOTLB, which defaults to 6% of memory with a maximum of 1Gbyte. I think this overhead is acceptable. In non-default case, e.g. dynamic allocated DMA buffer, the runtime overhead will increase. As for the memory overheads, It is indeed unavoidable. Will these performance issues be a deal breaker for enabling shared device assignment in this way? > > Key point is that even 4194304 "only" allows for 16 GiB. Imagine 1 TiB > of shared memory :/ > >> >>> >>> >>> There is the alternative of having in-place private/shared conversion >>> when we also let guest_memfd manage some shared memory. It has plenty of >>> downsides, but for the problem at hand it would mean that we don't >>> discard on shared/private conversion.> >>> But whenever we want to convert memory shared->private we would >>> similarly have to from IOMMU page tables via VFIO. (the in-place >>> conversion will only be allowed if any additional references on a page >>> are gone -- when it is inaccessible by userspace/kernel). >> >> I'm not clear about this in-place private/shared conversion. Can you >> elaborate a little bit? It seems this alternative changes private and >> shared management in current guest_memfd? > > Yes, there have been discussions about that, also in the context of > supporting huge pages while allowing for the guest to still convert > individual 4K chunks ... > > A summary is here [1]. Likely more things will be covered at Linux > Plumbers. > > > [1] > https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/ > Thanks for your sharing.
On Fri, Jul 26, 2024 at 09:08:51AM +0200, David Hildenbrand wrote: > On 26.07.24 07:02, Tian, Kevin wrote: > > > From: David Hildenbrand <david@redhat.com> > > > Sent: Thursday, July 25, 2024 10:04 PM > > > > > > > Open > > > > ==== > > > > Implementing a RamDiscardManager to notify VFIO of page conversions > > > > causes changes in semantics: private memory is treated as discarded (or > > > > hot-removed) memory. This isn't aligned with the expectation of current > > > > RamDiscardManager users (e.g. VFIO or live migration) who really > > > > expect that discarded memory is hot-removed and thus can be skipped > > > when > > > > the users are processing guest memory. Treating private memory as > > > > discarded won't work in future if VFIO or live migration needs to handle > > > > private memory. e.g. VFIO may need to map private memory to support > > > > Trusted IO and live migration for confidential VMs need to migrate > > > > private memory. > > > > > > "VFIO may need to map private memory to support Trusted IO" > > > > > > I've been told that the way we handle shared memory won't be the way > > > this is going to work with guest_memfd. KVM will coordinate directly > > > with VFIO or $whatever and update the IOMMU tables itself right in the > > > kernel; the pages are pinned/owned by guest_memfd, so that will just > > > work. So I don't consider that currently a concern. guest_memfd private > > > memory is not mapped into user page tables and as it currently seems it > > > never will be. > > > > Or could extend MAP_DMA to accept guest_memfd+offset in place of With TIO, I can imagine several buffer sharing requirements: KVM maps VFIO owned private MMIO, IOMMU maps gmem owned private memory, IOMMU maps VFIO owned private MMIO. These buffers cannot be found by user page table anymore. I'm wondering it would be messy to have specific PFN finding methods for each FD type. Is it possible we have a unified way for buffer sharing and PFN finding, is dma-buf a candidate? > > 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve > > the pinned pfn. > > In theory yes, and I've been thinking of the same for a while. Until people > told me that it is unlikely that it will work that way in the future. Could you help specify why it won't work? As Kevin mentioned below, SEV-TIO may still allow userspace to manage the IOMMU mapping for private. I'm not sure how they map private memory for IOMMU without touching gmemfd. Thanks, Yilun > > > > > IMHO it's more the TIO arch deciding whether VFIO/IOMMUFD needs > > to manage the mapping of the private memory instead of the use of > > guest_memfd. > > > > e.g. SEV-TIO, iiuc, introduces a new-layer page ownership tracker (RMP) > > to check the HPA after the IOMMU walks the existing I/O page tables. > > So reasonably VFIO/IOMMUFD could continue to manage those I/O > > page tables including both private and shared memory, with a hint to > > know where to find the pfn (host page table or guest_memfd). > > > > But TDX Connect introduces a new I/O page table format (same as secure > > EPT) for mapping the private memory and further requires sharing the > > secure-EPT between CPU/IOMMU for private. Then it appears to be > > a different story. > > Yes. This seems to be the future and more in-line with in-place/in-kernel > conversion as e.g., pKVM wants to have it. If you want to avoid user space > altogether when doing shared<->private conversions, then letting user space > manage the IOMMUs is not going to work. > > > If we ever have to go down that path (MAP_DMA of guest_memfd), we could have > two RAMDiscardManager for a RAM region, just like we have two memory > backends: one for shared memory populate/discard (what this series tries to > achieve), one for private memory populate/discard. > > The thing is, that private memory will always have to be special-cased all > over the place either way, unfortunately. > > -- > Cheers, > > David / dhildenb > >
On 31.07.24 09:12, Xu Yilun wrote: > On Fri, Jul 26, 2024 at 09:08:51AM +0200, David Hildenbrand wrote: >> On 26.07.24 07:02, Tian, Kevin wrote: >>>> From: David Hildenbrand <david@redhat.com> >>>> Sent: Thursday, July 25, 2024 10:04 PM >>>> >>>>> Open >>>>> ==== >>>>> Implementing a RamDiscardManager to notify VFIO of page conversions >>>>> causes changes in semantics: private memory is treated as discarded (or >>>>> hot-removed) memory. This isn't aligned with the expectation of current >>>>> RamDiscardManager users (e.g. VFIO or live migration) who really >>>>> expect that discarded memory is hot-removed and thus can be skipped >>>> when >>>>> the users are processing guest memory. Treating private memory as >>>>> discarded won't work in future if VFIO or live migration needs to handle >>>>> private memory. e.g. VFIO may need to map private memory to support >>>>> Trusted IO and live migration for confidential VMs need to migrate >>>>> private memory. >>>> >>>> "VFIO may need to map private memory to support Trusted IO" >>>> >>>> I've been told that the way we handle shared memory won't be the way >>>> this is going to work with guest_memfd. KVM will coordinate directly >>>> with VFIO or $whatever and update the IOMMU tables itself right in the >>>> kernel; the pages are pinned/owned by guest_memfd, so that will just >>>> work. So I don't consider that currently a concern. guest_memfd private >>>> memory is not mapped into user page tables and as it currently seems it >>>> never will be. >>> >>> Or could extend MAP_DMA to accept guest_memfd+offset in place of > > With TIO, I can imagine several buffer sharing requirements: KVM maps VFIO > owned private MMIO, IOMMU maps gmem owned private memory, IOMMU maps VFIO > owned private MMIO. These buffers cannot be found by user page table > anymore. I'm wondering it would be messy to have specific PFN finding > methods for each FD type. Is it possible we have a unified way for > buffer sharing and PFN finding, is dma-buf a candidate? No expert on that, so I'm afraid I can't help. > >>> 'vaddr' and have VFIO/IOMMUFD call guest_memfd helpers to retrieve >>> the pinned pfn. >> >> In theory yes, and I've been thinking of the same for a while. Until people >> told me that it is unlikely that it will work that way in the future. > > Could you help specify why it won't work? As Kevin mentioned below, SEV-TIO > may still allow userspace to manage the IOMMU mapping for private. I'm > not sure how they map private memory for IOMMU without touching gmemfd. I raised that question in [1]: "How would the device be able to grab/access "private memory", if not via the user page tables?" Jason summarized it as "The approaches I'm aware of require the secure world to own the IOMMU and generate the IOMMU page tables. So we will not use a GUP approach with VFIO today as the kernel will not have any reason to generate a page table in the first place. Instead we will say "this PCI device translates through the secure world" and walk away." I think for some cVM approaches it really cannot work without letting KVM/secure world handle the IOMMU (e.g., sharing of page tables between IOMMU and KVM). For your use case it *might* work, but I am wondering if this is how it should be done, and if there are better alternatives. [1] https://lkml.org/lkml/2024/6/20/920
Sorry for the late reply! >> Current users must skip it, yes. How private memory would have to be >> handled, and who would handle it, is rather unclear. >> >> Again, maybe we'd want separate RamDiscardManager for private and shared >> memory (after all, these are two separate memory backends). > > We also considered distinguishing the populate and discard operation for > private and shared memory separately. As in method 2 above, we mentioned > to add a new argument to indicate the memory attribute to operate on. > They seem to have a similar idea. Yes. Likely it's just some implementation detail. I think the following states would be possible: * Discarded in shared + discarded in private (not populated) * Discarded in shared + populated in private (private populated) * Populated in shared + discarded in private (shared populated) One could map these to states discarded/private/shared indeed. [...] >> I've had this talk with Intel, because the 4K granularity is a pain. I >> was told that ship has sailed ... and we have to cope with random 4K >> conversions :( >> >> The many mappings will likely add both memory and runtime overheads in >> the kernel. But we only know once we measure. > > In the normal case, the main runtime overhead comes from > private<->shared flip in SWIOTLB, which defaults to 6% of memory with a > maximum of 1Gbyte. I think this overhead is acceptable. In non-default > case, e.g. dynamic allocated DMA buffer, the runtime overhead will > increase. As for the memory overheads, It is indeed unavoidable. > > Will these performance issues be a deal breaker for enabling shared > device assignment in this way? I see the most problematic part being the dma_entry_limit and all of these individual MAP/UNMAP calls on 4KiB granularity. dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the possible maximum should be 4294967296, and the default is 65535. So we should be able to have a maximum of 16 TiB shared memory all in 4KiB chunks. sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying a per-page overhead of ~2.4%, excluding the actual rbtree. Tree lookup/modifications with that many nodes might also get a bit slower, but likely still tolerable as you note. Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable for your use case?
Hi David, On 7/26/2024 3:20 PM, David Hildenbrand wrote: > Yes, there have been discussions about that, also in the context of > supporting huge pages while allowing for the guest to still convert > individual 4K chunks ... > > A summary is here [1]. Likely more things will be covered at Linux > Plumbers. > > > [1] > https://lore.kernel.org/kvm/20240712232937.2861788-1-ackerleytng@google.com/ This is a very valuable link. Thanks a lot for sharing. Aaron and I are particular interesting to the huge page (both hugetlb and THP) support for gmem_fd (per our testing, at least 10%+ performance gain with it with TDX for many workloads). We will monitor the linux-mm for such kind of discussion. I am wondering whether it's possible that you can involve Aaron and me if the discussion is still open but not on the mailing list (I suppose you will be included always for such kind of discussion). Thanks. Regards Yin, Fengwei
On 7/31/2024 7:18 PM, David Hildenbrand wrote: > Sorry for the late reply! > >>> Current users must skip it, yes. How private memory would have to be >>> handled, and who would handle it, is rather unclear. >>> >>> Again, maybe we'd want separate RamDiscardManager for private and shared >>> memory (after all, these are two separate memory backends). >> >> We also considered distinguishing the populate and discard operation for >> private and shared memory separately. As in method 2 above, we mentioned >> to add a new argument to indicate the memory attribute to operate on. >> They seem to have a similar idea. > > Yes. Likely it's just some implementation detail. I think the following > states would be possible: > > * Discarded in shared + discarded in private (not populated) > * Discarded in shared + populated in private (private populated) > * Populated in shared + discarded in private (shared populated) > > One could map these to states discarded/private/shared indeed. Make sense. We can follow this if the mechanism of RamDiscardManager is acceptable and no other concerns. > > [...] > >>> I've had this talk with Intel, because the 4K granularity is a pain. I >>> was told that ship has sailed ... and we have to cope with random 4K >>> conversions :( >>> >>> The many mappings will likely add both memory and runtime overheads in >>> the kernel. But we only know once we measure. >> >> In the normal case, the main runtime overhead comes from >> private<->shared flip in SWIOTLB, which defaults to 6% of memory with a >> maximum of 1Gbyte. I think this overhead is acceptable. In non-default >> case, e.g. dynamic allocated DMA buffer, the runtime overhead will >> increase. As for the memory overheads, It is indeed unavoidable. >> >> Will these performance issues be a deal breaker for enabling shared >> device assignment in this way? > > I see the most problematic part being the dma_entry_limit and all of > these individual MAP/UNMAP calls on 4KiB granularity. > > dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the > possible maximum should be 4294967296, and the default is 65535. > > So we should be able to have a maximum of 16 TiB shared memory all in > 4KiB chunks. > > sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying > a per-page overhead of ~2.4%, excluding the actual rbtree. > > Tree lookup/modifications with that many nodes might also get a bit > slower, but likely still tolerable as you note. > > Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable > for your use case? Yes. We can't guarantee the behavior of guest, so the overhead would be uncertain and unavoidable. >
Hi Paolo, Hope to draw your attention. As TEE I/O would depend on shared device assignment and we introduce this RDM solution in QEMU. Now, Observe the in-place private/shared conversion option mentioned by David, do you think we should continue to add pass-thru support for this in-qemu page conversion method? Or wait for the option discussion to see if it will change to in-kernel conversion. Thanks Chenyi On 7/25/2024 3:21 PM, Chenyi Qiang wrote: > Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated > discard") effectively disables device assignment with guest_memfd. > guest_memfd is required for confidential guests, so device assignment to > confidential guests is disabled. A supporting assumption for disabling > device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO > etc...) solves the confidential-guest device-assignment problem [1]. > That turns out not to be the case because TEE I/O depends on being able > to operate devices against "shared"/untrusted memory for device > initialization and error recovery scenarios. > > This series utilizes an existing framework named RamDiscardManager to > notify VFIO of page conversions. However, there's still one concern > related to the semantics of RamDiscardManager which is used to manage > the memory plug/unplug state. This is a little different from the memory > shared/private in our requirement. See the "Open" section below for more > details. > > Background > ========== > Confidential VMs have two classes of memory: shared and private memory. > Shared memory is accessible from the host/VMM while private memory is > not. Confidential VMs can decide which memory is shared/private and > convert memory between shared/private at runtime. > > "guest_memfd" is a new kind of fd whose primary goal is to serve guest > private memory. The key differences between guest_memfd and normal memfd > are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and > cannot be mapped, read or written by userspace. > > In QEMU's implementation, shared memory is allocated with normal methods > (e.g. mmap or fallocate) while private memory is allocated from > guest_memfd. When a VM performs memory conversions, QEMU frees pages via > madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and > allocates new pages from the other side. > > Problem > ======= > Device assignment in QEMU is implemented via VFIO system. In the normal > VM, VM memory is pinned at the beginning of time by VFIO. In the > confidential VM, the VM can convert memory and when that happens > nothing currently tells VFIO that its mappings are stale. This means > that page conversion leaks memory and leaves stale IOMMU mappings. For > example, sequence like the following can result in stale IOMMU mappings: > > 1. allocate shared page > 2. convert page shared->private > 3. discard shared page > 4. convert page private->shared > 5. allocate shared page > 6. issue DMA operations against that shared page > > After step 3, VFIO is still pinning the page. However, DMA operations in > step 6 will hit the old mapping that was allocated in step 1, which > causes the device to access the invalid data. > > Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require > uncoordinated discard") has blocked the device assignment with > guest_memfd to avoid this problem. > > Solution > ======== > The key to enable shared device assignment is to solve the stale IOMMU > mappings problem. > > Given the constraints and assumptions here is a solution that satisfied > the use cases. RamDiscardManager, an existing interface currently > utilized by virtio-mem, offers a means to modify IOMMU mappings in > accordance with VM page assignment. Page conversion is similar to > hot-removing a page in one mode and adding it back in the other. > > This series implements a RamDiscardManager for confidential VMs and > utilizes its infrastructure to notify VFIO of page conversions. > > Another possible attempt [2] was to not discard shared pages in step 3 > above. This was an incomplete band-aid because guests would consume > twice the memory since shared pages wouldn't be freed even after they > were converted to private. > > Open > ==== > Implementing a RamDiscardManager to notify VFIO of page conversions > causes changes in semantics: private memory is treated as discarded (or > hot-removed) memory. This isn't aligned with the expectation of current > RamDiscardManager users (e.g. VFIO or live migration) who really > expect that discarded memory is hot-removed and thus can be skipped when > the users are processing guest memory. Treating private memory as > discarded won't work in future if VFIO or live migration needs to handle > private memory. e.g. VFIO may need to map private memory to support > Trusted IO and live migration for confidential VMs need to migrate > private memory. > > There are two possible ways to mitigate the semantics changes. > 1. Develop a new mechanism to notify the page conversions between > private and shared. For example, utilize the notifier_list in QEMU. VFIO > registers its own handler and gets notified upon page conversions. This > is a clean approach which only touches the notifier workflow. A > challenge is that for device hotplug, existing shared memory should be > mapped in IOMMU. This will need additional changes. > > 2. Extend the existing RamDiscardManager interface to manage not only > the discarded/populated status of guest memory but also the > shared/private status. RamDiscardManager users like VFIO will be > notified with one more argument indicating what change is happening and > can take action accordingly. It also has challenges e.g. QEMU allows > only one RamDiscardManager, how to support virtio-mem for confidential > VMs would be a problem. And some APIs like .is_populated() exposed by > RamDiscardManager are meaningless to shared/private memory. So they may > need some adjustments. > > Testing > ======= > This patch series is tested based on the internal TDX KVM/QEMU tree. > > To facilitate shared device assignment with the NIC, employ the legacy > type1 VFIO with the QEMU command: > > qemu-system-x86_64 [...] > -device vfio-pci,host=XX:XX.X > > The parameter of dma_entry_limit needs to be adjusted. For example, a > 16GB guest needs to adjust the parameter like > vfio_iommu_type1.dma_entry_limit=4194304. > > If use the iommufd-backed VFIO with the qemu command: > > qemu-system-x86_64 [...] > -object iommufd,id=iommufd0 \ > -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 > > No additional adjustment required. > > Following the bootup of the TD guest, the guest's IP address becomes > visible, and iperf is able to successfully send and receive data. > > Related link > ============ > [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/ > [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/ > > Chenyi Qiang (6): > guest_memfd: Introduce an object to manage the guest-memfd with > RamDiscardManager > guest_memfd: Introduce a helper to notify the shared/private state > change > KVM: Notify the state change via RamDiscardManager helper during > shared/private conversion > memory: Register the RamDiscardManager instance upon guest_memfd > creation > guest-memfd: Default to discarded (private) in guest_memfd_manager > RAMBlock: make guest_memfd require coordinate discard > > accel/kvm/kvm-all.c | 7 + > include/sysemu/guest-memfd-manager.h | 49 +++ > system/guest-memfd-manager.c | 425 +++++++++++++++++++++++++++ > system/meson.build | 1 + > system/physmem.c | 11 +- > 5 files changed, 492 insertions(+), 1 deletion(-) > create mode 100644 include/sysemu/guest-memfd-manager.h > create mode 100644 system/guest-memfd-manager.c > > > base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
Hi Paolo, Kindly ping for this thread. The in-place page conversion is discussed at Linux Plumbers. Does it give some direction for shared device assignment enabling work? Thanks Chenyi On 8/16/2024 11:02 AM, Chenyi Qiang wrote: > Hi Paolo, > > Hope to draw your attention. As TEE I/O would depend on shared device > assignment and we introduce this RDM solution in QEMU. Now, Observe the > in-place private/shared conversion option mentioned by David, do you > think we should continue to add pass-thru support for this in-qemu page > conversion method? Or wait for the option discussion to see if it will > change to in-kernel conversion. > > Thanks > Chenyi > > On 7/25/2024 3:21 PM, Chenyi Qiang wrote: >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated >> discard") effectively disables device assignment with guest_memfd. >> guest_memfd is required for confidential guests, so device assignment to >> confidential guests is disabled. A supporting assumption for disabling >> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO >> etc...) solves the confidential-guest device-assignment problem [1]. >> That turns out not to be the case because TEE I/O depends on being able >> to operate devices against "shared"/untrusted memory for device >> initialization and error recovery scenarios. >> >> This series utilizes an existing framework named RamDiscardManager to >> notify VFIO of page conversions. However, there's still one concern >> related to the semantics of RamDiscardManager which is used to manage >> the memory plug/unplug state. This is a little different from the memory >> shared/private in our requirement. See the "Open" section below for more >> details. >> >> Background >> ========== >> Confidential VMs have two classes of memory: shared and private memory. >> Shared memory is accessible from the host/VMM while private memory is >> not. Confidential VMs can decide which memory is shared/private and >> convert memory between shared/private at runtime. >> >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest >> private memory. The key differences between guest_memfd and normal memfd >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and >> cannot be mapped, read or written by userspace. >> >> In QEMU's implementation, shared memory is allocated with normal methods >> (e.g. mmap or fallocate) while private memory is allocated from >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and >> allocates new pages from the other side. >> >> Problem >> ======= >> Device assignment in QEMU is implemented via VFIO system. In the normal >> VM, VM memory is pinned at the beginning of time by VFIO. In the >> confidential VM, the VM can convert memory and when that happens >> nothing currently tells VFIO that its mappings are stale. This means >> that page conversion leaks memory and leaves stale IOMMU mappings. For >> example, sequence like the following can result in stale IOMMU mappings: >> >> 1. allocate shared page >> 2. convert page shared->private >> 3. discard shared page >> 4. convert page private->shared >> 5. allocate shared page >> 6. issue DMA operations against that shared page >> >> After step 3, VFIO is still pinning the page. However, DMA operations in >> step 6 will hit the old mapping that was allocated in step 1, which >> causes the device to access the invalid data. >> >> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require >> uncoordinated discard") has blocked the device assignment with >> guest_memfd to avoid this problem. >> >> Solution >> ======== >> The key to enable shared device assignment is to solve the stale IOMMU >> mappings problem. >> >> Given the constraints and assumptions here is a solution that satisfied >> the use cases. RamDiscardManager, an existing interface currently >> utilized by virtio-mem, offers a means to modify IOMMU mappings in >> accordance with VM page assignment. Page conversion is similar to >> hot-removing a page in one mode and adding it back in the other. >> >> This series implements a RamDiscardManager for confidential VMs and >> utilizes its infrastructure to notify VFIO of page conversions. >> >> Another possible attempt [2] was to not discard shared pages in step 3 >> above. This was an incomplete band-aid because guests would consume >> twice the memory since shared pages wouldn't be freed even after they >> were converted to private. >> >> Open >> ==== >> Implementing a RamDiscardManager to notify VFIO of page conversions >> causes changes in semantics: private memory is treated as discarded (or >> hot-removed) memory. This isn't aligned with the expectation of current >> RamDiscardManager users (e.g. VFIO or live migration) who really >> expect that discarded memory is hot-removed and thus can be skipped when >> the users are processing guest memory. Treating private memory as >> discarded won't work in future if VFIO or live migration needs to handle >> private memory. e.g. VFIO may need to map private memory to support >> Trusted IO and live migration for confidential VMs need to migrate >> private memory. >> >> There are two possible ways to mitigate the semantics changes. >> 1. Develop a new mechanism to notify the page conversions between >> private and shared. For example, utilize the notifier_list in QEMU. VFIO >> registers its own handler and gets notified upon page conversions. This >> is a clean approach which only touches the notifier workflow. A >> challenge is that for device hotplug, existing shared memory should be >> mapped in IOMMU. This will need additional changes. >> >> 2. Extend the existing RamDiscardManager interface to manage not only >> the discarded/populated status of guest memory but also the >> shared/private status. RamDiscardManager users like VFIO will be >> notified with one more argument indicating what change is happening and >> can take action accordingly. It also has challenges e.g. QEMU allows >> only one RamDiscardManager, how to support virtio-mem for confidential >> VMs would be a problem. And some APIs like .is_populated() exposed by >> RamDiscardManager are meaningless to shared/private memory. So they may >> need some adjustments. >> >> Testing >> ======= >> This patch series is tested based on the internal TDX KVM/QEMU tree. >> >> To facilitate shared device assignment with the NIC, employ the legacy >> type1 VFIO with the QEMU command: >> >> qemu-system-x86_64 [...] >> -device vfio-pci,host=XX:XX.X >> >> The parameter of dma_entry_limit needs to be adjusted. For example, a >> 16GB guest needs to adjust the parameter like >> vfio_iommu_type1.dma_entry_limit=4194304. >> >> If use the iommufd-backed VFIO with the qemu command: >> >> qemu-system-x86_64 [...] >> -object iommufd,id=iommufd0 \ >> -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 >> >> No additional adjustment required. >> >> Following the bootup of the TD guest, the guest's IP address becomes >> visible, and iperf is able to successfully send and receive data. >> >> Related link >> ============ >> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/ >> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/ >> >> Chenyi Qiang (6): >> guest_memfd: Introduce an object to manage the guest-memfd with >> RamDiscardManager >> guest_memfd: Introduce a helper to notify the shared/private state >> change >> KVM: Notify the state change via RamDiscardManager helper during >> shared/private conversion >> memory: Register the RamDiscardManager instance upon guest_memfd >> creation >> guest-memfd: Default to discarded (private) in guest_memfd_manager >> RAMBlock: make guest_memfd require coordinate discard >> >> accel/kvm/kvm-all.c | 7 + >> include/sysemu/guest-memfd-manager.h | 49 +++ >> system/guest-memfd-manager.c | 425 +++++++++++++++++++++++++++ >> system/meson.build | 1 + >> system/physmem.c | 11 +- >> 5 files changed, 492 insertions(+), 1 deletion(-) >> create mode 100644 include/sysemu/guest-memfd-manager.h >> create mode 100644 system/guest-memfd-manager.c >> >> >> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819
On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote: > Hi Paolo, > > Kindly ping for this thread. The in-place page conversion is discussed > at Linux Plumbers. Does it give some direction for shared device > assignment enabling work? > Hi everybody. Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to provide AI acceleration within TEE CVMs. We require passing though the GPU via VFIO stubbing, which means that we are impacted by the absence of an API to inform VFIO about page conversions. The CSPs have enough kernel engineers who handle this process in their own host kernels, but we have several enterprise customers who are eager to begin using this solution in the upstream. AMD has successfully ported enough of the SEV-SNP support into 6.11 and our initial testing shows successful operation, but only by disabling discard via these two QEMU patches: - https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8 - https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095 This "workaround" is a bit of a hack, as it effectively requires greater than double the amount of host memory than as to be allocated to the guest CVM. The proposal here appears to be a promising workaround; are there other solutions that are recommended for this use case? This configuration is in GA right now and NVIDIA is committed to support and test this bounce-buffer mailbox solution for many years into the future, so we're highly invested in seeing a converged solution in the upstream. Thanks, Rob > Thanks > Chenyi > > On 8/16/2024 11:02 AM, Chenyi Qiang wrote: > > Hi Paolo, > > > > Hope to draw your attention. As TEE I/O would depend on shared device > > assignment and we introduce this RDM solution in QEMU. Now, Observe the > > in-place private/shared conversion option mentioned by David, do you > > think we should continue to add pass-thru support for this in-qemu page > > conversion method? Or wait for the option discussion to see if it will > > change to in-kernel conversion. > > > > Thanks > > Chenyi > > > > On 7/25/2024 3:21 PM, Chenyi Qiang wrote: > >> Commit 852f0048f3 ("RAMBlock: make guest_memfd require uncoordinated > >> discard") effectively disables device assignment with guest_memfd. > >> guest_memfd is required for confidential guests, so device assignment to > >> confidential guests is disabled. A supporting assumption for disabling > >> device-assignment was that TEE I/O (SEV-TIO, TDX Connect, COVE-IO > >> etc...) solves the confidential-guest device-assignment problem [1]. > >> That turns out not to be the case because TEE I/O depends on being able > >> to operate devices against "shared"/untrusted memory for device > >> initialization and error recovery scenarios. > >> > >> This series utilizes an existing framework named RamDiscardManager to > >> notify VFIO of page conversions. However, there's still one concern > >> related to the semantics of RamDiscardManager which is used to manage > >> the memory plug/unplug state. This is a little different from the memory > >> shared/private in our requirement. See the "Open" section below for more > >> details. > >> > >> Background > >> ========== > >> Confidential VMs have two classes of memory: shared and private memory. > >> Shared memory is accessible from the host/VMM while private memory is > >> not. Confidential VMs can decide which memory is shared/private and > >> convert memory between shared/private at runtime. > >> > >> "guest_memfd" is a new kind of fd whose primary goal is to serve guest > >> private memory. The key differences between guest_memfd and normal memfd > >> are that guest_memfd is spawned by a KVM ioctl, bound to its owner VM and > >> cannot be mapped, read or written by userspace. > >> > >> In QEMU's implementation, shared memory is allocated with normal methods > >> (e.g. mmap or fallocate) while private memory is allocated from > >> guest_memfd. When a VM performs memory conversions, QEMU frees pages via > >> madvise() or via PUNCH_HOLE on memfd or guest_memfd from one side and > >> allocates new pages from the other side. > >> > >> Problem > >> ======= > >> Device assignment in QEMU is implemented via VFIO system. In the normal > >> VM, VM memory is pinned at the beginning of time by VFIO. In the > >> confidential VM, the VM can convert memory and when that happens > >> nothing currently tells VFIO that its mappings are stale. This means > >> that page conversion leaks memory and leaves stale IOMMU mappings. For > >> example, sequence like the following can result in stale IOMMU mappings: > >> > >> 1. allocate shared page > >> 2. convert page shared->private > >> 3. discard shared page > >> 4. convert page private->shared > >> 5. allocate shared page > >> 6. issue DMA operations against that shared page > >> > >> After step 3, VFIO is still pinning the page. However, DMA operations in > >> step 6 will hit the old mapping that was allocated in step 1, which > >> causes the device to access the invalid data. > >> > >> Currently, the commit 852f0048f3 ("RAMBlock: make guest_memfd require > >> uncoordinated discard") has blocked the device assignment with > >> guest_memfd to avoid this problem. > >> > >> Solution > >> ======== > >> The key to enable shared device assignment is to solve the stale IOMMU > >> mappings problem. > >> > >> Given the constraints and assumptions here is a solution that satisfied > >> the use cases. RamDiscardManager, an existing interface currently > >> utilized by virtio-mem, offers a means to modify IOMMU mappings in > >> accordance with VM page assignment. Page conversion is similar to > >> hot-removing a page in one mode and adding it back in the other. > >> > >> This series implements a RamDiscardManager for confidential VMs and > >> utilizes its infrastructure to notify VFIO of page conversions. > >> > >> Another possible attempt [2] was to not discard shared pages in step 3 > >> above. This was an incomplete band-aid because guests would consume > >> twice the memory since shared pages wouldn't be freed even after they > >> were converted to private. > >> > >> Open > >> ==== > >> Implementing a RamDiscardManager to notify VFIO of page conversions > >> causes changes in semantics: private memory is treated as discarded (or > >> hot-removed) memory. This isn't aligned with the expectation of current > >> RamDiscardManager users (e.g. VFIO or live migration) who really > >> expect that discarded memory is hot-removed and thus can be skipped when > >> the users are processing guest memory. Treating private memory as > >> discarded won't work in future if VFIO or live migration needs to handle > >> private memory. e.g. VFIO may need to map private memory to support > >> Trusted IO and live migration for confidential VMs need to migrate > >> private memory. > >> > >> There are two possible ways to mitigate the semantics changes. > >> 1. Develop a new mechanism to notify the page conversions between > >> private and shared. For example, utilize the notifier_list in QEMU. VFIO > >> registers its own handler and gets notified upon page conversions. This > >> is a clean approach which only touches the notifier workflow. A > >> challenge is that for device hotplug, existing shared memory should be > >> mapped in IOMMU. This will need additional changes. > >> > >> 2. Extend the existing RamDiscardManager interface to manage not only > >> the discarded/populated status of guest memory but also the > >> shared/private status. RamDiscardManager users like VFIO will be > >> notified with one more argument indicating what change is happening and > >> can take action accordingly. It also has challenges e.g. QEMU allows > >> only one RamDiscardManager, how to support virtio-mem for confidential > >> VMs would be a problem. And some APIs like .is_populated() exposed by > >> RamDiscardManager are meaningless to shared/private memory. So they may > >> need some adjustments. > >> > >> Testing > >> ======= > >> This patch series is tested based on the internal TDX KVM/QEMU tree. > >> > >> To facilitate shared device assignment with the NIC, employ the legacy > >> type1 VFIO with the QEMU command: > >> > >> qemu-system-x86_64 [...] > >> -device vfio-pci,host=XX:XX.X > >> > >> The parameter of dma_entry_limit needs to be adjusted. For example, a > >> 16GB guest needs to adjust the parameter like > >> vfio_iommu_type1.dma_entry_limit=4194304. > >> > >> If use the iommufd-backed VFIO with the qemu command: > >> > >> qemu-system-x86_64 [...] > >> -object iommufd,id=iommufd0 \ > >> -device vfio-pci,host=XX:XX.X,iommufd=iommufd0 > >> > >> No additional adjustment required. > >> > >> Following the bootup of the TD guest, the guest's IP address becomes > >> visible, and iperf is able to successfully send and receive data. > >> > >> Related link > >> ============ > >> [1] https://lore.kernel.org/all/d6acfbef-96a1-42bc-8866-c12a4de8c57c@redhat.com/ > >> [2] https://lore.kernel.org/all/20240320083945.991426-20-michael.roth@amd.com/ > >> > >> Chenyi Qiang (6): > >> guest_memfd: Introduce an object to manage the guest-memfd with > >> RamDiscardManager > >> guest_memfd: Introduce a helper to notify the shared/private state > >> change > >> KVM: Notify the state change via RamDiscardManager helper during > >> shared/private conversion > >> memory: Register the RamDiscardManager instance upon guest_memfd > >> creation > >> guest-memfd: Default to discarded (private) in guest_memfd_manager > >> RAMBlock: make guest_memfd require coordinate discard > >> > >> accel/kvm/kvm-all.c | 7 + > >> include/sysemu/guest-memfd-manager.h | 49 +++ > >> system/guest-memfd-manager.c | 425 +++++++++++++++++++++++++++ > >> system/meson.build | 1 + > >> system/physmem.c | 11 +- > >> 5 files changed, 492 insertions(+), 1 deletion(-) > >> create mode 100644 include/sysemu/guest-memfd-manager.h > >> create mode 100644 system/guest-memfd-manager.c > >> > >> > >> base-commit: 900536d3e97aed7fdd9cb4dadd3bf7023360e819 > >
On 15.11.24 17:47, Rob Nertney wrote: > On Tue, Oct 08, 2024 at 04:59:45PM +0800, Chenyi Qiang wrote: >> Hi Paolo, >> >> Kindly ping for this thread. The in-place page conversion is discussed >> at Linux Plumbers. Does it give some direction for shared device >> assignment enabling work? >> > Hi everybody. Hi, > > Our NVIDIA GPUs currently support this shared-memory/bounce-buffer method to > provide AI acceleration within TEE CVMs. We require passing though the GPU via > VFIO stubbing, which means that we are impacted by the absence of an API to > inform VFIO about page conversions. > > The CSPs have enough kernel engineers who handle this process in their own host > kernels, but we have several enterprise customers who are eager to begin using > this solution in the upstream. AMD has successfully ported enough of the > SEV-SNP support into 6.11 and our initial testing shows successful operation, > but only by disabling discard via these two QEMU patches: > - https://github.com/AMDESE/qemu/commit/0c9ae28d3e199de9a40876a492e0f03a11c6f5d8 > - https://github.com/AMDESE/qemu/commit/5256c41fb3055961ea7ac368acc0b86a6632d095 > > This "workaround" is a bit of a hack, as it effectively requires greater than > double the amount of host memory than as to be allocated to the guest CVM. The > proposal here appears to be a promising workaround; are there other solutions > that are recommended for this use case? What people we are working on is supporting private and shared memory in guest_memfd, and allowing an in-place conversion between shared and private: this avoids discards + reallocation and consequently any double memory allocation. To get stuff into VFIO, we must only map the currently shared pages (VFIO will pin + map them), and unmap them (VFIO will unmap + unpin them) before converting them to private. This series should likely achieve the unmap-before-conversion-to-private, and map-after-conversion-to-shared, such that it could be compatible with guest_memfd. QEMU would simply mmap the guest_memfd to obtain a user space mapping, from which it can pass address ranges to VFIO like we already do. This user space mapping only allows for shared pages to be faulted in. Currently private pages cannot be faulted in (inaccessible -> SIGBUS). So far the theory. I'll note that this is likely not the most elegant solution, but something that would achieve in a reasonable timeframe one solution to the problem. Cheers!