Message ID | 20240823132137.336874-13-aik@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Secure VFIO, TDISP, SEV TIO | expand |
+Jason/David > From: Alexey Kardashevskiy <aik@amd.com> > Sent: Friday, August 23, 2024 9:21 PM > > IOMMUFD calls get_user_pages() for every mapping which will allocate > shared memory instead of using private memory managed by the KVM and > MEMFD. > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE > API > similar to already existing VFIO device and VFIO group fds. > This addition registers the KVM in IOMMUFD with a callback to get a pfn > for guest private memory for mapping it later in the IOMMU. > No callback for free as it is generic folio_put() for now. > > The aforementioned callback uses uptr to calculate the offset into > the KVM memory slot and find private backing pfn, copies > kvm_gmem_get_pfn() pretty much. > > This relies on private pages to be pinned beforehand. > There was a related discussion [1] which leans toward the conclusion that the IOMMU page table for private memory will be managed by the secure world i.e. the KVM path. Obviously the work here confirms that it doesn't hold for SEV-TIO which still expects the host to manage the IOMMU page table. btw going down this path it's clearer to extend the MAP_DMA uAPI to accept {gmemfd, offset} than adding a callback to KVM.
On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote: > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > shared memory instead of using private memory managed by the KVM and > > MEMFD. > > > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE > > API > > similar to already existing VFIO device and VFIO group fds. > > This addition registers the KVM in IOMMUFD with a callback to get a pfn > > for guest private memory for mapping it later in the IOMMU. > > No callback for free as it is generic folio_put() for now. > > > > The aforementioned callback uses uptr to calculate the offset into > > the KVM memory slot and find private backing pfn, copies > > kvm_gmem_get_pfn() pretty much. > > > > This relies on private pages to be pinned beforehand. > > > > There was a related discussion [1] which leans toward the conclusion > that the IOMMU page table for private memory will be managed by > the secure world i.e. the KVM path. It is still effectively true, AMD's design has duplication, the RMP table has the mappings to validate GPA and that is all managed in the secure world. They just want another copy of that information in the unsecure world in the form of page tables :\ > btw going down this path it's clearer to extend the MAP_DMA > uAPI to accept {gmemfd, offset} than adding a callback to KVM. Yes, we want a DMA MAP from memfd sort of API in general. So it should go directly to guest memfd with no kvm entanglement. Jason
On 26/8/24 18:39, Tian, Kevin wrote: > +Jason/David > >> From: Alexey Kardashevskiy <aik@amd.com> >> Sent: Friday, August 23, 2024 9:21 PM >> >> IOMMUFD calls get_user_pages() for every mapping which will allocate >> shared memory instead of using private memory managed by the KVM and >> MEMFD. >> >> Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE >> API >> similar to already existing VFIO device and VFIO group fds. >> This addition registers the KVM in IOMMUFD with a callback to get a pfn >> for guest private memory for mapping it later in the IOMMU. >> No callback for free as it is generic folio_put() for now. >> >> The aforementioned callback uses uptr to calculate the offset into >> the KVM memory slot and find private backing pfn, copies >> kvm_gmem_get_pfn() pretty much. >> >> This relies on private pages to be pinned beforehand. >> > > There was a related discussion [1] which leans toward the conclusion Forgot [1]? > that the IOMMU page table for private memory will be managed by > the secure world i.e. the KVM path. > > Obviously the work here confirms that it doesn't hold for SEV-TIO > which still expects the host to manage the IOMMU page table. > > btw going down this path it's clearer to extend the MAP_DMA > uAPI to accept {gmemfd, offset} than adding a callback to KVM. Thanks for the comment, makes sense, this should make the interface cleaner. It was just a bit messy (but doable nevertheless) at the time to push this new mapping flag/type all the way down to pfn_reader_user_pin: iommufd_ioas_map -> iopt_map_user_pages -> iopt_map_pages -> iopt_fill_domains_pages -> iopt_area_fill_domains -> pfn_reader_first -> pfn_reader_next -> pfn_reader_fill_span -> pfn_reader_user_pin Thanks,
> From: Alexey Kardashevskiy <aik@amd.com> > Sent: Tuesday, August 27, 2024 10:28 AM > > On 26/8/24 18:39, Tian, Kevin wrote: > > +Jason/David > > > >> From: Alexey Kardashevskiy <aik@amd.com> > >> Sent: Friday, August 23, 2024 9:21 PM > >> > >> IOMMUFD calls get_user_pages() for every mapping which will allocate > >> shared memory instead of using private memory managed by the KVM > and > >> MEMFD. > >> > >> Add support for IOMMUFD fd to the VFIO KVM device's > KVM_DEV_VFIO_FILE > >> API > >> similar to already existing VFIO device and VFIO group fds. > >> This addition registers the KVM in IOMMUFD with a callback to get a pfn > >> for guest private memory for mapping it later in the IOMMU. > >> No callback for free as it is generic folio_put() for now. > >> > >> The aforementioned callback uses uptr to calculate the offset into > >> the KVM memory slot and find private backing pfn, copies > >> kvm_gmem_get_pfn() pretty much. > >> > >> This relies on private pages to be pinned beforehand. > >> > > > > There was a related discussion [1] which leans toward the conclusion > > Forgot [1]? [1] https://lore.kernel.org/kvm/20240620143406.GJ2494510@nvidia.com/ > > > that the IOMMU page table for private memory will be managed by > > the secure world i.e. the KVM path. > > > > Obviously the work here confirms that it doesn't hold for SEV-TIO > > which still expects the host to manage the IOMMU page table. > > > > btw going down this path it's clearer to extend the MAP_DMA > > uAPI to accept {gmemfd, offset} than adding a callback to KVM. > > Thanks for the comment, makes sense, this should make the interface > cleaner. It was just a bit messy (but doable nevertheless) at the time > to push this new mapping flag/type all the way down to > pfn_reader_user_pin: > > iommufd_ioas_map -> iopt_map_user_pages -> iopt_map_pages -> > iopt_fill_domains_pages -> iopt_area_fill_domains -> pfn_reader_first -> > pfn_reader_next -> pfn_reader_fill_span -> pfn_reader_user_pin > > > Thanks, > > -- > Alexey
On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote: > On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote: > > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > > shared memory instead of using private memory managed by the KVM and > > > MEMFD. > > > > > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE > > > API > > > similar to already existing VFIO device and VFIO group fds. > > > This addition registers the KVM in IOMMUFD with a callback to get a pfn > > > for guest private memory for mapping it later in the IOMMU. > > > No callback for free as it is generic folio_put() for now. > > > > > > The aforementioned callback uses uptr to calculate the offset into > > > the KVM memory slot and find private backing pfn, copies > > > kvm_gmem_get_pfn() pretty much. > > > > > > This relies on private pages to be pinned beforehand. > > > > > > > There was a related discussion [1] which leans toward the conclusion > > that the IOMMU page table for private memory will be managed by > > the secure world i.e. the KVM path. > > It is still effectively true, AMD's design has duplication, the RMP > table has the mappings to validate GPA and that is all managed in the > secure world. > > They just want another copy of that information in the unsecure world > in the form of page tables :\ > > > btw going down this path it's clearer to extend the MAP_DMA > > uAPI to accept {gmemfd, offset} than adding a callback to KVM. > > Yes, we want a DMA MAP from memfd sort of API in general. So it should > go directly to guest memfd with no kvm entanglement. A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace takes control of the IOMMU mapping in the unsecure world. But as mentioned, the unsecure world mapping is just a "copy" and has no generic meaning without the CoCo-VM context. Seems no need for userspace to repeat the "copy" for IOMMU. Maybe userspace could just find a way to link the KVM context to IOMMU at the first place, then let KVM & IOMMU directly negotiate the mapping at runtime. Thanks, Yilun > > Jason >
On Thu, Aug 29, 2024 at 05:34:52PM +0800, Xu Yilun wrote: > On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote: > > On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote: > > > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > > > shared memory instead of using private memory managed by the KVM and > > > > MEMFD. > > > > > > > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE > > > > API > > > > similar to already existing VFIO device and VFIO group fds. > > > > This addition registers the KVM in IOMMUFD with a callback to get a pfn > > > > for guest private memory for mapping it later in the IOMMU. > > > > No callback for free as it is generic folio_put() for now. > > > > > > > > The aforementioned callback uses uptr to calculate the offset into > > > > the KVM memory slot and find private backing pfn, copies > > > > kvm_gmem_get_pfn() pretty much. > > > > > > > > This relies on private pages to be pinned beforehand. > > > > > > > > > > There was a related discussion [1] which leans toward the conclusion > > > that the IOMMU page table for private memory will be managed by > > > the secure world i.e. the KVM path. > > > > It is still effectively true, AMD's design has duplication, the RMP > > table has the mappings to validate GPA and that is all managed in the > > secure world. > > > > They just want another copy of that information in the unsecure world > > in the form of page tables :\ > > > > > btw going down this path it's clearer to extend the MAP_DMA > > > uAPI to accept {gmemfd, offset} than adding a callback to KVM. > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should > > go directly to guest memfd with no kvm entanglement. > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace > takes control of the IOMMU mapping in the unsecure world. Yes, such is how it seems to work. It doesn't actually have much control, it has to build a mapping that matches the RMP table exactly but still has to build it.. > But as mentioned, the unsecure world mapping is just a "copy" and > has no generic meaning without the CoCo-VM context. Seems no need > for userspace to repeat the "copy" for IOMMU. Well, here I say copy from the information already in the PSP secure world in the form fo their RMP, but in a different format. There is another copy in KVM in it's stage 2 translation but.. > Maybe userspace could just find a way to link the KVM context to IOMMU > at the first place, then let KVM & IOMMU directly negotiate the mapping > at runtime. I think the KVM folks have said no to sharing the KVM stage 2 directly with the iommu. They do too many operations that are incompatible with the iommu requirements for the stage 2. If that is true for the confidential compute, I don't know. Still, continuing to duplicate the two mappings as we have always done seems like a reasonable place to start and we want a memfd map anyhow for other reasons: https://lore.kernel.org/linux-iommu/20240806125602.GJ478300@nvidia.com/ Jason
On 29/8/24 22:15, Jason Gunthorpe wrote: > On Thu, Aug 29, 2024 at 05:34:52PM +0800, Xu Yilun wrote: >> On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote: >>> On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote: >>>>> IOMMUFD calls get_user_pages() for every mapping which will allocate >>>>> shared memory instead of using private memory managed by the KVM and >>>>> MEMFD. >>>>> >>>>> Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE >>>>> API >>>>> similar to already existing VFIO device and VFIO group fds. >>>>> This addition registers the KVM in IOMMUFD with a callback to get a pfn >>>>> for guest private memory for mapping it later in the IOMMU. >>>>> No callback for free as it is generic folio_put() for now. >>>>> >>>>> The aforementioned callback uses uptr to calculate the offset into >>>>> the KVM memory slot and find private backing pfn, copies >>>>> kvm_gmem_get_pfn() pretty much. >>>>> >>>>> This relies on private pages to be pinned beforehand. >>>>> >>>> >>>> There was a related discussion [1] which leans toward the conclusion >>>> that the IOMMU page table for private memory will be managed by >>>> the secure world i.e. the KVM path. >>> >>> It is still effectively true, AMD's design has duplication, the RMP >>> table has the mappings to validate GPA and that is all managed in the >>> secure world. >>> >>> They just want another copy of that information in the unsecure world >>> in the form of page tables :\ >>> >>>> btw going down this path it's clearer to extend the MAP_DMA >>>> uAPI to accept {gmemfd, offset} than adding a callback to KVM. >>> >>> Yes, we want a DMA MAP from memfd sort of API in general. So it should >>> go directly to guest memfd with no kvm entanglement. >> >> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace >> takes control of the IOMMU mapping in the unsecure world. > > Yes, such is how it seems to work. > > It doesn't actually have much control, it has to build a mapping that > matches the RMP table exactly but still has to build it.. Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will help to map IOVA (in fact, guest pfn) to host pfn? Thanks,
On Thu, Aug 29, 2024 at 09:15:49AM -0300, Jason Gunthorpe wrote: > On Thu, Aug 29, 2024 at 05:34:52PM +0800, Xu Yilun wrote: > > On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote: > > > On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote: > > > > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > > > > shared memory instead of using private memory managed by the KVM and > > > > > MEMFD. > > > > > > > > > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE > > > > > API > > > > > similar to already existing VFIO device and VFIO group fds. > > > > > This addition registers the KVM in IOMMUFD with a callback to get a pfn > > > > > for guest private memory for mapping it later in the IOMMU. > > > > > No callback for free as it is generic folio_put() for now. > > > > > > > > > > The aforementioned callback uses uptr to calculate the offset into > > > > > the KVM memory slot and find private backing pfn, copies > > > > > kvm_gmem_get_pfn() pretty much. > > > > > > > > > > This relies on private pages to be pinned beforehand. > > > > > > > > > > > > > There was a related discussion [1] which leans toward the conclusion > > > > that the IOMMU page table for private memory will be managed by > > > > the secure world i.e. the KVM path. > > > > > > It is still effectively true, AMD's design has duplication, the RMP > > > table has the mappings to validate GPA and that is all managed in the > > > secure world. > > > > > > They just want another copy of that information in the unsecure world > > > in the form of page tables :\ > > > > > > > btw going down this path it's clearer to extend the MAP_DMA > > > > uAPI to accept {gmemfd, offset} than adding a callback to KVM. > > > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should > > > go directly to guest memfd with no kvm entanglement. > > > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace > > takes control of the IOMMU mapping in the unsecure world. > > Yes, such is how it seems to work. > > It doesn't actually have much control, it has to build a mapping that > matches the RMP table exactly but still has to build it.. > > > But as mentioned, the unsecure world mapping is just a "copy" and > > has no generic meaning without the CoCo-VM context. Seems no need > > for userspace to repeat the "copy" for IOMMU. > > Well, here I say copy from the information already in the PSP secure > world in the form fo their RMP, but in a different format. > > There is another copy in KVM in it's stage 2 translation but.. > > > Maybe userspace could just find a way to link the KVM context to IOMMU > > at the first place, then let KVM & IOMMU directly negotiate the mapping > > at runtime. > > I think the KVM folks have said no to sharing the KVM stage 2 directly > with the iommu. They do too many operations that are incompatible with > the iommu requirements for the stage 2. I kind of agree. I'm not considering the page table sharing for AMD's case. I was just thinking about the way we sync up the secure mapping for KVM & IOMMU, when Page attribute conversion happens, still via userspace or KVM directly notifies IOMMU. > > If that is true for the confidential compute, I don't know. For Intel TDX TEE-IO, there may be a different story. Architechturely the secure IOMMU page table has to share with KVM secure stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module ensures the SEPT operations good for secure IOMMU, so there is no much trick to play for SEPT. > > Still, continuing to duplicate the two mappings as we have always done > seems like a reasonable place to start and we want a memfd map anyhow > for other reasons: > > https://lore.kernel.org/linux-iommu/20240806125602.GJ478300@nvidia.com/ > > Jason
On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote: > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should > > > > go directly to guest memfd with no kvm entanglement. > > > > > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace > > > takes control of the IOMMU mapping in the unsecure world. > > > > Yes, such is how it seems to work. > > > > It doesn't actually have much control, it has to build a mapping that > > matches the RMP table exactly but still has to build it.. > > Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host > physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will > help to map IOVA (in fact, guest pfn) to host pfn? Thanks, It is the explanation for why this is safe. For CC the translation of IOVA to physical must not be controlled by the hypervisor, for security. This can result in translation based attacks. AMD is weird because it puts the IOMMU page table in untrusted hypervisor memory, everyone else seems to put it in the trusted world's memory. This works for AMD because they have two copies of this translation, in two different formats, one in the RMP which is in trusted memory and one in the IO page table which is not trusted. Yes you can't use the RMP to do an IOVA lookup, but it does encode exactly the same information. Both must agree on the IOVA to physical mapping otherwise the HW rejects it. Meaning the IOMMU configuration must perfectly match the RMP configuration. Jason
On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote: > > If that is true for the confidential compute, I don't know. > > For Intel TDX TEE-IO, there may be a different story. > > Architechturely the secure IOMMU page table has to share with KVM secure > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module > ensures the SEPT operations good for secure IOMMU, so there is no much > trick to play for SEPT. Yes, I think ARM will do the same as well. From a uAPI perspective we need some way to create a secure vPCI function linked to a KVM and some IOMMUs will implicitly get a translation from the secure world and some IOMMUs will need to manage it in untrusted hypervisor memory. Jason
On 30/8/24 22:35, Jason Gunthorpe wrote: > On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote: >>>>> Yes, we want a DMA MAP from memfd sort of API in general. So it should >>>>> go directly to guest memfd with no kvm entanglement. >>>> >>>> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace >>>> takes control of the IOMMU mapping in the unsecure world. >>> >>> Yes, such is how it seems to work. >>> >>> It doesn't actually have much control, it has to build a mapping that >>> matches the RMP table exactly but still has to build it.. >> >> Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host >> physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will >> help to map IOVA (in fact, guest pfn) to host pfn? Thanks, > > It is the explanation for why this is safe. > > For CC the translation of IOVA to physical must not be controlled by > the hypervisor, for security. This can result in translation based > attacks. > > AMD is weird because it puts the IOMMU page table in untrusted > hypervisor memory, everyone else seems to put it in the trusted > world's memory. > > This works for AMD because they have two copies of this translation, > in two different formats, one in the RMP which is in trusted memory > and one in the IO page table which is not trusted. Yes you can't use > the RMP to do an IOVA lookup, but it does encode exactly the same > information. It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on x86 (and some/most other architectures) but it is not for when guests get hardware-assisted vIOMMU support. Thanks, > Both must agree on the IOVA to physical mapping otherwise the HW > rejects it. Meaning the IOMMU configuration must perfectly match the > RMP configuration.
On Mon, Sep 02, 2024 at 11:09:49AM +1000, Alexey Kardashevskiy wrote: > On 30/8/24 22:35, Jason Gunthorpe wrote: > > On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote: > > > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should > > > > > > go directly to guest memfd with no kvm entanglement. > > > > > > > > > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace > > > > > takes control of the IOMMU mapping in the unsecure world. > > > > > > > > Yes, such is how it seems to work. > > > > > > > > It doesn't actually have much control, it has to build a mapping that > > > > matches the RMP table exactly but still has to build it.. > > > > > > Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host > > > physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will > > > help to map IOVA (in fact, guest pfn) to host pfn? Thanks, > > > > It is the explanation for why this is safe. > > > > For CC the translation of IOVA to physical must not be controlled by > > the hypervisor, for security. This can result in translation based > > attacks. > > > > AMD is weird because it puts the IOMMU page table in untrusted > > hypervisor memory, everyone else seems to put it in the trusted > > world's memory. > > > > This works for AMD because they have two copies of this translation, > > in two different formats, one in the RMP which is in trusted memory > > and one in the IO page table which is not trusted. Yes you can't use > > the RMP to do an IOVA lookup, but it does encode exactly the same > > information. > > It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on > x86 (and some/most other architectures) but it is not for when guests get > hardware-assisted vIOMMU support. Yes, you are forced into a nesting IOMMU architecture with CC guests. Jason
On 3/9/24 09:52, Jason Gunthorpe wrote: > On Mon, Sep 02, 2024 at 11:09:49AM +1000, Alexey Kardashevskiy wrote: >> On 30/8/24 22:35, Jason Gunthorpe wrote: >>> On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote: >>>>>>> Yes, we want a DMA MAP from memfd sort of API in general. So it should >>>>>>> go directly to guest memfd with no kvm entanglement. >>>>>> >>>>>> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace >>>>>> takes control of the IOMMU mapping in the unsecure world. >>>>> >>>>> Yes, such is how it seems to work. >>>>> >>>>> It doesn't actually have much control, it has to build a mapping that >>>>> matches the RMP table exactly but still has to build it.. >>>> >>>> Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host >>>> physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will >>>> help to map IOVA (in fact, guest pfn) to host pfn? Thanks, >>> >>> It is the explanation for why this is safe. >>> >>> For CC the translation of IOVA to physical must not be controlled by >>> the hypervisor, for security. This can result in translation based >>> attacks. >>> >>> AMD is weird because it puts the IOMMU page table in untrusted >>> hypervisor memory, everyone else seems to put it in the trusted >>> world's memory. >>> >>> This works for AMD because they have two copies of this translation, >>> in two different formats, one in the RMP which is in trusted memory >>> and one in the IO page table which is not trusted. Yes you can't use >>> the RMP to do an IOVA lookup, but it does encode exactly the same >>> information. >> >> It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on >> x86 (and some/most other architectures) but it is not for when guests get >> hardware-assisted vIOMMU support. > > Yes, you are forced into a nesting IOMMU architecture with CC guests. Up to two I/O page tables and the RMP table allow both 1:1 and vIOMMU, what am I forced into, and by what? Thanks, > > Jason
On Tue, Sep 03, 2024 at 10:03:53AM +1000, Alexey Kardashevskiy wrote: > > > On 3/9/24 09:52, Jason Gunthorpe wrote: > > On Mon, Sep 02, 2024 at 11:09:49AM +1000, Alexey Kardashevskiy wrote: > > > On 30/8/24 22:35, Jason Gunthorpe wrote: > > > > On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote: > > > > > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should > > > > > > > > go directly to guest memfd with no kvm entanglement. > > > > > > > > > > > > > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace > > > > > > > takes control of the IOMMU mapping in the unsecure world. > > > > > > > > > > > > Yes, such is how it seems to work. > > > > > > > > > > > > It doesn't actually have much control, it has to build a mapping that > > > > > > matches the RMP table exactly but still has to build it.. > > > > > > > > > > Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host > > > > > physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will > > > > > help to map IOVA (in fact, guest pfn) to host pfn? Thanks, > > > > > > > > It is the explanation for why this is safe. > > > > > > > > For CC the translation of IOVA to physical must not be controlled by > > > > the hypervisor, for security. This can result in translation based > > > > attacks. > > > > > > > > AMD is weird because it puts the IOMMU page table in untrusted > > > > hypervisor memory, everyone else seems to put it in the trusted > > > > world's memory. > > > > > > > > This works for AMD because they have two copies of this translation, > > > > in two different formats, one in the RMP which is in trusted memory > > > > and one in the IO page table which is not trusted. Yes you can't use > > > > the RMP to do an IOVA lookup, but it does encode exactly the same > > > > information. > > > > > > It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on > > > x86 (and some/most other architectures) but it is not for when guests get > > > hardware-assisted vIOMMU support. > > > > Yes, you are forced into a nesting IOMMU architecture with CC guests. > > Up to two I/O page tables and the RMP table allow both 1:1 and vIOMMU, what > am I forced into, and by what? A key point of CC security is that the hypervisor cannot control the IOVA translation. AMD is securing non-viommu by using the RMP to limit the IOVA translation to 1:1 But, viommu models require a secured non 1:1 mapping. How do you intend to secure this other than using actual iommu nesting? Presumably relying on the PSP to program the secure DTE's GCR3 pointer. Jason
Jason Gunthorpe wrote: > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote: > > > > If that is true for the confidential compute, I don't know. > > > > For Intel TDX TEE-IO, there may be a different story. > > > > Architechturely the secure IOMMU page table has to share with KVM secure > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module > > ensures the SEPT operations good for secure IOMMU, so there is no much > > trick to play for SEPT. > > Yes, I think ARM will do the same as well. > > From a uAPI perspective we need some way to create a secure vPCI > function linked to a KVM and some IOMMUs will implicitly get a > translation from the secure world and some IOMMUs will need to manage > it in untrusted hypervisor memory. Yes. This matches the line of though I had for the PCI TSM core interface. It allows establishing the connection to the device's security manager and facilitates linking that to a KVM context. So part of the uAPI is charged with managing device-security independent of a VM, and binding a vPCI device involves a rendezvous of the secure-world IOMMU setup with secure-world PCI via IOMMU and PCI-TSM coordination.
On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote: > Jason Gunthorpe wrote: > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote: > > > > > > If that is true for the confidential compute, I don't know. > > > > > > For Intel TDX TEE-IO, there may be a different story. > > > > > > Architechturely the secure IOMMU page table has to share with KVM secure > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module > > > ensures the SEPT operations good for secure IOMMU, so there is no much > > > trick to play for SEPT. > > > > Yes, I think ARM will do the same as well. > > > > From a uAPI perspective we need some way to create a secure vPCI > > function linked to a KVM and some IOMMUs will implicitly get a > > translation from the secure world and some IOMMUs will need to manage > > it in untrusted hypervisor memory. > > Yes. This matches the line of though I had for the PCI TSM core > interface. Okay, but I don't think you should ever be binding any PCI stuff to KVM without involving VFIO in some way. VFIO is the security proof that userspace is even permitted to touch that PCI Device at all. > It allows establishing the connection to the device's > security manager and facilitates linking that to a KVM context. So part > of the uAPI is charged with managing device-security independent of a > VM Yes, the PCI core should have stuff for managing device-secuirty for any bound driver, especially assuming an operational standard kernel driver only. > and binding a vPCI device involves a rendezvous of the secure-world > IOMMU setup with secure-world PCI via IOMMU and PCI-TSM > coordination. And this stuff needs to start with VFIO and we can figure out of it is in the iommufd subcomponent or not. I'd really like to see a clearly written proposal for what the uAPI would look like for vPCI function lifecycle and binding that at least one of the platforms is happy with :) It would be a good starting point for other platforms to pick at. Try iommufd first (I'm guessing this is correct) and if it doesn't work explain why. Jason
Jason Gunthorpe wrote: > On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote: > > Jason Gunthorpe wrote: > > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote: > > > > > > > > If that is true for the confidential compute, I don't know. > > > > > > > > For Intel TDX TEE-IO, there may be a different story. > > > > > > > > Architechturely the secure IOMMU page table has to share with KVM secure > > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module > > > > ensures the SEPT operations good for secure IOMMU, so there is no much > > > > trick to play for SEPT. > > > > > > Yes, I think ARM will do the same as well. > > > > > > From a uAPI perspective we need some way to create a secure vPCI > > > function linked to a KVM and some IOMMUs will implicitly get a > > > translation from the secure world and some IOMMUs will need to manage > > > it in untrusted hypervisor memory. > > > > Yes. This matches the line of though I had for the PCI TSM core > > interface. > > Okay, but I don't think you should ever be binding any PCI stuff to > KVM without involving VFIO in some way. > > VFIO is the security proof that userspace is even permitted to touch > that PCI Device at all. Right, I think VFIO grows a uAPI to make a vPCI device "bind capable" which ties together the PCI/TSM security context, the assignable device context and the KVM context. > > It allows establishing the connection to the device's > > security manager and facilitates linking that to a KVM context. So part > > of the uAPI is charged with managing device-security independent of a > > VM > > Yes, the PCI core should have stuff for managing device-secuirty for > any bound driver, especially assuming an operational standard kernel > driver only. > > > and binding a vPCI device involves a rendezvous of the secure-world > > IOMMU setup with secure-world PCI via IOMMU and PCI-TSM > > coordination. > > And this stuff needs to start with VFIO and we can figure out of it is > in the iommufd subcomponent or not. > > I'd really like to see a clearly written proposal for what the uAPI > would look like for vPCI function lifecycle and binding that at least > one of the platforms is happy with :) > > It would be a good starting point for other platforms to pick at. Try > iommufd first (I'm guessing this is correct) and if it doesn't work > explain why. Yes, makes sense. Will take a look at that also to prevent more disconnects on what this PCI device-security community is actually building.
> From: Williams, Dan J <dan.j.williams@intel.com> > Sent: Wednesday, September 4, 2024 9:00 AM > > Jason Gunthorpe wrote: > > On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote: > > > Jason Gunthorpe wrote: > > > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote: > > > > > > > > > > If that is true for the confidential compute, I don't know. > > > > > > > > > > For Intel TDX TEE-IO, there may be a different story. > > > > > > > > > > Architechturely the secure IOMMU page table has to share with KVM > secure > > > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX > Module > > > > > ensures the SEPT operations good for secure IOMMU, so there is no > much > > > > > trick to play for SEPT. > > > > > > > > Yes, I think ARM will do the same as well. > > > > > > > > From a uAPI perspective we need some way to create a secure vPCI > > > > function linked to a KVM and some IOMMUs will implicitly get a > > > > translation from the secure world and some IOMMUs will need to > manage > > > > it in untrusted hypervisor memory. > > > > > > Yes. This matches the line of though I had for the PCI TSM core > > > interface. > > > > Okay, but I don't think you should ever be binding any PCI stuff to > > KVM without involving VFIO in some way. > > > > VFIO is the security proof that userspace is even permitted to touch > > that PCI Device at all. > > Right, I think VFIO grows a uAPI to make a vPCI device "bind capable" > which ties together the PCI/TSM security context, the assignable device > context and the KVM context. > Could you elaborate why the new uAPI is for making vPCI "bind capable" instead of doing the actual binding to KVM?
On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote: > Jason Gunthorpe wrote: > > On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote: > > > Jason Gunthorpe wrote: > > > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote: > > > > > > > > > > If that is true for the confidential compute, I don't know. > > > > > > > > > > For Intel TDX TEE-IO, there may be a different story. > > > > > > > > > > Architechturely the secure IOMMU page table has to share with KVM secure > > > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module > > > > > ensures the SEPT operations good for secure IOMMU, so there is no much > > > > > trick to play for SEPT. > > > > > > > > Yes, I think ARM will do the same as well. > > > > > > > > From a uAPI perspective we need some way to create a secure vPCI > > > > function linked to a KVM and some IOMMUs will implicitly get a > > > > translation from the secure world and some IOMMUs will need to manage > > > > it in untrusted hypervisor memory. > > > > > > Yes. This matches the line of though I had for the PCI TSM core > > > interface. > > > > Okay, but I don't think you should ever be binding any PCI stuff to > > KVM without involving VFIO in some way. > > > > VFIO is the security proof that userspace is even permitted to touch > > that PCI Device at all. > > Right, I think VFIO grows a uAPI to make a vPCI device "bind capable" > which ties together the PCI/TSM security context, the assignable device > context and the KVM context. I think it is more than just "bind capable", I understand various situations are going to require information passed from the VMM to the secure world to specify details about what vPCI function will appear in the VM. AMD probably needs very little here, but others will need more. > > It would be a good starting point for other platforms to pick at. Try > > iommufd first (I'm guessing this is correct) and if it doesn't work > > explain why. > > Yes, makes sense. Will take a look at that also to prevent more > disconnects on what this PCI device-security community is actually > building. We are already adding a VIOMMU object and that is going to be the linkage to the KVM side So we could have new actions: - Create a CC VIOMMU with XYZ parameters - Create a CC vPCI function on the vIOMMU with XYZ parameters - Query stuff? - ??? - Destroy a vPCI function Jason
On Thu, Sep 05, 2024 at 08:29:16AM +0000, Tian, Kevin wrote: > Could you elaborate why the new uAPI is for making vPCI "bind capable" > instead of doing the actual binding to KVM? I don't see why you'd do any of this in KVM, I mean you could, but you also don't have to and KVM people don't really know about all the VFIO parts anyhow. It is like a bunch of our other viommu stuff, KVM has to share some of the HW and interfaces with the iommu driver. In this case it would be the secure VM context and the handles to talk to the trusted world Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, September 5, 2024 8:02 PM > > On Thu, Sep 05, 2024 at 08:29:16AM +0000, Tian, Kevin wrote: > > > Could you elaborate why the new uAPI is for making vPCI "bind capable" > > instead of doing the actual binding to KVM? > > I don't see why you'd do any of this in KVM, I mean you could, but you > also don't have to and KVM people don't really know about all the VFIO > parts anyhow. > that's not my point. I was asking why this VFIO uAPI is not for linking/ binding a vPCI device to KVM (not do it in KVM) while making it just 'bind capable'.
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, September 5, 2024 8:01 PM > > On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote: > > Jason Gunthorpe wrote: > > > It would be a good starting point for other platforms to pick at. Try > > > iommufd first (I'm guessing this is correct) and if it doesn't work > > > explain why. > > > > Yes, makes sense. Will take a look at that also to prevent more > > disconnects on what this PCI device-security community is actually > > building. > > We are already adding a VIOMMU object and that is going to be the > linkage to the KVM side > > So we could have new actions: > - Create a CC VIOMMU with XYZ parameters > - Create a CC vPCI function on the vIOMMU with XYZ parameters > - Query stuff? > - ??? > - Destroy a vPCI function > I'll look at the vIOMMU series soon. Just double confirm here. the so-called vIOMMU object here is the uAPI between iommufd and userspace. Not exactly suggesting a vIOMMU visible to guest. otherwise this solution will be tied to implementations supporting trusted vIOMMU. Then you expect to build CC/vPCI stuff around the vIOMMU object given it already connects to KVM?
On Thu, Sep 05, 2024 at 12:17:14PM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Thursday, September 5, 2024 8:01 PM > > > > On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote: > > > Jason Gunthorpe wrote: > > > > It would be a good starting point for other platforms to pick at. Try > > > > iommufd first (I'm guessing this is correct) and if it doesn't work > > > > explain why. > > > > > > Yes, makes sense. Will take a look at that also to prevent more > > > disconnects on what this PCI device-security community is actually > > > building. > > > > We are already adding a VIOMMU object and that is going to be the > > linkage to the KVM side > > > > So we could have new actions: > > - Create a CC VIOMMU with XYZ parameters > > - Create a CC vPCI function on the vIOMMU with XYZ parameters > > - Query stuff? > > - ??? > > - Destroy a vPCI function > > > > I'll look at the vIOMMU series soon. Just double confirm here. > > the so-called vIOMMU object here is the uAPI between iommufd > and userspace. Not exactly suggesting a vIOMMU visible to guest. > otherwise this solution will be tied to implementations supporting > trusted vIOMMU. Right, the viommu object today just wraps elements of HW that are connected to the VM in some way. It is sort of a security container. If the VMM goes on to expose a vIOMMU to the guest or not should be orthogonal. I expect people will explicitly request a secure vIOMMU if they intend to expose the vIOMMU to the CC VM. This can trigger any actions in the trusted world that are required to support a secure vIOMMU. For instance any secure vIOMMU will require some way for the guest to issue secure invalidations, and that will require some level of trusted world involvement. At the minimum the trusted world has to attest the validity of the vIOMMU to the guest. > Then you expect to build CC/vPCI stuff around the vIOMMU > object given it already connects to KVM? Yes, it is my thought We alreay have a binding of devices to the viommu, increasing that to also include creating vPCI objects in the trusted world is a small step. Jason
Jason Gunthorpe wrote: > On Thu, Sep 05, 2024 at 12:17:14PM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Thursday, September 5, 2024 8:01 PM > > > > > > On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote: > > > > Jason Gunthorpe wrote: > > > > > It would be a good starting point for other platforms to pick at. Try > > > > > iommufd first (I'm guessing this is correct) and if it doesn't work > > > > > explain why. > > > > > > > > Yes, makes sense. Will take a look at that also to prevent more > > > > disconnects on what this PCI device-security community is actually > > > > building. > > > > > > We are already adding a VIOMMU object and that is going to be the > > > linkage to the KVM side > > > > > > So we could have new actions: > > > - Create a CC VIOMMU with XYZ parameters > > > - Create a CC vPCI function on the vIOMMU with XYZ parameters > > > - Query stuff? > > > - ??? > > > - Destroy a vPCI function > > > > > > > I'll look at the vIOMMU series soon. Just double confirm here. > > > > the so-called vIOMMU object here is the uAPI between iommufd > > and userspace. Not exactly suggesting a vIOMMU visible to guest. > > otherwise this solution will be tied to implementations supporting > > trusted vIOMMU. > > Right, the viommu object today just wraps elements of HW that are > connected to the VM in some way. It is sort of a security container. > > If the VMM goes on to expose a vIOMMU to the guest or not should be > orthogonal. > > I expect people will explicitly request a secure vIOMMU if they intend > to expose the vIOMMU to the CC VM. This can trigger any actions in the > trusted world that are required to support a secure vIOMMU. > > For instance any secure vIOMMU will require some way for the guest to > issue secure invalidations, and that will require some level of > trusted world involvement. At the minimum the trusted world has to > attest the validity of the vIOMMU to the guest. > > > Then you expect to build CC/vPCI stuff around the vIOMMU > > object given it already connects to KVM? > > Yes, it is my thought > > We alreay have a binding of devices to the viommu, increasing that to > also include creating vPCI objects in the trusted world is a small > step. Sounds reasonable to me. To answer Kevin's question about what "bind capable" means I need to clarify this oversubscribed "bind" term means. "Bind" in the TDISP sense is transitioning the device to the LOCKED state so that its configuration is static and ready for the VM to run attestation without worrying about TOCTOU races. The VMM is not in a good position to know when the assigned device can be locked. There are updates, configuration changes, and reset/recovery scenarios the VM may want to perform before transitioning the device to the LOCKED state. So, the "bind capable" concept is: pre-condition VFIO with the context that "this vPCI device is known to VFIO as a device that can attach to the secure world, all the linkage between VFIO and the secure world is prepared for a VM to trigger entry into the LOCKED state, and later the RUN state". As mentioned in another thread this entry into the LOCKED state is likely nearly as violent as hotplug event since the DMA layer currently has no concept of a device having a foot in the secure world and the shared world at the same time.
On Thu, Sep 05, 2024 at 01:53:27PM -0700, Dan Williams wrote: > As mentioned in another thread this entry into the LOCKED state is > likely nearly as violent as hotplug event since the DMA layer currently > has no concept of a device having a foot in the secure world and the > shared world at the same time. There is also something of a complicated situation where the VM also must validate that it has received the complete and correct device before it can lock it. Ie that the MMIO ranges belong to the device, the DMA goes to the right place (the vRID:pRID is trusted), and so on. Further, the vIOMMU, and it's parameters, in the VM must also be validated and trusted before the VM can lock the device. The VM and the trusted world must verify they have the exclusive control over the translation. This is where AMDs model of having the hypervisor control things get a little bit confusing for me. I suppose there will be some way that the confidential VM asks the trusted world to control the secure DTE such that it can select between GCR3, BLOCKED and IDENTITY. Regardless, I think everyone will need some metadata from the vIOMMU world into the trusted world to do all of this. Jason
> From: Williams, Dan J <dan.j.williams@intel.com> > Sent: Friday, September 6, 2024 4:53 AM > > Jason Gunthorpe wrote: > > On Thu, Sep 05, 2024 at 12:17:14PM +0000, Tian, Kevin wrote: > > > Then you expect to build CC/vPCI stuff around the vIOMMU > > > object given it already connects to KVM? > > > > Yes, it is my thought > > > > We alreay have a binding of devices to the viommu, increasing that to > > also include creating vPCI objects in the trusted world is a small > > step. > > Sounds reasonable to me. > > To answer Kevin's question about what "bind capable" means I need to > clarify this oversubscribed "bind" term means. "Bind" in the TDISP sense > is transitioning the device to the LOCKED state so that its > configuration is static and ready for the VM to run attestation without > worrying about TOCTOU races. > > The VMM is not in a good position to know when the assigned device can > be locked. There are updates, configuration changes, and reset/recovery > scenarios the VM may want to perform before transitioning the device to > the LOCKED state. So, the "bind capable" concept is: pre-condition VFIO > with the context that "this vPCI device is known to VFIO as a device > that can attach to the secure world, all the linkage between VFIO and > the secure world is prepared for a VM to trigger entry into the LOCKED > state, and later the RUN state". Okay this makes sense. So the point is that the TDI state machine is fully managed by the TSM driver so here 'bind capable' is a necessary preparation step for that state machine to enter the LOCKED state. > > As mentioned in another thread this entry into the LOCKED state is > likely nearly as violent as hotplug event since the DMA layer currently > has no concept of a device having a foot in the secure world and the > shared world at the same time. Is the DMA layer relevant in this context? Here we are talking about VFIO/IOMMUFD which can be hinted by VMM for whatever side effect caused by the entry into the LOCKED state...
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, September 6, 2024 7:07 AM > > On Thu, Sep 05, 2024 at 01:53:27PM -0700, Dan Williams wrote: > > > As mentioned in another thread this entry into the LOCKED state is > > likely nearly as violent as hotplug event since the DMA layer currently > > has no concept of a device having a foot in the secure world and the > > shared world at the same time. > > There is also something of a complicated situation where the VM also > must validate that it has received the complete and correct device > before it can lock it. Ie that the MMIO ranges belong to the device, > the DMA goes to the right place (the vRID:pRID is trusted), and so on. > > Further, the vIOMMU, and it's parameters, in the VM must also be > validated and trusted before the VM can lock the device. The VM and > the trusted world must verify they have the exclusive control over the > translation. Looking at the TDISP spec it's the host to lock the device (as Dan described the entry into the LOCKED state) while the VM is allowed to put the device into the RUN state after validation. I guess you actually meant the entry into RUN here? otherwise there might be some disconnect here. > > This is where AMDs model of having the hypervisor control things get a > little bit confusing for me. I suppose there will be some way that the > confidential VM asks the trusted world to control the secure DTE such > that it can select between GCR3, BLOCKED and IDENTITY. this matches what I read from the SEV-TIO spec. > > Regardless, I think everyone will need some metadata from the vIOMMU > world into the trusted world to do all of this. Agree.
On Fri, Sep 06, 2024 at 02:46:24AM +0000, Tian, Kevin wrote: > > Further, the vIOMMU, and it's parameters, in the VM must also be > > validated and trusted before the VM can lock the device. The VM and > > the trusted world must verify they have the exclusive control over the > > translation. > > Looking at the TDISP spec it's the host to lock the device (as Dan > described the entry into the LOCKED state) while the VM is allowed > to put the device into the RUN state after validation. > > I guess you actually meant the entry into RUN here? otherwise > there might be some disconnect here. Yeah Jason
On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > IOMMUFD calls get_user_pages() for every mapping which will allocate > shared memory instead of using private memory managed by the KVM and > MEMFD. Please check this series, it is much more how I would expect this to work. Use the guest memfd directly and forget about kvm in the iommufd code: https://lore.kernel.org/r/1726319158-283074-1-git-send-email-steven.sistare@oracle.com I would imagine you'd detect the guest memfd when accepting the FD and then having some different path in the pinning logic to pin and get the physical ranges out. Probably we would also need some CAP interaction with the iommu driver to understand if it can accept private pages to even allow this in the first place. Thanks, Jason
On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > shared memory instead of using private memory managed by the KVM and > > MEMFD. > > Please check this series, it is much more how I would expect this to > work. Use the guest memfd directly and forget about kvm in the iommufd code: > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-steven.sistare@oracle.com > > I would imagine you'd detect the guest memfd when accepting the FD and > then having some different path in the pinning logic to pin and get > the physical ranges out. According to the discussion at KVM microconference around hugepage support for guest_memfd [1], it's imperative that guest private memory is not long term pinned. Ideal way to implement this integration would be to support a notifier that can be invoked by guest_memfd when memory ranges get truncated so that IOMMU can unmap the corresponding ranges. Such a notifier should also get called during memory conversion, it would be interesting to discuss how conversion flow would work in this case. [1] https://lpc.events/event/18/contributions/1764/ (checkout the slide 12 from attached presentation) > > Probably we would also need some CAP interaction with the iommu driver > to understand if it can accept private pages to even allow this in the > first place. > > Thanks, > Jason >
> From: Vishal Annapurve <vannapurve@google.com> > Sent: Saturday, September 21, 2024 5:11 AM > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > > shared memory instead of using private memory managed by the KVM > and > > > MEMFD. > > > > Please check this series, it is much more how I would expect this to > > work. Use the guest memfd directly and forget about kvm in the iommufd > code: > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > steven.sistare@oracle.com > > > > I would imagine you'd detect the guest memfd when accepting the FD and > > then having some different path in the pinning logic to pin and get > > the physical ranges out. > > According to the discussion at KVM microconference around hugepage > support for guest_memfd [1], it's imperative that guest private memory > is not long term pinned. Ideal way to implement this integration would > be to support a notifier that can be invoked by guest_memfd when > memory ranges get truncated so that IOMMU can unmap the corresponding > ranges. Such a notifier should also get called during memory > conversion, it would be interesting to discuss how conversion flow > would work in this case. > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > slide 12 from attached presentation) > Most devices don't support I/O page fault hence can only DMA to long term pinned buffers. The notifier might be helpful for in-kernel conversion but as a basic requirement there needs a way for IOMMUFD to call into guest memfd to request long term pinning for a given range. That is how I interpreted "different path" in Jason's comment.
On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote: > > > From: Vishal Annapurve <vannapurve@google.com> > > Sent: Saturday, September 21, 2024 5:11 AM > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > > > > IOMMUFD calls get_user_pages() for every mapping which will allocate > > > > shared memory instead of using private memory managed by the KVM > > and > > > > MEMFD. > > > > > > Please check this series, it is much more how I would expect this to > > > work. Use the guest memfd directly and forget about kvm in the iommufd > > code: > > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > > steven.sistare@oracle.com > > > > > > I would imagine you'd detect the guest memfd when accepting the FD and > > > then having some different path in the pinning logic to pin and get > > > the physical ranges out. > > > > According to the discussion at KVM microconference around hugepage > > support for guest_memfd [1], it's imperative that guest private memory > > is not long term pinned. Ideal way to implement this integration would > > be to support a notifier that can be invoked by guest_memfd when > > memory ranges get truncated so that IOMMU can unmap the corresponding > > ranges. Such a notifier should also get called during memory > > conversion, it would be interesting to discuss how conversion flow > > would work in this case. > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > > slide 12 from attached presentation) > > > > Most devices don't support I/O page fault hence can only DMA to long > term pinned buffers. The notifier might be helpful for in-kernel conversion > but as a basic requirement there needs a way for IOMMUFD to call into > guest memfd to request long term pinning for a given range. That is > how I interpreted "different path" in Jason's comment. Policy that is being aimed here: 1) guest_memfd will pin the pages backing guest memory for all users. 2) kvm_gmem_get_pfn users will get a locked folio with elevated refcount when asking for the pfn/page from guest_memfd. Users will drop the refcount and release the folio lock when they are done using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio lock is supposed to be held for short durations. 3) Users can assume the pfn is around until they are notified by guest_memfd on truncation or memory conversion. Step 3 above is already followed by KVM EPT setup logic for CoCo VMs. TDX VMs especially need to have secure EPT entries always mapped (once faulted-in) while the guest memory ranges are private.
> From: Vishal Annapurve <vannapurve@google.com> > Sent: Monday, September 23, 2024 2:34 PM > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote: > > > > > From: Vishal Annapurve <vannapurve@google.com> > > > Sent: Saturday, September 21, 2024 5:11 AM > > > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> > wrote: > > > > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > > > > > IOMMUFD calls get_user_pages() for every mapping which will > allocate > > > > > shared memory instead of using private memory managed by the > KVM > > > and > > > > > MEMFD. > > > > > > > > Please check this series, it is much more how I would expect this to > > > > work. Use the guest memfd directly and forget about kvm in the > iommufd > > > code: > > > > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > > > steven.sistare@oracle.com > > > > > > > > I would imagine you'd detect the guest memfd when accepting the FD > and > > > > then having some different path in the pinning logic to pin and get > > > > the physical ranges out. > > > > > > According to the discussion at KVM microconference around hugepage > > > support for guest_memfd [1], it's imperative that guest private memory > > > is not long term pinned. Ideal way to implement this integration would > > > be to support a notifier that can be invoked by guest_memfd when > > > memory ranges get truncated so that IOMMU can unmap the > corresponding > > > ranges. Such a notifier should also get called during memory > > > conversion, it would be interesting to discuss how conversion flow > > > would work in this case. > > > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > > > slide 12 from attached presentation) > > > > > > > Most devices don't support I/O page fault hence can only DMA to long > > term pinned buffers. The notifier might be helpful for in-kernel conversion > > but as a basic requirement there needs a way for IOMMUFD to call into > > guest memfd to request long term pinning for a given range. That is > > how I interpreted "different path" in Jason's comment. > > Policy that is being aimed here: > 1) guest_memfd will pin the pages backing guest memory for all users. > 2) kvm_gmem_get_pfn users will get a locked folio with elevated > refcount when asking for the pfn/page from guest_memfd. Users will > drop the refcount and release the folio lock when they are done > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio > lock is supposed to be held for short durations. > 3) Users can assume the pfn is around until they are notified by > guest_memfd on truncation or memory conversion. > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs. > TDX VMs especially need to have secure EPT entries always mapped (once > faulted-in) while the guest memory ranges are private. 'faulted-in' doesn't work for device DMAs (w/o IOPF). and above is based on the assumption that CoCo VM will always map/pin the private memory pages until a conversion happens. Conversion is initiated by the guest so ideally the guest is responsible for not leaving any in-fly DMAs to the page which is being converted. From this angle it is fine for IOMMUFD to receive a notification from guest memfd when such a conversion happens. But I'm not sure whether the TDX way is architectural or just an implementation choice which could be changed later, or whether it applies to other arch. If that behavior cannot be guaranteed, then we may still need a way for IOMMUFD to request long term pin.
On Mon, Sep 23, 2024 at 08:24:40AM +0000, Tian, Kevin wrote: > > From: Vishal Annapurve <vannapurve@google.com> > > Sent: Monday, September 23, 2024 2:34 PM > > > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote: > > > > > > > From: Vishal Annapurve <vannapurve@google.com> > > > > Sent: Saturday, September 21, 2024 5:11 AM > > > > > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> > > wrote: > > > > > > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > IOMMUFD calls get_user_pages() for every mapping which will > > allocate > > > > > > shared memory instead of using private memory managed by the > > KVM > > > > and > > > > > > MEMFD. > > > > > > > > > > Please check this series, it is much more how I would expect this to > > > > > work. Use the guest memfd directly and forget about kvm in the > > iommufd > > > > code: > > > > > > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > > > > steven.sistare@oracle.com > > > > > > > > > > I would imagine you'd detect the guest memfd when accepting the FD > > and > > > > > then having some different path in the pinning logic to pin and get > > > > > the physical ranges out. > > > > > > > > According to the discussion at KVM microconference around hugepage > > > > support for guest_memfd [1], it's imperative that guest private memory > > > > is not long term pinned. Ideal way to implement this integration would > > > > be to support a notifier that can be invoked by guest_memfd when > > > > memory ranges get truncated so that IOMMU can unmap the > > corresponding > > > > ranges. Such a notifier should also get called during memory > > > > conversion, it would be interesting to discuss how conversion flow > > > > would work in this case. > > > > > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > > > > slide 12 from attached presentation) > > > > > > > > > > Most devices don't support I/O page fault hence can only DMA to long > > > term pinned buffers. The notifier might be helpful for in-kernel conversion > > > but as a basic requirement there needs a way for IOMMUFD to call into > > > guest memfd to request long term pinning for a given range. That is > > > how I interpreted "different path" in Jason's comment. > > > > Policy that is being aimed here: > > 1) guest_memfd will pin the pages backing guest memory for all users. > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated > > refcount when asking for the pfn/page from guest_memfd. Users will > > drop the refcount and release the folio lock when they are done > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio > > lock is supposed to be held for short durations. > > 3) Users can assume the pfn is around until they are notified by > > guest_memfd on truncation or memory conversion. > > > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs. > > TDX VMs especially need to have secure EPT entries always mapped (once > > faulted-in) while the guest memory ranges are private. > > 'faulted-in' doesn't work for device DMAs (w/o IOPF). > > and above is based on the assumption that CoCo VM will always > map/pin the private memory pages until a conversion happens. > > Conversion is initiated by the guest so ideally the guest is responsible > for not leaving any in-fly DMAs to the page which is being converted. > From this angle it is fine for IOMMUFD to receive a notification from > guest memfd when such a conversion happens. Right, I think the expectation is if a guest has active DMA on a page it is changing between shared/private there is no expectation that the DMA will succeed. So we don't need page fault, we just need to allow it to safely fail. IMHO we should try to do as best we can here, and the ideal interface would be a notifier to switch the shared/private pages in some portion of the guestmemfd. With the idea that iommufd could perhaps do it atomically. When the notifier returns then the old pages are fenced off at the HW. But this would have to be a sleepable notifier that can do memory allocation. It is actually pretty complicated and we will need a reliable cut operation to make this work on AMD v1. Jason
On Mon, Sep 23, 2024, 10:24 AM Tian, Kevin <kevin.tian@intel.com> wrote: > > > From: Vishal Annapurve <vannapurve@google.com> > > Sent: Monday, September 23, 2024 2:34 PM > > > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote: > > > > > > > From: Vishal Annapurve <vannapurve@google.com> > > > > Sent: Saturday, September 21, 2024 5:11 AM > > > > > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> > > wrote: > > > > > > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote: > > > > > > IOMMUFD calls get_user_pages() for every mapping which will > > allocate > > > > > > shared memory instead of using private memory managed by the > > KVM > > > > and > > > > > > MEMFD. > > > > > > > > > > Please check this series, it is much more how I would expect this to > > > > > work. Use the guest memfd directly and forget about kvm in the > > iommufd > > > > code: > > > > > > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > > > > steven.sistare@oracle.com > > > > > > > > > > I would imagine you'd detect the guest memfd when accepting the FD > > and > > > > > then having some different path in the pinning logic to pin and get > > > > > the physical ranges out. > > > > > > > > According to the discussion at KVM microconference around hugepage > > > > support for guest_memfd [1], it's imperative that guest private memory > > > > is not long term pinned. Ideal way to implement this integration would > > > > be to support a notifier that can be invoked by guest_memfd when > > > > memory ranges get truncated so that IOMMU can unmap the > > corresponding > > > > ranges. Such a notifier should also get called during memory > > > > conversion, it would be interesting to discuss how conversion flow > > > > would work in this case. > > > > > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > > > > slide 12 from attached presentation) > > > > > > > > > > Most devices don't support I/O page fault hence can only DMA to long > > > term pinned buffers. The notifier might be helpful for in-kernel conversion > > > but as a basic requirement there needs a way for IOMMUFD to call into > > > guest memfd to request long term pinning for a given range. That is > > > how I interpreted "different path" in Jason's comment. > > > > Policy that is being aimed here: > > 1) guest_memfd will pin the pages backing guest memory for all users. > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated > > refcount when asking for the pfn/page from guest_memfd. Users will > > drop the refcount and release the folio lock when they are done > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio > > lock is supposed to be held for short durations. > > 3) Users can assume the pfn is around until they are notified by > > guest_memfd on truncation or memory conversion. > > > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs. > > TDX VMs especially need to have secure EPT entries always mapped (once > > faulted-in) while the guest memory ranges are private. > > 'faulted-in' doesn't work for device DMAs (w/o IOPF). faulted-in can be replaced with mapped-in for the context of IOMMU operations. > > and above is based on the assumption that CoCo VM will always > map/pin the private memory pages until a conversion happens. Host physical memory is pinned by the host software stack. If you are talking about arch specific logic in KVM, then the expectation again is that guest_memfd will give pinned memory to it's users. > > Conversion is initiated by the guest so ideally the guest is responsible > for not leaving any in-fly DMAs to the page which is being converted. > From this angle it is fine for IOMMUFD to receive a notification from > guest memfd when such a conversion happens. > > But I'm not sure whether the TDX way is architectural or just an > implementation choice which could be changed later, or whether it > applies to other arch. All private memory accesses from TDX VMs go via Secure EPT. If host removes secure EPT entries without guest intervention then linux guest has a logic to generate a panic when it encounters EPT violation on private memory accesses [1]. > > If that behavior cannot be guaranteed, then we may still need a way > for IOMMUFD to request long term pin. [1] https://elixir.bootlin.com/linux/v6.11/source/arch/x86/coco/tdx/tdx.c#L677
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, September 24, 2024 12:03 AM > > On Mon, Sep 23, 2024 at 08:24:40AM +0000, Tian, Kevin wrote: > > > From: Vishal Annapurve <vannapurve@google.com> > > > Sent: Monday, September 23, 2024 2:34 PM > > > > > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> > wrote: > > > > > > > > > From: Vishal Annapurve <vannapurve@google.com> > > > > > Sent: Saturday, September 21, 2024 5:11 AM > > > > > > > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> > > > wrote: > > > > > > > > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy > wrote: > > > > > > > IOMMUFD calls get_user_pages() for every mapping which will > > > allocate > > > > > > > shared memory instead of using private memory managed by the > > > KVM > > > > > and > > > > > > > MEMFD. > > > > > > > > > > > > Please check this series, it is much more how I would expect this to > > > > > > work. Use the guest memfd directly and forget about kvm in the > > > iommufd > > > > > code: > > > > > > > > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > > > > > steven.sistare@oracle.com > > > > > > > > > > > > I would imagine you'd detect the guest memfd when accepting the > FD > > > and > > > > > > then having some different path in the pinning logic to pin and get > > > > > > the physical ranges out. > > > > > > > > > > According to the discussion at KVM microconference around > hugepage > > > > > support for guest_memfd [1], it's imperative that guest private > memory > > > > > is not long term pinned. Ideal way to implement this integration > would > > > > > be to support a notifier that can be invoked by guest_memfd when > > > > > memory ranges get truncated so that IOMMU can unmap the > > > corresponding > > > > > ranges. Such a notifier should also get called during memory > > > > > conversion, it would be interesting to discuss how conversion flow > > > > > would work in this case. > > > > > > > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > > > > > slide 12 from attached presentation) > > > > > > > > > > > > > Most devices don't support I/O page fault hence can only DMA to long > > > > term pinned buffers. The notifier might be helpful for in-kernel > conversion > > > > but as a basic requirement there needs a way for IOMMUFD to call into > > > > guest memfd to request long term pinning for a given range. That is > > > > how I interpreted "different path" in Jason's comment. > > > > > > Policy that is being aimed here: > > > 1) guest_memfd will pin the pages backing guest memory for all users. > > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated > > > refcount when asking for the pfn/page from guest_memfd. Users will > > > drop the refcount and release the folio lock when they are done > > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio > > > lock is supposed to be held for short durations. > > > 3) Users can assume the pfn is around until they are notified by > > > guest_memfd on truncation or memory conversion. > > > > > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs. > > > TDX VMs especially need to have secure EPT entries always mapped > (once > > > faulted-in) while the guest memory ranges are private. > > > > 'faulted-in' doesn't work for device DMAs (w/o IOPF). > > > > and above is based on the assumption that CoCo VM will always > > map/pin the private memory pages until a conversion happens. > > > > Conversion is initiated by the guest so ideally the guest is responsible > > for not leaving any in-fly DMAs to the page which is being converted. > > From this angle it is fine for IOMMUFD to receive a notification from > > guest memfd when such a conversion happens. > > Right, I think the expectation is if a guest has active DMA on a page > it is changing between shared/private there is no expectation that the > DMA will succeed. So we don't need page fault, we just need to allow > it to safely fail. > > IMHO we should try to do as best we can here, and the ideal interface > would be a notifier to switch the shared/private pages in some portion > of the guestmemfd. With the idea that iommufd could perhaps do it > atomically. > yes atomic replacement is necessary here, as there might be in-fly DMAs to pages adjacent to the one being converted in the same 1G hunk. Unmap/remap could potentially break it.
> From: Vishal Annapurve <vannapurve@google.com> > Sent: Tuesday, September 24, 2024 4:54 AM > > On Mon, Sep 23, 2024, 10:24 AM Tian, Kevin <kevin.tian@intel.com> wrote: > > > > > From: Vishal Annapurve <vannapurve@google.com> > > > Sent: Monday, September 23, 2024 2:34 PM > > > > > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> > wrote: > > > > > > > > > From: Vishal Annapurve <vannapurve@google.com> > > > > > Sent: Saturday, September 21, 2024 5:11 AM > > > > > > > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> > > > wrote: > > > > > > > > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy > wrote: > > > > > > > IOMMUFD calls get_user_pages() for every mapping which will > > > allocate > > > > > > > shared memory instead of using private memory managed by the > > > KVM > > > > > and > > > > > > > MEMFD. > > > > > > > > > > > > Please check this series, it is much more how I would expect this to > > > > > > work. Use the guest memfd directly and forget about kvm in the > > > iommufd > > > > > code: > > > > > > > > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email- > > > > > steven.sistare@oracle.com > > > > > > > > > > > > I would imagine you'd detect the guest memfd when accepting the > FD > > > and > > > > > > then having some different path in the pinning logic to pin and get > > > > > > the physical ranges out. > > > > > > > > > > According to the discussion at KVM microconference around > hugepage > > > > > support for guest_memfd [1], it's imperative that guest private > memory > > > > > is not long term pinned. Ideal way to implement this integration > would > > > > > be to support a notifier that can be invoked by guest_memfd when > > > > > memory ranges get truncated so that IOMMU can unmap the > > > corresponding > > > > > ranges. Such a notifier should also get called during memory > > > > > conversion, it would be interesting to discuss how conversion flow > > > > > would work in this case. > > > > > > > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the > > > > > slide 12 from attached presentation) > > > > > > > > > > > > > Most devices don't support I/O page fault hence can only DMA to long > > > > term pinned buffers. The notifier might be helpful for in-kernel > conversion > > > > but as a basic requirement there needs a way for IOMMUFD to call into > > > > guest memfd to request long term pinning for a given range. That is > > > > how I interpreted "different path" in Jason's comment. > > > > > > Policy that is being aimed here: > > > 1) guest_memfd will pin the pages backing guest memory for all users. > > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated > > > refcount when asking for the pfn/page from guest_memfd. Users will > > > drop the refcount and release the folio lock when they are done > > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio > > > lock is supposed to be held for short durations. > > > 3) Users can assume the pfn is around until they are notified by > > > guest_memfd on truncation or memory conversion. > > > > > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs. > > > TDX VMs especially need to have secure EPT entries always mapped > (once > > > faulted-in) while the guest memory ranges are private. > > > > 'faulted-in' doesn't work for device DMAs (w/o IOPF). > > faulted-in can be replaced with mapped-in for the context of IOMMU > operations. > > > > > and above is based on the assumption that CoCo VM will always > > map/pin the private memory pages until a conversion happens. > > Host physical memory is pinned by the host software stack. If you are > talking about arch specific logic in KVM, then the expectation again > is that guest_memfd will give pinned memory to it's users. sorry it's a typo. I meant the host does it for CoCo VM. > > > > > Conversion is initiated by the guest so ideally the guest is responsible > > for not leaving any in-fly DMAs to the page which is being converted. > > From this angle it is fine for IOMMUFD to receive a notification from > > guest memfd when such a conversion happens. > > > > But I'm not sure whether the TDX way is architectural or just an > > implementation choice which could be changed later, or whether it > > applies to other arch. > > All private memory accesses from TDX VMs go via Secure EPT. If host > removes secure EPT entries without guest intervention then linux guest > has a logic to generate a panic when it encounters EPT violation on > private memory accesses [1]. Yeah, that sounds good. > > > > > If that behavior cannot be guaranteed, then we may still need a way > > for IOMMUFD to request long term pin. > > [1] > https://elixir.bootlin.com/linux/v6.11/source/arch/x86/coco/tdx/tdx.c#L677
On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote: > > IMHO we should try to do as best we can here, and the ideal interface > > would be a notifier to switch the shared/private pages in some portion > > of the guestmemfd. With the idea that iommufd could perhaps do it > > atomically. > > yes atomic replacement is necessary here, as there might be in-fly > DMAs to pages adjacent to the one being converted in the same > 1G hunk. Unmap/remap could potentially break it. Yeah.. This integration is going to be much more complicated than I originally thought about. It will need the generic pt stuff as the hitless page table manipulations we are contemplating here are pretty complex. Jason
On Tue, Sep 24, 2024 at 2:07 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote: > > > IMHO we should try to do as best we can here, and the ideal interface > > > would be a notifier to switch the shared/private pages in some portion > > > of the guestmemfd. With the idea that iommufd could perhaps do it > > > atomically. > > > > yes atomic replacement is necessary here, as there might be in-fly > > DMAs to pages adjacent to the one being converted in the same > > 1G hunk. Unmap/remap could potentially break it. > > Yeah.. This integration is going to be much more complicated than I > originally thought about. It will need the generic pt stuff as the > hitless page table manipulations we are contemplating here are pretty > complex. > > Jason To ensure that I understand your concern properly, the complexity of handling hitless page manipulations is because guests can convert memory at smaller granularity than the physical page size used by the host software. Complexity remains the same irrespective of whether kvm/guest_memfd is notifying iommu driver to unmap converted ranges or if its userspace notifying iommu driver.
On Wed, Sep 25, 2024 at 10:44:12AM +0200, Vishal Annapurve wrote: > On Tue, Sep 24, 2024 at 2:07 PM Jason Gunthorpe <jgg@nvidia.com> wrote: > > > > On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote: > > > > IMHO we should try to do as best we can here, and the ideal interface > > > > would be a notifier to switch the shared/private pages in some portion > > > > of the guestmemfd. With the idea that iommufd could perhaps do it > > > > atomically. > > > > > > yes atomic replacement is necessary here, as there might be in-fly > > > DMAs to pages adjacent to the one being converted in the same > > > 1G hunk. Unmap/remap could potentially break it. > > > > Yeah.. This integration is going to be much more complicated than I > > originally thought about. It will need the generic pt stuff as the > > hitless page table manipulations we are contemplating here are pretty > > complex. > > > > Jason > > To ensure that I understand your concern properly, the complexity of > handling hitless page manipulations is because guests can convert > memory at smaller granularity than the physical page size used by the > host software. Yes You want to, say, break up a 1G private page into 2M chunks and then hitlessly replace a 2M chunk with a shared one. Unlike the MM side you don't really want to just non-present the whole thing and fault it back in. So it is more complex. We already plan to build the 1G -> 2M transformation for dirty tracking, the atomic replace will be a further operation. In the short term you could experiment on this using unmap/remap, but that isn't really going to work well as a solution. You really can't unmap an entire 1G page just to poke a 2M hole into it without disrupting the guest DMA. Fortunately the work needed to resolve this is well in progress, I had not realized there was a guest memfd connection, but this is good to know. It means more people will be intersted in helping :) :) > Complexity remains the same irrespective of whether kvm/guest_memfd > is notifying iommu driver to unmap converted ranges or if its > userspace notifying iommu driver. You don't want to use the verb 'unmap'. What you want is a verb more like 'refresh' which can only make sense in the kernel. 'refresh' would cause the iommu copy of the physical addresses to update to match the current data in the guestmemfd. So the private/shared sequence would be like: 1) Guest asks for private -> shared 2) Guestmemfd figures out what the new physicals should be for the shared 3) Guestmemfd does 'refresh' on all of its notifiers. This will pick up the new shared physical and remove the old private physical from the iommus 4) Guestmemfd can be sure nothing in iommu is touching the old memory. There are some other small considerations that increase complexity, like AMD needs an IOPTE boundary at any transition between shared/private. This is a current active bug in the AMD stuff, fixing it automatically and preserving huge pages via special guestmemfd support sounds very appealing to me. Jason
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h index 0ec3509b7e33..fc9239fc94c0 100644 --- a/drivers/iommu/iommufd/io_pagetable.h +++ b/drivers/iommu/iommufd/io_pagetable.h @@ -204,6 +204,9 @@ struct iopt_pages { struct rb_root_cached access_itree; /* Of iopt_area::pages_node */ struct rb_root_cached domains_itree; + + struct kvm *kvm; + gmem_pin_t gmem_pin; }; struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length, diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index 92efe30a8f0d..bd5573ddcd9c 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -10,6 +10,7 @@ #include <linux/uaccess.h> #include <linux/iommu.h> #include <linux/iova_bitmap.h> +#include <linux/iommufd.h> #include <uapi/linux/iommufd.h> #include "../iommu-priv.h" @@ -28,6 +29,9 @@ struct iommufd_ctx { /* Compatibility with VFIO no iommu */ u8 no_iommu_mode; struct iommufd_ioas *vfio_ioas; + + struct kvm *kvm; + gmem_pin_t gmem_pin; }; /* diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ffc3a949f837..a990f604c044 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -9,6 +9,7 @@ #include <linux/types.h> #include <linux/errno.h> #include <linux/err.h> +#include <linux/kvm_types.h> struct device; struct iommufd_device; @@ -57,6 +58,11 @@ void iommufd_ctx_get(struct iommufd_ctx *ictx); #if IS_ENABLED(CONFIG_IOMMUFD) struct iommufd_ctx *iommufd_ctx_from_file(struct file *file); struct iommufd_ctx *iommufd_ctx_from_fd(int fd); +bool iommufd_file_is_valid(struct file *file); +typedef int (*gmem_pin_t)(struct kvm *kvm, void __user *uptr, gfn_t *gfn, + kvm_pfn_t *pfn, int *max_order); +void iommufd_file_set_kvm(struct file *file, struct kvm *kvm, + gmem_pin_t gmem_pin); void iommufd_ctx_put(struct iommufd_ctx *ictx); bool iommufd_ctx_has_group(struct iommufd_ctx *ictx, struct iommu_group *group); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index fdb331b3e0d3..a09a346ba3ca 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1297,6 +1297,7 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len); struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); +struct kvm_memory_slot *uptr_to_memslot(struct kvm *kvm, void __user *uptr); bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn); bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn); unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn); @@ -1713,6 +1714,22 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn) return NULL; } +static inline struct kvm_memory_slot * +try_get_memslot_uptr(struct kvm_memory_slot *slot, void __user *uptr) +{ + unsigned long base_upn; + unsigned long upn = (unsigned long) uptr >> PAGE_SHIFT; + + if (!slot) + return NULL; + + base_upn = slot->userspace_addr >> PAGE_SHIFT; + if (upn >= base_upn && upn < base_upn + slot->npages) + return slot; + else + return NULL; +} + /* * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL. * @@ -1741,6 +1758,22 @@ search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx) return approx ? slot : NULL; } +static inline struct kvm_memory_slot * +search_memslots_uptr(struct kvm_memslots *slots, void __user *uptr) +{ + unsigned long upn = (unsigned long) uptr >> PAGE_SHIFT; + struct kvm_memslot_iter iter; + + kvm_for_each_memslot_in_gfn_range(&iter, slots, 0, 512ULL * SZ_1T) { + struct kvm_memory_slot *slot = iter.slot; + unsigned long base_upn = slot->userspace_addr >> PAGE_SHIFT; + + if (upn >= base_upn && upn < base_upn + slot->npages) + return slot; + } + return NULL; +} + static inline struct kvm_memory_slot * ____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx) { @@ -1760,6 +1793,25 @@ ____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx) return NULL; } +static inline struct kvm_memory_slot * +____uptr_to_memslot(struct kvm_memslots *slots, void __user *uptr) +{ + struct kvm_memory_slot *slot; + + slot = (struct kvm_memory_slot *)atomic_long_read(&slots->last_used_slot); + slot = try_get_memslot_uptr(slot, uptr); + if (slot) + return slot; + + slot = search_memslots_uptr(slots, uptr); + if (slot) { + atomic_long_set(&slots->last_used_slot, (unsigned long)slot); + return slot; + } + + return NULL; +} + /* * __gfn_to_memslot() and its descendants are here to allow arch code to inline * the lookups in hot paths. gfn_to_memslot() itself isn't here as an inline @@ -1771,6 +1823,12 @@ __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn) return ____gfn_to_memslot(slots, gfn, false); } +static inline struct kvm_memory_slot * +__uptr_to_memslot(struct kvm_memslots *slots, void __user *uptr) +{ + return ____uptr_to_memslot(slots, uptr); +} + static inline unsigned long __gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn) { @@ -2446,6 +2504,8 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) #ifdef CONFIG_KVM_PRIVATE_MEM int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order); +int kvm_gmem_uptr_to_pfn(struct kvm *kvm, void __user *uptr, gfn_t *gfn, + kvm_pfn_t *pfn, int *max_order); #else static inline int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, @@ -2454,6 +2514,12 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm, KVM_BUG_ON(1, kvm); return -EIO; } +static inline int kvm_gmem_uptr_to_pfn(struct kvm *kvm, void __user *uptr, gfn_t *gfn, + kvm_pfn_t *pfn, int *max_order) +{ + KVM_BUG_ON(1, kvm); + return -EIO; +} #endif /* CONFIG_KVM_PRIVATE_MEM */ #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c index 05fd9d3abf1b..aa7584d4a2b8 100644 --- a/drivers/iommu/iommufd/io_pagetable.c +++ b/drivers/iommu/iommufd/io_pagetable.c @@ -412,6 +412,8 @@ int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt, elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM; elm.start_byte = uptr - elm.pages->uptr; elm.length = length; + elm.pages->kvm = ictx->kvm; + elm.pages->gmem_pin = ictx->gmem_pin; list_add(&elm.next, &pages_list); rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 83bbd7c5d160..b6039f7c1cce 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -17,6 +17,7 @@ #include <linux/bug.h> #include <uapi/linux/iommufd.h> #include <linux/iommufd.h> +#include <linux/kvm_host.h> #include "io_pagetable.h" #include "iommufd_private.h" @@ -488,6 +489,26 @@ struct iommufd_ctx *iommufd_ctx_from_fd(int fd) } EXPORT_SYMBOL_NS_GPL(iommufd_ctx_from_fd, IOMMUFD); +bool iommufd_file_is_valid(struct file *file) +{ + return file->f_op == &iommufd_fops; +} +EXPORT_SYMBOL_NS_GPL(iommufd_file_is_valid, IOMMUFD); + +void iommufd_file_set_kvm(struct file *file, struct kvm *kvm, gmem_pin_t gmem_pin) +{ + struct iommufd_ctx *ictx = iommufd_ctx_from_file(file); + + if (WARN_ON(!ictx)) + return; + + ictx->kvm = kvm; + ictx->gmem_pin = gmem_pin; + + iommufd_ctx_put(ictx); +} +EXPORT_SYMBOL_NS_GPL(iommufd_file_set_kvm, IOMMUFD); + /** * iommufd_ctx_put - Put back a reference * @ictx: Context to put back diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c index 117f644a0c5b..d85b6969d9ea 100644 --- a/drivers/iommu/iommufd/pages.c +++ b/drivers/iommu/iommufd/pages.c @@ -52,6 +52,8 @@ #include <linux/highmem.h> #include <linux/kthread.h> #include <linux/iommufd.h> +#include <linux/kvm_host.h> +#include <linux/pagemap.h> #include "io_pagetable.h" #include "double_span.h" @@ -622,6 +624,33 @@ static void batch_from_pages(struct pfn_batch *batch, struct page **pages, break; } +static void memfd_unpin_user_page_range_dirty_lock(struct page *page, + unsigned long npages, + bool make_dirty) +{ + unsigned long i, nr; + + for (i = 0; i < npages; i += nr) { + struct page *next = nth_page(page, i); + struct folio *folio = page_folio(next); + + if (folio_test_large(folio)) + nr = min_t(unsigned int, npages - i, + folio_nr_pages(folio) - + folio_page_idx(folio, next)); + else + nr = 1; + + if (make_dirty && !folio_test_dirty(folio)) { + // FIXME: do we need this? private memory does not swap + folio_lock(folio); + folio_mark_dirty(folio); + folio_unlock(folio); + } + folio_put(folio); + } +} + static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages, unsigned int first_page_off, size_t npages) { @@ -638,9 +667,14 @@ static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages, size_t to_unpin = min_t(size_t, npages, batch->npfns[cur] - first_page_off); - unpin_user_page_range_dirty_lock( - pfn_to_page(batch->pfns[cur] + first_page_off), - to_unpin, pages->writable); + if (pages->kvm) + memfd_unpin_user_page_range_dirty_lock( + pfn_to_page(batch->pfns[cur] + first_page_off), + to_unpin, pages->writable); + else + unpin_user_page_range_dirty_lock( + pfn_to_page(batch->pfns[cur] + first_page_off), + to_unpin, pages->writable); iopt_pages_sub_npinned(pages, to_unpin); cur++; first_page_off = 0; @@ -777,17 +811,51 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user, return -EFAULT; uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE); - if (!remote_mm) - rc = pin_user_pages_fast(uptr, npages, user->gup_flags, - user->upages); - else { - if (!user->locked) { - mmap_read_lock(pages->source_mm); - user->locked = 1; + + if (pages->kvm) { + if (WARN_ON(!pages->gmem_pin)) + return -EFAULT; + + rc = 0; + for (unsigned long i = 0; i < npages; ++i, uptr += PAGE_SIZE) { + gfn_t gfn = 0; + kvm_pfn_t pfn = 0; + int max_order = 0, rc1; + + rc1 = pages->gmem_pin(pages->kvm, (void *) uptr, + &gfn, &pfn, &max_order); + if (rc1 == -EINVAL && i == 0) { + pr_err_once("Must be vfio mmio at gfn=%llx pfn=%llx, skipping\n", + gfn, pfn); + goto the_usual; + } + + if (rc1) { + pr_err("%s: %d %ld %lx -> %lx\n", __func__, + rc1, i, (unsigned long) uptr, (unsigned long) pfn); + rc = rc1; + break; + } + + user->upages[i] = pfn_to_page(pfn); + } + + if (!rc) + rc = npages; + } else { +the_usual: + if (!remote_mm) { + rc = pin_user_pages_fast(uptr, npages, user->gup_flags, + user->upages); + } else { + if (!user->locked) { + mmap_read_lock(pages->source_mm); + user->locked = 1; + } + rc = pin_user_pages_remote(pages->source_mm, uptr, npages, + user->gup_flags, user->upages, + &user->locked); } - rc = pin_user_pages_remote(pages->source_mm, uptr, npages, - user->gup_flags, user->upages, - &user->locked); } if (rc <= 0) { if (WARN_ON(!rc)) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index e930014b4bdc..07ff561208fd 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -659,6 +659,46 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot, return folio; } +int kvm_gmem_uptr_to_pfn(struct kvm *kvm, void __user *uptr, gfn_t *gfn, + kvm_pfn_t *pfn, int *max_order) +{ + struct kvm_memory_slot *slot = __uptr_to_memslot(kvm_memslots(kvm), + uptr); + bool is_prepared = false; + unsigned long upn_off; + struct folio *folio; + struct file *file; + int r; + + if (!slot) + return -EFAULT; + + file = kvm_gmem_get_file(slot); + if (!file) + return -EFAULT; + + upn_off = ((unsigned long) uptr - slot->userspace_addr) >> PAGE_SHIFT; + *gfn = slot->base_gfn + upn_off; + + folio = __kvm_gmem_get_pfn(file, slot, *gfn, pfn, &is_prepared, max_order, true); + if (IS_ERR(folio)) { + r = PTR_ERR(folio); + goto out; + } + + if (!is_prepared) + r = kvm_gmem_prepare_folio(kvm, slot, *gfn, folio); + + folio_unlock(folio); + if (r < 0) + folio_put(folio); + +out: + fput(file); + return r; +} +EXPORT_SYMBOL_GPL(kvm_gmem_uptr_to_pfn); + int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, gfn_t gfn, kvm_pfn_t *pfn, int *max_order) { diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c index a4e9db212adc..7c1d859a58e8 100644 --- a/virt/kvm/vfio.c +++ b/virt/kvm/vfio.c @@ -16,6 +16,7 @@ #include <linux/uaccess.h> #include <linux/vfio.h> #include <linux/tsm.h> +#include <linux/iommufd.h> #include "vfio.h" #ifdef CONFIG_SPAPR_TCE_IOMMU @@ -25,6 +26,7 @@ struct kvm_vfio_file { struct list_head node; struct file *file; + bool is_iommufd; #ifdef CONFIG_SPAPR_TCE_IOMMU struct iommu_group *iommu_group; #endif @@ -87,6 +89,36 @@ static bool kvm_vfio_file_is_valid(struct file *file) return ret; } +static bool kvm_iommufd_file_is_valid(struct file *file) +{ + bool (*fn)(struct file *file); + bool ret; + + fn = symbol_get(iommufd_file_is_valid); + if (!fn) + return false; + + ret = fn(file); + + symbol_put(iommufd_file_is_valid); + + return ret; +} + +static void kvm_iommufd_file_set_kvm(struct file *file, struct kvm *kvm, + gmem_pin_t gmem_pin) +{ + void (*fn)(struct file *file, struct kvm *kvm, gmem_pin_t gmem_pin); + + fn = symbol_get(iommufd_file_set_kvm); + if (!fn) + return; + + fn(file, kvm, gmem_pin); + + symbol_put(iommufd_file_set_kvm); +} + static struct vfio_device *kvm_vfio_file_device(struct file *file) { struct vfio_device *(*fn)(struct file *file); @@ -167,7 +199,7 @@ static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd) { struct kvm_vfio *kv = dev->private; struct kvm_vfio_file *kvf; - struct file *filp; + struct file *filp = NULL; int ret = 0; filp = fget(fd); @@ -175,7 +207,7 @@ static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd) return -EBADF; /* Ensure the FD is a vfio FD. */ - if (!kvm_vfio_file_is_valid(filp)) { + if (!kvm_vfio_file_is_valid(filp) && !kvm_iommufd_file_is_valid(filp)) { ret = -EINVAL; goto out_fput; } @@ -196,11 +228,18 @@ static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd) } kvf->file = get_file(filp); + list_add_tail(&kvf->node, &kv->file_list); kvm_arch_start_assignment(dev->kvm); - kvm_vfio_file_set_kvm(kvf->file, dev->kvm); - kvm_vfio_update_coherency(dev); + kvf->is_iommufd = kvm_iommufd_file_is_valid(filp); + + if (kvf->is_iommufd) { + kvm_iommufd_file_set_kvm(kvf->file, dev->kvm, kvm_gmem_uptr_to_pfn); + } else { + kvm_vfio_file_set_kvm(kvf->file, dev->kvm); + kvm_vfio_update_coherency(dev); + } out_unlock: mutex_unlock(&kv->lock); @@ -233,7 +272,11 @@ static int kvm_vfio_file_del(struct kvm_device *dev, unsigned int fd) #ifdef CONFIG_SPAPR_TCE_IOMMU kvm_spapr_tce_release_vfio_group(dev->kvm, kvf); #endif - kvm_vfio_file_set_kvm(kvf->file, NULL); + if (kvf->is_iommufd) + kvm_iommufd_file_set_kvm(kvf->file, NULL, NULL); + else + kvm_vfio_file_set_kvm(kvf->file, NULL); + fput(kvf->file); kfree(kvf); ret = 0; @@ -476,7 +519,10 @@ static void kvm_vfio_release(struct kvm_device *dev) #ifdef CONFIG_SPAPR_TCE_IOMMU kvm_spapr_tce_release_vfio_group(dev->kvm, kvf); #endif - kvm_vfio_file_set_kvm(kvf->file, NULL); + if (kvf->is_iommufd) + kvm_iommufd_file_set_kvm(kvf->file, NULL, NULL); + else + kvm_vfio_file_set_kvm(kvf->file, NULL); fput(kvf->file); list_del(&kvf->node); kfree(kvf);
IOMMUFD calls get_user_pages() for every mapping which will allocate shared memory instead of using private memory managed by the KVM and MEMFD. Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE API similar to already existing VFIO device and VFIO group fds. This addition registers the KVM in IOMMUFD with a callback to get a pfn for guest private memory for mapping it later in the IOMMU. No callback for free as it is generic folio_put() for now. The aforementioned callback uses uptr to calculate the offset into the KVM memory slot and find private backing pfn, copies kvm_gmem_get_pfn() pretty much. This relies on private pages to be pinned beforehand. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> --- drivers/iommu/iommufd/io_pagetable.h | 3 + drivers/iommu/iommufd/iommufd_private.h | 4 + include/linux/iommufd.h | 6 ++ include/linux/kvm_host.h | 66 ++++++++++++++ drivers/iommu/iommufd/io_pagetable.c | 2 + drivers/iommu/iommufd/main.c | 21 +++++ drivers/iommu/iommufd/pages.c | 94 +++++++++++++++++--- virt/kvm/guest_memfd.c | 40 +++++++++ virt/kvm/vfio.c | 58 ++++++++++-- 9 files changed, 275 insertions(+), 19 deletions(-)