diff mbox series

[RFC,12/21] KVM: IOMMUFD: MEMFD: Map private pages

Message ID 20240823132137.336874-13-aik@amd.com (mailing list archive)
State New, archived
Headers show
Series Secure VFIO, TDISP, SEV TIO | expand

Commit Message

Alexey Kardashevskiy Aug. 23, 2024, 1:21 p.m. UTC
IOMMUFD calls get_user_pages() for every mapping which will allocate
shared memory instead of using private memory managed by the KVM and
MEMFD.

Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE API
similar to already existing VFIO device and VFIO group fds.
This addition registers the KVM in IOMMUFD with a callback to get a pfn
for guest private memory for mapping it later in the IOMMU.
No callback for free as it is generic folio_put() for now.

The aforementioned callback uses uptr to calculate the offset into
the KVM memory slot and find private backing pfn, copies
kvm_gmem_get_pfn() pretty much.

This relies on private pages to be pinned beforehand.

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
 drivers/iommu/iommufd/io_pagetable.h    |  3 +
 drivers/iommu/iommufd/iommufd_private.h |  4 +
 include/linux/iommufd.h                 |  6 ++
 include/linux/kvm_host.h                | 66 ++++++++++++++
 drivers/iommu/iommufd/io_pagetable.c    |  2 +
 drivers/iommu/iommufd/main.c            | 21 +++++
 drivers/iommu/iommufd/pages.c           | 94 +++++++++++++++++---
 virt/kvm/guest_memfd.c                  | 40 +++++++++
 virt/kvm/vfio.c                         | 58 ++++++++++--
 9 files changed, 275 insertions(+), 19 deletions(-)

Comments

Tian, Kevin Aug. 26, 2024, 8:39 a.m. UTC | #1
+Jason/David

> From: Alexey Kardashevskiy <aik@amd.com>
> Sent: Friday, August 23, 2024 9:21 PM
> 
> IOMMUFD calls get_user_pages() for every mapping which will allocate
> shared memory instead of using private memory managed by the KVM and
> MEMFD.
> 
> Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
> API
> similar to already existing VFIO device and VFIO group fds.
> This addition registers the KVM in IOMMUFD with a callback to get a pfn
> for guest private memory for mapping it later in the IOMMU.
> No callback for free as it is generic folio_put() for now.
> 
> The aforementioned callback uses uptr to calculate the offset into
> the KVM memory slot and find private backing pfn, copies
> kvm_gmem_get_pfn() pretty much.
> 
> This relies on private pages to be pinned beforehand.
> 

There was a related discussion [1] which leans toward the conclusion
that the IOMMU page table for private memory will be managed by
the secure world i.e. the KVM path.

Obviously the work here confirms that it doesn't hold for SEV-TIO
which still expects the host to manage the IOMMU page table.

btw going down this path it's clearer to extend the MAP_DMA
uAPI to accept {gmemfd, offset} than adding a callback to KVM.
Jason Gunthorpe Aug. 26, 2024, 12:30 p.m. UTC | #2
On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote:
> > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > shared memory instead of using private memory managed by the KVM and
> > MEMFD.
> > 
> > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
> > API
> > similar to already existing VFIO device and VFIO group fds.
> > This addition registers the KVM in IOMMUFD with a callback to get a pfn
> > for guest private memory for mapping it later in the IOMMU.
> > No callback for free as it is generic folio_put() for now.
> > 
> > The aforementioned callback uses uptr to calculate the offset into
> > the KVM memory slot and find private backing pfn, copies
> > kvm_gmem_get_pfn() pretty much.
> > 
> > This relies on private pages to be pinned beforehand.
> > 
> 
> There was a related discussion [1] which leans toward the conclusion
> that the IOMMU page table for private memory will be managed by
> the secure world i.e. the KVM path.

It is still effectively true, AMD's design has duplication, the RMP
table has the mappings to validate GPA and that is all managed in the
secure world.

They just want another copy of that information in the unsecure world
in the form of page tables :\

> btw going down this path it's clearer to extend the MAP_DMA
> uAPI to accept {gmemfd, offset} than adding a callback to KVM.

Yes, we want a DMA MAP from memfd sort of API in general. So it should
go directly to guest memfd with no kvm entanglement.

Jason
Alexey Kardashevskiy Aug. 27, 2024, 2:27 a.m. UTC | #3
On 26/8/24 18:39, Tian, Kevin wrote:
> +Jason/David
> 
>> From: Alexey Kardashevskiy <aik@amd.com>
>> Sent: Friday, August 23, 2024 9:21 PM
>>
>> IOMMUFD calls get_user_pages() for every mapping which will allocate
>> shared memory instead of using private memory managed by the KVM and
>> MEMFD.
>>
>> Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
>> API
>> similar to already existing VFIO device and VFIO group fds.
>> This addition registers the KVM in IOMMUFD with a callback to get a pfn
>> for guest private memory for mapping it later in the IOMMU.
>> No callback for free as it is generic folio_put() for now.
>>
>> The aforementioned callback uses uptr to calculate the offset into
>> the KVM memory slot and find private backing pfn, copies
>> kvm_gmem_get_pfn() pretty much.
>>
>> This relies on private pages to be pinned beforehand.
>>
> 
> There was a related discussion [1] which leans toward the conclusion

Forgot [1]?

> that the IOMMU page table for private memory will be managed by
> the secure world i.e. the KVM path.
> 
> Obviously the work here confirms that it doesn't hold for SEV-TIO
> which still expects the host to manage the IOMMU page table.
> 
> btw going down this path it's clearer to extend the MAP_DMA
> uAPI to accept {gmemfd, offset} than adding a callback to KVM.

Thanks for the comment, makes sense, this should make the interface 
cleaner. It was just a bit messy (but doable nevertheless) at the time 
to push this new mapping flag/type all the way down to pfn_reader_user_pin:

iommufd_ioas_map -> iopt_map_user_pages -> iopt_map_pages -> 
iopt_fill_domains_pages -> iopt_area_fill_domains -> pfn_reader_first -> 
pfn_reader_next -> pfn_reader_fill_span -> pfn_reader_user_pin


Thanks,
Tian, Kevin Aug. 27, 2024, 2:31 a.m. UTC | #4
> From: Alexey Kardashevskiy <aik@amd.com>
> Sent: Tuesday, August 27, 2024 10:28 AM
> 
> On 26/8/24 18:39, Tian, Kevin wrote:
> > +Jason/David
> >
> >> From: Alexey Kardashevskiy <aik@amd.com>
> >> Sent: Friday, August 23, 2024 9:21 PM
> >>
> >> IOMMUFD calls get_user_pages() for every mapping which will allocate
> >> shared memory instead of using private memory managed by the KVM
> and
> >> MEMFD.
> >>
> >> Add support for IOMMUFD fd to the VFIO KVM device's
> KVM_DEV_VFIO_FILE
> >> API
> >> similar to already existing VFIO device and VFIO group fds.
> >> This addition registers the KVM in IOMMUFD with a callback to get a pfn
> >> for guest private memory for mapping it later in the IOMMU.
> >> No callback for free as it is generic folio_put() for now.
> >>
> >> The aforementioned callback uses uptr to calculate the offset into
> >> the KVM memory slot and find private backing pfn, copies
> >> kvm_gmem_get_pfn() pretty much.
> >>
> >> This relies on private pages to be pinned beforehand.
> >>
> >
> > There was a related discussion [1] which leans toward the conclusion
> 
> Forgot [1]?

[1] https://lore.kernel.org/kvm/20240620143406.GJ2494510@nvidia.com/

> 
> > that the IOMMU page table for private memory will be managed by
> > the secure world i.e. the KVM path.
> >
> > Obviously the work here confirms that it doesn't hold for SEV-TIO
> > which still expects the host to manage the IOMMU page table.
> >
> > btw going down this path it's clearer to extend the MAP_DMA
> > uAPI to accept {gmemfd, offset} than adding a callback to KVM.
> 
> Thanks for the comment, makes sense, this should make the interface
> cleaner. It was just a bit messy (but doable nevertheless) at the time
> to push this new mapping flag/type all the way down to
> pfn_reader_user_pin:
> 
> iommufd_ioas_map -> iopt_map_user_pages -> iopt_map_pages ->
> iopt_fill_domains_pages -> iopt_area_fill_domains -> pfn_reader_first ->
> pfn_reader_next -> pfn_reader_fill_span -> pfn_reader_user_pin
> 
> 
> Thanks,
> 
> --
> Alexey
Xu Yilun Aug. 29, 2024, 9:34 a.m. UTC | #5
On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote:
> > > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > > shared memory instead of using private memory managed by the KVM and
> > > MEMFD.
> > > 
> > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
> > > API
> > > similar to already existing VFIO device and VFIO group fds.
> > > This addition registers the KVM in IOMMUFD with a callback to get a pfn
> > > for guest private memory for mapping it later in the IOMMU.
> > > No callback for free as it is generic folio_put() for now.
> > > 
> > > The aforementioned callback uses uptr to calculate the offset into
> > > the KVM memory slot and find private backing pfn, copies
> > > kvm_gmem_get_pfn() pretty much.
> > > 
> > > This relies on private pages to be pinned beforehand.
> > > 
> > 
> > There was a related discussion [1] which leans toward the conclusion
> > that the IOMMU page table for private memory will be managed by
> > the secure world i.e. the KVM path.
> 
> It is still effectively true, AMD's design has duplication, the RMP
> table has the mappings to validate GPA and that is all managed in the
> secure world.
> 
> They just want another copy of that information in the unsecure world
> in the form of page tables :\
> 
> > btw going down this path it's clearer to extend the MAP_DMA
> > uAPI to accept {gmemfd, offset} than adding a callback to KVM.
> 
> Yes, we want a DMA MAP from memfd sort of API in general. So it should
> go directly to guest memfd with no kvm entanglement.

A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
takes control of the IOMMU mapping in the unsecure world. But as
mentioned, the unsecure world mapping is just a "copy" and has no
generic meaning without the CoCo-VM context. Seems no need for userspace
to repeat the "copy" for IOMMU.

Maybe userspace could just find a way to link the KVM context to IOMMU
at the first place, then let KVM & IOMMU directly negotiate the mapping
at runtime.

Thanks,
Yilun

> 
> Jason
>
Jason Gunthorpe Aug. 29, 2024, 12:15 p.m. UTC | #6
On Thu, Aug 29, 2024 at 05:34:52PM +0800, Xu Yilun wrote:
> On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote:
> > > > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > > > shared memory instead of using private memory managed by the KVM and
> > > > MEMFD.
> > > > 
> > > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
> > > > API
> > > > similar to already existing VFIO device and VFIO group fds.
> > > > This addition registers the KVM in IOMMUFD with a callback to get a pfn
> > > > for guest private memory for mapping it later in the IOMMU.
> > > > No callback for free as it is generic folio_put() for now.
> > > > 
> > > > The aforementioned callback uses uptr to calculate the offset into
> > > > the KVM memory slot and find private backing pfn, copies
> > > > kvm_gmem_get_pfn() pretty much.
> > > > 
> > > > This relies on private pages to be pinned beforehand.
> > > > 
> > > 
> > > There was a related discussion [1] which leans toward the conclusion
> > > that the IOMMU page table for private memory will be managed by
> > > the secure world i.e. the KVM path.
> > 
> > It is still effectively true, AMD's design has duplication, the RMP
> > table has the mappings to validate GPA and that is all managed in the
> > secure world.
> > 
> > They just want another copy of that information in the unsecure world
> > in the form of page tables :\
> > 
> > > btw going down this path it's clearer to extend the MAP_DMA
> > > uAPI to accept {gmemfd, offset} than adding a callback to KVM.
> > 
> > Yes, we want a DMA MAP from memfd sort of API in general. So it should
> > go directly to guest memfd with no kvm entanglement.
> 
> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
> takes control of the IOMMU mapping in the unsecure world. 

Yes, such is how it seems to work.

It doesn't actually have much control, it has to build a mapping that
matches the RMP table exactly but still has to build it..

> But as mentioned, the unsecure world mapping is just a "copy" and
> has no generic meaning without the CoCo-VM context. Seems no need
> for userspace to repeat the "copy" for IOMMU.

Well, here I say copy from the information already in the PSP secure
world in the form fo their RMP, but in a different format.

There is another copy in KVM in it's stage 2 translation but..

> Maybe userspace could just find a way to link the KVM context to IOMMU
> at the first place, then let KVM & IOMMU directly negotiate the mapping
> at runtime.

I think the KVM folks have said no to sharing the KVM stage 2 directly
with the iommu. They do too many operations that are incompatible with
the iommu requirements for the stage 2.

If that is true for the confidential compute, I don't know.

Still, continuing to duplicate the two mappings as we have always done
seems like a reasonable place to start and we want a memfd map anyhow
for other reasons:

https://lore.kernel.org/linux-iommu/20240806125602.GJ478300@nvidia.com/

Jason
Alexey Kardashevskiy Aug. 30, 2024, 3:47 a.m. UTC | #7
On 29/8/24 22:15, Jason Gunthorpe wrote:
> On Thu, Aug 29, 2024 at 05:34:52PM +0800, Xu Yilun wrote:
>> On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote:
>>> On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote:
>>>>> IOMMUFD calls get_user_pages() for every mapping which will allocate
>>>>> shared memory instead of using private memory managed by the KVM and
>>>>> MEMFD.
>>>>>
>>>>> Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
>>>>> API
>>>>> similar to already existing VFIO device and VFIO group fds.
>>>>> This addition registers the KVM in IOMMUFD with a callback to get a pfn
>>>>> for guest private memory for mapping it later in the IOMMU.
>>>>> No callback for free as it is generic folio_put() for now.
>>>>>
>>>>> The aforementioned callback uses uptr to calculate the offset into
>>>>> the KVM memory slot and find private backing pfn, copies
>>>>> kvm_gmem_get_pfn() pretty much.
>>>>>
>>>>> This relies on private pages to be pinned beforehand.
>>>>>
>>>>
>>>> There was a related discussion [1] which leans toward the conclusion
>>>> that the IOMMU page table for private memory will be managed by
>>>> the secure world i.e. the KVM path.
>>>
>>> It is still effectively true, AMD's design has duplication, the RMP
>>> table has the mappings to validate GPA and that is all managed in the
>>> secure world.
>>>
>>> They just want another copy of that information in the unsecure world
>>> in the form of page tables :\
>>>
>>>> btw going down this path it's clearer to extend the MAP_DMA
>>>> uAPI to accept {gmemfd, offset} than adding a callback to KVM.
>>>
>>> Yes, we want a DMA MAP from memfd sort of API in general. So it should
>>> go directly to guest memfd with no kvm entanglement.
>>
>> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
>> takes control of the IOMMU mapping in the unsecure world.
> 
> Yes, such is how it seems to work.
> 
> It doesn't actually have much control, it has to build a mapping that
> matches the RMP table exactly but still has to build it..


Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to 
host physical, if we skip IOMMU, then how RMP (maps host pfns to guest 
pfns) will help to map IOVA (in fact, guest pfn) to host pfn? Thanks,
Xu Yilun Aug. 30, 2024, 5:20 a.m. UTC | #8
On Thu, Aug 29, 2024 at 09:15:49AM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 29, 2024 at 05:34:52PM +0800, Xu Yilun wrote:
> > On Mon, Aug 26, 2024 at 09:30:24AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 26, 2024 at 08:39:25AM +0000, Tian, Kevin wrote:
> > > > > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > > > > shared memory instead of using private memory managed by the KVM and
> > > > > MEMFD.
> > > > > 
> > > > > Add support for IOMMUFD fd to the VFIO KVM device's KVM_DEV_VFIO_FILE
> > > > > API
> > > > > similar to already existing VFIO device and VFIO group fds.
> > > > > This addition registers the KVM in IOMMUFD with a callback to get a pfn
> > > > > for guest private memory for mapping it later in the IOMMU.
> > > > > No callback for free as it is generic folio_put() for now.
> > > > > 
> > > > > The aforementioned callback uses uptr to calculate the offset into
> > > > > the KVM memory slot and find private backing pfn, copies
> > > > > kvm_gmem_get_pfn() pretty much.
> > > > > 
> > > > > This relies on private pages to be pinned beforehand.
> > > > > 
> > > > 
> > > > There was a related discussion [1] which leans toward the conclusion
> > > > that the IOMMU page table for private memory will be managed by
> > > > the secure world i.e. the KVM path.
> > > 
> > > It is still effectively true, AMD's design has duplication, the RMP
> > > table has the mappings to validate GPA and that is all managed in the
> > > secure world.
> > > 
> > > They just want another copy of that information in the unsecure world
> > > in the form of page tables :\
> > > 
> > > > btw going down this path it's clearer to extend the MAP_DMA
> > > > uAPI to accept {gmemfd, offset} than adding a callback to KVM.
> > > 
> > > Yes, we want a DMA MAP from memfd sort of API in general. So it should
> > > go directly to guest memfd with no kvm entanglement.
> > 
> > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
> > takes control of the IOMMU mapping in the unsecure world. 
> 
> Yes, such is how it seems to work.
> 
> It doesn't actually have much control, it has to build a mapping that
> matches the RMP table exactly but still has to build it..
> 
> > But as mentioned, the unsecure world mapping is just a "copy" and
> > has no generic meaning without the CoCo-VM context. Seems no need
> > for userspace to repeat the "copy" for IOMMU.
> 
> Well, here I say copy from the information already in the PSP secure
> world in the form fo their RMP, but in a different format.
> 
> There is another copy in KVM in it's stage 2 translation but..
> 
> > Maybe userspace could just find a way to link the KVM context to IOMMU
> > at the first place, then let KVM & IOMMU directly negotiate the mapping
> > at runtime.
> 
> I think the KVM folks have said no to sharing the KVM stage 2 directly
> with the iommu. They do too many operations that are incompatible with
> the iommu requirements for the stage 2.

I kind of agree.

I'm not considering the page table sharing for AMD's case. I was just
thinking about the way we sync up the secure mapping for KVM & IOMMU,
when Page attribute conversion happens, still via userspace or KVM
directly notifies IOMMU.

> 
> If that is true for the confidential compute, I don't know.

For Intel TDX TEE-IO, there may be a different story.

Architechturely the secure IOMMU page table has to share with KVM secure
stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module
ensures the SEPT operations good for secure IOMMU, so there is no much
trick to play for SEPT.

> 
> Still, continuing to duplicate the two mappings as we have always done
> seems like a reasonable place to start and we want a memfd map anyhow
> for other reasons:
> 
> https://lore.kernel.org/linux-iommu/20240806125602.GJ478300@nvidia.com/
> 
> Jason
Jason Gunthorpe Aug. 30, 2024, 12:35 p.m. UTC | #9
On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote:
> > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should
> > > > go directly to guest memfd with no kvm entanglement.
> > > 
> > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
> > > takes control of the IOMMU mapping in the unsecure world.
> > 
> > Yes, such is how it seems to work.
> > 
> > It doesn't actually have much control, it has to build a mapping that
> > matches the RMP table exactly but still has to build it..
> 
> Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host
> physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will
> help to map IOVA (in fact, guest pfn) to host pfn? Thanks,

It is the explanation for why this is safe.

For CC the translation of IOVA to physical must not be controlled by
the hypervisor, for security. This can result in translation based
attacks.

AMD is weird because it puts the IOMMU page table in untrusted
hypervisor memory, everyone else seems to put it in the trusted
world's memory.

This works for AMD because they have two copies of this translation,
in two different formats, one in the RMP which is in trusted memory
and one in the IO page table which is not trusted. Yes you can't use
the RMP to do an IOVA lookup, but it does encode exactly the same
information.

Both must agree on the IOVA to physical mapping otherwise the HW
rejects it. Meaning the IOMMU configuration must perfectly match the
RMP configuration.

Jason
Jason Gunthorpe Aug. 30, 2024, 12:36 p.m. UTC | #10
On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote:

> > If that is true for the confidential compute, I don't know.
> 
> For Intel TDX TEE-IO, there may be a different story.
> 
> Architechturely the secure IOMMU page table has to share with KVM secure
> stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module
> ensures the SEPT operations good for secure IOMMU, so there is no much
> trick to play for SEPT.

Yes, I think ARM will do the same as well.

From a uAPI perspective we need some way to create a secure vPCI
function linked to a KVM and some IOMMUs will implicitly get a
translation from the secure world and some IOMMUs will need to manage
it in untrusted hypervisor memory.

Jason
Alexey Kardashevskiy Sept. 2, 2024, 1:09 a.m. UTC | #11
On 30/8/24 22:35, Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote:
>>>>> Yes, we want a DMA MAP from memfd sort of API in general. So it should
>>>>> go directly to guest memfd with no kvm entanglement.
>>>>
>>>> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
>>>> takes control of the IOMMU mapping in the unsecure world.
>>>
>>> Yes, such is how it seems to work.
>>>
>>> It doesn't actually have much control, it has to build a mapping that
>>> matches the RMP table exactly but still has to build it..
>>
>> Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host
>> physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will
>> help to map IOVA (in fact, guest pfn) to host pfn? Thanks,
> 
> It is the explanation for why this is safe.
> 
> For CC the translation of IOVA to physical must not be controlled by
> the hypervisor, for security. This can result in translation based
> attacks.
> 
> AMD is weird because it puts the IOMMU page table in untrusted
> hypervisor memory, everyone else seems to put it in the trusted
> world's memory.
> 
> This works for AMD because they have two copies of this translation,
> in two different formats, one in the RMP which is in trusted memory
> and one in the IO page table which is not trusted. Yes you can't use
> the RMP to do an IOVA lookup, but it does encode exactly the same
> information.

It is exactly the same because today VFIO does 1:1 IOVA->guest mapping 
on x86 (and some/most other architectures) but it is not for when guests 
get hardware-assisted vIOMMU support. Thanks,

> Both must agree on the IOVA to physical mapping otherwise the HW
> rejects it. Meaning the IOMMU configuration must perfectly match the
> RMP configuration.
Jason Gunthorpe Sept. 2, 2024, 11:52 p.m. UTC | #12
On Mon, Sep 02, 2024 at 11:09:49AM +1000, Alexey Kardashevskiy wrote:
> On 30/8/24 22:35, Jason Gunthorpe wrote:
> > On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote:
> > > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should
> > > > > > go directly to guest memfd with no kvm entanglement.
> > > > > 
> > > > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
> > > > > takes control of the IOMMU mapping in the unsecure world.
> > > > 
> > > > Yes, such is how it seems to work.
> > > > 
> > > > It doesn't actually have much control, it has to build a mapping that
> > > > matches the RMP table exactly but still has to build it..
> > > 
> > > Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host
> > > physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will
> > > help to map IOVA (in fact, guest pfn) to host pfn? Thanks,
> > 
> > It is the explanation for why this is safe.
> > 
> > For CC the translation of IOVA to physical must not be controlled by
> > the hypervisor, for security. This can result in translation based
> > attacks.
> > 
> > AMD is weird because it puts the IOMMU page table in untrusted
> > hypervisor memory, everyone else seems to put it in the trusted
> > world's memory.
> > 
> > This works for AMD because they have two copies of this translation,
> > in two different formats, one in the RMP which is in trusted memory
> > and one in the IO page table which is not trusted. Yes you can't use
> > the RMP to do an IOVA lookup, but it does encode exactly the same
> > information.
> 
> It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on
> x86 (and some/most other architectures) but it is not for when guests get
> hardware-assisted vIOMMU support. 

Yes, you are forced into a nesting IOMMU architecture with CC guests.

Jason
Alexey Kardashevskiy Sept. 3, 2024, 12:03 a.m. UTC | #13
On 3/9/24 09:52, Jason Gunthorpe wrote:
> On Mon, Sep 02, 2024 at 11:09:49AM +1000, Alexey Kardashevskiy wrote:
>> On 30/8/24 22:35, Jason Gunthorpe wrote:
>>> On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote:
>>>>>>> Yes, we want a DMA MAP from memfd sort of API in general. So it should
>>>>>>> go directly to guest memfd with no kvm entanglement.
>>>>>>
>>>>>> A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
>>>>>> takes control of the IOMMU mapping in the unsecure world.
>>>>>
>>>>> Yes, such is how it seems to work.
>>>>>
>>>>> It doesn't actually have much control, it has to build a mapping that
>>>>> matches the RMP table exactly but still has to build it..
>>>>
>>>> Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host
>>>> physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will
>>>> help to map IOVA (in fact, guest pfn) to host pfn? Thanks,
>>>
>>> It is the explanation for why this is safe.
>>>
>>> For CC the translation of IOVA to physical must not be controlled by
>>> the hypervisor, for security. This can result in translation based
>>> attacks.
>>>
>>> AMD is weird because it puts the IOMMU page table in untrusted
>>> hypervisor memory, everyone else seems to put it in the trusted
>>> world's memory.
>>>
>>> This works for AMD because they have two copies of this translation,
>>> in two different formats, one in the RMP which is in trusted memory
>>> and one in the IO page table which is not trusted. Yes you can't use
>>> the RMP to do an IOVA lookup, but it does encode exactly the same
>>> information.
>>
>> It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on
>> x86 (and some/most other architectures) but it is not for when guests get
>> hardware-assisted vIOMMU support.
> 
> Yes, you are forced into a nesting IOMMU architecture with CC guests.

Up to two I/O page tables and the RMP table allow both 1:1 and vIOMMU, 
what am I forced into, and by what? Thanks,


> 
> Jason
Jason Gunthorpe Sept. 3, 2024, 12:37 a.m. UTC | #14
On Tue, Sep 03, 2024 at 10:03:53AM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 3/9/24 09:52, Jason Gunthorpe wrote:
> > On Mon, Sep 02, 2024 at 11:09:49AM +1000, Alexey Kardashevskiy wrote:
> > > On 30/8/24 22:35, Jason Gunthorpe wrote:
> > > > On Fri, Aug 30, 2024 at 01:47:40PM +1000, Alexey Kardashevskiy wrote:
> > > > > > > > Yes, we want a DMA MAP from memfd sort of API in general. So it should
> > > > > > > > go directly to guest memfd with no kvm entanglement.
> > > > > > > 
> > > > > > > A uAPI like ioctl(MAP_DMA, gmemfd, offset, iova) still means userspace
> > > > > > > takes control of the IOMMU mapping in the unsecure world.
> > > > > > 
> > > > > > Yes, such is how it seems to work.
> > > > > > 
> > > > > > It doesn't actually have much control, it has to build a mapping that
> > > > > > matches the RMP table exactly but still has to build it..
> > > > > 
> > > > > Sorry, I am missing the point here. IOMMU maps bus addresses (IOVAs) to host
> > > > > physical, if we skip IOMMU, then how RMP (maps host pfns to guest pfns) will
> > > > > help to map IOVA (in fact, guest pfn) to host pfn? Thanks,
> > > > 
> > > > It is the explanation for why this is safe.
> > > > 
> > > > For CC the translation of IOVA to physical must not be controlled by
> > > > the hypervisor, for security. This can result in translation based
> > > > attacks.
> > > > 
> > > > AMD is weird because it puts the IOMMU page table in untrusted
> > > > hypervisor memory, everyone else seems to put it in the trusted
> > > > world's memory.
> > > > 
> > > > This works for AMD because they have two copies of this translation,
> > > > in two different formats, one in the RMP which is in trusted memory
> > > > and one in the IO page table which is not trusted. Yes you can't use
> > > > the RMP to do an IOVA lookup, but it does encode exactly the same
> > > > information.
> > > 
> > > It is exactly the same because today VFIO does 1:1 IOVA->guest mapping on
> > > x86 (and some/most other architectures) but it is not for when guests get
> > > hardware-assisted vIOMMU support.
> > 
> > Yes, you are forced into a nesting IOMMU architecture with CC guests.
> 
> Up to two I/O page tables and the RMP table allow both 1:1 and vIOMMU, what
> am I forced into, and by what?

A key point of CC security is that the hypervisor cannot control the
IOVA translation.

AMD is securing non-viommu by using the RMP to limit the IOVA
translation to 1:1

But, viommu models require a secured non 1:1 mapping.

How do you intend to secure this other than using actual iommu
nesting? Presumably relying on the PSP to program the secure DTE's
GCR3 pointer.

Jason
Dan Williams Sept. 3, 2024, 8:34 p.m. UTC | #15
Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote:
> 
> > > If that is true for the confidential compute, I don't know.
> > 
> > For Intel TDX TEE-IO, there may be a different story.
> > 
> > Architechturely the secure IOMMU page table has to share with KVM secure
> > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module
> > ensures the SEPT operations good for secure IOMMU, so there is no much
> > trick to play for SEPT.
> 
> Yes, I think ARM will do the same as well.
> 
> From a uAPI perspective we need some way to create a secure vPCI
> function linked to a KVM and some IOMMUs will implicitly get a
> translation from the secure world and some IOMMUs will need to manage
> it in untrusted hypervisor memory.

Yes. This matches the line of though I had for the PCI TSM core
interface. It allows establishing the connection to the device's
security manager and facilitates linking that to a KVM context. So part
of the uAPI is charged with managing device-security independent of a
VM, and binding a vPCI device involves a rendezvous of the
secure-world IOMMU setup with secure-world PCI via IOMMU and PCI-TSM
coordination.
Jason Gunthorpe Sept. 4, 2024, 12:02 a.m. UTC | #16
On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote:
> > 
> > > > If that is true for the confidential compute, I don't know.
> > > 
> > > For Intel TDX TEE-IO, there may be a different story.
> > > 
> > > Architechturely the secure IOMMU page table has to share with KVM secure
> > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module
> > > ensures the SEPT operations good for secure IOMMU, so there is no much
> > > trick to play for SEPT.
> > 
> > Yes, I think ARM will do the same as well.
> > 
> > From a uAPI perspective we need some way to create a secure vPCI
> > function linked to a KVM and some IOMMUs will implicitly get a
> > translation from the secure world and some IOMMUs will need to manage
> > it in untrusted hypervisor memory.
> 
> Yes. This matches the line of though I had for the PCI TSM core
> interface. 

Okay, but I don't think you should ever be binding any PCI stuff to
KVM without involving VFIO in some way.

VFIO is the security proof that userspace is even permitted to touch
that PCI Device at all.

> It allows establishing the connection to the device's
> security manager and facilitates linking that to a KVM context. So part
> of the uAPI is charged with managing device-security independent of a
> VM

Yes, the PCI core should have stuff for managing device-secuirty for
any bound driver, especially assuming an operational standard kernel
driver only.

> and binding a vPCI device involves a rendezvous of the secure-world
> IOMMU setup with secure-world PCI via IOMMU and PCI-TSM
> coordination.

And this stuff needs to start with VFIO and we can figure out of it is
in the iommufd subcomponent or not.

I'd really like to see a clearly written proposal for what the uAPI
would look like for vPCI function lifecycle and binding that at least
one of the platforms is happy with :)

It would be a good starting point for other platforms to pick at. Try
iommufd first (I'm guessing this is correct) and if it doesn't work
explain why.

Jason
Dan Williams Sept. 4, 2024, 12:59 a.m. UTC | #17
Jason Gunthorpe wrote:
> On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote:
> > > 
> > > > > If that is true for the confidential compute, I don't know.
> > > > 
> > > > For Intel TDX TEE-IO, there may be a different story.
> > > > 
> > > > Architechturely the secure IOMMU page table has to share with KVM secure
> > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module
> > > > ensures the SEPT operations good for secure IOMMU, so there is no much
> > > > trick to play for SEPT.
> > > 
> > > Yes, I think ARM will do the same as well.
> > > 
> > > From a uAPI perspective we need some way to create a secure vPCI
> > > function linked to a KVM and some IOMMUs will implicitly get a
> > > translation from the secure world and some IOMMUs will need to manage
> > > it in untrusted hypervisor memory.
> > 
> > Yes. This matches the line of though I had for the PCI TSM core
> > interface. 
> 
> Okay, but I don't think you should ever be binding any PCI stuff to
> KVM without involving VFIO in some way.
> 
> VFIO is the security proof that userspace is even permitted to touch
> that PCI Device at all.

Right, I think VFIO grows a uAPI to make a vPCI device "bind capable"
which ties together the PCI/TSM security context, the assignable device
context and the KVM context.

> > It allows establishing the connection to the device's
> > security manager and facilitates linking that to a KVM context. So part
> > of the uAPI is charged with managing device-security independent of a
> > VM
> 
> Yes, the PCI core should have stuff for managing device-secuirty for
> any bound driver, especially assuming an operational standard kernel
> driver only.
> 
> > and binding a vPCI device involves a rendezvous of the secure-world
> > IOMMU setup with secure-world PCI via IOMMU and PCI-TSM
> > coordination.
> 
> And this stuff needs to start with VFIO and we can figure out of it is
> in the iommufd subcomponent or not.
> 
> I'd really like to see a clearly written proposal for what the uAPI
> would look like for vPCI function lifecycle and binding that at least
> one of the platforms is happy with :)
> 
> It would be a good starting point for other platforms to pick at. Try
> iommufd first (I'm guessing this is correct) and if it doesn't work
> explain why.

Yes, makes sense. Will take a look at that also to prevent more
disconnects on what this PCI device-security community is actually
building.
Tian, Kevin Sept. 5, 2024, 8:29 a.m. UTC | #18
> From: Williams, Dan J <dan.j.williams@intel.com>
> Sent: Wednesday, September 4, 2024 9:00 AM
> 
> Jason Gunthorpe wrote:
> > On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote:
> > > >
> > > > > > If that is true for the confidential compute, I don't know.
> > > > >
> > > > > For Intel TDX TEE-IO, there may be a different story.
> > > > >
> > > > > Architechturely the secure IOMMU page table has to share with KVM
> secure
> > > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX
> Module
> > > > > ensures the SEPT operations good for secure IOMMU, so there is no
> much
> > > > > trick to play for SEPT.
> > > >
> > > > Yes, I think ARM will do the same as well.
> > > >
> > > > From a uAPI perspective we need some way to create a secure vPCI
> > > > function linked to a KVM and some IOMMUs will implicitly get a
> > > > translation from the secure world and some IOMMUs will need to
> manage
> > > > it in untrusted hypervisor memory.
> > >
> > > Yes. This matches the line of though I had for the PCI TSM core
> > > interface.
> >
> > Okay, but I don't think you should ever be binding any PCI stuff to
> > KVM without involving VFIO in some way.
> >
> > VFIO is the security proof that userspace is even permitted to touch
> > that PCI Device at all.
> 
> Right, I think VFIO grows a uAPI to make a vPCI device "bind capable"
> which ties together the PCI/TSM security context, the assignable device
> context and the KVM context.
> 

Could you elaborate why the new uAPI is for making vPCI "bind capable"
instead of doing the actual binding to KVM?
Jason Gunthorpe Sept. 5, 2024, noon UTC | #19
On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > On Tue, Sep 03, 2024 at 01:34:29PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > On Fri, Aug 30, 2024 at 01:20:12PM +0800, Xu Yilun wrote:
> > > > 
> > > > > > If that is true for the confidential compute, I don't know.
> > > > > 
> > > > > For Intel TDX TEE-IO, there may be a different story.
> > > > > 
> > > > > Architechturely the secure IOMMU page table has to share with KVM secure
> > > > > stage 2 (SEPT). The SEPT is managed by firmware (TDX Module), TDX Module
> > > > > ensures the SEPT operations good for secure IOMMU, so there is no much
> > > > > trick to play for SEPT.
> > > > 
> > > > Yes, I think ARM will do the same as well.
> > > > 
> > > > From a uAPI perspective we need some way to create a secure vPCI
> > > > function linked to a KVM and some IOMMUs will implicitly get a
> > > > translation from the secure world and some IOMMUs will need to manage
> > > > it in untrusted hypervisor memory.
> > > 
> > > Yes. This matches the line of though I had for the PCI TSM core
> > > interface. 
> > 
> > Okay, but I don't think you should ever be binding any PCI stuff to
> > KVM without involving VFIO in some way.
> > 
> > VFIO is the security proof that userspace is even permitted to touch
> > that PCI Device at all.
> 
> Right, I think VFIO grows a uAPI to make a vPCI device "bind capable"
> which ties together the PCI/TSM security context, the assignable device
> context and the KVM context.

I think it is more than just "bind capable", I understand various
situations are going to require information passed from the VMM to the
secure world to specify details about what vPCI function will appear
in the VM.

AMD probably needs very little here, but others will need more.

> > It would be a good starting point for other platforms to pick at. Try
> > iommufd first (I'm guessing this is correct) and if it doesn't work
> > explain why.
> 
> Yes, makes sense. Will take a look at that also to prevent more
> disconnects on what this PCI device-security community is actually
> building.

We are already adding a VIOMMU object and that is going to be the
linkage to the KVM side

So we could have new actions:
 - Create a CC VIOMMU with XYZ parameters
 - Create a CC vPCI function on the vIOMMU with XYZ parameters
 - Query stuff?
 - ???
 - Destroy a vPCI function

Jason
Jason Gunthorpe Sept. 5, 2024, 12:02 p.m. UTC | #20
On Thu, Sep 05, 2024 at 08:29:16AM +0000, Tian, Kevin wrote:

> Could you elaborate why the new uAPI is for making vPCI "bind capable"
> instead of doing the actual binding to KVM? 

I don't see why you'd do any of this in KVM, I mean you could, but you
also don't have to and KVM people don't really know about all the VFIO
parts anyhow.

It is like a bunch of our other viommu stuff, KVM has to share some of
the HW and interfaces with the iommu driver. In this case it would be
the secure VM context and the handles to talk to the trusted world

Jason
Tian, Kevin Sept. 5, 2024, 12:07 p.m. UTC | #21
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 5, 2024 8:02 PM
> 
> On Thu, Sep 05, 2024 at 08:29:16AM +0000, Tian, Kevin wrote:
> 
> > Could you elaborate why the new uAPI is for making vPCI "bind capable"
> > instead of doing the actual binding to KVM?
> 
> I don't see why you'd do any of this in KVM, I mean you could, but you
> also don't have to and KVM people don't really know about all the VFIO
> parts anyhow.
> 

that's not my point. I was asking why this VFIO uAPI is not for linking/
binding a vPCI device to KVM (not do it in KVM) while making it just 'bind
capable'. 
Tian, Kevin Sept. 5, 2024, 12:17 p.m. UTC | #22
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 5, 2024 8:01 PM
> 
> On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > It would be a good starting point for other platforms to pick at. Try
> > > iommufd first (I'm guessing this is correct) and if it doesn't work
> > > explain why.
> >
> > Yes, makes sense. Will take a look at that also to prevent more
> > disconnects on what this PCI device-security community is actually
> > building.
> 
> We are already adding a VIOMMU object and that is going to be the
> linkage to the KVM side
> 
> So we could have new actions:
>  - Create a CC VIOMMU with XYZ parameters
>  - Create a CC vPCI function on the vIOMMU with XYZ parameters
>  - Query stuff?
>  - ???
>  - Destroy a vPCI function
> 

I'll look at the vIOMMU series soon. Just double confirm here.

the so-called vIOMMU object here is the uAPI between iommufd
and userspace. Not exactly suggesting a vIOMMU visible to guest.
otherwise this solution will be tied to implementations supporting
trusted vIOMMU.

Then you expect to build CC/vPCI stuff around the vIOMMU
object given it already connects to KVM?
Jason Gunthorpe Sept. 5, 2024, 12:23 p.m. UTC | #23
On Thu, Sep 05, 2024 at 12:17:14PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, September 5, 2024 8:01 PM
> > 
> > On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote:
> > > Jason Gunthorpe wrote:
> > > > It would be a good starting point for other platforms to pick at. Try
> > > > iommufd first (I'm guessing this is correct) and if it doesn't work
> > > > explain why.
> > >
> > > Yes, makes sense. Will take a look at that also to prevent more
> > > disconnects on what this PCI device-security community is actually
> > > building.
> > 
> > We are already adding a VIOMMU object and that is going to be the
> > linkage to the KVM side
> > 
> > So we could have new actions:
> >  - Create a CC VIOMMU with XYZ parameters
> >  - Create a CC vPCI function on the vIOMMU with XYZ parameters
> >  - Query stuff?
> >  - ???
> >  - Destroy a vPCI function
> > 
> 
> I'll look at the vIOMMU series soon. Just double confirm here.
> 
> the so-called vIOMMU object here is the uAPI between iommufd
> and userspace. Not exactly suggesting a vIOMMU visible to guest.
> otherwise this solution will be tied to implementations supporting
> trusted vIOMMU.

Right, the viommu object today just wraps elements of HW that are
connected to the VM in some way. It is sort of a security container.

If the VMM goes on to expose a vIOMMU to the guest or not should be
orthogonal.

I expect people will explicitly request a secure vIOMMU if they intend
to expose the vIOMMU to the CC VM. This can trigger any actions in the
trusted world that are required to support a secure vIOMMU.

For instance any secure vIOMMU will require some way for the guest to
issue secure invalidations, and that will require some level of
trusted world involvement. At the minimum the trusted world has to
attest the validity of the vIOMMU to the guest.

> Then you expect to build CC/vPCI stuff around the vIOMMU
> object given it already connects to KVM?

Yes, it is my thought

We alreay have a binding of devices to the viommu, increasing that to
also include creating vPCI objects in the trusted world is a small
step.

Jason
Dan Williams Sept. 5, 2024, 8:53 p.m. UTC | #24
Jason Gunthorpe wrote:
> On Thu, Sep 05, 2024 at 12:17:14PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, September 5, 2024 8:01 PM
> > > 
> > > On Tue, Sep 03, 2024 at 05:59:38PM -0700, Dan Williams wrote:
> > > > Jason Gunthorpe wrote:
> > > > > It would be a good starting point for other platforms to pick at. Try
> > > > > iommufd first (I'm guessing this is correct) and if it doesn't work
> > > > > explain why.
> > > >
> > > > Yes, makes sense. Will take a look at that also to prevent more
> > > > disconnects on what this PCI device-security community is actually
> > > > building.
> > > 
> > > We are already adding a VIOMMU object and that is going to be the
> > > linkage to the KVM side
> > > 
> > > So we could have new actions:
> > >  - Create a CC VIOMMU with XYZ parameters
> > >  - Create a CC vPCI function on the vIOMMU with XYZ parameters
> > >  - Query stuff?
> > >  - ???
> > >  - Destroy a vPCI function
> > > 
> > 
> > I'll look at the vIOMMU series soon. Just double confirm here.
> > 
> > the so-called vIOMMU object here is the uAPI between iommufd
> > and userspace. Not exactly suggesting a vIOMMU visible to guest.
> > otherwise this solution will be tied to implementations supporting
> > trusted vIOMMU.
> 
> Right, the viommu object today just wraps elements of HW that are
> connected to the VM in some way. It is sort of a security container.
> 
> If the VMM goes on to expose a vIOMMU to the guest or not should be
> orthogonal.
> 
> I expect people will explicitly request a secure vIOMMU if they intend
> to expose the vIOMMU to the CC VM. This can trigger any actions in the
> trusted world that are required to support a secure vIOMMU.
> 
> For instance any secure vIOMMU will require some way for the guest to
> issue secure invalidations, and that will require some level of
> trusted world involvement. At the minimum the trusted world has to
> attest the validity of the vIOMMU to the guest.
> 
> > Then you expect to build CC/vPCI stuff around the vIOMMU
> > object given it already connects to KVM?
> 
> Yes, it is my thought
> 
> We alreay have a binding of devices to the viommu, increasing that to
> also include creating vPCI objects in the trusted world is a small
> step.

Sounds reasonable to me.

To answer Kevin's question about what "bind capable" means I need to
clarify this oversubscribed "bind" term means. "Bind" in the TDISP sense
is transitioning the device to the LOCKED state so that its
configuration is static and ready for the VM to run attestation without
worrying about TOCTOU races.

The VMM is not in a good position to know when the assigned device can
be locked. There are updates, configuration changes, and reset/recovery
scenarios the VM may want to perform before transitioning the device to
the LOCKED state. So, the "bind capable" concept is: pre-condition VFIO
with the context that "this vPCI device is known to VFIO as a device
that can attach to the secure world, all the linkage between VFIO and
the secure world is prepared for a VM to trigger entry into the LOCKED
state, and later the RUN state".

As mentioned in another thread this entry into the LOCKED state is
likely nearly as violent as hotplug event since the DMA layer currently
has no concept of a device having a foot in the secure world and the
shared world at the same time.
Jason Gunthorpe Sept. 5, 2024, 11:06 p.m. UTC | #25
On Thu, Sep 05, 2024 at 01:53:27PM -0700, Dan Williams wrote:

> As mentioned in another thread this entry into the LOCKED state is
> likely nearly as violent as hotplug event since the DMA layer currently
> has no concept of a device having a foot in the secure world and the
> shared world at the same time.

There is also something of a complicated situation where the VM also
must validate that it has received the complete and correct device
before it can lock it. Ie that the MMIO ranges belong to the device,
the DMA goes to the right place (the vRID:pRID is trusted), and so on.

Further, the vIOMMU, and it's parameters, in the VM must also be
validated and trusted before the VM can lock the device. The VM and
the trusted world must verify they have the exclusive control over the
translation.

This is where AMDs model of having the hypervisor control things get a
little bit confusing for me. I suppose there will be some way that the
confidential VM asks the trusted world to control the secure DTE such
that it can select between GCR3, BLOCKED and IDENTITY.

Regardless, I think everyone will need some metadata from the vIOMMU
world into the trusted world to do all of this.

Jason
Tian, Kevin Sept. 6, 2024, 2:41 a.m. UTC | #26
> From: Williams, Dan J <dan.j.williams@intel.com>
> Sent: Friday, September 6, 2024 4:53 AM
> 
> Jason Gunthorpe wrote:
> > On Thu, Sep 05, 2024 at 12:17:14PM +0000, Tian, Kevin wrote:
> > > Then you expect to build CC/vPCI stuff around the vIOMMU
> > > object given it already connects to KVM?
> >
> > Yes, it is my thought
> >
> > We alreay have a binding of devices to the viommu, increasing that to
> > also include creating vPCI objects in the trusted world is a small
> > step.
> 
> Sounds reasonable to me.
> 
> To answer Kevin's question about what "bind capable" means I need to
> clarify this oversubscribed "bind" term means. "Bind" in the TDISP sense
> is transitioning the device to the LOCKED state so that its
> configuration is static and ready for the VM to run attestation without
> worrying about TOCTOU races.
> 
> The VMM is not in a good position to know when the assigned device can
> be locked. There are updates, configuration changes, and reset/recovery
> scenarios the VM may want to perform before transitioning the device to
> the LOCKED state. So, the "bind capable" concept is: pre-condition VFIO
> with the context that "this vPCI device is known to VFIO as a device
> that can attach to the secure world, all the linkage between VFIO and
> the secure world is prepared for a VM to trigger entry into the LOCKED
> state, and later the RUN state".

Okay this makes sense. So the point is that the TDI state machine is
fully managed by the TSM driver so here 'bind capable' is a necessary
preparation step for that state machine to enter the LOCKED state.

> 
> As mentioned in another thread this entry into the LOCKED state is
> likely nearly as violent as hotplug event since the DMA layer currently
> has no concept of a device having a foot in the secure world and the
> shared world at the same time.

Is the DMA layer relevant in this context? Here we are talking about
VFIO/IOMMUFD which can be hinted by VMM for whatever side
effect caused by the entry into the LOCKED state...
Tian, Kevin Sept. 6, 2024, 2:46 a.m. UTC | #27
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, September 6, 2024 7:07 AM
> 
> On Thu, Sep 05, 2024 at 01:53:27PM -0700, Dan Williams wrote:
> 
> > As mentioned in another thread this entry into the LOCKED state is
> > likely nearly as violent as hotplug event since the DMA layer currently
> > has no concept of a device having a foot in the secure world and the
> > shared world at the same time.
> 
> There is also something of a complicated situation where the VM also
> must validate that it has received the complete and correct device
> before it can lock it. Ie that the MMIO ranges belong to the device,
> the DMA goes to the right place (the vRID:pRID is trusted), and so on.
> 
> Further, the vIOMMU, and it's parameters, in the VM must also be
> validated and trusted before the VM can lock the device. The VM and
> the trusted world must verify they have the exclusive control over the
> translation.

Looking at the TDISP spec it's the host to lock the device (as Dan
described the entry into the LOCKED state) while the VM is allowed
to put the device into the RUN state after validation.

I guess you actually meant the entry into RUN here? otherwise 
there might be some disconnect here.

> 
> This is where AMDs model of having the hypervisor control things get a
> little bit confusing for me. I suppose there will be some way that the
> confidential VM asks the trusted world to control the secure DTE such
> that it can select between GCR3, BLOCKED and IDENTITY.

this matches what I read from the SEV-TIO spec.

> 
> Regardless, I think everyone will need some metadata from the vIOMMU
> world into the trusted world to do all of this.

Agree.
Jason Gunthorpe Sept. 6, 2024, 1:54 p.m. UTC | #28
On Fri, Sep 06, 2024 at 02:46:24AM +0000, Tian, Kevin wrote:
> > Further, the vIOMMU, and it's parameters, in the VM must also be
> > validated and trusted before the VM can lock the device. The VM and
> > the trusted world must verify they have the exclusive control over the
> > translation.
> 
> Looking at the TDISP spec it's the host to lock the device (as Dan
> described the entry into the LOCKED state) while the VM is allowed
> to put the device into the RUN state after validation.
> 
> I guess you actually meant the entry into RUN here? otherwise 
> there might be some disconnect here.

Yeah

Jason
Jason Gunthorpe Sept. 15, 2024, 9:07 p.m. UTC | #29
On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> IOMMUFD calls get_user_pages() for every mapping which will allocate
> shared memory instead of using private memory managed by the KVM and
> MEMFD.

Please check this series, it is much more how I would expect this to
work. Use the guest memfd directly and forget about kvm in the iommufd code:

https://lore.kernel.org/r/1726319158-283074-1-git-send-email-steven.sistare@oracle.com

I would imagine you'd detect the guest memfd when accepting the FD and
then having some different path in the pinning logic to pin and get
the physical ranges out.

Probably we would also need some CAP interaction with the iommu driver
to understand if it can accept private pages to even allow this in the
first place.

Thanks,
Jason
Vishal Annapurve Sept. 20, 2024, 9:10 p.m. UTC | #30
On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > shared memory instead of using private memory managed by the KVM and
> > MEMFD.
>
> Please check this series, it is much more how I would expect this to
> work. Use the guest memfd directly and forget about kvm in the iommufd code:
>
> https://lore.kernel.org/r/1726319158-283074-1-git-send-email-steven.sistare@oracle.com
>
> I would imagine you'd detect the guest memfd when accepting the FD and
> then having some different path in the pinning logic to pin and get
> the physical ranges out.

According to the discussion at KVM microconference around hugepage
support for guest_memfd [1], it's imperative that guest private memory
is not long term pinned. Ideal way to implement this integration would
be to support a notifier that can be invoked by guest_memfd when
memory ranges get truncated so that IOMMU can unmap the corresponding
ranges. Such a notifier should also get called during memory
conversion, it would be interesting to discuss how conversion flow
would work in this case.

[1] https://lpc.events/event/18/contributions/1764/ (checkout the
slide 12 from attached presentation)

>
> Probably we would also need some CAP interaction with the iommu driver
> to understand if it can accept private pages to even allow this in the
> first place.
>
> Thanks,
> Jason
>
Tian, Kevin Sept. 23, 2024, 5:35 a.m. UTC | #31
> From: Vishal Annapurve <vannapurve@google.com>
> Sent: Saturday, September 21, 2024 5:11 AM
> 
> On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > > shared memory instead of using private memory managed by the KVM
> and
> > > MEMFD.
> >
> > Please check this series, it is much more how I would expect this to
> > work. Use the guest memfd directly and forget about kvm in the iommufd
> code:
> >
> > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> steven.sistare@oracle.com
> >
> > I would imagine you'd detect the guest memfd when accepting the FD and
> > then having some different path in the pinning logic to pin and get
> > the physical ranges out.
> 
> According to the discussion at KVM microconference around hugepage
> support for guest_memfd [1], it's imperative that guest private memory
> is not long term pinned. Ideal way to implement this integration would
> be to support a notifier that can be invoked by guest_memfd when
> memory ranges get truncated so that IOMMU can unmap the corresponding
> ranges. Such a notifier should also get called during memory
> conversion, it would be interesting to discuss how conversion flow
> would work in this case.
> 
> [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> slide 12 from attached presentation)
> 

Most devices don't support I/O page fault hence can only DMA to long
term pinned buffers. The notifier might be helpful for in-kernel conversion
but as a basic requirement there needs a way for IOMMUFD to call into
guest memfd to request long term pinning for a given range. That is
how I interpreted "different path" in Jason's comment.
Vishal Annapurve Sept. 23, 2024, 6:34 a.m. UTC | #32
On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Vishal Annapurve <vannapurve@google.com>
> > Sent: Saturday, September 21, 2024 5:11 AM
> >
> > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > > > IOMMUFD calls get_user_pages() for every mapping which will allocate
> > > > shared memory instead of using private memory managed by the KVM
> > and
> > > > MEMFD.
> > >
> > > Please check this series, it is much more how I would expect this to
> > > work. Use the guest memfd directly and forget about kvm in the iommufd
> > code:
> > >
> > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > steven.sistare@oracle.com
> > >
> > > I would imagine you'd detect the guest memfd when accepting the FD and
> > > then having some different path in the pinning logic to pin and get
> > > the physical ranges out.
> >
> > According to the discussion at KVM microconference around hugepage
> > support for guest_memfd [1], it's imperative that guest private memory
> > is not long term pinned. Ideal way to implement this integration would
> > be to support a notifier that can be invoked by guest_memfd when
> > memory ranges get truncated so that IOMMU can unmap the corresponding
> > ranges. Such a notifier should also get called during memory
> > conversion, it would be interesting to discuss how conversion flow
> > would work in this case.
> >
> > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > slide 12 from attached presentation)
> >
>
> Most devices don't support I/O page fault hence can only DMA to long
> term pinned buffers. The notifier might be helpful for in-kernel conversion
> but as a basic requirement there needs a way for IOMMUFD to call into
> guest memfd to request long term pinning for a given range. That is
> how I interpreted "different path" in Jason's comment.

Policy that is being aimed here:
1) guest_memfd will pin the pages backing guest memory for all users.
2) kvm_gmem_get_pfn users will get a locked folio with elevated
refcount when asking for the pfn/page from guest_memfd. Users will
drop the refcount and release the folio lock when they are done
using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
lock is supposed to be held for short durations.
3) Users can assume the pfn is around until they are notified by
guest_memfd on truncation or memory conversion.

Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
TDX VMs especially need to have secure EPT entries always mapped (once
faulted-in) while the guest memory ranges are private.
Tian, Kevin Sept. 23, 2024, 8:24 a.m. UTC | #33
> From: Vishal Annapurve <vannapurve@google.com>
> Sent: Monday, September 23, 2024 2:34 PM
> 
> On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Vishal Annapurve <vannapurve@google.com>
> > > Sent: Saturday, September 21, 2024 5:11 AM
> > >
> > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com>
> wrote:
> > > >
> > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > > > > IOMMUFD calls get_user_pages() for every mapping which will
> allocate
> > > > > shared memory instead of using private memory managed by the
> KVM
> > > and
> > > > > MEMFD.
> > > >
> > > > Please check this series, it is much more how I would expect this to
> > > > work. Use the guest memfd directly and forget about kvm in the
> iommufd
> > > code:
> > > >
> > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > steven.sistare@oracle.com
> > > >
> > > > I would imagine you'd detect the guest memfd when accepting the FD
> and
> > > > then having some different path in the pinning logic to pin and get
> > > > the physical ranges out.
> > >
> > > According to the discussion at KVM microconference around hugepage
> > > support for guest_memfd [1], it's imperative that guest private memory
> > > is not long term pinned. Ideal way to implement this integration would
> > > be to support a notifier that can be invoked by guest_memfd when
> > > memory ranges get truncated so that IOMMU can unmap the
> corresponding
> > > ranges. Such a notifier should also get called during memory
> > > conversion, it would be interesting to discuss how conversion flow
> > > would work in this case.
> > >
> > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > slide 12 from attached presentation)
> > >
> >
> > Most devices don't support I/O page fault hence can only DMA to long
> > term pinned buffers. The notifier might be helpful for in-kernel conversion
> > but as a basic requirement there needs a way for IOMMUFD to call into
> > guest memfd to request long term pinning for a given range. That is
> > how I interpreted "different path" in Jason's comment.
> 
> Policy that is being aimed here:
> 1) guest_memfd will pin the pages backing guest memory for all users.
> 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> refcount when asking for the pfn/page from guest_memfd. Users will
> drop the refcount and release the folio lock when they are done
> using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> lock is supposed to be held for short durations.
> 3) Users can assume the pfn is around until they are notified by
> guest_memfd on truncation or memory conversion.
> 
> Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> TDX VMs especially need to have secure EPT entries always mapped (once
> faulted-in) while the guest memory ranges are private.

'faulted-in' doesn't work for device DMAs (w/o IOPF).

and above is based on the assumption that CoCo VM will always
map/pin the private memory pages until a conversion happens.

Conversion is initiated by the guest so ideally the guest is responsible 
for not leaving any in-fly DMAs to the page which is being converted.
From this angle it is fine for IOMMUFD to receive a notification from
guest memfd when such a conversion happens.

But I'm not sure whether the TDX way is architectural or just an
implementation choice which could be changed later, or whether it
applies to other arch.

If that behavior cannot be guaranteed, then we may still need a way
for IOMMUFD to request long term pin.
Jason Gunthorpe Sept. 23, 2024, 4:02 p.m. UTC | #34
On Mon, Sep 23, 2024 at 08:24:40AM +0000, Tian, Kevin wrote:
> > From: Vishal Annapurve <vannapurve@google.com>
> > Sent: Monday, September 23, 2024 2:34 PM
> > 
> > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Vishal Annapurve <vannapurve@google.com>
> > > > Sent: Saturday, September 21, 2024 5:11 AM
> > > >
> > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com>
> > wrote:
> > > > >
> > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > > > > > IOMMUFD calls get_user_pages() for every mapping which will
> > allocate
> > > > > > shared memory instead of using private memory managed by the
> > KVM
> > > > and
> > > > > > MEMFD.
> > > > >
> > > > > Please check this series, it is much more how I would expect this to
> > > > > work. Use the guest memfd directly and forget about kvm in the
> > iommufd
> > > > code:
> > > > >
> > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > > steven.sistare@oracle.com
> > > > >
> > > > > I would imagine you'd detect the guest memfd when accepting the FD
> > and
> > > > > then having some different path in the pinning logic to pin and get
> > > > > the physical ranges out.
> > > >
> > > > According to the discussion at KVM microconference around hugepage
> > > > support for guest_memfd [1], it's imperative that guest private memory
> > > > is not long term pinned. Ideal way to implement this integration would
> > > > be to support a notifier that can be invoked by guest_memfd when
> > > > memory ranges get truncated so that IOMMU can unmap the
> > corresponding
> > > > ranges. Such a notifier should also get called during memory
> > > > conversion, it would be interesting to discuss how conversion flow
> > > > would work in this case.
> > > >
> > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > > slide 12 from attached presentation)
> > > >
> > >
> > > Most devices don't support I/O page fault hence can only DMA to long
> > > term pinned buffers. The notifier might be helpful for in-kernel conversion
> > > but as a basic requirement there needs a way for IOMMUFD to call into
> > > guest memfd to request long term pinning for a given range. That is
> > > how I interpreted "different path" in Jason's comment.
> > 
> > Policy that is being aimed here:
> > 1) guest_memfd will pin the pages backing guest memory for all users.
> > 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> > refcount when asking for the pfn/page from guest_memfd. Users will
> > drop the refcount and release the folio lock when they are done
> > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> > lock is supposed to be held for short durations.
> > 3) Users can assume the pfn is around until they are notified by
> > guest_memfd on truncation or memory conversion.
> > 
> > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> > TDX VMs especially need to have secure EPT entries always mapped (once
> > faulted-in) while the guest memory ranges are private.
> 
> 'faulted-in' doesn't work for device DMAs (w/o IOPF).
> 
> and above is based on the assumption that CoCo VM will always
> map/pin the private memory pages until a conversion happens.
> 
> Conversion is initiated by the guest so ideally the guest is responsible 
> for not leaving any in-fly DMAs to the page which is being converted.
> From this angle it is fine for IOMMUFD to receive a notification from
> guest memfd when such a conversion happens.

Right, I think the expectation is if a guest has active DMA on a page
it is changing between shared/private there is no expectation that the
DMA will succeed. So we don't need page fault, we just need to allow
it to safely fail.

IMHO we should try to do as best we can here, and the ideal interface
would be a notifier to switch the shared/private pages in some portion
of the guestmemfd. With the idea that iommufd could perhaps do it
atomically.

When the notifier returns then the old pages are fenced off at the
HW.  

But this would have to be a sleepable notifier that can do memory
allocation.

It is actually pretty complicated and we will need a reliable cut
operation to make this work on AMD v1.

Jason
Vishal Annapurve Sept. 23, 2024, 8:53 p.m. UTC | #35
On Mon, Sep 23, 2024, 10:24 AM Tian, Kevin <kevin.tian@intel.com> wrote:
>
> > From: Vishal Annapurve <vannapurve@google.com>
> > Sent: Monday, September 23, 2024 2:34 PM
> >
> > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> > >
> > > > From: Vishal Annapurve <vannapurve@google.com>
> > > > Sent: Saturday, September 21, 2024 5:11 AM
> > > >
> > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com>
> > wrote:
> > > > >
> > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy wrote:
> > > > > > IOMMUFD calls get_user_pages() for every mapping which will
> > allocate
> > > > > > shared memory instead of using private memory managed by the
> > KVM
> > > > and
> > > > > > MEMFD.
> > > > >
> > > > > Please check this series, it is much more how I would expect this to
> > > > > work. Use the guest memfd directly and forget about kvm in the
> > iommufd
> > > > code:
> > > > >
> > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > > steven.sistare@oracle.com
> > > > >
> > > > > I would imagine you'd detect the guest memfd when accepting the FD
> > and
> > > > > then having some different path in the pinning logic to pin and get
> > > > > the physical ranges out.
> > > >
> > > > According to the discussion at KVM microconference around hugepage
> > > > support for guest_memfd [1], it's imperative that guest private memory
> > > > is not long term pinned. Ideal way to implement this integration would
> > > > be to support a notifier that can be invoked by guest_memfd when
> > > > memory ranges get truncated so that IOMMU can unmap the
> > corresponding
> > > > ranges. Such a notifier should also get called during memory
> > > > conversion, it would be interesting to discuss how conversion flow
> > > > would work in this case.
> > > >
> > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > > slide 12 from attached presentation)
> > > >
> > >
> > > Most devices don't support I/O page fault hence can only DMA to long
> > > term pinned buffers. The notifier might be helpful for in-kernel conversion
> > > but as a basic requirement there needs a way for IOMMUFD to call into
> > > guest memfd to request long term pinning for a given range. That is
> > > how I interpreted "different path" in Jason's comment.
> >
> > Policy that is being aimed here:
> > 1) guest_memfd will pin the pages backing guest memory for all users.
> > 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> > refcount when asking for the pfn/page from guest_memfd. Users will
> > drop the refcount and release the folio lock when they are done
> > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> > lock is supposed to be held for short durations.
> > 3) Users can assume the pfn is around until they are notified by
> > guest_memfd on truncation or memory conversion.
> >
> > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> > TDX VMs especially need to have secure EPT entries always mapped (once
> > faulted-in) while the guest memory ranges are private.
>
> 'faulted-in' doesn't work for device DMAs (w/o IOPF).

faulted-in can be replaced with mapped-in for the context of IOMMU operations.

>
> and above is based on the assumption that CoCo VM will always
> map/pin the private memory pages until a conversion happens.

Host physical memory is pinned by the host software stack. If you are
talking about arch specific logic in KVM, then the expectation again
is that guest_memfd will give pinned memory to it's users.

>
> Conversion is initiated by the guest so ideally the guest is responsible
> for not leaving any in-fly DMAs to the page which is being converted.
> From this angle it is fine for IOMMUFD to receive a notification from
> guest memfd when such a conversion happens.
>
> But I'm not sure whether the TDX way is architectural or just an
> implementation choice which could be changed later, or whether it
> applies to other arch.

All private memory accesses from TDX VMs go via Secure EPT. If host
removes secure EPT entries without guest intervention then linux guest
has a logic to generate a panic when it encounters EPT violation on
private memory accesses [1].

>
> If that behavior cannot be guaranteed, then we may still need a way
> for IOMMUFD to request long term pin.

[1] https://elixir.bootlin.com/linux/v6.11/source/arch/x86/coco/tdx/tdx.c#L677
Tian, Kevin Sept. 23, 2024, 11:52 p.m. UTC | #36
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 24, 2024 12:03 AM
> 
> On Mon, Sep 23, 2024 at 08:24:40AM +0000, Tian, Kevin wrote:
> > > From: Vishal Annapurve <vannapurve@google.com>
> > > Sent: Monday, September 23, 2024 2:34 PM
> > >
> > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Vishal Annapurve <vannapurve@google.com>
> > > > > Sent: Saturday, September 21, 2024 5:11 AM
> > > > >
> > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com>
> > > wrote:
> > > > > >
> > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy
> wrote:
> > > > > > > IOMMUFD calls get_user_pages() for every mapping which will
> > > allocate
> > > > > > > shared memory instead of using private memory managed by the
> > > KVM
> > > > > and
> > > > > > > MEMFD.
> > > > > >
> > > > > > Please check this series, it is much more how I would expect this to
> > > > > > work. Use the guest memfd directly and forget about kvm in the
> > > iommufd
> > > > > code:
> > > > > >
> > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > > > steven.sistare@oracle.com
> > > > > >
> > > > > > I would imagine you'd detect the guest memfd when accepting the
> FD
> > > and
> > > > > > then having some different path in the pinning logic to pin and get
> > > > > > the physical ranges out.
> > > > >
> > > > > According to the discussion at KVM microconference around
> hugepage
> > > > > support for guest_memfd [1], it's imperative that guest private
> memory
> > > > > is not long term pinned. Ideal way to implement this integration
> would
> > > > > be to support a notifier that can be invoked by guest_memfd when
> > > > > memory ranges get truncated so that IOMMU can unmap the
> > > corresponding
> > > > > ranges. Such a notifier should also get called during memory
> > > > > conversion, it would be interesting to discuss how conversion flow
> > > > > would work in this case.
> > > > >
> > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > > > slide 12 from attached presentation)
> > > > >
> > > >
> > > > Most devices don't support I/O page fault hence can only DMA to long
> > > > term pinned buffers. The notifier might be helpful for in-kernel
> conversion
> > > > but as a basic requirement there needs a way for IOMMUFD to call into
> > > > guest memfd to request long term pinning for a given range. That is
> > > > how I interpreted "different path" in Jason's comment.
> > >
> > > Policy that is being aimed here:
> > > 1) guest_memfd will pin the pages backing guest memory for all users.
> > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> > > refcount when asking for the pfn/page from guest_memfd. Users will
> > > drop the refcount and release the folio lock when they are done
> > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> > > lock is supposed to be held for short durations.
> > > 3) Users can assume the pfn is around until they are notified by
> > > guest_memfd on truncation or memory conversion.
> > >
> > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> > > TDX VMs especially need to have secure EPT entries always mapped
> (once
> > > faulted-in) while the guest memory ranges are private.
> >
> > 'faulted-in' doesn't work for device DMAs (w/o IOPF).
> >
> > and above is based on the assumption that CoCo VM will always
> > map/pin the private memory pages until a conversion happens.
> >
> > Conversion is initiated by the guest so ideally the guest is responsible
> > for not leaving any in-fly DMAs to the page which is being converted.
> > From this angle it is fine for IOMMUFD to receive a notification from
> > guest memfd when such a conversion happens.
> 
> Right, I think the expectation is if a guest has active DMA on a page
> it is changing between shared/private there is no expectation that the
> DMA will succeed. So we don't need page fault, we just need to allow
> it to safely fail.
> 
> IMHO we should try to do as best we can here, and the ideal interface
> would be a notifier to switch the shared/private pages in some portion
> of the guestmemfd. With the idea that iommufd could perhaps do it
> atomically.
> 

yes atomic replacement is necessary here, as there might be in-fly
DMAs to pages adjacent to the one being converted in the same
1G hunk. Unmap/remap could potentially break it.
Tian, Kevin Sept. 23, 2024, 11:55 p.m. UTC | #37
> From: Vishal Annapurve <vannapurve@google.com>
> Sent: Tuesday, September 24, 2024 4:54 AM
> 
> On Mon, Sep 23, 2024, 10:24 AM Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > > From: Vishal Annapurve <vannapurve@google.com>
> > > Sent: Monday, September 23, 2024 2:34 PM
> > >
> > > On Mon, Sep 23, 2024 at 7:36 AM Tian, Kevin <kevin.tian@intel.com>
> wrote:
> > > >
> > > > > From: Vishal Annapurve <vannapurve@google.com>
> > > > > Sent: Saturday, September 21, 2024 5:11 AM
> > > > >
> > > > > On Sun, Sep 15, 2024 at 11:08 PM Jason Gunthorpe <jgg@nvidia.com>
> > > wrote:
> > > > > >
> > > > > > On Fri, Aug 23, 2024 at 11:21:26PM +1000, Alexey Kardashevskiy
> wrote:
> > > > > > > IOMMUFD calls get_user_pages() for every mapping which will
> > > allocate
> > > > > > > shared memory instead of using private memory managed by the
> > > KVM
> > > > > and
> > > > > > > MEMFD.
> > > > > >
> > > > > > Please check this series, it is much more how I would expect this to
> > > > > > work. Use the guest memfd directly and forget about kvm in the
> > > iommufd
> > > > > code:
> > > > > >
> > > > > > https://lore.kernel.org/r/1726319158-283074-1-git-send-email-
> > > > > steven.sistare@oracle.com
> > > > > >
> > > > > > I would imagine you'd detect the guest memfd when accepting the
> FD
> > > and
> > > > > > then having some different path in the pinning logic to pin and get
> > > > > > the physical ranges out.
> > > > >
> > > > > According to the discussion at KVM microconference around
> hugepage
> > > > > support for guest_memfd [1], it's imperative that guest private
> memory
> > > > > is not long term pinned. Ideal way to implement this integration
> would
> > > > > be to support a notifier that can be invoked by guest_memfd when
> > > > > memory ranges get truncated so that IOMMU can unmap the
> > > corresponding
> > > > > ranges. Such a notifier should also get called during memory
> > > > > conversion, it would be interesting to discuss how conversion flow
> > > > > would work in this case.
> > > > >
> > > > > [1] https://lpc.events/event/18/contributions/1764/ (checkout the
> > > > > slide 12 from attached presentation)
> > > > >
> > > >
> > > > Most devices don't support I/O page fault hence can only DMA to long
> > > > term pinned buffers. The notifier might be helpful for in-kernel
> conversion
> > > > but as a basic requirement there needs a way for IOMMUFD to call into
> > > > guest memfd to request long term pinning for a given range. That is
> > > > how I interpreted "different path" in Jason's comment.
> > >
> > > Policy that is being aimed here:
> > > 1) guest_memfd will pin the pages backing guest memory for all users.
> > > 2) kvm_gmem_get_pfn users will get a locked folio with elevated
> > > refcount when asking for the pfn/page from guest_memfd. Users will
> > > drop the refcount and release the folio lock when they are done
> > > using/installing (e.g. in KVM EPT/IOMMU PT entries) it. This folio
> > > lock is supposed to be held for short durations.
> > > 3) Users can assume the pfn is around until they are notified by
> > > guest_memfd on truncation or memory conversion.
> > >
> > > Step 3 above is already followed by KVM EPT setup logic for CoCo VMs.
> > > TDX VMs especially need to have secure EPT entries always mapped
> (once
> > > faulted-in) while the guest memory ranges are private.
> >
> > 'faulted-in' doesn't work for device DMAs (w/o IOPF).
> 
> faulted-in can be replaced with mapped-in for the context of IOMMU
> operations.
> 
> >
> > and above is based on the assumption that CoCo VM will always
> > map/pin the private memory pages until a conversion happens.
> 
> Host physical memory is pinned by the host software stack. If you are
> talking about arch specific logic in KVM, then the expectation again
> is that guest_memfd will give pinned memory to it's users.

sorry it's a typo. I meant the host does it for CoCo VM.

> 
> >
> > Conversion is initiated by the guest so ideally the guest is responsible
> > for not leaving any in-fly DMAs to the page which is being converted.
> > From this angle it is fine for IOMMUFD to receive a notification from
> > guest memfd when such a conversion happens.
> >
> > But I'm not sure whether the TDX way is architectural or just an
> > implementation choice which could be changed later, or whether it
> > applies to other arch.
> 
> All private memory accesses from TDX VMs go via Secure EPT. If host
> removes secure EPT entries without guest intervention then linux guest
> has a logic to generate a panic when it encounters EPT violation on
> private memory accesses [1].

Yeah, that sounds good.

> 
> >
> > If that behavior cannot be guaranteed, then we may still need a way
> > for IOMMUFD to request long term pin.
> 
> [1]
> https://elixir.bootlin.com/linux/v6.11/source/arch/x86/coco/tdx/tdx.c#L677
Jason Gunthorpe Sept. 24, 2024, 12:07 p.m. UTC | #38
On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote:
> > IMHO we should try to do as best we can here, and the ideal interface
> > would be a notifier to switch the shared/private pages in some portion
> > of the guestmemfd. With the idea that iommufd could perhaps do it
> > atomically.
> 
> yes atomic replacement is necessary here, as there might be in-fly
> DMAs to pages adjacent to the one being converted in the same
> 1G hunk. Unmap/remap could potentially break it.

Yeah.. This integration is going to be much more complicated than I
originally thought about. It will need the generic pt stuff as the
hitless page table manipulations we are contemplating here are pretty
complex.

Jason
Vishal Annapurve Sept. 25, 2024, 8:44 a.m. UTC | #39
On Tue, Sep 24, 2024 at 2:07 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote:
> > > IMHO we should try to do as best we can here, and the ideal interface
> > > would be a notifier to switch the shared/private pages in some portion
> > > of the guestmemfd. With the idea that iommufd could perhaps do it
> > > atomically.
> >
> > yes atomic replacement is necessary here, as there might be in-fly
> > DMAs to pages adjacent to the one being converted in the same
> > 1G hunk. Unmap/remap could potentially break it.
>
> Yeah.. This integration is going to be much more complicated than I
> originally thought about. It will need the generic pt stuff as the
> hitless page table manipulations we are contemplating here are pretty
> complex.
>
> Jason

 To ensure that I understand your concern properly, the complexity of
handling hitless page manipulations is because guests can convert
memory at smaller granularity than the physical page size used by the
host software. Complexity remains the same irrespective of whether
kvm/guest_memfd is notifying iommu driver to unmap converted ranges or
if its userspace notifying iommu driver.
Jason Gunthorpe Sept. 25, 2024, 3:41 p.m. UTC | #40
On Wed, Sep 25, 2024 at 10:44:12AM +0200, Vishal Annapurve wrote:
> On Tue, Sep 24, 2024 at 2:07 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 11:52:19PM +0000, Tian, Kevin wrote:
> > > > IMHO we should try to do as best we can here, and the ideal interface
> > > > would be a notifier to switch the shared/private pages in some portion
> > > > of the guestmemfd. With the idea that iommufd could perhaps do it
> > > > atomically.
> > >
> > > yes atomic replacement is necessary here, as there might be in-fly
> > > DMAs to pages adjacent to the one being converted in the same
> > > 1G hunk. Unmap/remap could potentially break it.
> >
> > Yeah.. This integration is going to be much more complicated than I
> > originally thought about. It will need the generic pt stuff as the
> > hitless page table manipulations we are contemplating here are pretty
> > complex.
> >
> > Jason
> 
>  To ensure that I understand your concern properly, the complexity of
> handling hitless page manipulations is because guests can convert
> memory at smaller granularity than the physical page size used by the
> host software.

Yes

You want to, say, break up a 1G private page into 2M chunks and then
hitlessly replace a 2M chunk with a shared one. Unlike the MM side you
don't really want to just non-present the whole thing and fault it
back in. So it is more complex.

We already plan to build the 1G -> 2M transformation for dirty
tracking, the atomic replace will be a further operation.

In the short term you could experiment on this using unmap/remap, but
that isn't really going to work well as a solution. You really can't
unmap an entire 1G page just to poke a 2M hole into it without
disrupting the guest DMA.

Fortunately the work needed to resolve this is well in progress, I had
not realized there was a guest memfd connection, but this is good to
know. It means more people will be intersted in helping :) :)

> Complexity remains the same irrespective of whether kvm/guest_memfd
> is notifying iommu driver to unmap converted ranges or if its
> userspace notifying iommu driver.

You don't want to use the verb 'unmap'.

What you want is a verb more like 'refresh' which can only make sense
in the kernel. 'refresh' would cause the iommu copy of the physical
addresses to update to match the current data in the guestmemfd.

So the private/shared sequence would be like:

1) Guest asks for private -> shared
2) Guestmemfd figures out what the new physicals should be for the
   shared
3) Guestmemfd does 'refresh' on all of its notifiers. This will pick
   up the new shared physical and remove the old private physical from
   the iommus
4) Guestmemfd can be sure nothing in iommu is touching the old memory.

There are some other small considerations that increase complexity,
like AMD needs an IOPTE boundary at any transition between
shared/private. This is a current active bug in the AMD stuff, fixing
it automatically and preserving huge pages via special guestmemfd
support sounds very appealing to me.

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 0ec3509b7e33..fc9239fc94c0 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -204,6 +204,9 @@  struct iopt_pages {
 	struct rb_root_cached access_itree;
 	/* Of iopt_area::pages_node */
 	struct rb_root_cached domains_itree;
+
+	struct kvm *kvm;
+	gmem_pin_t gmem_pin;
 };
 
 struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 92efe30a8f0d..bd5573ddcd9c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -10,6 +10,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/iommu.h>
 #include <linux/iova_bitmap.h>
+#include <linux/iommufd.h>
 #include <uapi/linux/iommufd.h>
 #include "../iommu-priv.h"
 
@@ -28,6 +29,9 @@  struct iommufd_ctx {
 	/* Compatibility with VFIO no iommu */
 	u8 no_iommu_mode;
 	struct iommufd_ioas *vfio_ioas;
+
+	struct kvm *kvm;
+	gmem_pin_t gmem_pin;
 };
 
 /*
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index ffc3a949f837..a990f604c044 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -9,6 +9,7 @@ 
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/err.h>
+#include <linux/kvm_types.h>
 
 struct device;
 struct iommufd_device;
@@ -57,6 +58,11 @@  void iommufd_ctx_get(struct iommufd_ctx *ictx);
 #if IS_ENABLED(CONFIG_IOMMUFD)
 struct iommufd_ctx *iommufd_ctx_from_file(struct file *file);
 struct iommufd_ctx *iommufd_ctx_from_fd(int fd);
+bool iommufd_file_is_valid(struct file *file);
+typedef int (*gmem_pin_t)(struct kvm *kvm, void __user *uptr, gfn_t *gfn,
+			  kvm_pfn_t *pfn, int *max_order);
+void iommufd_file_set_kvm(struct file *file, struct kvm *kvm,
+			  gmem_pin_t gmem_pin);
 void iommufd_ctx_put(struct iommufd_ctx *ictx);
 bool iommufd_ctx_has_group(struct iommufd_ctx *ictx, struct iommu_group *group);
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fdb331b3e0d3..a09a346ba3ca 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1297,6 +1297,7 @@  int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
 
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
+struct kvm_memory_slot *uptr_to_memslot(struct kvm *kvm, void __user *uptr);
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
 bool kvm_vcpu_is_visible_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);
 unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn);
@@ -1713,6 +1714,22 @@  try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 		return NULL;
 }
 
+static inline struct kvm_memory_slot *
+try_get_memslot_uptr(struct kvm_memory_slot *slot, void __user *uptr)
+{
+	unsigned long base_upn;
+	unsigned long upn = (unsigned long) uptr >> PAGE_SHIFT;
+
+	if (!slot)
+		return NULL;
+
+	base_upn = slot->userspace_addr >> PAGE_SHIFT;
+	if (upn >= base_upn && upn < base_upn + slot->npages)
+		return slot;
+	else
+		return NULL;
+}
+
 /*
  * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
  *
@@ -1741,6 +1758,22 @@  search_memslots(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 	return approx ? slot : NULL;
 }
 
+static inline struct kvm_memory_slot *
+search_memslots_uptr(struct kvm_memslots *slots, void __user *uptr)
+{
+	unsigned long upn = (unsigned long) uptr >> PAGE_SHIFT;
+	struct kvm_memslot_iter iter;
+
+	kvm_for_each_memslot_in_gfn_range(&iter, slots, 0, 512ULL * SZ_1T) {
+		struct kvm_memory_slot *slot = iter.slot;
+		unsigned long base_upn = slot->userspace_addr >> PAGE_SHIFT;
+
+		if (upn >= base_upn && upn < base_upn + slot->npages)
+			return slot;
+	}
+	return NULL;
+}
+
 static inline struct kvm_memory_slot *
 ____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 {
@@ -1760,6 +1793,25 @@  ____gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn, bool approx)
 	return NULL;
 }
 
+static inline struct kvm_memory_slot *
+____uptr_to_memslot(struct kvm_memslots *slots, void __user *uptr)
+{
+	struct kvm_memory_slot *slot;
+
+	slot = (struct kvm_memory_slot *)atomic_long_read(&slots->last_used_slot);
+	slot = try_get_memslot_uptr(slot, uptr);
+	if (slot)
+		return slot;
+
+	slot = search_memslots_uptr(slots, uptr);
+	if (slot) {
+		atomic_long_set(&slots->last_used_slot, (unsigned long)slot);
+		return slot;
+	}
+
+	return NULL;
+}
+
 /*
  * __gfn_to_memslot() and its descendants are here to allow arch code to inline
  * the lookups in hot paths.  gfn_to_memslot() itself isn't here as an inline
@@ -1771,6 +1823,12 @@  __gfn_to_memslot(struct kvm_memslots *slots, gfn_t gfn)
 	return ____gfn_to_memslot(slots, gfn, false);
 }
 
+static inline struct kvm_memory_slot *
+__uptr_to_memslot(struct kvm_memslots *slots, void __user *uptr)
+{
+	return ____uptr_to_memslot(slots, uptr);
+}
+
 static inline unsigned long
 __gfn_to_hva_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
@@ -2446,6 +2504,8 @@  static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #ifdef CONFIG_KVM_PRIVATE_MEM
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+int kvm_gmem_uptr_to_pfn(struct kvm *kvm, void __user *uptr, gfn_t *gfn,
+			 kvm_pfn_t *pfn, int *max_order);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2454,6 +2514,12 @@  static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+static inline int kvm_gmem_uptr_to_pfn(struct kvm *kvm, void __user *uptr, gfn_t *gfn,
+				       kvm_pfn_t *pfn, int *max_order)
+{
+	KVM_BUG_ON(1, kvm);
+	return -EIO;
+}
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 05fd9d3abf1b..aa7584d4a2b8 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -412,6 +412,8 @@  int iopt_map_user_pages(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
 		elm.pages->account_mode = IOPT_PAGES_ACCOUNT_MM;
 	elm.start_byte = uptr - elm.pages->uptr;
 	elm.length = length;
+	elm.pages->kvm = ictx->kvm;
+	elm.pages->gmem_pin = ictx->gmem_pin;
 	list_add(&elm.next, &pages_list);
 
 	rc = iopt_map_pages(iopt, &pages_list, length, iova, iommu_prot, flags);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 83bbd7c5d160..b6039f7c1cce 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -17,6 +17,7 @@ 
 #include <linux/bug.h>
 #include <uapi/linux/iommufd.h>
 #include <linux/iommufd.h>
+#include <linux/kvm_host.h>
 
 #include "io_pagetable.h"
 #include "iommufd_private.h"
@@ -488,6 +489,26 @@  struct iommufd_ctx *iommufd_ctx_from_fd(int fd)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_ctx_from_fd, IOMMUFD);
 
+bool iommufd_file_is_valid(struct file *file)
+{
+	return file->f_op == &iommufd_fops;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_file_is_valid, IOMMUFD);
+
+void iommufd_file_set_kvm(struct file *file, struct kvm *kvm, gmem_pin_t gmem_pin)
+{
+	struct iommufd_ctx *ictx = iommufd_ctx_from_file(file);
+
+	if (WARN_ON(!ictx))
+		return;
+
+	ictx->kvm = kvm;
+	ictx->gmem_pin = gmem_pin;
+
+	iommufd_ctx_put(ictx);
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_file_set_kvm, IOMMUFD);
+
 /**
  * iommufd_ctx_put - Put back a reference
  * @ictx: Context to put back
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 117f644a0c5b..d85b6969d9ea 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -52,6 +52,8 @@ 
 #include <linux/highmem.h>
 #include <linux/kthread.h>
 #include <linux/iommufd.h>
+#include <linux/kvm_host.h>
+#include <linux/pagemap.h>
 
 #include "io_pagetable.h"
 #include "double_span.h"
@@ -622,6 +624,33 @@  static void batch_from_pages(struct pfn_batch *batch, struct page **pages,
 			break;
 }
 
+static void memfd_unpin_user_page_range_dirty_lock(struct page *page,
+						   unsigned long npages,
+						   bool make_dirty)
+{
+	unsigned long i, nr;
+
+	for (i = 0; i < npages; i += nr) {
+		struct page *next = nth_page(page, i);
+		struct folio *folio = page_folio(next);
+
+		if (folio_test_large(folio))
+			nr = min_t(unsigned int, npages - i,
+				   folio_nr_pages(folio) -
+				   folio_page_idx(folio, next));
+		else
+			nr = 1;
+
+		if (make_dirty && !folio_test_dirty(folio)) {
+			// FIXME: do we need this? private memory does not swap
+			folio_lock(folio);
+			folio_mark_dirty(folio);
+			folio_unlock(folio);
+		}
+		folio_put(folio);
+	}
+}
+
 static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
 			unsigned int first_page_off, size_t npages)
 {
@@ -638,9 +667,14 @@  static void batch_unpin(struct pfn_batch *batch, struct iopt_pages *pages,
 		size_t to_unpin = min_t(size_t, npages,
 					batch->npfns[cur] - first_page_off);
 
-		unpin_user_page_range_dirty_lock(
-			pfn_to_page(batch->pfns[cur] + first_page_off),
-			to_unpin, pages->writable);
+		if (pages->kvm)
+			memfd_unpin_user_page_range_dirty_lock(
+				pfn_to_page(batch->pfns[cur] + first_page_off),
+				to_unpin, pages->writable);
+		else
+			unpin_user_page_range_dirty_lock(
+				pfn_to_page(batch->pfns[cur] + first_page_off),
+				to_unpin, pages->writable);
 		iopt_pages_sub_npinned(pages, to_unpin);
 		cur++;
 		first_page_off = 0;
@@ -777,17 +811,51 @@  static int pfn_reader_user_pin(struct pfn_reader_user *user,
 		return -EFAULT;
 
 	uptr = (uintptr_t)(pages->uptr + start_index * PAGE_SIZE);
-	if (!remote_mm)
-		rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
-					 user->upages);
-	else {
-		if (!user->locked) {
-			mmap_read_lock(pages->source_mm);
-			user->locked = 1;
+
+	if (pages->kvm) {
+		if (WARN_ON(!pages->gmem_pin))
+			return -EFAULT;
+
+		rc = 0;
+		for (unsigned long i = 0; i < npages; ++i, uptr += PAGE_SIZE) {
+			gfn_t gfn = 0;
+			kvm_pfn_t pfn = 0;
+			int max_order = 0, rc1;
+
+			rc1 = pages->gmem_pin(pages->kvm, (void *) uptr,
+					      &gfn, &pfn, &max_order);
+			if (rc1 == -EINVAL && i == 0) {
+				pr_err_once("Must be vfio mmio at gfn=%llx pfn=%llx, skipping\n",
+					    gfn, pfn);
+				goto the_usual;
+			}
+
+			if (rc1) {
+				pr_err("%s: %d %ld %lx -> %lx\n", __func__,
+				       rc1, i, (unsigned long) uptr, (unsigned long) pfn);
+				rc = rc1;
+				break;
+			}
+
+			user->upages[i] = pfn_to_page(pfn);
+		}
+
+		if (!rc)
+			rc = npages;
+	} else {
+the_usual:
+		if (!remote_mm) {
+			rc = pin_user_pages_fast(uptr, npages, user->gup_flags,
+						 user->upages);
+		} else {
+			if (!user->locked) {
+				mmap_read_lock(pages->source_mm);
+				user->locked = 1;
+			}
+			rc = pin_user_pages_remote(pages->source_mm, uptr, npages,
+						   user->gup_flags, user->upages,
+						   &user->locked);
 		}
-		rc = pin_user_pages_remote(pages->source_mm, uptr, npages,
-					   user->gup_flags, user->upages,
-					   &user->locked);
 	}
 	if (rc <= 0) {
 		if (WARN_ON(!rc))
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index e930014b4bdc..07ff561208fd 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -659,6 +659,46 @@  __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 	return folio;
 }
 
+int kvm_gmem_uptr_to_pfn(struct kvm *kvm, void __user *uptr, gfn_t *gfn,
+			 kvm_pfn_t *pfn, int *max_order)
+{
+	struct kvm_memory_slot *slot = __uptr_to_memslot(kvm_memslots(kvm),
+							 uptr);
+	bool is_prepared = false;
+	unsigned long upn_off;
+	struct folio *folio;
+	struct file *file;
+	int r;
+
+	if (!slot)
+		return -EFAULT;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return -EFAULT;
+
+	upn_off = ((unsigned long) uptr - slot->userspace_addr) >> PAGE_SHIFT;
+	*gfn = slot->base_gfn + upn_off;
+
+	folio = __kvm_gmem_get_pfn(file, slot, *gfn, pfn, &is_prepared, max_order, true);
+	if (IS_ERR(folio)) {
+		r = PTR_ERR(folio);
+		goto out;
+	}
+
+	if (!is_prepared)
+		r = kvm_gmem_prepare_folio(kvm, slot, *gfn, folio);
+
+	folio_unlock(folio);
+	if (r < 0)
+		folio_put(folio);
+
+out:
+	fput(file);
+	return r;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_uptr_to_pfn);
+
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, int *max_order)
 {
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index a4e9db212adc..7c1d859a58e8 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -16,6 +16,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/tsm.h>
+#include <linux/iommufd.h>
 #include "vfio.h"
 
 #ifdef CONFIG_SPAPR_TCE_IOMMU
@@ -25,6 +26,7 @@ 
 struct kvm_vfio_file {
 	struct list_head node;
 	struct file *file;
+	bool is_iommufd;
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 	struct iommu_group *iommu_group;
 #endif
@@ -87,6 +89,36 @@  static bool kvm_vfio_file_is_valid(struct file *file)
 	return ret;
 }
 
+static bool kvm_iommufd_file_is_valid(struct file *file)
+{
+	bool (*fn)(struct file *file);
+	bool ret;
+
+	fn = symbol_get(iommufd_file_is_valid);
+	if (!fn)
+		return false;
+
+	ret = fn(file);
+
+	symbol_put(iommufd_file_is_valid);
+
+	return ret;
+}
+
+static void kvm_iommufd_file_set_kvm(struct file *file, struct kvm *kvm,
+				     gmem_pin_t gmem_pin)
+{
+	void (*fn)(struct file *file, struct kvm *kvm, gmem_pin_t gmem_pin);
+
+	fn = symbol_get(iommufd_file_set_kvm);
+	if (!fn)
+		return;
+
+	fn(file, kvm, gmem_pin);
+
+	symbol_put(iommufd_file_set_kvm);
+}
+
 static struct vfio_device *kvm_vfio_file_device(struct file *file)
 {
 	struct vfio_device *(*fn)(struct file *file);
@@ -167,7 +199,7 @@  static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd)
 {
 	struct kvm_vfio *kv = dev->private;
 	struct kvm_vfio_file *kvf;
-	struct file *filp;
+	struct file *filp = NULL;
 	int ret = 0;
 
 	filp = fget(fd);
@@ -175,7 +207,7 @@  static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd)
 		return -EBADF;
 
 	/* Ensure the FD is a vfio FD. */
-	if (!kvm_vfio_file_is_valid(filp)) {
+	if (!kvm_vfio_file_is_valid(filp) && !kvm_iommufd_file_is_valid(filp)) {
 		ret = -EINVAL;
 		goto out_fput;
 	}
@@ -196,11 +228,18 @@  static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd)
 	}
 
 	kvf->file = get_file(filp);
+
 	list_add_tail(&kvf->node, &kv->file_list);
 
 	kvm_arch_start_assignment(dev->kvm);
-	kvm_vfio_file_set_kvm(kvf->file, dev->kvm);
-	kvm_vfio_update_coherency(dev);
+	kvf->is_iommufd = kvm_iommufd_file_is_valid(filp);
+
+	if (kvf->is_iommufd) {
+		kvm_iommufd_file_set_kvm(kvf->file, dev->kvm, kvm_gmem_uptr_to_pfn);
+	} else {
+		kvm_vfio_file_set_kvm(kvf->file, dev->kvm);
+		kvm_vfio_update_coherency(dev);
+	}
 
 out_unlock:
 	mutex_unlock(&kv->lock);
@@ -233,7 +272,11 @@  static int kvm_vfio_file_del(struct kvm_device *dev, unsigned int fd)
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 		kvm_spapr_tce_release_vfio_group(dev->kvm, kvf);
 #endif
-		kvm_vfio_file_set_kvm(kvf->file, NULL);
+		if (kvf->is_iommufd)
+			kvm_iommufd_file_set_kvm(kvf->file, NULL, NULL);
+		else
+			kvm_vfio_file_set_kvm(kvf->file, NULL);
+
 		fput(kvf->file);
 		kfree(kvf);
 		ret = 0;
@@ -476,7 +519,10 @@  static void kvm_vfio_release(struct kvm_device *dev)
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 		kvm_spapr_tce_release_vfio_group(dev->kvm, kvf);
 #endif
-		kvm_vfio_file_set_kvm(kvf->file, NULL);
+		if (kvf->is_iommufd)
+			kvm_iommufd_file_set_kvm(kvf->file, NULL, NULL);
+		else
+			kvm_vfio_file_set_kvm(kvf->file, NULL);
 		fput(kvf->file);
 		list_del(&kvf->node);
 		kfree(kvf);