Message ID | 20240115103735.132209-1-zhenzhong.duan@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | intel_iommu: Enable stage-1 translation | expand |
On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan <zhenzhong.duan@intel.com> wrote: > > Hi, > > This series enables stage-1 translation support in intel iommu which > we called "modern" mode. In this mode, we don't do shadowing of > guest page table for passthrough device but pass stage-1 page table > to host side to construct a nested domain; we also support emulated > device by translating the stage-1 page table. There was some effort > to enable this feature in old days, see [1] for details. > > The key design is to utilize the dual-stage IOMMU translation > (also known as IOMMU nested translation) capability in host IOMMU. > As the below diagram shows, guest I/O page table pointer in GPA > (guest physical address) is passed to host and be used to perform > the stage-1 address translation. Along with it, modifications to > present mappings in the guest I/O page table should be followed > with an IOTLB invalidation. > > .-------------. .---------------------------. > | vIOMMU | | Guest I/O page table | > | | '---------------------------' > .----------------/ > | PASID Entry |--- PASID cache flush --+ > '-------------' | > | | V > | | I/O page table pointer in GPA > '-------------' > Guest > ------| Shadow |---------------------------|-------- > v v v > Host > .-------------. .------------------------. > | pIOMMU | | FS for GIOVA->GPA | > | | '------------------------' > .----------------/ | > | PASID Entry | V (Nested xlate) > '----------------\.----------------------------------. > | | | SS for GPA->HPA, unmanaged domain| > | | '----------------------------------' > '-------------' > Where: > - FS = First stage page tables > - SS = Second stage page tables > <Intel VT-d Nested translation> > > There are some interactions between VFIO and vIOMMU. > * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can > use to registers/unregisters IOMMUDevice object. > * VFIO registers an IOMMUFDDevice object at vfio device realize > stage to vIOMMU, this is implemented as a prerequisite series[2]. > * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps > to bind/unbind device to IOMMUFD backed domains, either nested > domain or not. > > See below diagram: > > VFIO Device Intel IOMMU > .-----------------. .-------------------. > | | | | > | .---------|PCIIOMMUOps |.-------------. | > | | IOMMUFD |(set_iommu_device) || IOMMUFD | | > | | Device |------------------------>|| Device list | | > | .---------|(unset_iommu_device) |.-------------. | > | | | | | > | | | V | > | .---------| IOMMUFDDeviceOps| .---------. | > | | IOMMUFD | (attach_hwpt)| | IOMMUFD | | > | | link |<------------------------| | Device | | > | .---------| (detach_hwpt)| .---------. | > | | | | | > | | | ... | > .-----------------. .-------------------. > > Based on Yi's suggestion, we updated a new design of managing ioas and > hwpt, made it support multiple iommufd objects and the ERRATA_772415 > case, meanwhile tried to be optimal to share ioas and hwpt whenever > possible. > > Stage-2 page table could be shared by different devices if there is > no conflict and devices link to same iommufd object, i.e. devices > under same host IOMMU can share same stage-2 page table. If there > is conflict, i.e. there is one device under non cache coherency > mode which is different from others, it requires a seperate > stage-2 page table in non-CC mode. > > SPR platform has ERRATA_772415 which requires no readonly mappings > in stage-2 page table. This series supports creating VTDIOASContainer > with no readonly mappings. I'm not clear if there is a rare case that > some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this design > can survive even in that case. > > See below example diagram for a full view: > > IntelIOMMUState > | > V > .------------------. .------------------. .-------------------. > | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->... > | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)| > .------------------. .------------------. .-------------------. > | | | > | .-->... | > V V > .-------------------. .-------------------. .---------------. > | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |-->... > .-------------------. .-------------------. .---------------. > | | | | > | | | | > .-----------. .-----------. .------------. .------------. > | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | > | Device(CC)| | Device(CC)| | Device | | Device(CC) | > | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | > | | | | | (iommufd0) | | (iommufd0) | > .-----------. .-----------. .------------. .------------. > > This series is also a prerequisite work for vSVA, i.e. Sharing > guest application address space with passthrough devices. > > To enable "modern" mode, only need to add "x-scalable-mode=modern". > i.e. -device intel-iommu,x-scalable-mode=modern,... > > Passthrough device should use iommufd backend to work in "modern" mode. > i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... > > If host doens't support nested translation, qemu will fail > with an unsupported report. > > Test done: > - devices hotplug/unplug > - different devices linked to different iommufds > > PATCH1-2: Some preparing work to update header and IOMMUFD uAPI > PATCH3-4: Initialize vfio IOMMUFDDevice interface and pass to vIOMMU > PATCH5: Introduce a placeholder variable for scalable modern mode > PATCH6: Sync host cap/ecap with vIOMMU default cap/ecap in modern mode > PATCH7-22: Implement first stage page table for passthrough and emulated device Can we split the series and start from the emulated devices (and have a qtest for that)? This might help for reviewing. Thanks
>-----Original Message----- >From: Jason Wang <jasowang@redhat.com> >Subject: Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation > >On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan ><zhenzhong.duan@intel.com> wrote: >> >> Hi, >> >> This series enables stage-1 translation support in intel iommu which >> we called "modern" mode. In this mode, we don't do shadowing of >> guest page table for passthrough device but pass stage-1 page table >> to host side to construct a nested domain; we also support emulated >> device by translating the stage-1 page table. There was some effort >> to enable this feature in old days, see [1] for details. >> >> The key design is to utilize the dual-stage IOMMU translation >> (also known as IOMMU nested translation) capability in host IOMMU. >> As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform >> the stage-1 address translation. Along with it, modifications to >> present mappings in the guest I/O page table should be followed >> with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .------------------------. >> | pIOMMU | | FS for GIOVA->GPA | >> | | '------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.----------------------------------. >> | | | SS for GPA->HPA, unmanaged domain| >> | | '----------------------------------' >> '-------------' >> Where: >> - FS = First stage page tables >> - SS = Second stage page tables >> <Intel VT-d Nested translation> >> >> There are some interactions between VFIO and vIOMMU. >> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can >> use to registers/unregisters IOMMUDevice object. >> * VFIO registers an IOMMUFDDevice object at vfio device realize >> stage to vIOMMU, this is implemented as a prerequisite series[2]. >> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps >> to bind/unbind device to IOMMUFD backed domains, either nested >> domain or not. >> >> See below diagram: >> >> VFIO Device Intel IOMMU >> .-----------------. .-------------------. >> | | | | >> | .---------|PCIIOMMUOps |.-------------. | >> | | IOMMUFD |(set_iommu_device) || IOMMUFD | | >> | | Device |------------------------>|| Device list | | >> | .---------|(unset_iommu_device) |.-------------. | >> | | | | | >> | | | V | >> | .---------| IOMMUFDDeviceOps| .---------. | >> | | IOMMUFD | (attach_hwpt)| | IOMMUFD | | >> | | link |<------------------------| | Device | | >> | .---------| (detach_hwpt)| .---------. | >> | | | | | >> | | | ... | >> .-----------------. .-------------------. >> >> Based on Yi's suggestion, we updated a new design of managing ioas and >> hwpt, made it support multiple iommufd objects and the ERRATA_772415 >> case, meanwhile tried to be optimal to share ioas and hwpt whenever >> possible. >> >> Stage-2 page table could be shared by different devices if there is >> no conflict and devices link to same iommufd object, i.e. devices >> under same host IOMMU can share same stage-2 page table. If there >> is conflict, i.e. there is one device under non cache coherency >> mode which is different from others, it requires a seperate >> stage-2 page table in non-CC mode. >> >> SPR platform has ERRATA_772415 which requires no readonly mappings >> in stage-2 page table. This series supports creating VTDIOASContainer >> with no readonly mappings. I'm not clear if there is a rare case that >> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this >design >> can survive even in that case. >> >> See below example diagram for a full view: >> >> IntelIOMMUState >> | >> V >> .------------------. .------------------. .-------------------. >> | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer >|-->... >> | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW >only)| >> .------------------. .------------------. .-------------------. >> | | | >> | .-->... | >> V V >> .-------------------. .-------------------. .---------------. >> | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | >VTDS2Hwpt(CC) |-->... >> .-------------------. .-------------------. .---------------. >> | | | | >> | | | | >> .-----------. .-----------. .------------. .------------. >> | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | >> | Device(CC)| | Device(CC)| | Device | | Device(CC) | >> | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | >> | | | | | (iommufd0) | | (iommufd0) | >> .-----------. .-----------. .------------. .------------. >> >> This series is also a prerequisite work for vSVA, i.e. Sharing >> guest application address space with passthrough devices. >> >> To enable "modern" mode, only need to add "x-scalable-mode=modern". >> i.e. -device intel-iommu,x-scalable-mode=modern,... >> >> Passthrough device should use iommufd backend to work in "modern" >mode. >> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... >> >> If host doens't support nested translation, qemu will fail >> with an unsupported report. >> >> Test done: >> - devices hotplug/unplug >> - different devices linked to different iommufds >> >> PATCH1-2: Some preparing work to update header and IOMMUFD uAPI >> PATCH3-4: Initialize vfio IOMMUFDDevice interface and pass to vIOMMU >> PATCH5: Introduce a placeholder variable for scalable modern mode >> PATCH6: Sync host cap/ecap with vIOMMU default cap/ecap in modern >mode >> PATCH7-22: Implement first stage page table for passthrough and >emulated device > >Can we split the series and start from the emulated devices (and have >a qtest for that)? This might help for reviewing. Sure, will do in rfcv2. Thanks Zhenzhong