Message ID | 20220414104710.28534-1-yi.l.liu@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio: Adopt iommufd | expand |
Hi,
Thanks for the work!
On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote:
> - More tests
I did a quick test on my ARM64 platform, using "iommu=smmuv3"
string. The behaviors are different between using default and
using legacy "iommufd=off".
The legacy pathway exits the VM with:
vfio 0002:01:00.0:
failed to setup container for group 1:
memory listener initialization failed:
Region smmuv3-iommu-memory-region-16-0:
device 00.02.0 requires iommu MAP notifier which is not currently supported
while the iommufd pathway started the VM but reported errors
from host kernel about address translation failures, probably
because of accessing unmapped addresses.
I found iommufd pathway also calls error_propagate_prepend()
to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
doesn't get a chance to print errp out. Perhaps there should
be a final error check somewhere to exit?
Nic
Hi Nicolin, On 4/15/22 10:37 AM, Nicolin Chen wrote: > Hi, > > Thanks for the work! > > On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote: > >> - More tests > I did a quick test on my ARM64 platform, using "iommu=smmuv3" > string. The behaviors are different between using default and > using legacy "iommufd=off". > > The legacy pathway exits the VM with: > vfio 0002:01:00.0: > failed to setup container for group 1: > memory listener initialization failed: > Region smmuv3-iommu-memory-region-16-0: > device 00.02.0 requires iommu MAP notifier which is not currently supported > > while the iommufd pathway started the VM but reported errors > from host kernel about address translation failures, probably > because of accessing unmapped addresses. > > I found iommufd pathway also calls error_propagate_prepend() > to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it > doesn't get a chance to print errp out. Perhaps there should > be a final error check somewhere to exit? thank you for giving it a try. vsmmuv3 + vfio is not supported as we miss the HW nested stage support and SMMU does not support cache mode. If you want to test viommu on ARM you shall test virtio-iommu+vfio. This should work but this is not yet tested. I pushed a fix for the error notification issue: qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git Thanks Eric > > Nic >
> From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Thursday, April 14, 2022 6:47 PM > > With the introduction of iommufd[1], the linux kernel provides a generic > interface for userspace drivers to propagate their DMA mappings to kernel > for assigned devices. This series does the porting of the VFIO devices > onto the /dev/iommu uapi and let it coexist with the legacy implementation. > Other devices like vpda, vfio mdev and etc. are not considered yet. vfio mdev has no special support in Qemu. Just that it's not supported by iommufd yet thus can only be operated in legacy container interface at this point. Later once it's supported by the kernel suppose no additional enabling work is required for mdev in Qemu. > > For vfio devices, the new interface is tied with device fd and iommufd > as the iommufd solution is device-centric. This is different from legacy > vfio which is group-centric. To support both interfaces in QEMU, this > series introduces the iommu backend concept in the form of different > container classes. The existing vfio container is named legacy container > (equivalent with legacy iommu backend in this series), while the new > iommufd based container is named as iommufd container (may also be > mentioned > as iommufd backend in this series). The two backend types have their own > way to setup secure context and dma management interface. Below diagram > shows how it looks like with both BEs. > > VFIO AddressSpace/Memory > +-------+ +----------+ +-----+ +-----+ > | pci | | platform | | ap | | ccw | > +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ > | | | | | AddressSpace | > | | | | +------------+---------+ > +---V-----------V-----------V--------V----+ / > | VFIOAddressSpace | <------------+ > | | | MemoryListener > | VFIOContainer list | > +-------+----------------------------+----+ > | | > | | > +-------V------+ +--------V----------+ > | iommufd | | vfio legacy | > | container | | container | > +-------+------+ +--------+----------+ > | | > | /dev/iommu | /dev/vfio/vfio > | /dev/vfio/devices/vfioX | /dev/vfio/$group_id > Userspace | | > > ===========+============================+======================= > ========= > Kernel | device fd | > +---------------+ | group/container fd > | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) > | ATTACH_IOAS) | | device fd > | | | > | +-------V------------V-----------------+ > iommufd | | vfio | > (map/unmap | +---------+--------------------+-------+ > ioas_copy) | | | map/unmap > | | | > +------V------+ +-----V------+ +------V--------+ > | iommfd core | | device | | vfio iommu | > +-------------+ +------------+ +---------------+ last row: s/iommfd/iommufd/ overall this sounds a reasonable abstraction. Later when vdpa starts supporting iommufd probably the iommufd BE will become even smaller with more logic shareable between vfio and vdpa. > > [Secure Context setup] > - iommufd BE: uses device fd and iommufd to setup secure context > (bind_iommufd, attach_ioas) > - vfio legacy BE: uses group fd and container fd to setup secure context > (set_container, set_iommu) > [Device access] > - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX > - vfio legacy BE: device fd is retrieved from group fd ioctl > [DMA Mapping flow] > - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener > - VFIO populates DMA map/unmap via the container BEs > *) iommufd BE: uses iommufd > *) vfio legacy BE: uses container fd > > This series qomifies the VFIOContainer object which acts as a base class what does 'qomify' mean? I didn't find this word from dictionary... > for a container. This base class is derived into the legacy VFIO container > and the new iommufd based container. The base class implements generic > code > such as code related to memory_listener and address space management > whereas > the derived class implements callbacks that depend on the kernel user space 'the kernel user space'? > being used. > > The selection of the backend is made on a device basis using the new > iommufd option (on/off/auto). By default the iommufd backend is selected > if supported by the host and by QEMU (iommufd KConfig). This option is > currently available only for the vfio-pci device. For other types of > devices, it does not yet exist and the legacy BE is chosen by default. > > Test done: > - PCI and Platform device were tested In this case PCI uses iommufd while platform device uses legacy? > - ccw and ap were only compile-tested > - limited device hotplug test > - vIOMMU test run for both legacy and iommufd backends (limited tests) > > This series was co-developed by Eric Auger and me based on the exploration > iommufd kernel[2], complete code of this series is available in[3]. As > iommufd kernel is in the early step (only iommufd generic interface is in > mailing list), so this series hasn't made the iommufd backend fully on par > with legacy backend w.r.t. features like p2p mappings, coherency tracking, what does 'coherency tracking' mean here? if related to iommu enforce snoop it is fully handled by the kernel so far. I didn't find any use of VFIO_DMA_CC_IOMMU in current Qemu. > live migration, etc. This series hasn't supported PCI devices without FLR > neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when > userspace > is using iommufd. The kernel needs to be updated to accept device fd list for > reset when userspace is using iommufd. Related work is in progress by > Jason[4]. > > TODOs: > - Add DMA alias check for iommufd BE (group level) > - Make pci.c to be BE agnostic. Needs kernel change as well to fix the > VFIO_DEVICE_PCI_HOT_RESET gap > - Cleanup the VFIODevice fields as it's used in both BEs > - Add locks > - Replace list with g_tree > - More tests > > Patch Overview: > > - Preparation: > 0001-scripts-update-linux-headers-Add-iommufd.h.patch > 0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch > 0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch > 0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch > 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into- > iommu_m.patch 3-5 are pure cleanups which could be sent out separately > 0006-vfio-common-Split-common.c-into-common.c-container.c.patch > > - Introduce container object and covert existing vfio to use it: > 0007-vfio-Add-base-object-for-VFIOContainer.patch > 0008-vfio-container-Introduce-vfio_attach-detach_device.patch > 0009-vfio-platform-Use-vfio_-attach-detach-_device.patch > 0010-vfio-ap-Use-vfio_-attach-detach-_device.patch > 0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch > 0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch > 0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch > > - Introduce iommufd based container: > 0014-hw-iommufd-Creation.patch > 0015-vfio-iommufd-Implement-iommufd-backend.patch > 0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch > > - Add backend selection for vfio-pci: > 0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch > 0018-vfio-pci-Add-an-iommufd-option.patch > > [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6- > iommufd_jgg@nvidia.com/ > [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 > [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 > [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd- > vfio_mdev_no_group_jgg@nvidia.com/ Following is probably more relevant to [4]: https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/ Thanks Kevin
Hi Kevin, On 2022/4/18 16:49, Tian, Kevin wrote: >> From: Liu, Yi L <yi.l.liu@intel.com> >> Sent: Thursday, April 14, 2022 6:47 PM >> >> With the introduction of iommufd[1], the linux kernel provides a generic >> interface for userspace drivers to propagate their DMA mappings to kernel >> for assigned devices. This series does the porting of the VFIO devices >> onto the /dev/iommu uapi and let it coexist with the legacy implementation. >> Other devices like vpda, vfio mdev and etc. are not considered yet. > > vfio mdev has no special support in Qemu. Just that it's not supported > by iommufd yet thus can only be operated in legacy container interface at > this point. Later once it's supported by the kernel suppose no additional > enabling work is required for mdev in Qemu. yes. will make it more precise in next version. >> >> For vfio devices, the new interface is tied with device fd and iommufd >> as the iommufd solution is device-centric. This is different from legacy >> vfio which is group-centric. To support both interfaces in QEMU, this >> series introduces the iommu backend concept in the form of different >> container classes. The existing vfio container is named legacy container >> (equivalent with legacy iommu backend in this series), while the new >> iommufd based container is named as iommufd container (may also be >> mentioned >> as iommufd backend in this series). The two backend types have their own >> way to setup secure context and dma management interface. Below diagram >> shows how it looks like with both BEs. >> >> VFIO AddressSpace/Memory >> +-------+ +----------+ +-----+ +-----+ >> | pci | | platform | | ap | | ccw | >> +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ >> | | | | | AddressSpace | >> | | | | +------------+---------+ >> +---V-----------V-----------V--------V----+ / >> | VFIOAddressSpace | <------------+ >> | | | MemoryListener >> | VFIOContainer list | >> +-------+----------------------------+----+ >> | | >> | | >> +-------V------+ +--------V----------+ >> | iommufd | | vfio legacy | >> | container | | container | >> +-------+------+ +--------+----------+ >> | | >> | /dev/iommu | /dev/vfio/vfio >> | /dev/vfio/devices/vfioX | /dev/vfio/$group_id >> Userspace | | >> >> ===========+============================+======================= >> ========= >> Kernel | device fd | >> +---------------+ | group/container fd >> | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) >> | ATTACH_IOAS) | | device fd >> | | | >> | +-------V------------V-----------------+ >> iommufd | | vfio | >> (map/unmap | +---------+--------------------+-------+ >> ioas_copy) | | | map/unmap >> | | | >> +------V------+ +-----V------+ +------V--------+ >> | iommfd core | | device | | vfio iommu | >> +-------------+ +------------+ +---------------+ > > last row: s/iommfd/iommufd/ thanks. a typo. > overall this sounds a reasonable abstraction. Later when vdpa starts > supporting iommufd probably the iommufd BE will become even > smaller with more logic shareable between vfio and vdpa. let's see if Jason Wang will give some idea. :-) >> >> [Secure Context setup] >> - iommufd BE: uses device fd and iommufd to setup secure context >> (bind_iommufd, attach_ioas) >> - vfio legacy BE: uses group fd and container fd to setup secure context >> (set_container, set_iommu) >> [Device access] >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX >> - vfio legacy BE: device fd is retrieved from group fd ioctl >> [DMA Mapping flow] >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener >> - VFIO populates DMA map/unmap via the container BEs >> *) iommufd BE: uses iommufd >> *) vfio legacy BE: uses container fd >> >> This series qomifies the VFIOContainer object which acts as a base class > > what does 'qomify' mean? I didn't find this word from dictionary... > >> for a container. This base class is derived into the legacy VFIO container >> and the new iommufd based container. The base class implements generic >> code >> such as code related to memory_listener and address space management >> whereas >> the derived class implements callbacks that depend on the kernel user space > > 'the kernel user space'? aha, just want to express different BE callbacks will use different user interface exposed by kernel. will refine the wording. > >> being used. >> >> The selection of the backend is made on a device basis using the new >> iommufd option (on/off/auto). By default the iommufd backend is selected >> if supported by the host and by QEMU (iommufd KConfig). This option is >> currently available only for the vfio-pci device. For other types of >> devices, it does not yet exist and the legacy BE is chosen by default. >> >> Test done: >> - PCI and Platform device were tested > > In this case PCI uses iommufd while platform device uses legacy? For PCI, both legacy and iommufd were tested. The exploration kernel branch doesn't have the new device uapi for platform device, so I didn't test it. But I remember Eric should have tested it with iommufd. Eric? >> - ccw and ap were only compile-tested >> - limited device hotplug test >> - vIOMMU test run for both legacy and iommufd backends (limited tests) >> >> This series was co-developed by Eric Auger and me based on the exploration >> iommufd kernel[2], complete code of this series is available in[3]. As >> iommufd kernel is in the early step (only iommufd generic interface is in >> mailing list), so this series hasn't made the iommufd backend fully on par >> with legacy backend w.r.t. features like p2p mappings, coherency tracking, > > what does 'coherency tracking' mean here? if related to iommu enforce > snoop it is fully handled by the kernel so far. I didn't find any use of > VFIO_DMA_CC_IOMMU in current Qemu. It's the kvm_group add/del stuffs.perhaps say kvm_group add/del equivalence would be better? >> live migration, etc. This series hasn't supported PCI devices without FLR >> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when >> userspace >> is using iommufd. The kernel needs to be updated to accept device fd list for >> reset when userspace is using iommufd. Related work is in progress by >> Jason[4]. >> >> TODOs: >> - Add DMA alias check for iommufd BE (group level) >> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the >> VFIO_DEVICE_PCI_HOT_RESET gap >> - Cleanup the VFIODevice fields as it's used in both BEs >> - Add locks >> - Replace list with g_tree >> - More tests >> >> Patch Overview: >> >> - Preparation: >> 0001-scripts-update-linux-headers-Add-iommufd.h.patch >> 0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch >> 0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch >> 0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch >> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into- >> iommu_m.patch > > 3-5 are pure cleanups which could be sent out separately yes. may send later after checking with Eric. :-) >> 0006-vfio-common-Split-common.c-into-common.c-container.c.patch >> >> - Introduce container object and covert existing vfio to use it: >> 0007-vfio-Add-base-object-for-VFIOContainer.patch >> 0008-vfio-container-Introduce-vfio_attach-detach_device.patch >> 0009-vfio-platform-Use-vfio_-attach-detach-_device.patch >> 0010-vfio-ap-Use-vfio_-attach-detach-_device.patch >> 0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch >> 0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch >> 0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch >> >> - Introduce iommufd based container: >> 0014-hw-iommufd-Creation.patch >> 0015-vfio-iommufd-Implement-iommufd-backend.patch >> 0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch >> >> - Add backend selection for vfio-pci: >> 0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch >> 0018-vfio-pci-Add-an-iommufd-option.patch >> >> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6- >> iommufd_jgg@nvidia.com/ >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd- >> vfio_mdev_no_group_jgg@nvidia.com/ > > Following is probably more relevant to [4]: > > https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/ absolutely.:-) thanks. > Thanks > Kevin
On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote: > >> - More tests > > I did a quick test on my ARM64 platform, using "iommu=smmuv3" > > string. The behaviors are different between using default and > > using legacy "iommufd=off". > > > > The legacy pathway exits the VM with: > > vfio 0002:01:00.0: > > failed to setup container for group 1: > > memory listener initialization failed: > > Region smmuv3-iommu-memory-region-16-0: > > device 00.02.0 requires iommu MAP notifier which is not currently supported > > > > while the iommufd pathway started the VM but reported errors > > from host kernel about address translation failures, probably > > because of accessing unmapped addresses. > > > > I found iommufd pathway also calls error_propagate_prepend() > > to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it > > doesn't get a chance to print errp out. Perhaps there should > > be a final error check somewhere to exit? > > thank you for giving it a try. > > vsmmuv3 + vfio is not supported as we miss the HW nested stage support > and SMMU does not support cache mode. If you want to test viommu on ARM > you shall test virtio-iommu+vfio. This should work but this is not yet > tested. I tried "-device virtio-iommu" and "-device virtio-iommu-pci" separately with vfio-pci, but neither seems to work. The host SMMU driver reports Translation Faults. Do you know what commands I should use to run QEMU for that combination? > I pushed a fix for the error notification issue: > qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git Yes. This fixes the problem. Thanks!
[Cc +libvirt folks] On Thu, 14 Apr 2022 03:46:52 -0700 Yi Liu <yi.l.liu@intel.com> wrote: > With the introduction of iommufd[1], the linux kernel provides a generic > interface for userspace drivers to propagate their DMA mappings to kernel > for assigned devices. This series does the porting of the VFIO devices > onto the /dev/iommu uapi and let it coexist with the legacy implementation. > Other devices like vpda, vfio mdev and etc. are not considered yet. > > For vfio devices, the new interface is tied with device fd and iommufd > as the iommufd solution is device-centric. This is different from legacy > vfio which is group-centric. To support both interfaces in QEMU, this > series introduces the iommu backend concept in the form of different > container classes. The existing vfio container is named legacy container > (equivalent with legacy iommu backend in this series), while the new > iommufd based container is named as iommufd container (may also be mentioned > as iommufd backend in this series). The two backend types have their own > way to setup secure context and dma management interface. Below diagram > shows how it looks like with both BEs. > > VFIO AddressSpace/Memory > +-------+ +----------+ +-----+ +-----+ > | pci | | platform | | ap | | ccw | > +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ > | | | | | AddressSpace | > | | | | +------------+---------+ > +---V-----------V-----------V--------V----+ / > | VFIOAddressSpace | <------------+ > | | | MemoryListener > | VFIOContainer list | > +-------+----------------------------+----+ > | | > | | > +-------V------+ +--------V----------+ > | iommufd | | vfio legacy | > | container | | container | > +-------+------+ +--------+----------+ > | | > | /dev/iommu | /dev/vfio/vfio > | /dev/vfio/devices/vfioX | /dev/vfio/$group_id > Userspace | | > ===========+============================+================================ > Kernel | device fd | > +---------------+ | group/container fd > | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) > | ATTACH_IOAS) | | device fd > | | | > | +-------V------------V-----------------+ > iommufd | | vfio | > (map/unmap | +---------+--------------------+-------+ > ioas_copy) | | | map/unmap > | | | > +------V------+ +-----V------+ +------V--------+ > | iommfd core | | device | | vfio iommu | > +-------------+ +------------+ +---------------+ > > [Secure Context setup] > - iommufd BE: uses device fd and iommufd to setup secure context > (bind_iommufd, attach_ioas) > - vfio legacy BE: uses group fd and container fd to setup secure context > (set_container, set_iommu) > [Device access] > - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX > - vfio legacy BE: device fd is retrieved from group fd ioctl > [DMA Mapping flow] > - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener > - VFIO populates DMA map/unmap via the container BEs > *) iommufd BE: uses iommufd > *) vfio legacy BE: uses container fd > > This series qomifies the VFIOContainer object which acts as a base class > for a container. This base class is derived into the legacy VFIO container > and the new iommufd based container. The base class implements generic code > such as code related to memory_listener and address space management whereas > the derived class implements callbacks that depend on the kernel user space > being used. > > The selection of the backend is made on a device basis using the new > iommufd option (on/off/auto). By default the iommufd backend is selected > if supported by the host and by QEMU (iommufd KConfig). This option is > currently available only for the vfio-pci device. For other types of > devices, it does not yet exist and the legacy BE is chosen by default. I've discussed this a bit with Eric, but let me propose a different command line interface. Libvirt generally likes to pass file descriptors to QEMU rather than grant it access to those files directly. This was problematic with vfio-pci because libvirt can't easily know when QEMU will want to grab another /dev/vfio/vfio container. Therefore we abandoned this approach and instead libvirt grants file permissions. However, with iommufd there's no reason that QEMU ever needs more than a single instance of /dev/iommufd and we're using per device vfio file descriptors, so it seems like a good time to revisit this. The interface I was considering would be to add an iommufd object to QEMU, so we might have a: -device iommufd[,fd=#][,id=foo] For non-libivrt usage this would have the ability to open /dev/iommufd itself if an fd is not provided. This object could be shared with other iommufd users in the VM and maybe we'd allow multiple instances for more esoteric use cases. [NB, maybe this should be a -object rather than -device since the iommufd is not a guest visible device?] The vfio-pci device might then become: -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] So essentially we can specify the device via host, sysfsdev, or passing an fd to the vfio device file. When an iommufd object is specified, "foo" in the example above, each of those options would use the vfio-device access mechanism, essentially the same as iommufd=on in your example. With the fd passing option, an iommufd object would be required and necessarily use device level access. In your example, the iommufd=auto seems especially troublesome for libvirt because QEMU is going to have different locked memory requirements based on whether we're using type1 or iommufd, where the latter resolves the duplicate accounting issues. libvirt needs to know deterministically which backed is being used, which this proposal seems to provide, while at the same time bringing us more in line with fd passing. Thoughts? Thanks, Alex
On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote: > [Cc +libvirt folks] > > On Thu, 14 Apr 2022 03:46:52 -0700 > Yi Liu <yi.l.liu@intel.com> wrote: > > > With the introduction of iommufd[1], the linux kernel provides a generic > > interface for userspace drivers to propagate their DMA mappings to kernel > > for assigned devices. This series does the porting of the VFIO devices > > onto the /dev/iommu uapi and let it coexist with the legacy implementation. > > Other devices like vpda, vfio mdev and etc. are not considered yet. snip > > The selection of the backend is made on a device basis using the new > > iommufd option (on/off/auto). By default the iommufd backend is selected > > if supported by the host and by QEMU (iommufd KConfig). This option is > > currently available only for the vfio-pci device. For other types of > > devices, it does not yet exist and the legacy BE is chosen by default. > > I've discussed this a bit with Eric, but let me propose a different > command line interface. Libvirt generally likes to pass file > descriptors to QEMU rather than grant it access to those files > directly. This was problematic with vfio-pci because libvirt can't > easily know when QEMU will want to grab another /dev/vfio/vfio > container. Therefore we abandoned this approach and instead libvirt > grants file permissions. > > However, with iommufd there's no reason that QEMU ever needs more than > a single instance of /dev/iommufd and we're using per device vfio file > descriptors, so it seems like a good time to revisit this. I assume access to '/dev/iommufd' gives the process somewhat elevated privileges, such that you don't want to unconditionally give QEMU access to this device ? > The interface I was considering would be to add an iommufd object to > QEMU, so we might have a: > > -device iommufd[,fd=#][,id=foo] > > For non-libivrt usage this would have the ability to open /dev/iommufd > itself if an fd is not provided. This object could be shared with > other iommufd users in the VM and maybe we'd allow multiple instances > for more esoteric use cases. [NB, maybe this should be a -object rather than > -device since the iommufd is not a guest visible device?] Yes, -object would be the right answer for something that's purely a host side backend impl selector. > The vfio-pci device might then become: > > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] > > So essentially we can specify the device via host, sysfsdev, or passing > an fd to the vfio device file. When an iommufd object is specified, > "foo" in the example above, each of those options would use the > vfio-device access mechanism, essentially the same as iommufd=on in > your example. With the fd passing option, an iommufd object would be > required and necessarily use device level access. > > In your example, the iommufd=auto seems especially troublesome for > libvirt because QEMU is going to have different locked memory > requirements based on whether we're using type1 or iommufd, where the > latter resolves the duplicate accounting issues. libvirt needs to know > deterministically which backed is being used, which this proposal seems > to provide, while at the same time bringing us more in line with fd > passing. Thoughts? Thanks, Yep, I agree that libvirt needs to have more direct control over this. This is also even more important if there are notable feature differences in the 2 backends. I wonder if anyone has considered an even more distinct impl, whereby we have a completely different device type on the backend, eg -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] If a vendor wants to fully remove the legacy impl, they can then use the Kconfig mechanism to disable the build of the legacy impl device, while keeping the iommu impl (or vica-verca if the new iommu impl isn't considered reliable enough for them to support yet). Libvirt would use -object iommu,id=iommu0,fd=NNN -device vfio-iommu-pci,fd=MMM,iommu=iommu0 Non-libvirt would use a simpler -device vfio-iommu-pci,host=0000:03:22.1 with QEMU auto-creating a 'iommu' object in the background. This would fit into libvirt's existing modelling better. We currently have a concept of a PCI assignment backend, which previously supported the legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl feels like a 3rd PCI assignment approach, and so fits with how we modelled it as a different device type in the past. With regards, Daniel
On Mon, Apr 25, 2022 at 11:10:14AM +0100, Daniel P. Berrangé wrote: > > However, with iommufd there's no reason that QEMU ever needs more than > > a single instance of /dev/iommufd and we're using per device vfio file > > descriptors, so it seems like a good time to revisit this. > > I assume access to '/dev/iommufd' gives the process somewhat elevated > privileges, such that you don't want to unconditionally give QEMU > access to this device ? I doesn't give much, at worst it allows userspace to allocate kernel memory and pin pages which can be already be done through all sorts of other interfaces qemu already has access to.. Jason
On Mon, 25 Apr 2022 11:10:14 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote: > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote: > > [Cc +libvirt folks] > > > > On Thu, 14 Apr 2022 03:46:52 -0700 > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > With the introduction of iommufd[1], the linux kernel provides a generic > > > interface for userspace drivers to propagate their DMA mappings to kernel > > > for assigned devices. This series does the porting of the VFIO devices > > > onto the /dev/iommu uapi and let it coexist with the legacy implementation. > > > Other devices like vpda, vfio mdev and etc. are not considered yet. > > snip > > > > The selection of the backend is made on a device basis using the new > > > iommufd option (on/off/auto). By default the iommufd backend is selected > > > if supported by the host and by QEMU (iommufd KConfig). This option is > > > currently available only for the vfio-pci device. For other types of > > > devices, it does not yet exist and the legacy BE is chosen by default. > > > > I've discussed this a bit with Eric, but let me propose a different > > command line interface. Libvirt generally likes to pass file > > descriptors to QEMU rather than grant it access to those files > > directly. This was problematic with vfio-pci because libvirt can't > > easily know when QEMU will want to grab another /dev/vfio/vfio > > container. Therefore we abandoned this approach and instead libvirt > > grants file permissions. > > > > However, with iommufd there's no reason that QEMU ever needs more than > > a single instance of /dev/iommufd and we're using per device vfio file > > descriptors, so it seems like a good time to revisit this. > > I assume access to '/dev/iommufd' gives the process somewhat elevated > privileges, such that you don't want to unconditionally give QEMU > access to this device ? It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged interface which should have limited scope for abuse, but more so here the goal would be to de-privilege QEMU that one step further that it cannot open the device file itself. > > The interface I was considering would be to add an iommufd object to > > QEMU, so we might have a: > > > > -device iommufd[,fd=#][,id=foo] > > > > For non-libivrt usage this would have the ability to open /dev/iommufd > > itself if an fd is not provided. This object could be shared with > > other iommufd users in the VM and maybe we'd allow multiple instances > > for more esoteric use cases. [NB, maybe this should be a -object rather than > > -device since the iommufd is not a guest visible device?] > > Yes, -object would be the right answer for something that's purely > a host side backend impl selector. > > > The vfio-pci device might then become: > > > > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] > > > > So essentially we can specify the device via host, sysfsdev, or passing > > an fd to the vfio device file. When an iommufd object is specified, > > "foo" in the example above, each of those options would use the > > vfio-device access mechanism, essentially the same as iommufd=on in > > your example. With the fd passing option, an iommufd object would be > > required and necessarily use device level access. > > > > In your example, the iommufd=auto seems especially troublesome for > > libvirt because QEMU is going to have different locked memory > > requirements based on whether we're using type1 or iommufd, where the > > latter resolves the duplicate accounting issues. libvirt needs to know > > deterministically which backed is being used, which this proposal seems > > to provide, while at the same time bringing us more in line with fd > > passing. Thoughts? Thanks, > > Yep, I agree that libvirt needs to have more direct control over this. > This is also even more important if there are notable feature differences > in the 2 backends. > > I wonder if anyone has considered an even more distinct impl, whereby > we have a completely different device type on the backend, eg > > -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] > > If a vendor wants to fully remove the legacy impl, they can then use the > Kconfig mechanism to disable the build of the legacy impl device, while > keeping the iommu impl (or vica-verca if the new iommu impl isn't considered > reliable enough for them to support yet). > > Libvirt would use > > -object iommu,id=iommu0,fd=NNN > -device vfio-iommu-pci,fd=MMM,iommu=iommu0 > > Non-libvirt would use a simpler > > -device vfio-iommu-pci,host=0000:03:22.1 > > with QEMU auto-creating a 'iommu' object in the background. > > This would fit into libvirt's existing modelling better. We currently have > a concept of a PCI assignment backend, which previously supported the > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl > feels like a 3rd PCI assignment approach, and so fits with how we modelled > it as a different device type in the past. I don't think we want to conflate "iommu" and "iommufd", we're creating an object that interfaces into the iommufd uAPI, not an iommu itself. Likewise "vfio-iommu-pci" is just confusing, there was an iommu interface previously, it's just a different implementation now and as far as the VM interface to the device, it's identical. Note that a "vfio-iommufd-pci" device multiplies the matrix of every vfio device for a rather subtle implementation detail. My expectation would be that libvirt uses: -object iommufd,id=iommufd0,fd=NNN -device vfio-pci,fd=MMM,iommufd=iommufd0 Whereas simple QEMU command line would be: -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 The iommufd object would open /dev/iommufd itself. Creating an implicit iommufd object is someone problematic because one of the things I forgot to highlight in my previous description is that the iommufd object is meant to be shared across not only various vfio devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. vdpa. If the old style were used: -device vfio-pci,host=0000:02:00.0 Then QEMU would use vfio for the IOMMU backend. If libvirt/userspace wants to query whether "legacy" vfio is still supported by the host kernel, I think it'd only need to look for whether the /dev/vfio/vfio container interface still exists. If we need some means for QEMU to remove legacy support, I'd rather find a way to do it via probing device options. It's easy enough to see if iommufd support exists by looking for the presence of the iommufd option for the vfio-pci device and Kconfig within QEMU could be used regardless of whether we define a new device name. Thanks, Alex
Hi Nicolin, On 4/19/22 5:26 AM, Nicolin Chen wrote: > On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote: > >>>> - More tests >>> I did a quick test on my ARM64 platform, using "iommu=smmuv3" >>> string. The behaviors are different between using default and >>> using legacy "iommufd=off". >>> >>> The legacy pathway exits the VM with: >>> vfio 0002:01:00.0: >>> failed to setup container for group 1: >>> memory listener initialization failed: >>> Region smmuv3-iommu-memory-region-16-0: >>> device 00.02.0 requires iommu MAP notifier which is not currently supported >>> >>> while the iommufd pathway started the VM but reported errors >>> from host kernel about address translation failures, probably >>> because of accessing unmapped addresses. >>> >>> I found iommufd pathway also calls error_propagate_prepend() >>> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it >>> doesn't get a chance to print errp out. Perhaps there should >>> be a final error check somewhere to exit? >> thank you for giving it a try. >> >> vsmmuv3 + vfio is not supported as we miss the HW nested stage support >> and SMMU does not support cache mode. If you want to test viommu on ARM >> you shall test virtio-iommu+vfio. This should work but this is not yet >> tested. > I tried "-device virtio-iommu" and "-device virtio-iommu-pci" > separately with vfio-pci, but neither seems to work. The host > SMMU driver reports Translation Faults. > > Do you know what commands I should use to run QEMU for that > combination? you shall use : -device virtio-iommu-pci -device vfio-pci,host=<BDF> Please make sure the "-device virtio-iommu-pci" is set *before* the "-device vfio-pci," Otherwise the IOMMU MR notifiers are not set properly and this may be the cause of your physical SMMU translations faults. Eric > >> I pushed a fix for the error notification issue: >> qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git > Yes. This fixes the problem. Thanks! >
Hi, On 4/18/22 2:09 PM, Yi Liu wrote: > Hi Kevin, > > On 2022/4/18 16:49, Tian, Kevin wrote: >>> From: Liu, Yi L <yi.l.liu@intel.com> >>> Sent: Thursday, April 14, 2022 6:47 PM >>> >>> With the introduction of iommufd[1], the linux kernel provides a >>> generic >>> interface for userspace drivers to propagate their DMA mappings to >>> kernel >>> for assigned devices. This series does the porting of the VFIO devices >>> onto the /dev/iommu uapi and let it coexist with the legacy >>> implementation. >>> Other devices like vpda, vfio mdev and etc. are not considered yet. >> >> vfio mdev has no special support in Qemu. Just that it's not supported >> by iommufd yet thus can only be operated in legacy container >> interface at >> this point. Later once it's supported by the kernel suppose no >> additional >> enabling work is required for mdev in Qemu. > > yes. will make it more precise in next version. > >>> >>> For vfio devices, the new interface is tied with device fd and iommufd >>> as the iommufd solution is device-centric. This is different from >>> legacy >>> vfio which is group-centric. To support both interfaces in QEMU, this >>> series introduces the iommu backend concept in the form of different >>> container classes. The existing vfio container is named legacy >>> container >>> (equivalent with legacy iommu backend in this series), while the new >>> iommufd based container is named as iommufd container (may also be >>> mentioned >>> as iommufd backend in this series). The two backend types have their >>> own >>> way to setup secure context and dma management interface. Below diagram >>> shows how it looks like with both BEs. >>> >>> VFIO AddressSpace/Memory >>> +-------+ +----------+ +-----+ +-----+ >>> | pci | | platform | | ap | | ccw | >>> +---+---+ +----+-----+ +--+--+ +--+--+ >>> +----------------------+ >>> | | | | | >>> AddressSpace | >>> | | | | >>> +------------+---------+ >>> +---V-----------V-----------V--------V----+ / >>> | VFIOAddressSpace | <------------+ >>> | | | MemoryListener >>> | VFIOContainer list | >>> +-------+----------------------------+----+ >>> | | >>> | | >>> +-------V------+ +--------V----------+ >>> | iommufd | | vfio legacy | >>> | container | | container | >>> +-------+------+ +--------+----------+ >>> | | >>> | /dev/iommu | /dev/vfio/vfio >>> | /dev/vfio/devices/vfioX | /dev/vfio/$group_id >>> Userspace | | >>> >>> ===========+============================+======================= >>> ========= >>> Kernel | device fd | >>> +---------------+ | group/container fd >>> | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) >>> | ATTACH_IOAS) | | device fd >>> | | | >>> | +-------V------------V-----------------+ >>> iommufd | | vfio | >>> (map/unmap | +---------+--------------------+-------+ >>> ioas_copy) | | | map/unmap >>> | | | >>> +------V------+ +-----V------+ +------V--------+ >>> | iommfd core | | device | | vfio iommu | >>> +-------------+ +------------+ +---------------+ >> >> last row: s/iommfd/iommufd/ > > thanks. a typo. > >> overall this sounds a reasonable abstraction. Later when vdpa starts >> supporting iommufd probably the iommufd BE will become even >> smaller with more logic shareable between vfio and vdpa. > > let's see if Jason Wang will give some idea. :-) > >>> >>> [Secure Context setup] >>> - iommufd BE: uses device fd and iommufd to setup secure context >>> (bind_iommufd, attach_ioas) >>> - vfio legacy BE: uses group fd and container fd to setup secure >>> context >>> (set_container, set_iommu) >>> [Device access] >>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX >>> - vfio legacy BE: device fd is retrieved from group fd ioctl >>> [DMA Mapping flow] >>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener >>> - VFIO populates DMA map/unmap via the container BEs >>> *) iommufd BE: uses iommufd >>> *) vfio legacy BE: uses container fd >>> >>> This series qomifies the VFIOContainer object which acts as a base >>> class >> >> what does 'qomify' mean? I didn't find this word from dictionary... >> >>> for a container. This base class is derived into the legacy VFIO >>> container >>> and the new iommufd based container. The base class implements generic >>> code >>> such as code related to memory_listener and address space management >>> whereas >>> the derived class implements callbacks that depend on the kernel >>> user space >> >> 'the kernel user space'? > > aha, just want to express different BE callbacks will use different > user interface exposed by kernel. will refine the wording. > >> >>> being used. >>> >>> The selection of the backend is made on a device basis using the new >>> iommufd option (on/off/auto). By default the iommufd backend is >>> selected >>> if supported by the host and by QEMU (iommufd KConfig). This option is >>> currently available only for the vfio-pci device. For other types of >>> devices, it does not yet exist and the legacy BE is chosen by default. >>> >>> Test done: >>> - PCI and Platform device were tested >> >> In this case PCI uses iommufd while platform device uses legacy? > > For PCI, both legacy and iommufd were tested. The exploration kernel > branch doesn't have the new device uapi for platform device, so I > didn't test it. > But I remember Eric should have tested it with iommufd. Eric? No I just ran non regression tests for vfio-platform, in legacy mode. I did not integrate with the new device uapi for platform device. > >>> - ccw and ap were only compile-tested >>> - limited device hotplug test >>> - vIOMMU test run for both legacy and iommufd backends (limited tests) >>> >>> This series was co-developed by Eric Auger and me based on the >>> exploration >>> iommufd kernel[2], complete code of this series is available in[3]. As >>> iommufd kernel is in the early step (only iommufd generic interface >>> is in >>> mailing list), so this series hasn't made the iommufd backend fully >>> on par >>> with legacy backend w.r.t. features like p2p mappings, coherency >>> tracking, >> >> what does 'coherency tracking' mean here? if related to iommu enforce >> snoop it is fully handled by the kernel so far. I didn't find any use of >> VFIO_DMA_CC_IOMMU in current Qemu. > > It's the kvm_group add/del stuffs.perhaps say kvm_group add/del > equivalence > would be better? > >>> live migration, etc. This series hasn't supported PCI devices >>> without FLR >>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when >>> userspace >>> is using iommufd. The kernel needs to be updated to accept device fd >>> list for >>> reset when userspace is using iommufd. Related work is in progress by >>> Jason[4]. >>> >>> TODOs: >>> - Add DMA alias check for iommufd BE (group level) >>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the >>> VFIO_DEVICE_PCI_HOT_RESET gap >>> - Cleanup the VFIODevice fields as it's used in both BEs >>> - Add locks >>> - Replace list with g_tree >>> - More tests >>> >>> Patch Overview: >>> >>> - Preparation: >>> 0001-scripts-update-linux-headers-Add-iommufd.h.patch >>> 0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch >>> 0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch >>> 0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch >>> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into- >>> iommu_m.patch >> >> 3-5 are pure cleanups which could be sent out separately > > yes. may send later after checking with Eric. :-) yes makes sense to send them separately. Thanks Eric > >>> 0006-vfio-common-Split-common.c-into-common.c-container.c.patch >>> >>> - Introduce container object and covert existing vfio to use it: >>> 0007-vfio-Add-base-object-for-VFIOContainer.patch >>> 0008-vfio-container-Introduce-vfio_attach-detach_device.patch >>> 0009-vfio-platform-Use-vfio_-attach-detach-_device.patch >>> 0010-vfio-ap-Use-vfio_-attach-detach-_device.patch >>> 0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch >>> 0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch >>> 0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch >>> >>> - Introduce iommufd based container: >>> 0014-hw-iommufd-Creation.patch >>> 0015-vfio-iommufd-Implement-iommufd-backend.patch >>> 0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch >>> >>> - Add backend selection for vfio-pci: >>> 0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch >>> 0018-vfio-pci-Add-an-iommufd-option.patch >>> >>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6- >>> iommufd_jgg@nvidia.com/ >>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd- >>> vfio_mdev_no_group_jgg@nvidia.com/ >> >> Following is probably more relevant to [4]: >> >> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/ >> > > absolutely.:-) thanks. > >> Thanks >> Kevin >
Hi Kevin, On 4/18/22 10:49 AM, Tian, Kevin wrote: >> From: Liu, Yi L <yi.l.liu@intel.com> >> Sent: Thursday, April 14, 2022 6:47 PM >> >> With the introduction of iommufd[1], the linux kernel provides a generic >> interface for userspace drivers to propagate their DMA mappings to kernel >> for assigned devices. This series does the porting of the VFIO devices >> onto the /dev/iommu uapi and let it coexist with the legacy implementation. >> Other devices like vpda, vfio mdev and etc. are not considered yet. > vfio mdev has no special support in Qemu. Just that it's not supported > by iommufd yet thus can only be operated in legacy container interface at > this point. Later once it's supported by the kernel suppose no additional > enabling work is required for mdev in Qemu. > >> For vfio devices, the new interface is tied with device fd and iommufd >> as the iommufd solution is device-centric. This is different from legacy >> vfio which is group-centric. To support both interfaces in QEMU, this >> series introduces the iommu backend concept in the form of different >> container classes. The existing vfio container is named legacy container >> (equivalent with legacy iommu backend in this series), while the new >> iommufd based container is named as iommufd container (may also be >> mentioned >> as iommufd backend in this series). The two backend types have their own >> way to setup secure context and dma management interface. Below diagram >> shows how it looks like with both BEs. >> >> VFIO AddressSpace/Memory >> +-------+ +----------+ +-----+ +-----+ >> | pci | | platform | | ap | | ccw | >> +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ >> | | | | | AddressSpace | >> | | | | +------------+---------+ >> +---V-----------V-----------V--------V----+ / >> | VFIOAddressSpace | <------------+ >> | | | MemoryListener >> | VFIOContainer list | >> +-------+----------------------------+----+ >> | | >> | | >> +-------V------+ +--------V----------+ >> | iommufd | | vfio legacy | >> | container | | container | >> +-------+------+ +--------+----------+ >> | | >> | /dev/iommu | /dev/vfio/vfio >> | /dev/vfio/devices/vfioX | /dev/vfio/$group_id >> Userspace | | >> >> ===========+============================+======================= >> ========= >> Kernel | device fd | >> +---------------+ | group/container fd >> | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) >> | ATTACH_IOAS) | | device fd >> | | | >> | +-------V------------V-----------------+ >> iommufd | | vfio | >> (map/unmap | +---------+--------------------+-------+ >> ioas_copy) | | | map/unmap >> | | | >> +------V------+ +-----V------+ +------V--------+ >> | iommfd core | | device | | vfio iommu | >> +-------------+ +------------+ +---------------+ > last row: s/iommfd/iommufd/ > > overall this sounds a reasonable abstraction. Later when vdpa starts > supporting iommufd probably the iommufd BE will become even > smaller with more logic shareable between vfio and vdpa. > >> [Secure Context setup] >> - iommufd BE: uses device fd and iommufd to setup secure context >> (bind_iommufd, attach_ioas) >> - vfio legacy BE: uses group fd and container fd to setup secure context >> (set_container, set_iommu) >> [Device access] >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX >> - vfio legacy BE: device fd is retrieved from group fd ioctl >> [DMA Mapping flow] >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener >> - VFIO populates DMA map/unmap via the container BEs >> *) iommufd BE: uses iommufd >> *) vfio legacy BE: uses container fd >> >> This series qomifies the VFIOContainer object which acts as a base class > what does 'qomify' mean? I didn't find this word from dictionary... sorry this is pure QEMU terminology. This stands for "QEMU Object Model" additional info at: https://qemu.readthedocs.io/en/latest/devel/qom.html Eric > >> for a container. This base class is derived into the legacy VFIO container >> and the new iommufd based container. The base class implements generic >> code >> such as code related to memory_listener and address space management >> whereas >> the derived class implements callbacks that depend on the kernel user space > 'the kernel user space'? > >> being used. >> >> The selection of the backend is made on a device basis using the new >> iommufd option (on/off/auto). By default the iommufd backend is selected >> if supported by the host and by QEMU (iommufd KConfig). This option is >> currently available only for the vfio-pci device. For other types of >> devices, it does not yet exist and the legacy BE is chosen by default. >> >> Test done: >> - PCI and Platform device were tested > In this case PCI uses iommufd while platform device uses legacy? > >> - ccw and ap were only compile-tested >> - limited device hotplug test >> - vIOMMU test run for both legacy and iommufd backends (limited tests) >> >> This series was co-developed by Eric Auger and me based on the exploration >> iommufd kernel[2], complete code of this series is available in[3]. As >> iommufd kernel is in the early step (only iommufd generic interface is in >> mailing list), so this series hasn't made the iommufd backend fully on par >> with legacy backend w.r.t. features like p2p mappings, coherency tracking, > what does 'coherency tracking' mean here? if related to iommu enforce > snoop it is fully handled by the kernel so far. I didn't find any use of > VFIO_DMA_CC_IOMMU in current Qemu. > >> live migration, etc. This series hasn't supported PCI devices without FLR >> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when >> userspace >> is using iommufd. The kernel needs to be updated to accept device fd list for >> reset when userspace is using iommufd. Related work is in progress by >> Jason[4]. >> >> TODOs: >> - Add DMA alias check for iommufd BE (group level) >> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the >> VFIO_DEVICE_PCI_HOT_RESET gap >> - Cleanup the VFIODevice fields as it's used in both BEs >> - Add locks >> - Replace list with g_tree >> - More tests >> >> Patch Overview: >> >> - Preparation: >> 0001-scripts-update-linux-headers-Add-iommufd.h.patch >> 0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch >> 0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch >> 0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch >> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into- >> iommu_m.patch > 3-5 are pure cleanups which could be sent out separately > >> 0006-vfio-common-Split-common.c-into-common.c-container.c.patch >> >> - Introduce container object and covert existing vfio to use it: >> 0007-vfio-Add-base-object-for-VFIOContainer.patch >> 0008-vfio-container-Introduce-vfio_attach-detach_device.patch >> 0009-vfio-platform-Use-vfio_-attach-detach-_device.patch >> 0010-vfio-ap-Use-vfio_-attach-detach-_device.patch >> 0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch >> 0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch >> 0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch >> >> - Introduce iommufd based container: >> 0014-hw-iommufd-Creation.patch >> 0015-vfio-iommufd-Implement-iommufd-backend.patch >> 0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch >> >> - Add backend selection for vfio-pci: >> 0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch >> 0018-vfio-pci-Add-an-iommufd-option.patch >> >> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6- >> iommufd_jgg@nvidia.com/ >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd- >> vfio_mdev_no_group_jgg@nvidia.com/ > Following is probably more relevant to [4]: > > https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/ > > Thanks > Kevin >
Hi Alex, On 4/23/22 12:09 AM, Alex Williamson wrote: > [Cc +libvirt folks] > > On Thu, 14 Apr 2022 03:46:52 -0700 > Yi Liu <yi.l.liu@intel.com> wrote: > >> With the introduction of iommufd[1], the linux kernel provides a generic >> interface for userspace drivers to propagate their DMA mappings to kernel >> for assigned devices. This series does the porting of the VFIO devices >> onto the /dev/iommu uapi and let it coexist with the legacy implementation. >> Other devices like vpda, vfio mdev and etc. are not considered yet. >> >> For vfio devices, the new interface is tied with device fd and iommufd >> as the iommufd solution is device-centric. This is different from legacy >> vfio which is group-centric. To support both interfaces in QEMU, this >> series introduces the iommu backend concept in the form of different >> container classes. The existing vfio container is named legacy container >> (equivalent with legacy iommu backend in this series), while the new >> iommufd based container is named as iommufd container (may also be mentioned >> as iommufd backend in this series). The two backend types have their own >> way to setup secure context and dma management interface. Below diagram >> shows how it looks like with both BEs. >> >> VFIO AddressSpace/Memory >> +-------+ +----------+ +-----+ +-----+ >> | pci | | platform | | ap | | ccw | >> +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ >> | | | | | AddressSpace | >> | | | | +------------+---------+ >> +---V-----------V-----------V--------V----+ / >> | VFIOAddressSpace | <------------+ >> | | | MemoryListener >> | VFIOContainer list | >> +-------+----------------------------+----+ >> | | >> | | >> +-------V------+ +--------V----------+ >> | iommufd | | vfio legacy | >> | container | | container | >> +-------+------+ +--------+----------+ >> | | >> | /dev/iommu | /dev/vfio/vfio >> | /dev/vfio/devices/vfioX | /dev/vfio/$group_id >> Userspace | | >> ===========+============================+================================ >> Kernel | device fd | >> +---------------+ | group/container fd >> | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) >> | ATTACH_IOAS) | | device fd >> | | | >> | +-------V------------V-----------------+ >> iommufd | | vfio | >> (map/unmap | +---------+--------------------+-------+ >> ioas_copy) | | | map/unmap >> | | | >> +------V------+ +-----V------+ +------V--------+ >> | iommfd core | | device | | vfio iommu | >> +-------------+ +------------+ +---------------+ >> >> [Secure Context setup] >> - iommufd BE: uses device fd and iommufd to setup secure context >> (bind_iommufd, attach_ioas) >> - vfio legacy BE: uses group fd and container fd to setup secure context >> (set_container, set_iommu) >> [Device access] >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX >> - vfio legacy BE: device fd is retrieved from group fd ioctl >> [DMA Mapping flow] >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener >> - VFIO populates DMA map/unmap via the container BEs >> *) iommufd BE: uses iommufd >> *) vfio legacy BE: uses container fd >> >> This series qomifies the VFIOContainer object which acts as a base class >> for a container. This base class is derived into the legacy VFIO container >> and the new iommufd based container. The base class implements generic code >> such as code related to memory_listener and address space management whereas >> the derived class implements callbacks that depend on the kernel user space >> being used. >> >> The selection of the backend is made on a device basis using the new >> iommufd option (on/off/auto). By default the iommufd backend is selected >> if supported by the host and by QEMU (iommufd KConfig). This option is >> currently available only for the vfio-pci device. For other types of >> devices, it does not yet exist and the legacy BE is chosen by default. > I've discussed this a bit with Eric, but let me propose a different > command line interface. Libvirt generally likes to pass file > descriptors to QEMU rather than grant it access to those files > directly. This was problematic with vfio-pci because libvirt can't > easily know when QEMU will want to grab another /dev/vfio/vfio > container. Therefore we abandoned this approach and instead libvirt > grants file permissions. > > However, with iommufd there's no reason that QEMU ever needs more than > a single instance of /dev/iommufd and we're using per device vfio file > descriptors, so it seems like a good time to revisit this. > > The interface I was considering would be to add an iommufd object to > QEMU, so we might have a: > > -device iommufd[,fd=#][,id=foo] > > For non-libivrt usage this would have the ability to open /dev/iommufd > itself if an fd is not provided. This object could be shared with > other iommufd users in the VM and maybe we'd allow multiple instances > for more esoteric use cases. [NB, maybe this should be a -object rather than > -device since the iommufd is not a guest visible device?] > > The vfio-pci device might then become: > > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] > > So essentially we can specify the device via host, sysfsdev, or passing > an fd to the vfio device file. When an iommufd object is specified, > "foo" in the example above, each of those options would use the > vfio-device access mechanism, essentially the same as iommufd=on in > your example. With the fd passing option, an iommufd object would be > required and necessarily use device level access. What is the use case you foresee for the "fd=#" option? > > In your example, the iommufd=auto seems especially troublesome for > libvirt because QEMU is going to have different locked memory > requirements based on whether we're using type1 or iommufd, where the > latter resolves the duplicate accounting issues. libvirt needs to know > deterministically which backed is being used, which this proposal seems > to provide, while at the same time bringing us more in line with fd > passing. Thoughts? Thanks, I like your proposal (based on the -object iommufd). The only thing that may be missing I think is for a qemu end-user who actually does not care about the iommu backend being used but just wishes to use the most recent available one it adds some extra complexity. But this is not the most important use case ;) Thanks Eric > > Alex >
On Mon, 25 Apr 2022 22:23:05 +0200 Eric Auger <eric.auger@redhat.com> wrote: > Hi Alex, > > On 4/23/22 12:09 AM, Alex Williamson wrote: > > [Cc +libvirt folks] > > > > On Thu, 14 Apr 2022 03:46:52 -0700 > > Yi Liu <yi.l.liu@intel.com> wrote: > > > >> With the introduction of iommufd[1], the linux kernel provides a generic > >> interface for userspace drivers to propagate their DMA mappings to kernel > >> for assigned devices. This series does the porting of the VFIO devices > >> onto the /dev/iommu uapi and let it coexist with the legacy implementation. > >> Other devices like vpda, vfio mdev and etc. are not considered yet. > >> > >> For vfio devices, the new interface is tied with device fd and iommufd > >> as the iommufd solution is device-centric. This is different from legacy > >> vfio which is group-centric. To support both interfaces in QEMU, this > >> series introduces the iommu backend concept in the form of different > >> container classes. The existing vfio container is named legacy container > >> (equivalent with legacy iommu backend in this series), while the new > >> iommufd based container is named as iommufd container (may also be mentioned > >> as iommufd backend in this series). The two backend types have their own > >> way to setup secure context and dma management interface. Below diagram > >> shows how it looks like with both BEs. > >> > >> VFIO AddressSpace/Memory > >> +-------+ +----------+ +-----+ +-----+ > >> | pci | | platform | | ap | | ccw | > >> +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ > >> | | | | | AddressSpace | > >> | | | | +------------+---------+ > >> +---V-----------V-----------V--------V----+ / > >> | VFIOAddressSpace | <------------+ > >> | | | MemoryListener > >> | VFIOContainer list | > >> +-------+----------------------------+----+ > >> | | > >> | | > >> +-------V------+ +--------V----------+ > >> | iommufd | | vfio legacy | > >> | container | | container | > >> +-------+------+ +--------+----------+ > >> | | > >> | /dev/iommu | /dev/vfio/vfio > >> | /dev/vfio/devices/vfioX | /dev/vfio/$group_id > >> Userspace | | > >> ===========+============================+================================ > >> Kernel | device fd | > >> +---------------+ | group/container fd > >> | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU) > >> | ATTACH_IOAS) | | device fd > >> | | | > >> | +-------V------------V-----------------+ > >> iommufd | | vfio | > >> (map/unmap | +---------+--------------------+-------+ > >> ioas_copy) | | | map/unmap > >> | | | > >> +------V------+ +-----V------+ +------V--------+ > >> | iommfd core | | device | | vfio iommu | > >> +-------------+ +------------+ +---------------+ > >> > >> [Secure Context setup] > >> - iommufd BE: uses device fd and iommufd to setup secure context > >> (bind_iommufd, attach_ioas) > >> - vfio legacy BE: uses group fd and container fd to setup secure context > >> (set_container, set_iommu) > >> [Device access] > >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX > >> - vfio legacy BE: device fd is retrieved from group fd ioctl > >> [DMA Mapping flow] > >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener > >> - VFIO populates DMA map/unmap via the container BEs > >> *) iommufd BE: uses iommufd > >> *) vfio legacy BE: uses container fd > >> > >> This series qomifies the VFIOContainer object which acts as a base class > >> for a container. This base class is derived into the legacy VFIO container > >> and the new iommufd based container. The base class implements generic code > >> such as code related to memory_listener and address space management whereas > >> the derived class implements callbacks that depend on the kernel user space > >> being used. > >> > >> The selection of the backend is made on a device basis using the new > >> iommufd option (on/off/auto). By default the iommufd backend is selected > >> if supported by the host and by QEMU (iommufd KConfig). This option is > >> currently available only for the vfio-pci device. For other types of > >> devices, it does not yet exist and the legacy BE is chosen by default. > > I've discussed this a bit with Eric, but let me propose a different > > command line interface. Libvirt generally likes to pass file > > descriptors to QEMU rather than grant it access to those files > > directly. This was problematic with vfio-pci because libvirt can't > > easily know when QEMU will want to grab another /dev/vfio/vfio > > container. Therefore we abandoned this approach and instead libvirt > > grants file permissions. > > > > However, with iommufd there's no reason that QEMU ever needs more than > > a single instance of /dev/iommufd and we're using per device vfio file > > descriptors, so it seems like a good time to revisit this. > > > > The interface I was considering would be to add an iommufd object to > > QEMU, so we might have a: > > > > -device iommufd[,fd=#][,id=foo] > > > > For non-libivrt usage this would have the ability to open /dev/iommufd > > itself if an fd is not provided. This object could be shared with > > other iommufd users in the VM and maybe we'd allow multiple instances > > for more esoteric use cases. [NB, maybe this should be a -object rather than > > -device since the iommufd is not a guest visible device?] > > > > The vfio-pci device might then become: > > > > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo] > > > > So essentially we can specify the device via host, sysfsdev, or passing > > an fd to the vfio device file. When an iommufd object is specified, > > "foo" in the example above, each of those options would use the > > vfio-device access mechanism, essentially the same as iommufd=on in > > your example. With the fd passing option, an iommufd object would be > > required and necessarily use device level access. > What is the use case you foresee for the "fd=#" option? On the vfio-pci device this was intended to be the actual vfio device file descriptor. Once we have a file per device, QEMU doesn't really have any need to navigate through sysfs to determine which fd to use other than for user convenience on the command line. For libvirt usage, I assume QEMU could accept the device fd, without ever really knowing anything about the host address or sysfs path of the device. > > > > In your example, the iommufd=auto seems especially troublesome for > > libvirt because QEMU is going to have different locked memory > > requirements based on whether we're using type1 or iommufd, where the > > latter resolves the duplicate accounting issues. libvirt needs to know > > deterministically which backed is being used, which this proposal seems > > to provide, while at the same time bringing us more in line with fd > > passing. Thoughts? Thanks, > I like your proposal (based on the -object iommufd). The only thing that > may be missing I think is for a qemu end-user who actually does not care > about the iommu backend being used but just wishes to use the most > recent available one it adds some extra complexity. But this is not the > most important use case ;) Yeah, I can sympathize with that, but isn't that also why we're pursing a vfio compatibility interface at the kernel level? Eventually, once the native vfio IOMMU backends go away, the vfio "container" device file will be provided by iommufd and that transition to the new interface can be both seamless to the user and apparent to tools like libvirt. An end-user with a fixed command line should continue to work and will eventually get iommufd via compatibility, but taking care of an end-user that "does not care" and "wishes to use the most recent" is a non-goal for me. That would be more troublesome for tools and use cases that we do care about imo. Thanks, Alex
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Monday, April 25, 2022 10:38 PM > > On Mon, 25 Apr 2022 11:10:14 +0100 > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote: > > > [Cc +libvirt folks] > > > > > > On Thu, 14 Apr 2022 03:46:52 -0700 > > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > > > With the introduction of iommufd[1], the linux kernel provides a > generic > > > > interface for userspace drivers to propagate their DMA mappings to > kernel > > > > for assigned devices. This series does the porting of the VFIO devices > > > > onto the /dev/iommu uapi and let it coexist with the legacy > implementation. > > > > Other devices like vpda, vfio mdev and etc. are not considered yet. > > > > snip > > > > > > The selection of the backend is made on a device basis using the new > > > > iommufd option (on/off/auto). By default the iommufd backend is > selected > > > > if supported by the host and by QEMU (iommufd KConfig). This option > is > > > > currently available only for the vfio-pci device. For other types of > > > > devices, it does not yet exist and the legacy BE is chosen by default. > > > > > > I've discussed this a bit with Eric, but let me propose a different > > > command line interface. Libvirt generally likes to pass file > > > descriptors to QEMU rather than grant it access to those files > > > directly. This was problematic with vfio-pci because libvirt can't > > > easily know when QEMU will want to grab another /dev/vfio/vfio > > > container. Therefore we abandoned this approach and instead libvirt > > > grants file permissions. > > > > > > However, with iommufd there's no reason that QEMU ever needs more > than > > > a single instance of /dev/iommufd and we're using per device vfio file > > > descriptors, so it seems like a good time to revisit this. > > > > I assume access to '/dev/iommufd' gives the process somewhat elevated > > privileges, such that you don't want to unconditionally give QEMU > > access to this device ? > > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged > interface which should have limited scope for abuse, but more so here > the goal would be to de-privilege QEMU that one step further that it > cannot open the device file itself. > > > > The interface I was considering would be to add an iommufd object to > > > QEMU, so we might have a: > > > > > > -device iommufd[,fd=#][,id=foo] > > > > > > For non-libivrt usage this would have the ability to open /dev/iommufd > > > itself if an fd is not provided. This object could be shared with > > > other iommufd users in the VM and maybe we'd allow multiple instances > > > for more esoteric use cases. [NB, maybe this should be a -object rather > than > > > -device since the iommufd is not a guest visible device?] > > > > Yes, -object would be the right answer for something that's purely > > a host side backend impl selector. > > > > > The vfio-pci device might then become: > > > > > > -device vfio- > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f > oo] > > > > > > So essentially we can specify the device via host, sysfsdev, or passing > > > an fd to the vfio device file. When an iommufd object is specified, > > > "foo" in the example above, each of those options would use the > > > vfio-device access mechanism, essentially the same as iommufd=on in > > > your example. With the fd passing option, an iommufd object would be > > > required and necessarily use device level access. > > > > > > In your example, the iommufd=auto seems especially troublesome for > > > libvirt because QEMU is going to have different locked memory > > > requirements based on whether we're using type1 or iommufd, where > the > > > latter resolves the duplicate accounting issues. libvirt needs to know Based on current plan there is probably a transition window between the point where the first vfio device type (vfio-pci) gaining iommufd support and the point where all vfio types supporting iommufd. Libvirt can figure out which one to use iommufd by checking the presence of /dev/vfio/devices/vfioX. But what would be the resource limit policy in Libvirt in such transition window when both type1 and iommufd might be used? Or do we just expect Libvirt to support iommufd only after the transition window ends to avoid handling such mess? > > > deterministically which backed is being used, which this proposal seems > > > to provide, while at the same time bringing us more in line with fd > > > passing. Thoughts? Thanks, > > > > Yep, I agree that libvirt needs to have more direct control over this. > > This is also even more important if there are notable feature differences > > in the 2 backends. > > > > I wonder if anyone has considered an even more distinct impl, whereby > > we have a completely different device type on the backend, eg > > > > -device vfio-iommu- > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f > oo] > > > > If a vendor wants to fully remove the legacy impl, they can then use the > > Kconfig mechanism to disable the build of the legacy impl device, while > > keeping the iommu impl (or vica-verca if the new iommu impl isn't > considered > > reliable enough for them to support yet). > > > > Libvirt would use > > > > -object iommu,id=iommu0,fd=NNN > > -device vfio-iommu-pci,fd=MMM,iommu=iommu0 > > > > Non-libvirt would use a simpler > > > > -device vfio-iommu-pci,host=0000:03:22.1 > > > > with QEMU auto-creating a 'iommu' object in the background. > > > > This would fit into libvirt's existing modelling better. We currently have > > a concept of a PCI assignment backend, which previously supported the > > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl > > feels like a 3rd PCI assignment approach, and so fits with how we modelled > > it as a different device type in the past. > > I don't think we want to conflate "iommu" and "iommufd", we're creating > an object that interfaces into the iommufd uAPI, not an iommu itself. > Likewise "vfio-iommu-pci" is just confusing, there was an iommu > interface previously, it's just a different implementation now and as > far as the VM interface to the device, it's identical. Note that a > "vfio-iommufd-pci" device multiplies the matrix of every vfio device > for a rather subtle implementation detail. > > My expectation would be that libvirt uses: > > -object iommufd,id=iommufd0,fd=NNN > -device vfio-pci,fd=MMM,iommufd=iommufd0 > > Whereas simple QEMU command line would be: > > -object iommufd,id=iommufd0 > -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 > > The iommufd object would open /dev/iommufd itself. Creating an > implicit iommufd object is someone problematic because one of the > things I forgot to highlight in my previous description is that the > iommufd object is meant to be shared across not only various vfio > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. > vdpa. Out of curiosity - in concept one iommufd is sufficient to support all ioas requirements across subsystems while having multiple iommufd's instead lose the benefit of centralized accounting. The latter will also cause some trouble when we start virtualizing ENQCMD which requires VM-wide PASID virtualization thus further needs to share that information across iommufd's. Not unsolvable but really no gain by adding such complexity. So I'm curious whether Qemu provide a way to restrict that certain object type can only have one instance to discourage such multi-iommufd attempt? > > If the old style were used: > > -device vfio-pci,host=0000:02:00.0 > > Then QEMU would use vfio for the IOMMU backend. > > If libvirt/userspace wants to query whether "legacy" vfio is still > supported by the host kernel, I think it'd only need to look for > whether the /dev/vfio/vfio container interface still exists. > > If we need some means for QEMU to remove legacy support, I'd rather > find a way to do it via probing device options. It's easy enough to > see if iommufd support exists by looking for the presence of the > iommufd option for the vfio-pci device and Kconfig within QEMU could be > used regardless of whether we define a new device name. Thanks, > > Alex
> From: Eric Auger <eric.auger@redhat.com> > Sent: Tuesday, April 26, 2022 3:55 AM > > Hi Kevin, > > On 4/18/22 10:49 AM, Tian, Kevin wrote: > >> From: Liu, Yi L <yi.l.liu@intel.com> > >> Sent: Thursday, April 14, 2022 6:47 PM > >> > >> This series qomifies the VFIOContainer object which acts as a base class > > what does 'qomify' mean? I didn't find this word from dictionary... > sorry this is pure QEMU terminology. This stands for "QEMU Object Model" > additional info at: > https://qemu.readthedocs.io/en/latest/devel/qom.html > Nice to know!
> -----Original Message----- > From: Yi Liu [mailto:yi.l.liu@intel.com] > Sent: 14 April 2022 11:47 > To: alex.williamson@redhat.com; cohuck@redhat.com; > qemu-devel@nongnu.org > Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; > jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com; > eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com; > chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com > Subject: [RFC 00/18] vfio: Adopt iommufd > > With the introduction of iommufd[1], the linux kernel provides a generic > interface for userspace drivers to propagate their DMA mappings to kernel > for assigned devices. This series does the porting of the VFIO devices > onto the /dev/iommu uapi and let it coexist with the legacy implementation. > Other devices like vpda, vfio mdev and etc. are not considered yet. > > For vfio devices, the new interface is tied with device fd and iommufd > as the iommufd solution is device-centric. This is different from legacy > vfio which is group-centric. To support both interfaces in QEMU, this > series introduces the iommu backend concept in the form of different > container classes. The existing vfio container is named legacy container > (equivalent with legacy iommu backend in this series), while the new > iommufd based container is named as iommufd container (may also be > mentioned > as iommufd backend in this series). The two backend types have their own > way to setup secure context and dma management interface. Below diagram > shows how it looks like with both BEs. > > VFIO > AddressSpace/Memory > +-------+ +----------+ +-----+ +-----+ > | pci | | platform | | ap | | ccw | > +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ > | | | | | AddressSpace > | > | | | | +------------+---------+ > +---V-----------V-----------V--------V----+ / > | VFIOAddressSpace | <------------+ > | | | MemoryListener > | VFIOContainer list | > +-------+----------------------------+----+ > | | > | | > +-------V------+ +--------V----------+ > | iommufd | | vfio legacy | > | container | | container | > +-------+------+ +--------+----------+ > | | > | /dev/iommu | /dev/vfio/vfio > | /dev/vfio/devices/vfioX | /dev/vfio/$group_id > Userspace | | > > ===========+============================+========================== > ====== > Kernel | device fd | > +---------------+ | group/container fd > | (BIND_IOMMUFD | | > (SET_CONTAINER/SET_IOMMU) > | ATTACH_IOAS) | | device fd > | | | > | +-------V------------V-----------------+ > iommufd | | vfio | > (map/unmap | +---------+--------------------+-------+ > ioas_copy) | | | map/unmap > | | | > +------V------+ +-----V------+ +------V--------+ > | iommfd core | | device | | vfio iommu | > +-------------+ +------------+ +---------------+ > > [Secure Context setup] > - iommufd BE: uses device fd and iommufd to setup secure context > (bind_iommufd, attach_ioas) > - vfio legacy BE: uses group fd and container fd to setup secure context > (set_container, set_iommu) > [Device access] > - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX > - vfio legacy BE: device fd is retrieved from group fd ioctl > [DMA Mapping flow] > - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener > - VFIO populates DMA map/unmap via the container BEs > *) iommufd BE: uses iommufd > *) vfio legacy BE: uses container fd > > This series qomifies the VFIOContainer object which acts as a base class > for a container. This base class is derived into the legacy VFIO container > and the new iommufd based container. The base class implements generic > code > such as code related to memory_listener and address space management > whereas > the derived class implements callbacks that depend on the kernel user space > being used. > > The selection of the backend is made on a device basis using the new > iommufd option (on/off/auto). By default the iommufd backend is selected > if supported by the host and by QEMU (iommufd KConfig). This option is > currently available only for the vfio-pci device. For other types of > devices, it does not yet exist and the legacy BE is chosen by default. > > Test done: > - PCI and Platform device were tested > - ccw and ap were only compile-tested > - limited device hotplug test > - vIOMMU test run for both legacy and iommufd backends (limited tests) > > This series was co-developed by Eric Auger and me based on the exploration > iommufd kernel[2], complete code of this series is available in[3]. As > iommufd kernel is in the early step (only iommufd generic interface is in > mailing list), so this series hasn't made the iommufd backend fully on par > with legacy backend w.r.t. features like p2p mappings, coherency tracking, > live migration, etc. This series hasn't supported PCI devices without FLR > neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when > userspace > is using iommufd. The kernel needs to be updated to accept device fd list for > reset when userspace is using iommufd. Related work is in progress by > Jason[4]. > > TODOs: > - Add DMA alias check for iommufd BE (group level) > - Make pci.c to be BE agnostic. Needs kernel change as well to fix the > VFIO_DEVICE_PCI_HOT_RESET gap > - Cleanup the VFIODevice fields as it's used in both BEs > - Add locks > - Replace list with g_tree > - More tests > > Patch Overview: > > - Preparation: > 0001-scripts-update-linux-headers-Add-iommufd.h.patch > 0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch > 0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch > 0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch > > 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch > 0006-vfio-common-Split-common.c-into-common.c-container.c.patch > > - Introduce container object and covert existing vfio to use it: > 0007-vfio-Add-base-object-for-VFIOContainer.patch > 0008-vfio-container-Introduce-vfio_attach-detach_device.patch > 0009-vfio-platform-Use-vfio_-attach-detach-_device.patch > 0010-vfio-ap-Use-vfio_-attach-detach-_device.patch > 0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch > 0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch > 0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch > > - Introduce iommufd based container: > 0014-hw-iommufd-Creation.patch > 0015-vfio-iommufd-Implement-iommufd-backend.patch > 0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch > > - Add backend selection for vfio-pci: > 0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch > 0018-vfio-pci-Add-an-iommufd-option.patch > > [1] > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com > / > [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 > [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 Hi, I had a go with the above branches on our ARM64 platform trying to pass-through a VF dev, but Qemu reports an error as below, [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002) qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) I think this happens for the dev BAR addr range. I haven't debugged the kernel yet to see where it actually reports that. Maybe I am missing something. Please let me know. Thanks, Shameer
Hi Shameer, On 4/26/22 11:47 AM, Shameerali Kolothum Thodi wrote: > >> -----Original Message----- >> From: Yi Liu [mailto:yi.l.liu@intel.com] >> Sent: 14 April 2022 11:47 >> To: alex.williamson@redhat.com; cohuck@redhat.com; >> qemu-devel@nongnu.org >> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; >> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; >> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; >> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com; >> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com; >> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com >> Subject: [RFC 00/18] vfio: Adopt iommufd >> >> With the introduction of iommufd[1], the linux kernel provides a generic >> interface for userspace drivers to propagate their DMA mappings to kernel >> for assigned devices. This series does the porting of the VFIO devices >> onto the /dev/iommu uapi and let it coexist with the legacy implementation. >> Other devices like vpda, vfio mdev and etc. are not considered yet. >> >> For vfio devices, the new interface is tied with device fd and iommufd >> as the iommufd solution is device-centric. This is different from legacy >> vfio which is group-centric. To support both interfaces in QEMU, this >> series introduces the iommu backend concept in the form of different >> container classes. The existing vfio container is named legacy container >> (equivalent with legacy iommu backend in this series), while the new >> iommufd based container is named as iommufd container (may also be >> mentioned >> as iommufd backend in this series). The two backend types have their own >> way to setup secure context and dma management interface. Below diagram >> shows how it looks like with both BEs. >> >> VFIO >> AddressSpace/Memory >> +-------+ +----------+ +-----+ +-----+ >> | pci | | platform | | ap | | ccw | >> +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+ >> | | | | | AddressSpace >> | >> | | | | +------------+---------+ >> +---V-----------V-----------V--------V----+ / >> | VFIOAddressSpace | <------------+ >> | | | MemoryListener >> | VFIOContainer list | >> +-------+----------------------------+----+ >> | | >> | | >> +-------V------+ +--------V----------+ >> | iommufd | | vfio legacy | >> | container | | container | >> +-------+------+ +--------+----------+ >> | | >> | /dev/iommu | /dev/vfio/vfio >> | /dev/vfio/devices/vfioX | /dev/vfio/$group_id >> Userspace | | >> >> ===========+============================+========================== >> ====== >> Kernel | device fd | >> +---------------+ | group/container fd >> | (BIND_IOMMUFD | | >> (SET_CONTAINER/SET_IOMMU) >> | ATTACH_IOAS) | | device fd >> | | | >> | +-------V------------V-----------------+ >> iommufd | | vfio | >> (map/unmap | +---------+--------------------+-------+ >> ioas_copy) | | | map/unmap >> | | | >> +------V------+ +-----V------+ +------V--------+ >> | iommfd core | | device | | vfio iommu | >> +-------------+ +------------+ +---------------+ >> >> [Secure Context setup] >> - iommufd BE: uses device fd and iommufd to setup secure context >> (bind_iommufd, attach_ioas) >> - vfio legacy BE: uses group fd and container fd to setup secure context >> (set_container, set_iommu) >> [Device access] >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX >> - vfio legacy BE: device fd is retrieved from group fd ioctl >> [DMA Mapping flow] >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener >> - VFIO populates DMA map/unmap via the container BEs >> *) iommufd BE: uses iommufd >> *) vfio legacy BE: uses container fd >> >> This series qomifies the VFIOContainer object which acts as a base class >> for a container. This base class is derived into the legacy VFIO container >> and the new iommufd based container. The base class implements generic >> code >> such as code related to memory_listener and address space management >> whereas >> the derived class implements callbacks that depend on the kernel user space >> being used. >> >> The selection of the backend is made on a device basis using the new >> iommufd option (on/off/auto). By default the iommufd backend is selected >> if supported by the host and by QEMU (iommufd KConfig). This option is >> currently available only for the vfio-pci device. For other types of >> devices, it does not yet exist and the legacy BE is chosen by default. >> >> Test done: >> - PCI and Platform device were tested >> - ccw and ap were only compile-tested >> - limited device hotplug test >> - vIOMMU test run for both legacy and iommufd backends (limited tests) >> >> This series was co-developed by Eric Auger and me based on the exploration >> iommufd kernel[2], complete code of this series is available in[3]. As >> iommufd kernel is in the early step (only iommufd generic interface is in >> mailing list), so this series hasn't made the iommufd backend fully on par >> with legacy backend w.r.t. features like p2p mappings, coherency tracking, >> live migration, etc. This series hasn't supported PCI devices without FLR >> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when >> userspace >> is using iommufd. The kernel needs to be updated to accept device fd list for >> reset when userspace is using iommufd. Related work is in progress by >> Jason[4]. >> >> TODOs: >> - Add DMA alias check for iommufd BE (group level) >> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the >> VFIO_DEVICE_PCI_HOT_RESET gap >> - Cleanup the VFIODevice fields as it's used in both BEs >> - Add locks >> - Replace list with g_tree >> - More tests >> >> Patch Overview: >> >> - Preparation: >> 0001-scripts-update-linux-headers-Add-iommufd.h.patch >> 0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch >> 0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch >> 0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch >> >> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch >> 0006-vfio-common-Split-common.c-into-common.c-container.c.patch >> >> - Introduce container object and covert existing vfio to use it: >> 0007-vfio-Add-base-object-for-VFIOContainer.patch >> 0008-vfio-container-Introduce-vfio_attach-detach_device.patch >> 0009-vfio-platform-Use-vfio_-attach-detach-_device.patch >> 0010-vfio-ap-Use-vfio_-attach-detach-_device.patch >> 0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch >> 0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch >> 0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch >> >> - Introduce iommufd based container: >> 0014-hw-iommufd-Creation.patch >> 0015-vfio-iommufd-Implement-iommufd-backend.patch >> 0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch >> >> - Add backend selection for vfio-pci: >> 0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch >> 0018-vfio-pci-Add-an-iommufd-option.patch >> >> [1] >> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com >> / >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 > Hi, > > I had a go with the above branches on our ARM64 platform trying to pass-through > a VF dev, but Qemu reports an error as below, > > [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002) > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) > > I think this happens for the dev BAR addr range. I haven't debugged the kernel > yet to see where it actually reports that. Does it prevent your assigned device from working? I have such errors too but this is a known issue. This is due to the fact P2P DMA is not supported yet. Thanks Eric > > Maybe I am missing something. Please let me know. > > Thanks, > Shameer >
On Tue, Apr 26, 2022 at 08:37:41AM +0000, Tian, Kevin wrote: > Based on current plan there is probably a transition window between the > point where the first vfio device type (vfio-pci) gaining iommufd support > and the point where all vfio types supporting iommufd. I am still hoping to do all in one shot, lets see :) Jason
> -----Original Message----- > From: Eric Auger [mailto:eric.auger@redhat.com] > Sent: 26 April 2022 12:45 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi > Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com; > qemu-devel@nongnu.org > Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; > jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com; > kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com; > peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org> > Subject: Re: [RFC 00/18] vfio: Adopt iommufd [...] > >> > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com > >> / > >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 > >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 > > Hi, > > > > I had a go with the above branches on our ARM64 platform trying to > pass-through > > a VF dev, but Qemu reports an error as below, > > > > [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002) > > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address > > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, > 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) > > > > I think this happens for the dev BAR addr range. I haven't debugged the > kernel > > yet to see where it actually reports that. > Does it prevent your assigned device from working? I have such errors > too but this is a known issue. This is due to the fact P2P DMA is not > supported yet. > Yes, the basic tests all good so far. I am still not very clear how it works if the map() fails though. It looks like it fails in, iommufd_ioas_map() iopt_map_user_pages() iopt_map_pages() .. pfn_reader_pin_pages() So does it mean it just works because the page is resident()? Thanks, Shameer
On Tue, 26 Apr 2022 08:37:41 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Monday, April 25, 2022 10:38 PM > > > > On Mon, 25 Apr 2022 11:10:14 +0100 > > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote: > > > > [Cc +libvirt folks] > > > > > > > > On Thu, 14 Apr 2022 03:46:52 -0700 > > > > Yi Liu <yi.l.liu@intel.com> wrote: > > > > > > > > > With the introduction of iommufd[1], the linux kernel provides a > > generic > > > > > interface for userspace drivers to propagate their DMA mappings to > > kernel > > > > > for assigned devices. This series does the porting of the VFIO devices > > > > > onto the /dev/iommu uapi and let it coexist with the legacy > > implementation. > > > > > Other devices like vpda, vfio mdev and etc. are not considered yet. > > > > > > snip > > > > > > > > The selection of the backend is made on a device basis using the new > > > > > iommufd option (on/off/auto). By default the iommufd backend is > > selected > > > > > if supported by the host and by QEMU (iommufd KConfig). This option > > is > > > > > currently available only for the vfio-pci device. For other types of > > > > > devices, it does not yet exist and the legacy BE is chosen by default. > > > > > > > > I've discussed this a bit with Eric, but let me propose a different > > > > command line interface. Libvirt generally likes to pass file > > > > descriptors to QEMU rather than grant it access to those files > > > > directly. This was problematic with vfio-pci because libvirt can't > > > > easily know when QEMU will want to grab another /dev/vfio/vfio > > > > container. Therefore we abandoned this approach and instead libvirt > > > > grants file permissions. > > > > > > > > However, with iommufd there's no reason that QEMU ever needs more > > than > > > > a single instance of /dev/iommufd and we're using per device vfio file > > > > descriptors, so it seems like a good time to revisit this. > > > > > > I assume access to '/dev/iommufd' gives the process somewhat elevated > > > privileges, such that you don't want to unconditionally give QEMU > > > access to this device ? > > > > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged > > interface which should have limited scope for abuse, but more so here > > the goal would be to de-privilege QEMU that one step further that it > > cannot open the device file itself. > > > > > > The interface I was considering would be to add an iommufd object to > > > > QEMU, so we might have a: > > > > > > > > -device iommufd[,fd=#][,id=foo] > > > > > > > > For non-libivrt usage this would have the ability to open /dev/iommufd > > > > itself if an fd is not provided. This object could be shared with > > > > other iommufd users in the VM and maybe we'd allow multiple instances > > > > for more esoteric use cases. [NB, maybe this should be a -object rather > > than > > > > -device since the iommufd is not a guest visible device?] > > > > > > Yes, -object would be the right answer for something that's purely > > > a host side backend impl selector. > > > > > > > The vfio-pci device might then become: > > > > > > > > -device vfio- > > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f > > oo] > > > > > > > > So essentially we can specify the device via host, sysfsdev, or passing > > > > an fd to the vfio device file. When an iommufd object is specified, > > > > "foo" in the example above, each of those options would use the > > > > vfio-device access mechanism, essentially the same as iommufd=on in > > > > your example. With the fd passing option, an iommufd object would be > > > > required and necessarily use device level access. > > > > > > > > In your example, the iommufd=auto seems especially troublesome for > > > > libvirt because QEMU is going to have different locked memory > > > > requirements based on whether we're using type1 or iommufd, where > > the > > > > latter resolves the duplicate accounting issues. libvirt needs to know > > Based on current plan there is probably a transition window between the > point where the first vfio device type (vfio-pci) gaining iommufd support > and the point where all vfio types supporting iommufd. Libvirt can figure > out which one to use iommufd by checking the presence of > /dev/vfio/devices/vfioX. But what would be the resource limit policy > in Libvirt in such transition window when both type1 and iommufd might > be used? Or do we just expect Libvirt to support iommufd only after the > transition window ends to avoid handling such mess? Good point regarding libvirt testing for the vfio device files for use with iommufd, so libvirt would test if /dev/iommufd exists and if the device they want to assign maps to a /dev/vfio/devices/vfioX file. This was essentially implicit in the fd=# option to the vfio-pci device. In mixed combinations, I'd expect libvirt to continue to add the full VM memory to the locked memory limit for each non-iommufd device added. > > > > deterministically which backed is being used, which this proposal seems > > > > to provide, while at the same time bringing us more in line with fd > > > > passing. Thoughts? Thanks, > > > > > > Yep, I agree that libvirt needs to have more direct control over this. > > > This is also even more important if there are notable feature differences > > > in the 2 backends. > > > > > > I wonder if anyone has considered an even more distinct impl, whereby > > > we have a completely different device type on the backend, eg > > > > > > -device vfio-iommu- > > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f > > oo] > > > > > > If a vendor wants to fully remove the legacy impl, they can then use the > > > Kconfig mechanism to disable the build of the legacy impl device, while > > > keeping the iommu impl (or vica-verca if the new iommu impl isn't > > considered > > > reliable enough for them to support yet). > > > > > > Libvirt would use > > > > > > -object iommu,id=iommu0,fd=NNN > > > -device vfio-iommu-pci,fd=MMM,iommu=iommu0 > > > > > > Non-libvirt would use a simpler > > > > > > -device vfio-iommu-pci,host=0000:03:22.1 > > > > > > with QEMU auto-creating a 'iommu' object in the background. > > > > > > This would fit into libvirt's existing modelling better. We currently have > > > a concept of a PCI assignment backend, which previously supported the > > > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl > > > feels like a 3rd PCI assignment approach, and so fits with how we modelled > > > it as a different device type in the past. > > > > I don't think we want to conflate "iommu" and "iommufd", we're creating > > an object that interfaces into the iommufd uAPI, not an iommu itself. > > Likewise "vfio-iommu-pci" is just confusing, there was an iommu > > interface previously, it's just a different implementation now and as > > far as the VM interface to the device, it's identical. Note that a > > "vfio-iommufd-pci" device multiplies the matrix of every vfio device > > for a rather subtle implementation detail. > > > > My expectation would be that libvirt uses: > > > > -object iommufd,id=iommufd0,fd=NNN > > -device vfio-pci,fd=MMM,iommufd=iommufd0 > > > > Whereas simple QEMU command line would be: > > > > -object iommufd,id=iommufd0 > > -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 > > > > The iommufd object would open /dev/iommufd itself. Creating an > > implicit iommufd object is someone problematic because one of the > > things I forgot to highlight in my previous description is that the > > iommufd object is meant to be shared across not only various vfio > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. > > vdpa. > > Out of curiosity - in concept one iommufd is sufficient to support all > ioas requirements across subsystems while having multiple iommufd's > instead lose the benefit of centralized accounting. The latter will also > cause some trouble when we start virtualizing ENQCMD which requires > VM-wide PASID virtualization thus further needs to share that > information across iommufd's. Not unsolvable but really no gain by > adding such complexity. So I'm curious whether Qemu provide > a way to restrict that certain object type can only have one instance > to discourage such multi-iommufd attempt? I don't see any reason for QEMU to restrict iommufd objects. The QEMU philosophy seems to be to let users create whatever configuration they want. For libvirt though, the assumption would be that a single iommufd object can be used across subsystems, so libvirt would never automatically create multiple objects. We also need to be able to advise libvirt as to how each iommufd object or user of that object factors into the VM locked memory requirement. When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt to set the locked memory limit to the size of VM RAM per iommufd, regardless of the number of devices using a given iommufd. However, I don't know if all users of iommufd will be exclusively mapping VM RAM. Combinations of devices where some map VM RAM and others map QEMU buffer space could still require some incremental increase per device (I'm not sure if vfio-nvme is such a device). It seems like heuristics will still be involved even after iommufd solves the per-device vfio-pci locked memory limit issue. Thanks, Alex
On Tue, 26 Apr 2022 12:43:35 +0000 Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > -----Original Message----- > > From: Eric Auger [mailto:eric.auger@redhat.com] > > Sent: 26 April 2022 12:45 > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi > > Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com; > > qemu-devel@nongnu.org > > Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; > > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; > > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; > > jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com; > > kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com; > > peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org> > > Subject: Re: [RFC 00/18] vfio: Adopt iommufd > > [...] > > > >> > > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com > > >> / > > >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 > > >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 > > > Hi, > > > > > > I had a go with the above branches on our ARM64 platform trying to > > pass-through > > > a VF dev, but Qemu reports an error as below, > > > > > > [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002) > > > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address > > > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, > > 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) > > > > > > I think this happens for the dev BAR addr range. I haven't debugged the > > kernel > > > yet to see where it actually reports that. > > Does it prevent your assigned device from working? I have such errors > > too but this is a known issue. This is due to the fact P2P DMA is not > > supported yet. > > > > Yes, the basic tests all good so far. I am still not very clear how it works if > the map() fails though. It looks like it fails in, > > iommufd_ioas_map() > iopt_map_user_pages() > iopt_map_pages() > .. > pfn_reader_pin_pages() > > So does it mean it just works because the page is resident()? No, it just means that you're not triggering any accesses that require peer-to-peer DMA support. Any sort of test where the device is only performing DMA to guest RAM, which is by far the standard use case, will work fine. This also doesn't affect vCPU access to BAR space. It's only a failure of the mappings of the BAR space into the IOAS, which is only used when a device tries to directly target another device's BAR space via DMA. Thanks, Alex
On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote: > We also need to be able to advise libvirt as to how each iommufd object > or user of that object factors into the VM locked memory requirement. > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt > to set the locked memory limit to the size of VM RAM per iommufd, > regardless of the number of devices using a given iommufd. However, I > don't know if all users of iommufd will be exclusively mapping VM RAM. > Combinations of devices where some map VM RAM and others map QEMU > buffer space could still require some incremental increase per device > (I'm not sure if vfio-nvme is such a device). It seems like heuristics > will still be involved even after iommufd solves the per-device > vfio-pci locked memory limit issue. Thanks, If the model is to pass the FD, how about we put a limit on the FD itself instead of abusing the locked memory limit? We could have a no-way-out ioctl that directly limits the # of PFNs covered by iopt_pages inside an iommufd. Jason
On Tue, 26 Apr 2022 13:42:17 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote: > > We also need to be able to advise libvirt as to how each iommufd object > > or user of that object factors into the VM locked memory requirement. > > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt > > to set the locked memory limit to the size of VM RAM per iommufd, > > regardless of the number of devices using a given iommufd. However, I > > don't know if all users of iommufd will be exclusively mapping VM RAM. > > Combinations of devices where some map VM RAM and others map QEMU > > buffer space could still require some incremental increase per device > > (I'm not sure if vfio-nvme is such a device). It seems like heuristics > > will still be involved even after iommufd solves the per-device > > vfio-pci locked memory limit issue. Thanks, > > If the model is to pass the FD, how about we put a limit on the FD > itself instead of abusing the locked memory limit? > > We could have a no-way-out ioctl that directly limits the # of PFNs > covered by iopt_pages inside an iommufd. FD passing would likely only be the standard for libvirt invoked VMs. The QEMU vfio-pci device would still parse a host= or sysfsdev= option when invoked by mortals and associate to use the legacy vfio group interface or the new vfio device interface based on whether an iommufd is specified. Does that rule out your suggestion? I don't know, please reveal more about the mechanics of putting a limit on the FD itself and this no-way-out ioctl. The latter name suggests to me that I should also note that we need to support memory hotplug with these devices. Thanks, Alex
On Tue, Apr 26, 2022 at 01:24:35PM -0600, Alex Williamson wrote: > On Tue, 26 Apr 2022 13:42:17 -0300 > Jason Gunthorpe <jgg@nvidia.com> wrote: > > > On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote: > > > We also need to be able to advise libvirt as to how each iommufd object > > > or user of that object factors into the VM locked memory requirement. > > > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt > > > to set the locked memory limit to the size of VM RAM per iommufd, > > > regardless of the number of devices using a given iommufd. However, I > > > don't know if all users of iommufd will be exclusively mapping VM RAM. > > > Combinations of devices where some map VM RAM and others map QEMU > > > buffer space could still require some incremental increase per device > > > (I'm not sure if vfio-nvme is such a device). It seems like heuristics > > > will still be involved even after iommufd solves the per-device > > > vfio-pci locked memory limit issue. Thanks, > > > > If the model is to pass the FD, how about we put a limit on the FD > > itself instead of abusing the locked memory limit? > > > > We could have a no-way-out ioctl that directly limits the # of PFNs > > covered by iopt_pages inside an iommufd. > > FD passing would likely only be the standard for libvirt invoked VMs. > The QEMU vfio-pci device would still parse a host= or sysfsdev= option > when invoked by mortals and associate to use the legacy vfio group > interface or the new vfio device interface based on whether an iommufd > is specified. Yes, but perhaps we don't need resource limits in the mortals case.. > Does that rule out your suggestion? I don't know, please reveal more > about the mechanics of putting a limit on the FD itself and this > no-way-out ioctl. The latter name suggests to me that I should also > note that we need to support memory hotplug with these devices. Thanks, So libvirt uses CAP_SYS_RESOURCE and prlimit to adjust things in realtime today? It could still work, instead of no way out iommufd would have to check for CAP_SYS_RESOURCE to make the limit higher. It is a pretty simple idea, we just attach a resource limit to the FD and every PFN that gets mapped into the iommufd counts against that limit, regardless if it is pinned or not. An ioctl on the FD would set the limit, defaulting to unlimited. To me this has the appeal that what is being resourced controlled is strictly defined - address space mapped into an iommufd - which has a bunch of nice additional consequences like partially bounding the amount of kernel memory an iommufd can consume and so forth. Doesn't interact with iouring or rdma however. Though we could certianly consider allowing RDMA to consume an iommufd to access pinned pages much like a vfio-mdev would - I'm not sure what is ideal for the qemu usage of RDMA for migration.. Jason
> From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, April 27, 2022 12:22 AM > > > > > > My expectation would be that libvirt uses: > > > > > > -object iommufd,id=iommufd0,fd=NNN > > > -device vfio-pci,fd=MMM,iommufd=iommufd0 > > > > > > Whereas simple QEMU command line would be: > > > > > > -object iommufd,id=iommufd0 > > > -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 > > > > > > The iommufd object would open /dev/iommufd itself. Creating an > > > implicit iommufd object is someone problematic because one of the > > > things I forgot to highlight in my previous description is that the > > > iommufd object is meant to be shared across not only various vfio > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. > > > vdpa. > > > > Out of curiosity - in concept one iommufd is sufficient to support all > > ioas requirements across subsystems while having multiple iommufd's > > instead lose the benefit of centralized accounting. The latter will also > > cause some trouble when we start virtualizing ENQCMD which requires > > VM-wide PASID virtualization thus further needs to share that > > information across iommufd's. Not unsolvable but really no gain by > > adding such complexity. So I'm curious whether Qemu provide > > a way to restrict that certain object type can only have one instance > > to discourage such multi-iommufd attempt? > > I don't see any reason for QEMU to restrict iommufd objects. The QEMU > philosophy seems to be to let users create whatever configuration they > want. For libvirt though, the assumption would be that a single > iommufd object can be used across subsystems, so libvirt would never > automatically create multiple objects. I like the flexibility what the objection approach gives in your proposal. But with the said complexity in mind (with no foreseen benefit), I wonder whether an alternative approach which treats iommufd as a global property instead of an object is acceptable in Qemu, i.e.: -iommufd on/off -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0] All devices with iommufd specified then implicitly share a single iommufd object within Qemu. This still allows vfio devices to be specified via fd but just requires Libvirt to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be considered or just not a typical way in Qemu philosophy e.g. any object associated with a device must be explicitly specified? Thanks Kevin
On Thu, 28 Apr 2022 03:21:45 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, April 27, 2022 12:22 AM > > > > > > > > My expectation would be that libvirt uses: > > > > > > > > -object iommufd,id=iommufd0,fd=NNN > > > > -device vfio-pci,fd=MMM,iommufd=iommufd0 > > > > > > > > Whereas simple QEMU command line would be: > > > > > > > > -object iommufd,id=iommufd0 > > > > -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 > > > > > > > > The iommufd object would open /dev/iommufd itself. Creating an > > > > implicit iommufd object is someone problematic because one of the > > > > things I forgot to highlight in my previous description is that the > > > > iommufd object is meant to be shared across not only various vfio > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. > > > > vdpa. > > > > > > Out of curiosity - in concept one iommufd is sufficient to support all > > > ioas requirements across subsystems while having multiple iommufd's > > > instead lose the benefit of centralized accounting. The latter will also > > > cause some trouble when we start virtualizing ENQCMD which requires > > > VM-wide PASID virtualization thus further needs to share that > > > information across iommufd's. Not unsolvable but really no gain by > > > adding such complexity. So I'm curious whether Qemu provide > > > a way to restrict that certain object type can only have one instance > > > to discourage such multi-iommufd attempt? > > > > I don't see any reason for QEMU to restrict iommufd objects. The QEMU > > philosophy seems to be to let users create whatever configuration they > > want. For libvirt though, the assumption would be that a single > > iommufd object can be used across subsystems, so libvirt would never > > automatically create multiple objects. > > I like the flexibility what the objection approach gives in your proposal. > But with the said complexity in mind (with no foreseen benefit), I wonder What's the actual complexity? Front-end/backend splits are very common in QEMU. We're making the object connection via name, why is it significantly more complicated to allow multiple iommufd objects? On the contrary, it seems to me that we'd need to go out of our way to add code to block multiple iommufd objects. > whether an alternative approach which treats iommufd as a global > property instead of an object is acceptable in Qemu, i.e.: > > -iommufd on/off > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0] > > All devices with iommufd specified then implicitly share a single iommufd > object within Qemu. QEMU requires key-value pairs AFAIK, so the above doesn't work, then we're just back to the iommufd=on/off. > This still allows vfio devices to be specified via fd but just requires Libvirt > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be > considered or just not a typical way in Qemu philosophy e.g. any object > associated with a device must be explicitly specified? Avoiding QEMU opening files was a significant focus of my alternate proposal. Also note that we must be able to support hotplug, so we need to be able to dynamically add and remove the iommufd object, I don't see that a global property allows for that. Implicit associations of devices to shared resources doesn't seem particularly desirable to me. Thanks, Alex
On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote: > On Thu, 28 Apr 2022 03:21:45 +0000 > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > Sent: Wednesday, April 27, 2022 12:22 AM > > > > > > > > > > My expectation would be that libvirt uses: > > > > > > > > > > -object iommufd,id=iommufd0,fd=NNN > > > > > -device vfio-pci,fd=MMM,iommufd=iommufd0 > > > > > > > > > > Whereas simple QEMU command line would be: > > > > > > > > > > -object iommufd,id=iommufd0 > > > > > -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 > > > > > > > > > > The iommufd object would open /dev/iommufd itself. Creating an > > > > > implicit iommufd object is someone problematic because one of the > > > > > things I forgot to highlight in my previous description is that the > > > > > iommufd object is meant to be shared across not only various vfio > > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex. > > > > > vdpa. > > > > > > > > Out of curiosity - in concept one iommufd is sufficient to support all > > > > ioas requirements across subsystems while having multiple iommufd's > > > > instead lose the benefit of centralized accounting. The latter will also > > > > cause some trouble when we start virtualizing ENQCMD which requires > > > > VM-wide PASID virtualization thus further needs to share that > > > > information across iommufd's. Not unsolvable but really no gain by > > > > adding such complexity. So I'm curious whether Qemu provide > > > > a way to restrict that certain object type can only have one instance > > > > to discourage such multi-iommufd attempt? > > > > > > I don't see any reason for QEMU to restrict iommufd objects. The QEMU > > > philosophy seems to be to let users create whatever configuration they > > > want. For libvirt though, the assumption would be that a single > > > iommufd object can be used across subsystems, so libvirt would never > > > automatically create multiple objects. > > > > I like the flexibility what the objection approach gives in your proposal. > > But with the said complexity in mind (with no foreseen benefit), I wonder > > What's the actual complexity? Front-end/backend splits are very common > in QEMU. We're making the object connection via name, why is it > significantly more complicated to allow multiple iommufd objects? On > the contrary, it seems to me that we'd need to go out of our way to add > code to block multiple iommufd objects. > > > whether an alternative approach which treats iommufd as a global > > property instead of an object is acceptable in Qemu, i.e.: > > > > -iommufd on/off > > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0] > > > > All devices with iommufd specified then implicitly share a single iommufd > > object within Qemu. > > QEMU requires key-value pairs AFAIK, so the above doesn't work, then > we're just back to the iommufd=on/off. > > > This still allows vfio devices to be specified via fd but just requires Libvirt > > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be > > considered or just not a typical way in Qemu philosophy e.g. any object > > associated with a device must be explicitly specified? > > Avoiding QEMU opening files was a significant focus of my alternate > proposal. Also note that we must be able to support hotplug, so we > need to be able to dynamically add and remove the iommufd object, I > don't see that a global property allows for that. Implicit > associations of devices to shared resources doesn't seem particularly > desirable to me. Thanks, Adding new global properties/options is rather an anti-pattern for QEMU these days. Using -object is the right approach. If you only want to allow for one of them, just document this requirement. We've got other objects which are singletons like all the confidential guest classes for each arch. With regards, Daniel
> From: Daniel P. Berrangé <berrange@redhat.com> > Sent: Friday, April 29, 2022 12:20 AM > > On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote: > > On Thu, 28 Apr 2022 03:21:45 +0000 > > "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > > > From: Alex Williamson <alex.williamson@redhat.com> > > > > Sent: Wednesday, April 27, 2022 12:22 AM > > > > > > > > > > > > My expectation would be that libvirt uses: > > > > > > > > > > > > -object iommufd,id=iommufd0,fd=NNN > > > > > > -device vfio-pci,fd=MMM,iommufd=iommufd0 > > > > > > > > > > > > Whereas simple QEMU command line would be: > > > > > > > > > > > > -object iommufd,id=iommufd0 > > > > > > -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0 > > > > > > > > > > > > The iommufd object would open /dev/iommufd itself. Creating an > > > > > > implicit iommufd object is someone problematic because one of the > > > > > > things I forgot to highlight in my previous description is that the > > > > > > iommufd object is meant to be shared across not only various vfio > > > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, > ex. > > > > > > vdpa. > > > > > > > > > > Out of curiosity - in concept one iommufd is sufficient to support all > > > > > ioas requirements across subsystems while having multiple > iommufd's > > > > > instead lose the benefit of centralized accounting. The latter will also > > > > > cause some trouble when we start virtualizing ENQCMD which > requires > > > > > VM-wide PASID virtualization thus further needs to share that > > > > > information across iommufd's. Not unsolvable but really no gain by > > > > > adding such complexity. So I'm curious whether Qemu provide > > > > > a way to restrict that certain object type can only have one instance > > > > > to discourage such multi-iommufd attempt? > > > > > > > > I don't see any reason for QEMU to restrict iommufd objects. The > QEMU > > > > philosophy seems to be to let users create whatever configuration they > > > > want. For libvirt though, the assumption would be that a single > > > > iommufd object can be used across subsystems, so libvirt would never > > > > automatically create multiple objects. > > > > > > I like the flexibility what the objection approach gives in your proposal. > > > But with the said complexity in mind (with no foreseen benefit), I wonder > > > > What's the actual complexity? Front-end/backend splits are very common > > in QEMU. We're making the object connection via name, why is it > > significantly more complicated to allow multiple iommufd objects? On > > the contrary, it seems to me that we'd need to go out of our way to add > > code to block multiple iommufd objects. Probably it's just a hypothetical concern when I thought about the need of managing certain global information (e.g. PASID virtualization) cross iommufd's down the road. With your and Daniel's replies I think we'll first try to follow the common practice in Qemu first given there are more positive reasons to do so than the hypothetical concern itself. > > > > > whether an alternative approach which treats iommufd as a global > > > property instead of an object is acceptable in Qemu, i.e.: > > > > > > -iommufd on/off > > > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0] > > > > > > All devices with iommufd specified then implicitly share a single iommufd > > > object within Qemu. > > > > QEMU requires key-value pairs AFAIK, so the above doesn't work, then > > we're just back to the iommufd=on/off. > > > > > This still allows vfio devices to be specified via fd but just requires Libvirt > > > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be > > > considered or just not a typical way in Qemu philosophy e.g. any object > > > associated with a device must be explicitly specified? > > > > Avoiding QEMU opening files was a significant focus of my alternate > > proposal. Also note that we must be able to support hotplug, so we > > need to be able to dynamically add and remove the iommufd object, I > > don't see that a global property allows for that. Implicit > > associations of devices to shared resources doesn't seem particularly > > desirable to me. Thanks, > > Adding new global properties/options is rather an anti-pattern for QEMU > these days. Using -object is the right approach. If you only want to > allow for one of them, just document this requirement. We've got other > objects which are singletons like all the confidential guest classes > for each arch. > Good to know such last resort. As said we'll try to avoid this restriction and follow Alex's proposal unless there are unexpectedly unreasonable complexities arising later. Thanks Kevin
Hi, Alex On 2022/4/27 上午12:35, Alex Williamson wrote: > On Tue, 26 Apr 2022 12:43:35 +0000 > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > >>> -----Original Message----- >>> From: Eric Auger [mailto:eric.auger@redhat.com] >>> Sent: 26 April 2022 12:45 >>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi >>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com; >>> qemu-devel@nongnu.org >>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; >>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; >>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; >>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com; >>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com; >>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org> >>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd >> [...] >> >>>>> >>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com >>>>> / >>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >>>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >>>> Hi, >>>> >>>> I had a go with the above branches on our ARM64 platform trying to >>> pass-through >>>> a VF dev, but Qemu reports an error as below, >>>> >>>> [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002) >>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address >>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, >>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) >>>> I think this happens for the dev BAR addr range. I haven't debugged the >>> kernel >>>> yet to see where it actually reports that. >>> Does it prevent your assigned device from working? I have such errors >>> too but this is a known issue. This is due to the fact P2P DMA is not >>> supported yet. >>> >> Yes, the basic tests all good so far. I am still not very clear how it works if >> the map() fails though. It looks like it fails in, >> >> iommufd_ioas_map() >> iopt_map_user_pages() >> iopt_map_pages() >> .. >> pfn_reader_pin_pages() >> >> So does it mean it just works because the page is resident()? > No, it just means that you're not triggering any accesses that require > peer-to-peer DMA support. Any sort of test where the device is only > performing DMA to guest RAM, which is by far the standard use case, > will work fine. This also doesn't affect vCPU access to BAR space. > It's only a failure of the mappings of the BAR space into the IOAS, > which is only used when a device tries to directly target another > device's BAR space via DMA. Thanks, I also get this issue when trying adding prereg listenner + container->prereg_listener = vfio_memory_prereg_listener; + memory_listener_register(&container->prereg_listener, + &address_space_memory); host kernel log: iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000 iopt_alloc_area input area=859a2d00 iova=8000000000 iopt_alloc_area area=859a2d00 iova=8000000000 pin_user_pages_remote rc=-14 qemu log: vfio_prereg_listener_region_add iommufd_map iova=0x8000000000 qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, 0x10000, 0xffff9c495000) = -14 (Bad address) qemu-system-aarch64: (null) double free or corruption (fasttop) Aborted (core dumped) With hack of ignoring address 0x8000000000 in map and unmap, kernel can boot. Thanks
Hi Zhangfei, On 2022/5/9 22:24, Zhangfei Gao wrote: > Hi, Alex > > On 2022/4/27 上午12:35, Alex Williamson wrote: >> On Tue, 26 Apr 2022 12:43:35 +0000 >> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: >> >>>> -----Original Message----- >>>> From: Eric Auger [mailto:eric.auger@redhat.com] >>>> Sent: 26 April 2022 12:45 >>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi >>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com; >>>> qemu-devel@nongnu.org >>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; >>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; >>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; >>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com; >>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com; >>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org> >>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd >>> [...] >>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com >>>>>> / >>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >>>>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >>>>> Hi, >>>>> >>>>> I had a go with the above branches on our ARM64 platform trying to >>>> pass-through >>>>> a VF dev, but Qemu reports an error as below, >>>>> >>>>> [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002) >>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address >>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, >>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) >>>>> I think this happens for the dev BAR addr range. I haven't debugged the >>>> kernel >>>>> yet to see where it actually reports that. >>>> Does it prevent your assigned device from working? I have such errors >>>> too but this is a known issue. This is due to the fact P2P DMA is not >>>> supported yet. >>> Yes, the basic tests all good so far. I am still not very clear how it >>> works if >>> the map() fails though. It looks like it fails in, >>> >>> iommufd_ioas_map() >>> iopt_map_user_pages() >>> iopt_map_pages() >>> .. >>> pfn_reader_pin_pages() >>> >>> So does it mean it just works because the page is resident()? >> No, it just means that you're not triggering any accesses that require >> peer-to-peer DMA support. Any sort of test where the device is only >> performing DMA to guest RAM, which is by far the standard use case, >> will work fine. This also doesn't affect vCPU access to BAR space. >> It's only a failure of the mappings of the BAR space into the IOAS, >> which is only used when a device tries to directly target another >> device's BAR space via DMA. Thanks, > > I also get this issue when trying adding prereg listenner > > + container->prereg_listener = vfio_memory_prereg_listener; > + memory_listener_register(&container->prereg_listener, > + &address_space_memory); > > host kernel log: > iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, cmd->iova=8000000000, > cmd->user_va=9c495000, cmd->length=10000 > iopt_alloc_area input area=859a2d00 iova=8000000000 > iopt_alloc_area area=859a2d00 iova=8000000000 > pin_user_pages_remote rc=-14 > > qemu log: > vfio_prereg_listener_region_add > iommufd_map iova=0x8000000000 > qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address > qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, 0x10000, > 0xffff9c495000) = -14 (Bad address) > qemu-system-aarch64: (null) > double free or corruption (fasttop) > Aborted (core dumped) > > With hack of ignoring address 0x8000000000 in map and unmap, kernel can boot. do you know if the iova 0x8000000000 guest RAM or MMIO? Currently, iommufd kernel part doesn't support mapping device BAR MMIO. This is a known gap.
Hi Hi, Zhangfei, On 5/10/22 05:17, Yi Liu wrote: > Hi Zhangfei, > > On 2022/5/9 22:24, Zhangfei Gao wrote: >> Hi, Alex >> >> On 2022/4/27 上午12:35, Alex Williamson wrote: >>> On Tue, 26 Apr 2022 12:43:35 +0000 >>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: >>> >>>>> -----Original Message----- >>>>> From: Eric Auger [mailto:eric.auger@redhat.com] >>>>> Sent: 26 April 2022 12:45 >>>>> To: Shameerali Kolothum Thodi >>>>> <shameerali.kolothum.thodi@huawei.com>; Yi >>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; >>>>> cohuck@redhat.com; >>>>> qemu-devel@nongnu.org >>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; >>>>> farman@linux.ibm.com; >>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; >>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; >>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com; >>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com; >>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org> >>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd >>>> [...] >>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com >>>>> >>>>>>> / >>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >>>>>>> [3] >>>>>>> https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >>>>>> Hi, >>>>>> >>>>>> I had a go with the above branches on our ARM64 platform trying to >>>>> pass-through >>>>>> a VF dev, but Qemu reports an error as below, >>>>>> >>>>>> [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> >>>>>> 0002) >>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address >>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, >>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) >>>>>> I think this happens for the dev BAR addr range. I haven't >>>>>> debugged the >>>>> kernel >>>>>> yet to see where it actually reports that. >>>>> Does it prevent your assigned device from working? I have such errors >>>>> too but this is a known issue. This is due to the fact P2P DMA is not >>>>> supported yet. >>>> Yes, the basic tests all good so far. I am still not very clear how >>>> it works if >>>> the map() fails though. It looks like it fails in, >>>> >>>> iommufd_ioas_map() >>>> iopt_map_user_pages() >>>> iopt_map_pages() >>>> .. >>>> pfn_reader_pin_pages() >>>> >>>> So does it mean it just works because the page is resident()? >>> No, it just means that you're not triggering any accesses that require >>> peer-to-peer DMA support. Any sort of test where the device is only >>> performing DMA to guest RAM, which is by far the standard use case, >>> will work fine. This also doesn't affect vCPU access to BAR space. >>> It's only a failure of the mappings of the BAR space into the IOAS, >>> which is only used when a device tries to directly target another >>> device's BAR space via DMA. Thanks, >> >> I also get this issue when trying adding prereg listenner >> >> + container->prereg_listener = vfio_memory_prereg_listener; >> + memory_listener_register(&container->prereg_listener, >> + &address_space_memory); >> >> host kernel log: >> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, >> cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000 >> iopt_alloc_area input area=859a2d00 iova=8000000000 >> iopt_alloc_area area=859a2d00 iova=8000000000 >> pin_user_pages_remote rc=-14 >> >> qemu log: >> vfio_prereg_listener_region_add >> iommufd_map iova=0x8000000000 >> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address >> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, >> 0x10000, 0xffff9c495000) = -14 (Bad address) >> qemu-system-aarch64: (null) >> double free or corruption (fasttop) >> Aborted (core dumped) >> >> With hack of ignoring address 0x8000000000 in map and unmap, kernel >> can boot. > > do you know if the iova 0x8000000000 guest RAM or MMIO? Currently, > iommufd kernel part doesn't support mapping device BAR MMIO. This is a > known gap. In qemu arm virt machine this indeed matches the PCI MMIO region. Thanks Eric
On 2022/5/10 下午2:51, Eric Auger wrote: > Hi Hi, Zhangfei, > > On 5/10/22 05:17, Yi Liu wrote: >> Hi Zhangfei, >> >> On 2022/5/9 22:24, Zhangfei Gao wrote: >>> Hi, Alex >>> >>> On 2022/4/27 上午12:35, Alex Williamson wrote: >>>> On Tue, 26 Apr 2022 12:43:35 +0000 >>>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: >>>> >>>>>> -----Original Message----- >>>>>> From: Eric Auger [mailto:eric.auger@redhat.com] >>>>>> Sent: 26 April 2022 12:45 >>>>>> To: Shameerali Kolothum Thodi >>>>>> <shameerali.kolothum.thodi@huawei.com>; Yi >>>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; >>>>>> cohuck@redhat.com; >>>>>> qemu-devel@nongnu.org >>>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; >>>>>> farman@linux.ibm.com; >>>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; >>>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; >>>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com; >>>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com; >>>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org> >>>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd >>>>> [...] >>>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com >>>>>> >>>>>>>> / >>>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6 >>>>>>>> [3] >>>>>>>> https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1 >>>>>>> Hi, >>>>>>> >>>>>>> I had a go with the above branches on our ARM64 platform trying to >>>>>> pass-through >>>>>>> a VF dev, but Qemu reports an error as below, >>>>>>> >>>>>>> [ 0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> >>>>>>> 0002) >>>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address >>>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, >>>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address) >>>>>>> I think this happens for the dev BAR addr range. I haven't >>>>>>> debugged the >>>>>> kernel >>>>>>> yet to see where it actually reports that. >>>>>> Does it prevent your assigned device from working? I have such errors >>>>>> too but this is a known issue. This is due to the fact P2P DMA is not >>>>>> supported yet. >>>>> Yes, the basic tests all good so far. I am still not very clear how >>>>> it works if >>>>> the map() fails though. It looks like it fails in, >>>>> >>>>> iommufd_ioas_map() >>>>> iopt_map_user_pages() >>>>> iopt_map_pages() >>>>> .. >>>>> pfn_reader_pin_pages() >>>>> >>>>> So does it mean it just works because the page is resident()? >>>> No, it just means that you're not triggering any accesses that require >>>> peer-to-peer DMA support. Any sort of test where the device is only >>>> performing DMA to guest RAM, which is by far the standard use case, >>>> will work fine. This also doesn't affect vCPU access to BAR space. >>>> It's only a failure of the mappings of the BAR space into the IOAS, >>>> which is only used when a device tries to directly target another >>>> device's BAR space via DMA. Thanks, >>> I also get this issue when trying adding prereg listenner >>> >>> + container->prereg_listener = vfio_memory_prereg_listener; >>> + memory_listener_register(&container->prereg_listener, >>> + &address_space_memory); >>> >>> host kernel log: >>> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, >>> cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000 >>> iopt_alloc_area input area=859a2d00 iova=8000000000 >>> iopt_alloc_area area=859a2d00 iova=8000000000 >>> pin_user_pages_remote rc=-14 >>> >>> qemu log: >>> vfio_prereg_listener_region_add >>> iommufd_map iova=0x8000000000 >>> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address >>> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, >>> 0x10000, 0xffff9c495000) = -14 (Bad address) >>> qemu-system-aarch64: (null) >>> double free or corruption (fasttop) >>> Aborted (core dumped) >>> >>> With hack of ignoring address 0x8000000000 in map and unmap, kernel >>> can boot. >> do you know if the iova 0x8000000000 guest RAM or MMIO? Currently, >> iommufd kernel part doesn't support mapping device BAR MMIO. This is a >> known gap. > In qemu arm virt machine this indeed matches the PCI MMIO region. Thanks Yi and Eric, Then will wait for the updated iommufd kernel for the PCI MMIO region. Another question, How to get the iommu_domain in the ioctl. qemu can get container->ioas_id. kernel can get ioas via the ioas_id. But how to get the domain? Currently I am hacking with ioas->iopt.next_domain_id, which is increasing. domain = xa_load(&ioas->iopt.domains, ioas->iopt.next_domain_id-1); Any idea? Thanks
On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: > Thanks Yi and Eric, > Then will wait for the updated iommufd kernel for the PCI MMIO region. > > Another question, > How to get the iommu_domain in the ioctl. The ID of the iommu_domain (called the hwpt) it should be returned by the vfio attach ioctl. Jason
On 2022/5/10 20:45, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >> Thanks Yi and Eric, >> Then will wait for the updated iommufd kernel for the PCI MMIO region. >> >> Another question, >> How to get the iommu_domain in the ioctl. > > The ID of the iommu_domain (called the hwpt) it should be returned by > the vfio attach ioctl. yes, hwpt_id is returned by the vfio attach ioctl and recorded in qemu. You can query page table related capabilities with this id. https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
On 2022/5/10 下午10:08, Yi Liu wrote: > On 2022/5/10 20:45, Jason Gunthorpe wrote: >> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>> Thanks Yi and Eric, >>> Then will wait for the updated iommufd kernel for the PCI MMIO region. >>> >>> Another question, >>> How to get the iommu_domain in the ioctl. >> >> The ID of the iommu_domain (called the hwpt) it should be returned by >> the vfio attach ioctl. > > yes, hwpt_id is returned by the vfio attach ioctl and recorded in > qemu. You can query page table related capabilities with this id. > > https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ > Thanks Yi, Do we use iommufd_hw_pagetable_from_id in kernel? The qemu send hwpt_id via ioctl. Currently VFIOIOMMUFDContainer has hwpt_list, Which member is good to save hwpt_id, IOMMUTLBEntry? In kernel ioctl: iommufd_vfio_ioctl @dev: Device to get an iommu_domain for iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct device *dev) But iommufd_vfio_ioctl seems no para dev? Thanks
Hi, Yi On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote: > > > On 2022/5/10 下午10:08, Yi Liu wrote: >> On 2022/5/10 20:45, Jason Gunthorpe wrote: >>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>>> Thanks Yi and Eric, >>>> Then will wait for the updated iommufd kernel for the PCI MMIO region. >>>> >>>> Another question, >>>> How to get the iommu_domain in the ioctl. >>> >>> The ID of the iommu_domain (called the hwpt) it should be returned by >>> the vfio attach ioctl. >> >> yes, hwpt_id is returned by the vfio attach ioctl and recorded in >> qemu. You can query page table related capabilities with this id. >> >> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ >> > Thanks Yi, > > Do we use iommufd_hw_pagetable_from_id in kernel? > > The qemu send hwpt_id via ioctl. > Currently VFIOIOMMUFDContainer has hwpt_list, > Which member is good to save hwpt_id, IOMMUTLBEntry? Can VFIOIOMMUFDContainer have multi hwpt? Since VFIOIOMMUFDContainer has hwpt_list now. If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, where no vbasedev can be used for compare. I am testing with a workaround, adding VFIOIOASHwpt *hwpt in VFIOIOMMUFDContainer. And save hwpt when vfio_device_attach_container. > > In kernel ioctl: iommufd_vfio_ioctl > @dev: Device to get an iommu_domain for > iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, > struct device *dev) > But iommufd_vfio_ioctl seems no para dev? We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev. iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL) Thanks >
Hi Zhangfei, On 2022/5/11 22:17, zhangfei.gao@foxmail.com wrote: > > > On 2022/5/10 下午10:08, Yi Liu wrote: >> On 2022/5/10 20:45, Jason Gunthorpe wrote: >>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>>> Thanks Yi and Eric, >>>> Then will wait for the updated iommufd kernel for the PCI MMIO region. >>>> >>>> Another question, >>>> How to get the iommu_domain in the ioctl. >>> >>> The ID of the iommu_domain (called the hwpt) it should be returned by >>> the vfio attach ioctl. >> >> yes, hwpt_id is returned by the vfio attach ioctl and recorded in >> qemu. You can query page table related capabilities with this id. >> >> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ >> > Thanks Yi, > > Do we use iommufd_hw_pagetable_from_id in kernel? > > The qemu send hwpt_id via ioctl. > Currently VFIOIOMMUFDContainer has hwpt_list, > Which member is good to save hwpt_id, IOMMUTLBEntry? currently, we don't make use of hwpt yet in the version we have in the qemu branch. I have a change to make use of it. Also, it would be used in future for nested translation setup and also dirty page bit support query for a given domain. > > In kernel ioctl: iommufd_vfio_ioctl > @dev: Device to get an iommu_domain for > iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct > device *dev) > But iommufd_vfio_ioctl seems no para dev? there is. you can look at the vfio_group_set_iommufd(), it loops the device_list provided by vfio. And the device info is passed to iommufd. > Thanks > > > >
Hi Zhangfei, On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote: > > Hi, Yi > > On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote: >> >> >> On 2022/5/10 下午10:08, Yi Liu wrote: >>> On 2022/5/10 20:45, Jason Gunthorpe wrote: >>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>>>> Thanks Yi and Eric, >>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region. >>>>> >>>>> Another question, >>>>> How to get the iommu_domain in the ioctl. >>>> >>>> The ID of the iommu_domain (called the hwpt) it should be returned by >>>> the vfio attach ioctl. >>> >>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in >>> qemu. You can query page table related capabilities with this id. >>> >>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ >>> >> Thanks Yi, >> >> Do we use iommufd_hw_pagetable_from_id in kernel? >> >> The qemu send hwpt_id via ioctl. >> Currently VFIOIOMMUFDContainer has hwpt_list, >> Which member is good to save hwpt_id, IOMMUTLBEntry? > > Can VFIOIOMMUFDContainer have multi hwpt? yes, it is possible > Since VFIOIOMMUFDContainer has hwpt_list now. > If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, > where no vbasedev can be used for compare. > > I am testing with a workaround, adding VFIOIOASHwpt *hwpt in > VFIOIOMMUFDContainer. > And save hwpt when vfio_device_attach_container. > >> >> In kernel ioctl: iommufd_vfio_ioctl >> @dev: Device to get an iommu_domain for >> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct >> device *dev) >> But iommufd_vfio_ioctl seems no para dev? > > We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev. > iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL) this is not good. dev is passed in to this function to allocate domain and also check sw_msi things. If you pass in a NULL, it may even unable to get a domain for the hwpt. It won't work I guess. > Thanks >> >
On 2022/5/17 下午4:55, Yi Liu wrote: > Hi Zhangfei, > > On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote: >> >> Hi, Yi >> >> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote: >>> >>> >>> On 2022/5/10 下午10:08, Yi Liu wrote: >>>> On 2022/5/10 20:45, Jason Gunthorpe wrote: >>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>>>>> Thanks Yi and Eric, >>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO >>>>>> region. >>>>>> >>>>>> Another question, >>>>>> How to get the iommu_domain in the ioctl. >>>>> >>>>> The ID of the iommu_domain (called the hwpt) it should be returned by >>>>> the vfio attach ioctl. >>>> >>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in >>>> qemu. You can query page table related capabilities with this id. >>>> >>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ >>>> >>>> >>> Thanks Yi, >>> >>> Do we use iommufd_hw_pagetable_from_id in kernel? >>> >>> The qemu send hwpt_id via ioctl. >>> Currently VFIOIOMMUFDContainer has hwpt_list, >>> Which member is good to save hwpt_id, IOMMUTLBEntry? >> >> Can VFIOIOMMUFDContainer have multi hwpt? > > yes, it is possible Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) > >> Since VFIOIOMMUFDContainer has hwpt_list now. >> If so, how to get specific hwpt from map/unmap_notify in >> hw/vfio/as.c, where no vbasedev can be used for compare. >> >> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in >> VFIOIOMMUFDContainer. >> And save hwpt when vfio_device_attach_container. >> >>> >>> In kernel ioctl: iommufd_vfio_ioctl >>> @dev: Device to get an iommu_domain for >>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, >>> struct device *dev) >>> But iommufd_vfio_ioctl seems no para dev? >> >> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev. >> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL) > > this is not good. dev is passed in to this function to allocate domain > and also check sw_msi things. If you pass in a NULL, it may even unable > to get a domain for the hwpt. It won't work I guess. The iommufd_hw_pagetable_from_id can be used for 1, allocate domain, which need para dev case IOMMUFD_OBJ_IOAS hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev); 2. Just return allocated domain via hwpt_id, which does not need dev. case IOMMUFD_OBJ_HW_PAGETABLE: return container_of(obj, struct iommufd_hw_pagetable, obj); By the way, any plan of the nested mode? Thanks
On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote: > > > On 2022/5/17 下午4:55, Yi Liu wrote: >> Hi Zhangfei, >> >> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote: >>> >>> Hi, Yi >>> >>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote: >>>> >>>> >>>> On 2022/5/10 下午10:08, Yi Liu wrote: >>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote: >>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>>>>>> Thanks Yi and Eric, >>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region. >>>>>>> >>>>>>> Another question, >>>>>>> How to get the iommu_domain in the ioctl. >>>>>> >>>>>> The ID of the iommu_domain (called the hwpt) it should be returned by >>>>>> the vfio attach ioctl. >>>>> >>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in >>>>> qemu. You can query page table related capabilities with this id. >>>>> >>>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ >>>>> >>>> Thanks Yi, >>>> >>>> Do we use iommufd_hw_pagetable_from_id in kernel? >>>> >>>> The qemu send hwpt_id via ioctl. >>>> Currently VFIOIOMMUFDContainer has hwpt_list, >>>> Which member is good to save hwpt_id, IOMMUTLBEntry? >>> >>> Can VFIOIOMMUFDContainer have multi hwpt? >> >> yes, it is possible > Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry > *iotlb) in map/unmap, should use ioas_id instead of hwpt_id > >> >>> Since VFIOIOMMUFDContainer has hwpt_list now. >>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, >>> where no vbasedev can be used for compare. >>> >>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in >>> VFIOIOMMUFDContainer. >>> And save hwpt when vfio_device_attach_container. >>> >>>> >>>> In kernel ioctl: iommufd_vfio_ioctl >>>> @dev: Device to get an iommu_domain for >>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, >>>> struct device *dev) >>>> But iommufd_vfio_ioctl seems no para dev? >>> >>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev. >>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL) >> >> this is not good. dev is passed in to this function to allocate domain >> and also check sw_msi things. If you pass in a NULL, it may even unable >> to get a domain for the hwpt. It won't work I guess. > > The iommufd_hw_pagetable_from_id can be used for > 1, allocate domain, which need para dev > case IOMMUFD_OBJ_IOAS > hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev); this is used when attaching ioas. > 2. Just return allocated domain via hwpt_id, which does not need dev. > case IOMMUFD_OBJ_HW_PAGETABLE: > return container_of(obj, struct iommufd_hw_pagetable, obj); yes, this would be the usage in nesting. you may check my below branch. It's for nesting integration. https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting > By the way, any plan of the nested mode? I'm working with Eric, Nic on it. Currently, I've got the above kernel branch, QEMU side is also WIP. > Thanks
> -----Original Message----- > From: Yi Liu [mailto:yi.l.liu@intel.com] > Sent: 18 May 2022 15:01 > To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>; > Zhangfei Gao <zhangfei.gao@linaro.org> > Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>; > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > cohuck@redhat.com; qemu-devel@nongnu.org; > david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; > nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com; > chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com > Subject: Re: [RFC 00/18] vfio: Adopt iommufd > > On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote: > > > > > > On 2022/5/17 下午4:55, Yi Liu wrote: > >> Hi Zhangfei, > >> > >> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote: > >>> > >>> Hi, Yi > >>> > >>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote: > >>>> > >>>> > >>>> On 2022/5/10 下午10:08, Yi Liu wrote: > >>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote: > >>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: > >>>>>>> Thanks Yi and Eric, > >>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO > region. > >>>>>>> > >>>>>>> Another question, > >>>>>>> How to get the iommu_domain in the ioctl. > >>>>>> > >>>>>> The ID of the iommu_domain (called the hwpt) it should be returned > by > >>>>>> the vfio attach ioctl. > >>>>> > >>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in > >>>>> qemu. You can query page table related capabilities with this id. > >>>>> > >>>>> > https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ > >>>>> > >>>> Thanks Yi, > >>>> > >>>> Do we use iommufd_hw_pagetable_from_id in kernel? > >>>> > >>>> The qemu send hwpt_id via ioctl. > >>>> Currently VFIOIOMMUFDContainer has hwpt_list, > >>>> Which member is good to save hwpt_id, IOMMUTLBEntry? > >>> > >>> Can VFIOIOMMUFDContainer have multi hwpt? > >> > >> yes, it is possible > > Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, > IOMMUTLBEntry > > *iotlb) > > in map/unmap, should use ioas_id instead of hwpt_id > > > > >> > >>> Since VFIOIOMMUFDContainer has hwpt_list now. > >>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, > >>> where no vbasedev can be used for compare. > >>> > >>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in > >>> VFIOIOMMUFDContainer. > >>> And save hwpt when vfio_device_attach_container. > >>> > >>>> > >>>> In kernel ioctl: iommufd_vfio_ioctl > >>>> @dev: Device to get an iommu_domain for > >>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, > >>>> struct device *dev) > >>>> But iommufd_vfio_ioctl seems no para dev? > >>> > >>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not > need dev. > >>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL) > >> > >> this is not good. dev is passed in to this function to allocate domain > >> and also check sw_msi things. If you pass in a NULL, it may even unable > >> to get a domain for the hwpt. It won't work I guess. > > > > The iommufd_hw_pagetable_from_id can be used for > > 1, allocate domain, which need para dev > > case IOMMUFD_OBJ_IOAS > > hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev); > > this is used when attaching ioas. > > > 2. Just return allocated domain via hwpt_id, which does not need dev. > > case IOMMUFD_OBJ_HW_PAGETABLE: > > return container_of(obj, struct iommufd_hw_pagetable, obj); > > yes, this would be the usage in nesting. you may check my below > branch. It's for nesting integration. > > https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting > > > By the way, any plan of the nested mode? > I'm working with Eric, Nic on it. Currently, I've got the above kernel > branch, QEMU side is also WIP. Hi Yi/Eric, I had a look at the above nesting kernel and Qemu branches and as mentioned in the cover letter it is not working on ARM yet. IIUC, to get it working via the iommufd the main thing is we need a way to configure the phys SMMU in nested mode and setup the mappings for the stage 2. The Cache/PASID related changes looks more straight forward. I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering do you guys have something that can be shared yet? Please let me know. Thanks, Shameer
Hi Shameer, On 6/28/22 10:14, Shameerali Kolothum Thodi wrote: > >> -----Original Message----- >> From: Yi Liu [mailto:yi.l.liu@intel.com] >> Sent: 18 May 2022 15:01 >> To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>; >> Zhangfei Gao <zhangfei.gao@linaro.org> >> Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>; >> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; >> cohuck@redhat.com; qemu-devel@nongnu.org; >> david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com; >> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com; >> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org; >> nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com; >> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com >> Subject: Re: [RFC 00/18] vfio: Adopt iommufd >> >> On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote: >>> >>> On 2022/5/17 下午4:55, Yi Liu wrote: >>>> Hi Zhangfei, >>>> >>>> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote: >>>>> Hi, Yi >>>>> >>>>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote: >>>>>> >>>>>> On 2022/5/10 下午10:08, Yi Liu wrote: >>>>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote: >>>>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote: >>>>>>>>> Thanks Yi and Eric, >>>>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO >> region. >>>>>>>>> Another question, >>>>>>>>> How to get the iommu_domain in the ioctl. >>>>>>>> The ID of the iommu_domain (called the hwpt) it should be returned >> by >>>>>>>> the vfio attach ioctl. >>>>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in >>>>>>> qemu. You can query page table related capabilities with this id. >>>>>>> >>>>>>> >> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ >>>>>> Thanks Yi, >>>>>> >>>>>> Do we use iommufd_hw_pagetable_from_id in kernel? >>>>>> >>>>>> The qemu send hwpt_id via ioctl. >>>>>> Currently VFIOIOMMUFDContainer has hwpt_list, >>>>>> Which member is good to save hwpt_id, IOMMUTLBEntry? >>>>> Can VFIOIOMMUFDContainer have multi hwpt? >>>> yes, it is possible >>> Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, >> IOMMUTLBEntry >>> *iotlb) >> in map/unmap, should use ioas_id instead of hwpt_id >> >>>>> Since VFIOIOMMUFDContainer has hwpt_list now. >>>>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, >>>>> where no vbasedev can be used for compare. >>>>> >>>>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in >>>>> VFIOIOMMUFDContainer. >>>>> And save hwpt when vfio_device_attach_container. >>>>> >>>>>> In kernel ioctl: iommufd_vfio_ioctl >>>>>> @dev: Device to get an iommu_domain for >>>>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, >>>>>> struct device *dev) >>>>>> But iommufd_vfio_ioctl seems no para dev? >>>>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not >> need dev. >>>>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL) >>>> this is not good. dev is passed in to this function to allocate domain >>>> and also check sw_msi things. If you pass in a NULL, it may even unable >>>> to get a domain for the hwpt. It won't work I guess. >>> The iommufd_hw_pagetable_from_id can be used for >>> 1, allocate domain, which need para dev >>> case IOMMUFD_OBJ_IOAS >>> hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev); >> this is used when attaching ioas. >> >>> 2. Just return allocated domain via hwpt_id, which does not need dev. >>> case IOMMUFD_OBJ_HW_PAGETABLE: >>> return container_of(obj, struct iommufd_hw_pagetable, obj); >> yes, this would be the usage in nesting. you may check my below >> branch. It's for nesting integration. >> >> https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting >> >>> By the way, any plan of the nested mode? >> I'm working with Eric, Nic on it. Currently, I've got the above kernel >> branch, QEMU side is also WIP. > Hi Yi/Eric, > > I had a look at the above nesting kernel and Qemu branches and as mentioned > in the cover letter it is not working on ARM yet. > > IIUC, to get it working via the iommufd the main thing is we need a way to configure > the phys SMMU in nested mode and setup the mappings for the stage 2. The > Cache/PASID related changes looks more straight forward. > > I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering > do you guys have something that can be shared yet? I am working on the respin based on latest iommufd kernel branches and qemu RFC v2 but it is still WIP. I will share as soon as possible. Eric > > Please let me know. > > Thanks, > Shameer