mbox series

[v1,00/22] vfio: Adopt iommufd

Message ID 20230830103754.36461-1-zhenzhong.duan@intel.com (mailing list archive)
Headers show
Series vfio: Adopt iommufd | expand

Message

Duan, Zhenzhong Aug. 30, 2023, 10:37 a.m. UTC
Hi All,

As the kernel side iommufd cdev and hot reset feature have been queued,
also hwpt alloc has been added in Jason's for_next branch [1], I'd like
to update a new version matching kernel side update and with rfc flag
removed. Qemu code can be found at [2], look forward more comments!


We have done wide test with different combinations, e.g:

- PCI device were tested
- FD passing and hot reset with some trick.
- device hotplug test with legacy and iommufd backends
- with or without vIOMMU for legacy and iommufd backends
- divices linked to different iommufds
- VFIO migration with a E800 net card(no dirty sync support) passthrough
- platform, ccw and ap were only compile-tested due to environment limit


Given some iommufd kernel limitations, the iommufd backend is
not yet fully on par with the legacy backend w.r.t. features like:
- p2p mappings (you will see related error traces)
- dirty page sync
- and etc.


Changelog:
v1:
- Alloc hwpt instead of using auto hwpt
- elaborate iommufd code per Nicolin
- consolidate two patches and drop as.c
- typo error fix and function rename

I didn't list change log of rfc stage, see [3] if anyone is interested.


[1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
[2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
[3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html


--------------------------------------------------------------------------

With the introduction of iommufd, the Linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.

This QEMU integration is the result of a collaborative work between
Yi Liu, Yi Sun, Nicolin Chen and Eric Auger.

At QEMU level, interactions with the /dev/iommu are abstracted by a new
iommufd object (compiled in with the CONFIG_IOMMUFD option).

Any QEMU device (e.g. vfio device) wishing to use /dev/iommu must be
linked with an iommufd object. In this series, the vfio-pci device is
granted with such capability (other VFIO devices are not yet ready):

It gets a new optional parameter named iommufd which allows to pass
an iommufd object:

    -object iommufd,id=iommufd0
    -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0

Note the /dev/iommu and vfio cdev can be externally opened by a
management layer. In such a case the fd is passed:
  
    -object iommufd,id=iommufd0,fd=22
    -device vfio-pci,iommufd=iommufd0,fd=23

If the fd parameter is not passed, the fd is opened by QEMU.
See https://www.mail-archive.com/qemu-devel@nongnu.org/msg937155.html
for detailed discuss on this requirement.

If no iommufd option is passed to the vfio-pci device, iommufd is not
used and the end-user gets the behavior based on the legacy vfio iommu
interfaces:

    -device vfio-pci,host=0000:02:00.0

While the legacy kernel interface is group-centric, the new iommufd
interface is device-centric, relying on device fd and iommufd.

To support both interfaces in the QEMU VFIO device we reworked the vfio
container abstraction so that the generic VFIO code can use either
backend.

The VFIOContainer object becomes a base object derived into
a) the legacy VFIO container and
b) the new iommufd based container.

The base object implements generic code such as code related to
memory_listener and address space management whereas the derived
objects implement callbacks specific to either BE, legacy and
iommufd. Indeed each backend has its own way to setup secure context
and dma management interface. The below diagram shows how it looks
like with both BEs.

                    VFIO                           AddressSpace/Memory
    +-------+  +----------+  +-----+  +-----+
    |  pci  |  | platform |  |  ap |  | ccw |
    +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
        |           |           |        |        |   AddressSpace       |
        |           |           |        |        +------------+---------+
    +---V-----------V-----------V--------V----+               /
    |           VFIOAddressSpace              | <------------+
    |                  |                      |  MemoryListener
    |          VFIOContainer list             |
    +-------+----------------------------+----+
            |                            |
            |                            |
    +-------V------+            +--------V----------+
    |   iommufd    |            |    vfio legacy    |
    |  container   |            |     container     |
    +-------+------+            +--------+----------+
            |                            |
            | /dev/iommu                 | /dev/vfio/vfio
            | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
Userspace   |                            |
============+============================+===========================
Kernel      |  device fd                 |
            +---------------+            | group/container fd
            | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
            |  ATTACH_IOAS) |            | device fd
            |               |            |
            |       +-------V------------V-----------------+
    iommufd |       |                vfio                  |
(map/unmap  |       +---------+--------------------+-------+
ioas_copy)  |                 |                    | map/unmap
            |                 |                    |
     +------V------+    +-----V------+      +------V--------+
     | iommfd core |    |  device    |      |  vfio iommu   |
     +-------------+    +------------+      +---------------+

[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
              (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
                  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
2. VFIO populates DMA map/unmap via the container BEs
   *) iommufd BE: uses iommufd
   *) vfio legacy BE: uses container fd


Thanks,
Yi, Yi, Eric, Zhenzhong


Eric Auger (8):
  scripts/update-linux-headers: Add iommufd.h
  vfio/common: Introduce vfio_container_add|del_section_window()
  vfio/container: Introduce vfio_[attach/detach]_device
  vfio/platform: Use vfio_[attach/detach]_device
  vfio/ap: Use vfio_[attach/detach]_device
  vfio/ccw: Use vfio_[attach/detach]_device
  backends/iommufd: Introduce the iommufd object
  vfio/pci: Allow the selection of a given iommu backend

Yi Liu (5):
  vfio/common: Move IOMMU agnostic helpers to a separate file
  vfio/common: Move legacy VFIO backend code into separate container.c
  vfio: Add base container
  util/char_dev: Add open_cdev()
  vfio/iommufd: Implement the iommufd backend

Zhenzhong Duan (9):
  Update linux-header to support iommufd cdev and hwpt alloc
  vfio/common: Extract out vfio_kvm_device_[add/del]_fd
  vfio/common: Add a vfio device iterator
  vfio/common: Refactor vfio_viommu_preset() to be group agnostic
  vfio/common: Simplify vfio_viommu_preset()
  Add iommufd configure option
  vfio/iommufd: Add vfio device iterator callback for iommufd
  vfio/pci: Adapt vfio pci hot reset support with iommufd BE
  vfio/pci: Make vfio cdev pre-openable by passing a file handle

 MAINTAINERS                           |   13 +
 backends/Kconfig                      |    4 +
 backends/iommufd.c                    |  291 ++++
 backends/meson.build                  |    3 +
 backends/trace-events                 |   13 +
 hw/vfio/ap.c                          |   68 +-
 hw/vfio/ccw.c                         |  120 +-
 hw/vfio/common.c                      | 1948 +++----------------------
 hw/vfio/container-base.c              |  160 ++
 hw/vfio/container.c                   | 1208 +++++++++++++++
 hw/vfio/helpers.c                     |  626 ++++++++
 hw/vfio/iommufd.c                     |  554 +++++++
 hw/vfio/meson.build                   |    6 +
 hw/vfio/pci.c                         |  319 +++-
 hw/vfio/platform.c                    |   43 +-
 hw/vfio/spapr.c                       |   22 +-
 hw/vfio/trace-events                  |   21 +-
 include/hw/vfio/vfio-common.h         |  111 +-
 include/hw/vfio/vfio-container-base.h |  158 ++
 include/qemu/char_dev.h               |   16 +
 include/standard-headers/linux/fuse.h |    3 +
 include/sysemu/iommufd.h              |   49 +
 linux-headers/linux/iommufd.h         |  444 ++++++
 linux-headers/linux/kvm.h             |   13 +-
 linux-headers/linux/vfio.h            |  148 +-
 meson.build                           |    6 +
 meson_options.txt                     |    2 +
 qapi/qom.json                         |   18 +-
 qemu-options.hx                       |   13 +
 scripts/meson-buildoptions.sh         |    3 +
 scripts/update-linux-headers.sh       |    3 +-
 util/chardev_open.c                   |   61 +
 util/meson.build                      |    1 +
 33 files changed, 4395 insertions(+), 2073 deletions(-)
 create mode 100644 backends/iommufd.c
 create mode 100644 hw/vfio/container-base.c
 create mode 100644 hw/vfio/container.c
 create mode 100644 hw/vfio/helpers.c
 create mode 100644 hw/vfio/iommufd.c
 create mode 100644 include/hw/vfio/vfio-container-base.h
 create mode 100644 include/qemu/char_dev.h
 create mode 100644 include/sysemu/iommufd.h
 create mode 100644 linux-headers/linux/iommufd.h
 create mode 100644 util/chardev_open.c

Comments

Eric Auger Sept. 14, 2023, 9:04 a.m. UTC | #1
Hi Zhenzhong

On 8/30/23 12:37, Zhenzhong Duan wrote:
> Hi All,
>
> As the kernel side iommufd cdev and hot reset feature have been queued,
> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> to update a new version matching kernel side update and with rfc flag
> removed. Qemu code can be found at [2], look forward more comments!
>
>
> We have done wide test with different combinations, e.g:
>
> - PCI device were tested
> - FD passing and hot reset with some trick.
> - device hotplug test with legacy and iommufd backends
> - with or without vIOMMU for legacy and iommufd backends
> - divices linked to different iommufds
> - VFIO migration with a E800 net card(no dirty sync support) passthrough
> - platform, ccw and ap were only compile-tested due to environment limit
>
>
> Given some iommufd kernel limitations, the iommufd backend is
> not yet fully on par with the legacy backend w.r.t. features like:
> - p2p mappings (you will see related error traces)
> - dirty page sync
> - and etc.
>
>
> Changelog:
> v1:
> - Alloc hwpt instead of using auto hwpt
> - elaborate iommufd code per Nicolin
> - consolidate two patches and drop as.c
> - typo error fix and function rename
>
> I didn't list change log of rfc stage, see [3] if anyone is interested.
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
> [2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
> [3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html

Do you have a branch to share?

It does not apply to upstream

Thanks

Eric
>
>
> --------------------------------------------------------------------------
>
> With the introduction of iommufd, the Linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>
> This QEMU integration is the result of a collaborative work between
> Yi Liu, Yi Sun, Nicolin Chen and Eric Auger.
>
> At QEMU level, interactions with the /dev/iommu are abstracted by a new
> iommufd object (compiled in with the CONFIG_IOMMUFD option).
>
> Any QEMU device (e.g. vfio device) wishing to use /dev/iommu must be
> linked with an iommufd object. In this series, the vfio-pci device is
> granted with such capability (other VFIO devices are not yet ready):
>
> It gets a new optional parameter named iommufd which allows to pass
> an iommufd object:
>
>     -object iommufd,id=iommufd0
>     -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
>
> Note the /dev/iommu and vfio cdev can be externally opened by a
> management layer. In such a case the fd is passed:
>   
>     -object iommufd,id=iommufd0,fd=22
>     -device vfio-pci,iommufd=iommufd0,fd=23
>
> If the fd parameter is not passed, the fd is opened by QEMU.
> See https://www.mail-archive.com/qemu-devel@nongnu.org/msg937155.html
> for detailed discuss on this requirement.
>
> If no iommufd option is passed to the vfio-pci device, iommufd is not
> used and the end-user gets the behavior based on the legacy vfio iommu
> interfaces:
>
>     -device vfio-pci,host=0000:02:00.0
>
> While the legacy kernel interface is group-centric, the new iommufd
> interface is device-centric, relying on device fd and iommufd.
>
> To support both interfaces in the QEMU VFIO device we reworked the vfio
> container abstraction so that the generic VFIO code can use either
> backend.
>
> The VFIOContainer object becomes a base object derived into
> a) the legacy VFIO container and
> b) the new iommufd based container.
>
> The base object implements generic code such as code related to
> memory_listener and address space management whereas the derived
> objects implement callbacks specific to either BE, legacy and
> iommufd. Indeed each backend has its own way to setup secure context
> and dma management interface. The below diagram shows how it looks
> like with both BEs.
>
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> Userspace   |                            |
> ============+============================+===========================
> Kernel      |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
> ioas_copy)  |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
>
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> 1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> 2. VFIO populates DMA map/unmap via the container BEs
>    *) iommufd BE: uses iommufd
>    *) vfio legacy BE: uses container fd
>
>
> Thanks,
> Yi, Yi, Eric, Zhenzhong
>
>
> Eric Auger (8):
>   scripts/update-linux-headers: Add iommufd.h
>   vfio/common: Introduce vfio_container_add|del_section_window()
>   vfio/container: Introduce vfio_[attach/detach]_device
>   vfio/platform: Use vfio_[attach/detach]_device
>   vfio/ap: Use vfio_[attach/detach]_device
>   vfio/ccw: Use vfio_[attach/detach]_device
>   backends/iommufd: Introduce the iommufd object
>   vfio/pci: Allow the selection of a given iommu backend
>
> Yi Liu (5):
>   vfio/common: Move IOMMU agnostic helpers to a separate file
>   vfio/common: Move legacy VFIO backend code into separate container.c
>   vfio: Add base container
>   util/char_dev: Add open_cdev()
>   vfio/iommufd: Implement the iommufd backend
>
> Zhenzhong Duan (9):
>   Update linux-header to support iommufd cdev and hwpt alloc
>   vfio/common: Extract out vfio_kvm_device_[add/del]_fd
>   vfio/common: Add a vfio device iterator
>   vfio/common: Refactor vfio_viommu_preset() to be group agnostic
>   vfio/common: Simplify vfio_viommu_preset()
>   Add iommufd configure option
>   vfio/iommufd: Add vfio device iterator callback for iommufd
>   vfio/pci: Adapt vfio pci hot reset support with iommufd BE
>   vfio/pci: Make vfio cdev pre-openable by passing a file handle
>
>  MAINTAINERS                           |   13 +
>  backends/Kconfig                      |    4 +
>  backends/iommufd.c                    |  291 ++++
>  backends/meson.build                  |    3 +
>  backends/trace-events                 |   13 +
>  hw/vfio/ap.c                          |   68 +-
>  hw/vfio/ccw.c                         |  120 +-
>  hw/vfio/common.c                      | 1948 +++----------------------
>  hw/vfio/container-base.c              |  160 ++
>  hw/vfio/container.c                   | 1208 +++++++++++++++
>  hw/vfio/helpers.c                     |  626 ++++++++
>  hw/vfio/iommufd.c                     |  554 +++++++
>  hw/vfio/meson.build                   |    6 +
>  hw/vfio/pci.c                         |  319 +++-
>  hw/vfio/platform.c                    |   43 +-
>  hw/vfio/spapr.c                       |   22 +-
>  hw/vfio/trace-events                  |   21 +-
>  include/hw/vfio/vfio-common.h         |  111 +-
>  include/hw/vfio/vfio-container-base.h |  158 ++
>  include/qemu/char_dev.h               |   16 +
>  include/standard-headers/linux/fuse.h |    3 +
>  include/sysemu/iommufd.h              |   49 +
>  linux-headers/linux/iommufd.h         |  444 ++++++
>  linux-headers/linux/kvm.h             |   13 +-
>  linux-headers/linux/vfio.h            |  148 +-
>  meson.build                           |    6 +
>  meson_options.txt                     |    2 +
>  qapi/qom.json                         |   18 +-
>  qemu-options.hx                       |   13 +
>  scripts/meson-buildoptions.sh         |    3 +
>  scripts/update-linux-headers.sh       |    3 +-
>  util/chardev_open.c                   |   61 +
>  util/meson.build                      |    1 +
>  33 files changed, 4395 insertions(+), 2073 deletions(-)
>  create mode 100644 backends/iommufd.c
>  create mode 100644 hw/vfio/container-base.c
>  create mode 100644 hw/vfio/container.c
>  create mode 100644 hw/vfio/helpers.c
>  create mode 100644 hw/vfio/iommufd.c
>  create mode 100644 include/hw/vfio/vfio-container-base.h
>  create mode 100644 include/qemu/char_dev.h
>  create mode 100644 include/sysemu/iommufd.h
>  create mode 100644 linux-headers/linux/iommufd.h
>  create mode 100644 util/chardev_open.c
>
Duan, Zhenzhong Sept. 14, 2023, 9:27 a.m. UTC | #2
Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Thursday, September 14, 2023 5:04 PM
>To: Duan, Zhenzhong <zhenzhong.duan@intel.com>; qemu-devel@nongnu.org
>Cc: alex.williamson@redhat.com; clg@redhat.com; jgg@nvidia.com;
>nicolinc@nvidia.com; Martins, Joao <joao.m.martins@oracle.com>;
>peterx@redhat.com; jasowang@redhat.com; Tian, Kevin <kevin.tian@intel.com>;
>Liu, Yi L <yi.l.liu@intel.com>; Sun, Yi Y <yi.y.sun@intel.com>; Peng, Chao P
><chao.p.peng@intel.com>
>Subject: Re: [PATCH v1 00/22] vfio: Adopt iommufd
>
>Hi Zhenzhong
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> Hi All,
>>
>> As the kernel side iommufd cdev and hot reset feature have been queued,
>> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
>> to update a new version matching kernel side update and with rfc flag
>> removed. Qemu code can be found at [2], look forward more comments!
>>
>>
>> We have done wide test with different combinations, e.g:
>>
>> - PCI device were tested
>> - FD passing and hot reset with some trick.
>> - device hotplug test with legacy and iommufd backends
>> - with or without vIOMMU for legacy and iommufd backends
>> - divices linked to different iommufds
>> - VFIO migration with a E800 net card(no dirty sync support) passthrough
>> - platform, ccw and ap were only compile-tested due to environment limit
>>
>>
>> Given some iommufd kernel limitations, the iommufd backend is
>> not yet fully on par with the legacy backend w.r.t. features like:
>> - p2p mappings (you will see related error traces)
>> - dirty page sync
>> - and etc.
>>
>>
>> Changelog:
>> v1:
>> - Alloc hwpt instead of using auto hwpt
>> - elaborate iommufd code per Nicolin
>> - consolidate two patches and drop as.c
>> - typo error fix and function rename
>>
>> I didn't list change log of rfc stage, see [3] if anyone is interested.
>>
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
>> [2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
>> [3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html
>
>Do you have a branch to share?
>
>It does not apply to upstream

Sure, https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_cdev_v1_rebased
I think this one is already based on today's upstream.

Thanks
Zhenzhong
Cédric Le Goater Sept. 15, 2023, 12:42 p.m. UTC | #3
On 8/30/23 12:37, Zhenzhong Duan wrote:
> Hi All,
> 
> As the kernel side iommufd cdev and hot reset feature have been queued,
> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> to update a new version matching kernel side update and with rfc flag
> removed. Qemu code can be found at [2], look forward more comments!

FYI, I have started cleaning up the VFIO support in QEMU PPC. First
is the removal of nvlink2, which was dropped from the kernel 2.5 years
ago. Next is probably removal of all the PPC bits in VFIO. Code is
bitrotting and AFAICT VFIO has been broken on these platforms since
5.18 or so.

The consequences on this patchset should be less movement of code
between files. I think this is something we should reduce to maintain
history.

Thanks,

C.
  

> 
> 
> We have done wide test with different combinations, e.g:
> 
> - PCI device were tested
> - FD passing and hot reset with some trick.
> - device hotplug test with legacy and iommufd backends
> - with or without vIOMMU for legacy and iommufd backends
> - divices linked to different iommufds
> - VFIO migration with a E800 net card(no dirty sync support) passthrough
> - platform, ccw and ap were only compile-tested due to environment limit
> 
> 
> Given some iommufd kernel limitations, the iommufd backend is
> not yet fully on par with the legacy backend w.r.t. features like:
> - p2p mappings (you will see related error traces)
> - dirty page sync
> - and etc.
> 
> 
> Changelog:
> v1:
> - Alloc hwpt instead of using auto hwpt
> - elaborate iommufd code per Nicolin
> - consolidate two patches and drop as.c
> - typo error fix and function rename
> 
> I didn't list change log of rfc stage, see [3] if anyone is interested.
> 
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git
> [2] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_cdev_v1
> [3] https://lists.nongnu.org/archive/html/qemu-devel/2023-07/msg02529.html
> 
> 
> --------------------------------------------------------------------------
> 
> With the introduction of iommufd, the Linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> 
> This QEMU integration is the result of a collaborative work between
> Yi Liu, Yi Sun, Nicolin Chen and Eric Auger.
> 
> At QEMU level, interactions with the /dev/iommu are abstracted by a new
> iommufd object (compiled in with the CONFIG_IOMMUFD option).
> 
> Any QEMU device (e.g. vfio device) wishing to use /dev/iommu must be
> linked with an iommufd object. In this series, the vfio-pci device is
> granted with such capability (other VFIO devices are not yet ready):
> 
> It gets a new optional parameter named iommufd which allows to pass
> an iommufd object:
> 
>      -object iommufd,id=iommufd0
>      -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
> 
> Note the /dev/iommu and vfio cdev can be externally opened by a
> management layer. In such a case the fd is passed:
>    
>      -object iommufd,id=iommufd0,fd=22
>      -device vfio-pci,iommufd=iommufd0,fd=23
> 
> If the fd parameter is not passed, the fd is opened by QEMU.
> See https://www.mail-archive.com/qemu-devel@nongnu.org/msg937155.html
> for detailed discuss on this requirement.
> 
> If no iommufd option is passed to the vfio-pci device, iommufd is not
> used and the end-user gets the behavior based on the legacy vfio iommu
> interfaces:
> 
>      -device vfio-pci,host=0000:02:00.0
> 
> While the legacy kernel interface is group-centric, the new iommufd
> interface is device-centric, relying on device fd and iommufd.
> 
> To support both interfaces in the QEMU VFIO device we reworked the vfio
> container abstraction so that the generic VFIO code can use either
> backend.
> 
> The VFIOContainer object becomes a base object derived into
> a) the legacy VFIO container and
> b) the new iommufd based container.
> 
> The base object implements generic code such as code related to
> memory_listener and address space management whereas the derived
> objects implement callbacks specific to either BE, legacy and
> iommufd. Indeed each backend has its own way to setup secure context
> and dma management interface. The below diagram shows how it looks
> like with both BEs.
> 
>                      VFIO                           AddressSpace/Memory
>      +-------+  +----------+  +-----+  +-----+
>      |  pci  |  | platform |  |  ap |  | ccw |
>      +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>          |           |           |        |        |   AddressSpace       |
>          |           |           |        |        +------------+---------+
>      +---V-----------V-----------V--------V----+               /
>      |           VFIOAddressSpace              | <------------+
>      |                  |                      |  MemoryListener
>      |          VFIOContainer list             |
>      +-------+----------------------------+----+
>              |                            |
>              |                            |
>      +-------V------+            +--------V----------+
>      |   iommufd    |            |    vfio legacy    |
>      |  container   |            |     container     |
>      +-------+------+            +--------+----------+
>              |                            |
>              | /dev/iommu                 | /dev/vfio/vfio
>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> Userspace   |                            |
> ============+============================+===========================
> Kernel      |  device fd                 |
>              +---------------+            | group/container fd
>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>              |  ATTACH_IOAS) |            | device fd
>              |               |            |
>              |       +-------V------------V-----------------+
>      iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
> ioas_copy)  |                 |                    | map/unmap
>              |                 |                    |
>       +------V------+    +-----V------+      +------V--------+
>       | iommfd core |    |  device    |      |  vfio iommu   |
>       +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>                (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                    (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> 1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> 2. VFIO populates DMA map/unmap via the container BEs
>     *) iommufd BE: uses iommufd
>     *) vfio legacy BE: uses container fd
> 
> 
> Thanks,
> Yi, Yi, Eric, Zhenzhong
> 
> 
> Eric Auger (8):
>    scripts/update-linux-headers: Add iommufd.h
>    vfio/common: Introduce vfio_container_add|del_section_window()
>    vfio/container: Introduce vfio_[attach/detach]_device
>    vfio/platform: Use vfio_[attach/detach]_device
>    vfio/ap: Use vfio_[attach/detach]_device
>    vfio/ccw: Use vfio_[attach/detach]_device
>    backends/iommufd: Introduce the iommufd object
>    vfio/pci: Allow the selection of a given iommu backend
> 
> Yi Liu (5):
>    vfio/common: Move IOMMU agnostic helpers to a separate file
>    vfio/common: Move legacy VFIO backend code into separate container.c
>    vfio: Add base container
>    util/char_dev: Add open_cdev()
>    vfio/iommufd: Implement the iommufd backend
> 
> Zhenzhong Duan (9):
>    Update linux-header to support iommufd cdev and hwpt alloc
>    vfio/common: Extract out vfio_kvm_device_[add/del]_fd
>    vfio/common: Add a vfio device iterator
>    vfio/common: Refactor vfio_viommu_preset() to be group agnostic
>    vfio/common: Simplify vfio_viommu_preset()
>    Add iommufd configure option
>    vfio/iommufd: Add vfio device iterator callback for iommufd
>    vfio/pci: Adapt vfio pci hot reset support with iommufd BE
>    vfio/pci: Make vfio cdev pre-openable by passing a file handle
> 
>   MAINTAINERS                           |   13 +
>   backends/Kconfig                      |    4 +
>   backends/iommufd.c                    |  291 ++++
>   backends/meson.build                  |    3 +
>   backends/trace-events                 |   13 +
>   hw/vfio/ap.c                          |   68 +-
>   hw/vfio/ccw.c                         |  120 +-
>   hw/vfio/common.c                      | 1948 +++----------------------
>   hw/vfio/container-base.c              |  160 ++
>   hw/vfio/container.c                   | 1208 +++++++++++++++
>   hw/vfio/helpers.c                     |  626 ++++++++
>   hw/vfio/iommufd.c                     |  554 +++++++
>   hw/vfio/meson.build                   |    6 +
>   hw/vfio/pci.c                         |  319 +++-
>   hw/vfio/platform.c                    |   43 +-
>   hw/vfio/spapr.c                       |   22 +-
>   hw/vfio/trace-events                  |   21 +-
>   include/hw/vfio/vfio-common.h         |  111 +-
>   include/hw/vfio/vfio-container-base.h |  158 ++
>   include/qemu/char_dev.h               |   16 +
>   include/standard-headers/linux/fuse.h |    3 +
>   include/sysemu/iommufd.h              |   49 +
>   linux-headers/linux/iommufd.h         |  444 ++++++
>   linux-headers/linux/kvm.h             |   13 +-
>   linux-headers/linux/vfio.h            |  148 +-
>   meson.build                           |    6 +
>   meson_options.txt                     |    2 +
>   qapi/qom.json                         |   18 +-
>   qemu-options.hx                       |   13 +
>   scripts/meson-buildoptions.sh         |    3 +
>   scripts/update-linux-headers.sh       |    3 +-
>   util/chardev_open.c                   |   61 +
>   util/meson.build                      |    1 +
>   33 files changed, 4395 insertions(+), 2073 deletions(-)
>   create mode 100644 backends/iommufd.c
>   create mode 100644 hw/vfio/container-base.c
>   create mode 100644 hw/vfio/container.c
>   create mode 100644 hw/vfio/helpers.c
>   create mode 100644 hw/vfio/iommufd.c
>   create mode 100644 include/hw/vfio/vfio-container-base.h
>   create mode 100644 include/qemu/char_dev.h
>   create mode 100644 include/sysemu/iommufd.h
>   create mode 100644 linux-headers/linux/iommufd.h
>   create mode 100644 util/chardev_open.c
>
Duan, Zhenzhong Sept. 15, 2023, 1:14 p.m. UTC | #4
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Sent: Friday, September 15, 2023 8:43 PM
>Subject: Re: [PATCH v1 00/22] vfio: Adopt iommufd
>
>On 8/30/23 12:37, Zhenzhong Duan wrote:
>> Hi All,
>>
>> As the kernel side iommufd cdev and hot reset feature have been queued,
>> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
>> to update a new version matching kernel side update and with rfc flag
>> removed. Qemu code can be found at [2], look forward more comments!
>
>FYI, I have started cleaning up the VFIO support in QEMU PPC. First
>is the removal of nvlink2, which was dropped from the kernel 2.5 years
>ago. Next is probably removal of all the PPC bits in VFIO. Code is
>bitrotting and AFAICT VFIO has been broken on these platforms since
>5.18 or so.
>
>The consequences on this patchset should be less movement of code
>between files. I think this is something we should reduce to maintain
>history.

Glad to know I'll only need to move less code. I'll rebase this patchset
after you finish.

Thanks
Zhenzhong
Jason Gunthorpe Sept. 18, 2023, 11:51 a.m. UTC | #5
On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
> On 8/30/23 12:37, Zhenzhong Duan wrote:
> > Hi All,
> > 
> > As the kernel side iommufd cdev and hot reset feature have been queued,
> > also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> > to update a new version matching kernel side update and with rfc flag
> > removed. Qemu code can be found at [2], look forward more comments!
> 
> FYI, I have started cleaning up the VFIO support in QEMU PPC. First
> is the removal of nvlink2, which was dropped from the kernel 2.5 years
> ago. Next is probably removal of all the PPC bits in VFIO. Code is
> bitrotting and AFAICT VFIO has been broken on these platforms since
> 5.18 or so.

It was fixed since then - at least one company (not IBM) still cares
about vfio on ppc, though I think it is for a DPDK use case not VFIO.

Jason
Cédric Le Goater Sept. 18, 2023, 12:23 p.m. UTC | #6
On 9/18/23 13:51, Jason Gunthorpe wrote:
> On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
>> On 8/30/23 12:37, Zhenzhong Duan wrote:
>>> Hi All,
>>>
>>> As the kernel side iommufd cdev and hot reset feature have been queued,
>>> also hwpt alloc has been added in Jason's for_next branch [1], I'd like
>>> to update a new version matching kernel side update and with rfc flag
>>> removed. Qemu code can be found at [2], look forward more comments!
>>
>> FYI, I have started cleaning up the VFIO support in QEMU PPC. First
>> is the removal of nvlink2, which was dropped from the kernel 2.5 years
>> ago. Next is probably removal of all the PPC bits in VFIO. Code is
>> bitrotting and AFAICT VFIO has been broken on these platforms since
>> 5.18 or so.
> 
> It was fixed since then - at least one company (not IBM) still cares
> about vfio on ppc, though I think it is for a DPDK use case not VFIO.

Indeed.
I just checked on a POWER9 box running a debian sid (6.4) and device
assignment of a simple NIC (e1000e) in a ubuntu 23.04 guest worked
correctly. Using a 6.6-rc1 on the host worked also. One improvement
would be to reflect in the Kconfig files that CONFIG_IOMMUFD is not
supported on PPC so that it can not be selected.

Thanks,

C.
Jason Gunthorpe Sept. 18, 2023, 5:56 p.m. UTC | #7
On Mon, Sep 18, 2023 at 02:23:48PM +0200, Cédric Le Goater wrote:
> On 9/18/23 13:51, Jason Gunthorpe wrote:
> > On Fri, Sep 15, 2023 at 02:42:48PM +0200, Cédric Le Goater wrote:
> > > On 8/30/23 12:37, Zhenzhong Duan wrote:
> > > > Hi All,
> > > > 
> > > > As the kernel side iommufd cdev and hot reset feature have been queued,
> > > > also hwpt alloc has been added in Jason's for_next branch [1], I'd like
> > > > to update a new version matching kernel side update and with rfc flag
> > > > removed. Qemu code can be found at [2], look forward more comments!
> > > 
> > > FYI, I have started cleaning up the VFIO support in QEMU PPC. First
> > > is the removal of nvlink2, which was dropped from the kernel 2.5 years
> > > ago. Next is probably removal of all the PPC bits in VFIO. Code is
> > > bitrotting and AFAICT VFIO has been broken on these platforms since
> > > 5.18 or so.
> > 
> > It was fixed since then - at least one company (not IBM) still cares
> > about vfio on ppc, though I think it is for a DPDK use case not VFIO.
> 
> Indeed.
> I just checked on a POWER9 box running a debian sid (6.4) and device
> assignment of a simple NIC (e1000e) in a ubuntu 23.04 guest worked
> correctly. Using a 6.6-rc1 on the host worked also. One improvement
> would be to reflect in the Kconfig files that CONFIG_IOMMUFD is not
> supported on PPC so that it can not be selected.

When we did this I thought there were other iommu drivers on Power
that did work with VFIO (fsl_pamu specifically), but it turns out that
ppc iommu driver doesn't support VFIO and the VFIO FSL stuff is for
ARM only.

So it could be done...

These days I believe we have the capacity to do the PPC stuff without
making it so special - it would be alot of work but the road is pretty
clear. At least if qemu wants to remove PPC VFIO support I would not
object.

Jason