mbox series

[RFC,00/18] vfio: Adopt iommufd

Message ID 20220414104710.28534-1-yi.l.liu@intel.com (mailing list archive)
Headers show
Series vfio: Adopt iommufd | expand

Message

Yi Liu April 14, 2022, 10:46 a.m. UTC
With the introduction of iommufd[1], the linux kernel provides a generic
interface for userspace drivers to propagate their DMA mappings to kernel
for assigned devices. This series does the porting of the VFIO devices
onto the /dev/iommu uapi and let it coexist with the legacy implementation.
Other devices like vpda, vfio mdev and etc. are not considered yet.

For vfio devices, the new interface is tied with device fd and iommufd
as the iommufd solution is device-centric. This is different from legacy
vfio which is group-centric. To support both interfaces in QEMU, this
series introduces the iommu backend concept in the form of different
container classes. The existing vfio container is named legacy container
(equivalent with legacy iommu backend in this series), while the new
iommufd based container is named as iommufd container (may also be mentioned
as iommufd backend in this series). The two backend types have their own
way to setup secure context and dma management interface. Below diagram
shows how it looks like with both BEs.

                    VFIO                           AddressSpace/Memory
    +-------+  +----------+  +-----+  +-----+
    |  pci  |  | platform |  |  ap |  | ccw |
    +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
        |           |           |        |        |   AddressSpace       |
        |           |           |        |        +------------+---------+
    +---V-----------V-----------V--------V----+               /
    |           VFIOAddressSpace              | <------------+
    |                  |                      |  MemoryListener
    |          VFIOContainer list             |
    +-------+----------------------------+----+
            |                            |
            |                            |
    +-------V------+            +--------V----------+
    |   iommufd    |            |    vfio legacy    |
    |  container   |            |     container     |
    +-------+------+            +--------+----------+
            |                            |
            | /dev/iommu                 | /dev/vfio/vfio
            | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
 Userspace  |                            |
 ===========+============================+================================
 Kernel     |  device fd                 |
            +---------------+            | group/container fd
            | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
            |  ATTACH_IOAS) |            | device fd
            |               |            |
            |       +-------V------------V-----------------+
    iommufd |       |                vfio                  |
(map/unmap  |       +---------+--------------------+-------+
 ioas_copy) |                 |                    | map/unmap
            |                 |                    |
     +------V------+    +-----V------+      +------V--------+
     | iommfd core |    |  device    |      |  vfio iommu   |
     +-------------+    +------------+      +---------------+

[Secure Context setup]
- iommufd BE: uses device fd and iommufd to setup secure context
              (bind_iommufd, attach_ioas)
- vfio legacy BE: uses group fd and container fd to setup secure context
                  (set_container, set_iommu)
[Device access]
- iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
- vfio legacy BE: device fd is retrieved from group fd ioctl
[DMA Mapping flow]
- VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
- VFIO populates DMA map/unmap via the container BEs
  *) iommufd BE: uses iommufd
  *) vfio legacy BE: uses container fd

This series qomifies the VFIOContainer object which acts as a base class
for a container. This base class is derived into the legacy VFIO container
and the new iommufd based container. The base class implements generic code
such as code related to memory_listener and address space management whereas
the derived class implements callbacks that depend on the kernel user space
being used.

The selection of the backend is made on a device basis using the new
iommufd option (on/off/auto). By default the iommufd backend is selected
if supported by the host and by QEMU (iommufd KConfig). This option is
currently available only for the vfio-pci device. For other types of
devices, it does not yet exist and the legacy BE is chosen by default.

Test done:
- PCI and Platform device were tested
- ccw and ap were only compile-tested
- limited device hotplug test
- vIOMMU test run for both legacy and iommufd backends (limited tests)

This series was co-developed by Eric Auger and me based on the exploration
iommufd kernel[2], complete code of this series is available in[3]. As
iommufd kernel is in the early step (only iommufd generic interface is in
mailing list), so this series hasn't made the iommufd backend fully on par
with legacy backend w.r.t. features like p2p mappings, coherency tracking,
live migration, etc. This series hasn't supported PCI devices without FLR
neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when userspace
is using iommufd. The kernel needs to be updated to accept device fd list for
reset when userspace is using iommufd. Related work is in progress by
Jason[4].

TODOs:
- Add DMA alias check for iommufd BE (group level)
- Make pci.c to be BE agnostic. Needs kernel change as well to fix the
  VFIO_DEVICE_PCI_HOT_RESET gap
- Cleanup the VFIODevice fields as it's used in both BEs
- Add locks
- Replace list with g_tree
- More tests

Patch Overview:

- Preparation:
  0001-scripts-update-linux-headers-Add-iommufd.h.patch
  0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
  0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
  0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
  0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
  0006-vfio-common-Split-common.c-into-common.c-container.c.patch

- Introduce container object and covert existing vfio to use it:
  0007-vfio-Add-base-object-for-VFIOContainer.patch
  0008-vfio-container-Introduce-vfio_attach-detach_device.patch
  0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
  0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
  0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
  0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
  0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch

- Introduce iommufd based container:
  0014-hw-iommufd-Creation.patch
  0015-vfio-iommufd-Implement-iommufd-backend.patch
  0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch

- Add backend selection for vfio-pci:
  0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
  0018-vfio-pci-Add-an-iommufd-option.patch

[1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com/
[2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
[3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
[4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-vfio_mdev_no_group_jgg@nvidia.com/

Base commit: 4bf58c7 virtio-iommu: use-after-free fix

Thanks,
Yi & Eric

Eric Auger (12):
  scripts/update-linux-headers: Add iommufd.h
  linux-headers: Import latest vfio.h and iommufd.h
  hw/vfio/pci: fix vfio_pci_hot_reset_result trace point
  vfio/pci: Use vbasedev local variable in vfio_realize()
  vfio/container: Introduce vfio_[attach/detach]_device
  vfio/platform: Use vfio_[attach/detach]_device
  vfio/ap: Use vfio_[attach/detach]_device
  vfio/ccw: Use vfio_[attach/detach]_device
  vfio/container-obj: Introduce [attach/detach]_device container
    callbacks
  vfio/container-obj: Introduce VFIOContainer reset callback
  vfio/as: Allow the selection of a given iommu backend
  vfio/pci: Add an iommufd option

Yi Liu (6):
  vfio/common: Rename VFIOGuestIOMMU::iommu into ::iommu_mr
  vfio/common: Split common.c into common.c, container.c and as.c
  vfio: Add base object for VFIOContainer
  hw/iommufd: Creation
  vfio/iommufd: Implement iommufd backend
  vfio/iommufd: Add IOAS_COPY_DMA support

 MAINTAINERS                          |    7 +
 hw/Kconfig                           |    1 +
 hw/iommufd/Kconfig                   |    4 +
 hw/iommufd/iommufd.c                 |  209 +++
 hw/iommufd/meson.build               |    1 +
 hw/iommufd/trace-events              |   11 +
 hw/iommufd/trace.h                   |    1 +
 hw/meson.build                       |    1 +
 hw/vfio/ap.c                         |   62 +-
 hw/vfio/as.c                         | 1042 ++++++++++++
 hw/vfio/ccw.c                        |  118 +-
 hw/vfio/common.c                     | 2340 ++------------------------
 hw/vfio/container-obj.c              |  221 +++
 hw/vfio/container.c                  | 1308 ++++++++++++++
 hw/vfio/iommufd.c                    |  570 +++++++
 hw/vfio/meson.build                  |    6 +
 hw/vfio/migration.c                  |    4 +-
 hw/vfio/pci.c                        |  133 +-
 hw/vfio/platform.c                   |   42 +-
 hw/vfio/spapr.c                      |   22 +-
 hw/vfio/trace-events                 |   11 +
 include/hw/iommufd/iommufd.h         |   37 +
 include/hw/vfio/vfio-common.h        |   96 +-
 include/hw/vfio/vfio-container-obj.h |  169 ++
 linux-headers/linux/iommufd.h        |  223 +++
 linux-headers/linux/vfio.h           |   84 +
 meson.build                          |    1 +
 scripts/update-linux-headers.sh      |    2 +-
 28 files changed, 4258 insertions(+), 2468 deletions(-)
 create mode 100644 hw/iommufd/Kconfig
 create mode 100644 hw/iommufd/iommufd.c
 create mode 100644 hw/iommufd/meson.build
 create mode 100644 hw/iommufd/trace-events
 create mode 100644 hw/iommufd/trace.h
 create mode 100644 hw/vfio/as.c
 create mode 100644 hw/vfio/container-obj.c
 create mode 100644 hw/vfio/container.c
 create mode 100644 hw/vfio/iommufd.c
 create mode 100644 include/hw/iommufd/iommufd.h
 create mode 100644 include/hw/vfio/vfio-container-obj.h
 create mode 100644 linux-headers/linux/iommufd.h

Comments

Nicolin Chen April 15, 2022, 8:37 a.m. UTC | #1
Hi,

Thanks for the work!

On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote:
 
> - More tests

I did a quick test on my ARM64 platform, using "iommu=smmuv3"
string. The behaviors are different between using default and
using legacy "iommufd=off".

The legacy pathway exits the VM with:
    vfio 0002:01:00.0:
    failed to setup container for group 1:
    memory listener initialization failed:
    Region smmuv3-iommu-memory-region-16-0:
    device 00.02.0 requires iommu MAP notifier which is not currently supported

while the iommufd pathway started the VM but reported errors
from host kernel about address translation failures, probably
because of accessing unmapped addresses.

I found iommufd pathway also calls error_propagate_prepend()
to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
doesn't get a chance to print errp out. Perhaps there should
be a final error check somewhere to exit?

Nic
Eric Auger April 17, 2022, 10:30 a.m. UTC | #2
Hi Nicolin,

On 4/15/22 10:37 AM, Nicolin Chen wrote:
> Hi,
>
> Thanks for the work!
>
> On Thu, Apr 14, 2022 at 03:46:52AM -0700, Yi Liu wrote:
>  
>> - More tests
> I did a quick test on my ARM64 platform, using "iommu=smmuv3"
> string. The behaviors are different between using default and
> using legacy "iommufd=off".
>
> The legacy pathway exits the VM with:
>     vfio 0002:01:00.0:
>     failed to setup container for group 1:
>     memory listener initialization failed:
>     Region smmuv3-iommu-memory-region-16-0:
>     device 00.02.0 requires iommu MAP notifier which is not currently supported
>
> while the iommufd pathway started the VM but reported errors
> from host kernel about address translation failures, probably
> because of accessing unmapped addresses.
>
> I found iommufd pathway also calls error_propagate_prepend()
> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
> doesn't get a chance to print errp out. Perhaps there should
> be a final error check somewhere to exit?

thank you for giving it a try.

vsmmuv3 + vfio is not supported as we miss the HW nested stage support
and SMMU does not support cache mode. If you want to test viommu on ARM
you shall test virtio-iommu+vfio. This should work but this is not yet
tested.

I pushed a fix for the error notification issue:
qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git

Thanks

Eric
>
> Nic
>
Tian, Kevin April 18, 2022, 8:49 a.m. UTC | #3
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, April 14, 2022 6:47 PM
> 
> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.

vfio mdev has no special support in Qemu. Just that it's not supported
by iommufd yet thus can only be operated in legacy container interface at
this point. Later once it's supported by the kernel suppose no additional
enabling work is required for mdev in Qemu.

> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be
> mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
> 
> ===========+============================+=======================
> =========
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+

last row: s/iommfd/iommufd/

overall this sounds a reasonable abstraction. Later when vdpa starts
supporting iommufd probably the iommufd BE will become even
smaller with more logic shareable between vfio and vdpa.

> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class

what does 'qomify' mean? I didn't find this word from dictionary...

> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic
> code
> such as code related to memory_listener and address space management
> whereas
> the derived class implements callbacks that depend on the kernel user space

'the kernel user space'?

> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.
> 
> Test done:
> - PCI and Platform device were tested

In this case PCI uses iommufd while platform device uses legacy?

> - ccw and ap were only compile-tested
> - limited device hotplug test
> - vIOMMU test run for both legacy and iommufd backends (limited tests)
> 
> This series was co-developed by Eric Auger and me based on the exploration
> iommufd kernel[2], complete code of this series is available in[3]. As
> iommufd kernel is in the early step (only iommufd generic interface is in
> mailing list), so this series hasn't made the iommufd backend fully on par
> with legacy backend w.r.t. features like p2p mappings, coherency tracking,

what does 'coherency tracking' mean here? if related to iommu enforce
snoop it is fully handled by the kernel so far. I didn't find any use of
VFIO_DMA_CC_IOMMU in current Qemu.

> live migration, etc. This series hasn't supported PCI devices without FLR
> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
> userspace
> is using iommufd. The kernel needs to be updated to accept device fd list for
> reset when userspace is using iommufd. Related work is in progress by
> Jason[4].
> 
> TODOs:
> - Add DMA alias check for iommufd BE (group level)
> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>   VFIO_DEVICE_PCI_HOT_RESET gap
> - Cleanup the VFIODevice fields as it's used in both BEs
> - Add locks
> - Replace list with g_tree
> - More tests
> 
> Patch Overview:
> 
> - Preparation:
>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>   0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
> iommu_m.patch

3-5 are pure cleanups which could be sent out separately 

>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
> 
> - Introduce container object and covert existing vfio to use it:
>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
> 
> - Introduce iommufd based container:
>   0014-hw-iommufd-Creation.patch
>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
> 
> - Add backend selection for vfio-pci:
>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>   0018-vfio-pci-Add-an-iommufd-option.patch
> 
> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
> iommufd_jgg@nvidia.com/
> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
> vfio_mdev_no_group_jgg@nvidia.com/

Following is probably more relevant to [4]:

https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/

Thanks
Kevin
Yi Liu April 18, 2022, 12:09 p.m. UTC | #4
Hi Kevin,

On 2022/4/18 16:49, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 14, 2022 6:47 PM
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> vfio mdev has no special support in Qemu. Just that it's not supported
> by iommufd yet thus can only be operated in legacy container interface at
> this point. Later once it's supported by the kernel suppose no additional
> enabling work is required for mdev in Qemu.

yes. will make it more precise in next version.

>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                      VFIO                           AddressSpace/Memory
>>      +-------+  +----------+  +-----+  +-----+
>>      |  pci  |  | platform |  |  ap |  | ccw |
>>      +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>          |           |           |        |        |   AddressSpace       |
>>          |           |           |        |        +------------+---------+
>>      +---V-----------V-----------V--------V----+               /
>>      |           VFIOAddressSpace              | <------------+
>>      |                  |                      |  MemoryListener
>>      |          VFIOContainer list             |
>>      +-------+----------------------------+----+
>>              |                            |
>>              |                            |
>>      +-------V------+            +--------V----------+
>>      |   iommufd    |            |    vfio legacy    |
>>      |  container   |            |     container     |
>>      +-------+------+            +--------+----------+
>>              |                            |
>>              | /dev/iommu                 | /dev/vfio/vfio
>>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>   Userspace  |                            |
>>
>> ===========+============================+=======================
>> =========
>>   Kernel     |  device fd                 |
>>              +---------------+            | group/container fd
>>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>              |  ATTACH_IOAS) |            | device fd
>>              |               |            |
>>              |       +-------V------------V-----------------+
>>      iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>   ioas_copy) |                 |                    | map/unmap
>>              |                 |                    |
>>       +------V------+    +-----V------+      +------V--------+
>>       | iommfd core |    |  device    |      |  vfio iommu   |
>>       +-------------+    +------------+      +---------------+
> 
> last row: s/iommfd/iommufd/

thanks. a typo.

> overall this sounds a reasonable abstraction. Later when vdpa starts
> supporting iommufd probably the iommufd BE will become even
> smaller with more logic shareable between vfio and vdpa.

let's see if Jason Wang will give some idea. :-)

>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>                (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                    (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>    *) iommufd BE: uses iommufd
>>    *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
> 
> what does 'qomify' mean? I didn't find this word from dictionary...
> 
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
> 
> 'the kernel user space'?

aha, just want to express different BE callbacks will use different user 
interface exposed by kernel. will refine the wording.

> 
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
> 
> In this case PCI uses iommufd while platform device uses legacy?

For PCI, both legacy and iommufd were tested. The exploration kernel branch 
doesn't have the new device uapi for platform device, so I didn't test it.
But I remember Eric should have tested it with iommufd. Eric?

>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> 
> what does 'coherency tracking' mean here? if related to iommu enforce
> snoop it is fully handled by the kernel so far. I didn't find any use of
> VFIO_DMA_CC_IOMMU in current Qemu.

It's the kvm_group add/del stuffs.perhaps say kvm_group add/del equivalence
would be better?

>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>    VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>    0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>    0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>    0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>    0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>    0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>> iommu_m.patch
> 
> 3-5 are pure cleanups which could be sent out separately

yes. may send later after checking with Eric. :-)

>>    0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>    0007-vfio-Add-base-object-for-VFIOContainer.patch
>>    0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>    0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>    0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>    0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>    0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>    0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>    0014-hw-iommufd-Creation.patch
>>    0015-vfio-iommufd-Implement-iommufd-backend.patch
>>    0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>    0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>    0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>> vfio_mdev_no_group_jgg@nvidia.com/
> 
> Following is probably more relevant to [4]:
> 
> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/

absolutely.:-) thanks.

> Thanks
> Kevin
Nicolin Chen April 19, 2022, 3:26 a.m. UTC | #5
On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote:

> >> - More tests
> > I did a quick test on my ARM64 platform, using "iommu=smmuv3"
> > string. The behaviors are different between using default and
> > using legacy "iommufd=off".
> >
> > The legacy pathway exits the VM with:
> >     vfio 0002:01:00.0:
> >     failed to setup container for group 1:
> >     memory listener initialization failed:
> >     Region smmuv3-iommu-memory-region-16-0:
> >     device 00.02.0 requires iommu MAP notifier which is not currently supported
> >
> > while the iommufd pathway started the VM but reported errors
> > from host kernel about address translation failures, probably
> > because of accessing unmapped addresses.
> >
> > I found iommufd pathway also calls error_propagate_prepend()
> > to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
> > doesn't get a chance to print errp out. Perhaps there should
> > be a final error check somewhere to exit?
> 
> thank you for giving it a try.
> 
> vsmmuv3 + vfio is not supported as we miss the HW nested stage support
> and SMMU does not support cache mode. If you want to test viommu on ARM
> you shall test virtio-iommu+vfio. This should work but this is not yet
> tested.

I tried "-device virtio-iommu" and "-device virtio-iommu-pci"
separately with vfio-pci, but neither seems to work. The host
SMMU driver reports Translation Faults.

Do you know what commands I should use to run QEMU for that
combination?

> I pushed a fix for the error notification issue:
> qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git

Yes. This fixes the problem. Thanks!
Alex Williamson April 22, 2022, 10:09 p.m. UTC | #6
[Cc +libvirt folks]

On Thu, 14 Apr 2022 03:46:52 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO                           AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace       |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
>  ===========+============================+================================
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic code
> such as code related to memory_listener and address space management whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.

I've discussed this a bit with Eric, but let me propose a different
command line interface.  Libvirt generally likes to pass file
descriptors to QEMU rather than grant it access to those files
directly.  This was problematic with vfio-pci because libvirt can't
easily know when QEMU will want to grab another /dev/vfio/vfio
container.  Therefore we abandoned this approach and instead libvirt
grants file permissions.

However, with iommufd there's no reason that QEMU ever needs more than
a single instance of /dev/iommufd and we're using per device vfio file
descriptors, so it seems like a good time to revisit this.

The interface I was considering would be to add an iommufd object to
QEMU, so we might have a:

-device iommufd[,fd=#][,id=foo]

For non-libivrt usage this would have the ability to open /dev/iommufd
itself if an fd is not provided.  This object could be shared with
other iommufd users in the VM and maybe we'd allow multiple instances
for more esoteric use cases.  [NB, maybe this should be a -object rather than
-device since the iommufd is not a guest visible device?]

The vfio-pci device might then become:

-device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]

So essentially we can specify the device via host, sysfsdev, or passing
an fd to the vfio device file.  When an iommufd object is specified,
"foo" in the example above, each of those options would use the
vfio-device access mechanism, essentially the same as iommufd=on in
your example.  With the fd passing option, an iommufd object would be
required and necessarily use device level access.

In your example, the iommufd=auto seems especially troublesome for
libvirt because QEMU is going to have different locked memory
requirements based on whether we're using type1 or iommufd, where the
latter resolves the duplicate accounting issues.  libvirt needs to know
deterministically which backed is being used, which this proposal seems
to provide, while at the same time bringing us more in line with fd
passing.  Thoughts?  Thanks,

Alex
Daniel P. Berrangé April 25, 2022, 10:10 a.m. UTC | #7
On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> [Cc +libvirt folks]
> 
> On Thu, 14 Apr 2022 03:46:52 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > With the introduction of iommufd[1], the linux kernel provides a generic
> > interface for userspace drivers to propagate their DMA mappings to kernel
> > for assigned devices. This series does the porting of the VFIO devices
> > onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> > Other devices like vpda, vfio mdev and etc. are not considered yet.

snip

> > The selection of the backend is made on a device basis using the new
> > iommufd option (on/off/auto). By default the iommufd backend is selected
> > if supported by the host and by QEMU (iommufd KConfig). This option is
> > currently available only for the vfio-pci device. For other types of
> > devices, it does not yet exist and the legacy BE is chosen by default.
> 
> I've discussed this a bit with Eric, but let me propose a different
> command line interface.  Libvirt generally likes to pass file
> descriptors to QEMU rather than grant it access to those files
> directly.  This was problematic with vfio-pci because libvirt can't
> easily know when QEMU will want to grab another /dev/vfio/vfio
> container.  Therefore we abandoned this approach and instead libvirt
> grants file permissions.
> 
> However, with iommufd there's no reason that QEMU ever needs more than
> a single instance of /dev/iommufd and we're using per device vfio file
> descriptors, so it seems like a good time to revisit this.

I assume access to '/dev/iommufd' gives the process somewhat elevated
privileges, such that you don't want to unconditionally give QEMU
access to this device ?

> The interface I was considering would be to add an iommufd object to
> QEMU, so we might have a:
> 
> -device iommufd[,fd=#][,id=foo]
> 
> For non-libivrt usage this would have the ability to open /dev/iommufd
> itself if an fd is not provided.  This object could be shared with
> other iommufd users in the VM and maybe we'd allow multiple instances
> for more esoteric use cases.  [NB, maybe this should be a -object rather than
> -device since the iommufd is not a guest visible device?]

Yes,  -object would be the right answer for something that's purely
a host side backend impl selector.

> The vfio-pci device might then become:
> 
> -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> So essentially we can specify the device via host, sysfsdev, or passing
> an fd to the vfio device file.  When an iommufd object is specified,
> "foo" in the example above, each of those options would use the
> vfio-device access mechanism, essentially the same as iommufd=on in
> your example.  With the fd passing option, an iommufd object would be
> required and necessarily use device level access.
> 
> In your example, the iommufd=auto seems especially troublesome for
> libvirt because QEMU is going to have different locked memory
> requirements based on whether we're using type1 or iommufd, where the
> latter resolves the duplicate accounting issues.  libvirt needs to know
> deterministically which backed is being used, which this proposal seems
> to provide, while at the same time bringing us more in line with fd
> passing.  Thoughts?  Thanks,

Yep, I agree that libvirt needs to have more direct control over this.
This is also even more important if there are notable feature differences
in the 2 backends.

I wonder if anyone has considered an even more distinct impl, whereby
we have a completely different device type on the backend, eg

  -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]

If a vendor wants to fully remove the legacy impl, they can then use the
Kconfig mechanism to disable the build of the legacy impl device, while
keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
reliable enough for them to support yet).

Libvirt would use

   -object iommu,id=iommu0,fd=NNN
   -device vfio-iommu-pci,fd=MMM,iommu=iommu0

Non-libvirt would use a simpler

   -device vfio-iommu-pci,host=0000:03:22.1

with QEMU auto-creating a 'iommu' object in the background.

This would fit into libvirt's existing modelling better. We currently have
a concept of a PCI assignment backend, which previously supported the
legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
feels like a 3rd PCI assignment approach, and so fits with how we modelled
it as a different device type in the past.

With regards,
Daniel
Jason Gunthorpe April 25, 2022, 1:36 p.m. UTC | #8
On Mon, Apr 25, 2022 at 11:10:14AM +0100, Daniel P. Berrangé wrote:

> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

I doesn't give much, at worst it allows userspace to allocate kernel
memory and pin pages which can be already be done through all sorts of
other interfaces qemu already has access to..

Jason
Alex Williamson April 25, 2022, 2:37 p.m. UTC | #9
On Mon, 25 Apr 2022 11:10:14 +0100
Daniel P. Berrangé <berrange@redhat.com> wrote:

> On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > [Cc +libvirt folks]
> > 
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > With the introduction of iommufd[1], the linux kernel provides a generic
> > > interface for userspace drivers to propagate their DMA mappings to kernel
> > > for assigned devices. This series does the porting of the VFIO devices
> > > onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> 
> snip
> 
> > > The selection of the backend is made on a device basis using the new
> > > iommufd option (on/off/auto). By default the iommufd backend is selected
> > > if supported by the host and by QEMU (iommufd KConfig). This option is
> > > currently available only for the vfio-pci device. For other types of
> > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > 
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> > 
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.  
> 
> I assume access to '/dev/iommufd' gives the process somewhat elevated
> privileges, such that you don't want to unconditionally give QEMU
> access to this device ?

It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
interface which should have limited scope for abuse, but more so here
the goal would be to de-privilege QEMU that one step further that it
cannot open the device file itself.

> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> > 
> > -device iommufd[,fd=#][,id=foo]
> > 
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather than
> > -device since the iommufd is not a guest visible device?]  
> 
> Yes,  -object would be the right answer for something that's purely
> a host side backend impl selector.
> 
> > The vfio-pci device might then become:
> > 
> > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> > 
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.
> > 
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> 
> Yep, I agree that libvirt needs to have more direct control over this.
> This is also even more important if there are notable feature differences
> in the 2 backends.
> 
> I wonder if anyone has considered an even more distinct impl, whereby
> we have a completely different device type on the backend, eg
> 
>   -device vfio-iommu-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> 
> If a vendor wants to fully remove the legacy impl, they can then use the
> Kconfig mechanism to disable the build of the legacy impl device, while
> keeping the iommu impl (or vica-verca if the new iommu impl isn't considered
> reliable enough for them to support yet).
> 
> Libvirt would use
> 
>    -object iommu,id=iommu0,fd=NNN
>    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> 
> Non-libvirt would use a simpler
> 
>    -device vfio-iommu-pci,host=0000:03:22.1
> 
> with QEMU auto-creating a 'iommu' object in the background.
> 
> This would fit into libvirt's existing modelling better. We currently have
> a concept of a PCI assignment backend, which previously supported the
> legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> feels like a 3rd PCI assignment approach, and so fits with how we modelled
> it as a different device type in the past.

I don't think we want to conflate "iommu" and "iommufd", we're creating
an object that interfaces into the iommufd uAPI, not an iommu itself.
Likewise "vfio-iommu-pci" is just confusing, there was an iommu
interface previously, it's just a different implementation now and as
far as the VM interface to the device, it's identical.  Note that a
"vfio-iommufd-pci" device multiplies the matrix of every vfio device
for a rather subtle implementation detail.

My expectation would be that libvirt uses:

 -object iommufd,id=iommufd0,fd=NNN
 -device vfio-pci,fd=MMM,iommufd=iommufd0

Whereas simple QEMU command line would be:

 -object iommufd,id=iommufd0
 -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0

The iommufd object would open /dev/iommufd itself.  Creating an
implicit iommufd object is someone problematic because one of the
things I forgot to highlight in my previous description is that the
iommufd object is meant to be shared across not only various vfio
devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
vdpa.

If the old style were used:

 -device vfio-pci,host=0000:02:00.0

Then QEMU would use vfio for the IOMMU backend.

If libvirt/userspace wants to query whether "legacy" vfio is still
supported by the host kernel, I think it'd only need to look for
whether the /dev/vfio/vfio container interface still exists.

If we need some means for QEMU to remove legacy support, I'd rather
find a way to do it via probing device options.  It's easy enough to
see if iommufd support exists by looking for the presence of the
iommufd option for the vfio-pci device and Kconfig within QEMU could be
used regardless of whether we define a new device name.  Thanks,

Alex
Eric Auger April 25, 2022, 7:40 p.m. UTC | #10
Hi Nicolin,

On 4/19/22 5:26 AM, Nicolin Chen wrote:
> On Sun, Apr 17, 2022 at 12:30:40PM +0200, Eric Auger wrote:
>
>>>> - More tests
>>> I did a quick test on my ARM64 platform, using "iommu=smmuv3"
>>> string. The behaviors are different between using default and
>>> using legacy "iommufd=off".
>>>
>>> The legacy pathway exits the VM with:
>>>     vfio 0002:01:00.0:
>>>     failed to setup container for group 1:
>>>     memory listener initialization failed:
>>>     Region smmuv3-iommu-memory-region-16-0:
>>>     device 00.02.0 requires iommu MAP notifier which is not currently supported
>>>
>>> while the iommufd pathway started the VM but reported errors
>>> from host kernel about address translation failures, probably
>>> because of accessing unmapped addresses.
>>>
>>> I found iommufd pathway also calls error_propagate_prepend()
>>> to add to errp for not supporting IOMMU_NOTIFIER_MAP, but it
>>> doesn't get a chance to print errp out. Perhaps there should
>>> be a final error check somewhere to exit?
>> thank you for giving it a try.
>>
>> vsmmuv3 + vfio is not supported as we miss the HW nested stage support
>> and SMMU does not support cache mode. If you want to test viommu on ARM
>> you shall test virtio-iommu+vfio. This should work but this is not yet
>> tested.
> I tried "-device virtio-iommu" and "-device virtio-iommu-pci"
> separately with vfio-pci, but neither seems to work. The host
> SMMU driver reports Translation Faults.
>
> Do you know what commands I should use to run QEMU for that
> combination?
you shall use :

 -device virtio-iommu-pci -device vfio-pci,host=<BDF>

Please make sure the "-device virtio-iommu-pci" is set *before* the
"-device vfio-pci,"

Otherwise the IOMMU MR notifiers are not set properly and this may be
the cause of your physical SMMU translations faults.

Eric
>
>> I pushed a fix for the error notification issue:
>> qemu-for-5.17-rc6-vm-rfcv2-rc0 on my git https://github.com/eauger/qemu.git
> Yes. This fixes the problem. Thanks!
>
Eric Auger April 25, 2022, 7:51 p.m. UTC | #11
Hi,

On 4/18/22 2:09 PM, Yi Liu wrote:
> Hi Kevin,
>
> On 2022/4/18 16:49, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Thursday, April 14, 2022 6:47 PM
>>>
>>> With the introduction of iommufd[1], the linux kernel provides a
>>> generic
>>> interface for userspace drivers to propagate their DMA mappings to
>>> kernel
>>> for assigned devices. This series does the porting of the VFIO devices
>>> onto the /dev/iommu uapi and let it coexist with the legacy
>>> implementation.
>>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> vfio mdev has no special support in Qemu. Just that it's not supported
>> by iommufd yet thus can only be operated in legacy container
>> interface at
>> this point. Later once it's supported by the kernel suppose no
>> additional
>> enabling work is required for mdev in Qemu.
>
> yes. will make it more precise in next version.
>
>>>
>>> For vfio devices, the new interface is tied with device fd and iommufd
>>> as the iommufd solution is device-centric. This is different from
>>> legacy
>>> vfio which is group-centric. To support both interfaces in QEMU, this
>>> series introduces the iommu backend concept in the form of different
>>> container classes. The existing vfio container is named legacy
>>> container
>>> (equivalent with legacy iommu backend in this series), while the new
>>> iommufd based container is named as iommufd container (may also be
>>> mentioned
>>> as iommufd backend in this series). The two backend types have their
>>> own
>>> way to setup secure context and dma management interface. Below diagram
>>> shows how it looks like with both BEs.
>>>
>>>                      VFIO                           AddressSpace/Memory
>>>      +-------+  +----------+  +-----+  +-----+
>>>      |  pci  |  | platform |  |  ap |  | ccw |
>>>      +---+---+  +----+-----+  +--+--+  +--+--+    
>>> +----------------------+
>>>          |           |           |        |        |  
>>> AddressSpace       |
>>>          |           |           |        |       
>>> +------------+---------+
>>>      +---V-----------V-----------V--------V----+               /
>>>      |           VFIOAddressSpace              | <------------+
>>>      |                  |                      |  MemoryListener
>>>      |          VFIOContainer list             |
>>>      +-------+----------------------------+----+
>>>              |                            |
>>>              |                            |
>>>      +-------V------+            +--------V----------+
>>>      |   iommufd    |            |    vfio legacy    |
>>>      |  container   |            |     container     |
>>>      +-------+------+            +--------+----------+
>>>              |                            |
>>>              | /dev/iommu                 | /dev/vfio/vfio
>>>              | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>>   Userspace  |                            |
>>>
>>> ===========+============================+=======================
>>> =========
>>>   Kernel     |  device fd                 |
>>>              +---------------+            | group/container fd
>>>              | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>>              |  ATTACH_IOAS) |            | device fd
>>>              |               |            |
>>>              |       +-------V------------V-----------------+
>>>      iommufd |       |                vfio                  |
>>> (map/unmap  |       +---------+--------------------+-------+
>>>   ioas_copy) |                 |                    | map/unmap
>>>              |                 |                    |
>>>       +------V------+    +-----V------+      +------V--------+
>>>       | iommfd core |    |  device    |      |  vfio iommu   |
>>>       +-------------+    +------------+      +---------------+
>>
>> last row: s/iommfd/iommufd/
>
> thanks. a typo.
>
>> overall this sounds a reasonable abstraction. Later when vdpa starts
>> supporting iommufd probably the iommufd BE will become even
>> smaller with more logic shareable between vfio and vdpa.
>
> let's see if Jason Wang will give some idea. :-)
>
>>>
>>> [Secure Context setup]
>>> - iommufd BE: uses device fd and iommufd to setup secure context
>>>                (bind_iommufd, attach_ioas)
>>> - vfio legacy BE: uses group fd and container fd to setup secure
>>> context
>>>                    (set_container, set_iommu)
>>> [Device access]
>>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>>> [DMA Mapping flow]
>>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>>> - VFIO populates DMA map/unmap via the container BEs
>>>    *) iommufd BE: uses iommufd
>>>    *) vfio legacy BE: uses container fd
>>>
>>> This series qomifies the VFIOContainer object which acts as a base
>>> class
>>
>> what does 'qomify' mean? I didn't find this word from dictionary...
>>
>>> for a container. This base class is derived into the legacy VFIO
>>> container
>>> and the new iommufd based container. The base class implements generic
>>> code
>>> such as code related to memory_listener and address space management
>>> whereas
>>> the derived class implements callbacks that depend on the kernel
>>> user space
>>
>> 'the kernel user space'?
>
> aha, just want to express different BE callbacks will use different
> user interface exposed by kernel. will refine the wording.
>
>>
>>> being used.
>>>
>>> The selection of the backend is made on a device basis using the new
>>> iommufd option (on/off/auto). By default the iommufd backend is
>>> selected
>>> if supported by the host and by QEMU (iommufd KConfig). This option is
>>> currently available only for the vfio-pci device. For other types of
>>> devices, it does not yet exist and the legacy BE is chosen by default.
>>>
>>> Test done:
>>> - PCI and Platform device were tested
>>
>> In this case PCI uses iommufd while platform device uses legacy?
>
> For PCI, both legacy and iommufd were tested. The exploration kernel
> branch doesn't have the new device uapi for platform device, so I
> didn't test it.
> But I remember Eric should have tested it with iommufd. Eric?
No I just ran non regression tests for vfio-platform, in legacy mode. I
did not integrate with the new device uapi for platform device.
>
>>> - ccw and ap were only compile-tested
>>> - limited device hotplug test
>>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>>
>>> This series was co-developed by Eric Auger and me based on the
>>> exploration
>>> iommufd kernel[2], complete code of this series is available in[3]. As
>>> iommufd kernel is in the early step (only iommufd generic interface
>>> is in
>>> mailing list), so this series hasn't made the iommufd backend fully
>>> on par
>>> with legacy backend w.r.t. features like p2p mappings, coherency
>>> tracking,
>>
>> what does 'coherency tracking' mean here? if related to iommu enforce
>> snoop it is fully handled by the kernel so far. I didn't find any use of
>> VFIO_DMA_CC_IOMMU in current Qemu.
>
> It's the kvm_group add/del stuffs.perhaps say kvm_group add/del
> equivalence
> would be better?
>
>>> live migration, etc. This series hasn't supported PCI devices
>>> without FLR
>>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>>> userspace
>>> is using iommufd. The kernel needs to be updated to accept device fd
>>> list for
>>> reset when userspace is using iommufd. Related work is in progress by
>>> Jason[4].
>>>
>>> TODOs:
>>> - Add DMA alias check for iommufd BE (group level)
>>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>>    VFIO_DEVICE_PCI_HOT_RESET gap
>>> - Cleanup the VFIODevice fields as it's used in both BEs
>>> - Add locks
>>> - Replace list with g_tree
>>> - More tests
>>>
>>> Patch Overview:
>>>
>>> - Preparation:
>>>    0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>>    0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>>    0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>>    0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>>    0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>>> iommu_m.patch
>>
>> 3-5 are pure cleanups which could be sent out separately
>
> yes. may send later after checking with Eric. :-)
yes makes sense to send them separately.

Thanks

Eric
>
>>>    0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>>
>>> - Introduce container object and covert existing vfio to use it:
>>>    0007-vfio-Add-base-object-for-VFIOContainer.patch
>>>    0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>>    0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>>    0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>>    0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>>    0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>>    0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>>
>>> - Introduce iommufd based container:
>>>    0014-hw-iommufd-Creation.patch
>>>    0015-vfio-iommufd-Implement-iommufd-backend.patch
>>>    0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>>
>>> - Add backend selection for vfio-pci:
>>>    0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>>    0018-vfio-pci-Add-an-iommufd-option.patch
>>>
>>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>>> iommufd_jgg@nvidia.com/
>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>>> vfio_mdev_no_group_jgg@nvidia.com/
>>
>> Following is probably more relevant to [4]:
>>
>> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/
>>
>
> absolutely.:-) thanks.
>
>> Thanks
>> Kevin
>
Eric Auger April 25, 2022, 7:55 p.m. UTC | #12
Hi Kevin,

On 4/18/22 10:49 AM, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 14, 2022 6:47 PM
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
> vfio mdev has no special support in Qemu. Just that it's not supported
> by iommufd yet thus can only be operated in legacy container interface at
> this point. Later once it's supported by the kernel suppose no additional
> enabling work is required for mdev in Qemu.
>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO                           AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace       |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>
>> ===========+============================+=======================
>> =========
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
> last row: s/iommfd/iommufd/
>
> overall this sounds a reasonable abstraction. Later when vdpa starts
> supporting iommufd probably the iommufd BE will become even
> smaller with more logic shareable between vfio and vdpa.
>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
> what does 'qomify' mean? I didn't find this word from dictionary...
sorry this is pure QEMU terminology. This stands for "QEMU Object Model"
additional info at:
https://qemu.readthedocs.io/en/latest/devel/qom.html

Eric
>
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
> 'the kernel user space'?
>
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
> In this case PCI uses iommufd while platform device uses legacy?
>
>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> what does 'coherency tracking' mean here? if related to iommu enforce
> snoop it is fully handled by the kernel so far. I didn't find any use of
> VFIO_DMA_CC_IOMMU in current Qemu.
>
>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>   VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>   0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-
>> iommu_m.patch
> 3-5 are pure cleanups which could be sent out separately 
>
>>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>   0014-hw-iommufd-Creation.patch
>>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>   0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-
>> iommufd_jgg@nvidia.com/
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>> [4] https://lore.kernel.org/kvm/0-v1-a8faf768d202+125dd-
>> vfio_mdev_no_group_jgg@nvidia.com/
> Following is probably more relevant to [4]:
>
> https://lore.kernel.org/all/10-v1-33906a626da1+16b0-vfio_kvm_no_group_jgg@nvidia.com/
>
> Thanks
> Kevin
>
Eric Auger April 25, 2022, 8:23 p.m. UTC | #13
Hi Alex,

On 4/23/22 12:09 AM, Alex Williamson wrote:
> [Cc +libvirt folks]
>
> On Thu, 14 Apr 2022 03:46:52 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO                           AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace       |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>  ===========+============================+================================
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic code
>> such as code related to memory_listener and address space management whereas
>> the derived class implements callbacks that depend on the kernel user space
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
> I've discussed this a bit with Eric, but let me propose a different
> command line interface.  Libvirt generally likes to pass file
> descriptors to QEMU rather than grant it access to those files
> directly.  This was problematic with vfio-pci because libvirt can't
> easily know when QEMU will want to grab another /dev/vfio/vfio
> container.  Therefore we abandoned this approach and instead libvirt
> grants file permissions.
>
> However, with iommufd there's no reason that QEMU ever needs more than
> a single instance of /dev/iommufd and we're using per device vfio file
> descriptors, so it seems like a good time to revisit this.
>
> The interface I was considering would be to add an iommufd object to
> QEMU, so we might have a:
>
> -device iommufd[,fd=#][,id=foo]
>
> For non-libivrt usage this would have the ability to open /dev/iommufd
> itself if an fd is not provided.  This object could be shared with
> other iommufd users in the VM and maybe we'd allow multiple instances
> for more esoteric use cases.  [NB, maybe this should be a -object rather than
> -device since the iommufd is not a guest visible device?]
>
> The vfio-pci device might then become:
>
> -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
>
> So essentially we can specify the device via host, sysfsdev, or passing
> an fd to the vfio device file.  When an iommufd object is specified,
> "foo" in the example above, each of those options would use the
> vfio-device access mechanism, essentially the same as iommufd=on in
> your example.  With the fd passing option, an iommufd object would be
> required and necessarily use device level access.
What is the use case you foresee for the "fd=#" option?
>
> In your example, the iommufd=auto seems especially troublesome for
> libvirt because QEMU is going to have different locked memory
> requirements based on whether we're using type1 or iommufd, where the
> latter resolves the duplicate accounting issues.  libvirt needs to know
> deterministically which backed is being used, which this proposal seems
> to provide, while at the same time bringing us more in line with fd
> passing.  Thoughts?  Thanks,
I like your proposal (based on the -object iommufd). The only thing that
may be missing I think is for a qemu end-user who actually does not care
about the iommu backend being used but just wishes to use the most
recent available one it adds some extra complexity. But this is not the
most important use case ;)

Thanks

Eric
>
> Alex
>
Alex Williamson April 25, 2022, 10:53 p.m. UTC | #14
On Mon, 25 Apr 2022 22:23:05 +0200
Eric Auger <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 4/23/22 12:09 AM, Alex Williamson wrote:
> > [Cc +libvirt folks]
> >
> > On Thu, 14 Apr 2022 03:46:52 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >  
> >> With the introduction of iommufd[1], the linux kernel provides a generic
> >> interface for userspace drivers to propagate their DMA mappings to kernel
> >> for assigned devices. This series does the porting of the VFIO devices
> >> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> >> Other devices like vpda, vfio mdev and etc. are not considered yet.
> >>
> >> For vfio devices, the new interface is tied with device fd and iommufd
> >> as the iommufd solution is device-centric. This is different from legacy
> >> vfio which is group-centric. To support both interfaces in QEMU, this
> >> series introduces the iommu backend concept in the form of different
> >> container classes. The existing vfio container is named legacy container
> >> (equivalent with legacy iommu backend in this series), while the new
> >> iommufd based container is named as iommufd container (may also be mentioned
> >> as iommufd backend in this series). The two backend types have their own
> >> way to setup secure context and dma management interface. Below diagram
> >> shows how it looks like with both BEs.
> >>
> >>                     VFIO                           AddressSpace/Memory
> >>     +-------+  +----------+  +-----+  +-----+
> >>     |  pci  |  | platform |  |  ap |  | ccw |
> >>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
> >>         |           |           |        |        |   AddressSpace       |
> >>         |           |           |        |        +------------+---------+
> >>     +---V-----------V-----------V--------V----+               /
> >>     |           VFIOAddressSpace              | <------------+
> >>     |                  |                      |  MemoryListener
> >>     |          VFIOContainer list             |
> >>     +-------+----------------------------+----+
> >>             |                            |
> >>             |                            |
> >>     +-------V------+            +--------V----------+
> >>     |   iommufd    |            |    vfio legacy    |
> >>     |  container   |            |     container     |
> >>     +-------+------+            +--------+----------+
> >>             |                            |
> >>             | /dev/iommu                 | /dev/vfio/vfio
> >>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
> >>  Userspace  |                            |
> >>  ===========+============================+================================
> >>  Kernel     |  device fd                 |
> >>             +---------------+            | group/container fd
> >>             | (BIND_IOMMUFD |            | (SET_CONTAINER/SET_IOMMU)
> >>             |  ATTACH_IOAS) |            | device fd
> >>             |               |            |
> >>             |       +-------V------------V-----------------+
> >>     iommufd |       |                vfio                  |
> >> (map/unmap  |       +---------+--------------------+-------+
> >>  ioas_copy) |                 |                    | map/unmap
> >>             |                 |                    |
> >>      +------V------+    +-----V------+      +------V--------+
> >>      | iommfd core |    |  device    |      |  vfio iommu   |
> >>      +-------------+    +------------+      +---------------+
> >>
> >> [Secure Context setup]
> >> - iommufd BE: uses device fd and iommufd to setup secure context
> >>               (bind_iommufd, attach_ioas)
> >> - vfio legacy BE: uses group fd and container fd to setup secure context
> >>                   (set_container, set_iommu)
> >> [Device access]
> >> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> >> - vfio legacy BE: device fd is retrieved from group fd ioctl
> >> [DMA Mapping flow]
> >> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> >> - VFIO populates DMA map/unmap via the container BEs
> >>   *) iommufd BE: uses iommufd
> >>   *) vfio legacy BE: uses container fd
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> >> for a container. This base class is derived into the legacy VFIO container
> >> and the new iommufd based container. The base class implements generic code
> >> such as code related to memory_listener and address space management whereas
> >> the derived class implements callbacks that depend on the kernel user space
> >> being used.
> >>
> >> The selection of the backend is made on a device basis using the new
> >> iommufd option (on/off/auto). By default the iommufd backend is selected
> >> if supported by the host and by QEMU (iommufd KConfig). This option is
> >> currently available only for the vfio-pci device. For other types of
> >> devices, it does not yet exist and the legacy BE is chosen by default.  
> > I've discussed this a bit with Eric, but let me propose a different
> > command line interface.  Libvirt generally likes to pass file
> > descriptors to QEMU rather than grant it access to those files
> > directly.  This was problematic with vfio-pci because libvirt can't
> > easily know when QEMU will want to grab another /dev/vfio/vfio
> > container.  Therefore we abandoned this approach and instead libvirt
> > grants file permissions.
> >
> > However, with iommufd there's no reason that QEMU ever needs more than
> > a single instance of /dev/iommufd and we're using per device vfio file
> > descriptors, so it seems like a good time to revisit this.
> >
> > The interface I was considering would be to add an iommufd object to
> > QEMU, so we might have a:
> >
> > -device iommufd[,fd=#][,id=foo]
> >
> > For non-libivrt usage this would have the ability to open /dev/iommufd
> > itself if an fd is not provided.  This object could be shared with
> > other iommufd users in the VM and maybe we'd allow multiple instances
> > for more esoteric use cases.  [NB, maybe this should be a -object rather than
> > -device since the iommufd is not a guest visible device?]
> >
> > The vfio-pci device might then become:
> >
> > -device vfio-pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=foo]
> >
> > So essentially we can specify the device via host, sysfsdev, or passing
> > an fd to the vfio device file.  When an iommufd object is specified,
> > "foo" in the example above, each of those options would use the
> > vfio-device access mechanism, essentially the same as iommufd=on in
> > your example.  With the fd passing option, an iommufd object would be
> > required and necessarily use device level access.  
> What is the use case you foresee for the "fd=#" option?

On the vfio-pci device this was intended to be the actual vfio device
file descriptor.  Once we have a file per device, QEMU doesn't really
have any need to navigate through sysfs to determine which fd to use
other than for user convenience on the command line.  For libvirt usage,
I assume QEMU could accept the device fd, without ever really knowing
anything about the host address or sysfs path of the device.

> >
> > In your example, the iommufd=auto seems especially troublesome for
> > libvirt because QEMU is going to have different locked memory
> > requirements based on whether we're using type1 or iommufd, where the
> > latter resolves the duplicate accounting issues.  libvirt needs to know
> > deterministically which backed is being used, which this proposal seems
> > to provide, while at the same time bringing us more in line with fd
> > passing.  Thoughts?  Thanks,  
> I like your proposal (based on the -object iommufd). The only thing that
> may be missing I think is for a qemu end-user who actually does not care
> about the iommu backend being used but just wishes to use the most
> recent available one it adds some extra complexity. But this is not the
> most important use case ;)

Yeah, I can sympathize with that, but isn't that also why we're pursing
a vfio compatibility interface at the kernel level?  Eventually, once
the native vfio IOMMU backends go away, the vfio "container" device
file will be provided by iommufd and that transition to the new
interface can be both seamless to the user and apparent to tools like
libvirt.

An end-user with a fixed command line should continue to work and will
eventually get iommufd via compatibility, but taking care of an
end-user that "does not care" and "wishes to use the most recent" is a
non-goal for me.  That would be more troublesome for tools and use cases
that we do care about imo.  Thanks,

Alex
Tian, Kevin April 26, 2022, 8:37 a.m. UTC | #15
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 25, 2022 10:38 PM
> 
> On Mon, 25 Apr 2022 11:10:14 +0100
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > > [Cc +libvirt folks]
> > >
> > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > With the introduction of iommufd[1], the linux kernel provides a
> generic
> > > > interface for userspace drivers to propagate their DMA mappings to
> kernel
> > > > for assigned devices. This series does the porting of the VFIO devices
> > > > onto the /dev/iommu uapi and let it coexist with the legacy
> implementation.
> > > > Other devices like vpda, vfio mdev and etc. are not considered yet.
> >
> > snip
> >
> > > > The selection of the backend is made on a device basis using the new
> > > > iommufd option (on/off/auto). By default the iommufd backend is
> selected
> > > > if supported by the host and by QEMU (iommufd KConfig). This option
> is
> > > > currently available only for the vfio-pci device. For other types of
> > > > devices, it does not yet exist and the legacy BE is chosen by default.
> > >
> > > I've discussed this a bit with Eric, but let me propose a different
> > > command line interface.  Libvirt generally likes to pass file
> > > descriptors to QEMU rather than grant it access to those files
> > > directly.  This was problematic with vfio-pci because libvirt can't
> > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > container.  Therefore we abandoned this approach and instead libvirt
> > > grants file permissions.
> > >
> > > However, with iommufd there's no reason that QEMU ever needs more
> than
> > > a single instance of /dev/iommufd and we're using per device vfio file
> > > descriptors, so it seems like a good time to revisit this.
> >
> > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > privileges, such that you don't want to unconditionally give QEMU
> > access to this device ?
> 
> It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> interface which should have limited scope for abuse, but more so here
> the goal would be to de-privilege QEMU that one step further that it
> cannot open the device file itself.
> 
> > > The interface I was considering would be to add an iommufd object to
> > > QEMU, so we might have a:
> > >
> > > -device iommufd[,fd=#][,id=foo]
> > >
> > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > itself if an fd is not provided.  This object could be shared with
> > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > for more esoteric use cases.  [NB, maybe this should be a -object rather
> than
> > > -device since the iommufd is not a guest visible device?]
> >
> > Yes,  -object would be the right answer for something that's purely
> > a host side backend impl selector.
> >
> > > The vfio-pci device might then become:
> > >
> > > -device vfio-
> pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> > >
> > > So essentially we can specify the device via host, sysfsdev, or passing
> > > an fd to the vfio device file.  When an iommufd object is specified,
> > > "foo" in the example above, each of those options would use the
> > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > your example.  With the fd passing option, an iommufd object would be
> > > required and necessarily use device level access.
> > >
> > > In your example, the iommufd=auto seems especially troublesome for
> > > libvirt because QEMU is going to have different locked memory
> > > requirements based on whether we're using type1 or iommufd, where
> the
> > > latter resolves the duplicate accounting issues.  libvirt needs to know

Based on current plan there is probably a transition window between the
point where the first vfio device type (vfio-pci) gaining iommufd support
and the point where all vfio types supporting iommufd. Libvirt can figure
out which one to use iommufd by checking the presence of
/dev/vfio/devices/vfioX. But what would be the resource limit policy
in Libvirt in such transition window when both type1 and iommufd might
be used? Or do we just expect Libvirt to support iommufd only after the
transition window ends to avoid handling such mess?

> > > deterministically which backed is being used, which this proposal seems
> > > to provide, while at the same time bringing us more in line with fd
> > > passing.  Thoughts?  Thanks,
> >
> > Yep, I agree that libvirt needs to have more direct control over this.
> > This is also even more important if there are notable feature differences
> > in the 2 backends.
> >
> > I wonder if anyone has considered an even more distinct impl, whereby
> > we have a completely different device type on the backend, eg
> >
> >   -device vfio-iommu-
> pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> >
> > If a vendor wants to fully remove the legacy impl, they can then use the
> > Kconfig mechanism to disable the build of the legacy impl device, while
> > keeping the iommu impl (or vica-verca if the new iommu impl isn't
> considered
> > reliable enough for them to support yet).
> >
> > Libvirt would use
> >
> >    -object iommu,id=iommu0,fd=NNN
> >    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> >
> > Non-libvirt would use a simpler
> >
> >    -device vfio-iommu-pci,host=0000:03:22.1
> >
> > with QEMU auto-creating a 'iommu' object in the background.
> >
> > This would fit into libvirt's existing modelling better. We currently have
> > a concept of a PCI assignment backend, which previously supported the
> > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> > feels like a 3rd PCI assignment approach, and so fits with how we modelled
> > it as a different device type in the past.
> 
> I don't think we want to conflate "iommu" and "iommufd", we're creating
> an object that interfaces into the iommufd uAPI, not an iommu itself.
> Likewise "vfio-iommu-pci" is just confusing, there was an iommu
> interface previously, it's just a different implementation now and as
> far as the VM interface to the device, it's identical.  Note that a
> "vfio-iommufd-pci" device multiplies the matrix of every vfio device
> for a rather subtle implementation detail.
> 
> My expectation would be that libvirt uses:
> 
>  -object iommufd,id=iommufd0,fd=NNN
>  -device vfio-pci,fd=MMM,iommufd=iommufd0
> 
> Whereas simple QEMU command line would be:
> 
>  -object iommufd,id=iommufd0
>  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> 
> The iommufd object would open /dev/iommufd itself.  Creating an
> implicit iommufd object is someone problematic because one of the
> things I forgot to highlight in my previous description is that the
> iommufd object is meant to be shared across not only various vfio
> devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> vdpa.

Out of curiosity - in concept one iommufd is sufficient to support all
ioas requirements across subsystems while having multiple iommufd's
instead lose the benefit of centralized accounting. The latter will also
cause some trouble when we start virtualizing ENQCMD which requires
VM-wide PASID virtualization thus further needs to share that 
information across iommufd's. Not unsolvable but really no gain by
adding such complexity. So I'm curious whether Qemu provide
a way to restrict that certain object type can only have one instance
to discourage such multi-iommufd attempt?

> 
> If the old style were used:
> 
>  -device vfio-pci,host=0000:02:00.0
> 
> Then QEMU would use vfio for the IOMMU backend.
> 
> If libvirt/userspace wants to query whether "legacy" vfio is still
> supported by the host kernel, I think it'd only need to look for
> whether the /dev/vfio/vfio container interface still exists.
> 
> If we need some means for QEMU to remove legacy support, I'd rather
> find a way to do it via probing device options.  It's easy enough to
> see if iommufd support exists by looking for the presence of the
> iommufd option for the vfio-pci device and Kconfig within QEMU could be
> used regardless of whether we define a new device name.  Thanks,
> 
> Alex
Tian, Kevin April 26, 2022, 8:39 a.m. UTC | #16
> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 26, 2022 3:55 AM
> 
> Hi Kevin,
> 
> On 4/18/22 10:49 AM, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Thursday, April 14, 2022 6:47 PM
> >>
> >> This series qomifies the VFIOContainer object which acts as a base class
> > what does 'qomify' mean? I didn't find this word from dictionary...
> sorry this is pure QEMU terminology. This stands for "QEMU Object Model"
> additional info at:
> https://qemu.readthedocs.io/en/latest/devel/qom.html
> 

Nice to know!
Shameerali Kolothum Thodi April 26, 2022, 9:47 a.m. UTC | #17
> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 14 April 2022 11:47
> To: alex.williamson@redhat.com; cohuck@redhat.com;
> qemu-devel@nongnu.org
> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com;
> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com;
> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
> Subject: [RFC 00/18] vfio: Adopt iommufd
> 
> With the introduction of iommufd[1], the linux kernel provides a generic
> interface for userspace drivers to propagate their DMA mappings to kernel
> for assigned devices. This series does the porting of the VFIO devices
> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
> Other devices like vpda, vfio mdev and etc. are not considered yet.
> 
> For vfio devices, the new interface is tied with device fd and iommufd
> as the iommufd solution is device-centric. This is different from legacy
> vfio which is group-centric. To support both interfaces in QEMU, this
> series introduces the iommu backend concept in the form of different
> container classes. The existing vfio container is named legacy container
> (equivalent with legacy iommu backend in this series), while the new
> iommufd based container is named as iommufd container (may also be
> mentioned
> as iommufd backend in this series). The two backend types have their own
> way to setup secure context and dma management interface. Below diagram
> shows how it looks like with both BEs.
> 
>                     VFIO
> AddressSpace/Memory
>     +-------+  +----------+  +-----+  +-----+
>     |  pci  |  | platform |  |  ap |  | ccw |
>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>         |           |           |        |        |   AddressSpace
> |
>         |           |           |        |        +------------+---------+
>     +---V-----------V-----------V--------V----+               /
>     |           VFIOAddressSpace              | <------------+
>     |                  |                      |  MemoryListener
>     |          VFIOContainer list             |
>     +-------+----------------------------+----+
>             |                            |
>             |                            |
>     +-------V------+            +--------V----------+
>     |   iommufd    |            |    vfio legacy    |
>     |  container   |            |     container     |
>     +-------+------+            +--------+----------+
>             |                            |
>             | /dev/iommu                 | /dev/vfio/vfio
>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>  Userspace  |                            |
> 
> ===========+============================+==========================
> ======
>  Kernel     |  device fd                 |
>             +---------------+            | group/container fd
>             | (BIND_IOMMUFD |            |
> (SET_CONTAINER/SET_IOMMU)
>             |  ATTACH_IOAS) |            | device fd
>             |               |            |
>             |       +-------V------------V-----------------+
>     iommufd |       |                vfio                  |
> (map/unmap  |       +---------+--------------------+-------+
>  ioas_copy) |                 |                    | map/unmap
>             |                 |                    |
>      +------V------+    +-----V------+      +------V--------+
>      | iommfd core |    |  device    |      |  vfio iommu   |
>      +-------------+    +------------+      +---------------+
> 
> [Secure Context setup]
> - iommufd BE: uses device fd and iommufd to setup secure context
>               (bind_iommufd, attach_ioas)
> - vfio legacy BE: uses group fd and container fd to setup secure context
>                   (set_container, set_iommu)
> [Device access]
> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
> - vfio legacy BE: device fd is retrieved from group fd ioctl
> [DMA Mapping flow]
> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
> - VFIO populates DMA map/unmap via the container BEs
>   *) iommufd BE: uses iommufd
>   *) vfio legacy BE: uses container fd
> 
> This series qomifies the VFIOContainer object which acts as a base class
> for a container. This base class is derived into the legacy VFIO container
> and the new iommufd based container. The base class implements generic
> code
> such as code related to memory_listener and address space management
> whereas
> the derived class implements callbacks that depend on the kernel user space
> being used.
> 
> The selection of the backend is made on a device basis using the new
> iommufd option (on/off/auto). By default the iommufd backend is selected
> if supported by the host and by QEMU (iommufd KConfig). This option is
> currently available only for the vfio-pci device. For other types of
> devices, it does not yet exist and the legacy BE is chosen by default.
> 
> Test done:
> - PCI and Platform device were tested
> - ccw and ap were only compile-tested
> - limited device hotplug test
> - vIOMMU test run for both legacy and iommufd backends (limited tests)
> 
> This series was co-developed by Eric Auger and me based on the exploration
> iommufd kernel[2], complete code of this series is available in[3]. As
> iommufd kernel is in the early step (only iommufd generic interface is in
> mailing list), so this series hasn't made the iommufd backend fully on par
> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
> live migration, etc. This series hasn't supported PCI devices without FLR
> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
> userspace
> is using iommufd. The kernel needs to be updated to accept device fd list for
> reset when userspace is using iommufd. Related work is in progress by
> Jason[4].
> 
> TODOs:
> - Add DMA alias check for iommufd BE (group level)
> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>   VFIO_DEVICE_PCI_HOT_RESET gap
> - Cleanup the VFIODevice fields as it's used in both BEs
> - Add locks
> - Replace list with g_tree
> - More tests
> 
> Patch Overview:
> 
> - Preparation:
>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
> 
> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
> 
> - Introduce container object and covert existing vfio to use it:
>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
> 
> - Introduce iommufd based container:
>   0014-hw-iommufd-Creation.patch
>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
> 
> - Add backend selection for vfio-pci:
>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>   0018-vfio-pci-Add-an-iommufd-option.patch
> 
> [1]
> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
> /
> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1

Hi,

I had a go with the above branches on our ARM64 platform trying to pass-through
a VF dev, but Qemu reports an error as below,

[    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)

I think this happens for the dev BAR addr range. I haven't debugged the kernel
yet to see where it actually reports that. 

Maybe I am missing something. Please let me know.

Thanks,
Shameer
Eric Auger April 26, 2022, 11:44 a.m. UTC | #18
Hi Shameer,

On 4/26/22 11:47 AM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yi Liu [mailto:yi.l.liu@intel.com]
>> Sent: 14 April 2022 11:47
>> To: alex.williamson@redhat.com; cohuck@redhat.com;
>> qemu-devel@nongnu.org
>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger@redhat.com;
>> eric.auger.pro@gmail.com; kevin.tian@intel.com; yi.l.liu@intel.com;
>> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
>> Subject: [RFC 00/18] vfio: Adopt iommufd
>>
>> With the introduction of iommufd[1], the linux kernel provides a generic
>> interface for userspace drivers to propagate their DMA mappings to kernel
>> for assigned devices. This series does the porting of the VFIO devices
>> onto the /dev/iommu uapi and let it coexist with the legacy implementation.
>> Other devices like vpda, vfio mdev and etc. are not considered yet.
>>
>> For vfio devices, the new interface is tied with device fd and iommufd
>> as the iommufd solution is device-centric. This is different from legacy
>> vfio which is group-centric. To support both interfaces in QEMU, this
>> series introduces the iommu backend concept in the form of different
>> container classes. The existing vfio container is named legacy container
>> (equivalent with legacy iommu backend in this series), while the new
>> iommufd based container is named as iommufd container (may also be
>> mentioned
>> as iommufd backend in this series). The two backend types have their own
>> way to setup secure context and dma management interface. Below diagram
>> shows how it looks like with both BEs.
>>
>>                     VFIO
>> AddressSpace/Memory
>>     +-------+  +----------+  +-----+  +-----+
>>     |  pci  |  | platform |  |  ap |  | ccw |
>>     +---+---+  +----+-----+  +--+--+  +--+--+     +----------------------+
>>         |           |           |        |        |   AddressSpace
>> |
>>         |           |           |        |        +------------+---------+
>>     +---V-----------V-----------V--------V----+               /
>>     |           VFIOAddressSpace              | <------------+
>>     |                  |                      |  MemoryListener
>>     |          VFIOContainer list             |
>>     +-------+----------------------------+----+
>>             |                            |
>>             |                            |
>>     +-------V------+            +--------V----------+
>>     |   iommufd    |            |    vfio legacy    |
>>     |  container   |            |     container     |
>>     +-------+------+            +--------+----------+
>>             |                            |
>>             | /dev/iommu                 | /dev/vfio/vfio
>>             | /dev/vfio/devices/vfioX    | /dev/vfio/$group_id
>>  Userspace  |                            |
>>
>> ===========+============================+==========================
>> ======
>>  Kernel     |  device fd                 |
>>             +---------------+            | group/container fd
>>             | (BIND_IOMMUFD |            |
>> (SET_CONTAINER/SET_IOMMU)
>>             |  ATTACH_IOAS) |            | device fd
>>             |               |            |
>>             |       +-------V------------V-----------------+
>>     iommufd |       |                vfio                  |
>> (map/unmap  |       +---------+--------------------+-------+
>>  ioas_copy) |                 |                    | map/unmap
>>             |                 |                    |
>>      +------V------+    +-----V------+      +------V--------+
>>      | iommfd core |    |  device    |      |  vfio iommu   |
>>      +-------------+    +------------+      +---------------+
>>
>> [Secure Context setup]
>> - iommufd BE: uses device fd and iommufd to setup secure context
>>               (bind_iommufd, attach_ioas)
>> - vfio legacy BE: uses group fd and container fd to setup secure context
>>                   (set_container, set_iommu)
>> [Device access]
>> - iommufd BE: device fd is opened through /dev/vfio/devices/vfioX
>> - vfio legacy BE: device fd is retrieved from group fd ioctl
>> [DMA Mapping flow]
>> - VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
>> - VFIO populates DMA map/unmap via the container BEs
>>   *) iommufd BE: uses iommufd
>>   *) vfio legacy BE: uses container fd
>>
>> This series qomifies the VFIOContainer object which acts as a base class
>> for a container. This base class is derived into the legacy VFIO container
>> and the new iommufd based container. The base class implements generic
>> code
>> such as code related to memory_listener and address space management
>> whereas
>> the derived class implements callbacks that depend on the kernel user space
>> being used.
>>
>> The selection of the backend is made on a device basis using the new
>> iommufd option (on/off/auto). By default the iommufd backend is selected
>> if supported by the host and by QEMU (iommufd KConfig). This option is
>> currently available only for the vfio-pci device. For other types of
>> devices, it does not yet exist and the legacy BE is chosen by default.
>>
>> Test done:
>> - PCI and Platform device were tested
>> - ccw and ap were only compile-tested
>> - limited device hotplug test
>> - vIOMMU test run for both legacy and iommufd backends (limited tests)
>>
>> This series was co-developed by Eric Auger and me based on the exploration
>> iommufd kernel[2], complete code of this series is available in[3]. As
>> iommufd kernel is in the early step (only iommufd generic interface is in
>> mailing list), so this series hasn't made the iommufd backend fully on par
>> with legacy backend w.r.t. features like p2p mappings, coherency tracking,
>> live migration, etc. This series hasn't supported PCI devices without FLR
>> neither as the kernel doesn't support VFIO_DEVICE_PCI_HOT_RESET when
>> userspace
>> is using iommufd. The kernel needs to be updated to accept device fd list for
>> reset when userspace is using iommufd. Related work is in progress by
>> Jason[4].
>>
>> TODOs:
>> - Add DMA alias check for iommufd BE (group level)
>> - Make pci.c to be BE agnostic. Needs kernel change as well to fix the
>>   VFIO_DEVICE_PCI_HOT_RESET gap
>> - Cleanup the VFIODevice fields as it's used in both BEs
>> - Add locks
>> - Replace list with g_tree
>> - More tests
>>
>> Patch Overview:
>>
>> - Preparation:
>>   0001-scripts-update-linux-headers-Add-iommufd.h.patch
>>   0002-linux-headers-Import-latest-vfio.h-and-iommufd.h.patch
>>   0003-hw-vfio-pci-fix-vfio_pci_hot_reset_result-trace-poin.patch
>>   0004-vfio-pci-Use-vbasedev-local-variable-in-vfio_realize.patch
>>
>> 0005-vfio-common-Rename-VFIOGuestIOMMU-iommu-into-iommu_m.patch
>>   0006-vfio-common-Split-common.c-into-common.c-container.c.patch
>>
>> - Introduce container object and covert existing vfio to use it:
>>   0007-vfio-Add-base-object-for-VFIOContainer.patch
>>   0008-vfio-container-Introduce-vfio_attach-detach_device.patch
>>   0009-vfio-platform-Use-vfio_-attach-detach-_device.patch
>>   0010-vfio-ap-Use-vfio_-attach-detach-_device.patch
>>   0011-vfio-ccw-Use-vfio_-attach-detach-_device.patch
>>   0012-vfio-container-obj-Introduce-attach-detach-_device-c.patch
>>   0013-vfio-container-obj-Introduce-VFIOContainer-reset-cal.patch
>>
>> - Introduce iommufd based container:
>>   0014-hw-iommufd-Creation.patch
>>   0015-vfio-iommufd-Implement-iommufd-backend.patch
>>   0016-vfio-iommufd-Add-IOAS_COPY_DMA-support.patch
>>
>> - Add backend selection for vfio-pci:
>>   0017-vfio-as-Allow-the-selection-of-a-given-iommu-backend.patch
>>   0018-vfio-pci-Add-an-iommufd-option.patch
>>
>> [1]
>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>> /
>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> Hi,
>
> I had a go with the above branches on our ARM64 platform trying to pass-through
> a VF dev, but Qemu reports an error as below,
>
> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0, 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>
> I think this happens for the dev BAR addr range. I haven't debugged the kernel
> yet to see where it actually reports that. 
Does it prevent your assigned device from working? I have such errors
too but this is a known issue. This is due to the fact P2P DMA is not
supported yet.

Thanks

Eric

>
> Maybe I am missing something. Please let me know.
>
> Thanks,
> Shameer
>
Jason Gunthorpe April 26, 2022, 12:33 p.m. UTC | #19
On Tue, Apr 26, 2022 at 08:37:41AM +0000, Tian, Kevin wrote:

> Based on current plan there is probably a transition window between the
> point where the first vfio device type (vfio-pci) gaining iommufd support
> and the point where all vfio types supporting iommufd. 

I am still hoping to do all in one shot, lets see :) 

Jason
Shameerali Kolothum Thodi April 26, 2022, 12:43 p.m. UTC | #20
> -----Original Message-----
> From: Eric Auger [mailto:eric.auger@redhat.com]
> Sent: 26 April 2022 12:45
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
> qemu-devel@nongnu.org
> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
> Subject: Re: [RFC 00/18] vfio: Adopt iommufd

[...]
 
> >>
> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
> >> /
> >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
> > Hi,
> >
> > I had a go with the above branches on our ARM64 platform trying to
> pass-through
> > a VF dev, but Qemu reports an error as below,
> >
> > [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
> >
> > I think this happens for the dev BAR addr range. I haven't debugged the
> kernel
> > yet to see where it actually reports that.
> Does it prevent your assigned device from working? I have such errors
> too but this is a known issue. This is due to the fact P2P DMA is not
> supported yet.
> 

Yes, the basic tests all good so far. I am still not very clear how it works if
the map() fails though. It looks like it fails in,

iommufd_ioas_map()
  iopt_map_user_pages()
   iopt_map_pages()
   ..
     pfn_reader_pin_pages()

So does it mean it just works because the page is resident()?

Thanks,
Shameer
Alex Williamson April 26, 2022, 4:21 p.m. UTC | #21
On Tue, 26 Apr 2022 08:37:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 25, 2022 10:38 PM
> > 
> > On Mon, 25 Apr 2022 11:10:14 +0100
> > Daniel P. Berrangé <berrange@redhat.com> wrote:
> >   
> > > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:  
> > > > [Cc +libvirt folks]
> > > >
> > > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > With the introduction of iommufd[1], the linux kernel provides a  
> > generic  
> > > > > interface for userspace drivers to propagate their DMA mappings to  
> > kernel  
> > > > > for assigned devices. This series does the porting of the VFIO devices
> > > > > onto the /dev/iommu uapi and let it coexist with the legacy  
> > implementation.  
> > > > > Other devices like vpda, vfio mdev and etc. are not considered yet.  
> > >
> > > snip
> > >  
> > > > > The selection of the backend is made on a device basis using the new
> > > > > iommufd option (on/off/auto). By default the iommufd backend is  
> > selected  
> > > > > if supported by the host and by QEMU (iommufd KConfig). This option  
> > is  
> > > > > currently available only for the vfio-pci device. For other types of
> > > > > devices, it does not yet exist and the legacy BE is chosen by default.  
> > > >
> > > > I've discussed this a bit with Eric, but let me propose a different
> > > > command line interface.  Libvirt generally likes to pass file
> > > > descriptors to QEMU rather than grant it access to those files
> > > > directly.  This was problematic with vfio-pci because libvirt can't
> > > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > > container.  Therefore we abandoned this approach and instead libvirt
> > > > grants file permissions.
> > > >
> > > > However, with iommufd there's no reason that QEMU ever needs more  
> > than  
> > > > a single instance of /dev/iommufd and we're using per device vfio file
> > > > descriptors, so it seems like a good time to revisit this.  
> > >
> > > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > > privileges, such that you don't want to unconditionally give QEMU
> > > access to this device ?  
> > 
> > It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> > interface which should have limited scope for abuse, but more so here
> > the goal would be to de-privilege QEMU that one step further that it
> > cannot open the device file itself.
> >   
> > > > The interface I was considering would be to add an iommufd object to
> > > > QEMU, so we might have a:
> > > >
> > > > -device iommufd[,fd=#][,id=foo]
> > > >
> > > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > > itself if an fd is not provided.  This object could be shared with
> > > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > > for more esoteric use cases.  [NB, maybe this should be a -object rather  
> > than  
> > > > -device since the iommufd is not a guest visible device?]  
> > >
> > > Yes,  -object would be the right answer for something that's purely
> > > a host side backend impl selector.
> > >  
> > > > The vfio-pci device might then become:
> > > >
> > > > -device vfio-  
> > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > > >
> > > > So essentially we can specify the device via host, sysfsdev, or passing
> > > > an fd to the vfio device file.  When an iommufd object is specified,
> > > > "foo" in the example above, each of those options would use the
> > > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > > your example.  With the fd passing option, an iommufd object would be
> > > > required and necessarily use device level access.
> > > >
> > > > In your example, the iommufd=auto seems especially troublesome for
> > > > libvirt because QEMU is going to have different locked memory
> > > > requirements based on whether we're using type1 or iommufd, where  
> > the  
> > > > latter resolves the duplicate accounting issues.  libvirt needs to know  
> 
> Based on current plan there is probably a transition window between the
> point where the first vfio device type (vfio-pci) gaining iommufd support
> and the point where all vfio types supporting iommufd. Libvirt can figure
> out which one to use iommufd by checking the presence of
> /dev/vfio/devices/vfioX. But what would be the resource limit policy
> in Libvirt in such transition window when both type1 and iommufd might
> be used? Or do we just expect Libvirt to support iommufd only after the
> transition window ends to avoid handling such mess?

Good point regarding libvirt testing for the vfio device files for use
with iommufd, so libvirt would test if /dev/iommufd exists and if the
device they want to assign maps to a /dev/vfio/devices/vfioX file.  This
was essentially implicit in the fd=# option to the vfio-pci device.

In mixed combinations, I'd expect libvirt to continue to add the full
VM memory to the locked memory limit for each non-iommufd device added.

> > > > deterministically which backed is being used, which this proposal seems
> > > > to provide, while at the same time bringing us more in line with fd
> > > > passing.  Thoughts?  Thanks,  
> > >
> > > Yep, I agree that libvirt needs to have more direct control over this.
> > > This is also even more important if there are notable feature differences
> > > in the 2 backends.
> > >
> > > I wonder if anyone has considered an even more distinct impl, whereby
> > > we have a completely different device type on the backend, eg
> > >
> > >   -device vfio-iommu-  
> > pci[,host=DDDD:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> > oo]  
> > >
> > > If a vendor wants to fully remove the legacy impl, they can then use the
> > > Kconfig mechanism to disable the build of the legacy impl device, while
> > > keeping the iommu impl (or vica-verca if the new iommu impl isn't  
> > considered  
> > > reliable enough for them to support yet).
> > >
> > > Libvirt would use
> > >
> > >    -object iommu,id=iommu0,fd=NNN
> > >    -device vfio-iommu-pci,fd=MMM,iommu=iommu0
> > >
> > > Non-libvirt would use a simpler
> > >
> > >    -device vfio-iommu-pci,host=0000:03:22.1
> > >
> > > with QEMU auto-creating a 'iommu' object in the background.
> > >
> > > This would fit into libvirt's existing modelling better. We currently have
> > > a concept of a PCI assignment backend, which previously supported the
> > > legacy PCI assignment, vs the VFIO PCI assignment. This new iommu impl
> > > feels like a 3rd PCI assignment approach, and so fits with how we modelled
> > > it as a different device type in the past.  
> > 
> > I don't think we want to conflate "iommu" and "iommufd", we're creating
> > an object that interfaces into the iommufd uAPI, not an iommu itself.
> > Likewise "vfio-iommu-pci" is just confusing, there was an iommu
> > interface previously, it's just a different implementation now and as
> > far as the VM interface to the device, it's identical.  Note that a
> > "vfio-iommufd-pci" device multiplies the matrix of every vfio device
> > for a rather subtle implementation detail.
> > 
> > My expectation would be that libvirt uses:
> > 
> >  -object iommufd,id=iommufd0,fd=NNN
> >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > 
> > Whereas simple QEMU command line would be:
> > 
> >  -object iommufd,id=iommufd0
> >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > 
> > The iommufd object would open /dev/iommufd itself.  Creating an
> > implicit iommufd object is someone problematic because one of the
> > things I forgot to highlight in my previous description is that the
> > iommufd object is meant to be shared across not only various vfio
> > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > vdpa.  
> 
> Out of curiosity - in concept one iommufd is sufficient to support all
> ioas requirements across subsystems while having multiple iommufd's
> instead lose the benefit of centralized accounting. The latter will also
> cause some trouble when we start virtualizing ENQCMD which requires
> VM-wide PASID virtualization thus further needs to share that 
> information across iommufd's. Not unsolvable but really no gain by
> adding such complexity. So I'm curious whether Qemu provide
> a way to restrict that certain object type can only have one instance
> to discourage such multi-iommufd attempt?

I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
philosophy seems to be to let users create whatever configuration they
want.  For libvirt though, the assumption would be that a single
iommufd object can be used across subsystems, so libvirt would never
automatically create multiple objects.

We also need to be able to advise libvirt as to how each iommufd object
or user of that object factors into the VM locked memory requirement.
When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
to set the locked memory limit to the size of VM RAM per iommufd,
regardless of the number of devices using a given iommufd.  However, I
don't know if all users of iommufd will be exclusively mapping VM RAM.
Combinations of devices where some map VM RAM and others map QEMU
buffer space could still require some incremental increase per device
(I'm not sure if vfio-nvme is such a device).  It seems like heuristics
will still be involved even after iommufd solves the per-device
vfio-pci locked memory limit issue.  Thanks,

Alex
Alex Williamson April 26, 2022, 4:35 p.m. UTC | #22
On Tue, 26 Apr 2022 12:43:35 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Eric Auger [mailto:eric.auger@redhat.com]
> > Sent: 26 April 2022 12:45
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
> > Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
> > qemu-devel@nongnu.org
> > Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> > mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> > jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> > jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
> > kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
> > peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
> > Subject: Re: [RFC 00/18] vfio: Adopt iommufd  
> 
> [...]
>  
> > >>  
> > https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com  
> > >> /
> > >> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
> > >> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1  
> > > Hi,
> > >
> > > I had a go with the above branches on our ARM64 platform trying to  
> > pass-through  
> > > a VF dev, but Qemu reports an error as below,
> > >
> > > [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
> > > qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
> > > qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,  
> > 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)  
> > >
> > > I think this happens for the dev BAR addr range. I haven't debugged the  
> > kernel  
> > > yet to see where it actually reports that.  
> > Does it prevent your assigned device from working? I have such errors
> > too but this is a known issue. This is due to the fact P2P DMA is not
> > supported yet.
> >   
> 
> Yes, the basic tests all good so far. I am still not very clear how it works if
> the map() fails though. It looks like it fails in,
> 
> iommufd_ioas_map()
>   iopt_map_user_pages()
>    iopt_map_pages()
>    ..
>      pfn_reader_pin_pages()
> 
> So does it mean it just works because the page is resident()?

No, it just means that you're not triggering any accesses that require
peer-to-peer DMA support.  Any sort of test where the device is only
performing DMA to guest RAM, which is by far the standard use case,
will work fine.  This also doesn't affect vCPU access to BAR space.
It's only a failure of the mappings of the BAR space into the IOAS,
which is only used when a device tries to directly target another
device's BAR space via DMA.  Thanks,

Alex
Jason Gunthorpe April 26, 2022, 4:42 p.m. UTC | #23
On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> We also need to be able to advise libvirt as to how each iommufd object
> or user of that object factors into the VM locked memory requirement.
> When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> to set the locked memory limit to the size of VM RAM per iommufd,
> regardless of the number of devices using a given iommufd.  However, I
> don't know if all users of iommufd will be exclusively mapping VM RAM.
> Combinations of devices where some map VM RAM and others map QEMU
> buffer space could still require some incremental increase per device
> (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> will still be involved even after iommufd solves the per-device
> vfio-pci locked memory limit issue.  Thanks,

If the model is to pass the FD, how about we put a limit on the FD
itself instead of abusing the locked memory limit?

We could have a no-way-out ioctl that directly limits the # of PFNs
covered by iopt_pages inside an iommufd.

Jason
Alex Williamson April 26, 2022, 7:24 p.m. UTC | #24
On Tue, 26 Apr 2022 13:42:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > We also need to be able to advise libvirt as to how each iommufd object
> > or user of that object factors into the VM locked memory requirement.
> > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > to set the locked memory limit to the size of VM RAM per iommufd,
> > regardless of the number of devices using a given iommufd.  However, I
> > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > Combinations of devices where some map VM RAM and others map QEMU
> > buffer space could still require some incremental increase per device
> > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > will still be involved even after iommufd solves the per-device
> > vfio-pci locked memory limit issue.  Thanks,  
> 
> If the model is to pass the FD, how about we put a limit on the FD
> itself instead of abusing the locked memory limit?
> 
> We could have a no-way-out ioctl that directly limits the # of PFNs
> covered by iopt_pages inside an iommufd.

FD passing would likely only be the standard for libvirt invoked VMs.
The QEMU vfio-pci device would still parse a host= or sysfsdev= option
when invoked by mortals and associate to use the legacy vfio group
interface or the new vfio device interface based on whether an iommufd
is specified.

Does that rule out your suggestion?  I don't know, please reveal more
about the mechanics of putting a limit on the FD itself and this
no-way-out ioctl.  The latter name suggests to me that I should also
note that we need to support memory hotplug with these devices.  Thanks,

Alex
Jason Gunthorpe April 26, 2022, 7:36 p.m. UTC | #25
On Tue, Apr 26, 2022 at 01:24:35PM -0600, Alex Williamson wrote:
> On Tue, 26 Apr 2022 13:42:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Apr 26, 2022 at 10:21:59AM -0600, Alex Williamson wrote:
> > > We also need to be able to advise libvirt as to how each iommufd object
> > > or user of that object factors into the VM locked memory requirement.
> > > When used by vfio-pci, we're only mapping VM RAM, so we'd ask libvirt
> > > to set the locked memory limit to the size of VM RAM per iommufd,
> > > regardless of the number of devices using a given iommufd.  However, I
> > > don't know if all users of iommufd will be exclusively mapping VM RAM.
> > > Combinations of devices where some map VM RAM and others map QEMU
> > > buffer space could still require some incremental increase per device
> > > (I'm not sure if vfio-nvme is such a device).  It seems like heuristics
> > > will still be involved even after iommufd solves the per-device
> > > vfio-pci locked memory limit issue.  Thanks,  
> > 
> > If the model is to pass the FD, how about we put a limit on the FD
> > itself instead of abusing the locked memory limit?
> > 
> > We could have a no-way-out ioctl that directly limits the # of PFNs
> > covered by iopt_pages inside an iommufd.
> 
> FD passing would likely only be the standard for libvirt invoked VMs.
> The QEMU vfio-pci device would still parse a host= or sysfsdev= option
> when invoked by mortals and associate to use the legacy vfio group
> interface or the new vfio device interface based on whether an iommufd
> is specified.

Yes, but perhaps we don't need resource limits in the mortals case..

> Does that rule out your suggestion?  I don't know, please reveal more
> about the mechanics of putting a limit on the FD itself and this
> no-way-out ioctl.  The latter name suggests to me that I should also
> note that we need to support memory hotplug with these devices.  Thanks,

So libvirt uses CAP_SYS_RESOURCE and prlimit to adjust things in
realtime today?

It could still work, instead of no way out iommufd would have to check
for CAP_SYS_RESOURCE to make the limit higher.

It is a pretty simple idea, we just attach a resource limit to the FD
and every PFN that gets mapped into the iommufd counts against that
limit, regardless if it is pinned or not. An ioctl on the FD would set
the limit, defaulting to unlimited.

To me this has the appeal that what is being resourced controlled is
strictly defined - address space mapped into an iommufd - which has a
bunch of nice additional consequences like partially bounding the
amount of kernel memory an iommufd can consume and so forth.

Doesn't interact with iouring or rdma however.

Though we could certianly consider allowing RDMA to consume an iommufd
to access pinned pages much like a vfio-mdev would - I'm not sure what
is ideal for the qemu usage of RDMA for migration..

Jason
Tian, Kevin April 28, 2022, 3:21 a.m. UTC | #26
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 27, 2022 12:22 AM
> > >
> > > My expectation would be that libvirt uses:
> > >
> > >  -object iommufd,id=iommufd0,fd=NNN
> > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > >
> > > Whereas simple QEMU command line would be:
> > >
> > >  -object iommufd,id=iommufd0
> > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > >
> > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > implicit iommufd object is someone problematic because one of the
> > > things I forgot to highlight in my previous description is that the
> > > iommufd object is meant to be shared across not only various vfio
> > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > vdpa.
> >
> > Out of curiosity - in concept one iommufd is sufficient to support all
> > ioas requirements across subsystems while having multiple iommufd's
> > instead lose the benefit of centralized accounting. The latter will also
> > cause some trouble when we start virtualizing ENQCMD which requires
> > VM-wide PASID virtualization thus further needs to share that
> > information across iommufd's. Not unsolvable but really no gain by
> > adding such complexity. So I'm curious whether Qemu provide
> > a way to restrict that certain object type can only have one instance
> > to discourage such multi-iommufd attempt?
> 
> I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> philosophy seems to be to let users create whatever configuration they
> want.  For libvirt though, the assumption would be that a single
> iommufd object can be used across subsystems, so libvirt would never
> automatically create multiple objects.

I like the flexibility what the objection approach gives in your proposal.
But with the said complexity in mind (with no foreseen benefit), I wonder
whether an alternative approach which treats iommufd as a global
property instead of an object is acceptable in Qemu, i.e.:

-iommufd on/off
-device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]

All devices with iommufd specified then implicitly share a single iommufd
object within Qemu.

This still allows vfio devices to be specified via fd but just requires Libvirt
to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
considered or just not a typical way in Qemu philosophy e.g. any object
associated with a device must be explicitly specified?

Thanks
Kevin
Alex Williamson April 28, 2022, 2:24 p.m. UTC | #27
On Thu, 28 Apr 2022 03:21:45 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > >
> > > > My expectation would be that libvirt uses:
> > > >
> > > >  -object iommufd,id=iommufd0,fd=NNN
> > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > >
> > > > Whereas simple QEMU command line would be:
> > > >
> > > >  -object iommufd,id=iommufd0
> > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > >
> > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > implicit iommufd object is someone problematic because one of the
> > > > things I forgot to highlight in my previous description is that the
> > > > iommufd object is meant to be shared across not only various vfio
> > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > vdpa.  
> > >
> > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > ioas requirements across subsystems while having multiple iommufd's
> > > instead lose the benefit of centralized accounting. The latter will also
> > > cause some trouble when we start virtualizing ENQCMD which requires
> > > VM-wide PASID virtualization thus further needs to share that
> > > information across iommufd's. Not unsolvable but really no gain by
> > > adding such complexity. So I'm curious whether Qemu provide
> > > a way to restrict that certain object type can only have one instance
> > > to discourage such multi-iommufd attempt?  
> > 
> > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > philosophy seems to be to let users create whatever configuration they
> > want.  For libvirt though, the assumption would be that a single
> > iommufd object can be used across subsystems, so libvirt would never
> > automatically create multiple objects.  
> 
> I like the flexibility what the objection approach gives in your proposal.
> But with the said complexity in mind (with no foreseen benefit), I wonder

What's the actual complexity?  Front-end/backend splits are very common
in QEMU.  We're making the object connection via name, why is it
significantly more complicated to allow multiple iommufd objects?  On
the contrary, it seems to me that we'd need to go out of our way to add
code to block multiple iommufd objects.

> whether an alternative approach which treats iommufd as a global
> property instead of an object is acceptable in Qemu, i.e.:
> 
> -iommufd on/off
> -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> 
> All devices with iommufd specified then implicitly share a single iommufd
> object within Qemu.

QEMU requires key-value pairs AFAIK, so the above doesn't work, then
we're just back to the iommufd=on/off.
 
> This still allows vfio devices to be specified via fd but just requires Libvirt
> to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> considered or just not a typical way in Qemu philosophy e.g. any object
> associated with a device must be explicitly specified?

Avoiding QEMU opening files was a significant focus of my alternate
proposal.  Also note that we must be able to support hotplug, so we
need to be able to dynamically add and remove the iommufd object, I
don't see that a global property allows for that.  Implicit
associations of devices to shared resources doesn't seem particularly
desirable to me.  Thanks,

Alex
Daniel P. Berrangé April 28, 2022, 4:20 p.m. UTC | #28
On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> On Thu, 28 Apr 2022 03:21:45 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 27, 2022 12:22 AM  
> > > > >
> > > > > My expectation would be that libvirt uses:
> > > > >
> > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > >
> > > > > Whereas simple QEMU command line would be:
> > > > >
> > > > >  -object iommufd,id=iommufd0
> > > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > > >
> > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > implicit iommufd object is someone problematic because one of the
> > > > > things I forgot to highlight in my previous description is that the
> > > > > iommufd object is meant to be shared across not only various vfio
> > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > > > vdpa.  
> > > >
> > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > ioas requirements across subsystems while having multiple iommufd's
> > > > instead lose the benefit of centralized accounting. The latter will also
> > > > cause some trouble when we start virtualizing ENQCMD which requires
> > > > VM-wide PASID virtualization thus further needs to share that
> > > > information across iommufd's. Not unsolvable but really no gain by
> > > > adding such complexity. So I'm curious whether Qemu provide
> > > > a way to restrict that certain object type can only have one instance
> > > > to discourage such multi-iommufd attempt?  
> > > 
> > > I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> > > philosophy seems to be to let users create whatever configuration they
> > > want.  For libvirt though, the assumption would be that a single
> > > iommufd object can be used across subsystems, so libvirt would never
> > > automatically create multiple objects.  
> > 
> > I like the flexibility what the objection approach gives in your proposal.
> > But with the said complexity in mind (with no foreseen benefit), I wonder
> 
> What's the actual complexity?  Front-end/backend splits are very common
> in QEMU.  We're making the object connection via name, why is it
> significantly more complicated to allow multiple iommufd objects?  On
> the contrary, it seems to me that we'd need to go out of our way to add
> code to block multiple iommufd objects.
> 
> > whether an alternative approach which treats iommufd as a global
> > property instead of an object is acceptable in Qemu, i.e.:
> > 
> > -iommufd on/off
> > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> > 
> > All devices with iommufd specified then implicitly share a single iommufd
> > object within Qemu.
> 
> QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> we're just back to the iommufd=on/off.
>  
> > This still allows vfio devices to be specified via fd but just requires Libvirt
> > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > considered or just not a typical way in Qemu philosophy e.g. any object
> > associated with a device must be explicitly specified?
> 
> Avoiding QEMU opening files was a significant focus of my alternate
> proposal.  Also note that we must be able to support hotplug, so we
> need to be able to dynamically add and remove the iommufd object, I
> don't see that a global property allows for that.  Implicit
> associations of devices to shared resources doesn't seem particularly
> desirable to me.  Thanks,

Adding new global properties/options is rather an anti-pattern for QEMU
these days. Using -object is the right approach. If you only want to
allow for one of them, just document this requirement. We've got other
objects which are singletons like all the confidential guest classes
for each arch.

With regards,
Daniel
Tian, Kevin April 29, 2022, 12:45 a.m. UTC | #29
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Friday, April 29, 2022 12:20 AM
> 
> On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> > On Thu, 28 Apr 2022 03:21:45 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, April 27, 2022 12:22 AM
> > > > > >
> > > > > > My expectation would be that libvirt uses:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > > >
> > > > > > Whereas simple QEMU command line would be:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0
> > > > > >  -device vfio-pci,iommufd=iommufd0,host=0000:02:00.0
> > > > > >
> > > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > > implicit iommufd object is someone problematic because one of the
> > > > > > things I forgot to highlight in my previous description is that the
> > > > > > iommufd object is meant to be shared across not only various vfio
> > > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems,
> ex.
> > > > > > vdpa.
> > > > >
> > > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > > ioas requirements across subsystems while having multiple
> iommufd's
> > > > > instead lose the benefit of centralized accounting. The latter will also
> > > > > cause some trouble when we start virtualizing ENQCMD which
> requires
> > > > > VM-wide PASID virtualization thus further needs to share that
> > > > > information across iommufd's. Not unsolvable but really no gain by
> > > > > adding such complexity. So I'm curious whether Qemu provide
> > > > > a way to restrict that certain object type can only have one instance
> > > > > to discourage such multi-iommufd attempt?
> > > >
> > > > I don't see any reason for QEMU to restrict iommufd objects.  The
> QEMU
> > > > philosophy seems to be to let users create whatever configuration they
> > > > want.  For libvirt though, the assumption would be that a single
> > > > iommufd object can be used across subsystems, so libvirt would never
> > > > automatically create multiple objects.
> > >
> > > I like the flexibility what the objection approach gives in your proposal.
> > > But with the said complexity in mind (with no foreseen benefit), I wonder
> >
> > What's the actual complexity?  Front-end/backend splits are very common
> > in QEMU.  We're making the object connection via name, why is it
> > significantly more complicated to allow multiple iommufd objects?  On
> > the contrary, it seems to me that we'd need to go out of our way to add
> > code to block multiple iommufd objects.

Probably it's just a hypothetical concern when I thought about the need
of managing certain global information (e.g. PASID virtualization) cross
iommufd's down the road. With your and Daniel's replies I think we'll
first try to follow the common practice in Qemu first given there are
more positive reasons to do so than the hypothetical concern itself.

> >
> > > whether an alternative approach which treats iommufd as a global
> > > property instead of an object is acceptable in Qemu, i.e.:
> > >
> > > -iommufd on/off
> > > -device vfio-pci,iommufd,[fd=MMM/host=0000:02:00.0]
> > >
> > > All devices with iommufd specified then implicitly share a single iommufd
> > > object within Qemu.
> >
> > QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> > we're just back to the iommufd=on/off.
> >
> > > This still allows vfio devices to be specified via fd but just requires Libvirt
> > > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > > considered or just not a typical way in Qemu philosophy e.g. any object
> > > associated with a device must be explicitly specified?
> >
> > Avoiding QEMU opening files was a significant focus of my alternate
> > proposal.  Also note that we must be able to support hotplug, so we
> > need to be able to dynamically add and remove the iommufd object, I
> > don't see that a global property allows for that.  Implicit
> > associations of devices to shared resources doesn't seem particularly
> > desirable to me.  Thanks,
> 
> Adding new global properties/options is rather an anti-pattern for QEMU
> these days. Using -object is the right approach. If you only want to
> allow for one of them, just document this requirement. We've got other
> objects which are singletons like all the confidential guest classes
> for each arch.
> 

Good to know such last resort. As said we'll try to avoid this restriction
and follow Alex's proposal unless there are unexpectedly unreasonable
complexities arising later.

Thanks
Kevin
Zhangfei Gao May 9, 2022, 2:24 p.m. UTC | #30
Hi, Alex

On 2022/4/27 上午12:35, Alex Williamson wrote:
> On Tue, 26 Apr 2022 12:43:35 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>
>>> -----Original Message-----
>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>> Sent: 26 April 2022 12:45
>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
>>> qemu-devel@nongnu.org
>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>> [...]
>>   
>>>>>   
>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>> /
>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>> Hi,
>>>>
>>>> I had a go with the above branches on our ARM64 platform trying to
>>> pass-through
>>>> a VF dev, but Qemu reports an error as below,
>>>>
>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>> I think this happens for the dev BAR addr range. I haven't debugged the
>>> kernel
>>>> yet to see where it actually reports that.
>>> Does it prevent your assigned device from working? I have such errors
>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>> supported yet.
>>>    
>> Yes, the basic tests all good so far. I am still not very clear how it works if
>> the map() fails though. It looks like it fails in,
>>
>> iommufd_ioas_map()
>>    iopt_map_user_pages()
>>     iopt_map_pages()
>>     ..
>>       pfn_reader_pin_pages()
>>
>> So does it mean it just works because the page is resident()?
> No, it just means that you're not triggering any accesses that require
> peer-to-peer DMA support.  Any sort of test where the device is only
> performing DMA to guest RAM, which is by far the standard use case,
> will work fine.  This also doesn't affect vCPU access to BAR space.
> It's only a failure of the mappings of the BAR space into the IOAS,
> which is only used when a device tries to directly target another
> device's BAR space via DMA.  Thanks,

I also get this issue when trying adding prereg listenner

+    container->prereg_listener = vfio_memory_prereg_listener;
+    memory_listener_register(&container->prereg_listener,
+                            &address_space_memory);

host kernel log:
iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, 
cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000
iopt_alloc_area input area=859a2d00 iova=8000000000
iopt_alloc_area area=859a2d00 iova=8000000000
pin_user_pages_remote rc=-14

qemu log:
vfio_prereg_listener_region_add
iommufd_map iova=0x8000000000
qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, 0x10000, 
0xffff9c495000) = -14 (Bad address)
qemu-system-aarch64: (null)
double free or corruption (fasttop)
Aborted (core dumped)

With hack of ignoring address 0x8000000000 in map and unmap, kernel can 
boot.

Thanks
Yi Liu May 10, 2022, 3:17 a.m. UTC | #31
Hi Zhangfei,

On 2022/5/9 22:24, Zhangfei Gao wrote:
> Hi, Alex
> 
> On 2022/4/27 上午12:35, Alex Williamson wrote:
>> On Tue, 26 Apr 2022 12:43:35 +0000
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>>
>>>> -----Original Message-----
>>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>>> Sent: 26 April 2022 12:45
>>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Yi
>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com; cohuck@redhat.com;
>>>> qemu-devel@nongnu.org
>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>> [...]
>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>>> /
>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>>> [3] https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>>> Hi,
>>>>>
>>>>> I had a go with the above branches on our ARM64 platform trying to
>>>> pass-through
>>>>> a VF dev, but Qemu reports an error as below,
>>>>>
>>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 -> 0002)
>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>>> I think this happens for the dev BAR addr range. I haven't debugged the
>>>> kernel
>>>>> yet to see where it actually reports that.
>>>> Does it prevent your assigned device from working? I have such errors
>>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>>> supported yet.
>>> Yes, the basic tests all good so far. I am still not very clear how it 
>>> works if
>>> the map() fails though. It looks like it fails in,
>>>
>>> iommufd_ioas_map()
>>>    iopt_map_user_pages()
>>>     iopt_map_pages()
>>>     ..
>>>       pfn_reader_pin_pages()
>>>
>>> So does it mean it just works because the page is resident()?
>> No, it just means that you're not triggering any accesses that require
>> peer-to-peer DMA support.  Any sort of test where the device is only
>> performing DMA to guest RAM, which is by far the standard use case,
>> will work fine.  This also doesn't affect vCPU access to BAR space.
>> It's only a failure of the mappings of the BAR space into the IOAS,
>> which is only used when a device tries to directly target another
>> device's BAR space via DMA.  Thanks,
> 
> I also get this issue when trying adding prereg listenner
> 
> +    container->prereg_listener = vfio_memory_prereg_listener;
> +    memory_listener_register(&container->prereg_listener,
> +                            &address_space_memory);
> 
> host kernel log:
> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000, cmd->iova=8000000000, 
> cmd->user_va=9c495000, cmd->length=10000
> iopt_alloc_area input area=859a2d00 iova=8000000000
> iopt_alloc_area area=859a2d00 iova=8000000000
> pin_user_pages_remote rc=-14
> 
> qemu log:
> vfio_prereg_listener_region_add
> iommufd_map iova=0x8000000000
> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000, 0x10000, 
> 0xffff9c495000) = -14 (Bad address)
> qemu-system-aarch64: (null)
> double free or corruption (fasttop)
> Aborted (core dumped)
> 
> With hack of ignoring address 0x8000000000 in map and unmap, kernel can boot.

do you know if the iova 0x8000000000 guest RAM or MMIO? Currently, iommufd 
kernel part doesn't support mapping device BAR MMIO. This is a known gap.
Eric Auger May 10, 2022, 6:51 a.m. UTC | #32
Hi Hi, Zhangfei,

On 5/10/22 05:17, Yi Liu wrote:
> Hi Zhangfei,
>
> On 2022/5/9 22:24, Zhangfei Gao wrote:
>> Hi, Alex
>>
>> On 2022/4/27 上午12:35, Alex Williamson wrote:
>>> On Tue, 26 Apr 2022 12:43:35 +0000
>>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>>>
>>>>> -----Original Message-----
>>>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>>>> Sent: 26 April 2022 12:45
>>>>> To: Shameerali Kolothum Thodi
>>>>> <shameerali.kolothum.thodi@huawei.com>; Yi
>>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com;
>>>>> cohuck@redhat.com;
>>>>> qemu-devel@nongnu.org
>>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com;
>>>>> farman@linux.ibm.com;
>>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>>> [...]
>>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>>
>>>>>>> /
>>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>>>> [3]
>>>>>>> https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>>>> Hi,
>>>>>>
>>>>>> I had a go with the above branches on our ARM64 platform trying to
>>>>> pass-through
>>>>>> a VF dev, but Qemu reports an error as below,
>>>>>>
>>>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 ->
>>>>>> 0002)
>>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>>>> I think this happens for the dev BAR addr range. I haven't
>>>>>> debugged the
>>>>> kernel
>>>>>> yet to see where it actually reports that.
>>>>> Does it prevent your assigned device from working? I have such errors
>>>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>>>> supported yet.
>>>> Yes, the basic tests all good so far. I am still not very clear how
>>>> it works if
>>>> the map() fails though. It looks like it fails in,
>>>>
>>>> iommufd_ioas_map()
>>>>    iopt_map_user_pages()
>>>>     iopt_map_pages()
>>>>     ..
>>>>       pfn_reader_pin_pages()
>>>>
>>>> So does it mean it just works because the page is resident()?
>>> No, it just means that you're not triggering any accesses that require
>>> peer-to-peer DMA support.  Any sort of test where the device is only
>>> performing DMA to guest RAM, which is by far the standard use case,
>>> will work fine.  This also doesn't affect vCPU access to BAR space.
>>> It's only a failure of the mappings of the BAR space into the IOAS,
>>> which is only used when a device tries to directly target another
>>> device's BAR space via DMA.  Thanks,
>>
>> I also get this issue when trying adding prereg listenner
>>
>> +    container->prereg_listener = vfio_memory_prereg_listener;
>> +    memory_listener_register(&container->prereg_listener,
>> +                            &address_space_memory);
>>
>> host kernel log:
>> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000,
>> cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000
>> iopt_alloc_area input area=859a2d00 iova=8000000000
>> iopt_alloc_area area=859a2d00 iova=8000000000
>> pin_user_pages_remote rc=-14
>>
>> qemu log:
>> vfio_prereg_listener_region_add
>> iommufd_map iova=0x8000000000
>> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
>> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000,
>> 0x10000, 0xffff9c495000) = -14 (Bad address)
>> qemu-system-aarch64: (null)
>> double free or corruption (fasttop)
>> Aborted (core dumped)
>>
>> With hack of ignoring address 0x8000000000 in map and unmap, kernel
>> can boot.
>
> do you know if the iova 0x8000000000 guest RAM or MMIO? Currently,
> iommufd kernel part doesn't support mapping device BAR MMIO. This is a
> known gap.
In qemu arm virt machine this indeed matches the PCI MMIO region.

Thanks

Eric
Zhangfei Gao May 10, 2022, 12:35 p.m. UTC | #33
On 2022/5/10 下午2:51, Eric Auger wrote:
> Hi Hi, Zhangfei,
>
> On 5/10/22 05:17, Yi Liu wrote:
>> Hi Zhangfei,
>>
>> On 2022/5/9 22:24, Zhangfei Gao wrote:
>>> Hi, Alex
>>>
>>> On 2022/4/27 上午12:35, Alex Williamson wrote:
>>>> On Tue, 26 Apr 2022 12:43:35 +0000
>>>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
>>>>
>>>>>> -----Original Message-----
>>>>>> From: Eric Auger [mailto:eric.auger@redhat.com]
>>>>>> Sent: 26 April 2022 12:45
>>>>>> To: Shameerali Kolothum Thodi
>>>>>> <shameerali.kolothum.thodi@huawei.com>; Yi
>>>>>> Liu <yi.l.liu@intel.com>; alex.williamson@redhat.com;
>>>>>> cohuck@redhat.com;
>>>>>> qemu-devel@nongnu.org
>>>>>> Cc: david@gibson.dropbear.id.au; thuth@redhat.com;
>>>>>> farman@linux.ibm.com;
>>>>>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>>>>>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>>>>>> jgg@nvidia.com; nicolinc@nvidia.com; eric.auger.pro@gmail.com;
>>>>>> kevin.tian@intel.com; chao.p.peng@intel.com; yi.y.sun@intel.com;
>>>>>> peterx@redhat.com; Zhangfei Gao <zhangfei.gao@linaro.org>
>>>>>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>>>> [...]
>>>>>> https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@nvidia.com
>>>>>>
>>>>>>>> /
>>>>>>>> [2] https://github.com/luxis1999/iommufd/tree/iommufd-v5.17-rc6
>>>>>>>> [3]
>>>>>>>> https://github.com/luxis1999/qemu/tree/qemu-for-5.17-rc6-vm-rfcv1
>>>>>>> Hi,
>>>>>>>
>>>>>>> I had a go with the above branches on our ARM64 platform trying to
>>>>>> pass-through
>>>>>>> a VF dev, but Qemu reports an error as below,
>>>>>>>
>>>>>>> [    0.444728] hisi_sec2 0000:00:01.0: enabling device (0000 ->
>>>>>>> 0002)
>>>>>>> qemu-system-aarch64-iommufd: IOMMU_IOAS_MAP failed: Bad address
>>>>>>> qemu-system-aarch64-iommufd: vfio_container_dma_map(0xaaaafeb40ce0,
>>>>>> 0x8000000000, 0x10000, 0xffffb40ef000) = -14 (Bad address)
>>>>>>> I think this happens for the dev BAR addr range. I haven't
>>>>>>> debugged the
>>>>>> kernel
>>>>>>> yet to see where it actually reports that.
>>>>>> Does it prevent your assigned device from working? I have such errors
>>>>>> too but this is a known issue. This is due to the fact P2P DMA is not
>>>>>> supported yet.
>>>>> Yes, the basic tests all good so far. I am still not very clear how
>>>>> it works if
>>>>> the map() fails though. It looks like it fails in,
>>>>>
>>>>> iommufd_ioas_map()
>>>>>     iopt_map_user_pages()
>>>>>      iopt_map_pages()
>>>>>      ..
>>>>>        pfn_reader_pin_pages()
>>>>>
>>>>> So does it mean it just works because the page is resident()?
>>>> No, it just means that you're not triggering any accesses that require
>>>> peer-to-peer DMA support.  Any sort of test where the device is only
>>>> performing DMA to guest RAM, which is by far the standard use case,
>>>> will work fine.  This also doesn't affect vCPU access to BAR space.
>>>> It's only a failure of the mappings of the BAR space into the IOAS,
>>>> which is only used when a device tries to directly target another
>>>> device's BAR space via DMA.  Thanks,
>>> I also get this issue when trying adding prereg listenner
>>>
>>> +    container->prereg_listener = vfio_memory_prereg_listener;
>>> +    memory_listener_register(&container->prereg_listener,
>>> +                            &address_space_memory);
>>>
>>> host kernel log:
>>> iommufd_ioas_map 1 iova=8000000000, iova1=8000000000,
>>> cmd->iova=8000000000, cmd->user_va=9c495000, cmd->length=10000
>>> iopt_alloc_area input area=859a2d00 iova=8000000000
>>> iopt_alloc_area area=859a2d00 iova=8000000000
>>> pin_user_pages_remote rc=-14
>>>
>>> qemu log:
>>> vfio_prereg_listener_region_add
>>> iommufd_map iova=0x8000000000
>>> qemu-system-aarch64: IOMMU_IOAS_MAP failed: Bad address
>>> qemu-system-aarch64: vfio_dma_map(0xaaaafb96a930, 0x8000000000,
>>> 0x10000, 0xffff9c495000) = -14 (Bad address)
>>> qemu-system-aarch64: (null)
>>> double free or corruption (fasttop)
>>> Aborted (core dumped)
>>>
>>> With hack of ignoring address 0x8000000000 in map and unmap, kernel
>>> can boot.
>> do you know if the iova 0x8000000000 guest RAM or MMIO? Currently,
>> iommufd kernel part doesn't support mapping device BAR MMIO. This is a
>> known gap.
> In qemu arm virt machine this indeed matches the PCI MMIO region.

Thanks Yi and Eric,
Then will wait for the updated iommufd kernel for the PCI MMIO region.

Another question,
How to get the iommu_domain in the ioctl.

qemu can get container->ioas_id.

kernel can get ioas via the ioas_id.
But how to get the domain?
Currently I am hacking with ioas->iopt.next_domain_id, which is increasing.
domain = xa_load(&ioas->iopt.domains, ioas->iopt.next_domain_id-1);

Any idea?

Thanks
Jason Gunthorpe May 10, 2022, 12:45 p.m. UTC | #34
On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
> Thanks Yi and Eric,
> Then will wait for the updated iommufd kernel for the PCI MMIO region.
> 
> Another question,
> How to get the iommu_domain in the ioctl.

The ID of the iommu_domain (called the hwpt) it should be returned by
the vfio attach ioctl.

Jason
Yi Liu May 10, 2022, 2:08 p.m. UTC | #35
On 2022/5/10 20:45, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>> Thanks Yi and Eric,
>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>
>> Another question,
>> How to get the iommu_domain in the ioctl.
> 
> The ID of the iommu_domain (called the hwpt) it should be returned by
> the vfio attach ioctl.

yes, hwpt_id is returned by the vfio attach ioctl and recorded in
qemu. You can query page table related capabilities with this id.

https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
zhangfei.gao@foxmail.com May 11, 2022, 2:17 p.m. UTC | #36
On 2022/5/10 下午10:08, Yi Liu wrote:
> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>> Thanks Yi and Eric,
>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>
>>> Another question,
>>> How to get the iommu_domain in the ioctl.
>>
>> The ID of the iommu_domain (called the hwpt) it should be returned by
>> the vfio attach ioctl.
>
> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
> qemu. You can query page table related capabilities with this id.
>
> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>
Thanks Yi,

Do we use iommufd_hw_pagetable_from_id in kernel?

The qemu send hwpt_id via ioctl.
Currently VFIOIOMMUFDContainer has hwpt_list,
Which member is good to save hwpt_id, IOMMUTLBEntry?

In kernel ioctl: iommufd_vfio_ioctl
@dev: Device to get an iommu_domain for
iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct 
device *dev)
But iommufd_vfio_ioctl seems no para dev?

Thanks
zhangfei.gao@foxmail.com May 12, 2022, 9:01 a.m. UTC | #37
Hi, Yi

On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>
>
> On 2022/5/10 下午10:08, Yi Liu wrote:
>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>> Thanks Yi and Eric,
>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>
>>>> Another question,
>>>> How to get the iommu_domain in the ioctl.
>>>
>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>> the vfio attach ioctl.
>>
>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>> qemu. You can query page table related capabilities with this id.
>>
>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>
> Thanks Yi,
>
> Do we use iommufd_hw_pagetable_from_id in kernel?
>
> The qemu send hwpt_id via ioctl.
> Currently VFIOIOMMUFDContainer has hwpt_list,
> Which member is good to save hwpt_id, IOMMUTLBEntry?

Can VFIOIOMMUFDContainer  have multi hwpt?
Since VFIOIOMMUFDContainer has hwpt_list now.
If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, 
where no vbasedev can be used for compare.

I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
VFIOIOMMUFDContainer.
And save hwpt when vfio_device_attach_container.


>
> In kernel ioctl: iommufd_vfio_ioctl
> @dev: Device to get an iommu_domain for
> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, 
> struct device *dev)
> But iommufd_vfio_ioctl seems no para dev?

We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)

Thanks
>
Yi Liu May 17, 2022, 8:52 a.m. UTC | #38
Hi Zhangfei,

On 2022/5/11 22:17, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/5/10 下午10:08, Yi Liu wrote:
>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>> Thanks Yi and Eric,
>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>
>>>> Another question,
>>>> How to get the iommu_domain in the ioctl.
>>>
>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>> the vfio attach ioctl.
>>
>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>> qemu. You can query page table related capabilities with this id.
>>
>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>
> Thanks Yi,
> 
> Do we use iommufd_hw_pagetable_from_id in kernel?
> 
> The qemu send hwpt_id via ioctl.
> Currently VFIOIOMMUFDContainer has hwpt_list,
> Which member is good to save hwpt_id, IOMMUTLBEntry?

currently, we don't make use of hwpt yet in the version we have
in the qemu branch. I have a change to make use of it. Also, it
would be used in future for nested translation setup and also
dirty page bit support query for a given domain.

> 
> In kernel ioctl: iommufd_vfio_ioctl
> @dev: Device to get an iommu_domain for
> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct 
> device *dev)
> But iommufd_vfio_ioctl seems no para dev?

there is. you can look at the vfio_group_set_iommufd(), it loops the
device_list provided by vfio. And the device info is passed to iommufd.

> Thanks
> 
> 
> 
>
Yi Liu May 17, 2022, 8:55 a.m. UTC | #39
Hi Zhangfei,

On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
> 
> Hi, Yi
> 
> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>
>>
>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>> Thanks Yi and Eric,
>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>>
>>>>> Another question,
>>>>> How to get the iommu_domain in the ioctl.
>>>>
>>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>>> the vfio attach ioctl.
>>>
>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>> qemu. You can query page table related capabilities with this id.
>>>
>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>>
>> Thanks Yi,
>>
>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>
>> The qemu send hwpt_id via ioctl.
>> Currently VFIOIOMMUFDContainer has hwpt_list,
>> Which member is good to save hwpt_id, IOMMUTLBEntry?
> 
> Can VFIOIOMMUFDContainer  have multi hwpt?

yes, it is possible

> Since VFIOIOMMUFDContainer has hwpt_list now.
> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, 
> where no vbasedev can be used for compare.
> 
> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
> VFIOIOMMUFDContainer.
> And save hwpt when vfio_device_attach_container.
> 
>>
>> In kernel ioctl: iommufd_vfio_ioctl
>> @dev: Device to get an iommu_domain for
>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, struct 
>> device *dev)
>> But iommufd_vfio_ioctl seems no para dev?
> 
> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)

this is not good. dev is passed in to this function to allocate domain
and also check sw_msi things. If you pass in a NULL, it may even unable
to get a domain for the hwpt. It won't work I guess.

> Thanks
>>
>
zhangfei.gao@foxmail.com May 18, 2022, 7:22 a.m. UTC | #40
On 2022/5/17 下午4:55, Yi Liu wrote:
> Hi Zhangfei,
>
> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
>>
>> Hi, Yi
>>
>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>>
>>>
>>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>>> Thanks Yi and Eric,
>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO 
>>>>>> region.
>>>>>>
>>>>>> Another question,
>>>>>> How to get the iommu_domain in the ioctl.
>>>>>
>>>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>>>> the vfio attach ioctl.
>>>>
>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>>> qemu. You can query page table related capabilities with this id.
>>>>
>>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/ 
>>>>
>>>>
>>> Thanks Yi,
>>>
>>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>>
>>> The qemu send hwpt_id via ioctl.
>>> Currently VFIOIOMMUFDContainer has hwpt_list,
>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
>>
>> Can VFIOIOMMUFDContainer  have multi hwpt?
>
> yes, it is possible
Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, 
IOMMUTLBEntry *iotlb)

>
>> Since VFIOIOMMUFDContainer has hwpt_list now.
>> If so, how to get specific hwpt from map/unmap_notify in 
>> hw/vfio/as.c, where no vbasedev can be used for compare.
>>
>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
>> VFIOIOMMUFDContainer.
>> And save hwpt when vfio_device_attach_container.
>>
>>>
>>> In kernel ioctl: iommufd_vfio_ioctl
>>> @dev: Device to get an iommu_domain for
>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, 
>>> struct device *dev)
>>> But iommufd_vfio_ioctl seems no para dev?
>>
>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
>
> this is not good. dev is passed in to this function to allocate domain
> and also check sw_msi things. If you pass in a NULL, it may even unable
> to get a domain for the hwpt. It won't work I guess.

The iommufd_hw_pagetable_from_id can be used for
1, allocate domain, which need para dev
case IOMMUFD_OBJ_IOAS
hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);

2. Just return allocated domain via hwpt_id, which does not need dev.
case IOMMUFD_OBJ_HW_PAGETABLE:
return container_of(obj, struct iommufd_hw_pagetable, obj);

By the way, any plan of the nested mode?

Thanks
Yi Liu May 18, 2022, 2 p.m. UTC | #41
On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
> 
> 
> On 2022/5/17 下午4:55, Yi Liu wrote:
>> Hi Zhangfei,
>>
>> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
>>>
>>> Hi, Yi
>>>
>>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>>>
>>>>
>>>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>>>> Thanks Yi and Eric,
>>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO region.
>>>>>>>
>>>>>>> Another question,
>>>>>>> How to get the iommu_domain in the ioctl.
>>>>>>
>>>>>> The ID of the iommu_domain (called the hwpt) it should be returned by
>>>>>> the vfio attach ioctl.
>>>>>
>>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>>>> qemu. You can query page table related capabilities with this id.
>>>>>
>>>>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>>>>
>>>> Thanks Yi,
>>>>
>>>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>>>
>>>> The qemu send hwpt_id via ioctl.
>>>> Currently VFIOIOMMUFDContainer has hwpt_list,
>>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
>>>
>>> Can VFIOIOMMUFDContainer  have multi hwpt?
>>
>> yes, it is possible
> Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry 
> *iotlb)

in map/unmap, should use ioas_id instead of hwpt_id

> 
>>
>>> Since VFIOIOMMUFDContainer has hwpt_list now.
>>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c, 
>>> where no vbasedev can be used for compare.
>>>
>>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in 
>>> VFIOIOMMUFDContainer.
>>> And save hwpt when vfio_device_attach_container.
>>>
>>>>
>>>> In kernel ioctl: iommufd_vfio_ioctl
>>>> @dev: Device to get an iommu_domain for
>>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id, 
>>>> struct device *dev)
>>>> But iommufd_vfio_ioctl seems no para dev?
>>>
>>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not need dev.
>>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
>>
>> this is not good. dev is passed in to this function to allocate domain
>> and also check sw_msi things. If you pass in a NULL, it may even unable
>> to get a domain for the hwpt. It won't work I guess.
> 
> The iommufd_hw_pagetable_from_id can be used for
> 1, allocate domain, which need para dev
> case IOMMUFD_OBJ_IOAS
> hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);

this is used when attaching ioas.

> 2. Just return allocated domain via hwpt_id, which does not need dev.
> case IOMMUFD_OBJ_HW_PAGETABLE:
> return container_of(obj, struct iommufd_hw_pagetable, obj);

yes, this would be the usage in nesting. you may check my below
branch. It's for nesting integration.

https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting

> By the way, any plan of the nested mode?
I'm working with Eric, Nic on it. Currently, I've got the above kernel
branch, QEMU side is also WIP.

> Thanks
Shameerali Kolothum Thodi June 28, 2022, 8:14 a.m. UTC | #42
> -----Original Message-----
> From: Yi Liu [mailto:yi.l.liu@intel.com]
> Sent: 18 May 2022 15:01
> To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>;
> Zhangfei Gao <zhangfei.gao@linaro.org>
> Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> cohuck@redhat.com; qemu-devel@nongnu.org;
> david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
> nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com;
> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
> 
> On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
> >
> >
> > On 2022/5/17 下午4:55, Yi Liu wrote:
> >> Hi Zhangfei,
> >>
> >> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
> >>>
> >>> Hi, Yi
> >>>
> >>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
> >>>>
> >>>>
> >>>> On 2022/5/10 下午10:08, Yi Liu wrote:
> >>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
> >>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
> >>>>>>> Thanks Yi and Eric,
> >>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO
> region.
> >>>>>>>
> >>>>>>> Another question,
> >>>>>>> How to get the iommu_domain in the ioctl.
> >>>>>>
> >>>>>> The ID of the iommu_domain (called the hwpt) it should be returned
> by
> >>>>>> the vfio attach ioctl.
> >>>>>
> >>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
> >>>>> qemu. You can query page table related capabilities with this id.
> >>>>>
> >>>>>
> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
> >>>>>
> >>>> Thanks Yi,
> >>>>
> >>>> Do we use iommufd_hw_pagetable_from_id in kernel?
> >>>>
> >>>> The qemu send hwpt_id via ioctl.
> >>>> Currently VFIOIOMMUFDContainer has hwpt_list,
> >>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
> >>>
> >>> Can VFIOIOMMUFDContainer  have multi hwpt?
> >>
> >> yes, it is possible
> > Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n,
> IOMMUTLBEntry
> > *iotlb)
> 
> in map/unmap, should use ioas_id instead of hwpt_id
> 
> >
> >>
> >>> Since VFIOIOMMUFDContainer has hwpt_list now.
> >>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c,
> >>> where no vbasedev can be used for compare.
> >>>
> >>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in
> >>> VFIOIOMMUFDContainer.
> >>> And save hwpt when vfio_device_attach_container.
> >>>
> >>>>
> >>>> In kernel ioctl: iommufd_vfio_ioctl
> >>>> @dev: Device to get an iommu_domain for
> >>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
> >>>> struct device *dev)
> >>>> But iommufd_vfio_ioctl seems no para dev?
> >>>
> >>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not
> need dev.
> >>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
> >>
> >> this is not good. dev is passed in to this function to allocate domain
> >> and also check sw_msi things. If you pass in a NULL, it may even unable
> >> to get a domain for the hwpt. It won't work I guess.
> >
> > The iommufd_hw_pagetable_from_id can be used for
> > 1, allocate domain, which need para dev
> > case IOMMUFD_OBJ_IOAS
> > hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
> 
> this is used when attaching ioas.
> 
> > 2. Just return allocated domain via hwpt_id, which does not need dev.
> > case IOMMUFD_OBJ_HW_PAGETABLE:
> > return container_of(obj, struct iommufd_hw_pagetable, obj);
> 
> yes, this would be the usage in nesting. you may check my below
> branch. It's for nesting integration.
> 
> https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting
> 
> > By the way, any plan of the nested mode?
> I'm working with Eric, Nic on it. Currently, I've got the above kernel
> branch, QEMU side is also WIP.

Hi Yi/Eric,

I had a look at the above nesting kernel and Qemu branches and as mentioned
in the cover letter it is not working on ARM yet.

IIUC, to get it working via the iommufd the main thing is we need a way to configure
the phys SMMU in nested mode and setup the mappings for the stage 2. The
Cache/PASID related changes looks more straight forward. 

I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering
do you guys have something that can be shared yet?

Please let me know.

Thanks,
Shameer
Eric Auger June 28, 2022, 8:58 a.m. UTC | #43
Hi Shameer,

On 6/28/22 10:14, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Yi Liu [mailto:yi.l.liu@intel.com]
>> Sent: 18 May 2022 15:01
>> To: zhangfei.gao@foxmail.com; Jason Gunthorpe <jgg@nvidia.com>;
>> Zhangfei Gao <zhangfei.gao@linaro.org>
>> Cc: eric.auger@redhat.com; Alex Williamson <alex.williamson@redhat.com>;
>> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
>> cohuck@redhat.com; qemu-devel@nongnu.org;
>> david@gibson.dropbear.id.au; thuth@redhat.com; farman@linux.ibm.com;
>> mjrosato@linux.ibm.com; akrowiak@linux.ibm.com; pasic@linux.ibm.com;
>> jjherne@linux.ibm.com; jasowang@redhat.com; kvm@vger.kernel.org;
>> nicolinc@nvidia.com; eric.auger.pro@gmail.com; kevin.tian@intel.com;
>> chao.p.peng@intel.com; yi.y.sun@intel.com; peterx@redhat.com
>> Subject: Re: [RFC 00/18] vfio: Adopt iommufd
>>
>> On 2022/5/18 15:22, zhangfei.gao@foxmail.com wrote:
>>>
>>> On 2022/5/17 下午4:55, Yi Liu wrote:
>>>> Hi Zhangfei,
>>>>
>>>> On 2022/5/12 17:01, zhangfei.gao@foxmail.com wrote:
>>>>> Hi, Yi
>>>>>
>>>>> On 2022/5/11 下午10:17, zhangfei.gao@foxmail.com wrote:
>>>>>>
>>>>>> On 2022/5/10 下午10:08, Yi Liu wrote:
>>>>>>> On 2022/5/10 20:45, Jason Gunthorpe wrote:
>>>>>>>> On Tue, May 10, 2022 at 08:35:00PM +0800, Zhangfei Gao wrote:
>>>>>>>>> Thanks Yi and Eric,
>>>>>>>>> Then will wait for the updated iommufd kernel for the PCI MMIO
>> region.
>>>>>>>>> Another question,
>>>>>>>>> How to get the iommu_domain in the ioctl.
>>>>>>>> The ID of the iommu_domain (called the hwpt) it should be returned
>> by
>>>>>>>> the vfio attach ioctl.
>>>>>>> yes, hwpt_id is returned by the vfio attach ioctl and recorded in
>>>>>>> qemu. You can query page table related capabilities with this id.
>>>>>>>
>>>>>>>
>> https://lore.kernel.org/kvm/20220414104710.28534-16-yi.l.liu@intel.com/
>>>>>> Thanks Yi,
>>>>>>
>>>>>> Do we use iommufd_hw_pagetable_from_id in kernel?
>>>>>>
>>>>>> The qemu send hwpt_id via ioctl.
>>>>>> Currently VFIOIOMMUFDContainer has hwpt_list,
>>>>>> Which member is good to save hwpt_id, IOMMUTLBEntry?
>>>>> Can VFIOIOMMUFDContainer  have multi hwpt?
>>>> yes, it is possible
>>> Then how to get hwpt_id in map/unmap_notify(IOMMUNotifier *n,
>> IOMMUTLBEntry
>>> *iotlb)
>> in map/unmap, should use ioas_id instead of hwpt_id
>>
>>>>> Since VFIOIOMMUFDContainer has hwpt_list now.
>>>>> If so, how to get specific hwpt from map/unmap_notify in hw/vfio/as.c,
>>>>> where no vbasedev can be used for compare.
>>>>>
>>>>> I am testing with a workaround, adding VFIOIOASHwpt *hwpt in
>>>>> VFIOIOMMUFDContainer.
>>>>> And save hwpt when vfio_device_attach_container.
>>>>>
>>>>>> In kernel ioctl: iommufd_vfio_ioctl
>>>>>> @dev: Device to get an iommu_domain for
>>>>>> iommufd_hw_pagetable_from_id(struct iommufd_ctx *ictx, u32 pt_id,
>>>>>> struct device *dev)
>>>>>> But iommufd_vfio_ioctl seems no para dev?
>>>>> We can set dev=Null since IOMMUFD_OBJ_HW_PAGETABLE does not
>> need dev.
>>>>> iommufd_hw_pagetable_from_id(ictx, hwpt_id, NULL)
>>>> this is not good. dev is passed in to this function to allocate domain
>>>> and also check sw_msi things. If you pass in a NULL, it may even unable
>>>> to get a domain for the hwpt. It won't work I guess.
>>> The iommufd_hw_pagetable_from_id can be used for
>>> 1, allocate domain, which need para dev
>>> case IOMMUFD_OBJ_IOAS
>>> hwpt = iommufd_hw_pagetable_auto_get(ictx, ioas, dev);
>> this is used when attaching ioas.
>>
>>> 2. Just return allocated domain via hwpt_id, which does not need dev.
>>> case IOMMUFD_OBJ_HW_PAGETABLE:
>>> return container_of(obj, struct iommufd_hw_pagetable, obj);
>> yes, this would be the usage in nesting. you may check my below
>> branch. It's for nesting integration.
>>
>> https://github.com/luxis1999/iommufd/tree/iommufd-v5.18-rc4-nesting
>>
>>> By the way, any plan of the nested mode?
>> I'm working with Eric, Nic on it. Currently, I've got the above kernel
>> branch, QEMU side is also WIP.
> Hi Yi/Eric,
>
> I had a look at the above nesting kernel and Qemu branches and as mentioned
> in the cover letter it is not working on ARM yet.
>
> IIUC, to get it working via the iommufd the main thing is we need a way to configure
> the phys SMMU in nested mode and setup the mappings for the stage 2. The
> Cache/PASID related changes looks more straight forward. 
>
> I had quite a few hacks to get it working on ARM, but still a WIP. So just wondering
> do you guys have something that can be shared yet?

I am working on the respin based on latest iommufd kernel branches and
qemu RFC v2 but it is still WIP.

I will share as soon as possible.

Eric
>
> Please let me know.
>
> Thanks,
> Shameer