[00/14] Add iommufd physical device operations for replace and alloc hwpt

Message ID	0-v1-7612f88c19f5+2f21-iommufd_alloc_jgg@nvidia.com (mailing list archive)
Headers	show Return-Path: <linux-kselftest-owner@vger.kernel.org> From: Jason Gunthorpe <jgg@nvidia.com> To: iommu@lists.linux.dev, Kevin Tian <kevin.tian@intel.com>, linux-kselftest@vger.kernel.org Cc: kvm@vger.kernel.org, Nicolin Chen <nicolinc@nvidia.com>, Yi Liu <yi.l.liu@intel.com> Subject: [PATCH 00/14] Add iommufd physical device operations for replace and alloc hwpt Date: Fri, 24 Feb 2023 20:27:45 -0400 Message-Id: <0-v1-7612f88c19f5+2f21-iommufd_alloc_jgg@nvidia.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Precedence: bulk
Series	Add iommufd physical device operations for replace and alloc hwpt \| expand [00/14] Add iommufd physical device operations for replace and alloc hwpt [01/14] iommufd: Move isolated msi enforcement to iommufd_device_bind() [02/14] iommufd: Add iommufd_group [03/14] iommufd: Replace the hwpt->devices list with iommufd_group [04/14] iommufd: Use the iommufd_group to avoid duplicate reserved groups and msi setup [05/14] iommufd: Make sw_msi_start a group global [06/14] iommufd: Move putting a hwpt to a helper function [07/14] iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc() [08/14] iommu: Introduce a new iommu_group_replace_domain() API [09/14] iommufd: Add iommufd_device_replace() [10/14] iommufd: Make destroy_rwsem use a lock class per object type [11/14] iommufd/selftest: Test iommufd_device_replace() [12/14] iommufd: Add IOMMU_HWPT_ALLOC [13/14] iommufd/selftest: Return the real idev id from selftest mock_domain [14/14] iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC

Jason Gunthorpe Feb. 25, 2023, 12:27 a.m. UTC

This is the basic functionality for iommufd to support
iommufd_device_replace() and IOMMU_HWPT_ALLOC for physical devices.

iommufd_device_replace() allows changing the HWPT associated with the
device to a new IOAS or HWPT. Replace does this in way that failure leaves
things unchanged, and utilizes the iommu iommu_group_replace_domain() API
to allow the iommu driver to perform an optional non-disruptive change.

IOMMU_HWPT_ALLOC allows HWPTs to be explicitly allocated by the user and
used by attach or replace. At this point it isn't very useful since the
HWPT is the same as the automatically managed HWPT from the IOAS. However
a following series will allow userspace to customize the created HWPT.

The implementation is complicated because we have to introduce some
per-iommu_group memory in iommufd and redo how we think about multi-device
groups to be more explicit. This solves all the locking problems in the
prior attempts.

This series is infrastructure work for the following series which:
 - Add replace for attach
 - Expose replace through VFIO APIs
 - Implement driver parameters for HWPT creation (nesting)

Once review of this is complete I will keep it on a side branch and
accumulate the following series when they are ready so we can have a
stable base and make more incremental progress. When we have all the parts
together to get a full implementation it can go to Linus.

I have this on github:

https://github.com/jgunthorpe/linux/commits/iommufd_hwpt

Jason Gunthorpe (12):
  iommufd: Move isolated msi enforcement to iommufd_device_bind()
  iommufd: Add iommufd_group
  iommufd: Replace the hwpt->devices list with iommufd_group
  iommufd: Use the iommufd_group to avoid duplicate reserved groups and
    msi setup
  iommufd: Make sw_msi_start a group global
  iommufd: Move putting a hwpt to a helper function
  iommufd: Add enforced_cache_coherency to iommufd_hw_pagetable_alloc()
  iommufd: Add iommufd_device_replace()
  iommufd: Make destroy_rwsem use a lock class per object type
  iommufd: Add IOMMU_HWPT_ALLOC
  iommufd/selftest: Return the real idev id from selftest mock_domain
  iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC

Nicolin Chen (2):
  iommu: Introduce a new iommu_group_replace_domain() API
  iommufd/selftest: Test iommufd_device_replace()

 drivers/iommu/iommu-priv.h                    |  10 +
 drivers/iommu/iommu.c                         |  30 ++
 drivers/iommu/iommufd/device.c                | 482 +++++++++++++-----
 drivers/iommu/iommufd/hw_pagetable.c          |  96 +++-
 drivers/iommu/iommufd/io_pagetable.c          |   5 +-
 drivers/iommu/iommufd/iommufd_private.h       |  44 +-
 drivers/iommu/iommufd/iommufd_test.h          |   7 +
 drivers/iommu/iommufd/main.c                  |  17 +-
 drivers/iommu/iommufd/selftest.c              |  40 ++
 include/linux/iommufd.h                       |   1 +
 include/uapi/linux/iommufd.h                  |  26 +
 tools/testing/selftests/iommu/iommufd.c       |  64 ++-
 .../selftests/iommu/iommufd_fail_nth.c        |  57 ++-
 tools/testing/selftests/iommu/iommufd_utils.h |  59 ++-
 14 files changed, 758 insertions(+), 180 deletions(-)
 create mode 100644 drivers/iommu/iommu-priv.h


base-commit: ac395958f9156733246b5bc3a481c6d38c321a7a

Tian, Kevin March 7, 2023, 8:42 a.m. UTC | #1

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, February 25, 2023 8:28 AM
> 
[...]
> The implementation is complicated because we have to introduce some
> per-iommu_group memory in iommufd and redo how we think about multi-
> device
> groups to be more explicit. This solves all the locking problems in the
> prior attempts.
> 

Now think about the pasid case.

pasid attach is managed as a device operation today:
	iommu_attach_device_pasid()

Following it naturally we'll have a pasid array per iommufd_device to
track attached HWPT per pasid.

But internally there is only one pasid table per iommu group. i.e. same
story as RID attach that once dev1 replaces hwpt on pasid1 then it takes
effect on all other devices in the same group.

Then confusion comes. If we must have explicit group object in iommufd
to manage domain replacement per rid, then do we need the same
explicit mechanism e.g. tracking pasid attached hwpt in iommufd_group
instead of in iommufd_device? and have a iommu_attach_group_pasid() API.

In contrast if it's fine to continue managing pasid attach/replace per device,
does it imply there is another way of solving it cleanly w/o explicit group
object in the rid case?

Thanks
Kevin

Jason Gunthorpe March 7, 2023, 12:46 p.m. UTC | #2

On Tue, Mar 07, 2023 at 08:42:06AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Saturday, February 25, 2023 8:28 AM
> > 
> [...]
> > The implementation is complicated because we have to introduce some
> > per-iommu_group memory in iommufd and redo how we think about multi-
> > device
> > groups to be more explicit. This solves all the locking problems in the
> > prior attempts.
> > 
> 
> Now think about the pasid case.
> 
> pasid attach is managed as a device operation today:
> 	iommu_attach_device_pasid()
> 
> Following it naturally we'll have a pasid array per iommufd_device to
> track attached HWPT per pasid.
> 
> But internally there is only one pasid table per iommu group. i.e. same
> story as RID attach that once dev1 replaces hwpt on pasid1 then it takes
> effect on all other devices in the same group.

IMHO I can't belive that any actual systems that support PASID have a
RID aliasing problem too.

I think we should fix the iommu core to make PASID per-device and
require systems that have a RID aliasing problem to block PASID.

This is a bigger picture, if drivers have to optionally share their
PASID tables with other drivers then we can't have per-driver PASID
allocators at all either.

> Then confusion comes. If we must have explicit group object in iommufd
> to manage domain replacement per rid, then do we need the same
> explicit mechanism e.g. tracking pasid attached hwpt in iommufd_group
> instead of in iommufd_device? and have a iommu_attach_group_pasid() API.

If we make PASID per-group then yes

But no actual HW models per-group PASID to the VM so this is a
complete API disaster for vIOMMU. Better not to do it.

Jason

Baolu Lu March 8, 2023, 2:08 a.m. UTC | #3

On 3/7/23 8:46 PM, Jason Gunthorpe wrote:
> On Tue, Mar 07, 2023 at 08:42:06AM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe<jgg@nvidia.com>
>>> Sent: Saturday, February 25, 2023 8:28 AM
>>>
>> [...]
>>> The implementation is complicated because we have to introduce some
>>> per-iommu_group memory in iommufd and redo how we think about multi-
>>> device
>>> groups to be more explicit. This solves all the locking problems in the
>>> prior attempts.
>>>
>> Now think about the pasid case.
>>
>> pasid attach is managed as a device operation today:
>> 	iommu_attach_device_pasid()
>>
>> Following it naturally we'll have a pasid array per iommufd_device to
>> track attached HWPT per pasid.
>>
>> But internally there is only one pasid table per iommu group. i.e. same
>> story as RID attach that once dev1 replaces hwpt on pasid1 then it takes
>> effect on all other devices in the same group.
> IMHO I can't belive that any actual systems that support PASID have a
> RID aliasing problem too.
> 
> I think we should fix the iommu core to make PASID per-device and
> require systems that have a RID aliasing problem to block PASID.
> 
> This is a bigger picture, if drivers have to optionally share their
> PASID tables with other drivers then we can't have per-driver PASID
> allocators at all either.

This is actually required in PCI and IOMMU core. pci_enable_pasid()
requires full ACS support on device's upstream path:

        if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
                 return -EINVAL;

and, for such PCI topology, iommu core always allocates an exclusive
iommu group.

The only place where seems to be a little messy is,

static int __iommu_set_group_pasid(struct iommu_domain *domain,
                                    struct iommu_group *group, ioasid_t 
pasid)
{
         struct group_device *device;
         int ret = 0;

         list_for_each_entry(device, &group->devices, list) {
                 ret = domain->ops->set_dev_pasid(domain, device->dev, 
pasid);
                 if (ret)
                         break;
         }

         return ret;
}

Perhaps we need a check on singleton group?

Best regards,
baolu

Tian, Kevin March 8, 2023, 7:38 a.m. UTC | #4

> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Wednesday, March 8, 2023 10:08 AM
> 
> On 3/7/23 8:46 PM, Jason Gunthorpe wrote:
> > On Tue, Mar 07, 2023 at 08:42:06AM +0000, Tian, Kevin wrote:
> >>> From: Jason Gunthorpe<jgg@nvidia.com>
> >>> Sent: Saturday, February 25, 2023 8:28 AM
> >>>
> >> [...]
> >>> The implementation is complicated because we have to introduce some
> >>> per-iommu_group memory in iommufd and redo how we think about
> multi-
> >>> device
> >>> groups to be more explicit. This solves all the locking problems in the
> >>> prior attempts.
> >>>
> >> Now think about the pasid case.
> >>
> >> pasid attach is managed as a device operation today:
> >> 	iommu_attach_device_pasid()
> >>
> >> Following it naturally we'll have a pasid array per iommufd_device to
> >> track attached HWPT per pasid.
> >>
> >> But internally there is only one pasid table per iommu group. i.e. same
> >> story as RID attach that once dev1 replaces hwpt on pasid1 then it takes
> >> effect on all other devices in the same group.
> > IMHO I can't belive that any actual systems that support PASID have a
> > RID aliasing problem too.
> >
> > I think we should fix the iommu core to make PASID per-device and
> > require systems that have a RID aliasing problem to block PASID.
> >
> > This is a bigger picture, if drivers have to optionally share their
> > PASID tables with other drivers then we can't have per-driver PASID
> > allocators at all either.
> 
> This is actually required in PCI and IOMMU core. pci_enable_pasid()
> requires full ACS support on device's upstream path:
> 
>         if (!pci_acs_path_enabled(pdev, NULL, PCI_ACS_RR | PCI_ACS_UF))
>                  return -EINVAL;
> 
> and, for such PCI topology, iommu core always allocates an exclusive
> iommu group.
> 
> The only place where seems to be a little messy is,
> 
> static int __iommu_set_group_pasid(struct iommu_domain *domain,
>                                     struct iommu_group *group, ioasid_t
> pasid)
> {
>          struct group_device *device;
>          int ret = 0;
> 
>          list_for_each_entry(device, &group->devices, list) {
>                  ret = domain->ops->set_dev_pasid(domain, device->dev,
> pasid);
>                  if (ret)
>                          break;
>          }
> 
>          return ret;
> }
>

yes, this is exactly where my confusion comes. I thought we ever agreed
that PASID usage won't be supported on multi-devices group. Then I saw
about code (but didn't catch the implication of pci acs) which led me to
think about the group mess.

sticking to per-device pasid would certainly simplify the logic in iommufd.

Jason Gunthorpe March 8, 2023, 6:59 p.m. UTC | #5

On Wed, Mar 08, 2023 at 07:38:38AM +0000, Tian, Kevin wrote:
  
> > Perhaps we need a check on singleton group?
> > 
> 
> or move the xarray to device? Storing it in group while adding a singleton
> restriction only causes confusion.

Makes sense

Jason

[00/14] Add iommufd physical device operations for replace and alloc hwpt

Message

Comments