diff mbox series

[1/8] iommu: Introduce a replace API for device pasid

Message ID 20231127063428.127436-2-yi.l.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series iommufd support pasid attach/replace | expand

Commit Message

Yi Liu Nov. 27, 2023, 6:34 a.m. UTC
From: Lu Baolu <baolu.lu@linux.intel.com>

Provide a high-level API to allow replacements of one domain with
another for specific pasid of a device. This is similar to
iommu_group_replace_domain() and it is also expected to be used
only by IOMMUFD.

Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommu-priv.h |  2 +
 drivers/iommu/iommu.c      | 82 +++++++++++++++++++++++++++++++-------
 2 files changed, 70 insertions(+), 14 deletions(-)

Comments

Jason Gunthorpe Jan. 15, 2024, 5:19 p.m. UTC | #1
On Sun, Nov 26, 2023 at 10:34:21PM -0800, Yi Liu wrote:
> +int iommu_replace_device_pasid(struct iommu_domain *domain,
> +			       struct device *dev, ioasid_t pasid)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	struct iommu_domain *old_domain;
> +	int ret;
> +
> +	if (!domain)
> +		return -EINVAL;
> +
> +	if (!group)
> +		return -ENODEV;
> +
> +	mutex_lock(&group->mutex);
> +	__iommu_remove_group_pasid(group, pasid);

It is not replace if you do remove first.

Replace must just call set_dev_pasid and nothing much else..

Jason
Yi Liu March 10, 2024, 1:05 p.m. UTC | #2
On 2024/1/16 01:19, Jason Gunthorpe wrote:
> On Sun, Nov 26, 2023 at 10:34:21PM -0800, Yi Liu wrote:
>> +int iommu_replace_device_pasid(struct iommu_domain *domain,
>> +			       struct device *dev, ioasid_t pasid)
>> +{
>> +	struct iommu_group *group = dev->iommu_group;
>> +	struct iommu_domain *old_domain;
>> +	int ret;
>> +
>> +	if (!domain)
>> +		return -EINVAL;
>> +
>> +	if (!group)
>> +		return -ENODEV;
>> +
>> +	mutex_lock(&group->mutex);
>> +	__iommu_remove_group_pasid(group, pasid);
> 
> It is not replace if you do remove first.
> 
> Replace must just call set_dev_pasid and nothing much else..

Seems uneasy to do it so far. The VT-d driver needs to get the old domain
first in order to do replacement. However, VT-d driver does not track the
attached domains of pasids. It gets domain of a pasid
by iommu_get_domain_for_dev_pasid(). Like intel_iommu_remove_dev_pasid)
in link [1]. While the iommu layer exchanges the domain in the
group->pasid_array before calling into VT-d driver. So when calling into
VT-d driver, the domain got by iommu_get_domain_for_dev_pasid() is already
the new domain. To solve it, we may need to let VT-d driver have its
own tracking on the domains. How about your thoughts? @Jason, @Kevin, @Baoplu?

[1] 
https://github.com/torvalds/linux/blob/master/drivers/iommu/intel/iommu.c#L4621C19-L4621C20
Tian, Kevin March 11, 2024, 9:26 a.m. UTC | #3
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 10, 2024 9:06 PM
> 
> On 2024/1/16 01:19, Jason Gunthorpe wrote:
> > On Sun, Nov 26, 2023 at 10:34:21PM -0800, Yi Liu wrote:
> >> +int iommu_replace_device_pasid(struct iommu_domain *domain,
> >> +			       struct device *dev, ioasid_t pasid)
> >> +{
> >> +	struct iommu_group *group = dev->iommu_group;
> >> +	struct iommu_domain *old_domain;
> >> +	int ret;
> >> +
> >> +	if (!domain)
> >> +		return -EINVAL;
> >> +
> >> +	if (!group)
> >> +		return -ENODEV;
> >> +
> >> +	mutex_lock(&group->mutex);
> >> +	__iommu_remove_group_pasid(group, pasid);
> >
> > It is not replace if you do remove first.
> >
> > Replace must just call set_dev_pasid and nothing much else..
> 
> Seems uneasy to do it so far. The VT-d driver needs to get the old domain
> first in order to do replacement. However, VT-d driver does not track the
> attached domains of pasids. It gets domain of a pasid
> by iommu_get_domain_for_dev_pasid(). Like
> intel_iommu_remove_dev_pasid)
> in link [1]. While the iommu layer exchanges the domain in the
> group->pasid_array before calling into VT-d driver. So when calling into
> VT-d driver, the domain got by iommu_get_domain_for_dev_pasid() is
> already
> the new domain. To solve it, we may need to let VT-d driver have its
> own tracking on the domains. How about your thoughts? @Jason, @Kevin,
> @Baoplu?
> 
> [1]
> https://github.com/torvalds/linux/blob/master/drivers/iommu/intel/iommu
> .c#L4621C19-L4621C20
> 

Jason's point was that the core helper should directly call set_dev_pasid
and underlying iommu driver decides whether atomic replacement
can be implemented inside the callback. If you first remove in the core
then one can never implement a replace semantics.
Yi Liu March 12, 2024, 3:07 a.m. UTC | #4
On 2024/3/11 17:26, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Sunday, March 10, 2024 9:06 PM
>>
>> On 2024/1/16 01:19, Jason Gunthorpe wrote:
>>> On Sun, Nov 26, 2023 at 10:34:21PM -0800, Yi Liu wrote:
>>>> +int iommu_replace_device_pasid(struct iommu_domain *domain,
>>>> +			       struct device *dev, ioasid_t pasid)
>>>> +{
>>>> +	struct iommu_group *group = dev->iommu_group;
>>>> +	struct iommu_domain *old_domain;
>>>> +	int ret;
>>>> +
>>>> +	if (!domain)
>>>> +		return -EINVAL;
>>>> +
>>>> +	if (!group)
>>>> +		return -ENODEV;
>>>> +
>>>> +	mutex_lock(&group->mutex);
>>>> +	__iommu_remove_group_pasid(group, pasid);
>>>
>>> It is not replace if you do remove first.
>>>
>>> Replace must just call set_dev_pasid and nothing much else..
>>
>> Seems uneasy to do it so far. The VT-d driver needs to get the old domain
>> first in order to do replacement. However, VT-d driver does not track the
>> attached domains of pasids. It gets domain of a pasid
>> by iommu_get_domain_for_dev_pasid(). Like
>> intel_iommu_remove_dev_pasid)
>> in link [1]. While the iommu layer exchanges the domain in the
>> group->pasid_array before calling into VT-d driver. So when calling into
>> VT-d driver, the domain got by iommu_get_domain_for_dev_pasid() is
>> already
>> the new domain. To solve it, we may need to let VT-d driver have its
>> own tracking on the domains. How about your thoughts? @Jason, @Kevin,
>> @Baoplu?
>>
>> [1]
>> https://github.com/torvalds/linux/blob/master/drivers/iommu/intel/iommu
>> .c#L4621C19-L4621C20
>>
> 
> Jason's point was that the core helper should directly call set_dev_pasid
> and underlying iommu driver decides whether atomic replacement
> can be implemented inside the callback. If you first remove in the core
> then one can never implement a replace semantics.

yeah, I got Jason's point. I'm raising an open to make the set_dev_pasid
callback to handle domain replacement. The existing intel iommu driver
does not track attached domains for pasid. But it's needed to know the
attached domain of a pasid and find the dev_pasid under the domain, hence
be able to clean up the old attachment and do the new attachment. Existing
code cannot work as I mentioned above. The group->pasid_xarray is updated
before calling set_dev_pasid callback. This means the underlying iommu
driver cannot get the old domain in the set_dev_pasid callback by the
iommu_get_domain_for_dev_pasid() helper.

As above, I'm considering the possibility to track attached domains for
pasid by an xarray in the struct device_domain_info. Hence, intel iommu
driver could store/retrieve attached domain for pasids. If it's
replacement, the set_dev_pasid callback could find the attached domain and
the dev_pasid instance in the domain. Then be able to do detach and clean
up the tracking structures (e.g. dev_pasid).

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e58263dcfadd..03a847a406e1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4398,6 +4398,7 @@ static struct iommu_device 
*intel_iommu_probe_device(struct device *dev)

  	info->dev = dev;
  	info->iommu = iommu;
+	xa_init(&info->pasid_array);
  	if (dev_is_pci(dev)) {
  		if (ecap_dev_iotlb_support(iommu->ecap) &&
  		    pci_ats_supported(pdev) &&
@@ -4464,6 +4465,7 @@ static void intel_iommu_release_device(struct device 
*dev)
  	mutex_unlock(&iommu->iopf_lock);
  	dmar_remove_one_dev_info(dev);
  	intel_pasid_free_table(dev);
+	WARN_ON(!xa_empty(&info->pasid_array));
  	intel_iommu_debugfs_remove_dev(info);
  	kfree(info);
  	set_dma_ops(dev, NULL);
@@ -4707,7 +4709,13 @@ static int intel_iommu_iotlb_sync_map(struct 
iommu_domain *domain,
  	return 0;
  }

-static void intel_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
+/*
+ * Clear the pasid table entries so that all DMA requests with specific
+ * PASID from the device are blocked. If the page table has been set, clean
+ * up the data structures.
+ */
+static void device_pasid_block_translation(struct device *dev, ioasid_t pasid,
+					   bool remove)
  {
  	struct device_domain_info *info = dev_iommu_priv_get(dev);
  	struct dev_pasid_info *curr, *dev_pasid = NULL;
@@ -4716,9 +4724,11 @@ static void intel_iommu_remove_dev_pasid(struct 
device *dev, ioasid_t pasid)
  	struct iommu_domain *domain;
  	unsigned long flags;

-	domain = iommu_get_domain_for_dev_pasid(dev, pasid, 0);
-	if (WARN_ON_ONCE(!domain))
+	domain = xa_erase(&info->pasid_array, pasid);
+	if (!domain) {
+		WARN_ON_ONCE(remove);
  		return;
+	}

  	/*
  	 * The SVA implementation needs to handle its own stuffs like the mm
@@ -4753,6 +4763,11 @@ static void intel_iommu_remove_dev_pasid(struct 
device *dev, ioasid_t pasid)
  	intel_drain_pasid_prq(dev, pasid);
  }

+static void intel_iommu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
+{
+	device_pasid_block_translation(dev, pasid, true);
+}
+
  static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
  				     struct device *dev, ioasid_t pasid)
  {
@@ -4772,6 +4787,9 @@ static int intel_iommu_set_dev_pasid(struct 
iommu_domain *domain,
  	if (context_copied(iommu, info->bus, info->devfn))
  		return -EBUSY;

+	/* Try to block translation if it already has */
+	device_pasid_block_translation(dev, pasid, false);
+
  	ret = prepare_domain_attach_device(domain, dev);
  	if (ret)
  		return ret;
@@ -4801,6 +4819,8 @@ static int intel_iommu_set_dev_pasid(struct 
iommu_domain *domain,
  	list_add(&dev_pasid->link_domain, &dmar_domain->dev_pasids);
  	spin_unlock_irqrestore(&dmar_domain->lock, flags);

+	xa_store(&info->pasid_array, pasid, dmar_domain, GFP_KERNEL);
+
  	if (domain->type & __IOMMU_DOMAIN_PAGING)
  		intel_iommu_debugfs_create_dev_pasid(dev_pasid);

diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 94e2ecc3c73f..a466555f61fe 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -767,6 +767,7 @@ struct device_domain_info {
  #ifdef CONFIG_INTEL_IOMMU_DEBUGFS
  	struct dentry *debugfs_dentry; /* pointer to device directory dentry */
  #endif
+	struct xarray pasid_array; /* Attached domains per PASID */
  };

  struct dev_pasid_info {
Baolu Lu March 13, 2024, 3:13 a.m. UTC | #5
On 2024/3/12 11:07, Yi Liu wrote:
> On 2024/3/11 17:26, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Sunday, March 10, 2024 9:06 PM
>>>
>>> On 2024/1/16 01:19, Jason Gunthorpe wrote:
>>>> On Sun, Nov 26, 2023 at 10:34:21PM -0800, Yi Liu wrote:
>>>>> +int iommu_replace_device_pasid(struct iommu_domain *domain,
>>>>> +                   struct device *dev, ioasid_t pasid)
>>>>> +{
>>>>> +    struct iommu_group *group = dev->iommu_group;
>>>>> +    struct iommu_domain *old_domain;
>>>>> +    int ret;
>>>>> +
>>>>> +    if (!domain)
>>>>> +        return -EINVAL;
>>>>> +
>>>>> +    if (!group)
>>>>> +        return -ENODEV;
>>>>> +
>>>>> +    mutex_lock(&group->mutex);
>>>>> +    __iommu_remove_group_pasid(group, pasid);
>>>>
>>>> It is not replace if you do remove first.
>>>>
>>>> Replace must just call set_dev_pasid and nothing much else..
>>>
>>> Seems uneasy to do it so far. The VT-d driver needs to get the old 
>>> domain
>>> first in order to do replacement. However, VT-d driver does not track 
>>> the
>>> attached domains of pasids. It gets domain of a pasid
>>> by iommu_get_domain_for_dev_pasid(). Like
>>> intel_iommu_remove_dev_pasid)
>>> in link [1]. While the iommu layer exchanges the domain in the
>>> group->pasid_array before calling into VT-d driver. So when calling into
>>> VT-d driver, the domain got by iommu_get_domain_for_dev_pasid() is
>>> already
>>> the new domain. To solve it, we may need to let VT-d driver have its
>>> own tracking on the domains. How about your thoughts? @Jason, @Kevin,
>>> @Baoplu?
>>>
>>> [1]
>>> https://github.com/torvalds/linux/blob/master/drivers/iommu/intel/iommu
>>> .c#L4621C19-L4621C20
>>>
>>
>> Jason's point was that the core helper should directly call set_dev_pasid
>> and underlying iommu driver decides whether atomic replacement
>> can be implemented inside the callback. If you first remove in the core
>> then one can never implement a replace semantics.
> 
> yeah, I got Jason's point. I'm raising an open to make the set_dev_pasid
> callback to handle domain replacement. The existing intel iommu driver
> does not track attached domains for pasid. But it's needed to know the
> attached domain of a pasid and find the dev_pasid under the domain, hence
> be able to clean up the old attachment and do the new attachment. Existing
> code cannot work as I mentioned above. The group->pasid_xarray is updated
> before calling set_dev_pasid callback. This means the underlying iommu
> driver cannot get the old domain in the set_dev_pasid callback by the
> iommu_get_domain_for_dev_pasid() helper.
> 
> As above, I'm considering the possibility to track attached domains for
> pasid by an xarray in the struct device_domain_info. Hence, intel iommu
> driver could store/retrieve attached domain for pasids. If it's
> replacement, the set_dev_pasid callback could find the attached domain and
> the dev_pasid instance in the domain. Then be able to do detach and clean
> up the tracking structures (e.g. dev_pasid).

Maintaining the same data structure in both core and individual iommu
drivers seems to be duplicate effort. It appears that what you want here
is just to get the currently used domain in the set_dev_pasid path. Is
it possible to adjust the core code so that the pasid array is updated
after the driver's set_dev_pasid callback returns success?

Best regards,
baolu
Yi Liu March 13, 2024, 8:11 a.m. UTC | #6
On 2024/3/13 11:13, Baolu Lu wrote:
> On 2024/3/12 11:07, Yi Liu wrote:
>> On 2024/3/11 17:26, Tian, Kevin wrote:
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Sunday, March 10, 2024 9:06 PM
>>>>
>>>> On 2024/1/16 01:19, Jason Gunthorpe wrote:
>>>>> On Sun, Nov 26, 2023 at 10:34:21PM -0800, Yi Liu wrote:
>>>>>> +int iommu_replace_device_pasid(struct iommu_domain *domain,
>>>>>> +                   struct device *dev, ioasid_t pasid)
>>>>>> +{
>>>>>> +    struct iommu_group *group = dev->iommu_group;
>>>>>> +    struct iommu_domain *old_domain;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    if (!domain)
>>>>>> +        return -EINVAL;
>>>>>> +
>>>>>> +    if (!group)
>>>>>> +        return -ENODEV;
>>>>>> +
>>>>>> +    mutex_lock(&group->mutex);
>>>>>> +    __iommu_remove_group_pasid(group, pasid);
>>>>>
>>>>> It is not replace if you do remove first.
>>>>>
>>>>> Replace must just call set_dev_pasid and nothing much else..
>>>>
>>>> Seems uneasy to do it so far. The VT-d driver needs to get the old domain
>>>> first in order to do replacement. However, VT-d driver does not track the
>>>> attached domains of pasids. It gets domain of a pasid
>>>> by iommu_get_domain_for_dev_pasid(). Like
>>>> intel_iommu_remove_dev_pasid)
>>>> in link [1]. While the iommu layer exchanges the domain in the
>>>> group->pasid_array before calling into VT-d driver. So when calling into
>>>> VT-d driver, the domain got by iommu_get_domain_for_dev_pasid() is
>>>> already
>>>> the new domain. To solve it, we may need to let VT-d driver have its
>>>> own tracking on the domains. How about your thoughts? @Jason, @Kevin,
>>>> @Baoplu?
>>>>
>>>> [1]
>>>> https://github.com/torvalds/linux/blob/master/drivers/iommu/intel/iommu
>>>> .c#L4621C19-L4621C20
>>>>
>>>
>>> Jason's point was that the core helper should directly call set_dev_pasid
>>> and underlying iommu driver decides whether atomic replacement
>>> can be implemented inside the callback. If you first remove in the core
>>> then one can never implement a replace semantics.
>>
>> yeah, I got Jason's point. I'm raising an open to make the set_dev_pasid
>> callback to handle domain replacement. The existing intel iommu driver
>> does not track attached domains for pasid. But it's needed to know the
>> attached domain of a pasid and find the dev_pasid under the domain, hence
>> be able to clean up the old attachment and do the new attachment. Existing
>> code cannot work as I mentioned above. The group->pasid_xarray is updated
>> before calling set_dev_pasid callback. This means the underlying iommu
>> driver cannot get the old domain in the set_dev_pasid callback by the
>> iommu_get_domain_for_dev_pasid() helper.
>>
>> As above, I'm considering the possibility to track attached domains for
>> pasid by an xarray in the struct device_domain_info. Hence, intel iommu
>> driver could store/retrieve attached domain for pasids. If it's
>> replacement, the set_dev_pasid callback could find the attached domain and
>> the dev_pasid instance in the domain. Then be able to do detach and clean
>> up the tracking structures (e.g. dev_pasid).
> 
> Maintaining the same data structure in both core and individual iommu
> drivers seems to be duplicate effort. It appears that what you want here
> is just to get the currently used domain in the set_dev_pasid path. Is
> it possible to adjust the core code so that the pasid array is updated
> after the driver's set_dev_pasid callback returns success?

yes. how about your opinion? @Jason. I noticed the set_dev_pasid callback 
and pasid_array update is under the group->lock, so update it should be
fine to adjust the order to update pasid_array after set_dev_pasid returns.
Jason Gunthorpe March 18, 2024, 4:52 p.m. UTC | #7
On Wed, Mar 13, 2024 at 04:11:41PM +0800, Yi Liu wrote:

> yes. how about your opinion? @Jason. I noticed the set_dev_pasid callback
> and pasid_array update is under the group->lock, so update it should be
> fine to adjust the order to update pasid_array after set_dev_pasid returns.

Yes, it makes some sense

But, also I would like it very much if we just have the core pass in
the actual old domain as a an addition function argument.

I think we have some small mistakes in multi-device group error
unwinding for remove because the global xarray can't isn't actually
going to be correct in all scenarios.

Jason
Yi Liu March 19, 2024, 7:29 a.m. UTC | #8
On 2024/3/19 00:52, Jason Gunthorpe wrote:
> On Wed, Mar 13, 2024 at 04:11:41PM +0800, Yi Liu wrote:
> 
>> yes. how about your opinion? @Jason. I noticed the set_dev_pasid callback
>> and pasid_array update is under the group->lock, so update it should be
>> fine to adjust the order to update pasid_array after set_dev_pasid returns.
> 
> Yes, it makes some sense
> 
> But, also I would like it very much if we just have the core pass in
> the actual old domain as a an addition function argument.

ok, this works too. For normal attach, just pass in a NULL old domain.

> I think we have some small mistakes in multi-device group error
> unwinding for remove because the global xarray can't isn't actually
> going to be correct in all scenarios.

do you mean the __iommu_remove_group_pasid() call in the below function?
Currently, it is called when __iommu_set_group_pasid() failed. However,
__iommu_set_group_pasid() may need to do remove itself when error happens,
so the helper can be more self-contained. Or you mean something else?

int iommu_attach_device_pasid(struct iommu_domain *domain,
			      struct device *dev, ioasid_t pasid)
{
	/* Caller must be a probed driver on dev */
	struct iommu_group *group = dev->iommu_group;
	void *curr;
	int ret;

	if (!domain->ops->set_dev_pasid)
		return -EOPNOTSUPP;

	if (!group)
		return -ENODEV;

	if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
		return -EINVAL;

	mutex_lock(&group->mutex);
	curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
	if (curr) {
		ret = xa_err(curr) ? : -EBUSY;
		goto out_unlock;
	}

	ret = __iommu_set_group_pasid(domain, group, pasid);
	if (ret) {
		__iommu_remove_group_pasid(group, pasid);
		xa_erase(&group->pasid_array, pasid);
	}
out_unlock:
	mutex_unlock(&group->mutex);
	return ret;
}
EXPORT_SYMBOL_GPL(iommu_attach_device_pasid);

https://github.com/torvalds/linux/blob/b3603fcb79b1036acae10602bffc4855a4b9af80/drivers/iommu/iommu.c#L3376
Jason Gunthorpe March 20, 2024, 12:38 p.m. UTC | #9
On Tue, Mar 19, 2024 at 03:29:39PM +0800, Yi Liu wrote:
> On 2024/3/19 00:52, Jason Gunthorpe wrote:
> > On Wed, Mar 13, 2024 at 04:11:41PM +0800, Yi Liu wrote:
> > 
> > > yes. how about your opinion? @Jason. I noticed the set_dev_pasid callback
> > > and pasid_array update is under the group->lock, so update it should be
> > > fine to adjust the order to update pasid_array after set_dev_pasid returns.
> > 
> > Yes, it makes some sense
> > 
> > But, also I would like it very much if we just have the core pass in
> > the actual old domain as a an addition function argument.
> 
> ok, this works too. For normal attach, just pass in a NULL old domain.
> 
> > I think we have some small mistakes in multi-device group error
> > unwinding for remove because the global xarray can't isn't actually
> > going to be correct in all scenarios.
> 
> do you mean the __iommu_remove_group_pasid() call in the below function?
> Currently, it is called when __iommu_set_group_pasid() failed. However,
> __iommu_set_group_pasid() may need to do remove itself when error happens,
> so the helper can be more self-contained. Or you mean something else?

Yes..

> int iommu_attach_device_pasid(struct iommu_domain *domain,
> 			      struct device *dev, ioasid_t pasid)
> {
> 	/* Caller must be a probed driver on dev */
> 	struct iommu_group *group = dev->iommu_group;
> 	void *curr;
> 	int ret;
> 
> 	if (!domain->ops->set_dev_pasid)
> 		return -EOPNOTSUPP;
> 
> 	if (!group)
> 		return -ENODEV;
> 
> 	if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
> 		return -EINVAL;
> 
> 	mutex_lock(&group->mutex);
> 	curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
> 	if (curr) {
> 		ret = xa_err(curr) ? : -EBUSY;
> 		goto out_unlock;
> 	}
> 
> 	ret = __iommu_set_group_pasid(domain, group, pasid);

So here we have the xa set to the new domain

> 	if (ret) {
> 		__iommu_remove_group_pasid(group, pasid);

And here we still have it set to the new domain even though some of
the devices within the group failed to attach. The logic needs to be
more like the main domain attach path where iterate and then undo only
what we did

And the whole thing is easier to reason about if an input argument
specifies the current attached domain instead of having the driver
read it from the xarray.

Jason
Yi Liu March 21, 2024, 6:16 a.m. UTC | #10
On 2024/3/20 20:38, Jason Gunthorpe wrote:
> On Tue, Mar 19, 2024 at 03:29:39PM +0800, Yi Liu wrote:
>> On 2024/3/19 00:52, Jason Gunthorpe wrote:
>>> On Wed, Mar 13, 2024 at 04:11:41PM +0800, Yi Liu wrote:
>>>
>>>> yes. how about your opinion? @Jason. I noticed the set_dev_pasid callback
>>>> and pasid_array update is under the group->lock, so update it should be
>>>> fine to adjust the order to update pasid_array after set_dev_pasid returns.
>>>
>>> Yes, it makes some sense
>>>
>>> But, also I would like it very much if we just have the core pass in
>>> the actual old domain as a an addition function argument.
>>
>> ok, this works too. For normal attach, just pass in a NULL old domain.
>>
>>> I think we have some small mistakes in multi-device group error
>>> unwinding for remove because the global xarray can't isn't actually
>>> going to be correct in all scenarios.
>>
>> do you mean the __iommu_remove_group_pasid() call in the below function?
>> Currently, it is called when __iommu_set_group_pasid() failed. However,
>> __iommu_set_group_pasid() may need to do remove itself when error happens,
>> so the helper can be more self-contained. Or you mean something else?
> 
> Yes..
> 
>> int iommu_attach_device_pasid(struct iommu_domain *domain,
>> 			      struct device *dev, ioasid_t pasid)
>> {
>> 	/* Caller must be a probed driver on dev */
>> 	struct iommu_group *group = dev->iommu_group;
>> 	void *curr;
>> 	int ret;
>>
>> 	if (!domain->ops->set_dev_pasid)
>> 		return -EOPNOTSUPP;
>>
>> 	if (!group)
>> 		return -ENODEV;
>>
>> 	if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
>> 		return -EINVAL;
>>
>> 	mutex_lock(&group->mutex);
>> 	curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
>> 	if (curr) {
>> 		ret = xa_err(curr) ? : -EBUSY;
>> 		goto out_unlock;
>> 	}
>>
>> 	ret = __iommu_set_group_pasid(domain, group, pasid);
> 
> So here we have the xa set to the new domain

aha, yes. The underlying driver won't be able to get the correct domain
in the remove_dev_pasid callback.

>> 	if (ret) {
>> 		__iommu_remove_group_pasid(group, pasid);
> 
> And here we still have it set to the new domain even though some of
> the devices within the group failed to attach. The logic needs to be
> more like the main domain attach path where iterate and then undo only
> what we did

yes, the correct way is to undo what have been done before the fail
device. However, I somehow remember that pasid capability is only
available when the group is singleton. So iterate all devices of the
devices just means one device in fact. If this is true, then the
current code is fine although a bit confusing.

> And the whole thing is easier to reason about if an input argument
> specifies the current attached domain instead of having the driver
> read it from the xarray.

yep, will correct it as a fix patch.
Yi Liu March 21, 2024, 11:26 a.m. UTC | #11
On 2024/3/21 14:16, Yi Liu wrote:
> On 2024/3/20 20:38, Jason Gunthorpe wrote:
>> On Tue, Mar 19, 2024 at 03:29:39PM +0800, Yi Liu wrote:
>>> On 2024/3/19 00:52, Jason Gunthorpe wrote:
>>>> On Wed, Mar 13, 2024 at 04:11:41PM +0800, Yi Liu wrote:
>>>>
>>>>> yes. how about your opinion? @Jason. I noticed the set_dev_pasid callback
>>>>> and pasid_array update is under the group->lock, so update it should be
>>>>> fine to adjust the order to update pasid_array after set_dev_pasid 
>>>>> returns.
>>>>
>>>> Yes, it makes some sense
>>>>
>>>> But, also I would like it very much if we just have the core pass in
>>>> the actual old domain as a an addition function argument.
>>>
>>> ok, this works too. For normal attach, just pass in a NULL old domain.
>>>
>>>> I think we have some small mistakes in multi-device group error
>>>> unwinding for remove because the global xarray can't isn't actually
>>>> going to be correct in all scenarios.
>>>
>>> do you mean the __iommu_remove_group_pasid() call in the below function?
>>> Currently, it is called when __iommu_set_group_pasid() failed. However,
>>> __iommu_set_group_pasid() may need to do remove itself when error happens,
>>> so the helper can be more self-contained. Or you mean something else?
>>
>> Yes..
>>
>>> int iommu_attach_device_pasid(struct iommu_domain *domain,
>>>                   struct device *dev, ioasid_t pasid)
>>> {
>>>     /* Caller must be a probed driver on dev */
>>>     struct iommu_group *group = dev->iommu_group;
>>>     void *curr;
>>>     int ret;
>>>
>>>     if (!domain->ops->set_dev_pasid)
>>>         return -EOPNOTSUPP;
>>>
>>>     if (!group)
>>>         return -ENODEV;
>>>
>>>     if (!dev_has_iommu(dev) || dev_iommu_ops(dev) != domain->owner)
>>>         return -EINVAL;
>>>
>>>     mutex_lock(&group->mutex);
>>>     curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, 
>>> GFP_KERNEL);
>>>     if (curr) {
>>>         ret = xa_err(curr) ? : -EBUSY;
>>>         goto out_unlock;
>>>     }
>>>
>>>     ret = __iommu_set_group_pasid(domain, group, pasid);
>>
>> So here we have the xa set to the new domain
> 
> aha, yes. The underlying driver won't be able to get the correct domain
> in the remove_dev_pasid callback.
> 
>>>     if (ret) {
>>>         __iommu_remove_group_pasid(group, pasid);
>>
>> And here we still have it set to the new domain even though some of
>> the devices within the group failed to attach. The logic needs to be
>> more like the main domain attach path where iterate and then undo only
>> what we did
> 
> yes, the correct way is to undo what have been done before the fail
> device. However, I somehow remember that pasid capability is only
> available when the group is singleton. So iterate all devices of the
> devices just means one device in fact. If this is true, then the
> current code is fine although a bit confusing.
> 
>> And the whole thing is easier to reason about if an input argument
>> specifies the current attached domain instead of having the driver
>> read it from the xarray.
> 
> yep, will correct it as a fix patch.

Hi Jason,

It appears there are two solutions here.

First, only undo the devices that have set_dev_pasid successfully in
the __iommu_set_group_pasid(), so the problematic
__iommu_remove_group_pasid() call at line 3378 [1] would go away.
This also makes the helper more self-contained. Draft patch in [2]

Second, pass in the domain to remove_dev_pasid(). Draft patch in [3]

Either of the above two should be able to solve the mistake you mentioned.
BTW. They are orthogonal, so it's also possible to apply both of them.
Which one is your preference then?

[1] 
https://github.com/torvalds/linux/blob/b3603fcb79b1036acae10602bffc4855a4b9af80/drivers/iommu/iommu.c#L3378
[2] 
https://github.com/yiliu1765/iommufd/commit/bb83270767ab460978a3452750f5e032b43e59bf
[3] 
https://github.com/yiliu1765/iommufd/commit/bd270b7b6619b8ba35dfaf9870df8728e370163f

Regards,
Yi Liu
Jason Gunthorpe March 21, 2024, 12:20 p.m. UTC | #12
On Thu, Mar 21, 2024 at 07:26:41PM +0800, Yi Liu wrote:
> > yes, the correct way is to undo what have been done before the fail
> > device. However, I somehow remember that pasid capability is only
> > available when the group is singleton. So iterate all devices of the
> > devices just means one device in fact. If this is true, then the
> > current code is fine although a bit confusing.

Platform devicse don't have that limitation.. It is PCI only.

> > > And the whole thing is easier to reason about if an input argument
> > > specifies the current attached domain instead of having the driver
> > > read it from the xarray.
> > 
> > yep, will correct it as a fix patch.
> 
> Hi Jason,
> 
> It appears there are two solutions here.
> 
> First, only undo the devices that have set_dev_pasid successfully in
> the __iommu_set_group_pasid(), so the problematic
> __iommu_remove_group_pasid() call at line 3378 [1] would go away.
> This also makes the helper more self-contained. Draft patch in [2]
> 
> Second, pass in the domain to remove_dev_pasid(). Draft patch in [3]
> 
> Either of the above two should be able to solve the mistake you mentioned.
> BTW. They are orthogonal, so it's also possible to apply both of them.
> Which one is your preference then?

I would do both because I also think it is not nice that the drivers
always have to have the boiler plate to read the xarray in their
remove..

Jason
Yi Liu March 21, 2024, 1:58 p.m. UTC | #13
On 2024/3/21 20:20, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2024 at 07:26:41PM +0800, Yi Liu wrote:
>>> yes, the correct way is to undo what have been done before the fail
>>> device. However, I somehow remember that pasid capability is only
>>> available when the group is singleton. So iterate all devices of the
>>> devices just means one device in fact. If this is true, then the
>>> current code is fine although a bit confusing.
> 
> Platform devicse don't have that limitation.. It is PCI only.

ok.

>>>> And the whole thing is easier to reason about if an input argument
>>>> specifies the current attached domain instead of having the driver
>>>> read it from the xarray.
>>>
>>> yep, will correct it as a fix patch.
>>
>> Hi Jason,
>>
>> It appears there are two solutions here.
>>
>> First, only undo the devices that have set_dev_pasid successfully in
>> the __iommu_set_group_pasid(), so the problematic
>> __iommu_remove_group_pasid() call at line 3378 [1] would go away.
>> This also makes the helper more self-contained. Draft patch in [2]
>>
>> Second, pass in the domain to remove_dev_pasid(). Draft patch in [3]
>>
>> Either of the above two should be able to solve the mistake you mentioned.
>> BTW. They are orthogonal, so it's also possible to apply both of them.
>> Which one is your preference then?
> 
> I would do both because I also think it is not nice that the drivers
> always have to have the boiler plate to read the xarray in their
> remove..

sure. I'll send the two patches as Fix series.
diff mbox series

Patch

diff --git a/drivers/iommu/iommu-priv.h b/drivers/iommu/iommu-priv.h
index 2024a2313348..5c32637f6325 100644
--- a/drivers/iommu/iommu-priv.h
+++ b/drivers/iommu/iommu-priv.h
@@ -19,6 +19,8 @@  static inline const struct iommu_ops *dev_iommu_ops(struct device *dev)
 
 int iommu_group_replace_domain(struct iommu_group *group,
 			       struct iommu_domain *new_domain);
+int iommu_replace_device_pasid(struct iommu_domain *domain,
+			       struct device *dev, ioasid_t pasid);
 
 int iommu_device_register_bus(struct iommu_device *iommu,
 			      const struct iommu_ops *ops, struct bus_type *bus,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9d573e971aff..ec213ebd5ecc 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3430,6 +3430,27 @@  static void __iommu_remove_group_pasid(struct iommu_group *group,
 	}
 }
 
+static int __iommu_group_attach_pasid(struct iommu_domain *domain,
+				      struct iommu_group *group, ioasid_t pasid)
+{
+	void *curr;
+	int ret;
+
+	lockdep_assert_held(&group->mutex);
+
+	curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
+	if (curr)
+		return xa_err(curr) ? : -EBUSY;
+
+	ret = __iommu_set_group_pasid(domain, group, pasid);
+	if (ret) {
+		__iommu_remove_group_pasid(group, pasid);
+		xa_erase(&group->pasid_array, pasid);
+	}
+
+	return ret;
+}
+
 /*
  * iommu_attach_device_pasid() - Attach a domain to pasid of device
  * @domain: the iommu domain.
@@ -3453,19 +3474,9 @@  int iommu_attach_device_pasid(struct iommu_domain *domain,
 		return -ENODEV;
 
 	mutex_lock(&group->mutex);
-	curr = xa_cmpxchg(&group->pasid_array, pasid, NULL, domain, GFP_KERNEL);
-	if (curr) {
-		ret = xa_err(curr) ? : -EBUSY;
-		goto out_unlock;
-	}
-
-	ret = __iommu_set_group_pasid(domain, group, pasid);
-	if (ret) {
-		__iommu_remove_group_pasid(group, pasid);
-		xa_erase(&group->pasid_array, pasid);
-	}
-out_unlock:
+	ret = __iommu_group_attach_pasid(domain, group, pasid);
 	mutex_unlock(&group->mutex);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_attach_device_pasid);
@@ -3479,8 +3490,8 @@  EXPORT_SYMBOL_GPL(iommu_attach_device_pasid);
  * The @domain must have been attached to @pasid of the @dev with
  * iommu_attach_device_pasid().
  */
-void iommu_detach_device_pasid(struct iommu_domain *domain, struct device *dev,
-			       ioasid_t pasid)
+void iommu_detach_device_pasid(struct iommu_domain *domain,
+			       struct device *dev, ioasid_t pasid)
 {
 	/* Caller must be a probed driver on dev */
 	struct iommu_group *group = dev->iommu_group;
@@ -3492,6 +3503,49 @@  void iommu_detach_device_pasid(struct iommu_domain *domain, struct device *dev,
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device_pasid);
 
+/**
+ * iommu_replace_device_pasid - replace the domain that a pasid is attached to
+ * @domain: new IOMMU domain to replace with
+ * @dev: the physical device
+ * @pasid: pasid that will be attached to the new domain
+ *
+ * This API allows the pasid to switch domains. Return 0 on success, or an
+ * error. The pasid will roll back to use the old domain if failure. The
+ * caller could call iommu_detach_device_pasid() before free the old domain
+ * in order to avoid use-after-free case.
+ */
+int iommu_replace_device_pasid(struct iommu_domain *domain,
+			       struct device *dev, ioasid_t pasid)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct iommu_domain *old_domain;
+	int ret;
+
+	if (!domain)
+		return -EINVAL;
+
+	if (!group)
+		return -ENODEV;
+
+	mutex_lock(&group->mutex);
+	__iommu_remove_group_pasid(group, pasid);
+	old_domain = xa_erase(&group->pasid_array, pasid);
+	ret = __iommu_group_attach_pasid(domain, group, pasid);
+	if (ret)
+		goto err_rollback;
+	mutex_unlock(&group->mutex);
+
+	return 0;
+
+err_rollback:
+	if (old_domain)
+		__iommu_group_attach_pasid(old_domain, group, pasid);
+	mutex_unlock(&group->mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(iommu_replace_device_pasid, IOMMUFD_INTERNAL);
+
 /*
  * iommu_get_domain_for_dev_pasid() - Retrieve domain for @pasid of @dev
  * @dev: the queried device