[v10,10/10] iommu/vt-d: Add iotlb flush for nested domain

Message ID	20240102143834.146165-11-yi.l.liu@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BAF916414; Tue, 2 Jan 2024 14:38:50 +0000 (UTC) From: Yi Liu <yi.l.liu@intel.com> To: joro@8bytes.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com, robin.murphy@arm.com, baolu.lu@linux.intel.com Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com, yi.l.liu@intel.com, yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com, shameerali.kolothum.thodi@huawei.com, lulu@redhat.com, suravee.suthikulpanit@amd.com, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, zhenzhong.duan@intel.com, joao.m.martins@oracle.com, xin.zeng@intel.com, yan.y.zhao@intel.com, j.granados@samsung.com Subject: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain Date: Tue, 2 Jan 2024 06:38:34 -0800 Message-Id: <20240102143834.146165-11-yi.l.liu@intel.com> In-Reply-To: <20240102143834.146165-1-yi.l.liu@intel.com> References: <20240102143834.146165-1-yi.l.liu@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add iommufd nesting (part 2/2) \| expand [v10,00/10] Add iommufd nesting (part 2/2) [v10,01/10] iommu: Add cache_invalidate_user op [v10,02/10] iommufd: Add IOMMU_HWPT_INVALIDATE [v10,03/10] iommu: Add iommu_copy_struct_from_user_array helper [v10,04/10] iommufd/selftest: Add mock_domain_cache_invalidate_user support [v10,05/10] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op [v10,06/10] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl [v10,07/10] iommu/vt-d: Allow qi_submit_sync() to return the QI faults [v10,08/10] iommu/vt-d: Convert stage-1 cache invalidation to return QI fault [v10,09/10] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation [v10,10/10] iommu/vt-d: Add iotlb flush for nested domain

Message ID

20240102143834.146165-11-yi.l.liu@intel.com (mailing list archive)

State

New, archived

Headers

From: Yi Liu <yi.l.liu@intel.com>
To: joro@8bytes.org,
	alex.williamson@redhat.com,
	jgg@nvidia.com,
	kevin.tian@intel.com,
	robin.murphy@arm.com,
	baolu.lu@linux.intel.com
Cc: cohuck@redhat.com,
	eric.auger@redhat.com,
	nicolinc@nvidia.com,
	kvm@vger.kernel.org,
	mjrosato@linux.ibm.com,
	chao.p.peng@linux.intel.com,
	yi.l.liu@intel.com,
	yi.y.sun@linux.intel.com,
	peterx@redhat.com,
	jasowang@redhat.com,
	shameerali.kolothum.thodi@huawei.com,
	lulu@redhat.com,
	suravee.suthikulpanit@amd.com,
	iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org,
	zhenzhong.duan@intel.com,
	joao.m.martins@oracle.com,
	xin.zeng@intel.com,
	yan.y.zhao@intel.com,
	j.granados@samsung.com
Subject: [PATCH v10 10/10] iommu/vt-d: Add iotlb flush for nested domain
Date: Tue,  2 Jan 2024 06:38:34 -0800
Message-Id: <20240102143834.146165-11-yi.l.liu@intel.com>
In-Reply-To: <20240102143834.146165-1-yi.l.liu@intel.com>
References: <20240102143834.146165-1-yi.l.liu@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

Add iommufd nesting (part 2/2) | expand

Commit Message

Yi Liu Jan. 2, 2024, 2:38 p.m. UTC

From: Lu Baolu <baolu.lu@linux.intel.com>

This implements the .cache_invalidate_user() callback to support iotlb
flush for nested domain.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/intel/nested.c | 107 +++++++++++++++++++++++++++++++++++
 1 file changed, 107 insertions(+)

Comments

Jason Gunthorpe Jan. 2, 2024, 6:44 p.m. UTC | #1

On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:

> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
> +				     unsigned long npages, bool ih, u32 *error)
> +{
> +	struct iommu_domain_info *info;
> +	unsigned long i;
> +	unsigned mask;
> +	u32 fault;
> +
> +	xa_for_each(&domain->iommu_array, i, info)
> +		qi_flush_piotlb(info->iommu,
> +				domain_id_iommu(domain, info->iommu),
> +				IOMMU_NO_PASID, addr, npages, ih, NULL);

This locking on the xarray is messed up throughout the driver. There
could be a concurrent detach at this point which will free info and
UAF this.

This seems to be systemic issue, so I'm going to ignore it here, but
please make a series to fix it completely.

xarray is probably a bad data structure to manage attachment, a linked
list is going to use less memory in most cases and you need a mutex
lock anyhow.

Jason

Yi Liu Jan. 3, 2024, 1:33 a.m. UTC | #2

On 2024/1/3 02:44, Jason Gunthorpe wrote:
> On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:
> 
>> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
>> +				     unsigned long npages, bool ih, u32 *error)
>> +{
>> +	struct iommu_domain_info *info;
>> +	unsigned long i;
>> +	unsigned mask;
>> +	u32 fault;
>> +
>> +	xa_for_each(&domain->iommu_array, i, info)
>> +		qi_flush_piotlb(info->iommu,
>> +				domain_id_iommu(domain, info->iommu),
>> +				IOMMU_NO_PASID, addr, npages, ih, NULL);
> 
> This locking on the xarray is messed up throughout the driver. There
> could be a concurrent detach at this point which will free info and
> UAF this.

hmmm, xa_for_each() takes and releases rcu lock, and according to the
domain_detach_iommu(), info is freed after xa_erase(). For an existing
info stored in xarray, xa_erase() should return after rcu lock is released.
is it? Any idea? @Baolu

void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
{
	struct iommu_domain_info *info;

	spin_lock(&iommu->lock);
	info = xa_load(&domain->iommu_array, iommu->seq_id);
	if (--info->refcnt == 0) {
		clear_bit(info->did, iommu->domain_ids);
		xa_erase(&domain->iommu_array, iommu->seq_id);
		domain->nid = NUMA_NO_NODE;
		domain_update_iommu_cap(domain);
		kfree(info);
	}
	spin_unlock(&iommu->lock);
}

> This seems to be systemic issue, so I'm going to ignore it here, but
> please make a series to fix it completely.

yeah, this writing is the same with other places that reference the 
iommu_array. If there is real problem, may check with Baolu and Kevin.

> xarray is probably a bad data structure to manage attachment, a linked
> list is going to use less memory in most cases and you need a mutex
> lock anyhow.

below is the commit that introduces iommu_array.

commit ba949f4cd4c39c587e9b722ac7eb7f7e8a42dace
Author: Lu Baolu <baolu.lu@linux.intel.com>
Date:   Tue Jul 12 08:09:05 2022 +0800

     iommu/vt-d: Refactor iommu information of each domain

     When a DMA domain is attached to a device, it needs to allocate a domain
     ID from its IOMMU. Currently, the domain ID information is stored in two
     static arrays embedded in the domain structure. This can lead to memory
     waste when the driver is running on a small platform.

     This optimizes these static arrays by replacing them with an xarray and
     consuming memory on demand.

     Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
     Reviewed-by: Kevin Tian <kevin.tian@intel.com>
     Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
     Link: 
https://lore.kernel.org/r/20220702015610.2849494-4-baolu.lu@linux.intel.com
     Signed-off-by: Joerg Roedel <jroedel@suse.de>

Baolu Lu Jan. 3, 2024, 3:06 a.m. UTC | #3

On 2024/1/3 9:33, Yi Liu wrote:
> On 2024/1/3 02:44, Jason Gunthorpe wrote:
>> On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:
>>
>>> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 
>>> addr,
>>> +                     unsigned long npages, bool ih, u32 *error)
>>> +{
>>> +    struct iommu_domain_info *info;
>>> +    unsigned long i;
>>> +    unsigned mask;
>>> +    u32 fault;
>>> +
>>> +    xa_for_each(&domain->iommu_array, i, info)
>>> +        qi_flush_piotlb(info->iommu,
>>> +                domain_id_iommu(domain, info->iommu),
>>> +                IOMMU_NO_PASID, addr, npages, ih, NULL);
>>
>> This locking on the xarray is messed up throughout the driver. There
>> could be a concurrent detach at this point which will free info and
>> UAF this.
> 
> hmmm, xa_for_each() takes and releases rcu lock, and according to the
> domain_detach_iommu(), info is freed after xa_erase(). For an existing
> info stored in xarray, xa_erase() should return after rcu lock is released.
> is it? Any idea? @Baolu

I once thought locking for xarray is self-contained. I need more thought
on this before taking further action.

Best regards,
baolu

Jason Gunthorpe Jan. 3, 2024, 12:44 p.m. UTC | #4

On Wed, Jan 03, 2024 at 11:06:19AM +0800, Baolu Lu wrote:
> On 2024/1/3 9:33, Yi Liu wrote:
> > On 2024/1/3 02:44, Jason Gunthorpe wrote:
> > > On Tue, Jan 02, 2024 at 06:38:34AM -0800, Yi Liu wrote:
> > > 
> > > > +static void intel_nested_flush_cache(struct dmar_domain
> > > > *domain, u64 addr,
> > > > +                     unsigned long npages, bool ih, u32 *error)
> > > > +{
> > > > +    struct iommu_domain_info *info;
> > > > +    unsigned long i;
> > > > +    unsigned mask;
> > > > +    u32 fault;
> > > > +
> > > > +    xa_for_each(&domain->iommu_array, i, info)
> > > > +        qi_flush_piotlb(info->iommu,
> > > > +                domain_id_iommu(domain, info->iommu),
> > > > +                IOMMU_NO_PASID, addr, npages, ih, NULL);
> > > 
> > > This locking on the xarray is messed up throughout the driver. There
> > > could be a concurrent detach at this point which will free info and
> > > UAF this.
> > 
> > hmmm, xa_for_each() takes and releases rcu lock, and according to the
> > domain_detach_iommu(), info is freed after xa_erase(). For an existing
> > info stored in xarray, xa_erase() should return after rcu lock is released.
> > is it? Any idea? @Baolu
> 
> I once thought locking for xarray is self-contained. I need more thought
> on this before taking further action.

The locking of xarray itself is self-contained, but once it returns a
value then the user has to provide locking to protect the value.

In this case the xarray storage memory itself will not UAF but the
info pointer to memory returned from the xarray will.

I've been thinking arm/amd/intel all need the same datastructure here,
and it is a bit complicated. We should try to make a library to handle
it..

It is straightforward except for the RCU list walk for invalidation..

Jason

diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index b5a5563ab32c..f1f86437939c 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -73,9 +73,116 @@  static void intel_nested_domain_free(struct iommu_domain *domain)
 	kfree(to_dmar_domain(domain));
 }
 
+static void nested_flush_dev_iotlb(struct dmar_domain *domain, u64 addr,
+				   unsigned mask, u32 *fault)
+{
+	struct device_domain_info *info;
+	unsigned long flags;
+	u16 sid, qdep;
+
+	spin_lock_irqsave(&domain->lock, flags);
+	list_for_each_entry(info, &domain->devices, link) {
+		if (!info->ats_enabled)
+			continue;
+		sid = info->bus << 8 | info->devfn;
+		qdep = info->ats_qdep;
+		qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
+				   qdep, addr, mask, fault);
+		quirk_extra_dev_tlb_flush(info, addr, mask,
+					  IOMMU_NO_PASID, qdep);
+	}
+	spin_unlock_irqrestore(&domain->lock, flags);
+}
+
+static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
+				     unsigned long npages, bool ih, u32 *error)
+{
+	struct iommu_domain_info *info;
+	unsigned long i;
+	unsigned mask;
+	u32 fault;
+
+	xa_for_each(&domain->iommu_array, i, info)
+		qi_flush_piotlb(info->iommu,
+				domain_id_iommu(domain, info->iommu),
+				IOMMU_NO_PASID, addr, npages, ih, NULL);
+
+	if (!domain->has_iotlb_device)
+		return;
+
+	if (npages == U64_MAX)
+		mask = 64 - VTD_PAGE_SHIFT;
+	else
+		mask = ilog2(__roundup_pow_of_two(npages));
+
+	nested_flush_dev_iotlb(domain, addr, mask, &fault);
+
+	*error = 0;
+	/*
+	 * Invalidation queue error (i.e. IQE) will not be reported to user
+	 * as it's caused only by driver internal bug.
+	 */
+	if (fault & DMA_FSTS_ICE)
+		*error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ICE;
+	if (fault & DMA_FSTS_ITE)
+		*error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ITE;
+}
+
+static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
+					      struct iommu_user_data_array *array)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	struct iommu_hwpt_vtd_s1_invalidate inv_entry;
+	u32 processed = 0;
+	int ret = 0;
+	u32 index;
+
+	if (array->type != IOMMU_HWPT_INVALIDATE_DATA_VTD_S1) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	for (index = 0; index < array->entry_num; index++) {
+		ret = iommu_copy_struct_from_user_array(&inv_entry, array,
+							IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+							index, hw_error);
+		if (ret)
+			break;
+
+		if (inv_entry.flags & ~IOMMU_VTD_INV_FLAGS_LEAF) {
+			ret = -EOPNOTSUPP;
+			break;
+		}
+
+		if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
+		    ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
+			ret = -EINVAL;
+			break;
+		}
+
+		intel_nested_flush_cache(dmar_domain, inv_entry.addr,
+					 inv_entry.npages,
+					 inv_entry.flags & IOMMU_VTD_INV_FLAGS_LEAF,
+					 &inv_entry.hw_error);
+
+		ret = iommu_respond_struct_to_user_array(array, index,
+							 (void *)&inv_entry,
+							 sizeof(inv_entry));
+		if (ret)
+			break;
+
+		processed++;
+	}
+
+out:
+	array->entry_num = processed;
+	return ret;
+}
+
 static const struct iommu_domain_ops intel_nested_domain_ops = {
 	.attach_dev		= intel_nested_attach_dev,
 	.free			= intel_nested_domain_free,
+	.cache_invalidate_user	= intel_nested_cache_invalidate_user,
 };
 
 struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent,

[v10,10/10] iommu/vt-d: Add iotlb flush for nested domain

Commit Message

Comments

Patch