[v1,07/17] dma-mapping: Implement link/unlink ranges API

Message ID	f8c7f160c9ae97fef4ccd355f9979727552c7374.1730298502.git.leon@kernel.org (mailing list archive)
State	Superseded
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 890CF216447; Wed, 30 Oct 2024 15:13:43 +0000 (UTC) From: Leon Romanovsky <leon@kernel.org> To: Jens Axboe <axboe@kernel.dk>, Jason Gunthorpe <jgg@ziepe.ca>, Robin Murphy <robin.murphy@arm.com>, Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me> Cc: Leon Romanovsky <leonro@nvidia.com>, Keith Busch <kbusch@kernel.org>, Bjorn Helgaas <bhelgaas@google.com>, Logan Gunthorpe <logang@deltatee.com>, Yishai Hadas <yishaih@nvidia.com>, Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>, Kevin Tian <kevin.tian@intel.com>, Alex Williamson <alex.williamson@redhat.com>, Marek Szyprowski <m.szyprowski@samsung.com>, =?utf-8?b?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Jonathan Corbet <corbet@lwn.net>, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, linux-pci@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v1 07/17] dma-mapping: Implement link/unlink ranges API Date: Wed, 30 Oct 2024 17:12:53 +0200 Message-ID: <f8c7f160c9ae97fef4ccd355f9979727552c7374.1730298502.git.leon@kernel.org> In-Reply-To: <cover.1730298502.git.leon@kernel.org> References: <cover.1730298502.git.leon@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Provide a new two step DMA mapping API \| expand [v1,00/17] Provide a new two step DMA mapping API [v1,01/17] PCI/P2PDMA: Refactor the p2pdma mapping helpers [v1,02/17] dma-mapping: move the PCI P2PDMA mapping helpers to pci-p2pdma.h [v1,03/17] iommu: generalize the batched sync after map interface [v1,04/17] dma-mapping: Add check if IOVA can be used [v1,05/17] dma: Provide an interface to allow allocate IOVA [v1,06/17] iommu/dma: Factor out a iommu_dma_map_swiotlb helper [v1,07/17] dma-mapping: Implement link/unlink ranges API [v1,08/17] dma-mapping: add a dma_need_unmap helper [v1,09/17] docs: core-api: document the IOVA-based API [v1,10/17] mm/hmm: let users to tag specific PFN with DMA mapped bit [v1,11/17] mm/hmm: provide generic DMA managing logic [v1,12/17] RDMA/umem: Store ODP access mask information in PFN [v1,13/17] RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage [v1,14/17] RDMA/umem: Separate implicit ODP initialization from explicit ODP [v1,15/17] vfio/mlx5: Explicitly use number of pages instead of allocated length [v1,16/17] vfio/mlx5: Rewrite create mkey flow to allow better code reuse [v1,17/17] vfio/mlx5: Convert vfio to use DMA link API

Leon Romanovsky Oct. 30, 2024, 3:12 p.m. UTC

From: Leon Romanovsky <leonro@nvidia.com>

Introduce new DMA APIs to perform DMA linkage of buffers
in layers higher than DMA.

In proposed API, the callers will perform the following steps.
In map path:
	if (dma_can_use_iova(...))
	    dma_iova_alloc()
	    for (page in range)
	       dma_iova_link_next(...)
	    dma_iova_sync(...)
	else
	     /* Fallback to legacy map pages */
             for (all pages)
	       dma_map_page(...)

In unmap path:
	if (dma_can_use_iova(...))
	     dma_iova_destroy()
	else
	     for (all pages)
		dma_unmap_page(...)

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/iommu/dma-iommu.c   | 259 ++++++++++++++++++++++++++++++++++++
 include/linux/dma-mapping.h |  32 +++++
 2 files changed, 291 insertions(+)

Robin Murphy Oct. 31, 2024, 9:18 p.m. UTC | #1

On 30/10/2024 3:12 pm, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Introduce new DMA APIs to perform DMA linkage of buffers
> in layers higher than DMA.
> 
> In proposed API, the callers will perform the following steps.
> In map path:
> 	if (dma_can_use_iova(...))
> 	    dma_iova_alloc()
> 	    for (page in range)
> 	       dma_iova_link_next(...)
> 	    dma_iova_sync(...)
> 	else
> 	     /* Fallback to legacy map pages */
>               for (all pages)
> 	       dma_map_page(...)
> 
> In unmap path:
> 	if (dma_can_use_iova(...))
> 	     dma_iova_destroy()
> 	else
> 	     for (all pages)
> 		dma_unmap_page(...)
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>   drivers/iommu/dma-iommu.c   | 259 ++++++++++++++++++++++++++++++++++++
>   include/linux/dma-mapping.h |  32 +++++
>   2 files changed, 291 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index e1eaad500d27..4a504a879cc0 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1834,6 +1834,265 @@ void dma_iova_free(struct device *dev, struct dma_iova_state *state)
>   }
>   EXPORT_SYMBOL_GPL(dma_iova_free);
>   
> +static int __dma_iova_link(struct device *dev, dma_addr_t addr,
> +		phys_addr_t phys, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	bool coherent = dev_is_dma_coherent(dev);
> +
> +	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))

If you really imagine this can support non-coherent operation and 
DMA_ATTR_SKIP_CPU_SYNC, where are the corresponding explicit sync 
operations? dma_sync_single_*() sure as heck aren't going to work...

In fact, same goes for SWIOTLB bouncing even in the coherent case.

> +		arch_sync_dma_for_device(phys, size, dir);

Plus if the aim is to pass P2P and whatever arbitrary physical addresses 
through here as well, how can we be sure this isn't going to explode?

> +
> +	return iommu_map_nosync(iommu_get_dma_domain(dev), addr, phys, size,
> +			dma_info_to_prot(dir, coherent, attrs), GFP_ATOMIC);
> +}
> +
> +static int iommu_dma_iova_bounce_and_link(struct device *dev, dma_addr_t addr,
> +		phys_addr_t phys, size_t bounce_len,
> +		enum dma_data_direction dir, unsigned long attrs,
> +		size_t iova_start_pad)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iova_domain *iovad = &domain->iova_cookie->iovad;
> +	phys_addr_t bounce_phys;
> +	int error;
> +
> +	bounce_phys = iommu_dma_map_swiotlb(dev, phys, bounce_len, dir, attrs);
> +	if (bounce_phys == DMA_MAPPING_ERROR)
> +		return -ENOMEM;
> +
> +	error = __dma_iova_link(dev, addr - iova_start_pad,
> +			bounce_phys - iova_start_pad,
> +			iova_align(iovad, bounce_len), dir, attrs);
> +	if (error)
> +		swiotlb_tbl_unmap_single(dev, bounce_phys, bounce_len, dir,
> +				attrs);
> +	return error;
> +}
> +
> +static int iommu_dma_iova_link_swiotlb(struct device *dev,
> +		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
> +		size_t size, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, phys);
> +	size_t iova_end_pad = iova_offset(iovad, phys + size);

I thought the code below was wrong until I double-checked and realised 
that this is not what its name implies it to be...

> +	dma_addr_t addr = state->addr + offset;
> +	size_t mapped = 0;
> +	int error;
> +
> +	if (iova_start_pad) {
> +		size_t bounce_len = min(size, iovad->granule - iova_start_pad);
> +
> +		error = iommu_dma_iova_bounce_and_link(dev, addr, phys,
> +				bounce_len, dir, attrs, iova_start_pad);
> +		if (error)
> +			return error;
> +		state->__size |= DMA_IOVA_USE_SWIOTLB;
> +
> +		mapped += bounce_len;
> +		size -= bounce_len;
> +		if (!size)
> +			return 0;
> +	}
> +
> +	size -= iova_end_pad;
> +	error = __dma_iova_link(dev, addr + mapped, phys + mapped, size, dir,
> +			attrs);
> +	if (error)
> +		goto out_unmap;
> +	mapped += size;
> +
> +	if (iova_end_pad) {
> +		error = iommu_dma_iova_bounce_and_link(dev, addr + mapped,
> +				phys + mapped, iova_end_pad, dir, attrs, 0);
> +		if (error)
> +			goto out_unmap;
> +		state->__size |= DMA_IOVA_USE_SWIOTLB;
> +	}
> +
> +	return 0;
> +
> +out_unmap:
> +	dma_iova_unlink(dev, state, 0, mapped, dir, attrs);
> +	return error;
> +}
> +
> +/**
> + * dma_iova_link - Link a range of IOVA space
> + * @dev: DMA device
> + * @state: IOVA state
> + * @phys: physical address to link
> + * @offset: offset into the IOVA state to map into
> + * @size: size of the buffer
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Link a range of IOVA space for the given IOVA state without IOTLB sync.
> + * This function is used to link multiple physical addresses in contigueous
> + * IOVA space without performing costly IOTLB sync.
> + *
> + * The caller is responsible to call to dma_iova_sync() to sync IOTLB at
> + * the end of linkage.
> + */
> +int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, phys);
> +
> +	if (WARN_ON_ONCE(iova_start_pad && offset > 0))
> +		return -EIO;
> +
> +	if (dev_use_swiotlb(dev, size, dir) && iova_offset(iovad, phys | size))
> +		return iommu_dma_iova_link_swiotlb(dev, state, phys, offset,
> +				size, dir, attrs);
> +
> +	return __dma_iova_link(dev, state->addr + offset - iova_start_pad,
> +			phys - iova_start_pad,
> +			iova_align(iovad, size + iova_start_pad), dir, attrs);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_link);
> +
> +/**
> + * dma_iova_sync - Sync IOTLB
> + * @dev: DMA device
> + * @state: IOVA state
> + * @offset: offset into the IOVA state to sync
> + * @size: size of the buffer
> + *
> + * Sync IOTLB for the given IOVA state. This function should be called on
> + * the IOVA-contigous range created by one ore more dma_iova_link() calls
> + * to sync the IOTLB.
> + */
> +int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	dma_addr_t addr = state->addr + offset;
> +	size_t iova_start_pad = iova_offset(iovad, addr);
> +
> +	return iommu_sync_map(domain, addr - iova_start_pad,
> +		      iova_align(iovad, size + iova_start_pad));
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_sync);
> +
> +static void iommu_dma_iova_unlink_range_slow(struct device *dev,
> +		dma_addr_t addr, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	size_t iova_start_pad = iova_offset(iovad, addr);
> +	dma_addr_t end = addr + size;
> +
> +	do {
> +		phys_addr_t phys;
> +		size_t len;
> +
> +		phys = iommu_iova_to_phys(domain, addr);
> +		if (WARN_ON(!phys))
> +			continue;
> +		len = min_t(size_t,
> +			end - addr, iovad->granule - iova_start_pad);
> +
> +		if (!dev_is_dma_coherent(dev) &&
> +		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> +			arch_sync_dma_for_cpu(phys, len, dir);
> +
> +		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);

How do you know that "phys" and "len" match what was originally 
allocated and bounced in, and this isn't going to try to bounce out too 
much, free the wrong slot, or anything else nasty? If it's not supposed 
to be intentional that a sub-granule buffer can be linked to any offset 
in the middle of the IOVA range as long as its original physical address 
is aligned to the IOVA granule size(?), why try to bounce anywhere other 
than the ends of the range at all?

> +
> +		addr += len;
> +		iova_start_pad = 0;
> +	} while (addr < end);
> +}
> +
> +static void __iommu_dma_iova_unlink(struct device *dev,
> +		struct dma_iova_state *state, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs,
> +		bool free_iova)
> +{
> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
> +	struct iova_domain *iovad = &cookie->iovad;
> +	dma_addr_t addr = state->addr + offset;
> +	size_t iova_start_pad = iova_offset(iovad, addr);
> +	struct iommu_iotlb_gather iotlb_gather;
> +	size_t unmapped;
> +
> +	if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
> +	    (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
> +		iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
> +
> +	iommu_iotlb_gather_init(&iotlb_gather);
> +	iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);

Is is really worth the bother?

> +	size = iova_align(iovad, size + iova_start_pad);
> +	addr -= iova_start_pad;
> +	unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
> +	WARN_ON(unmapped != size);
> +
> +	if (!iotlb_gather.queued)
> +		iommu_iotlb_sync(domain, &iotlb_gather);
> +	if (free_iova)
> +		iommu_dma_free_iova(cookie, addr, size, &iotlb_gather);

There's no guarantee that "size" is the correct value here, so this has 
every chance of corrupting the IOVA domain.

> +}
> +
> +/**
> + * dma_iova_unlink - Unlink a range of IOVA space
> + * @dev: DMA device
> + * @state: IOVA state
> + * @offset: offset into the IOVA state to unlink
> + * @size: size of the buffer
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Unlink a range of IOVA space for the given IOVA state.

If I initially link a large range in one go, then unlink a small part of 
it, what behaviour can I expect?

Thanks,
Robin.

> + */
> +void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	 __iommu_dma_iova_unlink(dev, state, offset, size, dir, attrs, false);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_unlink);
> +
> +/**
> + * dma_iova_destroy - Finish a DMA mapping transaction
> + * @dev: DMA device
> + * @state: IOVA state
> + * @mapped_len: number of bytes to unmap
> + * @dir: DMA direction
> + * @attrs: attributes of mapping properties
> + *
> + * Unlink the IOVA range up to @mapped_len and free the entire IOVA space. The
> + * range of IOVA from dma_addr to @mapped_len must all be linked, and be the
> + * only linked IOVA in state.
> + */
> +void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
> +		size_t mapped_len, enum dma_data_direction dir,
> +		unsigned long attrs)
> +{
> +	if (mapped_len)
> +		__iommu_dma_iova_unlink(dev, state, 0, mapped_len, dir, attrs,
> +				true);
> +	else
> +		/*
> +		 * We can be here if first call to dma_iova_link() failed and
> +		 * there is nothing to unlink, so let's be more clear.
> +		 */
> +		dma_iova_free(dev, state);
> +}
> +EXPORT_SYMBOL_GPL(dma_iova_destroy);
> +
>   void iommu_setup_dma_ops(struct device *dev)
>   {
>   	struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index 817f11bce7bc..8074a3b5c807 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -313,6 +313,17 @@ static inline bool dma_use_iova(struct dma_iova_state *state)
>   bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
>   		phys_addr_t phys, size_t size);
>   void dma_iova_free(struct device *dev, struct dma_iova_state *state);
> +void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
> +		size_t mapped_len, enum dma_data_direction dir,
> +		unsigned long attrs);
> +int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size);
> +int dma_iova_link(struct device *dev, struct dma_iova_state *state,
> +		phys_addr_t phys, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs);
> +void dma_iova_unlink(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size, enum dma_data_direction dir,
> +		unsigned long attrs);
>   #else /* CONFIG_IOMMU_DMA */
>   static inline bool dma_use_iova(struct dma_iova_state *state)
>   {
> @@ -327,6 +338,27 @@ static inline void dma_iova_free(struct device *dev,
>   		struct dma_iova_state *state)
>   {
>   }
> +static inline void dma_iova_destroy(struct device *dev,
> +		struct dma_iova_state *state, size_t mapped_len,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +}
> +static inline int dma_iova_sync(struct device *dev, struct dma_iova_state *state,
> +		size_t offset, size_t size)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int dma_iova_link(struct device *dev,
> +		struct dma_iova_state *state, phys_addr_t phys, size_t offset,
> +		size_t size, enum dma_data_direction dir, unsigned long attrs)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void dma_iova_unlink(struct device *dev,
> +		struct dma_iova_state *state, size_t offset, size_t size,
> +		enum dma_data_direction dir, unsigned long attrs)
> +{
> +}
>   #endif /* CONFIG_IOMMU_DMA */
>   
>   #if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)

Christoph Hellwig Nov. 4, 2024, 9:10 a.m. UTC | #2

On Thu, Oct 31, 2024 at 09:18:07PM +0000, Robin Murphy wrote:
>>   +static int __dma_iova_link(struct device *dev, dma_addr_t addr,
>> +		phys_addr_t phys, size_t size, enum dma_data_direction dir,
>> +		unsigned long attrs)
>> +{
>> +	bool coherent = dev_is_dma_coherent(dev);
>> +
>> +	if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
>
> If you really imagine this can support non-coherent operation and 
> DMA_ATTR_SKIP_CPU_SYNC, where are the corresponding explicit sync 
> operations? dma_sync_single_*() sure as heck aren't going to work...
>
> In fact, same goes for SWIOTLB bouncing even in the coherent case.

No with explicit sync operations.  But plain map/unmap works, I've
actually verified that with nvme.  And that's a pretty large use
case.

>> +		arch_sync_dma_for_device(phys, size, dir);
>
> Plus if the aim is to pass P2P and whatever arbitrary physical addresses 
> through here as well, how can we be sure this isn't going to explode?

That's a good point.  Only mapped through host bridge P2P can even
end up here, so the address is a perfectly valid physical address
in the host.  But I'm not sure if all arch_sync_dma_for_device
implementations handle IOMMU memory fine.

>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>> +	struct iova_domain *iovad = &cookie->iovad;
>> +	size_t iova_start_pad = iova_offset(iovad, phys);
>> +	size_t iova_end_pad = iova_offset(iovad, phys + size);
>
> I thought the code below was wrong until I double-checked and realised that 
> this is not what its name implies it to be...

Which variable does this refer to, and what would be a better name?

>> +		phys = iommu_iova_to_phys(domain, addr);
>> +		if (WARN_ON(!phys))
>> +			continue;
>> +		len = min_t(size_t,
>> +			end - addr, iovad->granule - iova_start_pad);
>> +
>> +		if (!dev_is_dma_coherent(dev) &&
>> +		    !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
>> +			arch_sync_dma_for_cpu(phys, len, dir);
>> +
>> +		swiotlb_tbl_unmap_single(dev, phys, len, dir, attrs);
>
> How do you know that "phys" and "len" match what was originally allocated 
> and bounced in, and this isn't going to try to bounce out too much, free 
> the wrong slot, or anything else nasty? If it's not supposed to be 
> intentional that a sub-granule buffer can be linked to any offset in the 
> middle of the IOVA range as long as its original physical address is 
> aligned to the IOVA granule size(?), why try to bounce anywhere other than 
> the ends of the range at all?

Mostly because the code is simpler and unless misused it just works.
But it might be worth adding explicit checks for the start and end.

>> +static void __iommu_dma_iova_unlink(struct device *dev,
>> +		struct dma_iova_state *state, size_t offset, size_t size,
>> +		enum dma_data_direction dir, unsigned long attrs,
>> +		bool free_iova)
>> +{
>> +	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>> +	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>> +	struct iova_domain *iovad = &cookie->iovad;
>> +	dma_addr_t addr = state->addr + offset;
>> +	size_t iova_start_pad = iova_offset(iovad, addr);
>> +	struct iommu_iotlb_gather iotlb_gather;
>> +	size_t unmapped;
>> +
>> +	if ((state->__size & DMA_IOVA_USE_SWIOTLB) ||
>> +	    (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)))
>> +		iommu_dma_iova_unlink_range_slow(dev, addr, size, dir, attrs);
>> +
>> +	iommu_iotlb_gather_init(&iotlb_gather);
>> +	iotlb_gather.queued = free_iova && READ_ONCE(cookie->fq_domain);
>
> Is is really worth the bother?

Worth what?

>> +	size = iova_align(iovad, size + iova_start_pad);
>> +	addr -= iova_start_pad;
>> +	unmapped = iommu_unmap_fast(domain, addr, size, &iotlb_gather);
>> +	WARN_ON(unmapped != size);
>> +
>> +	if (!iotlb_gather.queued)
>> +		iommu_iotlb_sync(domain, &iotlb_gather);
>> +	if (free_iova)
>> +		iommu_dma_free_iova(cookie, addr, size, &iotlb_gather);
>
> There's no guarantee that "size" is the correct value here, so this has 
> every chance of corrupting the IOVA domain.

Yes, but the same is true for every users of the iommu_* API as well.

>> +/**
>> + * dma_iova_unlink - Unlink a range of IOVA space
>> + * @dev: DMA device
>> + * @state: IOVA state
>> + * @offset: offset into the IOVA state to unlink
>> + * @size: size of the buffer
>> + * @dir: DMA direction
>> + * @attrs: attributes of mapping properties
>> + *
>> + * Unlink a range of IOVA space for the given IOVA state.
>
> If I initially link a large range in one go, then unlink a small part of 
> it, what behaviour can I expect?

As in map say 128k and then unmap 4k?  It will just work, even if that
is not the intended use case, which is either map everything up front
and unmap everything together, or the HMM version of random constant
mapping and unmapping at page size granularity.

Jason Gunthorpe Nov. 4, 2024, 12:19 p.m. UTC | #3

On Mon, Nov 04, 2024 at 10:10:48AM +0100, Christoph Hellwig wrote:
> >> +		arch_sync_dma_for_device(phys, size, dir);
> >
> > Plus if the aim is to pass P2P and whatever arbitrary physical addresses 
> > through here as well, how can we be sure this isn't going to explode?
> 
> That's a good point.  Only mapped through host bridge P2P can even
> end up here, so the address is a perfectly valid physical address
> in the host.  But I'm not sure if all arch_sync_dma_for_device
> implementations handle IOMMU memory fine.

I was told on x86 if you do a cache flush operation on MMIO there is a
chance it will MCE. Recently had some similar discussions about ARM
where it was asserted some platforms may have similar.

It would be safest to only call arch flushing calls on memory that is
mapped cachable. We can assume that a P2P target is never CPU
mapped cachable, regardless of how the DMA is routed.

Jason

Christoph Hellwig Nov. 4, 2024, 12:53 p.m. UTC | #4

On Mon, Nov 04, 2024 at 08:19:24AM -0400, Jason Gunthorpe wrote:
> > That's a good point.  Only mapped through host bridge P2P can even
> > end up here, so the address is a perfectly valid physical address
> > in the host.  But I'm not sure if all arch_sync_dma_for_device
> > implementations handle IOMMU memory fine.
> 
> I was told on x86 if you do a cache flush operation on MMIO there is a
> chance it will MCE. Recently had some similar discussions about ARM
> where it was asserted some platforms may have similar.

On x86 we never flush caches for DMA operations anyway, so x86 isn't
really the concern here, but architectures that do cache incoherent DMA
to PCIe devices.  Which isn't a whole lot as most SOCs try to avoid that
for PCIe even if they lack DMA coherent for lesser peripherals, but I bet
there are some on arm/arm64 and maybe riscv or mips.

> It would be safest to only call arch flushing calls on memory that is
> mapped cachable. We can assume that a P2P target is never CPU
> mapped cachable, regardless of how the DMA is routed.

Yes.  I.e. force DMA_ATTR_SKIP_CPU_SYNC for P2P.

Leon Romanovsky Nov. 7, 2024, 2:50 p.m. UTC | #5

On Mon, Nov 04, 2024 at 01:53:02PM +0100, Christoph Hellwig wrote:
> On Mon, Nov 04, 2024 at 08:19:24AM -0400, Jason Gunthorpe wrote:
> > > That's a good point.  Only mapped through host bridge P2P can even
> > > end up here, so the address is a perfectly valid physical address
> > > in the host.  But I'm not sure if all arch_sync_dma_for_device
> > > implementations handle IOMMU memory fine.
> > 
> > I was told on x86 if you do a cache flush operation on MMIO there is a
> > chance it will MCE. Recently had some similar discussions about ARM
> > where it was asserted some platforms may have similar.
> 
> On x86 we never flush caches for DMA operations anyway, so x86 isn't
> really the concern here, but architectures that do cache incoherent DMA
> to PCIe devices.  Which isn't a whole lot as most SOCs try to avoid that
> for PCIe even if they lack DMA coherent for lesser peripherals, but I bet
> there are some on arm/arm64 and maybe riscv or mips.
> 
> > It would be safest to only call arch flushing calls on memory that is
> > mapped cachable. We can assume that a P2P target is never CPU
> > mapped cachable, regardless of how the DMA is routed.
> 
> Yes.  I.e. force DMA_ATTR_SKIP_CPU_SYNC for P2P.

What do you think?

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 38bcb3ecceeb..065bdace3344 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -559,14 +559,19 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
 {
 	enum dma_data_direction dir = rq_dma_dir(req);
 	unsigned int mapped = 0;
+	unsigned long attrs = 0;
 	int error = 0;
 
 	iter->addr = state->addr;
 	iter->len = dma_iova_size(state);
+	if (req->cmd_flags & REQ_P2PDMA) {
+		attrs |= DMA_ATTR_SKIP_CPU_SYNC;
+		req->cmd_flags &= ~REQ_P2PDMA;
+	}
 
 	do {
 		error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
-				vec->len, dir, 0);
+				vec->len, dir, attrs);
 		if (error)
 			goto error_unmap;
 		mapped += vec->len;
@@ -578,7 +583,7 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
 
 	return true;
 error_unmap:
-	dma_iova_destroy(dma_dev, state, mapped, rq_dma_dir(req), 0);
+	dma_iova_destroy(dma_dev, state, mapped, rq_dma_dir(req), attrs);
 	iter->status = errno_to_blk_status(error);
 	return false;
 }
@@ -633,7 +638,6 @@ bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
 			 * P2P transfers through the host bridge are treated the
 			 * same as non-P2P transfers below and during unmap.
 			 */
-			req->cmd_flags &= ~REQ_P2PDMA;
 			break;
 		default:
 			iter->status = BLK_STS_INVAL;
@@ -644,6 +648,8 @@ bool blk_rq_dma_map_iter_start(struct request *req, struct device *dma_dev,
 	if (blk_can_dma_map_iova(req, dma_dev) &&
 	    dma_iova_try_alloc(dma_dev, state, vec.paddr, total_len))
 		return blk_rq_dma_map_iova(req, dma_dev, state, iter, &vec);
+
+	req->cmd_flags &= ~REQ_P2PDMA;
 	return blk_dma_map_direct(req, dma_dev, iter, &vec);
 }
 EXPORT_SYMBOL_GPL(blk_rq_dma_map_iter_start);
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 62980ca8f3c5..5fe30fbc42b0 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,7 @@ struct mmu_interval_notifier;
  * HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
  * HMM_PFN_ERROR - accessing the pfn is impossible and the device should
  *                 fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_P2PDMA - P@P page, not bus mapped
  * HMM_PFN_P2PDMA_BUS - Bus mapped P2P transfer
  * HMM_PFN_DMA_MAPPED - Flag preserved on input-to-output transformation
  *                      to mark that page is already DMA mapped
@@ -41,6 +42,7 @@ enum hmm_pfn_flags {
 	HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
 
 	/* Sticky flag, carried from Input to Output */
+	HMM_PFN_P2PDMA     = 1UL << (BITS_PER_LONG - 5),
 	HMM_PFN_P2PDMA_BUS = 1UL << (BITS_PER_LONG - 6),
 	HMM_PFN_DMA_MAPPED = 1UL << (BITS_PER_LONG - 7),
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 4ef2b3815212..b2ec199c2ea8 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -710,6 +710,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 	struct page *page = hmm_pfn_to_page(pfns[idx]);
 	phys_addr_t paddr = hmm_pfn_to_phys(pfns[idx]);
 	size_t offset = idx * map->dma_entry_size;
+	unsigned long attrs = 0;
 	dma_addr_t dma_addr;
 	int ret;
 
@@ -740,6 +741,9 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 
 	switch (pci_p2pdma_state(p2pdma_state, dev, page)) {
 	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		attrs |= DMA_ATTR_SKIP_CPU_SYNC;
+		pfns[idx] |= HMM_PFN_P2PDMA;
+		fallthrough;
 	case PCI_P2PDMA_MAP_NONE:
 		break;
 	case PCI_P2PDMA_MAP_BUS_ADDR:
@@ -752,7 +756,8 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 
 	if (dma_use_iova(state)) {
 		ret = dma_iova_link(dev, state, paddr, offset,
-				    map->dma_entry_size, DMA_BIDIRECTIONAL, 0);
+				    map->dma_entry_size, DMA_BIDIRECTIONAL,
+				    attrs);
 		if (ret)
 			return DMA_MAPPING_ERROR;
 
@@ -793,6 +798,7 @@ bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
 	struct dma_iova_state *state = &map->state;
 	dma_addr_t *dma_addrs = map->dma_list;
 	unsigned long *pfns = map->pfn_list;
+	unsigned long attrs = 0;
 
 #define HMM_PFN_VALID_DMA (HMM_PFN_VALID | HMM_PFN_DMA_MAPPED)
 	if ((pfns[idx] & HMM_PFN_VALID_DMA) != HMM_PFN_VALID_DMA)
@@ -801,14 +807,16 @@ bool hmm_dma_unmap_pfn(struct device *dev, struct hmm_dma_map *map, size_t idx)
 
 	if (pfns[idx] & HMM_PFN_P2PDMA_BUS)
 		; /* no need to unmap bus address P2P mappings */
-	else if (dma_use_iova(state))
+	else if (dma_use_iova(state)) {
+		if (pfns[idx] & HMM_PFN_P2PDMA)
+			attrs |= DMA_ATTR_SKIP_CPU_SYNC;
 		dma_iova_unlink(dev, state, idx * map->dma_entry_size,
-				map->dma_entry_size, DMA_BIDIRECTIONAL, 0);
-	else if (dma_need_unmap(dev))
+				map->dma_entry_size, DMA_BIDIRECTIONAL, attrs);
+	} else if (dma_need_unmap(dev))
 		dma_unmap_page(dev, dma_addrs[idx], map->dma_entry_size,
 			       DMA_BIDIRECTIONAL);
 
-	pfns[idx] &= ~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA_BUS);
+	pfns[idx] &= ~(HMM_PFN_DMA_MAPPED | HMM_PFN_P2PDMA | HMM_PFN_P2PDMA_BUS);
 	return true;
 }
 EXPORT_SYMBOL_GPL(hmm_dma_unmap_pfn);

[v1,07/17] dma-mapping: Implement link/unlink ranges API

Commit Message

Comments

Patch