[RFC,1/7] swiotlb: Introduce swiotlb throttling

Message ID	20240822183718.1234-2-mhklinux@outlook.com (mailing list archive)
State	Not Applicable
Headers	show Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 123D2EC5; Thu, 22 Aug 2024 18:37:47 +0000 (UTC) From: mhkelley58@gmail.com To: kbusch@kernel.org, axboe@kernel.dk, sagi@grimberg.me, James.Bottomley@HansenPartnership.com, martin.petersen@oracle.com, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, robin.murphy@arm.com, hch@lst.de, m.szyprowski@samsung.com, petr@tesarici.cz, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-coco@lists.linux.dev Subject: [RFC 1/7] swiotlb: Introduce swiotlb throttling Date: Thu, 22 Aug 2024 11:37:12 -0700 Message-Id: <20240822183718.1234-2-mhklinux@outlook.com> In-Reply-To: <20240822183718.1234-1-mhklinux@outlook.com> References: <20240822183718.1234-1-mhklinux@outlook.com> Reply-To: mhklinux@outlook.com Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Introduce swiotlb throttling \| expand [RFC,0/7] Introduce swiotlb throttling [RFC,1/7] swiotlb: Introduce swiotlb throttling [RFC,2/7] dma: Handle swiotlb throttling for SGLs [RFC,3/7] dma: Add function for drivers to know if allowing blocking is useful [RFC,4/7] scsi_lib_dma: Add _attrs variant of scsi_dma_map() [RFC,5/7] scsi: storvsc: Enable swiotlb throttling [RFC,6/7] nvme: Move BLK_MQ_F_BLOCKING indicator to struct nvme_ctrl [RFC,7/7] nvme: Enable swiotlb throttling for NVMe PCI devices

Michael Kelley Aug. 22, 2024, 6:37 p.m. UTC

From: Michael Kelley <mhklinux@outlook.com>

Implement throttling of swiotlb map requests. Because throttling requires
temporarily pending some requests, throttling can only be used by map
requests made in contexts that can block. Detecting such contexts at
runtime is infeasible, so device driver code must be updated to add
DMA_ATTR_MAY_BLOCK on map requests done in a context that can block.
Even if a map request is throttled, the corresponding unmap request will
never block, so unmap has no context restrictions, just like current code.
If a swiotlb map request does *not* have DMA_ATTR_MAY_BLOCK, no throttling
is done and there is no functional change.

The goal of throttling is to reduce peak usage of swiotlb memory,
particularly in environments like CoCo VMs which must use bounce buffering
for all DMA I/O. These VMs currently allocate up to 1 GiB for swiotlb
memory to ensure that it isn't exhausted. But for many workloads, this
memory is effectively wasted because it can't be used for other purposes.
Throttling can lower the swiotlb memory requirements without unduly raising
the risk of exhaustion, thus making several hundred MiBs of additional
memory available for general usage.

The high-level implementation is as follows:

1.  Each struct io_tlb_mem has a semaphore that is initialized to 1.  A
semaphore is used instead of a mutex because the semaphore likely won't
be released by the same thread that obtained it.

2. Each struct io_tlb_mem has a swiotlb space usage level above which
throttling is done. This usage level is initialized to 70% of the total
size of that io_tlb_mem, and is tunable at runtime via /sys if
CONFIG_DEBUG_FS is set.

3. When swiotlb_tbl_map_single() is invoked with throttling allowed, if
the current usage of that io_tlb_mem is above the throttle level, the
semaphore must be obtained before proceeding. The semaphore is then
released by the corresponding swiotlb unmap call. If the semaphore is
already held when swiotlb_tbl_map_single() must obtain it, the calling
thread blocks until the semaphore is available. Once the thread obtains
the semaphore, it proceeds to allocate swiotlb space in the usual way.
The swiotlb map call saves throttling information in the io_tlb_slot, and
then swiotlb unmap uses that information to determine if the semaphore
is held. If so, it releases the semaphore, potentially allowing a
queued request to proceed. Overall, the semaphore queues multiple waiters
and wakes them up in the order in which they waited. Effectively, the
semaphore single threads map/unmap pairs to reduce peak usage.

4. A "low throttle" level is also implemented and initialized to 65% of
the total size of the io_tlb_mem. If the current usage is between the
throttle level and the low throttle level, AND the semaphore is held, the
requestor must obtain the semaphore. Consider if throttling occurs, so
that one map request holds the semaphore, and three others are queued
waiting for the semaphore. If swiotlb usage then drops because of
unrelated unmap's, a new incoming map request may not get throttled, and
bypass the three requests waiting in the semaphore queue. There's not
a forward progress issue because the requests in the queue will complete
as long as the underlying I/Os make forward progress. But to somewhat
address the fairness issue, the low throttle level provides hysteresis
in that new incoming requests continue to queue on the semaphore as long
as used swiotlb memory is above that lower limit.

5. SGLs are handled in a subsequent patch.

In #3 above the check for being above the throttle level is an
instantaneous check with no locking and no reservation of space, to avoid
atomic operations. Consequently, multiple threads could all make the check
and decide they are under the throttle level. They can all proceed without
obtaining the semaphore, and potentially generate a peak in usage.
Furthermore, other DMA map requests that don't have throttling enabled
proceed without even checking, and hence can also push usage toward a peak.
So throttling can blunt and reduce peaks in swiotlb memory usage, but
does it not guarantee to prevent exhaustion.

Signed-off-by: Michael Kelley <mhklinux@outlook.com>
---
 include/linux/dma-mapping.h |   8 +++
 include/linux/swiotlb.h     |  15 ++++-
 kernel/dma/Kconfig          |  13 ++++
 kernel/dma/swiotlb.c        | 114 ++++++++++++++++++++++++++++++++----
 4 files changed, 136 insertions(+), 14 deletions(-)

Petr Tesařík Aug. 23, 2024, 7:41 a.m. UTC | #1

On Thu, 22 Aug 2024 11:37:12 -0700
mhkelley58@gmail.com wrote:

> From: Michael Kelley <mhklinux@outlook.com>
> 
> Implement throttling of swiotlb map requests. Because throttling requires
> temporarily pending some requests, throttling can only be used by map
> requests made in contexts that can block. Detecting such contexts at
> runtime is infeasible, so device driver code must be updated to add
> DMA_ATTR_MAY_BLOCK on map requests done in a context that can block.
> Even if a map request is throttled, the corresponding unmap request will
> never block, so unmap has no context restrictions, just like current code.
> If a swiotlb map request does *not* have DMA_ATTR_MAY_BLOCK, no throttling
> is done and there is no functional change.
> 
> The goal of throttling is to reduce peak usage of swiotlb memory,
> particularly in environments like CoCo VMs which must use bounce buffering
> for all DMA I/O. These VMs currently allocate up to 1 GiB for swiotlb
> memory to ensure that it isn't exhausted. But for many workloads, this
> memory is effectively wasted because it can't be used for other purposes.
> Throttling can lower the swiotlb memory requirements without unduly raising
> the risk of exhaustion, thus making several hundred MiBs of additional
> memory available for general usage.
> 
> The high-level implementation is as follows:
> 
> 1.  Each struct io_tlb_mem has a semaphore that is initialized to 1.  A
> semaphore is used instead of a mutex because the semaphore likely won't
> be released by the same thread that obtained it.
> 
> 2. Each struct io_tlb_mem has a swiotlb space usage level above which
> throttling is done. This usage level is initialized to 70% of the total
> size of that io_tlb_mem, and is tunable at runtime via /sys if
> CONFIG_DEBUG_FS is set.
> 
> 3. When swiotlb_tbl_map_single() is invoked with throttling allowed, if
> the current usage of that io_tlb_mem is above the throttle level, the
> semaphore must be obtained before proceeding. The semaphore is then
> released by the corresponding swiotlb unmap call. If the semaphore is
> already held when swiotlb_tbl_map_single() must obtain it, the calling
> thread blocks until the semaphore is available. Once the thread obtains
> the semaphore, it proceeds to allocate swiotlb space in the usual way.
> The swiotlb map call saves throttling information in the io_tlb_slot, and
> then swiotlb unmap uses that information to determine if the semaphore
> is held. If so, it releases the semaphore, potentially allowing a
> queued request to proceed. Overall, the semaphore queues multiple waiters
> and wakes them up in the order in which they waited. Effectively, the
> semaphore single threads map/unmap pairs to reduce peak usage.
> 
> 4. A "low throttle" level is also implemented and initialized to 65% of
> the total size of the io_tlb_mem. If the current usage is between the
> throttle level and the low throttle level, AND the semaphore is held, the
> requestor must obtain the semaphore. Consider if throttling occurs, so
> that one map request holds the semaphore, and three others are queued
> waiting for the semaphore. If swiotlb usage then drops because of
> unrelated unmap's, a new incoming map request may not get throttled, and
> bypass the three requests waiting in the semaphore queue. There's not
> a forward progress issue because the requests in the queue will complete
> as long as the underlying I/Os make forward progress. But to somewhat
> address the fairness issue, the low throttle level provides hysteresis
> in that new incoming requests continue to queue on the semaphore as long
> as used swiotlb memory is above that lower limit.
> 
> 5. SGLs are handled in a subsequent patch.
> 
> In #3 above the check for being above the throttle level is an
> instantaneous check with no locking and no reservation of space, to avoid
> atomic operations. Consequently, multiple threads could all make the check
> and decide they are under the throttle level. They can all proceed without
> obtaining the semaphore, and potentially generate a peak in usage.
> Furthermore, other DMA map requests that don't have throttling enabled
> proceed without even checking, and hence can also push usage toward a peak.
> So throttling can blunt and reduce peaks in swiotlb memory usage, but
> does it not guarantee to prevent exhaustion.
> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  include/linux/dma-mapping.h |   8 +++
>  include/linux/swiotlb.h     |  15 ++++-
>  kernel/dma/Kconfig          |  13 ++++
>  kernel/dma/swiotlb.c        | 114 ++++++++++++++++++++++++++++++++----
>  4 files changed, 136 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index f693aafe221f..7b78294813be 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -62,6 +62,14 @@
>   */
>  #define DMA_ATTR_PRIVILEGED		(1UL << 9)
>  
> +/*
> + * DMA_ATTR_MAY_BLOCK: Indication by a driver that the DMA map request is
> + * allowed to block. This flag must only be used on DMA map requests made in
> + * contexts that allow blocking. The corresponding unmap request will not
> + * block.
> + */
> +#define DMA_ATTR_MAY_BLOCK		(1UL << 10)
> +
>  /*
>   * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
>   * be given to a device to use as a DMA source or target.  It is specific to a
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3dae0f592063..10d07d0ee00c 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -89,6 +89,10 @@ struct io_tlb_pool {
>   * @defpool:	Default (initial) IO TLB memory pool descriptor.
>   * @pool:	IO TLB memory pool descriptor (if not dynamic).
>   * @nslabs:	Total number of IO TLB slabs in all pools.
> + * @high_throttle: Slab count above which requests are throttled.
> + * @low_throttle: Slab count abouve which requests are throttled when
> + *		throttle_sem is already held.
> + * @throttle_sem: Semaphore that throttled requests must obtain.
>   * @debugfs:	The dentry to debugfs.
>   * @force_bounce: %true if swiotlb bouncing is forced
>   * @for_alloc:  %true if the pool is used for memory allocation
> @@ -104,10 +108,17 @@ struct io_tlb_pool {
>   *		in debugfs.
>   * @transient_nslabs: The total number of slots in all transient pools that
>   *		are currently used across all areas.
> + * @high_throttle_count: Count of requests throttled because high_throttle
> + *		was exceeded.
> + * @low_throttle_count: Count of requests throttled because low_throttle was
> + *		exceeded and throttle_sem was already held.
>   */
>  struct io_tlb_mem {
>  	struct io_tlb_pool defpool;
>  	unsigned long nslabs;
> +	unsigned long high_throttle;
> +	unsigned long low_throttle;
> +	struct semaphore throttle_sem;

Are these struct members needed if CONFIG_SWIOTLB_THROTTLE is not set?

>  	struct dentry *debugfs;
>  	bool force_bounce;
>  	bool for_alloc;
> @@ -118,11 +129,11 @@ struct io_tlb_mem {
>  	struct list_head pools;
>  	struct work_struct dyn_alloc;
>  #endif
> -#ifdef CONFIG_DEBUG_FS

I think this should not be removed but changed to:
#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_SWIOTLB_THROTTLE)

>  	atomic_long_t total_used;
>  	atomic_long_t used_hiwater;
>  	atomic_long_t transient_nslabs;
> -#endif
> +	unsigned long high_throttle_count;
> +	unsigned long low_throttle_count;

And these two should be guarded by #ifdef CONFIG_SWIOTLB_THROTTLE.

>  };
>  
>  struct io_tlb_pool *__swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
> diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> index c06e56be0ca1..d45ba62f58c8 100644
> --- a/kernel/dma/Kconfig
> +++ b/kernel/dma/Kconfig
> @@ -103,6 +103,19 @@ config SWIOTLB_DYNAMIC
>  
>  	  If unsure, say N.
>  
> +config SWIOTLB_THROTTLE
> +	bool "Throttle DMA map requests from enabled drivers"
> +	default n
> +	depends on SWIOTLB
> +	help
> +	  Enable throttling of DMA map requests to help avoid exhausting
> +	  bounce buffer space, causing request failures. Throttling
> +	  applies only where the calling driver has enabled blocking in
> +	  DMA map requests. This option is most useful in CoCo VMs where
> +	  all DMA operations must go through bounce buffers.


If I didn't know anything about the concept, this description would
confuse me... The short description should be something like: "Throttle
the use of DMA bounce buffers." Do not mention "enabled drivers" here;
it's sufficient to mention the limitations in the help text.

In addition, the help text should make it clear that this throttling
does not apply if bounce buffers are not needed; except for CoCo VMs,
this is the most common case. I mean, your description does mention CoCo
VMs, but e.g. distributions may wonder what the impact would be if they
enable this option and the kernel then runs on bare metal.

> +
> +	  If unsure, say N.
> +
>  config DMA_BOUNCE_UNALIGNED_KMALLOC
>  	bool
>  	depends on SWIOTLB
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index df68d29740a0..940b95cf02b7 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -34,6 +34,7 @@
>  #include <linux/init.h>
>  #include <linux/memblock.h>
>  #include <linux/mm.h>
> +#include <linux/semaphore.h>
>  #include <linux/pfn.h>
>  #include <linux/rculist.h>
>  #include <linux/scatterlist.h>
> @@ -71,12 +72,15 @@
>   *		from each index.
>   * @pad_slots:	Number of preceding padding slots. Valid only in the first
>   *		allocated non-padding slot.
> + * @throttled:  Boolean indicating the slot is used by a request that was
> + *		throttled. Valid only in the first allocated non-padding slot.
>   */
>  struct io_tlb_slot {
>  	phys_addr_t orig_addr;
>  	size_t alloc_size;
>  	unsigned short list;
> -	unsigned short pad_slots;
> +	u8 pad_slots;
> +	u8 throttled;

I'm not sure this flag is needed for each slot.

SWIOTLB mappings should be throttled when the total SWIOTLB usage is
above a threshold. Conversely, it can be unthrottled when the total
usage goes below a threshold, and it should not matter if that happens
due to an unmap of the exact buffer which previously pushed the usage
over the edge, or due to an unmap of any other unrelated buffer.

I had a few more comments to the rest of this patch, but they're moot
if this base logic gets redone.

Petr T

>  };
>  
>  static bool swiotlb_force_bounce;
> @@ -249,6 +253,31 @@ static inline unsigned long nr_slots(u64 val)
>  	return DIV_ROUND_UP(val, IO_TLB_SIZE);
>  }
>  
> +#ifdef CONFIG_SWIOTLB_THROTTLE
> +static void init_throttling(struct io_tlb_mem *mem)
> +{
> +	sema_init(&mem->throttle_sem, 1);
> +
> +	/*
> +	 * The default thresholds are somewhat arbitrary. They are
> +	 * conservative to allow space for devices that can't throttle and
> +	 * because the determination of whether to throttle is done without
> +	 * any atomicity. The low throttle exists to provide a modest amount
> +	 * of hysteresis so that the system doesn't flip rapidly between
> +	 * throttling and not throttling when usage fluctuates near the high
> +	 * throttle level.
> +	 */
> +	mem->high_throttle = (mem->nslabs * 70) / 100;
> +	mem->low_throttle = (mem->nslabs * 65) / 100;
> +}
> +#else
> +static void init_throttling(struct io_tlb_mem *mem)
> +{
> +	mem->high_throttle = 0;
> +	mem->low_throttle = 0;
> +}
> +#endif
> +
>  /*
>   * Early SWIOTLB allocation may be too early to allow an architecture to
>   * perform the desired operations.  This function allows the architecture to
> @@ -415,6 +444,8 @@ void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
>  
>  	if (flags & SWIOTLB_VERBOSE)
>  		swiotlb_print_info();
> +
> +	init_throttling(&io_tlb_default_mem);
>  }
>  
>  void __init swiotlb_init(bool addressing_limit, unsigned int flags)
> @@ -511,6 +542,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
>  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
>  				 nareas);
>  	add_mem_pool(&io_tlb_default_mem, mem);
> +	init_throttling(&io_tlb_default_mem);
>  
>  	swiotlb_print_info();
>  	return 0;
> @@ -947,7 +979,7 @@ static unsigned int wrap_area_index(struct io_tlb_pool *mem, unsigned int index)
>   * function gives imprecise results because there's no locking across
>   * multiple areas.
>   */
> -#ifdef CONFIG_DEBUG_FS
> +#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_SWIOTLB_THROTTLE)
>  static void inc_used_and_hiwater(struct io_tlb_mem *mem, unsigned int nslots)
>  {
>  	unsigned long old_hiwater, new_used;
> @@ -966,14 +998,14 @@ static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
>  	atomic_long_sub(nslots, &mem->total_used);
>  }
>  
> -#else /* !CONFIG_DEBUG_FS */
> +#else /* !CONFIG_DEBUG_FS && !CONFIG_SWIOTLB_THROTTLE*/
>  static void inc_used_and_hiwater(struct io_tlb_mem *mem, unsigned int nslots)
>  {
>  }
>  static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
>  {
>  }
> -#endif /* CONFIG_DEBUG_FS */
> +#endif /* CONFIG_DEBUG_FS || CONFIG_SWIOTLB_THROTTLE */
>  
>  #ifdef CONFIG_SWIOTLB_DYNAMIC
>  #ifdef CONFIG_DEBUG_FS
> @@ -1277,7 +1309,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
>  
>  #endif /* CONFIG_SWIOTLB_DYNAMIC */
>  
> -#ifdef CONFIG_DEBUG_FS
> +#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_SWIOTLB_THROTTLE)
>  
>  /**
>   * mem_used() - get number of used slots in an allocator
> @@ -1293,7 +1325,7 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
>  	return atomic_long_read(&mem->total_used);
>  }
>  
> -#else /* !CONFIG_DEBUG_FS */
> +#else /* !CONFIG_DEBUG_FS && !CONFIG_SWIOTLB_THROTTLE */
>  
>  /**
>   * mem_pool_used() - get number of used slots in a memory pool
> @@ -1373,6 +1405,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
>  	unsigned int offset;
>  	struct io_tlb_pool *pool;
> +	bool throttle = false;
>  	unsigned int i;
>  	size_t size;
>  	int index;
> @@ -1398,6 +1431,32 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	dev_WARN_ONCE(dev, alloc_align_mask > ~PAGE_MASK,
>  		"Alloc alignment may prevent fulfilling requests with max mapping_size\n");
>  
> +	if (IS_ENABLED(CONFIG_SWIOTLB_THROTTLE) && attrs & DMA_ATTR_MAY_BLOCK) {
> +		unsigned long used = atomic_long_read(&mem->total_used);
> +
> +		/*
> +		 * Determining whether to throttle is intentionally done without
> +		 * atomicity. For example, multiple requests could proceed in
> +		 * parallel when usage is just under the threshold, putting
> +		 * usage above the threshold by the aggregate size of the
> +		 * parallel requests. The thresholds must already be set
> +		 * conservatively because of drivers that can't enable
> +		 * throttling, so this slop in the accounting shouldn't be
> +		 * problem. It's better than the potential bottleneck of a
> +		 * globally synchronzied reservation mechanism.
> +		 */
> +		if (used > mem->high_throttle) {
> +			throttle = true;
> +			mem->high_throttle_count++;
> +		} else if ((used > mem->low_throttle) &&
> +					(mem->throttle_sem.count <= 0)) {
> +			throttle = true;
> +			mem->low_throttle_count++;
> +		}
> +		if (throttle)
> +			down(&mem->throttle_sem);
> +	}
> +
>  	offset = swiotlb_align_offset(dev, alloc_align_mask, orig_addr);
>  	size = ALIGN(mapping_size + offset, alloc_align_mask + 1);
>  	index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool);
> @@ -1406,6 +1465,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  			dev_warn_ratelimited(dev,
>  	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
>  				 size, mem->nslabs, mem_used(mem));
> +		if (throttle)
> +			up(&mem->throttle_sem);
>  		return (phys_addr_t)DMA_MAPPING_ERROR;
>  	}
>  
> @@ -1424,6 +1485,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	offset &= (IO_TLB_SIZE - 1);
>  	index += pad_slots;
>  	pool->slots[index].pad_slots = pad_slots;
> +	pool->slots[index].throttled = throttle;
>  	for (i = 0; i < (nr_slots(size) - pad_slots); i++)
>  		pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
>  	tlb_addr = slot_addr(pool->start, index) + offset;
> @@ -1440,7 +1502,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	return tlb_addr;
>  }
>  
> -static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
> +static bool swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
>  				  struct io_tlb_pool *mem)
>  {
>  	unsigned long flags;
> @@ -1448,8 +1510,10 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
>  	int index, nslots, aindex;
>  	struct io_tlb_area *area;
>  	int count, i;
> +	bool throttled;
>  
>  	index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> +	throttled = mem->slots[index].throttled;
>  	index -= mem->slots[index].pad_slots;
>  	nslots = nr_slots(mem->slots[index].alloc_size + offset);
>  	aindex = index / mem->area_nslabs;
> @@ -1478,6 +1542,7 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
>  		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
>  		mem->slots[i].alloc_size = 0;
>  		mem->slots[i].pad_slots = 0;
> +		mem->slots[i].throttled = 0;
>  	}
>  
>  	/*
> @@ -1492,6 +1557,8 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
>  	spin_unlock_irqrestore(&area->lock, flags);
>  
>  	dec_used(dev->dma_io_tlb_mem, nslots);
> +
> +	return throttled;
>  }
>  
>  #ifdef CONFIG_SWIOTLB_DYNAMIC
> @@ -1501,6 +1568,9 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
>   * @dev:	Device which mapped the buffer.
>   * @tlb_addr:	Physical address within a bounce buffer.
>   * @pool:       Pointer to the transient memory pool to be checked and deleted.
> + * @throttled:	If the function returns %true, return boolean indicating
> + *		if the transient allocation was throttled. Not set if the
> + *		function returns %false.
>   *
>   * Check whether the address belongs to a transient SWIOTLB memory pool.
>   * If yes, then delete the pool.
> @@ -1508,11 +1578,18 @@ static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
>   * Return: %true if @tlb_addr belonged to a transient pool that was released.
>   */
>  static bool swiotlb_del_transient(struct device *dev, phys_addr_t tlb_addr,
> -		struct io_tlb_pool *pool)
> +		struct io_tlb_pool *pool, bool *throttled)
>  {
> +	unsigned int offset;
> +	int index;
> +
>  	if (!pool->transient)
>  		return false;
>  
> +	offset = swiotlb_align_offset(dev, 0, tlb_addr);
> +	index = (tlb_addr - offset - pool->start) >> IO_TLB_SHIFT;
> +	*throttled = pool->slots[index].throttled;
> +
>  	dec_used(dev->dma_io_tlb_mem, pool->nslabs);
>  	swiotlb_del_pool(dev, pool);
>  	dec_transient_used(dev->dma_io_tlb_mem, pool->nslabs);
> @@ -1522,7 +1599,7 @@ static bool swiotlb_del_transient(struct device *dev, phys_addr_t tlb_addr,
>  #else  /* !CONFIG_SWIOTLB_DYNAMIC */
>  
>  static inline bool swiotlb_del_transient(struct device *dev,
> -		phys_addr_t tlb_addr, struct io_tlb_pool *pool)
> +		phys_addr_t tlb_addr, struct io_tlb_pool *pool, bool *throttled)
>  {
>  	return false;
>  }
> @@ -1536,6 +1613,8 @@ void __swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
>  		size_t mapping_size, enum dma_data_direction dir,
>  		unsigned long attrs, struct io_tlb_pool *pool)
>  {
> +	bool throttled;
> +
>  	/*
>  	 * First, sync the memory before unmapping the entry
>  	 */
> @@ -1544,9 +1623,11 @@ void __swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
>  		swiotlb_bounce(dev, tlb_addr, mapping_size,
>  						DMA_FROM_DEVICE, pool);
>  
> -	if (swiotlb_del_transient(dev, tlb_addr, pool))
> -		return;
> -	swiotlb_release_slots(dev, tlb_addr, pool);
> +	if (!swiotlb_del_transient(dev, tlb_addr, pool, &throttled))
> +		throttled = swiotlb_release_slots(dev, tlb_addr, pool);
> +
> +	if (throttled)
> +		up(&dev->dma_io_tlb_mem->throttle_sem);
>  }
>  
>  void __swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> @@ -1719,6 +1800,14 @@ static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>  		return;
>  
>  	debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
> +	debugfs_create_ulong("high_throttle", 0600, mem->debugfs,
> +			&mem->high_throttle);
> +	debugfs_create_ulong("low_throttle", 0600, mem->debugfs,
> +			&mem->low_throttle);
> +	debugfs_create_ulong("high_throttle_count", 0600, mem->debugfs,
> +			&mem->high_throttle_count);
> +	debugfs_create_ulong("low_throttle_count", 0600, mem->debugfs,
> +			&mem->low_throttle_count);
>  	debugfs_create_file("io_tlb_used", 0400, mem->debugfs, mem,
>  			&fops_io_tlb_used);
>  	debugfs_create_file("io_tlb_used_hiwater", 0600, mem->debugfs, mem,
> @@ -1841,6 +1930,7 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
>  		INIT_LIST_HEAD_RCU(&mem->pools);
>  #endif
>  		add_mem_pool(mem, pool);
> +		init_throttling(mem);
>  
>  		rmem->priv = mem;
>

Michael Kelley Aug. 23, 2024, 8:41 p.m. UTC | #2

From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, August 23, 2024 12:41 AM
> 
> On Thu, 22 Aug 2024 11:37:12 -0700
> mhkelley58@gmail.com wrote:
> 
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > Implement throttling of swiotlb map requests. Because throttling requires
> > temporarily pending some requests, throttling can only be used by map
> > requests made in contexts that can block. Detecting such contexts at
> > runtime is infeasible, so device driver code must be updated to add
> > DMA_ATTR_MAY_BLOCK on map requests done in a context that can block.
> > Even if a map request is throttled, the corresponding unmap request will
> > never block, so unmap has no context restrictions, just like current code.
> > If a swiotlb map request does *not* have DMA_ATTR_MAY_BLOCK, no throttling
> > is done and there is no functional change.
> >
> > The goal of throttling is to reduce peak usage of swiotlb memory,
> > particularly in environments like CoCo VMs which must use bounce buffering
> > for all DMA I/O. These VMs currently allocate up to 1 GiB for swiotlb
> > memory to ensure that it isn't exhausted. But for many workloads, this
> > memory is effectively wasted because it can't be used for other purposes.
> > Throttling can lower the swiotlb memory requirements without unduly raising
> > the risk of exhaustion, thus making several hundred MiBs of additional
> > memory available for general usage.
> >
> > The high-level implementation is as follows:
> >
> > 1.  Each struct io_tlb_mem has a semaphore that is initialized to 1.  A
> > semaphore is used instead of a mutex because the semaphore likely won't
> > be released by the same thread that obtained it.
> >
> > 2. Each struct io_tlb_mem has a swiotlb space usage level above which
> > throttling is done. This usage level is initialized to 70% of the total
> > size of that io_tlb_mem, and is tunable at runtime via /sys if
> > CONFIG_DEBUG_FS is set.
> >
> > 3. When swiotlb_tbl_map_single() is invoked with throttling allowed, if
> > the current usage of that io_tlb_mem is above the throttle level, the
> > semaphore must be obtained before proceeding. The semaphore is then
> > released by the corresponding swiotlb unmap call. If the semaphore is
> > already held when swiotlb_tbl_map_single() must obtain it, the calling
> > thread blocks until the semaphore is available. Once the thread obtains
> > the semaphore, it proceeds to allocate swiotlb space in the usual way.
> > The swiotlb map call saves throttling information in the io_tlb_slot, and
> > then swiotlb unmap uses that information to determine if the semaphore
> > is held. If so, it releases the semaphore, potentially allowing a
> > queued request to proceed. Overall, the semaphore queues multiple waiters
> > and wakes them up in the order in which they waited. Effectively, the
> > semaphore single threads map/unmap pairs to reduce peak usage.
> >
> > 4. A "low throttle" level is also implemented and initialized to 65% of
> > the total size of the io_tlb_mem. If the current usage is between the
> > throttle level and the low throttle level, AND the semaphore is held, the
> > requestor must obtain the semaphore. Consider if throttling occurs, so
> > that one map request holds the semaphore, and three others are queued
> > waiting for the semaphore. If swiotlb usage then drops because of
> > unrelated unmap's, a new incoming map request may not get throttled, and
> > bypass the three requests waiting in the semaphore queue. There's not
> > a forward progress issue because the requests in the queue will complete
> > as long as the underlying I/Os make forward progress. But to somewhat
> > address the fairness issue, the low throttle level provides hysteresis
> > in that new incoming requests continue to queue on the semaphore as long
> > as used swiotlb memory is above that lower limit.
> >
> > 5. SGLs are handled in a subsequent patch.
> >
> > In #3 above the check for being above the throttle level is an
> > instantaneous check with no locking and no reservation of space, to avoid
> > atomic operations. Consequently, multiple threads could all make the check
> > and decide they are under the throttle level. They can all proceed without
> > obtaining the semaphore, and potentially generate a peak in usage.
> > Furthermore, other DMA map requests that don't have throttling enabled
> > proceed without even checking, and hence can also push usage toward a peak.
> > So throttling can blunt and reduce peaks in swiotlb memory usage, but
> > does it not guarantee to prevent exhaustion.
> >
> > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > ---
> >  include/linux/dma-mapping.h |   8 +++
> >  include/linux/swiotlb.h     |  15 ++++-
> >  kernel/dma/Kconfig          |  13 ++++
> >  kernel/dma/swiotlb.c        | 114 ++++++++++++++++++++++++++++++++----
> >  4 files changed, 136 insertions(+), 14 deletions(-)
> >
> > diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> > index f693aafe221f..7b78294813be 100644
> > --- a/include/linux/dma-mapping.h
> > +++ b/include/linux/dma-mapping.h
> > @@ -62,6 +62,14 @@
> >   */
> >  #define DMA_ATTR_PRIVILEGED		(1UL << 9)
> >
> > +/*
> > + * DMA_ATTR_MAY_BLOCK: Indication by a driver that the DMA map request is
> > + * allowed to block. This flag must only be used on DMA map requests made in
> > + * contexts that allow blocking. The corresponding unmap request will not
> > + * block.
> > + */
> > +#define DMA_ATTR_MAY_BLOCK		(1UL << 10)
> > +
> >  /*
> >   * A dma_addr_t can hold any valid DMA or bus address for the platform.  It can
> >   * be given to a device to use as a DMA source or target.  It is specific to a
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 3dae0f592063..10d07d0ee00c 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -89,6 +89,10 @@ struct io_tlb_pool {
> >   * @defpool:	Default (initial) IO TLB memory pool descriptor.
> >   * @pool:	IO TLB memory pool descriptor (if not dynamic).
> >   * @nslabs:	Total number of IO TLB slabs in all pools.
> > + * @high_throttle: Slab count above which requests are throttled.
> > + * @low_throttle: Slab count abouve which requests are throttled when
> > + *		throttle_sem is already held.
> > + * @throttle_sem: Semaphore that throttled requests must obtain.
> >   * @debugfs:	The dentry to debugfs.
> >   * @force_bounce: %true if swiotlb bouncing is forced
> >   * @for_alloc:  %true if the pool is used for memory allocation
> > @@ -104,10 +108,17 @@ struct io_tlb_pool {
> >   *		in debugfs.
> >   * @transient_nslabs: The total number of slots in all transient pools that
> >   *		are currently used across all areas.
> > + * @high_throttle_count: Count of requests throttled because high_throttle
> > + *		was exceeded.
> > + * @low_throttle_count: Count of requests throttled because low_throttle was
> > + *		exceeded and throttle_sem was already held.
> >   */
> >  struct io_tlb_mem {
> >  	struct io_tlb_pool defpool;
> >  	unsigned long nslabs;
> > +	unsigned long high_throttle;
> > +	unsigned long low_throttle;
> > +	struct semaphore throttle_sem;
> 
> Are these struct members needed if CONFIG_SWIOTLB_THROTTLE is not set?

They are not needed. But I specifically left them unguarded because
the #ifdef just clutters things here (and in the code as needed to make
things compile) without adding any real value. The amount of memory
saved is miniscule as there's rarely more than one instance of io_tbl_mem.

> 
> >  	struct dentry *debugfs;
> >  	bool force_bounce;
> >  	bool for_alloc;
> > @@ -118,11 +129,11 @@ struct io_tlb_mem {
> >  	struct list_head pools;
> >  	struct work_struct dyn_alloc;
> >  #endif
> > -#ifdef CONFIG_DEBUG_FS
> 
> I think this should not be removed but changed to:
> #if defined(CONFIG_DEBUG_FS) || defined(CONFIG_SWIOTLB_THROTTLE)

Same thought here.

> 
> >  	atomic_long_t total_used;
> >  	atomic_long_t used_hiwater;
> >  	atomic_long_t transient_nslabs;
> > -#endif
> > +	unsigned long high_throttle_count;
> > +	unsigned long low_throttle_count;
> 
> And these two should be guarded by #ifdef CONFIG_SWIOTLB_THROTTLE.

And here.

> 
> >  };
> >
> >  struct io_tlb_pool *__swiotlb_find_pool(struct device *dev, phys_addr_t paddr);
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index c06e56be0ca1..d45ba62f58c8 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -103,6 +103,19 @@ config SWIOTLB_DYNAMIC
> >
> >  	  If unsure, say N.
> >
> > +config SWIOTLB_THROTTLE
> > +	bool "Throttle DMA map requests from enabled drivers"
> > +	default n
> > +	depends on SWIOTLB
> > +	help
> > +	  Enable throttling of DMA map requests to help avoid exhausting
> > +	  bounce buffer space, causing request failures. Throttling
> > +	  applies only where the calling driver has enabled blocking in
> > +	  DMA map requests. This option is most useful in CoCo VMs where
> > +	  all DMA operations must go through bounce buffers.
> 
> 
> If I didn't know anything about the concept, this description would
> confuse me... The short description should be something like: "Throttle
> the use of DMA bounce buffers." Do not mention "enabled drivers" here;
> it's sufficient to mention the limitations in the help text.
> 
> In addition, the help text should make it clear that this throttling
> does not apply if bounce buffers are not needed; except for CoCo VMs,
> this is the most common case. I mean, your description does mention CoCo
> VMs, but e.g. distributions may wonder what the impact would be if they
> enable this option and the kernel then runs on bare metal.

OK. I'll work on the text per your comments.

> 
> > +
> > +	  If unsure, say N.
> > +
> >  config DMA_BOUNCE_UNALIGNED_KMALLOC
> >  	bool
> >  	depends on SWIOTLB
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index df68d29740a0..940b95cf02b7 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -34,6 +34,7 @@
> >  #include <linux/init.h>
> >  #include <linux/memblock.h>
> >  #include <linux/mm.h>
> > +#include <linux/semaphore.h>
> >  #include <linux/pfn.h>
> >  #include <linux/rculist.h>
> >  #include <linux/scatterlist.h>
> > @@ -71,12 +72,15 @@
> >   *		from each index.
> >   * @pad_slots:	Number of preceding padding slots. Valid only in the first
> >   *		allocated non-padding slot.
> > + * @throttled:  Boolean indicating the slot is used by a request that was
> > + *		throttled. Valid only in the first allocated non-padding slot.
> >   */
> >  struct io_tlb_slot {
> >  	phys_addr_t orig_addr;
> >  	size_t alloc_size;
> >  	unsigned short list;
> > -	unsigned short pad_slots;
> > +	u8 pad_slots;
> > +	u8 throttled;
> 
> I'm not sure this flag is needed for each slot.
> 
> SWIOTLB mappings should be throttled when the total SWIOTLB usage is
> above a threshold. Conversely, it can be unthrottled when the total
> usage goes below a threshold, and it should not matter if that happens
> due to an unmap of the exact buffer which previously pushed the usage
> over the edge, or due to an unmap of any other unrelated buffer.

I think I understand what you are proposing. But I don't see a way
to make it work without adding global synchronization beyond
the current atomic counter for the number of used slabs. At a minimum
we would need a global spin lock instead of the atomic counter. The spin
lock would protect the (non-atomic) slab count along with some other
accounting, and that's more global references. As described in the
cover letter, I was trying to avoid doing that.

If you can see how to do what you propose with just the current
atomic counter, please describe.

Michael

> 
> I had a few more comments to the rest of this patch, but they're moot
> if this base logic gets redone.
> 
> Petr T
> 
> >  };
> >
> >  static bool swiotlb_force_bounce;
> > @@ -249,6 +253,31 @@ static inline unsigned long nr_slots(u64 val)
> >  	return DIV_ROUND_UP(val, IO_TLB_SIZE);
> >  }
> >
> > +#ifdef CONFIG_SWIOTLB_THROTTLE
> > +static void init_throttling(struct io_tlb_mem *mem)
> > +{
> > +	sema_init(&mem->throttle_sem, 1);
> > +
> > +	/*
> > +	 * The default thresholds are somewhat arbitrary. They are
> > +	 * conservative to allow space for devices that can't throttle and
> > +	 * because the determination of whether to throttle is done without
> > +	 * any atomicity. The low throttle exists to provide a modest amount
> > +	 * of hysteresis so that the system doesn't flip rapidly between
> > +	 * throttling and not throttling when usage fluctuates near the high
> > +	 * throttle level.
> > +	 */
> > +	mem->high_throttle = (mem->nslabs * 70) / 100;
> > +	mem->low_throttle = (mem->nslabs * 65) / 100;
> > +}
> > +#else
> > +static void init_throttling(struct io_tlb_mem *mem)
> > +{
> > +	mem->high_throttle = 0;
> > +	mem->low_throttle = 0;
> > +
> > +#endif
> > +
> >  /*
> >   * Early SWIOTLB allocation may be too early to allow an architecture to
> >   * perform the desired operations.  This function allows the architecture to
> > @@ -415,6 +444,8 @@ void __init swiotlb_init_remap(bool addressing_limit,
> unsigned int flags,
> >
> >  	if (flags & SWIOTLB_VERBOSE)
> >  		swiotlb_print_info();
> > +
> > +	init_throttling(&io_tlb_default_mem);
> >  }
> >
> >  void __init swiotlb_init(bool addressing_limit, unsigned int flags)
> > @@ -511,6 +542,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> >  	swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
> >  				 nareas);
> >  	add_mem_pool(&io_tlb_default_mem, mem);
> > +	init_throttling(&io_tlb_default_mem);
> >
> >  	swiotlb_print_info();
> >  	return 0;
> > @@ -947,7 +979,7 @@ static unsigned int wrap_area_index(struct io_tlb_pool
> *mem, unsigned int index)
> >   * function gives imprecise results because there's no locking across
> >   * multiple areas.
> >   */
> > -#ifdef CONFIG_DEBUG_FS
> > +#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_SWIOTLB_THROTTLE)
> >  static void inc_used_and_hiwater(struct io_tlb_mem *mem, unsigned int nslots)
> >  {
> >  	unsigned long old_hiwater, new_used;
> > @@ -966,14 +998,14 @@ static void dec_used(struct io_tlb_mem *mem, unsigned
> int nslots)
> >  	atomic_long_sub(nslots, &mem->total_used);
> >  }
> >
> > -#else /* !CONFIG_DEBUG_FS */
> > +#else /* !CONFIG_DEBUG_FS && !CONFIG_SWIOTLB_THROTTLE*/
> >  static void inc_used_and_hiwater(struct io_tlb_mem *mem, unsigned int nslots)
> >  {
> >  }
> >  static void dec_used(struct io_tlb_mem *mem, unsigned int nslots)
> >  {
> >  }
> > -#endif /* CONFIG_DEBUG_FS */
> > +#endif /* CONFIG_DEBUG_FS || CONFIG_SWIOTLB_THROTTLE */
> >
> >  #ifdef CONFIG_SWIOTLB_DYNAMIC
> >  #ifdef CONFIG_DEBUG_FS
> > @@ -1277,7 +1309,7 @@ static int swiotlb_find_slots(struct device *dev,
> phys_addr_t orig_addr,
> >
> >  #endif /* CONFIG_SWIOTLB_DYNAMIC */
> >
> > -#ifdef CONFIG_DEBUG_FS
> > +#if defined(CONFIG_DEBUG_FS) || defined(CONFIG_SWIOTLB_THROTTLE)
> >
> >  /**
> >   * mem_used() - get number of used slots in an allocator
> > @@ -1293,7 +1325,7 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
> >  	return atomic_long_read(&mem->total_used);
> >  }
> >
> > -#else /* !CONFIG_DEBUG_FS */
> > +#else /* !CONFIG_DEBUG_FS && !CONFIG_SWIOTLB_THROTTLE */
> >
> >  /**
> >   * mem_pool_used() - get number of used slots in a memory pool
> > @@ -1373,6 +1405,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> >  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >  	unsigned int offset;
> >  	struct io_tlb_pool *pool;
> > +	bool throttle = false;
> >  	unsigned int i;
> >  	size_t size;
> >  	int index;
> > @@ -1398,6 +1431,32 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> >  	dev_WARN_ONCE(dev, alloc_align_mask > ~PAGE_MASK,
> >  		"Alloc alignment may prevent fulfilling requests with max
> mapping_size\n");
> >
> > +	if (IS_ENABLED(CONFIG_SWIOTLB_THROTTLE) && attrs &
> DMA_ATTR_MAY_BLOCK) {
> > +		unsigned long used = atomic_long_read(&mem->total_used);
> > +
> > +		/*
> > +		 * Determining whether to throttle is intentionally done without
> > +		 * atomicity. For example, multiple requests could proceed in
> > +		 * parallel when usage is just under the threshold, putting
> > +		 * usage above the threshold by the aggregate size of the
> > +		 * parallel requests. The thresholds must already be set
> > +		 * conservatively because of drivers that can't enable
> > +		 * throttling, so this slop in the accounting shouldn't be
> > +		 * problem. It's better than the potential bottleneck of a
> > +		 * globally synchronzied reservation mechanism.
> > +		 */
> > +		if (used > mem->high_throttle) {
> > +			throttle = true;
> > +			mem->high_throttle_count++;
> > +		} else if ((used > mem->low_throttle) &&
> > +					(mem->throttle_sem.count <= 0)) {
> > +			throttle = true;
> > +			mem->low_throttle_count++;
> > +		}
> > +		if (throttle)
> > +			down(&mem->throttle_sem);
> > +	}
> > +
> >  	offset = swiotlb_align_offset(dev, alloc_align_mask, orig_addr);
> >  	size = ALIGN(mapping_size + offset, alloc_align_mask + 1);
> >  	index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool);
> > @@ -1406,6 +1465,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> >  			dev_warn_ratelimited(dev,
> >  	"swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
> >  				 size, mem->nslabs, mem_used(mem));
> > +		if (throttle)
> > +			up(&mem->throttle_sem);
> >  		return (phys_addr_t)DMA_MAPPING_ERROR;
> >  	}
> >
> > @@ -1424,6 +1485,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> >  	offset &= (IO_TLB_SIZE - 1);
> >  	index += pad_slots;
> >  	pool->slots[index].pad_slots = pad_slots;
> > +	pool->slots[index].throttled = throttle;
> >  	for (i = 0; i < (nr_slots(size) - pad_slots); i++)
> >  		pool->slots[index + i].orig_addr = slot_addr(orig_addr, i);
> >  	tlb_addr = slot_addr(pool->start, index) + offset;
> > @@ -1440,7 +1502,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> phys_addr_t orig_addr,
> >  	return tlb_addr;
> >  }
> >
> > -static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
> > +static bool swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr,
> >  				  struct io_tlb_pool *mem)
> >  {
> >  	unsigned long flags;
> > @@ -1448,8 +1510,10 @@ static void swiotlb_release_slots(struct device *dev,
> phys_addr_t tlb_addr,
> >  	int index, nslots, aindex;
> >  	struct io_tlb_area *area;
> >  	int count, i;
> > +	bool throttled;
> >
> >  	index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
> > +	throttled = mem->slots[index].throttled;
> >  	index -= mem->slots[index].pad_slots;
> >  	nslots = nr_slots(mem->slots[index].alloc_size + offset);
> >  	aindex = index / mem->area_nslabs;
> > @@ -1478,6 +1542,7 @@ static void swiotlb_release_slots(struct device *dev,
> phys_addr_t tlb_addr,
> >  		mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
> >  		mem->slots[i].alloc_size = 0;
> >  		mem->slots[i].pad_slots = 0;
> > +		mem->slots[i].throttled = 0;
> >  	}
> >
> >  	/*
> > @@ -1492,6 +1557,8 @@ static void swiotlb_release_slots(struct device *dev,
> phys_addr_t tlb_addr,
> >  	spin_unlock_irqrestore(&area->lock, flags);
> >
> >  	dec_used(dev->dma_io_tlb_mem, nslots);
> > +
> > +	return throttled;
> >  }
> >
> >  #ifdef CONFIG_SWIOTLB_DYNAMIC
> > @@ -1501,6 +1568,9 @@ static void swiotlb_release_slots(struct device *dev,
> phys_addr_t tlb_addr,
> >   * @dev:	Device which mapped the buffer.
> >   * @tlb_addr:	Physical address within a bounce buffer.
> >   * @pool:       Pointer to the transient memory pool to be checked and deleted.
> > + * @throttled:	If the function returns %true, return boolean indicating
> > + *		if the transient allocation was throttled. Not set if the
> > + *		function returns %false.
> >   *
> >   * Check whether the address belongs to a transient SWIOTLB memory pool.
> >   * If yes, then delete the pool.
> > @@ -1508,11 +1578,18 @@ static void swiotlb_release_slots(struct device *dev,
> phys_addr_t tlb_addr,
> >   * Return: %true if @tlb_addr belonged to a transient pool that was released.
> >   */
> >  static bool swiotlb_del_transient(struct device *dev, phys_addr_t tlb_addr,
> > -		struct io_tlb_pool *pool)
> > +		struct io_tlb_pool *pool, bool *throttled)
> >  {
> > +	unsigned int offset;
> > +	int index;
> > +
> >  	if (!pool->transient)
> >  		return false;
> >
> > +	offset = swiotlb_align_offset(dev, 0, tlb_addr);
> > +	index = (tlb_addr - offset - pool->start) >> IO_TLB_SHIFT;
> > +	*throttled = pool->slots[index].throttled;
> > +
> >  	dec_used(dev->dma_io_tlb_mem, pool->nslabs);
> >  	swiotlb_del_pool(dev, pool);
> >  	dec_transient_used(dev->dma_io_tlb_mem, pool->nslabs);
> > @@ -1522,7 +1599,7 @@ static bool swiotlb_del_transient(struct device *dev,
> phys_addr_t tlb_addr,
> >  #else  /* !CONFIG_SWIOTLB_DYNAMIC */
> >
> >  static inline bool swiotlb_del_transient(struct device *dev,
> > -		phys_addr_t tlb_addr, struct io_tlb_pool *pool)
> > +		phys_addr_t tlb_addr, struct io_tlb_pool *pool, bool *throttled)
> >  {
> >  	return false;
> >  }
> > @@ -1536,6 +1613,8 @@ void __swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
> >  		size_t mapping_size, enum dma_data_direction dir,
> >  		unsigned long attrs, struct io_tlb_pool *pool)
> >  {
> > +	bool throttled;
> > +
> >  	/*
> >  	 * First, sync the memory before unmapping the entry
> >  	 */
> > @@ -1544,9 +1623,11 @@ void __swiotlb_tbl_unmap_single(struct device *dev,
> phys_addr_t tlb_addr,
> >  		swiotlb_bounce(dev, tlb_addr, mapping_size,
> >  						DMA_FROM_DEVICE, pool);
> >
> > -	if (swiotlb_del_transient(dev, tlb_addr, pool))
> > -		return;
> > -	swiotlb_release_slots(dev, tlb_addr, pool);
> > +	if (!swiotlb_del_transient(dev, tlb_addr, pool, &throttled))
> > +		throttled = swiotlb_release_slots(dev, tlb_addr, pool);
> > +
> > +	if (throttled)
> > +		up(&dev->dma_io_tlb_mem->throttle_sem);
> >  }
> >
> >  void __swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> > @@ -1719,6 +1800,14 @@ static void swiotlb_create_debugfs_files(struct
> io_tlb_mem *mem,
> >  		return;
> >
> >  	debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
> > +	debugfs_create_ulong("high_throttle", 0600, mem->debugfs,
> > +			&mem->high_throttle);
> > +	debugfs_create_ulong("low_throttle", 0600, mem->debugfs,
> > +			&mem->low_throttle);
> > +	debugfs_create_ulong("high_throttle_count", 0600, mem->debugfs,
> > +			&mem->high_throttle_count);
> > +	debugfs_create_ulong("low_throttle_count", 0600, mem->debugfs,
> > +			&mem->low_throttle_count);
> >  	debugfs_create_file("io_tlb_used", 0400, mem->debugfs, mem,
> >  			&fops_io_tlb_used);
> >  	debugfs_create_file("io_tlb_used_hiwater", 0600, mem->debugfs, mem,
> > @@ -1841,6 +1930,7 @@ static int rmem_swiotlb_device_init(struct reserved_mem
> *rmem,
> >  		INIT_LIST_HEAD_RCU(&mem->pools);
> >  #endif
> >  		add_mem_pool(mem, pool);
> > +		init_throttling(mem);
> >
> >  		rmem->priv = mem;
> >

Petr Tesařík Aug. 27, 2024, 3:55 p.m. UTC | #3

On Fri, 23 Aug 2024 20:41:15 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, August 23, 2024 12:41 AM
> > 
> > On Thu, 22 Aug 2024 11:37:12 -0700
> > mhkelley58@gmail.com wrote:
>[...]
> > > @@ -71,12 +72,15 @@
> > >   *		from each index.
> > >   * @pad_slots:	Number of preceding padding slots. Valid only in the first
> > >   *		allocated non-padding slot.
> > > + * @throttled:  Boolean indicating the slot is used by a request that was
> > > + *		throttled. Valid only in the first allocated non-padding slot.
> > >   */
> > >  struct io_tlb_slot {
> > >  	phys_addr_t orig_addr;
> > >  	size_t alloc_size;
> > >  	unsigned short list;
> > > -	unsigned short pad_slots;
> > > +	u8 pad_slots;
> > > +	u8 throttled;  
> > 
> > I'm not sure this flag is needed for each slot.
> > 
> > SWIOTLB mappings should be throttled when the total SWIOTLB usage is
> > above a threshold. Conversely, it can be unthrottled when the total
> > usage goes below a threshold, and it should not matter if that happens
> > due to an unmap of the exact buffer which previously pushed the usage
> > over the edge, or due to an unmap of any other unrelated buffer.  
> 
> I think I understand what you are proposing. But I don't see a way
> to make it work without adding global synchronization beyond
> the current atomic counter for the number of uI'm sed slabs. At a minimum
> we would need a global spin lock instead of the atomic counter. The spin
> lock would protect the (non-atomic) slab count along with some other
> accounting, and that's more global references. As described in the
> cover letter, I was trying to avoid doing that.

I have thought about this for a few days. And I'm still not convinced.
You have made it clear in multiple places that the threshold is a soft
limit, and there are many ways the SWIOTLB utilization may exceed the
threshold. In fact I'm not even 100% sure that an atomic counter is
needed, because the check is racy anyway. Another task may increase
(or decrease) the counter between atomic_long_read(&mem->total_used)
and a subsequent down(&mem->throttle_sem).

I consider it a feature, not a flaw, because the real important checks
happen later while searching for free slots, and those are protected
with a spinlock.

> If you can see how to do what you propose with just the current
> atomic counter, please describe.

I think I'm certainly missing something obvious, but let me open the
discussion to improve my understanding of the matter.

Suppose we don't protect the slab count with anything. What is the
worst possible outcome? IIUC the worst scenario is that multiple tasks
unmap swiotlb buffers simultaneously and all of them believe that their
action made the total usage go below the low threshold, so all of them
try to release the semaphore.

That's obviously not good, but AFAICS all that's needed is a
test_and_clear_bit() on a per-io_tlb_mem throttled flag just before
calling up(). Since up() would acquire the semaphore's spinlock, and
there's only one semaphore per io_tlb_mem, adding an atomic flag doesn't
look like too much overhead to me, especially if it ends up in the same
cache line as the semaphore.

Besides, this all happens only in the slow path, i.e. only after the
current utilization has just dropped below the low threshold, not if
the utilization was already below the threshold before freeing up some
slots.

I have briefly considered subtracting the low threshold as initial bias
from mem->total_used and using atomic_long_add_negative() to avoid the
need for an extra throttled flag, but at this point I'm not sure it can
be implemented without any races. We can try to figure out the details
if it sounds like a good idea.

Petr T

Michael Kelley Aug. 27, 2024, 5:30 p.m. UTC | #4

From: Petr Tesařík <petr@tesarici.cz> Sent: Tuesday, August 27, 2024 8:56 AM
> 
> On Fri, 23 Aug 2024 20:41:15 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
> 
> > From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, August 23, 2024 12:41 AM
> > >
> > > On Thu, 22 Aug 2024 11:37:12 -0700
> > > mhkelley58@gmail.com wrote:
> >[...]
> > > > @@ -71,12 +72,15 @@
> > > >   *		from each index.
> > > >   * @pad_slots:	Number of preceding padding slots. Valid only in the first
> > > >   *		allocated non-padding slot.
> > > > + * @throttled:  Boolean indicating the slot is used by a request that was
> > > > + *		throttled. Valid only in the first allocated non-padding slot.
> > > >   */
> > > >  struct io_tlb_slot {
> > > >  	phys_addr_t orig_addr;
> > > >  	size_t alloc_size;
> > > >  	unsigned short list;
> > > > -	unsigned short pad_slots;
> > > > +	u8 pad_slots;
> > > > +	u8 throttled;
> > >
> > > I'm not sure this flag is needed for each slot.
> > >
> > > SWIOTLB mappings should be throttled when the total SWIOTLB usage is
> > > above a threshold. Conversely, it can be unthrottled when the total
> > > usage goes below a threshold, and it should not matter if that happens
> > > due to an unmap of the exact buffer which previously pushed the usage
> > > over the edge, or due to an unmap of any other unrelated buffer.
> >
> > I think I understand what you are proposing. But I don't see a way
> > to make it work without adding global synchronization beyond
> > the current atomic counter for the number of uI'm sed slabs. At a minimum
> > we would need a global spin lock instead of the atomic counter. The spin
> > lock would protect the (non-atomic) slab count along with some other
> > accounting, and that's more global references. As described in the
> > cover letter, I was trying to avoid doing that.
> 
> I have thought about this for a few days. And I'm still not convinced.
> You have made it clear in multiple places that the threshold is a soft
> limit, and there are many ways the SWIOTLB utilization may exceed the
> threshold. In fact I'm not even 100% sure that an atomic counter is
> needed, because the check is racy anyway.

Atomic operations are expensive at the memory bus level, particularly
in high CPU count systems with NUMA topologies. However,
maintaining an imprecise global count doesn't work because the
divergence from reality can become unbounded over time. The
alternative is to sum up all the per-area counters each time a
reasonably good global value is needed, and that can be expensive itself
with high area counts. A hybrid might be to maintain an imprecise global
count, but periodically update it by summing up all the per-area counters
so that the divergence from reality isn't unbounded.

> Another task may increase
> (or decrease) the counter between atomic_long_read(&mem->total_used)
> and a subsequent down(&mem->throttle_sem).
> 
> I consider it a feature, not a flaw, because the real important checks
> happen later while searching for free slots, and those are protected
> with a spinlock.
> 
> > If you can see how to do what you propose with just the current
> > atomic counter, please describe.
> 
> I think I'm certainly missing something obvious, but let me open the
> discussion to improve my understanding of the matter.
> 
> Suppose we don't protect the slab count with anything. What is the
> worst possible outcome? IIUC the worst scenario is that multiple tasks
> unmap swiotlb buffers simultaneously and all of them believe that their
> action made the total usage go below the low threshold, so all of them
> try to release the semaphore.
> 
> That's obviously not good, but AFAICS all that's needed is a
> test_and_clear_bit() on a per-io_tlb_mem throttled flag just before
> calling up(). Since up() would acquire the semaphore's spinlock, and
> there's only one semaphore per io_tlb_mem, adding an atomic flag doesn't
> look like too much overhead to me, especially if it ends up in the same
> cache line as the semaphore.

Yes, the semaphore management is the problem. Presumably we want
each throttled request to wait on the semaphore, forming an ordered
queue of waiters. Each up() on the semaphore releases one of those
waiters. We don’t want to release all the waiters when the slab count
transitions from "above throttle" to "below throttle" because that
creates a thundering herd problem.

So consider this example scenario:
1) Two waiters ("A" and "B") are queued the semaphore, each wanting 2 slabs.
2) An unrelated swiotlb unmap frees 10 slabs, taking the slab count
from 2 above threshold to 8 below threshold. This does up() on
the semaphore and awakens "A".
3) "A" does his request for 2 slabs, and the slab count is now 6 below
threshold.
4) "A" does swiotlb unmap.  The slab count goes from 6 below threshold back
to 8 below threshold, so no semaphore operation is done. "B" is still waiting.
5) System-wide, swiotlb requests decline, and the slab count never goes above
the threshold again. At this point, "B" is still waiting and never gets awakened.

An ordered queue of waiters is incompatible with wakeups determined solely
on whether the slab count is below the threshold after swiotlb unmap. You
would have to wait up all waiters and let them re-contend for the slots that
are available below the threshold, with most probably losing out and going
back on the semaphore wait queue (i.e., a thundering herd).

Separately, what does a swiotlb unmap do if it takes the slab count from above
threshold to below threshold, and there are no waiters? It should not do up()
in that case, but how can it make that decision in a way that doesn't race
with a swiotlb map operation running at the same time?

Michael

> 
> Besides, this all happens only in the slow path, i.e. only after the
> current utilization has just dropped below the low threshold, not if
> the utilization was already below the threshold before freeing up some
> slots.
> 
> I have briefly considered subtracting the low threshold as initial bias
> from mem->total_used and using atomic_long_add_negative() to avoid the
> need for an extra throttled flag, but at this point I'm not sure it can
> be implemented without any races. We can try to figure out the details
> if it sounds like a good idea.
> 
> Petr T

Petr Tesařík Aug. 28, 2024, 5:15 a.m. UTC | #5

On Tue, 27 Aug 2024 17:30:59 +0000
Michael Kelley <mhklinux@outlook.com> wrote:

> From: Petr Tesařík <petr@tesarici.cz> Sent: Tuesday, August 27, 2024 8:56 AM
> > 
> > On Fri, 23 Aug 2024 20:41:15 +0000
> > Michael Kelley <mhklinux@outlook.com> wrote:
> >   
> > > From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, August 23, 2024 12:41 AM  
> > > >
> > > > On Thu, 22 Aug 2024 11:37:12 -0700
> > > > mhkelley58@gmail.com wrote:  
> > >[...]  
> > > > > @@ -71,12 +72,15 @@
> > > > >   *		from each index.
> > > > >   * @pad_slots:	Number of preceding padding slots. Valid only in the first
> > > > >   *		allocated non-padding slot.
> > > > > + * @throttled:  Boolean indicating the slot is used by a request that was
> > > > > + *		throttled. Valid only in the first allocated non-padding slot.
> > > > >   */
> > > > >  struct io_tlb_slot {
> > > > >  	phys_addr_t orig_addr;
> > > > >  	size_t alloc_size;
> > > > >  	unsigned short list;
> > > > > -	unsigned short pad_slots;
> > > > > +	u8 pad_slots;
> > > > > +	u8 throttled;  
> > > >
> > > > I'm not sure this flag is needed for each slot.
> > > >
> > > > SWIOTLB mappings should be throttled when the total SWIOTLB usage is
> > > > above a threshold. Conversely, it can be unthrottled when the total
> > > > usage goes below a threshold, and it should not matter if that happens
> > > > due to an unmap of the exact buffer which previously pushed the usage
> > > > over the edge, or due to an unmap of any other unrelated buffer.  
> > >
> > > I think I understand what you are proposing. But I don't see a way
> > > to make it work without adding global synchronization beyond
> > > the current atomic counter for the number of uI'm sed slabs. At a minimum
> > > we would need a global spin lock instead of the atomic counter. The spin
> > > lock would protect the (non-atomic) slab count along with some other
> > > accounting, and that's more global references. As described in the
> > > cover letter, I was trying to avoid doing that.  
> > 
> > I have thought about this for a few days. And I'm still not convinced.
> > You have made it clear in multiple places that the threshold is a soft
> > limit, and there are many ways the SWIOTLB utilization may exceed the
> > threshold. In fact I'm not even 100% sure that an atomic counter is
> > needed, because the check is racy anyway.  
> 
> Atomic operations are expensive at the memory bus level, particularly
> in high CPU count systems with NUMA topologies. However,

Sure, the CPU must ensure exclusive access to the underlying memory and
cache coherency across all CPUs. I know how these things work...

> maintaining an imprecise global count doesn't work because the
> divergence from reality can become unbounded over time. The
> alternative is to sum up all the per-area counters each time a
> reasonably good global value is needed, and that can be expensive itself
> with high area counts. A hybrid might be to maintain an imprecise global
> count, but periodically update it by summing up all the per-area counters
> so that the divergence from reality isn't unbounded.

Yes, this is what I had in mind, but I'm not sure which option is
worse. Let me run a micro-benchmark on a 192-core AmpereOne system.

> > Another task may increase
> > (or decrease) the counter between atomic_long_read(&mem->total_used)
> > and a subsequent down(&mem->throttle_sem).
> > 
> > I consider it a feature, not a flaw, because the real important checks
> > happen later while searching for free slots, and those are protected
> > with a spinlock.
> >   
> > > If you can see how to do what you propose with just the current
> > > atomic counter, please describe.  
> > 
> > I think I'm certainly missing something obvious, but let me open the
> > discussion to improve my understanding of the matter.
> > 
> > Suppose we don't protect the slab count with anything. What is the
> > worst possible outcome? IIUC the worst scenario is that multiple tasks
> > unmap swiotlb buffers simultaneously and all of them believe that their
> > action made the total usage go below the low threshold, so all of them
> > try to release the semaphore.
> > 
> > That's obviously not good, but AFAICS all that's needed is a
> > test_and_clear_bit() on a per-io_tlb_mem throttled flag just before
> > calling up(). Since up() would acquire the semaphore's spinlock, and
> > there's only one semaphore per io_tlb_mem, adding an atomic flag doesn't
> > look like too much overhead to me, especially if it ends up in the same
> > cache line as the semaphore.  
> 
> Yes, the semaphore management is the problem. Presumably we want
> each throttled request to wait on the semaphore, forming an ordered
> queue of waiters. Each up() on the semaphore releases one of those
> waiters. We don’t want to release all the waiters when the slab count
> transitions from "above throttle" to "below throttle" because that
> creates a thundering herd problem.
>
> So consider this example scenario:
> 1) Two waiters ("A" and "B") are queued the semaphore, each wanting 2 slabs.
> 2) An unrelated swiotlb unmap frees 10 slabs, taking the slab count
> from 2 above threshold to 8 below threshold. This does up() on
> the semaphore and awakens "A".
> 3) "A" does his request for 2 slabs, and the slab count is now 6 below
> threshold.
> 4) "A" does swiotlb unmap.  The slab count goes from 6 below threshold back
> to 8 below threshold, so no semaphore operation is done. "B" is still waiting.
> 5) System-wide, swiotlb requests decline, and the slab count never goes above
> the threshold again. At this point, "B" is still waiting and never gets awakened.
> 
> An ordered queue of waiters is incompatible with wakeups determined solely
> on whether the slab count is below the threshold after swiotlb unmap. You
> would have to wait up all waiters and let them re-contend for the slots that
> are available below the threshold, with most probably losing out and going
> back on the semaphore wait queue (i.e., a thundering herd).

Ah, right, the semaphore must be released as many times as it is
acquired. Thank you for your patience.

> Separately, what does a swiotlb unmap do if it takes the slab count from above
> threshold to below threshold, and there are no waiters? It should not do up()
> in that case, but how can it make that decision in a way that doesn't race
> with a swiotlb map operation running at the same time?

Hm, this confirms my gut feeling that the atomic counter alone would
not be sufficient.

I think I can follow your reasoning now:

1. Kernels which enable CONFIG_SWIOTLB_THROTTLE are likely to have
   CONFIG_DEBUG_FS as well, so the price for an atomic operation on
   total_used is already paid.
2. There are no pre-existing per-io_tlb_mem ordering constraints on
   unmap, except the used counter, which is insufficient.
3. Slot data is already protected by its area spinlock, so adding
   something there does not increase the price.

I don't have an immediate idea, but I still believe we can do better.
For one thing, your scheme is susceptible to excessive throttling in
degenerate cases, e.g.:

1. A spike in network traffic temporarily increases swiotlb usage above
   the threshold, but it is not throttled because the network driver
   does not use SWIOTLB_ATTR_MAY_BLOCK.
2. A slow disk "Snail" maps a buffer and acquires the semaphore.
3. A fast disk "Cheetah" tries to map a buffer and goes on the
   semaphore wait queue.
4. Network buffers are unmapped, dropping usage below the threshold,
   but since the throttle flag was not set, the semaphore is not
   touched.
5. "Cheetah" is unnecessarily waiting for "Snail" to finish.

You may have never hit this scenario in your testing, because you
presumably had only fast virtual block devices.

I'm currently thinking along the lines of waking up the semaphore
on unmap whenever current usage is above the threshold and there is a
waiter.

As a side note, I get your concerns about the thundering herd effect,
but keep in mind that bounce buffers are not necessarily equal. If four
devices are blocked on mapping a single slot, you can actually wake up
all of them after you release four slots. For SG lists, you even add
explicit logic to trigger the wakeup only on the last segment...

BTW as we talk about the semaphore queue, it reminds me of an issue I
had with your proposed patch:

> @@ -1398,6 +1431,32 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>  	dev_WARN_ONCE(dev, alloc_align_mask > ~PAGE_MASK,
>  		"Alloc alignment may prevent fulfilling requests with max mapping_size\n");
>  
> +	if (IS_ENABLED(CONFIG_SWIOTLB_THROTTLE) && attrs & DMA_ATTR_MAY_BLOCK) {
> +		unsigned long used = atomic_long_read(&mem->total_used);
> +
> +		/*
> +		 * Determining whether to throttle is intentionally done without
> +		 * atomicity. For example, multiple requests could proceed in
> +		 * parallel when usage is just under the threshold, putting
> +		 * usage above the threshold by the aggregate size of the
> +		 * parallel requests. The thresholds must already be set
> +		 * conservatively because of drivers that can't enable
> +		 * throttling, so this slop in the accounting shouldn't be
> +		 * problem. It's better than the potential bottleneck of a
> +		 * globally synchronzied reservation mechanism.
> +		 */
> +		if (used > mem->high_throttle) {
> +			throttle = true;
> +			mem->high_throttle_count++;
> +		} else if ((used > mem->low_throttle) &&
> +					(mem->throttle_sem.count <= 0)) {
                                              ^^^^^^^^^^^^^^^^^^

Is it safe to access the semaphore count like this without taking the
semaphore spinlock? If it is, then it deserves a comment to explain why
you can ignore this comment in include/linux/semaphore.h:

/* Please don't access any members of this structure directly */

Petr T

> +			throttle = true;
> +			mem->low_throttle_count++;
> +		}
> +		if (throttle)
> +			down(&mem->throttle_sem);
> +	}
> +
>  	offset = swiotlb_align_offset(dev, alloc_align_mask, orig_addr);
>  	size = ALIGN(mapping_size + offset, alloc_align_mask + 1);
>  	index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool);

Michael Kelley Aug. 28, 2024, 6:14 a.m. UTC | #6

From: Petr Tesařík <petr@tesarici.cz> Sent: Tuesday, August 27, 2024 10:16 PM
> 
> On Tue, 27 Aug 2024 17:30:59 +0000
> Michael Kelley <mhklinux@outlook.com> wrote:
> 
> > From: Petr Tesařík <petr@tesarici.cz> Sent: Tuesday, August 27, 2024 8:56 AM
> > >
> > > On Fri, 23 Aug 2024 20:41:15 +0000
> > > Michael Kelley <mhklinux@outlook.com> wrote:
> > >
> > > > From: Petr Tesařík <petr@tesarici.cz> Sent: Friday, August 23, 2024 12:41 AM
> > > > >
> > > > > On Thu, 22 Aug 2024 11:37:12 -0700
> > > > > mhkelley58@gmail.com wrote:
> > > >[...]
> > > > > > @@ -71,12 +72,15 @@
> > > > > >   *		from each index.
> > > > > >   * @pad_slots:	Number of preceding padding slots. Valid only in the first
> > > > > >   *		allocated non-padding slot.
> > > > > > + * @throttled:  Boolean indicating the slot is used by a request that was
> > > > > > + *		throttled. Valid only in the first allocated non-padding slot.
> > > > > >   */
> > > > > >  struct io_tlb_slot {
> > > > > >  	phys_addr_t orig_addr;
> > > > > >  	size_t alloc_size;
> > > > > >  	unsigned short list;
> > > > > > -	unsigned short pad_slots;
> > > > > > +	u8 pad_slots;
> > > > > > +	u8 throttled;
> > > > >
> > > > > I'm not sure this flag is needed for each slot.
> > > > >
> > > > > SWIOTLB mappings should be throttled when the total SWIOTLB usage is
> > > > > above a threshold. Conversely, it can be unthrottled when the total
> > > > > usage goes below a threshold, and it should not matter if that happens
> > > > > due to an unmap of the exact buffer which previously pushed the usage
> > > > > over the edge, or due to an unmap of any other unrelated buffer.
> > > >
> > > > I think I understand what you are proposing. But I don't see a way
> > > > to make it work without adding global synchronization beyond
> > > > the current atomic counter for the number of uI'm sed slabs. At a minimum
> > > > we would need a global spin lock instead of the atomic counter. The spin
> > > > lock would protect the (non-atomic) slab count along with some other
> > > > accounting, and that's more global references. As described in the
> > > > cover letter, I was trying to avoid doing that.
> > >
> > > I have thought about this for a few days. And I'm still not convinced.
> > > You have made it clear in multiple places that the threshold is a soft
> > > limit, and there are many ways the SWIOTLB utilization may exceed the
> > > threshold. In fact I'm not even 100% sure that an atomic counter is
> > > needed, because the check is racy anyway.
> >
> > Atomic operations are expensive at the memory bus level, particularly
> > in high CPU count systems with NUMA topologies. However,
> 
> Sure, the CPU must ensure exclusive access to the underlying memory and
> cache coherency across all CPUs. I know how these things work...
> 
> > maintaining an imprecise global count doesn't work because the
> > divergence from reality can become unbounded over time. The
> > alternative is to sum up all the per-area counters each time a
> > reasonably good global value is needed, and that can be expensive itself
> > with high area counts. A hybrid might be to maintain an imprecise global
> > count, but periodically update it by summing up all the per-area counters
> > so that the divergence from reality isn't unbounded.
> 
> Yes, this is what I had in mind, but I'm not sure which option is
> worse. Let me run a micro-benchmark on a 192-core AmpereOne system.
> 
> > > Another task may increase
> > > (or decrease) the counter between atomic_long_read(&mem->total_used)
> > > and a subsequent down(&mem->throttle_sem).
> > >
> > > I consider it a feature, not a flaw, because the real important checks
> > > happen later while searching for free slots, and those are protected
> > > with a spinlock.
> > >
> > > > If you can see how to do what you propose with just the current
> > > > atomic counter, please describe.
> > >
> > > I think I'm certainly missing something obvious, but let me open the
> > > discussion to improve my understanding of the matter.
> > >
> > > Suppose we don't protect the slab count with anything. What is the
> > > worst possible outcome? IIUC the worst scenario is that multiple tasks
> > > unmap swiotlb buffers simultaneously and all of them believe that their
> > > action made the total usage go below the low threshold, so all of them
> > > try to release the semaphore.
> > >
> > > That's obviously not good, but AFAICS all that's needed is a
> > > test_and_clear_bit() on a per-io_tlb_mem throttled flag just before
> > > calling up(). Since up() would acquire the semaphore's spinlock, and
> > > there's only one semaphore per io_tlb_mem, adding an atomic flag doesn't
> > > look like too much overhead to me, especially if it ends up in the same
> > > cache line as the semaphore.
> >
> > Yes, the semaphore management is the problem. Presumably we want
> > each throttled request to wait on the semaphore, forming an ordered
> > queue of waiters. Each up() on the semaphore releases one of those
> > waiters. We don’t want to release all the waiters when the slab count
> > transitions from "above throttle" to "below throttle" because that
> > creates a thundering herd problem.
> >
> > So consider this example scenario:
> > 1) Two waiters ("A" and "B") are queued the semaphore, each wanting 2 slabs.
> > 2) An unrelated swiotlb unmap frees 10 slabs, taking the slab count
> > from 2 above threshold to 8 below threshold. This does up() on
> > the semaphore and awakens "A".
> > 3) "A" does his request for 2 slabs, and the slab count is now 6 below
> > threshold.
> > 4) "A" does swiotlb unmap.  The slab count goes from 6 below threshold back
> > to 8 below threshold, so no semaphore operation is done. "B" is still waiting.
> > 5) System-wide, swiotlb requests decline, and the slab count never goes above
> > the threshold again. At this point, "B" is still waiting and never gets awakened.
> >
> > An ordered queue of waiters is incompatible with wakeups determined solely
> > on whether the slab count is below the threshold after swiotlb unmap. You
> > would have to wait up all waiters and let them re-contend for the slots that
> > are available below the threshold, with most probably losing out and going
> > back on the semaphore wait queue (i.e., a thundering herd).
> 
> Ah, right, the semaphore must be released as many times as it is
> acquired. Thank you for your patience.
> 
> > Separately, what does a swiotlb unmap do if it takes the slab count from above
> > threshold to below threshold, and there are no waiters? It should not do up()
> > in that case, but how can it make that decision in a way that doesn't race
> > with a swiotlb map operation running at the same time?
> 
> Hm, this confirms my gut feeling that the atomic counter alone would
> not be sufficient.
> 
> I think I can follow your reasoning now:
> 
> 1. Kernels which enable CONFIG_SWIOTLB_THROTTLE are likely to have
>    CONFIG_DEBUG_FS as well, so the price for an atomic operation on
>    total_used is already paid.

I'm unsure if that is true. But my thinking that the atomic total_used is
needed by throttling may have been faulty.  Certainly, if CONFIG_DEBUG_FS
is set, then the cost is already paid. But if not, CONFIG_SWIOTLB_THROTTLE
in my current code adds the atomic total_used cost for *all* swiotlb map
and unmap requests. But the cost of a computed-on-the-fly value (by
summing across all areas) would be paid only by MAY_BLOCK map
requests (and not on unmap), so that decreases the overall cost. And I
had not thought of the hybrid approach until I wrote my previous
response to you. Both seem worth further thinking/investigation.

> 2. There are no pre-existing per-io_tlb_mem ordering constraints on
>    unmap, except the used counter, which is insufficient.

Agreed.

> 3. Slot data is already protected by its area spinlock, so adding
>    something there does not increase the price.

Agreed.

> 
> I don't have an immediate idea, but I still believe we can do better.
> For one thing, your scheme is susceptible to excessive throttling in
> degenerate cases, e.g.:
> 
> 1. A spike in network traffic temporarily increases swiotlb usage above
>    the threshold, but it is not throttled because the network driver
>    does not use SWIOTLB_ATTR_MAY_BLOCK.
> 2. A slow disk "Snail" maps a buffer and acquires the semaphore.
> 3. A fast disk "Cheetah" tries to map a buffer and goes on the
>    semaphore wait queue.
> 4. Network buffers are unmapped, dropping usage below the threshold,
>    but since the throttle flag was not set, the semaphore is not
>    touched.
> 5. "Cheetah" is unnecessarily waiting for "Snail" to finish.
> 
> You may have never hit this scenario in your testing, because you
> presumably had only fast virtual block devices.

My approach was to explicitly not worry about this scenario. :-)  I
stated in the patch set cover letter that throttled requests are
serialized (though maybe not clearly enough). And if a workload
regularly runs above the threshold, the size of the swiotlb memory
should probably be increased. I'm open to an approach that does
better than serialization of throttled requests if it doesn't get
too complicated, but I think it's of secondary importance.

> 
> I'm currently thinking along the lines of waking up the semaphore
> on unmap whenever current usage is above the threshold and there is a
> waiter.
> 
> As a side note, I get your concerns about the thundering herd effect,
> but keep in mind that bounce buffers are not necessarily equal. If four
> devices are blocked on mapping a single slot, you can actually wake up
> all of them after you release four slots. 

Agreed. But the accounting to do that correctly probably requires
a spin lock, and I didn't want to go there.

> For SG lists, you even add
> explicit logic to trigger the wakeup only on the last segment...

Yes. I'm my thinking, that's just part of the serialization of throttled
requests. Throttled request "A", which used an SGL, shouldn't release
the semaphore and hand off ownership to request "B" until all
the swiotlb memory allocated by "A"s SGL has been released.

> 
> BTW as we talk about the semaphore queue, it reminds me of an issue I
> had with your proposed patch:
> 
> > @@ -1398,6 +1431,32 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> >  	dev_WARN_ONCE(dev, alloc_align_mask > ~PAGE_MASK,
> >  		"Alloc alignment may prevent fulfilling requests with max mapping_size\n");
> >
> > +	if (IS_ENABLED(CONFIG_SWIOTLB_THROTTLE) && attrs & DMA_ATTR_MAY_BLOCK) {
> > +		unsigned long used = atomic_long_read(&mem->total_used);
> > +
> > +		/*
> > +		 * Determining whether to throttle is intentionally done without
> > +		 * atomicity. For example, multiple requests could proceed in
> > +		 * parallel when usage is just under the threshold, putting
> > +		 * usage above the threshold by the aggregate size of the
> > +		 * parallel requests. The thresholds must already be set
> > +		 * conservatively because of drivers that can't enable
> > +		 * throttling, so this slop in the accounting shouldn't be
> > +		 * problem. It's better than the potential bottleneck of a
> > +		 * globally synchronzied reservation mechanism.
> > +		 */
> > +		if (used > mem->high_throttle) {
> > +			throttle = true;
> > +			mem->high_throttle_count++;
> > +		} else if ((used > mem->low_throttle) &&
> > +					(mem->throttle_sem.count <= 0)) {
>                                               ^^^^^^^^^^^^^^^^^^
> 
> Is it safe to access the semaphore count like this without taking the
> semaphore spinlock? If it is, then it deserves a comment to explain why
> you can ignore this comment in include/linux/semaphore.h:
> 
> /* Please don't access any members of this structure directly */
> 
> Petr T

Yes, this is a bit of a hack for the RFC patch set. The semaphore code
doesn't offer an API to find out if a semaphore is held. In my mind, the
right solution is to add a semaphore API to get the current "count"
of the semaphore (or maybe just a boolean indicating if it is held),
and then use that API. I would add the API as this patch set goes
from RFC to PATCH status. (Mutex's have such an API.)

The API would provide only an instantaneous value, and in the absence
of any higher-level synchronization, the value could change immediately
after it is read. But that's OK in the swiotlb throttling use case because
the throttling tolerates "errors" due to such a change. The
implementation of the API doesn't need to obtain the semaphore spin
lock as long as the read of the count field is atomic (i.e., doesn't tear),
which it should be.

Michael

> 
> > +			throttle = true;
> > +			mem->low_throttle_count++;
> > +		}
> > +		if (throttle)
> > +			down(&mem->throttle_sem);
> > +	}
> > +
> >  	offset = swiotlb_align_offset(dev, alloc_align_mask, orig_addr);
> >  	size = ALIGN(mapping_size + offset, alloc_align_mask + 1);
> >  	index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool);

[RFC,1/7] swiotlb: Introduce swiotlb throttling

Commit Message

Comments

Patch