[RFC,2/2] dma-fence: Use kernel's sort for merging fences

Message ID	20241113171947.57446-2-tursulin@igalia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tursulin@igalia.com> To: dri-devel@lists.freedesktop.org Cc: kernel-dev@igalia.com, Tvrtko Ursulin <tvrtko.ursulin@igalia.com>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Friedrich Vock <friedrich.vock@gmx.de> Subject: [RFC 2/2] dma-fence: Use kernel's sort for merging fences Date: Wed, 13 Nov 2024 17:19:47 +0000 Message-ID: <20241113171947.57446-2-tursulin@igalia.com> In-Reply-To: <20241113171947.57446-1-tursulin@igalia.com> References: <20241113171947.57446-1-tursulin@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] dma-fence: Add a single fence fast path for fence merging \| expand [1/2] dma-fence: Add a single fence fast path for fence merging [RFC,2/2] dma-fence: Use kernel's sort for merging fences

Message ID

20241113171947.57446-2-tursulin@igalia.com (mailing list archive)

State

New, archived

Headers

From: Tvrtko Ursulin <tursulin@igalia.com>
To: dri-devel@lists.freedesktop.org
Cc: kernel-dev@igalia.com, Tvrtko Ursulin <tvrtko.ursulin@igalia.com>,
	=?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>,
 Friedrich Vock <friedrich.vock@gmx.de>
Subject: [RFC 2/2] dma-fence: Use kernel's sort for merging fences
Date: Wed, 13 Nov 2024 17:19:47 +0000
Message-ID: <20241113171947.57446-2-tursulin@igalia.com>
In-Reply-To: <20241113171947.57446-1-tursulin@igalia.com>
References: <20241113171947.57446-1-tursulin@igalia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

[1/2] dma-fence: Add a single fence fast path for fence merging | expand

Commit Message

Tvrtko Ursulin Nov. 13, 2024, 5:19 p.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

One alternative to the fix Christian proposed in
https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@amd.com/
is to replace the rather complex open coded sorting loops with the kernel
standard sort followed by a context squashing pass.

Proposed advantage of this would be readability but one concern Christian
raised was that there could be many fences, that they are typically mostly
sorted, and so the kernel's heap sort would be much worse by the proposed
algorithm.

I had a look running some games and vkcube to see what are the typical
number of input fences. Tested scenarios:

1) Hogwarts Legacy under Gamescope

450 calls per second to __dma_fence_unwrap_merge.

Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:

   N       Before      After
   0       0.91%
   1      69.40%
  2-3     28.72%       9.4%  (90.6% resolved to one fence)
  4-5      0.93%
  6-9      0.03%
  10+

2) Cyberpunk 2077 under Gamescope

1050 calls per second, amounting to 0.01% CPU time according to perf top.

   N       Before      After
   0       1.13%
   1      52.30%
  2-3     40.34%       55.57%
  4-5      1.46%        0.50%
  6-9      2.44%
  10+      2.34%

3) vkcube under Plasma

90 calls per second.

   N       Before      After
   0
   1
  2-3      100%         0%   (Ie. all resolved to a single fence)
  4-5
  6-9
  10+

In the case of vkcube all invocations in the 2-3 bucket were actually
just two input fences.

From these numbers it looks like the heap sort should not be a
disadvantage, given how the dominant case is <= 2 input fences which heap
sort solves with just one compare and swap. (And for the case of one input
fence we have a fast path in the previous patch.)

A complementary possibility is to implement a different sorting algorithm
under the same API as the kernel's sort() and so keep the simplicity,
potentially moving the new sort under lib/ if it would be found more
widely useful.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Friedrich Vock <friedrich.vock@gmx.de>
---
 drivers/dma-buf/dma-fence-unwrap.c | 129 ++++++++++++++++-------------
 1 file changed, 73 insertions(+), 56 deletions(-)

Comments

Tvrtko Ursulin Nov. 14, 2024, 8:28 a.m. UTC | #1

On 13/11/2024 17:19, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> One alternative to the fix Christian proposed in
> https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@amd.com/
> is to replace the rather complex open coded sorting loops with the kernel
> standard sort followed by a context squashing pass.
> 
> Proposed advantage of this would be readability but one concern Christian
> raised was that there could be many fences, that they are typically mostly
> sorted, and so the kernel's heap sort would be much worse by the proposed
> algorithm.
> 
> I had a look running some games and vkcube to see what are the typical
> number of input fences. Tested scenarios:
> 
> 1) Hogwarts Legacy under Gamescope
> 
> 450 calls per second to __dma_fence_unwrap_merge.
> 
> Percentages per number of fences buckets, before and after checking for
> signalled status, sorting and flattening:
> 
>     N       Before      After
>     0       0.91%
>     1      69.40%
>    2-3     28.72%       9.4%  (90.6% resolved to one fence)
>    4-5      0.93%
>    6-9      0.03%
>    10+
> 
> 2) Cyberpunk 2077 under Gamescope
> 
> 1050 calls per second, amounting to 0.01% CPU time according to perf top.

This was undercounted, 1050/s is actually only for when two or more 
fences are in the input array. Therefore the number of "raw" calls is 
much higher, around 2x more according to these percentages, or even more 
for Hogwarths. Which I guess strengthens the argument for the single 
fence fast path from the previous patch.

>     N       Before      After
>     0       1.13%
>     1      52.30%
>    2-3     40.34%       55.57%
>    4-5      1.46%        0.50%
>    6-9      2.44%
>    10+      2.34%
> 
> 3) vkcube under Plasma
> 
> 90 calls per second.
> 
>     N       Before      After
>     0
>     1
>    2-3      100%         0%   (Ie. all resolved to a single fence)
>    4-5
>    6-9
>    10+
> 
> In the case of vkcube all invocations in the 2-3 bucket were actually
> just two input fences.
> 
>  From these numbers it looks like the heap sort should not be a
> disadvantage, given how the dominant case is <= 2 input fences which heap
> sort solves with just one compare and swap. (And for the case of one input
> fence we have a fast path in the previous patch.)
> 
> A complementary possibility is to implement a different sorting algorithm
> under the same API as the kernel's sort() and so keep the simplicity,
> potentially moving the new sort under lib/ if it would be found more
> widely useful.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Fixes: https://gitlab.freedesktop.org/drm/amd/-/issues/3617

It should at least.

Regards,

Tvrtko

> Cc: Christian König <christian.koenig@amd.com>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c | 129 ++++++++++++++++-------------
>   1 file changed, 73 insertions(+), 56 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 75c3e37fd617..750dc20a9e9d 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -12,6 +12,7 @@
>   #include <linux/dma-fence-chain.h>
>   #include <linux/dma-fence-unwrap.h>
>   #include <linux/slab.h>
> +#include <linux/sort.h>
>   
>   /* Internal helper to start new array iteration, don't use directly */
>   static struct dma_fence *
> @@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(struct dma_fence_unwrap *cursor)
>   }
>   EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>   
> +
> +static int fence_cmp(const void *_a, const void *_b)
> +{
> +	struct dma_fence *a = *(struct dma_fence **)_a;
> +	struct dma_fence *b = *(struct dma_fence **)_b;
> +
> +	if (a->context < b->context)
> +		return -1;
> +	else if (a->context > b->context)
> +		return 1;
> +
> +	if (dma_fence_is_later(b, a))
> +		return -1;
> +	else if (dma_fence_is_later(a, b))
> +		return 1;
> +
> +	return 0;
> +}
> +
>   /* Implementation for the dma_fence_merge() marco, don't use directly */
>   struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   					   struct dma_fence **fences,
> @@ -67,9 +87,12 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	struct dma_fence *tmp, *signaled, **array;
>   	struct dma_fence_array *result;
>   	ktime_t timestamp;
> -	unsigned int i;
> -	size_t count;
> +	int i, j, count;
>   
> +	/*
> +	 * Count number of unwrapped fences and fince the latest signaled
> +	 * timestamp.
> +	 */
>   	count = 0;
>   	timestamp = ns_to_ktime(0);
>   	for (i = 0; i < num_fences; ++i) {
> @@ -98,74 +121,68 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	else if (count == 1)
>   		return dma_fence_get(signaled);
>   
> +	/*
> +	 * Allocate and populate the array.
> +	 */
>   	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>   	if (!array)
>   		return NULL;
>   
> -	/*
> -	 * This trashes the input fence array and uses it as position for the
> -	 * following merge loop. This works because the dma_fence_merge()
> -	 * wrapper macro is creating this temporary array on the stack together
> -	 * with the iterators.
> -	 */
> -	for (i = 0; i < num_fences; ++i)
> -		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
> -
>   	count = 0;
> -	do {
> -		unsigned int sel;
> -
> -restart:
> -		tmp = NULL;
> -		for (i = 0; i < num_fences; ++i) {
> -			struct dma_fence *next;
> -
> -			while (fences[i] && dma_fence_is_signaled(fences[i]))
> -				fences[i] = dma_fence_unwrap_next(&iter[i]);
> -
> -			next = fences[i];
> -			if (!next)
> -				continue;
> -
> -			/*
> -			 * We can't guarantee that inpute fences are ordered by
> -			 * context, but it is still quite likely when this
> -			 * function is used multiple times. So attempt to order
> -			 * the fences by context as we pass over them and merge
> -			 * fences with the same context.
> -			 */
> -			if (!tmp || tmp->context > next->context) {
> -				tmp = next;
> -				sel = i;
> -
> -			} else if (tmp->context < next->context) {
> -				continue;
> -
> -			} else if (dma_fence_is_later(tmp, next)) {
> -				fences[i] = dma_fence_unwrap_next(&iter[i]);
> -				goto restart;
> -			} else {
> -				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> -				goto restart;
> -			}
> +	for (i = 0; i < num_fences; ++i) {
> +		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
> +			if (!dma_fence_is_signaled(tmp))
> +				array[count++] = tmp;
>   		}
> -
> -		if (tmp) {
> -			array[count++] = dma_fence_get(tmp);
> -			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +	}
> +
> +	/*
> +	 * Equal fast-path as the above one, in case some fences got signalled
> +	 * in the meantime.
> +	 */
> +	if (count == 0) {
> +		tmp = dma_fence_allocate_private_stub(timestamp);
> +		goto return_tmp;
> +	} else if (count == 1) {
> +		tmp = dma_fence_get(array[0]);
> +		goto return_tmp;
> +	}
> +
> +	/*
> +	 * Sort in context and seqno order.
> +	 */
> +	sort(array, count, sizeof(*array), fence_cmp, NULL);
> +
> +	/*
> +	 * Only keep the most recent fence for each context.
> +	 */
> +	j = 0;
> +	tmp = array[0];
> +	for (i = 1; i < count; i++) {
> +		if (array[i]->context != tmp->context) {
> +			array[j++] = dma_fence_get(tmp);
>   		}
> -	} while (tmp);
> -
> +		tmp = array[i];
> +	}
> +	if (j == 0 || tmp->context != array[j - 1]->context) {
> +		array[j++] = dma_fence_get(tmp);
> +	}
> +	count = j;
> +
> +	/*
> +	 * And another fast-path as the earlier ones.
> +	 */
>   	if (count == 0) {
>   		tmp = dma_fence_allocate_private_stub(ktime_get());
>   		goto return_tmp;
> -	}
> -
> -	if (count == 1) {
> +	} else if (count == 1) {
>   		tmp = array[0];
>   		goto return_tmp;
>   	}
>   
> +	/*
> +	 * Finnaly create the output fence array.
> +	 */
>   	result = dma_fence_array_create(count, array,
>   					dma_fence_context_alloc(1),
>   					1, false);

Christian König Nov. 14, 2024, 9:05 a.m. UTC | #2

Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> One alternative to the fix Christian proposed in
> https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@amd.com/
> is to replace the rather complex open coded sorting loops with the kernel
> standard sort followed by a context squashing pass.
>
> Proposed advantage of this would be readability but one concern Christian
> raised was that there could be many fences, that they are typically mostly
> sorted, and so the kernel's heap sort would be much worse by the proposed
> algorithm.
>
> I had a look running some games and vkcube to see what are the typical
> number of input fences. Tested scenarios:
>
> 1) Hogwarts Legacy under Gamescope
>
> 450 calls per second to __dma_fence_unwrap_merge.
>
> Percentages per number of fences buckets, before and after checking for
> signalled status, sorting and flattening:
>
>     N       Before      After
>     0       0.91%
>     1      69.40%
>    2-3     28.72%       9.4%  (90.6% resolved to one fence)
>    4-5      0.93%
>    6-9      0.03%
>    10+
>
> 2) Cyberpunk 2077 under Gamescope
>
> 1050 calls per second, amounting to 0.01% CPU time according to perf top.
>
>     N       Before      After
>     0       1.13%
>     1      52.30%
>    2-3     40.34%       55.57%
>    4-5      1.46%        0.50%
>    6-9      2.44%
>    10+      2.34%
>
> 3) vkcube under Plasma
>
> 90 calls per second.
>
>     N       Before      After
>     0
>     1
>    2-3      100%         0%   (Ie. all resolved to a single fence)
>    4-5
>    6-9
>    10+
>
> In the case of vkcube all invocations in the 2-3 bucket were actually
> just two input fences.
>
>  From these numbers it looks like the heap sort should not be a
> disadvantage, given how the dominant case is <= 2 input fences which heap
> sort solves with just one compare and swap. (And for the case of one input
> fence we have a fast path in the previous patch.)
>
> A complementary possibility is to implement a different sorting algorithm
> under the same API as the kernel's sort() and so keep the simplicity,
> potentially moving the new sort under lib/ if it would be found more
> widely useful.

Well the API would need to be different from sort() since a merge sort 
always works with multiple inputs and a single output.

But from the number you gathered I really don't think we are going to 
need that.

>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c | 129 ++++++++++++++++-------------
>   1 file changed, 73 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 75c3e37fd617..750dc20a9e9d 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -12,6 +12,7 @@
>   #include <linux/dma-fence-chain.h>
>   #include <linux/dma-fence-unwrap.h>
>   #include <linux/slab.h>
> +#include <linux/sort.h>
>   
>   /* Internal helper to start new array iteration, don't use directly */
>   static struct dma_fence *
> @@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(struct dma_fence_unwrap *cursor)
>   }
>   EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>   
> +
> +static int fence_cmp(const void *_a, const void *_b)
> +{
> +	struct dma_fence *a = *(struct dma_fence **)_a;
> +	struct dma_fence *b = *(struct dma_fence **)_b;
> +
> +	if (a->context < b->context)
> +		return -1;
> +	else if (a->context > b->context)
> +		return 1;
> +
> +	if (dma_fence_is_later(b, a))
> +		return -1;
> +	else if (dma_fence_is_later(a, b))
> +		return 1;
> +
> +	return 0;
> +}
> +
>   /* Implementation for the dma_fence_merge() marco, don't use directly */
>   struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   					   struct dma_fence **fences,
> @@ -67,9 +87,12 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	struct dma_fence *tmp, *signaled, **array;
>   	struct dma_fence_array *result;
>   	ktime_t timestamp;
> -	unsigned int i;
> -	size_t count;
> +	int i, j, count;
>   
> +	/*
> +	 * Count number of unwrapped fences and fince the latest signaled
> +	 * timestamp.
> +	 */

What is done should be obvious from the code, only why something is done 
needs code comment and explanation.

>   	count = 0;
>   	timestamp = ns_to_ktime(0);
>   	for (i = 0; i < num_fences; ++i) {
> @@ -98,74 +121,68 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	else if (count == 1)
>   		return dma_fence_get(signaled);
>   
> +	/*
> +	 * Allocate and populate the array.
> +	 */
>   	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>   	if (!array)
>   		return NULL;
>   
> -	/*
> -	 * This trashes the input fence array and uses it as position for the
> -	 * following merge loop. This works because the dma_fence_merge()
> -	 * wrapper macro is creating this temporary array on the stack together
> -	 * with the iterators.
> -	 */
> -	for (i = 0; i < num_fences; ++i)
> -		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
> -
>   	count = 0;
> -	do {
> -		unsigned int sel;
> -
> -restart:
> -		tmp = NULL;
> -		for (i = 0; i < num_fences; ++i) {
> -			struct dma_fence *next;
> -
> -			while (fences[i] && dma_fence_is_signaled(fences[i]))
> -				fences[i] = dma_fence_unwrap_next(&iter[i]);
> -
> -			next = fences[i];
> -			if (!next)
> -				continue;
> -
> -			/*
> -			 * We can't guarantee that inpute fences are ordered by
> -			 * context, but it is still quite likely when this
> -			 * function is used multiple times. So attempt to order
> -			 * the fences by context as we pass over them and merge
> -			 * fences with the same context.
> -			 */
> -			if (!tmp || tmp->context > next->context) {
> -				tmp = next;
> -				sel = i;
> -
> -			} else if (tmp->context < next->context) {
> -				continue;
> -
> -			} else if (dma_fence_is_later(tmp, next)) {
> -				fences[i] = dma_fence_unwrap_next(&iter[i]);
> -				goto restart;
> -			} else {
> -				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> -				goto restart;
> -			}
> +	for (i = 0; i < num_fences; ++i) {
> +		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
> +			if (!dma_fence_is_signaled(tmp))
> +				array[count++] = tmp;

Same problem as in patch #1, you need to grab a reference to tmp here.

Apart from that the patch looks good to me, but I would reduce the comments.

When we need to explain what code does then the code need to be improved 
and not documented.

Regards,
Christian

>   		}
> -
> -		if (tmp) {
> -			array[count++] = dma_fence_get(tmp);
> -			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +	}
> +
> +	/*
> +	 * Equal fast-path as the above one, in case some fences got signalled
> +	 * in the meantime.
> +	 */
> +	if (count == 0) {
> +		tmp = dma_fence_allocate_private_stub(timestamp);
> +		goto return_tmp;
> +	} else if (count == 1) {
> +		tmp = dma_fence_get(array[0]);
> +		goto return_tmp;
> +	}
> +
> +	/*
> +	 * Sort in context and seqno order.
> +	 */
> +	sort(array, count, sizeof(*array), fence_cmp, NULL);
> +
> +	/*
> +	 * Only keep the most recent fence for each context.
> +	 */
> +	j = 0;
> +	tmp = array[0];
> +	for (i = 1; i < count; i++) {
> +		if (array[i]->context != tmp->context) {
> +			array[j++] = dma_fence_get(tmp);
>   		}
> -	} while (tmp);
> -
> +		tmp = array[i];
> +	}
> +	if (j == 0 || tmp->context != array[j - 1]->context) {
> +		array[j++] = dma_fence_get(tmp);
> +	}
> +	count = j;
> +
> +	/*
> +	 * And another fast-path as the earlier ones.
> +	 */
>   	if (count == 0) {
>   		tmp = dma_fence_allocate_private_stub(ktime_get());
>   		goto return_tmp;
> -	}
> -
> -	if (count == 1) {
> +	} else if (count == 1) {
>   		tmp = array[0];
>   		goto return_tmp;
>   	}
>   
> +	/*
> +	 * Finnaly create the output fence array.
> +	 */
>   	result = dma_fence_array_create(count, array,
>   					dma_fence_context_alloc(1),
>   					1, false);

Tvrtko Ursulin Nov. 14, 2024, 11:12 a.m. UTC | #3

On 14/11/2024 09:05, Christian König wrote:
> Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> One alternative to the fix Christian proposed in
>> https://lore.kernel.org/dri-devel/20241024124159.4519-3-christian.koenig@amd.com/
>> is to replace the rather complex open coded sorting loops with the kernel
>> standard sort followed by a context squashing pass.
>>
>> Proposed advantage of this would be readability but one concern Christian
>> raised was that there could be many fences, that they are typically 
>> mostly
>> sorted, and so the kernel's heap sort would be much worse by the proposed
>> algorithm.
>>
>> I had a look running some games and vkcube to see what are the typical
>> number of input fences. Tested scenarios:
>>
>> 1) Hogwarts Legacy under Gamescope
>>
>> 450 calls per second to __dma_fence_unwrap_merge.
>>
>> Percentages per number of fences buckets, before and after checking for
>> signalled status, sorting and flattening:
>>
>>     N       Before      After
>>     0       0.91%
>>     1      69.40%
>>    2-3     28.72%       9.4%  (90.6% resolved to one fence)
>>    4-5      0.93%
>>    6-9      0.03%
>>    10+
>>
>> 2) Cyberpunk 2077 under Gamescope
>>
>> 1050 calls per second, amounting to 0.01% CPU time according to perf top.
>>
>>     N       Before      After
>>     0       1.13%
>>     1      52.30%
>>    2-3     40.34%       55.57%
>>    4-5      1.46%        0.50%
>>    6-9      2.44%
>>    10+      2.34%
>>
>> 3) vkcube under Plasma
>>
>> 90 calls per second.
>>
>>     N       Before      After
>>     0
>>     1
>>    2-3      100%         0%   (Ie. all resolved to a single fence)
>>    4-5
>>    6-9
>>    10+
>>
>> In the case of vkcube all invocations in the 2-3 bucket were actually
>> just two input fences.
>>
>>  From these numbers it looks like the heap sort should not be a
>> disadvantage, given how the dominant case is <= 2 input fences which heap
>> sort solves with just one compare and swap. (And for the case of one 
>> input
>> fence we have a fast path in the previous patch.)
>>
>> A complementary possibility is to implement a different sorting algorithm
>> under the same API as the kernel's sort() and so keep the simplicity,
>> potentially moving the new sort under lib/ if it would be found more
>> widely useful.
> 
> Well the API would need to be different from sort() since a merge sort 
> always works with multiple inputs and a single output.

I was thinking insert sort could be good for small arrays if they are 
mostly already sorted. Reference I found was 
https://dl.acm.org/doi/pdf/10.1145/359024.359026, and although they do 
not look at lists below 50 elements, I think advantage over heap would 
hold even better.

> 
> But from the number you gathered I really don't think we are going to 
> need that.
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>> ---
>>   drivers/dma-buf/dma-fence-unwrap.c | 129 ++++++++++++++++-------------
>>   1 file changed, 73 insertions(+), 56 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 75c3e37fd617..750dc20a9e9d 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/dma-fence-chain.h>
>>   #include <linux/dma-fence-unwrap.h>
>>   #include <linux/slab.h>
>> +#include <linux/sort.h>
>>   /* Internal helper to start new array iteration, don't use directly */
>>   static struct dma_fence *
>> @@ -59,6 +60,25 @@ struct dma_fence *dma_fence_unwrap_next(struct 
>> dma_fence_unwrap *cursor)
>>   }
>>   EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>> +
>> +static int fence_cmp(const void *_a, const void *_b)
>> +{
>> +    struct dma_fence *a = *(struct dma_fence **)_a;
>> +    struct dma_fence *b = *(struct dma_fence **)_b;
>> +
>> +    if (a->context < b->context)
>> +        return -1;
>> +    else if (a->context > b->context)
>> +        return 1;
>> +
>> +    if (dma_fence_is_later(b, a))
>> +        return -1;
>> +    else if (dma_fence_is_later(a, b))
>> +        return 1;
>> +
>> +    return 0;
>> +}
>> +
>>   /* Implementation for the dma_fence_merge() marco, don't use 
>> directly */
>>   struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>>                          struct dma_fence **fences,
>> @@ -67,9 +87,12 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>       struct dma_fence *tmp, *signaled, **array;
>>       struct dma_fence_array *result;
>>       ktime_t timestamp;
>> -    unsigned int i;
>> -    size_t count;
>> +    int i, j, count;
>> +    /*
>> +     * Count number of unwrapped fences and fince the latest signaled
>> +     * timestamp.
>> +     */
> 
> What is done should be obvious from the code, only why something is done 
> needs code comment and explanation.

I was going for completing the narrative so each logical block in the 
function has a comment, versus just some. IMO it makes it easier to 
follow by making the steps nicely visually separated. But I don't feel 
strongly about this so have removed it in v2.

> 
>>       count = 0;
>>       timestamp = ns_to_ktime(0);
>>       for (i = 0; i < num_fences; ++i) {
>> @@ -98,74 +121,68 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>       else if (count == 1)
>>           return dma_fence_get(signaled);
>> +    /*
>> +     * Allocate and populate the array.
>> +     */
>>       array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>>       if (!array)
>>           return NULL;
>> -    /*
>> -     * This trashes the input fence array and uses it as position for 
>> the
>> -     * following merge loop. This works because the dma_fence_merge()
>> -     * wrapper macro is creating this temporary array on the stack 
>> together
>> -     * with the iterators.
>> -     */
>> -    for (i = 0; i < num_fences; ++i)
>> -        fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>> -
>>       count = 0;
>> -    do {
>> -        unsigned int sel;
>> -
>> -restart:
>> -        tmp = NULL;
>> -        for (i = 0; i < num_fences; ++i) {
>> -            struct dma_fence *next;
>> -
>> -            while (fences[i] && dma_fence_is_signaled(fences[i]))
>> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -
>> -            next = fences[i];
>> -            if (!next)
>> -                continue;
>> -
>> -            /*
>> -             * We can't guarantee that inpute fences are ordered by
>> -             * context, but it is still quite likely when this
>> -             * function is used multiple times. So attempt to order
>> -             * the fences by context as we pass over them and merge
>> -             * fences with the same context.
>> -             */
>> -            if (!tmp || tmp->context > next->context) {
>> -                tmp = next;
>> -                sel = i;
>> -
>> -            } else if (tmp->context < next->context) {
>> -                continue;
>> -
>> -            } else if (dma_fence_is_later(tmp, next)) {
>> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -                goto restart;
>> -            } else {
>> -                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> -                goto restart;
>> -            }
>> +    for (i = 0; i < num_fences; ++i) {
>> +        dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>> +            if (!dma_fence_is_signaled(tmp))
>> +                array[count++] = tmp;
> 
> Same problem as in patch #1, you need to grab a reference to tmp here.

Yep. v2 will appear on the list shortly and given my sloppy quality 
track record when jumping between the topics I would appreciate if 
someone else would smoke test it too.

Regards,

Tvrtko

> 
> Apart from that the patch looks good to me, but I would reduce the 
> comments.
> 
> When we need to explain what code does then the code need to be improved 
> and not documented.
> 
> Regards,
> Christian
> 
>>           }
>> -
>> -        if (tmp) {
>> -            array[count++] = dma_fence_get(tmp);
>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +    }
>> +
>> +    /*
>> +     * Equal fast-path as the above one, in case some fences got 
>> signalled
>> +     * in the meantime.
>> +     */
>> +    if (count == 0) {
>> +        tmp = dma_fence_allocate_private_stub(timestamp);
>> +        goto return_tmp;
>> +    } else if (count == 1) {
>> +        tmp = dma_fence_get(array[0]);
>> +        goto return_tmp;
>> +    }
>> +
>> +    /*
>> +     * Sort in context and seqno order.
>> +     */
>> +    sort(array, count, sizeof(*array), fence_cmp, NULL);
>> +
>> +    /*
>> +     * Only keep the most recent fence for each context.
>> +     */
>> +    j = 0;
>> +    tmp = array[0];
>> +    for (i = 1; i < count; i++) {
>> +        if (array[i]->context != tmp->context) {
>> +            array[j++] = dma_fence_get(tmp);
>>           }
>> -    } while (tmp);
>> -
>> +        tmp = array[i];
>> +    }
>> +    if (j == 0 || tmp->context != array[j - 1]->context) {
>> +        array[j++] = dma_fence_get(tmp);
>> +    }
>> +    count = j;
>> +
>> +    /*
>> +     * And another fast-path as the earlier ones.
>> +     */
>>       if (count == 0) {
>>           tmp = dma_fence_allocate_private_stub(ktime_get());
>>           goto return_tmp;
>> -    }
>> -
>> -    if (count == 1) {
>> +    } else if (count == 1) {
>>           tmp = array[0];
>>           goto return_tmp;
>>       }
>> +    /*
>> +     * Finnaly create the output fence array.
>> +     */
>>       result = dma_fence_array_create(count, array,
>>                       dma_fence_context_alloc(1),
>>                       1, false);
>

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index 75c3e37fd617..750dc20a9e9d 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -12,6 +12,7 @@ 
 #include <linux/dma-fence-chain.h>
 #include <linux/dma-fence-unwrap.h>
 #include <linux/slab.h>
+#include <linux/sort.h>
 
 /* Internal helper to start new array iteration, don't use directly */
 static struct dma_fence *
@@ -59,6 +60,25 @@  struct dma_fence *dma_fence_unwrap_next(struct dma_fence_unwrap *cursor)
 }
 EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
 
+
+static int fence_cmp(const void *_a, const void *_b)
+{
+	struct dma_fence *a = *(struct dma_fence **)_a;
+	struct dma_fence *b = *(struct dma_fence **)_b;
+
+	if (a->context < b->context)
+		return -1;
+	else if (a->context > b->context)
+		return 1;
+
+	if (dma_fence_is_later(b, a))
+		return -1;
+	else if (dma_fence_is_later(a, b))
+		return 1;
+
+	return 0;
+}
+
 /* Implementation for the dma_fence_merge() marco, don't use directly */
 struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 					   struct dma_fence **fences,
@@ -67,9 +87,12 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 	struct dma_fence *tmp, *signaled, **array;
 	struct dma_fence_array *result;
 	ktime_t timestamp;
-	unsigned int i;
-	size_t count;
+	int i, j, count;
 
+	/*
+	 * Count number of unwrapped fences and fince the latest signaled
+	 * timestamp.
+	 */
 	count = 0;
 	timestamp = ns_to_ktime(0);
 	for (i = 0; i < num_fences; ++i) {
@@ -98,74 +121,68 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 	else if (count == 1)
 		return dma_fence_get(signaled);
 
+	/*
+	 * Allocate and populate the array.
+	 */
 	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
 	if (!array)
 		return NULL;
 
-	/*
-	 * This trashes the input fence array and uses it as position for the
-	 * following merge loop. This works because the dma_fence_merge()
-	 * wrapper macro is creating this temporary array on the stack together
-	 * with the iterators.
-	 */
-	for (i = 0; i < num_fences; ++i)
-		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
-
 	count = 0;
-	do {
-		unsigned int sel;
-
-restart:
-		tmp = NULL;
-		for (i = 0; i < num_fences; ++i) {
-			struct dma_fence *next;
-
-			while (fences[i] && dma_fence_is_signaled(fences[i]))
-				fences[i] = dma_fence_unwrap_next(&iter[i]);
-
-			next = fences[i];
-			if (!next)
-				continue;
-
-			/*
-			 * We can't guarantee that inpute fences are ordered by
-			 * context, but it is still quite likely when this
-			 * function is used multiple times. So attempt to order
-			 * the fences by context as we pass over them and merge
-			 * fences with the same context.
-			 */
-			if (!tmp || tmp->context > next->context) {
-				tmp = next;
-				sel = i;
-
-			} else if (tmp->context < next->context) {
-				continue;
-
-			} else if (dma_fence_is_later(tmp, next)) {
-				fences[i] = dma_fence_unwrap_next(&iter[i]);
-				goto restart;
-			} else {
-				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
-				goto restart;
-			}
+	for (i = 0; i < num_fences; ++i) {
+		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
+			if (!dma_fence_is_signaled(tmp))
+				array[count++] = tmp;
 		}
-
-		if (tmp) {
-			array[count++] = dma_fence_get(tmp);
-			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
+	}
+
+	/*
+	 * Equal fast-path as the above one, in case some fences got signalled
+	 * in the meantime.
+	 */
+	if (count == 0) {
+		tmp = dma_fence_allocate_private_stub(timestamp);
+		goto return_tmp;
+	} else if (count == 1) {
+		tmp = dma_fence_get(array[0]);
+		goto return_tmp;
+	}
+
+	/*
+	 * Sort in context and seqno order.
+	 */
+	sort(array, count, sizeof(*array), fence_cmp, NULL);
+
+	/*
+	 * Only keep the most recent fence for each context.
+	 */
+	j = 0;
+	tmp = array[0];
+	for (i = 1; i < count; i++) {
+		if (array[i]->context != tmp->context) {
+			array[j++] = dma_fence_get(tmp);
 		}
-	} while (tmp);
-
+		tmp = array[i];
+	}
+	if (j == 0 || tmp->context != array[j - 1]->context) {
+		array[j++] = dma_fence_get(tmp);
+	}
+	count = j;
+
+	/*
+	 * And another fast-path as the earlier ones.
+	 */
 	if (count == 0) {
 		tmp = dma_fence_allocate_private_stub(ktime_get());
 		goto return_tmp;
-	}
-
-	if (count == 1) {
+	} else if (count == 1) {
 		tmp = array[0];
 		goto return_tmp;
 	}
 
+	/*
+	 * Finnaly create the output fence array.
+	 */
 	result = dma_fence_array_create(count, array,
 					dma_fence_context_alloc(1),
 					1, false);

[RFC,2/2] dma-fence: Use kernel's sort for merging fences

Commit Message

Comments

Patch