diff mbox series

[1/2] dma-fence: Add a single fence fast path for fence merging

Message ID 20241113171947.57446-1-tursulin@igalia.com (mailing list archive)
State New, archived
Headers show
Series [1/2] dma-fence: Add a single fence fast path for fence merging | expand

Commit Message

Tvrtko Ursulin Nov. 13, 2024, 5:19 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Testing some workloads in two different scenarios, such as games running
under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
that in a significant portion of calls the dma_fence_unwrap_merge helper
is called with just a single unsignalled fence.

Therefore it is worthile to add a fast path for that case and so bypass
the memory allocation and insertion sort attempts.

Tested scenarios:

1) Hogwarts Legacy under Gamescope

450 calls per second to __dma_fence_unwrap_merge.

Percentages per number of fences buckets, before and after checking for
signalled status, sorting and flattening:

   N       Before      After
   0       0.85%
   1      69.80%        ->   The new fast path.
  2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 fence)
 10-19
 20-40
  50+

2) Cyberpunk 2077 under Gamescope

1050 calls per second.

   N       Before      After
   0       0.71%
   1      52.53%        ->    The new fast path.
  2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
 10-19     2.34%
 20-40     0.06%
  50+

3) vkcube under Plasma

90 calls per second.

   N       Before      After
   0
   1
  2-9      100%         0%   (Ie. all resolved to a single fence)
 10-19
 20-40
  50+

In the case of vkcube all invocations in the 2-9 bucket were actually
just two input fences.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Friedrich Vock <friedrich.vock@gmx.de>
---
 drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Christian König Nov. 14, 2024, 8:53 a.m. UTC | #1
Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> Testing some workloads in two different scenarios, such as games running
> under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
> that in a significant portion of calls the dma_fence_unwrap_merge helper
> is called with just a single unsignalled fence.
>
> Therefore it is worthile to add a fast path for that case and so bypass
> the memory allocation and insertion sort attempts.

You should probably re-order the patches since we need to backport the 
second as bug fix while the first is just an improvement.

There is also a bug in this patch which needs to be fixed.

> Tested scenarios:
>
> 1) Hogwarts Legacy under Gamescope
>
> 450 calls per second to __dma_fence_unwrap_merge.
>
> Percentages per number of fences buckets, before and after checking for
> signalled status, sorting and flattening:
>
>     N       Before      After
>     0       0.85%
>     1      69.80%        ->   The new fast path.
>    2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 fence)
>   10-19
>   20-40
>    50+
>
> 2) Cyberpunk 2077 under Gamescope
>
> 1050 calls per second.
>
>     N       Before      After
>     0       0.71%
>     1      52.53%        ->    The new fast path.
>    2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
>   10-19     2.34%
>   20-40     0.06%
>    50+
>
> 3) vkcube under Plasma
>
> 90 calls per second.
>
>     N       Before      After
>     0
>     1
>    2-9      100%         0%   (Ie. all resolved to a single fence)
>   10-19
>   20-40
>    50+
>
> In the case of vkcube all invocations in the 2-9 bucket were actually
> just two input fences.

Nice to have some numbers at hand.

>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Friedrich Vock <friedrich.vock@gmx.de>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..75c3e37fd617 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   					   struct dma_fence **fences,
>   					   struct dma_fence_unwrap *iter)
>   {
> +	struct dma_fence *tmp, *signaled, **array;

I would name that unsignaled instead.

>   	struct dma_fence_array *result;
> -	struct dma_fence *tmp, **array;
>   	ktime_t timestamp;
>   	unsigned int i;
>   	size_t count;
> @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	for (i = 0; i < num_fences; ++i) {
>   		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>   			if (!dma_fence_is_signaled(tmp)) {
> +				signaled = tmp;

You need to grab a reference to tmp here if you want to keep it.

It's perfectly possible that tmp is garbage collected as soon as you go 
to the next iteration or leave the loop.

Regards,
Christian.

>   				++count;
>   			} else {
>   				ktime_t t = dma_fence_timestamp(tmp);
> @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   	/*
>   	 * If we couldn't find a pending fence just return a private signaled
>   	 * fence with the timestamp of the last signaled one.
> +	 *
> +	 * Or if there was a single unsignaled fence left we can return it
> +	 * directly and early since that is a major path on many workloads.
>   	 */
>   	if (count == 0)
>   		return dma_fence_allocate_private_stub(timestamp);
> +	else if (count == 1)
> +		return dma_fence_get(signaled);
>   
>   	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>   	if (!array)
Tvrtko Ursulin Nov. 14, 2024, 9:05 a.m. UTC | #2
On 14/11/2024 08:53, Christian König wrote:
> Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> Testing some workloads in two different scenarios, such as games running
>> under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows
>> that in a significant portion of calls the dma_fence_unwrap_merge helper
>> is called with just a single unsignalled fence.
>>
>> Therefore it is worthile to add a fast path for that case and so bypass
>> the memory allocation and insertion sort attempts.
> 
> You should probably re-order the patches since we need to backport the 
> second as bug fix while the first is just an improvement.

Ok.

> There is also a bug in this patch which needs to be fixed.
> 
>> Tested scenarios:
>>
>> 1) Hogwarts Legacy under Gamescope
>>
>> 450 calls per second to __dma_fence_unwrap_merge.
>>
>> Percentages per number of fences buckets, before and after checking for
>> signalled status, sorting and flattening:
>>
>>     N       Before      After
>>     0       0.85%
>>     1      69.80%        ->   The new fast path.
>>    2-9     29.36%        9%   (Ie. 91% of this bucket flattened to 1 
>> fence)
>>   10-19
>>   20-40
>>    50+
>>
>> 2) Cyberpunk 2077 under Gamescope
>>
>> 1050 calls per second.
>>
>>     N       Before      After
>>     0       0.71%
>>     1      52.53%        ->    The new fast path.
>>    2-9     44.38%      50.60%  (Ie. half resolved to a single fence)
>>   10-19     2.34%
>>   20-40     0.06%
>>    50+
>>
>> 3) vkcube under Plasma
>>
>> 90 calls per second.
>>
>>     N       Before      After
>>     0
>>     1
>>    2-9      100%         0%   (Ie. all resolved to a single fence)
>>   10-19
>>   20-40
>>    50+
>>
>> In the case of vkcube all invocations in the 2-9 bucket were actually
>> just two input fences.
> 
> Nice to have some numbers at hand.
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Friedrich Vock <friedrich.vock@gmx.de>
>> ---
>>   drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..75c3e37fd617 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>                          struct dma_fence **fences,
>>                          struct dma_fence_unwrap *iter)
>>   {
>> +    struct dma_fence *tmp, *signaled, **array;
> 
> I would name that unsignaled instead.

Indeed. And a polite way of pointing out my brain was completely 
reversed. :)

>>       struct dma_fence_array *result;
>> -    struct dma_fence *tmp, **array;
>>       ktime_t timestamp;
>>       unsigned int i;
>>       size_t count;
>> @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>       for (i = 0; i < num_fences; ++i) {
>>           dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>>               if (!dma_fence_is_signaled(tmp)) {
>> +                signaled = tmp;
> 
> You need to grab a reference to tmp here if you want to keep it.
> 
> It's perfectly possible that tmp is garbage collected as soon as you go 
> to the next iteration or leave the loop.

Yep.. same bug in the second patch.

Regards,

Tvrtko

>>                   ++count;
>>               } else {
>>                   ktime_t t = dma_fence_timestamp(tmp);
>> @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
>> int num_fences,
>>       /*
>>        * If we couldn't find a pending fence just return a private 
>> signaled
>>        * fence with the timestamp of the last signaled one.
>> +     *
>> +     * Or if there was a single unsignaled fence left we can return it
>> +     * directly and early since that is a major path on many workloads.
>>        */
>>       if (count == 0)
>>           return dma_fence_allocate_private_stub(timestamp);
>> +    else if (count == 1)
>> +        return dma_fence_get(signaled);
>>       array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>>       if (!array)
>
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index 628af51c81af..75c3e37fd617 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -64,8 +64,8 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 					   struct dma_fence **fences,
 					   struct dma_fence_unwrap *iter)
 {
+	struct dma_fence *tmp, *signaled, **array;
 	struct dma_fence_array *result;
-	struct dma_fence *tmp, **array;
 	ktime_t timestamp;
 	unsigned int i;
 	size_t count;
@@ -75,6 +75,7 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 	for (i = 0; i < num_fences; ++i) {
 		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
 			if (!dma_fence_is_signaled(tmp)) {
+				signaled = tmp;
 				++count;
 			} else {
 				ktime_t t = dma_fence_timestamp(tmp);
@@ -88,9 +89,14 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 	/*
 	 * If we couldn't find a pending fence just return a private signaled
 	 * fence with the timestamp of the last signaled one.
+	 *
+	 * Or if there was a single unsignaled fence left we can return it
+	 * directly and early since that is a major path on many workloads.
 	 */
 	if (count == 0)
 		return dma_fence_allocate_private_stub(timestamp);
+	else if (count == 1)
+		return dma_fence_get(signaled);
 
 	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
 	if (!array)