diff mbox series

[2/3] dma-buf: sort fences in dma_fence_unwrap_merge

Message ID 20241024124159.4519-3-christian.koenig@amd.com (mailing list archive)
State New
Headers show
Series [1/3] dma-buf/dma-fence_array: use kvzalloc | expand

Commit Message

Christian König Oct. 24, 2024, 12:41 p.m. UTC
The merge function initially handled only individual fences and
arrays which in turn were created by the merge function. This allowed
to create the new array by a simple merge sort based on the fence
context number.

The problem is now that since the addition of timeline sync objects
userspace can create chain containers in basically any fence context
order.

If those are merged together it can happen that we create really
large arrays since the merge sort algorithm doesn't work any more.

So put an insert sort behind the merge sort which kicks in when the
input fences are not in the expected order. This isn't as efficient
as a heap sort, but has better properties for the most common use
case.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 5 deletions(-)

Comments

Friedrich Vock Oct. 30, 2024, 6:10 p.m. UTC | #1
On 24.10.24 14:41, Christian König wrote:
> The merge function initially handled only individual fences and
> arrays which in turn were created by the merge function. This allowed
> to create the new array by a simple merge sort based on the fence
> context number.
>
> The problem is now that since the addition of timeline sync objects
> userspace can create chain containers in basically any fence context
> order.
>
> If those are merged together it can happen that we create really
> large arrays since the merge sort algorithm doesn't work any more.
>
> So put an insert sort behind the merge sort which kicks in when the
> input fences are not in the expected order. This isn't as efficient
> as a heap sort, but has better properties for the most common use
> case.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++----
>   1 file changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..d9aa280d9ff6 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -106,7 +106,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>
>   	count = 0;
> -	do {
> +	while (true) {
>   		unsigned int sel;
>
>   restart:
> @@ -144,11 +144,40 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   			}
>   		}
>
> -		if (tmp) {
> -			array[count++] = dma_fence_get(tmp);
> -			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +		if (!tmp)
> +			break;
> +
> +		/*
> +		 * We could use a binary search here, but since the assumption
> +		 * is that the main input are already sorted dma_fence_arrays
> +		 * just looking from end has a higher chance of finding the
> +		 * right location on the first try
> +		 */
> +
> +		for (i = count; i--;) {

This is broken. The first iteration of this loop will always index out
of bounds. What you probably want here is:

+		for (i = count - 1; count && i--;) {

This intentionally overflows for count == 0, but the ++i after the loop
undoes that. Maybe it would be worth a comment to point out that's
intentional.

> +			if (likely(array[i]->context < tmp->context))
> +				break;
> +
> +			if (array[i]->context == tmp->context) {
> +				if (dma_fence_is_later(tmp, array[i])) {
> +					dma_fence_put(array[i]);
> +					array[i] = dma_fence_get(tmp);
> +				}
> +				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +				goto restart;
> +			}
>   		}
> -	} while (tmp);
> +
> +		++i;
> +		/*
> +		 * Make room for the fence, this should be a nop most of the
> +		 * time.
> +		 */
> +		memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array));

Need memmove here, src and dst alias.

I took it for a spin with these things fixed and it seemed to resolve
the issue as well. How do you want to proceed? I guess I would be
comfortable putting a Reviewed-by and/or Tested-by on a version with
these things fixed (with the usual caveat that I'm not a maintainer - I
guess the process requires (at least one) reviewer to be?).

By the way, I guess you might've had some internal branches where this
fix needed to go into quick or something? Usually I'm happy to make a v2
for my patches myself, too ;)

Regards,
Friedrich

> +		array[i] = dma_fence_get(tmp);
> +		fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +		count++;
> +	};
>
>   	if (count == 0) {
>   		tmp = dma_fence_allocate_private_stub(ktime_get());
Christian König Nov. 6, 2024, 11:59 a.m. UTC | #2
Am 30.10.24 um 19:10 schrieb Friedrich Vock:
> On 24.10.24 14:41, Christian König wrote:
>> The merge function initially handled only individual fences and
>> arrays which in turn were created by the merge function. This allowed
>> to create the new array by a simple merge sort based on the fence
>> context number.
>>
>> The problem is now that since the addition of timeline sync objects
>> userspace can create chain containers in basically any fence context
>> order.
>>
>> If those are merged together it can happen that we create really
>> large arrays since the merge sort algorithm doesn't work any more.
>>
>> So put an insert sort behind the merge sort which kicks in when the
>> input fences are not in the expected order. This isn't as efficient
>> as a heap sort, but has better properties for the most common use
>> case.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++----
>>   1 file changed, 34 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..d9aa280d9ff6 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -106,7 +106,7 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>           fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>
>>       count = 0;
>> -    do {
>> +    while (true) {
>>           unsigned int sel;
>>
>>   restart:
>> @@ -144,11 +144,40 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>               }
>>           }
>>
>> -        if (tmp) {
>> -            array[count++] = dma_fence_get(tmp);
>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +        if (!tmp)
>> +            break;
>> +
>> +        /*
>> +         * We could use a binary search here, but since the assumption
>> +         * is that the main input are already sorted dma_fence_arrays
>> +         * just looking from end has a higher chance of finding the
>> +         * right location on the first try
>> +         */
>> +
>> +        for (i = count; i--;) {
>
> This is broken. The first iteration of this loop will always index out
> of bounds.

Nope, that is correct. The condition is evaluated before the loop, so 
the i-- reduces the index to the last element in the array.

Regards,
Christian.


> What you probably want here is:
>
> +        for (i = count - 1; count && i--;) {
>
> This intentionally overflows for count == 0, but the ++i after the loop
> undoes that. Maybe it would be worth a comment to point out that's
> intentional.
>
>> +            if (likely(array[i]->context < tmp->context))
>> +                break;
>> +
>> +            if (array[i]->context == tmp->context) {
>> +                if (dma_fence_is_later(tmp, array[i])) {
>> +                    dma_fence_put(array[i]);
>> +                    array[i] = dma_fence_get(tmp);
>> +                }
>> +                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +                goto restart;
>> +            }
>>           }
>> -    } while (tmp);
>> +
>> +        ++i;
>> +        /*
>> +         * Make room for the fence, this should be a nop most of the
>> +         * time.
>> +         */
>> +        memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array));
>
> Need memmove here, src and dst alias.
>
> I took it for a spin with these things fixed and it seemed to resolve
> the issue as well. How do you want to proceed? I guess I would be
> comfortable putting a Reviewed-by and/or Tested-by on a version with
> these things fixed (with the usual caveat that I'm not a maintainer - I
> guess the process requires (at least one) reviewer to be?).
>
> By the way, I guess you might've had some internal branches where this
> fix needed to go into quick or something? Usually I'm happy to make a v2
> for my patches myself, too ;)
>
> Regards,
> Friedrich
>
>> +        array[i] = dma_fence_get(tmp);
>> +        fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +        count++;
>> +    };
>>
>>       if (count == 0) {
>>           tmp = dma_fence_allocate_private_stub(ktime_get());
>
Tvrtko Ursulin Nov. 7, 2024, 4 p.m. UTC | #3
On 24/10/2024 13:41, Christian König wrote:
> The merge function initially handled only individual fences and
> arrays which in turn were created by the merge function. This allowed
> to create the new array by a simple merge sort based on the fence
> context number.
> 
> The problem is now that since the addition of timeline sync objects
> userspace can create chain containers in basically any fence context
> order.
> 
> If those are merged together it can happen that we create really
> large arrays since the merge sort algorithm doesn't work any more.
> 
> So put an insert sort behind the merge sort which kicks in when the
> input fences are not in the expected order. This isn't as efficient
> as a heap sort, but has better properties for the most common use
> case.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++----
>   1 file changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..d9aa280d9ff6 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -106,7 +106,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>   
>   	count = 0;
> -	do {
> +	while (true) {
>   		unsigned int sel;
>   
>   restart:
> @@ -144,11 +144,40 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>   			}
>   		}
>   
> -		if (tmp) {
> -			array[count++] = dma_fence_get(tmp);
> -			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +		if (!tmp)
> +			break;
> +
> +		/*
> +		 * We could use a binary search here, but since the assumption
> +		 * is that the main input are already sorted dma_fence_arrays
> +		 * just looking from end has a higher chance of finding the
> +		 * right location on the first try
> +		 */
> +
> +		for (i = count; i--;) {
> +			if (likely(array[i]->context < tmp->context))
> +				break;
> +
> +			if (array[i]->context == tmp->context) {
> +				if (dma_fence_is_later(tmp, array[i])) {
> +					dma_fence_put(array[i]);
> +					array[i] = dma_fence_get(tmp);
> +				}
> +				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +				goto restart;
> +			}
>   		}
> -	} while (tmp);
> +
> +		++i;
> +		/*
> +		 * Make room for the fence, this should be a nop most of the
> +		 * time.
> +		 */
> +		memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array));
> +		array[i] = dma_fence_get(tmp);
> +		fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +		count++;

Having ventured into this function for the first time, I can say that 
this is some smart code which is not easy to grasp. It could definitely 
benefit from a high level comment before the do-while loop to explain 
what it is going to do.

Next and tmp local variable names I also wonder if could be renamed to 
something more descriptive.

And the algorithmic complexity of the end result, given the multiple 
loops and gotos, I have no idea what it could be.

Has a dumb solution been considered like a two-pass with a 
pessimistically allocated fence array been considered? Like:

1) Populate array with all unsignalled unwrapped fences. (O(count))

2) Bog standard include/linux/sort.h by context and seqno. (O(count*log 
(count)))

3) Walk array and squash same context to latest fence. (Before this 
patch that wasn't there, right?). (O(count)) (Overwrite in place, no 
memcpy needed.)

Algorithmic complexity of that would be obvious and code much simpler.

Regards,

Tvrtko

> +	};
>   
>   	if (count == 0) {
>   		tmp = dma_fence_allocate_private_stub(ktime_get());
Tvrtko Ursulin Nov. 8, 2024, 11:22 a.m. UTC | #4
On 07/11/2024 16:00, Tvrtko Ursulin wrote:
> 
> On 24/10/2024 13:41, Christian König wrote:
>> The merge function initially handled only individual fences and
>> arrays which in turn were created by the merge function. This allowed
>> to create the new array by a simple merge sort based on the fence
>> context number.
>>
>> The problem is now that since the addition of timeline sync objects
>> userspace can create chain containers in basically any fence context
>> order.
>>
>> If those are merged together it can happen that we create really
>> large arrays since the merge sort algorithm doesn't work any more.
>>
>> So put an insert sort behind the merge sort which kicks in when the
>> input fences are not in the expected order. This isn't as efficient
>> as a heap sort, but has better properties for the most common use
>> case.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++----
>>   1 file changed, 34 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..d9aa280d9ff6 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -106,7 +106,7 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>           fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>       count = 0;
>> -    do {
>> +    while (true) {
>>           unsigned int sel;
>>   restart:
>> @@ -144,11 +144,40 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>               }
>>           }
>> -        if (tmp) {
>> -            array[count++] = dma_fence_get(tmp);
>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +        if (!tmp)
>> +            break;
>> +
>> +        /*
>> +         * We could use a binary search here, but since the assumption
>> +         * is that the main input are already sorted dma_fence_arrays
>> +         * just looking from end has a higher chance of finding the
>> +         * right location on the first try
>> +         */
>> +
>> +        for (i = count; i--;) {
>> +            if (likely(array[i]->context < tmp->context))
>> +                break;
>> +
>> +            if (array[i]->context == tmp->context) {
>> +                if (dma_fence_is_later(tmp, array[i])) {
>> +                    dma_fence_put(array[i]);
>> +                    array[i] = dma_fence_get(tmp);
>> +                }
>> +                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +                goto restart;
>> +            }
>>           }
>> -    } while (tmp);
>> +
>> +        ++i;
>> +        /*
>> +         * Make room for the fence, this should be a nop most of the
>> +         * time.
>> +         */
>> +        memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array));
>> +        array[i] = dma_fence_get(tmp);
>> +        fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +        count++;
> 
> Having ventured into this function for the first time, I can say that 
> this is some smart code which is not easy to grasp. It could definitely 
> benefit from a high level comment before the do-while loop to explain 
> what it is going to do.
> 
> Next and tmp local variable names I also wonder if could be renamed to 
> something more descriptive.
> 
> And the algorithmic complexity of the end result, given the multiple 
> loops and gotos, I have no idea what it could be.
> 
> Has a dumb solution been considered like a two-pass with a 
> pessimistically allocated fence array been considered? Like:
> 
> 1) Populate array with all unsignalled unwrapped fences. (O(count))
> 
> 2) Bog standard include/linux/sort.h by context and seqno. (O(count*log 
> (count)))
> 
> 3) Walk array and squash same context to latest fence. (Before this 
> patch that wasn't there, right?). (O(count)) (Overwrite in place, no 
> memcpy needed.)
> 
> Algorithmic complexity of that would be obvious and code much simpler.

FWIW something like the below passes selftests. How does it look to you? 
Do you think more or less efficient and more or less readable?

commit 8a7c3ea7e7af85e813bf5fc151537ae37be1d6d9
Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Date:   Fri Nov 8 10:14:15 2024 +0000

     __dma_fence_unwrap_merge

     Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
b/drivers/dma-buf/dma-fence-unwrap.c
index 628af51c81af..47d67e482e96 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -12,6 +12,7 @@
  #include <linux/dma-fence-chain.h>
  #include <linux/dma-fence-unwrap.h>
  #include <linux/slab.h>
+#include <linux/sort.h>

  /* Internal helper to start new array iteration, don't use directly */
  static struct dma_fence *
@@ -59,17 +60,39 @@ struct dma_fence *dma_fence_unwrap_next(struct 
dma_fence_unwrap *cursor)
  }
  EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);

+
+static int fence_cmp(const void *_a, const void *_b)
+{
+	const struct dma_fence *a = *(const struct dma_fence **)_a;
+	const struct dma_fence *b = *(const struct dma_fence **)_b;
+
+	if (a->context < b->context)
+		return -1;
+	else if (a->context > b->context)
+		return 1;
+
+	if (a->seqno < b->seqno)
+		return -1;
+	else if (a->seqno > b->seqno)
+		return 1;
+
+	return 0;
+}
+
  /* Implementation for the dma_fence_merge() marco, don't use directly */
  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
  					   struct dma_fence **fences,
  					   struct dma_fence_unwrap *iter)
  {
-	struct dma_fence_array *result;
  	struct dma_fence *tmp, **array;
+	struct dma_fence_array *result;
  	ktime_t timestamp;
-	unsigned int i;
-	size_t count;
+	int i, j, count;

+	/*
+	 * Count number of unwrapped fences and fince the latest signaled
+	 * timestamp.
+	 */
  	count = 0;
  	timestamp = ns_to_ktime(0);
  	for (i = 0; i < num_fences; ++i) {
@@ -92,63 +115,41 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned 
int num_fences,
  	if (count == 0)
  		return dma_fence_allocate_private_stub(timestamp);

+	/*
+	 * Allocate and populate the array.
+	 */
  	array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
  	if (!array)
  		return NULL;

-	/*
-	 * This trashes the input fence array and uses it as position for the
-	 * following merge loop. This works because the dma_fence_merge()
-	 * wrapper macro is creating this temporary array on the stack together
-	 * with the iterators.
-	 */
-	for (i = 0; i < num_fences; ++i)
-		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
-
  	count = 0;
-	do {
-		unsigned int sel;
-
-restart:
-		tmp = NULL;
-		for (i = 0; i < num_fences; ++i) {
-			struct dma_fence *next;
-
-			while (fences[i] && dma_fence_is_signaled(fences[i]))
-				fences[i] = dma_fence_unwrap_next(&iter[i]);
-
-			next = fences[i];
-			if (!next)
-				continue;
-
-			/*
-			 * We can't guarantee that inpute fences are ordered by
-			 * context, but it is still quite likely when this
-			 * function is used multiple times. So attempt to order
-			 * the fences by context as we pass over them and merge
-			 * fences with the same context.
-			 */
-			if (!tmp || tmp->context > next->context) {
-				tmp = next;
-				sel = i;
-
-			} else if (tmp->context < next->context) {
-				continue;
-
-			} else if (dma_fence_is_later(tmp, next)) {
-				fences[i] = dma_fence_unwrap_next(&iter[i]);
-				goto restart;
-			} else {
-				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
-				goto restart;
-			}
+	for (i = 0; i < num_fences; ++i) {
+		dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
+			if (!dma_fence_is_signaled(tmp))
+				array[count++] = tmp;
  		}
+	}
+
+	/*
+	 * Sort in context and seqno order.
+	 */
+	sort(array, count, sizeof(*array), fence_cmp, NULL);

-		if (tmp) {
-			array[count++] = dma_fence_get(tmp);
-			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
+	/*
+	 * Only keep the most recent fence for each context.
+	 */
+	j = 0;
+	tmp = array[0];
+	for (i = 1; i < count; i++) {
+		if (array[i]->context != tmp->context) {
+			array[j++] = dma_fence_get(tmp);
  		}
-	} while (tmp);
+		tmp = array[i];
+	}
+	if (tmp->context != array[j - 1]->context) {
+		array[j++] = dma_fence_get(tmp);
+	}
+	count = j;

  	if (count == 0) {
  		tmp = dma_fence_allocate_private_stub(ktime_get());


Regards,

Tvrtko


> 
> Regards,
> 
> Tvrtko
> 
>> +    };
>>       if (count == 0) {
>>           tmp = dma_fence_allocate_private_stub(ktime_get());
Christian König Nov. 8, 2024, 2:58 p.m. UTC | #5
Am 08.11.24 um 12:22 schrieb Tvrtko Ursulin:
> On 07/11/2024 16:00, Tvrtko Ursulin wrote:
>> On 24/10/2024 13:41, Christian König wrote:
>>> The merge function initially handled only individual fences and
>>> arrays which in turn were created by the merge function. This allowed
>>> to create the new array by a simple merge sort based on the fence
>>> context number.
>>>
>>> The problem is now that since the addition of timeline sync objects
>>> userspace can create chain containers in basically any fence context
>>> order.
>>>
>>> If those are merged together it can happen that we create really
>>> large arrays since the merge sort algorithm doesn't work any more.
>>>
>>> So put an insert sort behind the merge sort which kicks in when the
>>> input fences are not in the expected order. This isn't as efficient
>>> as a heap sort, but has better properties for the most common use
>>> case.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/dma-buf/dma-fence-unwrap.c | 39 
>>> ++++++++++++++++++++++++++----
>>>   1 file changed, 34 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>>> b/drivers/dma-buf/dma-fence-unwrap.c
>>> index 628af51c81af..d9aa280d9ff6 100644
>>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>>> @@ -106,7 +106,7 @@ struct dma_fence 
>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>>           fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>>       count = 0;
>>> -    do {
>>> +    while (true) {
>>>           unsigned int sel;
>>>   restart:
>>> @@ -144,11 +144,40 @@ struct dma_fence 
>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>>               }
>>>           }
>>> -        if (tmp) {
>>> -            array[count++] = dma_fence_get(tmp);
>>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>> +        if (!tmp)
>>> +            break;
>>> +
>>> +        /*
>>> +         * We could use a binary search here, but since the assumption
>>> +         * is that the main input are already sorted dma_fence_arrays
>>> +         * just looking from end has a higher chance of finding the
>>> +         * right location on the first try
>>> +         */
>>> +
>>> +        for (i = count; i--;) {
>>> +            if (likely(array[i]->context < tmp->context))
>>> +                break;
>>> +
>>> +            if (array[i]->context == tmp->context) {
>>> +                if (dma_fence_is_later(tmp, array[i])) {
>>> +                    dma_fence_put(array[i]);
>>> +                    array[i] = dma_fence_get(tmp);
>>> +                }
>>> +                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>> +                goto restart;
>>> +            }
>>>           }
>>> -    } while (tmp);
>>> +
>>> +        ++i;
>>> +        /*
>>> +         * Make room for the fence, this should be a nop most of the
>>> +         * time.
>>> +         */
>>> +        memcpy(&array[i + 1], &array[i], (count - i) * 
>>> sizeof(*array));
>>> +        array[i] = dma_fence_get(tmp);
>>> +        fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>> +        count++;
>>
>> Having ventured into this function for the first time, I can say that 
>> this is some smart code which is not easy to grasp. It could 
>> definitely benefit from a high level comment before the do-while loop 
>> to explain what it is going to do.
>>
>> Next and tmp local variable names I also wonder if could be renamed 
>> to something more descriptive.
>>
>> And the algorithmic complexity of the end result, given the multiple 
>> loops and gotos, I have no idea what it could be.
>>
>> Has a dumb solution been considered like a two-pass with a 
>> pessimistically allocated fence array been considered? Like:
>>
>> 1) Populate array with all unsignalled unwrapped fences. (O(count))
>>
>> 2) Bog standard include/linux/sort.h by context and seqno. 
>> (O(count*log (count)))
>>
>> 3) Walk array and squash same context to latest fence. (Before this 
>> patch that wasn't there, right?). (O(count)) (Overwrite in place, no 
>> memcpy needed.)
>>
>> Algorithmic complexity of that would be obvious and code much simpler.
>
> FWIW something like the below passes selftests. How does it look to 
> you? Do you think more or less efficient and more or less readable?

Yeah I was considering the exact same thing.

What hold me back was the fact that the heap sort() implementation is 
really inefficient for the most common use case of this. In other words 
two arrays with fences already sorted is basically just O(count).

And I'm also not sure how many fences we see in those arrays in 
practice. With Vulkan basically trying to feed multiple contexts to keep 
all CPUs busy we might have quite a number here.

Regards,
Christian.

>
> commit 8a7c3ea7e7af85e813bf5fc151537ae37be1d6d9
> Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Date:   Fri Nov 8 10:14:15 2024 +0000
>
>     __dma_fence_unwrap_merge
>
>     Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
> b/drivers/dma-buf/dma-fence-unwrap.c
> index 628af51c81af..47d67e482e96 100644
> --- a/drivers/dma-buf/dma-fence-unwrap.c
> +++ b/drivers/dma-buf/dma-fence-unwrap.c
> @@ -12,6 +12,7 @@
>  #include <linux/dma-fence-chain.h>
>  #include <linux/dma-fence-unwrap.h>
>  #include <linux/slab.h>
> +#include <linux/sort.h>
>
>  /* Internal helper to start new array iteration, don't use directly */
>  static struct dma_fence *
> @@ -59,17 +60,39 @@ struct dma_fence *dma_fence_unwrap_next(struct 
> dma_fence_unwrap *cursor)
>  }
>  EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>
> +
> +static int fence_cmp(const void *_a, const void *_b)
> +{
> +    const struct dma_fence *a = *(const struct dma_fence **)_a;
> +    const struct dma_fence *b = *(const struct dma_fence **)_b;
> +
> +    if (a->context < b->context)
> +        return -1;
> +    else if (a->context > b->context)
> +        return 1;
> +
> +    if (a->seqno < b->seqno)
> +        return -1;
> +    else if (a->seqno > b->seqno)
> +        return 1;
> +
> +    return 0;
> +}
> +
>  /* Implementation for the dma_fence_merge() marco, don't use directly */
>  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>                         struct dma_fence **fences,
>                         struct dma_fence_unwrap *iter)
>  {
> -    struct dma_fence_array *result;
>      struct dma_fence *tmp, **array;
> +    struct dma_fence_array *result;
>      ktime_t timestamp;
> -    unsigned int i;
> -    size_t count;
> +    int i, j, count;
>
> +    /*
> +     * Count number of unwrapped fences and fince the latest signaled
> +     * timestamp.
> +     */
>      count = 0;
>      timestamp = ns_to_ktime(0);
>      for (i = 0; i < num_fences; ++i) {
> @@ -92,63 +115,41 @@ struct dma_fence 
> *__dma_fence_unwrap_merge(unsigned int num_fences,
>      if (count == 0)
>          return dma_fence_allocate_private_stub(timestamp);
>
> +    /*
> +     * Allocate and populate the array.
> +     */
>      array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>      if (!array)
>          return NULL;
>
> -    /*
> -     * This trashes the input fence array and uses it as position for 
> the
> -     * following merge loop. This works because the dma_fence_merge()
> -     * wrapper macro is creating this temporary array on the stack 
> together
> -     * with the iterators.
> -     */
> -    for (i = 0; i < num_fences; ++i)
> -        fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
> -
>      count = 0;
> -    do {
> -        unsigned int sel;
> -
> -restart:
> -        tmp = NULL;
> -        for (i = 0; i < num_fences; ++i) {
> -            struct dma_fence *next;
> -
> -            while (fences[i] && dma_fence_is_signaled(fences[i]))
> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
> -
> -            next = fences[i];
> -            if (!next)
> -                continue;
> -
> -            /*
> -             * We can't guarantee that inpute fences are ordered by
> -             * context, but it is still quite likely when this
> -             * function is used multiple times. So attempt to order
> -             * the fences by context as we pass over them and merge
> -             * fences with the same context.
> -             */
> -            if (!tmp || tmp->context > next->context) {
> -                tmp = next;
> -                sel = i;
> -
> -            } else if (tmp->context < next->context) {
> -                continue;
> -
> -            } else if (dma_fence_is_later(tmp, next)) {
> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
> -                goto restart;
> -            } else {
> -                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> -                goto restart;
> -            }
> +    for (i = 0; i < num_fences; ++i) {
> +        dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
> +            if (!dma_fence_is_signaled(tmp))
> +                array[count++] = tmp;
>          }
> +    }
> +
> +    /*
> +     * Sort in context and seqno order.
> +     */
> +    sort(array, count, sizeof(*array), fence_cmp, NULL);
>
> -        if (tmp) {
> -            array[count++] = dma_fence_get(tmp);
> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
> +    /*
> +     * Only keep the most recent fence for each context.
> +     */
> +    j = 0;
> +    tmp = array[0];
> +    for (i = 1; i < count; i++) {
> +        if (array[i]->context != tmp->context) {
> +            array[j++] = dma_fence_get(tmp);
>          }
> -    } while (tmp);
> +        tmp = array[i];
> +    }
> +    if (tmp->context != array[j - 1]->context) {
> +        array[j++] = dma_fence_get(tmp);
> +    }
> +    count = j;
>
>      if (count == 0) {
>          tmp = dma_fence_allocate_private_stub(ktime_get());
>
>
> Regards,
>
> Tvrtko
>
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +    };
>>>       if (count == 0) {
>>>           tmp = dma_fence_allocate_private_stub(ktime_get());
Tvrtko Ursulin Nov. 8, 2024, 4:12 p.m. UTC | #6
On 08/11/2024 14:58, Christian König wrote:
> Am 08.11.24 um 12:22 schrieb Tvrtko Ursulin:
>> On 07/11/2024 16:00, Tvrtko Ursulin wrote:
>>> On 24/10/2024 13:41, Christian König wrote:
>>>> The merge function initially handled only individual fences and
>>>> arrays which in turn were created by the merge function. This allowed
>>>> to create the new array by a simple merge sort based on the fence
>>>> context number.
>>>>
>>>> The problem is now that since the addition of timeline sync objects
>>>> userspace can create chain containers in basically any fence context
>>>> order.
>>>>
>>>> If those are merged together it can happen that we create really
>>>> large arrays since the merge sort algorithm doesn't work any more.
>>>>
>>>> So put an insert sort behind the merge sort which kicks in when the
>>>> input fences are not in the expected order. This isn't as efficient
>>>> as a heap sort, but has better properties for the most common use
>>>> case.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> ---
>>>>   drivers/dma-buf/dma-fence-unwrap.c | 39 
>>>> ++++++++++++++++++++++++++----
>>>>   1 file changed, 34 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>>>> b/drivers/dma-buf/dma-fence-unwrap.c
>>>> index 628af51c81af..d9aa280d9ff6 100644
>>>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>>>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>>>> @@ -106,7 +106,7 @@ struct dma_fence 
>>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>>>           fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>>>>       count = 0;
>>>> -    do {
>>>> +    while (true) {
>>>>           unsigned int sel;
>>>>   restart:
>>>> @@ -144,11 +144,40 @@ struct dma_fence 
>>>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>>>               }
>>>>           }
>>>> -        if (tmp) {
>>>> -            array[count++] = dma_fence_get(tmp);
>>>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>>> +        if (!tmp)
>>>> +            break;
>>>> +
>>>> +        /*
>>>> +         * We could use a binary search here, but since the assumption
>>>> +         * is that the main input are already sorted dma_fence_arrays
>>>> +         * just looking from end has a higher chance of finding the
>>>> +         * right location on the first try
>>>> +         */
>>>> +
>>>> +        for (i = count; i--;) {
>>>> +            if (likely(array[i]->context < tmp->context))
>>>> +                break;
>>>> +
>>>> +            if (array[i]->context == tmp->context) {
>>>> +                if (dma_fence_is_later(tmp, array[i])) {
>>>> +                    dma_fence_put(array[i]);
>>>> +                    array[i] = dma_fence_get(tmp);
>>>> +                }
>>>> +                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>>> +                goto restart;
>>>> +            }
>>>>           }
>>>> -    } while (tmp);
>>>> +
>>>> +        ++i;
>>>> +        /*
>>>> +         * Make room for the fence, this should be a nop most of the
>>>> +         * time.
>>>> +         */
>>>> +        memcpy(&array[i + 1], &array[i], (count - i) * 
>>>> sizeof(*array));
>>>> +        array[i] = dma_fence_get(tmp);
>>>> +        fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>>>> +        count++;
>>>
>>> Having ventured into this function for the first time, I can say that 
>>> this is some smart code which is not easy to grasp. It could 
>>> definitely benefit from a high level comment before the do-while loop 
>>> to explain what it is going to do.
>>>
>>> Next and tmp local variable names I also wonder if could be renamed 
>>> to something more descriptive.
>>>
>>> And the algorithmic complexity of the end result, given the multiple 
>>> loops and gotos, I have no idea what it could be.
>>>
>>> Has a dumb solution been considered like a two-pass with a 
>>> pessimistically allocated fence array been considered? Like:
>>>
>>> 1) Populate array with all unsignalled unwrapped fences. (O(count))
>>>
>>> 2) Bog standard include/linux/sort.h by context and seqno. 
>>> (O(count*log (count)))
>>>
>>> 3) Walk array and squash same context to latest fence. (Before this 
>>> patch that wasn't there, right?). (O(count)) (Overwrite in place, no 
>>> memcpy needed.)
>>>
>>> Algorithmic complexity of that would be obvious and code much simpler.
>>
>> FWIW something like the below passes selftests. How does it look to 
>> you? Do you think more or less efficient and more or less readable?
> 
> Yeah I was considering the exact same thing.
> 
> What hold me back was the fact that the heap sort() implementation is 
> really inefficient for the most common use case of this. In other words 
> two arrays with fences already sorted is basically just O(count).
> 
> And I'm also not sure how many fences we see in those arrays in 
> practice. With Vulkan basically trying to feed multiple contexts to keep 
> all CPUs busy we might have quite a number here.

I can add some instrumentation and run some games next week.

Another option is adding the sort algorithm you want with the same API 
as kernel's sort. Even if a local implementation that may already 
increase readability of the merging process.

Regards,

Tvrtko

> 
> Regards,
> Christian.
> 
>>
>> commit 8a7c3ea7e7af85e813bf5fc151537ae37be1d6d9
>> Author: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Date:   Fri Nov 8 10:14:15 2024 +0000
>>
>>     __dma_fence_unwrap_merge
>>
>>     Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c 
>> b/drivers/dma-buf/dma-fence-unwrap.c
>> index 628af51c81af..47d67e482e96 100644
>> --- a/drivers/dma-buf/dma-fence-unwrap.c
>> +++ b/drivers/dma-buf/dma-fence-unwrap.c
>> @@ -12,6 +12,7 @@
>>  #include <linux/dma-fence-chain.h>
>>  #include <linux/dma-fence-unwrap.h>
>>  #include <linux/slab.h>
>> +#include <linux/sort.h>
>>
>>  /* Internal helper to start new array iteration, don't use directly */
>>  static struct dma_fence *
>> @@ -59,17 +60,39 @@ struct dma_fence *dma_fence_unwrap_next(struct 
>> dma_fence_unwrap *cursor)
>>  }
>>  EXPORT_SYMBOL_GPL(dma_fence_unwrap_next);
>>
>> +
>> +static int fence_cmp(const void *_a, const void *_b)
>> +{
>> +    const struct dma_fence *a = *(const struct dma_fence **)_a;
>> +    const struct dma_fence *b = *(const struct dma_fence **)_b;
>> +
>> +    if (a->context < b->context)
>> +        return -1;
>> +    else if (a->context > b->context)
>> +        return 1;
>> +
>> +    if (a->seqno < b->seqno)
>> +        return -1;
>> +    else if (a->seqno > b->seqno)
>> +        return 1;
>> +
>> +    return 0;
>> +}
>> +
>>  /* Implementation for the dma_fence_merge() marco, don't use directly */
>>  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
>>                         struct dma_fence **fences,
>>                         struct dma_fence_unwrap *iter)
>>  {
>> -    struct dma_fence_array *result;
>>      struct dma_fence *tmp, **array;
>> +    struct dma_fence_array *result;
>>      ktime_t timestamp;
>> -    unsigned int i;
>> -    size_t count;
>> +    int i, j, count;
>>
>> +    /*
>> +     * Count number of unwrapped fences and fince the latest signaled
>> +     * timestamp.
>> +     */
>>      count = 0;
>>      timestamp = ns_to_ktime(0);
>>      for (i = 0; i < num_fences; ++i) {
>> @@ -92,63 +115,41 @@ struct dma_fence 
>> *__dma_fence_unwrap_merge(unsigned int num_fences,
>>      if (count == 0)
>>          return dma_fence_allocate_private_stub(timestamp);
>>
>> +    /*
>> +     * Allocate and populate the array.
>> +     */
>>      array = kmalloc_array(count, sizeof(*array), GFP_KERNEL);
>>      if (!array)
>>          return NULL;
>>
>> -    /*
>> -     * This trashes the input fence array and uses it as position for 
>> the
>> -     * following merge loop. This works because the dma_fence_merge()
>> -     * wrapper macro is creating this temporary array on the stack 
>> together
>> -     * with the iterators.
>> -     */
>> -    for (i = 0; i < num_fences; ++i)
>> -        fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
>> -
>>      count = 0;
>> -    do {
>> -        unsigned int sel;
>> -
>> -restart:
>> -        tmp = NULL;
>> -        for (i = 0; i < num_fences; ++i) {
>> -            struct dma_fence *next;
>> -
>> -            while (fences[i] && dma_fence_is_signaled(fences[i]))
>> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -
>> -            next = fences[i];
>> -            if (!next)
>> -                continue;
>> -
>> -            /*
>> -             * We can't guarantee that inpute fences are ordered by
>> -             * context, but it is still quite likely when this
>> -             * function is used multiple times. So attempt to order
>> -             * the fences by context as we pass over them and merge
>> -             * fences with the same context.
>> -             */
>> -            if (!tmp || tmp->context > next->context) {
>> -                tmp = next;
>> -                sel = i;
>> -
>> -            } else if (tmp->context < next->context) {
>> -                continue;
>> -
>> -            } else if (dma_fence_is_later(tmp, next)) {
>> -                fences[i] = dma_fence_unwrap_next(&iter[i]);
>> -                goto restart;
>> -            } else {
>> -                fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> -                goto restart;
>> -            }
>> +    for (i = 0; i < num_fences; ++i) {
>> +        dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) {
>> +            if (!dma_fence_is_signaled(tmp))
>> +                array[count++] = tmp;
>>          }
>> +    }
>> +
>> +    /*
>> +     * Sort in context and seqno order.
>> +     */
>> +    sort(array, count, sizeof(*array), fence_cmp, NULL);
>>
>> -        if (tmp) {
>> -            array[count++] = dma_fence_get(tmp);
>> -            fences[sel] = dma_fence_unwrap_next(&iter[sel]);
>> +    /*
>> +     * Only keep the most recent fence for each context.
>> +     */
>> +    j = 0;
>> +    tmp = array[0];
>> +    for (i = 1; i < count; i++) {
>> +        if (array[i]->context != tmp->context) {
>> +            array[j++] = dma_fence_get(tmp);
>>          }
>> -    } while (tmp);
>> +        tmp = array[i];
>> +    }
>> +    if (tmp->context != array[j - 1]->context) {
>> +        array[j++] = dma_fence_get(tmp);
>> +    }
>> +    count = j;
>>
>>      if (count == 0) {
>>          tmp = dma_fence_allocate_private_stub(ktime_get());
>>
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> +    };
>>>>       if (count == 0) {
>>>>           tmp = dma_fence_allocate_private_stub(ktime_get());
>
diff mbox series

Patch

diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c
index 628af51c81af..d9aa280d9ff6 100644
--- a/drivers/dma-buf/dma-fence-unwrap.c
+++ b/drivers/dma-buf/dma-fence-unwrap.c
@@ -106,7 +106,7 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 		fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]);
 
 	count = 0;
-	do {
+	while (true) {
 		unsigned int sel;
 
 restart:
@@ -144,11 +144,40 @@  struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences,
 			}
 		}
 
-		if (tmp) {
-			array[count++] = dma_fence_get(tmp);
-			fences[sel] = dma_fence_unwrap_next(&iter[sel]);
+		if (!tmp)
+			break;
+
+		/*
+		 * We could use a binary search here, but since the assumption
+		 * is that the main input are already sorted dma_fence_arrays
+		 * just looking from end has a higher chance of finding the
+		 * right location on the first try
+		 */
+
+		for (i = count; i--;) {
+			if (likely(array[i]->context < tmp->context))
+				break;
+
+			if (array[i]->context == tmp->context) {
+				if (dma_fence_is_later(tmp, array[i])) {
+					dma_fence_put(array[i]);
+					array[i] = dma_fence_get(tmp);
+				}
+				fences[sel] = dma_fence_unwrap_next(&iter[sel]);
+				goto restart;
+			}
 		}
-	} while (tmp);
+
+		++i;
+		/*
+		 * Make room for the fence, this should be a nop most of the
+		 * time.
+		 */
+		memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array));
+		array[i] = dma_fence_get(tmp);
+		fences[sel] = dma_fence_unwrap_next(&iter[sel]);
+		count++;
+	};
 
 	if (count == 0) {
 		tmp = dma_fence_allocate_private_stub(ktime_get());