Message ID | 20241113171947.57446-1-tursulin@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] dma-fence: Add a single fence fast path for fence merging | expand |
Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin: > From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > > Testing some workloads in two different scenarios, such as games running > under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows > that in a significant portion of calls the dma_fence_unwrap_merge helper > is called with just a single unsignalled fence. > > Therefore it is worthile to add a fast path for that case and so bypass > the memory allocation and insertion sort attempts. You should probably re-order the patches since we need to backport the second as bug fix while the first is just an improvement. There is also a bug in this patch which needs to be fixed. > Tested scenarios: > > 1) Hogwarts Legacy under Gamescope > > 450 calls per second to __dma_fence_unwrap_merge. > > Percentages per number of fences buckets, before and after checking for > signalled status, sorting and flattening: > > N Before After > 0 0.85% > 1 69.80% -> The new fast path. > 2-9 29.36% 9% (Ie. 91% of this bucket flattened to 1 fence) > 10-19 > 20-40 > 50+ > > 2) Cyberpunk 2077 under Gamescope > > 1050 calls per second. > > N Before After > 0 0.71% > 1 52.53% -> The new fast path. > 2-9 44.38% 50.60% (Ie. half resolved to a single fence) > 10-19 2.34% > 20-40 0.06% > 50+ > > 3) vkcube under Plasma > > 90 calls per second. > > N Before After > 0 > 1 > 2-9 100% 0% (Ie. all resolved to a single fence) > 10-19 > 20-40 > 50+ > > In the case of vkcube all invocations in the 2-9 bucket were actually > just two input fences. Nice to have some numbers at hand. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> > Cc: Christian König <christian.koenig@amd.com> > Cc: Friedrich Vock <friedrich.vock@gmx.de> > --- > drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c > index 628af51c81af..75c3e37fd617 100644 > --- a/drivers/dma-buf/dma-fence-unwrap.c > +++ b/drivers/dma-buf/dma-fence-unwrap.c > @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > struct dma_fence **fences, > struct dma_fence_unwrap *iter) > { > + struct dma_fence *tmp, *signaled, **array; I would name that unsignaled instead. > struct dma_fence_array *result; > - struct dma_fence *tmp, **array; > ktime_t timestamp; > unsigned int i; > size_t count; > @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > for (i = 0; i < num_fences; ++i) { > dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) { > if (!dma_fence_is_signaled(tmp)) { > + signaled = tmp; You need to grab a reference to tmp here if you want to keep it. It's perfectly possible that tmp is garbage collected as soon as you go to the next iteration or leave the loop. Regards, Christian. > ++count; > } else { > ktime_t t = dma_fence_timestamp(tmp); > @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > /* > * If we couldn't find a pending fence just return a private signaled > * fence with the timestamp of the last signaled one. > + * > + * Or if there was a single unsignaled fence left we can return it > + * directly and early since that is a major path on many workloads. > */ > if (count == 0) > return dma_fence_allocate_private_stub(timestamp); > + else if (count == 1) > + return dma_fence_get(signaled); > > array = kmalloc_array(count, sizeof(*array), GFP_KERNEL); > if (!array)
On 14/11/2024 08:53, Christian König wrote: > Am 13.11.24 um 18:19 schrieb Tvrtko Ursulin: >> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> >> Testing some workloads in two different scenarios, such as games running >> under Gamescope on a Steam Deck, or vkcube under a Plasma desktop, shows >> that in a significant portion of calls the dma_fence_unwrap_merge helper >> is called with just a single unsignalled fence. >> >> Therefore it is worthile to add a fast path for that case and so bypass >> the memory allocation and insertion sort attempts. > > You should probably re-order the patches since we need to backport the > second as bug fix while the first is just an improvement. Ok. > There is also a bug in this patch which needs to be fixed. > >> Tested scenarios: >> >> 1) Hogwarts Legacy under Gamescope >> >> 450 calls per second to __dma_fence_unwrap_merge. >> >> Percentages per number of fences buckets, before and after checking for >> signalled status, sorting and flattening: >> >> N Before After >> 0 0.85% >> 1 69.80% -> The new fast path. >> 2-9 29.36% 9% (Ie. 91% of this bucket flattened to 1 >> fence) >> 10-19 >> 20-40 >> 50+ >> >> 2) Cyberpunk 2077 under Gamescope >> >> 1050 calls per second. >> >> N Before After >> 0 0.71% >> 1 52.53% -> The new fast path. >> 2-9 44.38% 50.60% (Ie. half resolved to a single fence) >> 10-19 2.34% >> 20-40 0.06% >> 50+ >> >> 3) vkcube under Plasma >> >> 90 calls per second. >> >> N Before After >> 0 >> 1 >> 2-9 100% 0% (Ie. all resolved to a single fence) >> 10-19 >> 20-40 >> 50+ >> >> In the case of vkcube all invocations in the 2-9 bucket were actually >> just two input fences. > > Nice to have some numbers at hand. > >> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> >> Cc: Christian König <christian.koenig@amd.com> >> Cc: Friedrich Vock <friedrich.vock@gmx.de> >> --- >> drivers/dma-buf/dma-fence-unwrap.c | 8 +++++++- >> 1 file changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/dma-buf/dma-fence-unwrap.c >> b/drivers/dma-buf/dma-fence-unwrap.c >> index 628af51c81af..75c3e37fd617 100644 >> --- a/drivers/dma-buf/dma-fence-unwrap.c >> +++ b/drivers/dma-buf/dma-fence-unwrap.c >> @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned >> int num_fences, >> struct dma_fence **fences, >> struct dma_fence_unwrap *iter) >> { >> + struct dma_fence *tmp, *signaled, **array; > > I would name that unsignaled instead. Indeed. And a polite way of pointing out my brain was completely reversed. :) >> struct dma_fence_array *result; >> - struct dma_fence *tmp, **array; >> ktime_t timestamp; >> unsigned int i; >> size_t count; >> @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned >> int num_fences, >> for (i = 0; i < num_fences; ++i) { >> dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) { >> if (!dma_fence_is_signaled(tmp)) { >> + signaled = tmp; > > You need to grab a reference to tmp here if you want to keep it. > > It's perfectly possible that tmp is garbage collected as soon as you go > to the next iteration or leave the loop. Yep.. same bug in the second patch. Regards, Tvrtko >> ++count; >> } else { >> ktime_t t = dma_fence_timestamp(tmp); >> @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned >> int num_fences, >> /* >> * If we couldn't find a pending fence just return a private >> signaled >> * fence with the timestamp of the last signaled one. >> + * >> + * Or if there was a single unsignaled fence left we can return it >> + * directly and early since that is a major path on many workloads. >> */ >> if (count == 0) >> return dma_fence_allocate_private_stub(timestamp); >> + else if (count == 1) >> + return dma_fence_get(signaled); >> array = kmalloc_array(count, sizeof(*array), GFP_KERNEL); >> if (!array) >
diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c index 628af51c81af..75c3e37fd617 100644 --- a/drivers/dma-buf/dma-fence-unwrap.c +++ b/drivers/dma-buf/dma-fence-unwrap.c @@ -64,8 +64,8 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, struct dma_fence **fences, struct dma_fence_unwrap *iter) { + struct dma_fence *tmp, *signaled, **array; struct dma_fence_array *result; - struct dma_fence *tmp, **array; ktime_t timestamp; unsigned int i; size_t count; @@ -75,6 +75,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, for (i = 0; i < num_fences; ++i) { dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) { if (!dma_fence_is_signaled(tmp)) { + signaled = tmp; ++count; } else { ktime_t t = dma_fence_timestamp(tmp); @@ -88,9 +89,14 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, /* * If we couldn't find a pending fence just return a private signaled * fence with the timestamp of the last signaled one. + * + * Or if there was a single unsignaled fence left we can return it + * directly and early since that is a major path on many workloads. */ if (count == 0) return dma_fence_allocate_private_stub(timestamp); + else if (count == 1) + return dma_fence_get(signaled); array = kmalloc_array(count, sizeof(*array), GFP_KERNEL); if (!array)