Message ID | 20241018054608.6478-1-friedrich.vock@gmx.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dma-buf: Eliminate all duplicate fences in dma_fence_unwrap_merge | expand |
Am 18.10.24 um 07:46 schrieb Friedrich Vock: > When dma_fence_unwrap_merge is called on fence chains where the fences > aren't ordered by context, the merging logic breaks down and we end up > inserting fences twice. Doing this repeatedly leads to the number of > fences going up exponentially, and in some gaming workloads we'll end up > running out of memory to store the resulting array altogether, leading > to a warning such as: Ah! I was searching for that one for quite a while now. I own you a beer should you ever be near Cologne. Please also see my patch on the mailing list to use kvzalloc() to mitigate this. > > vkd3d_queue: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 > CPU: 2 PID: 5287 Comm: vkd3d_queue Tainted: G S 6.10.7-200.fsync.fc40.x86_64 #1 > Hardware name: Dell Inc. G5 5505/0NCW8W, BIOS 1.11.0 03/22/2022 > Call Trace: > <TASK> > dump_stack_lvl+0x5d/0x80 > warn_alloc+0x164/0x190 > ? srso_return_thunk+0x5/0x5f > ? __alloc_pages_direct_compact+0x1d9/0x220 > __alloc_pages_slowpath.constprop.2+0xd14/0xd80 > __alloc_pages_noprof+0x32b/0x350 > ? dma_fence_array_create+0x48/0x110 > __kmalloc_large_node+0x6f/0x130 > __kmalloc_noprof+0x2dd/0x4a0 > ? dma_fence_array_create+0x48/0x110 > dma_fence_array_create+0x48/0x110 > __dma_fence_unwrap_merge+0x481/0x5b0 > sync_file_merge.constprop.0+0xf8/0x180 > sync_file_ioctl+0x476/0x590 > ? srso_return_thunk+0x5/0x5f > ? __seccomp_filter+0xe8/0x5a0 > __x64_sys_ioctl+0x97/0xd0 > do_syscall_64+0x82/0x160 > ? srso_return_thunk+0x5/0x5f > ? drm_syncobj_destroy_ioctl+0x8b/0xb0 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > ? __check_object_size+0x58/0x230 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > ? drm_ioctl+0x2ba/0x530 > ? __pfx_drm_syncobj_destroy_ioctl+0x10/0x10 > ? srso_return_thunk+0x5/0x5f > ? ktime_get_mono_fast_ns+0x3b/0xd0 > ? srso_return_thunk+0x5/0x5f > ? amdgpu_drm_ioctl+0x71/0x90 [amdgpu] > ? srso_return_thunk+0x5/0x5f > ? syscall_exit_to_user_mode+0x72/0x200 > ? srso_return_thunk+0x5/0x5f > ? do_syscall_64+0x8e/0x160 > ? syscall_exit_to_user_mode+0x72/0x200 > ? srso_return_thunk+0x5/0x5f > ? do_syscall_64+0x8e/0x160 > ? srso_return_thunk+0x5/0x5f > ? syscall_exit_to_user_mode+0x72/0x200 > ? srso_return_thunk+0x5/0x5f > ? do_syscall_64+0x8e/0x160 > ? do_syscall_64+0x8e/0x160 > ? srso_return_thunk+0x5/0x5f > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > It's a bit unfortunate that we end up with quadratic complexity w.r.t. > the number of merged fences in all cases, but I'd argue in practice > there shouldn't be more than a handful of in-flight fences to merge. > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3617 > Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> > --- > drivers/dma-buf/dma-fence-unwrap.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c > index 628af51c81af..46277cef0bc6 100644 > --- a/drivers/dma-buf/dma-fence-unwrap.c > +++ b/drivers/dma-buf/dma-fence-unwrap.c > @@ -68,7 +68,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > struct dma_fence *tmp, **array; > ktime_t timestamp; > unsigned int i; > - size_t count; > + size_t count, j; > > count = 0; > timestamp = ns_to_ktime(0); > @@ -127,6 +127,10 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > * function is used multiple times. So attempt to order > * the fences by context as we pass over them and merge > * fences with the same context. > + * > + * We will remove any remaining duplicate fences down > + * below, but doing this here saves us from having to > + * iterate over the array to detect the duplicate. > */ > if (!tmp || tmp->context > next->context) { > tmp = next; > @@ -145,7 +149,12 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > } > > if (tmp) { > - array[count++] = dma_fence_get(tmp); > + for (j = 0; j < count; ++j) { > + if (array[count] == tmp) > + break; > + } > + if (j == count) > + array[count++] = dma_fence_get(tmp); That is clearly not the right solution. Since comparing the context should have already removed all duplicates. Going to double check the code. Thanks, Christian. > fences[sel] = dma_fence_unwrap_next(&iter[sel]); > } > } while (tmp); > -- > 2.47.0 >
Hi, On 18.10.24 10:56, Christian König wrote: > Am 18.10.24 um 07:46 schrieb Friedrich Vock: >> When dma_fence_unwrap_merge is called on fence chains where the fences >> aren't ordered by context, the merging logic breaks down and we end up >> inserting fences twice. Doing this repeatedly leads to the number of >> fences going up exponentially, and in some gaming workloads we'll end up >> running out of memory to store the resulting array altogether, leading >> to a warning such as: > > Ah! I was searching for that one for quite a while now. > > I own you a beer should you ever be near Cologne. I visit the area somewhat regularly, I'll let you know the next time around ;-) > > Please also see my patch on the mailing list to use kvzalloc() to > mitigate this. > >> >> vkd3d_queue: page allocation failure: order:7, >> mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), >> nodemask=(null),cpuset=/,mems_allowed=0 >> CPU: 2 PID: 5287 Comm: vkd3d_queue Tainted: G S >> 6.10.7-200.fsync.fc40.x86_64 #1 >> Hardware name: Dell Inc. G5 5505/0NCW8W, BIOS 1.11.0 03/22/2022 >> Call Trace: >> <TASK> >> dump_stack_lvl+0x5d/0x80 >> warn_alloc+0x164/0x190 >> ? srso_return_thunk+0x5/0x5f >> ? __alloc_pages_direct_compact+0x1d9/0x220 >> __alloc_pages_slowpath.constprop.2+0xd14/0xd80 >> __alloc_pages_noprof+0x32b/0x350 >> ? dma_fence_array_create+0x48/0x110 >> __kmalloc_large_node+0x6f/0x130 >> __kmalloc_noprof+0x2dd/0x4a0 >> ? dma_fence_array_create+0x48/0x110 >> dma_fence_array_create+0x48/0x110 >> __dma_fence_unwrap_merge+0x481/0x5b0 >> sync_file_merge.constprop.0+0xf8/0x180 >> sync_file_ioctl+0x476/0x590 >> ? srso_return_thunk+0x5/0x5f >> ? __seccomp_filter+0xe8/0x5a0 >> __x64_sys_ioctl+0x97/0xd0 >> do_syscall_64+0x82/0x160 >> ? srso_return_thunk+0x5/0x5f >> ? drm_syncobj_destroy_ioctl+0x8b/0xb0 >> ? srso_return_thunk+0x5/0x5f >> ? srso_return_thunk+0x5/0x5f >> ? __check_object_size+0x58/0x230 >> ? srso_return_thunk+0x5/0x5f >> ? srso_return_thunk+0x5/0x5f >> ? drm_ioctl+0x2ba/0x530 >> ? __pfx_drm_syncobj_destroy_ioctl+0x10/0x10 >> ? srso_return_thunk+0x5/0x5f >> ? ktime_get_mono_fast_ns+0x3b/0xd0 >> ? srso_return_thunk+0x5/0x5f >> ? amdgpu_drm_ioctl+0x71/0x90 [amdgpu] >> ? srso_return_thunk+0x5/0x5f >> ? syscall_exit_to_user_mode+0x72/0x200 >> ? srso_return_thunk+0x5/0x5f >> ? do_syscall_64+0x8e/0x160 >> ? syscall_exit_to_user_mode+0x72/0x200 >> ? srso_return_thunk+0x5/0x5f >> ? do_syscall_64+0x8e/0x160 >> ? srso_return_thunk+0x5/0x5f >> ? syscall_exit_to_user_mode+0x72/0x200 >> ? srso_return_thunk+0x5/0x5f >> ? do_syscall_64+0x8e/0x160 >> ? do_syscall_64+0x8e/0x160 >> ? srso_return_thunk+0x5/0x5f >> entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> It's a bit unfortunate that we end up with quadratic complexity w.r.t. >> the number of merged fences in all cases, but I'd argue in practice >> there shouldn't be more than a handful of in-flight fences to merge. >> >> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3617 >> Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> >> --- >> drivers/dma-buf/dma-fence-unwrap.c | 13 +++++++++++-- >> 1 file changed, 11 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma- >> fence-unwrap.c >> index 628af51c81af..46277cef0bc6 100644 >> --- a/drivers/dma-buf/dma-fence-unwrap.c >> +++ b/drivers/dma-buf/dma-fence-unwrap.c >> @@ -68,7 +68,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned >> int num_fences, >> struct dma_fence *tmp, **array; >> ktime_t timestamp; >> unsigned int i; >> - size_t count; >> + size_t count, j; >> >> count = 0; >> timestamp = ns_to_ktime(0); >> @@ -127,6 +127,10 @@ struct dma_fence >> *__dma_fence_unwrap_merge(unsigned int num_fences, >> * function is used multiple times. So attempt to order >> * the fences by context as we pass over them and merge >> * fences with the same context. >> + * >> + * We will remove any remaining duplicate fences down >> + * below, but doing this here saves us from having to >> + * iterate over the array to detect the duplicate. >> */ >> if (!tmp || tmp->context > next->context) { >> tmp = next; >> @@ -145,7 +149,12 @@ struct dma_fence >> *__dma_fence_unwrap_merge(unsigned int num_fences, >> } >> >> if (tmp) { >> - array[count++] = dma_fence_get(tmp); >> + for (j = 0; j < count; ++j) { >> + if (array[count] == tmp) >> + break; >> + } >> + if (j == count) >> + array[count++] = dma_fence_get(tmp); > > That is clearly not the right solution. Since comparing the context > should have already removed all duplicates. Sadly, not. This is true as long as the fences are ordered by context, but this is not a given. The error manifests precisely when they are not ordered. Imagine we try to merge two chains/arrays that contain the same 4 fences (I'll call them fences 1-4), but the second one has another fence with a higher context (fence 5) in front of it. I'll try to visualize the chains/arrays as ASCII art - we start out in a state like this (the ^ marker referring to the current iterator position): Result: <empty> 1. 1 -> 2 -> 3 -> 4 ^ 2. 5 -> 1 -> 2 -> 3 -> 4 ^ Since fences 1-4 have a lower context than fence 5, we'll insert fences 1-4 first: Result: 1 -> 2 -> 3 -> 4 1. 1 -> 2 -> 3 -> 4 ^ 2. 5 -> 1 -> 2 -> 3 -> 4 ^ Only then will we insert fence 5: Result: 1 -> 2 -> 3 -> 4 -> 5 1. 1 -> 2 -> 3 -> 4 ^ 2. 5 -> 1 -> 2 -> 3 -> 4 ^ Now we're at fence 1 in the second chain/array, but we've already inserted fences 1-4 from the first chain/array, so we fail to merge them. Instead, they get reinserted: Result: 1 -> 2 -> 3 -> 4 -> 5 -> 1 -> 2 -> 3 -> 4 1. 1 -> 2 -> 3 -> 4 ^ 2. 5 -> 1 -> 2 -> 3 -> 4 ^ We can't assume the fences are in any sort of order w.r.t their context, so if we want to check for duplicates exhaustively we'll always end up with some kind of O(n^2) algorithm. I see only a handful of ways we can go: a) Don't check exhaustively (current behavior). Obviously, this doesn't work that well in practice. b) Eat the O(n^2) cost (this patch). I've kept the current merging code since it's an easy way to reduce the amount of times we have to do the expensive duplicate check, but other than that I'm not sure we can do much to reduce cost. c) Enforce order w.r.t. context. I don't think we can require that fence chains order their fences by context, they should be ordered by timeline point (maybe it would work for arrays, but whatever). That leaves us with having to sort the fences by context just before merging. That would reduce complexity to some O(n log n) at worst, but in practice I fear it might not be worth it compared to just iterating over the result array a few times, especially given that once this bug is fixed, we should be back to only a few fences to merge :) Regards, Friedrich > > Going to double check the code. > > Thanks, > Christian. > >> fences[sel] = dma_fence_unwrap_next(&iter[sel]); >> } >> } while (tmp); >> -- >> 2.47.0 >> >
Am 18.10.24 um 21:17 schrieb Friedrich Vock: > [SNIP] >>> if (tmp) { >>> - array[count++] = dma_fence_get(tmp); >>> + for (j = 0; j < count; ++j) { >>> + if (array[count] == tmp) >>> + break; >>> + } >>> + if (j == count) >>> + array[count++] = dma_fence_get(tmp); >> >> That is clearly not the right solution. Since comparing the context >> should have already removed all duplicates. > > Sadly, not. This is true as long as the fences are ordered by context, > but this is not a given. The error manifests precisely when they are not > ordered. > > Imagine we try to merge two chains/arrays that contain the same 4 fences > (I'll call them fences 1-4), but the second one has another fence with a > higher context (fence 5) in front of it. Yeah, the problem here is that the code originally assumed that we only merge arrays. And most of those arrays were previously created by merging other fences, so they are supposed to be ordered. > > [SNIP] > > We can't assume the fences are in any sort of order w.r.t their context, > so if we want to check for duplicates exhaustively we'll always end up > with some kind of O(n^2) algorithm. I see only a handful of ways we > can go: > > a) Don't check exhaustively (current behavior). Obviously, this doesn't > work that well in practice. > > b) Eat the O(n^2) cost (this patch). I've kept the current merging code > since it's an easy way to reduce the amount of times we have to do the > expensive duplicate check, but other than that I'm not sure we can do > much to reduce cost. > > c) Enforce order w.r.t. context. I don't think we can require that fence > chains order their fences by context, they should be ordered by timeline > point (maybe it would work for arrays, but whatever). That leaves us > with having to sort the fences by context just before merging. That > would reduce complexity to some O(n log n) at worst, but in practice I > fear it might not be worth it compared to just iterating over the result > array a few times, especially given that once this bug is fixed, we > should be back to only a few fences to merge :) Yeah, how that happens is rather obvious and we indeed can't require fence chains to be ordered. But I don't think we always need this O(n^2) cost either, we just need to check from the end of the array if we really have the highest context at hand. That's still O(n^2) in the worst case, but for the case where we merge two sorted arrays it becomes O(1). Give me a moment to hack that together. Thanks, Christian. > > Regards, > Friedrich > >> >> Going to double check the code. >> >> Thanks, >> Christian. >> >>> fences[sel] = dma_fence_unwrap_next(&iter[sel]); >>> } >>> } while (tmp); >>> -- >>> 2.47.0 >>> >> >
diff --git a/drivers/dma-buf/dma-fence-unwrap.c b/drivers/dma-buf/dma-fence-unwrap.c index 628af51c81af..46277cef0bc6 100644 --- a/drivers/dma-buf/dma-fence-unwrap.c +++ b/drivers/dma-buf/dma-fence-unwrap.c @@ -68,7 +68,7 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, struct dma_fence *tmp, **array; ktime_t timestamp; unsigned int i; - size_t count; + size_t count, j; count = 0; timestamp = ns_to_ktime(0); @@ -127,6 +127,10 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, * function is used multiple times. So attempt to order * the fences by context as we pass over them and merge * fences with the same context. + * + * We will remove any remaining duplicate fences down + * below, but doing this here saves us from having to + * iterate over the array to detect the duplicate. */ if (!tmp || tmp->context > next->context) { tmp = next; @@ -145,7 +149,12 @@ struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, } if (tmp) { - array[count++] = dma_fence_get(tmp); + for (j = 0; j < count; ++j) { + if (array[count] == tmp) + break; + } + if (j == count) + array[count++] = dma_fence_get(tmp); fences[sel] = dma_fence_unwrap_next(&iter[sel]); } } while (tmp);
When dma_fence_unwrap_merge is called on fence chains where the fences aren't ordered by context, the merging logic breaks down and we end up inserting fences twice. Doing this repeatedly leads to the number of fences going up exponentially, and in some gaming workloads we'll end up running out of memory to store the resulting array altogether, leading to a warning such as: vkd3d_queue: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 CPU: 2 PID: 5287 Comm: vkd3d_queue Tainted: G S 6.10.7-200.fsync.fc40.x86_64 #1 Hardware name: Dell Inc. G5 5505/0NCW8W, BIOS 1.11.0 03/22/2022 Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 warn_alloc+0x164/0x190 ? srso_return_thunk+0x5/0x5f ? __alloc_pages_direct_compact+0x1d9/0x220 __alloc_pages_slowpath.constprop.2+0xd14/0xd80 __alloc_pages_noprof+0x32b/0x350 ? dma_fence_array_create+0x48/0x110 __kmalloc_large_node+0x6f/0x130 __kmalloc_noprof+0x2dd/0x4a0 ? dma_fence_array_create+0x48/0x110 dma_fence_array_create+0x48/0x110 __dma_fence_unwrap_merge+0x481/0x5b0 sync_file_merge.constprop.0+0xf8/0x180 sync_file_ioctl+0x476/0x590 ? srso_return_thunk+0x5/0x5f ? __seccomp_filter+0xe8/0x5a0 __x64_sys_ioctl+0x97/0xd0 do_syscall_64+0x82/0x160 ? srso_return_thunk+0x5/0x5f ? drm_syncobj_destroy_ioctl+0x8b/0xb0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? __check_object_size+0x58/0x230 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? drm_ioctl+0x2ba/0x530 ? __pfx_drm_syncobj_destroy_ioctl+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? ktime_get_mono_fast_ns+0x3b/0xd0 ? srso_return_thunk+0x5/0x5f ? amdgpu_drm_ioctl+0x71/0x90 [amdgpu] ? srso_return_thunk+0x5/0x5f ? syscall_exit_to_user_mode+0x72/0x200 ? srso_return_thunk+0x5/0x5f ? do_syscall_64+0x8e/0x160 ? syscall_exit_to_user_mode+0x72/0x200 ? srso_return_thunk+0x5/0x5f ? do_syscall_64+0x8e/0x160 ? srso_return_thunk+0x5/0x5f ? syscall_exit_to_user_mode+0x72/0x200 ? srso_return_thunk+0x5/0x5f ? do_syscall_64+0x8e/0x160 ? do_syscall_64+0x8e/0x160 ? srso_return_thunk+0x5/0x5f entry_SYSCALL_64_after_hwframe+0x76/0x7e It's a bit unfortunate that we end up with quadratic complexity w.r.t. the number of merged fences in all cases, but I'd argue in practice there shouldn't be more than a handful of in-flight fences to merge. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3617 Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> --- drivers/dma-buf/dma-fence-unwrap.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) -- 2.47.0