Message ID | 151076492143.19172.10013343108476631614@mail.alporthouse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Am 15.11.2017 um 17:55 schrieb Chris Wilson: > Quoting Chris Wilson (2017-11-14 14:34:05) >> Quoting Christian König (2017-11-14 14:24:44) >>> Am 06.11.2017 um 17:22 schrieb Chris Wilson: >>>> Quoting Christian König (2017-10-30 14:59:04) >>>>> @@ -126,17 +127,28 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, >>>>> dma_fence_put(old_fence); >>>>> return; >>>>> } >>>>> + >>>>> + if (!signaled && dma_fence_is_signaled(old_fence)) { >>>>> + signaled = old_fence; >>>>> + signaled_idx = i; >>>>> + } >>>> How much do we care about only keeping one fence per-ctx here? You could >>>> rearrange this to break on old_fence->context == fence->context || >>>> dma_fence_is_signaled(old_fence) then everyone can use the final block. >>> Yeah, that is what David Zhou suggested as well. >>> >>> I've rejected this approach for now cause I think we still have cases >>> where we rely on one fence per ctx (but I'm not 100% sure). >>> >>> I changed patch #1 in this series as you suggest and going to send that >>> out once more in a minute. >>> >>> Can we get this upstream as is for now? I won't have much more time >>> working on this. >> Sure, we are only discussing how we might make it look tidier, pure >> micro-optimisation with the caveat of losing the one-fence-per-ctx >> guarantee. > Ah, one thing to note is that extra checking pushed one of our corner > case tests over its time limit. > > If we can completely forgo the one-fence-per-ctx here, what works really > well in testing is > > diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c > index 5319ac478918..5755e95fab1b 100644 > --- a/drivers/dma-buf/reservation.c > +++ b/drivers/dma-buf/reservation.c > @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > struct reservation_object_list *fobj, > struct dma_fence *fence) > { > - struct dma_fence *replace = NULL; > - u32 ctx = fence->context; > - u32 i; > - > dma_fence_get(fence); > > preempt_disable(); > write_seqcount_begin(&obj->seq); > > - for (i = 0; i < fobj->shared_count; ++i) { > - struct dma_fence *check; > - > - check = rcu_dereference_protected(fobj->shared[i], > - reservation_object_held(obj)); > - > - if (check->context == ctx || dma_fence_is_signaled(check)) { > - replace = old_fence; > - break; > - } > - } > - > /* > * memory barrier is added by write_seqcount_begin, > * fobj->shared_count is protected by this lock too > */ > - RCU_INIT_POINTER(fobj->shared[i], fence); > - if (!replace) > - fobj->shared_count++; > + RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence); > > write_seqcount_end(&obj->seq); > preempt_enable(); > - > - dma_fence_put(replace); > } > > static void > > i.e. don't check when not replacing the shared[], on creating the new > buffer we then discard all the old fences. > > It should work for amdgpu as well since you do a ht to coalesce > redundant fences before queuing. That won't work for all cases. This way the reservation object would keep growing without a chance to ever shrink. Christian. > -Chris
Quoting Christian König (2017-11-15 17:34:07) > Am 15.11.2017 um 17:55 schrieb Chris Wilson: > > Quoting Chris Wilson (2017-11-14 14:34:05) > >> Quoting Christian König (2017-11-14 14:24:44) > >>> Am 06.11.2017 um 17:22 schrieb Chris Wilson: > >>>> Quoting Christian König (2017-10-30 14:59:04) > >>>>> @@ -126,17 +127,28 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > >>>>> dma_fence_put(old_fence); > >>>>> return; > >>>>> } > >>>>> + > >>>>> + if (!signaled && dma_fence_is_signaled(old_fence)) { > >>>>> + signaled = old_fence; > >>>>> + signaled_idx = i; > >>>>> + } > >>>> How much do we care about only keeping one fence per-ctx here? You could > >>>> rearrange this to break on old_fence->context == fence->context || > >>>> dma_fence_is_signaled(old_fence) then everyone can use the final block. > >>> Yeah, that is what David Zhou suggested as well. > >>> > >>> I've rejected this approach for now cause I think we still have cases > >>> where we rely on one fence per ctx (but I'm not 100% sure). > >>> > >>> I changed patch #1 in this series as you suggest and going to send that > >>> out once more in a minute. > >>> > >>> Can we get this upstream as is for now? I won't have much more time > >>> working on this. > >> Sure, we are only discussing how we might make it look tidier, pure > >> micro-optimisation with the caveat of losing the one-fence-per-ctx > >> guarantee. > > Ah, one thing to note is that extra checking pushed one of our corner > > case tests over its time limit. > > > > If we can completely forgo the one-fence-per-ctx here, what works really > > well in testing is > > > > diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c > > index 5319ac478918..5755e95fab1b 100644 > > --- a/drivers/dma-buf/reservation.c > > +++ b/drivers/dma-buf/reservation.c > > @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > > struct reservation_object_list *fobj, > > struct dma_fence *fence) > > { > > - struct dma_fence *replace = NULL; > > - u32 ctx = fence->context; > > - u32 i; > > - > > dma_fence_get(fence); > > > > preempt_disable(); > > write_seqcount_begin(&obj->seq); > > > > - for (i = 0; i < fobj->shared_count; ++i) { > > - struct dma_fence *check; > > - > > - check = rcu_dereference_protected(fobj->shared[i], > > - reservation_object_held(obj)); > > - > > - if (check->context == ctx || dma_fence_is_signaled(check)) { > > - replace = old_fence; > > - break; > > - } > > - } > > - > > /* > > * memory barrier is added by write_seqcount_begin, > > * fobj->shared_count is protected by this lock too > > */ > > - RCU_INIT_POINTER(fobj->shared[i], fence); > > - if (!replace) > > - fobj->shared_count++; > > + RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence); > > > > write_seqcount_end(&obj->seq); > > preempt_enable(); > > - > > - dma_fence_put(replace); > > } > > > > static void > > > > i.e. don't check when not replacing the shared[], on creating the new > > buffer we then discard all the old fences. > > > > It should work for amdgpu as well since you do a ht to coalesce > > redundant fences before queuing. > > That won't work for all cases. This way the reservation object would > keep growing without a chance to ever shrink. We only keep the active fences when it grows, which is effective enough to keep it in check on the workloads I can find in the hour since noticing the failure in CI ;) And on the workloads where it is being flooded with live fences from many contexts, the order of magnitude throughput improvement is not easy to ignore. -Chris
Am 15.11.2017 um 18:43 schrieb Chris Wilson: > Quoting Christian König (2017-11-15 17:34:07) >> Am 15.11.2017 um 17:55 schrieb Chris Wilson: >>> Quoting Chris Wilson (2017-11-14 14:34:05) >>>> Quoting Christian König (2017-11-14 14:24:44) >>>>> Am 06.11.2017 um 17:22 schrieb Chris Wilson: >>>>>> Quoting Christian König (2017-10-30 14:59:04) >>>>>>> @@ -126,17 +127,28 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, >>>>>>> dma_fence_put(old_fence); >>>>>>> return; >>>>>>> } >>>>>>> + >>>>>>> + if (!signaled && dma_fence_is_signaled(old_fence)) { >>>>>>> + signaled = old_fence; >>>>>>> + signaled_idx = i; >>>>>>> + } >>>>>> How much do we care about only keeping one fence per-ctx here? You could >>>>>> rearrange this to break on old_fence->context == fence->context || >>>>>> dma_fence_is_signaled(old_fence) then everyone can use the final block. >>>>> Yeah, that is what David Zhou suggested as well. >>>>> >>>>> I've rejected this approach for now cause I think we still have cases >>>>> where we rely on one fence per ctx (but I'm not 100% sure). >>>>> >>>>> I changed patch #1 in this series as you suggest and going to send that >>>>> out once more in a minute. >>>>> >>>>> Can we get this upstream as is for now? I won't have much more time >>>>> working on this. >>>> Sure, we are only discussing how we might make it look tidier, pure >>>> micro-optimisation with the caveat of losing the one-fence-per-ctx >>>> guarantee. >>> Ah, one thing to note is that extra checking pushed one of our corner >>> case tests over its time limit. >>> >>> If we can completely forgo the one-fence-per-ctx here, what works really >>> well in testing is >>> >>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c >>> index 5319ac478918..5755e95fab1b 100644 >>> --- a/drivers/dma-buf/reservation.c >>> +++ b/drivers/dma-buf/reservation.c >>> @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, >>> struct reservation_object_list *fobj, >>> struct dma_fence *fence) >>> { >>> - struct dma_fence *replace = NULL; >>> - u32 ctx = fence->context; >>> - u32 i; >>> - >>> dma_fence_get(fence); >>> >>> preempt_disable(); >>> write_seqcount_begin(&obj->seq); >>> >>> - for (i = 0; i < fobj->shared_count; ++i) { >>> - struct dma_fence *check; >>> - >>> - check = rcu_dereference_protected(fobj->shared[i], >>> - reservation_object_held(obj)); >>> - >>> - if (check->context == ctx || dma_fence_is_signaled(check)) { >>> - replace = old_fence; >>> - break; >>> - } >>> - } >>> - >>> /* >>> * memory barrier is added by write_seqcount_begin, >>> * fobj->shared_count is protected by this lock too >>> */ >>> - RCU_INIT_POINTER(fobj->shared[i], fence); >>> - if (!replace) >>> - fobj->shared_count++; >>> + RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence); >>> >>> write_seqcount_end(&obj->seq); >>> preempt_enable(); >>> - >>> - dma_fence_put(replace); >>> } >>> >>> static void >>> >>> i.e. don't check when not replacing the shared[], on creating the new >>> buffer we then discard all the old fences. >>> >>> It should work for amdgpu as well since you do a ht to coalesce >>> redundant fences before queuing. >> That won't work for all cases. This way the reservation object would >> keep growing without a chance to ever shrink. > We only keep the active fences when it grows, which is effective enough > to keep it in check on the workloads I can find in the hour since > noticing the failure in CI ;) Not sure what tests you run, but this change means that we basically just add the fence to an ever growing list of fences on every command submission which is only reaped when we double the list size. That is a serious no-go as far as I can see. What we could do is improve reservation_object_reserve_shared() as well to figure out how many signaled fences are actually in the array when we reallocate it. > And on the workloads where it is being > flooded with live fences from many contexts, the order of magnitude > throughput improvement is not easy to ignore. Well not sure about the Intel driver, but since we started to mostly use a single reservation object per process the time spend in those functions on amdgpu are completely negligible. Regards, Christian. > -Chris
Quoting Christian König (2017-11-15 18:56:43) > Am 15.11.2017 um 18:43 schrieb Chris Wilson: > > Quoting Christian König (2017-11-15 17:34:07) > >> Am 15.11.2017 um 17:55 schrieb Chris Wilson: > >>> Quoting Chris Wilson (2017-11-14 14:34:05) > >>>> Quoting Christian König (2017-11-14 14:24:44) > >>>>> Am 06.11.2017 um 17:22 schrieb Chris Wilson: > >>>>>> Quoting Christian König (2017-10-30 14:59:04) > >>>>>>> @@ -126,17 +127,28 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > >>>>>>> dma_fence_put(old_fence); > >>>>>>> return; > >>>>>>> } > >>>>>>> + > >>>>>>> + if (!signaled && dma_fence_is_signaled(old_fence)) { > >>>>>>> + signaled = old_fence; > >>>>>>> + signaled_idx = i; > >>>>>>> + } > >>>>>> How much do we care about only keeping one fence per-ctx here? You could > >>>>>> rearrange this to break on old_fence->context == fence->context || > >>>>>> dma_fence_is_signaled(old_fence) then everyone can use the final block. > >>>>> Yeah, that is what David Zhou suggested as well. > >>>>> > >>>>> I've rejected this approach for now cause I think we still have cases > >>>>> where we rely on one fence per ctx (but I'm not 100% sure). > >>>>> > >>>>> I changed patch #1 in this series as you suggest and going to send that > >>>>> out once more in a minute. > >>>>> > >>>>> Can we get this upstream as is for now? I won't have much more time > >>>>> working on this. > >>>> Sure, we are only discussing how we might make it look tidier, pure > >>>> micro-optimisation with the caveat of losing the one-fence-per-ctx > >>>> guarantee. > >>> Ah, one thing to note is that extra checking pushed one of our corner > >>> case tests over its time limit. > >>> > >>> If we can completely forgo the one-fence-per-ctx here, what works really > >>> well in testing is > >>> > >>> diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c > >>> index 5319ac478918..5755e95fab1b 100644 > >>> --- a/drivers/dma-buf/reservation.c > >>> +++ b/drivers/dma-buf/reservation.c > >>> @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, > >>> struct reservation_object_list *fobj, > >>> struct dma_fence *fence) > >>> { > >>> - struct dma_fence *replace = NULL; > >>> - u32 ctx = fence->context; > >>> - u32 i; > >>> - > >>> dma_fence_get(fence); > >>> > >>> preempt_disable(); > >>> write_seqcount_begin(&obj->seq); > >>> > >>> - for (i = 0; i < fobj->shared_count; ++i) { > >>> - struct dma_fence *check; > >>> - > >>> - check = rcu_dereference_protected(fobj->shared[i], > >>> - reservation_object_held(obj)); > >>> - > >>> - if (check->context == ctx || dma_fence_is_signaled(check)) { > >>> - replace = old_fence; > >>> - break; > >>> - } > >>> - } > >>> - > >>> /* > >>> * memory barrier is added by write_seqcount_begin, > >>> * fobj->shared_count is protected by this lock too > >>> */ > >>> - RCU_INIT_POINTER(fobj->shared[i], fence); > >>> - if (!replace) > >>> - fobj->shared_count++; > >>> + RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence); > >>> > >>> write_seqcount_end(&obj->seq); > >>> preempt_enable(); > >>> - > >>> - dma_fence_put(replace); > >>> } > >>> > >>> static void > >>> > >>> i.e. don't check when not replacing the shared[], on creating the new > >>> buffer we then discard all the old fences. > >>> > >>> It should work for amdgpu as well since you do a ht to coalesce > >>> redundant fences before queuing. > >> That won't work for all cases. This way the reservation object would > >> keep growing without a chance to ever shrink. > > We only keep the active fences when it grows, which is effective enough > > to keep it in check on the workloads I can find in the hour since > > noticing the failure in CI ;) > > Not sure what tests you run, but this change means that we basically > just add the fence to an ever growing list of fences on every command > submission which is only reaped when we double the list size. Sure, just the frequency of doubling is also halved everytime. Then the entire array of shared is reaped when idle. Just throwing the issue out there; something that is already slow, just got noticeably slower. And that if we just ignore it, and only reap on reallocate, it vanishes. (It's just one case that happened to be caught by CI because it triggered a timeout, what CI isn't telling us is the dramatic improvement in other cases from avoiding an array of 1,000,000 stale fences on every object.) > That is a serious no-go as far as I can see. What we could do is improve > reservation_object_reserve_shared() as well to figure out how many > signaled fences are actually in the array when we reallocate it. Just amoritizing all the cost to the reallocate and then avoiding the growth should be enough. In the past I sent a conversion of the shared fences over to a compressed idr to avoid the cost of searching the shared array. It doesn't have the builtin reaping of stale fences; but we can do the discard on idle. > > And on the workloads where it is being > > flooded with live fences from many contexts, the order of magnitude > > throughput improvement is not easy to ignore. > > Well not sure about the Intel driver, but since we started to mostly use > a single reservation object per process the time spend in those > functions on amdgpu are completely negligible. Implicit fence tracking at object granularity. They are everywhere and very costly, until we can change the entirety of userspace (and dmabuf) to switch over to only using explicit fencing. Then it becomes somebody else's problem and the kernel need only track HW access which is a much smaller (and predefined) problem. -Chris
diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c index 5319ac478918..5755e95fab1b 100644 --- a/drivers/dma-buf/reservation.c +++ b/drivers/dma-buf/reservation.c @@ -104,39 +104,19 @@ reservation_object_add_shared_inplace(struct reservation_object *obj, struct reservation_object_list *fobj, struct dma_fence *fence) { - struct dma_fence *replace = NULL; - u32 ctx = fence->context; - u32 i; - dma_fence_get(fence); preempt_disable(); write_seqcount_begin(&obj->seq); - for (i = 0; i < fobj->shared_count; ++i) { - struct dma_fence *check; - - check = rcu_dereference_protected(fobj->shared[i], - reservation_object_held(obj)); - - if (check->context == ctx || dma_fence_is_signaled(check)) { - replace = old_fence; - break; - } - } - /* * memory barrier is added by write_seqcount_begin, * fobj->shared_count is protected by this lock too */ - RCU_INIT_POINTER(fobj->shared[i], fence); - if (!replace) - fobj->shared_count++; + RCU_INIT_POINTER(fobj->shared[fobj->shared_count++], fence); write_seqcount_end(&obj->seq); preempt_enable(); - - dma_fence_put(replace); } static void