Message ID | 1434966909-4113-6-git-send-email-sourab.gupta@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: > From: Sourab Gupta <sourab.gupta@intel.com> > > To collect timestamps around any GPU workload, we need to insert > commands to capture them into the ringbuffer. Therefore, during the stop event > call, we need to wait for GPU to complete processing the last request for > which these commands were inserted. > We need to ensure this processing is done before event_destroy callback which > deallocates the buffer for holding the data. There's no reason for this to be synchronous. Just that you need an active reference on the output buffer to be only released after the final request. -Chris
On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: > > From: Sourab Gupta <sourab.gupta@intel.com> > > > > To collect timestamps around any GPU workload, we need to insert > > commands to capture them into the ringbuffer. Therefore, during the stop event > > call, we need to wait for GPU to complete processing the last request for > > which these commands were inserted. > > We need to ensure this processing is done before event_destroy callback which > > deallocates the buffer for holding the data. > > There's no reason for this to be synchronous. Just that you need an > active reference on the output buffer to be only released after the > final request. Yeah I think the interaction between OA sampling and GEM is the critical piece here for both patch series. Step one is to have a per-pmu lock to keep track of data private to OA and mmio based sampling as a starting point. Then we need to figure out how to structure the connection without OA/PMU and gem trampling over each another's feet. Wrt requests those are refcounted already and just waiting on them doesn't need dev->struct_mutex. That should be all you need, assuming you do correctly refcount them ... -Daniel
On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: > > > From: Sourab Gupta <sourab.gupta@intel.com> > > > > > > To collect timestamps around any GPU workload, we need to insert > > > commands to capture them into the ringbuffer. Therefore, during the stop event > > > call, we need to wait for GPU to complete processing the last request for > > > which these commands were inserted. > > > We need to ensure this processing is done before event_destroy callback which > > > deallocates the buffer for holding the data. > > > > There's no reason for this to be synchronous. Just that you need an > > active reference on the output buffer to be only released after the > > final request. > > Yeah I think the interaction between OA sampling and GEM is the critical > piece here for both patch series. Step one is to have a per-pmu lock to > keep track of data private to OA and mmio based sampling as a starting > point. Then we need to figure out how to structure the connection without > OA/PMU and gem trampling over each another's feet. > > Wrt requests those are refcounted already and just waiting on them doesn't > need dev->struct_mutex. That should be all you need, assuming you do > correctly refcount them ... > -Daniel Hi Daniel/Chris, Thanks for your feedback. I acknowledge the issue with increasing coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track of data private to OA in my next version of patches. W.r.t. having the synchronous wait during event stop, I thought through the method of having active reference on output buffer. This will let us have a delayed buffer destroy (controlled by decrement of output buffer refcount as and when gpu requests complete). This will be decoupled from the event stop here. But that is not the only thing at play here. The event stop is expected to disable the OA unit (which it is doing right now). In my usecase, I can't disable OA unit, till the time GPU processes the MI_RPC requests scheduled already, which otherwise may lead to hangs. So, I'm waiting synchronously for requests to complete before disabling OA unit. Also, the event_destroy callback should be destroying the buffers associated with event. (Right now, in current design, event_destroy callback waits for the above operations to be completed before proceeding to destroy the buffer). If I try to create a function which destroys the output buffer according to all references being released, all these operations will have to be performed together in that function, in this serial order (so that when refcount drops to 0, i.e. requests have completed, OA unit will be disabled, and then the buffer destroyed). This would lead to these operations being decoupled from the event_stop and event_destroy functions. This may be non-intentional behaviour w.r.t. these callbacks from userspace perspective. Robert, any thoughts?
On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: > > > > From: Sourab Gupta <sourab.gupta@intel.com> > > > > > > > > To collect timestamps around any GPU workload, we need to insert > > > > commands to capture them into the ringbuffer. Therefore, during the stop event > > > > call, we need to wait for GPU to complete processing the last request for > > > > which these commands were inserted. > > > > We need to ensure this processing is done before event_destroy callback which > > > > deallocates the buffer for holding the data. > > > > > > There's no reason for this to be synchronous. Just that you need an > > > active reference on the output buffer to be only released after the > > > final request. > > > > Yeah I think the interaction between OA sampling and GEM is the critical > > piece here for both patch series. Step one is to have a per-pmu lock to > > keep track of data private to OA and mmio based sampling as a starting > > point. Then we need to figure out how to structure the connection without > > OA/PMU and gem trampling over each another's feet. > > > > Wrt requests those are refcounted already and just waiting on them doesn't > > need dev->struct_mutex. That should be all you need, assuming you do > > correctly refcount them ... > > -Daniel > > Hi Daniel/Chris, > > Thanks for your feedback. I acknowledge the issue with increasing > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track > of data private to OA in my next version of patches. If you create new locking schemes please coordinate with Rob about them. Better if we come up with a consistent locking scheme for all of OA. That ofc doesn't exclude that we'll have per-pmu locking, just that the locking design should match if it makes sense. The example I'm thinking of is drrs&psr which both use the frontbuffer tracking code and both have their private mutex. But the overall approach to locking and cleaning up the async work is the same. > W.r.t. having the synchronous wait during event stop, I thought through > the method of having active reference on output buffer. This will let us > have a delayed buffer destroy (controlled by decrement of output buffer > refcount as and when gpu requests complete). This will be decoupled from > the event stop here. But that is not the only thing at play here. > > The event stop is expected to disable the OA unit (which it is doing > right now). In my usecase, I can't disable OA unit, till the time GPU > processes the MI_RPC requests scheduled already, which otherwise may > lead to hangs. > So, I'm waiting synchronously for requests to complete before disabling > OA unit. > > Also, the event_destroy callback should be destroying the buffers > associated with event. (Right now, in current design, event_destroy > callback waits for the above operations to be completed before > proceeding to destroy the buffer). > > If I try to create a function which destroys the output buffer according > to all references being released, all these operations will have to be > performed together in that function, in this serial order (so that when > refcount drops to 0, i.e. requests have completed, OA unit will be > disabled, and then the buffer destroyed). This would lead to these > operations being decoupled from the event_stop and event_destroy > functions. This may be non-intentional behaviour w.r.t. these callbacks > from userspace perspective. > > Robert, any thoughts? Yeah now idea whether those perf callbacks are allowed to stall for a longer time. If not we could simply push the cleanup/OA disabling into an async work. -Daniel
On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: > > > > From: Sourab Gupta <sourab.gupta@intel.com> > > > > > > > > To collect timestamps around any GPU workload, we need to insert > > > > commands to capture them into the ringbuffer. Therefore, during the stop event > > > > call, we need to wait for GPU to complete processing the last request for > > > > which these commands were inserted. > > > > We need to ensure this processing is done before event_destroy callback which > > > > deallocates the buffer for holding the data. > > > > > > There's no reason for this to be synchronous. Just that you need an > > > active reference on the output buffer to be only released after the > > > final request. > > > > Yeah I think the interaction between OA sampling and GEM is the critical > > piece here for both patch series. Step one is to have a per-pmu lock to > > keep track of data private to OA and mmio based sampling as a starting > > point. Then we need to figure out how to structure the connection without > > OA/PMU and gem trampling over each another's feet. > > > > Wrt requests those are refcounted already and just waiting on them doesn't > > need dev->struct_mutex. That should be all you need, assuming you do > > correctly refcount them ... > > -Daniel > > Hi Daniel/Chris, > > Thanks for your feedback. I acknowledge the issue with increasing > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track > of data private to OA in my next version of patches. > > W.r.t. having the synchronous wait during event stop, I thought through > the method of having active reference on output buffer. This will let us > have a delayed buffer destroy (controlled by decrement of output buffer > refcount as and when gpu requests complete). This will be decoupled from > the event stop here. But that is not the only thing at play here. > > The event stop is expected to disable the OA unit (which it is doing > right now). In my usecase, I can't disable OA unit, till the time GPU > processes the MI_RPC requests scheduled already, which otherwise may > lead to hangs. > So, I'm waiting synchronously for requests to complete before disabling > OA unit. There's no issue here. Just hook into the retire-notification callback for the last active request of the oa buffer. If you are using LRI to disable the OA Context writing to the buffer, you can simply create a request for the LRI and use the active reference as the retire notification. (Otoh, if the oa buffer is inactive you can simply do the decoupling immediately.) You shouldn't need a object-free-notification callback aiui. > Also, the event_destroy callback should be destroying the buffers > associated with event. (Right now, in current design, event_destroy > callback waits for the above operations to be completed before > proceeding to destroy the buffer). > > If I try to create a function which destroys the output buffer according > to all references being released, all these operations will have to be > performed together in that function, in this serial order (so that when > refcount drops to 0, i.e. requests have completed, OA unit will be > disabled, and then the buffer destroyed). This would lead to these > operations being decoupled from the event_stop and event_destroy > functions. This may be non-intentional behaviour w.r.t. these callbacks > from userspace perspective. When the perf event is stopped, you stop generating perf events. But the buffer may still be alive, and may be resurrected if you a new event is started, but you will want to be sure to ensure that pending oa writes are ignored. -Chris
On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote: > On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: > > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: > > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: > > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: > > > > > From: Sourab Gupta <sourab.gupta@intel.com> > > > > > > > > > > To collect timestamps around any GPU workload, we need to insert > > > > > commands to capture them into the ringbuffer. Therefore, during the stop event > > > > > call, we need to wait for GPU to complete processing the last request for > > > > > which these commands were inserted. > > > > > We need to ensure this processing is done before event_destroy callback which > > > > > deallocates the buffer for holding the data. > > > > > > > > There's no reason for this to be synchronous. Just that you need an > > > > active reference on the output buffer to be only released after the > > > > final request. > > > > > > Yeah I think the interaction between OA sampling and GEM is the critical > > > piece here for both patch series. Step one is to have a per-pmu lock to > > > keep track of data private to OA and mmio based sampling as a starting > > > point. Then we need to figure out how to structure the connection without > > > OA/PMU and gem trampling over each another's feet. > > > > > > Wrt requests those are refcounted already and just waiting on them doesn't > > > need dev->struct_mutex. That should be all you need, assuming you do > > > correctly refcount them ... > > > -Daniel > > > > Hi Daniel/Chris, > > > > Thanks for your feedback. I acknowledge the issue with increasing > > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track > > of data private to OA in my next version of patches. > > If you create new locking schemes please coordinate with Rob about them. > Better if we come up with a consistent locking scheme for all of OA. That > ofc doesn't exclude that we'll have per-pmu locking, just that the locking > design should match if it makes sense. The example I'm thinking of is > drrs&psr which both use the frontbuffer tracking code and both have their > private mutex. But the overall approach to locking and cleaning up the > async work is the same. > Ok, I'll coordinate with Robert about a consistent locking scheme here. > > W.r.t. having the synchronous wait during event stop, I thought through > > the method of having active reference on output buffer. This will let us > > have a delayed buffer destroy (controlled by decrement of output buffer > > refcount as and when gpu requests complete). This will be decoupled from > > the event stop here. But that is not the only thing at play here. > > > > The event stop is expected to disable the OA unit (which it is doing > > right now). In my usecase, I can't disable OA unit, till the time GPU > > processes the MI_RPC requests scheduled already, which otherwise may > > lead to hangs. > > So, I'm waiting synchronously for requests to complete before disabling > > OA unit. > > > > Also, the event_destroy callback should be destroying the buffers > > associated with event. (Right now, in current design, event_destroy > > callback waits for the above operations to be completed before > > proceeding to destroy the buffer). > > > > If I try to create a function which destroys the output buffer according > > to all references being released, all these operations will have to be > > performed together in that function, in this serial order (so that when > > refcount drops to 0, i.e. requests have completed, OA unit will be > > disabled, and then the buffer destroyed). This would lead to these > > operations being decoupled from the event_stop and event_destroy > > functions. This may be non-intentional behaviour w.r.t. these callbacks > > from userspace perspective. > > > > Robert, any thoughts? > > Yeah now idea whether those perf callbacks are allowed to stall for a > longer time. If not we could simply push the cleanup/OA disabling into an > async work. > -Daniel Hi Daniel, Actually right now, the stop_event callback is not stalled. Instead I'm scheduling an async work fn to do the job. The event destroy is then waiting for the work fn to finish processing, before proceeding on to destroy the buffer. I'm also thinking through Chris' suggestions here and will try to come up with solution based on them. Thanks, Sourab
On Thu, Jun 25, 2015 at 9:27 AM, Gupta, Sourab <sourab.gupta@intel.com> wrote: > On Thu, 2015-06-25 at 07:42 +0000, Daniel Vetter wrote: >> On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: >> > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: >> > > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: >> > > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: >> > > > > From: Sourab Gupta <sourab.gupta@intel.com> >> > > > > >> > > > > To collect timestamps around any GPU workload, we need to insert >> > > > > commands to capture them into the ringbuffer. Therefore, during the stop event >> > > > > call, we need to wait for GPU to complete processing the last request for >> > > > > which these commands were inserted. >> > > > > We need to ensure this processing is done before event_destroy callback which >> > > > > deallocates the buffer for holding the data. >> > > > >> > > > There's no reason for this to be synchronous. Just that you need an >> > > > active reference on the output buffer to be only released after the >> > > > final request. >> > > >> > > Yeah I think the interaction between OA sampling and GEM is the critical >> > > piece here for both patch series. Step one is to have a per-pmu lock to >> > > keep track of data private to OA and mmio based sampling as a starting >> > > point. Then we need to figure out how to structure the connection without >> > > OA/PMU and gem trampling over each another's feet. >> > > >> > > Wrt requests those are refcounted already and just waiting on them doesn't >> > > need dev->struct_mutex. That should be all you need, assuming you do >> > > correctly refcount them ... >> > > -Daniel >> > >> > Hi Daniel/Chris, >> > >> > Thanks for your feedback. I acknowledge the issue with increasing >> > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track >> > of data private to OA in my next version of patches. >> >> If you create new locking schemes please coordinate with Rob about them. >> Better if we come up with a consistent locking scheme for all of OA. That >> ofc doesn't exclude that we'll have per-pmu locking, just that the locking >> design should match if it makes sense. The example I'm thinking of is >> drrs&psr which both use the frontbuffer tracking code and both have their >> private mutex. But the overall approach to locking and cleaning up the >> async work is the same. >> > Ok, I'll coordinate with Robert about a consistent locking scheme here. > >> > W.r.t. having the synchronous wait during event stop, I thought through >> > the method of having active reference on output buffer. This will let us >> > have a delayed buffer destroy (controlled by decrement of output buffer >> > refcount as and when gpu requests complete). This will be decoupled from >> > the event stop here. But that is not the only thing at play here. >> > >> > The event stop is expected to disable the OA unit (which it is doing >> > right now). In my usecase, I can't disable OA unit, till the time GPU >> > processes the MI_RPC requests scheduled already, which otherwise may >> > lead to hangs. >> > So, I'm waiting synchronously for requests to complete before disabling >> > OA unit. >> > >> > Also, the event_destroy callback should be destroying the buffers >> > associated with event. (Right now, in current design, event_destroy >> > callback waits for the above operations to be completed before >> > proceeding to destroy the buffer). >> > >> > If I try to create a function which destroys the output buffer according >> > to all references being released, all these operations will have to be >> > performed together in that function, in this serial order (so that when >> > refcount drops to 0, i.e. requests have completed, OA unit will be >> > disabled, and then the buffer destroyed). This would lead to these >> > operations being decoupled from the event_stop and event_destroy >> > functions. This may be non-intentional behaviour w.r.t. these callbacks >> > from userspace perspective. >> > >> > Robert, any thoughts? >> >> Yeah now idea whether those perf callbacks are allowed to stall for a >> longer time. If not we could simply push the cleanup/OA disabling into an >> async work. >> -Daniel > > Hi Daniel, > Actually right now, the stop_event callback is not stalled. Instead I'm > scheduling an async work fn to do the job. The event destroy is then > waiting for the work fn to finish processing, before proceeding on to > destroy the buffer. > I'm also thinking through Chris' suggestions here and will try to come > up with solution based on them. A notable detail here is that most pmu methods are called in atomic context by events/core.c via inter-processor interrupts (as a somewhat unintended side effect of userspace opening i915_oa events as specific-cpu events perf will make sure all methods are invoked on that specific cpu). This means waiting in pmu->event_stop() isn't even an option for us, though as noted it's only the destroy currently that waits for the completion of the stop work fn. event_init() and event->destroy() are two exceptions that are called in process context. That said though, it does seem worth considering if we can avoid waiting in the event_destroy() if not strictly necessary, and chris's suggestion to follow a refcounting scheme sounds workable. Regards, - Robert > > Thanks, > Sourab >
On Thu, Jun 25, 2015 at 7:02 AM, Gupta, Sourab <sourab.gupta@intel.com> wrote: > On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: >> On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: >> > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: >> > > From: Sourab Gupta <sourab.gupta@intel.com> >> > > >> > > To collect timestamps around any GPU workload, we need to insert >> > > commands to capture them into the ringbuffer. Therefore, during the stop event >> > > call, we need to wait for GPU to complete processing the last request for >> > > which these commands were inserted. >> > > We need to ensure this processing is done before event_destroy callback which >> > > deallocates the buffer for holding the data. >> > >> > There's no reason for this to be synchronous. Just that you need an >> > active reference on the output buffer to be only released after the >> > final request. >> >> Yeah I think the interaction between OA sampling and GEM is the critical >> piece here for both patch series. Step one is to have a per-pmu lock to >> keep track of data private to OA and mmio based sampling as a starting >> point. Then we need to figure out how to structure the connection without >> OA/PMU and gem trampling over each another's feet. From the initial i915_oa code I think the notable locks are: the struct perf_event_context::lock spin lock taken by events/core.c before calling any of our pmu methods (with the exception of event_init and event->destroy()) so we at least don't need to worry about pmu methods trampling each other. I think we should only take struct_mutex to meet the existing assumptions of gem code we're using. Since most pmu methods are in interrupt context our calling into gem is generally constrained to event_init or event->destroy() without having to use workers. we have a pmu spin lock to protect OA register state, notably because besides the pmu methods we also have hooks into gem context switching or pinning/unpinning that may trigger updates to the OACONTROL state vs e.g. disabling OACONTROL in pmu->stop() which may concurrently on different cpus. we have an oa_buffer.flush_lock spin lock since in addition to a hrtimer periodically forwarding samples from OABUFFER to perf, userspace may also explicitly request a flush via fsync() or we might flush at pmu->stop(). As a further detail here, if the hrtimer contends with userspace flushing the hrtimer won't ever actually wait on flush_lock, assuming it's not necessary to then have a back-to-back flush while the buffer will probably be empty. I have to admit I haven't really scrutinized the locking details of Sourab's work yet, but it may make sense to introduce some additional exclusion scheme for managing access to to the fifo of pending RPC commands. I need to double check with Sourab, but I think we'd already discussed some changes to how forwarding of these RPC based reports will be managed where I thought we might be able to avoid the need for a worker and struct_mutex locking while forwarding. I think Sourab may have just preferred to send something out before this refactor to try and solicit broader, high level feedback on his work sooner. I imagine if we do refactor to avoid the worker for forwarding though that will affect the locking details regarding the fifo structure. Regards, - Robert >> >> Wrt requests those are refcounted already and just waiting on them doesn't >> need dev->struct_mutex. That should be all you need, assuming you do >> correctly refcount them ... >> -Daniel > > Hi Daniel/Chris, > > Thanks for your feedback. I acknowledge the issue with increasing > coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track > of data private to OA in my next version of patches. > > W.r.t. having the synchronous wait during event stop, I thought through > the method of having active reference on output buffer. This will let us > have a delayed buffer destroy (controlled by decrement of output buffer > refcount as and when gpu requests complete). This will be decoupled from > the event stop here. But that is not the only thing at play here. > > The event stop is expected to disable the OA unit (which it is doing > right now). In my usecase, I can't disable OA unit, till the time GPU > processes the MI_RPC requests scheduled already, which otherwise may > lead to hangs. > So, I'm waiting synchronously for requests to complete before disabling > OA unit. > > Also, the event_destroy callback should be destroying the buffers > associated with event. (Right now, in current design, event_destroy > callback waits for the above operations to be completed before > proceeding to destroy the buffer). > > If I try to create a function which destroys the output buffer according > to all references being released, all these operations will have to be > performed together in that function, in this serial order (so that when > refcount drops to 0, i.e. requests have completed, OA unit will be > disabled, and then the buffer destroyed). This would lead to these > operations being decoupled from the event_stop and event_destroy > functions. This may be non-intentional behaviour w.r.t. these callbacks > from userspace perspective. > > Robert, any thoughts? > I think a few pertinent details here that inform how we need to handle this are: 1) Almost all the pmu methods are called in atomic context (except event_init) as they are invoked via events/core.c via an inter-processor interrupt so waiting for a completion in event_destroy isn't really an option for us.
>> Robert, any thoughts? >> > > I think a few pertinent details here that inform how we need to handle this are: > > 1) Almost all the pmu methods are called in atomic context (except > event_init) as they are invoked via events/core.c via an > inter-processor interrupt so waiting for a completion in event_destroy > isn't really an option for us. Oops meant to delete this incomplete comment. Before double checking events/core.c I was worried that event->destroy() might be called in interrupt context, but that's not the case. - Robert
On Thu, Jun 25, 2015 at 9:02 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > On Thu, Jun 25, 2015 at 06:02:35AM +0000, Gupta, Sourab wrote: >> On Mon, 2015-06-22 at 16:09 +0000, Daniel Vetter wrote: >> > On Mon, Jun 22, 2015 at 02:22:54PM +0100, Chris Wilson wrote: >> > > On Mon, Jun 22, 2015 at 03:25:07PM +0530, sourab.gupta@intel.com wrote: >> > > > From: Sourab Gupta <sourab.gupta@intel.com> >> > > > >> > > > To collect timestamps around any GPU workload, we need to insert >> > > > commands to capture them into the ringbuffer. Therefore, during the stop event >> > > > call, we need to wait for GPU to complete processing the last request for >> > > > which these commands were inserted. >> > > > We need to ensure this processing is done before event_destroy callback which >> > > > deallocates the buffer for holding the data. >> > > >> > > There's no reason for this to be synchronous. Just that you need an >> > > active reference on the output buffer to be only released after the >> > > final request. >> > >> > Yeah I think the interaction between OA sampling and GEM is the critical >> > piece here for both patch series. Step one is to have a per-pmu lock to >> > keep track of data private to OA and mmio based sampling as a starting >> > point. Then we need to figure out how to structure the connection without >> > OA/PMU and gem trampling over each another's feet. >> > >> > Wrt requests those are refcounted already and just waiting on them doesn't >> > need dev->struct_mutex. That should be all you need, assuming you do >> > correctly refcount them ... >> > -Daniel >> >> Hi Daniel/Chris, >> >> Thanks for your feedback. I acknowledge the issue with increasing >> coverage scope of i915 dev mutex. I'll have a per-pmu lock to keep track >> of data private to OA in my next version of patches. >> >> W.r.t. having the synchronous wait during event stop, I thought through >> the method of having active reference on output buffer. This will let us >> have a delayed buffer destroy (controlled by decrement of output buffer >> refcount as and when gpu requests complete). This will be decoupled from >> the event stop here. But that is not the only thing at play here. >> >> The event stop is expected to disable the OA unit (which it is doing >> right now). In my usecase, I can't disable OA unit, till the time GPU >> processes the MI_RPC requests scheduled already, which otherwise may >> lead to hangs. >> So, I'm waiting synchronously for requests to complete before disabling >> OA unit. > > There's no issue here. Just hook into the retire-notification callback > for the last active request of the oa buffer. If you are using LRI to > disable the OA Context writing to the buffer, you can simply create a > request for the LRI and use the active reference as the retire > notification. (Otoh, if the oa buffer is inactive you can simply do the > decoupling immediately.) You shouldn't need a object-free-notification > callback aiui. > >> Also, the event_destroy callback should be destroying the buffers >> associated with event. (Right now, in current design, event_destroy >> callback waits for the above operations to be completed before >> proceeding to destroy the buffer). >> >> If I try to create a function which destroys the output buffer according >> to all references being released, all these operations will have to be >> performed together in that function, in this serial order (so that when >> refcount drops to 0, i.e. requests have completed, OA unit will be >> disabled, and then the buffer destroyed). This would lead to these >> operations being decoupled from the event_stop and event_destroy >> functions. This may be non-intentional behaviour w.r.t. these callbacks >> from userspace perspective. > > When the perf event is stopped, you stop generating perf events. But the > buffer may still be alive, and may be resurrected if you a new event is > started, but you will want to be sure to ensure that pending oa writes > are ignored. > -Chris Just to summarise some things I know Sourab discussed with Chris on IRC and I then also discussed with Sourab offline. It does look like we can avoid any waiting in event->destroy(), decoupling tracking of outstanding RPC commands+oacontrol disable and the event active status. There's no strict reason OACONTROL needs to be disabled when stopping or destroying an event if there are still pending RPC commands, we just need to stop forwarding samples while disabled and discard reports that land after an event is destroyed. We can also update the hw.state and active state before OACONTROL is disabled, whereas currently the code is updating this state outside of core perf's context lock which isn't good and we might also end up continuing to forward periodic samples from the oabuffer to perf when the user expects the event to be disabled. There was a discussion of updating oacontrol via LRIs, but my concern here is with wanting to update oacontrol state for periodic sampling use cases where it would seem awkward to handle that via LRIs. Enabling/disabling periodic sampling can be considered orthogonal from fully disabling oacontrol and it will be useful to use periodic sampling in conjunction with RPC sampling. When we stop an event we should be able to immediately stop periodic sampling even if we must leave OA enabled for a pending RPC command. If the event is started again we can quickly re-enable periodic sampling but it might be awkward to consider an in-flight LRI we know is going to disable oacontrol soon. Related to using LRIs though was the idea of using the LRI request to help manage the lifetime of the destination buffer, by adding the vmas of the dest buffer to the request's active list. It seems like we should be able to do this kind of thing with the requests associated with the MI_RPC commands instead. Thinking about the details of waiting for the last RPC command before destroying the dest buffer and disabling OACONTROL these are the requirements I see: - we want free the dest buffer in a finite time, since it's large (i.e. don't want to assume it's ok to keep around if allocated once) - must wait for last RPC commands to complete before disabling oacontrol (and can free the dest buffer at the same point) - in any subsequent event_init() we need to be able to verify we don't /still/ have pending RPC commands, because an incompatible oacontrol requirement at this point is a corner case where we really do have to block and wait for those RPC commands to complete or say return -EBUSY. Without the retire-notification api that Chris has been experimenting with (that could provide us a convenient callback when our RPC commands retire), the current plan is to still schedule some work in event->destroy() to wait for the last outstanding RPC command to retire before disabling oacontrol as well as free the RPC destination buffer, but notably there's no need to block on its completion. I imagine this could later easily be adapted to work with a retire-notification api, and maybe Sourab could even experiment with Chris's wip code for this to compare. In the end, I think the only time we'll really need to block waiting for outstanding requests is in event_init() in cases where there are still pending RPC commands and the previous oacontrol report-format is smaller than the newly requested format (which should basically be never I guess if in practice, most users will be requesting the largest report format... and technically I suppose there are even lots of cases where you could safely allow the report size to increase if you know there's adequate space in the old dest buffer; though probably not worth fussing about.). Besides avoiding blocking in event->destroy(); I'd been thinking we wouldn't need the worker for forwarding the RPC reports to perf and could instead just use a hrtimer like we do for periodic samples. Talking this through with Sourab though, he tried this, and it looks like it would be awkward due to needing to drop the references on request structs which may in-turn need to take struct_mutex to finally free. In terms of locking while forwarding samples and in the insert_cmd hooks, for the fifo structures tracking pending commands we should stop using struct_mutex (except if necessary for calling into gem) and we should also avoid locking around all of the forwarding work with any mutex that (if anything) just needs to protect updates to the fifo itself. I guess there's some reasonable lock free way to handle adding and removing from these fifos, but haven't considered that carefully and don't have a strong opinion on particular details here. Ok I think that pretty much summarises what was discussed offline, Regards, - Robert > > -- > Chris Wilson, Intel Open Source Technology Centre
On Thu, Jun 25, 2015 at 06:31:52PM +0100, Robert Bragg wrote: > Thinking about the details of waiting for the last RPC command before > destroying the dest buffer and disabling OACONTROL these are the > requirements I see: > - we want free the dest buffer in a finite time, since it's large > (i.e. don't want to assume it's ok to keep around if allocated once) Skipping to this point, no you don't. If you mark the buffer as purgeable after you stop tracking it (though they may still be pending writes from the GPU), the system will preferentially reuse the buffer's backing storage when memory is tight (including waiting for the GPU). Similarly, you need to unpin it as soon as possible then the address space is available for reuse asap. -Chris
On Thu, Jun 25, 2015 at 06:37:18PM +0100, Chris Wilson wrote: > On Thu, Jun 25, 2015 at 06:31:52PM +0100, Robert Bragg wrote: > > Thinking about the details of waiting for the last RPC command before > > destroying the dest buffer and disabling OACONTROL these are the > > requirements I see: > > - we want free the dest buffer in a finite time, since it's large > > (i.e. don't want to assume it's ok to keep around if allocated once) > > Skipping to this point, no you don't. If you mark the buffer as > purgeable after you stop tracking it (though they may still be pending > writes from the GPU), the system will preferentially reuse the buffer's > backing storage when memory is tight (including waiting for the GPU). > Similarly, you need to unpin it as soon as possible then the address > space is available for reuse asap. Or perhaps more to the point, you cannot free the buffer before the GPU is finished. It doesn't matter if you want to free it synchronously, it cannot happen before it would have been freed by dropping the active reference anyway. -Chris
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 25c0938..a0e1d17 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2022,6 +2022,8 @@ struct drm_i915_private { u32 tail; } buffer; struct work_struct work_timer; + struct work_struct work_event_stop; + struct completion complete; } gen_pmu; struct list_head profile_cmd; diff --git a/drivers/gpu/drm/i915/i915_oa_perf.c b/drivers/gpu/drm/i915/i915_oa_perf.c index e3e867f..574b6d3 100644 --- a/drivers/gpu/drm/i915/i915_oa_perf.c +++ b/drivers/gpu/drm/i915/i915_oa_perf.c @@ -306,6 +306,9 @@ void forward_gen_pmu_snapshots_work(struct work_struct *__work) int head, tail, num_nodes, ret; struct drm_i915_gem_request *req; + if (dev_priv->gen_pmu.event_active == false) + return; + first_node = (struct drm_i915_ts_node *) ((char *)hdr + hdr->data_offset); num_nodes = (hdr->size_in_bytes - hdr->data_offset) / @@ -335,6 +338,50 @@ void forward_gen_pmu_snapshots_work(struct work_struct *__work) mutex_unlock(&dev_priv->dev->struct_mutex); } +void i915_gen_pmu_stop_work_fn(struct work_struct *__work) +{ + struct drm_i915_private *dev_priv = + container_of(__work, typeof(*dev_priv), + gen_pmu.work_event_stop); + struct perf_event *event = dev_priv->gen_pmu.exclusive_event; + struct drm_i915_ts_queue_header *hdr = + (struct drm_i915_ts_queue_header *) + dev_priv->gen_pmu.buffer.addr; + struct drm_i915_ts_node *first_node, *node; + int head, tail, num_nodes, ret; + struct drm_i915_gem_request *req; + + first_node = (struct drm_i915_ts_node *) + ((char *)hdr + hdr->data_offset); + num_nodes = (hdr->size_in_bytes - hdr->data_offset) / + sizeof(*node); + + + ret = i915_mutex_lock_interruptible(dev_priv->dev); + if (ret) + return; + + i915_gen_pmu_wait_gpu(dev_priv); + + /* Ensure that all requests are completed*/ + tail = hdr->node_count; + head = dev_priv->gen_pmu.buffer.head; + while ((head % num_nodes) != (tail % num_nodes)) { + node = &first_node[head % num_nodes]; + req = node->node_info.req; + if (req && !i915_gem_request_completed(req, true)) + WARN_ON(1); + head++; + } + + event->hw.state = PERF_HES_STOPPED; + dev_priv->gen_pmu.buffer.tail = 0; + dev_priv->gen_pmu.buffer.head = 0; + + mutex_unlock(&dev_priv->dev->struct_mutex); + complete(&dev_priv->gen_pmu.complete); +} + static void gen_pmu_flush_snapshots(struct drm_i915_private *dev_priv) { WARN_ON(!dev_priv->gen_pmu.buffer.addr); @@ -562,6 +609,7 @@ static void i915_oa_event_destroy(struct perf_event *event) static void gen_buffer_destroy(struct drm_i915_private *i915) { + wait_for_completion(&i915->gen_pmu.complete); mutex_lock(&i915->dev->struct_mutex); vunmap(i915->gen_pmu.buffer.addr); @@ -1409,7 +1457,7 @@ static void i915_gen_event_stop(struct perf_event *event, int flags) hrtimer_cancel(&dev_priv->gen_pmu.timer); gen_pmu_flush_snapshots(dev_priv); - event->hw.state = PERF_HES_STOPPED; + schedule_work(&dev_priv->gen_pmu.work_event_stop); } static int i915_gen_event_add(struct perf_event *event, int flags) @@ -1595,6 +1643,9 @@ void i915_gen_pmu_register(struct drm_device *dev) i915->gen_pmu.timer.function = hrtimer_sample_gen; INIT_WORK(&i915->gen_pmu.work_timer, forward_gen_pmu_snapshots_work); + INIT_WORK(&i915->gen_pmu.work_event_stop, i915_gen_pmu_stop_work_fn); + init_completion(&i915->gen_pmu.complete); + spin_lock_init(&i915->gen_pmu.lock); i915->gen_pmu.pmu.capabilities = PERF_PMU_CAP_IS_DEVICE; @@ -1625,6 +1676,7 @@ void i915_gen_pmu_unregister(struct drm_device *dev) return; cancel_work_sync(&i915->gen_pmu.work_timer); + cancel_work_sync(&i915->gen_pmu.work_event_stop); perf_pmu_unregister(&i915->gen_pmu.pmu); i915->gen_pmu.pmu.event_init = NULL;