diff mbox

[BUG,4.17] etnaviv-gpu f1840000.gpu: recover hung GPU!

Message ID 1529411326.7211.19.camel@pengutronix.de (mailing list archive)
State New, archived
Headers show

Commit Message

Lucas Stach June 19, 2018, 12:28 p.m. UTC
Am Dienstag, den 19.06.2018, 12:42 +0100 schrieb Russell King - ARM Linux:
> On Tue, Jun 19, 2018 at 01:11:29PM +0200, Lucas Stach wrote:
> > Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
> > > No, it's not "a really big job" - it's just that the Dove GC600 is not
> > > fast enough to complete _two_ 1080p sized GPU operations within 500ms.
> > > The preceeding job contained two blits - one of them a non-alphablend
> > > copy of:
> > > 
> > >                 00180000 04200780  0,24,1920,1056 -> 0,24,1920,1056
> > > 
> > > and one an alpha blended copy of:
> > > 
> > >                 00000000 04380780  0,0,1920,1080 -> 0,0,1920,1080
> > > 
> > > This is (iirc) something I already fixed with the addition of the
> > > progress detection back before etnaviv was merged into the mainline
> > > kernel.
> > 
> > I hadn't expected it to be this slow. I see that we might need to bring
> > back the progress detection to fix the userspace regression, but I'm
> > not fond of this, as it might lead to really bad QoS.
> 
> Well, the choices are that or worse overall performance through having
> to ignore the GPU entirely.
> 
> > I would prefer userspace tracking the size of the blits and flushing
> > the cmdstream at an appropriate time, so we don't end up with really
> > long running jobs, but I'm not sure if this would be acceptable to
> > you...
> 
> The question becomes how to split up two operations.  Yes, we could
> submit them individually, but if they're together taking in excess of
> 500ms, then it's likely that individually, each operation will take in
> excess of 250ms which is still a long time.
> 
> In any case, I think we need to fix this for 4.17-stable and then try
> to work (a) which operations are taking a long time, and (b) how to
> solve this issue.

Agreed. I'll look into bringing back the process detection for 4.17
stable.

I'm still curious why the GC600 on the Dove is that slow. With
performance like this moving a big(ish) window on the screen must be a
horrible user experience.

> Do we have any way to track how long each submitted job has actually
> taken on the GPU?  (Eg, by recording the times that we receive the
> events?)  It wouldn't be very accurate for small jobs, but given this
> operation is taking so long, it would give an indication of how long
> this operation is actually taking.  etnaviv doesn't appear to have
> any tracepoints, which would've been ideal for that.  Maybe this is
> a reason to add some? ;)

See attached patch (which I apparently forgot to send out). The DRM GPU
scheduler has some tracepoints, which might be helpful. The attached
patch adds a drm_sched_job_run tracepoint when a job is queued in the
hardware ring. Together with the existing drm_sched_process_job, this
should get you an idea how long a job takes to process. Note that at
any time up to 4 jobs are allowed in the hardware queue, so you need to
match up the end times.

Regards,
Lucas

Comments

Russell King (Oracle) June 19, 2018, 3:56 p.m. UTC | #1
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> Am Dienstag, den 19.06.2018, 12:42 +0100 schrieb Russell King - ARM Linux:
> > On Tue, Jun 19, 2018 at 01:11:29PM +0200, Lucas Stach wrote:
> > > Am Dienstag, den 19.06.2018, 12:00 +0100 schrieb Russell King - ARM Linux:
> > > > No, it's not "a really big job" - it's just that the Dove GC600 is not
> > > > fast enough to complete _two_ 1080p sized GPU operations within 500ms.
> > > > The preceeding job contained two blits - one of them a non-alphablend
> > > > copy of:
> > > > 
> > > >                 00180000 04200780  0,24,1920,1056 -> 0,24,1920,1056
> > > > 
> > > > and one an alpha blended copy of:
> > > > 
> > > >                 00000000 04380780  0,0,1920,1080 -> 0,0,1920,1080
> > > > 
> > > > This is (iirc) something I already fixed with the addition of the
> > > > progress detection back before etnaviv was merged into the mainline
> > > > kernel.
> > > 
> > > I hadn't expected it to be this slow. I see that we might need to bring
> > > back the progress detection to fix the userspace regression, but I'm
> > > not fond of this, as it might lead to really bad QoS.
> > 
> > Well, the choices are that or worse overall performance through having
> > to ignore the GPU entirely.
> > 
> > > I would prefer userspace tracking the size of the blits and flushing
> > > the cmdstream at an appropriate time, so we don't end up with really
> > > long running jobs, but I'm not sure if this would be acceptable to
> > > you...
> > 
> > The question becomes how to split up two operations.  Yes, we could
> > submit them individually, but if they're together taking in excess of
> > 500ms, then it's likely that individually, each operation will take in
> > excess of 250ms which is still a long time.
> > 
> > In any case, I think we need to fix this for 4.17-stable and then try
> > to work (a) which operations are taking a long time, and (b) how to
> > solve this issue.
> 
> Agreed. I'll look into bringing back the process detection for 4.17
> stable.
> 
> I'm still curious why the GC600 on the Dove is that slow. With
> performance like this moving a big(ish) window on the screen must be a
> horrible user experience.

I _think_ it's down to the blend being slow on GC600 - one of the
problems of running modern "desktops" on the Dove is that with
Xorg and a compositing window manager (eg, modern metacity) then
yes, dragging windows around is very slow because of the multiple
GPU operations required - even dragging a small window results in
almost the entire screen being re-blended.

I don't think that's fair to blame on the Dove though - that's just
total inefficiency on the Xorg/compositing side to basically redraw
the _entire_ screen for small changes.

The compositing window manager brings with it other issues as well,
in particular with colour-keyed overlay and detecting whether anything
obscures the overlay.  For example, if, as a memory bandwidth
optimisation, you detect that the overlay window is unobscured in
the Xvideo extension, and disable the primary plane and colourkeying,
this works fine with non-compositing managers.  However, with a
compositing manager, you can end up with there being some graphics
that is blended _on top_ of the Xvideo window which is unknown to the
Xvideo backend... which results in the graphics not being displayed.

The blending also has a detrimental effect on the colourkeying when
the graphics is displayed - because of the blending, the colourkey is
no longer the expected RGB value around objects, so you get the
colourkey bleeding through around (eg) a menu.

I've now disabled compositing in metacity which makes things a whole
lot nicer (I've actually been meaning to track down the "slow window
dragging" problem for a good few months now) and solves the overlay
issue too.

> > Do we have any way to track how long each submitted job has actually
> > taken on the GPU?  (Eg, by recording the times that we receive the
> > events?)  It wouldn't be very accurate for small jobs, but given this
> > operation is taking so long, it would give an indication of how long
> > this operation is actually taking.  etnaviv doesn't appear to have
> > any tracepoints, which would've been ideal for that.  Maybe this is
> > a reason to add some? ;)
> 
> See attached patch (which I apparently forgot to send out). The DRM GPU
> scheduler has some tracepoints, which might be helpful. The attached
> patch adds a drm_sched_job_run tracepoint when a job is queued in the
> hardware ring. Together with the existing drm_sched_process_job, this
> should get you an idea how long a job takes to process. Note that at
> any time up to 4 jobs are allowed in the hardware queue, so you need to
> match up the end times.

Thanks, I'll try to get some data in the next week or so.
Russell King (Oracle) June 26, 2018, 8:17 a.m. UTC | #2
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> Agreed. I'll look into bringing back the process detection for 4.17
> stable.
> 
> I'm still curious why the GC600 on the Dove is that slow. With
> performance like this moving a big(ish) window on the screen must be a
> horrible user experience.

This doesn't seem to be the cause - it seems that there's something
going on with 4.17 that really is causing the Dove GC600 to get stuck.
Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16
state while keeping everything else the same results in no hangs,
whereas increasing the timeout with 4.17 still results in hangs.

I'll try to find some time to bisect.
Russell King (Oracle) June 26, 2018, 3:36 p.m. UTC | #3
On Tue, Jun 26, 2018 at 09:17:26AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> > Agreed. I'll look into bringing back the process detection for 4.17
> > stable.
> > 
> > I'm still curious why the GC600 on the Dove is that slow. With
> > performance like this moving a big(ish) window on the screen must be a
> > horrible user experience.
> 
> This doesn't seem to be the cause - it seems that there's something
> going on with 4.17 that really is causing the Dove GC600 to get stuck.
> Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16
> state while keeping everything else the same results in no hangs,
> whereas increasing the timeout with 4.17 still results in hangs.
> 
> I'll try to find some time to bisect.

Sorry, it seems that my attempts to change what was running on the
system were ineffective (due to the etnaviv module loaded from the
initramfs, not from the fs copy I was updating.)  Extending the
timeout to 5 seconds does indeed stop the issue.

More importantly, it stops some memory corruption I've observed as
well, caused by etnaviv freeing buffers when it thinks the GPU has
timed out while the GPU is still writing to them.
Lucas Stach June 26, 2018, 3:48 p.m. UTC | #4
Hi Russell,

Am Dienstag, den 26.06.2018, 16:36 +0100 schrieb Russell King - ARM
Linux:
> On Tue, Jun 26, 2018 at 09:17:26AM +0100, Russell King - ARM Linux
> wrote:
> > On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> > > Agreed. I'll look into bringing back the process detection for
> > > 4.17
> > > stable.
> > > 
> > > I'm still curious why the GC600 on the Dove is that slow. With
> > > performance like this moving a big(ish) window on the screen must
> > > be a
> > > horrible user experience.
> > 
> > This doesn't seem to be the cause - it seems that there's something
> > going on with 4.17 that really is causing the Dove GC600 to get
> > stuck.
> > Reverting all the drivers/gpu/drm/etnaviv changes back to the 4.16
> > state while keeping everything else the same results in no hangs,
> > whereas increasing the timeout with 4.17 still results in hangs.
> > 
> > I'll try to find some time to bisect.
> 
> Sorry, it seems that my attempts to change what was running on the
> system were ineffective (due to the etnaviv module loaded from the
> initramfs, not from the fs copy I was updating.)  Extending the
> timeout to 5 seconds does indeed stop the issue.

Thanks for confirming that the issue is caused by the removed progress
check.

> More importantly, it stops some memory corruption I've observed as
> well, caused by etnaviv freeing buffers when it thinks the GPU has
> timed out while the GPU is still writing to them.

Urgh, that is really bad. I'll get a fix out of the door tomorrow.

Regards,
Lucas
Lucas Stach June 27, 2018, 2:38 p.m. UTC | #5
Hi Russell,

Am Dienstag, den 26.06.2018, 17:48 +0200 schrieb Lucas Stach:
> Hi Russell,
> 
> Am Dienstag, den 26.06.2018, 16:36 +0100 schrieb Russell King - ARM
> Linux:
> > On Tue, Jun 26, 2018 at 09:17:26AM +0100, Russell King - ARM Linux
> > wrote:
> > > On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> > > > Agreed. I'll look into bringing back the process detection for
> > > > 4.17
> > > > stable.
> > > > 
> > > > I'm still curious why the GC600 on the Dove is that slow. With
> > > > performance like this moving a big(ish) window on the screen
> > > > must
> > > > be a
> > > > horrible user experience.
> > > 
> > > This doesn't seem to be the cause - it seems that there's
> > > something
> > > going on with 4.17 that really is causing the Dove GC600 to get
> > > stuck.
> > > Reverting all the drivers/gpu/drm/etnaviv changes back to the
> > > 4.16
> > > state while keeping everything else the same results in no hangs,
> > > whereas increasing the timeout with 4.17 still results in hangs.
> > > 
> > > I'll try to find some time to bisect.
> > 
> > Sorry, it seems that my attempts to change what was running on the
> > system were ineffective (due to the etnaviv module loaded from the
> > initramfs, not from the fs copy I was updating.)  Extending the
> > timeout to 5 seconds does indeed stop the issue.
> 
> Thanks for confirming that the issue is caused by the removed
> progress
> check.
> 
> > More importantly, it stops some memory corruption I've observed as
> > well, caused by etnaviv freeing buffers when it thinks the GPU has
> > timed out while the GPU is still writing to them.
> 
> Urgh, that is really bad. I'll get a fix out of the door tomorrow.

I've just sent a patch do to this. Can you please confirm that this
fixes your issue on GC600?

Thanks,
Lucas
Russell King (Oracle) July 11, 2018, 8:13 p.m. UTC | #6
On Tue, Jun 19, 2018 at 02:28:46PM +0200, Lucas Stach wrote:
> See attached patch (which I apparently forgot to send out). The DRM GPU
> scheduler has some tracepoints, which might be helpful. The attached
> patch adds a drm_sched_job_run tracepoint when a job is queued in the
> hardware ring. Together with the existing drm_sched_process_job, this
> should get you an idea how long a job takes to process. Note that at
> any time up to 4 jobs are allowed in the hardware queue, so you need to
> match up the end times.

Here's the trace:

            Xorg  1009 [000]   276.593018:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=144, fence=0xeb849580, ring=ef16bd00, job count:0, hw job count:0
  gnome-software  1384 [000]   276.673634: gpu_scheduler:drm_sched_process_job: fence=0xeb849580 signaled
            Xorg  1009 [000]   278.756538:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=145, fence=0xecaaef40, ring=ef16bd00, job count:0, hw job count:0
            Xorg  1009 [000]   279.130196: gpu_scheduler:drm_sched_process_job: fence=0xecaaef40 signaled
            Xorg  1009 [000]   279.856129:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=146, fence=0xecbdd7c0, ring=ef16bd00, job count:0, hw job count:0
            Xorg  1009 [000]   279.856542:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=147, fence=0xeb849700, ring=ef16bd00, job count:0, hw job count:1
         kswapd0    26 [000]   281.532808: gpu_scheduler:drm_sched_process_job: fence=0xecbdd7c0 signaled
  gnome-software  1384 [000]   281.560666: gpu_scheduler:drm_sched_process_job: fence=0xeb849700 signaled
            Xorg  1009 [000]   282.554920:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=148, fence=0xecad3a00, ring=ef16bd00, job count:0, hw job count:0
            Xorg  1009 [000]   283.956216:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=149, fence=0xeb849700, ring=ef16bd00, job count:0, hw job count:1
            Xorg  1009 [000]   284.149557: gpu_scheduler:drm_sched_process_job: fence=0xecad3a00 signaled
            Xorg  1009 [000]   285.465863: gpu_scheduler:drm_sched_process_job: fence=0xeb849700 signaled
            Xorg  1009 [000]   286.796692:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=150, fence=0xecbdd340, ring=ef16bd00, job count:0, hw job count:0
            Xorg  1009 [000]   288.549507:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=151, fence=0xecbdd280, ring=ef16bd00, job count:0, hw job count:1
            Xorg  1009 [000]   288.856277:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=152, fence=0xecaaee80, ring=ef16bd00, job count:0, hw job count:2
            Xorg  1009 [000]   288.856713:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=153, fence=0xecad3040, ring=ef16bd00, job count:0, hw job count:3
  gnome-software  1384 [000]   288.958764: gpu_scheduler:drm_sched_process_job: fence=0xecbdd340 signaled
            Xorg  1009 [000]   290.052842:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=154, fence=0xecbdde80, ring=ef16bd00, job count:0, hw job count:3
            Xorg  1009 [000]   291.682222:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=155, fence=0xecaaeb80, ring=ef16bd00, job count:0, hw job count:4
  gnome-software  1384 [000]   291.760295: gpu_scheduler:drm_sched_process_job: fence=0xecbdd280 signaled
  gnome-software  1384 [000]   292.018437: gpu_scheduler:drm_sched_process_job: fence=0xecaaee80 signaled
            Xorg  1009 [000]   292.046721: gpu_scheduler:drm_sched_process_job: fence=0xecad3040 signaled
            Xorg  1009 [000]   292.819475: gpu_scheduler:drm_sched_process_job: fence=0xecbdde80 signaled
            Xorg  1009 [000]   293.149829:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=156, fence=0xecad3f40, ring=ef16bd00, job count:0, hw job count:1
  gnome-software  1384 [000]   293.190607: gpu_scheduler:drm_sched_process_job: fence=0xecaaeb80 signaled
  gnome-software  1384 [000]   293.190720: gpu_scheduler:drm_sched_process_job: fence=0xecad3f40 signaled
            Xorg  1009 [000]   293.463414:         gpu_scheduler:drm_sched_job: entity=0xee9c3e00, id=157, fence=0xecbddd00, ring=ef16bd00, job count:0, hw job count:0
            Xorg  1009 [000]   293.463629: gpu_scheduler:drm_sched_process_job: fence=0xecbddd00 signaled

So:
Job	Start		End		Time on queue:
144	276.593018	276.673634	80ms
145	278.756538	279.130196	374ms
146	279.856129	281.532808	1.7s
147	279.856542	281.560666	1.7s (overlaps 146)
148	282.554920	284.149557	1.6s
149	283.956216	285.465863	1.5s (overlaps 148)
150	286.796692	288.958764	2.2s
151	288.549507	291.760295	3.2s (overlaps 150)
152	288.856277	292.018437	3.2s (overlaps 150-151)
153	288.856713	292.046721	3.2s (overlaps 150-152)
154	290.052842	292.819475	2.8s (overlaps 151-153)
155	291.682222	293.190607	1.5s (overlaps 151-154)
156	293.149829	293.190720	40ms (overlaps 155)
157	293.463414	293.463629	215us

The time for jobs 147, 149, 151-155 are misleading because they are
delayed by the preceeding job.  Correcting for that,

					Time executing:
147					28ms
149					1.3s
151					2.8s
152					250ms
153					28ms
154					780ms
155					370ms

So, sorted by time apparently executing:
151					2.8s
150	286.796692	288.958764	2.2s
146	279.856129	281.532808	1.7s
148	282.554920	284.149557	1.6s
149					1.3s
154					780ms
145	278.756538	279.130196	374ms
155					370ms
152					250ms
144	276.593018	276.673634	80ms
156	293.149829	293.190720	40ms
153					28ms
147					28ms
157	293.463414	293.463629	215us

Of course, from the perf dump, we don't know what the GPU was being
asked to do.

I've applied your forward-progress patch, and with the 5s timeout
reverted, it seems to "behave" without causing a timeout.
diff mbox

Patch

From a9ec48d1eecddcc95018ad37ebdf154ffa7ce9a4 Mon Sep 17 00:00:00 2001
From: Lucas Stach <l.stach@pengutronix.de>
Date: Fri, 8 Dec 2017 18:35:43 +0100
Subject: [PATCH] drm/sched: add tracepoint for job run

When tracing GPU execution it is very interesting to know when the job gets
dequeued from the software queue and added to the hardware ring. Add a
tracepoint to allow easy access to this information.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
---
 drivers/gpu/drm/scheduler/gpu_scheduler.c |  1 +
 include/drm/gpu_scheduler_trace.h         | 27 +++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/gpu_scheduler.c b/drivers/gpu/drm/scheduler/gpu_scheduler.c
index 0d95888ccc3e..ceecaef67801 100644
--- a/drivers/gpu/drm/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/scheduler/gpu_scheduler.c
@@ -666,6 +666,7 @@  static int drm_sched_main(void *param)
 		drm_sched_job_begin(sched_job);
 
 		fence = sched->ops->run_job(sched_job);
+		trace_drm_sched_job_run(sched_job, entity);
 		drm_sched_fence_scheduled(s_fence);
 
 		if (fence) {
diff --git a/include/drm/gpu_scheduler_trace.h b/include/drm/gpu_scheduler_trace.h
index 0789e8d0a0e1..c4d83857ae00 100644
--- a/include/drm/gpu_scheduler_trace.h
+++ b/include/drm/gpu_scheduler_trace.h
@@ -61,6 +61,33 @@  TRACE_EVENT(drm_sched_job,
 		      __entry->job_count, __entry->hw_job_count)
 );
 
+TRACE_EVENT(drm_sched_job_run,
+	    TP_PROTO(struct drm_sched_job *sched_job, struct drm_sched_entity *entity),
+	    TP_ARGS(sched_job, entity),
+	    TP_STRUCT__entry(
+			     __field(struct drm_sched_entity *, entity)
+			     __field(struct dma_fence *, fence)
+			     __field(const char *, name)
+			     __field(uint64_t, id)
+			     __field(u32, job_count)
+			     __field(int, hw_job_count)
+			     ),
+
+	    TP_fast_assign(
+			   __entry->entity = entity;
+			   __entry->id = sched_job->id;
+			   __entry->fence = &sched_job->s_fence->finished;
+			   __entry->name = sched_job->sched->name;
+			   __entry->job_count = spsc_queue_count(&entity->job_queue);
+			   __entry->hw_job_count = atomic_read(
+				   &sched_job->sched->hw_rq_count);
+			   ),
+	    TP_printk("entity=%p, id=%llu, fence=%p, ring=%s, job count:%u, hw job count:%d",
+		      __entry->entity, __entry->id,
+		      __entry->fence, __entry->name,
+		      __entry->job_count, __entry->hw_job_count)
+);
+
 TRACE_EVENT(drm_sched_process_job,
 	    TP_PROTO(struct drm_sched_fence *fence),
 	    TP_ARGS(fence),
-- 
2.17.1