Message ID | 20180703170515.6298-1-eric@anholt.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jul 3, 2018 at 1:05 PM, Eric Anholt <eric@anholt.net> wrote: > GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary submits jobs that take 4 > seconds at maximum resolution, but we still want to reset quickly if a > job is really hung. Sample the CL's current address and the return > address (since we call into tile lists repeatedly) and if either has > changed then assume we've made progress. > > Signed-off-by: Eric Anholt <eric@anholt.net> > Cc: Lucas Stach <l.stach@pengutronix.de> Series is: Reviewed-by: Alex Deucher <alexander.deucher@amd.com> > --- > drivers/gpu/drm/v3d/v3d_drv.h | 2 ++ > drivers/gpu/drm/v3d/v3d_regs.h | 1 + > drivers/gpu/drm/v3d/v3d_sched.c | 18 ++++++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h > index f546e0ab9562..a5d96d823416 100644 > --- a/drivers/gpu/drm/v3d/v3d_drv.h > +++ b/drivers/gpu/drm/v3d/v3d_drv.h > @@ -189,6 +189,8 @@ struct v3d_job { > > /* GPU virtual addresses of the start/end of the CL job. */ > u32 start, end; > + > + u32 timedout_ctca, timedout_ctra; > }; > > struct v3d_exec_info { > diff --git a/drivers/gpu/drm/v3d/v3d_regs.h b/drivers/gpu/drm/v3d/v3d_regs.h > index fc13282dfc2f..854046565989 100644 > --- a/drivers/gpu/drm/v3d/v3d_regs.h > +++ b/drivers/gpu/drm/v3d/v3d_regs.h > @@ -222,6 +222,7 @@ > #define V3D_CLE_CTNCA(n) (V3D_CLE_CT0CA + 4 * n) > #define V3D_CLE_CT0RA 0x00118 > #define V3D_CLE_CT1RA 0x0011c > +#define V3D_CLE_CTNRA(n) (V3D_CLE_CT0RA + 4 * n) > #define V3D_CLE_CT0LC 0x00120 > #define V3D_CLE_CT1LC 0x00124 > #define V3D_CLE_CT0PC 0x00128 > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c > index 808bc901f567..00667c733dca 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -153,7 +153,25 @@ v3d_job_timedout(struct drm_sched_job *sched_job) > struct v3d_job *job = to_v3d_job(sched_job); > struct v3d_exec_info *exec = job->exec; > struct v3d_dev *v3d = exec->v3d; > + enum v3d_queue job_q = job == &exec->bin ? V3D_BIN : V3D_RENDER; > enum v3d_queue q; > + u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(job_q)); > + u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(job_q)); > + > + /* If the current address or return address have changed, then > + * the GPU has probably made progress and we should delay the > + * reset. This could fail if the GPU got in an infinite loop > + * in the CL, but that is pretty unlikely outside of an i-g-t > + * testcase. > + */ > + if (job->timedout_ctca != ctca || job->timedout_ctra != ctra) { > + job->timedout_ctca = ctca; > + job->timedout_ctra = ctra; > + > + schedule_delayed_work(&job->base.work_tdr, > + job->base.sched->timeout); > + return; > + } > > mutex_lock(&v3d->reset_lock); > > -- > 2.18.0 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Jul 03, 2018 at 10:05:12AM -0700, Eric Anholt wrote: > GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary submits jobs that take 4 > seconds at maximum resolution, but we still want to reset quickly if a > job is really hung. Sample the CL's current address and the return > address (since we call into tile lists repeatedly) and if either has > changed then assume we've made progress. > > Signed-off-by: Eric Anholt <eric@anholt.net> > Cc: Lucas Stach <l.stach@pengutronix.de> > --- > drivers/gpu/drm/v3d/v3d_drv.h | 2 ++ > drivers/gpu/drm/v3d/v3d_regs.h | 1 + > drivers/gpu/drm/v3d/v3d_sched.c | 18 ++++++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h > index f546e0ab9562..a5d96d823416 100644 > --- a/drivers/gpu/drm/v3d/v3d_drv.h > +++ b/drivers/gpu/drm/v3d/v3d_drv.h > @@ -189,6 +189,8 @@ struct v3d_job { > > /* GPU virtual addresses of the start/end of the CL job. */ > u32 start, end; > + > + u32 timedout_ctca, timedout_ctra; > }; > > struct v3d_exec_info { > diff --git a/drivers/gpu/drm/v3d/v3d_regs.h b/drivers/gpu/drm/v3d/v3d_regs.h > index fc13282dfc2f..854046565989 100644 > --- a/drivers/gpu/drm/v3d/v3d_regs.h > +++ b/drivers/gpu/drm/v3d/v3d_regs.h > @@ -222,6 +222,7 @@ > #define V3D_CLE_CTNCA(n) (V3D_CLE_CT0CA + 4 * n) > #define V3D_CLE_CT0RA 0x00118 > #define V3D_CLE_CT1RA 0x0011c > +#define V3D_CLE_CTNRA(n) (V3D_CLE_CT0RA + 4 * n) > #define V3D_CLE_CT0LC 0x00120 > #define V3D_CLE_CT1LC 0x00124 > #define V3D_CLE_CT0PC 0x00128 > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c > index 808bc901f567..00667c733dca 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -153,7 +153,25 @@ v3d_job_timedout(struct drm_sched_job *sched_job) > struct v3d_job *job = to_v3d_job(sched_job); > struct v3d_exec_info *exec = job->exec; > struct v3d_dev *v3d = exec->v3d; > + enum v3d_queue job_q = job == &exec->bin ? V3D_BIN : V3D_RENDER; > enum v3d_queue q; > + u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(job_q)); > + u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(job_q)); > + > + /* If the current address or return address have changed, then > + * the GPU has probably made progress and we should delay the > + * reset. This could fail if the GPU got in an infinite loop > + * in the CL, but that is pretty unlikely outside of an i-g-t > + * testcase. > + */ In i915 we have a credit system that allows only a limited amount of postponing the timeout. That would block the fairly easy DOS. Otoh I have no idea how robust v3d is against DOS attacks at the ioctl level and whether that's worth it. From your description I'm assuming that a loop in the shaders won't trick this, so ARB_robustness/WebGL should still be happy. Anyway, Ack on the entire pile. -Daniel > + if (job->timedout_ctca != ctca || job->timedout_ctra != ctra) { > + job->timedout_ctca = ctca; > + job->timedout_ctra = ctra; > + > + schedule_delayed_work(&job->base.work_tdr, > + job->base.sched->timeout); > + return; > + } > > mutex_lock(&v3d->reset_lock); > > -- > 2.18.0 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am Dienstag, den 03.07.2018, 10:05 -0700 schrieb Eric Anholt: > GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary submits jobs that take 4 > seconds at maximum resolution, but we still want to reset quickly if a > job is really hung. Sample the CL's current address and the return > address (since we call into tile lists repeatedly) and if either has > changed then assume we've made progress. So this means you are doubling your timeout? AFAICS for the first time you hit the timeout handler the cached ctca and ctra values will probably always differ from the current values. Maybe this warrants a mention in the commit message, as it's changing the behavior of the scheduler timeout. Also how easy is it for userspace to construct such an infinite loop in the CL? Thinking about a rogue client DoSing the GPU while exploiting this check in the timeout handler to stay under the radar... Regards, Lucas > Signed-off-by: Eric Anholt <eric@anholt.net> > > Cc: Lucas Stach <l.stach@pengutronix.de> > --- > drivers/gpu/drm/v3d/v3d_drv.h | 2 ++ > drivers/gpu/drm/v3d/v3d_regs.h | 1 + > drivers/gpu/drm/v3d/v3d_sched.c | 18 ++++++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h > index f546e0ab9562..a5d96d823416 100644 > --- a/drivers/gpu/drm/v3d/v3d_drv.h > +++ b/drivers/gpu/drm/v3d/v3d_drv.h > @@ -189,6 +189,8 @@ struct v3d_job { > > > /* GPU virtual addresses of the start/end of the CL job. */ > > u32 start, end; > + > > + u32 timedout_ctca, timedout_ctra; > }; > > struct v3d_exec_info { > diff --git a/drivers/gpu/drm/v3d/v3d_regs.h b/drivers/gpu/drm/v3d/v3d_regs.h > index fc13282dfc2f..854046565989 100644 > --- a/drivers/gpu/drm/v3d/v3d_regs.h > +++ b/drivers/gpu/drm/v3d/v3d_regs.h > @@ -222,6 +222,7 @@ > #define V3D_CLE_CTNCA(n) (V3D_CLE_CT0CA + 4 * n) > #define V3D_CLE_CT0RA 0x00118 > #define V3D_CLE_CT1RA 0x0011c > +#define V3D_CLE_CTNRA(n) (V3D_CLE_CT0RA + 4 * n) > #define V3D_CLE_CT0LC 0x00120 > #define V3D_CLE_CT1LC 0x00124 > #define V3D_CLE_CT0PC 0x00128 > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c > index 808bc901f567..00667c733dca 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -153,7 +153,25 @@ v3d_job_timedout(struct drm_sched_job *sched_job) > > struct v3d_job *job = to_v3d_job(sched_job); > > struct v3d_exec_info *exec = job->exec; > > struct v3d_dev *v3d = exec->v3d; > > + enum v3d_queue job_q = job == &exec->bin ? V3D_BIN : V3D_RENDER; > > enum v3d_queue q; > > + u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(job_q)); > > + u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(job_q)); > + > > + /* If the current address or return address have changed, then > > + * the GPU has probably made progress and we should delay the > > + * reset. This could fail if the GPU got in an infinite loop > > + * in the CL, but that is pretty unlikely outside of an i-g-t > > + * testcase. > > + */ > > + if (job->timedout_ctca != ctca || job->timedout_ctra != ctra) { > > + job->timedout_ctca = ctca; > > + job->timedout_ctra = ctra; > + > > + schedule_delayed_work(&job->base.work_tdr, > > + job->base.sched->timeout); > > + return; > > + } > > > mutex_lock(&v3d->reset_lock); >
Lucas Stach <l.stach@pengutronix.de> writes: > Am Dienstag, den 03.07.2018, 10:05 -0700 schrieb Eric Anholt: >> GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary submits jobs that take 4 >> seconds at maximum resolution, but we still want to reset quickly if a >> job is really hung. Sample the CL's current address and the return >> address (since we call into tile lists repeatedly) and if either has >> changed then assume we've made progress. > > So this means you are doubling your timeout? AFAICS for the first time > you hit the timeout handler the cached ctca and ctra values will > probably always differ from the current values. Maybe this warrants a > mention in the commit message, as it's changing the behavior of the > scheduler timeout. I supposes that doubles the minimum timeout, but I don't think there's any principled choice behind that value. > Also how easy is it for userspace to construct such an infinite loop in > the CL? Thinking about a rogue client DoSing the GPU while exploiting > this check in the timeout handler to stay under the radar... You'd need to have a big enough CL that you don't sample the same location twice in a row, but otherwise it's trivial and equivalent to a V3D33 igt case I wrote. I don't think we as the kernel particularly cares to protect from that case, though -- it's mainly "does a broken WebGL shader take down your desktop?" which we will still be protecting from. If you wanted to protect from a general userspace attacker, you could have a maximum 1 minute timeout or something, but I'm not sure your life is actually much better when you let an arbitrary number of clients submit many jobs to round-robin through each of which has a long timeout like that.
Am Dienstag, den 03.07.2018, 10:05 -0700 schrieb Eric Anholt: > GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary submits jobs that take 4 > seconds at maximum resolution, but we still want to reset quickly if a > job is really hung. Sample the CL's current address and the return > address (since we call into tile lists repeatedly) and if either has > changed then assume we've made progress. > > > Signed-off-by: Eric Anholt <eric@anholt.net> > Cc: Lucas Stach <l.stach@pengutronix.de> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> > --- > drivers/gpu/drm/v3d/v3d_drv.h | 2 ++ > drivers/gpu/drm/v3d/v3d_regs.h | 1 + > drivers/gpu/drm/v3d/v3d_sched.c | 18 ++++++++++++++++++ > 3 files changed, 21 insertions(+) > > diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h > index f546e0ab9562..a5d96d823416 100644 > --- a/drivers/gpu/drm/v3d/v3d_drv.h > +++ b/drivers/gpu/drm/v3d/v3d_drv.h > @@ -189,6 +189,8 @@ struct v3d_job { > > > /* GPU virtual addresses of the start/end of the CL job. */ > > u32 start, end; > + > > + u32 timedout_ctca, timedout_ctra; > }; > > struct v3d_exec_info { > diff --git a/drivers/gpu/drm/v3d/v3d_regs.h b/drivers/gpu/drm/v3d/v3d_regs.h > index fc13282dfc2f..854046565989 100644 > --- a/drivers/gpu/drm/v3d/v3d_regs.h > +++ b/drivers/gpu/drm/v3d/v3d_regs.h > @@ -222,6 +222,7 @@ > #define V3D_CLE_CTNCA(n) (V3D_CLE_CT0CA + 4 * n) > #define V3D_CLE_CT0RA 0x00118 > #define V3D_CLE_CT1RA 0x0011c > +#define V3D_CLE_CTNRA(n) (V3D_CLE_CT0RA + 4 * n) > #define V3D_CLE_CT0LC 0x00120 > #define V3D_CLE_CT1LC 0x00124 > #define V3D_CLE_CT0PC 0x00128 > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c > index 808bc901f567..00667c733dca 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -153,7 +153,25 @@ v3d_job_timedout(struct drm_sched_job *sched_job) > > struct v3d_job *job = to_v3d_job(sched_job); > > struct v3d_exec_info *exec = job->exec; > > struct v3d_dev *v3d = exec->v3d; > > + enum v3d_queue job_q = job == &exec->bin ? V3D_BIN : V3D_RENDER; > > enum v3d_queue q; > > + u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(job_q)); > > + u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(job_q)); > + > > + /* If the current address or return address have changed, then > > + * the GPU has probably made progress and we should delay the > > + * reset. This could fail if the GPU got in an infinite loop > > + * in the CL, but that is pretty unlikely outside of an i-g-t > > + * testcase. > > + */ > > + if (job->timedout_ctca != ctca || job->timedout_ctra != ctra) { > > + job->timedout_ctca = ctca; > > + job->timedout_ctra = ctra; > + > > + schedule_delayed_work(&job->base.work_tdr, > > + job->base.sched->timeout); > > + return; > > + } > > > mutex_lock(&v3d->reset_lock); >
diff --git a/drivers/gpu/drm/v3d/v3d_drv.h b/drivers/gpu/drm/v3d/v3d_drv.h index f546e0ab9562..a5d96d823416 100644 --- a/drivers/gpu/drm/v3d/v3d_drv.h +++ b/drivers/gpu/drm/v3d/v3d_drv.h @@ -189,6 +189,8 @@ struct v3d_job { /* GPU virtual addresses of the start/end of the CL job. */ u32 start, end; + + u32 timedout_ctca, timedout_ctra; }; struct v3d_exec_info { diff --git a/drivers/gpu/drm/v3d/v3d_regs.h b/drivers/gpu/drm/v3d/v3d_regs.h index fc13282dfc2f..854046565989 100644 --- a/drivers/gpu/drm/v3d/v3d_regs.h +++ b/drivers/gpu/drm/v3d/v3d_regs.h @@ -222,6 +222,7 @@ #define V3D_CLE_CTNCA(n) (V3D_CLE_CT0CA + 4 * n) #define V3D_CLE_CT0RA 0x00118 #define V3D_CLE_CT1RA 0x0011c +#define V3D_CLE_CTNRA(n) (V3D_CLE_CT0RA + 4 * n) #define V3D_CLE_CT0LC 0x00120 #define V3D_CLE_CT1LC 0x00124 #define V3D_CLE_CT0PC 0x00128 diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index 808bc901f567..00667c733dca 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -153,7 +153,25 @@ v3d_job_timedout(struct drm_sched_job *sched_job) struct v3d_job *job = to_v3d_job(sched_job); struct v3d_exec_info *exec = job->exec; struct v3d_dev *v3d = exec->v3d; + enum v3d_queue job_q = job == &exec->bin ? V3D_BIN : V3D_RENDER; enum v3d_queue q; + u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(job_q)); + u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(job_q)); + + /* If the current address or return address have changed, then + * the GPU has probably made progress and we should delay the + * reset. This could fail if the GPU got in an infinite loop + * in the CL, but that is pretty unlikely outside of an i-g-t + * testcase. + */ + if (job->timedout_ctca != ctca || job->timedout_ctra != ctra) { + job->timedout_ctca = ctca; + job->timedout_ctra = ctra; + + schedule_delayed_work(&job->base.work_tdr, + job->base.sched->timeout); + return; + } mutex_lock(&v3d->reset_lock);
GTF-GLES2.gtf.GL.acos.acos_float_vert_xvary submits jobs that take 4 seconds at maximum resolution, but we still want to reset quickly if a job is really hung. Sample the CL's current address and the return address (since we call into tile lists repeatedly) and if either has changed then assume we've made progress. Signed-off-by: Eric Anholt <eric@anholt.net> Cc: Lucas Stach <l.stach@pengutronix.de> --- drivers/gpu/drm/v3d/v3d_drv.h | 2 ++ drivers/gpu/drm/v3d/v3d_regs.h | 1 + drivers/gpu/drm/v3d/v3d_sched.c | 18 ++++++++++++++++++ 3 files changed, 21 insertions(+)