Message ID | 20230711213501.526237-3-andrealmeid@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: Add new reset option and rework coredump | expand |
Am 11.07.23 um 23:34 schrieb André Almeida: > If a DRM fence is set to -ENODATA, that means that this context was a > cause of a soft reset, but is never marked as guilty. Flag it as guilty > and log to user that this context won't accept more submissions. > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > index 0dc9c655c4fb..fe8e47d063da 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c > @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance, > ctx_entity = &ctx->entities[hw_ip][ring]->entity; > r = drm_sched_entity_error(ctx_entity); > if (r) { > + if (r == -ENODATA) { > + DRM_ERROR("%s (%d) context caused a reset," > + "marking it guilty and refusing new submissions.\n", > + current->comm, current->pid); > + atomic_set(&ctx->guilty, 1); > + } I'm going back and forth with that as well. Michel has a very good point that it often is sufficient to cancel just one rough shader to keep going. But Marek has a very good point as well that when that happens multiple times we probably want to block the application from making further submissions. Christian. > DRM_DEBUG("error entity %p\n", ctx_entity); > return r; > }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c index 0dc9c655c4fb..fe8e47d063da 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance, ctx_entity = &ctx->entities[hw_ip][ring]->entity; r = drm_sched_entity_error(ctx_entity); if (r) { + if (r == -ENODATA) { + DRM_ERROR("%s (%d) context caused a reset," + "marking it guilty and refusing new submissions.\n", + current->comm, current->pid); + atomic_set(&ctx->guilty, 1); + } DRM_DEBUG("error entity %p\n", ctx_entity); return r; }
If a DRM fence is set to -ENODATA, that means that this context was a cause of a soft reset, but is never marked as guilty. Flag it as guilty and log to user that this context won't accept more submissions. Signed-off-by: André Almeida <andrealmeid@igalia.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++ 1 file changed, 6 insertions(+)