[5/6] drm/i915: Implement GPU reset

Message ID	1252853462-9236-6-git-send-email-bgamari.foss@gmail.com (mailing list archive)
State	Superseded
Headers	show Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n8DExCia001354 for <patchwork-intel-gfx@patchwork.kernel.org>; Sun, 13 Sep 2009 14:59:12 GMT From: Ben Gamari <bgamari.foss@gmail.com> To: Jesse Barnes <jbarnes@virtuousgeek.org>, Chris Wilson <chris@chris-wilson.co.uk>, Eric Anholt <eric@anholt.net> Date: Sun, 13 Sep 2009 10:51:01 -0400 Message-Id: <1252853462-9236-6-git-send-email-bgamari.foss@gmail.com> In-Reply-To: <1252853462-9236-5-git-send-email-bgamari.foss@gmail.com> References: <1252853462-9236-1-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-2-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-3-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-4-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-5-git-send-email-bgamari.foss@gmail.com> Cc: intel-gfx@lists.freedesktop.org Subject: [Intel-gfx] [PATCH 5/6] drm/i915: Implement GPU reset Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: intel-gfx-bounces@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org

Ben Gamari Sept. 13, 2009, 2:51 p.m. UTC

This patch puts in place the machinery to attempt to reset the GPU. This
will be used when attempting to recover from a GPU hang.

Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Ben Gamari <bgamari.foss@gmail.com>
---
 drivers/gpu/drm/i915/i915_dma.c |    8 ++
 drivers/gpu/drm/i915/i915_drv.c |  141 +++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h |    1 +
 drivers/gpu/drm/i915/i915_irq.c |    8 ++
 drivers/gpu/drm/i915/i915_reg.h |    4 +
 5 files changed, 162 insertions(+), 0 deletions(-)

Daniel J Blueman Sept. 13, 2009, 5:27 p.m. UTC | #1

A couple of things just caught my eye while looking through the patch,
perhaps to consider tweaking?

On Sun, Sep 13, 2009 at 3:51 PM, Ben Gamari <bgamari.foss@gmail.com> wrote:
> This patch puts in place the machinery to attempt to reset the GPU. This
> will be used when attempting to recover from a GPU hang.
>
> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
> Signed-off-by: Ben Gamari <bgamari.foss@gmail.com>
> ---
> Â drivers/gpu/drm/i915/i915_dma.c | Â  Â 8 ++
> Â drivers/gpu/drm/i915/i915_drv.c | Â 141 +++++++++++++++++++++++++++++++++++++++
> Â drivers/gpu/drm/i915/i915_drv.h | Â  Â 1 +
> Â drivers/gpu/drm/i915/i915_irq.c | Â  Â 8 ++
> Â drivers/gpu/drm/i915/i915_reg.h | Â  Â 4 +
> Â 5 files changed, 162 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index aabe41b..25f8e75 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1173,6 +1173,9 @@ static int i915_load_modeset_init(struct drm_device *dev,
> Â  Â  Â  Â drm_mm_init(&dev_priv->vram, 0, prealloc_size);
> Â  Â  Â  Â DRM_INFO("set up %ldM of stolen space\n", prealloc_size / (1024*1024));
>
> + Â  Â  Â  /* We're off and running w/KMS */
> + Â  Â  Â  dev_priv->mm.suspended = 0;
> +
> Â  Â  Â  Â /* Let GEM Manage from end of prealloc space to end of aperture.
> Â  Â  Â  Â  *
> Â  Â  Â  Â  * However, leave one page at the end still bound to the scratch page.
> @@ -1184,7 +1187,9 @@ static int i915_load_modeset_init(struct drm_device *dev,
> Â  Â  Â  Â  */
> Â  Â  Â  Â i915_gem_do_init(dev, prealloc_size, agp_size - 4096);
>
> + Â  Â  Â  mutex_lock(&dev->struct_mutex);
> Â  Â  Â  Â ret = i915_gem_init_ringbuffer(dev);
> + Â  Â  Â  mutex_unlock(&dev->struct_mutex);
> Â  Â  Â  Â if (ret)
> Â  Â  Â  Â  Â  Â  Â  Â goto out;
>
> @@ -1433,6 +1438,9 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
> Â  Â  Â  Â  Â  Â  Â  Â return ret;
> Â  Â  Â  Â }
>
> + Â  Â  Â  /* Start out suspended */
> + Â  Â  Â  dev_priv->mm.suspended = 1;
> +
> Â  Â  Â  Â if (drm_core_check_feature(dev, DRIVER_MODESET)) {
> Â  Â  Â  Â  Â  Â  Â  Â ret = i915_load_modeset_init(dev, prealloc_start,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  prealloc_size, agp_size);
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index dbe568c..0c03a2a 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -127,6 +127,147 @@ static int i915_resume(struct drm_device *dev)
> Â  Â  Â  Â return ret;
> Â }
>
> +/**
> + * i965_reset - reset chip after a hang
> + * @dev: drm device to reset
> + * @flags: reset domains
> + *
> + * Reset the chip. Â Useful if a hang is detected.
> + *
> + * Procedure is fairly simple:
> + * Â  - reset the chip using the reset reg
> + * Â  - re-init context state
> + * Â  - re-init hardware status page
> + * Â  - re-init ring buffer
> + * Â  - re-init interrupt state
> + * Â  - re-init display
> + */
> +void i965_reset(struct drm_device *dev, u8 flags)
> +{
> + Â  Â  Â  drm_i915_private_t *dev_priv = dev->dev_private;
> + Â  Â  Â  unsigned long timeout;
> + Â  Â  Â  u8 gdrst;
> + Â  Â  Â  bool need_display = true; //!(flags & (GDRST_RENDER | GDRST_MEDIA));

As the kernel's coding style isn't C99, is it worth using /* foo */
here to be compliant?

> +
> +#if defined(CONFIG_SMP)
> + Â  Â  Â  timeout = jiffies + msecs_to_jiffies(500);
> + Â  Â  Â  do {
> + Â  Â  Â  Â  Â  Â  Â  udelay(100);
> + Â  Â  Â  } while (mutex_is_locked(&dev->struct_mutex) && time_after(timeout, jiffies));
> +
> + Â  Â  Â  if (mutex_is_locked(&dev->struct_mutex)) {
> +#if 1
> + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("i915 struct_mutex lock is still held by %s. Giving on up reset.\n", dev->struct_mutex.owner->task->comm);
> + Â  Â  Â  Â  Â  Â  Â  return;
> +#else
> + Â  Â  Â  Â  Â  Â  Â  struct task_struct *task = dev->struct_mutex.owner->task;
> + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Killing process %d (%s) for holding i915 device mutex\n",
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  task->pid, task->comm);
> + Â  Â  Â  Â  Â  Â  Â  force_sig(SIGILL, task);

Should be SIGKILL here?

> +#endif
> + Â  Â  Â  }
> +#else
> + Â  Â  Â  BUG_ON(mutex_is_locked(&dev->struct_mutex));
> +#endif
> +
> + Â  Â  Â  debug_show_all_locks();
> + Â  Â  Â  mutex_lock(&dev->struct_mutex);
> +
> + Â  Â  Â  /*
> + Â  Â  Â  Â * Clear request list
> + Â  Â  Â  Â */
> + Â  Â  Â  i915_gem_retire_requests(dev);
> +
> + Â  Â  Â  if (need_display)
> + Â  Â  Â  Â  Â  Â  Â  i915_save_display(dev);
> +
> + Â  Â  Â  if (IS_I965G(dev) || IS_G4X(dev)) {
> + Â  Â  Â  Â  Â  Â  Â  /*
> + Â  Â  Â  Â  Â  Â  Â  Â * Set the domains we want to reset, then the reset bit (bit 0).
> + Â  Â  Â  Â  Â  Â  Â  Â * Clear the reset bit after a while and wait for hardware status
> + Â  Â  Â  Â  Â  Â  Â  Â * bit (bit 1) to be set
> + Â  Â  Â  Â  Â  Â  Â  Â */
> + Â  Â  Â  Â  Â  Â  Â  pci_read_config_byte(dev->pdev, GDRST, &gdrst);
> + Â  Â  Â  Â  Â  Â  Â  //TODO: Set domains

Consider using kernel coding style perhaps?

> + Â  Â  Â  Â  Â  Â  Â  pci_write_config_byte(dev->pdev, GDRST, gdrst | flags | ((flags == GDRST_FULL) ? 0x1 : 0x0));
> + Â  Â  Â  Â  Â  Â  Â  udelay(50);
> + Â  Â  Â  Â  Â  Â  Â  pci_write_config_byte(dev->pdev, GDRST, gdrst & 0xfe);
> +
> + Â  Â  Â  Â  Â  Â  Â  /* ...we don't want to loop forever though, 500ms should be plenty */
> + Â  Â  Â  Â  Â  Â  Â timeout = jiffies + msecs_to_jiffies(500);
> + Â  Â  Â  Â  Â  Â  Â  do {
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  udelay(100);
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  pci_read_config_byte(dev->pdev, GDRST, &gdrst);
> + Â  Â  Â  Â  Â  Â  Â  } while ((gdrst & 0x1) && time_after(timeout, jiffies));
> +
> + Â  Â  Â  Â  Â  Â  Â  if (gdrst & 0x1) {
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Failed to reset chip. You're screwed.");

Perhaps drop the second sentence, to keep things looking neat.

> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  mutex_unlock(&dev->struct_mutex);
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  return;
> + Â  Â  Â  Â  Â  Â  Â  }
> + Â  Â  Â  } else {
> + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Error occurred. Don't know how to reset this chip.\n");
> + Â  Â  Â  Â  Â  Â  Â  return;
> + Â  Â  Â  }
> +
> + Â  Â  Â  /* Ok, now get things going again... */
> +
> + Â  Â  Â  /*
> + Â  Â  Â  Â * Everything depends on having the GTT running, so we need to start
> + Â  Â  Â  Â * there. Â Fortunately we don't need to do this unless we reset the
> + Â  Â  Â  Â * chip at a PCI level.
> + Â  Â  Â  Â *
> + Â  Â  Â  Â * Next we need to restore the context, but we don't use those
> + Â  Â  Â  Â * yet either...
> + Â  Â  Â  Â *
> + Â  Â  Â  Â * Ring buffer needs to be re-initialized in the KMS case, or if X
> + Â  Â  Â  Â * was running at the time of the reset (i.e. we weren't VT
> + Â  Â  Â  Â * switched away).
> + Â  Â  Â  Â */
> + Â  Â  Â  if (drm_core_check_feature(dev, DRIVER_MODESET) ||
> + Â  Â  Â  Â  Â  !dev_priv->mm.suspended) {
> + Â  Â  Â  Â  Â  Â  Â  drm_i915_ring_buffer_t *ring = &dev_priv->ring;
> + Â  Â  Â  Â  Â  Â  Â  struct drm_gem_object *obj = ring->ring_obj;
> + Â  Â  Â  Â  Â  Â  Â  struct drm_i915_gem_object *obj_priv = obj->driver_private;
> + Â  Â  Â  Â  Â  Â  Â  dev_priv->mm.suspended = 0;
> +
> + Â  Â  Â  Â  Â  Â  Â  /* Stop the ring if it's running. */
> + Â  Â  Â  Â  Â  Â  Â  I915_WRITE(PRB0_CTL, 0);
> + Â  Â  Â  Â  Â  Â  Â  I915_WRITE(PRB0_TAIL, 0);
> + Â  Â  Â  Â  Â  Â  Â  I915_WRITE(PRB0_HEAD, 0);
> +
> + Â  Â  Â  Â  Â  Â  Â  /* Initialize the ring. */
> + Â  Â  Â  Â  Â  Â  Â  I915_WRITE(PRB0_START, obj_priv->gtt_offset);
> + Â  Â  Â  Â  Â  Â  Â  I915_WRITE(PRB0_CTL,
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â ((obj->size - 4096) & RING_NR_PAGES) |
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â RING_NO_REPORT |
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â RING_VALID);
> + Â  Â  Â  Â  Â  Â  Â  if (!drm_core_check_feature(dev, DRIVER_MODESET))
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  i915_kernel_lost_context(dev);
> + Â  Â  Â  Â  Â  Â  Â  else {
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  ring->head = I915_READ(PRB0_HEAD) & HEAD_ADDR;
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  ring->tail = I915_READ(PRB0_TAIL) & TAIL_ADDR;
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  ring->space = ring->head - (ring->tail + 8);
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  if (ring->space < 0)
> + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  ring->space += ring->Size;
> + Â  Â  Â  Â  Â  Â  Â  }
> +
> + Â  Â  Â  Â  Â  Â  Â  mutex_unlock(&dev->struct_mutex);
> + Â  Â  Â  Â  Â  Â  Â  drm_irq_uninstall(dev);
> + Â  Â  Â  Â  Â  Â  Â  drm_irq_install(dev);
> + Â  Â  Â  Â  Â  Â  Â  mutex_lock(&dev->struct_mutex);
> + Â  Â  Â  }
> +
> + Â  Â  Â  /*
> + Â  Â  Â  Â * Display needs restore too...
> + Â  Â  Â  Â */
> + Â  Â  Â  if (need_display)
> + Â  Â  Â  Â  Â  Â  Â  i915_restore_display(dev);
> +
> + Â  Â  Â  mutex_unlock(&dev->struct_mutex);
> +}
> +
> +
> Â static int __devinit
> Â i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> Â {
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index afbcaa9..8797777 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -624,6 +624,7 @@ extern long i915_compat_ioctl(struct file *filp, unsigned int cmd,
> Â extern int i915_emit_box(struct drm_device *dev,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  struct drm_clip_rect *boxes,
> Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  int i, int DR1, int DR4);
> +extern void i965_reset(struct drm_device *dev, u8 flags);
>
> Â /* i915_irq.c */
> Â void i915_hangcheck_elapsed(unsigned long data);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 126696a..dbfcf0a 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -482,6 +482,14 @@ static void i915_handle_error(struct drm_device *dev)
> Â  Â  Â  Â  Â  Â  Â  Â I915_WRITE(IIR, I915_RENDER_COMMAND_PARSER_ERROR_INTERRUPT);
> Â  Â  Â  Â }
>
> + Â  Â  Â  if (dev_priv->mm.wedged) {
> + Â  Â  Â  Â  Â  Â  Â  /*
> + Â  Â  Â  Â  Â  Â  Â  Â * Wakeup waiting processes so they don't hang
> + Â  Â  Â  Â  Â  Â  Â  Â */
> + Â  Â  Â  Â  Â  Â  Â  printk("i915: Waking up sleeping processes\n");
> + Â  Â  Â  Â  Â  Â  Â  DRM_WAKEUP(&dev_priv->irq_queue);
> + Â  Â  Â  }
> +
> Â  Â  Â  Â queue_work(dev_priv->wq, &dev_priv->error_work);
> Â }
>
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index f7c2de8..99981a0 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -85,6 +85,10 @@
> Â #define Â  I915_GC_RENDER_CLOCK_200_MHZ (1 << 0)
> Â #define Â  I915_GC_RENDER_CLOCK_333_MHZ (4 << 0)
> Â #define LBB Â  Â 0xf4
> +#define GDRST 0xc0
> +#define Â GDRST_FULL Â  Â (0<<2)
> +#define Â GDRST_RENDER Â (1<<2)
> +#define Â GDRST_MEDIA Â  (3<<2)
>
> Â /* VGA stuff */
>

Jesse Barnes Sept. 13, 2009, 6:21 p.m. UTC | #2

On Sun, 13 Sep 2009 18:27:48 +0100
Daniel J Blueman <daniel.blueman@gmail.com> wrote:

> A couple of things just caught my eye while looking through the patch,
> perhaps to consider tweaking?

Thanks for looking.

> > +void i965_reset(struct drm_device *dev, u8 flags)
> > +{
> > + Â  Â  Â  drm_i915_private_t *dev_priv = dev->dev_private;
> > + Â  Â  Â  unsigned long timeout;
> > + Â  Â  Â  u8 gdrst;
> > + Â  Â  Â  bool need_display = true; //!(flags & (GDRST_RENDER |
> > GDRST_MEDIA));
> 
> As the kernel's coding style isn't C99, is it worth using /* foo */
> here to be compliant?

Or just drop it and add a comment about the other two.  We'd need to
experiment some more before trying to use them.

> > +#if 1
> > + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("i915 struct_mutex lock is still held by
> > %s. Giving on up reset.\n", dev->struct_mutex.owner->task->comm);
> > + Â  Â  Â  Â  Â  Â  Â  return;
> > +#else
> > + Â  Â  Â  Â  Â  Â  Â  struct task_struct *task =
> > dev->struct_mutex.owner->task;
> > + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Killing process %d (%s) for holding i915
> > device mutex\n",
> > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  task->pid, task->comm);
> > + Â  Â  Â  Â  Â  Â  Â  force_sig(SIGILL, task);
> 
> Should be SIGKILL here?

We should just decide whether we want the warning here or the kill.
Ben, I assume you tested the warning mostly?  Killing a waiter will
probably just kill the unlucky task who happened to be waiting for the
hung batch; might be friendlier to let it continue, but it also makes
the reset less reliable.

> > + Â  Â  Â  Â  Â  Â  Â  pci_read_config_byte(dev->pdev, GDRST, &gdrst);
> > + Â  Â  Â  Â  Â  Â  Â  //TODO: Set domains
> 
> Consider using kernel coding style perhaps?

Yeah, or just drop since domains are the render, media etc mentioned
above.

> 
> > + Â  Â  Â  Â  Â  Â  Â  pci_write_config_byte(dev->pdev, GDRST, gdrst |
> > flags | ((flags == GDRST_FULL) ? 0x1 : 0x0));
> > + Â  Â  Â  Â  Â  Â  Â  udelay(50);
> > + Â  Â  Â  Â  Â  Â  Â  pci_write_config_byte(dev->pdev, GDRST, gdrst &
> > 0xfe); +
> > + Â  Â  Â  Â  Â  Â  Â  /* ...we don't want to loop forever though, 500ms
> > should be plenty */
> > + Â  Â  Â  Â  Â  Â  Â timeout = jiffies + msecs_to_jiffies(500);
> > + Â  Â  Â  Â  Â  Â  Â  do {
> > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  udelay(100);
> > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  pci_read_config_byte(dev->pdev, GDRST,
> > &gdrst);
> > + Â  Â  Â  Â  Â  Â  Â  } while ((gdrst & 0x1) && time_after(timeout,
> > jiffies)); +
> > + Â  Â  Â  Â  Â  Â  Â  if (gdrst & 0x1) {
> > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Failed to reset chip. You're
> > screwed.");
> 
> Perhaps drop the second sentence, to keep things looking neat.

We might want to panic in that case?

Ben Gamari Sept. 13, 2009, 7:24 p.m. UTC | #3

On Sun, Sep 13, 2009 at 11:21:35AM -0700, Jesse Barnes wrote:
> On Sun, 13 Sep 2009 18:27:48 +0100
> Daniel J Blueman <daniel.blueman@gmail.com> wrote:
> 
> > A couple of things just caught my eye while looking through the patch,
> > perhaps to consider tweaking?
> 
> Thanks for looking.
> 
> > > +void i965_reset(struct drm_device *dev, u8 flags)
> > > +{
> > > + Â  Â  Â  drm_i915_private_t *dev_priv = dev->dev_private;
> > > + Â  Â  Â  unsigned long timeout;
> > > + Â  Â  Â  u8 gdrst;
> > > + Â  Â  Â  bool need_display = true; //!(flags & (GDRST_RENDER |
> > > GDRST_MEDIA));
> > 
> > As the kernel's coding style isn't C99, is it worth using /* foo */
> > here to be compliant?
> 
> Or just drop it and add a comment about the other two.  We'd need to
> experiment some more before trying to use them.

Yeah, I think this is the right thing to do.

> 
> > > +#if 1
> > > + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("i915 struct_mutex lock is still held by
> > > %s. Giving on up reset.\n", dev->struct_mutex.owner->task->comm);
> > > + Â  Â  Â  Â  Â  Â  Â  return;
> > > +#else
> > > + Â  Â  Â  Â  Â  Â  Â  struct task_struct *task =
> > > dev->struct_mutex.owner->task;
> > > + Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Killing process %d (%s) for holding i915
> > > device mutex\n",
> > > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  task->pid, task->comm);
> > > + Â  Â  Â  Â  Â  Â  Â  force_sig(SIGILL, task);
> > 
> > Should be SIGKILL here?
> 
> We should just decide whether we want the warning here or the kill.
> Ben, I assume you tested the warning mostly?  Killing a waiter will
> probably just kill the unlucky task who happened to be waiting for the
> hung batch; might be friendlier to let it continue, but it also makes
> the reset less reliable.

We do need to decide what to do with the task that gave us the bad
batch. This particular code, however, doesn't handle that case. By the
time we get here we have already woken up the waiters with the wedged
flag set so in theory no one should be holding the lock. In the event
that someone is, we really have three options,

1) Try waking it up again and hope it doesn't take it this time
2) Wait for it to release it (probably will never happen)
3) Kill the process

I opted to kill the process.

> 
> > > + Â  Â  Â  Â  Â  Â  Â  pci_read_config_byte(dev->pdev, GDRST, &gdrst);
> > > + Â  Â  Â  Â  Â  Â  Â  //TODO: Set domains
> > 
> > Consider using kernel coding style perhaps?
> 
> Yeah, or just drop since domains are the render, media etc mentioned
> above.

Yep, this definitely slipped through.

> 
> > 
> > > + Â  Â  Â  Â  Â  Â  Â  pci_write_config_byte(dev->pdev, GDRST, gdrst |
> > > flags | ((flags == GDRST_FULL) ? 0x1 : 0x0));
> > > + Â  Â  Â  Â  Â  Â  Â  udelay(50);
> > > + Â  Â  Â  Â  Â  Â  Â  pci_write_config_byte(dev->pdev, GDRST, gdrst &
> > > 0xfe); +
> > > + Â  Â  Â  Â  Â  Â  Â  /* ...we don't want to loop forever though, 500ms
> > > should be plenty */
> > > + Â  Â  Â  Â  Â  Â  Â timeout = jiffies + msecs_to_jiffies(500);
> > > + Â  Â  Â  Â  Â  Â  Â  do {
> > > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  udelay(100);
> > > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  pci_read_config_byte(dev->pdev, GDRST,
> > > &gdrst);
> > > + Â  Â  Â  Â  Â  Â  Â  } while ((gdrst & 0x1) && time_after(timeout,
> > > jiffies)); +
> > > + Â  Â  Â  Â  Â  Â  Â  if (gdrst & 0x1) {
> > > + Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  Â  DRM_ERROR("Failed to reset chip. You're
> > > screwed.");
> > 
> > Perhaps drop the second sentence, to keep things looking neat.
> 
> We might want to panic in that case?

Meh, this seems a little much. Perhaps just WARN?

Thanks for looking this over.

-  Ben

Owain Ainsworth Sept. 13, 2009, 8:40 p.m. UTC | #4

On Sun, Sep 13, 2009 at 03:24:03PM -0400, Ben Gamari wrote:
> On Sun, Sep 13, 2009 at 11:21:35AM -0700, Jesse Barnes wrote:
> > On Sun, 13 Sep 2009 18:27:48 +0100
> > Daniel J Blueman <daniel.blueman@gmail.com> wrote:
> > 
> > > A couple of things just caught my eye while looking through the patch,
> > > perhaps to consider tweaking?
> > 
> > Thanks for looking.
> > 
> > > > +void i965_reset(struct drm_device *dev, u8 flags)
> > > > +{
> > > > + ? ? ? drm_i915_private_t *dev_priv = dev->dev_private;
> > > > + ? ? ? unsigned long timeout;
> > > > + ? ? ? u8 gdrst;
> > > > + ? ? ? bool need_display = true; //!(flags & (GDRST_RENDER |
> > > > GDRST_MEDIA));
> > > 
> > > As the kernel's coding style isn't C99, is it worth using /* foo */
> > > here to be compliant?
> > 
> > Or just drop it and add a comment about the other two.  We'd need to
> > experiment some more before trying to use them.
> 
> Yeah, I think this is the right thing to do.
> 
> > 
> > > > +#if 1
> > > > + ? ? ? ? ? ? ? DRM_ERROR("i915 struct_mutex lock is still held by
> > > > %s. Giving on up reset.\n", dev->struct_mutex.owner->task->comm);
> > > > + ? ? ? ? ? ? ? return;
> > > > +#else
> > > > + ? ? ? ? ? ? ? struct task_struct *task =
> > > > dev->struct_mutex.owner->task;
> > > > + ? ? ? ? ? ? ? DRM_ERROR("Killing process %d (%s) for holding i915
> > > > device mutex\n",
> > > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? task->pid, task->comm);
> > > > + ? ? ? ? ? ? ? force_sig(SIGILL, task);
> > > 
> > > Should be SIGKILL here?
> > 
> > We should just decide whether we want the warning here or the kill.
> > Ben, I assume you tested the warning mostly?  Killing a waiter will
> > probably just kill the unlucky task who happened to be waiting for the
> > hung batch; might be friendlier to let it continue, but it also makes
> > the reset less reliable.
> 
> We do need to decide what to do with the task that gave us the bad
> batch. This particular code, however, doesn't handle that case. By the
> time we get here we have already woken up the waiters with the wedged
> flag set so in theory no one should be holding the lock. In the event
> that someone is, we really have three options,
> 
> 1) Try waking it up again and hope it doesn't take it this time
> 2) Wait for it to release it (probably will never happen)
> 3) Kill the process
> 
> I opted to kill the process.

my version of the code doesn't actually do this and it managed fine.

where we'll be locked (and blocking) is essentially whenever we hit
i915_wait_request. now, that fucntion in it's sleeping loop does check
whether wedged is set, if so it EIO's out of there. So we set wedged (we
MUST do that first), then we wakeup all sleepers. Now, anyone sleeping
(and since they always sleep with the lock) will check the condition and
return EIO, where they're release the lock and dump that return back to
userland. Anything else means that something else is buggy.

such contention problems on openbsd i've not seen. The only other option
I can see is in wait_ring, where we should really also check for wedged
and if so cede the lock and try again, but that's a different problem,
really and has been a problem for $LONG_TIME

-0-

Ben Gamari Sept. 13, 2009, 8:56 p.m. UTC | #5

On Sun, Sep 13, 2009 at 09:40:11PM +0100, Owain Ainsworth wrote:
> 
> my version of the code doesn't actually do this and it managed fine.
> 
> where we'll be locked (and blocking) is essentially whenever we hit
> i915_wait_request. now, that fucntion in it's sleeping loop does check
> whether wedged is set, if so it EIO's out of there. So we set wedged (we
> MUST do that first), then we wakeup all sleepers. Now, anyone sleeping
> (and since they always sleep with the lock) will check the condition and
> return EIO, where they're release the lock and dump that return back to
> userland. Anything else means that something else is buggy.
> 
> such contention problems on openbsd i've not seen. The only other option
> I can see is in wait_ring, where we should really also check for wedged
> and if so cede the lock and try again, but that's a different problem,
> really and has been a problem for $LONG_TIME

Certainly. This code should be redundant in almost all cases. While
debugging the patch I saw contention issues but these were probably
triggered by bugs in my code. I kept the code, however, for cases such
as the one you mention (and because I didn't think to remove it). While
theoretically the locking semantics should preclude these cases, one
never knows. If we can agree to clean up all of the locking paths to
honor .wedged, then I agree that this code serves no purpose.

- Ben

[5/6] drm/i915: Implement GPU reset

Commit Message

Comments

Patch