[09/19] drm/radeon: handle lockup in delayed work, v2

Message ID	20140731153342.15061.54264.stgit@patser (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Subject: [PATCH 09/19] drm/radeon: handle lockup in delayed work, v2 To: airlied@linux.ie From: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Thu, 31 Jul 2014 17:33:42 +0200 Message-ID: <20140731153342.15061.54264.stgit@patser> In-Reply-To: <20140731153245.15061.63023.stgit@patser> References: <20140731153245.15061.63023.stgit@patser> User-Agent: StGit/0.15 MIME-Version: 1.0 Cc: thellstrom@vmware.com, nouveau@lists.freedesktop.org, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, bskeggs@redhat.com, alexander.deucher@amd.com, christian.koenig@amd.com Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Maarten Lankhorst July 31, 2014, 3:33 p.m. UTC

Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
---
V1 had a nasty bug breaking gpu lockup recovery. The fix is not
allowing radeon_fence_driver_check_lockup to take exclusive_lock,
and kill it during lockup recovery instead.
---
 drivers/gpu/drm/radeon/radeon.h        |    3 +
 drivers/gpu/drm/radeon/radeon_device.c |    5 +
 drivers/gpu/drm/radeon/radeon_fence.c  |  124 ++++++++++++++++++--------------
 drivers/gpu/drm/radeon/radeon_ring.c   |    1 
 4 files changed, 77 insertions(+), 56 deletions(-)

Christian König Aug. 1, 2014, 4:35 p.m. UTC | #1

Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
> ---
> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
> and kill it during lockup recovery instead.

That looks like the delayed work starts running as soon as we submit a 
fence, and not when it's needed for waiting.

Since it's a backup for failing IRQs I would rather put it into 
radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

Christian.

> ---
>   drivers/gpu/drm/radeon/radeon.h        |    3 +
>   drivers/gpu/drm/radeon/radeon_device.c |    5 +
>   drivers/gpu/drm/radeon/radeon_fence.c  |  124 ++++++++++++++++++--------------
>   drivers/gpu/drm/radeon/radeon_ring.c   |    1
>   4 files changed, 77 insertions(+), 56 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> index 60c47f829122..b01d88fc10cb 100644
> --- a/drivers/gpu/drm/radeon/radeon.h
> +++ b/drivers/gpu/drm/radeon/radeon.h
> @@ -347,6 +347,8 @@ struct radeon_fence_driver {
>   	uint64_t			sync_seq[RADEON_NUM_RINGS];
>   	atomic64_t			last_seq;
>   	bool				initialized;
> +	struct delayed_work	fence_check_work;
> +	struct radeon_device	*rdev;
>   };
>   
>   struct radeon_fence {
> @@ -360,6 +362,7 @@ struct radeon_fence {
>   
>   int radeon_fence_driver_start_ring(struct radeon_device *rdev, int ring);
>   int radeon_fence_driver_init(struct radeon_device *rdev);
> +void radeon_fence_cancel_delayed_check(struct radeon_device *rdev, int ring);
>   void radeon_fence_driver_fini(struct radeon_device *rdev);
>   void radeon_fence_driver_force_completion(struct radeon_device *rdev);
>   int radeon_fence_emit(struct radeon_device *rdev, struct radeon_fence **fence, int ring);
> diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
> index 697add2cd4e3..21efd32d07ee 100644
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -1637,6 +1637,11 @@ int radeon_gpu_reset(struct radeon_device *rdev)
>   	radeon_save_bios_scratch_regs(rdev);
>   	/* block TTM */
>   	resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
> +
> +	/* kill all the hangcheck tests too */
> +	for (i = 0; i < RADEON_NUM_RINGS; ++i)
> +		radeon_fence_cancel_delayed_check(rdev, i);
> +
>   	radeon_pm_suspend(rdev);
>   	radeon_suspend(rdev);
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_fence.c b/drivers/gpu/drm/radeon/radeon_fence.c
> index 913787085dfa..c055acc3a271 100644
> --- a/drivers/gpu/drm/radeon/radeon_fence.c
> +++ b/drivers/gpu/drm/radeon/radeon_fence.c
> @@ -125,16 +125,7 @@ int radeon_fence_emit(struct radeon_device *rdev,
>   	return 0;
>   }
>   
> -/**
> - * radeon_fence_process - process a fence
> - *
> - * @rdev: radeon_device pointer
> - * @ring: ring index the fence is associated with
> - *
> - * Checks the current fence value and wakes the fence queue
> - * if the sequence number has increased (all asics).
> - */
> -void radeon_fence_process(struct radeon_device *rdev, int ring)
> +static bool __radeon_fence_process(struct radeon_device *rdev, int ring)
>   {
>   	uint64_t seq, last_seq, last_emitted;
>   	unsigned count_loop = 0;
> @@ -190,7 +181,51 @@ void radeon_fence_process(struct radeon_device *rdev, int ring)
>   		}
>   	} while (atomic64_xchg(&rdev->fence_drv[ring].last_seq, seq) > seq);
>   
> -	if (wake)
> +	if (seq < last_emitted)
> +		mod_delayed_work(system_power_efficient_wq,
> +				 &rdev->fence_drv[ring].fence_check_work,
> +				 RADEON_FENCE_JIFFIES_TIMEOUT);
> +
> +	return wake;
> +}
> +
> +static void radeon_fence_driver_check_lockup(struct work_struct *work)
> +{
> +	struct radeon_fence_driver *fence_drv;
> +	struct radeon_device *rdev;
> +	unsigned long iring;
> +
> +	fence_drv = container_of(work, struct radeon_fence_driver, fence_check_work.work);
> +	rdev = fence_drv->rdev;
> +	iring = fence_drv - &rdev->fence_drv[0];
> +
> +	if (__radeon_fence_process(rdev, iring))
> +		wake_up_all(&rdev->fence_queue);
> +	else if (radeon_ring_is_lockup(rdev, iring, &rdev->ring[iring])) {
> +		/* good news we believe it's a lockup */
> +		dev_warn(rdev->dev, "GPU lockup (current fence id "
> +			 "0x%016llx last fence id 0x%016llx on ring %ld)\n",
> +			 (uint64_t)atomic64_read(&fence_drv->last_seq),
> +			 fence_drv->sync_seq[iring], iring);
> +
> +		/* remember that we need an reset */
> +		rdev->needs_reset = true;
> +		wake_up_all(&rdev->fence_queue);
> +	}
> +}
> +
> +/**
> + * radeon_fence_process - process a fence
> + *
> + * @rdev: radeon_device pointer
> + * @ring: ring index the fence is associated with
> + *
> + * Checks the current fence value and wakes the fence queue
> + * if the sequence number has increased (all asics).
> + */
> +void radeon_fence_process(struct radeon_device *rdev, int ring)
> +{
> +	if (__radeon_fence_process(rdev, ring))
>   		wake_up_all(&rdev->fence_queue);
>   }
>   
> @@ -302,9 +337,10 @@ static int radeon_fence_wait_seq(struct radeon_device *rdev, u64 *target_seq,
>   {
>   	uint64_t last_seq[RADEON_NUM_RINGS];
>   	bool signaled;
> -	int i, r;
> +	int i;
>   
>   	while (!radeon_fence_any_seq_signaled(rdev, target_seq)) {
> +		long r;
>   
>   		/* Save current sequence values, used to check for GPU lockups */
>   		for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> @@ -319,11 +355,11 @@ static int radeon_fence_wait_seq(struct radeon_device *rdev, u64 *target_seq,
>   		if (intr) {
>   			r = wait_event_interruptible_timeout(rdev->fence_queue, (
>   				(signaled = radeon_fence_any_seq_signaled(rdev, target_seq))
> -				 || rdev->needs_reset), RADEON_FENCE_JIFFIES_TIMEOUT);
> +				 || rdev->needs_reset), MAX_SCHEDULE_TIMEOUT);
>   		} else {
>   			r = wait_event_timeout(rdev->fence_queue, (
>   				(signaled = radeon_fence_any_seq_signaled(rdev, target_seq))
> -				 || rdev->needs_reset), RADEON_FENCE_JIFFIES_TIMEOUT);
> +				 || rdev->needs_reset), MAX_SCHEDULE_TIMEOUT);
>   		}
>   
>   		for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> @@ -334,50 +370,11 @@ static int radeon_fence_wait_seq(struct radeon_device *rdev, u64 *target_seq,
>   			trace_radeon_fence_wait_end(rdev->ddev, i, target_seq[i]);
>   		}
>   
> -		if (unlikely(r < 0))
> +		if (r < 0)
>   			return r;
>   
> -		if (unlikely(!signaled)) {
> -			if (rdev->needs_reset)
> -				return -EDEADLK;
> -
> -			/* we were interrupted for some reason and fence
> -			 * isn't signaled yet, resume waiting */
> -			if (r)
> -				continue;
> -
> -			for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> -				if (!target_seq[i])
> -					continue;
> -
> -				if (last_seq[i] != atomic64_read(&rdev->fence_drv[i].last_seq))
> -					break;
> -			}
> -
> -			if (i != RADEON_NUM_RINGS)
> -				continue;
> -
> -			for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> -				if (!target_seq[i])
> -					continue;
> -
> -				if (radeon_ring_is_lockup(rdev, i, &rdev->ring[i]))
> -					break;
> -			}
> -
> -			if (i < RADEON_NUM_RINGS) {
> -				/* good news we believe it's a lockup */
> -				dev_warn(rdev->dev, "GPU lockup (waiting for "
> -					 "0x%016llx last fence id 0x%016llx on"
> -					 " ring %d)\n",
> -					 target_seq[i], last_seq[i], i);
> -
> -				/* remember that we need an reset */
> -				rdev->needs_reset = true;
> -				wake_up_all(&rdev->fence_queue);
> -				return -EDEADLK;
> -			}
> -		}
> +		if (rdev->needs_reset)
> +			return -EDEADLK;
>   	}
>   	return 0;
>   }
> @@ -711,6 +708,8 @@ static void radeon_fence_driver_init_ring(struct radeon_device *rdev, int ring)
>   		rdev->fence_drv[ring].sync_seq[i] = 0;
>   	atomic64_set(&rdev->fence_drv[ring].last_seq, 0);
>   	rdev->fence_drv[ring].initialized = false;
> +	INIT_DELAYED_WORK(&rdev->fence_drv[ring].fence_check_work, radeon_fence_driver_check_lockup);
> +	rdev->fence_drv[ring].rdev = rdev;
>   }
>   
>   /**
> @@ -740,6 +739,17 @@ int radeon_fence_driver_init(struct radeon_device *rdev)
>   }
>   
>   /**
> + * radeon_fence_cancel_delayed_check - cancel all delayed checks on a ring during lockup
> + *
> + * This prevents the lockup check from being done while suspend is running
> + * during a recovery from a lockup.
> + */
> +void radeon_fence_cancel_delayed_check(struct radeon_device *rdev, int ring)
> +{
> +	cancel_delayed_work_sync(&rdev->fence_drv[ring].fence_check_work);
> +}
> +
> +/**
>    * radeon_fence_driver_fini - tear down the fence driver
>    * for all possible rings.
>    *
> @@ -755,6 +765,8 @@ void radeon_fence_driver_fini(struct radeon_device *rdev)
>   	for (ring = 0; ring < RADEON_NUM_RINGS; ring++) {
>   		if (!rdev->fence_drv[ring].initialized)
>   			continue;
> +
> +		cancel_delayed_work_sync(&rdev->fence_drv[ring].fence_check_work);
>   		r = radeon_fence_wait_empty(rdev, ring);
>   		if (r) {
>   			/* no need to trigger GPU reset as we are unloading */
> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
> index f8050f5429e2..594d871a6fce 100644
> --- a/drivers/gpu/drm/radeon/radeon_ring.c
> +++ b/drivers/gpu/drm/radeon/radeon_ring.c
> @@ -261,6 +261,7 @@ int radeon_ib_ring_tests(struct radeon_device *rdev)
>   
>   		r = radeon_ib_test(rdev, i, ring);
>   		if (r) {
> +			radeon_fence_cancel_delayed_check(rdev, i);
>   			ring->ready = false;
>   			rdev->needs_reset = false;
>   
>

Maarten Lankhorst Aug. 1, 2014, 5:46 p.m. UTC | #2

On 01-08-14 18:35, Christian König wrote:
> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>> ---
>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>> and kill it during lockup recovery instead.
> 
> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting.
> 
> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.

The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing.

Specifically what happened was this scenario:

- lock up occurs
- write lock taken by gpu_reset
- delayed work runs, tries to acquire read lock, blocks
- gpu_reset tries to cancel delayed work synchronously
- has to wait for delayed work to finish -> deadlock

~Maarten

Christian König Aug. 4, 2014, 8:36 a.m. UTC | #3

Hi Maarten,

Sorry for the delay. I've got way to much todo recently.

Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>
> On 01-08-14 18:35, Christian König wrote:
>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>>> ---
>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>>> and kill it during lockup recovery instead.
>> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting.
>>
>> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
> The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing.

The idea was turning the delayed work on and off when we turn the irq on 
and off as well, processing of the delayed work handler can still happen 
in radeon_fence.c

>
> Specifically what happened was this scenario:
>
> - lock up occurs
> - write lock taken by gpu_reset
> - delayed work runs, tries to acquire read lock, blocks
> - gpu_reset tries to cancel delayed work synchronously
> - has to wait for delayed work to finish -> deadlock

Why do you want to disable the work item from the lockup handler in the 
first place?

Just take the exclusive lock in the work item, when it concurrently runs 
with the lockup handler it will just block for the lockup handler to 
complete.

Regards,
Christian.

> ~Maarten

Maarten Lankhorst Aug. 4, 2014, 8:55 a.m. UTC | #4

op 04-08-14 10:36, Christian König schreef:
> Hi Maarten,
>
> Sorry for the delay. I've got way to much todo recently.
>
> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>>
>> On 01-08-14 18:35, Christian König wrote:
>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>>>> ---
>>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>>>> and kill it during lockup recovery instead.
>>> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting.
>>>
>>> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
>> The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing.
>
> The idea was turning the delayed work on and off when we turn the irq on and off as well, processing of the delayed work handler can still happen in radeon_fence.c
>
>>
>> Specifically what happened was this scenario:
>>
>> - lock up occurs
>> - write lock taken by gpu_reset
>> - delayed work runs, tries to acquire read lock, blocks
>> - gpu_reset tries to cancel delayed work synchronously
>> - has to wait for delayed work to finish -> deadlock
>
> Why do you want to disable the work item from the lockup handler in the first place?
>
> Just take the exclusive lock in the work item, when it concurrently runs with the lockup handler it will just block for the lockup handler to complete.
With the delayed work radeon_fence_wait no longer handles unreliable interrupts itself, so it has to run from the lockup handler.
But an alternative solution could be adding a radeon_fence_wait_timeout, ignore the timeout and check if fence is signaled on timeout.
This would probably be a cleaner solution.

~Maarten

Christian König Aug. 4, 2014, 11:57 a.m. UTC | #5

Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:
> op 04-08-14 10:36, Christian König schreef:
>> Hi Maarten,
>>
>> Sorry for the delay. I've got way to much todo recently.
>>
>> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>>> On 01-08-14 18:35, Christian König wrote:
>>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>>>>> ---
>>>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>>>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>>>>> and kill it during lockup recovery instead.
>>>> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting.
>>>>
>>>> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
>>> The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing.
>> The idea was turning the delayed work on and off when we turn the irq on and off as well, processing of the delayed work handler can still happen in radeon_fence.c
>>
>>> Specifically what happened was this scenario:
>>>
>>> - lock up occurs
>>> - write lock taken by gpu_reset
>>> - delayed work runs, tries to acquire read lock, blocks
>>> - gpu_reset tries to cancel delayed work synchronously
>>> - has to wait for delayed work to finish -> deadlock
>> Why do you want to disable the work item from the lockup handler in the first place?
>>
>> Just take the exclusive lock in the work item, when it concurrently runs with the lockup handler it will just block for the lockup handler to complete.
> With the delayed work radeon_fence_wait no longer handles unreliable interrupts itself, so it has to run from the lockup handler.
> But an alternative solution could be adding a radeon_fence_wait_timeout, ignore the timeout and check if fence is signaled on timeout.
> This would probably be a cleaner solution.

Yeah, agree. Manually specifying a timeout in the fence wait on lockup 
handling sounds like the best alternative to me.

Christian.

>
> ~Maarten
>

Maarten Lankhorst Aug. 4, 2014, 1:34 p.m. UTC | #6

Hey,

op 04-08-14 13:57, Christian König schreef:
> Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:
>> op 04-08-14 10:36, Christian König schreef:
>>> Hi Maarten,
>>>
>>> Sorry for the delay. I've got way to much todo recently.
>>>
>>> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>>>> On 01-08-14 18:35, Christian König wrote:
>>>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>>>>>> ---
>>>>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>>>>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>>>>>> and kill it during lockup recovery instead.
>>>>> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting.
>>>>>
>>>>> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
>>>> The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing.
>>> The idea was turning the delayed work on and off when we turn the irq on and off as well, processing of the delayed work handler can still happen in radeon_fence.c
>>>
>>>> Specifically what happened was this scenario:
>>>>
>>>> - lock up occurs
>>>> - write lock taken by gpu_reset
>>>> - delayed work runs, tries to acquire read lock, blocks
>>>> - gpu_reset tries to cancel delayed work synchronously
>>>> - has to wait for delayed work to finish -> deadlock
>>> Why do you want to disable the work item from the lockup handler in the first place?
>>>
>>> Just take the exclusive lock in the work item, when it concurrently runs with the lockup handler it will just block for the lockup handler to complete.
>> With the delayed work radeon_fence_wait no longer handles unreliable interrupts itself, so it has to run from the lockup handler.
>> But an alternative solution could be adding a radeon_fence_wait_timeout, ignore the timeout and check if fence is signaled on timeout.
>> This would probably be a cleaner solution.
>
> Yeah, agree. Manually specifying a timeout in the fence wait on lockup handling sounds like the best alternative to me.
It'a pain to deal with gpu reset.

I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.

~Maarten

Christian König Aug. 4, 2014, 2:37 p.m. UTC | #7

> It'a pain to deal with gpu reset.

Yeah, well that's nothing new.

> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't 
seem to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

Christian.

Am 04.08.2014 um 15:34 schrieb Maarten Lankhorst:
> Hey,
>
> op 04-08-14 13:57, Christian König schreef:
>> Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:
>>> op 04-08-14 10:36, Christian König schreef:
>>>> Hi Maarten,
>>>>
>>>> Sorry for the delay. I've got way to much todo recently.
>>>>
>>>> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>>>>> On 01-08-14 18:35, Christian König wrote:
>>>>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>>>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>>>>>>> ---
>>>>>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>>>>>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>>>>>>> and kill it during lockup recovery instead.
>>>>>> That looks like the delayed work starts running as soon as we submit a fence, and not when it's needed for waiting.
>>>>>>
>>>>>> Since it's a backup for failing IRQs I would rather put it into radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
>>>>> The delayed work is not just for failing irq's, it's also the handler that's used to detect lockups, which is why I trigger after processing fences, and reset the timer after processing.
>>>> The idea was turning the delayed work on and off when we turn the irq on and off as well, processing of the delayed work handler can still happen in radeon_fence.c
>>>>
>>>>> Specifically what happened was this scenario:
>>>>>
>>>>> - lock up occurs
>>>>> - write lock taken by gpu_reset
>>>>> - delayed work runs, tries to acquire read lock, blocks
>>>>> - gpu_reset tries to cancel delayed work synchronously
>>>>> - has to wait for delayed work to finish -> deadlock
>>>> Why do you want to disable the work item from the lockup handler in the first place?
>>>>
>>>> Just take the exclusive lock in the work item, when it concurrently runs with the lockup handler it will just block for the lockup handler to complete.
>>> With the delayed work radeon_fence_wait no longer handles unreliable interrupts itself, so it has to run from the lockup handler.
>>> But an alternative solution could be adding a radeon_fence_wait_timeout, ignore the timeout and check if fence is signaled on timeout.
>>> This would probably be a cleaner solution.
>> Yeah, agree. Manually specifying a timeout in the fence wait on lockup handling sounds like the best alternative to me.
> It'a pain to deal with gpu reset.
>
> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>
> ~Maarten
>

Maarten Lankhorst Aug. 4, 2014, 2:40 p.m. UTC | #8

op 04-08-14 16:37, Christian König schreef:
>> It'a pain to deal with gpu reset.
>
> Yeah, well that's nothing new.
>
>> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
>> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>
> The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
>
> How about moving the fence waiting out of the reset code?
What cases did I miss then?

I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.

Christian König Aug. 4, 2014, 2:45 p.m. UTC | #9

Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
> op 04-08-14 16:37, Christian König schreef:
>>> It'a pain to deal with gpu reset.
>> Yeah, well that's nothing new.
>>
>>> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
>>> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>> The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
>>
>> How about moving the fence waiting out of the reset code?
> What cases did I miss then?
>
> I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.

The IB test itself doesn't needs to be protected by the exclusive lock. 
Only everything between radeon_save_bios_scratch_regs and 
radeon_ring_restore.

Christian.

Maarten Lankhorst Aug. 4, 2014, 2:58 p.m. UTC | #10

op 04-08-14 16:45, Christian König schreef:
> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
>> op 04-08-14 16:37, Christian König schreef:
>>>> It'a pain to deal with gpu reset.
>>> Yeah, well that's nothing new.
>>>
>>>> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
>>>> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>>> The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
>>>
>>> How about moving the fence waiting out of the reset code?
>> What cases did I miss then?
>>
>> I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.
>
> The IB test itself doesn't needs to be protected by the exclusive lock. Only everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
I'm not sure about that, what do you want to do if the ring tests fail? Do you have to retake the exclusive lock?

~Maarten

Christian König Aug. 4, 2014, 3:04 p.m. UTC | #11

Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
> op 04-08-14 16:45, Christian König schreef:
>> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
>>> op 04-08-14 16:37, Christian König schreef:
>>>>> It'a pain to deal with gpu reset.
>>>> Yeah, well that's nothing new.
>>>>
>>>>> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
>>>>> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>>>> The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
>>>>
>>>> How about moving the fence waiting out of the reset code?
>>> What cases did I miss then?
>>>
>>> I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.
>> The IB test itself doesn't needs to be protected by the exclusive lock. Only everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
> I'm not sure about that, what do you want to do if the ring tests fail? Do you have to retake the exclusive lock?

Just set need_reset again and return -EAGAIN, that should have mostly 
the same effect as what we are doing right now.

Christian.

>
> ~Maarten
>

Maarten Lankhorst Aug. 4, 2014, 3:09 p.m. UTC | #12

op 04-08-14 17:04, Christian König schreef:
> Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
>> op 04-08-14 16:45, Christian König schreef:
>>> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
>>>> op 04-08-14 16:37, Christian König schreef:
>>>>>> It'a pain to deal with gpu reset.
>>>>> Yeah, well that's nothing new.
>>>>>
>>>>>> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
>>>>>> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>>>>> The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
>>>>>
>>>>> How about moving the fence waiting out of the reset code?
>>>> What cases did I miss then?
>>>>
>>>> I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.
>>> The IB test itself doesn't needs to be protected by the exclusive lock. Only everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
>> I'm not sure about that, what do you want to do if the ring tests fail? Do you have to retake the exclusive lock?
>
> Just set need_reset again and return -EAGAIN, that should have mostly the same effect as what we are doing right now.
Yeah, except for the locking the ttm delayed workqueue, but that bool should be easy to save/restore.
I think this could work.

~Maarten

Christian König Aug. 4, 2014, 5:04 p.m. UTC | #13

Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
> op 04-08-14 17:04, Christian König schreef:
>> Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
>>> op 04-08-14 16:45, Christian König schreef:
>>>> Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
>>>>> op 04-08-14 16:37, Christian König schreef:
>>>>>>> It'a pain to deal with gpu reset.
>>>>>> Yeah, well that's nothing new.
>>>>>>
>>>>>>> I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
>>>>>>> But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
>>>>>> The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
>>>>>>
>>>>>> How about moving the fence waiting out of the reset code?
>>>>> What cases did I miss then?
>>>>>
>>>>> I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.
>>>> The IB test itself doesn't needs to be protected by the exclusive lock. Only everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
>>> I'm not sure about that, what do you want to do if the ring tests fail? Do you have to retake the exclusive lock?
>> Just set need_reset again and return -EAGAIN, that should have mostly the same effect as what we are doing right now.
> Yeah, except for the locking the ttm delayed workqueue, but that bool should be easy to save/restore.
> I think this could work.

Actually you could activate the delayed workqueue much earlier as well.

Thinking more about it that sounds like a bug in the current code, 
because we probably want the workqueue activated before waiting for the 
fence.

Christian.

>
> ~Maarten
>

Daniel Vetter Aug. 5, 2014, 8:16 a.m. UTC | #14

On Mon, Aug 04, 2014 at 07:04:46PM +0200, Christian König wrote:
> Am 04.08.2014 um 17:09 schrieb Maarten Lankhorst:
> >op 04-08-14 17:04, Christian König schreef:
> >>Am 04.08.2014 um 16:58 schrieb Maarten Lankhorst:
> >>>op 04-08-14 16:45, Christian König schreef:
> >>>>Am 04.08.2014 um 16:40 schrieb Maarten Lankhorst:
> >>>>>op 04-08-14 16:37, Christian König schreef:
> >>>>>>>It'a pain to deal with gpu reset.
> >>>>>>Yeah, well that's nothing new.
> >>>>>>
> >>>>>>>I've now tried other solutions but that would mean reverting to the old style during gpu lockup recovery, and only running the delayed work when !lockup.
> >>>>>>>But this meant that the timeout was useless to add. I think the cleanest is keeping the v2 patch, because potentially any waiting code can be called during lockup recovery.
> >>>>>>The lockup code itself should never call any waiting code and V2 doesn't seem to handle a couple of cases correctly either.
> >>>>>>
> >>>>>>How about moving the fence waiting out of the reset code?
> >>>>>What cases did I miss then?
> >>>>>
> >>>>>I'm curious how you want to move the fence waiting out of reset, when there are so many places that could potentially wait, like radeon_ib_get can call radeon_sa_bo_new which can do a wait, or radeon_ring_alloc that can wait on radeon_fence_wait_next, etc.
> >>>>The IB test itself doesn't needs to be protected by the exclusive lock. Only everything between radeon_save_bios_scratch_regs and radeon_ring_restore.
> >>>I'm not sure about that, what do you want to do if the ring tests fail? Do you have to retake the exclusive lock?
> >>Just set need_reset again and return -EAGAIN, that should have mostly the same effect as what we are doing right now.
> >Yeah, except for the locking the ttm delayed workqueue, but that bool should be easy to save/restore.
> >I think this could work.
> 
> Actually you could activate the delayed workqueue much earlier as well.
> 
> Thinking more about it that sounds like a bug in the current code, because
> we probably want the workqueue activated before waiting for the fence.

We've actually had a similar issue on i915 where when userspace never
waited for rendering (some shitty userspace drivers did that way back) we
never noticed that the gpu died. So launching the hangcheck/stuck wait
worker (we have both too) right away is what we do now.
-Daniel

[09/19] drm/radeon: handle lockup in delayed work, v2

Commit Message

Comments

Patch