diff mbox

[v3,1/3] drm/radeon: take exclusive_lock in read mode during, ring tests, v3

Message ID 53F4A08B.80402@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Maarten Lankhorst Aug. 20, 2014, 1:20 p.m. UTC
Hey,

I found another bug related to gpu reset, one that seems to trigger more reliable with the delayed work changes.

Op 19-08-14 om 15:06 schreef Christian König:
> Am 19.08.2014 um 14:22 schrieb Maarten Lankhorst:
>> This is needed for the next commit, because the lockup detection
>> will need the read lock to run.
>>
>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>
>> ---
>> radeon_pm_compute_clocks already checks if dpm is enabled, so no need to check a second time.
>>
>> Because of locking and waiting stuff the radeon_pm_compute_clocks and resume_force_mode calls
>> have to be done with read lock held.
>>
>> Seems to survive on my radeon when catting /sys/kernel/debug/dri/0/radeon_gpu_reset although
>> uvd fails to reset, and that ring gets disabled as a result.
>
> Depending on what hardware you have it's normal that UVD doesn't reset properly. I still haven't figured out the correct sequence in which I need to disable/enable the different UVD blocks on all hardware generations.
>
> It seems to work fine on my Cayman, but doesn't for example on Turks (which at least theoretically should have the same UVD block). It should be fine as long as the engines gets properly disabled when the IB test fails after an reset.
>
> Another common source of reset instability is DPM, while it now seems to be stable on NI and BTC I can't get a single reset to work once I use it.
>
> Regarding the patch it looks good now, but I still want to test it a bit,
> Christian. 

It seems that if UVD fails a second lockup recovery could in theory be
attempted because the uvd ring is locked up, and the delayed work doesn't
check if the ring is still ready or not.

The changes in "drm/radeon: handle lockup in delayed work, v3" expose 
this bug reliably, because it forces hangups to always be checked.

This results in repeated lockup recovery occuring.

Does this alternate patch look better?

I've also attached a patch to this mail test for lockups on all rings, and if they have
any saved ring data the radeon_gpu_reset function will be retried.
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index e13e408832e5..9253e1253780 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1735,8 +1735,38 @@  int radeon_gpu_reset(struct radeon_device *rdev)
 		radeon_pm_compute_clocks(rdev);
 
 	if (!r) {
-		r = radeon_ib_ring_tests(rdev);
-		if (r && ring_data[RADEON_RING_TYPE_GFX_INDEX])
+		bool retry = false;
+
+		for (i = 0; i < RADEON_NUM_RINGS; ++i) {
+			struct radeon_ring *ring = &rdev->ring[i];
+
+			if (!ring->ready)
+				continue;
+
+			r = radeon_ib_test(rdev, i, ring);
+			if (!r)
+				continue;
+
+			ring->ready = false;
+			dev_err(rdev->dev, "failed testing IB on ring %d (%d)\n", i, r);
+
+			if (ring_data[i]) {
+				retry = true;
+				continue;
+			} else {
+				radeon_fence_driver_force_completion(rdev, i);
+				rdev->needs_reset = false;
+				r = 0;
+			}
+
+			if (i == RADEON_RING_TYPE_GFX_INDEX) {
+				dev_err(rdev->dev, "disabling acceleration\n");
+				rdev->accel_working = false;
+				break;
+			}
+		}
+
+		if (retry)
 			r = -EAGAIN;
 	}