diff mbox

radeon 0000:02:00.0: GPU lockup CP stall for more than 10000msec

Message ID 20121223105527.GA6230@liondog.tnic (mailing list archive)
State New, archived
Headers show

Commit Message

Borislav Petkov Dec. 23, 2012, 10:55 a.m. UTC
On Sat, Dec 22, 2012 at 07:42:16PM -0500, Alex Deucher wrote:
> Does booting with radeon.wb=0 help?

Right, this param specification somehow didn't work here:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.8.0-rc1 root=/dev/sda1 ro vga=0 log_bug_len=10M resume=/dev/sda2 no_console_suspend ignore_loglevel hpet=force radeon.wb=0
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.8.0-rc1 root=/dev/sda1 ro vga=0 log_bug_len=10M resume=/dev/sda2 no_console_suspend ignore_loglevel hpet=force radeon.wb=0

[ … ]

[    6.910104] radeon: `0' invalid for parameter `wb'

[ … ]

[   28.191072] radeon: `0' invalid for parameter `wb'

although the whole driver blubber didn't appear on the console fterwards
aso something got turned off allright.

Then, I went and tried "radeon.no_wb" where the driver blubber appeared
but AGP writeback was still enabled:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.8.0-rc1 root=/dev/sda1 ro vga=0 log_bug_len=10M resume=/dev/sda2 no_console_suspend ignore_loglevel hpet=force radeon.no_wb
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.8.0-rc1 root=/dev/sda1 ro vga=0 log_bug_len=10M resume=/dev/sda2 no_console_suspend ignore_loglevel hpet=force radeon.no_wb

[ … ]

[    6.382636] [drm] radeon kernel modesetting enabled.
[    6.384915] radeon 0000:02:00.0: VRAM: 512M 0x0000000000000000 - 0x000000001FFFFFFF (512M used)
[    6.384981] radeon 0000:02:00.0: GTT: 512M 0x0000000020000000 - 0x000000003FFFFFFF
[    6.388137] [drm] radeon: 512M of VRAM memory ready
[    6.388181] [drm] radeon: 512M of GTT memory ready.
[    6.388509] radeon 0000:02:00.0: irq 42 for MSI/MSI-X
[    6.388570] radeon 0000:02:00.0: radeon: using MSI.
[    6.388705] [drm] radeon: irq initialized.
[    6.567811] radeon 0000:02:00.0: WB enabled
				    ^^^^^^^^^^

[    6.567856] radeon 0000:02:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff8802243e5c00
[    6.567922] radeon 0000:02:00.0: fence driver on ring 3 use gpu addr 0x0000000020000c0c and cpu addr 0xffff8802243e5c0c
[    6.601247] [drm] Radeon Display Connectors
[    6.602427] [drm] radeon: power management initialized
[    6.722544] fbcon: radeondrmfb (fb0) is primary device
[    6.945065] radeon 0000:02:00.0: fb0: radeondrmfb frame buffer device
[    6.945100] radeon 0000:02:00.0: registered panic notifier
[    6.945159] [drm] Initialized radeon 2.27.0 20080528 for 0000:02:00.0 on minor 0

At this point, I got tired of this experimenting and went and took the
big hammer :-):



[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.8.0-rc1+ root=/dev/sda1 ro vga=0 log_bug_len=10M resume=/dev/sda2 no_console_suspend ignore_loglevel hpet=force radeon.no_wb no_wb
[    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.8.0-rc1+ root=/dev/sda1 ro vga=0 log_bug_len=10M resume=/dev/sda2 no_console_suspend ignore_loglevel hpet=force radeon.no_wb no_wb

[    6.562905] [drm] radeon kernel modesetting enabled.
[    6.565106] radeon 0000:02:00.0: VRAM: 512M 0x0000000000000000 - 0x000000001FFFFFFF (512M used)
[    6.565172] radeon 0000:02:00.0: GTT: 512M 0x0000000020000000 - 0x000000003FFFFFFF
[    6.567696] [drm] radeon: 512M of VRAM memory ready
[    6.567742] [drm] radeon: 512M of GTT memory ready.
[    6.568068] radeon 0000:02:00.0: irq 42 for MSI/MSI-X
[    6.568130] radeon 0000:02:00.0: radeon: using MSI.
[    6.568269] [drm] radeon: irq initialized.
[    6.684920] radeon_wb_init: disable the goddam WB: radeon_no_wb: 0
[    6.684967] radeon 0000:02:00.0: WB disabled
				    ^^^^^^^^^^^

[    6.685011] radeon 0000:02:00.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff880221ea3c00
[    6.685077] radeon 0000:02:00.0: fence driver on ring 3 use gpu addr 0x0000000020000c0c and cpu addr 0xffff880221ea3c0c
[    6.722367] [drm] Radeon Display Connectors
[    6.723548] [drm] radeon: power management initialized
[    6.843185] fbcon: radeondrmfb (fb0) is primary device
[    7.066368] radeon 0000:02:00.0: fb0: radeondrmfb frame buffer device
[    7.066402] radeon 0000:02:00.0: registered panic notifier
[    7.066462] [drm] Initialized radeon 2.27.0 20080528 for 0000:02:00.0 on minor 0

Ok, I hope I turned off the proper WB thing (I'm assuming you meant the
radeon_no_wb parameter).

And I'm running with it now, will report what happens.

Btw, I'm no GPU guy but why does radeon_wb_init() do all that memory
allocation and cleaning if wb can be disabled with a parameter?
Shouldn't it be checking the parameter, ->family, etc setting first and
only do the allocations when rdev->wb.enabled remains true?

Thanks.

Comments

Andy Furniss Dec. 23, 2012, 11:01 a.m. UTC | #1
Borislav Petkov wrote:

> [   28.191072] radeon: `0' invalid for parameter `wb'
>
> although the whole driver blubber didn't appear on the console fterwards
> aso something got turned off allright.
>
> Then, I went and tried "radeon.no_wb" where the driver blubber appeared
> but AGP writeback was still enabled:

no_wb=1 should work.
Borislav Petkov Dec. 23, 2012, 11:07 a.m. UTC | #2
On Sun, Dec 23, 2012 at 11:01:37AM +0000, Andy Furniss wrote:
> no_wb=1 should work.

Yeah, maybe all those radeon and other GPU module parameters' syntax
should be documented somewhere - Documentation/kernel-parameters.txt for
example, or a GPU-specific file, whatever - so that we can save us all
the time and confusion. Provided this hasn't happened yet, of course.

Thanks.
Andy Furniss Dec. 23, 2012, 11:19 a.m. UTC | #3
Borislav Petkov wrote:
> On Sun, Dec 23, 2012 at 11:01:37AM +0000, Andy Furniss wrote:
>> no_wb=1 should work.
>
> Yeah, maybe all those radeon and other GPU module parameters' syntax
> should be documented somewhere - Documentation/kernel-parameters.txt for
> example, or a GPU-specific file, whatever - so that we can save us all
> the time and confusion. Provided this hasn't happened yet, of course.

modinfo radeon

will give a list assuming you use modules, I think all of them need =<num>.
Borislav Petkov Dec. 23, 2012, 11:31 a.m. UTC | #4
On Sun, Dec 23, 2012 at 11:19:00AM +0000, Andy Furniss wrote:
> modinfo radeon
> 
> will give a list assuming you use modules, I think all of them need =<num>.

Yep, that is one way of getting that info, thanks. I always go and look
at Documentation/kernel-parameters.txt and forget about modinfo.

As you say 'radeon' needs to be module but since this is the case with
the distros, the majority of Linux installations out there have it this
way so we're fine.

Btw, there's a typo in the param list, if anyone wants to write a patch
for it :-):

parm:           lockup_timeout:GPU lockup timeout in ms (defaul 10000 = 10 seconds, 0 = disable) (int)

Thanks.
Markus Trippelsdorf Dec. 23, 2012, 11:51 a.m. UTC | #5
On 2012.12.23 at 12:31 +0100, Borislav Petkov wrote:
> On Sun, Dec 23, 2012 at 11:19:00AM +0000, Andy Furniss wrote:
> > modinfo radeon
> > 
> > will give a list assuming you use modules, I think all of them need =<num>.
> 
> Yep, that is one way of getting that info, thanks. I always go and look
> at Documentation/kernel-parameters.txt and forget about modinfo.
> 
> As you say 'radeon' needs to be module but since this is the case with
> the distros, the majority of Linux installations out there have it this
> way so we're fine.

(If you don't use modules:
 git grep MODULE_PARM_DESC -- drivers/gpu/drm/radeon/
)

You may have hit the same issue as I, see:
http://thread.gmane.org/gmane.comp.video.dri.devel/78328

Reverting commit 2d6cc729 fixes the problem for me, setting
radeon.no_wb=1 doesn't help.
Joe Perches Dec. 23, 2012, 11:52 a.m. UTC | #6
On Sun, 2012-12-23 at 11:01 +0000, Andy Furniss wrote:
> Borislav Petkov wrote:
> 
> > [   28.191072] radeon: `0' invalid for parameter `wb'
> >
> > although the whole driver blubber didn't appear on the console fterwards
> > aso something got turned off allright.
> >
> > Then, I went and tried "radeon.no_wb" where the driver blubber appeared
> > but AGP writeback was still enabled:
> 
> no_wb=1 should work.

Perhaps some of the module_param_named(,,int,) should be bool
Also, are all of the various permissions appropriate?

$ git grep module_param_named drivers/gpu/drm/radeon/
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(no_wb, radeon_no_wb, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(modeset, radeon_modeset, int, 0400);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(dynclks, radeon_dynclks, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(r4xx_atom, radeon_r4xx_atom, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(vramlimit, radeon_vram_limit, int, 0600);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(agpmode, radeon_agpmode, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(gartsize, radeon_gart_size, int, 0600);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(benchmark, radeon_benchmarking, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(test, radeon_testing, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(connector_table, radeon_connector_table, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(tv, radeon_tv, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(audio, radeon_audio, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(disp_priority, radeon_disp_priority, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(hw_i2c, radeon_hw_i2c, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(pcie_gen2, radeon_pcie_gen2, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(msi, radeon_msi, int, 0444);
drivers/gpu/drm/radeon/radeon_drv.c:module_param_named(lockup_timeout, radeon_lockup_timeout, int, 0444);
Borislav Petkov Dec. 23, 2012, 12:22 p.m. UTC | #7
On Sun, Dec 23, 2012 at 12:51:33PM +0100, Markus Trippelsdorf wrote:
> (If you don't use modules: git grep MODULE_PARM_DESC --
> drivers/gpu/drm/radeon/ )

Yeah.

> You may have hit the same issue as I, see:
> http://thread.gmane.org/gmane.comp.video.dri.devel/78328

Yes, it very much looks like it. That same page kills the machine here
too. Maybe the GPU gets scared from the graphic nature of those images
and gives up. :-D

Although the box is not completely dead - I can login to a console and
save dmesg before rebooting.

And this bug becomes funnier and funnier - Alex, you might want to add
that webpage to your testsuite :-).

> Reverting commit 2d6cc729 fixes the problem for me, setting
> radeon.no_wb=1 doesn't help.

Right, let me try that and report back.

Good job Markus, thanks!
Borislav Petkov Dec. 23, 2012, 1:31 p.m. UTC | #8
On Sun, Dec 23, 2012 at 01:22:12PM +0100, Borislav Petkov wrote:
> Right, let me try that and report back.

Yep, looks like reverting the above commit fixes it - the boston.com
website loads just fine.

Thanks.
Shuah Khan Dec. 25, 2012, 4:50 a.m. UTC | #9
On Sun, Dec 23, 2012 at 6:31 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sun, Dec 23, 2012 at 01:22:12PM +0100, Borislav Petkov wrote:
>> Right, let me try that and report back.
>
> Yep, looks like reverting the above commit fixes it - the boston.com
> website loads just fine.
>
> Thanks.
>
> --
> Regards/Gruss,
>     Boris.

Saw the same error and after reading this thread, reverted the

Commit 2d6cc7296d4ee128ab0fa3b715f0afde511f49c2.

drm/radeon: use async dma for ttm buffer moves on 6xx-SI

and the problem is gone. In my case, it is a solid hang right after
system switches to vga. I was able to login on console once or twice.
But dmesg showed the same message reported in this thread:

[   35.812085] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
[   35.812091] radeon 0000:01:00.0: GPU lockup (waiting for
0x0000000000000002 last fence id 0x0000000000000001)


My system has:
01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee
ATI RV620 [Mobility Radeon HD 3400 Series]

-- Shuah
Borislav Petkov Dec. 25, 2012, 10:54 a.m. UTC | #10
On Mon, Dec 24, 2012 at 09:50:11PM -0700, Shuah Khan wrote:
> Saw the same error and after reading this thread, reverted the
>
> Commit 2d6cc7296d4ee128ab0fa3b715f0afde511f49c2.
>
> drm/radeon: use async dma for ttm buffer moves on 6xx-SI
>
> and the problem is gone. In my case, it is a solid hang right after
> system switches to vga. I was able to login on console once or twice.
> But dmesg showed the same message reported in this thread:
>
> [ 35.812085] radeon 0000:01:00.0: GPU lockup CP stall for more than
> 10000msec [ 35.812091] radeon 0000:01:00.0: GPU lockup (waiting for
> 0x0000000000000002 last fence id 0x0000000000000001)
>
>
> My system has: 01:00.0 VGA compatible controller: Advanced Micro
> Devices [AMD] nee ATI RV620 [Mobility Radeon HD 3400 Series]

You can apply http://marc.info/?l=dri-devel&m=135628734704029 instead,
in the meantime. It partially reverts the above commit and if RV620 is a
R600 asic (it should be) then it would fix your observation too.

Thanks.
Antti Palosaari Jan. 2, 2013, 1:42 a.m. UTC | #11
On 12/25/2012 06:50 AM, Shuah Khan wrote:
> On Sun, Dec 23, 2012 at 6:31 AM, Borislav Petkov <bp@alien8.de> wrote:
>> On Sun, Dec 23, 2012 at 01:22:12PM +0100, Borislav Petkov wrote:
>>> Right, let me try that and report back.
>>
>> Yep, looks like reverting the above commit fixes it - the boston.com
>> website loads just fine.
>>
>> Thanks.
>>
>> --
>> Regards/Gruss,
>>      Boris.
>
> Saw the same error and after reading this thread, reverted the
>
> Commit 2d6cc7296d4ee128ab0fa3b715f0afde511f49c2.
>
> drm/radeon: use async dma for ttm buffer moves on 6xx-SI
>
> and the problem is gone. In my case, it is a solid hang right after
> system switches to vga. I was able to login on console once or twice.
> But dmesg showed the same message reported in this thread:
>
> [   35.812085] radeon 0000:01:00.0: GPU lockup CP stall for more than 10000msec
> [   35.812091] radeon 0000:01:00.0: GPU lockup (waiting for
> 0x0000000000000002 last fence id 0x0000000000000001)
>
>
> My system has:
> 01:00.0 VGA compatible controller: Advanced Micro Devices [AMD] nee
> ATI RV620 [Mobility Radeon HD 3400 Series]

I ended up also that same commit after bisecting from current 3.8 master.

01:05.0 VGA compatible controller: ATI Technologies Inc 760G [Radeon 3000]
It is ASUS M5A78L-M/USB3 with integrated GPU.

I cannot even boot unless graphical boot is removed from Fedora 17 boot 
options (rhgb quiet). Random GPU crashes still.

regards
Antti
Borislav Petkov Jan. 2, 2013, 12:02 p.m. UTC | #12
On Wed, Jan 02, 2013 at 03:42:20AM +0200, Antti Palosaari wrote:
> I ended up also that same commit after bisecting from current 3.8
> master.
>
> 01:05.0 VGA compatible controller: ATI Technologies Inc 760G [Radeon
> 3000] It is ASUS M5A78L-M/USB3 with integrated GPU.
>
> I cannot even boot unless graphical boot is removed from Fedora 17
> boot options (rhgb quiet). Random GPU crashes still.

You could try the temporary R600-only fix although I can't see whether
your GPU is also an R600 ASIC or something different by staring at the
model string above:

http://marc.info/?l=dri-devel&m=135628734704029

HTH.
Jerome Glisse Jan. 2, 2013, 5:19 p.m. UTC | #13
On Wed, Jan 2, 2013 at 7:02 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, Jan 02, 2013 at 03:42:20AM +0200, Antti Palosaari wrote:
>> I ended up also that same commit after bisecting from current 3.8
>> master.
>>
>> 01:05.0 VGA compatible controller: ATI Technologies Inc 760G [Radeon
>> 3000] It is ASUS M5A78L-M/USB3 with integrated GPU.
>>
>> I cannot even boot unless graphical boot is removed from Fedora 17
>> boot options (rhgb quiet). Random GPU crashes still.
>
> You could try the temporary R600-only fix although I can't see whether
> your GPU is also an R600 ASIC or something different by staring at the
> model string above:
>
> http://marc.info/?l=dri-devel&m=135628734704029
>
> HTH.
>
> --
> Regards/Gruss,
> Boris.

How do you trigger the issue ? Does it happens right away on boot ?

Cheers,
Jerome
Antti Palosaari Jan. 2, 2013, 5:58 p.m. UTC | #14
On 01/02/2013 07:19 PM, Jerome Glisse wrote:
> On Wed, Jan 2, 2013 at 7:02 AM, Borislav Petkov <bp@alien8.de> wrote:
>> On Wed, Jan 02, 2013 at 03:42:20AM +0200, Antti Palosaari wrote:
>>> I ended up also that same commit after bisecting from current 3.8
>>> master.
>>>
>>> 01:05.0 VGA compatible controller: ATI Technologies Inc 760G [Radeon
>>> 3000] It is ASUS M5A78L-M/USB3 with integrated GPU.
>>>
>>> I cannot even boot unless graphical boot is removed from Fedora 17
>>> boot options (rhgb quiet). Random GPU crashes still.
>>
>> You could try the temporary R600-only fix although I can't see whether
>> your GPU is also an R600 ASIC or something different by staring at the
>> model string above:
>>
>> http://marc.info/?l=dri-devel&m=135628734704029
>>
>> HTH.
>>
>> --
>> Regards/Gruss,
>> Boris.
>
> How do you trigger the issue ? Does it happens right away on boot ?
>
> Cheers,
> Jerome

Sorry for utterly unclear description. I meant it randomly crashes 
desktop in case I got it booting by removing graphical boot option. In 
that case Cinnamon desktop "freezes", I was able to move mouse cursor 
but clicking buttons or moving windows etc. didn't worked at all. Only 
mouse cursor moves. It was possible to switch to console by ctrl+alt+fN.

When Fedora graphical boot was enabled (options rhgb quiet) and I 
selected Kernel from the grub, it makes just blank screen and after 
10sec or so monitor switches off saying "no signal". Nothing happened 
after that, boot was forced. I use dm-crypt and normally about the first 
thing is to show is graphical lock screen asking passphrase.

I did some grepping from the syslog and that same message is seen:

Jan  2 03:35:34 localhost kernel: [ 1164.433117] radeon 0000:01:05.0: 
GPU lockup CP stall for more than 10000msec
Jan  2 03:35:34 localhost kernel: [ 1164.433121] radeon 0000:01:05.0: 
GPU lockup (waiting for 0x0000000000000003 last fence id 0x0000000000000002)

After I reverted bisected patch it has been working just fine. I has 
been running whole day without problems.

regards
Antti
Jerome Glisse Jan. 2, 2013, 10:31 p.m. UTC | #15
Please affected people can you test if patch :
http://people.freedesktop.org/~glisse/0003-drm-radeon-fix-dma-copy-on-r6xx-r7xx-evergen-ni-si-g.patch

Fix the issue, you need to make sure you don't have the patch that
disable dma on r6xx ie that line 977-978 & 1061-1062  in radeon_asic.c
is :
 .copy = &r600_copy_dma,
 .copy_ring_index = R600_RING_TYPE_DMA_INDEX,

Cheers,
Jerome
Markus Trippelsdorf Jan. 2, 2013, 10:38 p.m. UTC | #16
On 2013.01.02 at 17:31 -0500, Jerome Glisse wrote:
> Please affected people can you test if patch :
> http://people.freedesktop.org/~glisse/0003-drm-radeon-fix-dma-copy-on-r6xx-r7xx-evergen-ni-si-g.patch
> 
> Fix the issue, you need to make sure you don't have the patch that
> disable dma on r6xx ie that line 977-978 & 1061-1062  in radeon_asic.c
> is :
>  .copy = &r600_copy_dma,
>  .copy_ring_index = R600_RING_TYPE_DMA_INDEX,

It fixes the issue for me. Thanks.
diff mbox

Patch

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 49b06590001e..00214312db23 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -307,6 +307,11 @@  int radeon_wb_init(struct radeon_device *rdev)
                rdev->wb.use_event = true;
        }
 
+       if (rdev->wb.enabled) {
+               pr_err("%s: disable the goddam WB: radeon_no_wb: %d\n", __func__, radeon_no_wb);
+               rdev->wb.enabled = false;
+       }
+
        dev_info(rdev->dev, "WB %sabled\n", rdev->wb.enabled ? "en" : "dis");
 
        return 0;