CIK hangs with kernel 3.15, bisected

Message ID	CAAxE2A6cV3M+MhWGnDKEJtDQ2FcqiX0Kp6vKu95OXx76JuFY9Q@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> MIME-Version: 1.0 In-Reply-To: <53735D79.6050904@vodafone.de> References: <CAAxE2A7d7qSRyXebshNt3OB6TLZFPD+efSHLKo5w8m3cERO-+Q@mail.gmail.com> <CAAxE2A6DfNOFcXf3tkKghhf=iR-W=sivcHvAHuGvMwq1LstAvA@mail.gmail.com> <53735D79.6050904@vodafone.de> From: =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= <maraeo@gmail.com> Date: Tue, 27 May 2014 23:55:21 +0200 Message-ID: <CAAxE2A6cV3M+MhWGnDKEJtDQ2FcqiX0Kp6vKu95OXx76JuFY9Q@mail.gmail.com> Subject: Re: CIK hangs with kernel 3.15, bisected To: =?UTF-8?Q?Christian_K=C3=B6nig?= <deathsimple@vodafone.de> Content-Type: multipart/mixed; boundary=001a11c1da06140dbb04fa68c379 Cc: dri-devel <dri-devel@lists.freedesktop.org> Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Marek Olšák May 27, 2014, 9:55 p.m. UTC

Hi Christian,

I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
fixed yet. They are very rare and very random. Therefore, I have come
up with a patch which evicts page tables between IBs. See the
attachment. With that patch applied, the system starts fine, compiz
and glxgears work, but once I start playing openarena, it locks up
pretty quickly.

The patch shouldn't do anything in theory, because pages are moved
back to VRAM immediately after that. However, the VRAM address of page
tables may end up being different from before, which might be the root
cause.

Marek

On Wed, May 14, 2014 at 2:11 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Crap, any chance you can narrow it down a bit more?
>
> I've just tried a piglit quick test on my Bonaire and it seems to work
> perfectly fine.
>
> What hw do you test on?
>
> Regards,
> Christian.
>
> Am 13.05.2014 23:21, schrieb Marek Olšák:
>
>> Hi Christian,
>>
>> Even though some regressions are fixed by these patches:
>>
>> drm/radeon: fix page directory update size estimation
>> drm/radeon: fix buffer placement under memory pressure v2
>>
>> and indeed, the texelFetch tests no longer hang, there is one more
>> hang which needs to be fixed. :( All I know is the exact same commit
>> causes it and it can only be reproduced by running whole piglit with
>> concurrency enabled.
>>
>> My kernel git log:
>>
>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>> (10 hours ago) <Christian König>
>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>> hours ago) <Christian König>
>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>> months ago) <Christian König>
>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>> months ago) <Christian König>
>>
>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>> of the two fixes is the first bad commit.
>>
>> Marek
>>
>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>
>>> Hi Christian,
>>>
>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>
>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>> Author: Christian König <christian.koenig@amd.com>
>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>
>>>      drm/radeon: use normal BOs for the page tables v4
>>>
>>>      No need to make it more complicated than necessary,
>>>      just allocate the page tables as normal BO and
>>>      flush whenever the address change.
>>>
>>>      v2: update comments and function name
>>>      v3: squash bug fixes, page directory and tables patch
>>>      v4: rebased on Mareks changes
>>>
>>>      Signed-off-by: Christian König <christian.koenig@amd.com>
>>>
>>>
>>> Reverting the commit gives me a lot of merge conflicts.
>>>
>>> The simplest way to reproduce the hangs is to run piglit with these
>>> parameters:
>>> -t texelFetch.fs
>>>
>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>> run in parallel, which creates a lot of memory pressure and probably
>>> causes buffer evictions.
>>>
>>> Any idea what is wrong with it?
>>>
>>> Thanks,
>>>
>>> Marek
>
>

Christian König May 28, 2014, 10:38 a.m. UTC | #1

I already tried a similar patch as well, without any more noticeable 
crashes. But going to give this another round with your patch and openarena.

Thanks,
Christian.

Am 27.05.2014 23:55, schrieb Marek Olšák:
> Hi Christian,
>
> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
> fixed yet. They are very rare and very random. Therefore, I have come
> up with a patch which evicts page tables between IBs. See the
> attachment. With that patch applied, the system starts fine, compiz
> and glxgears work, but once I start playing openarena, it locks up
> pretty quickly.
>
> The patch shouldn't do anything in theory, because pages are moved
> back to VRAM immediately after that. However, the VRAM address of page
> tables may end up being different from before, which might be the root
> cause.
>
> Marek
>
> On Wed, May 14, 2014 at 2:11 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Crap, any chance you can narrow it down a bit more?
>>
>> I've just tried a piglit quick test on my Bonaire and it seems to work
>> perfectly fine.
>>
>> What hw do you test on?
>>
>> Regards,
>> Christian.
>>
>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>
>>> Hi Christian,
>>>
>>> Even though some regressions are fixed by these patches:
>>>
>>> drm/radeon: fix page directory update size estimation
>>> drm/radeon: fix buffer placement under memory pressure v2
>>>
>>> and indeed, the texelFetch tests no longer hang, there is one more
>>> hang which needs to be fixed. :( All I know is the exact same commit
>>> causes it and it can only be reproduced by running whole piglit with
>>> concurrency enabled.
>>>
>>> My kernel git log:
>>>
>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>> (10 hours ago) <Christian König>
>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>> hours ago) <Christian König>
>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>> months ago) <Christian König>
>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>> months ago) <Christian König>
>>>
>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>> of the two fixes is the first bad commit.
>>>
>>> Marek
>>>
>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>> Hi Christian,
>>>>
>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>
>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>> Author: Christian König <christian.koenig@amd.com>
>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>
>>>>       drm/radeon: use normal BOs for the page tables v4
>>>>
>>>>       No need to make it more complicated than necessary,
>>>>       just allocate the page tables as normal BO and
>>>>       flush whenever the address change.
>>>>
>>>>       v2: update comments and function name
>>>>       v3: squash bug fixes, page directory and tables patch
>>>>       v4: rebased on Mareks changes
>>>>
>>>>       Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>
>>>>
>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>
>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>> parameters:
>>>> -t texelFetch.fs
>>>>
>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>> run in parallel, which creates a lot of memory pressure and probably
>>>> causes buffer evictions.
>>>>
>>>> Any idea what is wrong with it?
>>>>
>>>> Thanks,
>>>>
>>>> Marek
>>

Christian König May 29, 2014, 4:30 p.m. UTC | #2

Hi Marek & Alex,

I've found the issue why forcefully evicting page tables sometimes 
crashes the box.

Well this is a typical hexdump page table before it is moved to GART:
000117f000  02914061 00000000
000117f008  02915061 00000000
000117f010  02916061 00000000
000117f018  02917061 00000000
000117f020  02918061 00000000

And it looks like this when it comes back:
0006102000  00000000 00000000
*

Ideas? I don't really have an explanation for this. Moving buffers 
around otherwise seems to work perfectly fine.

Thanks,
Christian.

Am 28.05.2014 12:38, schrieb Christian König:
> I already tried a similar patch as well, without any more noticeable 
> crashes. But going to give this another round with your patch and 
> openarena.
>
> Thanks,
> Christian.
>
> Am 27.05.2014 23:55, schrieb Marek Olšák:
>> Hi Christian,
>>
>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
>> fixed yet. They are very rare and very random. Therefore, I have come
>> up with a patch which evicts page tables between IBs. See the
>> attachment. With that patch applied, the system starts fine, compiz
>> and glxgears work, but once I start playing openarena, it locks up
>> pretty quickly.
>>
>> The patch shouldn't do anything in theory, because pages are moved
>> back to VRAM immediately after that. However, the VRAM address of page
>> tables may end up being different from before, which might be the root
>> cause.
>>
>> Marek
>>
>> On Wed, May 14, 2014 at 2:11 PM, Christian König
>> <deathsimple@vodafone.de> wrote:
>>> Crap, any chance you can narrow it down a bit more?
>>>
>>> I've just tried a piglit quick test on my Bonaire and it seems to work
>>> perfectly fine.
>>>
>>> What hw do you test on?
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>>
>>>> Hi Christian,
>>>>
>>>> Even though some regressions are fixed by these patches:
>>>>
>>>> drm/radeon: fix page directory update size estimation
>>>> drm/radeon: fix buffer placement under memory pressure v2
>>>>
>>>> and indeed, the texelFetch tests no longer hang, there is one more
>>>> hang which needs to be fixed. :( All I know is the exact same commit
>>>> causes it and it can only be reproduced by running whole piglit with
>>>> concurrency enabled.
>>>>
>>>> My kernel git log:
>>>>
>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>>> (10 hours ago) <Christian König>
>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>>> hours ago) <Christian König>
>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>>> months ago) <Christian König>
>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>>> months ago) <Christian König>
>>>>
>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>>> of the two fixes is the first bad commit.
>>>>
>>>> Marek
>>>>
>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>>> Hi Christian,
>>>>>
>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>>
>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>>
>>>>>       drm/radeon: use normal BOs for the page tables v4
>>>>>
>>>>>       No need to make it more complicated than necessary,
>>>>>       just allocate the page tables as normal BO and
>>>>>       flush whenever the address change.
>>>>>
>>>>>       v2: update comments and function name
>>>>>       v3: squash bug fixes, page directory and tables patch
>>>>>       v4: rebased on Mareks changes
>>>>>
>>>>>       Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>
>>>>>
>>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>>
>>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>>> parameters:
>>>>> -t texelFetch.fs
>>>>>
>>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>>> run in parallel, which creates a lot of memory pressure and probably
>>>>> causes buffer evictions.
>>>>>
>>>>> Any idea what is wrong with it?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Marek
>>>
>

Marek Olšák May 29, 2014, 4:51 p.m. UTC | #3

Can disable evictions for page tables, e.g. by removing them from the LRU list?

Marek

On Thu, May 29, 2014 at 6:30 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Hi Marek & Alex,
>
> I've found the issue why forcefully evicting page tables sometimes crashes
> the box.
>
> Well this is a typical hexdump page table before it is moved to GART:
> 000117f000  02914061 00000000
> 000117f008  02915061 00000000
> 000117f010  02916061 00000000
> 000117f018  02917061 00000000
> 000117f020  02918061 00000000
>
> And it looks like this when it comes back:
> 0006102000  00000000 00000000
> *
>
> Ideas? I don't really have an explanation for this. Moving buffers around
> otherwise seems to work perfectly fine.
>
> Thanks,
> Christian.
>
> Am 28.05.2014 12:38, schrieb Christian König:
>
>> I already tried a similar patch as well, without any more noticeable
>> crashes. But going to give this another round with your patch and openarena.
>>
>> Thanks,
>> Christian.
>>
>> Am 27.05.2014 23:55, schrieb Marek Olšák:
>>>
>>> Hi Christian,
>>>
>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
>>> fixed yet. They are very rare and very random. Therefore, I have come
>>> up with a patch which evicts page tables between IBs. See the
>>> attachment. With that patch applied, the system starts fine, compiz
>>> and glxgears work, but once I start playing openarena, it locks up
>>> pretty quickly.
>>>
>>> The patch shouldn't do anything in theory, because pages are moved
>>> back to VRAM immediately after that. However, the VRAM address of page
>>> tables may end up being different from before, which might be the root
>>> cause.
>>>
>>> Marek
>>>
>>> On Wed, May 14, 2014 at 2:11 PM, Christian König
>>> <deathsimple@vodafone.de> wrote:
>>>>
>>>> Crap, any chance you can narrow it down a bit more?
>>>>
>>>> I've just tried a piglit quick test on my Bonaire and it seems to work
>>>> perfectly fine.
>>>>
>>>> What hw do you test on?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>>>
>>>>> Hi Christian,
>>>>>
>>>>> Even though some regressions are fixed by these patches:
>>>>>
>>>>> drm/radeon: fix page directory update size estimation
>>>>> drm/radeon: fix buffer placement under memory pressure v2
>>>>>
>>>>> and indeed, the texelFetch tests no longer hang, there is one more
>>>>> hang which needs to be fixed. :( All I know is the exact same commit
>>>>> causes it and it can only be reproduced by running whole piglit with
>>>>> concurrency enabled.
>>>>>
>>>>> My kernel git log:
>>>>>
>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>>>> (10 hours ago) <Christian König>
>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>>>> hours ago) <Christian König>
>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>>>> months ago) <Christian König>
>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>>>> months ago) <Christian König>
>>>>>
>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>>>> of the two fixes is the first bad commit.
>>>>>
>>>>> Marek
>>>>>
>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>>>
>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>>>
>>>>>>       drm/radeon: use normal BOs for the page tables v4
>>>>>>
>>>>>>       No need to make it more complicated than necessary,
>>>>>>       just allocate the page tables as normal BO and
>>>>>>       flush whenever the address change.
>>>>>>
>>>>>>       v2: update comments and function name
>>>>>>       v3: squash bug fixes, page directory and tables patch
>>>>>>       v4: rebased on Mareks changes
>>>>>>
>>>>>>       Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>>
>>>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>>>
>>>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>>>> parameters:
>>>>>> -t texelFetch.fs
>>>>>>
>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>>>> run in parallel, which creates a lot of memory pressure and probably
>>>>>> causes buffer evictions.
>>>>>>
>>>>>> Any idea what is wrong with it?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Marek
>>>>
>>>>
>>
>

Alex Deucher May 29, 2014, 4:52 p.m. UTC | #4

On Thu, May 29, 2014 at 12:30 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Hi Marek & Alex,
>
> I've found the issue why forcefully evicting page tables sometimes crashes
> the box.
>
> Well this is a typical hexdump page table before it is moved to GART:
> 000117f000  02914061 00000000
> 000117f008  02915061 00000000
> 000117f010  02916061 00000000
> 000117f018  02917061 00000000
> 000117f020  02918061 00000000
>
> And it looks like this when it comes back:
> 0006102000  00000000 00000000
> *
>
> Ideas? I don't really have an explanation for this. Moving buffers around
> otherwise seems to work perfectly fine.

Nothing I can think of off hand.  Might be worth trying CP DMA rather
than SDMA for BO moves to see if we can narrow it down a bit more.
Might also try the other SDMA ring.

Alex

>
> Thanks,
> Christian.
>
> Am 28.05.2014 12:38, schrieb Christian König:
>
>> I already tried a similar patch as well, without any more noticeable
>> crashes. But going to give this another round with your patch and openarena.
>>
>> Thanks,
>> Christian.
>>
>> Am 27.05.2014 23:55, schrieb Marek Olšák:
>>>
>>> Hi Christian,
>>>
>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
>>> fixed yet. They are very rare and very random. Therefore, I have come
>>> up with a patch which evicts page tables between IBs. See the
>>> attachment. With that patch applied, the system starts fine, compiz
>>> and glxgears work, but once I start playing openarena, it locks up
>>> pretty quickly.
>>>
>>> The patch shouldn't do anything in theory, because pages are moved
>>> back to VRAM immediately after that. However, the VRAM address of page
>>> tables may end up being different from before, which might be the root
>>> cause.
>>>
>>> Marek
>>>
>>> On Wed, May 14, 2014 at 2:11 PM, Christian König
>>> <deathsimple@vodafone.de> wrote:
>>>>
>>>> Crap, any chance you can narrow it down a bit more?
>>>>
>>>> I've just tried a piglit quick test on my Bonaire and it seems to work
>>>> perfectly fine.
>>>>
>>>> What hw do you test on?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>>>
>>>>> Hi Christian,
>>>>>
>>>>> Even though some regressions are fixed by these patches:
>>>>>
>>>>> drm/radeon: fix page directory update size estimation
>>>>> drm/radeon: fix buffer placement under memory pressure v2
>>>>>
>>>>> and indeed, the texelFetch tests no longer hang, there is one more
>>>>> hang which needs to be fixed. :( All I know is the exact same commit
>>>>> causes it and it can only be reproduced by running whole piglit with
>>>>> concurrency enabled.
>>>>>
>>>>> My kernel git log:
>>>>>
>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>>>> (10 hours ago) <Christian König>
>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>>>> hours ago) <Christian König>
>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>>>> months ago) <Christian König>
>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>>>> months ago) <Christian König>
>>>>>
>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>>>> of the two fixes is the first bad commit.
>>>>>
>>>>> Marek
>>>>>
>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>>>
>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>>>
>>>>>>       drm/radeon: use normal BOs for the page tables v4
>>>>>>
>>>>>>       No need to make it more complicated than necessary,
>>>>>>       just allocate the page tables as normal BO and
>>>>>>       flush whenever the address change.
>>>>>>
>>>>>>       v2: update comments and function name
>>>>>>       v3: squash bug fixes, page directory and tables patch
>>>>>>       v4: rebased on Mareks changes
>>>>>>
>>>>>>       Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>>
>>>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>>>
>>>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>>>> parameters:
>>>>>> -t texelFetch.fs
>>>>>>
>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>>>> run in parallel, which creates a lot of memory pressure and probably
>>>>>> causes buffer evictions.
>>>>>>
>>>>>> Any idea what is wrong with it?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Marek
>>>>
>>>>
>>
>

Christian König May 29, 2014, 4:59 p.m. UTC | #5

Yeah, that will work around it for now.

But the general problem is that we have a memory corruption here, we 
just didn't noticed it earlier because clearing a texture or vectors 
with zero only results in random mis rendering.

Only when you hit a shader or in this case a page table it really 
manifests in a bad crash.

Going to dig deeper into this,
Christian.

Am 29.05.2014 18:51, schrieb Marek Olšák:
> Can disable evictions for page tables, e.g. by removing them from the LRU list?
>
> Marek
>
> On Thu, May 29, 2014 at 6:30 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Hi Marek & Alex,
>>
>> I've found the issue why forcefully evicting page tables sometimes crashes
>> the box.
>>
>> Well this is a typical hexdump page table before it is moved to GART:
>> 000117f000  02914061 00000000
>> 000117f008  02915061 00000000
>> 000117f010  02916061 00000000
>> 000117f018  02917061 00000000
>> 000117f020  02918061 00000000
>>
>> And it looks like this when it comes back:
>> 0006102000  00000000 00000000
>> *
>>
>> Ideas? I don't really have an explanation for this. Moving buffers around
>> otherwise seems to work perfectly fine.
>>
>> Thanks,
>> Christian.
>>
>> Am 28.05.2014 12:38, schrieb Christian König:
>>
>>> I already tried a similar patch as well, without any more noticeable
>>> crashes. But going to give this another round with your patch and openarena.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 27.05.2014 23:55, schrieb Marek Olšák:
>>>> Hi Christian,
>>>>
>>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
>>>> fixed yet. They are very rare and very random. Therefore, I have come
>>>> up with a patch which evicts page tables between IBs. See the
>>>> attachment. With that patch applied, the system starts fine, compiz
>>>> and glxgears work, but once I start playing openarena, it locks up
>>>> pretty quickly.
>>>>
>>>> The patch shouldn't do anything in theory, because pages are moved
>>>> back to VRAM immediately after that. However, the VRAM address of page
>>>> tables may end up being different from before, which might be the root
>>>> cause.
>>>>
>>>> Marek
>>>>
>>>> On Wed, May 14, 2014 at 2:11 PM, Christian König
>>>> <deathsimple@vodafone.de> wrote:
>>>>> Crap, any chance you can narrow it down a bit more?
>>>>>
>>>>> I've just tried a piglit quick test on my Bonaire and it seems to work
>>>>> perfectly fine.
>>>>>
>>>>> What hw do you test on?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> Even though some regressions are fixed by these patches:
>>>>>>
>>>>>> drm/radeon: fix page directory update size estimation
>>>>>> drm/radeon: fix buffer placement under memory pressure v2
>>>>>>
>>>>>> and indeed, the texelFetch tests no longer hang, there is one more
>>>>>> hang which needs to be fixed. :( All I know is the exact same commit
>>>>>> causes it and it can only be reproduced by running whole piglit with
>>>>>> concurrency enabled.
>>>>>>
>>>>>> My kernel git log:
>>>>>>
>>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>>>>> (10 hours ago) <Christian König>
>>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>>>>> hours ago) <Christian König>
>>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>>>>> months ago) <Christian König>
>>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>>>>> months ago) <Christian König>
>>>>>>
>>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>>>>> of the two fixes is the first bad commit.
>>>>>>
>>>>>> Marek
>>>>>>
>>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>>>>
>>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>>>>
>>>>>>>        drm/radeon: use normal BOs for the page tables v4
>>>>>>>
>>>>>>>        No need to make it more complicated than necessary,
>>>>>>>        just allocate the page tables as normal BO and
>>>>>>>        flush whenever the address change.
>>>>>>>
>>>>>>>        v2: update comments and function name
>>>>>>>        v3: squash bug fixes, page directory and tables patch
>>>>>>>        v4: rebased on Mareks changes
>>>>>>>
>>>>>>>        Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>
>>>>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>>>>
>>>>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>>>>> parameters:
>>>>>>> -t texelFetch.fs
>>>>>>>
>>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>>>>> run in parallel, which creates a lot of memory pressure and probably
>>>>>>> causes buffer evictions.
>>>>>>>
>>>>>>> Any idea what is wrong with it?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Marek
>>>>>

Christian König May 30, 2014, 3:57 p.m. UTC | #6

Well the good news is that when I use the CP DMA instead of the SDMA 
everything seems to work fine.

Unfortunately using the CP DMA has a completely different timing 
(because of the additional sync needed) and so I'm not sure if it's 
really fixed or just masked.

Christian.

Am 29.05.2014 18:52, schrieb Alex Deucher:
> On Thu, May 29, 2014 at 12:30 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Hi Marek & Alex,
>>
>> I've found the issue why forcefully evicting page tables sometimes crashes
>> the box.
>>
>> Well this is a typical hexdump page table before it is moved to GART:
>> 000117f000  02914061 00000000
>> 000117f008  02915061 00000000
>> 000117f010  02916061 00000000
>> 000117f018  02917061 00000000
>> 000117f020  02918061 00000000
>>
>> And it looks like this when it comes back:
>> 0006102000  00000000 00000000
>> *
>>
>> Ideas? I don't really have an explanation for this. Moving buffers around
>> otherwise seems to work perfectly fine.
> Nothing I can think of off hand.  Might be worth trying CP DMA rather
> than SDMA for BO moves to see if we can narrow it down a bit more.
> Might also try the other SDMA ring.
>
> Alex
>
>> Thanks,
>> Christian.
>>
>> Am 28.05.2014 12:38, schrieb Christian König:
>>
>>> I already tried a similar patch as well, without any more noticeable
>>> crashes. But going to give this another round with your patch and openarena.
>>>
>>> Thanks,
>>> Christian.
>>>
>>> Am 27.05.2014 23:55, schrieb Marek Olšák:
>>>> Hi Christian,
>>>>
>>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not
>>>> fixed yet. They are very rare and very random. Therefore, I have come
>>>> up with a patch which evicts page tables between IBs. See the
>>>> attachment. With that patch applied, the system starts fine, compiz
>>>> and glxgears work, but once I start playing openarena, it locks up
>>>> pretty quickly.
>>>>
>>>> The patch shouldn't do anything in theory, because pages are moved
>>>> back to VRAM immediately after that. However, the VRAM address of page
>>>> tables may end up being different from before, which might be the root
>>>> cause.
>>>>
>>>> Marek
>>>>
>>>> On Wed, May 14, 2014 at 2:11 PM, Christian König
>>>> <deathsimple@vodafone.de> wrote:
>>>>> Crap, any chance you can narrow it down a bit more?
>>>>>
>>>>> I've just tried a piglit quick test on my Bonaire and it seems to work
>>>>> perfectly fine.
>>>>>
>>>>> What hw do you test on?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 13.05.2014 23:21, schrieb Marek Olšák:
>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> Even though some regressions are fixed by these patches:
>>>>>>
>>>>>> drm/radeon: fix page directory update size estimation
>>>>>> drm/radeon: fix buffer placement under memory pressure v2
>>>>>>
>>>>>> and indeed, the texelFetch tests no longer hang, there is one more
>>>>>> hang which needs to be fixed. :( All I know is the exact same commit
>>>>>> causes it and it can only be reproduced by running whole piglit with
>>>>>> concurrency enabled.
>>>>>>
>>>>>> My kernel git log:
>>>>>>
>>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2
>>>>>> (10 hours ago) <Christian König>
>>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21
>>>>>> hours ago) <Christian König>
>>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2
>>>>>> months ago) <Christian König>
>>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2
>>>>>> months ago) <Christian König>
>>>>>>
>>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either
>>>>>> of the two fixes is the first bad commit.
>>>>>>
>>>>>> Marek
>>>>>>
>>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire:
>>>>>>>
>>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592
>>>>>>> Author: Christian König <christian.koenig@amd.com>
>>>>>>> Date:   Thu Feb 20 13:42:17 2014 +0100
>>>>>>>
>>>>>>>        drm/radeon: use normal BOs for the page tables v4
>>>>>>>
>>>>>>>        No need to make it more complicated than necessary,
>>>>>>>        just allocate the page tables as normal BO and
>>>>>>>        flush whenever the address change.
>>>>>>>
>>>>>>>        v2: update comments and function name
>>>>>>>        v3: squash bug fixes, page directory and tables patch
>>>>>>>        v4: rebased on Mareks changes
>>>>>>>
>>>>>>>        Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>>
>>>>>>> Reverting the commit gives me a lot of merge conflicts.
>>>>>>>
>>>>>>> The simplest way to reproduce the hangs is to run piglit with these
>>>>>>> parameters:
>>>>>>> -t texelFetch.fs
>>>>>>>
>>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also
>>>>>>> run in parallel, which creates a lot of memory pressure and probably
>>>>>>> causes buffer evictions.
>>>>>>>
>>>>>>> Any idea what is wrong with it?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Marek
>>>>>

CIK hangs with kernel 3.15, bisected

Commit Message

Comments

Patch