Message ID | CAAxE2A6cV3M+MhWGnDKEJtDQ2FcqiX0Kp6vKu95OXx76JuFY9Q@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
I already tried a similar patch as well, without any more noticeable crashes. But going to give this another round with your patch and openarena. Thanks, Christian. Am 27.05.2014 23:55, schrieb Marek Olšák: > Hi Christian, > > I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not > fixed yet. They are very rare and very random. Therefore, I have come > up with a patch which evicts page tables between IBs. See the > attachment. With that patch applied, the system starts fine, compiz > and glxgears work, but once I start playing openarena, it locks up > pretty quickly. > > The patch shouldn't do anything in theory, because pages are moved > back to VRAM immediately after that. However, the VRAM address of page > tables may end up being different from before, which might be the root > cause. > > Marek > > On Wed, May 14, 2014 at 2:11 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Crap, any chance you can narrow it down a bit more? >> >> I've just tried a piglit quick test on my Bonaire and it seems to work >> perfectly fine. >> >> What hw do you test on? >> >> Regards, >> Christian. >> >> Am 13.05.2014 23:21, schrieb Marek Olšák: >> >>> Hi Christian, >>> >>> Even though some regressions are fixed by these patches: >>> >>> drm/radeon: fix page directory update size estimation >>> drm/radeon: fix buffer placement under memory pressure v2 >>> >>> and indeed, the texelFetch tests no longer hang, there is one more >>> hang which needs to be fixed. :( All I know is the exact same commit >>> causes it and it can only be reproduced by running whole piglit with >>> concurrency enabled. >>> >>> My kernel git log: >>> >>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2 >>> (10 hours ago) <Christian König> >>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21 >>> hours ago) <Christian König> >>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2 >>> months ago) <Christian König> >>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2 >>> months ago) <Christian König> >>> >>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either >>> of the two fixes is the first bad commit. >>> >>> Marek >>> >>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote: >>>> Hi Christian, >>>> >>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire: >>>> >>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592 >>>> Author: Christian König <christian.koenig@amd.com> >>>> Date: Thu Feb 20 13:42:17 2014 +0100 >>>> >>>> drm/radeon: use normal BOs for the page tables v4 >>>> >>>> No need to make it more complicated than necessary, >>>> just allocate the page tables as normal BO and >>>> flush whenever the address change. >>>> >>>> v2: update comments and function name >>>> v3: squash bug fixes, page directory and tables patch >>>> v4: rebased on Mareks changes >>>> >>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>> >>>> >>>> Reverting the commit gives me a lot of merge conflicts. >>>> >>>> The simplest way to reproduce the hangs is to run piglit with these >>>> parameters: >>>> -t texelFetch.fs >>>> >>>> Some of the tests allocate a lot of MSAA textures and the tests also >>>> run in parallel, which creates a lot of memory pressure and probably >>>> causes buffer evictions. >>>> >>>> Any idea what is wrong with it? >>>> >>>> Thanks, >>>> >>>> Marek >>
Hi Marek & Alex, I've found the issue why forcefully evicting page tables sometimes crashes the box. Well this is a typical hexdump page table before it is moved to GART: 000117f000 02914061 00000000 000117f008 02915061 00000000 000117f010 02916061 00000000 000117f018 02917061 00000000 000117f020 02918061 00000000 And it looks like this when it comes back: 0006102000 00000000 00000000 * Ideas? I don't really have an explanation for this. Moving buffers around otherwise seems to work perfectly fine. Thanks, Christian. Am 28.05.2014 12:38, schrieb Christian König: > I already tried a similar patch as well, without any more noticeable > crashes. But going to give this another round with your patch and > openarena. > > Thanks, > Christian. > > Am 27.05.2014 23:55, schrieb Marek Olšák: >> Hi Christian, >> >> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not >> fixed yet. They are very rare and very random. Therefore, I have come >> up with a patch which evicts page tables between IBs. See the >> attachment. With that patch applied, the system starts fine, compiz >> and glxgears work, but once I start playing openarena, it locks up >> pretty quickly. >> >> The patch shouldn't do anything in theory, because pages are moved >> back to VRAM immediately after that. However, the VRAM address of page >> tables may end up being different from before, which might be the root >> cause. >> >> Marek >> >> On Wed, May 14, 2014 at 2:11 PM, Christian König >> <deathsimple@vodafone.de> wrote: >>> Crap, any chance you can narrow it down a bit more? >>> >>> I've just tried a piglit quick test on my Bonaire and it seems to work >>> perfectly fine. >>> >>> What hw do you test on? >>> >>> Regards, >>> Christian. >>> >>> Am 13.05.2014 23:21, schrieb Marek Olšák: >>> >>>> Hi Christian, >>>> >>>> Even though some regressions are fixed by these patches: >>>> >>>> drm/radeon: fix page directory update size estimation >>>> drm/radeon: fix buffer placement under memory pressure v2 >>>> >>>> and indeed, the texelFetch tests no longer hang, there is one more >>>> hang which needs to be fixed. :( All I know is the exact same commit >>>> causes it and it can only be reproduced by running whole piglit with >>>> concurrency enabled. >>>> >>>> My kernel git log: >>>> >>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2 >>>> (10 hours ago) <Christian König> >>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21 >>>> hours ago) <Christian König> >>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2 >>>> months ago) <Christian König> >>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2 >>>> months ago) <Christian König> >>>> >>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either >>>> of the two fixes is the first bad commit. >>>> >>>> Marek >>>> >>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote: >>>>> Hi Christian, >>>>> >>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire: >>>>> >>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592 >>>>> Author: Christian König <christian.koenig@amd.com> >>>>> Date: Thu Feb 20 13:42:17 2014 +0100 >>>>> >>>>> drm/radeon: use normal BOs for the page tables v4 >>>>> >>>>> No need to make it more complicated than necessary, >>>>> just allocate the page tables as normal BO and >>>>> flush whenever the address change. >>>>> >>>>> v2: update comments and function name >>>>> v3: squash bug fixes, page directory and tables patch >>>>> v4: rebased on Mareks changes >>>>> >>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>> >>>>> >>>>> Reverting the commit gives me a lot of merge conflicts. >>>>> >>>>> The simplest way to reproduce the hangs is to run piglit with these >>>>> parameters: >>>>> -t texelFetch.fs >>>>> >>>>> Some of the tests allocate a lot of MSAA textures and the tests also >>>>> run in parallel, which creates a lot of memory pressure and probably >>>>> causes buffer evictions. >>>>> >>>>> Any idea what is wrong with it? >>>>> >>>>> Thanks, >>>>> >>>>> Marek >>> >
Can disable evictions for page tables, e.g. by removing them from the LRU list? Marek On Thu, May 29, 2014 at 6:30 PM, Christian König <deathsimple@vodafone.de> wrote: > Hi Marek & Alex, > > I've found the issue why forcefully evicting page tables sometimes crashes > the box. > > Well this is a typical hexdump page table before it is moved to GART: > 000117f000 02914061 00000000 > 000117f008 02915061 00000000 > 000117f010 02916061 00000000 > 000117f018 02917061 00000000 > 000117f020 02918061 00000000 > > And it looks like this when it comes back: > 0006102000 00000000 00000000 > * > > Ideas? I don't really have an explanation for this. Moving buffers around > otherwise seems to work perfectly fine. > > Thanks, > Christian. > > Am 28.05.2014 12:38, schrieb Christian König: > >> I already tried a similar patch as well, without any more noticeable >> crashes. But going to give this another round with your patch and openarena. >> >> Thanks, >> Christian. >> >> Am 27.05.2014 23:55, schrieb Marek Olšák: >>> >>> Hi Christian, >>> >>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not >>> fixed yet. They are very rare and very random. Therefore, I have come >>> up with a patch which evicts page tables between IBs. See the >>> attachment. With that patch applied, the system starts fine, compiz >>> and glxgears work, but once I start playing openarena, it locks up >>> pretty quickly. >>> >>> The patch shouldn't do anything in theory, because pages are moved >>> back to VRAM immediately after that. However, the VRAM address of page >>> tables may end up being different from before, which might be the root >>> cause. >>> >>> Marek >>> >>> On Wed, May 14, 2014 at 2:11 PM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> >>>> Crap, any chance you can narrow it down a bit more? >>>> >>>> I've just tried a piglit quick test on my Bonaire and it seems to work >>>> perfectly fine. >>>> >>>> What hw do you test on? >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 13.05.2014 23:21, schrieb Marek Olšák: >>>> >>>>> Hi Christian, >>>>> >>>>> Even though some regressions are fixed by these patches: >>>>> >>>>> drm/radeon: fix page directory update size estimation >>>>> drm/radeon: fix buffer placement under memory pressure v2 >>>>> >>>>> and indeed, the texelFetch tests no longer hang, there is one more >>>>> hang which needs to be fixed. :( All I know is the exact same commit >>>>> causes it and it can only be reproduced by running whole piglit with >>>>> concurrency enabled. >>>>> >>>>> My kernel git log: >>>>> >>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2 >>>>> (10 hours ago) <Christian König> >>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21 >>>>> hours ago) <Christian König> >>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2 >>>>> months ago) <Christian König> >>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2 >>>>> months ago) <Christian König> >>>>> >>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either >>>>> of the two fixes is the first bad commit. >>>>> >>>>> Marek >>>>> >>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote: >>>>>> >>>>>> Hi Christian, >>>>>> >>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire: >>>>>> >>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592 >>>>>> Author: Christian König <christian.koenig@amd.com> >>>>>> Date: Thu Feb 20 13:42:17 2014 +0100 >>>>>> >>>>>> drm/radeon: use normal BOs for the page tables v4 >>>>>> >>>>>> No need to make it more complicated than necessary, >>>>>> just allocate the page tables as normal BO and >>>>>> flush whenever the address change. >>>>>> >>>>>> v2: update comments and function name >>>>>> v3: squash bug fixes, page directory and tables patch >>>>>> v4: rebased on Mareks changes >>>>>> >>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>> >>>>>> >>>>>> Reverting the commit gives me a lot of merge conflicts. >>>>>> >>>>>> The simplest way to reproduce the hangs is to run piglit with these >>>>>> parameters: >>>>>> -t texelFetch.fs >>>>>> >>>>>> Some of the tests allocate a lot of MSAA textures and the tests also >>>>>> run in parallel, which creates a lot of memory pressure and probably >>>>>> causes buffer evictions. >>>>>> >>>>>> Any idea what is wrong with it? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Marek >>>> >>>> >> >
On Thu, May 29, 2014 at 12:30 PM, Christian König <deathsimple@vodafone.de> wrote: > Hi Marek & Alex, > > I've found the issue why forcefully evicting page tables sometimes crashes > the box. > > Well this is a typical hexdump page table before it is moved to GART: > 000117f000 02914061 00000000 > 000117f008 02915061 00000000 > 000117f010 02916061 00000000 > 000117f018 02917061 00000000 > 000117f020 02918061 00000000 > > And it looks like this when it comes back: > 0006102000 00000000 00000000 > * > > Ideas? I don't really have an explanation for this. Moving buffers around > otherwise seems to work perfectly fine. Nothing I can think of off hand. Might be worth trying CP DMA rather than SDMA for BO moves to see if we can narrow it down a bit more. Might also try the other SDMA ring. Alex > > Thanks, > Christian. > > Am 28.05.2014 12:38, schrieb Christian König: > >> I already tried a similar patch as well, without any more noticeable >> crashes. But going to give this another round with your patch and openarena. >> >> Thanks, >> Christian. >> >> Am 27.05.2014 23:55, schrieb Marek Olšák: >>> >>> Hi Christian, >>> >>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not >>> fixed yet. They are very rare and very random. Therefore, I have come >>> up with a patch which evicts page tables between IBs. See the >>> attachment. With that patch applied, the system starts fine, compiz >>> and glxgears work, but once I start playing openarena, it locks up >>> pretty quickly. >>> >>> The patch shouldn't do anything in theory, because pages are moved >>> back to VRAM immediately after that. However, the VRAM address of page >>> tables may end up being different from before, which might be the root >>> cause. >>> >>> Marek >>> >>> On Wed, May 14, 2014 at 2:11 PM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> >>>> Crap, any chance you can narrow it down a bit more? >>>> >>>> I've just tried a piglit quick test on my Bonaire and it seems to work >>>> perfectly fine. >>>> >>>> What hw do you test on? >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 13.05.2014 23:21, schrieb Marek Olšák: >>>> >>>>> Hi Christian, >>>>> >>>>> Even though some regressions are fixed by these patches: >>>>> >>>>> drm/radeon: fix page directory update size estimation >>>>> drm/radeon: fix buffer placement under memory pressure v2 >>>>> >>>>> and indeed, the texelFetch tests no longer hang, there is one more >>>>> hang which needs to be fixed. :( All I know is the exact same commit >>>>> causes it and it can only be reproduced by running whole piglit with >>>>> concurrency enabled. >>>>> >>>>> My kernel git log: >>>>> >>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2 >>>>> (10 hours ago) <Christian König> >>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21 >>>>> hours ago) <Christian König> >>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2 >>>>> months ago) <Christian König> >>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2 >>>>> months ago) <Christian König> >>>>> >>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either >>>>> of the two fixes is the first bad commit. >>>>> >>>>> Marek >>>>> >>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote: >>>>>> >>>>>> Hi Christian, >>>>>> >>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire: >>>>>> >>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592 >>>>>> Author: Christian König <christian.koenig@amd.com> >>>>>> Date: Thu Feb 20 13:42:17 2014 +0100 >>>>>> >>>>>> drm/radeon: use normal BOs for the page tables v4 >>>>>> >>>>>> No need to make it more complicated than necessary, >>>>>> just allocate the page tables as normal BO and >>>>>> flush whenever the address change. >>>>>> >>>>>> v2: update comments and function name >>>>>> v3: squash bug fixes, page directory and tables patch >>>>>> v4: rebased on Mareks changes >>>>>> >>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>> >>>>>> >>>>>> Reverting the commit gives me a lot of merge conflicts. >>>>>> >>>>>> The simplest way to reproduce the hangs is to run piglit with these >>>>>> parameters: >>>>>> -t texelFetch.fs >>>>>> >>>>>> Some of the tests allocate a lot of MSAA textures and the tests also >>>>>> run in parallel, which creates a lot of memory pressure and probably >>>>>> causes buffer evictions. >>>>>> >>>>>> Any idea what is wrong with it? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Marek >>>> >>>> >> >
Yeah, that will work around it for now. But the general problem is that we have a memory corruption here, we just didn't noticed it earlier because clearing a texture or vectors with zero only results in random mis rendering. Only when you hit a shader or in this case a page table it really manifests in a bad crash. Going to dig deeper into this, Christian. Am 29.05.2014 18:51, schrieb Marek Olšák: > Can disable evictions for page tables, e.g. by removing them from the LRU list? > > Marek > > On Thu, May 29, 2014 at 6:30 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Hi Marek & Alex, >> >> I've found the issue why forcefully evicting page tables sometimes crashes >> the box. >> >> Well this is a typical hexdump page table before it is moved to GART: >> 000117f000 02914061 00000000 >> 000117f008 02915061 00000000 >> 000117f010 02916061 00000000 >> 000117f018 02917061 00000000 >> 000117f020 02918061 00000000 >> >> And it looks like this when it comes back: >> 0006102000 00000000 00000000 >> * >> >> Ideas? I don't really have an explanation for this. Moving buffers around >> otherwise seems to work perfectly fine. >> >> Thanks, >> Christian. >> >> Am 28.05.2014 12:38, schrieb Christian König: >> >>> I already tried a similar patch as well, without any more noticeable >>> crashes. But going to give this another round with your patch and openarena. >>> >>> Thanks, >>> Christian. >>> >>> Am 27.05.2014 23:55, schrieb Marek Olšák: >>>> Hi Christian, >>>> >>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not >>>> fixed yet. They are very rare and very random. Therefore, I have come >>>> up with a patch which evicts page tables between IBs. See the >>>> attachment. With that patch applied, the system starts fine, compiz >>>> and glxgears work, but once I start playing openarena, it locks up >>>> pretty quickly. >>>> >>>> The patch shouldn't do anything in theory, because pages are moved >>>> back to VRAM immediately after that. However, the VRAM address of page >>>> tables may end up being different from before, which might be the root >>>> cause. >>>> >>>> Marek >>>> >>>> On Wed, May 14, 2014 at 2:11 PM, Christian König >>>> <deathsimple@vodafone.de> wrote: >>>>> Crap, any chance you can narrow it down a bit more? >>>>> >>>>> I've just tried a piglit quick test on my Bonaire and it seems to work >>>>> perfectly fine. >>>>> >>>>> What hw do you test on? >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 13.05.2014 23:21, schrieb Marek Olšák: >>>>> >>>>>> Hi Christian, >>>>>> >>>>>> Even though some regressions are fixed by these patches: >>>>>> >>>>>> drm/radeon: fix page directory update size estimation >>>>>> drm/radeon: fix buffer placement under memory pressure v2 >>>>>> >>>>>> and indeed, the texelFetch tests no longer hang, there is one more >>>>>> hang which needs to be fixed. :( All I know is the exact same commit >>>>>> causes it and it can only be reproduced by running whole piglit with >>>>>> concurrency enabled. >>>>>> >>>>>> My kernel git log: >>>>>> >>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2 >>>>>> (10 hours ago) <Christian König> >>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21 >>>>>> hours ago) <Christian König> >>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2 >>>>>> months ago) <Christian König> >>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2 >>>>>> months ago) <Christian König> >>>>>> >>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either >>>>>> of the two fixes is the first bad commit. >>>>>> >>>>>> Marek >>>>>> >>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote: >>>>>>> Hi Christian, >>>>>>> >>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire: >>>>>>> >>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592 >>>>>>> Author: Christian König <christian.koenig@amd.com> >>>>>>> Date: Thu Feb 20 13:42:17 2014 +0100 >>>>>>> >>>>>>> drm/radeon: use normal BOs for the page tables v4 >>>>>>> >>>>>>> No need to make it more complicated than necessary, >>>>>>> just allocate the page tables as normal BO and >>>>>>> flush whenever the address change. >>>>>>> >>>>>>> v2: update comments and function name >>>>>>> v3: squash bug fixes, page directory and tables patch >>>>>>> v4: rebased on Mareks changes >>>>>>> >>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>>> >>>>>>> >>>>>>> Reverting the commit gives me a lot of merge conflicts. >>>>>>> >>>>>>> The simplest way to reproduce the hangs is to run piglit with these >>>>>>> parameters: >>>>>>> -t texelFetch.fs >>>>>>> >>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also >>>>>>> run in parallel, which creates a lot of memory pressure and probably >>>>>>> causes buffer evictions. >>>>>>> >>>>>>> Any idea what is wrong with it? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Marek >>>>>
Well the good news is that when I use the CP DMA instead of the SDMA everything seems to work fine. Unfortunately using the CP DMA has a completely different timing (because of the additional sync needed) and so I'm not sure if it's really fixed or just masked. Christian. Am 29.05.2014 18:52, schrieb Alex Deucher: > On Thu, May 29, 2014 at 12:30 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Hi Marek & Alex, >> >> I've found the issue why forcefully evicting page tables sometimes crashes >> the box. >> >> Well this is a typical hexdump page table before it is moved to GART: >> 000117f000 02914061 00000000 >> 000117f008 02915061 00000000 >> 000117f010 02916061 00000000 >> 000117f018 02917061 00000000 >> 000117f020 02918061 00000000 >> >> And it looks like this when it comes back: >> 0006102000 00000000 00000000 >> * >> >> Ideas? I don't really have an explanation for this. Moving buffers around >> otherwise seems to work perfectly fine. > Nothing I can think of off hand. Might be worth trying CP DMA rather > than SDMA for BO moves to see if we can narrow it down a bit more. > Might also try the other SDMA ring. > > Alex > >> Thanks, >> Christian. >> >> Am 28.05.2014 12:38, schrieb Christian König: >> >>> I already tried a similar patch as well, without any more noticeable >>> crashes. But going to give this another round with your patch and openarena. >>> >>> Thanks, >>> Christian. >>> >>> Am 27.05.2014 23:55, schrieb Marek Olšák: >>>> Hi Christian, >>>> >>>> I test on Bonaire (ChipID = 0x665c). Unfortunately, the hangs are not >>>> fixed yet. They are very rare and very random. Therefore, I have come >>>> up with a patch which evicts page tables between IBs. See the >>>> attachment. With that patch applied, the system starts fine, compiz >>>> and glxgears work, but once I start playing openarena, it locks up >>>> pretty quickly. >>>> >>>> The patch shouldn't do anything in theory, because pages are moved >>>> back to VRAM immediately after that. However, the VRAM address of page >>>> tables may end up being different from before, which might be the root >>>> cause. >>>> >>>> Marek >>>> >>>> On Wed, May 14, 2014 at 2:11 PM, Christian König >>>> <deathsimple@vodafone.de> wrote: >>>>> Crap, any chance you can narrow it down a bit more? >>>>> >>>>> I've just tried a piglit quick test on my Bonaire and it seems to work >>>>> perfectly fine. >>>>> >>>>> What hw do you test on? >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>> Am 13.05.2014 23:21, schrieb Marek Olšák: >>>>> >>>>>> Hi Christian, >>>>>> >>>>>> Even though some regressions are fixed by these patches: >>>>>> >>>>>> drm/radeon: fix page directory update size estimation >>>>>> drm/radeon: fix buffer placement under memory pressure v2 >>>>>> >>>>>> and indeed, the texelFetch tests no longer hang, there is one more >>>>>> hang which needs to be fixed. :( All I know is the exact same commit >>>>>> causes it and it can only be reproduced by running whole piglit with >>>>>> concurrency enabled. >>>>>> >>>>>> My kernel git log: >>>>>> >>>>>> * 2ba22c8 - drm/radeon: fix buffer placement under memory pressure v2 >>>>>> (10 hours ago) <Christian König> >>>>>> * 3af91e5 - drm/radeon: fix page directory update size estimation (21 >>>>>> hours ago) <Christian König> >>>>>> * 6d2f294 - drm/radeon: use normal BOs for the page tables v4 (2 >>>>>> months ago) <Christian König> >>>>>> * fa68834 - drm/radeon: further cleanup vm flushing & fencing (2 >>>>>> months ago) <Christian König> >>>>>> >>>>>> fa68834 doesn't hang, but 2ba22c8 hangs, which means 6d2f294 or either >>>>>> of the two fixes is the first bad commit. >>>>>> >>>>>> Marek >>>>>> >>>>>> On Fri, May 9, 2014 at 8:03 PM, Marek Olšák <maraeo@gmail.com> wrote: >>>>>>> Hi Christian, >>>>>>> >>>>>>> This commit which first appeared in 3.15-rc1 causes hangs on Bonaire: >>>>>>> >>>>>>> commit 6d2f2944e95e504a7d33385eeeb9bb7fcca72592 >>>>>>> Author: Christian König <christian.koenig@amd.com> >>>>>>> Date: Thu Feb 20 13:42:17 2014 +0100 >>>>>>> >>>>>>> drm/radeon: use normal BOs for the page tables v4 >>>>>>> >>>>>>> No need to make it more complicated than necessary, >>>>>>> just allocate the page tables as normal BO and >>>>>>> flush whenever the address change. >>>>>>> >>>>>>> v2: update comments and function name >>>>>>> v3: squash bug fixes, page directory and tables patch >>>>>>> v4: rebased on Mareks changes >>>>>>> >>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>>> >>>>>>> >>>>>>> Reverting the commit gives me a lot of merge conflicts. >>>>>>> >>>>>>> The simplest way to reproduce the hangs is to run piglit with these >>>>>>> parameters: >>>>>>> -t texelFetch.fs >>>>>>> >>>>>>> Some of the tests allocate a lot of MSAA textures and the tests also >>>>>>> run in parallel, which creates a lot of memory pressure and probably >>>>>>> causes buffer evictions. >>>>>>> >>>>>>> Any idea what is wrong with it? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Marek >>>>>
diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index d9ab99f..365e36f 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -116,6 +116,19 @@ void radeon_vm_manager_fini(struct radeon_device *rdev) rdev->vm_manager.enabled = false; } +static void force_gtt(struct radeon_bo *bo) +{ + if (radeon_bo_reserve(bo, false)) + return; + + radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_GTT); + + if (ttm_bo_validate(&bo->tbo, &bo->placement, true, false)) { + DRM_ERROR("failed to force a GTT placement\n"); + } + radeon_bo_unreserve(bo); +} + /** * radeon_vm_get_bos - add the vm BOs to a validation list * @@ -147,6 +160,8 @@ struct radeon_cs_reloc *radeon_vm_get_bos(struct radeon_device *rdev, list[0].handle = 0; list_add(&list[0].tv.head, head); + force_gtt(vm->page_directory); + for (i = 0, idx = 1; i <= vm->max_pde_used; i++) { if (!vm->page_tables[i].bo) continue; @@ -159,6 +174,8 @@ struct radeon_cs_reloc *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].tiling_flags = 0; list[idx].handle = 0; list_add(&list[idx++].tv.head, head); + + force_gtt(vm->page_tables[i].bo); } return list;