Message ID | CAAxE2A4HDf3pzfdz4pA7m=etD=RDYf77AobWSZOKiEnVQV8nOw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Marek, ah, yes! Piglit in combination with that patch can indeed crash the box. Going to investigate now that I can reproduce it. Thanks, Christian. Am 13.06.2014 15:19, schrieb Marek Olšák: > Hi, > > With my "force_gtt" patch, Cape Verde is unstable too, so all GCN > chips are affected. > > I recommend applying that patch, because it will reproduce the problem > faster. Without it, the hangs are very rare and it may take a while > before they occur. > > Marek > > On Thu, Jun 12, 2014 at 1:23 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Please do so, and you might want to try 3.15.0 as well. >> >> I've tested multiple piglit runs over night with my Bonaire and 3.15.0 and >> that seemed to work perfectly fine. >> >> Going to test Alex drm-next-3.16 a bit more as well. >> >> Christian. >> >> Am 11.06.2014 12:56, schrieb Marek Olšák: >> >>> I only tested Bonaire. I can test Cape Verde if needed. >>> >>> Marek >>> >>> On Wed, Jun 11, 2014 at 11:29 AM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> Crap, I already wanted to check back with you if that really fixes your >>>> problems. >>>> >>>> Thanks for the info, this crash also only happens on CIK doesn't it? >>>> >>>> Christian. >>>> >>>> Am 11.06.2014 01:30, schrieb Marek Olšák: >>>> >>>>> Sorry to tell you the bad news. This patch doesn't fix the hangs on my >>>>> machine. >>>>> >>>>> I tested drm-next-3.16 from Alex's tree. I also switched copying from >>>>> SDMA to CP DMA, which hung too. >>>>> >>>>> I also tried this: >>>>> >>>>> git checkout (the problematic commit): >>>>> 6d2f294 - drm/radeon: use normal BOs for the page tables v4 >>>>> >>>>> git cherry-pick (fixes): >>>>> 0e97703c - drm/radeon: add define for flags used in R600+ GTT >>>>> 0986c1a5 - drm/radeon: stop poisoning the GART TLB >>>>> 4906f689 - drm/radeon: fix page directory update size estimation >>>>> 4b095566 - drm/radeon: fix buffer placement under memory pressure v2 >>>>> >>>>> Then I tested both SDMA and CP DMA copying. Both were unstable. >>>>> >>>>> Testing was done with piglit / quick.tests. >>>>> >>>>> Marek >>>>> >>>>> >>>>> On Wed, Jun 4, 2014 at 3:29 PM, Christian König >>>>> <deathsimple@vodafone.de> >>>>> wrote: >>>>>> From: Christian König <christian.koenig@amd.com> >>>>>> >>>>>> When we set the valid bit on invalid GART entries they are >>>>>> loaded into the TLB when an adjacent entry is loaded. This >>>>>> poisons the TLB with invalid entries which are sometimes >>>>>> not correctly removed on TLB flush. >>>>>> >>>>>> For stable inclusion the patch probably needs to be modified a bit. >>>>>> >>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>> Cc: stable@vger.kernel.org >>>>>> --- >>>>>> drivers/gpu/drm/radeon/rs600.c | 5 ++++- >>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/radeon/rs600.c >>>>>> b/drivers/gpu/drm/radeon/rs600.c >>>>>> index 0a8be63..e0465b2 100644 >>>>>> --- a/drivers/gpu/drm/radeon/rs600.c >>>>>> +++ b/drivers/gpu/drm/radeon/rs600.c >>>>>> @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device >>>>>> *rdev, >>>>>> int i, uint64_t addr) >>>>>> return -EINVAL; >>>>>> } >>>>>> addr = addr & 0xFFFFFFFFFFFFF000ULL; >>>>>> - addr |= R600_PTE_GART; >>>>>> + if (addr == rdev->dummy_page.addr) >>>>>> + addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; >>>>>> + else >>>>>> + addr |= R600_PTE_GART; >>>>>> writeq(addr, ptr + (i * 8)); >>>>>> return 0; >>>>>> } >>>>>> -- >>>>>> 1.9.1 >>>>>> >>>>>> _______________________________________________ >>>>>> dri-devel mailing list >>>>>> dri-devel@lists.freedesktop.org >>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>>>
On Fri, Jun 13, 2014 at 11:45 AM, Christian König <deathsimple@vodafone.de> wrote: > Hi Marek, > > ah, yes! Piglit in combination with that patch can indeed crash the box. > > Going to investigate now that I can reproduce it. I wonder if it's a clockgating issue with the MC or BIF? You might try adjusting the rdev->cg_flags (try setting it to 0) in radeon_asic.c or disabling dpm. Alex > > Thanks, > Christian. > > Am 13.06.2014 15:19, schrieb Marek Olšák: > >> Hi, >> >> With my "force_gtt" patch, Cape Verde is unstable too, so all GCN >> chips are affected. >> >> I recommend applying that patch, because it will reproduce the problem >> faster. Without it, the hangs are very rare and it may take a while >> before they occur. >> >> Marek >> >> On Thu, Jun 12, 2014 at 1:23 PM, Christian König >> <deathsimple@vodafone.de> wrote: >>> >>> Please do so, and you might want to try 3.15.0 as well. >>> >>> I've tested multiple piglit runs over night with my Bonaire and 3.15.0 >>> and >>> that seemed to work perfectly fine. >>> >>> Going to test Alex drm-next-3.16 a bit more as well. >>> >>> Christian. >>> >>> Am 11.06.2014 12:56, schrieb Marek Olšák: >>> >>>> I only tested Bonaire. I can test Cape Verde if needed. >>>> >>>> Marek >>>> >>>> On Wed, Jun 11, 2014 at 11:29 AM, Christian König >>>> <deathsimple@vodafone.de> wrote: >>>>> >>>>> Crap, I already wanted to check back with you if that really fixes your >>>>> problems. >>>>> >>>>> Thanks for the info, this crash also only happens on CIK doesn't it? >>>>> >>>>> Christian. >>>>> >>>>> Am 11.06.2014 01:30, schrieb Marek Olšák: >>>>> >>>>>> Sorry to tell you the bad news. This patch doesn't fix the hangs on my >>>>>> machine. >>>>>> >>>>>> I tested drm-next-3.16 from Alex's tree. I also switched copying from >>>>>> SDMA to CP DMA, which hung too. >>>>>> >>>>>> I also tried this: >>>>>> >>>>>> git checkout (the problematic commit): >>>>>> 6d2f294 - drm/radeon: use normal BOs for the page tables v4 >>>>>> >>>>>> git cherry-pick (fixes): >>>>>> 0e97703c - drm/radeon: add define for flags used in R600+ GTT >>>>>> 0986c1a5 - drm/radeon: stop poisoning the GART TLB >>>>>> 4906f689 - drm/radeon: fix page directory update size estimation >>>>>> 4b095566 - drm/radeon: fix buffer placement under memory pressure v2 >>>>>> >>>>>> Then I tested both SDMA and CP DMA copying. Both were unstable. >>>>>> >>>>>> Testing was done with piglit / quick.tests. >>>>>> >>>>>> Marek >>>>>> >>>>>> >>>>>> On Wed, Jun 4, 2014 at 3:29 PM, Christian König >>>>>> <deathsimple@vodafone.de> >>>>>> wrote: >>>>>>> >>>>>>> From: Christian König <christian.koenig@amd.com> >>>>>>> >>>>>>> When we set the valid bit on invalid GART entries they are >>>>>>> loaded into the TLB when an adjacent entry is loaded. This >>>>>>> poisons the TLB with invalid entries which are sometimes >>>>>>> not correctly removed on TLB flush. >>>>>>> >>>>>>> For stable inclusion the patch probably needs to be modified a bit. >>>>>>> >>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>>> Cc: stable@vger.kernel.org >>>>>>> --- >>>>>>> drivers/gpu/drm/radeon/rs600.c | 5 ++++- >>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/radeon/rs600.c >>>>>>> b/drivers/gpu/drm/radeon/rs600.c >>>>>>> index 0a8be63..e0465b2 100644 >>>>>>> --- a/drivers/gpu/drm/radeon/rs600.c >>>>>>> +++ b/drivers/gpu/drm/radeon/rs600.c >>>>>>> @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device >>>>>>> *rdev, >>>>>>> int i, uint64_t addr) >>>>>>> return -EINVAL; >>>>>>> } >>>>>>> addr = addr & 0xFFFFFFFFFFFFF000ULL; >>>>>>> - addr |= R600_PTE_GART; >>>>>>> + if (addr == rdev->dummy_page.addr) >>>>>>> + addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; >>>>>>> + else >>>>>>> + addr |= R600_PTE_GART; >>>>>>> writeq(addr, ptr + (i * 8)); >>>>>>> return 0; >>>>>>> } >>>>>>> -- >>>>>>> 1.9.1 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> dri-devel mailing list >>>>>>> dri-devel@lists.freedesktop.org >>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>>>> >>>>> >
Am 13.06.2014 23:31, schrieb Alex Deucher: > On Fri, Jun 13, 2014 at 11:45 AM, Christian König > <deathsimple@vodafone.de> wrote: >> Hi Marek, >> >> ah, yes! Piglit in combination with that patch can indeed crash the box. >> >> Going to investigate now that I can reproduce it. > I wonder if it's a clockgating issue with the MC or BIF? You might > try adjusting the rdev->cg_flags (try setting it to 0) in > radeon_asic.c or disabling dpm. Unfortunately that was just a false alarm. I was just on a branch which didn't had the "stop poisoning the GART TLB" patch, after applying this patch I can again let piglit run for the whole night without a lockup. No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop poisoning the GART TLB"+"force_gtt" is rock solid here. Christian. > > Alex > >> Thanks, >> Christian. >> >> Am 13.06.2014 15:19, schrieb Marek Olšák: >> >>> Hi, >>> >>> With my "force_gtt" patch, Cape Verde is unstable too, so all GCN >>> chips are affected. >>> >>> I recommend applying that patch, because it will reproduce the problem >>> faster. Without it, the hangs are very rare and it may take a while >>> before they occur. >>> >>> Marek >>> >>> On Thu, Jun 12, 2014 at 1:23 PM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> Please do so, and you might want to try 3.15.0 as well. >>>> >>>> I've tested multiple piglit runs over night with my Bonaire and 3.15.0 >>>> and >>>> that seemed to work perfectly fine. >>>> >>>> Going to test Alex drm-next-3.16 a bit more as well. >>>> >>>> Christian. >>>> >>>> Am 11.06.2014 12:56, schrieb Marek Olšák: >>>> >>>>> I only tested Bonaire. I can test Cape Verde if needed. >>>>> >>>>> Marek >>>>> >>>>> On Wed, Jun 11, 2014 at 11:29 AM, Christian König >>>>> <deathsimple@vodafone.de> wrote: >>>>>> Crap, I already wanted to check back with you if that really fixes your >>>>>> problems. >>>>>> >>>>>> Thanks for the info, this crash also only happens on CIK doesn't it? >>>>>> >>>>>> Christian. >>>>>> >>>>>> Am 11.06.2014 01:30, schrieb Marek Olšák: >>>>>> >>>>>>> Sorry to tell you the bad news. This patch doesn't fix the hangs on my >>>>>>> machine. >>>>>>> >>>>>>> I tested drm-next-3.16 from Alex's tree. I also switched copying from >>>>>>> SDMA to CP DMA, which hung too. >>>>>>> >>>>>>> I also tried this: >>>>>>> >>>>>>> git checkout (the problematic commit): >>>>>>> 6d2f294 - drm/radeon: use normal BOs for the page tables v4 >>>>>>> >>>>>>> git cherry-pick (fixes): >>>>>>> 0e97703c - drm/radeon: add define for flags used in R600+ GTT >>>>>>> 0986c1a5 - drm/radeon: stop poisoning the GART TLB >>>>>>> 4906f689 - drm/radeon: fix page directory update size estimation >>>>>>> 4b095566 - drm/radeon: fix buffer placement under memory pressure v2 >>>>>>> >>>>>>> Then I tested both SDMA and CP DMA copying. Both were unstable. >>>>>>> >>>>>>> Testing was done with piglit / quick.tests. >>>>>>> >>>>>>> Marek >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 4, 2014 at 3:29 PM, Christian König >>>>>>> <deathsimple@vodafone.de> >>>>>>> wrote: >>>>>>>> From: Christian König <christian.koenig@amd.com> >>>>>>>> >>>>>>>> When we set the valid bit on invalid GART entries they are >>>>>>>> loaded into the TLB when an adjacent entry is loaded. This >>>>>>>> poisons the TLB with invalid entries which are sometimes >>>>>>>> not correctly removed on TLB flush. >>>>>>>> >>>>>>>> For stable inclusion the patch probably needs to be modified a bit. >>>>>>>> >>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com> >>>>>>>> Cc: stable@vger.kernel.org >>>>>>>> --- >>>>>>>> drivers/gpu/drm/radeon/rs600.c | 5 ++++- >>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/radeon/rs600.c >>>>>>>> b/drivers/gpu/drm/radeon/rs600.c >>>>>>>> index 0a8be63..e0465b2 100644 >>>>>>>> --- a/drivers/gpu/drm/radeon/rs600.c >>>>>>>> +++ b/drivers/gpu/drm/radeon/rs600.c >>>>>>>> @@ -634,7 +634,10 @@ int rs600_gart_set_page(struct radeon_device >>>>>>>> *rdev, >>>>>>>> int i, uint64_t addr) >>>>>>>> return -EINVAL; >>>>>>>> } >>>>>>>> addr = addr & 0xFFFFFFFFFFFFF000ULL; >>>>>>>> - addr |= R600_PTE_GART; >>>>>>>> + if (addr == rdev->dummy_page.addr) >>>>>>>> + addr |= R600_PTE_SYSTEM | R600_PTE_SNOOPED; >>>>>>>> + else >>>>>>>> + addr |= R600_PTE_GART; >>>>>>>> writeq(addr, ptr + (i * 8)); >>>>>>>> return 0; >>>>>>>> } >>>>>>>> -- >>>>>>>> 1.9.1 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> dri-devel mailing list >>>>>>>> dri-devel@lists.freedesktop.org >>>>>>>> http://lists.freedesktop.org/mailman/listinfo/dri-devel >>>>>>
On 15.06.2014 21:48, Christian König wrote: > Am 13.06.2014 23:31, schrieb Alex Deucher: >> On Fri, Jun 13, 2014 at 11:45 AM, Christian König >> <deathsimple@vodafone.de> wrote: >>> Hi Marek, >>> >>> ah, yes! Piglit in combination with that patch can indeed crash the box. >>> >>> Going to investigate now that I can reproduce it. >> I wonder if it's a clockgating issue with the MC or BIF? You might >> try adjusting the rdev->cg_flags (try setting it to 0) in >> radeon_asic.c or disabling dpm. > > Unfortunately that was just a false alarm. > > I was just on a branch which didn't had the "stop poisoning the GART > TLB" patch, after applying this patch I can again let piglit run for the > whole night without a lockup. > > No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop > poisoning the GART TLB"+"force_gtt" is rock solid here. FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is fine. 3.15 seems stable on Kaveri though, but I haven't tried the force_gtt patch on that yet. There have also been a number of bug reports about stability regressions in 3.15 on various SI and CIK cards. It seems likely that at least some of those are related to this issue as well. If we can't figure out the problem soon, we probably need to revert the 'Use normal BOs for page tables' and dependent changes at least for 3.15.y?
Hi Michel, 3.15 doesn't contain Christian's fix yet, so it should be always broken for everybody. The fix is currently only in 3.16. Alternatively, you can cherry-pick the fix to 3.15, but it doesn't apply cleanly. There is a workaround in 3.15 which disables sDMA and uses CP DMA for copying buffers. It seems to help Christian's machine, but not mine. When I said the kernel driver was broken, I meant that it was broken *with* the fix applied regardless of which engine was used for the copying. Marek On Thu, Jun 19, 2014 at 3:48 AM, Michel Dänzer <michel@daenzer.net> wrote: > On 15.06.2014 21:48, Christian König wrote: >> Am 13.06.2014 23:31, schrieb Alex Deucher: >>> On Fri, Jun 13, 2014 at 11:45 AM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> Hi Marek, >>>> >>>> ah, yes! Piglit in combination with that patch can indeed crash the box. >>>> >>>> Going to investigate now that I can reproduce it. >>> I wonder if it's a clockgating issue with the MC or BIF? You might >>> try adjusting the rdev->cg_flags (try setting it to 0) in >>> radeon_asic.c or disabling dpm. >> >> Unfortunately that was just a false alarm. >> >> I was just on a branch which didn't had the "stop poisoning the GART >> TLB" patch, after applying this patch I can again let piglit run for the >> whole night without a lockup. >> >> No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop >> poisoning the GART TLB"+"force_gtt" is rock solid here. > > FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is > fine. 3.15 seems stable on Kaveri though, but I haven't tried the > force_gtt patch on that yet. > > There have also been a number of bug reports about stability regressions > in 3.15 on various SI and CIK cards. It seems likely that at least some > of those are related to this issue as well. > > If we can't figure out the problem soon, we probably need to revert the > 'Use normal BOs for page tables' and dependent changes at least for 3.15.y? > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel
Hi Marek, > There is a workaround in 3.15 which disables sDMA and uses CP DMA for > copying buffers. It seems to help Christian's machine, but not mine. With stressing the box with piglit I was able to bring my machine down with the CP DMA as well, only cherry-picking the "stop poisoning the GART TLB" really fixed that issue. But I'm pretty sure that even with "stop poisoning the GART TLB" back-ported we still have at least one stability issue I can't reproduce. Christian. Am 19.06.2014 12:20, schrieb Marek Olšák: > Hi Michel, > > 3.15 doesn't contain Christian's fix yet, so it should be always > broken for everybody. The fix is currently only in 3.16. > > Alternatively, you can cherry-pick the fix to 3.15, but it doesn't > apply cleanly. > > There is a workaround in 3.15 which disables sDMA and uses CP DMA for > copying buffers. It seems to help Christian's machine, but not mine. > > When I said the kernel driver was broken, I meant that > it was broken *with* the fix applied regardless of which engine was > used for the copying. > > Marek > > On Thu, Jun 19, 2014 at 3:48 AM, Michel Dänzer <michel@daenzer.net> wrote: >> On 15.06.2014 21:48, Christian König wrote: >>> Am 13.06.2014 23:31, schrieb Alex Deucher: >>>> On Fri, Jun 13, 2014 at 11:45 AM, Christian König >>>> <deathsimple@vodafone.de> wrote: >>>>> Hi Marek, >>>>> >>>>> ah, yes! Piglit in combination with that patch can indeed crash the box. >>>>> >>>>> Going to investigate now that I can reproduce it. >>>> I wonder if it's a clockgating issue with the MC or BIF? You might >>>> try adjusting the rdev->cg_flags (try setting it to 0) in >>>> radeon_asic.c or disabling dpm. >>> Unfortunately that was just a false alarm. >>> >>> I was just on a branch which didn't had the "stop poisoning the GART >>> TLB" patch, after applying this patch I can again let piglit run for the >>> whole night without a lockup. >>> >>> No idea what goes wrong when Marek runs piglit, but 3.15.0+"stop >>> poisoning the GART TLB"+"force_gtt" is rock solid here. >> FWIW, 3.15 doesn't survive piglit on my Bonaire either, but 3.14 is >> fine. 3.15 seems stable on Kaveri though, but I haven't tried the >> force_gtt patch on that yet. >> >> There have also been a number of bug reports about stability regressions >> in 3.15 on various SI and CIK cards. It seems likely that at least some >> of those are related to this issue as well. >> >> If we can't figure out the problem soon, we probably need to revert the >> 'Use normal BOs for page tables' and dependent changes at least for 3.15.y? >> >> >> -- >> Earthling Michel Dänzer | http://www.amd.com >> Libre software enthusiast | Mesa and X developer >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/dri-devel
On 19.06.2014 19:20, Marek Olšák wrote: > Hi Michel, > > 3.15 doesn't contain Christian's fix yet, so it should be always > broken for everybody. The fix is currently only in 3.16. > > Alternatively, you can cherry-pick the fix to 3.15, but it doesn't > apply cleanly. That's a good point. Sorry, I should have mentioned I've been testing with the GART poisoning fix backported to 3.15. > There is a workaround in 3.15 which disables sDMA and uses CP DMA for > copying buffers. It seems to help Christian's machine, but not mine. I've been testing with CP DMA on Bonaire FWIW.
From 504c27c21131f0a2b472e8531ed4630454fe1471 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Ol=C5=A1=C3=A1k?= <marek.olsak@amd.com> Date: Fri, 13 Jun 2014 15:17:26 +0200 Subject: [PATCH] force_gtt --- drivers/gpu/drm/radeon/radeon_vm.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/drivers/gpu/drm/radeon/radeon_vm.c b/drivers/gpu/drm/radeon/radeon_vm.c index c11b71d..67f7658 100644 --- a/drivers/gpu/drm/radeon/radeon_vm.c +++ b/drivers/gpu/drm/radeon/radeon_vm.c @@ -116,6 +116,19 @@ void radeon_vm_manager_fini(struct radeon_device *rdev) rdev->vm_manager.enabled = false; } +static void force_gtt(struct radeon_bo *bo) +{ + if (radeon_bo_reserve(bo, false)) + return; + + radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_GTT); + + if (ttm_bo_validate(&bo->tbo, &bo->placement, true, false)) { + DRM_ERROR("failed to force a GTT placement\n"); + } + radeon_bo_unreserve(bo); +} + /** * radeon_vm_get_bos - add the vm BOs to a validation list * @@ -147,6 +160,8 @@ struct radeon_cs_reloc *radeon_vm_get_bos(struct radeon_device *rdev, list[0].handle = 0; list_add(&list[0].tv.head, head); + force_gtt(vm->page_directory); + for (i = 0, idx = 1; i <= vm->max_pde_used; i++) { if (!vm->page_tables[i].bo) continue; @@ -159,6 +174,8 @@ struct radeon_cs_reloc *radeon_vm_get_bos(struct radeon_device *rdev, list[idx].tiling_flags = 0; list[idx].handle = 0; list_add(&list[idx++].tv.head, head); + + force_gtt(vm->page_tables[i].bo); } return list; -- 1.9.1