Message ID | 20110415131152.GJ18463@8bytes.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Joerg Roedel <joro@8bytes.org> wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. Ok, but how did the allocation changes start triggering this error in v2.6.39-rc1? There must still be some layout specific thing here, right? Do we understand the details of that as well? Thanks, Ingo
On Fri, Apr 15, 2011 at 03:11:52PM +0200, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel <joerg.roedel@amd.com> > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Joerg, What about tagging this patch for stable/longterm releases? Potentially there are other cases where certain combinations of hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If the BIOS doesn't follow the BKDG recommendation to mask these errors, the system will hang/reboot. Thus I think having this quirk in .32 and .38 (at least) is useful. Andreas
On 11-04-15 09:11 AM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: >> we definitely want to also understand the reason for things not >> working, even if we do revert.. > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel <joerg.roedel@amd.com> > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> > --- > arch/x86/include/asm/msr-index.h | 4 ++++ > arch/x86/kernel/cpu/amd.c | 19 +++++++++++++++++++ > 2 files changed, 23 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h > index fd5a1f3..3cce714 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -96,11 +96,15 @@ > #define MSR_IA32_MC0_ADDR 0x00000402 > #define MSR_IA32_MC0_MISC 0x00000403 > > +#define MSR_AMD64_MC0_MASK 0xc0010044 > + > #define MSR_IA32_MCx_CTL(x) (MSR_IA32_MC0_CTL + 4*(x)) > #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x)) > #define MSR_IA32_MCx_ADDR(x) (MSR_IA32_MC0_ADDR + 4*(x)) > #define MSR_IA32_MCx_MISC(x) (MSR_IA32_MC0_MISC + 4*(x)) > > +#define MSR_AMD64_MCx_MASK(x) (MSR_AMD64_MC0_MASK + (x)) > + > /* These are consecutive and not in the normal 4er MCE bank block */ > #define MSR_IA32_MC0_CTL2 0x00000280 > #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x)) > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c > index 3ecece0..3532d3b 100644 > --- a/arch/x86/kernel/cpu/amd.c > +++ b/arch/x86/kernel/cpu/amd.c > @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) > /* As a rule processors have APIC timer running in deep C states */ > if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400)) > set_cpu_cap(c, X86_FEATURE_ARAT); > + > + /* > + * Disable GART TLB Walk Errors on Fam10h. We do this here > + * because this is always needed when GART is enabled, even in a > + * kernel which has no MCE support built in. > + */ > + if (c->x86 == 0x10) { > + /* > + * BIOS should disable GartTlbWlk Errors themself. If > + * it doesn't do it here as suggested by the BKDG. > + * > + * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012 > + */ > + u64 mask; > + > + rdmsrl(MSR_AMD64_MCx_MASK(4), mask); > + mask |= (1 << 10); > + wrmsrl(MSR_AMD64_MCx_MASK(4), mask); > + } > } > > #ifdef CONFIG_X86_32 Ok, I'll test it today. Should I apply it on a clean rc3 without any of the other patches? BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? This could be useful in the future. I'll keep you up to date.
On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > Ok, I'll test it today. Should I apply it on a clean rc3 without any of > the other patches? Yes, apply it just on -rc3 without any other patch. > > BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > This could be useful in the future. Cool, thanks Joerg
On Fri, Apr 15, 2011 at 04:04:45PM +0200, Andreas Herrmann wrote: > What about tagging this patch for stable/longterm releases? > > Potentially there are other cases where certain combinations of > hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If > the BIOS doesn't follow the BKDG recommendation to mask these errors, > the system will hang/reboot. Thus I think having this quirk in .32 and > .38 (at least) is useful. Right, thats certainly a good idea. The problem is not specific to GPUs, any other device can trigger this too. Joerg
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: > Ok, but how did the allocation changes start triggering this error in > v2.6.39-rc1? There must still be some layout specific thing here, right? > Do we understand the details of that as well? No, I must admit that I lack enough knowledge about the GPU hardware to make an guess how this tanslation-request happened. All I can tell is the address that was reported in the MCE, it is 0xa0001000 (==the second page of the GART aperture). Maybe Alex can help here. Alex, may it be possible that the GPU generates DMA requests in the GTT area before the GTT is activated (or the activation is completed)? Or can you imagine any other reason? Joerg
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: > Ok, but how did the allocation changes start triggering this error in > v2.6.39-rc1? There must still be some layout specific thing here, right? > Do we understand the details of that as well? Well, thinking again about this, the GPU likely generated this DMA request before too (which has an address in the range configured for the GTT on the card), but nobody noticed because they just hit main memory. And with the allocation changes in 39-rc1 the GART aperture started to be on the same address as the GTT (in their respective address spaces) so that the DMA request hit the GART. This caused the MCE and the sync-flood. The open question is why the GPU generates a DMA request with an address that is configured as the GTT base (+1 page) on the card. Joerg
On Fri, Apr 15, 2011 at 10:33 AM, Joerg Roedel <joro@8bytes.org> wrote: > On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: >> Ok, but how did the allocation changes start triggering this error in >> v2.6.39-rc1? There must still be some layout specific thing here, right? >> Do we understand the details of that as well? > > No, I must admit that I lack enough knowledge about the GPU hardware to > make an guess how this tanslation-request happened. All I can tell is > the address that was reported in the MCE, it is 0xa0001000 (==the second > page of the GART aperture). > > Maybe Alex can help here. Alex, may it be possible that the GPU > generates DMA requests in the GTT area before the GTT is activated (or > the activation is completed)? Or can you imagine any other reason? It shouldn't. The driver binds a dummy page to all entries in the table at init time and whenever the actual pages are unbound. Alex
On Fri, Apr 15, 2011 at 11:46 AM, Joerg Roedel <joro@8bytes.org> wrote: > On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: >> Ok, but how did the allocation changes start triggering this error in >> v2.6.39-rc1? There must still be some layout specific thing here, right? >> Do we understand the details of that as well? > > Well, thinking again about this, the GPU likely generated this DMA > request before too (which has an address in the range configured for the > GTT on the card), but nobody noticed because they just hit main memory. > And with the allocation changes in 39-rc1 the GART aperture started to > be on the same address as the GTT (in their respective address spaces) > so that the DMA request hit the GART. This caused the MCE and the > sync-flood. > The open question is why the GPU generates a DMA request with an address > that is configured as the GTT base (+1 page) on the card. > > Joerg > Do you also got the write if you load radeon with radeon.no_wb=1 ? I think at this address it's the wb page, or maybe the cp as wb likely take only one page Cheers, Jerome
On 11-04-15 10:27 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of >> the other patches? > Yes, apply it just on -rc3 without any other patch. > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? >> This could be useful in the future. > Cool, thanks > > > Joerg The patch was applied and tested. It looks fine, I'm able to boot without problem.
* Alexandre Demers <alexandre.f.demers@gmail.com> wrote: > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > >> the other patches? > > Yes, apply it just on -rc3 without any other patch. > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > >> This could be useful in the future. > > Cool, thanks > > > > > > Joerg > The patch was applied and tested. It looks fine, I'm able to boot > without problem. Joerg, mind submitting it with a changelog that includes everything we learned about this bug and all the Tested-by's in place? Is anyone of the opinion that we should try to revert the allocation order/alignment changes in addition to this fix? Thanks, Ingo
On 04/15/2011 12:06 PM, Ingo Molnar wrote: > > Joerg, mind submitting it with a changelog that includes everything we learned > about this bug and all the Tested-by's in place? > > Is anyone of the opinion that we should try to revert the allocation > order/alignment changes in addition to this fix? We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup. Joerg, can you insert some dump code in the drm/radon code to find out which function cause the problem? Thanks Yinghai
On 04/15/2011 12:18 PM, Yinghai Lu wrote: > On 04/15/2011 12:06 PM, Ingo Molnar wrote: > >> >> Joerg, mind submitting it with a changelog that includes everything we learned >> about this bug and all the Tested-by's in place? >> >> Is anyone of the opinion that we should try to revert the allocation >> order/alignment changes in addition to this fix? > > We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup. > > Joerg, > can you insert some dump code in the drm/radon code to find out which function cause the problem? > Yes, I would like to make sure we don't just paper over a real bug (again). I think we still should talk Joerg's patch since it seems to be the right thing to do anyway, but I do want to make sure we don't have a memory-overwrite bug in the kernel that we're papering over. -hpa
On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote: > > * Alexandre Demers <alexandre.f.demers@gmail.com> wrote: > > > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > > >> the other patches? > > > Yes, apply it just on -rc3 without any other patch. > > > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > > >> This could be useful in the future. > > > Cool, thanks > > > > > > > > > Joerg > > The patch was applied and tested. It looks fine, I'm able to boot > > without problem. > > Joerg, mind submitting it with a changelog that includes everything we learned > about this bug and all the Tested-by's in place? Looks like I am too late, it is already applied. But the changelog contains a link to the korg-bugzilla which has all information too. So the information is not lost. Joerg
On Fri, Apr 15, 2011 at 12:18:02PM -0700, Yinghai Lu wrote: > On 04/15/2011 12:06 PM, Ingo Molnar wrote: > > > > > Joerg, mind submitting it with a changelog that includes everything we learned > > about this bug and all the Tested-by's in place? > > > > Is anyone of the opinion that we should try to revert the allocation > > order/alignment changes in addition to this fix? > > We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup. > > Joerg, > can you insert some dump code in the drm/radon code to find out which > function cause the problem? I am not a GPU expert, but I will see what I can find out. Joerg
* Joerg Roedel <joro@8bytes.org> wrote: > On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote: > > > > * Alexandre Demers <alexandre.f.demers@gmail.com> wrote: > > > > > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > > > >> the other patches? > > > > Yes, apply it just on -rc3 without any other patch. > > > > > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > > > >> This could be useful in the future. > > > > Cool, thanks > > > > > > > > > > > > Joerg > > > The patch was applied and tested. It looks fine, I'm able to boot > > > without problem. > > > > Joerg, mind submitting it with a changelog that includes everything we learned > > about this bug and all the Tested-by's in place? > > Looks like I am too late, it is already applied. But the changelog > contains a link to the korg-bugzilla which has all information too. So > the information is not lost. Yeah. In this case getting the fix into -rc4 in a timely manner looked more important than waiting for an updated changelog :-) Thanks, Ingo
On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote: > Do you also got the write if you load radeon with radeon.no_wb=1 ? > I think at this address it's the wb page, or maybe the cp as wb likely > take only one page radeon.no_wb=1 makes no difference. The box still reboots. Joerg
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index fd5a1f3..3cce714 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -96,11 +96,15 @@ #define MSR_IA32_MC0_ADDR 0x00000402 #define MSR_IA32_MC0_MISC 0x00000403 +#define MSR_AMD64_MC0_MASK 0xc0010044 + #define MSR_IA32_MCx_CTL(x) (MSR_IA32_MC0_CTL + 4*(x)) #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x)) #define MSR_IA32_MCx_ADDR(x) (MSR_IA32_MC0_ADDR + 4*(x)) #define MSR_IA32_MCx_MISC(x) (MSR_IA32_MC0_MISC + 4*(x)) +#define MSR_AMD64_MCx_MASK(x) (MSR_AMD64_MC0_MASK + (x)) + /* These are consecutive and not in the normal 4er MCE bank block */ #define MSR_IA32_MC0_CTL2 0x00000280 #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x)) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 3ecece0..3532d3b 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) /* As a rule processors have APIC timer running in deep C states */ if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400)) set_cpu_cap(c, X86_FEATURE_ARAT); + + /* + * Disable GART TLB Walk Errors on Fam10h. We do this here + * because this is always needed when GART is enabled, even in a + * kernel which has no MCE support built in. + */ + if (c->x86 == 0x10) { + /* + * BIOS should disable GartTlbWlk Errors themself. If + * it doesn't do it here as suggested by the BKDG. + * + * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012 + */ + u64 mask; + + rdmsrl(MSR_AMD64_MCx_MASK(4), mask); + mask |= (1 << 10); + wrmsrl(MSR_AMD64_MCx_MASK(4), mask); + } } #ifdef CONFIG_X86_32