diff mbox

Linux 2.6.39-rc3

Message ID 20110414085624.GC18463@8bytes.org (mailing list archive)
State New, archived
Headers show

Commit Message

Joerg Roedel April 14, 2011, 8:56 a.m. UTC
On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
> On 04/13/2011 12:14 PM, Yinghai Lu wrote:
> > 
> > so looks bios program wrong address to the radon card?
> > 
> 
> Okay, staring at this, it definitely seems toxic to overlay the GART
> over memory areas reserved by the BIOS.  If I were to guess, I would say
> that the problem here seems to be that the kernel thinks it is
> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.
> 
> Alex D., could you comment on the "num cpu pages" bit?

Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):


And this makes a difference, with this change on-top of -rc3 the box boots
fine. So there seems to be some dependency between the GART base and the GTT
base even when they are in different address spaces.

Alex, can you comment on this?

Regards,

	Joerg

Comments

Dave Airlie April 14, 2011, 9:07 a.m. UTC | #1
On Thu, Apr 14, 2011 at 6:56 PM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
>> On 04/13/2011 12:14 PM, Yinghai Lu wrote:
>> >
>> > so looks bios program wrong address to the radon card?
>> >
>>
>> Okay, staring at this, it definitely seems toxic to overlay the GART
>> over memory areas reserved by the BIOS.  If I were to guess, I would say
>> that the problem here seems to be that the kernel thinks it is
>> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
>> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.
>>
>> Alex D., could you comment on the "num cpu pages" bit?
>
> Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):
>
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc)
>                        mc->gtt_size = size_bf;
>                }
>                mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size;
> +               if (mc->gtt_start == 0xa0000000)
> +                       mc->gtt_start = 0x80000000;
>        } else {
>                if (mc->gtt_size > size_af) {
>                        dev_warn(rdev->dev, "limiting GTT\n");
>
> And this makes a difference, with this change on-top of -rc3 the box boots
> fine. So there seems to be some dependency between the GART base and the GTT
> base even when they are in different address spaces.
>
> Alex, can you comment on this?

Wierd either a hw bug or some access to the GTT is leaking out before,
things are setup properly,

I think the RS780/880 docs are on the website, but generally the
address spaces are completely separate so anything getting through is
very unusual.

Dave.
Ingo Molnar April 14, 2011, 9:11 a.m. UTC | #2
* Joerg Roedel <joro@8bytes.org> wrote:

> On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
> > On 04/13/2011 12:14 PM, Yinghai Lu wrote:
> > > 
> > > so looks bios program wrong address to the radon card?
> > > 
> > 
> > Okay, staring at this, it definitely seems toxic to overlay the GART
> > over memory areas reserved by the BIOS.  If I were to guess, I would say
> > that the problem here seems to be that the kernel thinks it is
> > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
> > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.
> > 
> > Alex D., could you comment on the "num cpu pages" bit?
> 
> Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):
> 
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc)
>                         mc->gtt_size = size_bf;
>                 }
>                 mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size;
> +               if (mc->gtt_start == 0xa0000000)
> +                       mc->gtt_start = 0x80000000;
>         } else {
>                 if (mc->gtt_size > size_af) {
>                         dev_warn(rdev->dev, "limiting GTT\n");
> 
> And this makes a difference, with this change on-top of -rc3 the box boots
> fine. So there seems to be some dependency between the GART base and the GTT
> base even when they are in different address spaces.
> 
> Alex, can you comment on this?

I'd strongly suggest we revert back to the old and proven allocation order, as 
long as it results in valid layouts. Even if we figure out this particular 
GART/GTT assumption there might be a dozen others in other types of hardware.

Thanks,

	Ingo
Alex Deucher April 14, 2011, 2:28 p.m. UTC | #3
On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote:
>> On 04/13/2011 12:14 PM, Yinghai Lu wrote:
>> >
>> > so looks bios program wrong address to the radon card?
>> >
>>
>> Okay, staring at this, it definitely seems toxic to overlay the GART
>> over memory areas reserved by the BIOS.  If I were to guess, I would say
>> that the problem here seems to be that the kernel thinks it is
>> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in
>> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas.
>>
>> Alex D., could you comment on the "num cpu pages" bit?
>
> Okay, I tried the debug-patch from Yinghai (posted to the bugzilla):
>
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc)
>                        mc->gtt_size = size_bf;
>                }
>                mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size;
> +               if (mc->gtt_start == 0xa0000000)
> +                       mc->gtt_start = 0x80000000;
>        } else {
>                if (mc->gtt_size > size_af) {
>                        dev_warn(rdev->dev, "limiting GTT\n");
>
> And this makes a difference, with this change on-top of -rc3 the box boots
> fine. So there seems to be some dependency between the GART base and the GTT
> base even when they are in different address spaces.
>
> Alex, can you comment on this?

As Dave said, they are completely different addresses spaces.  You
could put the GPU aperture at 0 if you wanted (in fact we do on some
chips).  Perhaps there's some strange interaction with the nb gart
since the nb gart on that chipset was designed to be used for graphics
and the rs780/880 can be configured to use an agp aperture.
Unfortunately, I'm not that familiar with the nb gart.

Alex

>
> Regards,
>
>        Joerg
>
>
H. Peter Anvin April 14, 2011, 2:31 p.m. UTC | #4
On 04/14/2011 02:11 AM, Ingo Molnar wrote:
> 
> I'd strongly suggest we revert back to the old and proven allocation order, as 
> long as it results in valid layouts. Even if we figure out this particular 
> GART/GTT assumption there might be a dozen others in other types of hardware.
> 

Yes, but we might also be hiding a real bug which bites other hardware.
 We have found real and very serious bugs in the kernel this way before
-- things where drivers scribble over random memory and allocation order
exposed the failure in a predictable way, as opposed to random crashes.

	-hpa
Joerg Roedel April 14, 2011, 9:09 p.m. UTC | #5
On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > And this makes a difference, with this change on-top of -rc3 the box boots
> > fine. So there seems to be some dependency between the GART base and the GTT
> > base even when they are in different address spaces.
> >
> > Alex, can you comment on this?
> 
> As Dave said, they are completely different addresses spaces.  You
> could put the GPU aperture at 0 if you wanted (in fact we do on some
> chips).  Perhaps there's some strange interaction with the nb gart
> since the nb gart on that chipset was designed to be used for graphics
> and the rs780/880 can be configured to use an agp aperture.
> Unfortunately, I'm not that familiar with the nb gart.

Actually, the nb gart is part of the cpu. It is part of the cpu north
bridge and can translate io and cpu accesses. In fact, it is a remapper
of physical memory addresses.

The problem seems to be related to specific gpu chips. On another
notebook with an hd3000 card gtt and the nb gart aperture are both on
0xa0000000 too but the box works fine. I havn't tested with an hd5000
yet. The failing notebook has an hd4200 mobility.

Btw. what happens if the gpu accesses an unmapped address in the gtt
range?

Regards,

	Joerg
Alex Deucher April 14, 2011, 9:34 p.m. UTC | #6
On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel <joro@8bytes.org> wrote:
> On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
>> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> > And this makes a difference, with this change on-top of -rc3 the box boots
>> > fine. So there seems to be some dependency between the GART base and the GTT
>> > base even when they are in different address spaces.
>> >
>> > Alex, can you comment on this?
>>
>> As Dave said, they are completely different addresses spaces.  You
>> could put the GPU aperture at 0 if you wanted (in fact we do on some
>> chips).  Perhaps there's some strange interaction with the nb gart
>> since the nb gart on that chipset was designed to be used for graphics
>> and the rs780/880 can be configured to use an agp aperture.
>> Unfortunately, I'm not that familiar with the nb gart.
>
> Actually, the nb gart is part of the cpu. It is part of the cpu north
> bridge and can translate io and cpu accesses. In fact, it is a remapper
> of physical memory addresses.

I know what it's for.  In the IGP graphics chip is also part of the
north bridge, but it may not be related at all.

>
> The problem seems to be related to specific gpu chips. On another
> notebook with an hd3000 card gtt and the nb gart aperture are both on
> 0xa0000000 too but the box works fine. I havn't tested with an hd5000
> yet. The failing notebook has an hd4200 mobility.

What exact model is the hd3000?   Is it IGP GPU or a discrete GPU?  It
it's an IGP, it's identical to the hd4200 programming-wise.

>
> Btw. what happens if the gpu accesses an unmapped address in the gtt
> range?

It's redirected to a dummy page.

Alex
Joerg Roedel April 15, 2011, 6:50 a.m. UTC | #7
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote:
> On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel <joro@8bytes.org> wrote:

> > Actually, the nb gart is part of the cpu. It is part of the cpu north
> > bridge and can translate io and cpu accesses. In fact, it is a remapper
> > of physical memory addresses.
> 
> I know what it's for.  In the IGP graphics chip is also part of the
> north bridge, but it may not be related at all.

Okay, just wanted to make clear that it is part of the CPU and not of
the chipset :)

> > The problem seems to be related to specific gpu chips. On another
> > notebook with an hd3000 card gtt and the nb gart aperture are both on
> > 0xa0000000 too but the box works fine. I havn't tested with an hd5000
> > yet. The failing notebook has an hd4200 mobility.
> 
> What exact model is the hd3000?   Is it IGP GPU or a discrete GPU?  It
> it's an IGP, it's identical to the hd4200 programming-wise.

It is an IGP card, an 

	"ATI Technologies Inc RS780M/RS780MN [Radeon HD 3200 Graphics]"

according to lspci.

> > Btw. what happens if the gpu accesses an unmapped address in the gtt
> > range?
> 
> It's redirected to a dummy page.

So there should be no issue too, this is a very weird bug.

	Joerg
Michel Dänzer April 15, 2011, 8:26 a.m. UTC | #8
On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: 
> On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
> > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > > And this makes a difference, with this change on-top of -rc3 the box boots
> > > fine. So there seems to be some dependency between the GART base and the GTT
> > > base even when they are in different address spaces.
> > >
> > > Alex, can you comment on this?
> > 
> > As Dave said, they are completely different addresses spaces.  You
> > could put the GPU aperture at 0 if you wanted (in fact we do on some
> > chips).  Perhaps there's some strange interaction with the nb gart
> > since the nb gart on that chipset was designed to be used for graphics
> > and the rs780/880 can be configured to use an agp aperture.
> > Unfortunately, I'm not that familiar with the nb gart.
> 
> Actually, the nb gart is part of the cpu. It is part of the cpu north
> bridge and can translate io and cpu accesses. In fact, it is a remapper
> of physical memory addresses.
> 
> The problem seems to be related to specific gpu chips. On another
> notebook with an hd3000 card gtt and the nb gart aperture are both on
> 0xa0000000 too but the box works fine.

Wasn't the working theory that the problem occurs if those two values
aren't the same?
Joerg Roedel April 15, 2011, 8:55 a.m. UTC | #9
On Fri, Apr 15, 2011 at 10:26:34AM +0200, Michel Dänzer wrote:
> On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: 
> > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
> > > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> > > > And this makes a difference, with this change on-top of -rc3 the box boots
> > > > fine. So there seems to be some dependency between the GART base and the GTT
> > > > base even when they are in different address spaces.
> > > >
> > > > Alex, can you comment on this?
> > > 
> > > As Dave said, they are completely different addresses spaces.  You
> > > could put the GPU aperture at 0 if you wanted (in fact we do on some
> > > chips).  Perhaps there's some strange interaction with the nb gart
> > > since the nb gart on that chipset was designed to be used for graphics
> > > and the rs780/880 can be configured to use an agp aperture.
> > > Unfortunately, I'm not that familiar with the nb gart.
> > 
> > Actually, the nb gart is part of the cpu. It is part of the cpu north
> > bridge and can translate io and cpu accesses. In fact, it is a remapper
> > of physical memory addresses.
> > 
> > The problem seems to be related to specific gpu chips. On another
> > notebook with an hd3000 card gtt and the nb gart aperture are both on
> > 0xa0000000 too but the box works fine.
> 
> Wasn't the working theory that the problem occurs if those two values
> aren't the same?

Yes it is, but this doesn't seem to be problematic on all readeon GPU
chips.

	Joerg
Andreas Herrmann April 15, 2011, 2:49 p.m. UTC | #10
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote:
> On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel <joro@8bytes.org> wrote:
> > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote:
> >> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >> > And this makes a difference, with this change on-top of -rc3 the box boots
> >> > fine. So there seems to be some dependency between the GART base and the GTT
> >> > base even when they are in different address spaces.
> >> >
> >> > Alex, can you comment on this?
> >>
> >> As Dave said, they are completely different addresses spaces.  You
> >> could put the GPU aperture at 0 if you wanted (in fact we do on some
> >> chips).  Perhaps there's some strange interaction with the nb gart
> >> since the nb gart on that chipset was designed to be used for graphics
> >> and the rs780/880 can be configured to use an agp aperture.
> >> Unfortunately, I'm not that familiar with the nb gart.
> >
> > Actually, the nb gart is part of the cpu. It is part of the cpu north
> > bridge and can translate io and cpu accesses. In fact, it is a remapper
> > of physical memory addresses.
> 
> I know what it's for.  In the IGP graphics chip is also part of the
> north bridge, but it may not be related at all.
> 
> >
> > The problem seems to be related to specific gpu chips. On another
> > notebook with an hd3000 card gtt and the nb gart aperture are both on
> > 0xa0000000 too but the box works fine. I havn't tested with an hd5000
> > yet. The failing notebook has an hd4200 mobility.
> 
> What exact model is the hd3000?   Is it IGP GPU or a discrete GPU?  It
> it's an IGP, it's identical to the hd4200 programming-wise.

BTW, first of all the other notebook had a different CPU (it's family
0fh and Joerg's is family 10h). So different CPUs different GARTs
different issues ;-)

(Furthermore for CPU family 0fh reporting of GartTblWalk errors is
already switched off in arch/x86/kernel/cpu/mcheck/mce.c.)


Andreas
diff mbox

Patch

--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -325,6 +325,8 @@  void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc)
                        mc->gtt_size = size_bf;
                }
                mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size;
+               if (mc->gtt_start == 0xa0000000)
+                       mc->gtt_start = 0x80000000;
        } else {
                if (mc->gtt_size > size_af) {
                        dev_warn(rdev->dev, "limiting GTT\n");