diff mbox

[REGRESSION] nouveau: Memory corruption using nva3 engine for 0xaf

Message ID 1341469873-1582-1-git-send-email-rydberg@euromail.se (mailing list archive)
State New, archived
Headers show

Commit Message

Henrik Rydberg July 5, 2012, 6:31 a.m. UTC
Hi Ben, Dave,

Since 3.5-rc0, I have been experiencing occasional screen corruption
on my MacBookAir3,1, using a GeForce 320M (nv50, 0xaf). The X driver
version is xf86-video-nouvea-1.0.1-1 (arch).

I do not know what the root problem is, but I have been able to
isolate the symptoms to the usage of nva3_copy.c. The patch below is
the least intrusive way I could find which kills the symptoms.

Hopefully this will sched some light on the true problem, such that a
fix can be found for 3.5.

Thanks,
Henrik

The nva3 copy engine exhibits random memory corruption in at least one
case, the GeForce 320M (nv50, 0xaf) in the MacBookAir3,1.  This patch
omits creating the engine for the specific chipset, falling back to
M2MF, which kills the symptoms.
---
 drivers/gpu/drm/nouveau/nouveau_state.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Ben Skeggs July 5, 2012, 6:40 a.m. UTC | #1
On Thu, Jul 05, 2012 at 08:31:13AM +0200, Henrik Rydberg wrote:
> Hi Ben, Dave,
Hey Henrik,

> 
> Since 3.5-rc0, I have been experiencing occasional screen corruption
> on my MacBookAir3,1, using a GeForce 320M (nv50, 0xaf). The X driver
> version is xf86-video-nouvea-1.0.1-1 (arch).
> 
> I do not know what the root problem is, but I have been able to
> isolate the symptoms to the usage of nva3_copy.c. The patch below is
> the least intrusive way I could find which kills the symptoms.
> 
> Hopefully this will sched some light on the true problem, such that a
> fix can be found for 3.5.
Thanks for tracking down the source of this corruption.  I don't have
any such hardware, so until someone can figure it out, I think we
should apply this patch.

Cheers,
Ben.

> 
> Thanks,
> Henrik
> 
> The nva3 copy engine exhibits random memory corruption in at least one
> case, the GeForce 320M (nv50, 0xaf) in the MacBookAir3,1.  This patch
> omits creating the engine for the specific chipset, falling back to
> M2MF, which kills the symptoms.
> ---
Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

>  drivers/gpu/drm/nouveau/nouveau_state.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_state.c b/drivers/gpu/drm/nouveau/nouveau_state.c
> index 19706f0..b466937 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_state.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_state.c
> @@ -731,7 +731,6 @@ nouveau_card_init(struct drm_device *dev)
>  			case 0xa3:
>  			case 0xa5:
>  			case 0xa8:
> -			case 0xaf:
>  				nva3_copy_create(dev);
>  				break;
>  			}
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
Henrik Rydberg July 5, 2012, 6:54 a.m. UTC | #2
> Thanks for tracking down the source of this corruption.  I don't have
> any such hardware, so until someone can figure it out, I think we
> should apply this patch.

In that case, I would have to massage the patch a bit first; it
creates a problem with suspend/resume. Might be something with
nva3_pm.c, who knows. I am really stabbing in the dark here. :-)

Thanks,
Henrik
Henrik Rydberg July 5, 2012, 8:34 a.m. UTC | #3
On Thu, Jul 05, 2012 at 08:54:46AM +0200, Henrik Rydberg wrote:
> > Thanks for tracking down the source of this corruption.  I don't have
> > any such hardware, so until someone can figure it out, I think we
> > should apply this patch.
> 
> In that case, I would have to massage the patch a bit first; it
> creates a problem with suspend/resume. Might be something with
> nva3_pm.c, who knows. I am really stabbing in the dark here. :-)

It seems the suspend/resume problem is unrelated (bad systemd update),
so I am fine with applying this as is. Obviously not the best
solution, and if I have time I will continue to look for problems in
the nva3 copy code, but for now,

    Signed-off-by: Henrik Rydberg <rydberg@euromail.se>

Thanks,
Henrik
Henrik Rydberg July 9, 2012, 1:13 p.m. UTC | #4
On Thu, Jul 05, 2012 at 10:34:10AM +0200, Henrik Rydberg wrote:
> On Thu, Jul 05, 2012 at 08:54:46AM +0200, Henrik Rydberg wrote:
> > > Thanks for tracking down the source of this corruption.  I don't have
> > > any such hardware, so until someone can figure it out, I think we
> > > should apply this patch.
> > 
> > In that case, I would have to massage the patch a bit first; it
> > creates a problem with suspend/resume. Might be something with
> > nva3_pm.c, who knows. I am really stabbing in the dark here. :-)
> 
> It seems the suspend/resume problem is unrelated (bad systemd update),
> so I am fine with applying this as is. Obviously not the best
> solution, and if I have time I will continue to look for problems in
> the nva3 copy code, but for now,
> 
>     Signed-off-by: Henrik Rydberg <rydberg@euromail.se>

I have not encountered the problem in a long while, and I do not have
the patch applied. It is entirely possible that this was fixed by
something else. Unless you have already applied the patch, I would
suggest holding on to it to see if the problem reappears.

Sorry for the churn.

Thanks,
Henrik
Henrik Rydberg July 9, 2012, 6:27 p.m. UTC | #5
On Mon, Jul 09, 2012 at 03:13:25PM +0200, Henrik Rydberg wrote:
> On Thu, Jul 05, 2012 at 10:34:10AM +0200, Henrik Rydberg wrote:
> > On Thu, Jul 05, 2012 at 08:54:46AM +0200, Henrik Rydberg wrote:
> > > > Thanks for tracking down the source of this corruption.  I don't have
> > > > any such hardware, so until someone can figure it out, I think we
> > > > should apply this patch.
> > > 
> > > In that case, I would have to massage the patch a bit first; it
> > > creates a problem with suspend/resume. Might be something with
> > > nva3_pm.c, who knows. I am really stabbing in the dark here. :-)
> > 
> > It seems the suspend/resume problem is unrelated (bad systemd update),
> > so I am fine with applying this as is. Obviously not the best
> > solution, and if I have time I will continue to look for problems in
> > the nva3 copy code, but for now,
> > 
> >     Signed-off-by: Henrik Rydberg <rydberg@euromail.se>
> 
> I have not encountered the problem in a long while, and I do not have
> the patch applied. It is entirely possible that this was fixed by
> something else. Unless you have already applied the patch, I would
> suggest holding on to it to see if the problem reappears.
> 
> Sorry for the churn.

... and there it was again, hours after giving up on it. Oh well.

What makes this bug particularly difficult is that as soon as the
patch is applied, the problem disappears and does not show itself
again - with or without the patch applied. Sounds very much like the
problem is a failure state that does not get reset by current
mainline, but somehow gets reset with the patch applied.

I also learnt that the problem is not in the nva3_copy code itself; I
reverted nva3_copy.c and nva3_pm.c back to v3.4, but the problem persisted.

A DMA problem elsewhere, in the drm code or in the pci layer, seems
more likely than this particular hardware having problems with this
particular copy engine. As it stands, though, applying the patch is
the only thing known to work.

Thanks,
Henrik
Henrik Rydberg June 4, 2013, 8:48 p.m. UTC | #6
Hi Ben,

The new mutexes in nvc0/nv50 (fadb17190/b509656) break resume on my
MBA3,1. A dead-lock somewhere, perhaps? Reverting fixes the problem.

Thanks,
Henrik
Ilia Mirkin June 4, 2013, 9:16 p.m. UTC | #7
On Tue, Jun 4, 2013 at 4:48 PM, Henrik Rydberg <rydberg@euromail.se> wrote:
> Hi Ben,
>
> The new mutexes in nvc0/nv50 (fadb17190/b509656) break resume on my
> MBA3,1. A dead-lock somewhere, perhaps? Reverting fixes the problem.

A bunch of people saw it earlier. Fixed for nv50 (which is what I
assume you have) in
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=e9de89adcecb7a1296f5bc4d0052f58e18edd0a8

I assume it's on its way to mainline.

  -ilia
diff mbox

Patch

diff --git a/drivers/gpu/drm/nouveau/nouveau_state.c b/drivers/gpu/drm/nouveau/nouveau_state.c
index 19706f0..b466937 100644
--- a/drivers/gpu/drm/nouveau/nouveau_state.c
+++ b/drivers/gpu/drm/nouveau/nouveau_state.c
@@ -731,7 +731,6 @@  nouveau_card_init(struct drm_device *dev)
 			case 0xa3:
 			case 0xa5:
 			case 0xa8:
-			case 0xaf:
 				nva3_copy_create(dev);
 				break;
 			}