Message ID | CA+55aFz94M0TVdf+jbKs-AHNJq5mZW8oo4pUOAqHS8qb_Cz9sA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Mar 01, 2015 at 05:59:53PM -0800, Linus Torvalds wrote: > On Sun, Mar 1, 2015 at 1:00 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Back to the drawing board. > > Ok, many hours later, but I found it. > > The bisection was a disaster, having to work around other bugs in this > area, but it ended up getting "close enough" that I figured out what > went wrong. Sorry for all the bisect work. Ville dug as far as the load detect being unhappy due to atomic last week. But since I general don't real mail on w/e I've only seen this thread now. > The "intel_plane_duplicate_state()" is horribly horribly buggy. It > looks at the state->fb pointer, but it may have been free'd already. > > This workaround "works for me", but it's really still very > questionable, because while the "kref_get_unless_zero()" works > correctly when the last reference has been dropped, I'm not sure that > there is any guarantee that the whole allocation even exists any more, > so I think the *correct* thing to do would be to clear state->fb when > dropping the kref. But this was the smallest working patch I could > come up with. Somebody who actually knows the code should start > looking at the places that do drm_framebuffer_unreference(), and > actually clear that pointer instead. > > Added Matt Roper and Ander Conselvan de Oliveira to the discussion, > since they are the ones git says are involved with the original broken > intel_plane_duplicate_state(). > > Anyway, attached is > > (a) the patch with a big comment > > (b) the warnings I get on that machine that show where this problem > triggers (and another warning earlier). > > Comments? I'm sure this probably only triggers with *old* X servers > that don't do all the modern dri stuff. It's not old X servers but old machines which don't have hotplug interrupts (or still have tv-out, which doesn't have hpd support anywhere). I'll dig into this now just fyi the rules about how plane->state should be handled: - you need plane->lock - it's invariant once assigned, updates should only be done with a duplicate_state(), update state you want to change and the going through the atomic commit machinery - drm_plane_state->fb should always be a full reference But switching to atomic amounts to a full rewrite of the drm core and drivers (semantics for modeset updates are totally different). So there's lots of glue and ducttape going on to keep up the illusion for both old code and partially transitioned driver. It's all a bit wobbly atm and don't loook too closely at some of the hacks we have. And can you please attach a bactrace of the WARN in your patch, just to double-check you blow up at the same spot? Thanks, Daniel > Linus > From c182b15c3abee75cdc9d9564b6ab826403690f4e Mon Sep 17 00:00:00 2001 > From: Linus Torvalds <torvalds@localhost.localdomain> > Date: Sat, 28 Feb 2015 21:44:48 -0800 > Subject: [PATCH] Workaround for drm bug > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > > --- > drivers/gpu/drm/i915/intel_atomic_plane.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/intel_atomic_plane.c b/drivers/gpu/drm/i915/intel_atomic_plane.c > index 9e6f727..72714d3 100644 > --- a/drivers/gpu/drm/i915/intel_atomic_plane.c > +++ b/drivers/gpu/drm/i915/intel_atomic_plane.c > @@ -85,8 +85,23 @@ intel_plane_duplicate_state(struct drm_plane *plane) > return NULL; > > state = &intel_state->base; > - if (state->fb) > - drm_framebuffer_reference(state->fb); > + > + /* > + * We cannot do drm_framebuffer_reference(), because the reference > + * may already have been dropped. > + * > + * So we do what drm_framebuffer_lookup() does, namely do a > + * kref_get_unless_zero(). Even that is somewhat questionable, > + * in that maybe the 'fb' already got free'd. So warn loudly > + * about this. > + * > + * Maybe the base.fb should be cleared by whatever drops the > + * reference? > + */ > + if (state->fb && !kref_get_unless_zero(&state->fb->refcount)) { > + state->fb = NULL; > + WARN_ONCE(1, "intel_plane_duplicate_state got plane with dead frame buffer"); > + } > > return state; > } > -- > 2.3.1.167.g7f4ba4b > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Mar 2, 2015 at 1:04 AM, Daniel Vetter <daniel@ffwll.ch> wrote: > > And can you please attach a bactrace of the WARN in your patch, just to > double-check you blow up at the same spot? So the dmesg I attached had a backtrace for the new WARN_ONCE() (in addition to an unrelated(?) one from i915_gem_free_object()). Or did you mean a backtrace of the oops when things go wrong, when my patch is *not* applied? My first email had that with the kref.h warning from drm_framebuffer_reference, which is otherwise the same thing. And after things go wrong, and the plane handling thing uses a framebuffer that has already been freed, then the resulting oopses end up being kind of random. Which was partly why it ended up beign so hard to finally bisect, and I actually eventually gave up on it - because sometimes things would just happen to work, sometimes things would blow up and oops when X started (resulting in just a dead machine), sometimes things would oops at bootup. The most common oops was something going wrong when the framebuffer was free'd for the second time, and it would cause a NULL pointer dereference in drm_framebuffer_free() or one of the routines it called (so often a NULL pointer dereference in "mutex_lock(&dev->mode_config.fb_lock)" because "dev" was NULL, or the call to "fb->funcs->destroy(fb)" would oops because "fb->funcs" was NULL. Linus
On Mon, Mar 2, 2015 at 5:53 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Mar 2, 2015 at 1:04 AM, Daniel Vetter <daniel@ffwll.ch> wrote: >> And can you please attach a bactrace of the WARN in your patch, just to >> double-check you blow up at the same spot? > > So the dmesg I attached had a backtrace for the new WARN_ONCE() (in > addition to an unrelated(?) one from i915_gem_free_object()). > > Or did you mean a backtrace of the oops when things go wrong, when my > patch is *not* applied? My first email had that with the kref.h > warning from drm_framebuffer_reference, which is otherwise the same > thing. I've mixed things up with the other reporter which was full of the subsequent oopses. But after I've sorted out why drm-intel-next doesn't blow up the same way I see the bug now. Still baffled that we underrun the refcount apparently since the same pile of legacy code + atomic glue is used for the old modeset ioctl. But obviously something is different, so still digging. The gem_free_object backtrace is a completely unrelated issue. Fix for that is in drm-intel-fixes and on the way to you: commit 62e537f8d568347bbe4e00d7803a838750cdc618 Author: Rodrigo Vivi <rodrigo.vivi@intel.com> Date: Tue Feb 24 13:37:54 2015 -0800 drm/i915: Fix frontbuffer false positve. If that one doesn't help please scream ;-) -Daniel
From c182b15c3abee75cdc9d9564b6ab826403690f4e Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@localhost.localdomain> Date: Sat, 28 Feb 2015 21:44:48 -0800 Subject: [PATCH] Workaround for drm bug Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- drivers/gpu/drm/i915/intel_atomic_plane.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_atomic_plane.c b/drivers/gpu/drm/i915/intel_atomic_plane.c index 9e6f727..72714d3 100644 --- a/drivers/gpu/drm/i915/intel_atomic_plane.c +++ b/drivers/gpu/drm/i915/intel_atomic_plane.c @@ -85,8 +85,23 @@ intel_plane_duplicate_state(struct drm_plane *plane) return NULL; state = &intel_state->base; - if (state->fb) - drm_framebuffer_reference(state->fb); + + /* + * We cannot do drm_framebuffer_reference(), because the reference + * may already have been dropped. + * + * So we do what drm_framebuffer_lookup() does, namely do a + * kref_get_unless_zero(). Even that is somewhat questionable, + * in that maybe the 'fb' already got free'd. So warn loudly + * about this. + * + * Maybe the base.fb should be cleared by whatever drops the + * reference? + */ + if (state->fb && !kref_get_unless_zero(&state->fb->refcount)) { + state->fb = NULL; + WARN_ONCE(1, "intel_plane_duplicate_state got plane with dead frame buffer"); + } return state; } -- 2.3.1.167.g7f4ba4b