diff mbox series

[v2] gpu/drm/msm: fix shutdown hook in case GPU components failed to bind

Message ID 20210318200544.2244007-1-dmitry.baryshkov@linaro.org (mailing list archive)
State New, archived
Headers show
Series [v2] gpu/drm/msm: fix shutdown hook in case GPU components failed to bind | expand

Commit Message

Dmitry Baryshkov March 18, 2021, 8:05 p.m. UTC
if GPU components have failed to bind, shutdown callback would fail with
the following backtrace. Add safeguard check to stop that oops from
happening and allow the board to reboot.

[   66.617046] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[   66.626066] Mem abort info:
[   66.628939]   ESR = 0x96000006
[   66.632088]   EC = 0x25: DABT (current EL), IL = 32 bits
[   66.637542]   SET = 0, FnV = 0
[   66.640688]   EA = 0, S1PTW = 0
[   66.643924] Data abort info:
[   66.646889]   ISV = 0, ISS = 0x00000006
[   66.650832]   CM = 0, WnR = 0
[   66.653890] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000107f81000
[   66.660505] [0000000000000000] pgd=0000000100bb2003, p4d=0000000100bb2003, pud=0000000100897003, pmd=0000000000000000
[   66.671398] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[   66.677115] Modules linked in:
[   66.680261] CPU: 6 PID: 352 Comm: reboot Not tainted 5.11.0-rc2-00309-g79e3faa756b2 #38
[   66.688473] Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
[   66.695347] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[   66.701507] pc : msm_atomic_commit_tail+0x78/0x4e0
[   66.706437] lr : commit_tail+0xa4/0x184
[   66.710381] sp : ffff8000108f3af0
[   66.713791] x29: ffff8000108f3af0 x28: ffff418c44337000
[   66.719242] x27: 0000000000000000 x26: ffff418c40a24490
[   66.724693] x25: ffffd3a842a4f1a0 x24: 0000000000000008
[   66.730146] x23: ffffd3a84313f030 x22: ffff418c444ce000
[   66.735598] x21: ffff418c408a4980 x20: 0000000000000000
[   66.741049] x19: 0000000000000000 x18: ffff800010710fbc
[   66.746500] x17: 000000000000000c x16: 0000000000000001
[   66.751954] x15: 0000000000010008 x14: 0000000000000068
[   66.757405] x13: 0000000000000001 x12: 0000000000000000
[   66.762855] x11: 0000000000000001 x10: 00000000000009b0
[   66.768306] x9 : ffffd3a843192000 x8 : ffff418c44337000
[   66.773757] x7 : 0000000000000000 x6 : 00000000a401b34e
[   66.779210] x5 : 00ffffffffffffff x4 : 0000000000000000
[   66.784660] x3 : 0000000000000000 x2 : ffff418c444ce000
[   66.790111] x1 : ffffd3a841dce530 x0 : ffff418c444cf000
[   66.795563] Call trace:
[   66.798075]  msm_atomic_commit_tail+0x78/0x4e0
[   66.802633]  commit_tail+0xa4/0x184
[   66.806217]  drm_atomic_helper_commit+0x160/0x390
[   66.811051]  drm_atomic_commit+0x4c/0x60
[   66.815082]  drm_atomic_helper_disable_all+0x1f4/0x210
[   66.820355]  drm_atomic_helper_shutdown+0x80/0x130
[   66.825276]  msm_pdev_shutdown+0x14/0x20
[   66.829303]  platform_shutdown+0x28/0x40
[   66.833330]  device_shutdown+0x158/0x330
[   66.837357]  kernel_restart+0x40/0xa0
[   66.841122]  __do_sys_reboot+0x228/0x250
[   66.845148]  __arm64_sys_reboot+0x28/0x34
[   66.849264]  el0_svc_common.constprop.0+0x74/0x190
[   66.854187]  do_el0_svc+0x24/0x90
[   66.857595]  el0_svc+0x14/0x20
[   66.860739]  el0_sync_handler+0x1a4/0x1b0
[   66.864858]  el0_sync+0x174/0x180
[   66.868269] Code: 1ac020a0 2a000273 eb02007f 54ffff01 (f9400285)
[   66.874525] ---[ end trace 20dedb2a3229fec8 ]---

Fixes: 9d5cbf5fe46e ("drm/msm: add shutdown support for display platform_driver")
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
---
 drivers/gpu/drm/msm/msm_drv.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Stephen Boyd March 25, 2021, 1:49 a.m. UTC | #1
Quoting Dmitry Baryshkov (2021-03-18 13:05:44)
> if GPU components have failed to bind, shutdown callback would fail with
> the following backtrace. Add safeguard check to stop that oops from
> happening and allow the board to reboot.
[...]
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index 94525ac76d4e..fd2ac54caf9f 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -1311,6 +1311,10 @@ static int msm_pdev_remove(struct platform_device *pdev)
>  static void msm_pdev_shutdown(struct platform_device *pdev)
>  {
>         struct drm_device *drm = platform_get_drvdata(pdev);
> +       struct msm_drm_private *priv = drm ? drm->dev_private : NULL;
> +
> +       if (!priv || !priv->kms)
> +               return;
>  

I see a problem where if I don't get a backlight probing then my
graphics card doesn't appear but this driver is still bound. I was
hoping this patch would fix it but it doesn't. I have slab poisoning
enabled so sometimes the 'priv' pointer is 0x6b6b6b6b6b6b6b6b meaning it
got all freed.

I found that the 'drm' pointer here is pointing at junk. The
msm_drm_init() function calls drm_dev_put() on the error path and that
will destroy the drm pointer but it doesn't update this platform drivers
drvdata. Do we need another patch that sets the drvdata to NULL on
msm_drm_init() failing? One last note, I'm seeing this on 5.4 so maybe I
missed something and the drvdata has been set to NULL somewhere else
upstream. I sort of doubt it though.

---8<----
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index c842a270806d..895d74aa8834 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -577,6 +577,7 @@ static int msm_drm_init(struct device *dev, struct drm_driver *drv)
 	kfree(priv);
 err_put_drm_dev:
 	drm_dev_put(ddev);
+	platform_set_drvdata(pdev, NULL);
 	return ret;
 }
Rob Clark March 25, 2021, 3:09 a.m. UTC | #2
On Wed, Mar 24, 2021 at 6:49 PM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Dmitry Baryshkov (2021-03-18 13:05:44)
> > if GPU components have failed to bind, shutdown callback would fail with
> > the following backtrace. Add safeguard check to stop that oops from
> > happening and allow the board to reboot.
> [...]
> > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> > index 94525ac76d4e..fd2ac54caf9f 100644
> > --- a/drivers/gpu/drm/msm/msm_drv.c
> > +++ b/drivers/gpu/drm/msm/msm_drv.c
> > @@ -1311,6 +1311,10 @@ static int msm_pdev_remove(struct platform_device *pdev)
> >  static void msm_pdev_shutdown(struct platform_device *pdev)
> >  {
> >         struct drm_device *drm = platform_get_drvdata(pdev);
> > +       struct msm_drm_private *priv = drm ? drm->dev_private : NULL;
> > +
> > +       if (!priv || !priv->kms)
> > +               return;
> >
>
> I see a problem where if I don't get a backlight probing then my
> graphics card doesn't appear but this driver is still bound. I was
> hoping this patch would fix it but it doesn't. I have slab poisoning
> enabled so sometimes the 'priv' pointer is 0x6b6b6b6b6b6b6b6b meaning it
> got all freed.
>
> I found that the 'drm' pointer here is pointing at junk. The
> msm_drm_init() function calls drm_dev_put() on the error path and that
> will destroy the drm pointer but it doesn't update this platform drivers
> drvdata. Do we need another patch that sets the drvdata to NULL on
> msm_drm_init() failing? One last note, I'm seeing this on 5.4 so maybe I
> missed something and the drvdata has been set to NULL somewhere else
> upstream. I sort of doubt it though.

the hw that I guess you are running on should work pretty well w/
upstream kernel.. but I don't think there is any important delta
between upstream and the 5.4 based kernel that you are running that
would fix this..

so *probably* you are right..

BR,
-R

>
> ---8<----
> diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> index c842a270806d..895d74aa8834 100644
> --- a/drivers/gpu/drm/msm/msm_drv.c
> +++ b/drivers/gpu/drm/msm/msm_drv.c
> @@ -577,6 +577,7 @@ static int msm_drm_init(struct device *dev, struct drm_driver *drv)
>         kfree(priv);
>  err_put_drm_dev:
>         drm_dev_put(ddev);
> +       platform_set_drvdata(pdev, NULL);
>         return ret;
>  }
Stephen Boyd March 25, 2021, 4:39 a.m. UTC | #3
Quoting Rob Clark (2021-03-24 20:09:37)
> On Wed, Mar 24, 2021 at 6:49 PM Stephen Boyd <swboyd@chromium.org> wrote:
> >
> > Quoting Dmitry Baryshkov (2021-03-18 13:05:44)
> > > if GPU components have failed to bind, shutdown callback would fail with
> > > the following backtrace. Add safeguard check to stop that oops from
> > > happening and allow the board to reboot.
> > [...]
> > > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
> > > index 94525ac76d4e..fd2ac54caf9f 100644
> > > --- a/drivers/gpu/drm/msm/msm_drv.c
> > > +++ b/drivers/gpu/drm/msm/msm_drv.c
> > > @@ -1311,6 +1311,10 @@ static int msm_pdev_remove(struct platform_device *pdev)
> > >  static void msm_pdev_shutdown(struct platform_device *pdev)
> > >  {
> > >         struct drm_device *drm = platform_get_drvdata(pdev);
> > > +       struct msm_drm_private *priv = drm ? drm->dev_private : NULL;
> > > +
> > > +       if (!priv || !priv->kms)
> > > +               return;
> > >
> >
> > I see a problem where if I don't get a backlight probing then my
> > graphics card doesn't appear but this driver is still bound. I was
> > hoping this patch would fix it but it doesn't. I have slab poisoning
> > enabled so sometimes the 'priv' pointer is 0x6b6b6b6b6b6b6b6b meaning it
> > got all freed.
> >
> > I found that the 'drm' pointer here is pointing at junk. The
> > msm_drm_init() function calls drm_dev_put() on the error path and that
> > will destroy the drm pointer but it doesn't update this platform drivers
> > drvdata. Do we need another patch that sets the drvdata to NULL on
> > msm_drm_init() failing? One last note, I'm seeing this on 5.4 so maybe I
> > missed something and the drvdata has been set to NULL somewhere else
> > upstream. I sort of doubt it though.
> 
> the hw that I guess you are running on should work pretty well w/
> upstream kernel.. but I don't think there is any important delta
> between upstream and the 5.4 based kernel that you are running that
> would fix this..
> 
> so *probably* you are right..

linux-next is failing like this today for me on Lazor right after the
screen turns on. I'll have to figure out what's wrong before checking
upstream.

[   10.734752] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000080
[   10.744482] Mem abort info:
[   10.747462]   ESR = 0x96000006
[   10.750644]   EC = 0x25: DABT (current EL), IL = 32 bits
[   10.756125]   SET = 0, FnV = 0
[   10.759290]   EA = 0, S1PTW = 0
[   10.762543] Data abort info:
[   10.765519]   ISV = 0, ISS = 0x00000006
[   10.769485]   CM = 0, WnR = 0
[   10.772553] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000123474000
[   10.779212] [0000000000000080] pgd=0800000123475003, p4d=0800000123475003, pud=0800000123475003, pmd=0000000000000000
[   10.790128] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[   10.795856] Modules linked in: ath10k_snoc qmi_helpers ath10k_core ath mac80211 cfg80211 r8152 mii joydev
[   10.805705] CPU: 5 PID: 1576 Comm: DrmThread Not tainted 5.12.0-rc4-next-20210324+ #13
[   10.813832] Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
[   10.820535] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[   10.826703] pc : dpu_plane_atomic_update+0x80/0xcb8
[   10.831730] lr : dpu_plane_restore+0x5c/0x88
[   10.836117] sp : ffffffc012963920
[   10.839521] x29: ffffffc0129639c0 x28: ffffffed5c9ad000 
[   10.844979] x27: ffffffed5c736000 x26: ffffffed5ca3f000 
[   10.850443] x25: ffffffed5c736000 x24: 0000000000000000 
[   10.855903] x23: 0000000000000000 x22: ffffff80ad007400 
[   10.861361] x21: ffffff8085193808 x20: 0000000000000000 
[   10.866818] x19: ffffff8085193800 x18: 0000000000000008 
[   10.872274] x17: 0000000000800000 x16: 0000000020000000 
[   10.877738] x15: 0000000000000001 x14: 0000000000000000 
[   10.883201] x13: ffffff80852324a8 x12: 0000000000000008 
[   10.888657] x11: ffffffed5c3b7890 x10: 0000000000000000 
[   10.894112] x9 : 0000000000000000 x8 : 0000000000000000 
[   10.899570] x7 : 0000000000004000 x6 : 0000000000010000 
[   10.905026] x5 : 0000000000040000 x4 : 0000000000000800 
[   10.910482] x3 : 0000000000000000 x2 : 0000000000020041 
[   10.915946] x1 : ffffff80ad2e2600 x0 : ffffff8085193800 
[   10.921402] Call trace:
[   10.923923]  dpu_plane_atomic_update+0x80/0xcb8
[   10.928585]  dpu_plane_restore+0x5c/0x88
[   10.932620]  dpu_crtc_atomic_flush+0xd4/0x1a0
[   10.937105]  drm_atomic_helper_commit_planes+0x1b4/0x1e0
[   10.942565]  msm_atomic_commit_tail+0x2d4/0x670
[   10.947223]  commit_tail+0xac/0x148
[   10.950814]  drm_atomic_helper_commit+0x104/0x10c
[   10.955653]  drm_atomic_commit+0x58/0x68
[   10.959686]  drm_mode_atomic_ioctl+0x438/0x51c
[   10.964261]  drm_ioctl_kernel+0xa8/0x124
[   10.968295]  drm_ioctl+0x24c/0x3ec
[   10.971800]  drm_compat_ioctl+0xe0/0xf4
[   10.975745]  __arm64_compat_sys_ioctl+0xcc/0x104
[   10.980499]  el0_svc_common+0xa4/0x128
[   10.984358]  do_el0_svc_compat+0x2c/0x38
[   10.988395]  el0_svc_compat+0x20/0x30
[   10.992164]  el0_sync_compat_handler+0xc0/0xf0
[   10.996734]  el0_sync_compat+0x174/0x180
[   11.000774] Code: d0003d61 91204821 52800020 97fe8c65 (39420288)
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c
index 94525ac76d4e..fd2ac54caf9f 100644
--- a/drivers/gpu/drm/msm/msm_drv.c
+++ b/drivers/gpu/drm/msm/msm_drv.c
@@ -1311,6 +1311,10 @@  static int msm_pdev_remove(struct platform_device *pdev)
 static void msm_pdev_shutdown(struct platform_device *pdev)
 {
 	struct drm_device *drm = platform_get_drvdata(pdev);
+	struct msm_drm_private *priv = drm ? drm->dev_private : NULL;
+
+	if (!priv || !priv->kms)
+		return;
 
 	drm_atomic_helper_shutdown(drm);
 }