Message ID | 20230113041051.4189063-1-quic_bjorande@quicinc.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | drm/msm: Initialize mode_config earlier | expand |
On 13/01/2023 06:10, Bjorn Andersson wrote: > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling > drm_kms_helper_hotplug_event(), which assumes that the associated > drm_device's mode_config.funcs is a valid pointer. > > But in the MSM DisplayPort driver the HPD enablement happens at bind > time and mode_config.funcs is initialized late in msm_drm_init(). This > means that there's a window for hot plug events to dereference a NULL > mode_config.funcs. > > Move the assignment of mode_config.funcs before the bind, to avoid this > scenario. Cam we make DP driver not to report HPD events until the enable_hpd() was called? I think this is what was fixed by your internal_hpd patchset. > > Reported-by: Johan Hovold <johan@kernel.org> > Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> > --- > drivers/gpu/drm/msm/msm_drv.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c > index ee2f60b6f09b..7ac670f3e6ab 100644 > --- a/drivers/gpu/drm/msm/msm_drv.c > +++ b/drivers/gpu/drm/msm/msm_drv.c > @@ -438,6 +438,9 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) > > drm_mode_config_init(ddev); > > + ddev->mode_config.funcs = &mode_config_funcs; > + ddev->mode_config.helper_private = &mode_config_helper_funcs; > + > ret = msm_init_vram(ddev); > if (ret) > return ret; > @@ -479,9 +482,6 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) > > drm_helper_move_panel_connectors_to_head(ddev); > > - ddev->mode_config.funcs = &mode_config_funcs; > - ddev->mode_config.helper_private = &mode_config_helper_funcs; > - > for (i = 0; i < priv->num_crtcs; i++) { > /* initialize event thread */ > priv->event_thread[i].crtc_id = priv->crtcs[i]->base.id;
On 13/01/2023 06:23, Dmitry Baryshkov wrote: > On 13/01/2023 06:10, Bjorn Andersson wrote: >> Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled >> bridge_connector ends up in drm_bridge_connector_hpd_cb() calling >> drm_kms_helper_hotplug_event(), which assumes that the associated >> drm_device's mode_config.funcs is a valid pointer. >> >> But in the MSM DisplayPort driver the HPD enablement happens at bind >> time and mode_config.funcs is initialized late in msm_drm_init(). This >> means that there's a window for hot plug events to dereference a NULL >> mode_config.funcs. >> >> Move the assignment of mode_config.funcs before the bind, to avoid this >> scenario. > > Cam we make DP driver not to report HPD events until the enable_hpd() > was called? I think this is what was fixed by your internal_hpd patchset. Or to express this in another words: I thought that internal_hpd already deferred enabling hpd event reporting till the time when we need it, didn't it?
On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote: > On 13/01/2023 06:23, Dmitry Baryshkov wrote: > > On 13/01/2023 06:10, Bjorn Andersson wrote: > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling > > > drm_kms_helper_hotplug_event(), which assumes that the associated > > > drm_device's mode_config.funcs is a valid pointer. > > > > > > But in the MSM DisplayPort driver the HPD enablement happens at bind > > > time and mode_config.funcs is initialized late in msm_drm_init(). This > > > means that there's a window for hot plug events to dereference a NULL > > > mode_config.funcs. > > > > > > Move the assignment of mode_config.funcs before the bind, to avoid this > > > scenario. > > > > Cam we make DP driver not to report HPD events until the enable_hpd() > > was called? I think this is what was fixed by your internal_hpd > > patchset. > > Or to express this in another words: I thought that internal_hpd already > deferred enabling hpd event reporting till the time when we need it, didn't > it? > I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when this window of "opportunity" opens up, and here's the callstack: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm] ... Call trace: drm_bridge_hpd_enable+0x48/0x94 [drm] drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper] drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper] drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper] msm_drm_bind+0x370/0x628 [msm] try_to_bring_up_aggregate_device+0x170/0x1bc __component_add+0xb0/0x168 component_add+0x20/0x2c dp_display_probe+0x40c/0x468 [msm] platform_probe+0xb4/0xdc really_probe+0x13c/0x300 __driver_probe_device+0xc0/0xec driver_probe_device+0x48/0x204 __device_attach_driver+0x124/0x14c bus_for_each_drv+0x90/0xdc __device_attach+0xdc/0x1a8 device_initial_probe+0x20/0x2c bus_probe_device+0x40/0xa4 deferred_probe_work_func+0x94/0xd0 process_one_work+0x1a8/0x3c0 worker_thread+0x254/0x47c kthread+0xf8/0x1b8 ret_from_fork+0x10/0x20 ---[ end trace 0000000000000000 ]--- As drm_kms_helper_poll_init() is the last thing being called in msm_drm_init() shifting around the mode_config.func assignment would not have any impact. Perhaps we have shuffled other things around to avoid this bug? Either way, let's this on hold until further proof that it's still reproducible. Sorry for the noise, Bjorn
On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote: > > On 13/01/2023 06:23, Dmitry Baryshkov wrote: > > > On 13/01/2023 06:10, Bjorn Andersson wrote: > > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled > > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling > > > > drm_kms_helper_hotplug_event(), which assumes that the associated > > > > drm_device's mode_config.funcs is a valid pointer. > > > > > > > > But in the MSM DisplayPort driver the HPD enablement happens at bind > > > > time and mode_config.funcs is initialized late in msm_drm_init(). This > > > > means that there's a window for hot plug events to dereference a NULL > > > > mode_config.funcs. > > > > > > > > Move the assignment of mode_config.funcs before the bind, to avoid this > > > > scenario. > > > > > > Cam we make DP driver not to report HPD events until the enable_hpd() > > > was called? I think this is what was fixed by your internal_hpd > > > patchset. > > > > Or to express this in another words: I thought that internal_hpd already > > deferred enabling hpd event reporting till the time when we need it, didn't > > it? > > > > I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when > this window of "opportunity" opens up, and here's the callstack: > > ------------[ cut here ]------------ > WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm] > ... > Call trace: > drm_bridge_hpd_enable+0x48/0x94 [drm] > drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper] > drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper] > drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper] > msm_drm_bind+0x370/0x628 [msm] > try_to_bring_up_aggregate_device+0x170/0x1bc > __component_add+0xb0/0x168 > component_add+0x20/0x2c > dp_display_probe+0x40c/0x468 [msm] > platform_probe+0xb4/0xdc > really_probe+0x13c/0x300 > __driver_probe_device+0xc0/0xec > driver_probe_device+0x48/0x204 > __device_attach_driver+0x124/0x14c > bus_for_each_drv+0x90/0xdc > __device_attach+0xdc/0x1a8 > device_initial_probe+0x20/0x2c > bus_probe_device+0x40/0xa4 > deferred_probe_work_func+0x94/0xd0 > process_one_work+0x1a8/0x3c0 > worker_thread+0x254/0x47c > kthread+0xf8/0x1b8 > ret_from_fork+0x10/0x20 > ---[ end trace 0000000000000000 ]--- > > As drm_kms_helper_poll_init() is the last thing being called in > msm_drm_init() shifting around the mode_config.func assignment would not > have any impact. > > Perhaps we have shuffled other things around to avoid this bug? Either > way, let's this on hold until further proof that it's still > reproducible. As I've mentioned off list, I haven't hit the apparent race I reported here: https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it could very well be that something has changes that fixes (or hides) the issue since. Johan
On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote: > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > > On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote: > > > On 13/01/2023 06:23, Dmitry Baryshkov wrote: > > > > On 13/01/2023 06:10, Bjorn Andersson wrote: > > > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled > > > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling > > > > > drm_kms_helper_hotplug_event(), which assumes that the associated > > > > > drm_device's mode_config.funcs is a valid pointer. > > > > > > > > > > But in the MSM DisplayPort driver the HPD enablement happens at bind > > > > > time and mode_config.funcs is initialized late in msm_drm_init(). This > > > > > means that there's a window for hot plug events to dereference a NULL > > > > > mode_config.funcs. > > > > > > > > > > Move the assignment of mode_config.funcs before the bind, to avoid this > > > > > scenario. > > > > > > > > Cam we make DP driver not to report HPD events until the enable_hpd() > > > > was called? I think this is what was fixed by your internal_hpd > > > > patchset. > > > > > > Or to express this in another words: I thought that internal_hpd already > > > deferred enabling hpd event reporting till the time when we need it, didn't > > > it? > > > > > > > I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when > > this window of "opportunity" opens up, and here's the callstack: > > > > ------------[ cut here ]------------ > > WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm] > > ... > > Call trace: > > drm_bridge_hpd_enable+0x48/0x94 [drm] > > drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper] > > drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper] > > drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper] > > msm_drm_bind+0x370/0x628 [msm] > > try_to_bring_up_aggregate_device+0x170/0x1bc > > __component_add+0xb0/0x168 > > component_add+0x20/0x2c > > dp_display_probe+0x40c/0x468 [msm] > > platform_probe+0xb4/0xdc > > really_probe+0x13c/0x300 > > __driver_probe_device+0xc0/0xec > > driver_probe_device+0x48/0x204 > > __device_attach_driver+0x124/0x14c > > bus_for_each_drv+0x90/0xdc > > __device_attach+0xdc/0x1a8 > > device_initial_probe+0x20/0x2c > > bus_probe_device+0x40/0xa4 > > deferred_probe_work_func+0x94/0xd0 > > process_one_work+0x1a8/0x3c0 > > worker_thread+0x254/0x47c > > kthread+0xf8/0x1b8 > > ret_from_fork+0x10/0x20 > > ---[ end trace 0000000000000000 ]--- > > > > As drm_kms_helper_poll_init() is the last thing being called in > > msm_drm_init() shifting around the mode_config.func assignment would not > > have any impact. > > > > Perhaps we have shuffled other things around to avoid this bug? Either > > way, let's this on hold until further proof that it's still > > reproducible. > > As I've mentioned off list, I haven't hit the apparent race I reported > here: > > https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it > could very well be that something has changes that fixes (or hides) the > issue since. For unrelated reasons, I tried enabling async probing, and apart from apparently causing the panel driver to probe defer indefinitely, I also again hit the WARN_ON() I had added to catch this: [ 13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7 0 [drm_kms_helper] ... [ 13.679429] CPU: 0 PID: 125 Comm: kworker/0:3 Not tainted 6.2.0-rc4 #110 [ 13.687159] Hardware name: Qualcomm QRD, BIOS 6.0.220110.BOOT.MXF.1.1-00470-MAKENA-1 01/10/2022 [ 13.696947] Workqueue: events pmic_glink_altmode_worker [pmic_glink_altmode] [ 13.705044] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.713054] pc : drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper] [ 13.720812] lr : drm_kms_helper_hotplug_event+0x18/0x70 [drm_kms_helper] [ 13.728557] sp : ffff800009e33c90 [ 13.732779] x29: ffff800009e33c90 x28: ffffad90862eb000 x27: ffff62d2362ee305 [ 13.740956] x26: ffffad90862f1ea0 x25: ffffad9086309b50 x24: 0000000000000000 [ 13.749125] x23: 0000000000000003 x22: ffff62d0c5dad000 x21: 0000000000000002 [ 13.757291] x20: ffff62d0c6d24000 x19: ffff62d0c5dad000 x18: 0000000000000038 [ 13.765443] x17: 0000000000000004 x16: 000000000000d323 x15: 0000000000000004 [ 13.773585] x14: ffffad9086594208 x13: ffffad90865f50e8 x12: 0000000000000000 [ 13.781723] x11: 00000000000400d7 x10: 0000000000000008 x9 : 0000000000000002 [ 13.789867] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000097d00 [ 13.798013] x5 : ffff62d0c3395000 x4 : ffff62d2362ed750 x3 : 0000000000097e00 [ 13.806161] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 [ 13.814304] Call trace: [ 13.817604] drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper] [ 13.824959] drm_bridge_connector_hpd_cb+0xa0/0xc0 [drm_kms_helper] [ 13.832227] drm_bridge_hpd_notify+0x40/0x60 [drm] [ 13.837993] pmic_glink_altmode_worker+0xc0/0x150 [pmic_glink_altmode] [ 13.845505] process_one_work+0x288/0x6c0 [ 13.850404] worker_thread+0x74/0x450 [ 13.854948] kthread+0x118/0x120 [ 13.859032] ret_from_fork+0x10/0x20 [ 13.863473] irq event stamp: 7440 [ 13.867631] hardirqs last enabled at (7439): [<ffffad9085b00450>] _raw_spin_unlock_irqrestore+0x80/0x90 [ 13.878157] hardirqs last disabled at (7440): [<ffffad9085af33e4>] el1_dbg+0x24/0x90 [ 13.886885] softirqs last enabled at (7308): [<ffffad908514046c>] _stext+0x46c/0x5d8 [ 13.895697] softirqs last disabled at (7303): [<ffffad90851467b0>] ____do_softirq+0x10/0x20 [ 13.905038] ---[ end trace 0000000000000000 ]--- So the bug still appears to be there (and the MSM DRM driver is fragile and broken, but we knew that). Johan
On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote: > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote: > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > > > On Fri, Jan 13, 2023 at 10:57:18AM +0200, Dmitry Baryshkov wrote: > > > > On 13/01/2023 06:23, Dmitry Baryshkov wrote: > > > > > On 13/01/2023 06:10, Bjorn Andersson wrote: > > > > > > Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled > > > > > > bridge_connector ends up in drm_bridge_connector_hpd_cb() calling > > > > > > drm_kms_helper_hotplug_event(), which assumes that the associated > > > > > > drm_device's mode_config.funcs is a valid pointer. > > > > > > > > > > > > But in the MSM DisplayPort driver the HPD enablement happens at bind > > > > > > time and mode_config.funcs is initialized late in msm_drm_init(). This > > > > > > means that there's a window for hot plug events to dereference a NULL > > > > > > mode_config.funcs. > > > > > > > > > > > > Move the assignment of mode_config.funcs before the bind, to avoid this > > > > > > scenario. > > > > > > > > > > Cam we make DP driver not to report HPD events until the enable_hpd() > > > > > was called? I think this is what was fixed by your internal_hpd > > > > > patchset. > > > > > > > > Or to express this in another words: I thought that internal_hpd already > > > > deferred enabling hpd event reporting till the time when we need it, didn't > > > > it? > > > > > > > > > > I added a WARN_ON(1) in drm_bridge_hpd_enable() to get a sense of when > > > this window of "opportunity" opens up, and here's the callstack: > > > > > > ------------[ cut here ]------------ > > > WARNING: CPU: 6 PID: 99 at drivers/gpu/drm/drm_bridge.c:1260 drm_bridge_hpd_enable+0x48/0x94 [drm] > > > ... > > > Call trace: > > > drm_bridge_hpd_enable+0x48/0x94 [drm] > > > drm_bridge_connector_enable_hpd+0x30/0x3c [drm_kms_helper] > > > drm_kms_helper_poll_enable+0xa4/0x114 [drm_kms_helper] > > > drm_kms_helper_poll_init+0x6c/0x7c [drm_kms_helper] > > > msm_drm_bind+0x370/0x628 [msm] > > > try_to_bring_up_aggregate_device+0x170/0x1bc > > > __component_add+0xb0/0x168 > > > component_add+0x20/0x2c > > > dp_display_probe+0x40c/0x468 [msm] > > > platform_probe+0xb4/0xdc > > > really_probe+0x13c/0x300 > > > __driver_probe_device+0xc0/0xec > > > driver_probe_device+0x48/0x204 > > > __device_attach_driver+0x124/0x14c > > > bus_for_each_drv+0x90/0xdc > > > __device_attach+0xdc/0x1a8 > > > device_initial_probe+0x20/0x2c > > > bus_probe_device+0x40/0xa4 > > > deferred_probe_work_func+0x94/0xd0 > > > process_one_work+0x1a8/0x3c0 > > > worker_thread+0x254/0x47c > > > kthread+0xf8/0x1b8 > > > ret_from_fork+0x10/0x20 > > > ---[ end trace 0000000000000000 ]--- > > > > > > As drm_kms_helper_poll_init() is the last thing being called in > > > msm_drm_init() shifting around the mode_config.func assignment would not > > > have any impact. > > > > > > Perhaps we have shuffled other things around to avoid this bug? Either > > > way, let's this on hold until further proof that it's still > > > reproducible. > > > > As I've mentioned off list, I haven't hit the apparent race I reported > > here: > > > > https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ > > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it > > could very well be that something has changes that fixes (or hides) the > > issue since. > > For unrelated reasons, I tried enabling async probing, and apart from > apparently causing the panel driver to probe defer indefinitely, I also > again hit the WARN_ON() I had added to catch this: > > [ 13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7 > 0 [drm_kms_helper] > ... > [ 13.679429] CPU: 0 PID: 125 Comm: kworker/0:3 Not tainted 6.2.0-rc4 #110 > [ 13.687159] Hardware name: Qualcomm QRD, BIOS 6.0.220110.BOOT.MXF.1.1-00470-MAKENA-1 01/10/2022 > [ 13.696947] Workqueue: events pmic_glink_altmode_worker [pmic_glink_altmode] > [ 13.705044] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 13.713054] pc : drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper] > [ 13.720812] lr : drm_kms_helper_hotplug_event+0x18/0x70 [drm_kms_helper] > [ 13.728557] sp : ffff800009e33c90 > [ 13.732779] x29: ffff800009e33c90 x28: ffffad90862eb000 x27: ffff62d2362ee305 > [ 13.740956] x26: ffffad90862f1ea0 x25: ffffad9086309b50 x24: 0000000000000000 > [ 13.749125] x23: 0000000000000003 x22: ffff62d0c5dad000 x21: 0000000000000002 > [ 13.757291] x20: ffff62d0c6d24000 x19: ffff62d0c5dad000 x18: 0000000000000038 > [ 13.765443] x17: 0000000000000004 x16: 000000000000d323 x15: 0000000000000004 > [ 13.773585] x14: ffffad9086594208 x13: ffffad90865f50e8 x12: 0000000000000000 > [ 13.781723] x11: 00000000000400d7 x10: 0000000000000008 x9 : 0000000000000002 > [ 13.789867] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000097d00 > [ 13.798013] x5 : ffff62d0c3395000 x4 : ffff62d2362ed750 x3 : 0000000000097e00 > [ 13.806161] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 > [ 13.814304] Call trace: > [ 13.817604] drm_kms_helper_hotplug_event+0x48/0x70 [drm_kms_helper] > [ 13.824959] drm_bridge_connector_hpd_cb+0xa0/0xc0 [drm_kms_helper] > [ 13.832227] drm_bridge_hpd_notify+0x40/0x60 [drm] > [ 13.837993] pmic_glink_altmode_worker+0xc0/0x150 [pmic_glink_altmode] > [ 13.845505] process_one_work+0x288/0x6c0 > [ 13.850404] worker_thread+0x74/0x450 > [ 13.854948] kthread+0x118/0x120 > [ 13.859032] ret_from_fork+0x10/0x20 > [ 13.863473] irq event stamp: 7440 > [ 13.867631] hardirqs last enabled at (7439): [<ffffad9085b00450>] _raw_spin_unlock_irqrestore+0x80/0x90 > [ 13.878157] hardirqs last disabled at (7440): [<ffffad9085af33e4>] el1_dbg+0x24/0x90 > [ 13.886885] softirqs last enabled at (7308): [<ffffad908514046c>] _stext+0x46c/0x5d8 > [ 13.895697] softirqs last disabled at (7303): [<ffffad90851467b0>] ____do_softirq+0x10/0x20 > [ 13.905038] ---[ end trace 0000000000000000 ]--- > > So the bug still appears to be there (and the MSM DRM driver is fragile > and broken, but we knew that). > But the ordering between mode_config.funcs = !NULL and drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear. And my testing shows that drm_kms_helper_poll_init() is the cause for getting bridge->hpd_cb != NULL. So the ordering seems legit, unless there's something else causing the assignment of bridge->hpd_cb to happen earlier in this scenario. Regards, Bjorn
On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote: > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote: > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote: > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > > > > Perhaps we have shuffled other things around to avoid this bug? Either > > > > way, let's this on hold until further proof that it's still > > > > reproducible. > > > > > > As I've mentioned off list, I haven't hit the apparent race I reported > > > here: > > > > > > https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ > > > > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it > > > could very well be that something has changes that fixes (or hides) the > > > issue since. > > > > For unrelated reasons, I tried enabling async probing, and apart from > > apparently causing the panel driver to probe defer indefinitely, I also > > again hit the WARN_ON() I had added to catch this: > > > > [ 13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7 > > 0 [drm_kms_helper] > > So the bug still appears to be there (and the MSM DRM driver is fragile > > and broken, but we knew that). > > > > But the ordering between mode_config.funcs = !NULL and > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear. > > And my testing shows that drm_kms_helper_poll_init() is the cause for > getting bridge->hpd_cb != NULL. > > So the ordering seems legit, unless there's something else causing the > assignment of bridge->hpd_cb to happen earlier in this scenario. I'm not saying that this patch is correct (indeed it doesn't seem to be), but only that the bug I reported still appears to be present in 6.2. Now that I actually looked at this again, I realise that the reason that haven't seen it with 6.2 is more likely due to the fact that I'm now making sure to load the panel driver before the drm driver to avoid that unnecessary probe deferral. With async probing, I get the probe deferral again, and boom, I hit the same old NULL deref. I see there's a call to drm_kms_helper_poll_fini() in msm_drm_uninit() which should stop the polling, but perhaps there's still some corner case due to the unexpected probe (or rather component bind) deferral which we're hitting. Johan
On Tue, Jan 24, 2023 at 09:09:02AM +0100, Johan Hovold wrote: > On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote: > > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote: > > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote: > > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > > > > > > Perhaps we have shuffled other things around to avoid this bug? Either > > > > > way, let's this on hold until further proof that it's still > > > > > reproducible. > > > > > > > > As I've mentioned off list, I haven't hit the apparent race I reported > > > > here: > > > > > > > > https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ > > > > > > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it > > > > could very well be that something has changes that fixes (or hides) the > > > > issue since. > > > > > > For unrelated reasons, I tried enabling async probing, and apart from > > > apparently causing the panel driver to probe defer indefinitely, I also > > > again hit the WARN_ON() I had added to catch this: > > > > > > [ 13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7 > > > 0 [drm_kms_helper] > > > > So the bug still appears to be there (and the MSM DRM driver is fragile > > > and broken, but we knew that). > > > > > > > But the ordering between mode_config.funcs = !NULL and > > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear. > > > > And my testing shows that drm_kms_helper_poll_init() is the cause for > > getting bridge->hpd_cb != NULL. > > > > So the ordering seems legit, unless there's something else causing the > > assignment of bridge->hpd_cb to happen earlier in this scenario. > > I'm not saying that this patch is correct (indeed it doesn't seem to > be), but only that the bug I reported still appears to be present in > 6.2. > > Now that I actually looked at this again, I realise that the reason that > haven't seen it with 6.2 is more likely due to the fact that I'm now > making sure to load the panel driver before the drm driver to avoid that > unnecessary probe deferral. > > With async probing, I get the probe deferral again, and boom, I hit the > same old NULL deref. > > I see there's a call to drm_kms_helper_poll_fini() in msm_drm_uninit() > which should stop the polling, but perhaps there's still some corner > case due to the unexpected probe (or rather component bind) deferral > which we're hitting. I guess the drm_kms_helper_poll_fini() bit is irrelevant here as the call comes from the pmic_glink_altmode_worker() and drm_bridge_hpd_notify(). Perhaps the pmic_glink altmode driver simply isn't notified that the drm device is gone again due to the late "probe" deferral or similar? Johan
On Tue, Jan 24, 2023 at 09:09:02AM +0100, Johan Hovold wrote: > On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote: > > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote: > > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote: > > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > > > > > > Perhaps we have shuffled other things around to avoid this bug? Either > > > > > way, let's this on hold until further proof that it's still > > > > > reproducible. > > > > > > > > As I've mentioned off list, I haven't hit the apparent race I reported > > > > here: > > > > > > > > https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ > > > > > > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it > > > > could very well be that something has changes that fixes (or hides) the > > > > issue since. > > > > > > For unrelated reasons, I tried enabling async probing, and apart from > > > apparently causing the panel driver to probe defer indefinitely, I also > > > again hit the WARN_ON() I had added to catch this: > > > > > > [ 13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7 > > > 0 [drm_kms_helper] > > > > So the bug still appears to be there (and the MSM DRM driver is fragile > > > and broken, but we knew that). > > > > > > > But the ordering between mode_config.funcs = !NULL and > > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear. > > > > And my testing shows that drm_kms_helper_poll_init() is the cause for > > getting bridge->hpd_cb != NULL. > > > > So the ordering seems legit, unless there's something else causing the > > assignment of bridge->hpd_cb to happen earlier in this scenario. > > I'm not saying that this patch is correct (indeed it doesn't seem to > be), but only that the bug I reported still appears to be present in > 6.2. So after debugging this issue a third time, I can conclude that it is still very much present in 6.2. It appears you looked at the linux-next tree when you concluded that this patch was not needed. In 6.2 the bridge->hpd_cb callback is set before mode_config.funcs is initialised as part of kms->funcs->hw_init(kms). The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer dereference by moving the bridge->hpd_cb initialisation to drm_kms_helper_poll_init() as you mention above. The PMIC GLINK altmode driver still happily forwards notifications regardless of the DRM driver state though, which can lead to missed hotplug events. It seems you need to implement the hpd_enable()/disable() callbacks and either cache or not enable events in fw until the DRM driver is ready. Johan
On Wed, Mar 01, 2023 at 02:58:50PM +0100, Johan Hovold wrote: > On Tue, Jan 24, 2023 at 09:09:02AM +0100, Johan Hovold wrote: > > On Mon, Jan 23, 2023 at 09:17:49AM -0800, Bjorn Andersson wrote: > > > On Mon, Jan 23, 2023 at 05:01:45PM +0100, Johan Hovold wrote: > > > > On Tue, Jan 17, 2023 at 09:04:39AM +0100, Johan Hovold wrote: > > > > > On Mon, Jan 16, 2023 at 08:51:22PM -0600, Bjorn Andersson wrote: > > > > > > > > Perhaps we have shuffled other things around to avoid this bug? Either > > > > > > way, let's this on hold until further proof that it's still > > > > > > reproducible. > > > > > > > > > > As I've mentioned off list, I haven't hit the apparent race I reported > > > > > here: > > > > > > > > > > https://lore.kernel.org/all/Y1efJh11B5UQZ0Tz@hovoldconsulting.com/ > > > > > > > > > > since moving to 6.2. I did hit it with both 6.0 and 6.1-rc2, but it > > > > > could very well be that something has changes that fixes (or hides) the > > > > > issue since. > > > > > > > > For unrelated reasons, I tried enabling async probing, and apart from > > > > apparently causing the panel driver to probe defer indefinitely, I also > > > > again hit the WARN_ON() I had added to catch this: > > > > > > > > [ 13.593235] WARNING: CPU: 0 PID: 125 at drivers/gpu/drm/drm_probe_helper.c:664 drm_kms_helper_hotplug_event+0x48/0x7 > > > > 0 [drm_kms_helper] > > > > > > So the bug still appears to be there (and the MSM DRM driver is fragile > > > > and broken, but we knew that). > > > > > > > > > > But the ordering between mode_config.funcs = !NULL and > > > drm_kms_helper_poll_init() in msm_drm_init() seems pretty clear. > > > > > > And my testing shows that drm_kms_helper_poll_init() is the cause for > > > getting bridge->hpd_cb != NULL. > > > > > > So the ordering seems legit, unless there's something else causing the > > > assignment of bridge->hpd_cb to happen earlier in this scenario. > > > > I'm not saying that this patch is correct (indeed it doesn't seem to > > be), but only that the bug I reported still appears to be present in > > 6.2. > > So after debugging this issue a third time, I can conclude that it is > still very much present in 6.2. > > It appears you looked at the linux-next tree when you concluded that > this patch was not needed. In 6.2 the bridge->hpd_cb callback is set > before mode_config.funcs is initialised as part of > kms->funcs->hw_init(kms). > > The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer > dereference by moving the bridge->hpd_cb initialisation to > drm_kms_helper_poll_init() as you mention above. > > The PMIC GLINK altmode driver still happily forwards notifications > regardless of the DRM driver state though, which can lead to missed > hotplug events. It seems you need to implement the > hpd_enable()/disable() callbacks and either cache or not enable events > in fw until the DRM driver is ready. > It's not clear to me what the expectation from the DRM framework is on this point. We register a drm_bridge which is only capable of signaling HPD events (DRM_BRIDGE_OP_HPD), not querying HPD state (DRM_BRIDGE_OP_DETECT). Does this imply that any such bridge must ensure that hpd events are re-delivered once hpd_enable() has been invoked (we can't invoke it from hpd_enable...)? Is it reasonable to do this retriggering in the altmode driver? Or is it the job of the TCPM (it seems reasonable to not send the PAN_EN message until we get hpd_enable()...)? Regards, Bjorn
On Thu, Mar 02, 2023 at 03:17:04PM -0800, Bjorn Andersson wrote: > On Wed, Mar 01, 2023 at 02:58:50PM +0100, Johan Hovold wrote: > > So after debugging this issue a third time, I can conclude that it is > > still very much present in 6.2. > > > > It appears you looked at the linux-next tree when you concluded that > > this patch was not needed. In 6.2 the bridge->hpd_cb callback is set > > before mode_config.funcs is initialised as part of > > kms->funcs->hw_init(kms). > > > > The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer > > dereference by moving the bridge->hpd_cb initialisation to > > drm_kms_helper_poll_init() as you mention above. I can confirm that as expected my reproducer no longer triggers with 6.3-rc1. > > The PMIC GLINK altmode driver still happily forwards notifications > > regardless of the DRM driver state though, which can lead to missed > > hotplug events. It seems you need to implement the > > hpd_enable()/disable() callbacks and either cache or not enable events > > in fw until the DRM driver is ready. > > > > It's not clear to me what the expectation from the DRM framework is on > this point. We register a drm_bridge which is only capable of signaling > HPD events (DRM_BRIDGE_OP_HPD), not querying HPD state (DRM_BRIDGE_OP_DETECT). I think the assumption is that any bridge that can generate hotplug events also has a way of detecting whether it is connected (i.e. DRM_BRIDGE_OP_HPD => DRM_BRIDGE_OP_DETECT). The pmic_glink_altmode driver appears to be the only driver that sets DRM_BRIDGE_OP_HPD but not DRM_BRIDGE_OP_DETECT. > Does this imply that any such bridge must ensure that hpd events are > re-delivered once hpd_enable() has been invoked (we can't invoke it from > hpd_enable...)? > > Is it reasonable to do this retriggering in the altmode driver? Or is it > the job of the TCPM (it seems reasonable to not send the PAN_EN message > until we get hpd_enable()...)? Are you sure there is no way to query the firmware about the connected state? Otherwise, enabling the notification messages when hpd_enable() is called looks like it should work as the fw currently appears to always send a disconnected event followed by a connect event if connected. But that's not going to be enough unless you can also disable events in fw on hpd_disable() so that the state can again be updated on the next hpd_enable(). If that's not possible, it seems you need to cache the state in the driver and hope you get a notification after a suspend cycle if the state has changed. But in any case, the DRM documentation is pretty clear on that a bridge driver should not be calling drm_bridge_hpd_notify() until hpd_enable() is called (and also not after hpd_disable()) as the pmic_glink_altmode driver currently do. hpd_enable Enable hot plug detection. From now on the bridge shall call drm_bridge_hpd_notify() each time a change is detected in the output connection status, until hot plug detection gets disabled with hpd_disable. This callback is optional and shall only be implemented by bridges that support hot-plug notification without polling. Bridges that implement it shall also implement the hpd_disable callback and set the DRM_BRIDGE_OP_HPD flag in their drm_bridge->ops. https://dri.freedesktop.org/docs/drm/gpu/drm-kms-helpers.html#c.drm_bridge_funcs Johan
diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index ee2f60b6f09b..7ac670f3e6ab 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -438,6 +438,9 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) drm_mode_config_init(ddev); + ddev->mode_config.funcs = &mode_config_funcs; + ddev->mode_config.helper_private = &mode_config_helper_funcs; + ret = msm_init_vram(ddev); if (ret) return ret; @@ -479,9 +482,6 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) drm_helper_move_panel_connectors_to_head(ddev); - ddev->mode_config.funcs = &mode_config_funcs; - ddev->mode_config.helper_private = &mode_config_helper_funcs; - for (i = 0; i < priv->num_crtcs; i++) { /* initialize event thread */ priv->event_thread[i].crtc_id = priv->crtcs[i]->base.id;
Invoking drm_bridge_hpd_notify() on a drm_bridge with a HPD-enabled bridge_connector ends up in drm_bridge_connector_hpd_cb() calling drm_kms_helper_hotplug_event(), which assumes that the associated drm_device's mode_config.funcs is a valid pointer. But in the MSM DisplayPort driver the HPD enablement happens at bind time and mode_config.funcs is initialized late in msm_drm_init(). This means that there's a window for hot plug events to dereference a NULL mode_config.funcs. Move the assignment of mode_config.funcs before the bind, to avoid this scenario. Reported-by: Johan Hovold <johan@kernel.org> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com> --- drivers/gpu/drm/msm/msm_drv.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)