Message ID | 51e6d7fb-ac9e-a59f-ea63-ad06219b429d@collabora.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker <guillaume.tucker@collabora.com> wrote: > Hi Daniel, > > Please see below, I've had several bisection results pointing at > that commit over the week-end on mainline but also on linux-next > and net-next. While the peach-pi is a bit flaky at the moment > and is likely to have more than one issue, it does seem like this > commit is causing some well reproducible kernel hang. > > Here's a re-run with v4.15-rc3 showing the issue: > > https://lava.collabora.co.uk/scheduler/job/1018478 > > and here's another one with the change mentioned below reverted: > > https://lava.collabora.co.uk/scheduler/job/1018479 > > They both show a warning about "unbalanced disables for lcd_vdd", > I don't know if this is related as I haven't investigated any > further. It does appear to reliably hang with v4.15-rc3 and > boot most of the time with the commit reverted though. > > The automated kernelci.org bisection is still an experimental > tool and it may well be a false positive, so please take this > result with a pinch of salt... The patch just very minimal moves the connector cleanup around (so timing change), but except when you unload a driver (or maybe that funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have more info than "seems to hang a bit more" I have no idea what's wrong. The patch itself should work, at least it survived quite some serious testing we do on everything. -Daniel > Hope this helps! > > Best wishes, > Guillaume > > > -------- Forwarded Message -------- > Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging > Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC) > From: kernelci.org bot <bot@kernelci.org> > To: guillaume.tucker@collabora.com > > Bisection result for mainline/master (v4.15-rc3) on peach-pi > > Good known revision: > > c6b3e96 Merge branch 'for-linus' of > git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux > > Bad known revision: > > 50c4c4e Linux 4.15-rc3 > > Extra parameters: > > Tree: mainline > Branch: master > Target: peach-pi > Lab: lab-collabora > Defconfig: exynos_defconfig > Plan: boot > > > Breaking commit found: > > ------------------------------------------------------------------------------- > commit a703c55004e1c5076d57e43771b3e11117796ea0 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Mon Dec 4 21:48:18 2017 +0100 > > drm: safely free connectors from connector_iter > In > commit 613051dac40da1751ab269572766d3348d45a197 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Wed Dec 14 00:08:06 2016 +0100 > drm: locking&new iterators for connector_list > we've went to extreme lengths to make sure connector iterations > works > in any context, without introducing any additional locking context. > This worked, except for a small fumble in the implementation: > When we actually race with a concurrent connector unplug event, and > our temporary connector reference turns out to be the final one, then > everything breaks: We call the connector release function from > whatever context we happen to be in, which can be an irq/atomic > context. And connector freeing grabs all kinds of locks and stuff. > Fix this by creating a specially safe put function for > connetor_iter, > which (in this rare case) punts the cleanup to a worker. > Reported-by: Ben Widawsky <ben@bwidawsk.net> > Cc: Ben Widawsky <ben@bwidawsk.net> > Fixes: 613051dac40d ("drm: locking&new iterators for connector_list") > Cc: Dave Airlie <airlied@gmail.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Sean Paul <seanpaul@chromium.org> > Cc: <stable@vger.kernel.org> # v4.11+ > Reviewed-by: Dave Airlie <airlied@gmail.com> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch > > diff --git a/drivers/gpu/drm/drm_connector.c > b/drivers/gpu/drm/drm_connector.c > index 25f4b2e..4820141 100644 > --- a/drivers/gpu/drm/drm_connector.c > +++ b/drivers/gpu/drm/drm_connector.c > @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref) > connector->funcs->destroy(connector); > } > +static void drm_connector_free_work_fn(struct work_struct *work) > +{ > + struct drm_connector *connector = > + container_of(work, struct drm_connector, free_work); > + struct drm_device *dev = connector->dev; > + > + drm_mode_object_unregister(dev, &connector->base); > + connector->funcs->destroy(connector); > +} > + > /** > * drm_connector_init - Init a preallocated connector > * @dev: DRM device > @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev, > if (ret) > return ret; > + INIT_WORK(&connector->free_work, drm_connector_free_work_fn); > + > connector->base.properties = &connector->properties; > connector->dev = dev; > connector->funcs = funcs; > @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device > *dev, > } > EXPORT_SYMBOL(drm_connector_list_iter_begin); > +/* > + * Extra-safe connector put function that works in any context. Should only > be > + * used from the connector_iter functions, where we never really expect to > + * actually release the connector when dropping our final reference. > + */ > +static void > +drm_connector_put_safe(struct drm_connector *conn) > +{ > + if (refcount_dec_and_test(&conn->base.refcount.refcount)) > + schedule_work(&conn->free_work); > +} > + > /** > * drm_connector_list_iter_next - return next connector > * @iter: connectr_list iterator > @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct > drm_connector_list_iter *iter) > spin_unlock_irqrestore(&config->connector_list_lock, flags); > if (old_conn) > - drm_connector_put(old_conn); > + drm_connector_put_safe(old_conn); > return iter->conn; > } > @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct > drm_connector_list_iter *iter) > { > iter->dev = NULL; > if (iter->conn) > - drm_connector_put(iter->conn); > + drm_connector_put_safe(iter->conn); > lock_release(&connector_list_iter_dep_map, 0, _RET_IP_); > } > EXPORT_SYMBOL(drm_connector_list_iter_end); > diff --git a/drivers/gpu/drm/drm_mode_config.c > b/drivers/gpu/drm/drm_mode_config.c > index cda8bfa..cc78b3d 100644 > --- a/drivers/gpu/drm/drm_mode_config.c > +++ b/drivers/gpu/drm/drm_mode_config.c > @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev) > drm_connector_put(connector); > } > drm_connector_list_iter_end(&conn_iter); > + /* connector_iter drops references in a work item. */ > + flush_scheduled_work(); > if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) { > drm_connector_list_iter_begin(dev, &conn_iter); > drm_for_each_connector_iter(connector, &conn_iter) > diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h > index df9807a..a4649c5 100644 > --- a/include/drm/drm_connector.h > +++ b/include/drm/drm_connector.h > @@ -916,6 +916,14 @@ struct drm_connector { > uint8_t num_h_tile, num_v_tile; > uint8_t tile_h_loc, tile_v_loc; > uint16_t tile_h_size, tile_v_size; > + > + /** > + * @free_work: > + * > + * Work used only by &drm_connector_iter to be able to clean up a > + * connector from any context. > + */ > + struct work_struct free_work; > }; > #define obj_to_connector(x) container_of(x, struct drm_connector, base) > ------------------------------------------------------------------------------- > > > Git bisection log: > > ------------------------------------------------------------------------------- > git bisect start > # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' > of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux > git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94 > # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3 > git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36 > # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net > git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d > # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' > of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media > git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d > # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag > 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux > git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12 > # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag > 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel > into drm-fixes > git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992 > # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop > NONCONTIG flag for buffers allocated without IOMMU > git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80 > # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the > shrink request to prevent skip huge pool > git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6 > # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch > 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes > git bisect good db8f884ca7fe6af64d443d1510464efe23826131 > # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag > 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc > into drm-fixes > git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828 > # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free > connectors from connector_iter > git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0 > # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely > free connectors from connector_iter > -------------------------------------------------------------------------------
[adding Marek and Shuah to cc list] On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker > <guillaume.tucker@collabora.com> wrote: >> Hi Daniel, >> >> Please see below, I've had several bisection results pointing at >> that commit over the week-end on mainline but also on linux-next >> and net-next. While the peach-pi is a bit flaky at the moment >> and is likely to have more than one issue, it does seem like this >> commit is causing some well reproducible kernel hang. >> >> Here's a re-run with v4.15-rc3 showing the issue: >> >> https://lava.collabora.co.uk/scheduler/job/1018478 >> >> and here's another one with the change mentioned below reverted: >> >> https://lava.collabora.co.uk/scheduler/job/1018479 >> >> They both show a warning about "unbalanced disables for lcd_vdd", >> I don't know if this is related as I haven't investigated any >> further. It does appear to reliably hang with v4.15-rc3 and >> boot most of the time with the commit reverted though. >> >> The automated kernelci.org bisection is still an experimental >> tool and it may well be a false positive, so please take this >> result with a pinch of salt... > > The patch just very minimal moves the connector cleanup around (so > timing change), but except when you unload a driver (or maybe that > funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have > more info than "seems to hang a bit more" I have no idea what's wrong. > The patch itself should work, at least it survived quite some serious > testing we do on everything. > -Daniel > Marek was pointing to a different culprit [0] in this [1] thread. I see that both commits made it to v4.15-rc3, which is the first version where boot fails. So maybe is a combination of both? Or rather reverting one patch masks the error in the other. I've access to the machine but unfortunately not a lot of time to dig on this, I could try to do it in the weekend though. [0]: https://patchwork.kernel.org/patch/10067711/ [1]: https://www.spinics.net/lists/arm-kernel/msg622152.html Best regards, Javier >> Hope this helps! >> >> Best wishes, >> Guillaume >> >> >> -------- Forwarded Message -------- >> Subject: mainline/master boot bisection: v4.15-rc3 on peach-pi #3228-staging >> Date: Mon, 11 Dec 2017 08:25:55 +0000 (UTC) >> From: kernelci.org bot <bot@kernelci.org> >> To: guillaume.tucker@collabora.com >> >> Bisection result for mainline/master (v4.15-rc3) on peach-pi >> >> Good known revision: >> >> c6b3e96 Merge branch 'for-linus' of >> git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux >> >> Bad known revision: >> >> 50c4c4e Linux 4.15-rc3 >> >> Extra parameters: >> >> Tree: mainline >> Branch: master >> Target: peach-pi >> Lab: lab-collabora >> Defconfig: exynos_defconfig >> Plan: boot >> >> >> Breaking commit found: >> >> ------------------------------------------------------------------------------- >> commit a703c55004e1c5076d57e43771b3e11117796ea0 >> Author: Daniel Vetter <daniel.vetter@ffwll.ch> >> Date: Mon Dec 4 21:48:18 2017 +0100 >> >> drm: safely free connectors from connector_iter >> In >> commit 613051dac40da1751ab269572766d3348d45a197 >> Author: Daniel Vetter <daniel.vetter@ffwll.ch> >> Date: Wed Dec 14 00:08:06 2016 +0100 >> drm: locking&new iterators for connector_list >> we've went to extreme lengths to make sure connector iterations >> works >> in any context, without introducing any additional locking context. >> This worked, except for a small fumble in the implementation: >> When we actually race with a concurrent connector unplug event, and >> our temporary connector reference turns out to be the final one, then >> everything breaks: We call the connector release function from >> whatever context we happen to be in, which can be an irq/atomic >> context. And connector freeing grabs all kinds of locks and stuff. >> Fix this by creating a specially safe put function for >> connetor_iter, >> which (in this rare case) punts the cleanup to a worker. >> Reported-by: Ben Widawsky <ben@bwidawsk.net> >> Cc: Ben Widawsky <ben@bwidawsk.net> >> Fixes: 613051dac40d ("drm: locking&new iterators for connector_list") >> Cc: Dave Airlie <airlied@gmail.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Cc: Sean Paul <seanpaul@chromium.org> >> Cc: <stable@vger.kernel.org> # v4.11+ >> Reviewed-by: Dave Airlie <airlied@gmail.com> >> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> >> Link: >> https://patchwork.freedesktop.org/patch/msgid/20171204204818.24745-1-daniel.vetter@ffwll.ch >> >> diff --git a/drivers/gpu/drm/drm_connector.c >> b/drivers/gpu/drm/drm_connector.c >> index 25f4b2e..4820141 100644 >> --- a/drivers/gpu/drm/drm_connector.c >> +++ b/drivers/gpu/drm/drm_connector.c >> @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref) >> connector->funcs->destroy(connector); >> } >> +static void drm_connector_free_work_fn(struct work_struct *work) >> +{ >> + struct drm_connector *connector = >> + container_of(work, struct drm_connector, free_work); >> + struct drm_device *dev = connector->dev; >> + >> + drm_mode_object_unregister(dev, &connector->base); >> + connector->funcs->destroy(connector); >> +} >> + >> /** >> * drm_connector_init - Init a preallocated connector >> * @dev: DRM device >> @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev, >> if (ret) >> return ret; >> + INIT_WORK(&connector->free_work, drm_connector_free_work_fn); >> + >> connector->base.properties = &connector->properties; >> connector->dev = dev; >> connector->funcs = funcs; >> @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device >> *dev, >> } >> EXPORT_SYMBOL(drm_connector_list_iter_begin); >> +/* >> + * Extra-safe connector put function that works in any context. Should only >> be >> + * used from the connector_iter functions, where we never really expect to >> + * actually release the connector when dropping our final reference. >> + */ >> +static void >> +drm_connector_put_safe(struct drm_connector *conn) >> +{ >> + if (refcount_dec_and_test(&conn->base.refcount.refcount)) >> + schedule_work(&conn->free_work); >> +} >> + >> /** >> * drm_connector_list_iter_next - return next connector >> * @iter: connectr_list iterator >> @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct >> drm_connector_list_iter *iter) >> spin_unlock_irqrestore(&config->connector_list_lock, flags); >> if (old_conn) >> - drm_connector_put(old_conn); >> + drm_connector_put_safe(old_conn); >> return iter->conn; >> } >> @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct >> drm_connector_list_iter *iter) >> { >> iter->dev = NULL; >> if (iter->conn) >> - drm_connector_put(iter->conn); >> + drm_connector_put_safe(iter->conn); >> lock_release(&connector_list_iter_dep_map, 0, _RET_IP_); >> } >> EXPORT_SYMBOL(drm_connector_list_iter_end); >> diff --git a/drivers/gpu/drm/drm_mode_config.c >> b/drivers/gpu/drm/drm_mode_config.c >> index cda8bfa..cc78b3d 100644 >> --- a/drivers/gpu/drm/drm_mode_config.c >> +++ b/drivers/gpu/drm/drm_mode_config.c >> @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev) >> drm_connector_put(connector); >> } >> drm_connector_list_iter_end(&conn_iter); >> + /* connector_iter drops references in a work item. */ >> + flush_scheduled_work(); >> if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) { >> drm_connector_list_iter_begin(dev, &conn_iter); >> drm_for_each_connector_iter(connector, &conn_iter) >> diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h >> index df9807a..a4649c5 100644 >> --- a/include/drm/drm_connector.h >> +++ b/include/drm/drm_connector.h >> @@ -916,6 +916,14 @@ struct drm_connector { >> uint8_t num_h_tile, num_v_tile; >> uint8_t tile_h_loc, tile_v_loc; >> uint16_t tile_h_size, tile_v_size; >> + >> + /** >> + * @free_work: >> + * >> + * Work used only by &drm_connector_iter to be able to clean up a >> + * connector from any context. >> + */ >> + struct work_struct free_work; >> }; >> #define obj_to_connector(x) container_of(x, struct drm_connector, base) >> ------------------------------------------------------------------------------- >> >> >> Git bisection log: >> >> ------------------------------------------------------------------------------- >> git bisect start >> # good: [c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94] Merge branch 'for-linus' >> of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux >> git bisect good c6b3e9693f8a32ba3b07e2f2723886ea2aff4e94 >> # bad: [50c4c4e268a2d7a3e58ebb698ac74da0de40ae36] Linux 4.15-rc3 >> git bisect bad 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36 >> # bad: [e9ef1fe312b533592e39cddc1327463c30b0ed8d] Merge >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net >> git bisect bad e9ef1fe312b533592e39cddc1327463c30b0ed8d >> # bad: [77071bc6c472bb0b36818f3e9595114cdf98c86d] Merge tag 'media/v4.15-2' >> of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media >> git bisect bad 77071bc6c472bb0b36818f3e9595114cdf98c86d >> # bad: [4066aa72f9f2886105c6f747d7f9bd4f14f53c12] Merge tag >> 'drm-fixes-for-v4.15-rc3' of git://people.freedesktop.org/~airlied/linux >> git bisect bad 4066aa72f9f2886105c6f747d7f9bd4f14f53c12 >> # bad: [96980844bb4b74d2e7ce93d907670658e39a3992] Merge tag >> 'drm-intel-fixes-2017-12-07' of git://anongit.freedesktop.org/drm/drm-intel >> into drm-fixes >> git bisect bad 96980844bb4b74d2e7ce93d907670658e39a3992 >> # bad: [120a264f9c2782682027d931d83dcbd22e01da80] drm/exynos: gem: Drop >> NONCONTIG flag for buffers allocated without IOMMU >> git bisect bad 120a264f9c2782682027d931d83dcbd22e01da80 >> # good: [2bf257d662509553ae226239e7dc1c3d00636ca6] drm/ttm: roundup the >> shrink request to prevent skip huge pool >> git bisect good 2bf257d662509553ae226239e7dc1c3d00636ca6 >> # good: [db8f884ca7fe6af64d443d1510464efe23826131] Merge branch >> 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes >> git bisect good db8f884ca7fe6af64d443d1510464efe23826131 >> # bad: [bd3a3a2e92624942a143e485c83e641b2492d828] Merge tag >> 'drm-misc-fixes-2017-12-06' of git://anongit.freedesktop.org/drm/drm-misc >> into drm-fixes >> git bisect bad bd3a3a2e92624942a143e485c83e641b2492d828 >> # bad: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely free >> connectors from connector_iter >> git bisect bad a703c55004e1c5076d57e43771b3e11117796ea0 >> # first bad commit: [a703c55004e1c5076d57e43771b3e11117796ea0] drm: safely >> free connectors from connector_iter >> ------------------------------------------------------------------------------- > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > +41 (0) 79 365 57 48 - http://blog.ffwll.ch > -- > To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 11, 2017 at 11:28 PM, Javier Martinez Canillas <javier@dowhile0.org> wrote: > [adding Marek and Shuah to cc list] [snip] >>> >>> Please see below, I've had several bisection results pointing at >>> that commit over the week-end on mainline but also on linux-next >>> and net-next. While the peach-pi is a bit flaky at the moment >>> and is likely to have more than one issue, it does seem like this >>> commit is causing some well reproducible kernel hang. >>> >>> Here's a re-run with v4.15-rc3 showing the issue: >>> >>> https://lava.collabora.co.uk/scheduler/job/1018478 >>> >>> and here's another one with the change mentioned below reverted: >>> >>> https://lava.collabora.co.uk/scheduler/job/1018479 >>> >>> They both show a warning about "unbalanced disables for lcd_vdd", >>> I don't know if this is related as I haven't investigated any >>> further. It does appear to reliably hang with v4.15-rc3 and >>> boot most of the time with the commit reverted though. >>> >>> The automated kernelci.org bisection is still an experimental >>> tool and it may well be a false positive, so please take this >>> result with a pinch of salt... >> >> The patch just very minimal moves the connector cleanup around (so >> timing change), but except when you unload a driver (or maybe that >> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have >> more info than "seems to hang a bit more" I have no idea what's wrong. >> The patch itself should work, at least it survived quite some serious >> testing we do on everything. >> -Daniel >> > > Marek was pointing to a different culprit [0] in this [1] thread. I > see that both commits made it to v4.15-rc3, which is the first version > where boot fails. So maybe is a combination of both? Or rather > reverting one patch masks the error in the other. > > I've access to the machine but unfortunately not a lot of time to dig > on this, I could try to do it in the weekend though. > > [0]: https://patchwork.kernel.org/patch/10067711/ > [1]: https://www.spinics.net/lists/arm-kernel/msg622152.html > So I gave a quick look to this, and at the very least there's a bug in the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: exynos: Add status property to Exynos 542x Mixer nodes"). I've posted a fix for that: https://patchwork.kernel.org/patch/10105921/ I believe this could be also be the cause for the boot failure, since I see in the boot log that things start to go wrong after exynos-drm fails to bind the HDMI component: [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops 0xc1398690): -1 Anyway, I don't have access to the machine now, but it would be nice if someone test. Or I would do in a few days. Best regards, Javier -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote: > So I gave a quick look to this, and at the very least there's a bug in > the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: > exynos: Add status property to Exynos 542x Mixer nodes"). > > I've posted a fix for that: > > https://patchwork.kernel.org/patch/10105921/ > > I believe this could be also be the cause for the boot failure, since > I see in the boot log that things start to go wrong after exynos-drm > fails to bind the HDMI component: > > [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops > 0xc1398690): -1 Umm, -1 ? Looking that error code up in include/uapi/asm-generic/errno-base.h says it's -EPERM. I suspect that's someone just returning -1 because they're lazy... which is real bad form and needs fixing.
On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: > On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote: > > So I gave a quick look to this, and at the very least there's a bug in > > the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: > > exynos: Add status property to Exynos 542x Mixer nodes"). > > > > I've posted a fix for that: > > > > https://patchwork.kernel.org/patch/10105921/ > > > > I believe this could be also be the cause for the boot failure, since > > I see in the boot log that things start to go wrong after exynos-drm > > fails to bind the HDMI component: > > > > [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops > > 0xc1398690): -1 > > Umm, -1 ? Looking that error code up in > include/uapi/asm-generic/errno-base.h says it's -EPERM. > > I suspect that's someone just returning -1 because they're lazy... > which is real bad form and needs fixing. Oh, it really is -EPERM: struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, enum exynos_drm_output_type out_type) { struct drm_crtc *crtc; drm_for_each_crtc(crtc, drm_dev) if (to_exynos_crtc(crtc)->type == out_type) return to_exynos_crtc(crtc); return ERR_PTR(-EPERM); } Does "Operation not permitted" really convey the error here? It doesn't look like a permission error to me. Can we please avoid abusing errno codes?
On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote: > On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: >> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote: >>> So I gave a quick look to this, and at the very least there's a bug in >>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >>> exynos: Add status property to Exynos 542x Mixer nodes"). >>> >>> I've posted a fix for that: >>> >>> https://patchwork.kernel.org/patch/10105921/ >>> >>> I believe this could be also be the cause for the boot failure, since >>> I see in the boot log that things start to go wrong after exynos-drm >>> fails to bind the HDMI component: >>> >>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >>> 0xc1398690): -1 >> >> Umm, -1 ? Looking that error code up in >> include/uapi/asm-generic/errno-base.h says it's -EPERM. >> >> I suspect that's someone just returning -1 because they're lazy... >> which is real bad form and needs fixing. > > Oh, it really is -EPERM: > > struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, > enum exynos_drm_output_type out_type) > { > struct drm_crtc *crtc; > > drm_for_each_crtc(crtc, drm_dev) > if (to_exynos_crtc(crtc)->type == out_type) > return to_exynos_crtc(crtc); > > return ERR_PTR(-EPERM); > } > > Does "Operation not permitted" really convey the error here? It doesn't > look like a permission error to me. > > Can we please avoid abusing errno codes? I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ with top commit g968edbd worked just fine for me last Friday. I ran several tests and everything checked out except the exynos-gsc lockdep issue I sent a 4.14 patch for. However, with 4.15-rc3, dmesg is gets filled with [ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig. I will start bisect and try to isolate the problem. I suspect this is related to dts changes perhaps? I used to this problem a while back and it has been fixed. thanks, -- Shuah -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Shuah, On 2017-12-12 00:25, Shuah Khan wrote: > On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote: >> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: >>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas wrote: >>>> So I gave a quick look to this, and at the very least there's a bug in >>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >>>> exynos: Add status property to Exynos 542x Mixer nodes"). >>>> >>>> I've posted a fix for that: >>>> >>>> https://patchwork.kernel.org/patch/10105921/ >>>> >>>> I believe this could be also be the cause for the boot failure, since >>>> I see in the boot log that things start to go wrong after exynos-drm >>>> fails to bind the HDMI component: >>>> >>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >>>> 0xc1398690): -1 >>> Umm, -1 ? Looking that error code up in >>> include/uapi/asm-generic/errno-base.h says it's -EPERM. >>> >>> I suspect that's someone just returning -1 because they're lazy... >>> which is real bad form and needs fixing. >> Oh, it really is -EPERM: >> >> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device *drm_dev, >> enum exynos_drm_output_type out_type) >> { >> struct drm_crtc *crtc; >> >> drm_for_each_crtc(crtc, drm_dev) >> if (to_exynos_crtc(crtc)->type == out_type) >> return to_exynos_crtc(crtc); >> >> return ERR_PTR(-EPERM); >> } >> >> Does "Operation not permitted" really convey the error here? It doesn't >> look like a permission error to me. >> >> Can we please avoid abusing errno codes? > I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ > with top commit g968edbd worked just fine for me last Friday. I ran several > tests and everything checked out except the exynos-gsc lockdep issue I sent > a 4.14 patch for. > > However, with 4.15-rc3, dmesg is gets filled with > > [ 342.337181] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 342.337470] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 342.337851] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.382346] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.396682] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.399244] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.399496] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.399848] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.400163] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.400495] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.401294] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > [ 402.401595] [drm] Non-contiguous allocation is not supported without IOMMU, falling back to contiguous buffer > > Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig. > > I will start bisect and try to isolate the problem. I suspect this is related to dts > changes perhaps? I used to this problem a while back and it has been fixed. This warning has been added intentionally, see following discussions: https://patchwork.kernel.org/patch/10034919/ https://patchwork.kernel.org/patch/10070475/ This means that your test apps should be updated or you should enable Exynos IOMMU support in your config. Maybe it is a good time to finally enable it in exynos_defconfig. Best regards
Hello Marek, On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski <m.szyprowski@samsung.com> wrote: > Hi Shuah, > > > On 2017-12-12 00:25, Shuah Khan wrote: >> >> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote: >>> >>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: >>>> >>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas >>>> wrote: >>>>> >>>>> So I gave a quick look to this, and at the very least there's a bug in >>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >>>>> exynos: Add status property to Exynos 542x Mixer nodes"). >>>>> >>>>> I've posted a fix for that: >>>>> >>>>> https://patchwork.kernel.org/patch/10105921/ >>>>> >>>>> I believe this could be also be the cause for the boot failure, since >>>>> I see in the boot log that things start to go wrong after exynos-drm >>>>> fails to bind the HDMI component: >>>>> >>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >>>>> 0xc1398690): -1 >>>> >>>> Umm, -1 ? Looking that error code up in >>>> include/uapi/asm-generic/errno-base.h says it's -EPERM. >>>> >>>> I suspect that's someone just returning -1 because they're lazy... >>>> which is real bad form and needs fixing. >>> >>> Oh, it really is -EPERM: >>> >>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device >>> *drm_dev, >>> enum exynos_drm_output_type >>> out_type) >>> { >>> struct drm_crtc *crtc; >>> >>> drm_for_each_crtc(crtc, drm_dev) >>> if (to_exynos_crtc(crtc)->type == out_type) >>> return to_exynos_crtc(crtc); >>> >>> return ERR_PTR(-EPERM); >>> } >>> >>> Does "Operation not permitted" really convey the error here? It doesn't >>> look like a permission error to me. >>> >>> Can we please avoid abusing errno codes? >> >> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ >> with top commit g968edbd worked just fine for me last Friday. I ran >> several >> tests and everything checked out except the exynos-gsc lockdep issue I >> sent >> a 4.14 patch for. >> >> However, with 4.15-rc3, dmesg is gets filled with >> >> [ 342.337181] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 342.337470] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 342.337851] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.382346] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.396682] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.399244] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.399496] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.399848] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.400163] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.400495] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.401294] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> [ 402.401595] [drm] Non-contiguous allocation is not supported without >> IOMMU, falling back to contiguous buffer >> >> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig. >> >> I will start bisect and try to isolate the problem. I suspect this is >> related to dts >> changes perhaps? I used to this problem a while back and it has been >> fixed. > > > This warning has been added intentionally, see following discussions: > https://patchwork.kernel.org/patch/10034919/ > https://patchwork.kernel.org/patch/10070475/ > > This means that your test apps should be updated or you should enable Exynos > IOMMU support in your config. Maybe it is a good time to finally enable it > in exynos_defconfig. > Has the issue that the boot-loader keeps the display controller enabled and scanning pages on the Exynos Chromebooks resolved? I think that's that preventing to enable it by default in exynos_defconfig since it caused boot failures when enabled on these machines. I don't follow exynos development too closely nowadays so maybe there's a fix in place now. > Best regards > -- > Marek Szyprowski, PhD > Samsung R&D Institute Poland Best regards, Javier -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Javier, On 2017-12-12 09:00, Javier Martinez Canillas wrote: > On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski > <m.szyprowski@samsung.com> wrote: >> On 2017-12-12 00:25, Shuah Khan wrote: >>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote: >>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: >>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas >>>>> wrote: >>>>>> So I gave a quick look to this, and at the very least there's a bug in >>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >>>>>> exynos: Add status property to Exynos 542x Mixer nodes"). >>>>>> >>>>>> I've posted a fix for that: >>>>>> >>>>>> https://patchwork.kernel.org/patch/10105921/ >>>>>> >>>>>> I believe this could be also be the cause for the boot failure, since >>>>>> I see in the boot log that things start to go wrong after exynos-drm >>>>>> fails to bind the HDMI component: >>>>>> >>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >>>>>> 0xc1398690): -1 >>>>> Umm, -1 ? Looking that error code up in >>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM. >>>>> >>>>> I suspect that's someone just returning -1 because they're lazy... >>>>> which is real bad form and needs fixing. >>>> Oh, it really is -EPERM: >>>> >>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device >>>> *drm_dev, >>>> enum exynos_drm_output_type >>>> out_type) >>>> { >>>> struct drm_crtc *crtc; >>>> >>>> drm_for_each_crtc(crtc, drm_dev) >>>> if (to_exynos_crtc(crtc)->type == out_type) >>>> return to_exynos_crtc(crtc); >>>> >>>> return ERR_PTR(-EPERM); >>>> } >>>> >>>> Does "Operation not permitted" really convey the error here? It doesn't >>>> look like a permission error to me. >>>> >>>> Can we please avoid abusing errno codes? >>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ >>> with top commit g968edbd worked just fine for me last Friday. I ran >>> several >>> tests and everything checked out except the exynos-gsc lockdep issue I >>> sent >>> a 4.14 patch for. >>> >>> However, with 4.15-rc3, dmesg is gets filled with >>> >>> [ 342.337181] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 342.337470] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 342.337851] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.382346] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.396682] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.399244] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.399496] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.399848] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.400163] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.400495] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.401294] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> [ 402.401595] [drm] Non-contiguous allocation is not supported without >>> IOMMU, falling back to contiguous buffer >>> >>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig. >>> >>> I will start bisect and try to isolate the problem. I suspect this is >>> related to dts >>> changes perhaps? I used to this problem a while back and it has been >>> fixed. >> >> This warning has been added intentionally, see following discussions: >> https://patchwork.kernel.org/patch/10034919/ >> https://patchwork.kernel.org/patch/10070475/ >> >> This means that your test apps should be updated or you should enable Exynos >> IOMMU support in your config. Maybe it is a good time to finally enable it >> in exynos_defconfig. >> > Has the issue that the boot-loader keeps the display controller > enabled and scanning pages on the Exynos Chromebooks resolved? > > I think that's that preventing to enable it by default in > exynos_defconfig since it caused boot failures when enabled on these > machines. I don't follow exynos development too closely nowadays so > maybe there's a fix in place now. Not directly. I still didn't find time to properly add support for devices, which were left in-working state (with active DMA transactions) by bootloader, but due to some other changes in the order of operations during boot process, power domains are initialized very early and due to temporary lack of devices (which are not yet added to the system), are turned off. This practically stops FIMD for scanning framebuffer and "solves" this issue. I've checked now and Exynos Snow Chromebook boots fine with IOMMU support enabled, both with v4.15-rc3 and linux-next. Best regards
On Tue, Dec 12, 2017 at 9:47 AM, Marek Szyprowski <m.szyprowski@samsung.com> wrote: > Hi Javier, > > On 2017-12-12 09:00, Javier Martinez Canillas wrote: >> >> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski >> <m.szyprowski@samsung.com> wrote: (...) >>> This warning has been added intentionally, see following discussions: >>> https://patchwork.kernel.org/patch/10034919/ >>> https://patchwork.kernel.org/patch/10070475/ >>> >>> This means that your test apps should be updated or you should enable >>> Exynos >>> IOMMU support in your config. Maybe it is a good time to finally enable >>> it >>> in exynos_defconfig. >>> >> Has the issue that the boot-loader keeps the display controller >> enabled and scanning pages on the Exynos Chromebooks resolved? >> >> I think that's that preventing to enable it by default in >> exynos_defconfig since it caused boot failures when enabled on these >> machines. I don't follow exynos development too closely nowadays so >> maybe there's a fix in place now. > > > Not directly. I still didn't find time to properly add support for > devices, which were left in-working state (with active DMA > transactions) by bootloader, but due to some other changes in the > order of operations during boot process, power domains are > initialized very early and due to temporary lack of devices (which > are not yet added to the system), are turned off. This practically > stops FIMD for scanning framebuffer and "solves" this issue. > > I've checked now and Exynos Snow Chromebook boots fine with IOMMU > support enabled, both with v4.15-rc3 and linux-next. Then it looks like we could give EXYNOS_IOMMU a try. At least only on exynos_defconfig which would leave multi_v7 as a platform to compare. Best regards, Krzysztof -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi All, On 2017-12-11 23:28, Javier Martinez Canillas wrote: > [adding Marek and Shuah to cc list] > > On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote: >> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker >> <guillaume.tucker@collabora.com> wrote: >>> Hi Daniel, >>> >>> Please see below, I've had several bisection results pointing at >>> that commit over the week-end on mainline but also on linux-next >>> and net-next. While the peach-pi is a bit flaky at the moment >>> and is likely to have more than one issue, it does seem like this >>> commit is causing some well reproducible kernel hang. >>> >>> Here's a re-run with v4.15-rc3 showing the issue: >>> >>> https://lava.collabora.co.uk/scheduler/job/1018478 >>> >>> and here's another one with the change mentioned below reverted: >>> >>> https://lava.collabora.co.uk/scheduler/job/1018479 >>> >>> They both show a warning about "unbalanced disables for lcd_vdd", >>> I don't know if this is related as I haven't investigated any >>> further. It does appear to reliably hang with v4.15-rc3 and >>> boot most of the time with the commit reverted though. >>> >>> The automated kernelci.org bisection is still an experimental >>> tool and it may well be a false positive, so please take this >>> result with a pinch of salt... >> The patch just very minimal moves the connector cleanup around (so >> timing change), but except when you unload a driver (or maybe that >> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have >> more info than "seems to hang a bit more" I have no idea what's wrong. >> The patch itself should work, at least it survived quite some serious >> testing we do on everything. >> -Daniel >> > Marek was pointing to a different culprit [0] in this [1] thread. I > see that both commits made it to v4.15-rc3, which is the first version > where boot fails. So maybe is a combination of both? Or rather > reverting one patch masks the error in the other. > > I've access to the machine but unfortunately not a lot of time to dig > on this, I could try to do it in the weekend though. After a recent discussion on the Javier's patch: https://patchwork.kernel.org/patch/10106417/ I've managed to reproduce this issue also on Exynos5250 based Samsung Snow Chromebook and investigate a bit. It is caused by a deadlock in the main kernel workqueue. Here are details: 1. Exynos DRM fails to initialize due to missing regulators and gets moved to deferred probe device list 2. Deferred probe is triggered and kernel "events" workqueue calls deferred_probe_work_func() 3. exynos_drm_bind() is called, component_bind_all() fails due to missing Exynos Mixer device 4. error handling path is executed in exynos_drm_bind(), which calls drm_mode_config_cleanup() 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes deadlock. Do You have idea how to fix this issue properly? Taking a look at git blame, this indeed shows that the issue has been introduced by the commit a703c55004e1 ("drm: safely free connectors from connector_ite"), which added a call to flush_scheduled_work() in drm_mode_config_cleanup(). drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if called from the workqueue, but I don't have idea how to check that. The other way of fixing it would be to resurrect separate workqueue for DRM related events. Best regards
Hi Marek, On 12/12/2017 04:38 AM, Marek Szyprowski wrote: > Hi All, > > On 2017-12-11 23:28, Javier Martinez Canillas wrote: >> [adding Marek and Shuah to cc list] >> >> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote: >>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker >>> <guillaume.tucker@collabora.com> wrote: >>>> Hi Daniel, >>>> >>>> Please see below, I've had several bisection results pointing at >>>> that commit over the week-end on mainline but also on linux-next >>>> and net-next. While the peach-pi is a bit flaky at the moment >>>> and is likely to have more than one issue, it does seem like this >>>> commit is causing some well reproducible kernel hang. >>>> >>>> Here's a re-run with v4.15-rc3 showing the issue: >>>> >>>> https://lava.collabora.co.uk/scheduler/job/1018478 >>>> >>>> and here's another one with the change mentioned below reverted: >>>> >>>> https://lava.collabora.co.uk/scheduler/job/1018479 >>>> >>>> They both show a warning about "unbalanced disables for lcd_vdd", >>>> I don't know if this is related as I haven't investigated any >>>> further. It does appear to reliably hang with v4.15-rc3 and >>>> boot most of the time with the commit reverted though. >>>> >>>> The automated kernelci.org bisection is still an experimental >>>> tool and it may well be a false positive, so please take this >>>> result with a pinch of salt... >>> The patch just very minimal moves the connector cleanup around (so >>> timing change), but except when you unload a driver (or maybe that >>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have >>> more info than "seems to hang a bit more" I have no idea what's wrong. >>> The patch itself should work, at least it survived quite some serious >>> testing we do on everything. >>> -Daniel >>> >> Marek was pointing to a different culprit [0] in this [1] thread. I >> see that both commits made it to v4.15-rc3, which is the first version >> where boot fails. So maybe is a combination of both? Or rather >> reverting one patch masks the error in the other. >> >> I've access to the machine but unfortunately not a lot of time to dig >> on this, I could try to do it in the weekend though. > > After a recent discussion on the Javier's patch: > https://patchwork.kernel.org/patch/10106417/ > I've managed to reproduce this issue also on Exynos5250 based Samsung > Snow Chromebook and investigate a bit. > > It is caused by a deadlock in the main kernel workqueue. Here are details: > > 1. Exynos DRM fails to initialize due to missing regulators and gets moved > to deferred probe device list > > 2. Deferred probe is triggered and kernel "events" workqueue calls > deferred_probe_work_func() > > 3. exynos_drm_bind() is called, component_bind_all() fails due to missing > Exynos Mixer device > > 4. error handling path is executed in exynos_drm_bind(), which calls > drm_mode_config_cleanup() > > 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes > deadlock. > > Do You have idea how to fix this issue properly? > > Taking a look at git blame, this indeed shows that the issue has been > introduced by the commit a703c55004e1 ("drm: safely free connectors from > connector_ite"), which added a call to flush_scheduled_work() in > drm_mode_config_cleanup(). This commit is making its way into stable releases. It has been added to 4.14-6 stable. If this patch poses problems, maybe somebody should comment on the stable release thread. > > drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if > called from the workqueue, but I don't have idea how to check that. The > other way of fixing it would be to resurrect separate workqueue for DRM > related events. > Especially since there is no solution :) thanks, -- Shuah -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/12/2017 01:47 AM, Marek Szyprowski wrote: > Hi Javier, > > On 2017-12-12 09:00, Javier Martinez Canillas wrote: >> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski >> <m.szyprowski@samsung.com> wrote: >>> On 2017-12-12 00:25, Shuah Khan wrote: >>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote: >>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: >>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas >>>>>> wrote: >>>>>>> So I gave a quick look to this, and at the very least there's a bug in >>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >>>>>>> exynos: Add status property to Exynos 542x Mixer nodes"). >>>>>>> >>>>>>> I've posted a fix for that: >>>>>>> >>>>>>> https://patchwork.kernel.org/patch/10105921/ >>>>>>> >>>>>>> I believe this could be also be the cause for the boot failure, since >>>>>>> I see in the boot log that things start to go wrong after exynos-drm >>>>>>> fails to bind the HDMI component: >>>>>>> >>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >>>>>>> 0xc1398690): -1 >>>>>> Umm, -1 ? Looking that error code up in >>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM. >>>>>> >>>>>> I suspect that's someone just returning -1 because they're lazy... >>>>>> which is real bad form and needs fixing. >>>>> Oh, it really is -EPERM: >>>>> >>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device >>>>> *drm_dev, >>>>> enum exynos_drm_output_type >>>>> out_type) >>>>> { >>>>> struct drm_crtc *crtc; >>>>> >>>>> drm_for_each_crtc(crtc, drm_dev) >>>>> if (to_exynos_crtc(crtc)->type == out_type) >>>>> return to_exynos_crtc(crtc); >>>>> >>>>> return ERR_PTR(-EPERM); >>>>> } >>>>> >>>>> Does "Operation not permitted" really convey the error here? It doesn't >>>>> look like a permission error to me. >>>>> >>>>> Can we please avoid abusing errno codes? >>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ >>>> with top commit g968edbd worked just fine for me last Friday. I ran >>>> several >>>> tests and everything checked out except the exynos-gsc lockdep issue I >>>> sent >>>> a 4.14 patch for. >>>> >>>> However, with 4.15-rc3, dmesg is gets filled with >>>> >>>> [ 342.337181] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 342.337470] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 342.337851] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.382346] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.396682] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.399244] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.399496] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.399848] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.400163] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.400495] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.401294] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> [ 402.401595] [drm] Non-contiguous allocation is not supported without >>>> IOMMU, falling back to contiguous buffer >>>> >>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig. >>>> >>>> I will start bisect and try to isolate the problem. I suspect this is >>>> related to dts >>>> changes perhaps? I used to this problem a while back and it has been >>>> fixed. >>> >>> This warning has been added intentionally, see following discussions: >>> https://patchwork.kernel.org/patch/10034919/ >>> https://patchwork.kernel.org/patch/10070475/ >>> >>> This means that your test apps should be updated or you should enable Exynos >>> IOMMU support in your config. Maybe it is a good time to finally enable it >>> in exynos_defconfig. >>> >> Has the issue that the boot-loader keeps the display controller >> enabled and scanning pages on the Exynos Chromebooks resolved? >> >> I think that's that preventing to enable it by default in >> exynos_defconfig since it caused boot failures when enabled on these >> machines. I don't follow exynos development too closely nowadays so >> maybe there's a fix in place now. > > Not directly. I still didn't find time to properly add support for > devices, which were left in-working state (with active DMA > transactions) by bootloader, but due to some other changes in the > order of operations during boot process, power domains are > initialized very early and due to temporary lack of devices (which > are not yet added to the system), are turned off. This practically > stops FIMD for scanning framebuffer and "solves" this issue. > > I've checked now and Exynos Snow Chromebook boots fine with IOMMU > support enabled, both with v4.15-rc3 and linux-next. > Good to know it doesn't break Exynos Snow. This is why I test without IOMMU and then enable IOMMU on odroid-xu4 for test with IOMMU thanks, -- Shuah -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski <m.szyprowski@samsung.com> wrote: > Hi All, > > > On 2017-12-11 23:28, Javier Martinez Canillas wrote: >> >> [adding Marek and Shuah to cc list] >> >> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> >> wrote: >>> >>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker >>> <guillaume.tucker@collabora.com> wrote: >>>> >>>> Hi Daniel, >>>> >>>> Please see below, I've had several bisection results pointing at >>>> that commit over the week-end on mainline but also on linux-next >>>> and net-next. While the peach-pi is a bit flaky at the moment >>>> and is likely to have more than one issue, it does seem like this >>>> commit is causing some well reproducible kernel hang. >>>> >>>> Here's a re-run with v4.15-rc3 showing the issue: >>>> >>>> https://lava.collabora.co.uk/scheduler/job/1018478 >>>> >>>> and here's another one with the change mentioned below reverted: >>>> >>>> https://lava.collabora.co.uk/scheduler/job/1018479 >>>> >>>> They both show a warning about "unbalanced disables for lcd_vdd", >>>> I don't know if this is related as I haven't investigated any >>>> further. It does appear to reliably hang with v4.15-rc3 and >>>> boot most of the time with the commit reverted though. >>>> >>>> The automated kernelci.org bisection is still an experimental >>>> tool and it may well be a false positive, so please take this >>>> result with a pinch of salt... >>> >>> The patch just very minimal moves the connector cleanup around (so >>> timing change), but except when you unload a driver (or maybe that >>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have >>> more info than "seems to hang a bit more" I have no idea what's wrong. >>> The patch itself should work, at least it survived quite some serious >>> testing we do on everything. >>> -Daniel >>> >> Marek was pointing to a different culprit [0] in this [1] thread. I >> see that both commits made it to v4.15-rc3, which is the first version >> where boot fails. So maybe is a combination of both? Or rather >> reverting one patch masks the error in the other. >> >> I've access to the machine but unfortunately not a lot of time to dig >> on this, I could try to do it in the weekend though. > > > After a recent discussion on the Javier's patch: > https://patchwork.kernel.org/patch/10106417/ > I've managed to reproduce this issue also on Exynos5250 based Samsung > Snow Chromebook and investigate a bit. > > It is caused by a deadlock in the main kernel workqueue. Here are details: > > 1. Exynos DRM fails to initialize due to missing regulators and gets moved > to deferred probe device list > > 2. Deferred probe is triggered and kernel "events" workqueue calls > deferred_probe_work_func() > > 3. exynos_drm_bind() is called, component_bind_all() fails due to missing > Exynos Mixer device > > 4. error handling path is executed in exynos_drm_bind(), which calls > drm_mode_config_cleanup() > > 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes > deadlock. > > Do You have idea how to fix this issue properly? > > Taking a look at git blame, this indeed shows that the issue has been > introduced by the commit a703c55004e1 ("drm: safely free connectors from > connector_ite"), which added a call to flush_scheduled_work() in > drm_mode_config_cleanup(). > > drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if > called from the workqueue, but I don't have idea how to check that. The > other way of fixing it would be to resurrect separate workqueue for DRM > related events. We need to flush the work there, or things will go wrong on unload. I think the real fix is to make sure that the connector cleanup work isn't run on the same work stuff as any driver bind stuff, which yes means new separate workqueue just for this. I guess the simple fix is to do that in the drm, but tbh I'm surprised that driver bind/deferred probe hasn't run into this problem anywhere else yet. -Daniel
On 12/12/2017 07:47 AM, Shuah Khan wrote: > On 12/12/2017 01:47 AM, Marek Szyprowski wrote: >> Hi Javier, >> >> On 2017-12-12 09:00, Javier Martinez Canillas wrote: >>> On Tue, Dec 12, 2017 at 8:54 AM, Marek Szyprowski >>> <m.szyprowski@samsung.com> wrote: >>>> On 2017-12-12 00:25, Shuah Khan wrote: >>>>> On 12/11/2017 04:02 PM, Russell King - ARM Linux wrote: >>>>>> On Mon, Dec 11, 2017 at 10:58:29PM +0000, Russell King - ARM Linux wrote: >>>>>>> On Mon, Dec 11, 2017 at 11:54:48PM +0100, Javier Martinez Canillas >>>>>>> wrote: >>>>>>>> So I gave a quick look to this, and at the very least there's a bug in >>>>>>>> the Exynos5800 Peach Pi DTS caused by commit 1cb686c08d12 ("ARM: dts: >>>>>>>> exynos: Add status property to Exynos 542x Mixer nodes"). >>>>>>>> >>>>>>>> I've posted a fix for that: >>>>>>>> >>>>>>>> https://patchwork.kernel.org/patch/10105921/ >>>>>>>> >>>>>>>> I believe this could be also be the cause for the boot failure, since >>>>>>>> I see in the boot log that things start to go wrong after exynos-drm >>>>>>>> fails to bind the HDMI component: >>>>>>>> >>>>>>>> [ 2.916347] exynos-drm exynos-drm: failed to bind 14530000.hdmi (ops >>>>>>>> 0xc1398690): -1 >>>>>>> Umm, -1 ? Looking that error code up in >>>>>>> include/uapi/asm-generic/errno-base.h says it's -EPERM. >>>>>>> >>>>>>> I suspect that's someone just returning -1 because they're lazy... >>>>>>> which is real bad form and needs fixing. >>>>>> Oh, it really is -EPERM: >>>>>> >>>>>> struct exynos_drm_crtc *exynos_drm_crtc_get_by_type(struct drm_device >>>>>> *drm_dev, >>>>>> enum exynos_drm_output_type >>>>>> out_type) >>>>>> { >>>>>> struct drm_crtc *crtc; >>>>>> >>>>>> drm_for_each_crtc(crtc, drm_dev) >>>>>> if (to_exynos_crtc(crtc)->type == out_type) >>>>>> return to_exynos_crtc(crtc); >>>>>> >>>>>> return ERR_PTR(-EPERM); >>>>>> } >>>>>> >>>>>> Does "Operation not permitted" really convey the error here? It doesn't >>>>>> look like a permission error to me. >>>>>> >>>>>> Can we please avoid abusing errno codes? >>>>> I tried 4.15-rc3 on odroid-xu4 after seeing drm issues reported. 4.15-rc2+ >>>>> with top commit g968edbd worked just fine for me last Friday. I ran >>>>> several >>>>> tests and everything checked out except the exynos-gsc lockdep issue I >>>>> sent >>>>> a 4.14 patch for. >>>>> >>>>> However, with 4.15-rc3, dmesg is gets filled with >>>>> >>>>> [ 342.337181] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 342.337470] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 342.337851] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.382346] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.396682] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.399244] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.399496] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.399848] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.400163] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.400495] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.401294] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> [ 402.401595] [drm] Non-contiguous allocation is not supported without >>>>> IOMMU, falling back to contiguous buffer >>>>> >>>>> Something broke in 4.15-rc3 on odroix-xu4 badly with exynos_defconfig. >>>>> >>>>> I will start bisect and try to isolate the problem. I suspect this is >>>>> related to dts >>>>> changes perhaps? I used to this problem a while back and it has been >>>>> fixed. >>>> >>>> This warning has been added intentionally, see following discussions: >>>> https://patchwork.kernel.org/patch/10034919/ >>>> https://patchwork.kernel.org/patch/10070475/ >>>> >>>> This means that your test apps should be updated or you should enable Exynos >>>> IOMMU support in your config. Maybe it is a good time to finally enable it >>>> in exynos_defconfig. >>>> >>> Has the issue that the boot-loader keeps the display controller >>> enabled and scanning pages on the Exynos Chromebooks resolved? >>> >>> I think that's that preventing to enable it by default in >>> exynos_defconfig since it caused boot failures when enabled on these >>> machines. I don't follow exynos development too closely nowadays so >>> maybe there's a fix in place now. >> >> Not directly. I still didn't find time to properly add support for >> devices, which were left in-working state (with active DMA >> transactions) by bootloader, but due to some other changes in the >> order of operations during boot process, power domains are >> initialized very early and due to temporary lack of devices (which >> are not yet added to the system), are turned off. This practically >> stops FIMD for scanning framebuffer and "solves" this issue. >> >> I've checked now and Exynos Snow Chromebook boots fine with IOMMU >> support enabled, both with v4.15-rc3 and linux-next. Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent a patch to do that a while back. The decision at the time to not pull that patch is was based on systems not booting with it enabled. Is it time to revisit that or the recommendation is for IOMMU to be enabled in configs manually on systems it is safe to do so? thanks, -- Shuah -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Shuah, On Tue, Dec 12, 2017 at 7:26 PM, Shuah Khan <shuahkh@osg.samsung.com> wrote: [snip] >>> >>> Not directly. I still didn't find time to properly add support for >>> devices, which were left in-working state (with active DMA >>> transactions) by bootloader, but due to some other changes in the >>> order of operations during boot process, power domains are >>> initialized very early and due to temporary lack of devices (which >>> are not yet added to the system), are turned off. This practically >>> stops FIMD for scanning framebuffer and "solves" this issue. >>> >>> I've checked now and Exynos Snow Chromebook boots fine with IOMMU >>> support enabled, both with v4.15-rc3 and linux-next. > > Would it make sense to enable EXYNOS_IOMMU in exynos_defconfig. I sent > a patch to do that a while back. The decision at the time to not pull > that patch is was based on systems not booting with it enabled. > > Is it time to revisit that or the recommendation is for IOMMU to be > enabled in configs manually on systems it is safe to do so? > Yes, I think it would be good to have it enabled by default if that doesn't cause boot issues anymore. Could you please resend your patch and cc Marek and me so we can test on Snow and Peach Chromebooks? > thanks, > -- Shuah Best regards, Javier -- To unsubscribe from this list: send the line "unsubscribe linux-samsung-soc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Daniel, On 2017-12-12 19:14, Daniel Vetter wrote: > On Tue, Dec 12, 2017 at 12:38 PM, Marek Szyprowski > <m.szyprowski@samsung.com> wrote: >> Hi All, >> >> >> On 2017-12-11 23:28, Javier Martinez Canillas wrote: >>> [adding Marek and Shuah to cc list] >>> >>> On Mon, Dec 11, 2017 at 6:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> >>> wrote: >>>> On Mon, Dec 11, 2017 at 11:30 AM, Guillaume Tucker >>>> <guillaume.tucker@collabora.com> wrote: >>>>> Hi Daniel, >>>>> >>>>> Please see below, I've had several bisection results pointing at >>>>> that commit over the week-end on mainline but also on linux-next >>>>> and net-next. While the peach-pi is a bit flaky at the moment >>>>> and is likely to have more than one issue, it does seem like this >>>>> commit is causing some well reproducible kernel hang. >>>>> >>>>> Here's a re-run with v4.15-rc3 showing the issue: >>>>> >>>>> https://lava.collabora.co.uk/scheduler/job/1018478 >>>>> >>>>> and here's another one with the change mentioned below reverted: >>>>> >>>>> https://lava.collabora.co.uk/scheduler/job/1018479 >>>>> >>>>> They both show a warning about "unbalanced disables for lcd_vdd", >>>>> I don't know if this is related as I haven't investigated any >>>>> further. It does appear to reliably hang with v4.15-rc3 and >>>>> boot most of the time with the commit reverted though. >>>>> >>>>> The automated kernelci.org bisection is still an experimental >>>>> tool and it may well be a false positive, so please take this >>>>> result with a pinch of salt... >>>> The patch just very minimal moves the connector cleanup around (so >>>> timing change), but except when you unload a driver (or maybe that >>>> funny EPROBE_DEFER stuff) it shouldn't matter. So if you don't have >>>> more info than "seems to hang a bit more" I have no idea what's wrong. >>>> The patch itself should work, at least it survived quite some serious >>>> testing we do on everything. >>>> -Daniel >>>> >>> Marek was pointing to a different culprit [0] in this [1] thread. I >>> see that both commits made it to v4.15-rc3, which is the first version >>> where boot fails. So maybe is a combination of both? Or rather >>> reverting one patch masks the error in the other. >>> >>> I've access to the machine but unfortunately not a lot of time to dig >>> on this, I could try to do it in the weekend though. >> >> After a recent discussion on the Javier's patch: >> https://patchwork.kernel.org/patch/10106417/ >> I've managed to reproduce this issue also on Exynos5250 based Samsung >> Snow Chromebook and investigate a bit. >> >> It is caused by a deadlock in the main kernel workqueue. Here are details: >> >> 1. Exynos DRM fails to initialize due to missing regulators and gets moved >> to deferred probe device list >> >> 2. Deferred probe is triggered and kernel "events" workqueue calls >> deferred_probe_work_func() >> >> 3. exynos_drm_bind() is called, component_bind_all() fails due to missing >> Exynos Mixer device >> >> 4. error handling path is executed in exynos_drm_bind(), which calls >> drm_mode_config_cleanup() >> >> 5. drm_mode_config_cleanup() calls flush_scheduled_work(), what causes >> deadlock. >> >> Do You have idea how to fix this issue properly? >> >> Taking a look at git blame, this indeed shows that the issue has been >> introduced by the commit a703c55004e1 ("drm: safely free connectors from >> connector_ite"), which added a call to flush_scheduled_work() in >> drm_mode_config_cleanup(). >> >> drm_mode_config_cleanup() should avoid calling flush_scheduled_work() if >> called from the workqueue, but I don't have idea how to check that. The >> other way of fixing it would be to resurrect separate workqueue for DRM >> related events. > We need to flush the work there, or things will go wrong on unload. I > think the real fix is to make sure that the connector cleanup work > isn't run on the same work stuff as any driver bind stuff, which yes > means new separate workqueue just for this. > > I guess the simple fix is to do that in the drm, but tbh I'm surprised > that driver bind/deferred probe hasn't run into this problem anywhere > else yet. Well, this means that no-one tested the error paths in deferred probe case. It's not that surprising, if we assume that typically platform devices are deferred only once. Second probe() call (which is done from workqueue) is successful in that case. I've managed to fix this deadlock without additional workqueue: https://patchwork.freedesktop.org/patch/193069/ Best regards
diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c index 25f4b2e..4820141 100644 --- a/drivers/gpu/drm/drm_connector.c +++ b/drivers/gpu/drm/drm_connector.c @@ -152,6 +152,16 @@ static void drm_connector_free(struct kref *kref) connector->funcs->destroy(connector); } +static void drm_connector_free_work_fn(struct work_struct *work) +{ + struct drm_connector *connector = + container_of(work, struct drm_connector, free_work); + struct drm_device *dev = connector->dev; + + drm_mode_object_unregister(dev, &connector->base); + connector->funcs->destroy(connector); +} + /** * drm_connector_init - Init a preallocated connector * @dev: DRM device @@ -181,6 +191,8 @@ int drm_connector_init(struct drm_device *dev, if (ret) return ret; + INIT_WORK(&connector->free_work, drm_connector_free_work_fn); + connector->base.properties = &connector->properties; connector->dev = dev; connector->funcs = funcs; @@ -529,6 +541,18 @@ void drm_connector_list_iter_begin(struct drm_device *dev, } EXPORT_SYMBOL(drm_connector_list_iter_begin); +/* + * Extra-safe connector put function that works in any context. Should only be + * used from the connector_iter functions, where we never really expect to + * actually release the connector when dropping our final reference. + */ +static void +drm_connector_put_safe(struct drm_connector *conn) +{ + if (refcount_dec_and_test(&conn->base.refcount.refcount)) + schedule_work(&conn->free_work); +} + /** * drm_connector_list_iter_next - return next connector * @iter: connectr_list iterator @@ -561,7 +585,7 @@ drm_connector_list_iter_next(struct drm_connector_list_iter *iter) spin_unlock_irqrestore(&config->connector_list_lock, flags); if (old_conn) - drm_connector_put(old_conn); + drm_connector_put_safe(old_conn); return iter->conn; } @@ -580,7 +604,7 @@ void drm_connector_list_iter_end(struct drm_connector_list_iter *iter) { iter->dev = NULL; if (iter->conn) - drm_connector_put(iter->conn); + drm_connector_put_safe(iter->conn); lock_release(&connector_list_iter_dep_map, 0, _RET_IP_); } EXPORT_SYMBOL(drm_connector_list_iter_end); diff --git a/drivers/gpu/drm/drm_mode_config.c b/drivers/gpu/drm/drm_mode_config.c index cda8bfa..cc78b3d 100644 --- a/drivers/gpu/drm/drm_mode_config.c +++ b/drivers/gpu/drm/drm_mode_config.c @@ -431,6 +431,8 @@ void drm_mode_config_cleanup(struct drm_device *dev) drm_connector_put(connector); } drm_connector_list_iter_end(&conn_iter); + /* connector_iter drops references in a work item. */ + flush_scheduled_work(); if (WARN_ON(!list_empty(&dev->mode_config.connector_list))) { drm_connector_list_iter_begin(dev, &conn_iter); drm_for_each_connector_iter(connector, &conn_iter) diff --git a/include/drm/drm_connector.h b/include/drm/drm_connector.h index df9807a..a4649c5 100644 --- a/include/drm/drm_connector.h +++ b/include/drm/drm_connector.h @@ -916,6 +916,14 @@ struct drm_connector { uint8_t num_h_tile, num_v_tile; uint8_t tile_h_loc, tile_v_loc; uint16_t tile_h_size, tile_v_size; + + /** + * @free_work: + * + * Work used only by &drm_connector_iter to be able to clean up a + * connector from any context. + */ + struct work_struct free_work; }; #define obj_to_connector(x) container_of(x, struct drm_connector, base)