diff mbox series

regression due to soc_device_match not handling defer (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver)

Message ID YGGZJjAxA1IO+/VU@atmark-techno.com (mailing list archive)
State New, archived
Headers show
Series regression due to soc_device_match not handling defer (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver) | expand

Commit Message

Dominique MARTINET March 29, 2021, 9:08 a.m. UTC
Hi,

First thanks for the patch, it fixes the kexec hang I was looking at...

Unfortunately it means the soc is now init much later and other piece of
drivers that depend on querying the soc will fail, I am getting a BUG in
the caam crypto driver from 7d981405d0fd ("soc: imx8m: change to use
platform driver") + ce58459d8c7f ("arm64: dts: imx8m: add SoC ID
compatible") on the imx8mp evk as follow:

[    2.575505] caam 30900000.crypto: caam pkc algorithms registered in /proc/crypto
[    2.588986] caam 30900000.crypto: registering rng-caam
[    2.594168] caam_jr 30901000.jr: job ring error: irqstate: 00000103
[    2.600492] ------------[ cut here ]------------
[    2.605109] kernel BUG at drivers/crypto/caam/jr.c:187!
[    2.610338] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[    2.615829] Modules linked in:
[    2.618895] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc5 #8
[    2.625168] Hardware name: NXP i.MX8MPlus EVK board (DT)
[    2.630482] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[    2.636493] pc : caam_jr_interrupt+0x150/0x154
[    2.640944] lr : caam_jr_interrupt+0x150/0x154
[    2.645396] sp : ffff800010003e90
[    2.648709] x29: ffff800010003e90 x28: ffff800011f82e80
[    2.654035] x27: ffff800011cd2000 x26: ffff0000c1988400
[    2.659353] x25: ffff800011cd2000 x24: ffff800011f789e0
[    2.664674] x23: 000000000000002e x22: ffff800012261000
[    2.669994] x21: ffff0000c1979410 x20: ffff0000c1988a80
[    2.675315] x19: 0000000000000103 x18: 0000000000000000
[    2.680637] x17: 0000000000000007 x16: 000000000000000e
[    2.685958] x15: 0000000000000030 x14: ffffffffffffffff
[    2.691279] x13: ffff800090003aa7 x12: ffff800010003aaf
[    2.696601] x11: 0000000000000003 x10: 0101010101010101
[    2.701921] x9 : ffff8000100039d0 x8 : ffff800011fa3830
[    2.707242] x7 : ffff800011ffb830 x6 : ffff800011ffb830
[    2.712560] x5 : 0000000000000000 x4 : 0000000000000000
[    2.717881] x3 : 0000000000000000 x2 : 0000000000000000
[    2.723203] x1 : 0000000000000000 x0 : ffff800011f82e80
[    2.728528] Call trace:
[    2.730976]  caam_jr_interrupt+0x150/0x154
[    2.735080]  __handle_irq_event_percpu+0x6c/0x280
[    2.739791]  handle_irq_event+0x70/0x160
[    2.743716]  handle_fasteoi_irq+0xb0/0x200
[    2.747822]  __handle_domain_irq+0x8c/0xf0
[    2.751924]  gic_handle_irq+0xd8/0x160
[    2.755683]  el1_irq+0xb4/0x180
[    2.758829]  arch_cpu_idle+0x18/0x30
[    2.762412]  default_idle_call+0x4c/0x1d0
[    2.766431]  do_idle+0x238/0x2b0
[    2.769664]  cpu_startup_entry+0x30/0x7c
[    2.773595]  rest_init+0xe4/0xf4
[    2.776832]  arch_call_rest_init+0x1c/0x28
[    2.780937]  start_kernel+0x570/0x5a8
[    2.784602]  0x0
[    2.786452] Code: aa1503e0 90005c41 91108021 940da95d (d4210000)
[    2.792560] ---[ end trace 968b8515172abc2e ]---
[    2.797181] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[    2.804580] SMP: stopping secondary CPUs
[    2.808508] Kernel Offset: disabled
[    2.811998] CPU features: 0x00240002,2000200c
[    2.816360] Memory Limit: none
[    2.819419] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---


This particular crash can be fixed by making the caam driver delay as
well if the device isn't inited yet, e.g. this works:
--------
-------

But it obviously doesn't cover the ~50 (!) other soc_device_match users
in the code base which would all need to start handling potential error
return code.

(I also had problems with other drivers when trying to backport the
patch to the 5.4.70_2.3.0 imx kernel but I just gave up on that one)


I think having all drivers handle potential EPROBE_DEFER errors is the
best way forward, how do you propose to move on?
I can do some but have no way of testing most of these so am a bit
reluctant to...

Thanks,

Comments

Alice Guo (OSS) March 30, 2021, 2:41 a.m. UTC | #1
Hi,

Thanks for reporting this issue, I'll check and add a fix to handle defer probe.

Best regards,
Alice Guo

> -----Original Message-----
> From: Dominique MARTINET <dominique.martinet@atmark-techno.com>
> Sent: 2021年3月29日 17:09
> To: Alice Guo <alice.guo@nxp.com>; Shawn Guo <shawnguo@kernel.org>;
> Krzysztof Kozlowski <krzk@kernel.org>
> Cc: robh+dt@kernel.org; devicetree@vger.kernel.org; Peng Fan
> <peng.fan@nxp.com>; dl-linux-imx <linux-imx@nxp.com>;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: [EXT] regression due to soc_device_match not handling defer (Was:
> [PATCH v4 4/4] soc: imx8m: change to use platform driver)
> 
> Caution: EXT Email
> 
> Hi,
> 
> First thanks for the patch, it fixes the kexec hang I was looking at...
> 
> Unfortunately it means the soc is now init much later and other piece of drivers
> that depend on querying the soc will fail, I am getting a BUG in the caam crypto
> driver from 7d981405d0fd ("soc: imx8m: change to use platform driver") +
> ce58459d8c7f ("arm64: dts: imx8m: add SoC ID
> compatible") on the imx8mp evk as follow:
> 
> [    2.575505] caam 30900000.crypto: caam pkc algorithms registered in
> /proc/crypto
> [    2.588986] caam 30900000.crypto: registering rng-caam
> [    2.594168] caam_jr 30901000.jr: job ring error: irqstate: 00000103
> [    2.600492] ------------[ cut here ]------------
> [    2.605109] kernel BUG at drivers/crypto/caam/jr.c:187!
> [    2.610338] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [    2.615829] Modules linked in:
> [    2.618895] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc5 #8
> [    2.625168] Hardware name: NXP i.MX8MPlus EVK board (DT)
> [    2.630482] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> [    2.636493] pc : caam_jr_interrupt+0x150/0x154
> [    2.640944] lr : caam_jr_interrupt+0x150/0x154
> [    2.645396] sp : ffff800010003e90
> [    2.648709] x29: ffff800010003e90 x28: ffff800011f82e80
> [    2.654035] x27: ffff800011cd2000 x26: ffff0000c1988400
> [    2.659353] x25: ffff800011cd2000 x24: ffff800011f789e0
> [    2.664674] x23: 000000000000002e x22: ffff800012261000
> [    2.669994] x21: ffff0000c1979410 x20: ffff0000c1988a80
> [    2.675315] x19: 0000000000000103 x18: 0000000000000000
> [    2.680637] x17: 0000000000000007 x16: 000000000000000e
> [    2.685958] x15: 0000000000000030 x14: ffffffffffffffff
> [    2.691279] x13: ffff800090003aa7 x12: ffff800010003aaf
> [    2.696601] x11: 0000000000000003 x10: 0101010101010101
> [    2.701921] x9 : ffff8000100039d0 x8 : ffff800011fa3830
> [    2.707242] x7 : ffff800011ffb830 x6 : ffff800011ffb830
> [    2.712560] x5 : 0000000000000000 x4 : 0000000000000000
> [    2.717881] x3 : 0000000000000000 x2 : 0000000000000000
> [    2.723203] x1 : 0000000000000000 x0 : ffff800011f82e80
> [    2.728528] Call trace:
> [    2.730976]  caam_jr_interrupt+0x150/0x154
> [    2.735080]  __handle_irq_event_percpu+0x6c/0x280
> [    2.739791]  handle_irq_event+0x70/0x160
> [    2.743716]  handle_fasteoi_irq+0xb0/0x200
> [    2.747822]  __handle_domain_irq+0x8c/0xf0
> [    2.751924]  gic_handle_irq+0xd8/0x160
> [    2.755683]  el1_irq+0xb4/0x180
> [    2.758829]  arch_cpu_idle+0x18/0x30
> [    2.762412]  default_idle_call+0x4c/0x1d0
> [    2.766431]  do_idle+0x238/0x2b0
> [    2.769664]  cpu_startup_entry+0x30/0x7c
> [    2.773595]  rest_init+0xe4/0xf4
> [    2.776832]  arch_call_rest_init+0x1c/0x28
> [    2.780937]  start_kernel+0x570/0x5a8
> [    2.784602]  0x0
> [    2.786452] Code: aa1503e0 90005c41 91108021 940da95d (d4210000)
> [    2.792560] ---[ end trace 968b8515172abc2e ]---
> [    2.797181] Kernel panic - not syncing: Oops - BUG: Fatal exception in
> interrupt
> [    2.804580] SMP: stopping secondary CPUs
> [    2.808508] Kernel Offset: disabled
> [    2.811998] CPU features: 0x00240002,2000200c
> [    2.816360] Memory Limit: none
> [    2.819419] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception
> in interrupt ]---
> 
> 
> This particular crash can be fixed by making the caam driver delay as well if the
> device isn't inited yet, e.g. this works:
> --------
> diff --git a/drivers/base/soc.c b/drivers/base/soc.c index
> 0af5363a582c..f179f9e55b49 100644
> --- a/drivers/base/soc.c
> +++ b/drivers/base/soc.c
> @@ -36,6 +36,7 @@ static DEVICE_ATTR(family,            0444,
> soc_info_show,  NULL);
>  static DEVICE_ATTR(serial_number,      0444, soc_info_show,  NULL);
>  static DEVICE_ATTR(soc_id,             0444, soc_info_show,  NULL);
>  static DEVICE_ATTR(revision,           0444, soc_info_show,  NULL);
> +static int init_done;
> 
>  struct device *soc_device_to_device(struct soc_device *soc_dev)  { @@
> -157,6 +158,7 @@ struct soc_device *soc_device_register(struct
> soc_device_attribute *soc_dev_attr
>                 return ERR_PTR(ret);
>         }
> 
> +       init_done = true;
>         return soc_dev;
> 
>  out3:
> @@ -243,6 +245,9 @@ const struct soc_device_attribute
> *soc_device_match(  {
>         int ret = 0;
> 
> +       if (!init_done)
> +               return ERR_PTR(-EPROBE_DEFER);
> +
>         if (!matches)
>                 return NULL;
> 
> diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c index
> ca0361b2dbb0..d08f8cc4131f 100644
> --- a/drivers/crypto/caam/ctrl.c
> +++ b/drivers/crypto/caam/ctrl.c
> @@ -635,6 +635,9 @@ static int caam_probe(struct platform_device *pdev)
>         nprop = pdev->dev.of_node;
> 
>         imx_soc_match = soc_device_match(caam_imx_soc_table);
> +       if (IS_ERR(imx_soc_match))
> +               return PTR_ERR(imx_soc_match);
> +
>         caam_imx = (bool)imx_soc_match;
> 
>         if (imx_soc_match) {
> -------
> 
> But it obviously doesn't cover the ~50 (!) other soc_device_match users in the
> code base which would all need to start handling potential error return code.
> 
> (I also had problems with other drivers when trying to backport the patch to the
> 5.4.70_2.3.0 imx kernel but I just gave up on that one)
> 
> 
> I think having all drivers handle potential EPROBE_DEFER errors is the best way
> forward, how do you propose to move on?
> I can do some but have no way of testing most of these so am a bit reluctant
> to...
> 
> Thanks,
> --
> Dominique
Dominique MARTINET April 15, 2021, 1:27 a.m. UTC | #2
Alice Guo (OSS) wrote on Tue, Mar 30, 2021 at 02:41:23AM +0000:
> Thanks for reporting this issue, I'll check and add a fix to handle defer probe.

I haven't seen any follow up on this, have you had a chance to take a
look?
If this won't make it for 5.12 (in a couple of week probably?) would it
make sense to revert 7d981405d0fd ("soc: imx8m: change to use platform
driver") for now?



While looking at the code earlier I also have an unrelated, late-review
on the patch itself:

> +static u32 __init imx8mq_soc_revision(struct device *dev)
> [...]
>  @@ -191,8 +223,16 @@ static int __init imx8_soc_init(void)
>         data = id->data;
>         if (data) {
>                 soc_dev_attr->soc_id = data->name;
> -               if (data->soc_revision)
> -                       soc_rev = data->soc_revision();
> +               if (data->soc_revision) {
> +                       if (pdev) {
> +                               soc_rev = data->soc_revision(&pdev->dev);
> +                               ret = soc_rev;
> +                               if (ret < 0)

I appreciate current soc_revision are "small enough" (looking at
include/soc/imx/revision.h we're talking < 256) so this actually works,
but would it make sense to either make soc_rev signed, or to have
soc_revision() return a s64, or have the revision filled in another *u32
argument to make sure the error is an error and not just a large rev?

This is most definitely fine for now but that kind of code patterns can
lead to weird errors down the road.

Thanks,
Peng Fan April 15, 2021, 1:33 a.m. UTC | #3
> (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver)
> 
> Alice Guo (OSS) wrote on Tue, Mar 30, 2021 at 02:41:23AM +0000:
> > Thanks for reporting this issue, I'll check and add a fix to handle defer probe.
> 
> I haven't seen any follow up on this, have you had a chance to take a look?

We are trying to find a proper solution for this.

The proper method might be make soc_device_match return probe defer,
and take early soc attr into consideration, but I am not sure this would win
maintainer's vote.

> If this won't make it for 5.12 (in a couple of week probably?) would it make
> sense to revert 7d981405d0fd ("soc: imx8m: change to use platform
> driver") for now?

Please no. We are targeting android GKI, make driver as modules.
And reverting to original method will also break kexec.

I am on IRC #linux-imx, we could take more if you would like to.

Thanks,
Peng.

> 
> 
> 
> While looking at the code earlier I also have an unrelated, late-review on the
> patch itself:
> 
> > +static u32 __init imx8mq_soc_revision(struct device *dev)
> > [...]
> >  @@ -191,8 +223,16 @@ static int __init imx8_soc_init(void)
> >         data = id->data;
> >         if (data) {
> >                 soc_dev_attr->soc_id = data->name;
> > -               if (data->soc_revision)
> > -                       soc_rev = data->soc_revision();
> > +               if (data->soc_revision) {
> > +                       if (pdev) {
> > +                               soc_rev =
> data->soc_revision(&pdev->dev);
> > +                               ret = soc_rev;
> > +                               if (ret < 0)
> 
> I appreciate current soc_revision are "small enough" (looking at
> include/soc/imx/revision.h we're talking < 256) so this actually works, but
> would it make sense to either make soc_rev signed, or to have
> soc_revision() return a s64, or have the revision filled in another *u32
> argument to make sure the error is an error and not just a large rev?
> 
> This is most definitely fine for now but that kind of code patterns can lead to
> weird errors down the road.
> 
> Thanks,
> --
> Dominique
Lucas Stach June 24, 2021, 10:36 a.m. UTC | #4
Hi all,

Am Donnerstag, dem 15.04.2021 um 01:33 +0000 schrieb Peng Fan:
> > (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver)
> > 
> > Alice Guo (OSS) wrote on Tue, Mar 30, 2021 at 02:41:23AM +0000:
> > > Thanks for reporting this issue, I'll check and add a fix to handle defer probe.
> > 
> > I haven't seen any follow up on this, have you had a chance to take a look?
> 
> We are trying to find a proper solution for this.
> 
> The proper method might be make soc_device_match return probe defer,
> and take early soc attr into consideration, but I am not sure this would win
> maintainer's vote.
> 
> > If this won't make it for 5.12 (in a couple of week probably?) would it make
> > sense to revert 7d981405d0fd ("soc: imx8m: change to use platform
> > driver") for now?
> 
> Please no. We are targeting android GKI, make driver as modules.
> And reverting to original method will also break kexec.
> 
> I am on IRC #linux-imx, we could take more if you would like to.

It seems this stalled. This regression totally breaks the kernel boot
on all i.MX8M devices including the CAAM. 5.13 is about to be released,
as the second upstream kernel release after 5.12 without a fix for this
issue. What's the plan here?

If there is no good solution small enough to be ported to the stable
kernels in sight, I think the only sensible option here is to revert
this change.

Regards,
Lucas
Peng Fan (OSS) June 29, 2021, 2:39 a.m. UTC | #5
Hi Lucas,

On 2021/6/24 18:36, Lucas Stach wrote:
> Hi all,
> 
> Am Donnerstag, dem 15.04.2021 um 01:33 +0000 schrieb Peng Fan:
>>> (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver)
>>>
>>> Alice Guo (OSS) wrote on Tue, Mar 30, 2021 at 02:41:23AM +0000:
>>>> Thanks for reporting this issue, I'll check and add a fix to handle defer probe.
>>>
>>> I haven't seen any follow up on this, have you had a chance to take a look?
>>
>> We are trying to find a proper solution for this.
>>
>> The proper method might be make soc_device_match return probe defer,
>> and take early soc attr into consideration, but I am not sure this would win
>> maintainer's vote.
>>
>>> If this won't make it for 5.12 (in a couple of week probably?) would it make
>>> sense to revert 7d981405d0fd ("soc: imx8m: change to use platform
>>> driver") for now?
>>
>> Please no. We are targeting android GKI, make driver as modules.
>> And reverting to original method will also break kexec.
>>
>> I am on IRC #linux-imx, we could take more if you would like to.
> 
> It seems this stalled. This regression totally breaks the kernel boot
> on all i.MX8M devices including the CAAM. 5.13 is about to be released,
> as the second upstream kernel release after 5.12 without a fix for this
> issue. What's the plan here?
> 
> If there is no good solution small enough to be ported to the stable
> kernels in sight, I think the only sensible option here is to revert
> this change.

I not have time to touch this, maybe you could submit a revert for this.

Thanks,
Peng.

> 
> Regards,
> Lucas
>
diff mbox series

Patch

diff --git a/drivers/base/soc.c b/drivers/base/soc.c
index 0af5363a582c..f179f9e55b49 100644
--- a/drivers/base/soc.c
+++ b/drivers/base/soc.c
@@ -36,6 +36,7 @@  static DEVICE_ATTR(family,		0444, soc_info_show,  NULL);
 static DEVICE_ATTR(serial_number,	0444, soc_info_show,  NULL);
 static DEVICE_ATTR(soc_id,		0444, soc_info_show,  NULL);
 static DEVICE_ATTR(revision,		0444, soc_info_show,  NULL);
+static int init_done;
 
 struct device *soc_device_to_device(struct soc_device *soc_dev)
 {
@@ -157,6 +158,7 @@  struct soc_device *soc_device_register(struct soc_device_attribute *soc_dev_attr
 		return ERR_PTR(ret);
 	}
 
+	init_done = true;
 	return soc_dev;
 
 out3:
@@ -243,6 +245,9 @@  const struct soc_device_attribute *soc_device_match(
 {
 	int ret = 0;
 
+	if (!init_done)
+		return ERR_PTR(-EPROBE_DEFER);
+
 	if (!matches)
 		return NULL;
 
diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c
index ca0361b2dbb0..d08f8cc4131f 100644
--- a/drivers/crypto/caam/ctrl.c
+++ b/drivers/crypto/caam/ctrl.c
@@ -635,6 +635,9 @@  static int caam_probe(struct platform_device *pdev)
 	nprop = pdev->dev.of_node;
 
 	imx_soc_match = soc_device_match(caam_imx_soc_table);
+	if (IS_ERR(imx_soc_match))
+		return PTR_ERR(imx_soc_match);
+
 	caam_imx = (bool)imx_soc_match;
 
 	if (imx_soc_match) {