diff mbox series

[v1] mtd: parsers: ofpart: Fix parsing when size-cells is 0

Message ID 20221202071900.1143950-1-francesco@dolcini.it (mailing list archive)
State New, archived
Headers show
Series [v1] mtd: parsers: ofpart: Fix parsing when size-cells is 0 | expand

Commit Message

Francesco Dolcini Dec. 2, 2022, 7:19 a.m. UTC
From: Francesco Dolcini <francesco.dolcini@toradex.com>

Add a fallback mechanism to handle the case in which #size-cells is set
to <0>. According to the DT binding the nand controller node should have
set it to 0 and this is not compatible with the legacy way of
specifying partitions directly as child nodes of the nand-controller node.

This fixes a boot failure on colibri-imx7 and potentially other boards.

Cc: stable@vger.kernel.org
Fixes: 753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
Link: https://lore.kernel.org/all/Y4dgBTGNWpM6SQXI@francesco-nb.int.toradex.com/
Signed-off-by: Francesco Dolcini <francesco.dolcini@toradex.com>
---
 drivers/mtd/parsers/ofpart_core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Miquel Raynal Dec. 2, 2022, 9:14 a.m. UTC | #1
Hi Francesco,

francesco@dolcini.it wrote on Fri,  2 Dec 2022 08:19:00 +0100:

> From: Francesco Dolcini <francesco.dolcini@toradex.com>
> 
> Add a fallback mechanism to handle the case in which #size-cells is set
> to <0>. According to the DT binding the nand controller node should have
> set it to 0 and this is not compatible with the legacy way of
> specifying partitions directly as child nodes of the nand-controller node.

I understand the problem, I understand the fix, but I have to say, I
strongly dislike it :) Touching an mtd core driver to fix a single
broken use case like that is... problematic, for the least.

I am sorry but if a 6.0 kernel breaks because:
- a legacy scheme is used in the description
- u-boot still does not conform to the DT standard
- a patch tries to make a tool happy
Then the solution is clearly not an 'mtd core' mainline patch.

If you really want to workaround U-Boot, either you revert that patch
or you just fix the DT description instead. The parent/child/partitions
scheme has been enforced for maybe 5 years now and for a good reason: a
NAND controller with partitions does not make _any_ sense. There are
plenty of examples out there, imx7-colibri.dtsi has received many
updates since its introduction (for the best), so why not this one?

Cheers,
Miquèl

> This fixes a boot failure on colibri-imx7 and potentially other boards.
> 
> Cc: stable@vger.kernel.org
> Fixes: 753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> Link: https://lore.kernel.org/all/Y4dgBTGNWpM6SQXI@francesco-nb.int.toradex.com/
> Signed-off-by: Francesco Dolcini <francesco.dolcini@toradex.com>
> ---
>  drivers/mtd/parsers/ofpart_core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/mtd/parsers/ofpart_core.c b/drivers/mtd/parsers/ofpart_core.c
> index 192190c42fc8..aa3b7fa61e50 100644
> --- a/drivers/mtd/parsers/ofpart_core.c
> +++ b/drivers/mtd/parsers/ofpart_core.c
> @@ -122,6 +122,17 @@ static int parse_fixed_partitions(struct mtd_info *master,
>  
>  		a_cells = of_n_addr_cells(pp);
>  		s_cells = of_n_size_cells(pp);
> +		if (s_cells == 0) {
> +			/*
> +			 * Use #size-cells = <1> for backward compatibility
> +			 * in case #size-cells is set to <0> and firmware adds
> +			 * OF partitions without setting it.
> +			 */
> +			pr_warn_once("%s: ofpart partition %pOF (%pOF) #size-cells is <0>, using <1> for backward compatibility.\n",
> +				     master->name, pp,
> +				     mtd_node);
> +			s_cells = 1;
> +		}
>  		if (len / 4 != a_cells + s_cells) {
>  			pr_debug("%s: ofpart partition %pOF (%pOF) error parsing reg property.\n",
>  				 master->name, pp,
Francesco Dolcini Dec. 2, 2022, 10:12 a.m. UTC | #2
Hello Miquel,

On Fri, Dec 02, 2022 at 10:14:18AM +0100, Miquel Raynal wrote:
> francesco@dolcini.it wrote on Fri,  2 Dec 2022 08:19:00 +0100:
> > From: Francesco Dolcini <francesco.dolcini@toradex.com>
> > 
> > Add a fallback mechanism to handle the case in which #size-cells is set
> > to <0>. According to the DT binding the nand controller node should have
> > set it to 0 and this is not compatible with the legacy way of
> > specifying partitions directly as child nodes of the nand-controller node.
> 
> I understand the problem, I understand the fix, but I have to say, I
> strongly dislike it :) Touching an mtd core driver to fix a single
> broken use case like that is... problematic, for the least.
I just noticed it 2 days after this patch was backported to a stable
kernel, I am just the first one to notice, we are not talking about a single
use case.

> I am sorry but if a 6.0 kernel breaks because:
Not only kernel 6.0 is currently broken. This patch is going to be
backported to any stable kernel given the fixes tag it has.

> If you really want to workaround U-Boot, either you revert that patch
> or you just fix the DT description instead. The parent/child/partitions
> scheme has been enforced for maybe 5 years now and for a good reason: a
> NAND controller with partitions does not make _any_ sense. There are
> plenty of examples out there, imx7-colibri.dtsi has received many
> updates since its introduction (for the best), so why not this one?

I can and I will update imx7-colibri.dtsi (patch coming), but is this
good enough given the kind of boot failure regression this introduce? We
are going to have old u-boot around that will not work with it, and the
reality is that there are tons of reasons why people do update the linux
kernel and dts everyday, but never ever update the bootloader.

We cannot tell
  "All users of the XXX kernel series must upgrade."
and at the same time introduce a regression that break the boot and
ignore it.

Francesco
Francesco Dolcini Dec. 2, 2022, 10:24 a.m. UTC | #3
On Fri, Dec 02, 2022 at 11:12:43AM +0100, Francesco Dolcini wrote:
> Hello Miquel,
> 
> On Fri, Dec 02, 2022 at 10:14:18AM +0100, Miquel Raynal wrote:
> > francesco@dolcini.it wrote on Fri,  2 Dec 2022 08:19:00 +0100:
> > > From: Francesco Dolcini <francesco.dolcini@toradex.com>
> > > 
> > > Add a fallback mechanism to handle the case in which #size-cells is set
> > > to <0>. According to the DT binding the nand controller node should have
> > > set it to 0 and this is not compatible with the legacy way of
> > > specifying partitions directly as child nodes of the nand-controller node.
> > 
> > I understand the problem, I understand the fix, but I have to say, I
> > strongly dislike it :) Touching an mtd core driver to fix a single
> > broken use case like that is... problematic, for the least.
> I just noticed it 2 days after this patch was backported to a stable
> kernel, I am just the first one to notice, we are not talking about a single
> use case.
> 
> > I am sorry but if a 6.0 kernel breaks because:
> Not only kernel 6.0 is currently broken. This patch is going to be
> backported to any stable kernel given the fixes tag it has.
> 
> > If you really want to workaround U-Boot, either you revert that patch
> > or you just fix the DT description instead. The parent/child/partitions
> > scheme has been enforced for maybe 5 years now and for a good reason: a
> > NAND controller with partitions does not make _any_ sense. There are
> > plenty of examples out there, imx7-colibri.dtsi has received many
> > updates since its introduction (for the best), so why not this one?
> 
> I can and I will update imx7-colibri.dtsi (patch coming), but is this
> good enough given the kind of boot failure regression this introduce? We
> are going to have old u-boot around that will not work with it, and the

Just another piece of information, support for the partitions node in
U-Boot was added in version v2022.04 [1], we are not talking about ancient
old legacy stuff.

If I add the partitions node as a child of my nand controller, as I was
planning to do and I wrote 10 lines above, I will create a new flavor of
non-booting system with U-Boot older than v2022.04 :-/

U-Boot older than v2022.04 will update the nand controller node never
the less, the partition node will still be there and Linux will use it,
but it will be empty since nobody populate it.

Francesco

[1] commit 36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")
Miquel Raynal Dec. 2, 2022, 10:53 a.m. UTC | #4
Hi Francesco,

francesco@dolcini.it wrote on Fri, 2 Dec 2022 11:24:29 +0100:

> On Fri, Dec 02, 2022 at 11:12:43AM +0100, Francesco Dolcini wrote:
> > Hello Miquel,
> > 
> > On Fri, Dec 02, 2022 at 10:14:18AM +0100, Miquel Raynal wrote:  
> > > francesco@dolcini.it wrote on Fri,  2 Dec 2022 08:19:00 +0100:  
> > > > From: Francesco Dolcini <francesco.dolcini@toradex.com>
> > > > 
> > > > Add a fallback mechanism to handle the case in which #size-cells is set
> > > > to <0>. According to the DT binding the nand controller node should have
> > > > set it to 0 and this is not compatible with the legacy way of
> > > > specifying partitions directly as child nodes of the nand-controller node.  
> > > 
> > > I understand the problem, I understand the fix, but I have to say, I
> > > strongly dislike it :) Touching an mtd core driver to fix a single
> > > broken use case like that is... problematic, for the least.  
> > I just noticed it 2 days after this patch was backported to a stable
> > kernel, I am just the first one to notice, we are not talking about a single
> > use case.
> >   
> > > I am sorry but if a 6.0 kernel breaks because:  
> > Not only kernel 6.0 is currently broken. This patch is going to be
> > backported to any stable kernel given the fixes tag it has.
> >   
> > > If you really want to workaround U-Boot, either you revert that patch
> > > or you just fix the DT description instead. The parent/child/partitions
> > > scheme has been enforced for maybe 5 years now and for a good reason: a
> > > NAND controller with partitions does not make _any_ sense. There are
> > > plenty of examples out there, imx7-colibri.dtsi has received many
> > > updates since its introduction (for the best), so why not this one?  
> > 
> > I can and I will update imx7-colibri.dtsi (patch coming),

:thumb_up:

> > but is this
> > good enough given the kind of boot failure regression this introduce? We
> > are going to have old u-boot around that will not work with it, and the  
> 
> Just another piece of information, support for the partitions node in
> U-Boot was added in version v2022.04 [1], we are not talking about ancient
> old legacy stuff.

If it is so recent, then this is what needs to be fixed, and it should
not bother "many" people because 2022.04 is not so old.

So I am a bit lost, IIUC what is currently broken is:
- U-Boot > 2022.04 and any version of Linux with the backport?

> If I add the partitions node as a child of my nand controller, as I was
> planning to do and I wrote 10 lines above, I will create a new flavor of
> non-booting system with U-Boot older than v2022.04 :-/

I think there is a little confusion here. You are referring to the NAND
controller node, the commit refers to the NAND chip node. What this
commit does looks fine because it just tries to use the partitions {}
node rather than the NAND chip node and if the partitions {} node
already exist, I expect #address-cells and #size-cells to be defined
and be != 0 already.

Here is a proper description:

nand-controller {
	#address-cells = <1>;
	#size-cells = <0>;
	nand-chip {
		partitions {
			#address-cells = <1 or 2>;
			#size-cells = <1 or 2>;
			partition@x { };
			partition@y { };
		};
	};

	/* Here you can very well have another nand-chip node with
	 * another reg property which represents its own CS and another
	 * set of partitions.
	 */
};

> U-Boot older than v2022.04 will update the nand controller node never
> the less, the partition node will still be there and Linux will use it,
> but it will be empty since nobody populate it.
> 
> Francesco
> 
> [1] commit 36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")


Thanks,
Miquèl
Francesco Dolcini Dec. 2, 2022, 11:23 a.m. UTC | #5
+ u-boot list

On Fri, Dec 02, 2022 at 11:53:27AM +0100, Miquel Raynal wrote:
> francesco@dolcini.it wrote on Fri, 2 Dec 2022 11:24:29 +0100:
> > On Fri, Dec 02, 2022 at 11:12:43AM +0100, Francesco Dolcini wrote:
> > > On Fri, Dec 02, 2022 at 10:14:18AM +0100, Miquel Raynal wrote:  
> > > > francesco@dolcini.it wrote on Fri,  2 Dec 2022 08:19:00 +0100:  
> > > > > From: Francesco Dolcini <francesco.dolcini@toradex.com>
> > > > > 
> > > > > Add a fallback mechanism to handle the case in which #size-cells is set
> > > > > to <0>. According to the DT binding the nand controller node should have
> > > > > set it to 0 and this is not compatible with the legacy way of
> > > > > specifying partitions directly as child nodes of the nand-controller node.  
> > > > 
> > > > I understand the problem, I understand the fix, but I have to say, I
> > > > strongly dislike it :) Touching an mtd core driver to fix a single
> > > > broken use case like that is... problematic, for the least.  
> > > I just noticed it 2 days after this patch was backported to a stable
> > > kernel, I am just the first one to notice, we are not talking about a single
> > > use case.
> > >   
> > > > I am sorry but if a 6.0 kernel breaks because:  
> > > Not only kernel 6.0 is currently broken. This patch is going to be
> > > backported to any stable kernel given the fixes tag it has.
> > >   
> > > > If you really want to workaround U-Boot, either you revert that patch
> > > > or you just fix the DT description instead. The parent/child/partitions
> > > > scheme has been enforced for maybe 5 years now and for a good reason: a
> > > > NAND controller with partitions does not make _any_ sense. There are
> > > > plenty of examples out there, imx7-colibri.dtsi has received many
> > > > updates since its introduction (for the best), so why not this one?  
> > > 
> > > I can and I will update imx7-colibri.dtsi (patch coming),
> 
> :thumb_up:
> 
> > > but is this
> > > good enough given the kind of boot failure regression this introduce? We
> > > are going to have old u-boot around that will not work with it, and the  
> > 
> > Just another piece of information, support for the partitions node in
> > U-Boot was added in version v2022.04 [1], we are not talking about ancient
> > old legacy stuff.
> 
> If it is so recent, then this is what needs to be fixed, and it should
> not bother "many" people because 2022.04 is not so old.
> 
> So I am a bit lost, IIUC what is currently broken is:
> - U-Boot > 2022.04 and any version of Linux with the backport?
> 
> > If I add the partitions node as a child of my nand controller, as I was
> > planning to do and I wrote 10 lines above, I will create a new flavor of
> > non-booting system with U-Boot older than v2022.04 :-/
> 
> I think there is a little confusion here. You are referring to the NAND
I guess I have not explained myself well enough :-)

U-Boot is creating the partitions in the dtb, they are not defined in
the source dts file (this is common practice with multiple boards).

Before v2022.04 it was always updating the nand-controller node,
starting from v2022.04 if there is a dedicated `partitions` node it uses
it. This is just the reverse of what ofpart_core.c is doing (if the
partitions node is there it assumes the partitions should go into it,
otherwise it proceeds with the legacy way).

Let's have a concrete example with colibri-imx7.

Current status:
 - The nand-controller node does not include any partitions child, any
   U-Boot version will just add the partition directly as child of the
   nand controller. This is where I am hitting this boot regression now.

Potential change I envisioned here:
 - I add the partitions node to the nand-controller, e.g.

--- a/arch/arm/boot/dts/imx7-colibri.dtsi
+++ b/arch/arm/boot/dts/imx7-colibri.dtsi
@@ -380,6 +380,12 @@ &gpmi {
        nand-on-flash-bbt;
        pinctrl-names = "default";
        pinctrl-0 = <&pinctrl_gpmi_nand>;
+
+       partitions {
+               compatible = "fixed-partitions";
+               #address-cells = <1>;
+               #size-cells = <1>;
+       };
 };

 - U-Boot >= v2022.04 will just work fine creating the partitions as
   currently described in the bindings.
 - U-Boot < v2022.04 will still create the partitions as child of the
   nand-controller node. Linux will see that a `partitions` node exists
   but it will be empty, leading to a boot failure in case mtd is used
   as boot device.


> controller node, the commit refers to the NAND chip node. What this
> commit does looks fine because it just tries to use the partitions {}
> node rather than the NAND chip node and if the partitions {} node
> already exist, I expect #address-cells and #size-cells to be defined
> and be != 0 already.
yes, this commit is perfectly fine I agree.

The reality is that people is using newer kernel with older U-Boot, and
I do not think that deliberately breaking this use case is what the
Linux kernel should do.

I do not think that I can push a change in the DTS that will break
booting any board using an older U-Boot.

Francesco
Greg KH Dec. 2, 2022, 12:43 p.m. UTC | #6
On Fri, Dec 02, 2022 at 08:19:00AM +0100, Francesco Dolcini wrote:
> From: Francesco Dolcini <francesco.dolcini@toradex.com>
> 
> Add a fallback mechanism to handle the case in which #size-cells is set
> to <0>. According to the DT binding the nand controller node should have
> set it to 0 and this is not compatible with the legacy way of
> specifying partitions directly as child nodes of the nand-controller node.
> 
> This fixes a boot failure on colibri-imx7 and potentially other boards.
> 
> Cc: stable@vger.kernel.org
> Fixes: 753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> Link: https://lore.kernel.org/all/Y4dgBTGNWpM6SQXI@francesco-nb.int.toradex.com/
> Signed-off-by: Francesco Dolcini <francesco.dolcini@toradex.com>
> ---
>  drivers/mtd/parsers/ofpart_core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Miquel Raynal Dec. 2, 2022, 2:05 p.m. UTC | #7
Hi Francesco,

francesco@dolcini.it wrote on Fri, 2 Dec 2022 12:23:37 +0100:

> + u-boot list
> 
> On Fri, Dec 02, 2022 at 11:53:27AM +0100, Miquel Raynal wrote:
> > francesco@dolcini.it wrote on Fri, 2 Dec 2022 11:24:29 +0100:  
> > > On Fri, Dec 02, 2022 at 11:12:43AM +0100, Francesco Dolcini wrote:  
> > > > On Fri, Dec 02, 2022 at 10:14:18AM +0100, Miquel Raynal wrote:    
> > > > > francesco@dolcini.it wrote on Fri,  2 Dec 2022 08:19:00 +0100:    
> > > > > > From: Francesco Dolcini <francesco.dolcini@toradex.com>
> > > > > > 
> > > > > > Add a fallback mechanism to handle the case in which #size-cells is set
> > > > > > to <0>. According to the DT binding the nand controller node should have
> > > > > > set it to 0 and this is not compatible with the legacy way of
> > > > > > specifying partitions directly as child nodes of the nand-controller node.    
> > > > > 
> > > > > I understand the problem, I understand the fix, but I have to say, I
> > > > > strongly dislike it :) Touching an mtd core driver to fix a single
> > > > > broken use case like that is... problematic, for the least.    
> > > > I just noticed it 2 days after this patch was backported to a stable
> > > > kernel, I am just the first one to notice, we are not talking about a single
> > > > use case.
> > > >     
> > > > > I am sorry but if a 6.0 kernel breaks because:    
> > > > Not only kernel 6.0 is currently broken. This patch is going to be
> > > > backported to any stable kernel given the fixes tag it has.
> > > >     
> > > > > If you really want to workaround U-Boot, either you revert that patch
> > > > > or you just fix the DT description instead. The parent/child/partitions
> > > > > scheme has been enforced for maybe 5 years now and for a good reason: a
> > > > > NAND controller with partitions does not make _any_ sense. There are
> > > > > plenty of examples out there, imx7-colibri.dtsi has received many
> > > > > updates since its introduction (for the best), so why not this one?    
> > > > 
> > > > I can and I will update imx7-colibri.dtsi (patch coming),  
> > 
> > :thumb_up:
> >   
> > > > but is this
> > > > good enough given the kind of boot failure regression this introduce? We
> > > > are going to have old u-boot around that will not work with it, and the    
> > > 
> > > Just another piece of information, support for the partitions node in
> > > U-Boot was added in version v2022.04 [1], we are not talking about ancient
> > > old legacy stuff.  
> > 
> > If it is so recent, then this is what needs to be fixed, and it should
> > not bother "many" people because 2022.04 is not so old.
> > 
> > So I am a bit lost, IIUC what is currently broken is:
> > - U-Boot > 2022.04 and any version of Linux with the backport?
> >   
> > > If I add the partitions node as a child of my nand controller, as I was
> > > planning to do and I wrote 10 lines above, I will create a new flavor of
> > > non-booting system with U-Boot older than v2022.04 :-/  
> > 
> > I think there is a little confusion here. You are referring to the NAND  
> I guess I have not explained myself well enough :-)

Ok, there is still a confusion. Even though I think your logic still
applies, I want to emphasis on how wrong it is to define partitions in
the NAND _controller_ node rather than the NAND _chip_ node. And I
think this might have an impact on our final choice.

> U-Boot is creating the partitions in the dtb, they are not defined in
> the source dts file (this is common practice with multiple boards).

That fdt_fixup_mtdparts() thing is a mistake. The original idea is:

1. Define wrong nodes in your DT
2. Fix your DT at run time in U-Boot
3. Provide the "fixed" DT to Linux

Now step #2 now produces wrong FDT. So what, we should darken even
more the of partition driver in Linux to workaround it? At most what we
can do is warn the user so that people don't loose time understanding
what happens, but I am against supporting this, ever.

> Before v2022.04 it was always updating the nand-controller node,
> starting from v2022.04 if there is a dedicated `partitions` node it uses
> it.

Sounds reasonable.

> This is just the reverse of what ofpart_core.c is doing (if the
> partitions node is there it assumes the partitions should go into it,
> otherwise it proceeds with the legacy way).

Yes, that's how we handle legacy bindings.

> Let's have a concrete example with colibri-imx7.
> 
> Current status:
>  - The nand-controller node does not include any partitions child, any
>    U-Boot version will just add the partition directly as child of the
>    nand controller. This is where I am hitting this boot regression now.

Not exactly. It worked until now because your original DT already
included #size-cells = <1> I believe. It does not do that anymore and
that is why you get your boot regression: because the DT was modified.

The reason why the DT got modified however is interesting. The commit
log says the goal is to comply with modern bindings, which is great.
But if you break how your board boots, then you should probably not do
that. And if we really care about complying with the bindings, there
is something much more interesting than fixing a single property:
distinguishing the NAND controller vs. the NAND chip(s), which has been
enforced since 2016 (which probably predates the imx7-colibri.dtsi, but
whatever):
2d472aba15ff ("mtd: nand: document the NAND controller/NAND chip DT representation")

> Potential change I envisioned here:
>  - I add the partitions node to the nand-controller, e.g.
> 
> --- a/arch/arm/boot/dts/imx7-colibri.dtsi
> +++ b/arch/arm/boot/dts/imx7-colibri.dtsi
> @@ -380,6 +380,12 @@ &gpmi {
>         nand-on-flash-bbt;
>         pinctrl-names = "default";
>         pinctrl-0 = <&pinctrl_gpmi_nand>;
> +
> +       partitions {
> +               compatible = "fixed-partitions";
> +               #address-cells = <1>;
> +               #size-cells = <1>;
> +       };
>  };
> 
>  - U-Boot >= v2022.04 will just work fine creating the partitions as
>    currently described in the bindings.
>  - U-Boot < v2022.04 will still create the partitions as child of the
>    nand-controller node. Linux will see that a `partitions` node exists
>    but it will be empty, leading to a boot failure in case mtd is used
>    as boot device.
> 
> 
> > controller node, the commit refers to the NAND chip node. What this
> > commit does looks fine because it just tries to use the partitions {}
> > node rather than the NAND chip node and if the partitions {} node
> > already exist, I expect #address-cells and #size-cells to be defined
> > and be != 0 already.  
> yes, this commit is perfectly fine I agree.
> 
> The reality is that people is using newer kernel with older U-Boot, and
> I do not think that deliberately breaking this use case is what the
> Linux kernel should do.

Agreed.

> I do not think that I can push a change in the DTS that will break
> booting any board using an older U-Boot.

That's however the initial cause of this discovery. A DT change broke
your boot flow. I'm saying "your" boot flow because I am not sure it
affects "any" board.

For now it only affects the imx7 colibri boards because of:
753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")

But all these boards could be affected in the same way because of some
machine code playing with fdt_fixup_mtdparts():
* arch/arm/mach-uniphier/fdt-fixup.c
* board/compulab/cm_fx6/cm_fx6.c
* board/gateworks/gw_ventana/gw_ventana.c
* board/isee/igep003x/board.c
* board/isee/igep00x0/igep00x0.c
* board/phytec/phycore_am335x_r2/board.c
* board/st/stm32mp1/stm32mp1.c
* board/toradex/colibri-imx6ull/colibri-imx6ull.c
* board/toradex/colibri_imx7/colibri_imx7.c
* board/toradex/colibri_vf/colibri_vf.c
That's of course way too much possible failures.

I still strongly disagree with the initial proposal but what I think we
can do is:

1. To prevent future breakages: 
  Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
  kernel should work.

2. To help tracking down situations like that:
  Keep the warning in ofpart.c but continue to fail.

3. To fix the current situation:
   Immediately revert commit (and prevent it from being backported):
   753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
   This way your own boot flow is fixed in the short term.

4. There is no reason to partially fix a DT like what the above did
   besides trying to avoid warnings emitted by the DT check tools. If
   complying with modern bindings is a goal (and I think it should
   be), then we can modernize this DT without breaking the boot flow:
   Instead of only setting #size-cell = <0>, you can as well define
   in your DT a subnode to define the NAND chip. NAND chips are not
   supposed to have #size-cells properties, but in the past they did,
   which means #address-cells and #size-cells are allowed (and marked
   deprecated in the schema). So in practice, the dt-schema will not
   warn you if they are there, which means you can still set
   #size-cell = <1>.

   Please mind, the tools have been updated very recently to match
   what I am describing above, so they will likely still report
   errors until v6.2-rc1, see:
   https://lore.kernel.org/linux-mtd/20221114090315.848208-1-miquel.raynal@bootlin.com/

Does this sound reasonable?

Thanks,
Miquèl
Marek Vasut Dec. 2, 2022, 2:31 p.m. UTC | #8
On 12/2/22 15:05, Miquel Raynal wrote:
> Hi Francesco,

Hi,

[...]

> I still strongly disagree with the initial proposal but what I think we
> can do is:
> 
> 1. To prevent future breakages:
>    Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
>    kernel should work.
> 
> 2. To help tracking down situations like that:
>    Keep the warning in ofpart.c but continue to fail.
> 
> 3. To fix the current situation:
>     Immediately revert commit (and prevent it from being backported):
>     753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>     This way your own boot flow is fixed in the short term.

Here I disagree, the fix is correct and I think we shouldn't proliferate 
incorrect DTs which don't match the binding document. Rather, if a 
bootloader generates incorrect (new) DT entries, I believe the driver 
should implement a fixup and warn user about this. PC does that as well 
with broken ACPI tables as far as I can tell.

I'm not convinced making a DT non-compliant with bindings again, only to 
work around a problem induced by bootloader, is the right approach here.

This would be setting a dangerous example, where anyone could request a 
DT fix to be reverted because their random bootloader does the wrong 
thing and with valid DT clean up, something broke.

> 4. There is no reason to partially fix a DT like what the above did
>     besides trying to avoid warnings emitted by the DT check tools.

Note that the 3. does not partially fix the DT, it fixes the node fully.

>     If
>     complying with modern bindings is a goal (and I think it should
>     be), then we can modernize this DT without breaking the boot flow:
>     Instead of only setting #size-cell = <0>, you can as well define
>     in your DT a subnode to define the NAND chip. NAND chips are not
>     supposed to have #size-cells properties, but in the past they did,
>     which means #address-cells and #size-cells are allowed (and marked
>     deprecated in the schema). So in practice, the dt-schema will not
>     warn you if they are there, which means you can still set
>     #size-cell = <1>.

I am really not convinced we should hack around this on the DT end and 
try to push some sort of convoluted workaround there, instead of fixing 
it on the driver side (and bootloader side, in the long run).

>     Please mind, the tools have been updated very recently to match
>     what I am describing above, so they will likely still report
>     errors until v6.2-rc1, see:
>     https://lore.kernel.org/linux-mtd/20221114090315.848208-1-miquel.raynal@bootlin.com/
> 
> Does this sound reasonable?

[...]
Miquel Raynal Dec. 2, 2022, 3 p.m. UTC | #9
Hi Marek,

marex@denx.de wrote on Fri, 2 Dec 2022 15:31:40 +0100:

> On 12/2/22 15:05, Miquel Raynal wrote:
> > Hi Francesco,  
> 
> Hi,
> 
> [...]
> 
> > I still strongly disagree with the initial proposal but what I think we
> > can do is:
> > 
> > 1. To prevent future breakages:
> >    Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
> >    kernel should work.
> > 
> > 2. To help tracking down situations like that:
> >    Keep the warning in ofpart.c but continue to fail.
> > 
> > 3. To fix the current situation:
> >     Immediately revert commit (and prevent it from being backported):
> >     753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> >     This way your own boot flow is fixed in the short term.  
> 
> Here I disagree, the fix is correct and I think we shouldn't
> proliferate incorrect DTs which don't match the binding document.

I agree we should not proliferate incorrect DTs, so let's use a modern
description then, with a controller and a child node which defines the
chip.

> Rather, if a bootloader generates incorrect (new) DT entries, I
> believe the driver should implement a fixup and warn user about this.
> PC does that as well with broken ACPI tables as far as I can tell.
> 
> I'm not convinced making a DT non-compliant with bindings again,

I am sorry to say so, but while warnings reported by the tools
should be fixed, it's not because the tool does not scream at you that
the description is valid. We are actively working on enhancing the
schema so that "all" improper descriptions get warnings (see the series
pointed earlier), but in no way this change makes the node compliant
with modern bindings.

I'm not saying the fix is wrong, but let's be pragmatic, it currently
leads to boot failures.

> only to work around a problem induced by bootloader, is the right approach
> here.

When a patch breaks a board and there is no straight fix, you revert
it, then you think harder. That's what I am saying. This is a temporary
solution.

> This would be setting a dangerous example, where anyone could request a DT fix to be reverted because their random bootloader does the wrong thing and with valid DT clean up, something broke.

Please, you know this is not valid DT clean up. We've been decoupling
controller and chip description since 2016. What I am proposing is a
valid DT cleanup, not to the latest standard, but way closer than the
current solution.

> > 4. There is no reason to partially fix a DT like what the above did
> >     besides trying to avoid warnings emitted by the DT check tools.  
> 
> Note that the 3. does not partially fix the DT, it fixes the node fully.
> 
> >     If
> >     complying with modern bindings is a goal (and I think it should
> >     be), then we can modernize this DT without breaking the boot flow:
> >     Instead of only setting #size-cell = <0>, you can as well define
> >     in your DT a subnode to define the NAND chip. NAND chips are not
> >     supposed to have #size-cells properties, but in the past they did,
> >     which means #address-cells and #size-cells are allowed (and marked
> >     deprecated in the schema). So in practice, the dt-schema will not
> >     warn you if they are there, which means you can still set
> >     #size-cell = <1>.  
> 
> I am really not convinced we should hack around this on the DT end and try to push some sort of convoluted workaround there,

"convoluted workaround" :-)

> instead of fixing it on the driver side (and bootloader side, in the
> long run).
>
> >     Please mind, the tools have been updated very recently to match

s/tools/schema/

> >     what I am describing above, so they will likely still report
> >     errors until v6.2-rc1, see:
> >     https://lore.kernel.org/linux-mtd/20221114090315.848208-1-miquel.raynal@bootlin.com/
> > 
> > Does this sound reasonable?  
> 
> [...]


Thanks,
Miquèl
Marek Vasut Dec. 2, 2022, 3:23 p.m. UTC | #10
On 12/2/22 16:00, Miquel Raynal wrote:
> Hi Marek,

Hi,

> marex@denx.de wrote on Fri, 2 Dec 2022 15:31:40 +0100:
> 
>> On 12/2/22 15:05, Miquel Raynal wrote:
>>> Hi Francesco,
>>
>> Hi,
>>
>> [...]
>>
>>> I still strongly disagree with the initial proposal but what I think we
>>> can do is:
>>>
>>> 1. To prevent future breakages:
>>>     Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
>>>     kernel should work.
>>>
>>> 2. To help tracking down situations like that:
>>>     Keep the warning in ofpart.c but continue to fail.
>>>
>>> 3. To fix the current situation:
>>>      Immediately revert commit (and prevent it from being backported):
>>>      753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>>      This way your own boot flow is fixed in the short term.
>>
>> Here I disagree, the fix is correct and I think we shouldn't
>> proliferate incorrect DTs which don't match the binding document.
> 
> I agree we should not proliferate incorrect DTs, so let's use a modern
> description then

Yes please !

> , with a controller and a child node which defines the
> chip.

But what if there is no chip connected to the controller node ?

If I understand the proposal here right (please correct me if I'm 
wrong), then:

1) This is the original, old, wrong binding:
&gpmi {
   #size-cells = <1>;
   ...
   partition@N { ... };
};


2) This is the newer, but still wrong binding:
&gpmi {
   #size-cells = <0>;
   ...
   partitions {
     partition@N { ... };
   };
};

3) This is the newest binding, what we want:
&gpmi {
   #size-cells = <0>;
   ...
   nand-chip {
     partitions {
       partition@N { ... };
     };
   };
};

But if there is no physical nand chip connected to the controller, would 
we end up with empty nand-chip node in DT, like this?
&gpmi {
   #size-cells = <X>;
   ...
   nand-chip { /* empty */ };
};
What would be the gpmi controller size cells (X) in that case, still 0, 
right ? So how does that help solve this problem, wouldn't U-Boot still 
populate the partitions directly under the gpmi node or into partitions 
sub-node ?

>> Rather, if a bootloader generates incorrect (new) DT entries, I
>> believe the driver should implement a fixup and warn user about this.
>> PC does that as well with broken ACPI tables as far as I can tell.
>>
>> I'm not convinced making a DT non-compliant with bindings again,
> 
> I am sorry to say so, but while warnings reported by the tools
> should be fixed, it's not because the tool does not scream at you that
> the description is valid. We are actively working on enhancing the
> schema so that "all" improper descriptions get warnings (see the series
> pointed earlier), but in no way this change makes the node compliant
> with modern bindings.
> 
> I'm not saying the fix is wrong, but let's be pragmatic, it currently
> leads to boot failures.

I fully agree that we do have a problem, and that it trickled into 
stable makes it even worse. Maybe I don't fully understand the thing 
with nand-chip proposal, see my question above, esp. the last part.

>> only to work around a problem induced by bootloader, is the right approach
>> here.
> 
> When a patch breaks a board and there is no straight fix, you revert
> it, then you think harder. That's what I am saying. This is a temporary
> solution.

Isn't this patch the straight fix, at least until the bootloader can be 
updated to generate the nand-chip node correctly ?

>> This would be setting a dangerous example, where anyone could request a DT fix to be reverted because their random bootloader does the wrong thing and with valid DT clean up, something broke.
> 
> Please, you know this is not valid DT clean up. We've been decoupling
> controller and chip description since 2016. What I am proposing is a
> valid DT cleanup, not to the latest standard, but way closer than the
> current solution.

I think I really need one more explanation of the nand-chip part above.

[...]
Miquel Raynal Dec. 2, 2022, 3:49 p.m. UTC | #11
Hi Marek,

marex@denx.de wrote on Fri, 2 Dec 2022 16:23:29 +0100:

> On 12/2/22 16:00, Miquel Raynal wrote:
> > Hi Marek,  
> 
> Hi,
> 
> > marex@denx.de wrote on Fri, 2 Dec 2022 15:31:40 +0100:
> >   
> >> On 12/2/22 15:05, Miquel Raynal wrote:  
> >>> Hi Francesco,  
> >>
> >> Hi,
> >>
> >> [...]
> >>  
> >>> I still strongly disagree with the initial proposal but what I think we
> >>> can do is:
> >>>
> >>> 1. To prevent future breakages:
> >>>     Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
> >>>     kernel should work.
> >>>
> >>> 2. To help tracking down situations like that:
> >>>     Keep the warning in ofpart.c but continue to fail.
> >>>
> >>> 3. To fix the current situation:
> >>>      Immediately revert commit (and prevent it from being backported):
> >>>      753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> >>>      This way your own boot flow is fixed in the short term.  
> >>
> >> Here I disagree, the fix is correct and I think we shouldn't
> >> proliferate incorrect DTs which don't match the binding document.  
> > 
> > I agree we should not proliferate incorrect DTs, so let's use a modern
> > description then  
> 
> Yes please !
> 
> > , with a controller and a child node which defines the
> > chip.  
> 
> But what if there is no chip connected to the controller node ?
> 
> If I understand the proposal here right (please correct me if I'm wrong), then:

Good idea to summarize.

> 
> 1) This is the original, old, wrong binding:
> &gpmi {
>    #size-cells = <1>;
>    ...
>    partition@N { ... };
> };

Yes.

> 
> 
> 2) This is the newer, but still wrong binding:
> &gpmi {
>    #size-cells = <0>;
>    ...
>    partitions {
>      partition@N { ... };
>    };
> };

Well, this is wrong description, but it would work (for compat reasons,
even though I don't think this is considered valid DT by the schemas).

> 
> 3) This is the newest binding, what we want:
> &gpmi {
>    #size-cells = <0>;
>    ...
>    nand-chip {
>      partitions {
>        partition@N { ... };
>      };
>    };
> };

Yes

> 
> But if there is no physical nand chip connected to the controller, would we end up with empty nand-chip node in DT, like this?
> &gpmi {
>    #size-cells = <X>;
>    ...
>    nand-chip { /* empty */ };
> };

Is this really a concern? If there is no NAND chip, the controller
should be disabled, no? I guess technically you could even use the
status property in the nand-chip node...

However, it should not be empty, at the very least a reg property
should indicate on which CS it is wired, as expected there:
https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next

But, as nand-chip.yaml references mtd.yaml, you can as well use
whatever is described here:
https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/mtd.yaml?h=mtd/next

> What would be the gpmi controller size cells (X) in that case, still 0, right ? So how does that help solve this problem, wouldn't U-Boot still populate the partitions directly under the gpmi node or into partitions sub-node ?

The commit that was pointed in the original fix clearly stated that the
NAND chip node was targeted, not the NAND controller node. I hope this
is correctly supported in U-Boot though. So if there is a NAND chip
subnode, I suppose U-Boot would try to create the partitions that are
inside, or even in the sub "partitions" container.

> >> Rather, if a bootloader generates incorrect (new) DT entries, I
> >> believe the driver should implement a fixup and warn user about this.
> >> PC does that as well with broken ACPI tables as far as I can tell.
> >>
> >> I'm not convinced making a DT non-compliant with bindings again,  
> > 
> > I am sorry to say so, but while warnings reported by the tools
> > should be fixed, it's not because the tool does not scream at you that
> > the description is valid. We are actively working on enhancing the
> > schema so that "all" improper descriptions get warnings (see the series
> > pointed earlier), but in no way this change makes the node compliant
> > with modern bindings.
> > 
> > I'm not saying the fix is wrong, but let's be pragmatic, it currently
> > leads to boot failures.  
> 
> I fully agree that we do have a problem, and that it trickled into stable makes it even worse. Maybe I don't fully understand the thing with nand-chip proposal, see my question above, esp. the last part.
> 
> >> only to work around a problem induced by bootloader, is the right approach
> >> here.  
> > 
> > When a patch breaks a board and there is no straight fix, you revert
> > it, then you think harder. That's what I am saying. This is a temporary
> > solution.  
> 
> Isn't this patch the straight fix, at least until the bootloader can be updated to generate the nand-chip node correctly ?
> 
> >> This would be setting a dangerous example, where anyone could request a DT fix to be reverted because their random bootloader does the wrong thing and with valid DT clean up, something broke.  
> > 
> > Please, you know this is not valid DT clean up. We've been decoupling
> > controller and chip description since 2016. What I am proposing is a
> > valid DT cleanup, not to the latest standard, but way closer than the
> > current solution.  
> 
> I think I really need one more explanation of the nand-chip part above.

I hope things are clearer now.

Thanks,
Miquèl
Thorsten Leemhuis Dec. 2, 2022, 3:56 p.m. UTC | #12
On 02.12.22 15:31, Marek Vasut wrote:
> On 12/2/22 15:05, Miquel Raynal wrote:
> [...]
>> 3. To fix the current situation:
>>     Immediately revert commit (and prevent it from being backported):
>>     753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>     This way your own boot flow is fixed in the short term.
> 
> Here I disagree, the fix is correct and I think we shouldn't proliferate
> incorrect DTs which don't match the binding document. Rather, if a
> bootloader generates incorrect (new) DT entries, I believe the driver
> should implement a fixup and warn user about this. PC does that as well
> with broken ACPI tables as far as I can tell.

Well, that might be the right solution in the long run, that's up for
others to decide, but we need to fix this *quickly*. For two reasons
actually: the 6.1 release is near and the change was backported to
stable already.

For details wrt to the "quickly", see "Prioritize work on fixing
regressions" here:
https://docs.kernel.org/process/handling-regressions.html

IOW: Ideally it should be fixed by Sunday.

I'll hence likely soon will point Linus to this and suggest to revert
this, unless there are strong reasons against that or some sort of
agreement on a better solution.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.
Miquel Raynal Dec. 2, 2022, 4:01 p.m. UTC | #13
miquel.raynal@bootlin.com wrote on Fri, 2 Dec 2022 16:49:04 +0100:

> Hi Marek,
> 
> marex@denx.de wrote on Fri, 2 Dec 2022 16:23:29 +0100:
> 
> > On 12/2/22 16:00, Miquel Raynal wrote:  
> > > Hi Marek,    
> > 
> > Hi,
> >   
> > > marex@denx.de wrote on Fri, 2 Dec 2022 15:31:40 +0100:
> > >     
> > >> On 12/2/22 15:05, Miquel Raynal wrote:    
> > >>> Hi Francesco,    
> > >>
> > >> Hi,
> > >>
> > >> [...]
> > >>    
> > >>> I still strongly disagree with the initial proposal but what I think we
> > >>> can do is:
> > >>>
> > >>> 1. To prevent future breakages:
> > >>>     Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
> > >>>     kernel should work.
> > >>>
> > >>> 2. To help tracking down situations like that:
> > >>>     Keep the warning in ofpart.c but continue to fail.
> > >>>
> > >>> 3. To fix the current situation:
> > >>>      Immediately revert commit (and prevent it from being backported):
> > >>>      753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> > >>>      This way your own boot flow is fixed in the short term.    
> > >>
> > >> Here I disagree, the fix is correct and I think we shouldn't
> > >> proliferate incorrect DTs which don't match the binding document.    
> > > 
> > > I agree we should not proliferate incorrect DTs, so let's use a modern
> > > description then    
> > 
> > Yes please !
> >   
> > > , with a controller and a child node which defines the
> > > chip.    
> > 
> > But what if there is no chip connected to the controller node ?
> > 
> > If I understand the proposal here right (please correct me if I'm wrong), then:  
> 
> Good idea to summarize.
> 
> > 
> > 1) This is the original, old, wrong binding:
> > &gpmi {
> >    #size-cells = <1>;
> >    ...
> >    partition@N { ... };
> > };  
> 
> Yes.
> 
> > 
> > 
> > 2) This is the newer, but still wrong binding:
> > &gpmi {
> >    #size-cells = <0>;
> >    ...
> >    partitions {
> >      partition@N { ... };
> >    };
> > };  
> 
> Well, this is wrong description, but it would work (for compat reasons,
> even though I don't think this is considered valid DT by the schemas).
> 
> > 
> > 3) This is the newest binding, what we want:
> > &gpmi {
> >    #size-cells = <0>;
> >    ...
> >    nand-chip {
> >      partitions {
> >        partition@N { ... };
> >      };
> >    };
> > };  
> 
> Yes

Perhaps I should also mention that #size-cells expected to be 0 has
nothing to do with the "partitions" container (otherwise #address-cells
would be 0 as well). This value is however asking for an address-only
reg property describing which NAND chip should be addressed and how,
basically the NAND controller CS because you can wire your NAND to
any CS.

> > But if there is no physical nand chip connected to the controller, would we end up with empty nand-chip node in DT, like this?
> > &gpmi {
> >    #size-cells = <X>;
> >    ...
> >    nand-chip { /* empty */ };
> > };  
> 
> Is this really a concern? If there is no NAND chip, the controller
> should be disabled, no? I guess technically you could even use the
> status property in the nand-chip node...
> 
> However, it should not be empty, at the very least a reg property
> should indicate on which CS it is wired, as expected there:
> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next
> 
> But, as nand-chip.yaml references mtd.yaml, you can as well use
> whatever is described here:
> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/mtd.yaml?h=mtd/next
> 
> > What would be the gpmi controller size cells (X) in that case, still 0, right ? So how does that help solve this problem, wouldn't U-Boot still populate the partitions directly under the gpmi node or into partitions sub-node ?  
> 
> The commit that was pointed in the original fix clearly stated that the
> NAND chip node was targeted, not the NAND controller node. I hope this
> is correctly supported in U-Boot though. So if there is a NAND chip
> subnode, I suppose U-Boot would try to create the partitions that are
> inside, or even in the sub "partitions" container.
> 
> > >> Rather, if a bootloader generates incorrect (new) DT entries, I
> > >> believe the driver should implement a fixup and warn user about this.
> > >> PC does that as well with broken ACPI tables as far as I can tell.
> > >>
> > >> I'm not convinced making a DT non-compliant with bindings again,    
> > > 
> > > I am sorry to say so, but while warnings reported by the tools
> > > should be fixed, it's not because the tool does not scream at you that
> > > the description is valid. We are actively working on enhancing the
> > > schema so that "all" improper descriptions get warnings (see the series
> > > pointed earlier), but in no way this change makes the node compliant
> > > with modern bindings.
> > > 
> > > I'm not saying the fix is wrong, but let's be pragmatic, it currently
> > > leads to boot failures.    
> > 
> > I fully agree that we do have a problem, and that it trickled into stable makes it even worse. Maybe I don't fully understand the thing with nand-chip proposal, see my question above, esp. the last part.
> >   
> > >> only to work around a problem induced by bootloader, is the right approach
> > >> here.    
> > > 
> > > When a patch breaks a board and there is no straight fix, you revert
> > > it, then you think harder. That's what I am saying. This is a temporary
> > > solution.    
> > 
> > Isn't this patch the straight fix, at least until the bootloader can be updated to generate the nand-chip node correctly ?
> >   
> > >> This would be setting a dangerous example, where anyone could request a DT fix to be reverted because their random bootloader does the wrong thing and with valid DT clean up, something broke.    
> > > 
> > > Please, you know this is not valid DT clean up. We've been decoupling
> > > controller and chip description since 2016. What I am proposing is a
> > > valid DT cleanup, not to the latest standard, but way closer than the
> > > current solution.    
> > 
> > I think I really need one more explanation of the nand-chip part above.  
> 
> I hope things are clearer now.
> 
> Thanks,
> Miquèl
Marek Vasut Dec. 2, 2022, 4:17 p.m. UTC | #14
On 12/2/22 16:49, Miquel Raynal wrote:
> Hi Marek,

Hi,

>> On 12/2/22 16:00, Miquel Raynal wrote:
>>> Hi Marek,
>>
>> Hi,
>>
>>> marex@denx.de wrote on Fri, 2 Dec 2022 15:31:40 +0100:
>>>    
>>>> On 12/2/22 15:05, Miquel Raynal wrote:
>>>>> Hi Francesco,
>>>>
>>>> Hi,
>>>>
>>>> [...]
>>>>   
>>>>> I still strongly disagree with the initial proposal but what I think we
>>>>> can do is:
>>>>>
>>>>> 1. To prevent future breakages:
>>>>>      Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
>>>>>      kernel should work.
>>>>>
>>>>> 2. To help tracking down situations like that:
>>>>>      Keep the warning in ofpart.c but continue to fail.
>>>>>
>>>>> 3. To fix the current situation:
>>>>>       Immediately revert commit (and prevent it from being backported):
>>>>>       753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>>>>       This way your own boot flow is fixed in the short term.
>>>>
>>>> Here I disagree, the fix is correct and I think we shouldn't
>>>> proliferate incorrect DTs which don't match the binding document.
>>>
>>> I agree we should not proliferate incorrect DTs, so let's use a modern
>>> description then
>>
>> Yes please !
>>
>>> , with a controller and a child node which defines the
>>> chip.
>>
>> But what if there is no chip connected to the controller node ?
>>
>> If I understand the proposal here right (please correct me if I'm wrong), then:
> 
> Good idea to summarize.
> 
>>
>> 1) This is the original, old, wrong binding:
>> &gpmi {
>>     #size-cells = <1>;
>>     ...
>>     partition@N { ... };
>> };
> 
> Yes.
> 
>>
>>
>> 2) This is the newer, but still wrong binding:
>> &gpmi {
>>     #size-cells = <0>;
>>     ...
>>     partitions {
>>       partition@N { ... };
>>     };
>> };
> 
> Well, this is wrong description, but it would work (for compat reasons,
> even though I don't think this is considered valid DT by the schemas).
> 
>>
>> 3) This is the newest binding, what we want:
>> &gpmi {
>>     #size-cells = <0>;
>>     ...
>>     nand-chip {
>>       partitions {
>>         partition@N { ... };
>>       };
>>     };
>> };
> 
> Yes
> 
>>
>> But if there is no physical nand chip connected to the controller, would we end up with empty nand-chip node in DT, like this?
>> &gpmi {
>>     #size-cells = <X>;
>>     ...
>>     nand-chip { /* empty */ };
>> };
> 
> Is this really a concern?

I don't know, maybe it is not.

> If there is no NAND chip, the controller
> should be disabled, no? I guess technically you could even use the
> status property in the nand-chip node...

Sure.

> However, it should not be empty, at the very least a reg property
> should indicate on which CS it is wired, as expected there:
> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next

OK, I see your point. So basically this?

&gpmi {
   #size-cells = <1>;
   ...
   nand-chip@0 {
     reg = <0>;
   };
};

btw. the GPMI NAND controller supports only one chipselect, so the reg 
in nand-chip node makes little sense.

> But, as nand-chip.yaml references mtd.yaml, you can as well use
> whatever is described here:
> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/mtd.yaml?h=mtd/next
> 
>> What would be the gpmi controller size cells (X) in that case, still 0, right ? So how does that help solve this problem, wouldn't U-Boot still populate the partitions directly under the gpmi node or into partitions sub-node ?
> 
> The commit that was pointed in the original fix clearly stated that the
> NAND chip node was targeted

I think this is another miscommunication here. The commit

753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")

modifies the size-cells of the NAND controller. The nand-chip is not 
involved in this at all . In the examples above, it's the "&gpmi" node 
size-cells that is modified.

> , not the NAND controller node. I hope this
> is correctly supported in U-Boot though. So if there is a NAND chip
> subnode, I suppose U-Boot would try to create the partitions that are
> inside, or even in the sub "partitions" container.

My understanding is that U-Boot checks the nand-controller node 
size-cells, not the nand-chip{} or partitions{} subnode size-cells .

Francesco, can you please share the DT, including the U-Boot generated 
partitions, which is passed to Linux on Colibri MX7 ? I think that 
should make all confusion go away.

(or am I the only one who's still confused here?)
Miquel Raynal Dec. 2, 2022, 4:42 p.m. UTC | #15
Hi Marek,

marex@denx.de wrote on Fri, 2 Dec 2022 17:17:59 +0100:

> On 12/2/22 16:49, Miquel Raynal wrote:
> > Hi Marek,  
> 
> Hi,
> 
> >> On 12/2/22 16:00, Miquel Raynal wrote:  
> >>> Hi Marek,  
> >>
> >> Hi,
> >>  
> >>> marex@denx.de wrote on Fri, 2 Dec 2022 15:31:40 +0100:  
> >>>    >>>> On 12/2/22 15:05, Miquel Raynal wrote:  
> >>>>> Hi Francesco,  
> >>>>
> >>>> Hi,
> >>>>
> >>>> [...]  
> >>>>   >>>>> I still strongly disagree with the initial proposal but what I think we  
> >>>>> can do is:
> >>>>>
> >>>>> 1. To prevent future breakages:
> >>>>>      Fix fdt_fixup_mtdparts() in u-boot. This way newer U-Boot + any
> >>>>>      kernel should work.
> >>>>>
> >>>>> 2. To help tracking down situations like that:
> >>>>>      Keep the warning in ofpart.c but continue to fail.
> >>>>>
> >>>>> 3. To fix the current situation:
> >>>>>       Immediately revert commit (and prevent it from being backported):
> >>>>>       753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> >>>>>       This way your own boot flow is fixed in the short term.  
> >>>>
> >>>> Here I disagree, the fix is correct and I think we shouldn't
> >>>> proliferate incorrect DTs which don't match the binding document.  
> >>>
> >>> I agree we should not proliferate incorrect DTs, so let's use a modern
> >>> description then  
> >>
> >> Yes please !
> >>  
> >>> , with a controller and a child node which defines the
> >>> chip.  
> >>
> >> But what if there is no chip connected to the controller node ?
> >>
> >> If I understand the proposal here right (please correct me if I'm wrong), then:  
> > 
> > Good idea to summarize.
> >   
> >>
> >> 1) This is the original, old, wrong binding:
> >> &gpmi {
> >>     #size-cells = <1>;
> >>     ...
> >>     partition@N { ... };
> >> };  
> > 
> > Yes.
> >   
> >>
> >>
> >> 2) This is the newer, but still wrong binding:
> >> &gpmi {
> >>     #size-cells = <0>;
> >>     ...
> >>     partitions {
> >>       partition@N { ... };
> >>     };
> >> };  
> > 
> > Well, this is wrong description, but it would work (for compat reasons,
> > even though I don't think this is considered valid DT by the schemas).
> >   
> >>
> >> 3) This is the newest binding, what we want:
> >> &gpmi {
> >>     #size-cells = <0>;
> >>     ...
> >>     nand-chip {
> >>       partitions {
> >>         partition@N { ... };
> >>       };
> >>     };
> >> };  
> > 
> > Yes
> >   
> >>
> >> But if there is no physical nand chip connected to the controller, would we end up with empty nand-chip node in DT, like this?
> >> &gpmi {
> >>     #size-cells = <X>;
> >>     ...
> >>     nand-chip { /* empty */ };
> >> };  
> > 
> > Is this really a concern?  
> 
> I don't know, maybe it is not.
> 
> > If there is no NAND chip, the controller
> > should be disabled, no? I guess technically you could even use the
> > status property in the nand-chip node...  
> 
> Sure.
> 
> > However, it should not be empty, at the very least a reg property
> > should indicate on which CS it is wired, as expected there:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next  
> 
> OK, I see your point. So basically this?
> 
> &gpmi {
>    #size-cells = <1>;
>    ...
>    nand-chip@0 {
>      reg = <0>;
>    };
> };
> 
> btw. the GPMI NAND controller supports only one chipselect, so the reg in nand-chip node makes little sense.

I randomly opened a reference manual (IMX6DQL.pdf), they say:

	"Up to four NAND devices, supported by four chip-selects and one
	 ganged ready/ busy."

Anyway, the NAND controller generic bindings which require this reg
property, what the controller or the driver actually supports, or even
how it is used on current designs is not relevant here.

> > But, as nand-chip.yaml references mtd.yaml, you can as well use
> > whatever is described here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/mtd.yaml?h=mtd/next
> >   
> >> What would be the gpmi controller size cells (X) in that case, still 0, right ? So how does that help solve this problem, wouldn't U-Boot still populate the partitions directly under the gpmi node or into partitions sub-node ?  
> > 
> > The commit that was pointed in the original fix clearly stated that the
> > NAND chip node was targeted  
> 
> I think this is another miscommunication here. The commit
> 
> 753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> 
> modifies the size-cells of the NAND controller. The nand-chip is not involved in this at all . In the examples above, it's the "&gpmi" node size-cells that is modified.

Yes I know. I was referring to this commit, sorry:
36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")

The log says:

	Listing MTD partitions directly in the flash mode has been
	deprecated for a while for kernel Device Trees. Look for a node "partitions" in the
	found flash nodes and use it instead of the flash node itself for the
	partition list when it exists, so Device Trees following the current
	best practices can be fixed up.

Which (I hope) means U-boot will equivalently try to play with the
partitions container, either in the controller node or in the chip node.

> > , not the NAND controller node. I hope this
> > is correctly supported in U-Boot though. So if there is a NAND chip
> > subnode, I suppose U-Boot would try to create the partitions that are
> > inside, or even in the sub "partitions" container.  
> 
> My understanding is that U-Boot checks the nand-controller node size-cells, not the nand-chip{} or partitions{} subnode size-cells .

I don't think U-Boot cares.

> Francesco, can you please share the DT, including the U-Boot generated partitions, which is passed to Linux on Colibri MX7 ? I think that should make all confusion go away.

Please also do it with the NAND chip described. If, when the NAND chip
is described U-Boot tries to create partitions in the controller node,
then the situation is even worse than I thought. But I believe
describing the node like a suggest in the DT should prevent the boot
failure while still allowing a rather good description of the hardware.

BTW I still think the relevant action right now is to revert the DT
patch.

Thanks,
Miquèl
Francesco Dolcini Dec. 2, 2022, 4:45 p.m. UTC | #16
On Fri, Dec 02, 2022 at 05:17:59PM +0100, Marek Vasut wrote:
> On 12/2/22 16:49, Miquel Raynal wrote:
> > , not the NAND controller node. I hope this
> > is correctly supported in U-Boot though. So if there is a NAND chip
> > subnode, I suppose U-Boot would try to create the partitions that are
> > inside, or even in the sub "partitions" container.
> 
> My understanding is that U-Boot checks the nand-controller node size-cells,
> not the nand-chip{} or partitions{} subnode size-cells .
Not 100% correct.

 - U-Boot before v2022.04 updates the nand-controller{} node, no matter what.
 - U-Boot starting from v2022.04 looks for `partitions{}` into the
   nand-controller{} node, and creates the partition into it if found.
   If not found it behaves the same way as the previous versions.
   See commit 36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")

I'd like to stress once more the fact that we cannot expect old U-Boot
to be updated in the field, and they will keep generating the partitions
as child of the nand-controller node whatever we do with the dts file.

I think that this should be treated the same way as any other fixup we
might have for broken firmware, especially considering that this used to
"work" (yes, I can agree that it horrible, but I cannot change the past)
without even a warning since the imx7 support was first introduced in
the linux kernel years ago.

> Francesco, can you please share the DT, including the U-Boot generated
> partitions, which is passed to Linux on Colibri MX7 ? I think that should
> make all confusion go away.

The device tree part is easy, just
arch/arm/boot/dts/imx7d-colibri-eval-v3.dts.

and the nand-controller node is coming from

#include "imx7d.dtsi"

plus

&gpmi {
	fsl,use-minimum-ecc;
	nand-ecc-mode = "hw";
	nand-on-flash-bbt;
	pinctrl-names = "default";
	pinctrl-0 = <&pinctrl_gpmi_nand>;
};

The partitions nodes are generated 100% by U-Boot, nothing is present in
the dts source files.

With this DTS file as input, whatever U-Boot version is used I have the
following generated:

root@colibri-imx7-02844233:/# ls /proc/device-tree/soc/nand-controller@33002000/
#address-cells          dma-names               nand-on-flash-bbt       pinctrl-0
#size-cells             dmas                    partition@0             pinctrl-names
assigned-clock-parents  fsl,use-minimum-ecc     partition@200000        reg
assigned-clocks         interrupt-names         partition@380000        reg-names
clock-names             interrupts              partition@400000        status
clocks                  name                    partition@80000
compatible              nand-ecc-mode           phandle

root@colibri-imx7-02844233:/# ls /proc/device-tree/soc/nand-controller@33002000/partition@*
/proc/device-tree/soc/nand-controller@33002000/partition@0:
label  name   reg

/proc/device-tree/soc/nand-controller@33002000/partition@200000:
label      name       read_only  reg

/proc/device-tree/soc/nand-controller@33002000/partition@380000:
label  name   reg

/proc/device-tree/soc/nand-controller@33002000/partition@400000:
label  name   reg

/proc/device-tree/soc/nand-controller@33002000/partition@80000:
label      name       read_only  reg
Marek Vasut Dec. 2, 2022, 4:52 p.m. UTC | #17
On 12/2/22 17:42, Miquel Raynal wrote:
> Hi Marek,

Hi,

[...]

>>> However, it should not be empty, at the very least a reg property
>>> should indicate on which CS it is wired, as expected there:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next
>>
>> OK, I see your point. So basically this?
>>
>> &gpmi {
>>     #size-cells = <1>;
>>     ...
>>     nand-chip@0 {
>>       reg = <0>;
>>     };
>> };
>>
>> btw. the GPMI NAND controller supports only one chipselect, so the reg in nand-chip node makes little sense.
> 
> I randomly opened a reference manual (IMX6DQL.pdf), they say:
> 
> 	"Up to four NAND devices, supported by four chip-selects and one
> 	 ganged ready/ busy."

Doh, and MX7D has the same controller, so size-cells = <1>; makes sense 
with nand-chip@N {} .

> Anyway, the NAND controller generic bindings which require this reg
> property, what the controller or the driver actually supports, or even
> how it is used on current designs is not relevant here.
> 
>>> But, as nand-chip.yaml references mtd.yaml, you can as well use
>>> whatever is described here:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/mtd.yaml?h=mtd/next
>>>    
>>>> What would be the gpmi controller size cells (X) in that case, still 0, right ? So how does that help solve this problem, wouldn't U-Boot still populate the partitions directly under the gpmi node or into partitions sub-node ?
>>>
>>> The commit that was pointed in the original fix clearly stated that the
>>> NAND chip node was targeted
>>
>> I think this is another miscommunication here. The commit
>>
>> 753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>
>> modifies the size-cells of the NAND controller. The nand-chip is not involved in this at all . In the examples above, it's the "&gpmi" node size-cells that is modified.
> 
> Yes I know. I was referring to this commit, sorry:
> 36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")
> 
> The log says:
> 
> 	Listing MTD partitions directly in the flash mode has been
> 	deprecated for a while for kernel Device Trees. Look for a node "partitions" in the
> 	found flash nodes and use it instead of the flash node itself for the
> 	partition list when it exists, so Device Trees following the current
> 	best practices can be fixed up.
> 
> Which (I hope) means U-boot will equivalently try to play with the
> partitions container, either in the controller node or in the chip node.
> 
>>> , not the NAND controller node. I hope this
>>> is correctly supported in U-Boot though. So if there is a NAND chip
>>> subnode, I suppose U-Boot would try to create the partitions that are
>>> inside, or even in the sub "partitions" container.
>>
>> My understanding is that U-Boot checks the nand-controller node size-cells, not the nand-chip{} or partitions{} subnode size-cells .
> 
> I don't think U-Boot cares.
> 
>> Francesco, can you please share the DT, including the U-Boot generated partitions, which is passed to Linux on Colibri MX7 ? I think that should make all confusion go away.
> 
> Please also do it with the NAND chip described. If, when the NAND chip
> is described U-Boot tries to create partitions in the controller node,
> then the situation is even worse than I thought. But I believe
> describing the node like a suggest in the DT should prevent the boot
> failure while still allowing a rather good description of the hardware.
> 
> BTW I still think the relevant action right now is to revert the DT
> patch.

I am starting to bank toward that variant as well (thanks for clarifying 
the rationale in the discussion, that helped a lot).

But then, the follow up fix would be what exactly, update the binding 
document to require #size-cells = <1>; ?
Miquel Raynal Dec. 2, 2022, 4:57 p.m. UTC | #18
Hi Marek,

marex@denx.de wrote on Fri, 2 Dec 2022 17:52:05 +0100:

> On 12/2/22 17:42, Miquel Raynal wrote:
> > Hi Marek,  
> 
> Hi,
> 
> [...]
> 
> >>> However, it should not be empty, at the very least a reg property
> >>> should indicate on which CS it is wired, as expected there:
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next  
> >>
> >> OK, I see your point. So basically this?
> >>
> >> &gpmi {
> >>     #size-cells = <1>;
> >>     ...
> >>     nand-chip@0 {
> >>       reg = <0>;
> >>     };
> >> };
> >>
> >> btw. the GPMI NAND controller supports only one chipselect, so the reg in nand-chip node makes little sense.  
> > 
> > I randomly opened a reference manual (IMX6DQL.pdf), they say:
> > 
> > 	"Up to four NAND devices, supported by four chip-selects and one
> > 	 ganged ready/ busy."  
> 
> Doh, and MX7D has the same controller, so size-cells = <1>; makes sense with nand-chip@N {} .

Actually #address-cells is here for that. You need to point at one CS,
so in most cases this is:

controller {
	#address-cells = <1>;
	#size-cells = <0>;
	chip@N {
		reg = <N>;
	};
};

> 
> > Anyway, the NAND controller generic bindings which require this reg
> > property, what the controller or the driver actually supports, or even
> > how it is used on current designs is not relevant here.
> >   
> >>> But, as nand-chip.yaml references mtd.yaml, you can as well use
> >>> whatever is described here:
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/mtd.yaml?h=mtd/next  
> >>>    >>>> What would be the gpmi controller size cells (X) in that case, still 0, right ? So how does that help solve this problem, wouldn't U-Boot still populate the partitions directly under the gpmi node or into partitions sub-node ?  
> >>>
> >>> The commit that was pointed in the original fix clearly stated that the
> >>> NAND chip node was targeted  
> >>
> >> I think this is another miscommunication here. The commit
> >>
> >> 753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
> >>
> >> modifies the size-cells of the NAND controller. The nand-chip is not involved in this at all . In the examples above, it's the "&gpmi" node size-cells that is modified.  
> > 
> > Yes I know. I was referring to this commit, sorry:
> > 36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")
> > 
> > The log says:
> > 
> > 	Listing MTD partitions directly in the flash mode has been
> > 	deprecated for a while for kernel Device Trees. Look for a node "partitions" in the
> > 	found flash nodes and use it instead of the flash node itself for the
> > 	partition list when it exists, so Device Trees following the current
> > 	best practices can be fixed up.
> > 
> > Which (I hope) means U-boot will equivalently try to play with the
> > partitions container, either in the controller node or in the chip node.
> >   
> >>> , not the NAND controller node. I hope this
> >>> is correctly supported in U-Boot though. So if there is a NAND chip
> >>> subnode, I suppose U-Boot would try to create the partitions that are
> >>> inside, or even in the sub "partitions" container.  
> >>
> >> My understanding is that U-Boot checks the nand-controller node size-cells, not the nand-chip{} or partitions{} subnode size-cells .  
> > 
> > I don't think U-Boot cares.
> >   
> >> Francesco, can you please share the DT, including the U-Boot generated partitions, which is passed to Linux on Colibri MX7 ? I think that should make all confusion go away.  
> > 
> > Please also do it with the NAND chip described. If, when the NAND chip
> > is described U-Boot tries to create partitions in the controller node,
> > then the situation is even worse than I thought. But I believe
> > describing the node like a suggest in the DT should prevent the boot
> > failure while still allowing a rather good description of the hardware.
> > 
> > BTW I still think the relevant action right now is to revert the DT
> > patch.  
> 
> I am starting to bank toward that variant as well (thanks for clarifying the rationale in the discussion, that helped a lot).
> 
> But then, the follow up fix would be what exactly, update the binding document to require #size-cells = <1>; ?


Thanks,
Miquèl
Miquel Raynal Dec. 2, 2022, 5:05 p.m. UTC | #19
Hi Francesco,

francesco@dolcini.it wrote on Fri, 2 Dec 2022 17:45:37 +0100:

> On Fri, Dec 02, 2022 at 05:17:59PM +0100, Marek Vasut wrote:
> > On 12/2/22 16:49, Miquel Raynal wrote:  
> > > , not the NAND controller node. I hope this
> > > is correctly supported in U-Boot though. So if there is a NAND chip
> > > subnode, I suppose U-Boot would try to create the partitions that are
> > > inside, or even in the sub "partitions" container.  
> > 
> > My understanding is that U-Boot checks the nand-controller node size-cells,
> > not the nand-chip{} or partitions{} subnode size-cells .  
> Not 100% correct.
> 
>  - U-Boot before v2022.04 updates the nand-controller{} node, no matter what.
>  - U-Boot starting from v2022.04 looks for `partitions{}` into the
>    nand-controller{} node, and creates the partition into it if found.
>    If not found it behaves the same way as the previous versions.
>    See commit 36fee2f7621e ("common: fdt_support: add support for "partitions" subnode to fdt_fixup_mtdparts()")
> 
> I'd like to stress once more the fact that we cannot expect old U-Boot
> to be updated in the field, and they will keep generating the partitions
> as child of the nand-controller node whatever we do with the dts file.
> 
> I think that this should be treated the same way as any other fixup we
> might have for broken firmware, especially considering that this used to
> "work" (yes, I can agree that it horrible, but I cannot change the past)
> without even a warning since the imx7 support was first introduced in
> the linux kernel years ago.
> 
> > Francesco, can you please share the DT, including the U-Boot generated
> > partitions, which is passed to Linux on Colibri MX7 ? I think that should
> > make all confusion go away.  
> 
> The device tree part is easy, just
> arch/arm/boot/dts/imx7d-colibri-eval-v3.dts.
> 
> and the nand-controller node is coming from
> 
> #include "imx7d.dtsi"
> 
> plus
> 
> &gpmi {
> 	fsl,use-minimum-ecc;
> 	nand-ecc-mode = "hw";
> 	nand-on-flash-bbt;
> 	pinctrl-names = "default";
> 	pinctrl-0 = <&pinctrl_gpmi_nand>;
> };
> 
> The partitions nodes are generated 100% by U-Boot, nothing is present in
> the dts source files.

I hope if you provide a NAND chip child node, the partitions are created
at the right location, otherwise this is so, so wrong...

> 
> With this DTS file as input, whatever U-Boot version is used I have the
> following generated:
> 
> root@colibri-imx7-02844233:/# ls /proc/device-tree/soc/nand-controller@33002000/
> #address-cells          dma-names               nand-on-flash-bbt       pinctrl-0
> #size-cells             dmas                    partition@0             pinctrl-names
> assigned-clock-parents  fsl,use-minimum-ecc     partition@200000        reg
> assigned-clocks         interrupt-names         partition@380000        reg-names
> clock-names             interrupts              partition@400000        status
> clocks                  name                    partition@80000
> compatible              nand-ecc-mode           phandle
> 
> root@colibri-imx7-02844233:/# ls /proc/device-tree/soc/nand-controller@33002000/partition@*
> /proc/device-tree/soc/nand-controller@33002000/partition@0:
> label  name   reg
> 
> /proc/device-tree/soc/nand-controller@33002000/partition@200000:
> label      name       read_only  reg
> 
> /proc/device-tree/soc/nand-controller@33002000/partition@380000:
> label  name   reg
> 
> /proc/device-tree/soc/nand-controller@33002000/partition@400000:
> label  name   reg
> 
> /proc/device-tree/soc/nand-controller@33002000/partition@80000:
> label      name       read_only  reg
> 
> 


Thanks,
Miquèl
Marek Vasut Dec. 2, 2022, 5:08 p.m. UTC | #20
On 12/2/22 17:57, Miquel Raynal wrote:
> Hi Marek,

Hi,

>> On 12/2/22 17:42, Miquel Raynal wrote:
>>> Hi Marek,
>>
>> Hi,
>>
>> [...]
>>
>>>>> However, it should not be empty, at the very least a reg property
>>>>> should indicate on which CS it is wired, as expected there:
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/tree/Documentation/devicetree/bindings/mtd/nand-chip.yaml?h=mtd/next
>>>>
>>>> OK, I see your point. So basically this?
>>>>
>>>> &gpmi {
>>>>      #size-cells = <1>;
>>>>      ...
>>>>      nand-chip@0 {
>>>>        reg = <0>;
>>>>      };
>>>> };
>>>>
>>>> btw. the GPMI NAND controller supports only one chipselect, so the reg in nand-chip node makes little sense.
>>>
>>> I randomly opened a reference manual (IMX6DQL.pdf), they say:
>>>
>>> 	"Up to four NAND devices, supported by four chip-selects and one
>>> 	 ganged ready/ busy."
>>
>> Doh, and MX7D has the same controller, so size-cells = <1>; makes sense with nand-chip@N {} .
> 
> Actually #address-cells is here for that. You need to point at one CS,
> so in most cases this is:
> 
> controller {
> 	#address-cells = <1>;
> 	#size-cells = <0>;
> 	chip@N {
> 		reg = <N>;
> 	};
> };

Right ... thanks for spotting this.

But then the size-cells in the controller node should be zero. And 
753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells") is 
therefore correct ?

But that correction in 753395ea1e45 breaks setup where old U-Boot 
injects partition@N directly into the nand-controller node, without 
updating the nand-controller node size-cells, i.e. this:
nand-controller {
	#address-cells = <1>;
	#size-cells = <0>;

+       partition@0 { ... };
};
Because the size-cells is 0 in that case, and U-Boot does not update the 
size-cells .

It used to work before because Linux DT contained size-cells=<1> in the 
nand-controller node before 753395ea1e45 ("ARM: dts: imx7: Fix NAND 
controller size-cells").

But here I would say this is a firmware bug and it might have to be 
handled like a firmware bug, i.e. with fixup in the partition parser. I 
seem to be changing my opinion here again.
Francesco Dolcini Dec. 2, 2022, 5:20 p.m. UTC | #21
On Fri, Dec 02, 2022 at 05:42:55PM +0100, Miquel Raynal wrote:
> Please also do it with the NAND chip described. If, when the NAND chip
> is described U-Boot tries to create partitions in the controller node,
> then the situation is even worse than I thought. But I believe

It's like that for U-Boot older than v2022.04 ... and IMO we cannot
ignore it.

Said that from the code U-Boot looks into a `partition{}` node only as a
direct child of the nand-controller, if there is a nand-chip in between
the nand-controller{} and the partitions{} it will just ignore it.

I could try to see what it is doing exactly, but I would need a little
bit more time, I just tried changing the DTS as wrote I got a non
bootable system.

Francesco
Marek Vasut Dec. 4, 2022, 12:50 p.m. UTC | #22
On 12/2/22 16:56, Thorsten Leemhuis wrote:
> On 02.12.22 15:31, Marek Vasut wrote:
>> On 12/2/22 15:05, Miquel Raynal wrote:
>> [...]
>>> 3. To fix the current situation:
>>>      Immediately revert commit (and prevent it from being backported):
>>>      753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>>      This way your own boot flow is fixed in the short term.
>>
>> Here I disagree, the fix is correct and I think we shouldn't proliferate
>> incorrect DTs which don't match the binding document. Rather, if a
>> bootloader generates incorrect (new) DT entries, I believe the driver
>> should implement a fixup and warn user about this. PC does that as well
>> with broken ACPI tables as far as I can tell.
> 
> Well, that might be the right solution in the long run, that's up for
> others to decide, but we need to fix this *quickly*. For two reasons
> actually: the 6.1 release is near and the change was backported to
> stable already.
> 
> For details wrt to the "quickly", see "Prioritize work on fixing
> regressions" here:
> https://docs.kernel.org/process/handling-regressions.html
> 
> IOW: Ideally it should be fixed by Sunday.
> 
> I'll hence likely soon will point Linus to this and suggest to revert
> this, unless there are strong reasons against that or some sort of
> agreement on a better solution.

You might want to wait until everyone is back on Monday, the discussion 
is still ongoing, but it seems to be getting to a conclusion.
Thorsten Leemhuis Dec. 4, 2022, 12:59 p.m. UTC | #23
On 04.12.22 13:50, Marek Vasut wrote:
> On 12/2/22 16:56, Thorsten Leemhuis wrote:
>> On 02.12.22 15:31, Marek Vasut wrote:
>>> On 12/2/22 15:05, Miquel Raynal wrote:
>>> [...]
>>>> 3. To fix the current situation:
>>>>      Immediately revert commit (and prevent it from being backported):
>>>>      753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>>>      This way your own boot flow is fixed in the short term.
>>>
>>> Here I disagree, the fix is correct and I think we shouldn't proliferate
>>> incorrect DTs which don't match the binding document. Rather, if a
>>> bootloader generates incorrect (new) DT entries, I believe the driver
>>> should implement a fixup and warn user about this. PC does that as well
>>> with broken ACPI tables as far as I can tell.
>>
>> Well, that might be the right solution in the long run, that's up for
>> others to decide, but we need to fix this *quickly*. For two reasons
>> actually: the 6.1 release is near and the change was backported to
>> stable already.
>>
>> For details wrt to the "quickly", see "Prioritize work on fixing
>> regressions" here:
>> https://docs.kernel.org/process/handling-regressions.html
>>
>> IOW: Ideally it should be fixed by Sunday.
>>
>> I'll hence likely soon will point Linus to this and suggest to revert
>> this, unless there are strong reasons against that or some sort of
>> agreement on a better solution.
> 
> You might want to wait until everyone is back on Monday, the discussion
> is still ongoing, but it seems to be getting to a conclusion.

Yeah, came to a similar conclusion, but want to mentioned it
nevertheless and already have this prepared (together will appropriate
links to the discussion):

```
A regression causing boot failures on iMX7 (due to a backport this is
also affecting 6.0.y) could be fixed with a quick revert as well. But
looks like there is no need for it, after some back and forth the
developers that care are close to come to an agreement how to fix the
problem properly soonish:
```

Ciao, Thorsten
Marek Vasut Dec. 4, 2022, 3:50 p.m. UTC | #24
On 12/4/22 13:59, Thorsten Leemhuis wrote:
> On 04.12.22 13:50, Marek Vasut wrote:
>> On 12/2/22 16:56, Thorsten Leemhuis wrote:
>>> On 02.12.22 15:31, Marek Vasut wrote:
>>>> On 12/2/22 15:05, Miquel Raynal wrote:
>>>> [...]
>>>>> 3. To fix the current situation:
>>>>>       Immediately revert commit (and prevent it from being backported):
>>>>>       753395ea1e45 ("ARM: dts: imx7: Fix NAND controller size-cells")
>>>>>       This way your own boot flow is fixed in the short term.
>>>>
>>>> Here I disagree, the fix is correct and I think we shouldn't proliferate
>>>> incorrect DTs which don't match the binding document. Rather, if a
>>>> bootloader generates incorrect (new) DT entries, I believe the driver
>>>> should implement a fixup and warn user about this. PC does that as well
>>>> with broken ACPI tables as far as I can tell.
>>>
>>> Well, that might be the right solution in the long run, that's up for
>>> others to decide, but we need to fix this *quickly*. For two reasons
>>> actually: the 6.1 release is near and the change was backported to
>>> stable already.
>>>
>>> For details wrt to the "quickly", see "Prioritize work on fixing
>>> regressions" here:
>>> https://docs.kernel.org/process/handling-regressions.html
>>>
>>> IOW: Ideally it should be fixed by Sunday.
>>>
>>> I'll hence likely soon will point Linus to this and suggest to revert
>>> this, unless there are strong reasons against that or some sort of
>>> agreement on a better solution.
>>
>> You might want to wait until everyone is back on Monday, the discussion
>> is still ongoing, but it seems to be getting to a conclusion.
> 
> Yeah, came to a similar conclusion, but want to mentioned it
> nevertheless and already have this prepared (together will appropriate
> links to the discussion):
> 
> ```
> A regression causing boot failures on iMX7 (due to a backport this is
> also affecting 6.0.y) could be fixed with a quick revert as well. But
> looks like there is no need for it, after some back and forth the
> developers that care are close to come to an agreement how to fix the
> problem properly soonish:
> ```

ACK, thanks
Francesco Dolcini Dec. 5, 2022, 11:26 a.m. UTC | #25
On Fri, Dec 02, 2022 at 06:08:22PM +0100, Marek Vasut wrote:
> But here I would say this is a firmware bug and it might have to be handled
> like a firmware bug, i.e. with fixup in the partition parser. I seem to be
> changing my opinion here again.

I was thinking at this over the weekend, and I came to the following
ideas:

 - we need some improvement on the fixup we already have in the
   partition parser. We cannot ignore the fdt produced by U-Boot - as
   bad as it is.
 - the proposed fixup is fine for the immediate need, but it is
   not going to be enough to cover the general issue with the U-Boot
   generated partitions. U-Boot might keep generating partitions as direct
   child of the nand controller even when a partitions{} node is
   available. In this case the current parser just fails since it looks
   only into it and it will find it empty.
 - the current U-Boot only handle partitions{} as a direct child of the
   nand-controller, the nand-chip is ignored. This is not the way it is
   supposed to work. U-Boot code would need to be improved.

With all of that said I think that Miquel is right

> When a patch breaks a board and there is no straight fix, you revert
> it, then you think harder. That's what I am saying. This is a temporary
> solution.

?

Francesco
Francesco Dolcini Dec. 5, 2022, 11:30 a.m. UTC | #26
Hello Miquel

On Fri, Dec 02, 2022 at 06:20:02PM +0100, Francesco Dolcini wrote:
> On Fri, Dec 02, 2022 at 05:42:55PM +0100, Miquel Raynal wrote:
> > Please also do it with the NAND chip described. If, when the NAND chip
> > is described U-Boot tries to create partitions in the controller node,
> > then the situation is even worse than I thought. But I believe
> 
> It's like that for U-Boot older than v2022.04 ... and IMO we cannot
> ignore it.
> 
> Said that from the code U-Boot looks into a `partition{}` node only as a
> direct child of the nand-controller, if there is a nand-chip in between
> the nand-controller{} and the partitions{} it will just ignore it.
> 
> I could try to see what it is doing exactly, but I would need a little
> bit more time, I just tried changing the DTS as wrote I got a non
> bootable system.

If I have a nand-chip { partitions {} } described in the dts U-Boot
(even the latest one) ignores it and generates the partition as child of
the nand controller, the linux parser however see that partitions{}
exists, even if empty, and ignore the partition directly defined as
child of the nand controller.

TL;DR: parser fails and boot fails according to that.

Francesco
Miquel Raynal Dec. 5, 2022, 1:49 p.m. UTC | #27
Hi Francesco,

francesco@dolcini.it wrote on Mon, 5 Dec 2022 12:26:44 +0100:

> On Fri, Dec 02, 2022 at 06:08:22PM +0100, Marek Vasut wrote:
> > But here I would say this is a firmware bug and it might have to be handled
> > like a firmware bug, i.e. with fixup in the partition parser. I seem to be
> > changing my opinion here again.  
> 
> I was thinking at this over the weekend, and I came to the following
> ideas:
> 
>  - we need some improvement on the fixup we already have in the
>    partition parser. We cannot ignore the fdt produced by U-Boot - as
>    bad as it is.
>  - the proposed fixup is fine for the immediate need, but it is
>    not going to be enough to cover the general issue with the U-Boot
>    generated partitions. U-Boot might keep generating partitions as direct
>    child of the nand controller even when a partitions{} node is
>    available. In this case the current parser just fails since it looks
>    only into it and it will find it empty.
>  - the current U-Boot only handle partitions{} as a direct child of the
>    nand-controller, the nand-chip is ignored. This is not the way it is
>    supposed to work. U-Boot code would need to be improved.

I've been thinking about it this weekend as well and the current fix
which "just set" s_cell to 1 seems risky for me, it is typically the
type of quick & dirty fix that might even break other board (nobody
knew that U-Boot current logic expected #size-cells to be set in the
DT, what if another "broken" DT expects the opposite...), not
mentioning potential issues with big storages (> 4GiB).

All in all, I really think we should revert the DT change now, reverting
as little to no drawbacks besides a dt_binding_check warning and gives
us time to deal with it properly (both in U-Boot and Linux).

> With all of that said I think that Miquel is right
> 
> > When a patch breaks a board and there is no straight fix, you revert
> > it, then you think harder. That's what I am saying. This is a temporary
> > solution.  
> 
> ?
> 
> Francesco
> 
> 

Thanks,
Miquèl
Miquel Raynal Dec. 5, 2022, 3:28 p.m. UTC | #28
Hi Francesco,

francesco@dolcini.it wrote on Mon, 5 Dec 2022 12:30:16 +0100:

> Hello Miquel
> 
> On Fri, Dec 02, 2022 at 06:20:02PM +0100, Francesco Dolcini wrote:
> > On Fri, Dec 02, 2022 at 05:42:55PM +0100, Miquel Raynal wrote:  
> > > Please also do it with the NAND chip described. If, when the NAND chip
> > > is described U-Boot tries to create partitions in the controller node,
> > > then the situation is even worse than I thought. But I believe  
> > 
> > It's like that for U-Boot older than v2022.04 ... and IMO we cannot
> > ignore it.
> > 
> > Said that from the code U-Boot looks into a `partition{}` node only as a
> > direct child of the nand-controller, if there is a nand-chip in between
> > the nand-controller{} and the partitions{} it will just ignore it.
> > 
> > I could try to see what it is doing exactly, but I would need a little
> > bit more time, I just tried changing the DTS as wrote I got a non
> > bootable system.  
> 
> If I have a nand-chip { partitions {} } described in the dts U-Boot
> (even the latest one) ignores it and generates the partition as child of
> the nand controller, the linux parser however see that partitions{}
> exists, even if empty, and ignore the partition directly defined as
> child of the nand controller.
> 
> TL;DR: parser fails and boot fails according to that.

Yeah I get that. For me the longterm goal should be to just kill that
function. We have proper DT support today, Linux knows how to read the
mtdparts cmdline variable, so there is no need for anything else. I
guess in U-Boot we should just:
- warn users of this function that the function is deprecated and they
  should update their machine support
- just migrate to another solution on the colibri board

What do you think?

Thanks,
Miquèl
Marek Vasut Dec. 5, 2022, 4:25 p.m. UTC | #29
On 12/5/22 14:49, Miquel Raynal wrote:
> Hi Francesco,

Hi,

> francesco@dolcini.it wrote on Mon, 5 Dec 2022 12:26:44 +0100:
> 
>> On Fri, Dec 02, 2022 at 06:08:22PM +0100, Marek Vasut wrote:
>>> But here I would say this is a firmware bug and it might have to be handled
>>> like a firmware bug, i.e. with fixup in the partition parser. I seem to be
>>> changing my opinion here again.
>>
>> I was thinking at this over the weekend, and I came to the following
>> ideas:
>>
>>   - we need some improvement on the fixup we already have in the
>>     partition parser. We cannot ignore the fdt produced by U-Boot - as
>>     bad as it is.
>>   - the proposed fixup is fine for the immediate need, but it is
>>     not going to be enough to cover the general issue with the U-Boot
>>     generated partitions. U-Boot might keep generating partitions as direct
>>     child of the nand controller even when a partitions{} node is
>>     available. In this case the current parser just fails since it looks
>>     only into it and it will find it empty.
>>   - the current U-Boot only handle partitions{} as a direct child of the
>>     nand-controller, the nand-chip is ignored. This is not the way it is
>>     supposed to work. U-Boot code would need to be improved.
> 
> I've been thinking about it this weekend as well and the current fix
> which "just set" s_cell to 1 seems risky for me, it is typically the
> type of quick & dirty fix that might even break other board (nobody
> knew that U-Boot current logic expected #size-cells to be set in the
> DT, what if another "broken" DT expects the opposite...)

Then with the current configuration, such broken DT would not work, 
since current DT does set #size-cells=<1> (wrongly).

> , not
> mentioning potential issues with big storages (> 4GiB).
> 
> All in all, I really think we should revert the DT change now, reverting
> as little to no drawbacks besides a dt_binding_check warning and gives
> us time to deal with it properly (both in U-Boot and Linux).

I am really not happy with this, but if that's marked as intermediate 
fix, go for it.

How do we deal with this in the long run however? Parser-side fix like 
this one, maybe with better heuristics ?
Miquel Raynal Dec. 15, 2022, 7:16 a.m. UTC | #30
Hi Marek & Francesco,

marex@denx.de wrote on Mon, 5 Dec 2022 17:25:11 +0100:

> On 12/5/22 14:49, Miquel Raynal wrote:
> > Hi Francesco,  
> 
> Hi,
> 
> > francesco@dolcini.it wrote on Mon, 5 Dec 2022 12:26:44 +0100:
> >   
> >> On Fri, Dec 02, 2022 at 06:08:22PM +0100, Marek Vasut wrote:  
> >>> But here I would say this is a firmware bug and it might have to be handled
> >>> like a firmware bug, i.e. with fixup in the partition parser. I seem to be
> >>> changing my opinion here again.  
> >>
> >> I was thinking at this over the weekend, and I came to the following
> >> ideas:
> >>
> >>   - we need some improvement on the fixup we already have in the
> >>     partition parser. We cannot ignore the fdt produced by U-Boot - as
> >>     bad as it is.
> >>   - the proposed fixup is fine for the immediate need, but it is
> >>     not going to be enough to cover the general issue with the U-Boot
> >>     generated partitions. U-Boot might keep generating partitions as direct
> >>     child of the nand controller even when a partitions{} node is
> >>     available. In this case the current parser just fails since it looks
> >>     only into it and it will find it empty.
> >>   - the current U-Boot only handle partitions{} as a direct child of the
> >>     nand-controller, the nand-chip is ignored. This is not the way it is
> >>     supposed to work. U-Boot code would need to be improved.  
> > 
> > I've been thinking about it this weekend as well and the current fix
> > which "just set" s_cell to 1 seems risky for me, it is typically the
> > type of quick & dirty fix that might even break other board (nobody
> > knew that U-Boot current logic expected #size-cells to be set in the
> > DT, what if another "broken" DT expects the opposite...)  
> 
> Then with the current configuration, such broken DT would not work, since current DT does set #size-cells=<1> (wrongly).
> 
> > , not
> > mentioning potential issues with big storages (> 4GiB).
> > 
> > All in all, I really think we should revert the DT change now, reverting
> > as little to no drawbacks besides a dt_binding_check warning and gives
> > us time to deal with it properly (both in U-Boot and Linux).  
> 
> I am really not happy with this, but if that's marked as intermediate fix, go for it.
> 
> How do we deal with this in the long run however? Parser-side fix like this one, maybe with better heuristics ?

Yesterday while talking about an ACPI mis-description which needed
fixing, I realized fixing up what the firmware provides to Linux should
preferably be handled as early as possible. So my first first idea was
to avoid using the broken "fixup mtdparts" function in U-Boot and I am
still convinced this is what we should do in priority. However, as
rightly pointed in this thread, we need to take care about the case
where someone would use a newer DT (let's say, with the reverted changed
reverted again) with an old U-Boot. I am still against piggy hacks in
the generic ofpart.c driver, but what we could do however is a DT
fixup in the init_machine (or the dt_fixup) hook for imx7 Colibri, very
much like this:
https://elixir.bootlin.com/linux/latest/source/arch/arm/mach-mvebu/board-v7.c#L111
Plus a warning there saying "your dt is broken, update your firmware".

So next time someone stumbles upon this issue, we can tell them "fix
your bootloader", and apply the same hack in their board family (there
are three or four IIRC which might be concerned some day).

That would fix all cases and only have an impact on the affected boards.

Thanks,
Miquèl
Marek Vasut Dec. 15, 2022, 7:45 a.m. UTC | #31
On 12/15/22 08:16, Miquel Raynal wrote:
> Hi Marek & Francesco,

Hi,

> marex@denx.de wrote on Mon, 5 Dec 2022 17:25:11 +0100:
> 
>> On 12/5/22 14:49, Miquel Raynal wrote:
>>> Hi Francesco,
>>
>> Hi,
>>
>>> francesco@dolcini.it wrote on Mon, 5 Dec 2022 12:26:44 +0100:
>>>    
>>>> On Fri, Dec 02, 2022 at 06:08:22PM +0100, Marek Vasut wrote:
>>>>> But here I would say this is a firmware bug and it might have to be handled
>>>>> like a firmware bug, i.e. with fixup in the partition parser. I seem to be
>>>>> changing my opinion here again.
>>>>
>>>> I was thinking at this over the weekend, and I came to the following
>>>> ideas:
>>>>
>>>>    - we need some improvement on the fixup we already have in the
>>>>      partition parser. We cannot ignore the fdt produced by U-Boot - as
>>>>      bad as it is.
>>>>    - the proposed fixup is fine for the immediate need, but it is
>>>>      not going to be enough to cover the general issue with the U-Boot
>>>>      generated partitions. U-Boot might keep generating partitions as direct
>>>>      child of the nand controller even when a partitions{} node is
>>>>      available. In this case the current parser just fails since it looks
>>>>      only into it and it will find it empty.
>>>>    - the current U-Boot only handle partitions{} as a direct child of the
>>>>      nand-controller, the nand-chip is ignored. This is not the way it is
>>>>      supposed to work. U-Boot code would need to be improved.
>>>
>>> I've been thinking about it this weekend as well and the current fix
>>> which "just set" s_cell to 1 seems risky for me, it is typically the
>>> type of quick & dirty fix that might even break other board (nobody
>>> knew that U-Boot current logic expected #size-cells to be set in the
>>> DT, what if another "broken" DT expects the opposite...)
>>
>> Then with the current configuration, such broken DT would not work, since current DT does set #size-cells=<1> (wrongly).
>>
>>> , not
>>> mentioning potential issues with big storages (> 4GiB).
>>>
>>> All in all, I really think we should revert the DT change now, reverting
>>> as little to no drawbacks besides a dt_binding_check warning and gives
>>> us time to deal with it properly (both in U-Boot and Linux).
>>
>> I am really not happy with this, but if that's marked as intermediate fix, go for it.
>>
>> How do we deal with this in the long run however? Parser-side fix like this one, maybe with better heuristics ?
> 
> Yesterday while talking about an ACPI mis-description which needed
> fixing, I realized fixing up what the firmware provides to Linux should
> preferably be handled as early as possible. So my first first idea was
> to avoid using the broken "fixup mtdparts" function in U-Boot and I am
> still convinced this is what we should do in priority. However, as
> rightly pointed in this thread, we need to take care about the case
> where someone would use a newer DT (let's say, with the reverted changed
> reverted again) with an old U-Boot. I am still against piggy hacks in
> the generic ofpart.c driver, but what we could do however is a DT
> fixup in the init_machine (or the dt_fixup) hook for imx7 Colibri, very
> much like this:
> https://elixir.bootlin.com/linux/latest/source/arch/arm/mach-mvebu/board-v7.c#L111
> Plus a warning there saying "your dt is broken, update your firmware".

This does not work, because the old U-Boot fixup_mtdparts() may be 
applied on any machine, it is not colibri mx7 specific. Also, new 
arch-side workaround are really not welcome by the architecture 
maintainers as far as I can tell.

> So next time someone stumbles upon this issue, we can tell them "fix
> your bootloader", and apply the same hack in their board family (there
> are three or four IIRC which might be concerned some day).

There are also those machines we do not even know about which might be 
generating bogus DT using old U-Boot and fixup_mtdparts(), so, unless 
there is some all-arch fixup implementation, we wouldn't be able to fix 
them all on arch side. I think the all-arch fixup implementation would 
be the driver one, i.e. this patch as it is (or maybe with some 
improvement).

> That would fix all cases and only have an impact on the affected boards.

Sadly, it does only fix the known cases, not the unknown cases like 
downstream forks which never get any bootloader updates ever, and which 
you can't find in upstream U-Boot, and which you therefore cannot easily 
catch in the arch side fixup.

[...]
Miquel Raynal Dec. 15, 2022, 8:04 a.m. UTC | #32
Hi Marek,

marex@denx.de wrote on Thu, 15 Dec 2022 08:45:33 +0100:

> On 12/15/22 08:16, Miquel Raynal wrote:
> > Hi Marek & Francesco,  
> 
> Hi,
> 
> > marex@denx.de wrote on Mon, 5 Dec 2022 17:25:11 +0100:
> >   
> >> On 12/5/22 14:49, Miquel Raynal wrote:  
> >>> Hi Francesco,  
> >>
> >> Hi,
> >>  
> >>> francesco@dolcini.it wrote on Mon, 5 Dec 2022 12:26:44 +0100:  
> >>>    >>>> On Fri, Dec 02, 2022 at 06:08:22PM +0100, Marek Vasut wrote:  
> >>>>> But here I would say this is a firmware bug and it might have to be handled
> >>>>> like a firmware bug, i.e. with fixup in the partition parser. I seem to be
> >>>>> changing my opinion here again.  
> >>>>
> >>>> I was thinking at this over the weekend, and I came to the following
> >>>> ideas:
> >>>>
> >>>>    - we need some improvement on the fixup we already have in the
> >>>>      partition parser. We cannot ignore the fdt produced by U-Boot - as
> >>>>      bad as it is.
> >>>>    - the proposed fixup is fine for the immediate need, but it is
> >>>>      not going to be enough to cover the general issue with the U-Boot
> >>>>      generated partitions. U-Boot might keep generating partitions as direct
> >>>>      child of the nand controller even when a partitions{} node is
> >>>>      available. In this case the current parser just fails since it looks
> >>>>      only into it and it will find it empty.
> >>>>    - the current U-Boot only handle partitions{} as a direct child of the
> >>>>      nand-controller, the nand-chip is ignored. This is not the way it is
> >>>>      supposed to work. U-Boot code would need to be improved.  
> >>>
> >>> I've been thinking about it this weekend as well and the current fix
> >>> which "just set" s_cell to 1 seems risky for me, it is typically the
> >>> type of quick & dirty fix that might even break other board (nobody
> >>> knew that U-Boot current logic expected #size-cells to be set in the
> >>> DT, what if another "broken" DT expects the opposite...)  
> >>
> >> Then with the current configuration, such broken DT would not work, since current DT does set #size-cells=<1> (wrongly).
> >>  
> >>> , not
> >>> mentioning potential issues with big storages (> 4GiB).
> >>>
> >>> All in all, I really think we should revert the DT change now, reverting
> >>> as little to no drawbacks besides a dt_binding_check warning and gives
> >>> us time to deal with it properly (both in U-Boot and Linux).  
> >>
> >> I am really not happy with this, but if that's marked as intermediate fix, go for it.
> >>
> >> How do we deal with this in the long run however? Parser-side fix like this one, maybe with better heuristics ?  
> > 
> > Yesterday while talking about an ACPI mis-description which needed
> > fixing, I realized fixing up what the firmware provides to Linux should
> > preferably be handled as early as possible. So my first first idea was
> > to avoid using the broken "fixup mtdparts" function in U-Boot and I am
> > still convinced this is what we should do in priority. However, as
> > rightly pointed in this thread, we need to take care about the case
> > where someone would use a newer DT (let's say, with the reverted changed
> > reverted again) with an old U-Boot. I am still against piggy hacks in
> > the generic ofpart.c driver, but what we could do however is a DT
> > fixup in the init_machine (or the dt_fixup) hook for imx7 Colibri, very
> > much like this:
> > https://elixir.bootlin.com/linux/latest/source/arch/arm/mach-mvebu/board-v7.c#L111
> > Plus a warning there saying "your dt is broken, update your firmware".  
> 
> This does not work, because the old U-Boot fixup_mtdparts() may be applied on any machine,

No: https://elixir.bootlin.com/u-boot/latest/A/ident/fdt_fixup_mtdparts
And we should make our best so its use does not proliferate.
It's not like there is half a dozen of good ways to describe and forward
partitions today.

> it is not colibri mx7 specific. Also, new arch-side workaround are
> really not welcome by the architecture maintainers as far as I can
> tell.

So what? Let's propose the change and see what the maintainers have to
say. I am open to discussion.

As I said, it is not colibri mx7 specific, there are a few boards which
might be affected, they are all clearly identifiable with a compatible.
It's not the entire planet either.

> > So next time someone stumbles upon this issue, we can tell them "fix
> > your bootloader", and apply the same hack in their board family (there
> > are three or four IIRC which might be concerned some day).  
> 
> There are also those machines we do not even know about which might be generating bogus DT using old U-Boot and fixup_mtdparts(), so, unless there is some all-arch fixup implementation, we wouldn't be able to fix them all on arch side. I think the all-arch fixup implementation would be the driver one, i.e. this patch as it is (or maybe with some improvement).

If we don't know about them, as you say, I don't feel concerned.

If something is buggy, people will report it, we will point them in the
right direction so they can fix their firmware and propose a similar
fix in their case which will involve adding a new machine compatible to
the list of boards that should tweak the #size-cell property.

> > That would fix all cases and only have an impact on the affected boards.  
> 
> Sadly, it does only fix the known cases, not the unknown cases like downstream forks which never get any bootloader updates ever, and which you can't find in upstream U-Boot, and which you therefore cannot easily catch in the arch side fixup.

And ?

Thanks,
Miquèl
Marek Vasut Dec. 16, 2022, 12:36 a.m. UTC | #33
On 12/15/22 09:04, Miquel Raynal wrote:
> Hi Marek,

Hi,

[...]

>>> Yesterday while talking about an ACPI mis-description which needed
>>> fixing, I realized fixing up what the firmware provides to Linux should
>>> preferably be handled as early as possible. So my first first idea was
>>> to avoid using the broken "fixup mtdparts" function in U-Boot and I am
>>> still convinced this is what we should do in priority. However, as
>>> rightly pointed in this thread, we need to take care about the case
>>> where someone would use a newer DT (let's say, with the reverted changed
>>> reverted again) with an old U-Boot. I am still against piggy hacks in
>>> the generic ofpart.c driver, but what we could do however is a DT
>>> fixup in the init_machine (or the dt_fixup) hook for imx7 Colibri, very
>>> much like this:
>>> https://elixir.bootlin.com/linux/latest/source/arch/arm/mach-mvebu/board-v7.c#L111
>>> Plus a warning there saying "your dt is broken, update your firmware".
>>
>> This does not work, because the old U-Boot fixup_mtdparts() may be applied on any machine,
> 
> No: https://elixir.bootlin.com/u-boot/latest/A/ident/fdt_fixup_mtdparts

These are the boards from vendors who upstreamed their properly.

This does NOT take into account either boards which are using downstream 
U-Boot, or older U-Boot e.g. because they can not easily update.

> And we should make our best so its use does not proliferate.

I am not disputing that, but that's a separate U-Boot side fix which I 
hope Francesco would submit soon, AND, more importantly, the code is 
already in at least two U-Boot releases, so it's done, it's not going 
away any time soon.

> It's not like there is half a dozen of good ways to describe and forward
> partitions today.

That's really not what I am disputing here, the approach to describing 
partitions is crystal clear as far as I can tell.

>> it is not colibri mx7 specific. Also, new arch-side workaround are
>> really not welcome by the architecture maintainers as far as I can
>> tell.
> 
> So what? Let's propose the change and see what the maintainers have to
> say. I am open to discussion.

Why is there such strong opposition toward generic fix in the OF 
partition parser ?

> As I said, it is not colibri mx7 specific, there are a few boards which
> might be affected,

... that you know about ...

> they are all clearly identifiable with a compatible.
> It's not the entire planet either.

Neither of us can make this statement with certainty, because neither of 
us knows what hardware is running the affected version of U-Boot.

>>> So next time someone stumbles upon this issue, we can tell them "fix
>>> your bootloader", and apply the same hack in their board family (there
>>> are three or four IIRC which might be concerned some day).
>>
>> There are also those machines we do not even know about which might be generating bogus DT using old U-Boot and fixup_mtdparts(), so, unless there is some all-arch fixup implementation, we wouldn't be able to fix them all on arch side. I think the all-arch fixup implementation would be the driver one, i.e. this patch as it is (or maybe with some improvement).
> 
> If we don't know about them, as you say, I don't feel concerned.
> 
> If something is buggy, people will report it, we will point them in the
> right direction so they can fix their firmware and propose a similar
> fix in their case which will involve adding a new machine compatible to
> the list of boards that should tweak the #size-cell property.

Why is a potentially lengthy list of compatible strings in arch code, 
which every single user has to find _after_ their system completely 
fails to boot and forces them to perform potentially difficult recovery, 
potentially after an update to new linux-stable release no less -- over 
-- 4 liner generic fix in OF partition parser, which covers all the 
systems, does not cause systems to fail to boot completely, does not 
force users to suffer through recovery, does not require a list of 
compatibles in arch code, and rather only gracefully prints a warning ?

I very much prefer the second solution over the first.

And one more thing, the list of compatibles in arch code does not really 
work anyway, since once user updates their bootloader, the compatible 
won't change and the arch-side workaround would still be applied, which 
is not desired at that point anymore.

>>> That would fix all cases and only have an impact on the affected boards.
>>
>> Sadly, it does only fix the known cases, not the unknown cases like downstream forks which never get any bootloader updates ever, and which you can't find in upstream U-Boot, and which you therefore cannot easily catch in the arch side fixup.
> 
> And ?

I was under the impression Linux was supposed to deliver the best 
possible experience to its users even on not-perfect hardware, and if 
there are any quirks, the kernel should try to fix them up or work 
around them as best as it can, not dismiss them as broken hardware and 
fail to boot outright.
Francesco Dolcini Dec. 16, 2022, 7:52 a.m. UTC | #34
On Fri, Dec 16, 2022 at 01:36:03AM +0100, Marek Vasut wrote:
> On 12/15/22 09:04, Miquel Raynal wrote:
> > > > That would fix all cases and only have an impact on the affected
> > > > boards.
> > > 
> > > Sadly, it does only fix the known cases, not the unknown cases
> > > like downstream forks which never get any bootloader updates ever,
> > > and which you can't find in upstream U-Boot, and which you
> > > therefore cannot easily catch in the arch side fixup.
> > 
> > And ?
> 
> I was under the impression Linux was supposed to deliver the best possible
> experience to its users even on not-perfect hardware, and if there are any
> quirks, the kernel should try to fix them up or work around them as best as
> it can, not dismiss them as broken hardware and fail to boot outright.

I would say something more on this.

We are not talking about Linux not working well on some hardware, we are
talking about breaking hardware that was working fine since ever.
I believe that the Linux has a quite strong point of view on such kind
of regression.

Quoting Linus
> If the kernel used to work for you, the rule is that it continues to work for you.

Francesco
diff mbox series

Patch

diff --git a/drivers/mtd/parsers/ofpart_core.c b/drivers/mtd/parsers/ofpart_core.c
index 192190c42fc8..aa3b7fa61e50 100644
--- a/drivers/mtd/parsers/ofpart_core.c
+++ b/drivers/mtd/parsers/ofpart_core.c
@@ -122,6 +122,17 @@  static int parse_fixed_partitions(struct mtd_info *master,
 
 		a_cells = of_n_addr_cells(pp);
 		s_cells = of_n_size_cells(pp);
+		if (s_cells == 0) {
+			/*
+			 * Use #size-cells = <1> for backward compatibility
+			 * in case #size-cells is set to <0> and firmware adds
+			 * OF partitions without setting it.
+			 */
+			pr_warn_once("%s: ofpart partition %pOF (%pOF) #size-cells is <0>, using <1> for backward compatibility.\n",
+				     master->name, pp,
+				     mtd_node);
+			s_cells = 1;
+		}
 		if (len / 4 != a_cells + s_cells) {
 			pr_debug("%s: ofpart partition %pOF (%pOF) error parsing reg property.\n",
 				 master->name, pp,