mbox series

[RFC,devicetree,00/10] Do something about ls-extirq interrupt-map breakage

Message ID 20211214013800.2703568-1-vladimir.oltean@nxp.com (mailing list archive)
Headers show
Series Do something about ls-extirq interrupt-map breakage | expand

Message

Vladimir Oltean Dec. 14, 2021, 1:37 a.m. UTC
Currently the ls-extirq driver's use of the "interrupt-map" property is
double-broken:
- once by Rob Herring's commit 869f0ec048dc ("arm64: dts: freescale: Fix
  'interrupt-map' parent address cells")
- twice by Marc Zyngier's commit 041284181226 ("of/irq: Allow matching
  of an interrupt-map local to an interrupt controller"), later revised,
  not very elegantly, through commit de4adddcbcc2 ("of/irq: Add a quirk
  for controllers with their own definition of interrupt-map"). So this
  part works but we're on an offender list.

Mark suggests that the problem may lie with the ls-extirq driver, and
its interpretation of the "interrupt-map" property, to be exact.

This set of changes attempts to make the problem smaller by using a
vendor-specific name for the property, and reverts Rob's patch because
similarity with "interrupt-map" isn't actually a desirable feature after
all, it seems.

Vladimir Oltean (10):
  irqchip/ls-extirq: rename "interrupt-map" OF property to
    "fsl,extirq-map"
  Revert "arm64: dts: freescale: Fix 'interrupt-map' parent address
    cells"
  dt-bindings: ls-extirq: replace "interrupt-map" documentation with
    "fsl,extirq-map"
  arm64: dts: ls1043a: rename the "interrupt-map" of the extirq node to
    "fsl,extirq-map"
  arm64: dts: ls1046a: rename the "interrupt-map" of the extirq node to
    "fsl,extirq-map"
  arm64: dts: ls1088a: rename the "interrupt-map" of the extirq node to
    "fsl,extirq-map"
  arm64: dts: ls208xa: rename the "interrupt-map" of the extirq node to
    "fsl,extirq-map"
  arm64: dts: lx2160a: rename the "interrupt-map" of the extirq node to
    "fsl,extirq-map"
  ARM: dts: ls1021a: rename the "interrupt-map" of the extirq node to
    "fsl,extirq-map"
  dt-bindings: ls-extirq: add a YAML schema for the validator

 .../interrupt-controller/fsl,ls-extirq.txt    |  53 ---------
 .../interrupt-controller/fsl,ls-extirq.yaml   | 110 ++++++++++++++++++
 arch/arm/boot/dts/ls1021a.dtsi                |   3 +-
 .../arm64/boot/dts/freescale/fsl-ls1043a.dtsi |   3 +-
 .../arm64/boot/dts/freescale/fsl-ls1046a.dtsi |   3 +-
 .../arm64/boot/dts/freescale/fsl-ls1088a.dtsi |  27 +++--
 .../arm64/boot/dts/freescale/fsl-ls208xa.dtsi |  27 +++--
 .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi |  27 +++--
 drivers/irqchip/irq-ls-extirq.c               |  12 +-
 9 files changed, 161 insertions(+), 104 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.txt
 create mode 100644 Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml

Comments

Marc Zyngier Dec. 14, 2021, 8:51 a.m. UTC | #1
On Tue, 14 Dec 2021 01:37:50 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> Currently the ls-extirq driver's use of the "interrupt-map" property is
> double-broken:
> - once by Rob Herring's commit 869f0ec048dc ("arm64: dts: freescale: Fix
>   'interrupt-map' parent address cells")
> - twice by Marc Zyngier's commit 041284181226 ("of/irq: Allow matching
>   of an interrupt-map local to an interrupt controller"), later revised,
>   not very elegantly, through commit de4adddcbcc2 ("of/irq: Add a quirk

Elegance is, I'm afraid to say, bloody overrated when dealing with
this sort of crap.


>   for controllers with their own definition of interrupt-map"). So this
>   part works but we're on an offender list.

Define 'part works'. Either it does, or it doesn't. There is no middle
ground here.

> 
> Mark suggests that the problem may lie with the ls-extirq driver, and
> its interpretation of the "interrupt-map" property, to be exact.

s/Mark/Marc/, unless you are talking about someone else (who?).

> 
> This set of changes attempts to make the problem smaller by using a
> vendor-specific name for the property, and reverts Rob's patch because
> similarity with "interrupt-map" isn't actually a desirable feature after
> all, it seems.
> 
> Vladimir Oltean (10):
>   irqchip/ls-extirq: rename "interrupt-map" OF property to
>     "fsl,extirq-map"
>   Revert "arm64: dts: freescale: Fix 'interrupt-map' parent address
>     cells"
>   dt-bindings: ls-extirq: replace "interrupt-map" documentation with
>     "fsl,extirq-map"
>   arm64: dts: ls1043a: rename the "interrupt-map" of the extirq node to
>     "fsl,extirq-map"
>   arm64: dts: ls1046a: rename the "interrupt-map" of the extirq node to
>     "fsl,extirq-map"
>   arm64: dts: ls1088a: rename the "interrupt-map" of the extirq node to
>     "fsl,extirq-map"
>   arm64: dts: ls208xa: rename the "interrupt-map" of the extirq node to
>     "fsl,extirq-map"
>   arm64: dts: lx2160a: rename the "interrupt-map" of the extirq node to
>     "fsl,extirq-map"
>   ARM: dts: ls1021a: rename the "interrupt-map" of the extirq node to
>     "fsl,extirq-map"
>   dt-bindings: ls-extirq: add a YAML schema for the validator
> 
>  .../interrupt-controller/fsl,ls-extirq.txt    |  53 ---------
>  .../interrupt-controller/fsl,ls-extirq.yaml   | 110 ++++++++++++++++++
>  arch/arm/boot/dts/ls1021a.dtsi                |   3 +-
>  .../arm64/boot/dts/freescale/fsl-ls1043a.dtsi |   3 +-
>  .../arm64/boot/dts/freescale/fsl-ls1046a.dtsi |   3 +-
>  .../arm64/boot/dts/freescale/fsl-ls1088a.dtsi |  27 +++--
>  .../arm64/boot/dts/freescale/fsl-ls208xa.dtsi |  27 +++--
>  .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi |  27 +++--
>  drivers/irqchip/irq-ls-extirq.c               |  12 +-
>  9 files changed, 161 insertions(+), 104 deletions(-)
>  delete mode 100644 Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.txt
>  create mode 100644 Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml

This is totally pointless. These machines have been in the wild for
years, and existing DTs will be there *forever*. The very notion of
'backporting' DT changes is totally ludicrous when it is some firmware
(ATF, u-boot, or something else *that isn't under your control*) that
provides the DT. It also breaks backward compatibility (old kernel
with new DT), which is just as important. Why do you think I went the
elegance-deprived route and added a quirk?

So no, I'm not taking the irqchip changes, as most of this churn
serves no purpose. The revert of 869f0ec048dc is the only thing that
makes some sense.

	M.
Vladimir Oltean Dec. 14, 2021, 9:58 a.m. UTC | #2
Hi Marc (with a c),

On Tue, Dec 14, 2021 at 08:51:47AM +0000, Marc Zyngier wrote:
> On Tue, 14 Dec 2021 01:37:50 +0000,
> Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > 
> > Currently the ls-extirq driver's use of the "interrupt-map" property is
> > double-broken:
> > - once by Rob Herring's commit 869f0ec048dc ("arm64: dts: freescale: Fix
> >   'interrupt-map' parent address cells")
> > - twice by Marc Zyngier's commit 041284181226 ("of/irq: Allow matching
> >   of an interrupt-map local to an interrupt controller"), later revised,
> >   not very elegantly, through commit de4adddcbcc2 ("of/irq: Add a quirk
> 
> Elegance is, I'm afraid to say, bloody overrated when dealing with
> this sort of crap.
> 
> 
> >   for controllers with their own definition of interrupt-map"). So this
> >   part works but we're on an offender list.
> 
> Define 'part works'. Either it does, or it doesn't. There is no middle
> ground here.

"Part" is the subject, "works" is the predicate. It is a part of a
larger set of changes that now works after some breakage.

> > 
> > Mark suggests that the problem may lie with the ls-extirq driver, and
> > its interpretation of the "interrupt-map" property, to be exact.
> 
> s/Mark/Marc/, unless you are talking about someone else (who?).

Twas a typo, my hand must have slipped. This will not happen again.

> > This set of changes attempts to make the problem smaller by using a
> > vendor-specific name for the property, and reverts Rob's patch because
> > similarity with "interrupt-map" isn't actually a desirable feature after
> > all, it seems.
> > 
> > Vladimir Oltean (10):
> >   irqchip/ls-extirq: rename "interrupt-map" OF property to
> >     "fsl,extirq-map"
> >   Revert "arm64: dts: freescale: Fix 'interrupt-map' parent address
> >     cells"
> >   dt-bindings: ls-extirq: replace "interrupt-map" documentation with
> >     "fsl,extirq-map"
> >   arm64: dts: ls1043a: rename the "interrupt-map" of the extirq node to
> >     "fsl,extirq-map"
> >   arm64: dts: ls1046a: rename the "interrupt-map" of the extirq node to
> >     "fsl,extirq-map"
> >   arm64: dts: ls1088a: rename the "interrupt-map" of the extirq node to
> >     "fsl,extirq-map"
> >   arm64: dts: ls208xa: rename the "interrupt-map" of the extirq node to
> >     "fsl,extirq-map"
> >   arm64: dts: lx2160a: rename the "interrupt-map" of the extirq node to
> >     "fsl,extirq-map"
> >   ARM: dts: ls1021a: rename the "interrupt-map" of the extirq node to
> >     "fsl,extirq-map"
> >   dt-bindings: ls-extirq: add a YAML schema for the validator
> > 
> >  .../interrupt-controller/fsl,ls-extirq.txt    |  53 ---------
> >  .../interrupt-controller/fsl,ls-extirq.yaml   | 110 ++++++++++++++++++
> >  arch/arm/boot/dts/ls1021a.dtsi                |   3 +-
> >  .../arm64/boot/dts/freescale/fsl-ls1043a.dtsi |   3 +-
> >  .../arm64/boot/dts/freescale/fsl-ls1046a.dtsi |   3 +-
> >  .../arm64/boot/dts/freescale/fsl-ls1088a.dtsi |  27 +++--
> >  .../arm64/boot/dts/freescale/fsl-ls208xa.dtsi |  27 +++--
> >  .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi |  27 +++--
> >  drivers/irqchip/irq-ls-extirq.c               |  12 +-
> >  9 files changed, 161 insertions(+), 104 deletions(-)
> >  delete mode 100644 Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.txt
> >  create mode 100644 Documentation/devicetree/bindings/interrupt-controller/fsl,ls-extirq.yaml
> 
> This is totally pointless. These machines have been in the wild for
> years, and existing DTs will be there *forever*. The very notion of
> 'backporting' DT changes is totally ludicrous when it is some firmware
> (ATF, u-boot, or something else *that isn't under your control*) that
> provides the DT. It also breaks backward compatibility (old kernel
> with new DT), which is just as important. Why do you think I went the
> elegance-deprived route and added a quirk?

I wish the firmware for these SoCs was smart enough to be compatible
with the bindings that are in the kernel and provide a blob that the
kernel could actually use. Some work has been started there and this is
work in progress. True, I don't know what other OF-based firmware some
other customers may use, but I trust it isn't a lot more advanced than
what U-Boot currently has :)

Also, the machines may have been in the wild for years, but the
ls-extirq driver was added in November 2019. So not with the
introduction of the SoC device trees themselves. That isn't so long ago.

As for compatibility between old kernel and new DT: I guess you'll hear
various opinions on this one.
https://www.spinics.net/lists/linux-mips/msg07778.html

| > Are we okay with the new device tree blobs breaking the old kernel?
|
| From my point of view, newer device trees are not required to work on
| older kernel, this would impose an unreasonable limitation and the use
| case is very limited.

Sadly we're at a point where we cannot have software any longer that
works with all device trees in circulation, after Rob's change, because
the ls-extirq driver won't know what is the expected correct length of
an "interrupt-map"/"fsl,extirq-map"/whatever-you-want-to-call-it specifier.

And even with the revert, the argument you've brought to the table still
holds: any kernel running on a device tree with Rob's change will stay
broken no matter what we do. I'd like to take a more constructive
approach and see what we can do going forward (in the direction of the
arrow of time, in case that's not clear), at least.

> So no, I'm not taking the irqchip changes, as most of this churn
> serves no purpose. The revert of 869f0ec048dc is the only thing that
> makes some sense.

The devicetree changes were for Shawn to pick up anyway. But I posted
the entire series as RFC to gather comments. Other opinions still welcome.
Marc Zyngier Dec. 14, 2021, 10:20 a.m. UTC | #3
On Tue, 14 Dec 2021 09:58:54 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> Hi Marc (with a c),
> 
> I wish the firmware for these SoCs was smart enough to be compatible
> with the bindings that are in the kernel and provide a blob that the
> kernel could actually use. Some work has been started there and this is
> work in progress. True, I don't know what other OF-based firmware some
> other customers may use, but I trust it isn't a lot more advanced than
> what U-Boot currently has :)
> 
> Also, the machines may have been in the wild for years, but the
> ls-extirq driver was added in November 2019. So not with the
> introduction of the SoC device trees themselves. That isn't so long ago.
> 
> As for compatibility between old kernel and new DT: I guess you'll hear
> various opinions on this one.
> https://www.spinics.net/lists/linux-mips/msg07778.html
> 
> | > Are we okay with the new device tree blobs breaking the old kernel?
> |
> | From my point of view, newer device trees are not required to work on
> | older kernel, this would impose an unreasonable limitation and the use
> | case is very limited.

My views are on the opposite side. DT is an ABI, full stop. If you
change something, you *must* guarantee forward *and* backward
compatibility. That's because:

- you don't control how updatable the firmware is

- people may need to revert to other versions of the kernel because
  the new one is broken

- there are plenty of DT users beyond Linux, and we are not creating
  bindings for Linux only.

You may disagree with this, but for the subsystems I maintain, this is
the rule I intent to stick to.

	M.
Vladimir Oltean Dec. 14, 2021, 10:30 a.m. UTC | #4
On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> On Tue, 14 Dec 2021 09:58:54 +0000,
> Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > 
> > Hi Marc (with a c),
> > 
> > I wish the firmware for these SoCs was smart enough to be compatible
> > with the bindings that are in the kernel and provide a blob that the
> > kernel could actually use. Some work has been started there and this is
> > work in progress. True, I don't know what other OF-based firmware some
> > other customers may use, but I trust it isn't a lot more advanced than
> > what U-Boot currently has :)
> > 
> > Also, the machines may have been in the wild for years, but the
> > ls-extirq driver was added in November 2019. So not with the
> > introduction of the SoC device trees themselves. That isn't so long ago.
> > 
> > As for compatibility between old kernel and new DT: I guess you'll hear
> > various opinions on this one.
> > https://www.spinics.net/lists/linux-mips/msg07778.html
> > 
> > | > Are we okay with the new device tree blobs breaking the old kernel?
> > |
> > | From my point of view, newer device trees are not required to work on
> > | older kernel, this would impose an unreasonable limitation and the use
> > | case is very limited.
> 
> My views are on the opposite side. DT is an ABI, full stop. If you
> change something, you *must* guarantee forward *and* backward
> compatibility. That's because:
> 
> - you don't control how updatable the firmware is
> 
> - people may need to revert to other versions of the kernel because
>   the new one is broken
> 
> - there are plenty of DT users beyond Linux, and we are not creating
>   bindings for Linux only.
> 
> You may disagree with this, but for the subsystems I maintain, this is
> the rule I intent to stick to.

That's an honorable set of guiding principles, but how do you apply them
here? Reverting Rob's change won't fix the past, and updating the code
to account for one format will break the other. As for trying one
format, and if there's an error try the other, there may be situations
in which you accept invalid input as valid.
Marc Zyngier Dec. 14, 2021, 10:39 a.m. UTC | #5
On Tue, 14 Dec 2021 10:30:26 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> > On Tue, 14 Dec 2021 09:58:54 +0000,
> > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > 
> > > Hi Marc (with a c),
> > > 
> > > I wish the firmware for these SoCs was smart enough to be compatible
> > > with the bindings that are in the kernel and provide a blob that the
> > > kernel could actually use. Some work has been started there and this is
> > > work in progress. True, I don't know what other OF-based firmware some
> > > other customers may use, but I trust it isn't a lot more advanced than
> > > what U-Boot currently has :)
> > > 
> > > Also, the machines may have been in the wild for years, but the
> > > ls-extirq driver was added in November 2019. So not with the
> > > introduction of the SoC device trees themselves. That isn't so long ago.
> > > 
> > > As for compatibility between old kernel and new DT: I guess you'll hear
> > > various opinions on this one.
> > > https://www.spinics.net/lists/linux-mips/msg07778.html
> > > 
> > > | > Are we okay with the new device tree blobs breaking the old kernel?
> > > |
> > > | From my point of view, newer device trees are not required to work on
> > > | older kernel, this would impose an unreasonable limitation and the use
> > > | case is very limited.
> > 
> > My views are on the opposite side. DT is an ABI, full stop. If you
> > change something, you *must* guarantee forward *and* backward
> > compatibility. That's because:
> > 
> > - you don't control how updatable the firmware is
> > 
> > - people may need to revert to other versions of the kernel because
> >   the new one is broken
> > 
> > - there are plenty of DT users beyond Linux, and we are not creating
> >   bindings for Linux only.
> > 
> > You may disagree with this, but for the subsystems I maintain, this is
> > the rule I intent to stick to.
> 
> That's an honorable set of guiding principles, but how do you apply them
> here? Reverting Rob's change won't fix the past, and updating the code
> to account for one format will break the other. As for trying one
> format, and if there's an error try the other, there may be situations
> in which you accept invalid input as valid.

maz@hot-poop:~/arm-platforms$ git describe --contains 869f0ec048dc --match=v\*
v5.16-rc1~125^2~19^2~16

This patch landed in -rc1, and isn't part of any release. Just revert
it, and no damage is done.

	M.
Vladimir Oltean Dec. 14, 2021, 10:53 a.m. UTC | #6
On Tue, Dec 14, 2021 at 10:39:35AM +0000, Marc Zyngier wrote:
> On Tue, 14 Dec 2021 10:30:26 +0000,
> Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > 
> > On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> > > On Tue, 14 Dec 2021 09:58:54 +0000,
> > > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > > 
> > > > Hi Marc (with a c),
> > > > 
> > > > I wish the firmware for these SoCs was smart enough to be compatible
> > > > with the bindings that are in the kernel and provide a blob that the
> > > > kernel could actually use. Some work has been started there and this is
> > > > work in progress. True, I don't know what other OF-based firmware some
> > > > other customers may use, but I trust it isn't a lot more advanced than
> > > > what U-Boot currently has :)
> > > > 
> > > > Also, the machines may have been in the wild for years, but the
> > > > ls-extirq driver was added in November 2019. So not with the
> > > > introduction of the SoC device trees themselves. That isn't so long ago.
> > > > 
> > > > As for compatibility between old kernel and new DT: I guess you'll hear
> > > > various opinions on this one.
> > > > https://www.spinics.net/lists/linux-mips/msg07778.html
> > > > 
> > > > | > Are we okay with the new device tree blobs breaking the old kernel?
> > > > |
> > > > | From my point of view, newer device trees are not required to work on
> > > > | older kernel, this would impose an unreasonable limitation and the use
> > > > | case is very limited.
> > > 
> > > My views are on the opposite side. DT is an ABI, full stop. If you
> > > change something, you *must* guarantee forward *and* backward
> > > compatibility. That's because:
> > > 
> > > - you don't control how updatable the firmware is
> > > 
> > > - people may need to revert to other versions of the kernel because
> > >   the new one is broken
> > > 
> > > - there are plenty of DT users beyond Linux, and we are not creating
> > >   bindings for Linux only.
> > > 
> > > You may disagree with this, but for the subsystems I maintain, this is
> > > the rule I intent to stick to.
> > 
> > That's an honorable set of guiding principles, but how do you apply them
> > here? Reverting Rob's change won't fix the past, and updating the code
> > to account for one format will break the other. As for trying one
> > format, and if there's an error try the other, there may be situations
> > in which you accept invalid input as valid.
> 
> maz@hot-poop:~/arm-platforms$ git describe --contains 869f0ec048dc --match=v\*
> v5.16-rc1~125^2~19^2~16
> 
> This patch landed in -rc1, and isn't part of any release. Just revert
> it, and no damage is done.

The revert is one of the patches posted here. It will fix the problem
short-term but it may not be enough long-term. I think Rob is working on
some sort of validation for "interrupt-map" and this is how the apparently
non-conformant property was brought to his attention. It will trigger
validation warnings that I'm afraid will be tempting for many to "fix".
Thus the rest of the patches. Maybe it's just me, but between having to
play a whack-a-mole game and snapping compatibility of old kernels with
new DT blobs, I think more time is lost with the latter.
Marc Zyngier Dec. 14, 2021, 11:11 a.m. UTC | #7
On Tue, 14 Dec 2021 10:53:16 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> On Tue, Dec 14, 2021 at 10:39:35AM +0000, Marc Zyngier wrote:
> > On Tue, 14 Dec 2021 10:30:26 +0000,
> > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > 
> > > On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> > > > On Tue, 14 Dec 2021 09:58:54 +0000,
> > > > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > > > 
> > > > > Hi Marc (with a c),
> > > > > 
> > > > > I wish the firmware for these SoCs was smart enough to be compatible
> > > > > with the bindings that are in the kernel and provide a blob that the
> > > > > kernel could actually use. Some work has been started there and this is
> > > > > work in progress. True, I don't know what other OF-based firmware some
> > > > > other customers may use, but I trust it isn't a lot more advanced than
> > > > > what U-Boot currently has :)
> > > > > 
> > > > > Also, the machines may have been in the wild for years, but the
> > > > > ls-extirq driver was added in November 2019. So not with the
> > > > > introduction of the SoC device trees themselves. That isn't so long ago.
> > > > > 
> > > > > As for compatibility between old kernel and new DT: I guess you'll hear
> > > > > various opinions on this one.
> > > > > https://www.spinics.net/lists/linux-mips/msg07778.html
> > > > > 
> > > > > | > Are we okay with the new device tree blobs breaking the old kernel?
> > > > > |
> > > > > | From my point of view, newer device trees are not required to work on
> > > > > | older kernel, this would impose an unreasonable limitation and the use
> > > > > | case is very limited.
> > > > 
> > > > My views are on the opposite side. DT is an ABI, full stop. If you
> > > > change something, you *must* guarantee forward *and* backward
> > > > compatibility. That's because:
> > > > 
> > > > - you don't control how updatable the firmware is
> > > > 
> > > > - people may need to revert to other versions of the kernel because
> > > >   the new one is broken
> > > > 
> > > > - there are plenty of DT users beyond Linux, and we are not creating
> > > >   bindings for Linux only.
> > > > 
> > > > You may disagree with this, but for the subsystems I maintain, this is
> > > > the rule I intent to stick to.
> > > 
> > > That's an honorable set of guiding principles, but how do you apply them
> > > here? Reverting Rob's change won't fix the past, and updating the code
> > > to account for one format will break the other. As for trying one
> > > format, and if there's an error try the other, there may be situations
> > > in which you accept invalid input as valid.
> > 
> > maz@hot-poop:~/arm-platforms$ git describe --contains 869f0ec048dc --match=v\*
> > v5.16-rc1~125^2~19^2~16
> > 
> > This patch landed in -rc1, and isn't part of any release. Just revert
> > it, and no damage is done.
> 
> The revert is one of the patches posted here. It will fix the problem
> short-term but it may not be enough long-term. I think Rob is working on
> some sort of validation for "interrupt-map" and this is how the apparently
> non-conformant property was brought to his attention. It will trigger
> validation warnings that I'm afraid will be tempting for many to "fix".

Then build an annotation mechanism for the warning not to fire for
quirked systems.

> Thus the rest of the patches. Maybe it's just me, but between having to
> play a whack-a-mole game and snapping compatibility of old kernels with
> new DT blobs, I think more time is lost with the latter.

I said what I had to say on the subject, and when it comes to wasted
time, that's more than enough.

	M.
Vladimir Oltean March 24, 2022, 5:10 p.m. UTC | #8
Hello Marc,

On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> On Tue, 14 Dec 2021 09:58:54 +0000,
> Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > 
> > Hi Marc (with a c),
> > 
> > I wish the firmware for these SoCs was smart enough to be compatible
> > with the bindings that are in the kernel and provide a blob that the
> > kernel could actually use. Some work has been started there and this is
> > work in progress. True, I don't know what other OF-based firmware some
> > other customers may use, but I trust it isn't a lot more advanced than
> > what U-Boot currently has :)
> > 
> > Also, the machines may have been in the wild for years, but the
> > ls-extirq driver was added in November 2019. So not with the
> > introduction of the SoC device trees themselves. That isn't so long ago.
> > 
> > As for compatibility between old kernel and new DT: I guess you'll hear
> > various opinions on this one.
> > https://www.spinics.net/lists/linux-mips/msg07778.html
> > 
> > | > Are we okay with the new device tree blobs breaking the old kernel?
> > |
> > | From my point of view, newer device trees are not required to work on
> > | older kernel, this would impose an unreasonable limitation and the use
> > | case is very limited.
> 
> My views are on the opposite side. DT is an ABI, full stop. If you
> change something, you *must* guarantee forward *and* backward
> compatibility. That's because:
> 
> - you don't control how updatable the firmware is
> 
> - people may need to revert to other versions of the kernel because
>   the new one is broken
> 
> - there are plenty of DT users beyond Linux, and we are not creating
>   bindings for Linux only.
> 
> You may disagree with this, but for the subsystems I maintain, this is
> the rule I intent to stick to.
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.

I was just debugging an interesting issue with an old kernel not working
with a new DT blob, and after figuring out what the problem was (is),
I remembered this message and I'm curious what you have to say about it.

I have this DT layout:

	ethernet-phy@1 {
		reg = <0x1>;
		interrupts-extended = <&extirq 2 IRQ_TYPE_LEVEL_LOW>;
	};

	extirq: interrupt-controller@1ac {
		compatible = "fsl,ls1021a-extirq";
		<bla bla>
	};

I booted the new DT blob (which has "interrupts-extended") on a kernel
where the ls-extirq driver did not exist. This had the result of
of_mdiobus_phy_device_register() -> of_irq_get() returning -EPROBE_DEFER
forever and ever. So the PHY driver in turn never probed, and Ethernet
was broken. So I had to delete the interrupts OF property to let the PHY
at least work in poll mode.

What went wrong here in your opinion?
Marc Zyngier March 24, 2022, 5:21 p.m. UTC | #9
On Thu, 24 Mar 2022 17:10:42 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> Hello Marc,
> 
> On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> > On Tue, 14 Dec 2021 09:58:54 +0000,
> > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > 
> > > Hi Marc (with a c),
> > > 
> > > I wish the firmware for these SoCs was smart enough to be compatible
> > > with the bindings that are in the kernel and provide a blob that the
> > > kernel could actually use. Some work has been started there and this is
> > > work in progress. True, I don't know what other OF-based firmware some
> > > other customers may use, but I trust it isn't a lot more advanced than
> > > what U-Boot currently has :)
> > > 
> > > Also, the machines may have been in the wild for years, but the
> > > ls-extirq driver was added in November 2019. So not with the
> > > introduction of the SoC device trees themselves. That isn't so long ago.
> > > 
> > > As for compatibility between old kernel and new DT: I guess you'll hear
> > > various opinions on this one.
> > > https://www.spinics.net/lists/linux-mips/msg07778.html
> > > 
> > > | > Are we okay with the new device tree blobs breaking the old kernel?
> > > |
> > > | From my point of view, newer device trees are not required to work on
> > > | older kernel, this would impose an unreasonable limitation and the use
> > > | case is very limited.
> > 
> > My views are on the opposite side. DT is an ABI, full stop. If you
> > change something, you *must* guarantee forward *and* backward
> > compatibility. That's because:
> > 
> > - you don't control how updatable the firmware is
> > 
> > - people may need to revert to other versions of the kernel because
> >   the new one is broken
> > 
> > - there are plenty of DT users beyond Linux, and we are not creating
> >   bindings for Linux only.
> > 
> > You may disagree with this, but for the subsystems I maintain, this is
> > the rule I intent to stick to.
> > 
> > 	M.
> > 
> > -- 
> > Without deviation from the norm, progress is not possible.
> 
> I was just debugging an interesting issue with an old kernel not working
> with a new DT blob, and after figuring out what the problem was (is),
> I remembered this message and I'm curious what you have to say about it.
> 
> I have this DT layout:
> 
> 	ethernet-phy@1 {
> 		reg = <0x1>;
> 		interrupts-extended = <&extirq 2 IRQ_TYPE_LEVEL_LOW>;
> 	};
> 
> 	extirq: interrupt-controller@1ac {
> 		compatible = "fsl,ls1021a-extirq";
> 		<bla bla>
> 	};
> 
> I booted the new DT blob (which has "interrupts-extended") on a kernel
> where the ls-extirq driver did not exist. This had the result of
> of_mdiobus_phy_device_register() -> of_irq_get() returning -EPROBE_DEFER
> forever and ever. So the PHY driver in turn never probed, and Ethernet
> was broken. So I had to delete the interrupts OF property to let the PHY
> at least work in poll mode.
> 
> What went wrong here in your opinion?

I'm not sure what you expect me to say here. You have a device that
references an interrupt. The DT seems sound (I don't get why you think
"interrupt-extended" is a problem here, but hey...).

If your kernel doesn't have a driver for the interrupt controller
referenced here, what do you expect, other than things not working?

	M.
Vladimir Oltean March 24, 2022, 5:34 p.m. UTC | #10
On Thu, Mar 24, 2022 at 05:21:50PM +0000, Marc Zyngier wrote:
> On Thu, 24 Mar 2022 17:10:42 +0000,
> Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > 
> > Hello Marc,
> > 
> > On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> > > On Tue, 14 Dec 2021 09:58:54 +0000,
> > > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > > 
> > > > Hi Marc (with a c),
> > > > 
> > > > I wish the firmware for these SoCs was smart enough to be compatible
> > > > with the bindings that are in the kernel and provide a blob that the
> > > > kernel could actually use. Some work has been started there and this is
> > > > work in progress. True, I don't know what other OF-based firmware some
> > > > other customers may use, but I trust it isn't a lot more advanced than
> > > > what U-Boot currently has :)
> > > > 
> > > > Also, the machines may have been in the wild for years, but the
> > > > ls-extirq driver was added in November 2019. So not with the
> > > > introduction of the SoC device trees themselves. That isn't so long ago.
> > > > 
> > > > As for compatibility between old kernel and new DT: I guess you'll hear
> > > > various opinions on this one.
> > > > https://www.spinics.net/lists/linux-mips/msg07778.html
> > > > 
> > > > | > Are we okay with the new device tree blobs breaking the old kernel?
> > > > |
> > > > | From my point of view, newer device trees are not required to work on
> > > > | older kernel, this would impose an unreasonable limitation and the use
> > > > | case is very limited.
> > > 
> > > My views are on the opposite side. DT is an ABI, full stop. If you
> > > change something, you *must* guarantee forward *and* backward
> > > compatibility. That's because:
> > > 
> > > - you don't control how updatable the firmware is
> > > 
> > > - people may need to revert to other versions of the kernel because
> > >   the new one is broken
> > > 
> > > - there are plenty of DT users beyond Linux, and we are not creating
> > >   bindings for Linux only.
> > > 
> > > You may disagree with this, but for the subsystems I maintain, this is
> > > the rule I intent to stick to.
> > > 
> > > 	M.
> > > 
> > > -- 
> > > Without deviation from the norm, progress is not possible.
> > 
> > I was just debugging an interesting issue with an old kernel not working
> > with a new DT blob, and after figuring out what the problem was (is),
> > I remembered this message and I'm curious what you have to say about it.
> > 
> > I have this DT layout:
> > 
> > 	ethernet-phy@1 {
> > 		reg = <0x1>;
> > 		interrupts-extended = <&extirq 2 IRQ_TYPE_LEVEL_LOW>;
> > 	};
> > 
> > 	extirq: interrupt-controller@1ac {
> > 		compatible = "fsl,ls1021a-extirq";
> > 		<bla bla>
> > 	};
> > 
> > I booted the new DT blob (which has "interrupts-extended") on a kernel
> > where the ls-extirq driver did not exist. This had the result of
> > of_mdiobus_phy_device_register() -> of_irq_get() returning -EPROBE_DEFER
> > forever and ever. So the PHY driver in turn never probed, and Ethernet
> > was broken. So I had to delete the interrupts OF property to let the PHY
> > at least work in poll mode.
> > 
> > What went wrong here in your opinion?
> 
> I'm not sure what you expect me to say here. You have a device that
> references an interrupt. The DT seems sound (I don't get why you think
> "interrupt-extended" is a problem here, but hey...).
> 
> If your kernel doesn't have a driver for the interrupt controller
> referenced here, what do you expect, other than things not working?
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.

I was just raising this as what I thought would be a simple and
non-controversial counter example to your remark "If you change something,
you *must* guarantee forward *and* backward compatibility."

Practically speaking, what has happened is that the board DT appeared in
kernel N, the ls-extirq driver in kernel N+1, and the DT was updated to
enable PHY interrupts in kernel N+2. That DT update practically broke
kernel N from running correctly on DTs taken from kernel N+2 onwards.
This is the observable behavior, we can find as many justifications for
it as we wish.

(as to what I expect, Ethernet PHYs work without an interrupt too, but
 of_mdiobus_phy_device_register() treats -EPROBE_DEFER from of_irq_get()
 as special, because it assumes the IRQ domain will eventually come up.
 The IRQ is optional, as evidenced by the fact that kernel N worked)
Marc Zyngier March 24, 2022, 6:06 p.m. UTC | #11
On Thu, 24 Mar 2022 17:34:06 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> On Thu, Mar 24, 2022 at 05:21:50PM +0000, Marc Zyngier wrote:
> > On Thu, 24 Mar 2022 17:10:42 +0000,
> > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > 
> > > Hello Marc,
> > > 
> > > On Tue, Dec 14, 2021 at 10:20:36AM +0000, Marc Zyngier wrote:
> > > > On Tue, 14 Dec 2021 09:58:54 +0000,
> > > > Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > > > > 
> > > > > Hi Marc (with a c),
> > > > > 
> > > > > I wish the firmware for these SoCs was smart enough to be compatible
> > > > > with the bindings that are in the kernel and provide a blob that the
> > > > > kernel could actually use. Some work has been started there and this is
> > > > > work in progress. True, I don't know what other OF-based firmware some
> > > > > other customers may use, but I trust it isn't a lot more advanced than
> > > > > what U-Boot currently has :)
> > > > > 
> > > > > Also, the machines may have been in the wild for years, but the
> > > > > ls-extirq driver was added in November 2019. So not with the
> > > > > introduction of the SoC device trees themselves. That isn't so long ago.
> > > > > 
> > > > > As for compatibility between old kernel and new DT: I guess you'll hear
> > > > > various opinions on this one.
> > > > > https://www.spinics.net/lists/linux-mips/msg07778.html
> > > > > 
> > > > > | > Are we okay with the new device tree blobs breaking the old kernel?
> > > > > |
> > > > > | From my point of view, newer device trees are not required to work on
> > > > > | older kernel, this would impose an unreasonable limitation and the use
> > > > > | case is very limited.
> > > > 
> > > > My views are on the opposite side. DT is an ABI, full stop. If you
> > > > change something, you *must* guarantee forward *and* backward
> > > > compatibility. That's because:
> > > > 
> > > > - you don't control how updatable the firmware is
> > > > 
> > > > - people may need to revert to other versions of the kernel because
> > > >   the new one is broken
> > > > 
> > > > - there are plenty of DT users beyond Linux, and we are not creating
> > > >   bindings for Linux only.
> > > > 
> > > > You may disagree with this, but for the subsystems I maintain, this is
> > > > the rule I intent to stick to.
> > > > 
> > > > 	M.
> > > > 
> > > > -- 
> > > > Without deviation from the norm, progress is not possible.
> > > 
> > > I was just debugging an interesting issue with an old kernel not working
> > > with a new DT blob, and after figuring out what the problem was (is),
> > > I remembered this message and I'm curious what you have to say about it.
> > > 
> > > I have this DT layout:
> > > 
> > > 	ethernet-phy@1 {
> > > 		reg = <0x1>;
> > > 		interrupts-extended = <&extirq 2 IRQ_TYPE_LEVEL_LOW>;
> > > 	};
> > > 
> > > 	extirq: interrupt-controller@1ac {
> > > 		compatible = "fsl,ls1021a-extirq";
> > > 		<bla bla>
> > > 	};
> > > 
> > > I booted the new DT blob (which has "interrupts-extended") on a kernel
> > > where the ls-extirq driver did not exist. This had the result of
> > > of_mdiobus_phy_device_register() -> of_irq_get() returning -EPROBE_DEFER
> > > forever and ever. So the PHY driver in turn never probed, and Ethernet
> > > was broken. So I had to delete the interrupts OF property to let the PHY
> > > at least work in poll mode.
> > > 
> > > What went wrong here in your opinion?
> > 
> > I'm not sure what you expect me to say here. You have a device that
> > references an interrupt. The DT seems sound (I don't get why you think
> > "interrupt-extended" is a problem here, but hey...).
> > 
> > If your kernel doesn't have a driver for the interrupt controller
> > referenced here, what do you expect, other than things not working?
> > 
> > 	M.
> > 
> > -- 
> > Without deviation from the norm, progress is not possible.
> 
> I was just raising this as what I thought would be a simple and
> non-controversial counter example to your remark "If you change something,
> you *must* guarantee forward *and* backward compatibility."

If you change something *in the binding*, which was implicit in the
context, and makes no sense out of context.

> 
> Practically speaking, what has happened is that the board DT appeared in
> kernel N, the ls-extirq driver in kernel N+1, and the DT was updated to
> enable PHY interrupts in kernel N+2. That DT update practically broke
> kernel N from running correctly on DTs taken from kernel N+2 onwards.
> This is the observable behavior, we can find as many justifications for
> it as we wish.

Well, you can also argue that the DT was broken at N and N+1 for not
describing the HW correctly and completely. No binding has changed
here. Your DT was incomplete, and someone fixed it for you.

We can argue this things forever and a half. I've laid down the ground
rules for the stuff I maintain. If you're not happy with this, you can
fix it by either removing the NXP hardware from the tree, or taking
over from me as the irqchip maintainer. I'd be perfectly happy with
any (and even more, with both) of these outcomes.

	M.
Vladimir Oltean March 24, 2022, 7:09 p.m. UTC | #12
On Thu, Mar 24, 2022 at 06:06:51PM +0000, Marc Zyngier wrote:
> > I was just raising this as what I thought would be a simple and
> > non-controversial counter example to your remark "If you change something,
> > you *must* guarantee forward *and* backward compatibility."
> 
> If you change something *in the binding*, which was implicit in the
> context, and makes no sense out of context.
> 
> > Practically speaking, what has happened is that the board DT appeared in
> > kernel N, the ls-extirq driver in kernel N+1, and the DT was updated to
> > enable PHY interrupts in kernel N+2. That DT update practically broke
> > kernel N from running correctly on DTs taken from kernel N+2 onwards.
> > This is the observable behavior, we can find as many justifications for
> > it as we wish.
> 
> Well, you can also argue that the DT was broken at N and N+1 for not
> describing the HW correctly and completely. No binding has changed
> here. Your DT was incomplete, and someone fixed it for you.
> 
> We can argue this things forever and a half. I've laid down the ground
> rules for the stuff I maintain. If you're not happy with this, you can
> fix it by either removing the NXP hardware from the tree, or taking
> over from me as the irqchip maintainer. I'd be perfectly happy with
> any (and even more, with both) of these outcomes.

Ok, my intention wasn't to inflame you even though the way in which I
presented the problem might have suggested otherwise.

With my developer hat I still don't agree with you even with the
additional clarification you've made that you were referring only to
bindings and not to any and all DT changes. The reason being that the DT
blob is a whole, and it doesn't matter if there's a regression because
of a binding change or something else, you still need to be prepared to
update it, sometimes in lockstep with the kernel, like it or not.

But as a user, I just wanted to get an opinion from you what can we do
to deal better with this situation: optional interrupt provided by
device with missing driver, which of_irq_get() doesn't seem to understand.
There are more angles to this than just "new DT with old kernel". It can
also be new kernel, but ls-extirq driver disabled, and I still see that
as a kernel <-> DT compatibility concern.
Marc Zyngier March 24, 2022, 8:14 p.m. UTC | #13
On Thu, 24 Mar 2022 19:09:05 +0000,
Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> 
> On Thu, Mar 24, 2022 at 06:06:51PM +0000, Marc Zyngier wrote:
> > > I was just raising this as what I thought would be a simple and
> > > non-controversial counter example to your remark "If you change something,
> > > you *must* guarantee forward *and* backward compatibility."
> > 
> > If you change something *in the binding*, which was implicit in the
> > context, and makes no sense out of context.
> > 
> > > Practically speaking, what has happened is that the board DT appeared in
> > > kernel N, the ls-extirq driver in kernel N+1, and the DT was updated to
> > > enable PHY interrupts in kernel N+2. That DT update practically broke
> > > kernel N from running correctly on DTs taken from kernel N+2 onwards.
> > > This is the observable behavior, we can find as many justifications for
> > > it as we wish.
> > 
> > Well, you can also argue that the DT was broken at N and N+1 for not
> > describing the HW correctly and completely. No binding has changed
> > here. Your DT was incomplete, and someone fixed it for you.
> > 
> > We can argue this things forever and a half. I've laid down the ground
> > rules for the stuff I maintain. If you're not happy with this, you can
> > fix it by either removing the NXP hardware from the tree, or taking
> > over from me as the irqchip maintainer. I'd be perfectly happy with
> > any (and even more, with both) of these outcomes.
> 
> Ok, my intention wasn't to inflame you even though the way in which I
> presented the problem might have suggested otherwise.
> 
> With my developer hat I still don't agree with you even with the
> additional clarification you've made that you were referring only to
> bindings and not to any and all DT changes. The reason being that the DT
> blob is a whole, and it doesn't matter if there's a regression because
> of a binding change or something else, you still need to be prepared to
> update it, sometimes in lockstep with the kernel, like it or not.

<rant>
No. This doesn't happen on systems that ship the DT as part of their
firmware, and this doesn't happen with ACPI either. This only happens
on platform that are maintained like the NXP, Marvell, and other
similar platforms that are being used as a job security tool by doing
piecemeal enablement.

Properly maintained systems have had the same DT for years. Features
have been added over time, yet without breaking compatibility in
either direction. Yes, it requires some effort and planning. And even
quirks at times. Yet they don't break.

Amusingly, some of these better supported platforms do not have their
DT in the kernel tree. Synquacer, for example. Keeping the DTs in the
kernel tree has been one of the worse decision we have ever made. It
has simply moved the board files of old to a different place, under
the guise of separating description and code.  In practice, it
abstracted nothing at all, only made it more complicated because
people are treating DT as an integral part of the kernel code base,
which it really shouldn't be.
</rant>

> But as a user, I just wanted to get an opinion from you what can we do
> to deal better with this situation: optional interrupt provided by
> device with missing driver, which of_irq_get() doesn't seem to understand.
> There are more angles to this than just "new DT with old kernel". It can
> also be new kernel, but ls-extirq driver disabled, and I still see that
> as a kernel <-> DT compatibility concern.

If you're missing a driver, that's a user error. Or rather, a platform
maintainer error for not establishing the correct dependencies. This
has nothing to do with the DT. As for optional interrupts, that has
nothing to do with the DT either, but with the kernel code that
requests it. If you think the kernel should do better, you can always
post a patch.

And I'm done on that subject.

	M.
Robin Murphy March 25, 2022, 10:34 a.m. UTC | #14
On 2022-03-24 19:09, Vladimir Oltean wrote:
> On Thu, Mar 24, 2022 at 06:06:51PM +0000, Marc Zyngier wrote:
>>> I was just raising this as what I thought would be a simple and
>>> non-controversial counter example to your remark "If you change something,
>>> you *must* guarantee forward *and* backward compatibility."
>>
>> If you change something *in the binding*, which was implicit in the
>> context, and makes no sense out of context.
>>
>>> Practically speaking, what has happened is that the board DT appeared in
>>> kernel N, the ls-extirq driver in kernel N+1, and the DT was updated to
>>> enable PHY interrupts in kernel N+2. That DT update practically broke
>>> kernel N from running correctly on DTs taken from kernel N+2 onwards.
>>> This is the observable behavior, we can find as many justifications for
>>> it as we wish.
>>
>> Well, you can also argue that the DT was broken at N and N+1 for not
>> describing the HW correctly and completely. No binding has changed
>> here. Your DT was incomplete, and someone fixed it for you.
>>
>> We can argue this things forever and a half. I've laid down the ground
>> rules for the stuff I maintain. If you're not happy with this, you can
>> fix it by either removing the NXP hardware from the tree, or taking
>> over from me as the irqchip maintainer. I'd be perfectly happy with
>> any (and even more, with both) of these outcomes.
> 
> Ok, my intention wasn't to inflame you even though the way in which I
> presented the problem might have suggested otherwise.
> 
> With my developer hat I still don't agree with you even with the
> additional clarification you've made that you were referring only to
> bindings and not to any and all DT changes. The reason being that the DT
> blob is a whole, and it doesn't matter if there's a regression because
> of a binding change or something else, you still need to be prepared to
> update it, sometimes in lockstep with the kernel, like it or not.
> 
> But as a user, I just wanted to get an opinion from you what can we do
> to deal better with this situation: optional interrupt provided by
> device with missing driver, which of_irq_get() doesn't seem to understand.

FWIW, of_irq_get() absolutely understands how to handle a missing IRQ 
provider driver; it returns -EPROBE_DEFER. If a caller considers the IRQ 
optional, then it's up to that caller to decide how long to keep waiting 
for the provider to appear until giving up and carrying on without it. 
If your phy driver is making the dumb decision to wait for ever for 
something which isn't critical, then you're free to fix it, or perhaps 
even propose for of_irq_get() to opt in to the 
driver_deferred_probe_check_state() mechanism if you believe it's a 
sufficiently general case.

If a new DT with an additional new property (either on an existing 
machine, or on a completely new machine which has the property from the 
start) exposes a bug in a driver, that's unfortunate, but it is entirely 
irrelevant to the ABI implications of changing the interpretation of an 
existing property.

Robin.
Vladimir Oltean March 25, 2022, 5:54 p.m. UTC | #15
On Fri, Mar 25, 2022 at 10:34:13AM +0000, Robin Murphy wrote:
> On 2022-03-24 19:09, Vladimir Oltean wrote:
> > On Thu, Mar 24, 2022 at 06:06:51PM +0000, Marc Zyngier wrote:
> > > > I was just raising this as what I thought would be a simple and
> > > > non-controversial counter example to your remark "If you change something,
> > > > you *must* guarantee forward *and* backward compatibility."
> > > 
> > > If you change something *in the binding*, which was implicit in the
> > > context, and makes no sense out of context.
> > > 
> > > > Practically speaking, what has happened is that the board DT appeared in
> > > > kernel N, the ls-extirq driver in kernel N+1, and the DT was updated to
> > > > enable PHY interrupts in kernel N+2. That DT update practically broke
> > > > kernel N from running correctly on DTs taken from kernel N+2 onwards.
> > > > This is the observable behavior, we can find as many justifications for
> > > > it as we wish.
> > > 
> > > Well, you can also argue that the DT was broken at N and N+1 for not
> > > describing the HW correctly and completely. No binding has changed
> > > here. Your DT was incomplete, and someone fixed it for you.
> > > 
> > > We can argue this things forever and a half. I've laid down the ground
> > > rules for the stuff I maintain. If you're not happy with this, you can
> > > fix it by either removing the NXP hardware from the tree, or taking
> > > over from me as the irqchip maintainer. I'd be perfectly happy with
> > > any (and even more, with both) of these outcomes.
> > 
> > Ok, my intention wasn't to inflame you even though the way in which I
> > presented the problem might have suggested otherwise.
> > 
> > With my developer hat I still don't agree with you even with the
> > additional clarification you've made that you were referring only to
> > bindings and not to any and all DT changes. The reason being that the DT
> > blob is a whole, and it doesn't matter if there's a regression because
> > of a binding change or something else, you still need to be prepared to
> > update it, sometimes in lockstep with the kernel, like it or not.
> > 
> > But as a user, I just wanted to get an opinion from you what can we do
> > to deal better with this situation: optional interrupt provided by
> > device with missing driver, which of_irq_get() doesn't seem to understand.
> 
> FWIW, of_irq_get() absolutely understands how to handle a missing IRQ
> provider driver; it returns -EPROBE_DEFER. If a caller considers the IRQ
> optional, then it's up to that caller to decide how long to keep waiting for
> the provider to appear until giving up and carrying on without it. If your
> phy driver is making the dumb decision to wait for ever for something which
> isn't critical, then you're free to fix it, or perhaps even propose for
> of_irq_get() to opt in to the driver_deferred_probe_check_state() mechanism
> if you believe it's a sufficiently general case.

Thanks, I really needed that suggestion, at the moment I made this
change which seems to do what I want when I force-disable the ls-extirq
driver (which in turn simulates an unwritten driver from an old kernel):

diff --git a/drivers/net/mdio/fwnode_mdio.c b/drivers/net/mdio/fwnode_mdio.c
index 1becb1a731f6..1c1584fca632 100644
--- a/drivers/net/mdio/fwnode_mdio.c
+++ b/drivers/net/mdio/fwnode_mdio.c
@@ -43,6 +43,11 @@ int fwnode_mdiobus_phy_device_register(struct mii_bus *mdio,
 	int rc;
 
 	rc = fwnode_irq_get(child, 0);
+	/* Don't wait forever if the IRQ provider doesn't become available,
+	 * just fall back to poll mode
+	 */
+	if (rc == -EPROBE_DEFER)
+		rc = driver_deferred_probe_check_state(&phy->mdio.dev);
 	if (rc == -EPROBE_DEFER)
 		return rc;
 

The trickier part seems to be to adjust this change for older kernels
where the MDIO bus code calls of_irq_get() directly, and not get
regressions in. That, plus I need to understand what changes in behavior
when the irqchip driver is built as module and the MDIO bus driver is
built-in, and the "driver_deferred_probe_timeout" kernel boot parameter
has the default value of 0.

> If a new DT with an additional new property (either on an existing machine,
> or on a completely new machine which has the property from the start)
> exposes a bug in a driver, that's unfortunate, but it is entirely irrelevant
> to the ABI implications of changing the interpretation of an existing
> property.
> 
> Robin.

Agree, but I'd like to at least be shot down for a point I _am_ trying
to make, not for one I am not.

When I resumed this discussion I wasn't really focused on the patch set
that proposed to change some DT bindings. That was a bad idea, I abandoned
it, issue was solved (more or less) through other means, end of story.

Instead I focused on one of the arguments that Marc brought, that being
able to roll back kernel independently of firmware is important.
As a realist, I can't help but remark that this is effectively a non-goal.
There is always a risk that old kernels are caught off guard by DT
changes which they aren't prepared to handle, and even if I was aware of
the fact that I'm making a breaking change for old kernels when I'm
adding the 'interrupts-extended' property to the PHY (which I wasn't),
I'd still do it 10 out of 10 times.
I guess I'll always treat the 'old kernel works with new DT blob' case
as a pleasant surprise rather than the inviolable norm that Marc is
trying to make it sound like. Mentality difference between NXP/Marvell
vs Socionext aside, I really don't see how you can systematically avoid
these issues, so it's just a losing game to try to claim that every
firmware blob should work with every kernel, for the simple reason that
you can't change the past.

As to whether this has any implication on the point you and Marc are
trying to make, that the firmware ABI contract shouldn't change,
maybe not, probably not, especially if there are alternatives. But bring
a more solid argument to the table. Changing DT bindings is not off the
table _because_ old kernels will stop working with new DT blobs.