diff mbox series

[v1,1/2] driver core: fw_devlink: Add support for FWNODE_FLAG_BROKEN_PARENT

Message ID 20210826074526.825517-2-saravanak@google.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Fix rtl8366rb issues with fw_devlink=on | expand

Commit Message

Saravana Kannan Aug. 26, 2021, 7:45 a.m. UTC
If a parent device is also a supplier to a child device, fw_devlink=on
(correctly) delays the probe() of the child device until the probe() of
the parent finishes successfully.

However, some drivers of such parent devices (where parent is also a
supplier) incorrectly expect the child device to finish probing
successfully as soon as they are added using device_add() and before the
probe() of the parent device has completed successfully. While this
might have worked before, this is not guaranteed by driver core.
fw_devlink=on catches/breaks such drivers. One example of such a case is
discussed in the link mentioned below.

Add a flag to make fw_devlink=on not enforce these supplier-consumer
relationships, so these drivers can continue working. The flag is
intentionally called BROKEN_PARENT so it's clear that this flag
shouldn't be used in the normal case and that there's a problem with the
driver.

Link: https://lore.kernel.org/netdev/CAGETcx_uj0V4DChME-gy5HGKTYnxLBX=TH2rag29f_p=UcG+Tg@mail.gmail.com/
Signed-off-by: Saravana Kannan <saravanak@google.com>
---
 drivers/base/core.c    | 22 ++++++++++++++++++++++
 include/linux/fwnode.h |  3 +++
 2 files changed, 25 insertions(+)

Comments

Andrew Lunn Aug. 26, 2021, 1:13 p.m. UTC | #1
On Thu, Aug 26, 2021 at 12:45:24AM -0700, Saravana Kannan wrote:
> If a parent device is also a supplier to a child device, fw_devlink=on
> (correctly) delays the probe() of the child device until the probe() of
> the parent finishes successfully.
> 
> However, some drivers of such parent devices (where parent is also a
> supplier) incorrectly expect the child device to finish probing
> successfully as soon as they are added using device_add() and before the
> probe() of the parent device has completed successfully.

Please can you point at the code making this assumption. It sounds
like we are missing some EPROBE_DEFER handling in the driver, or maybe
the DSA framework.

     Andrew
Saravana Kannan Aug. 26, 2021, 7:56 p.m. UTC | #2
Greg, Florian, Vladimir, Alvin,

Let's continue the rest of the discussion here.

On Thu, Aug 26, 2021 at 6:13 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Thu, Aug 26, 2021 at 12:45:24AM -0700, Saravana Kannan wrote:
> > If a parent device is also a supplier to a child device, fw_devlink=on
> > (correctly) delays the probe() of the child device until the probe() of
> > the parent finishes successfully.
> >
> > However, some drivers of such parent devices (where parent is also a
> > supplier) incorrectly expect the child device to finish probing
> > successfully as soon as they are added using device_add() and before the
> > probe() of the parent device has completed successfully.
>
> Please can you point at the code making this assumption. It sounds
> like we are missing some EPROBE_DEFER handling in the driver, or maybe
> the DSA framework.

For context, this was discussed and explained in [1] and subsequent
replies. But let me summarize it here.

Alvin reported an issue that with fw_devlink=on, his downstream
hardware which is very similar to [2] doesn't have its PHYs probed
correctly. Instead of the PHYs being probed by the specific driver, it
gets probed by the "Generic PHY" driver. For those who aren't very
familiar with PHYs/networking (this is based on what Andrew explained
to me earlier), Ethernet PHYs follow a specific standard and can have
some extended functionality. The specific driver would give the full
functionality, but if it's not available when the PHY needs to be
used/connected, the generic PHY driver is force bound to the PHY and
it gives the basic functionality.

So upon digging into this, this is what I found and where I think we
have some bad assumptions about the driver core are present:

The  DT node in [2] is probed by realtek_smi_probe() [3]. The call flow is:
realtek_smi_probe()
  -> dsa_register_switch()
    -> dsa_switch_probe()
      -> dsa_tree_setup()
        -> dsa_tree_setup_switches()
          -> dsa_switch_setup()
            -> ds->ops->setup(ds)
              -> rtl8366rb_setup()
                -> realtek_smi_setup_mdio()
                  -> of_mdiobus_register()
                     This scans the MDIO bus/DT and device_add()s the PHYs
          -> dsa_port_setup()
            -> dsa_port_link_register_of()
              -> dsa_port_phylink_register()
                -> phylink_of_phy_connect()
                  -> phylink_fwnode_phy_connect()
                    -> phy_attach_direct()
                       This checks if PHY device has already probed (by
                       checking for dev->driver). If not, it forces the
                       probe of the PHY using one of the generic PHY
                       drivers.

So within dsa_register_switch() the PHY device is added and then
expected to have probed in the same thread/calling context. As stated
earlier, this is not guaranteed by the driver core. And this is what
needs fixing. This works as long as the PHYs don't have dependencies
on any other devices/suppliers and never defer probe. In the issue
Alvin reported, the PHYs have a dependency and things fall apart. I
don't have a strong opinion on whether this is a framework level fix
or fixes in a few drivers.

In the specific instance of [2] (providing snippet below to make it
easier to follow), the "phy0" device [4] depends on the "switch"
device [2] since "switch_intc" (the interrupt provider for phy0) is
initialized by the "switch" driver. And fw_devlink=on delays the probe
of phy0 until switch[2] finishes probing successfully (i.e. after
dsa_register_switch() <- realtek_smi_probe() returns) -- this is the
whole point of fw_devlink=on this is what reduces the useless deferred
probes/probe attempts of consumers before the suppliers finish probing
successfully.

Since dsa_register_switch() assumes the PHYs would have been probed as
soon as they are added, but they aren't probed in this case, the PHY
is force bound to the generic PHY driver. Which is the original issue
Alvin reported. Hope this clears things up for everyone.

-Saravana

switch {
        compatible = "realtek,rtl8366rb";
...

        switch_intc: interrupt-controller {
...
        };

        ports {
...
                port@0 {
                        phy-handle = <&phy0>;
                };
                port@1 {
                };
...
        };

        mdio {
                compatible = "realtek,smi-mdio";
...
                phy0: phy@0 {
...
                        interrupt-parent = <&switch_intc>;
                        interrupts = <0>;
                };
...
        };
};

[1] - https://lore.kernel.org/netdev/CAGETcx_uj0V4DChME-gy5HGKTYnxLBX=TH2rag29f_p=UcG+Tg@mail.gmail.com/
[2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/gemini-dlink-dir-685.dts?id=73f3af7b4611d77bdaea303fb639333eb28e37d7#n190
[3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/dsa/realtek-smi-core.c?id=73f3af7b4611d77bdaea303fb639333eb28e37d7#n386
[4] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/gemini-dlink-dir-685.dts?id=73f3af7b4611d77bdaea303fb639333eb28e37d7#n255
Andrew Lunn Aug. 26, 2021, 8:53 p.m. UTC | #3
> The  DT node in [2] is probed by realtek_smi_probe() [3]. The call flow is:
> realtek_smi_probe()
>   -> dsa_register_switch()
>     -> dsa_switch_probe()
>       -> dsa_tree_setup()
>         -> dsa_tree_setup_switches()
>           -> dsa_switch_setup()
>             -> ds->ops->setup(ds)
>               -> rtl8366rb_setup()
>                 -> realtek_smi_setup_mdio()
>                   -> of_mdiobus_register()
>                      This scans the MDIO bus/DT and device_add()s the PHYs
>           -> dsa_port_setup()
>             -> dsa_port_link_register_of()
>               -> dsa_port_phylink_register()
>                 -> phylink_of_phy_connect()
>                   -> phylink_fwnode_phy_connect()
>                     -> phy_attach_direct()
>                        This checks if PHY device has already probed (by
>                        checking for dev->driver). If not, it forces the
>                        probe of the PHY using one of the generic PHY
>                        drivers.
> 
> So within dsa_register_switch() the PHY device is added and then
> expected to have probed in the same thread/calling context. As stated
> earlier, this is not guaranteed by the driver core.

Have you looked at:

commit 16983507742cbcaa5592af530872a82e82fb9c51
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Fri Mar 27 01:00:22 2020 +0100

    net: phy: probe PHY drivers synchronously

See the full commit message, but the code change is:

iff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3b8f6b0b47b5..d543df282365 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2577,6 +2577,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
        new_driver->mdiodrv.driver.probe = phy_probe;
        new_driver->mdiodrv.driver.remove = phy_remove;
        new_driver->mdiodrv.driver.owner = owner;
+       new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
 
        retval = driver_register(&new_driver->mdiodrv.driver);
        if (retval) {

How does this add to the overall picture?

    Andrew
Saravana Kannan Aug. 26, 2021, 11:44 p.m. UTC | #4
On Thu, Aug 26, 2021 at 1:53 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > The  DT node in [2] is probed by realtek_smi_probe() [3]. The call flow is:
> > realtek_smi_probe()
> >   -> dsa_register_switch()
> >     -> dsa_switch_probe()
> >       -> dsa_tree_setup()
> >         -> dsa_tree_setup_switches()
> >           -> dsa_switch_setup()
> >             -> ds->ops->setup(ds)
> >               -> rtl8366rb_setup()
> >                 -> realtek_smi_setup_mdio()
> >                   -> of_mdiobus_register()
> >                      This scans the MDIO bus/DT and device_add()s the PHYs
> >           -> dsa_port_setup()
> >             -> dsa_port_link_register_of()
> >               -> dsa_port_phylink_register()
> >                 -> phylink_of_phy_connect()
> >                   -> phylink_fwnode_phy_connect()
> >                     -> phy_attach_direct()
> >                        This checks if PHY device has already probed (by
> >                        checking for dev->driver). If not, it forces the
> >                        probe of the PHY using one of the generic PHY
> >                        drivers.
> >
> > So within dsa_register_switch() the PHY device is added and then
> > expected to have probed in the same thread/calling context. As stated
> > earlier, this is not guaranteed by the driver core.
>
> Have you looked at:
>
> commit 16983507742cbcaa5592af530872a82e82fb9c51
> Author: Heiner Kallweit <hkallweit1@gmail.com>
> Date:   Fri Mar 27 01:00:22 2020 +0100
>
>     net: phy: probe PHY drivers synchronously
>
> See the full commit message, but the code change is:
>
> iff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 3b8f6b0b47b5..d543df282365 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -2577,6 +2577,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner)
>         new_driver->mdiodrv.driver.probe = phy_probe;
>         new_driver->mdiodrv.driver.remove = phy_remove;
>         new_driver->mdiodrv.driver.owner = owner;
> +       new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS;
>
>         retval = driver_register(&new_driver->mdiodrv.driver);
>         if (retval) {
>
> How does this add to the overall picture?

Doesn't add much to the discussion. In the example I gave, the driver
already does synchronous probing. If the device can't probe
successfully because a supplier isn't ready, it doesn't matter if it's
a synchronous probe. The probe would still be deferred and we'll hit
the same issue. Even in the situation the commit [5] describes, if
parallelized probing is done and the PHY depended on something (say a
clock), you'd still end up not probing the PHY even if the driver is
present and the generic PHY would end up force probing it.

[5] - https://lore.kernel.org/netdev/612b81d5-c4c1-5e20-a667-893eeeef0bf5@gmail.com/

-Saravana
Andrew Lunn Aug. 27, 2021, 1:23 a.m. UTC | #5
> Doesn't add much to the discussion. In the example I gave, the driver
> already does synchronous probing. If the device can't probe
> successfully because a supplier isn't ready, it doesn't matter if it's
> a synchronous probe. The probe would still be deferred and we'll hit
> the same issue. Even in the situation the commit [5] describes, if
> parallelized probing is done and the PHY depended on something (say a
> clock), you'd still end up not probing the PHY even if the driver is
> present and the generic PHY would end up force probing it.


genphy is meant to be used when there is no other driver available.
It is a best effort, better than nothing, might work. And quite a few
boards rely on it. However, it should not be used when there is a
specific driver.

So if the PHY device has been probed, and -EPROBE_DEFER was returned,
we also need to return -EPROBE_DEFER here when deciding if genphy
should be used. It should then all unwind and try again later.

I don't know the device core, but it looks like dev->can_match tells
us what we need to know. If true, we know there is a driver for this
device. But i'm hesitant to make use of this outside of driver/base.

	Andrew
Saravana Kannan Aug. 27, 2021, 4:02 a.m. UTC | #6
On Thu, Aug 26, 2021 at 6:23 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > Doesn't add much to the discussion. In the example I gave, the driver
> > already does synchronous probing. If the device can't probe
> > successfully because a supplier isn't ready, it doesn't matter if it's
> > a synchronous probe. The probe would still be deferred and we'll hit
> > the same issue. Even in the situation the commit [5] describes, if
> > parallelized probing is done and the PHY depended on something (say a
> > clock), you'd still end up not probing the PHY even if the driver is
> > present and the generic PHY would end up force probing it.
>
>
> genphy is meant to be used when there is no other driver available.
> It is a best effort, better than nothing, might work. And quite a few
> boards rely on it. However, it should not be used when there is a
> specific driver.

Agreed, that's what we are trying to ensure.

> So if the PHY device has been probed, and -EPROBE_DEFER was returned,
> we also need to return -EPROBE_DEFER here when deciding if genphy
> should be used. It should then all unwind and try again later.

Yes, I think dsa_register_switch() returning -EPROBE_DEFER if the PHYs
returned -EPROBE_DEFER might be okay (I think we should do it), but
that doesn't solve the problem for this driver.

fw_devlink=on/device links short circuits the probe() call of a
consumer (in this case the PHY) and returns -EPROBE_DEFER if the
supplier's (in this case switch) probe hasn't finished without an
error. fw_devlink/device links effectively does the probe in graph
topological order and there's a ton of good reasons to do it that way
-- what's why fw_devlink=on was implemented.

In this specific case though, since the PHY depends on the parent
device, if we fail the parent's probe realtek_smi_probe() because the
PHYs failed to probe, we'll get into a catch-22/chicken-n-egg
situation and the switch/PHYs will never probe.

I think a clean way to fix this at the driver level is to do what I
said in [6]. Copy pasting it here and expanding it a bit:

1. The IRQ registration and mdio bus registration should get moved to
realtek_smi_probe() which probes "realtek,rtl8366rb". So
realtek_smi_probe() succeeding doesn't depend on its child devices
probing successfully (which makes sense for any parent device).
2. realtek_smi_probe() should also create/register a
component-master/aggregate device that's "made up of"
realtek,rtl8366rb and all the PHYs. So the component-master will wait
for all of them to finish probing before it's initialized.
3. PHYs will probe successfully now because realtek,rtl8366rb probe()
which is the supplier's probe has finished probing without problems.
4. The component device's init (the .bind op) would call
dsa_register_switch() which kinda makes sense because the rtl8366rb
and all the PHYs combined together is what makes up the logical DSA
switch. The dsa_register_switch() will succeed and will be using the
right/specific PHY driver.

The same applies for any switch that has the PHYs as it's child device
AND (this is the key part) the PHYs depend on the switch as a supplier
(remember, if we didn't have the interrupt dependency, this would not
be an issue).

> I don't know the device core, but it looks like dev->can_match tells
> us what we need to know. If true, we know there is a driver for this
> device. But i'm hesitant to make use of this outside of driver/base.

can_match is never cleared once it's set and it's meant as an
optimization/preserving some probe order stuff. I wouldn't depend on
it for this case. We can just have a phy_has_driver()  function that
searches all the currently registered PHY drivers to check if there's
a matching driver. And dsa_register_switch() or phy_attach_direct()
can return -EPROBE_DEFER if there is a driver but it isn't bound yet.
Again, this is orthogonal to the realtek driver fix though because of
the catch-22 situation above.

-Saravana

[6] - https://lore.kernel.org/netdev/CAGETcx8_vxxPxF8WrXqk=PZYfEggsozP+z9KyOu5C2bEW0VW8g@mail.gmail.com/
Andrew Lunn Aug. 27, 2021, 1:44 p.m. UTC | #7
> fw_devlink=on/device links short circuits the probe() call of a
> consumer (in this case the PHY) and returns -EPROBE_DEFER if the
> supplier's (in this case switch) probe hasn't finished without an
> error. fw_devlink/device links effectively does the probe in graph
> topological order and there's a ton of good reasons to do it that way
> -- what's why fw_devlink=on was implemented.
> 
> In this specific case though, since the PHY depends on the parent
> device, if we fail the parent's probe realtek_smi_probe() because the
> PHYs failed to probe, we'll get into a catch-22/chicken-n-egg
> situation and the switch/PHYs will never probe.

So lets look at:

arch/arm/boot/dts/vf610-zii-dev-rev-b.dts

       mdio-mux {
                compatible = "mdio-mux-gpio";
                pinctrl-0 = <&pinctrl_mdio_mux>;
                pinctrl-names = "default";
                gpios = <&gpio0 8  GPIO_ACTIVE_HIGH
                         &gpio0 9  GPIO_ACTIVE_HIGH
                         &gpio0 24 GPIO_ACTIVE_HIGH
                         &gpio0 25 GPIO_ACTIVE_HIGH>;
                mdio-parent-bus = <&mdio1>;
                #address-cells = <1>;
                #size-cells = <0>;


We have an MDIO multiplexor


                mdio_mux_1: mdio@1 {
                        reg = <1>;
                        #address-cells = <1>;
                        #size-cells = <0>;

                        switch0: switch@0 {
                                compatible = "marvell,mv88e6085";
                                pinctrl-0 = <&pinctrl_gpio_switch0>;
                                pinctrl-names = "default";
                                reg = <0>;
                                dsa,member = <0 0>;
                                interrupt-parent = <&gpio0>;
                                interrupts = <27 IRQ_TYPE_LEVEL_LOW>;

On the first bus, we have a Ethernet switch.

                                interrupt-controller;
                                #interrupt-cells = <2>;
                                eeprom-length = <512>;

                                ports {
                                        #address-cells = <1>;
                                        #size-cells = <0>;

                                        port@0 {
                                                reg = <0>;
                                                label = "lan0";
                                                phy-handle = <&switch0phy0>;
                                        };

The first port of that switch has a pointer to a PHY.

                               mdio {
                                        #address-cells = <1>;
                                        #size-cells = <0>;

That Ethernet switch also has an MDIO bus,

                                        switch0phy0: switch0phy0@0 {
                                                reg = <0>;

On that bus is the PHY.

                                                interrupt-parent = <&switch0>;
                                                interrupts = <0 IRQ_TYPE_LEVEL_HIGH>;

And that PHY has an interrupt. And that interrupt is provided by the switch.

Given your description, it sounds like this is also go to break.

vf610-zii-dev-rev-c.dts is the same pattern, and there are more
examples for mv88e6xxx.

It is a common pattern, e.g. the mips ar9331.dtsi follows it.

I've not yet looked at plain Ethernet drivers. This pattern could also
exist there. And i wonder about other complex structures, i2c bus
multiplexors, you can have interrupt controllers as i2c devices,
etc. So the general case could exist in other places.

I don't think we should be playing whack-a-mole by changing drivers as
we find they regress and break. We need a generic fix. I think the
solution is pretty clear. As you said the device depends on its
parent. DT is a tree, so it is easy to walk up the tree to detect this
relationship, and not fail the probe.

   Andrew
Saravana Kannan Aug. 27, 2021, 6:06 p.m. UTC | #8
On Fri, Aug 27, 2021 at 6:44 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > fw_devlink=on/device links short circuits the probe() call of a
> > consumer (in this case the PHY) and returns -EPROBE_DEFER if the
> > supplier's (in this case switch) probe hasn't finished without an
> > error. fw_devlink/device links effectively does the probe in graph
> > topological order and there's a ton of good reasons to do it that way
> > -- what's why fw_devlink=on was implemented.
> >
> > In this specific case though, since the PHY depends on the parent
> > device, if we fail the parent's probe realtek_smi_probe() because the
> > PHYs failed to probe, we'll get into a catch-22/chicken-n-egg
> > situation and the switch/PHYs will never probe.
>
> So lets look at:
>
> arch/arm/boot/dts/vf610-zii-dev-rev-b.dts
>
>        mdio-mux {
>                 compatible = "mdio-mux-gpio";
>                 pinctrl-0 = <&pinctrl_mdio_mux>;
>                 pinctrl-names = "default";
>                 gpios = <&gpio0 8  GPIO_ACTIVE_HIGH
>                          &gpio0 9  GPIO_ACTIVE_HIGH
>                          &gpio0 24 GPIO_ACTIVE_HIGH
>                          &gpio0 25 GPIO_ACTIVE_HIGH>;
>                 mdio-parent-bus = <&mdio1>;
>                 #address-cells = <1>;
>                 #size-cells = <0>;
>
>
> We have an MDIO multiplexor
>
>
>                 mdio_mux_1: mdio@1 {
>                         reg = <1>;
>                         #address-cells = <1>;
>                         #size-cells = <0>;
>
>                         switch0: switch@0 {
>                                 compatible = "marvell,mv88e6085";
>                                 pinctrl-0 = <&pinctrl_gpio_switch0>;
>                                 pinctrl-names = "default";
>                                 reg = <0>;
>                                 dsa,member = <0 0>;
>                                 interrupt-parent = <&gpio0>;
>                                 interrupts = <27 IRQ_TYPE_LEVEL_LOW>;
>
> On the first bus, we have a Ethernet switch.
>
>                                 interrupt-controller;
>                                 #interrupt-cells = <2>;
>                                 eeprom-length = <512>;
>
>                                 ports {
>                                         #address-cells = <1>;
>                                         #size-cells = <0>;
>
>                                         port@0 {
>                                                 reg = <0>;
>                                                 label = "lan0";
>                                                 phy-handle = <&switch0phy0>;
>                                         };
>
> The first port of that switch has a pointer to a PHY.
>
>                                mdio {
>                                         #address-cells = <1>;
>                                         #size-cells = <0>;
>
> That Ethernet switch also has an MDIO bus,
>
>                                         switch0phy0: switch0phy0@0 {
>                                                 reg = <0>;
>
> On that bus is the PHY.
>
>                                                 interrupt-parent = <&switch0>;
>                                                 interrupts = <0 IRQ_TYPE_LEVEL_HIGH>;
>
> And that PHY has an interrupt. And that interrupt is provided by the switch.
>
> Given your description, it sounds like this is also go to break.

Based on what you pasted here (I didn't look any closer), I think it
will break too.

>
> vf610-zii-dev-rev-c.dts is the same pattern, and there are more
> examples for mv88e6xxx.
>
> It is a common pattern, e.g. the mips ar9331.dtsi follows it.

Then I think this should be solved at the DSA framework level. Make a
component-master/aggregate device made up of the switches and
ports/PHYs. Then wait for all of them to not -EPROBE_DEFER and then
initialize the DSA?

> I've not yet looked at plain Ethernet drivers. This pattern could also
> exist there. And i wonder about other complex structures, i2c bus
> multiplexors, you can have interrupt controllers as i2c devices,
> etc. So the general case could exist in other places.

I haven't seen any generic issues like this reported so far. It's only
after adding phy-handle that we are hitting these issues with DSA
switches.

> I don't think we should be playing whack-a-mole by changing drivers as
> we find they regress and break. We need a generic fix. I think the
> solution is pretty clear. As you said the device depends on its
> parent. DT is a tree, so it is easy to walk up the tree to detect this
> relationship, and not fail the probe.

It's easy to do, but it is the wrong behavior for fw_devlink=on. There
are plenty of cases where it's better to delay the child device's
probe until the parent finishes. You even gave an example[7] where it
would help avoid unnecessary deferred probes. There are plenty of
other cases like this too -- there's actually a USB driver that had an
infinite deferred probe loop that fw_devlink=on fixes. Also, the whole
point of fw_devlink=on is to enforce ordering like this -- so just
blanket ignoring dependencies on parent devices doesn't make sense.

But a parent device's probe depending on a child device's probe to
succeed as soon as it's added is never right though. So I think that's
what needs to be addresses.

So we have a couple of options:
1. Use a component driver model to initialize switches. I think it
could be doable at the DSA framework level.
2. Ask fw_devlink=on to ignore it for all switch devices -- it might
be possible to move my "quick fix" to the DSA framework.
3. Remove fw_devlink support for phy-handle.

I honestly think (1) is the best option and makes sense logically too.
Not saying it's a trivial work or a one liner, but it actually makes
sense. (2) might not be possible -- I need to take a closer look. I'd
prefer not doing (3), but I'd take that over breaking the whole point
of fw_devlink=on.

-Saravana

[7] - https://lore.kernel.org/netdev/YR5eMeKzcuYtB6Tk@lunn.ch/
Andrew Lunn Aug. 27, 2021, 8:11 p.m. UTC | #9
> > I've not yet looked at plain Ethernet drivers. This pattern could also
> > exist there. And i wonder about other complex structures, i2c bus
> > multiplexors, you can have interrupt controllers as i2c devices,
> > etc. So the general case could exist in other places.
> 
> I haven't seen any generic issues like this reported so far. It's only
> after adding phy-handle that we are hitting these issues with DSA
> switches.

Can you run your parser over the 2250 DTB blobs and see how many
children have dependencies on a parent? That could give us an idea how
many moles need whacking. And maybe, where in the tree they are
hiding?

    Andrew
Saravana Kannan Aug. 27, 2021, 9:33 p.m. UTC | #10
On Fri, Aug 27, 2021 at 1:11 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > I've not yet looked at plain Ethernet drivers. This pattern could also
> > > exist there. And i wonder about other complex structures, i2c bus
> > > multiplexors, you can have interrupt controllers as i2c devices,
> > > etc. So the general case could exist in other places.
> >
> > I haven't seen any generic issues like this reported so far. It's only
> > after adding phy-handle that we are hitting these issues with DSA
> > switches.
>
> Can you run your parser over the 2250 DTB blobs and see how many
> children have dependencies on a parent? That could give us an idea how
> many moles need whacking. And maybe, where in the tree they are
> hiding?

You are only responding to part of my email. As I said in my previous
email: "There are plenty of cases where it's better to delay the child
device's probe until the parent finishes. You even gave an example[7]
where it would help avoid unnecessary deferred probes." Can you please
give your thoughts on the rest of the points I made too?

-Saravana
Andrew Lunn Aug. 28, 2021, 5:01 p.m. UTC | #11
On Fri, Aug 27, 2021 at 02:33:02PM -0700, Saravana Kannan wrote:
> On Fri, Aug 27, 2021 at 1:11 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > > I've not yet looked at plain Ethernet drivers. This pattern could also
> > > > exist there. And i wonder about other complex structures, i2c bus
> > > > multiplexors, you can have interrupt controllers as i2c devices,
> > > > etc. So the general case could exist in other places.
> > >
> > > I haven't seen any generic issues like this reported so far. It's only
> > > after adding phy-handle that we are hitting these issues with DSA
> > > switches.
> >
> > Can you run your parser over the 2250 DTB blobs and see how many
> > children have dependencies on a parent? That could give us an idea how
> > many moles need whacking. And maybe, where in the tree they are
> > hiding?
> 
> You are only responding to part of my email. As I said in my previous
> email: "There are plenty of cases where it's better to delay the child
> device's probe until the parent finishes. You even gave an example[7]
> where it would help avoid unnecessary deferred probes." Can you please
> give your thoughts on the rest of the points I made too?

I must admit, my main problem at the moment is -rc1 in two weeks
time. It seems like a number of board with Ethernet switches will be
broken, that worked before. phy-handle is not limited to switch
drivers, it is also used for Ethernet drivers. So it could be, a
number of Ethernet drivers are also going to be broken in -rc1?

But the issues sounds not to be specific to phy-handle, but any
phandle that points back to a parent. So it could be drivers outside
of networking are also going to be broken with -rc1?

You have been very focused on one or two drivers. I would much rather
see you getting an idea of how wide spread this problem is, and what
should we do for -rc1?

Even if modifying DSA drivers to component drivers is possible, while
not breaking backwards compatibility with DT, it is not going to
happen over night. That is something for the next merge window, not
this merge window.

So reverting the phy-handle seems like part of the fix for -rc1. But
until you look at those 2250 DTB blobs, we have no idea if that is
sufficient for -rc1.

    Andrew
Saravana Kannan Aug. 30, 2021, 7:03 p.m. UTC | #12
On Sat, Aug 28, 2021 at 10:01 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Fri, Aug 27, 2021 at 02:33:02PM -0700, Saravana Kannan wrote:
> > On Fri, Aug 27, 2021 at 1:11 PM Andrew Lunn <andrew@lunn.ch> wrote:
> > >
> > > > > I've not yet looked at plain Ethernet drivers. This pattern could also
> > > > > exist there. And i wonder about other complex structures, i2c bus
> > > > > multiplexors, you can have interrupt controllers as i2c devices,
> > > > > etc. So the general case could exist in other places.
> > > >
> > > > I haven't seen any generic issues like this reported so far. It's only
> > > > after adding phy-handle that we are hitting these issues with DSA
> > > > switches.
> > >
> > > Can you run your parser over the 2250 DTB blobs and see how many
> > > children have dependencies on a parent? That could give us an idea how
> > > many moles need whacking. And maybe, where in the tree they are
> > > hiding?
> >
> > You are only responding to part of my email. As I said in my previous
> > email: "There are plenty of cases where it's better to delay the child
> > device's probe until the parent finishes. You even gave an example[7]
> > where it would help avoid unnecessary deferred probes." Can you please
> > give your thoughts on the rest of the points I made too?
>
> I must admit, my main problem at the moment is -rc1 in two weeks
> time. It seems like a number of board with Ethernet switches will be
> broken, that worked before. phy-handle is not limited to switch
> drivers, it is also used for Ethernet drivers. So it could be, a
> number of Ethernet drivers are also going to be broken in -rc1?

Again, in those cases, based on your FEC example, fw_devlink=on
actually improves things.

> But the issues sounds not to be specific to phy-handle, but any
> phandle that points back to a parent.

I feel like I'm going in circles here. This statement is not true.
Please read my previous explanations.

> So it could be drivers outside
> of networking are also going to be broken with -rc1?
> You have been very focused on one or two drivers. I would much rather
> see you getting an idea of how wide spread this problem is, and what
> should we do for -rc1?

Again, it's not a widespread problem as I explained before.
fw_devlink=on has been the default for 2 kernel versions now. With no
unfixed reported issues.

> Even if modifying DSA drivers to component drivers is possible, while
> not breaking backwards compatibility with DT,

It should be possible without needing any DT changes.

> it is not going to
> happen over night. That is something for the next merge window, not
> this merge window.

Right, I wasn't suggesting the component driver option be implemented
right away. We were talking about what the longer term proper fix
would be for DSA (and Ethernet if we actually find issues there) and
who would do it. That's what I hope this discussion could be.

Also, if we replace Patch 2/2 in this series with the patch below, it
will work as a generic quick fix for DSA that we could use for -rc1.
And if we still have issues reported on the phy-handle patch by -rc5
or so, we could revert the phy-handle patch then so that v5.15 isn't
broken.

-Saravana

--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct
dsa_switch *ds, struct device_node *dn)
 {
        int err;

+       /* A lot of switch devices have their PHYs as child devices and have
+        * the PHYs depend on the switch as a supplier (Eg: interrupt
+        * controller). With fw_devlink=on, that means the PHYs will defer
+        * probe until the probe() of the switch completes. However, the way
+        * the DSA framework is designed, the PHYs are expected to be probed
+        * successfully before the probe() of the switch completes.
+        *
+        * So, mark the switch devices as a "broken parent" so that fw_devlink
+        * knows not to create device links between PHYs and the parent switch.
+        */
+       np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
        err = dsa_switch_parse_member_of(ds, dn);
        if (err)
                return err;
Andrew Lunn Aug. 31, 2021, 1:16 p.m. UTC | #13
> > I must admit, my main problem at the moment is -rc1 in two weeks
> > time. It seems like a number of board with Ethernet switches will be
> > broken, that worked before. phy-handle is not limited to switch
> > drivers, it is also used for Ethernet drivers. So it could be, a
> > number of Ethernet drivers are also going to be broken in -rc1?
> 
> Again, in those cases, based on your FEC example, fw_devlink=on
> actually improves things.

Debatable. I did some testing. As expected some boards with Ethernet
switches are now broken. Without fw_devlink=on, some boards are not
optimal, but they actually work. With it, they are broken.

I did a bisect, and they have been broken since:

ea718c699055c8566eb64432388a04974c43b2ea is the first bad commit
commit ea718c699055c8566eb64432388a04974c43b2ea
Author: Saravana Kannan <saravanak@google.com>
Date:   Tue Mar 2 13:11:32 2021 -0800

    Revert "Revert "driver core: Set fw_devlink=on by default""
    
    This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df.
    
    Since all reported issues due to fw_devlink=on should be addressed by
    this series, revert the revert. fw_devlink=on Take II.
    
    Signed-off-by: Saravana Kannan <saravanak@google.com>
    Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 drivers/base/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

So however it is fixed, it needs to go into stable, not just -rc1.

> Again, it's not a widespread problem as I explained before.
> fw_devlink=on has been the default for 2 kernel versions now. With no
> unfixed reported issues.

Given that some Ethernet switches have been broken all that time, i
wonder what else has been broken? Normally, the kernel which is
release in December becomes the next LTS. It then gets picked up by
the distros and more wide spread tested. So it could be, you get a
flood of reports in January and February about things which are
broken. This is why i don't think you should be relying on bug
reports, you should be taking a more proactive stance and trying to
analyse the DTB blobs.

I will spend some time trying out your proposed fix. See if they are a
quick fix for stable.

      Andrew
Saravana Kannan Aug. 31, 2021, 7:26 p.m. UTC | #14
On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > I must admit, my main problem at the moment is -rc1 in two weeks
> > > time. It seems like a number of board with Ethernet switches will be
> > > broken, that worked before. phy-handle is not limited to switch
> > > drivers, it is also used for Ethernet drivers. So it could be, a
> > > number of Ethernet drivers are also going to be broken in -rc1?
> >
> > Again, in those cases, based on your FEC example, fw_devlink=on
> > actually improves things.
>
> Debatable. I did some testing. As expected some boards with Ethernet
> switches are now broken.

To clarify myself, I'm saying the patch to parse "ethernet" [8] will
make things better with fw_devlink=on for FEC. Not sure if you tested
that patch or not.

And yes, "phy-handle" will make things worse for switches because it
has two issues that need to be fixed. One is this deferred probe thing
for which I gave a patch in the previous email and the other is what
Marek reported (fix in the works). So can you revert "phy-handle"
support and test with [8] and you should see things be better with
fw_devlink=on.

[8] - https://lore.kernel.org/netdev/CAGETcx9=AyEfjX_-adgRuX=8a0MkLnj8sy2KJGhxpNCinJu4yA@mail.gmail.com/

> Without fw_devlink=on, some boards are not
> optimal, but they actually work. With it, they are broken.
>
> I did a bisect, and they have been broken since:
>
> ea718c699055c8566eb64432388a04974c43b2ea is the first bad commit
> commit ea718c699055c8566eb64432388a04974c43b2ea
> Author: Saravana Kannan <saravanak@google.com>
> Date:   Tue Mar 2 13:11:32 2021 -0800
>
>     Revert "Revert "driver core: Set fw_devlink=on by default""
>
>     This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df.
>
>     Since all reported issues due to fw_devlink=on should be addressed by
>     this series, revert the revert. fw_devlink=on Take II.
>
>     Signed-off-by: Saravana Kannan <saravanak@google.com>
>     Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
>  drivers/base/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> So however it is fixed, it needs to go into stable, not just -rc1.

Not sure what was the tip of the tree with which you bisected. For
example, if phy-handle broke things, bisect could still point at this
patch above depending on whether the first bisect is good/bad. Because
reverting this patch effectively disabled phy-handle parsing too.

To be clear, I'm not saying things aren't broken. I'm just pointing
out some nuances with the bisect that we need to be aware of.

> > Again, it's not a widespread problem as I explained before.
> > fw_devlink=on has been the default for 2 kernel versions now. With no
> > unfixed reported issues.
>
> Given that some Ethernet switches have been broken all that time, i
> wonder what else has been broken? Normally, the kernel which is
> release in December becomes the next LTS. It then gets picked up by
> the distros and more wide spread tested. So it could be, you get a
> flood of reports in January and February about things which are
> broken. This is why i don't think you should be relying on bug
> reports, you should be taking a more proactive stance and trying to
> analyse the DTB blobs.

As I mentioned earlier, just looking at DTB doesn't tell me much
because the driver could still be fine depending on how it's written.
Also, I don't have an easy way to do this. If I find a way, I'll do
it.

Btw, most bug reports that have been raised have been fixed with
generic fixes that address general DT patterns. For example,
fw_devlink has cycle detection built in, has support for devices that
never get probed, etc. Enabling fw_devlink=on and handling bug reports
with generic fixes has worked well so far to get fw_devlink to where
it is today. I've tried to be very quick in responding to fw_devlink
issues -- and that has worked out so far and hopefully it'll continue
to work out.

> I will spend some time trying out your proposed fix. See if they are a
> quick fix for stable.

Thanks for testing it out. And worst case, we'll revert phy-handle support.


-Saravana
Andrew Lunn Aug. 31, 2021, 10:05 p.m. UTC | #15
On Tue, Aug 31, 2021 at 12:26:30PM -0700, Saravana Kannan wrote:
> On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote:
> >
> > > > I must admit, my main problem at the moment is -rc1 in two weeks
> > > > time. It seems like a number of board with Ethernet switches will be
> > > > broken, that worked before. phy-handle is not limited to switch
> > > > drivers, it is also used for Ethernet drivers. So it could be, a
> > > > number of Ethernet drivers are also going to be broken in -rc1?
> > >
> > > Again, in those cases, based on your FEC example, fw_devlink=on
> > > actually improves things.
> >
> > Debatable. I did some testing. As expected some boards with Ethernet
> > switches are now broken.
> 
> To clarify myself, I'm saying the patch to parse "ethernet" [8] will
> make things better with fw_devlink=on for FEC. Not sure if you tested
> that patch or not.

Yes and no. I was tested with the FEC, but because of fw_devlink=on,
the switches don't probe correctly. So it is not possible to see if it
helped or not, since its plain broken.

> Not sure what was the tip of the tree with which you bisected.

I manually tested linux-next, v5.14, v5.13 and v5.12 and then
bisected:

$ git bisect log
git bisect start
# good: [9f4ad9e425a1d3b6a34617b8ea226d56a119a717] Linux 5.12
git bisect good 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
# bad: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13

So well away from linux-next which contains the phy-handle parser.

I will try to give the two patches a try today or tomorrow.

> Thanks for testing it out. And worst case, we'll revert phy-handle
> support.

Which is not enough to fix these Ethernet switches.

      Andrew
Saravana Kannan Aug. 31, 2021, 10:18 p.m. UTC | #16
On Tue, Aug 31, 2021 at 3:06 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Tue, Aug 31, 2021 at 12:26:30PM -0700, Saravana Kannan wrote:
> > On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote:
> > >
> > > > > I must admit, my main problem at the moment is -rc1 in two weeks
> > > > > time. It seems like a number of board with Ethernet switches will be
> > > > > broken, that worked before. phy-handle is not limited to switch
> > > > > drivers, it is also used for Ethernet drivers. So it could be, a
> > > > > number of Ethernet drivers are also going to be broken in -rc1?
> > > >
> > > > Again, in those cases, based on your FEC example, fw_devlink=on
> > > > actually improves things.
> > >
> > > Debatable. I did some testing. As expected some boards with Ethernet
> > > switches are now broken.
> >
> > To clarify myself, I'm saying the patch to parse "ethernet" [8] will
> > make things better with fw_devlink=on for FEC. Not sure if you tested
> > that patch or not.
>
> Yes and no. I was tested with the FEC, but because of fw_devlink=on,
> the switches don't probe correctly. So it is not possible to see if it
> helped or not, since its plain broken.
>
> > Not sure what was the tip of the tree with which you bisected.
>
> I manually tested linux-next, v5.14, v5.13 and v5.12 and then
> bisected:
>
> $ git bisect log
> git bisect start
> # good: [9f4ad9e425a1d3b6a34617b8ea226d56a119a717] Linux 5.12
> git bisect good 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
> # bad: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13
>
> So well away from linux-next which contains the phy-handle parser.
>
> I will try to give the two patches a try today or tomorrow.

If the switches are broken without the phy-handle or ethernet change,
I'm not sure if the "BROKEN_PARENT" patch would help.

>
> > Thanks for testing it out. And worst case, we'll revert phy-handle
> > support.
>
> Which is not enough to fix these Ethernet switches.

Ok, if you can give more specifics on this, I'll look into it. I'd need:
1) The DTS file that you see the issue on.
2) The devices that don't probe when you have the issue (you can find
these in /sys/kernel/debug/devices_deferred)
3) And if possible boot logs with dev_dbg changed to dev_info in
device_link_add() and device_links_check_suppliers()

Thanks,
Saravana
Andrew Lunn Aug. 31, 2021, 11:02 p.m. UTC | #17
> If the switches are broken without the phy-handle or ethernet change,
> I'm not sure if the "BROKEN_PARENT" patch would help.

> > Which is not enough to fix these Ethernet switches.
> 
> Ok, if you can give more specifics on this, I'll look into it.

The switches probe, but get the wrong PHY driver, genphy, not the
Marvell PHY driver. And genphy is not sufficient for this hardware.

I'd need:
> 1) The DTS file that you see the issue on.

I did the bisect on arch/arm/boot/dts/vf610-zii-dev-rev-c.dts but i
also tested arch/arm/boot/dts/vf610-zii-dev-rev-b.dts.

Rev B is interesting because switch0 and switch1 got genphy, while
switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
interrupt properties, so don't loop back to their parent device.

Here is Rev B. I trimmed out other devices probing in parallel:

[    1.029100] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00
[    1.034735] fec 400d1000.ethernet: Using random MAC address: 42:f2:14:33:78:f5
[    1.042272] libphy: fec_enet_mii_bus: probed
[    1.455932] libphy: mdio_mux: probed
[    1.459432] mv88e6085 0.1:00: switch 0x3520 detected: Marvell 88E6352, revision 1
[    1.494076] libphy: mdio: probed
[    1.518958] libphy: mdio_mux: probed
[    1.522553] mv88e6085 0.2:00: switch 0x3520 detected: Marvell 88E6352, revision 1
[    1.537295] libphy: mdio: probed
[    1.556571] libphy: mdio_mux: probed
[    1.559719] mv88e6085 0.4:00: switch 0x1a70 detected: Marvell 88E6185, revision 2
[    1.574614] libphy: mdio: probed
[    1.733104] mv88e6085 0.1:00 lan0 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:00] driver [Generic PHY] (irq=POLL)
[    1.750737] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
[    1.768273] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
[    1.806561] mv88e6085 0.2:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:00] driver [Generic PHY] (irq=POLL)
[    1.824033] mv88e6085 0.2:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
[    1.841496] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
[    1.943535] mv88e6085 0.4:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:00] driver [Marvell 88E1545] (irq=POLL)
[    2.003529] mv88e6085 0.4:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:01] driver [Marvell 88E1545] (irq=POLL)
[    2.063535] mv88e6085 0.4:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:02] driver [Marvell 88E1545] (irq=POLL)
[    2.084768] DSA: tree 0 setup
[    2.087791] libphy: mdio_mux: probed
[    2.265477] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=POLL)

root@zii-devel-b:~# cat /sys/kernel/debug/devices_deferred
root@zii-devel-b:~# 

For Rev C we see:

[    1.244417] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00
[    1.250081] fec 400d1000.ethernet: Using random MAC address: c6:42:89:ed:5f:dd
[    1.257507] libphy: fec_enet_mii_bus: probed
[    1.570725] libphy: mdio_mux: probed
[    1.574208] mv88e6085 0.1:00: switch 0xa10 detected: Marvell 88E6390X, revision 1
[    1.590272] libphy: mdio: probed
[    1.627721] libphy: mdio_mux: probed
[    1.631222] mv88e6085 0.2:00: switch 0xa10 detected: Marvell 88E6390X, revision 1
[    1.659643] libphy: mdio: probed
[    1.811665] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
[    1.829230] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
[    1.845884] mv88e6085 0.1:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:03] driver [Generic PHY] (irq=POLL)
[    1.863237] mv88e6085 0.1:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:04] driver [Generic PHY] (irq=POLL)
[    1.884078] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
[    1.901630] mv88e6085 0.2:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
[    1.918287] mv88e6085 0.2:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:03] driver [Generic PHY] (irq=POLL)
[    1.933721] mv88e6085 0.2:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:04] driver [Generic PHY] (irq=POLL)
[    1.948722] DSA: tree 0 setup
[    1.951599] libphy: mdio_mux: probed

[   21.565550] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=48)

I have Rev B using NFS root, so the interfaces are configured up by
the kernel during boot. Rev C has a local root filesystem, so user
space brings the interfaces up, and it is only when the FEC is opened
does it attach to the Micrel PHY. That explains the difference between
2.265 and 21.565 seconds for the last line.

Again, nothing deferred.

       Andrew
Andrew Lunn Aug. 31, 2021, 11:18 p.m. UTC | #18
> 3) And if possible boot logs with dev_dbg changed to dev_info in
> device_link_add() and device_links_check_suppliers()

Rev C.

Here is everything:

[    0.000000] Booting Linux on physical CPU 0x0
[    0.000000] Linux version 5.12.0-rc4-00011-gea718c699055-dirty (andrew@lenovo) (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, G
NU ld (GNU Binutils for Debian) 2.37) #20 Tue Aug 31 18:06:09 CDT 2021
[    0.000000] CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d
[    0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[    0.000000] OF: fdt: Machine model: ZII VF610 Development Board, Rev C
[    0.000000] printk: bootconsole [earlycon0] enabled
[    0.000000] Memory policy: Data cache writeback
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000080000000-0x000000009fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x000000009fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
[    0.000000] On node 0 totalpages: 131072
[    0.000000]   Normal zone: 1024 pages used for memmap
[    0.000000]   Normal zone: 0 pages reserved
[    0.000000]   Normal zone: 131072 pages, LIFO batch:31
[    0.000000] CPU: All CPU(s) started in SVC mode.
[    0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
[    0.000000] pcpu-alloc: [0] 0 
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: root=/dev/mmcblk0p2 rootfstype=ext4 rw rootwait earlyprintk
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 508208K/524288K available (7168K kernel code, 931K rwdata, 1644K rodata, 1024K init, 250K bss, 16080K reserved, 0K cma-r
eserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[    0.000000] ftrace: allocating 25455 entries in 50 pages
[    0.000000] ftrace: allocated 50 pages with 3 groups
[    0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
[    0.000000] L2C-310 erratum 769419 enabled
[    0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
[    0.000000] L2C-310 cache controller enabled, 8 ways, 512 kB
[    0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x06060000
[    0.000000] random: get_random_bytes called from start_kernel+0x32c/0x470 with crng_init=0
[    0.000009] sched_clock: 64 bits at 166MHz, resolution 5ns, wraps every 4398046511102ns
[    0.015010] clocksource: arm_global_timer: mask: 0xffffffffffffffff max_cycles: 0x2674622ffc, max_idle_ns: 440795203810 ns
[    0.025738] Switching to timer-based delay loop, resolution 5ns
[    0.031679] Console: colour dummy device 80x30
[    0.035135] printk: console [tty0] enabled
[    0.038268] printk: bootconsole [earlycon0] disabled
[    0.042336] Calibrating delay loop (skipped), value calculated using timer frequency.. 333.47 BogoMIPS (lpj=1667368)
[    0.042403] pid_max: default: 32768 minimum: 301
[    0.042758] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.042819] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.044134] CPU: Testing write buffer coherency: ok
[    0.045371] Setting up static identity map for 0x80100000 - 0x8010003c
[    0.046513] devtmpfs: initialized
[    0.058285] VFP support v0.3: implementor 41 architecture 2 part 30 variant 5 rev 1
[    0.058654] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    0.058725] futex hash table entries: 256 (order: -1, 3072 bytes, linear)
[    0.058923] pinctrl core: initialized pinctrl subsystem
[    0.060776] NET: Registered protocol family 16
[    0.062307] DMA: preallocated 256 KiB pool for atomic coherent allocations
[    0.075858] platform 40044000.spi: Linked as a consumer to 40048000.iomuxc
[    0.076094] platform 4003b000.adc: Linked as a consumer to 40048000.iomuxc
[    0.076279] platform 4002c000.spi: Linked as a sync state only consumer to 40048000.iomuxc
[    0.076327] platform 4002c000.spi: Linked as a consumer to 40048000.iomuxc
[    0.076514] platform 4002a000.serial: Linked as a consumer to 40048000.iomuxc
[    0.076702] platform 40029000.serial: Linked as a consumer to 40048000.iomuxc
[    0.076915] platform 40028000.serial: Linked as a consumer to 40048000.iomuxc
[    0.077102] platform 40027000.serial: Linked as a consumer to 40048000.iomuxc
[    0.077794] platform 4002c000.spi: Linked as a sync state only consumer to 40049000.gpio
[    0.079223] platform 4002c000.spi: Linked as a sync state only consumer to 4004c000.gpio
[    0.082051] platform 40066000.i2c: Linked as a consumer to 4004a000.gpio
[    0.082252] platform 40066000.i2c: Linked as a consumer to 40048000.iomuxc
[    0.082494] platform 40066000.i2c: Linked as a sync state only consumer to 40049000.gpio
[    0.083140] platform 40067000.i2c: Linked as a consumer to 40048000.iomuxc
[    0.085386] platform 40080000.bus: Linked as a sync state only consumer to 40048000.iomuxc
[    0.085622] platform 40080000.bus: Linked as a sync state only consumer to 4004c000.gpio
[    0.087821] platform 400b2000.esdhc: Linked as a consumer to 40048000.iomuxc
[    0.089182] platform 400d0000.ethernet: Linked as a consumer to 40048000.iomuxc
[    0.089428] platform 400d0000.ethernet: Linked as a sync state only consumer to 4004c000.gpio
[    0.090067] platform 400d1000.ethernet: Linked as a consumer to 40048000.iomuxc
[    0.090730] platform 400e6000.i2c: Linked as a consumer to 40048000.iomuxc
[    0.090978] platform 400e6000.i2c: Linked as a sync state only consumer to 4004c000.gpio
[    0.092164] platform gpio-leds: Linked as a consumer to 40048000.iomuxc
[    0.092411] platform gpio-leds: Linked as a sync state only consumer to 4004b000.gpio
[    0.092885] platform 4003b000.adc: Linked as a consumer to regulator-vcc-3v3-mcu
[    0.093441] platform 40034000.usb: Linked as a consumer to regulator-usb0-vbus
[    0.093649] platform regulator-usb0-vbus: Linked as a consumer to 40049000.gpio
[    0.093850] platform regulator-usb0-vbus: Linked as a consumer to 40048000.iomuxc
[    0.095808] platform mdio-mux: Linked as a consumer to 40049000.gpio
[    0.096022] platform mdio-mux: Linked as a consumer to 40048000.iomuxc
[    0.097352] vf610-pinctrl 40048000.iomuxc: initialized IMX pinctrl driver
[    0.108955] Kprobes globally optimized
[    0.215992] platform regulator-usb0-vbus: probe deferral - supplier 40049000.gpio not ready
[    0.216878] SCSI subsystem initialized
[    0.217522] usbcore: registered new interface driver usbfs
[    0.217683] usbcore: registered new interface driver hub
[    0.217824] usbcore: registered new device driver usb
[    0.218474] platform 40066000.i2c: probe deferral - supplier 4004a000.gpio not ready
[    0.219685] i2c i2c-0: IMX I2C adapter registered
[    0.219852] i2c i2c-0: using dma0chan0 (tx) and dma0chan1 (rx) for DMA transfers
[    0.221038] i2c 1-0070: Linked as a consumer to 4004c000.gpio
[    0.221289] i2c 1-0070: Linked as a consumer to 40048000.iomuxc
[    0.221492] i2c i2c-1: IMX I2C adapter registered
[    0.221625] i2c i2c-1: using dma0chan16 (tx) and dma0chan17 (rx) for DMA transfers
[    0.221698] imx-i2c 400e6000.i2c: Dropping the link to 4004c000.gpio
[    0.222145] pps_core: LinuxPPS API ver. 1 registered
[    0.222188] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.222290] PTP clock support registered
[    0.222905] Advanced Linux Sound Architecture Driver Initialized.
[    0.268543] clocksource: Switched to clocksource arm_global_timer
[    0.634728] NET: Registered protocol family 2
[    0.635842] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.635943] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.636068] TCP bind hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.636190] TCP: Hash tables configured (established 4096 bind 4096)
[    0.636470] UDP hash table entries: 256 (order: 0, 4096 bytes, linear)
[    0.636551] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear)
[    0.636879] NET: Registered protocol family 1
[    0.639687] RPC: Registered named UNIX socket transport module.
[    0.639749] RPC: Registered udp transport module.
[    0.639780] RPC: Registered tcp transport module.
[    0.639808] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.642391] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.657373] io scheduler mq-deadline registered
[    0.657455] io scheduler kyber registered
[    0.658814] gpio-23 (sx1503-irq): hogged as input
[    0.664005] gpio-98 (eth0-intrp): hogged as input
[    0.677195] 40027000.serial: ttyLP0 at MMIO 0x40027000 (irq = 19, base_baud = 5210526) is a FSL_LPUART
[    1.291559] printk: console [ttyLP0] enabled
[    1.295642] 40028000.serial: ttyLP1 at MMIO 0x40028000 (irq = 20, base_baud = 5210526) is a FSL_LPUART
[    1.305537] 40029000.serial: ttyLP2 at MMIO 0x40029000 (irq = 21, base_baud = 5210526) is a FSL_LPUART
[    1.315290] 4002a000.serial: ttyLP3 at MMIO 0x4002a000 (irq = 22, base_baud = 5210526) is a FSL_LPUART
[    1.360010] brd: module loaded
[    1.362289] at24 0-0050: supply vcc not found, using dummy regulator
[    1.367686] at24 0-0050: Linked as a consumer to regulator.0
[    1.374685] at24 0-0050: 256 byte 24c02 EEPROM, read-only
[    1.385241] spi-nor spi0.0: m25p128 (16384 Kbytes)
[    1.392931] fsl-dspi 4002c000.spi: Dropping the link to 40049000.gpio
[    1.398313] fsl-dspi 4002c000.spi: Dropping the link to 4004c000.gpio
[    1.411512] spi-nor spi1.0: n25q00 (131072 Kbytes)
[    1.421718] spi-nor spi1.2: n25q00 (131072 Kbytes)
[    1.430163] libphy: Fixed MDIO Bus: probed
[    1.439542] mdio_bus 400d0000.ethernet-1: Linked as a sync state only consumer to 4004c000.gpio
[    1.447185] mdio_bus 400d0000.ethernet-1: Linked as a sync state only consumer to 40048000.iomuxc
[    1.454900] libphy: fec_enet_mii_bus: probed
[    1.478153] mdio_bus 400d0000.ethernet-1:00: Linked as a consumer to 4004c000.gpio
[    1.484753] mdio_bus 400d0000.ethernet-1:00: Linked as a consumer to 40048000.iomuxc
[    1.492867] fec 400d0000.ethernet: Dropping the link to 4004c000.gpio
[    1.504139] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00
[    1.509763] fec 400d1000.ethernet: Using random MAC address: 86:cc:e2:6b:14:52
[    1.517154] libphy: fec_enet_mii_bus: probed
[    1.521523] usbcore: registered new interface driver asix
[    1.525784] usbcore: registered new interface driver ax88179_178a
[    1.530746] usbcore: registered new interface driver cdc_ether
[    1.535388] usbcore: registered new interface driver net1080
[    1.539902] usbcore: registered new interface driver cdc_subset
[    1.544650] usbcore: registered new interface driver zaurus
[    1.549100] usbcore: registered new interface driver cdc_ncm
[    1.553487] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.558763] ehci-platform: EHCI generic platform driver
[    1.563182] usbcore: registered new interface driver usb-storage
[    1.568721] platform 40034000.usb: probe deferral - supplier regulator-usb0-vbus not ready
[    1.587769] snvs_rtc 400a7000.snvs:snvs-rtc-lp: registered as rtc0
[    1.592824] snvs_rtc 400a7000.snvs:snvs-rtc-lp: setting system clock to 2021-08-31T23:07:25 UTC (1630451245)
[    1.601721] i2c /dev entries driver
[    1.609055] i2c i2c-1: Added multiplexed i2c bus 2
[    1.614256] i2c i2c-1: Added multiplexed i2c bus 3
[    1.619194] i2c i2c-1: Added multiplexed i2c bus 4
[    1.624001] i2c i2c-1: Added multiplexed i2c bus 5
[    1.628764] i2c i2c-1: Added multiplexed i2c bus 6
[    1.633465] i2c i2c-1: Added multiplexed i2c bus 7
[    1.638180] i2c i2c-1: Added multiplexed i2c bus 8
[    1.643056] i2c i2c-1: Added multiplexed i2c bus 9
[    1.646603] pca954x 1-0070: registered 8 multiplexed busses for I2C switch pca9548
[    1.657429] sdhci: Secure Digital Host Controller Interface driver
[    1.662415] sdhci: Copyright(c) Pierre Ossman
[    1.665495] sdhci-pltfm: SDHCI platform and OF driver helper
[    1.671279] leds-gpio gpio-leds: Dropping the link to 4004b000.gpio
[    1.677287] usbcore: registered new interface driver usbhid
[    1.681644] usbhid: USB HID core driver
[    1.684988] vf610-adc 4003b000.adc: Linked as a consumer to regulator.1
[    1.705844] NET: Registered protocol family 17
[    1.747018] mmc0: SDHCI controller on 400b2000.esdhc [400b2000.esdhc] using ADMA
[    1.757323] lm75 10-0048: supply vs not found, using dummy regulator
[    1.762801] lm75 10-0048: Linked as a consumer to regulator.0
[    1.771806] lm75 10-0048: hwmon0: sensor 'lm75'
[    1.775772] at24 10-0050: supply vcc not found, using dummy regulator
[    1.781313] at24 10-0050: Linked as a consumer to regulator.0
[    1.789929] at24 10-0050: 512 byte 24c04 EEPROM, writable, 1 bytes/write
[    1.796044] at24 10-0052: supply vcc not found, using dummy regulator
[    1.801527] at24 10-0052: Linked as a consumer to regulator.0
[    1.810012] at24 10-0052: 512 byte 24c04 EEPROM, writable, 1 bytes/write
[    1.816481] pca953x 10-0018: supply vcc not found, using dummy regulator
[    1.822251] pca953x 10-0018: Linked as a consumer to regulator.0
[    1.827009] pca953x 10-0018: using no AI
[    1.834047] platform sff3: Linked as a consumer to 10-0020
[    1.838453] platform sff2: Linked as a consumer to 10-0020
[    1.842908] i2c 10-0020: Linked as a consumer to 40049000.gpio
[    1.847643] i2c 10-0020: Linked as a consumer to 40048000.iomuxc
[    1.860624] pca953x 10-0022: supply vcc not found, using dummy regulator
[    1.866334] pca953x 10-0022: Linked as a consumer to regulator.0
[    1.871160] pca953x 10-0022: using no AI
[    1.878025] i2c i2c-10: IMX I2C adapter registered
[    1.881721] i2c i2c-10: using dma0chan2 (tx) and dma0chan3 (rx) for DMA transfers
[    1.887989] imx-i2c 40066000.i2c: Dropping the link to 40049000.gpio
[    1.894950] mdio_bus 0.1: Linked as a sync state only consumer to 40049000.gpio
[    1.901214] mdio_bus 0.1: Linked as a sync state only consumer to 40048000.iomuxc
[    1.907514] libphy: mdio_mux: probed
[    1.910276] mdio_bus 0.1:00: Linked as a consumer to 40049000.gpio
[    1.915348] mdio_bus 0.1:00: Linked as a consumer to 40048000.iomuxc
[    1.920693] mdio_bus 0.1:00: Linked as a sync state only consumer to 0.1:00
[    1.926760] mv88e6085 0.1:00: switch 0xa10 detected: Marvell 88E6390X, revision 1
[    1.942194] mdio_bus !mdio-mux!mdio@1!switch@0!mdio: Linked as a sync state only consumer to 0.1:00
[    1.950156] libphy: mdio: probed
[    1.952481] mmc0: host does not support reading read-only switch, assuming write-enable
[    1.963210] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: Linked as a consumer to 0.1:00
[    1.970021] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: probe deferral - supplier 0.1:00 not ready
[    1.978320] mmc0: new high speed SDHC card at address aaaa
[    1.985170] mmcblk0: mmc0:aaaa SU04G 3.69 GiB 
[    1.992443] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: Linked as a consumer to 0.1:00
[    1.999277] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: probe deferral - supplier 0.1:00 not ready
[    2.012242] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:03: Linked as a consumer to 0.1:00
[    2.019085] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:03: probe deferral - supplier 0.1:00 not ready
[    2.026995]  mmcblk0: p1 p2 p3 < p5 >
[    2.030498] mmcblk0: p5 size 440320 extends beyond EOD, truncated
[    2.043234] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:04: Linked as a consumer to 0.1:00
[    2.050044] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:04: probe deferral - supplier 0.1:00 not ready
[    2.059198] mv88e6085 0.1:00: Dropping the link to 0.1:00
[    2.064608] mdio_bus 0.2: Linked as a sync state only consumer to 40049000.gpio
[    2.070898] mdio_bus 0.2: Linked as a sync state only consumer to 40048000.iomuxc
[    2.077196] libphy: mdio_mux: probed
[    2.079960] mdio_bus 0.2:00: Linked as a consumer to 40049000.gpio
[    2.085043] mdio_bus 0.2:00: Linked as a consumer to 40048000.iomuxc
[    2.090387] mdio_bus 0.2:00: Linked as a sync state only consumer to 0.2:00
[    2.096466] mv88e6085 0.2:00: switch 0xa10 detected: Marvell 88E6390X, revision 1
[    2.111052] mdio_bus !mdio-mux!mdio@2!switch@0!mdio: Linked as a sync state only consumer to 0.2:00
[    2.119003] libphy: mdio: probed
[    2.126170] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: Linked as a consumer to 0.2:00
[    2.132949] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: probe deferral - supplier 0.2:00 not ready
[    2.145815] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: Linked as a consumer to 0.2:00
[    2.152565] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: probe deferral - supplier 0.2:00 not ready
[    2.165415] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:03: Linked as a consumer to 0.2:00
[    2.172164] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:03: probe deferral - supplier 0.2:00 not ready
[    2.185026] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:04: Linked as a consumer to 0.2:00
[    2.191777] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:04: probe deferral - supplier 0.2:00 not ready
[    2.309663] mv88e6085 0.1:00: configuring for fixed/ link mode
[    2.319255] mv88e6085 0.1:00: Link is Up - 100Mbps/Full - flow control off
[    2.329241] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
[    2.346853] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
[    2.364287] mv88e6085 0.1:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:03] driver [Generic PHY] (irq=POLL)
[    2.380887] mv88e6085 0.1:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:04] driver [Generic PHY] (irq=POLL)
[    2.397258] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:01: Dropping the link to 0.2:00
[    2.408469] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
[    2.423691] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:02: Dropping the link to 0.2:00
[    2.432750] mv88e6085 0.2:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
[    2.447887] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:03: Dropping the link to 0.2:00
[    2.456943] mv88e6085 0.2:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:03] driver [Generic PHY] (irq=POLL)
[    2.472080] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:04: Dropping the link to 0.2:00
[    2.481149] mv88e6085 0.2:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:04] driver [Generic PHY] (irq=POLL)
[    2.495310] mv88e6085 0.1:00: Linked as a consumer to 400d1000.ethernet
[    2.500779] DSA: tree 0 setup
[    2.502536] mv88e6085 0.2:00: Dropping the link to 0.2:00
[    2.507780] libphy: mdio_mux: probed

This is based on:

Commit ea718c699055c8566eb64432388a04974c43b2ea (HEAD, refs/bisect/bad)
Author: Saravana Kannan <saravanak@google.com>
Date:   Tue Mar 2 13:11:32 2021 -0800

    Revert "Revert "driver core: Set fw_devlink=on by default""
    
    This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df.
    
    Since all reported issues due to fw_devlink=on should be addressed by
    this series, revert the revert. fw_devlink=on Take II.
    
    Signed-off-by: Saravana Kannan <saravanak@google.com>
    Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

     Andrew
Vladimir Oltean Aug. 31, 2021, 11:18 p.m. UTC | #19
On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> Rev B is interesting because switch0 and switch1 got genphy, while
> switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> interrupt properties, so don't loop back to their parent device.

This is interesting and not what I really expected to happen. It goes to
show that we really need more time to understand all the subtleties of
device dependencies before jumping on patching stuff.

In case the DSA tree contains more than one switch, different things
will happen in dsa_register_switch().
The tree itself is only initialized when the last switch calls
dsa_register_switch(). All the other switches just mark themselves as
present and exit probing early. See this piece of code in dsa_tree_setup:

	complete = dsa_tree_setup_routing_table(dst);
	if (!complete)
		return 0;

So it should be a general property of cross-chip DSA trees that all
switches except the last one will have the specific PHY driver probed
properly, and not the genphy.

Because all (N - 1) switches of a tree exit early in dsa_register_switch,
they have successfully probed by the time the last switch brings up the
tree, and brings up the PHYs on behalf of every other switch.

The last switch can connect to the PHY on behalf of the other switches
past their probe ending, and those PHYs should not defer probing because
their supplier is now probed. It is only that the last switch cannot
connect to the PHYs of its own ports.

So if this does not work (you say that there are 2 switches that use
genphy) I suspect there are also other bugs involved.
Andrew Lunn Aug. 31, 2021, 11:27 p.m. UTC | #20
On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote:
> On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > Rev B is interesting because switch0 and switch1 got genphy, while
> > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > interrupt properties, so don't loop back to their parent device.
> 
> This is interesting and not what I really expected to happen. It goes to
> show that we really need more time to understand all the subtleties of
> device dependencies before jumping on patching stuff.

Here is the log on Rev B with the extra debug prints

Linux version 5.12.0-rc4-00011-gea718c699055-dirty (andrew@lenovo) (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binu
tils for Debian) 2.37) #20 Tue Aug 31 18:06:09 CDT 2021
CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt: Machine model: ZII VF610 Development Board, Rev B
printk: bootconsole [earlycon0] enabled
Memory policy: Data cache writeback
Zone ranges:
  Normal   [mem 0x0000000080000000-0x000000009fffffff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000080000000-0x000000009fffffff]
Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
CPU: All CPU(s) started in SVC mode.
Built 1 zonelists, mobility grouping on.  Total pages: 130048
Kernel command line: root=/dev/nfs nfsroot=10.0.0.1:/srv/nfsroot/zii_devel_b,nfsvers=3,tcp ip=dhcp rw earlyprintk
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 508204K/524288K available (7168K kernel code, 931K rwdata, 1644K rodata, 1024K init, 250K bss, 16084K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
ftrace: allocating 25455 entries in 50 pages
ftrace: allocated 50 pages with 3 groups
NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
L2C-310 erratum 769419 enabled
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 8 ways, 512 kB
L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x06060000
random: get_random_bytes called from start_kernel+0x32c/0x470 with crng_init=0
sched_clock: 64 bits at 166MHz, resolution 5ns, wraps every 4398046511102ns
clocksource: arm_global_timer: mask: 0xffffffffffffffff max_cycles: 0x2674622ffc, max_idle_ns: 440795203810 ns
Switching to timer-based delay loop, resolution 5ns
Console: colour dummy device 80x30
printk: console [tty0] enabled
printk: bootconsole [earlycon0] disabled
Booting Linux on physical CPU 0x0
Linux version 5.12.0-rc4-00011-gea718c699055-dirty (andrew@lenovo) (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binu
tils for Debian) 2.37) #20 Tue Aug 31 18:06:09 CDT 2021
CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt: Machine model: ZII VF610 Development Board, Rev B
printk: bootconsole [earlycon0] enabled
Memory policy: Data cache writeback
Zone ranges:
  Normal   [mem 0x0000000080000000-0x000000009fffffff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x0000000080000000-0x000000009fffffff]
Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff]
CPU: All CPU(s) started in SVC mode.
Built 1 zonelists, mobility grouping on.  Total pages: 130048
Kernel command line: root=/dev/nfs nfsroot=10.0.0.1:/srv/nfsroot/zii_devel_b,nfsvers=3,tcp ip=dhcp rw earlyprintk
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
mem auto-init: stack:off, heap alloc:off, heap free:off
Memory: 508204K/524288K available (7168K kernel code, 931K rwdata, 1644K rodata, 1024K init, 250K bss, 16084K reserved, 0K cma-reserved)
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
ftrace: allocating 25455 entries in 50 pages
ftrace: allocated 50 pages with 3 groups
NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16
L2C-310 erratum 769419 enabled
L2C-310 dynamic clock gating enabled, standby mode enabled
L2C-310 cache controller enabled, 8 ways, 512 kB
L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x06060000
random: get_random_bytes called from start_kernel+0x32c/0x470 with crng_init=0
sched_clock: 64 bits at 166MHz, resolution 5ns, wraps every 4398046511102ns
clocksource: arm_global_timer: mask: 0xffffffffffffffff max_cycles: 0x2674622ffc, max_idle_ns: 440795203810 ns
Switching to timer-based delay loop, resolution 5ns
Console: colour dummy device 80x30
printk: console [tty0] enabled
printk: bootconsole [earlycon0] disabled
Calibrating delay loop (skipped), value calculated using timer frequency.. 333.47 BogoMIPS (lpj=1667368)
pid_max: default: 32768 minimum: 301
Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
CPU: Testing write buffer coherency: ok
Setting up static identity map for 0x80100000 - 0x8010003c
devtmpfs: initialized
VFP support v0.3: implementor 41 architecture 2 part 30 variant 5 rev 1
clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
futex hash table entries: 256 (order: -1, 3072 bytes, linear)
pinctrl core: initialized pinctrl subsystem
NET: Registered protocol family 16
DMA: preallocated 256 KiB pool for atomic coherent allocations
platform 40044000.spi: Linked as a consumer to 40048000.iomuxc
platform 4003b000.adc: Linked as a consumer to 40048000.iomuxc
platform 40029000.serial: Linked as a consumer to 40048000.iomuxc
platform 40028000.serial: Linked as a consumer to 40048000.iomuxc
platform 40027000.serial: Linked as a consumer to 40048000.iomuxc
platform 40066000.i2c: Linked as a consumer to 4004a000.gpio
platform 40066000.i2c: Linked as a consumer to 40048000.iomuxc
platform 40066000.i2c: Linked as a sync state only consumer to 4004c000.gpio
platform 40067000.i2c: Linked as a consumer to 40048000.iomuxc
platform 40080000.bus: Linked as a sync state only consumer to 40048000.iomuxc
platform 40080000.bus: Linked as a sync state only consumer to 4004c000.gpio
platform 400b2000.esdhc: Linked as a consumer to 40048000.iomuxc
platform 400d0000.ethernet: Linked as a consumer to 40048000.iomuxc
platform 400d1000.ethernet: Linked as a consumer to 40048000.iomuxc
platform 400e6000.i2c: Linked as a consumer to 40048000.iomuxc
platform 400e6000.i2c: Linked as a sync state only consumer to 4004c000.gpio
platform gpio-leds: Linked as a consumer to 40048000.iomuxc
platform gpio-leds: Linked as a sync state only consumer to 4004b000.gpio
platform 4003b000.adc: Linked as a consumer to regulator-vcc-3v3-mcu
platform 40034000.usb: Linked as a consumer to regulator-usb0-vbus
platform regulator-usb0-vbus: Linked as a consumer to 40049000.gpio
platform regulator-usb0-vbus: Linked as a consumer to 40048000.iomuxc
platform mdio-mux: Linked as a consumer to 40049000.gpio
platform mdio-mux: Linked as a consumer to 40048000.iomuxc
platform spi0: Linked as a consumer to 4004a000.gpio
platform spi0: Linked as a consumer to 40048000.iomuxc
platform spi0: Linked as a sync state only consumer to 4004d000.gpio
vf610-pinctrl 40048000.iomuxc: initialized IMX pinctrl driver
Kprobes globally optimized
platform regulator-usb0-vbus: probe deferral - supplier 40049000.gpio not ready
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
platform 40066000.i2c: probe deferral - supplier 4004a000.gpio not ready
i2c i2c-0: IMX I2C adapter registered
i2c i2c-0: using dma0chan0 (tx) and dma0chan1 (rx) for DMA transfers
i2c 1-0070: Linked as a consumer to 4004c000.gpio
i2c 1-0070: Linked as a consumer to 40048000.iomuxc
i2c i2c-1: IMX I2C adapter registered
i2c i2c-1: using dma0chan16 (tx) and dma0chan17 (rx) for DMA transfers
imx-i2c 400e6000.i2c: Dropping the link to 4004c000.gpio
pps_core: LinuxPPS API ver. 1 registered
pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
PTP clock support registered
Advanced Linux Sound Architecture Driver Initialized.
clocksource: Switched to clocksource arm_global_timer
NET: Registered protocol family 2
tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
TCP bind hash table entries: 4096 (order: 2, 16384 bytes, linear)
TCP: Hash tables configured (established 4096 bind 4096)
UDP hash table entries: 256 (order: 0, 4096 bytes, linear)
UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear)
NET: Registered protocol family 1
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
workingset: timestamp_bits=30 max_order=17 bucket_order=0
io scheduler mq-deadline registered
io scheduler kyber registered
40027000.serial: ttyLP0 at MMIO 0x40027000 (irq = 19, base_baud = 5210526) is a FSL_LPUART
printk: console [ttyLP0] enabled
40028000.serial: ttyLP1 at MMIO 0x40028000 (irq = 20, base_baud = 5210526) is a FSL_LPUART
40029000.serial: ttyLP2 at MMIO 0x40029000 (irq = 21, base_baud = 5210526) is a FSL_LPUART
brd: module loaded
spi-nor spi0.0: mt25ql02g (262144 Kbytes)
spi-nor spi0.2: mt25ql02g (262144 Kbytes)
libphy: Fixed MDIO Bus: probed
libphy: fec_enet_mii_bus: probed
fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00
fec 400d1000.ethernet: Using random MAC address: 1a:09:76:4a:0d:49
libphy: fec_enet_mii_bus: probed
usbcore: registered new interface driver asix
usbcore: registered new interface driver ax88179_178a
usbcore: registered new interface driver cdc_ether
usbcore: registered new interface driver net1080
usbcore: registered new interface driver cdc_subset
usbcore: registered new interface driver zaurus
usbcore: registered new interface driver cdc_ncm
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ehci-platform: EHCI generic platform driver
usbcore: registered new interface driver usb-storage
platform 40034000.usb: probe deferral - supplier regulator-usb0-vbus not ready
snvs_rtc 400a7000.snvs:snvs-rtc-lp: registered as rtc0
snvs_rtc 400a7000.snvs:snvs-rtc-lp: setting system clock to 1970-01-01T00:00:00 UTC (0)
i2c /dev entries driver
at24 2-0050: supply vcc not found, using dummy regulator
at24 2-0050: Linked as a consumer to regulator.0
at24 2-0050: Dropping the link to regulator.0
i2c i2c-1: Added multiplexed i2c bus 2
at24 3-0050: supply vcc not found, using dummy regulator
at24 3-0050: Linked as a consumer to regulator.0
at24 3-0050: Dropping the link to regulator.0
i2c i2c-1: Added multiplexed i2c bus 3
at24 4-0050: supply vcc not found, using dummy regulator
at24 4-0050: Linked as a consumer to regulator.0
at24 4-0050: 256 byte 24c02 EEPROM, writable, 1 bytes/write
i2c i2c-1: Added multiplexed i2c bus 4
at24 5-0050: supply vcc not found, using dummy regulator
at24 5-0050: Linked as a consumer to regulator.0
at24 5-0050: 256 byte 24c02 EEPROM, writable, 1 bytes/write
i2c i2c-1: Added multiplexed i2c bus 5
i2c i2c-1: Added multiplexed i2c bus 6
i2c i2c-1: Added multiplexed i2c bus 7
i2c i2c-1: Added multiplexed i2c bus 8
i2c i2c-1: Added multiplexed i2c bus 9
pca954x 1-0070: registered 8 multiplexed busses for I2C switch pca9548
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
sdhci-pltfm: SDHCI platform and OF driver helper
leds-gpio gpio-leds: Dropping the link to 4004b000.gpio
usbcore: registered new interface driver usbhid
usbhid: USB HID core driver
vf610-adc 4003b000.adc: Linked as a consumer to regulator.1
NET: Registered protocol family 17
mmc0: SDHCI controller on 400b2000.esdhc [400b2000.esdhc] using ADMA
lm75 10-0048: supply vs not found, using dummy regulator
lm75 10-0048: Linked as a consumer to regulator.0
lm75 10-0048: hwmon0: sensor 'lm75'
at24 10-0050: supply vcc not found, using dummy regulator
at24 10-0050: Linked as a consumer to regulator.0
at24 10-0050: 512 byte 24c04 EEPROM, writable, 1 bytes/write
at24 10-0052: supply vcc not found, using dummy regulator
at24 10-0052: Linked as a consumer to regulator.0
at24 10-0052: 512 byte 24c04 EEPROM, writable, 1 bytes/write
pca953x 10-0020: supply vcc not found, using dummy regulator
pca953x 10-0020: Linked as a consumer to regulator.0
pca953x 10-0020: using no AI
platform mdio-mux: Linked as a sync state only consumer to 10-0022
i2c 10-0022: Linked as a consumer to 4004c000.gpio
i2c 10-0022: Linked as a consumer to 40048000.iomuxc
pca953x 10-0022: supply vcc not found, using dummy regulator
pca953x 10-0022: Linked as a consumer to regulator.0
pca953x 10-0022: using no AI
pca953x 10-0022: interrupt support not compiled in
i2c i2c-10: IMX I2C adapter registered
i2c i2c-10: using dma0chan2 (tx) and dma0chan3 (rx) for DMA transfers
imx-i2c 40066000.i2c: Dropping the link to 4004c000.gpio
mdio_bus 0.1: Linked as a sync state only consumer to 40049000.gpio
mdio_bus 0.1: Linked as a sync state only consumer to 40048000.iomuxc
libphy: mdio_mux: probed
mdio_bus 0.1:00: Linked as a consumer to 40049000.gpio
mdio_bus 0.1:00: Linked as a consumer to 40048000.iomuxc
mdio_bus 0.1:00: Linked as a sync state only consumer to 0.1:00
mv88e6085 0.1:00: switch 0x3520 detected: Marvell 88E6352, revision 1
mdio_bus !mdio-mux!mdio@1!switch@0!mdio: Linked as a sync state only consumer to 0.1:00
libphy: mdio: probed
mmc0: host does not support reading read-only switch, assuming write-enable
mmc0: new high speed SDHC card at address 0007
mdio_bus !mdio-mux!mdio@1!switch@0!mdio:00: Linked as a consumer to 0.1:00
mdio_bus !mdio-mux!mdio@1!switch@0!mdio:00: probe deferral - supplier 0.1:00 not ready
mmcblk0: mmc0:0007 SD4GB 3.71 GiB 
mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: Linked as a consumer to 0.1:00
mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: probe deferral - supplier 0.1:00 not ready
 mmcblk0: p1 p2 p3 < p5 >
mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: Linked as a consumer to 0.1:00
mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: probe deferral - supplier 0.1:00 not ready
mv88e6085 0.1:00: Dropping the link to 0.1:00
mdio_bus 0.2: Linked as a sync state only consumer to 40049000.gpio
mdio_bus 0.2: Linked as a sync state only consumer to 40048000.iomuxc
libphy: mdio_mux: probed
mdio_bus 0.2:00: Linked as a consumer to 40049000.gpio
mdio_bus 0.2:00: Linked as a consumer to 40048000.iomuxc
mdio_bus 0.2:00: Linked as a sync state only consumer to 0.2:00
mv88e6085 0.2:00: switch 0x3520 detected: Marvell 88E6352, revision 1
mdio_bus !mdio-mux!mdio@2!switch@0!mdio: Linked as a sync state only consumer to 0.2:00
libphy: mdio: probed
mdio_bus !mdio-mux!mdio@2!switch@0!mdio:00: Linked as a consumer to 0.2:00
mdio_bus !mdio-mux!mdio@2!switch@0!mdio:00: probe deferral - supplier 0.2:00 not ready
mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: Linked as a consumer to 0.2:00
mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: probe deferral - supplier 0.2:00 not ready
mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: Linked as a consumer to 0.2:00
mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: probe deferral - supplier 0.2:00 not ready
mv88e6085 0.2:00: Dropping the link to 0.2:00
mdio_bus 0.4: Linked as a sync state only consumer to 10-0022
libphy: mdio_mux: probed
mdio_bus 0.4:00: Linked as a sync state only consumer to 10-0022
mv88e6085 0.4:00: switch 0x1a70 detected: Marvell 88E6185, revision 2
libphy: mdio: probed
mv88e6085 0.1:00 lan0 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:00] driver [Generic PHY] (irq=POLL)
mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
mv88e6085 0.1:00: configuring for fixed/rgmii-txid link mode
mv88e6085 0.1:00: configuring for fixed/ link mode
mv88e6085 0.1:00: Link is Up - 1Gbps/Full - flow control off
mv88e6085 0.1:00: Link is Up - 100Mbps/Full - flow control off
mv88e6085 0.2:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:00] driver [Generic PHY] (irq=POLL)
mv88e6085 0.2:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
mv88e6085 0.2:00: configuring for fixed/rgmii-txid link mode
mv88e6085 0.2:00: configuring for fixed/rgmii-txid link mode
mv88e6085 0.4:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:00] driver [Marvell 88E1545] (irq=POLL)
mv88e6085 0.4:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:01] driver [Marvell 88E1545] (irq=POLL)
mv88e6085 0.4:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:02] driver [Marvell 88E1545] (irq=POLL)
mv88e6085 0.4:00: configuring for fixed/rgmii-txid link mode
mv88e6085 0.1:00: Linked as a consumer to 400d1000.ethernet
DSA: tree 0 setup
mv88e6085 0.4:00: Dropping the link to 10-0022
libphy: mdio_mux: probed
mdio-mux-gpio mdio-mux: Dropping the link to 10-0022

      Andrew
Vladimir Oltean Sept. 1, 2021, 1:28 a.m. UTC | #21
On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote:
> On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > Rev B is interesting because switch0 and switch1 got genphy, while
> > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > interrupt properties, so don't loop back to their parent device.
> 
> This is interesting and not what I really expected to happen. It goes to
> show that we really need more time to understand all the subtleties of
> device dependencies before jumping on patching stuff.

There is an even more interesting variation which I would like to point
out. It seems like a very odd loophole in the device links.

Take the example of the mv88e6xxx DSA driver. On my board
(arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I
had to declare the switches as interrupt controller and add interrupts
to their internal PHYs, I still need considerable force to 'break' this
board in the way discussed in this thread. The correct PHY driver insists
to probe, and not genphy. Let me explain.

The automatic device links between the switch (supplier, as interrupt-controller)
and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle.

Important note: fwnode_link_add does not link devices, it links OF nodes.

Even more important node, in the form of a comment:

 * The driver core will use the fwnode link to create a device link between the
 * two device objects corresponding to @con and @sup when they are created. The
 * driver core will automatically delete the fwnode link between @con and @sup
 * after doing that.

Okay?!

What seems to be omitted is that the DSA switch driver's probing itself
can be deferred. For example:

dsa_register_switch
-> dsa_switch_probe
   -> dsa_switch_parse_of
      -> dsa_switch_parse_ports_of
         -> dsa_port_parse_of
            -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0));
            -> not found => return -EPROBE_DEFER

When dsa_register_switch() returns -EPROBE_DEFER, it is effectively
an error path. So the reverse of initialization is performed.

The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_
dsa_register_switch. So when dsa_register_switch returns error code,
mv88e6xxx_mdios_unregister() will be called.

When mv88e6xxx_mdios_unregister() is called, the MDIO buses with
internal PHYs are destroyed. So the PHY devices themselves are destroyed
too. And the device links between the DSA switch and the internal PHYs,
those created based on the firmware node links created by fwnode_link_add,
are dropped too.

Now remember the comment that the device links created based on
fwnode_link_add are not restored.

So probing of the DSA switch finally resumes, and this time
device_links_check_suppliers() is effectively bypassed, the PHYs no
longer request probe deferral due to their supplier not being ready,
because the device link no longer exists.

Isn't this self-sabotaging?!

Now generally, DSA drivers defer probing because they probe in parallel
with the DSA master. This is typical if the switch is on a SPI bus, or
I2C, or on an MDIO bus provided by a _standalone_ MDIO controller.

If the MDIO controller is not standalone, but is provided by Ethernet
controller that is the DSA master itself, then things change a lot,
because probing can never be parallel. The DSA master probes,
initializes its MDIO bus, and this triggers the probing of the MDIO
devices on that bus, one of which is the DSA switch. So DSA can no
longer defer the probe due to that reason.

Secondly, in DSA we even have variation between drivers as to where they
register their internal MDIO buses. The mv88e6xxx driver does this in
mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb
driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this
is important. DSA provides drivers with a .setup() callback, which is
guaranteed to take place after nothing can defer the switch's probe
anymore.

So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register
from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this
setup, because the device links framework isn't sabotaging itself anymore.

Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register()
from the probe method and not the setup method, the entire design issue
with interrupts on internal DSA switch ports would have went absolutely
unnoticed for a few more years.

I have not tested this, but it also seems plausible that DSA can
trivially and reliably bypass any fw_devlink=on restrictions by simply
moving all of_mdiobus_register() driver calls from the .setup() method
to their respective probe methods (prior to calling dsa_register_switch),
then effectively fabricate an -EPROBE_DEFER during the first probe attempt.
I mean, who will know whether that probe deferral request was justified
or not?

Anyway, I'm not sure everyone agrees with this type of "solution" (even
though it's worth pointing it out as a fw_devlink limitation). In any
case, we need some sort of lightweight "fix" to the chicken-and-egg
problem, which will give me enough time to think of something better.
I hope it is at least clearer now that there are subtleties and nuances,
and we cannot just assess how many boards are broken by looking at the
device trees. By design, all are, sure, but they might still work, and
that's better than nothing...
Vladimir Oltean Sept. 1, 2021, 1:38 a.m. UTC | #22
On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote:
> On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote:
> > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > interrupt properties, so don't loop back to their parent device.
> > 
> > This is interesting and not what I really expected to happen. It goes to
> > show that we really need more time to understand all the subtleties of
> > device dependencies before jumping on patching stuff.
> 
> There is an even more interesting variation which I would like to point
> out. It seems like a very odd loophole in the device links.
> 
> Take the example of the mv88e6xxx DSA driver. On my board
> (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I
> had to declare the switches as interrupt controller and add interrupts
> to their internal PHYs, I still need considerable force to 'break' this
> board in the way discussed in this thread. The correct PHY driver insists
> to probe, and not genphy. Let me explain.
> 
> The automatic device links between the switch (supplier, as interrupt-controller)
> and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle.
> 
> Important note: fwnode_link_add does not link devices, it links OF nodes.
> 
> Even more important node, in the form of a comment:
> 
>  * The driver core will use the fwnode link to create a device link between the
>  * two device objects corresponding to @con and @sup when they are created. The
>  * driver core will automatically delete the fwnode link between @con and @sup
>  * after doing that.
> 
> Okay?!
> 
> What seems to be omitted is that the DSA switch driver's probing itself
> can be deferred. For example:
> 
> dsa_register_switch
> -> dsa_switch_probe
>    -> dsa_switch_parse_of
>       -> dsa_switch_parse_ports_of
>          -> dsa_port_parse_of
>             -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0));
>             -> not found => return -EPROBE_DEFER
> 
> When dsa_register_switch() returns -EPROBE_DEFER, it is effectively
> an error path. So the reverse of initialization is performed.
> 
> The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_
> dsa_register_switch. So when dsa_register_switch returns error code,
> mv88e6xxx_mdios_unregister() will be called.
> 
> When mv88e6xxx_mdios_unregister() is called, the MDIO buses with
> internal PHYs are destroyed. So the PHY devices themselves are destroyed
> too. And the device links between the DSA switch and the internal PHYs,
> those created based on the firmware node links created by fwnode_link_add,
> are dropped too.
> 
> Now remember the comment that the device links created based on
> fwnode_link_add are not restored.
> 
> So probing of the DSA switch finally resumes, and this time
> device_links_check_suppliers() is effectively bypassed, the PHYs no
> longer request probe deferral due to their supplier not being ready,
> because the device link no longer exists.
> 
> Isn't this self-sabotaging?!
> 
> Now generally, DSA drivers defer probing because they probe in parallel
> with the DSA master. This is typical if the switch is on a SPI bus, or
> I2C, or on an MDIO bus provided by a _standalone_ MDIO controller.
> 
> If the MDIO controller is not standalone, but is provided by Ethernet
> controller that is the DSA master itself, then things change a lot,
> because probing can never be parallel. The DSA master probes,
> initializes its MDIO bus, and this triggers the probing of the MDIO
> devices on that bus, one of which is the DSA switch. So DSA can no
> longer defer the probe due to that reason.
> 
> Secondly, in DSA we even have variation between drivers as to where they
> register their internal MDIO buses. The mv88e6xxx driver does this in
> mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb
> driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this
> is important. DSA provides drivers with a .setup() callback, which is
> guaranteed to take place after nothing can defer the switch's probe
> anymore.
> 
> So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register
> from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this
> setup, because the device links framework isn't sabotaging itself anymore.
> 
> Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register()
> from the probe method and not the setup method, the entire design issue
> with interrupts on internal DSA switch ports would have went absolutely
> unnoticed for a few more years.
> 
> I have not tested this, but it also seems plausible that DSA can
> trivially and reliably bypass any fw_devlink=on restrictions by simply
> moving all of_mdiobus_register() driver calls from the .setup() method
> to their respective probe methods (prior to calling dsa_register_switch),
> then effectively fabricate an -EPROBE_DEFER during the first probe attempt.
> I mean, who will know whether that probe deferral request was justified
> or not?

Pushing the thought even further, it is not even necessary to move the
of_mdiobus_register() call to the probe function. Where it is (in .setup)
is already good enough. It is sufficient to return -EOPNOTSUPP once
(the first time) immediately _after_ the call to of_mdiobus_register
(and have a proper error path, i.e. call mdiobus_unregister too).

> Anyway, I'm not sure everyone agrees with this type of "solution" (even
> though it's worth pointing it out as a fw_devlink limitation). In any
> case, we need some sort of lightweight "fix" to the chicken-and-egg
> problem, which will give me enough time to think of something better.
> I hope it is at least clearer now that there are subtleties and nuances,
> and we cannot just assess how many boards are broken by looking at the
> device trees. By design, all are, sure, but they might still work, and
> that's better than nothing...
Saravana Kannan Sept. 1, 2021, 1:46 a.m. UTC | #23
On Tue, Aug 31, 2021 at 4:02 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > If the switches are broken without the phy-handle or ethernet change,
> > I'm not sure if the "BROKEN_PARENT" patch would help.
>
> > > Which is not enough to fix these Ethernet switches.
> >
> > Ok, if you can give more specifics on this, I'll look into it.
>
> The switches probe, but get the wrong PHY driver, genphy, not the
> Marvell PHY driver. And genphy is not sufficient for this hardware.
>
> I'd need:
> > 1) The DTS file that you see the issue on.
>
> I did the bisect on arch/arm/boot/dts/vf610-zii-dev-rev-c.dts but i
> also tested arch/arm/boot/dts/vf610-zii-dev-rev-b.dts.

Thanks for the detailed info Andrew. I looked at the DT files. So
yeah, this is similar to the realtek issue and my generic fix for DSA
should work for all of these unless I'm forgetting something. Please
let me know if it doesn't.

-Saravana

> Rev B is interesting because switch0 and switch1 got genphy, while
> switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> interrupt properties, so don't loop back to their parent device.
>
> Here is Rev B. I trimmed out other devices probing in parallel:
>
> [    1.029100] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00
> [    1.034735] fec 400d1000.ethernet: Using random MAC address: 42:f2:14:33:78:f5
> [    1.042272] libphy: fec_enet_mii_bus: probed
> [    1.455932] libphy: mdio_mux: probed
> [    1.459432] mv88e6085 0.1:00: switch 0x3520 detected: Marvell 88E6352, revision 1
> [    1.494076] libphy: mdio: probed
> [    1.518958] libphy: mdio_mux: probed
> [    1.522553] mv88e6085 0.2:00: switch 0x3520 detected: Marvell 88E6352, revision 1
> [    1.537295] libphy: mdio: probed
> [    1.556571] libphy: mdio_mux: probed
> [    1.559719] mv88e6085 0.4:00: switch 0x1a70 detected: Marvell 88E6185, revision 2
> [    1.574614] libphy: mdio: probed
> [    1.733104] mv88e6085 0.1:00 lan0 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:00] driver [Generic PHY] (irq=POLL)
> [    1.750737] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
> [    1.768273] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
> [    1.806561] mv88e6085 0.2:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:00] driver [Generic PHY] (irq=POLL)
> [    1.824033] mv88e6085 0.2:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
> [    1.841496] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
> [    1.943535] mv88e6085 0.4:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:00] driver [Marvell 88E1545] (irq=POLL)
> [    2.003529] mv88e6085 0.4:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:01] driver [Marvell 88E1545] (irq=POLL)
> [    2.063535] mv88e6085 0.4:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:02] driver [Marvell 88E1545] (irq=POLL)
> [    2.084768] DSA: tree 0 setup
> [    2.087791] libphy: mdio_mux: probed
> [    2.265477] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=POLL)
>
> root@zii-devel-b:~# cat /sys/kernel/debug/devices_deferred
> root@zii-devel-b:~#
>
> For Rev C we see:
>
> [    1.244417] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00
> [    1.250081] fec 400d1000.ethernet: Using random MAC address: c6:42:89:ed:5f:dd
> [    1.257507] libphy: fec_enet_mii_bus: probed
> [    1.570725] libphy: mdio_mux: probed
> [    1.574208] mv88e6085 0.1:00: switch 0xa10 detected: Marvell 88E6390X, revision 1
> [    1.590272] libphy: mdio: probed
> [    1.627721] libphy: mdio_mux: probed
> [    1.631222] mv88e6085 0.2:00: switch 0xa10 detected: Marvell 88E6390X, revision 1
> [    1.659643] libphy: mdio: probed
> [    1.811665] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
> [    1.829230] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
> [    1.845884] mv88e6085 0.1:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:03] driver [Generic PHY] (irq=POLL)
> [    1.863237] mv88e6085 0.1:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:04] driver [Generic PHY] (irq=POLL)
> [    1.884078] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL)
> [    1.901630] mv88e6085 0.2:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL)
> [    1.918287] mv88e6085 0.2:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:03] driver [Generic PHY] (irq=POLL)
> [    1.933721] mv88e6085 0.2:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:04] driver [Generic PHY] (irq=POLL)
> [    1.948722] DSA: tree 0 setup
> [    1.951599] libphy: mdio_mux: probed
>
> [   21.565550] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=48)
>
> I have Rev B using NFS root, so the interfaces are configured up by
> the kernel during boot. Rev C has a local root filesystem, so user
> space brings the interfaces up, and it is only when the FEC is opened
> does it attach to the Micrel PHY. That explains the difference between
> 2.265 and 21.565 seconds for the last line.
>
> Again, nothing deferred.
>
>        Andrew
Saravana Kannan Sept. 1, 2021, 2 a.m. UTC | #24
On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > Rev B is interesting because switch0 and switch1 got genphy, while
> > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > interrupt properties, so don't loop back to their parent device.
>
> This is interesting and not what I really expected to happen. It goes to
> show that we really need more time to understand all the subtleties of
> device dependencies before jumping on patching stuff.
>
> In case the DSA tree contains more than one switch, different things
> will happen in dsa_register_switch().
> The tree itself is only initialized when the last switch calls
> dsa_register_switch(). All the other switches just mark themselves as
> present and exit probing early. See this piece of code in dsa_tree_setup:
>
>         complete = dsa_tree_setup_routing_table(dst);
>         if (!complete)
>                 return 0;
>
> So it should be a general property of cross-chip DSA trees that all
> switches except the last one will have the specific PHY driver probed
> properly, and not the genphy.
>
> Because all (N - 1) switches of a tree exit early in dsa_register_switch,
> they have successfully probed by the time the last switch brings up the
> tree, and brings up the PHYs on behalf of every other switch.
>
> The last switch can connect to the PHY on behalf of the other switches
> past their probe ending, and those PHYs should not defer probing because
> their supplier is now probed. It is only that the last switch cannot
> connect to the PHYs of its own ports.

I'm not saying this with any intention of making things easier for me
(I'm not even sure it does). But your description about how multiple
switches are handled by DSA has me even more convinced than before
that DSA needs to use a component device model. This is like the
textbook example for component devices.


-Saravana
Saravana Kannan Sept. 1, 2021, 2:19 a.m. UTC | #25
On Tue, Aug 31, 2021 at 6:38 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote:
> > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote:
> > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > > interrupt properties, so don't loop back to their parent device.
> > >
> > > This is interesting and not what I really expected to happen. It goes to
> > > show that we really need more time to understand all the subtleties of
> > > device dependencies before jumping on patching stuff.
> >
> > There is an even more interesting variation which I would like to point
> > out. It seems like a very odd loophole in the device links.
> >
> > Take the example of the mv88e6xxx DSA driver. On my board
> > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I
> > had to declare the switches as interrupt controller and add interrupts
> > to their internal PHYs, I still need considerable force to 'break' this
> > board in the way discussed in this thread. The correct PHY driver insists
> > to probe, and not genphy. Let me explain.
> >
> > The automatic device links between the switch (supplier, as interrupt-controller)
> > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle.
> >
> > Important note: fwnode_link_add does not link devices, it links OF nodes.
> >
> > Even more important node, in the form of a comment:
> >
> >  * The driver core will use the fwnode link to create a device link between the
> >  * two device objects corresponding to @con and @sup when they are created. The
> >  * driver core will automatically delete the fwnode link between @con and @sup
> >  * after doing that.
> >
> > Okay?!
> >
> > What seems to be omitted is that the DSA switch driver's probing itself
> > can be deferred. For example:
> >
> > dsa_register_switch
> > -> dsa_switch_probe
> >    -> dsa_switch_parse_of
> >       -> dsa_switch_parse_ports_of
> >          -> dsa_port_parse_of
> >             -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0));
> >             -> not found => return -EPROBE_DEFER
> >
> > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively
> > an error path. So the reverse of initialization is performed.
> >
> > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_
> > dsa_register_switch. So when dsa_register_switch returns error code,
> > mv88e6xxx_mdios_unregister() will be called.
> >
> > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with
> > internal PHYs are destroyed. So the PHY devices themselves are destroyed
> > too. And the device links between the DSA switch and the internal PHYs,
> > those created based on the firmware node links created by fwnode_link_add,
> > are dropped too.
> >
> > Now remember the comment that the device links created based on
> > fwnode_link_add are not restored.
> >
> > So probing of the DSA switch finally resumes, and this time
> > device_links_check_suppliers() is effectively bypassed, the PHYs no
> > longer request probe deferral due to their supplier not being ready,
> > because the device link no longer exists.
> >
> > Isn't this self-sabotaging?!

Yeah, this is a known "issue". I'm saying "issue" because at worst
it'd allow a few unnecessary deferred probes. And if you want to break
or get fw_devlink to ignore your child devices or your consumers,
there are simpler APIs to do it without having to intentionally defer
a probe.  Fixing this "issue" would just use up more memory and
increase boot time for no meaningful benefit.

> >
> > Now generally, DSA drivers defer probing because they probe in parallel
> > with the DSA master. This is typical if the switch is on a SPI bus, or
> > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller.
> >
> > If the MDIO controller is not standalone, but is provided by Ethernet
> > controller that is the DSA master itself, then things change a lot,
> > because probing can never be parallel. The DSA master probes,
> > initializes its MDIO bus, and this triggers the probing of the MDIO
> > devices on that bus, one of which is the DSA switch. So DSA can no
> > longer defer the probe due to that reason.
> >
> > Secondly, in DSA we even have variation between drivers as to where they
> > register their internal MDIO buses. The mv88e6xxx driver does this in
> > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb
> > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this
> > is important. DSA provides drivers with a .setup() callback, which is
> > guaranteed to take place after nothing can defer the switch's probe
> > anymore.
> >
> > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register
> > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this
> > setup, because the device links framework isn't sabotaging itself anymore.
> >
> > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register()
> > from the probe method and not the setup method, the entire design issue
> > with interrupts on internal DSA switch ports would have went absolutely
> > unnoticed for a few more years.
> >
> > I have not tested this, but it also seems plausible that DSA can
> > trivially and reliably bypass any fw_devlink=on restrictions by simply
> > moving all of_mdiobus_register() driver calls from the .setup() method
> > to their respective probe methods (prior to calling dsa_register_switch),
> > then effectively fabricate an -EPROBE_DEFER during the first probe attempt.
> > I mean, who will know whether that probe deferral request was justified
> > or not?
>
> Pushing the thought even further, it is not even necessary to move the
> of_mdiobus_register() call to the probe function. Where it is (in .setup)
> is already good enough. It is sufficient to return -EOPNOTSUPP once
> (the first time) immediately _after_ the call to of_mdiobus_register
> (and have a proper error path, i.e. call mdiobus_unregister too).

Right, there are plenty of ways to intentionally break fw_devlink. I
hope that's not the point :) And I don't think -EOPNOTSUPP would work
because your device wouldn't be probed again.

>
> > Anyway, I'm not sure everyone agrees with this type of "solution" (even
> > though it's worth pointing it out as a fw_devlink limitation). In any
> > case, we need some sort of lightweight "fix" to the chicken-and-egg
> > problem, which will give me enough time to think of something better.

I think the generic DSA patch I gave would be the lightweight fix to
address this chicken-and-egg issue.

As for the long term fix, I'd really suggest looking into using the
component device model. I'd even be happy to help make any driver
core/component device improvements you might need.

I'm also interested in looking into improving the PHY probing so that
the genphy never probes a device that has a driver that could probe
it. Even outside of all this fw_devlink thing, they way PHY is handled
now, if any of the supplier really isn't ready yet (say a clock), then
the genphy gets used -- which isn't good.

-Saravana

> > I hope it is at least clearer now that there are subtleties and nuances,
> > and we cannot just assess how many boards are broken by looking at the
> > device trees. By design, all are, sure, but they might still work, and
> > that's better than nothing...
Vladimir Oltean Sept. 1, 2021, 8:46 a.m. UTC | #26
On Tue, Aug 31, 2021 at 07:00:58PM -0700, Saravana Kannan wrote:
> On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > interrupt properties, so don't loop back to their parent device.
> >
> > This is interesting and not what I really expected to happen. It goes to
> > show that we really need more time to understand all the subtleties of
> > device dependencies before jumping on patching stuff.
> >
> > In case the DSA tree contains more than one switch, different things
> > will happen in dsa_register_switch().
> > The tree itself is only initialized when the last switch calls
> > dsa_register_switch(). All the other switches just mark themselves as
> > present and exit probing early. See this piece of code in dsa_tree_setup:
> >
> >         complete = dsa_tree_setup_routing_table(dst);
> >         if (!complete)
> >                 return 0;
> >
> > So it should be a general property of cross-chip DSA trees that all
> > switches except the last one will have the specific PHY driver probed
> > properly, and not the genphy.
> >
> > Because all (N - 1) switches of a tree exit early in dsa_register_switch,
> > they have successfully probed by the time the last switch brings up the
> > tree, and brings up the PHYs on behalf of every other switch.
> >
> > The last switch can connect to the PHY on behalf of the other switches
> > past their probe ending, and those PHYs should not defer probing because
> > their supplier is now probed. It is only that the last switch cannot
> > connect to the PHYs of its own ports.
> 
> I'm not saying this with any intention of making things easier for me
> (I'm not even sure it does). But your description about how multiple
> switches are handled by DSA has me even more convinced than before
> that DSA needs to use a component device model. This is like the
> textbook example for component devices.

In this example, I guess the component master would be the "struct dsa_switch_tree",
but there is no struct device associated with it.

How many "struct dsa_switch_tree" instances there are in a system
depends on whether OF is used or not.

If we use OF, the device tree needs to be parsed, and every unique first
cell (tree-id) of:
	dsa,member = <tree-id switch-id>;
constitutes a different "struct dsa_switch_tree".

If we do not use OF, the number of switch trees in a system is one, see dsa_switch_parse.

It seems to me like the compare function for component_match (where each
component is a "struct dsa_switch" should look at dev->of_node and parse
the "dsa,member" property again, and match on the same tree-id as the
component master itself?

There's also the question of how to do the component_match in a way that
also works for the pdata/non-OF based DSA systems (of which I have none to test).

All of this to move dsa_tree_setup() outside of the probe calling
context of any individual struct dsa_switch, and into the "bind" calling
context of the component master associated with the struct dsa_switch_tree.
This would allow the phy_connect()/phy_attach_direct() calls to find the
PHY device already bound to the specific driver, which would avoid
binding genphy as a last resort?

Two questions:

- Why would it now be more guaranteed that the PHY drivers are bound to
  the internal PHY devices exactly during the time span between events
  (a) Switch driver (a component of the switch tree) finishes probing
  (b) Switch tree (the component master) starts binding
  I mean in lack of any guarantee, we can still end up in a situation
  where the specific PHY driver still is not bound early enough to the
  internal PHY to be available by the time we call phylink_of_phy_connect,
  and we have all those component device goodies but they don't help.
  I'm sure I'm misunderstanding something but I don't know what.

- What if the internal PHY has other suppliers beyond the interrupt-parent?
  What if, say, it has a reset-gpios = <&gpio1>, where gpio1 is provided
  by some other thing on some other slow bus, which is equally slow (or
  slower) to probe to the DSA switch itself. So the temporary absence of
  this other supplier is causing the specific PHY driver to defer probing,
  just enough for a concurrent call to phylink_of_phy_connect -> phy_attach_direct
  to say "ok, I've waited enough and there is no driver, genphy it is".
  How would this be avoided? Or are you thinking of some kind of two-level
  component driver system:
  - the DSA switch is a component master, with components being its
    sub-devices such as internal PHYs etc
  - the DSA switch is also a component of the DSA switch tree
  But in that case, why is it even relevant to model the DSA switch tree
  probing as a component master, and why does it help? I don't care if
  it's a "textbook example" or not if it doesn't help.
Vladimir Oltean Sept. 1, 2021, 9:02 a.m. UTC | #27
On Tue, Aug 31, 2021 at 07:19:40PM -0700, Saravana Kannan wrote:
> On Tue, Aug 31, 2021 at 6:38 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote:
> > > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote:
> > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > > > interrupt properties, so don't loop back to their parent device.
> > > >
> > > > This is interesting and not what I really expected to happen. It goes to
> > > > show that we really need more time to understand all the subtleties of
> > > > device dependencies before jumping on patching stuff.
> > >
> > > There is an even more interesting variation which I would like to point
> > > out. It seems like a very odd loophole in the device links.
> > >
> > > Take the example of the mv88e6xxx DSA driver. On my board
> > > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I
> > > had to declare the switches as interrupt controller and add interrupts
> > > to their internal PHYs, I still need considerable force to 'break' this
> > > board in the way discussed in this thread. The correct PHY driver insists
> > > to probe, and not genphy. Let me explain.
> > >
> > > The automatic device links between the switch (supplier, as interrupt-controller)
> > > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle.
> > >
> > > Important note: fwnode_link_add does not link devices, it links OF nodes.
> > >
> > > Even more important node, in the form of a comment:
> > >
> > >  * The driver core will use the fwnode link to create a device link between the
> > >  * two device objects corresponding to @con and @sup when they are created. The
> > >  * driver core will automatically delete the fwnode link between @con and @sup
> > >  * after doing that.
> > >
> > > Okay?!
> > >
> > > What seems to be omitted is that the DSA switch driver's probing itself
> > > can be deferred. For example:
> > >
> > > dsa_register_switch
> > > -> dsa_switch_probe
> > >    -> dsa_switch_parse_of
> > >       -> dsa_switch_parse_ports_of
> > >          -> dsa_port_parse_of
> > >             -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0));
> > >             -> not found => return -EPROBE_DEFER
> > >
> > > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively
> > > an error path. So the reverse of initialization is performed.
> > >
> > > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_
> > > dsa_register_switch. So when dsa_register_switch returns error code,
> > > mv88e6xxx_mdios_unregister() will be called.
> > >
> > > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with
> > > internal PHYs are destroyed. So the PHY devices themselves are destroyed
> > > too. And the device links between the DSA switch and the internal PHYs,
> > > those created based on the firmware node links created by fwnode_link_add,
> > > are dropped too.
> > >
> > > Now remember the comment that the device links created based on
> > > fwnode_link_add are not restored.
> > >
> > > So probing of the DSA switch finally resumes, and this time
> > > device_links_check_suppliers() is effectively bypassed, the PHYs no
> > > longer request probe deferral due to their supplier not being ready,
> > > because the device link no longer exists.
> > >
> > > Isn't this self-sabotaging?!
> 
> Yeah, this is a known "issue". I'm saying "issue" because at worst
> it'd allow a few unnecessary deferred probes. And if you want to break
> or get fw_devlink to ignore your child devices or your consumers,
> there are simpler APIs to do it without having to intentionally defer
> a probe.  Fixing this "issue" would just use up more memory and
> increase boot time for no meaningful benefit.

But I mean, if the goal of fw_devlink is to infer a probing order based
on phandles, and it is faced with a long chain of devices, then any
-EPROBE_DEFER of a device on top of the chain will break the probing
order for all devices beneath it. It is self-defeating, it is already
memory used for nothing.

> > >
> > > Now generally, DSA drivers defer probing because they probe in parallel
> > > with the DSA master. This is typical if the switch is on a SPI bus, or
> > > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller.
> > >
> > > If the MDIO controller is not standalone, but is provided by Ethernet
> > > controller that is the DSA master itself, then things change a lot,
> > > because probing can never be parallel. The DSA master probes,
> > > initializes its MDIO bus, and this triggers the probing of the MDIO
> > > devices on that bus, one of which is the DSA switch. So DSA can no
> > > longer defer the probe due to that reason.
> > >
> > > Secondly, in DSA we even have variation between drivers as to where they
> > > register their internal MDIO buses. The mv88e6xxx driver does this in
> > > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb
> > > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this
> > > is important. DSA provides drivers with a .setup() callback, which is
> > > guaranteed to take place after nothing can defer the switch's probe
> > > anymore.
> > >
> > > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register
> > > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this
> > > setup, because the device links framework isn't sabotaging itself anymore.
> > >
> > > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register()
> > > from the probe method and not the setup method, the entire design issue
> > > with interrupts on internal DSA switch ports would have went absolutely
> > > unnoticed for a few more years.
> > >
> > > I have not tested this, but it also seems plausible that DSA can
> > > trivially and reliably bypass any fw_devlink=on restrictions by simply
> > > moving all of_mdiobus_register() driver calls from the .setup() method
> > > to their respective probe methods (prior to calling dsa_register_switch),
> > > then effectively fabricate an -EPROBE_DEFER during the first probe attempt.
> > > I mean, who will know whether that probe deferral request was justified
> > > or not?
> >
> > Pushing the thought even further, it is not even necessary to move the
> > of_mdiobus_register() call to the probe function. Where it is (in .setup)
> > is already good enough. It is sufficient to return -EOPNOTSUPP once
> > (the first time) immediately _after_ the call to of_mdiobus_register
> > (and have a proper error path, i.e. call mdiobus_unregister too).
> 
> Right, there are plenty of ways to intentionally break fw_devlink. I
> hope that's not the point :) And I don't think -EOPNOTSUPP would work
> because your device wouldn't be probed again.

Yes, -EPROBE_DEFER is what I meant.

> >
> > > Anyway, I'm not sure everyone agrees with this type of "solution" (even
> > > though it's worth pointing it out as a fw_devlink limitation). In any
> > > case, we need some sort of lightweight "fix" to the chicken-and-egg
> > > problem, which will give me enough time to think of something better.
> 
> I think the generic DSA patch I gave would be the lightweight fix to
> address this chicken-and-egg issue.
> 
> As for the long term fix, I'd really suggest looking into using the
> component device model. I'd even be happy to help make any driver
> core/component device improvements you might need.
> 
> I'm also interested in looking into improving the PHY probing so that
> the genphy never probes a device that has a driver that could probe
> it. Even outside of all this fw_devlink thing, they way PHY is handled
> now, if any of the supplier really isn't ready yet (say a clock), then
> the genphy gets used -- which isn't good.

I think this is the real problem which needs to be addressed. The
trouble is, I don't know if phy_attach_direct can find out the reason
for which d->driver is NULL, i.e. that there was a driver which matched
and attempted the probe, but returned -EPROBE_DEFER.

> -Saravana
> 
> > > I hope it is at least clearer now that there are subtleties and nuances,
> > > and we cannot just assess how many boards are broken by looking at the
> > > device trees. By design, all are, sure, but they might still work, and
> > > that's better than nothing...
Saravana Kannan Sept. 1, 2021, 10:53 p.m. UTC | #28
On Wed, Sep 1, 2021 at 1:46 AM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Tue, Aug 31, 2021 at 07:00:58PM -0700, Saravana Kannan wrote:
> > On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > >
> > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > > interrupt properties, so don't loop back to their parent device.
> > >
> > > This is interesting and not what I really expected to happen. It goes to
> > > show that we really need more time to understand all the subtleties of
> > > device dependencies before jumping on patching stuff.
> > >
> > > In case the DSA tree contains more than one switch, different things
> > > will happen in dsa_register_switch().
> > > The tree itself is only initialized when the last switch calls
> > > dsa_register_switch(). All the other switches just mark themselves as
> > > present and exit probing early. See this piece of code in dsa_tree_setup:
> > >
> > >         complete = dsa_tree_setup_routing_table(dst);
> > >         if (!complete)
> > >                 return 0;
> > >
> > > So it should be a general property of cross-chip DSA trees that all
> > > switches except the last one will have the specific PHY driver probed
> > > properly, and not the genphy.
> > >
> > > Because all (N - 1) switches of a tree exit early in dsa_register_switch,
> > > they have successfully probed by the time the last switch brings up the
> > > tree, and brings up the PHYs on behalf of every other switch.
> > >
> > > The last switch can connect to the PHY on behalf of the other switches
> > > past their probe ending, and those PHYs should not defer probing because
> > > their supplier is now probed. It is only that the last switch cannot
> > > connect to the PHYs of its own ports.
> >
> > I'm not saying this with any intention of making things easier for me
> > (I'm not even sure it does). But your description about how multiple
> > switches are handled by DSA has me even more convinced than before
> > that DSA needs to use a component device model. This is like the
> > textbook example for component devices.
>
> In this example, I guess the component master would be the "struct dsa_switch_tree",

Right.

> but there is no struct device associated with it.

We can create one? I don't think it needs to have a DT node. And if it
does, this is where my "I'm willing to help improve component device"
offer comes in to help make it a bit more generic.

>
> How many "struct dsa_switch_tree" instances there are in a system
> depends on whether OF is used or not.
>
> If we use OF, the device tree needs to be parsed, and every unique first
> cell (tree-id) of:
>         dsa,member = <tree-id switch-id>;
> constitutes a different "struct dsa_switch_tree".
>
> If we do not use OF, the number of switch trees in a system is one, see dsa_switch_parse.
>
> It seems to me like the compare function for component_match (where each
> component is a "struct dsa_switch" should look at dev->of_node and parse
> the "dsa,member" property again, and match on the same tree-id as the
> component master itself?

I don't know enough about DSA to give a useful answer here. But I guess so?

>
> There's also the question of how to do the component_match in a way that
> also works for the pdata/non-OF based DSA systems (of which I have none to test).

You could always just short circuit it and not create a component
device if it's just pdata/non-OF. That's one option.

> All of this to move dsa_tree_setup() outside of the probe calling
> context of any individual struct dsa_switch, and into the "bind" calling
> context of the component master associated with the struct dsa_switch_tree.

Right.

> This would allow the phy_connect()/phy_attach_direct() calls to find the
> PHY device already bound to the specific driver, which would avoid
> binding genphy as a last resort?

Short answer, yes. Long answer: this would fix multiple things:
1) Remove the parent's probe from depending on the child's probe().
This is not guaranteed at all, so we'd fix this bad assumption in the
code.
2) It would allow the PHYs to probe with fw_devlink because the switch
would have completed probing.
3) It'd avoid the bad design of the last switch's probe doing all the
PHY handling for the previous N-1 switches. What if something fails
there? Shouldn't it be one of the previous switches that should report
the failure (either in probe or switch registration or whatever?)? The
component device model would allow each switch to do all it's own work
and then the component master can do the "tying up" of all these
switches and PHYs.

>
> Two questions:
>
> - Why would it now be more guaranteed that the PHY drivers are bound to
>   the internal PHY devices exactly during the time span between events
>   (a) Switch driver (a component of the switch tree) finishes probing
>   (b) Switch tree (the component master) starts binding

Firstly, PHYs won't defer probe due to fw_devlink enforcing their
dependency on the switch and will actually have their probe() called
(and possibly succeed -- see more below).

>   I mean in lack of any guarantee, we can still end up in a situation
>   where the specific PHY driver still is not bound early enough to the
>   internal PHY to be available by the time we call phylink_of_phy_connect,
>   and we have all those component device goodies but they don't help.
>   I'm sure I'm misunderstanding something but I don't know what.
>
> - What if the internal PHY has other suppliers beyond the interrupt-parent?
>   What if, say, it has a reset-gpios = <&gpio1>, where gpio1 is provided
>   by some other thing on some other slow bus, which is equally slow (or
>   slower) to probe to the DSA switch itself. So the temporary absence of
>   this other supplier is causing the specific PHY driver to defer probing,
>   just enough for a concurrent call to phylink_of_phy_connect -> phy_attach_direct
>   to say "ok, I've waited enough and there is no driver, genphy it is".
>   How would this be avoided?

Good question and this is another reason for me suggesting the use of
component model.

> Or are you thinking of some kind of two-level
>   component driver system:
>   - the DSA switch is a component master, with components being its
>     sub-devices such as internal PHYs etc
>   - the DSA switch is also a component of the DSA switch tree

I was thinking of one component master with all the devices that make
up the DSA switch tree. I don't think there's any requirement that all
the component devices need to be of the same type. That way, the DSA
switch tree won't even be attempted until all the devices are ready.

One thing that's not clear to me wrt using specific driver vs the
genphy -- at what point is it okay to give up waiting for a specific
driver? This is more of a question to the maintainers than what
happens today. What if the specific driver is a module that's loaded
after the switch's driver? There's no time bound to this event. Are we
going to put the restriction that all the PHY's drivers need to be
registered/loaded before the switch's driver? If that's the decision,
that's okay by me. But I just want to understand the requirements.

Also see my reply to your other email.

-Saravana
Saravana Kannan Sept. 1, 2021, 10:57 p.m. UTC | #29
On Wed, Sep 1, 2021 at 2:02 AM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Tue, Aug 31, 2021 at 07:19:40PM -0700, Saravana Kannan wrote:
> > On Tue, Aug 31, 2021 at 6:38 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> > >
> > > On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote:
> > > > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote:
> > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > > > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > > > > interrupt properties, so don't loop back to their parent device.
> > > > >
> > > > > This is interesting and not what I really expected to happen. It goes to
> > > > > show that we really need more time to understand all the subtleties of
> > > > > device dependencies before jumping on patching stuff.
> > > >
> > > > There is an even more interesting variation which I would like to point
> > > > out. It seems like a very odd loophole in the device links.
> > > >
> > > > Take the example of the mv88e6xxx DSA driver. On my board
> > > > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I
> > > > had to declare the switches as interrupt controller and add interrupts
> > > > to their internal PHYs, I still need considerable force to 'break' this
> > > > board in the way discussed in this thread. The correct PHY driver insists
> > > > to probe, and not genphy. Let me explain.
> > > >
> > > > The automatic device links between the switch (supplier, as interrupt-controller)
> > > > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle.
> > > >
> > > > Important note: fwnode_link_add does not link devices, it links OF nodes.
> > > >
> > > > Even more important node, in the form of a comment:
> > > >
> > > >  * The driver core will use the fwnode link to create a device link between the
> > > >  * two device objects corresponding to @con and @sup when they are created. The
> > > >  * driver core will automatically delete the fwnode link between @con and @sup
> > > >  * after doing that.
> > > >
> > > > Okay?!
> > > >
> > > > What seems to be omitted is that the DSA switch driver's probing itself
> > > > can be deferred. For example:
> > > >
> > > > dsa_register_switch
> > > > -> dsa_switch_probe
> > > >    -> dsa_switch_parse_of
> > > >       -> dsa_switch_parse_ports_of
> > > >          -> dsa_port_parse_of
> > > >             -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0));
> > > >             -> not found => return -EPROBE_DEFER
> > > >
> > > > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively
> > > > an error path. So the reverse of initialization is performed.
> > > >
> > > > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_
> > > > dsa_register_switch. So when dsa_register_switch returns error code,
> > > > mv88e6xxx_mdios_unregister() will be called.
> > > >
> > > > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with
> > > > internal PHYs are destroyed. So the PHY devices themselves are destroyed
> > > > too. And the device links between the DSA switch and the internal PHYs,
> > > > those created based on the firmware node links created by fwnode_link_add,
> > > > are dropped too.
> > > >
> > > > Now remember the comment that the device links created based on
> > > > fwnode_link_add are not restored.
> > > >
> > > > So probing of the DSA switch finally resumes, and this time
> > > > device_links_check_suppliers() is effectively bypassed, the PHYs no
> > > > longer request probe deferral due to their supplier not being ready,
> > > > because the device link no longer exists.
> > > >
> > > > Isn't this self-sabotaging?!
> >
> > Yeah, this is a known "issue". I'm saying "issue" because at worst
> > it'd allow a few unnecessary deferred probes. And if you want to break
> > or get fw_devlink to ignore your child devices or your consumers,
> > there are simpler APIs to do it without having to intentionally defer
> > a probe.  Fixing this "issue" would just use up more memory and
> > increase boot time for no meaningful benefit.
>
> But I mean, if the goal of fw_devlink is to infer a probing order based
> on phandles, and it is faced with a long chain of devices, then any
> -EPROBE_DEFER of a device on top of the chain will break the probing
> order for all devices beneath it. It is self-defeating, it is already
> memory used for nothing.
>
> > > >
> > > > Now generally, DSA drivers defer probing because they probe in parallel
> > > > with the DSA master. This is typical if the switch is on a SPI bus, or
> > > > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller.
> > > >
> > > > If the MDIO controller is not standalone, but is provided by Ethernet
> > > > controller that is the DSA master itself, then things change a lot,
> > > > because probing can never be parallel. The DSA master probes,
> > > > initializes its MDIO bus, and this triggers the probing of the MDIO
> > > > devices on that bus, one of which is the DSA switch. So DSA can no
> > > > longer defer the probe due to that reason.
> > > >
> > > > Secondly, in DSA we even have variation between drivers as to where they
> > > > register their internal MDIO buses. The mv88e6xxx driver does this in
> > > > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb
> > > > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this
> > > > is important. DSA provides drivers with a .setup() callback, which is
> > > > guaranteed to take place after nothing can defer the switch's probe
> > > > anymore.
> > > >
> > > > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register
> > > > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this
> > > > setup, because the device links framework isn't sabotaging itself anymore.
> > > >
> > > > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register()
> > > > from the probe method and not the setup method, the entire design issue
> > > > with interrupts on internal DSA switch ports would have went absolutely
> > > > unnoticed for a few more years.
> > > >
> > > > I have not tested this, but it also seems plausible that DSA can
> > > > trivially and reliably bypass any fw_devlink=on restrictions by simply
> > > > moving all of_mdiobus_register() driver calls from the .setup() method
> > > > to their respective probe methods (prior to calling dsa_register_switch),
> > > > then effectively fabricate an -EPROBE_DEFER during the first probe attempt.
> > > > I mean, who will know whether that probe deferral request was justified
> > > > or not?
> > >
> > > Pushing the thought even further, it is not even necessary to move the
> > > of_mdiobus_register() call to the probe function. Where it is (in .setup)
> > > is already good enough. It is sufficient to return -EOPNOTSUPP once
> > > (the first time) immediately _after_ the call to of_mdiobus_register
> > > (and have a proper error path, i.e. call mdiobus_unregister too).
> >
> > Right, there are plenty of ways to intentionally break fw_devlink. I
> > hope that's not the point :) And I don't think -EOPNOTSUPP would work
> > because your device wouldn't be probed again.
>
> Yes, -EPROBE_DEFER is what I meant.
>
> > >
> > > > Anyway, I'm not sure everyone agrees with this type of "solution" (even
> > > > though it's worth pointing it out as a fw_devlink limitation). In any
> > > > case, we need some sort of lightweight "fix" to the chicken-and-egg
> > > > problem, which will give me enough time to think of something better.
> >
> > I think the generic DSA patch I gave would be the lightweight fix to
> > address this chicken-and-egg issue.
> >
> > As for the long term fix, I'd really suggest looking into using the
> > component device model. I'd even be happy to help make any driver
> > core/component device improvements you might need.
> >
> > I'm also interested in looking into improving the PHY probing so that
> > the genphy never probes a device that has a driver that could probe
> > it. Even outside of all this fw_devlink thing, they way PHY is handled
> > now, if any of the supplier really isn't ready yet (say a clock), then
> > the genphy gets used -- which isn't good.
>
> I think this is the real problem which needs to be addressed.

This is one of the real problems which need to be addresses.

> The
> trouble is, I don't know if phy_attach_direct can find out the reason
> for which d->driver is NULL, i.e. that there was a driver which matched
> and attempted the probe, but returned -EPROBE_DEFER.

I think if we can set the requirement that the PHY's driver needs to
be loaded/registered before the switch's driver, this should be
possible to figure out. Either using dev->can_match or with some
additional minor changes to driver-core.

-Saravana
Andrew Lunn Sept. 2, 2021, 5:41 p.m. UTC | #30
>   How would this be avoided? Or are you thinking of some kind of two-level
>   component driver system:
>   - the DSA switch is a component master, with components being its
>     sub-devices such as internal PHYs etc
>   - the DSA switch is also a component of the DSA switch tree

I think you might be missing a level. Think about the automotive
reference design system you posted the DT for a couple of days
ago. Don't you have cascaded switches, which are not members of the
same DSA tree. You might need a component for that whole group of
switches, above what you suggest here.

Can you nest components? How deep can you nest them?

    Andrew
Saravana Kannan Sept. 2, 2021, 5:58 p.m. UTC | #31
On Thu, Sep 2, 2021 at 10:41 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> >   How would this be avoided? Or are you thinking of some kind of two-level
> >   component driver system:
> >   - the DSA switch is a component master, with components being its
> >     sub-devices such as internal PHYs etc
> >   - the DSA switch is also a component of the DSA switch tree
>
> I think you might be missing a level. Think about the automotive
> reference design system you posted the DT for a couple of days
> ago. Don't you have cascaded switches, which are not members of the
> same DSA tree. You might need a component for that whole group of
> switches, above what you suggest here.
>
> Can you nest components? How deep can you nest them?

As far as I know you can nest components.

Also, technically you can make your own lightweight component model
like behaviour using stateful device links or fwnode links (probably
just a simple for loop). Just create a new "dsa_switch_tree" device
and create device links between that and whatever other devices that
need to probe first. And then you'll just have a common
"dsa_switch_tree" driver that probes these types of devices.

I'm waiting for [1] to land before I jump in and clean up the
component model to be more flexible and cleaner by using device links.
The current implementation does a lot of stuff that device links will
take care of for free.

[1] - https://lore.kernel.org/lkml/CAGETcx-mRrqC_sGiBk+wx8RtwjJjXf0KJo+ejU6SweEBiATaLw@mail.gmail.com/

-Saravana
Saravana Kannan Sept. 8, 2021, 6:35 p.m. UTC | #32
On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > > I must admit, my main problem at the moment is -rc1 in two weeks
> > > time. It seems like a number of board with Ethernet switches will be
> > > broken, that worked before. phy-handle is not limited to switch
> > > drivers, it is also used for Ethernet drivers. So it could be, a
> > > number of Ethernet drivers are also going to be broken in -rc1?
> >
> > Again, in those cases, based on your FEC example, fw_devlink=on
> > actually improves things.
>
> Debatable. I did some testing. As expected some boards with Ethernet
> switches are now broken. Without fw_devlink=on, some boards are not
> optimal, but they actually work. With it, they are broken.
>
> I did a bisect, and they have been broken since:
>
> ea718c699055c8566eb64432388a04974c43b2ea is the first bad commit
> commit ea718c699055c8566eb64432388a04974c43b2ea
> Author: Saravana Kannan <saravanak@google.com>
> Date:   Tue Mar 2 13:11:32 2021 -0800
>
>     Revert "Revert "driver core: Set fw_devlink=on by default""
>
>     This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df.
>
>     Since all reported issues due to fw_devlink=on should be addressed by
>     this series, revert the revert. fw_devlink=on Take II.
>
>     Signed-off-by: Saravana Kannan <saravanak@google.com>
>     Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
>  drivers/base/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> So however it is fixed, it needs to go into stable, not just -rc1.
>
> > Again, it's not a widespread problem as I explained before.
> > fw_devlink=on has been the default for 2 kernel versions now. With no
> > unfixed reported issues.
>
> Given that some Ethernet switches have been broken all that time, i
> wonder what else has been broken? Normally, the kernel which is
> release in December becomes the next LTS. It then gets picked up by
> the distros and more wide spread tested. So it could be, you get a
> flood of reports in January and February about things which are
> broken. This is why i don't think you should be relying on bug
> reports, you should be taking a more proactive stance and trying to
> analyse the DTB blobs.
>
> I will spend some time trying out your proposed fix. See if they are a
> quick fix for stable.

Hi Andrew,

Did you have a chance to try it out? I can fix up the commit text and
send out vN+1 of the patch if it works for you.

-Saravana
Andrew Lunn Sept. 9, 2021, 1:39 a.m. UTC | #33
> --- a/net/dsa/dsa2.c
> +++ b/net/dsa/dsa2.c
> @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct
> dsa_switch *ds, struct device_node *dn)
>  {
>         int err;
> 
> +       /* A lot of switch devices have their PHYs as child devices and have
> +        * the PHYs depend on the switch as a supplier (Eg: interrupt
> +        * controller). With fw_devlink=on, that means the PHYs will defer
> +        * probe until the probe() of the switch completes. However, the way
> +        * the DSA framework is designed, the PHYs are expected to be probed
> +        * successfully before the probe() of the switch completes.
> +        *
> +        * So, mark the switch devices as a "broken parent" so that fw_devlink
> +        * knows not to create device links between PHYs and the parent switch.
> +        */
> +       np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
>         err = dsa_switch_parse_member_of(ds, dn);
>         if (err)
>                 return err;

This does not work. First off, its dn, not np. But with that fixed, it
still does not work. This is too late, the mdio busses have already
been registered and probed, the PHYs have been found on the busses,
and the PHYs would of been probed, if not for fw_devlink.

What did work was:

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c45ca2473743..45d67d50e35f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -6249,8 +6249,10 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev)
        if (!np && !pdata)
                return -EINVAL;
 
-       if (np)
+       if (np) {
                compat_info = of_device_get_match_data(dev);
+               np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
+       }
 
        if (pdata) {
                compat_info = pdata_device_get_match_data(dev);

This will fix it for mv88e6xxx. But if the same problem occurs in any
of the other DSA drivers, they will still be broken:

~/linux/drivers/net/dsa$ grep -r mdiobus_register *
bcm_sf2.c:	err = mdiobus_register(priv->slave_mii_bus);
dsa_loop_bdinfo.c:	return mdiobus_register_board_info(&bdinfo, 1);
lantiq_gswip.c:	return of_mdiobus_register(ds->slave_mii_bus, mdio_np);
mt7530.c:	ret = mdiobus_register(bus);
mv88e6xxx/chip.c:	err = of_mdiobus_register(bus, np);
grep: mv88e6xxx/chip.o: binary file matches
ocelot/seville_vsc9953.c:	rc = mdiobus_register(bus);
ocelot/felix_vsc9959.c:	rc = mdiobus_register(bus);
qca/ar9331.c:	ret = of_mdiobus_register(mbus, mnp);
qca8k.c:	return devm_of_mdiobus_register(priv->dev, bus, mdio);
realtek-smi-core.c:	ret = of_mdiobus_register(smi->slave_mii_bus, mdio_np);
sja1105/sja1105_mdio.c:	rc = of_mdiobus_register(bus, np);
sja1105/sja1105_mdio.c:	rc = of_mdiobus_register(bus, np);
sja1105/sja1105_mdio.c:	rc = mdiobus_register(bus);
sja1105/sja1105_mdio.c:int sja1105_mdiobus_register(struct dsa_switch *ds)
sja1105/sja1105.h:int sja1105_mdiobus_register(struct dsa_switch *ds);
sja1105/sja1105_main.c:	rc = sja1105_mdiobus_register(ds);

If you are happy to use a big hammer:

diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 53f034fc2ef7..7ecd910f7fb8 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner)
            NULL == bus->read || NULL == bus->write)
                return -EINVAL;
 
+       if (bus->parent && bus->parent->of_node)
+               bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
+
        BUG_ON(bus->state != MDIOBUS_ALLOCATED &&
               bus->state != MDIOBUS_UNREGISTERED);
 
So basically saying all MDIO busses potentially have a problem.

I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are
not broken, they work fine, if fw_devlink gets out of the way and
allows them to do their job.

You also asked about why the component framework is not used. DSA has
been around for a while, the first commit dates back to October
2008. Russell Kings first commit for the component framework is
January 2014. The plain driver model has worked for the last 13 years,
so there has not been any need to change.

   Andrew
Saravana Kannan Sept. 9, 2021, 3:21 a.m. UTC | #34
On Wed, Sep 8, 2021 at 6:39 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > --- a/net/dsa/dsa2.c
> > +++ b/net/dsa/dsa2.c
> > @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct
> > dsa_switch *ds, struct device_node *dn)
> >  {
> >         int err;
> >
> > +       /* A lot of switch devices have their PHYs as child devices and have
> > +        * the PHYs depend on the switch as a supplier (Eg: interrupt
> > +        * controller). With fw_devlink=on, that means the PHYs will defer
> > +        * probe until the probe() of the switch completes. However, the way
> > +        * the DSA framework is designed, the PHYs are expected to be probed
> > +        * successfully before the probe() of the switch completes.
> > +        *
> > +        * So, mark the switch devices as a "broken parent" so that fw_devlink
> > +        * knows not to create device links between PHYs and the parent switch.
> > +        */
> > +       np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> >         err = dsa_switch_parse_member_of(ds, dn);
> >         if (err)
> >                 return err;
>
> This does not work. First off, its dn, not np.

My bad. Copy paste error.

> But with that fixed, it
> still does not work. This is too late, the mdio busses have already
> been registered and probed, the PHYs have been found on the busses,
> and the PHYs would of been probed, if not for fw_devlink.

Sigh... looks like some drivers register their mdio bus in their
dsa_switch_ops->setup while others do it in their actual probe
function (which actually makes more sense to me).

>
> What did work was:
>
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index c45ca2473743..45d67d50e35f 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -6249,8 +6249,10 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev)
>         if (!np && !pdata)
>                 return -EINVAL;
>
> -       if (np)
> +       if (np) {
>                 compat_info = of_device_get_match_data(dev);
> +               np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> +       }
>
>         if (pdata) {
>                 compat_info = pdata_device_get_match_data(dev);
>
> This will fix it for mv88e6xxx. But if the same problem occurs in any
> of the other DSA drivers, they will still be broken:
>
> ~/linux/drivers/net/dsa$ grep -r mdiobus_register *
> bcm_sf2.c:      err = mdiobus_register(priv->slave_mii_bus);
> dsa_loop_bdinfo.c:      return mdiobus_register_board_info(&bdinfo, 1);
> lantiq_gswip.c: return of_mdiobus_register(ds->slave_mii_bus, mdio_np);
> mt7530.c:       ret = mdiobus_register(bus);
> mv88e6xxx/chip.c:       err = of_mdiobus_register(bus, np);
> grep: mv88e6xxx/chip.o: binary file matches
> ocelot/seville_vsc9953.c:       rc = mdiobus_register(bus);
> ocelot/felix_vsc9959.c: rc = mdiobus_register(bus);
> qca/ar9331.c:   ret = of_mdiobus_register(mbus, mnp);
> qca8k.c:        return devm_of_mdiobus_register(priv->dev, bus, mdio);
> realtek-smi-core.c:     ret = of_mdiobus_register(smi->slave_mii_bus, mdio_np);

This one would have worked because it registers it in the ->setup()
ops. So it's not a simple grep for of_mdiobus_register(). But your
point stands nonetheless.

> sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np);
> sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np);
> sja1105/sja1105_mdio.c: rc = mdiobus_register(bus);
> sja1105/sja1105_mdio.c:int sja1105_mdiobus_register(struct dsa_switch *ds)
> sja1105/sja1105.h:int sja1105_mdiobus_register(struct dsa_switch *ds);
> sja1105/sja1105_main.c: rc = sja1105_mdiobus_register(ds);
>
> If you are happy to use a big hammer:

I'm okay with this big hammer for now while we figure out something better.

>
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 53f034fc2ef7..7ecd910f7fb8 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner)
>             NULL == bus->read || NULL == bus->write)
>                 return -EINVAL;
>
> +       if (bus->parent && bus->parent->of_node)
> +               bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> +
>         BUG_ON(bus->state != MDIOBUS_ALLOCATED &&
>                bus->state != MDIOBUS_UNREGISTERED);
>
> So basically saying all MDIO busses potentially have a problem.
>
> I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are
> not broken, they work fine, if fw_devlink gets out of the way and
> allows them to do their job.

The parent assuming the child will be probed as soon as it's added is
a broken expectation/assumption. fw_devlink is just catching them
immediately.

Having said that, this is not the hill either of us should choose to
die on. So, how about something like:
FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD

If that works, I can clean up the series with this and the MDIO fix
you mentioned.

> You also asked about why the component framework is not used. DSA has
> been around for a while, the first commit dates back to October
> 2008. Russell Kings first commit for the component framework is
> January 2014. The plain driver model has worked for the last 13 years,
> so there has not been any need to change.

Thanks for the history on why it couldn't have been used earlier.

In the long run, I'd still like to fix this so that the
dsa_tree_setup() doesn't need the flag above. I have some ideas using
device links that'll be much simpler to understand and maintain than
using the component framework. I'll send out patches for that (not
meant for 5.15) later and we can go with the MDIO bus hammer for 5.15.

-Saravana
Vladimir Oltean Sept. 9, 2021, 10:38 a.m. UTC | #35
On Wed, Sep 08, 2021 at 08:21:05PM -0700, Saravana Kannan wrote:
> > But with that fixed, it
> > still does not work. This is too late, the mdio busses have already
> > been registered and probed, the PHYs have been found on the busses,
> > and the PHYs would of been probed, if not for fw_devlink.
> 
> Sigh... looks like some drivers register their mdio bus in their
> dsa_switch_ops->setup while others do it in their actual probe
> function (which actually makes more sense to me).

If it makes more sense to you for of_mdiobus_register to be placed in
the probe function and not in ->setup, then please be aware of my
previous email pointing out that DSA defers probe due to other reasons
too before calling ->setup, like of_find_net_device_by_node not finding
the DSA master. If the MDIO bus was registered by then, it will be
destroyed by the unwind path and the device links will not be created a
second time, effectively defeating fw_devlink.

> > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> > index 53f034fc2ef7..7ecd910f7fb8 100644
> > --- a/drivers/net/phy/mdio_bus.c
> > +++ b/drivers/net/phy/mdio_bus.c
> > @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner)
> >             NULL == bus->read || NULL == bus->write)
> >                 return -EINVAL;
> >
> > +       if (bus->parent && bus->parent->of_node)
> > +               bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> > +
> >         BUG_ON(bus->state != MDIOBUS_ALLOCATED &&
> >                bus->state != MDIOBUS_UNREGISTERED);
> >
> > So basically saying all MDIO busses potentially have a problem.
> >
> > I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are
> > not broken, they work fine, if fw_devlink gets out of the way and
> > allows them to do their job.
> 
> The parent assuming the child will be probed as soon as it's added is
> a broken expectation/assumption. fw_devlink is just catching them
> immediately.

It's not really a broken expectation when you come to think of the fact
that synchronous probing is requested, and this is the internal PHY of
the switch we are talking about, not just any PHY off the street, with
random dependencies. It is known beforehand, and so are the dependencies.
All dependencies that the internal PHY has should be provided by the
switch driver by the time it registers the MDIO bus.

The system is not prepared to handle an -EPROBE_DEFER simply because
there is no good reason why it should happen.

I see fw_devlink as injecting a fault in this case. Maybe we should
treat it, but in any case it is adding a pointless -EPROBE_DEFER when
the PHY driver could have probed immediately. This will slow down the
boot time when we treat it properly eventually.

> Having said that, this is not the hill either of us should choose to
> die on. So, how about something like:
> FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD
> 
> If that works, I can clean up the series with this and the MDIO fix
> you mentioned.

I'm okay with the "needs child bound on add" name.

> > You also asked about why the component framework is not used. DSA has
> > been around for a while, the first commit dates back to October
> > 2008. Russell Kings first commit for the component framework is
> > January 2014. The plain driver model has worked for the last 13 years,
> > so there has not been any need to change.
> 
> Thanks for the history on why it couldn't have been used earlier.
> 
> In the long run, I'd still like to fix this so that the
> dsa_tree_setup() doesn't need the flag above. I have some ideas using
> device links that'll be much simpler to understand and maintain than
> using the component framework. I'll send out patches for that (not
> meant for 5.15) later and we can go with the MDIO bus hammer for 5.15.

Details?

I am not too fond of using the component framework because I am not
convinced we should be fabricating struct devices we do not need, just
to comply with an API that solves a fabricated problem.
Andrew Lunn Sept. 9, 2021, 3 p.m. UTC | #36
> Sigh... looks like some drivers register their mdio bus in their
> dsa_switch_ops->setup while others do it in their actual probe
> function (which actually makes more sense to me).
 
Drivers are free to do whatever they want. The driver model allows it.

> I'm okay with this big hammer for now while we figure out something
> better.


> 
> >
> > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> > index 53f034fc2ef7..7ecd910f7fb8 100644
> > --- a/drivers/net/phy/mdio_bus.c
> > +++ b/drivers/net/phy/mdio_bus.c
> > @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner)
> >             NULL == bus->read || NULL == bus->write)
> >                 return -EINVAL;
> >
> > +       if (bus->parent && bus->parent->of_node)
> > +               bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> > +
> >         BUG_ON(bus->state != MDIOBUS_ALLOCATED &&
> >                bus->state != MDIOBUS_UNREGISTERED);
> >
> > So basically saying all MDIO busses potentially have a problem.
> >
> > I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are
> > not broken, they work fine, if fw_devlink gets out of the way and
> > allows them to do their job.
> 
> The parent assuming the child will be probed as soon as it's added is
> a broken expectation/assumption. fw_devlink is just catching them
> immediately.

Why is it broken? As i said in the history, DSA has worked since
2008. This behaviour is not that old, but it has been used and worked
for a number of years.

I actual think your model of the driver model is too simple and needs
to accept that a driver probe is not atomic. Resources a driver
registers with other parts of the kernel can be used before the probe
completes. And we have some corners of the kernel that depend on that.

This is particularly true for network drivers. As soon as you register
a network interface to the stack, it will start using it, before the
probe function has completed. It does not wait around for the driver
core to say it has completed probing. And i doubt this is unique to
networking. Maybe when a frame buffer driver registers a frame buffer
with the core, the core starts to draw the splash screen, before the
probe finishes? Maybe when a block driver registers a block device,
the core starts reading the partition table, before the probe
finishes? These are all examples of using a resource before the probe
completes.

> Having said that, this is not the hill either of us should choose to
> die on. So, how about something like:
> FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD
> 
> If that works, I can clean up the series with this and the MDIO fix
> you mentioned.

That is O.K. for me as a fix. I can test patches next week.

     Andrew
Florian Fainelli Sept. 9, 2021, 4:56 p.m. UTC | #37
On 9/8/2021 6:39 PM, Andrew Lunn wrote:
>> --- a/net/dsa/dsa2.c
>> +++ b/net/dsa/dsa2.c
>> @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct
>> dsa_switch *ds, struct device_node *dn)
>>   {
>>          int err;
>>
>> +       /* A lot of switch devices have their PHYs as child devices and have
>> +        * the PHYs depend on the switch as a supplier (Eg: interrupt
>> +        * controller). With fw_devlink=on, that means the PHYs will defer
>> +        * probe until the probe() of the switch completes. However, the way
>> +        * the DSA framework is designed, the PHYs are expected to be probed
>> +        * successfully before the probe() of the switch completes.
>> +        *
>> +        * So, mark the switch devices as a "broken parent" so that fw_devlink
>> +        * knows not to create device links between PHYs and the parent switch.
>> +        */
>> +       np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
>>          err = dsa_switch_parse_member_of(ds, dn);
>>          if (err)
>>                  return err;
> 
> This does not work. First off, its dn, not np. But with that fixed, it
> still does not work. This is too late, the mdio busses have already
> been registered and probed, the PHYs have been found on the busses,
> and the PHYs would of been probed, if not for fw_devlink.
> 
> What did work was:
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index c45ca2473743..45d67d50e35f 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -6249,8 +6249,10 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev)
>          if (!np && !pdata)
>                  return -EINVAL;
>   
> -       if (np)
> +       if (np) {
>                  compat_info = of_device_get_match_data(dev);
> +               np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> +       }
>   
>          if (pdata) {
>                  compat_info = pdata_device_get_match_data(dev);
> 
> This will fix it for mv88e6xxx. But if the same problem occurs in any
> of the other DSA drivers, they will still be broken:
> 
> ~/linux/drivers/net/dsa$ grep -r mdiobus_register *
> bcm_sf2.c:	err = mdiobus_register(priv->slave_mii_bus);
> dsa_loop_bdinfo.c:	return mdiobus_register_board_info(&bdinfo, 1);
> lantiq_gswip.c:	return of_mdiobus_register(ds->slave_mii_bus, mdio_np);
> mt7530.c:	ret = mdiobus_register(bus);
> mv88e6xxx/chip.c:	err = of_mdiobus_register(bus, np);
> grep: mv88e6xxx/chip.o: binary file matches
> ocelot/seville_vsc9953.c:	rc = mdiobus_register(bus);
> ocelot/felix_vsc9959.c:	rc = mdiobus_register(bus);
> qca/ar9331.c:	ret = of_mdiobus_register(mbus, mnp);
> qca8k.c:	return devm_of_mdiobus_register(priv->dev, bus, mdio);
> realtek-smi-core.c:	ret = of_mdiobus_register(smi->slave_mii_bus, mdio_np);
> sja1105/sja1105_mdio.c:	rc = of_mdiobus_register(bus, np);
> sja1105/sja1105_mdio.c:	rc = of_mdiobus_register(bus, np);
> sja1105/sja1105_mdio.c:	rc = mdiobus_register(bus);
> sja1105/sja1105_mdio.c:int sja1105_mdiobus_register(struct dsa_switch *ds)
> sja1105/sja1105.h:int sja1105_mdiobus_register(struct dsa_switch *ds);
> sja1105/sja1105_main.c:	rc = sja1105_mdiobus_register(ds);
> 
> If you are happy to use a big hammer:
> 
> diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
> index 53f034fc2ef7..7ecd910f7fb8 100644
> --- a/drivers/net/phy/mdio_bus.c
> +++ b/drivers/net/phy/mdio_bus.c
> @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner)
>              NULL == bus->read || NULL == bus->write)
>                  return -EINVAL;
>   
> +       if (bus->parent && bus->parent->of_node)
> +               bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT;
> +
>          BUG_ON(bus->state != MDIOBUS_ALLOCATED &&
>                 bus->state != MDIOBUS_UNREGISTERED);
>   
> So basically saying all MDIO busses potentially have a problem.
> 
> I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are
> not broken, they work fine, if fw_devlink gets out of the way and
> allows them to do their job.
> 
> You also asked about why the component framework is not used. DSA has
> been around for a while, the first commit dates back to October
> 2008. Russell Kings first commit for the component framework is
> January 2014. The plain driver model has worked for the last 13 years,
> so there has not been any need to change.

That part of the story is more complicated than that, DSA did not get 
any development from 2008 till 2014 when I picked it up to add support 
for Broadcom switches. In 2016, in order to support more switches, 
especially those that were "pure" MDIO devices, Andrew came up with the 
mdio_device structure and also created "dsa2" which allowed registering 
switches from their actual bus. This allowed us to depart from all of 
the limitations of the unique "dsa" platform device which was just 
horrible to work with.

I recall very clearly that one of your prototypes that I tested was 
using the component framework, although I do not remember why we did not 
pursue that route and instead the DSA switch tree got reference counted 
and got its current form. So Andrew, you did evaluate the component 
framework but ended up not using it, do you remember why?

There is nothing wrong with the current approach of allowing switches to 
come up and do the final tree setup when the tree is fully resolved. If 
there are driver level changes that we can make to ease the pain on the 
device link framework, we should certainly entertain them, keep in mind 
that DSA for better or worse shows a lot of cargo cult programming.
Saravana Kannan Sept. 30, 2021, 5:33 a.m. UTC | #38
On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > Rev B is interesting because switch0 and switch1 got genphy, while
> > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > interrupt properties, so don't loop back to their parent device.
>
> This is interesting and not what I really expected to happen. It goes to
> show that we really need more time to understand all the subtleties of
> device dependencies before jumping on patching stuff.
>
> In case the DSA tree contains more than one switch, different things
> will happen in dsa_register_switch().
> The tree itself is only initialized when the last switch calls
> dsa_register_switch(). All the other switches just mark themselves as
> present and exit probing early. See this piece of code in dsa_tree_setup:
>
>         complete = dsa_tree_setup_routing_table(dst);
>         if (!complete)
>                 return 0;

Hi Vladimir,

Can you point me to an example dts file that has a DSA tree with more
than one switch and also point me to the switches that form the tree?

I'm working on a RFC series that tries to improve some stuff and
having an example DTS to look at would help.

Thanks,
Saravana

>
> So it should be a general property of cross-chip DSA trees that all
> switches except the last one will have the specific PHY driver probed
> properly, and not the genphy.
>
> Because all (N - 1) switches of a tree exit early in dsa_register_switch,
> they have successfully probed by the time the last switch brings up the
> tree, and brings up the PHYs on behalf of every other switch.
>
> The last switch can connect to the PHY on behalf of the other switches
> past their probe ending, and those PHYs should not defer probing because
> their supplier is now probed. It is only that the last switch cannot
> connect to the PHYs of its own ports.
>
> So if this does not work (you say that there are 2 switches that use
> genphy) I suspect there are also other bugs involved.
Andrew Lunn Sept. 30, 2021, 1:15 p.m. UTC | #39
On Wed, Sep 29, 2021 at 10:33:16PM -0700, Saravana Kannan wrote:
> On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > interrupt properties, so don't loop back to their parent device.
> >
> > This is interesting and not what I really expected to happen. It goes to
> > show that we really need more time to understand all the subtleties of
> > device dependencies before jumping on patching stuff.
> >
> > In case the DSA tree contains more than one switch, different things
> > will happen in dsa_register_switch().
> > The tree itself is only initialized when the last switch calls
> > dsa_register_switch(). All the other switches just mark themselves as
> > present and exit probing early. See this piece of code in dsa_tree_setup:
> >
> >         complete = dsa_tree_setup_routing_table(dst);
> >         if (!complete)
> >                 return 0;
> 
> Hi Vladimir,
> 
> Can you point me to an example dts file that has a DSA tree with more
> than one switch and also point me to the switches that form the tree?
> 
> I'm working on a RFC series that tries to improve some stuff and
> having an example DTS to look at would help.

Some of the Zodiac boards have multiple switches. They are all Marvell
switches, using the mv88e6xxx driver.

arch/arm/boot/dts/vf610-zii-dev-rev-b.dts
arch/arm/boot/dts/vf610-zii-dev-rev-c.dts
arch/arm/boot/dts/vf610-zii-scu4-aib.dts

	Andrew
Vladimir Oltean Sept. 30, 2021, 1:43 p.m. UTC | #40
Hi Saravana,

On Wed, Sep 29, 2021 at 10:33:16PM -0700, Saravana Kannan wrote:
> On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote:
> >
> > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote:
> > > Rev B is interesting because switch0 and switch1 got genphy, while
> > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have
> > > interrupt properties, so don't loop back to their parent device.
> >
> > This is interesting and not what I really expected to happen. It goes to
> > show that we really need more time to understand all the subtleties of
> > device dependencies before jumping on patching stuff.
> >
> > In case the DSA tree contains more than one switch, different things
> > will happen in dsa_register_switch().
> > The tree itself is only initialized when the last switch calls
> > dsa_register_switch(). All the other switches just mark themselves as
> > present and exit probing early. See this piece of code in dsa_tree_setup:
> >
> >         complete = dsa_tree_setup_routing_table(dst);
> >         if (!complete)
> >                 return 0;
> 
> Hi Vladimir,
> 
> Can you point me to an example dts file that has a DSA tree with more
> than one switch and also point me to the switches that form the tree?
> 
> I'm working on a RFC series that tries to improve some stuff and
> having an example DTS to look at would help.
> 
> Thanks,
> Saravana

Andrew is testing with arch/arm/boot/dts/vf610-zii-dev-rev-b.dts.

Graphically it looks like this:

 +-----------------------------+
 |          VF610 SoC          |
 |          +--------+         |
 |          |  fec1  |         |
 +----------+--------+---------+
                | DSA master
                |
                | ethernet = <&fec1>;
 +--------+----------+---------------------------+
 |        |  port@6  |                           |
 |        +----------+                           |
 |        | CPU port |     dsa,member = <0 0>;   |
 |        +----------+      -> tree 0, switch 0  |
 |        |   cpu    |                           |
 |        +----------+                           |
 |                                               |
 |            switch0                            |
 |                                               |
 +-----------+-----------+-----------+-----------+
 |   port@0  |   port@1  |   port@2  |   port@5  |
 +-----------+-----------+-----------+-----------+
 |switch0phy0|switch0phy1|switch0phy2|   no PHY  |
 +-----------+-----------+-----------+-----------+
 | user port | user port | user port | DSA port  |
 +-----------+-----------+-----------+-----------+
 |    lan0   |    lan1   |    lan2   |    dsa    |
 +-----------+-----------+-----------+-----------+
                                           | link = <&switch1port6 &switch2port9>;
                                           |
                                           |
                                           |
                                           | link = <&switch0port5>;
                           +----------+----------+-------------------------+
                           |          |  port@6  |                         |
                           |          +----------+                         |
                           |          | DSA port |    dsa,member = <0 1>;  |
                           |          +----------+     -> tree 0, switch 1 |
                           |          |   dsa    |                         |
                           |          +----------+                         |
                           |                                               |
                           |            switch1                            |
                           |                                               |
                           +-----------+-----------+-----------+-----------+
                           |   port@0  |   port@1  |   port@2  |   port@5  |
                           +-----------+-----------+-----------+-----------+
                           |switch1phy0|switch1phy1|switch2phy2|   no PHY  |
                           +-----------+-----------+-----------+-----------+
                           | user port | user port | user port | DSA port  |
                           +-----------+-----------+-----------+-----------+
                           |    lan3   |    lan4   |    lan5   |   dsa     |
                           +-----------+-----------+-----------+-----------+
                                                                    | link = <&switch2port9>;
                                                                    |
                                                                    |
                                                                    |
                                                                    | link = <&switch1port5 &switch0port5>;
                                                    +----------+----------+-------------------------------------+
                                                    |          |  port@9  |                                     |
                                                    |          +----------+                                     |
                                                    |          | DSA port |      dsa,member = <0 2>;            |
                                                    |          +----------+       -> tree 0, switch 2           |
                                                    |          |   dsa    |                                     |
                                                    |          +----------+                                     |
                                                    |                                                           |
                                                    |            switch2                                        |
                                                    |                                                           |
                                                    +-----------+-----------+-----------+-----------+-----------+
                                                    |   port@0  |   port@1  |   port@2  |   port@3  |   port@4  |
                                                    +-----------+-----------+-----------+-----------+-----------+
                                                    |switch2phy0|switch2phy1|switch2phy2|   no PHY  |   no PHY  |
                                                    +-----------+-----------+-----------+-----------+-----------+
                                                    | user port | user port | user port | user port | user port |
                                                    +-----------+-----------+-----------+-----------+-----------+
                                                    |    lan6   |    lan7   |   lan8    |  optical3 |  optical4 |
                                                    +-----------+-----------+-----------+-----------+-----------+
Andrew Lunn Sept. 30, 2021, 2 p.m. UTC | #41
> Andrew is testing with arch/arm/boot/dts/vf610-zii-dev-rev-b.dts.
> 
> Graphically it looks like this:

Nice ASCII art :-)

This shows the flow of Ethernet frames thought the switch
cluster. What is missing, and causing fw_devlink problems is the MDIO
bus master for the PHYs, and the interrupt control where PHY
interrupts are stored, and the linking from the PHY to the interrupt
controller. Physically all these parts are inside the Ethernet switch
package. But Linux models them as separate blocks. This is because in
the general case, they are all discrete blocks. You have a MAC chip,
and a PHY chip, and the PHY interrupt output it connected to a SoC
GPIO.

> 
>  +-----------------------------+
>  |          VF610 SoC          |
>  |          +--------+         |
>  |          |  fec1  |         |
>  +----------+--------+---------+
>                 | DSA master
>                 |
>                 | ethernet = <&fec1>;
>  +--------+----------+---------------------------+
>  |        |  port@6  |                           |
>  |        +----------+                           |
>  |        | CPU port |     dsa,member = <0 0>;   |
>  |        +----------+      -> tree 0, switch 0  |
>  |        |   cpu    |                           |
>  |        +----------+                           |
>  |                                               |
>  |            switch0                            |
>  |                                               |
>  +-----------+-----------+-----------+-----------+

Inside the block above, is the interrupt controller and the MDIO bus
master.


>  |   port@0  |   port@1  |   port@2  |   port@5  |
>  +-----------+-----------+-----------+-----------+
>  |switch0phy0|switch0phy1|switch0phy2|   no PHY  |
>  +-----------+-----------+-----------+-----------+

The control path for these PHYs is over the MDIO bus. They are probed
via the control path bus. These PHYs also have an interrupt output,
which is wired to the interrupt controller above.


>  | user port | user port | user port | DSA port  |
>  +-----------+-----------+-----------+-----------+
>  |    lan0   |    lan1   |    lan2   |    dsa    |
>  +-----------+-----------+-----------+-----------+

   Andrew
Saravana Kannan Sept. 30, 2021, 5:31 p.m. UTC | #42
On Thu, Sep 30, 2021 at 7:00 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > Andrew is testing with arch/arm/boot/dts/vf610-zii-dev-rev-b.dts.
> >
> > Graphically it looks like this:
>
> Nice ASCII art :-)

Thanks for the awesome diagram Vladimir!

>
> This shows the flow of Ethernet frames thought the switch
> cluster. What is missing, and causing fw_devlink problems is the MDIO
> bus master for the PHYs, and the interrupt control where PHY
> interrupts are stored, and the linking from the PHY to the interrupt
> controller. Physically all these parts are inside the Ethernet switch
> package. But Linux models them as separate blocks. This is because in
> the general case, they are all discrete blocks. You have a MAC chip,
> and a PHY chip, and the PHY interrupt output it connected to a SoC
> GPIO.
>
> >
> >  +-----------------------------+
> >  |          VF610 SoC          |
> >  |          +--------+         |
> >  |          |  fec1  |         |
> >  +----------+--------+---------+
> >                 | DSA master
> >                 |
> >                 | ethernet = <&fec1>;
> >  +--------+----------+---------------------------+
> >  |        |  port@6  |                           |
> >  |        +----------+                           |
> >  |        | CPU port |     dsa,member = <0 0>;   |
> >  |        +----------+      -> tree 0, switch 0  |
> >  |        |   cpu    |                           |
> >  |        +----------+                           |
> >  |                                               |
> >  |            switch0                            |
> >  |                                               |
> >  +-----------+-----------+-----------+-----------+
>
> Inside the block above, is the interrupt controller and the MDIO bus
> master.
>
>
> >  |   port@0  |   port@1  |   port@2  |   port@5  |
> >  +-----------+-----------+-----------+-----------+
> >  |switch0phy0|switch0phy1|switch0phy2|   no PHY  |
> >  +-----------+-----------+-----------+-----------+
>
> The control path for these PHYs is over the MDIO bus. They are probed
> via the control path bus. These PHYs also have an interrupt output,
> which is wired to the interrupt controller above.
>
>
> >  | user port | user port | user port | DSA port  |
> >  +-----------+-----------+-----------+-----------+
> >  |    lan0   |    lan1   |    lan2   |    dsa    |
> >  +-----------+-----------+-----------+-----------+
>

Thanks for the dts paths and the additional details Andrew.

I think this gives me enough info for now to make sure whatever I'm
coding isn't completely stupid. I'm trying to make the generic PHY
driver less greedy (taking it a bit further than what Vladimir was
attempting) and also delay the use of generic PHY driver as late as
possible (so that we give as much time as possible for the specific
driver to be registered/loaded before we give up and use generic PHY
driver). This would also need some changes to the DSA code and hence
these questions.

Btw, do we have non-DSA networking devices where fw_devlink=on
delaying PHY probes is causing an issue?

-Saravana
Andrew Lunn Sept. 30, 2021, 7:38 p.m. UTC | #43
> Btw, do we have non-DSA networking devices where fw_devlink=on
> delaying PHY probes is causing an issue?

I don't know if issues have been reported, but the realtek driver has
had problems in the past when the generic driver is used. Take a look
at r8169_mdio_register(), it does something similar to DSA.

What is going to make things interesting is that phy_attach_direct()
is called in two different contexts. During the MAC drivers probe, it
is O.K. to return EPROBE_DEFER, and let the MAC driver try again
later, if we know there is a specific PHY driver for it. But when
called during the MAC drivers open() op, -EPROBE_DEFER is not
allowed. What to do then is an interesting question.

     Andrew
Saravana Kannan Sept. 30, 2021, 7:48 p.m. UTC | #44
On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote:
>
> > Btw, do we have non-DSA networking devices where fw_devlink=on
> > delaying PHY probes is causing an issue?
>
> I don't know if issues have been reported, but the realtek driver has
> had problems in the past when the generic driver is used. Take a look
> at r8169_mdio_register(), it does something similar to DSA.

Does it have the issue of having the PHY as its child too and then
depending on it to bind to a driver? I can't tell because I didn't
know how to find that info for a PCI device.

>
> What is going to make things interesting is that phy_attach_direct()
> is called in two different contexts. During the MAC drivers probe, it
> is O.K. to return EPROBE_DEFER, and let the MAC driver try again
> later, if we know there is a specific PHY driver for it. But when
> called during the MAC drivers open() op, -EPROBE_DEFER is not
> allowed. What to do then is an interesting question.

Yeah, basically before doing an open() it'll have to call an API to
say "just bind with whatever you got". Or something along those lines.
I already know how to get that to work. I'll send some RFC soonish (I
hope).

-Saravana
Florian Fainelli Sept. 30, 2021, 8:06 p.m. UTC | #45
On 9/30/21 12:48 PM, Saravana Kannan wrote:
> On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote:
>>
>>> Btw, do we have non-DSA networking devices where fw_devlink=on
>>> delaying PHY probes is causing an issue?
>>
>> I don't know if issues have been reported, but the realtek driver has
>> had problems in the past when the generic driver is used. Take a look
>> at r8169_mdio_register(), it does something similar to DSA.
> 
> Does it have the issue of having the PHY as its child too and then
> depending on it to bind to a driver? I can't tell because I didn't
> know how to find that info for a PCI device.

Yes, r8169 includes a MDIO bus controller, and the PHY is internal to
the Ethernet MAC. These are AFAIR the relevant changes to this discussion:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16983507742cbcaa5592af530872a82e82fb9c51
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11287b693d03830010356339e4ceddf47dee34fa


> 
>>
>> What is going to make things interesting is that phy_attach_direct()
>> is called in two different contexts. During the MAC drivers probe, it
>> is O.K. to return EPROBE_DEFER, and let the MAC driver try again
>> later, if we know there is a specific PHY driver for it. But when
>> called during the MAC drivers open() op, -EPROBE_DEFER is not
>> allowed. What to do then is an interesting question.
> 
> Yeah, basically before doing an open() it'll have to call an API to
> say "just bind with whatever you got". Or something along those lines.
> I already know how to get that to work. I'll send some RFC soonish (I
> hope).

I don't think this is going to scale, we have dozens and dozens of
drivers that connect to the PHY during ndo_open(). It is not realistic
to audit them all, just like the opposite case where the drivers do
probe MDIO/PHY during their .probe() call is not realistic either.
Saravana Kannan Sept. 30, 2021, 8:14 p.m. UTC | #46
On Thu, Sep 30, 2021 at 1:06 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 9/30/21 12:48 PM, Saravana Kannan wrote:
> > On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote:
> >>
> >>> Btw, do we have non-DSA networking devices where fw_devlink=on
> >>> delaying PHY probes is causing an issue?
> >>
> >> I don't know if issues have been reported, but the realtek driver has
> >> had problems in the past when the generic driver is used. Take a look
> >> at r8169_mdio_register(), it does something similar to DSA.
> >
> > Does it have the issue of having the PHY as its child too and then
> > depending on it to bind to a driver? I can't tell because I didn't
> > know how to find that info for a PCI device.
>
> Yes, r8169 includes a MDIO bus controller, and the PHY is internal to
> the Ethernet MAC. These are AFAIR the relevant changes to this discussion:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16983507742cbcaa5592af530872a82e82fb9c51
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11287b693d03830010356339e4ceddf47dee34fa
>
>
> >
> >>
> >> What is going to make things interesting is that phy_attach_direct()
> >> is called in two different contexts. During the MAC drivers probe, it
> >> is O.K. to return EPROBE_DEFER, and let the MAC driver try again
> >> later, if we know there is a specific PHY driver for it. But when
> >> called during the MAC drivers open() op, -EPROBE_DEFER is not
> >> allowed. What to do then is an interesting question.
> >
> > Yeah, basically before doing an open() it'll have to call an API to
> > say "just bind with whatever you got". Or something along those lines.
> > I already know how to get that to work. I'll send some RFC soonish (I
> > hope).
>
> I don't think this is going to scale, we have dozens and dozens of
> drivers that connect to the PHY during ndo_open().

Whichever code calls ->ndo_open() can't that mark all the PHYs that'll
be used as "needs to be ready now"? In any case, if we can have an API
that allows a less greedy Generic PHY binding, we could slowly
transition drivers over or at least move them over as they hit issues
with Gen PHY. Anyway, I'll think discussing it over code would be
easier. I'll also have more context as I try to make changes. So,
let's continue this on my future RFC.

-Saravana

> It is not realistic
> to audit them all, just like the opposite case where the drivers do
> probe MDIO/PHY during their .probe() call is not realistic either.
> --
> Florian
Andrew Lunn Sept. 30, 2021, 8:22 p.m. UTC | #47
> I don't think this is going to scale, we have dozens and dozens of
> drivers that connect to the PHY during ndo_open(). It is not realistic
> to audit them all, just like the opposite case where the drivers do
> probe MDIO/PHY during their .probe() call is not realistic either.

I was wondering if Coccinelle could help use here. But a quick scan of
the documents don't suggest it can follow call stacks. Ideally we
would what something to goes and finds the struct net_device_ops, and
gets the function used for .ndo_open. Then look into that function,
and all functions it calls within the driver, and see if any of them
connect the PHY to the MAC. We could then add an additional parameter
to indicate we are in ndo_open context.

But it looks like that is wishful thinking.

    Andrew
Florian Fainelli Sept. 30, 2021, 9:16 p.m. UTC | #48
On 9/30/21 1:14 PM, Saravana Kannan wrote:
> On Thu, Sep 30, 2021 at 1:06 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On 9/30/21 12:48 PM, Saravana Kannan wrote:
>>> On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote:
>>>>
>>>>> Btw, do we have non-DSA networking devices where fw_devlink=on
>>>>> delaying PHY probes is causing an issue?
>>>>
>>>> I don't know if issues have been reported, but the realtek driver has
>>>> had problems in the past when the generic driver is used. Take a look
>>>> at r8169_mdio_register(), it does something similar to DSA.
>>>
>>> Does it have the issue of having the PHY as its child too and then
>>> depending on it to bind to a driver? I can't tell because I didn't
>>> know how to find that info for a PCI device.
>>
>> Yes, r8169 includes a MDIO bus controller, and the PHY is internal to
>> the Ethernet MAC. These are AFAIR the relevant changes to this discussion:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16983507742cbcaa5592af530872a82e82fb9c51
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11287b693d03830010356339e4ceddf47dee34fa
>>
>>
>>>
>>>>
>>>> What is going to make things interesting is that phy_attach_direct()
>>>> is called in two different contexts. During the MAC drivers probe, it
>>>> is O.K. to return EPROBE_DEFER, and let the MAC driver try again
>>>> later, if we know there is a specific PHY driver for it. But when
>>>> called during the MAC drivers open() op, -EPROBE_DEFER is not
>>>> allowed. What to do then is an interesting question.
>>>
>>> Yeah, basically before doing an open() it'll have to call an API to
>>> say "just bind with whatever you got". Or something along those lines.
>>> I already know how to get that to work. I'll send some RFC soonish (I
>>> hope).
>>
>> I don't think this is going to scale, we have dozens and dozens of
>> drivers that connect to the PHY during ndo_open().
> 
> Whichever code calls ->ndo_open() can't that mark all the PHYs that'll
> be used as "needs to be ready now"? In any case, if we can have an API
> that allows a less greedy Generic PHY binding, we could slowly
> transition drivers over or at least move them over as they hit issues
> with Gen PHY. Anyway, I'll think discussing it over code would be
> easier. I'll also have more context as I try to make changes. So,
> let's continue this on my future RFC.

It is the same API that is being used whether you connect to the PHY at
ndo_open() time or whether you do that during the parent's ->probe()
fortunately or unfortunately. Now we could set a flag in either case,
and hope that it addresses both situations?

Being able to be selective about the Ethernet PHY driver is being used
is actually a good idea, there are plenty of systems out there whereby
using the Generic PHY driver will not lead to a functional Ethernet
link, if we could say "I want my dedicated driver, and not Generic PHY"
that would actually help some cases, too.
diff mbox series

Patch

diff --git a/drivers/base/core.c b/drivers/base/core.c
index f6360490a4a3..2cc34f8ff051 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1719,6 +1719,28 @@  static int fw_devlink_create_devlink(struct device *con,
 	struct device *sup_dev;
 	int ret = 0;
 
+	/*
+	 * In some cases, a device P might also be a supplier to its child node
+	 * C. However, this would defer the probe of C until the probe of P
+	 * completes successfully. This is perfectly fine in the device driver
+	 * model. device_add() doesn't guarantee probe completion of the device
+	 * by the time it returns.
+	 *
+	 * However, there are a few drivers that assume C will finish probing
+	 * as soon as it's added and before P finishes probing. While this is a
+	 * broken assumption that needs the driver to be fixed, we don't want
+	 * to block fw_devlink improvements because of these drivers.
+	 *
+	 * So, we provide a flag to let fw_devlink know not to delay the probe
+	 * of C until the probe of P completes successfully.
+	 *
+	 * When such a flag is set, we can't create device links with P as the
+	 * supplier of C as that would delay the probe of C.
+	 */
+	if (sup_handle->flags & FWNODE_FLAG_BROKEN_PARENT &&
+	    fwnode_is_ancestor_of(sup_handle, con->fwnode))
+		return -EINVAL;
+
 	sup_dev = get_dev_from_fwnode(sup_handle);
 	if (sup_dev) {
 		/*
diff --git a/include/linux/fwnode.h b/include/linux/fwnode.h
index 59828516ebaf..9382065e6ff8 100644
--- a/include/linux/fwnode.h
+++ b/include/linux/fwnode.h
@@ -22,10 +22,13 @@  struct device;
  * LINKS_ADDED:	The fwnode has already be parsed to add fwnode links.
  * NOT_DEVICE:	The fwnode will never be populated as a struct device.
  * INITIALIZED: The hardware corresponding to fwnode has been initialized.
+ * BROKEN_PARENT: The driver of this fwnode/device expects the child devices to
+ *		  probe as soon as they are added.
  */
 #define FWNODE_FLAG_LINKS_ADDED		BIT(0)
 #define FWNODE_FLAG_NOT_DEVICE		BIT(1)
 #define FWNODE_FLAG_INITIALIZED		BIT(2)
+#define FWNODE_FLAG_BROKEN_PARENT	BIT(3)
 
 struct fwnode_handle {
 	struct fwnode_handle *secondary;