Message ID | 20210826074526.825517-2-saravanak@google.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | Fix rtl8366rb issues with fw_devlink=on | expand |
On Thu, Aug 26, 2021 at 12:45:24AM -0700, Saravana Kannan wrote: > If a parent device is also a supplier to a child device, fw_devlink=on > (correctly) delays the probe() of the child device until the probe() of > the parent finishes successfully. > > However, some drivers of such parent devices (where parent is also a > supplier) incorrectly expect the child device to finish probing > successfully as soon as they are added using device_add() and before the > probe() of the parent device has completed successfully. Please can you point at the code making this assumption. It sounds like we are missing some EPROBE_DEFER handling in the driver, or maybe the DSA framework. Andrew
Greg, Florian, Vladimir, Alvin, Let's continue the rest of the discussion here. On Thu, Aug 26, 2021 at 6:13 AM Andrew Lunn <andrew@lunn.ch> wrote: > > On Thu, Aug 26, 2021 at 12:45:24AM -0700, Saravana Kannan wrote: > > If a parent device is also a supplier to a child device, fw_devlink=on > > (correctly) delays the probe() of the child device until the probe() of > > the parent finishes successfully. > > > > However, some drivers of such parent devices (where parent is also a > > supplier) incorrectly expect the child device to finish probing > > successfully as soon as they are added using device_add() and before the > > probe() of the parent device has completed successfully. > > Please can you point at the code making this assumption. It sounds > like we are missing some EPROBE_DEFER handling in the driver, or maybe > the DSA framework. For context, this was discussed and explained in [1] and subsequent replies. But let me summarize it here. Alvin reported an issue that with fw_devlink=on, his downstream hardware which is very similar to [2] doesn't have its PHYs probed correctly. Instead of the PHYs being probed by the specific driver, it gets probed by the "Generic PHY" driver. For those who aren't very familiar with PHYs/networking (this is based on what Andrew explained to me earlier), Ethernet PHYs follow a specific standard and can have some extended functionality. The specific driver would give the full functionality, but if it's not available when the PHY needs to be used/connected, the generic PHY driver is force bound to the PHY and it gives the basic functionality. So upon digging into this, this is what I found and where I think we have some bad assumptions about the driver core are present: The DT node in [2] is probed by realtek_smi_probe() [3]. The call flow is: realtek_smi_probe() -> dsa_register_switch() -> dsa_switch_probe() -> dsa_tree_setup() -> dsa_tree_setup_switches() -> dsa_switch_setup() -> ds->ops->setup(ds) -> rtl8366rb_setup() -> realtek_smi_setup_mdio() -> of_mdiobus_register() This scans the MDIO bus/DT and device_add()s the PHYs -> dsa_port_setup() -> dsa_port_link_register_of() -> dsa_port_phylink_register() -> phylink_of_phy_connect() -> phylink_fwnode_phy_connect() -> phy_attach_direct() This checks if PHY device has already probed (by checking for dev->driver). If not, it forces the probe of the PHY using one of the generic PHY drivers. So within dsa_register_switch() the PHY device is added and then expected to have probed in the same thread/calling context. As stated earlier, this is not guaranteed by the driver core. And this is what needs fixing. This works as long as the PHYs don't have dependencies on any other devices/suppliers and never defer probe. In the issue Alvin reported, the PHYs have a dependency and things fall apart. I don't have a strong opinion on whether this is a framework level fix or fixes in a few drivers. In the specific instance of [2] (providing snippet below to make it easier to follow), the "phy0" device [4] depends on the "switch" device [2] since "switch_intc" (the interrupt provider for phy0) is initialized by the "switch" driver. And fw_devlink=on delays the probe of phy0 until switch[2] finishes probing successfully (i.e. after dsa_register_switch() <- realtek_smi_probe() returns) -- this is the whole point of fw_devlink=on this is what reduces the useless deferred probes/probe attempts of consumers before the suppliers finish probing successfully. Since dsa_register_switch() assumes the PHYs would have been probed as soon as they are added, but they aren't probed in this case, the PHY is force bound to the generic PHY driver. Which is the original issue Alvin reported. Hope this clears things up for everyone. -Saravana switch { compatible = "realtek,rtl8366rb"; ... switch_intc: interrupt-controller { ... }; ports { ... port@0 { phy-handle = <&phy0>; }; port@1 { }; ... }; mdio { compatible = "realtek,smi-mdio"; ... phy0: phy@0 { ... interrupt-parent = <&switch_intc>; interrupts = <0>; }; ... }; }; [1] - https://lore.kernel.org/netdev/CAGETcx_uj0V4DChME-gy5HGKTYnxLBX=TH2rag29f_p=UcG+Tg@mail.gmail.com/ [2] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/gemini-dlink-dir-685.dts?id=73f3af7b4611d77bdaea303fb639333eb28e37d7#n190 [3] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/dsa/realtek-smi-core.c?id=73f3af7b4611d77bdaea303fb639333eb28e37d7#n386 [4] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/gemini-dlink-dir-685.dts?id=73f3af7b4611d77bdaea303fb639333eb28e37d7#n255
> The DT node in [2] is probed by realtek_smi_probe() [3]. The call flow is: > realtek_smi_probe() > -> dsa_register_switch() > -> dsa_switch_probe() > -> dsa_tree_setup() > -> dsa_tree_setup_switches() > -> dsa_switch_setup() > -> ds->ops->setup(ds) > -> rtl8366rb_setup() > -> realtek_smi_setup_mdio() > -> of_mdiobus_register() > This scans the MDIO bus/DT and device_add()s the PHYs > -> dsa_port_setup() > -> dsa_port_link_register_of() > -> dsa_port_phylink_register() > -> phylink_of_phy_connect() > -> phylink_fwnode_phy_connect() > -> phy_attach_direct() > This checks if PHY device has already probed (by > checking for dev->driver). If not, it forces the > probe of the PHY using one of the generic PHY > drivers. > > So within dsa_register_switch() the PHY device is added and then > expected to have probed in the same thread/calling context. As stated > earlier, this is not guaranteed by the driver core. Have you looked at: commit 16983507742cbcaa5592af530872a82e82fb9c51 Author: Heiner Kallweit <hkallweit1@gmail.com> Date: Fri Mar 27 01:00:22 2020 +0100 net: phy: probe PHY drivers synchronously See the full commit message, but the code change is: iff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c index 3b8f6b0b47b5..d543df282365 100644 --- a/drivers/net/phy/phy_device.c +++ b/drivers/net/phy/phy_device.c @@ -2577,6 +2577,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner) new_driver->mdiodrv.driver.probe = phy_probe; new_driver->mdiodrv.driver.remove = phy_remove; new_driver->mdiodrv.driver.owner = owner; + new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS; retval = driver_register(&new_driver->mdiodrv.driver); if (retval) { How does this add to the overall picture? Andrew
On Thu, Aug 26, 2021 at 1:53 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > The DT node in [2] is probed by realtek_smi_probe() [3]. The call flow is: > > realtek_smi_probe() > > -> dsa_register_switch() > > -> dsa_switch_probe() > > -> dsa_tree_setup() > > -> dsa_tree_setup_switches() > > -> dsa_switch_setup() > > -> ds->ops->setup(ds) > > -> rtl8366rb_setup() > > -> realtek_smi_setup_mdio() > > -> of_mdiobus_register() > > This scans the MDIO bus/DT and device_add()s the PHYs > > -> dsa_port_setup() > > -> dsa_port_link_register_of() > > -> dsa_port_phylink_register() > > -> phylink_of_phy_connect() > > -> phylink_fwnode_phy_connect() > > -> phy_attach_direct() > > This checks if PHY device has already probed (by > > checking for dev->driver). If not, it forces the > > probe of the PHY using one of the generic PHY > > drivers. > > > > So within dsa_register_switch() the PHY device is added and then > > expected to have probed in the same thread/calling context. As stated > > earlier, this is not guaranteed by the driver core. > > Have you looked at: > > commit 16983507742cbcaa5592af530872a82e82fb9c51 > Author: Heiner Kallweit <hkallweit1@gmail.com> > Date: Fri Mar 27 01:00:22 2020 +0100 > > net: phy: probe PHY drivers synchronously > > See the full commit message, but the code change is: > > iff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c > index 3b8f6b0b47b5..d543df282365 100644 > --- a/drivers/net/phy/phy_device.c > +++ b/drivers/net/phy/phy_device.c > @@ -2577,6 +2577,7 @@ int phy_driver_register(struct phy_driver *new_driver, struct module *owner) > new_driver->mdiodrv.driver.probe = phy_probe; > new_driver->mdiodrv.driver.remove = phy_remove; > new_driver->mdiodrv.driver.owner = owner; > + new_driver->mdiodrv.driver.probe_type = PROBE_FORCE_SYNCHRONOUS; > > retval = driver_register(&new_driver->mdiodrv.driver); > if (retval) { > > How does this add to the overall picture? Doesn't add much to the discussion. In the example I gave, the driver already does synchronous probing. If the device can't probe successfully because a supplier isn't ready, it doesn't matter if it's a synchronous probe. The probe would still be deferred and we'll hit the same issue. Even in the situation the commit [5] describes, if parallelized probing is done and the PHY depended on something (say a clock), you'd still end up not probing the PHY even if the driver is present and the generic PHY would end up force probing it. [5] - https://lore.kernel.org/netdev/612b81d5-c4c1-5e20-a667-893eeeef0bf5@gmail.com/ -Saravana
> Doesn't add much to the discussion. In the example I gave, the driver > already does synchronous probing. If the device can't probe > successfully because a supplier isn't ready, it doesn't matter if it's > a synchronous probe. The probe would still be deferred and we'll hit > the same issue. Even in the situation the commit [5] describes, if > parallelized probing is done and the PHY depended on something (say a > clock), you'd still end up not probing the PHY even if the driver is > present and the generic PHY would end up force probing it. genphy is meant to be used when there is no other driver available. It is a best effort, better than nothing, might work. And quite a few boards rely on it. However, it should not be used when there is a specific driver. So if the PHY device has been probed, and -EPROBE_DEFER was returned, we also need to return -EPROBE_DEFER here when deciding if genphy should be used. It should then all unwind and try again later. I don't know the device core, but it looks like dev->can_match tells us what we need to know. If true, we know there is a driver for this device. But i'm hesitant to make use of this outside of driver/base. Andrew
On Thu, Aug 26, 2021 at 6:23 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > Doesn't add much to the discussion. In the example I gave, the driver > > already does synchronous probing. If the device can't probe > > successfully because a supplier isn't ready, it doesn't matter if it's > > a synchronous probe. The probe would still be deferred and we'll hit > > the same issue. Even in the situation the commit [5] describes, if > > parallelized probing is done and the PHY depended on something (say a > > clock), you'd still end up not probing the PHY even if the driver is > > present and the generic PHY would end up force probing it. > > > genphy is meant to be used when there is no other driver available. > It is a best effort, better than nothing, might work. And quite a few > boards rely on it. However, it should not be used when there is a > specific driver. Agreed, that's what we are trying to ensure. > So if the PHY device has been probed, and -EPROBE_DEFER was returned, > we also need to return -EPROBE_DEFER here when deciding if genphy > should be used. It should then all unwind and try again later. Yes, I think dsa_register_switch() returning -EPROBE_DEFER if the PHYs returned -EPROBE_DEFER might be okay (I think we should do it), but that doesn't solve the problem for this driver. fw_devlink=on/device links short circuits the probe() call of a consumer (in this case the PHY) and returns -EPROBE_DEFER if the supplier's (in this case switch) probe hasn't finished without an error. fw_devlink/device links effectively does the probe in graph topological order and there's a ton of good reasons to do it that way -- what's why fw_devlink=on was implemented. In this specific case though, since the PHY depends on the parent device, if we fail the parent's probe realtek_smi_probe() because the PHYs failed to probe, we'll get into a catch-22/chicken-n-egg situation and the switch/PHYs will never probe. I think a clean way to fix this at the driver level is to do what I said in [6]. Copy pasting it here and expanding it a bit: 1. The IRQ registration and mdio bus registration should get moved to realtek_smi_probe() which probes "realtek,rtl8366rb". So realtek_smi_probe() succeeding doesn't depend on its child devices probing successfully (which makes sense for any parent device). 2. realtek_smi_probe() should also create/register a component-master/aggregate device that's "made up of" realtek,rtl8366rb and all the PHYs. So the component-master will wait for all of them to finish probing before it's initialized. 3. PHYs will probe successfully now because realtek,rtl8366rb probe() which is the supplier's probe has finished probing without problems. 4. The component device's init (the .bind op) would call dsa_register_switch() which kinda makes sense because the rtl8366rb and all the PHYs combined together is what makes up the logical DSA switch. The dsa_register_switch() will succeed and will be using the right/specific PHY driver. The same applies for any switch that has the PHYs as it's child device AND (this is the key part) the PHYs depend on the switch as a supplier (remember, if we didn't have the interrupt dependency, this would not be an issue). > I don't know the device core, but it looks like dev->can_match tells > us what we need to know. If true, we know there is a driver for this > device. But i'm hesitant to make use of this outside of driver/base. can_match is never cleared once it's set and it's meant as an optimization/preserving some probe order stuff. I wouldn't depend on it for this case. We can just have a phy_has_driver() function that searches all the currently registered PHY drivers to check if there's a matching driver. And dsa_register_switch() or phy_attach_direct() can return -EPROBE_DEFER if there is a driver but it isn't bound yet. Again, this is orthogonal to the realtek driver fix though because of the catch-22 situation above. -Saravana [6] - https://lore.kernel.org/netdev/CAGETcx8_vxxPxF8WrXqk=PZYfEggsozP+z9KyOu5C2bEW0VW8g@mail.gmail.com/
> fw_devlink=on/device links short circuits the probe() call of a > consumer (in this case the PHY) and returns -EPROBE_DEFER if the > supplier's (in this case switch) probe hasn't finished without an > error. fw_devlink/device links effectively does the probe in graph > topological order and there's a ton of good reasons to do it that way > -- what's why fw_devlink=on was implemented. > > In this specific case though, since the PHY depends on the parent > device, if we fail the parent's probe realtek_smi_probe() because the > PHYs failed to probe, we'll get into a catch-22/chicken-n-egg > situation and the switch/PHYs will never probe. So lets look at: arch/arm/boot/dts/vf610-zii-dev-rev-b.dts mdio-mux { compatible = "mdio-mux-gpio"; pinctrl-0 = <&pinctrl_mdio_mux>; pinctrl-names = "default"; gpios = <&gpio0 8 GPIO_ACTIVE_HIGH &gpio0 9 GPIO_ACTIVE_HIGH &gpio0 24 GPIO_ACTIVE_HIGH &gpio0 25 GPIO_ACTIVE_HIGH>; mdio-parent-bus = <&mdio1>; #address-cells = <1>; #size-cells = <0>; We have an MDIO multiplexor mdio_mux_1: mdio@1 { reg = <1>; #address-cells = <1>; #size-cells = <0>; switch0: switch@0 { compatible = "marvell,mv88e6085"; pinctrl-0 = <&pinctrl_gpio_switch0>; pinctrl-names = "default"; reg = <0>; dsa,member = <0 0>; interrupt-parent = <&gpio0>; interrupts = <27 IRQ_TYPE_LEVEL_LOW>; On the first bus, we have a Ethernet switch. interrupt-controller; #interrupt-cells = <2>; eeprom-length = <512>; ports { #address-cells = <1>; #size-cells = <0>; port@0 { reg = <0>; label = "lan0"; phy-handle = <&switch0phy0>; }; The first port of that switch has a pointer to a PHY. mdio { #address-cells = <1>; #size-cells = <0>; That Ethernet switch also has an MDIO bus, switch0phy0: switch0phy0@0 { reg = <0>; On that bus is the PHY. interrupt-parent = <&switch0>; interrupts = <0 IRQ_TYPE_LEVEL_HIGH>; And that PHY has an interrupt. And that interrupt is provided by the switch. Given your description, it sounds like this is also go to break. vf610-zii-dev-rev-c.dts is the same pattern, and there are more examples for mv88e6xxx. It is a common pattern, e.g. the mips ar9331.dtsi follows it. I've not yet looked at plain Ethernet drivers. This pattern could also exist there. And i wonder about other complex structures, i2c bus multiplexors, you can have interrupt controllers as i2c devices, etc. So the general case could exist in other places. I don't think we should be playing whack-a-mole by changing drivers as we find they regress and break. We need a generic fix. I think the solution is pretty clear. As you said the device depends on its parent. DT is a tree, so it is easy to walk up the tree to detect this relationship, and not fail the probe. Andrew
On Fri, Aug 27, 2021 at 6:44 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > fw_devlink=on/device links short circuits the probe() call of a > > consumer (in this case the PHY) and returns -EPROBE_DEFER if the > > supplier's (in this case switch) probe hasn't finished without an > > error. fw_devlink/device links effectively does the probe in graph > > topological order and there's a ton of good reasons to do it that way > > -- what's why fw_devlink=on was implemented. > > > > In this specific case though, since the PHY depends on the parent > > device, if we fail the parent's probe realtek_smi_probe() because the > > PHYs failed to probe, we'll get into a catch-22/chicken-n-egg > > situation and the switch/PHYs will never probe. > > So lets look at: > > arch/arm/boot/dts/vf610-zii-dev-rev-b.dts > > mdio-mux { > compatible = "mdio-mux-gpio"; > pinctrl-0 = <&pinctrl_mdio_mux>; > pinctrl-names = "default"; > gpios = <&gpio0 8 GPIO_ACTIVE_HIGH > &gpio0 9 GPIO_ACTIVE_HIGH > &gpio0 24 GPIO_ACTIVE_HIGH > &gpio0 25 GPIO_ACTIVE_HIGH>; > mdio-parent-bus = <&mdio1>; > #address-cells = <1>; > #size-cells = <0>; > > > We have an MDIO multiplexor > > > mdio_mux_1: mdio@1 { > reg = <1>; > #address-cells = <1>; > #size-cells = <0>; > > switch0: switch@0 { > compatible = "marvell,mv88e6085"; > pinctrl-0 = <&pinctrl_gpio_switch0>; > pinctrl-names = "default"; > reg = <0>; > dsa,member = <0 0>; > interrupt-parent = <&gpio0>; > interrupts = <27 IRQ_TYPE_LEVEL_LOW>; > > On the first bus, we have a Ethernet switch. > > interrupt-controller; > #interrupt-cells = <2>; > eeprom-length = <512>; > > ports { > #address-cells = <1>; > #size-cells = <0>; > > port@0 { > reg = <0>; > label = "lan0"; > phy-handle = <&switch0phy0>; > }; > > The first port of that switch has a pointer to a PHY. > > mdio { > #address-cells = <1>; > #size-cells = <0>; > > That Ethernet switch also has an MDIO bus, > > switch0phy0: switch0phy0@0 { > reg = <0>; > > On that bus is the PHY. > > interrupt-parent = <&switch0>; > interrupts = <0 IRQ_TYPE_LEVEL_HIGH>; > > And that PHY has an interrupt. And that interrupt is provided by the switch. > > Given your description, it sounds like this is also go to break. Based on what you pasted here (I didn't look any closer), I think it will break too. > > vf610-zii-dev-rev-c.dts is the same pattern, and there are more > examples for mv88e6xxx. > > It is a common pattern, e.g. the mips ar9331.dtsi follows it. Then I think this should be solved at the DSA framework level. Make a component-master/aggregate device made up of the switches and ports/PHYs. Then wait for all of them to not -EPROBE_DEFER and then initialize the DSA? > I've not yet looked at plain Ethernet drivers. This pattern could also > exist there. And i wonder about other complex structures, i2c bus > multiplexors, you can have interrupt controllers as i2c devices, > etc. So the general case could exist in other places. I haven't seen any generic issues like this reported so far. It's only after adding phy-handle that we are hitting these issues with DSA switches. > I don't think we should be playing whack-a-mole by changing drivers as > we find they regress and break. We need a generic fix. I think the > solution is pretty clear. As you said the device depends on its > parent. DT is a tree, so it is easy to walk up the tree to detect this > relationship, and not fail the probe. It's easy to do, but it is the wrong behavior for fw_devlink=on. There are plenty of cases where it's better to delay the child device's probe until the parent finishes. You even gave an example[7] where it would help avoid unnecessary deferred probes. There are plenty of other cases like this too -- there's actually a USB driver that had an infinite deferred probe loop that fw_devlink=on fixes. Also, the whole point of fw_devlink=on is to enforce ordering like this -- so just blanket ignoring dependencies on parent devices doesn't make sense. But a parent device's probe depending on a child device's probe to succeed as soon as it's added is never right though. So I think that's what needs to be addresses. So we have a couple of options: 1. Use a component driver model to initialize switches. I think it could be doable at the DSA framework level. 2. Ask fw_devlink=on to ignore it for all switch devices -- it might be possible to move my "quick fix" to the DSA framework. 3. Remove fw_devlink support for phy-handle. I honestly think (1) is the best option and makes sense logically too. Not saying it's a trivial work or a one liner, but it actually makes sense. (2) might not be possible -- I need to take a closer look. I'd prefer not doing (3), but I'd take that over breaking the whole point of fw_devlink=on. -Saravana [7] - https://lore.kernel.org/netdev/YR5eMeKzcuYtB6Tk@lunn.ch/
> > I've not yet looked at plain Ethernet drivers. This pattern could also > > exist there. And i wonder about other complex structures, i2c bus > > multiplexors, you can have interrupt controllers as i2c devices, > > etc. So the general case could exist in other places. > > I haven't seen any generic issues like this reported so far. It's only > after adding phy-handle that we are hitting these issues with DSA > switches. Can you run your parser over the 2250 DTB blobs and see how many children have dependencies on a parent? That could give us an idea how many moles need whacking. And maybe, where in the tree they are hiding? Andrew
On Fri, Aug 27, 2021 at 1:11 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > I've not yet looked at plain Ethernet drivers. This pattern could also > > > exist there. And i wonder about other complex structures, i2c bus > > > multiplexors, you can have interrupt controllers as i2c devices, > > > etc. So the general case could exist in other places. > > > > I haven't seen any generic issues like this reported so far. It's only > > after adding phy-handle that we are hitting these issues with DSA > > switches. > > Can you run your parser over the 2250 DTB blobs and see how many > children have dependencies on a parent? That could give us an idea how > many moles need whacking. And maybe, where in the tree they are > hiding? You are only responding to part of my email. As I said in my previous email: "There are plenty of cases where it's better to delay the child device's probe until the parent finishes. You even gave an example[7] where it would help avoid unnecessary deferred probes." Can you please give your thoughts on the rest of the points I made too? -Saravana
On Fri, Aug 27, 2021 at 02:33:02PM -0700, Saravana Kannan wrote: > On Fri, Aug 27, 2021 at 1:11 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > I've not yet looked at plain Ethernet drivers. This pattern could also > > > > exist there. And i wonder about other complex structures, i2c bus > > > > multiplexors, you can have interrupt controllers as i2c devices, > > > > etc. So the general case could exist in other places. > > > > > > I haven't seen any generic issues like this reported so far. It's only > > > after adding phy-handle that we are hitting these issues with DSA > > > switches. > > > > Can you run your parser over the 2250 DTB blobs and see how many > > children have dependencies on a parent? That could give us an idea how > > many moles need whacking. And maybe, where in the tree they are > > hiding? > > You are only responding to part of my email. As I said in my previous > email: "There are plenty of cases where it's better to delay the child > device's probe until the parent finishes. You even gave an example[7] > where it would help avoid unnecessary deferred probes." Can you please > give your thoughts on the rest of the points I made too? I must admit, my main problem at the moment is -rc1 in two weeks time. It seems like a number of board with Ethernet switches will be broken, that worked before. phy-handle is not limited to switch drivers, it is also used for Ethernet drivers. So it could be, a number of Ethernet drivers are also going to be broken in -rc1? But the issues sounds not to be specific to phy-handle, but any phandle that points back to a parent. So it could be drivers outside of networking are also going to be broken with -rc1? You have been very focused on one or two drivers. I would much rather see you getting an idea of how wide spread this problem is, and what should we do for -rc1? Even if modifying DSA drivers to component drivers is possible, while not breaking backwards compatibility with DT, it is not going to happen over night. That is something for the next merge window, not this merge window. So reverting the phy-handle seems like part of the fix for -rc1. But until you look at those 2250 DTB blobs, we have no idea if that is sufficient for -rc1. Andrew
On Sat, Aug 28, 2021 at 10:01 AM Andrew Lunn <andrew@lunn.ch> wrote: > > On Fri, Aug 27, 2021 at 02:33:02PM -0700, Saravana Kannan wrote: > > On Fri, Aug 27, 2021 at 1:11 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > > > I've not yet looked at plain Ethernet drivers. This pattern could also > > > > > exist there. And i wonder about other complex structures, i2c bus > > > > > multiplexors, you can have interrupt controllers as i2c devices, > > > > > etc. So the general case could exist in other places. > > > > > > > > I haven't seen any generic issues like this reported so far. It's only > > > > after adding phy-handle that we are hitting these issues with DSA > > > > switches. > > > > > > Can you run your parser over the 2250 DTB blobs and see how many > > > children have dependencies on a parent? That could give us an idea how > > > many moles need whacking. And maybe, where in the tree they are > > > hiding? > > > > You are only responding to part of my email. As I said in my previous > > email: "There are plenty of cases where it's better to delay the child > > device's probe until the parent finishes. You even gave an example[7] > > where it would help avoid unnecessary deferred probes." Can you please > > give your thoughts on the rest of the points I made too? > > I must admit, my main problem at the moment is -rc1 in two weeks > time. It seems like a number of board with Ethernet switches will be > broken, that worked before. phy-handle is not limited to switch > drivers, it is also used for Ethernet drivers. So it could be, a > number of Ethernet drivers are also going to be broken in -rc1? Again, in those cases, based on your FEC example, fw_devlink=on actually improves things. > But the issues sounds not to be specific to phy-handle, but any > phandle that points back to a parent. I feel like I'm going in circles here. This statement is not true. Please read my previous explanations. > So it could be drivers outside > of networking are also going to be broken with -rc1? > You have been very focused on one or two drivers. I would much rather > see you getting an idea of how wide spread this problem is, and what > should we do for -rc1? Again, it's not a widespread problem as I explained before. fw_devlink=on has been the default for 2 kernel versions now. With no unfixed reported issues. > Even if modifying DSA drivers to component drivers is possible, while > not breaking backwards compatibility with DT, It should be possible without needing any DT changes. > it is not going to > happen over night. That is something for the next merge window, not > this merge window. Right, I wasn't suggesting the component driver option be implemented right away. We were talking about what the longer term proper fix would be for DSA (and Ethernet if we actually find issues there) and who would do it. That's what I hope this discussion could be. Also, if we replace Patch 2/2 in this series with the patch below, it will work as a generic quick fix for DSA that we could use for -rc1. And if we still have issues reported on the phy-handle patch by -rc5 or so, we could revert the phy-handle patch then so that v5.15 isn't broken. -Saravana --- a/net/dsa/dsa2.c +++ b/net/dsa/dsa2.c @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct dsa_switch *ds, struct device_node *dn) { int err; + /* A lot of switch devices have their PHYs as child devices and have + * the PHYs depend on the switch as a supplier (Eg: interrupt + * controller). With fw_devlink=on, that means the PHYs will defer + * probe until the probe() of the switch completes. However, the way + * the DSA framework is designed, the PHYs are expected to be probed + * successfully before the probe() of the switch completes. + * + * So, mark the switch devices as a "broken parent" so that fw_devlink + * knows not to create device links between PHYs and the parent switch. + */ + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; err = dsa_switch_parse_member_of(ds, dn); if (err) return err;
> > I must admit, my main problem at the moment is -rc1 in two weeks > > time. It seems like a number of board with Ethernet switches will be > > broken, that worked before. phy-handle is not limited to switch > > drivers, it is also used for Ethernet drivers. So it could be, a > > number of Ethernet drivers are also going to be broken in -rc1? > > Again, in those cases, based on your FEC example, fw_devlink=on > actually improves things. Debatable. I did some testing. As expected some boards with Ethernet switches are now broken. Without fw_devlink=on, some boards are not optimal, but they actually work. With it, they are broken. I did a bisect, and they have been broken since: ea718c699055c8566eb64432388a04974c43b2ea is the first bad commit commit ea718c699055c8566eb64432388a04974c43b2ea Author: Saravana Kannan <saravanak@google.com> Date: Tue Mar 2 13:11:32 2021 -0800 Revert "Revert "driver core: Set fw_devlink=on by default"" This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df. Since all reported issues due to fw_devlink=on should be addressed by this series, revert the revert. fw_devlink=on Take II. Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> drivers/base/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) So however it is fixed, it needs to go into stable, not just -rc1. > Again, it's not a widespread problem as I explained before. > fw_devlink=on has been the default for 2 kernel versions now. With no > unfixed reported issues. Given that some Ethernet switches have been broken all that time, i wonder what else has been broken? Normally, the kernel which is release in December becomes the next LTS. It then gets picked up by the distros and more wide spread tested. So it could be, you get a flood of reports in January and February about things which are broken. This is why i don't think you should be relying on bug reports, you should be taking a more proactive stance and trying to analyse the DTB blobs. I will spend some time trying out your proposed fix. See if they are a quick fix for stable. Andrew
On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > I must admit, my main problem at the moment is -rc1 in two weeks > > > time. It seems like a number of board with Ethernet switches will be > > > broken, that worked before. phy-handle is not limited to switch > > > drivers, it is also used for Ethernet drivers. So it could be, a > > > number of Ethernet drivers are also going to be broken in -rc1? > > > > Again, in those cases, based on your FEC example, fw_devlink=on > > actually improves things. > > Debatable. I did some testing. As expected some boards with Ethernet > switches are now broken. To clarify myself, I'm saying the patch to parse "ethernet" [8] will make things better with fw_devlink=on for FEC. Not sure if you tested that patch or not. And yes, "phy-handle" will make things worse for switches because it has two issues that need to be fixed. One is this deferred probe thing for which I gave a patch in the previous email and the other is what Marek reported (fix in the works). So can you revert "phy-handle" support and test with [8] and you should see things be better with fw_devlink=on. [8] - https://lore.kernel.org/netdev/CAGETcx9=AyEfjX_-adgRuX=8a0MkLnj8sy2KJGhxpNCinJu4yA@mail.gmail.com/ > Without fw_devlink=on, some boards are not > optimal, but they actually work. With it, they are broken. > > I did a bisect, and they have been broken since: > > ea718c699055c8566eb64432388a04974c43b2ea is the first bad commit > commit ea718c699055c8566eb64432388a04974c43b2ea > Author: Saravana Kannan <saravanak@google.com> > Date: Tue Mar 2 13:11:32 2021 -0800 > > Revert "Revert "driver core: Set fw_devlink=on by default"" > > This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df. > > Since all reported issues due to fw_devlink=on should be addressed by > this series, revert the revert. fw_devlink=on Take II. > > Signed-off-by: Saravana Kannan <saravanak@google.com> > Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > drivers/base/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > So however it is fixed, it needs to go into stable, not just -rc1. Not sure what was the tip of the tree with which you bisected. For example, if phy-handle broke things, bisect could still point at this patch above depending on whether the first bisect is good/bad. Because reverting this patch effectively disabled phy-handle parsing too. To be clear, I'm not saying things aren't broken. I'm just pointing out some nuances with the bisect that we need to be aware of. > > Again, it's not a widespread problem as I explained before. > > fw_devlink=on has been the default for 2 kernel versions now. With no > > unfixed reported issues. > > Given that some Ethernet switches have been broken all that time, i > wonder what else has been broken? Normally, the kernel which is > release in December becomes the next LTS. It then gets picked up by > the distros and more wide spread tested. So it could be, you get a > flood of reports in January and February about things which are > broken. This is why i don't think you should be relying on bug > reports, you should be taking a more proactive stance and trying to > analyse the DTB blobs. As I mentioned earlier, just looking at DTB doesn't tell me much because the driver could still be fine depending on how it's written. Also, I don't have an easy way to do this. If I find a way, I'll do it. Btw, most bug reports that have been raised have been fixed with generic fixes that address general DT patterns. For example, fw_devlink has cycle detection built in, has support for devices that never get probed, etc. Enabling fw_devlink=on and handling bug reports with generic fixes has worked well so far to get fw_devlink to where it is today. I've tried to be very quick in responding to fw_devlink issues -- and that has worked out so far and hopefully it'll continue to work out. > I will spend some time trying out your proposed fix. See if they are a > quick fix for stable. Thanks for testing it out. And worst case, we'll revert phy-handle support. -Saravana
On Tue, Aug 31, 2021 at 12:26:30PM -0700, Saravana Kannan wrote: > On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > I must admit, my main problem at the moment is -rc1 in two weeks > > > > time. It seems like a number of board with Ethernet switches will be > > > > broken, that worked before. phy-handle is not limited to switch > > > > drivers, it is also used for Ethernet drivers. So it could be, a > > > > number of Ethernet drivers are also going to be broken in -rc1? > > > > > > Again, in those cases, based on your FEC example, fw_devlink=on > > > actually improves things. > > > > Debatable. I did some testing. As expected some boards with Ethernet > > switches are now broken. > > To clarify myself, I'm saying the patch to parse "ethernet" [8] will > make things better with fw_devlink=on for FEC. Not sure if you tested > that patch or not. Yes and no. I was tested with the FEC, but because of fw_devlink=on, the switches don't probe correctly. So it is not possible to see if it helped or not, since its plain broken. > Not sure what was the tip of the tree with which you bisected. I manually tested linux-next, v5.14, v5.13 and v5.12 and then bisected: $ git bisect log git bisect start # good: [9f4ad9e425a1d3b6a34617b8ea226d56a119a717] Linux 5.12 git bisect good 9f4ad9e425a1d3b6a34617b8ea226d56a119a717 # bad: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13 So well away from linux-next which contains the phy-handle parser. I will try to give the two patches a try today or tomorrow. > Thanks for testing it out. And worst case, we'll revert phy-handle > support. Which is not enough to fix these Ethernet switches. Andrew
On Tue, Aug 31, 2021 at 3:06 PM Andrew Lunn <andrew@lunn.ch> wrote: > > On Tue, Aug 31, 2021 at 12:26:30PM -0700, Saravana Kannan wrote: > > On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > > > > > I must admit, my main problem at the moment is -rc1 in two weeks > > > > > time. It seems like a number of board with Ethernet switches will be > > > > > broken, that worked before. phy-handle is not limited to switch > > > > > drivers, it is also used for Ethernet drivers. So it could be, a > > > > > number of Ethernet drivers are also going to be broken in -rc1? > > > > > > > > Again, in those cases, based on your FEC example, fw_devlink=on > > > > actually improves things. > > > > > > Debatable. I did some testing. As expected some boards with Ethernet > > > switches are now broken. > > > > To clarify myself, I'm saying the patch to parse "ethernet" [8] will > > make things better with fw_devlink=on for FEC. Not sure if you tested > > that patch or not. > > Yes and no. I was tested with the FEC, but because of fw_devlink=on, > the switches don't probe correctly. So it is not possible to see if it > helped or not, since its plain broken. > > > Not sure what was the tip of the tree with which you bisected. > > I manually tested linux-next, v5.14, v5.13 and v5.12 and then > bisected: > > $ git bisect log > git bisect start > # good: [9f4ad9e425a1d3b6a34617b8ea226d56a119a717] Linux 5.12 > git bisect good 9f4ad9e425a1d3b6a34617b8ea226d56a119a717 > # bad: [62fb9874f5da54fdb243003b386128037319b219] Linux 5.13 > > So well away from linux-next which contains the phy-handle parser. > > I will try to give the two patches a try today or tomorrow. If the switches are broken without the phy-handle or ethernet change, I'm not sure if the "BROKEN_PARENT" patch would help. > > > Thanks for testing it out. And worst case, we'll revert phy-handle > > support. > > Which is not enough to fix these Ethernet switches. Ok, if you can give more specifics on this, I'll look into it. I'd need: 1) The DTS file that you see the issue on. 2) The devices that don't probe when you have the issue (you can find these in /sys/kernel/debug/devices_deferred) 3) And if possible boot logs with dev_dbg changed to dev_info in device_link_add() and device_links_check_suppliers() Thanks, Saravana
> If the switches are broken without the phy-handle or ethernet change, > I'm not sure if the "BROKEN_PARENT" patch would help. > > Which is not enough to fix these Ethernet switches. > > Ok, if you can give more specifics on this, I'll look into it. The switches probe, but get the wrong PHY driver, genphy, not the Marvell PHY driver. And genphy is not sufficient for this hardware. I'd need: > 1) The DTS file that you see the issue on. I did the bisect on arch/arm/boot/dts/vf610-zii-dev-rev-c.dts but i also tested arch/arm/boot/dts/vf610-zii-dev-rev-b.dts. Rev B is interesting because switch0 and switch1 got genphy, while switch2 got the correct Marvell PHY driver. switch2 PHYs don't have interrupt properties, so don't loop back to their parent device. Here is Rev B. I trimmed out other devices probing in parallel: [ 1.029100] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00 [ 1.034735] fec 400d1000.ethernet: Using random MAC address: 42:f2:14:33:78:f5 [ 1.042272] libphy: fec_enet_mii_bus: probed [ 1.455932] libphy: mdio_mux: probed [ 1.459432] mv88e6085 0.1:00: switch 0x3520 detected: Marvell 88E6352, revision 1 [ 1.494076] libphy: mdio: probed [ 1.518958] libphy: mdio_mux: probed [ 1.522553] mv88e6085 0.2:00: switch 0x3520 detected: Marvell 88E6352, revision 1 [ 1.537295] libphy: mdio: probed [ 1.556571] libphy: mdio_mux: probed [ 1.559719] mv88e6085 0.4:00: switch 0x1a70 detected: Marvell 88E6185, revision 2 [ 1.574614] libphy: mdio: probed [ 1.733104] mv88e6085 0.1:00 lan0 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:00] driver [Generic PHY] (irq=POLL) [ 1.750737] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) [ 1.768273] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) [ 1.806561] mv88e6085 0.2:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:00] driver [Generic PHY] (irq=POLL) [ 1.824033] mv88e6085 0.2:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) [ 1.841496] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) [ 1.943535] mv88e6085 0.4:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:00] driver [Marvell 88E1545] (irq=POLL) [ 2.003529] mv88e6085 0.4:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:01] driver [Marvell 88E1545] (irq=POLL) [ 2.063535] mv88e6085 0.4:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:02] driver [Marvell 88E1545] (irq=POLL) [ 2.084768] DSA: tree 0 setup [ 2.087791] libphy: mdio_mux: probed [ 2.265477] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=POLL) root@zii-devel-b:~# cat /sys/kernel/debug/devices_deferred root@zii-devel-b:~# For Rev C we see: [ 1.244417] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00 [ 1.250081] fec 400d1000.ethernet: Using random MAC address: c6:42:89:ed:5f:dd [ 1.257507] libphy: fec_enet_mii_bus: probed [ 1.570725] libphy: mdio_mux: probed [ 1.574208] mv88e6085 0.1:00: switch 0xa10 detected: Marvell 88E6390X, revision 1 [ 1.590272] libphy: mdio: probed [ 1.627721] libphy: mdio_mux: probed [ 1.631222] mv88e6085 0.2:00: switch 0xa10 detected: Marvell 88E6390X, revision 1 [ 1.659643] libphy: mdio: probed [ 1.811665] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) [ 1.829230] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) [ 1.845884] mv88e6085 0.1:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:03] driver [Generic PHY] (irq=POLL) [ 1.863237] mv88e6085 0.1:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:04] driver [Generic PHY] (irq=POLL) [ 1.884078] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) [ 1.901630] mv88e6085 0.2:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) [ 1.918287] mv88e6085 0.2:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:03] driver [Generic PHY] (irq=POLL) [ 1.933721] mv88e6085 0.2:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:04] driver [Generic PHY] (irq=POLL) [ 1.948722] DSA: tree 0 setup [ 1.951599] libphy: mdio_mux: probed [ 21.565550] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=48) I have Rev B using NFS root, so the interfaces are configured up by the kernel during boot. Rev C has a local root filesystem, so user space brings the interfaces up, and it is only when the FEC is opened does it attach to the Micrel PHY. That explains the difference between 2.265 and 21.565 seconds for the last line. Again, nothing deferred. Andrew
> 3) And if possible boot logs with dev_dbg changed to dev_info in > device_link_add() and device_links_check_suppliers() Rev C. Here is everything: [ 0.000000] Booting Linux on physical CPU 0x0 [ 0.000000] Linux version 5.12.0-rc4-00011-gea718c699055-dirty (andrew@lenovo) (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, G NU ld (GNU Binutils for Debian) 2.37) #20 Tue Aug 31 18:06:09 CDT 2021 [ 0.000000] CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d [ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache [ 0.000000] OF: fdt: Machine model: ZII VF610 Development Board, Rev C [ 0.000000] printk: bootconsole [earlycon0] enabled [ 0.000000] Memory policy: Data cache writeback [ 0.000000] Zone ranges: [ 0.000000] Normal [mem 0x0000000080000000-0x000000009fffffff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x0000000080000000-0x000000009fffffff] [ 0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] [ 0.000000] On node 0 totalpages: 131072 [ 0.000000] Normal zone: 1024 pages used for memmap [ 0.000000] Normal zone: 0 pages reserved [ 0.000000] Normal zone: 131072 pages, LIFO batch:31 [ 0.000000] CPU: All CPU(s) started in SVC mode. [ 0.000000] pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768 [ 0.000000] pcpu-alloc: [0] 0 [ 0.000000] Built 1 zonelists, mobility grouping on. Total pages: 130048 [ 0.000000] Kernel command line: root=/dev/mmcblk0p2 rootfstype=ext4 rw rootwait earlyprintk [ 0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear) [ 0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear) [ 0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off [ 0.000000] Memory: 508208K/524288K available (7168K kernel code, 931K rwdata, 1644K rodata, 1024K init, 250K bss, 16080K reserved, 0K cma-r eserved) [ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 [ 0.000000] ftrace: allocating 25455 entries in 50 pages [ 0.000000] ftrace: allocated 50 pages with 3 groups [ 0.000000] NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16 [ 0.000000] L2C-310 erratum 769419 enabled [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled [ 0.000000] L2C-310 cache controller enabled, 8 ways, 512 kB [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x06060000 [ 0.000000] random: get_random_bytes called from start_kernel+0x32c/0x470 with crng_init=0 [ 0.000009] sched_clock: 64 bits at 166MHz, resolution 5ns, wraps every 4398046511102ns [ 0.015010] clocksource: arm_global_timer: mask: 0xffffffffffffffff max_cycles: 0x2674622ffc, max_idle_ns: 440795203810 ns [ 0.025738] Switching to timer-based delay loop, resolution 5ns [ 0.031679] Console: colour dummy device 80x30 [ 0.035135] printk: console [tty0] enabled [ 0.038268] printk: bootconsole [earlycon0] disabled [ 0.042336] Calibrating delay loop (skipped), value calculated using timer frequency.. 333.47 BogoMIPS (lpj=1667368) [ 0.042403] pid_max: default: 32768 minimum: 301 [ 0.042758] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear) [ 0.042819] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear) [ 0.044134] CPU: Testing write buffer coherency: ok [ 0.045371] Setting up static identity map for 0x80100000 - 0x8010003c [ 0.046513] devtmpfs: initialized [ 0.058285] VFP support v0.3: implementor 41 architecture 2 part 30 variant 5 rev 1 [ 0.058654] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns [ 0.058725] futex hash table entries: 256 (order: -1, 3072 bytes, linear) [ 0.058923] pinctrl core: initialized pinctrl subsystem [ 0.060776] NET: Registered protocol family 16 [ 0.062307] DMA: preallocated 256 KiB pool for atomic coherent allocations [ 0.075858] platform 40044000.spi: Linked as a consumer to 40048000.iomuxc [ 0.076094] platform 4003b000.adc: Linked as a consumer to 40048000.iomuxc [ 0.076279] platform 4002c000.spi: Linked as a sync state only consumer to 40048000.iomuxc [ 0.076327] platform 4002c000.spi: Linked as a consumer to 40048000.iomuxc [ 0.076514] platform 4002a000.serial: Linked as a consumer to 40048000.iomuxc [ 0.076702] platform 40029000.serial: Linked as a consumer to 40048000.iomuxc [ 0.076915] platform 40028000.serial: Linked as a consumer to 40048000.iomuxc [ 0.077102] platform 40027000.serial: Linked as a consumer to 40048000.iomuxc [ 0.077794] platform 4002c000.spi: Linked as a sync state only consumer to 40049000.gpio [ 0.079223] platform 4002c000.spi: Linked as a sync state only consumer to 4004c000.gpio [ 0.082051] platform 40066000.i2c: Linked as a consumer to 4004a000.gpio [ 0.082252] platform 40066000.i2c: Linked as a consumer to 40048000.iomuxc [ 0.082494] platform 40066000.i2c: Linked as a sync state only consumer to 40049000.gpio [ 0.083140] platform 40067000.i2c: Linked as a consumer to 40048000.iomuxc [ 0.085386] platform 40080000.bus: Linked as a sync state only consumer to 40048000.iomuxc [ 0.085622] platform 40080000.bus: Linked as a sync state only consumer to 4004c000.gpio [ 0.087821] platform 400b2000.esdhc: Linked as a consumer to 40048000.iomuxc [ 0.089182] platform 400d0000.ethernet: Linked as a consumer to 40048000.iomuxc [ 0.089428] platform 400d0000.ethernet: Linked as a sync state only consumer to 4004c000.gpio [ 0.090067] platform 400d1000.ethernet: Linked as a consumer to 40048000.iomuxc [ 0.090730] platform 400e6000.i2c: Linked as a consumer to 40048000.iomuxc [ 0.090978] platform 400e6000.i2c: Linked as a sync state only consumer to 4004c000.gpio [ 0.092164] platform gpio-leds: Linked as a consumer to 40048000.iomuxc [ 0.092411] platform gpio-leds: Linked as a sync state only consumer to 4004b000.gpio [ 0.092885] platform 4003b000.adc: Linked as a consumer to regulator-vcc-3v3-mcu [ 0.093441] platform 40034000.usb: Linked as a consumer to regulator-usb0-vbus [ 0.093649] platform regulator-usb0-vbus: Linked as a consumer to 40049000.gpio [ 0.093850] platform regulator-usb0-vbus: Linked as a consumer to 40048000.iomuxc [ 0.095808] platform mdio-mux: Linked as a consumer to 40049000.gpio [ 0.096022] platform mdio-mux: Linked as a consumer to 40048000.iomuxc [ 0.097352] vf610-pinctrl 40048000.iomuxc: initialized IMX pinctrl driver [ 0.108955] Kprobes globally optimized [ 0.215992] platform regulator-usb0-vbus: probe deferral - supplier 40049000.gpio not ready [ 0.216878] SCSI subsystem initialized [ 0.217522] usbcore: registered new interface driver usbfs [ 0.217683] usbcore: registered new interface driver hub [ 0.217824] usbcore: registered new device driver usb [ 0.218474] platform 40066000.i2c: probe deferral - supplier 4004a000.gpio not ready [ 0.219685] i2c i2c-0: IMX I2C adapter registered [ 0.219852] i2c i2c-0: using dma0chan0 (tx) and dma0chan1 (rx) for DMA transfers [ 0.221038] i2c 1-0070: Linked as a consumer to 4004c000.gpio [ 0.221289] i2c 1-0070: Linked as a consumer to 40048000.iomuxc [ 0.221492] i2c i2c-1: IMX I2C adapter registered [ 0.221625] i2c i2c-1: using dma0chan16 (tx) and dma0chan17 (rx) for DMA transfers [ 0.221698] imx-i2c 400e6000.i2c: Dropping the link to 4004c000.gpio [ 0.222145] pps_core: LinuxPPS API ver. 1 registered [ 0.222188] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it> [ 0.222290] PTP clock support registered [ 0.222905] Advanced Linux Sound Architecture Driver Initialized. [ 0.268543] clocksource: Switched to clocksource arm_global_timer [ 0.634728] NET: Registered protocol family 2 [ 0.635842] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear) [ 0.635943] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear) [ 0.636068] TCP bind hash table entries: 4096 (order: 2, 16384 bytes, linear) [ 0.636190] TCP: Hash tables configured (established 4096 bind 4096) [ 0.636470] UDP hash table entries: 256 (order: 0, 4096 bytes, linear) [ 0.636551] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear) [ 0.636879] NET: Registered protocol family 1 [ 0.639687] RPC: Registered named UNIX socket transport module. [ 0.639749] RPC: Registered udp transport module. [ 0.639780] RPC: Registered tcp transport module. [ 0.639808] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 0.642391] workingset: timestamp_bits=30 max_order=17 bucket_order=0 [ 0.657373] io scheduler mq-deadline registered [ 0.657455] io scheduler kyber registered [ 0.658814] gpio-23 (sx1503-irq): hogged as input [ 0.664005] gpio-98 (eth0-intrp): hogged as input [ 0.677195] 40027000.serial: ttyLP0 at MMIO 0x40027000 (irq = 19, base_baud = 5210526) is a FSL_LPUART [ 1.291559] printk: console [ttyLP0] enabled [ 1.295642] 40028000.serial: ttyLP1 at MMIO 0x40028000 (irq = 20, base_baud = 5210526) is a FSL_LPUART [ 1.305537] 40029000.serial: ttyLP2 at MMIO 0x40029000 (irq = 21, base_baud = 5210526) is a FSL_LPUART [ 1.315290] 4002a000.serial: ttyLP3 at MMIO 0x4002a000 (irq = 22, base_baud = 5210526) is a FSL_LPUART [ 1.360010] brd: module loaded [ 1.362289] at24 0-0050: supply vcc not found, using dummy regulator [ 1.367686] at24 0-0050: Linked as a consumer to regulator.0 [ 1.374685] at24 0-0050: 256 byte 24c02 EEPROM, read-only [ 1.385241] spi-nor spi0.0: m25p128 (16384 Kbytes) [ 1.392931] fsl-dspi 4002c000.spi: Dropping the link to 40049000.gpio [ 1.398313] fsl-dspi 4002c000.spi: Dropping the link to 4004c000.gpio [ 1.411512] spi-nor spi1.0: n25q00 (131072 Kbytes) [ 1.421718] spi-nor spi1.2: n25q00 (131072 Kbytes) [ 1.430163] libphy: Fixed MDIO Bus: probed [ 1.439542] mdio_bus 400d0000.ethernet-1: Linked as a sync state only consumer to 4004c000.gpio [ 1.447185] mdio_bus 400d0000.ethernet-1: Linked as a sync state only consumer to 40048000.iomuxc [ 1.454900] libphy: fec_enet_mii_bus: probed [ 1.478153] mdio_bus 400d0000.ethernet-1:00: Linked as a consumer to 4004c000.gpio [ 1.484753] mdio_bus 400d0000.ethernet-1:00: Linked as a consumer to 40048000.iomuxc [ 1.492867] fec 400d0000.ethernet: Dropping the link to 4004c000.gpio [ 1.504139] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00 [ 1.509763] fec 400d1000.ethernet: Using random MAC address: 86:cc:e2:6b:14:52 [ 1.517154] libphy: fec_enet_mii_bus: probed [ 1.521523] usbcore: registered new interface driver asix [ 1.525784] usbcore: registered new interface driver ax88179_178a [ 1.530746] usbcore: registered new interface driver cdc_ether [ 1.535388] usbcore: registered new interface driver net1080 [ 1.539902] usbcore: registered new interface driver cdc_subset [ 1.544650] usbcore: registered new interface driver zaurus [ 1.549100] usbcore: registered new interface driver cdc_ncm [ 1.553487] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver [ 1.558763] ehci-platform: EHCI generic platform driver [ 1.563182] usbcore: registered new interface driver usb-storage [ 1.568721] platform 40034000.usb: probe deferral - supplier regulator-usb0-vbus not ready [ 1.587769] snvs_rtc 400a7000.snvs:snvs-rtc-lp: registered as rtc0 [ 1.592824] snvs_rtc 400a7000.snvs:snvs-rtc-lp: setting system clock to 2021-08-31T23:07:25 UTC (1630451245) [ 1.601721] i2c /dev entries driver [ 1.609055] i2c i2c-1: Added multiplexed i2c bus 2 [ 1.614256] i2c i2c-1: Added multiplexed i2c bus 3 [ 1.619194] i2c i2c-1: Added multiplexed i2c bus 4 [ 1.624001] i2c i2c-1: Added multiplexed i2c bus 5 [ 1.628764] i2c i2c-1: Added multiplexed i2c bus 6 [ 1.633465] i2c i2c-1: Added multiplexed i2c bus 7 [ 1.638180] i2c i2c-1: Added multiplexed i2c bus 8 [ 1.643056] i2c i2c-1: Added multiplexed i2c bus 9 [ 1.646603] pca954x 1-0070: registered 8 multiplexed busses for I2C switch pca9548 [ 1.657429] sdhci: Secure Digital Host Controller Interface driver [ 1.662415] sdhci: Copyright(c) Pierre Ossman [ 1.665495] sdhci-pltfm: SDHCI platform and OF driver helper [ 1.671279] leds-gpio gpio-leds: Dropping the link to 4004b000.gpio [ 1.677287] usbcore: registered new interface driver usbhid [ 1.681644] usbhid: USB HID core driver [ 1.684988] vf610-adc 4003b000.adc: Linked as a consumer to regulator.1 [ 1.705844] NET: Registered protocol family 17 [ 1.747018] mmc0: SDHCI controller on 400b2000.esdhc [400b2000.esdhc] using ADMA [ 1.757323] lm75 10-0048: supply vs not found, using dummy regulator [ 1.762801] lm75 10-0048: Linked as a consumer to regulator.0 [ 1.771806] lm75 10-0048: hwmon0: sensor 'lm75' [ 1.775772] at24 10-0050: supply vcc not found, using dummy regulator [ 1.781313] at24 10-0050: Linked as a consumer to regulator.0 [ 1.789929] at24 10-0050: 512 byte 24c04 EEPROM, writable, 1 bytes/write [ 1.796044] at24 10-0052: supply vcc not found, using dummy regulator [ 1.801527] at24 10-0052: Linked as a consumer to regulator.0 [ 1.810012] at24 10-0052: 512 byte 24c04 EEPROM, writable, 1 bytes/write [ 1.816481] pca953x 10-0018: supply vcc not found, using dummy regulator [ 1.822251] pca953x 10-0018: Linked as a consumer to regulator.0 [ 1.827009] pca953x 10-0018: using no AI [ 1.834047] platform sff3: Linked as a consumer to 10-0020 [ 1.838453] platform sff2: Linked as a consumer to 10-0020 [ 1.842908] i2c 10-0020: Linked as a consumer to 40049000.gpio [ 1.847643] i2c 10-0020: Linked as a consumer to 40048000.iomuxc [ 1.860624] pca953x 10-0022: supply vcc not found, using dummy regulator [ 1.866334] pca953x 10-0022: Linked as a consumer to regulator.0 [ 1.871160] pca953x 10-0022: using no AI [ 1.878025] i2c i2c-10: IMX I2C adapter registered [ 1.881721] i2c i2c-10: using dma0chan2 (tx) and dma0chan3 (rx) for DMA transfers [ 1.887989] imx-i2c 40066000.i2c: Dropping the link to 40049000.gpio [ 1.894950] mdio_bus 0.1: Linked as a sync state only consumer to 40049000.gpio [ 1.901214] mdio_bus 0.1: Linked as a sync state only consumer to 40048000.iomuxc [ 1.907514] libphy: mdio_mux: probed [ 1.910276] mdio_bus 0.1:00: Linked as a consumer to 40049000.gpio [ 1.915348] mdio_bus 0.1:00: Linked as a consumer to 40048000.iomuxc [ 1.920693] mdio_bus 0.1:00: Linked as a sync state only consumer to 0.1:00 [ 1.926760] mv88e6085 0.1:00: switch 0xa10 detected: Marvell 88E6390X, revision 1 [ 1.942194] mdio_bus !mdio-mux!mdio@1!switch@0!mdio: Linked as a sync state only consumer to 0.1:00 [ 1.950156] libphy: mdio: probed [ 1.952481] mmc0: host does not support reading read-only switch, assuming write-enable [ 1.963210] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: Linked as a consumer to 0.1:00 [ 1.970021] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: probe deferral - supplier 0.1:00 not ready [ 1.978320] mmc0: new high speed SDHC card at address aaaa [ 1.985170] mmcblk0: mmc0:aaaa SU04G 3.69 GiB [ 1.992443] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: Linked as a consumer to 0.1:00 [ 1.999277] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: probe deferral - supplier 0.1:00 not ready [ 2.012242] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:03: Linked as a consumer to 0.1:00 [ 2.019085] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:03: probe deferral - supplier 0.1:00 not ready [ 2.026995] mmcblk0: p1 p2 p3 < p5 > [ 2.030498] mmcblk0: p5 size 440320 extends beyond EOD, truncated [ 2.043234] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:04: Linked as a consumer to 0.1:00 [ 2.050044] mdio_bus !mdio-mux!mdio@1!switch@0!mdio:04: probe deferral - supplier 0.1:00 not ready [ 2.059198] mv88e6085 0.1:00: Dropping the link to 0.1:00 [ 2.064608] mdio_bus 0.2: Linked as a sync state only consumer to 40049000.gpio [ 2.070898] mdio_bus 0.2: Linked as a sync state only consumer to 40048000.iomuxc [ 2.077196] libphy: mdio_mux: probed [ 2.079960] mdio_bus 0.2:00: Linked as a consumer to 40049000.gpio [ 2.085043] mdio_bus 0.2:00: Linked as a consumer to 40048000.iomuxc [ 2.090387] mdio_bus 0.2:00: Linked as a sync state only consumer to 0.2:00 [ 2.096466] mv88e6085 0.2:00: switch 0xa10 detected: Marvell 88E6390X, revision 1 [ 2.111052] mdio_bus !mdio-mux!mdio@2!switch@0!mdio: Linked as a sync state only consumer to 0.2:00 [ 2.119003] libphy: mdio: probed [ 2.126170] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: Linked as a consumer to 0.2:00 [ 2.132949] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: probe deferral - supplier 0.2:00 not ready [ 2.145815] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: Linked as a consumer to 0.2:00 [ 2.152565] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: probe deferral - supplier 0.2:00 not ready [ 2.165415] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:03: Linked as a consumer to 0.2:00 [ 2.172164] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:03: probe deferral - supplier 0.2:00 not ready [ 2.185026] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:04: Linked as a consumer to 0.2:00 [ 2.191777] mdio_bus !mdio-mux!mdio@2!switch@0!mdio:04: probe deferral - supplier 0.2:00 not ready [ 2.309663] mv88e6085 0.1:00: configuring for fixed/ link mode [ 2.319255] mv88e6085 0.1:00: Link is Up - 100Mbps/Full - flow control off [ 2.329241] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) [ 2.346853] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) [ 2.364287] mv88e6085 0.1:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:03] driver [Generic PHY] (irq=POLL) [ 2.380887] mv88e6085 0.1:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:04] driver [Generic PHY] (irq=POLL) [ 2.397258] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:01: Dropping the link to 0.2:00 [ 2.408469] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) [ 2.423691] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:02: Dropping the link to 0.2:00 [ 2.432750] mv88e6085 0.2:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) [ 2.447887] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:03: Dropping the link to 0.2:00 [ 2.456943] mv88e6085 0.2:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:03] driver [Generic PHY] (irq=POLL) [ 2.472080] Generic PHY !mdio-mux!mdio@2!switch@0!mdio:04: Dropping the link to 0.2:00 [ 2.481149] mv88e6085 0.2:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:04] driver [Generic PHY] (irq=POLL) [ 2.495310] mv88e6085 0.1:00: Linked as a consumer to 400d1000.ethernet [ 2.500779] DSA: tree 0 setup [ 2.502536] mv88e6085 0.2:00: Dropping the link to 0.2:00 [ 2.507780] libphy: mdio_mux: probed This is based on: Commit ea718c699055c8566eb64432388a04974c43b2ea (HEAD, refs/bisect/bad) Author: Saravana Kannan <saravanak@google.com> Date: Tue Mar 2 13:11:32 2021 -0800 Revert "Revert "driver core: Set fw_devlink=on by default"" This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df. Since all reported issues due to fw_devlink=on should be addressed by this series, revert the revert. fw_devlink=on Take II. Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Andrew
On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > Rev B is interesting because switch0 and switch1 got genphy, while > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > interrupt properties, so don't loop back to their parent device. This is interesting and not what I really expected to happen. It goes to show that we really need more time to understand all the subtleties of device dependencies before jumping on patching stuff. In case the DSA tree contains more than one switch, different things will happen in dsa_register_switch(). The tree itself is only initialized when the last switch calls dsa_register_switch(). All the other switches just mark themselves as present and exit probing early. See this piece of code in dsa_tree_setup: complete = dsa_tree_setup_routing_table(dst); if (!complete) return 0; So it should be a general property of cross-chip DSA trees that all switches except the last one will have the specific PHY driver probed properly, and not the genphy. Because all (N - 1) switches of a tree exit early in dsa_register_switch, they have successfully probed by the time the last switch brings up the tree, and brings up the PHYs on behalf of every other switch. The last switch can connect to the PHY on behalf of the other switches past their probe ending, and those PHYs should not defer probing because their supplier is now probed. It is only that the last switch cannot connect to the PHYs of its own ports. So if this does not work (you say that there are 2 switches that use genphy) I suspect there are also other bugs involved.
On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote: > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > Rev B is interesting because switch0 and switch1 got genphy, while > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > interrupt properties, so don't loop back to their parent device. > > This is interesting and not what I really expected to happen. It goes to > show that we really need more time to understand all the subtleties of > device dependencies before jumping on patching stuff. Here is the log on Rev B with the extra debug prints Linux version 5.12.0-rc4-00011-gea718c699055-dirty (andrew@lenovo) (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binu tils for Debian) 2.37) #20 Tue Aug 31 18:06:09 CDT 2021 CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache OF: fdt: Machine model: ZII VF610 Development Board, Rev B printk: bootconsole [earlycon0] enabled Memory policy: Data cache writeback Zone ranges: Normal [mem 0x0000000080000000-0x000000009fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] CPU: All CPU(s) started in SVC mode. Built 1 zonelists, mobility grouping on. Total pages: 130048 Kernel command line: root=/dev/nfs nfsroot=10.0.0.1:/srv/nfsroot/zii_devel_b,nfsvers=3,tcp ip=dhcp rw earlyprintk Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear) mem auto-init: stack:off, heap alloc:off, heap free:off Memory: 508204K/524288K available (7168K kernel code, 931K rwdata, 1644K rodata, 1024K init, 250K bss, 16084K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 ftrace: allocating 25455 entries in 50 pages ftrace: allocated 50 pages with 3 groups NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16 L2C-310 erratum 769419 enabled L2C-310 dynamic clock gating enabled, standby mode enabled L2C-310 cache controller enabled, 8 ways, 512 kB L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x06060000 random: get_random_bytes called from start_kernel+0x32c/0x470 with crng_init=0 sched_clock: 64 bits at 166MHz, resolution 5ns, wraps every 4398046511102ns clocksource: arm_global_timer: mask: 0xffffffffffffffff max_cycles: 0x2674622ffc, max_idle_ns: 440795203810 ns Switching to timer-based delay loop, resolution 5ns Console: colour dummy device 80x30 printk: console [tty0] enabled printk: bootconsole [earlycon0] disabled Booting Linux on physical CPU 0x0 Linux version 5.12.0-rc4-00011-gea718c699055-dirty (andrew@lenovo) (arm-linux-gnueabihf-gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binu tils for Debian) 2.37) #20 Tue Aug 31 18:06:09 CDT 2021 CPU: ARMv7 Processor [410fc051] revision 1 (ARMv7), cr=10c53c7d CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache OF: fdt: Machine model: ZII VF610 Development Board, Rev B printk: bootconsole [earlycon0] enabled Memory policy: Data cache writeback Zone ranges: Normal [mem 0x0000000080000000-0x000000009fffffff] Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] CPU: All CPU(s) started in SVC mode. Built 1 zonelists, mobility grouping on. Total pages: 130048 Kernel command line: root=/dev/nfs nfsroot=10.0.0.1:/srv/nfsroot/zii_devel_b,nfsvers=3,tcp ip=dhcp rw earlyprintk Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear) Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear) mem auto-init: stack:off, heap alloc:off, heap free:off Memory: 508204K/524288K available (7168K kernel code, 931K rwdata, 1644K rodata, 1024K init, 250K bss, 16084K reserved, 0K cma-reserved) SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1 ftrace: allocating 25455 entries in 50 pages ftrace: allocated 50 pages with 3 groups NR_IRQS: 16, nr_irqs: 16, preallocated irqs: 16 L2C-310 erratum 769419 enabled L2C-310 dynamic clock gating enabled, standby mode enabled L2C-310 cache controller enabled, 8 ways, 512 kB L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x06060000 random: get_random_bytes called from start_kernel+0x32c/0x470 with crng_init=0 sched_clock: 64 bits at 166MHz, resolution 5ns, wraps every 4398046511102ns clocksource: arm_global_timer: mask: 0xffffffffffffffff max_cycles: 0x2674622ffc, max_idle_ns: 440795203810 ns Switching to timer-based delay loop, resolution 5ns Console: colour dummy device 80x30 printk: console [tty0] enabled printk: bootconsole [earlycon0] disabled Calibrating delay loop (skipped), value calculated using timer frequency.. 333.47 BogoMIPS (lpj=1667368) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear) Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear) CPU: Testing write buffer coherency: ok Setting up static identity map for 0x80100000 - 0x8010003c devtmpfs: initialized VFP support v0.3: implementor 41 architecture 2 part 30 variant 5 rev 1 clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns futex hash table entries: 256 (order: -1, 3072 bytes, linear) pinctrl core: initialized pinctrl subsystem NET: Registered protocol family 16 DMA: preallocated 256 KiB pool for atomic coherent allocations platform 40044000.spi: Linked as a consumer to 40048000.iomuxc platform 4003b000.adc: Linked as a consumer to 40048000.iomuxc platform 40029000.serial: Linked as a consumer to 40048000.iomuxc platform 40028000.serial: Linked as a consumer to 40048000.iomuxc platform 40027000.serial: Linked as a consumer to 40048000.iomuxc platform 40066000.i2c: Linked as a consumer to 4004a000.gpio platform 40066000.i2c: Linked as a consumer to 40048000.iomuxc platform 40066000.i2c: Linked as a sync state only consumer to 4004c000.gpio platform 40067000.i2c: Linked as a consumer to 40048000.iomuxc platform 40080000.bus: Linked as a sync state only consumer to 40048000.iomuxc platform 40080000.bus: Linked as a sync state only consumer to 4004c000.gpio platform 400b2000.esdhc: Linked as a consumer to 40048000.iomuxc platform 400d0000.ethernet: Linked as a consumer to 40048000.iomuxc platform 400d1000.ethernet: Linked as a consumer to 40048000.iomuxc platform 400e6000.i2c: Linked as a consumer to 40048000.iomuxc platform 400e6000.i2c: Linked as a sync state only consumer to 4004c000.gpio platform gpio-leds: Linked as a consumer to 40048000.iomuxc platform gpio-leds: Linked as a sync state only consumer to 4004b000.gpio platform 4003b000.adc: Linked as a consumer to regulator-vcc-3v3-mcu platform 40034000.usb: Linked as a consumer to regulator-usb0-vbus platform regulator-usb0-vbus: Linked as a consumer to 40049000.gpio platform regulator-usb0-vbus: Linked as a consumer to 40048000.iomuxc platform mdio-mux: Linked as a consumer to 40049000.gpio platform mdio-mux: Linked as a consumer to 40048000.iomuxc platform spi0: Linked as a consumer to 4004a000.gpio platform spi0: Linked as a consumer to 40048000.iomuxc platform spi0: Linked as a sync state only consumer to 4004d000.gpio vf610-pinctrl 40048000.iomuxc: initialized IMX pinctrl driver Kprobes globally optimized platform regulator-usb0-vbus: probe deferral - supplier 40049000.gpio not ready SCSI subsystem initialized usbcore: registered new interface driver usbfs usbcore: registered new interface driver hub usbcore: registered new device driver usb platform 40066000.i2c: probe deferral - supplier 4004a000.gpio not ready i2c i2c-0: IMX I2C adapter registered i2c i2c-0: using dma0chan0 (tx) and dma0chan1 (rx) for DMA transfers i2c 1-0070: Linked as a consumer to 4004c000.gpio i2c 1-0070: Linked as a consumer to 40048000.iomuxc i2c i2c-1: IMX I2C adapter registered i2c i2c-1: using dma0chan16 (tx) and dma0chan17 (rx) for DMA transfers imx-i2c 400e6000.i2c: Dropping the link to 4004c000.gpio pps_core: LinuxPPS API ver. 1 registered pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it> PTP clock support registered Advanced Linux Sound Architecture Driver Initialized. clocksource: Switched to clocksource arm_global_timer NET: Registered protocol family 2 tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear) TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear) TCP bind hash table entries: 4096 (order: 2, 16384 bytes, linear) TCP: Hash tables configured (established 4096 bind 4096) UDP hash table entries: 256 (order: 0, 4096 bytes, linear) UDP-Lite hash table entries: 256 (order: 0, 4096 bytes, linear) NET: Registered protocol family 1 RPC: Registered named UNIX socket transport module. RPC: Registered udp transport module. RPC: Registered tcp transport module. RPC: Registered tcp NFSv4.1 backchannel transport module. workingset: timestamp_bits=30 max_order=17 bucket_order=0 io scheduler mq-deadline registered io scheduler kyber registered 40027000.serial: ttyLP0 at MMIO 0x40027000 (irq = 19, base_baud = 5210526) is a FSL_LPUART printk: console [ttyLP0] enabled 40028000.serial: ttyLP1 at MMIO 0x40028000 (irq = 20, base_baud = 5210526) is a FSL_LPUART 40029000.serial: ttyLP2 at MMIO 0x40029000 (irq = 21, base_baud = 5210526) is a FSL_LPUART brd: module loaded spi-nor spi0.0: mt25ql02g (262144 Kbytes) spi-nor spi0.2: mt25ql02g (262144 Kbytes) libphy: Fixed MDIO Bus: probed libphy: fec_enet_mii_bus: probed fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00 fec 400d1000.ethernet: Using random MAC address: 1a:09:76:4a:0d:49 libphy: fec_enet_mii_bus: probed usbcore: registered new interface driver asix usbcore: registered new interface driver ax88179_178a usbcore: registered new interface driver cdc_ether usbcore: registered new interface driver net1080 usbcore: registered new interface driver cdc_subset usbcore: registered new interface driver zaurus usbcore: registered new interface driver cdc_ncm ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver ehci-platform: EHCI generic platform driver usbcore: registered new interface driver usb-storage platform 40034000.usb: probe deferral - supplier regulator-usb0-vbus not ready snvs_rtc 400a7000.snvs:snvs-rtc-lp: registered as rtc0 snvs_rtc 400a7000.snvs:snvs-rtc-lp: setting system clock to 1970-01-01T00:00:00 UTC (0) i2c /dev entries driver at24 2-0050: supply vcc not found, using dummy regulator at24 2-0050: Linked as a consumer to regulator.0 at24 2-0050: Dropping the link to regulator.0 i2c i2c-1: Added multiplexed i2c bus 2 at24 3-0050: supply vcc not found, using dummy regulator at24 3-0050: Linked as a consumer to regulator.0 at24 3-0050: Dropping the link to regulator.0 i2c i2c-1: Added multiplexed i2c bus 3 at24 4-0050: supply vcc not found, using dummy regulator at24 4-0050: Linked as a consumer to regulator.0 at24 4-0050: 256 byte 24c02 EEPROM, writable, 1 bytes/write i2c i2c-1: Added multiplexed i2c bus 4 at24 5-0050: supply vcc not found, using dummy regulator at24 5-0050: Linked as a consumer to regulator.0 at24 5-0050: 256 byte 24c02 EEPROM, writable, 1 bytes/write i2c i2c-1: Added multiplexed i2c bus 5 i2c i2c-1: Added multiplexed i2c bus 6 i2c i2c-1: Added multiplexed i2c bus 7 i2c i2c-1: Added multiplexed i2c bus 8 i2c i2c-1: Added multiplexed i2c bus 9 pca954x 1-0070: registered 8 multiplexed busses for I2C switch pca9548 sdhci: Secure Digital Host Controller Interface driver sdhci: Copyright(c) Pierre Ossman sdhci-pltfm: SDHCI platform and OF driver helper leds-gpio gpio-leds: Dropping the link to 4004b000.gpio usbcore: registered new interface driver usbhid usbhid: USB HID core driver vf610-adc 4003b000.adc: Linked as a consumer to regulator.1 NET: Registered protocol family 17 mmc0: SDHCI controller on 400b2000.esdhc [400b2000.esdhc] using ADMA lm75 10-0048: supply vs not found, using dummy regulator lm75 10-0048: Linked as a consumer to regulator.0 lm75 10-0048: hwmon0: sensor 'lm75' at24 10-0050: supply vcc not found, using dummy regulator at24 10-0050: Linked as a consumer to regulator.0 at24 10-0050: 512 byte 24c04 EEPROM, writable, 1 bytes/write at24 10-0052: supply vcc not found, using dummy regulator at24 10-0052: Linked as a consumer to regulator.0 at24 10-0052: 512 byte 24c04 EEPROM, writable, 1 bytes/write pca953x 10-0020: supply vcc not found, using dummy regulator pca953x 10-0020: Linked as a consumer to regulator.0 pca953x 10-0020: using no AI platform mdio-mux: Linked as a sync state only consumer to 10-0022 i2c 10-0022: Linked as a consumer to 4004c000.gpio i2c 10-0022: Linked as a consumer to 40048000.iomuxc pca953x 10-0022: supply vcc not found, using dummy regulator pca953x 10-0022: Linked as a consumer to regulator.0 pca953x 10-0022: using no AI pca953x 10-0022: interrupt support not compiled in i2c i2c-10: IMX I2C adapter registered i2c i2c-10: using dma0chan2 (tx) and dma0chan3 (rx) for DMA transfers imx-i2c 40066000.i2c: Dropping the link to 4004c000.gpio mdio_bus 0.1: Linked as a sync state only consumer to 40049000.gpio mdio_bus 0.1: Linked as a sync state only consumer to 40048000.iomuxc libphy: mdio_mux: probed mdio_bus 0.1:00: Linked as a consumer to 40049000.gpio mdio_bus 0.1:00: Linked as a consumer to 40048000.iomuxc mdio_bus 0.1:00: Linked as a sync state only consumer to 0.1:00 mv88e6085 0.1:00: switch 0x3520 detected: Marvell 88E6352, revision 1 mdio_bus !mdio-mux!mdio@1!switch@0!mdio: Linked as a sync state only consumer to 0.1:00 libphy: mdio: probed mmc0: host does not support reading read-only switch, assuming write-enable mmc0: new high speed SDHC card at address 0007 mdio_bus !mdio-mux!mdio@1!switch@0!mdio:00: Linked as a consumer to 0.1:00 mdio_bus !mdio-mux!mdio@1!switch@0!mdio:00: probe deferral - supplier 0.1:00 not ready mmcblk0: mmc0:0007 SD4GB 3.71 GiB mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: Linked as a consumer to 0.1:00 mdio_bus !mdio-mux!mdio@1!switch@0!mdio:01: probe deferral - supplier 0.1:00 not ready mmcblk0: p1 p2 p3 < p5 > mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: Linked as a consumer to 0.1:00 mdio_bus !mdio-mux!mdio@1!switch@0!mdio:02: probe deferral - supplier 0.1:00 not ready mv88e6085 0.1:00: Dropping the link to 0.1:00 mdio_bus 0.2: Linked as a sync state only consumer to 40049000.gpio mdio_bus 0.2: Linked as a sync state only consumer to 40048000.iomuxc libphy: mdio_mux: probed mdio_bus 0.2:00: Linked as a consumer to 40049000.gpio mdio_bus 0.2:00: Linked as a consumer to 40048000.iomuxc mdio_bus 0.2:00: Linked as a sync state only consumer to 0.2:00 mv88e6085 0.2:00: switch 0x3520 detected: Marvell 88E6352, revision 1 mdio_bus !mdio-mux!mdio@2!switch@0!mdio: Linked as a sync state only consumer to 0.2:00 libphy: mdio: probed mdio_bus !mdio-mux!mdio@2!switch@0!mdio:00: Linked as a consumer to 0.2:00 mdio_bus !mdio-mux!mdio@2!switch@0!mdio:00: probe deferral - supplier 0.2:00 not ready mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: Linked as a consumer to 0.2:00 mdio_bus !mdio-mux!mdio@2!switch@0!mdio:01: probe deferral - supplier 0.2:00 not ready mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: Linked as a consumer to 0.2:00 mdio_bus !mdio-mux!mdio@2!switch@0!mdio:02: probe deferral - supplier 0.2:00 not ready mv88e6085 0.2:00: Dropping the link to 0.2:00 mdio_bus 0.4: Linked as a sync state only consumer to 10-0022 libphy: mdio_mux: probed mdio_bus 0.4:00: Linked as a sync state only consumer to 10-0022 mv88e6085 0.4:00: switch 0x1a70 detected: Marvell 88E6185, revision 2 libphy: mdio: probed mv88e6085 0.1:00 lan0 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:00] driver [Generic PHY] (irq=POLL) mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) mv88e6085 0.1:00: configuring for fixed/rgmii-txid link mode mv88e6085 0.1:00: configuring for fixed/ link mode mv88e6085 0.1:00: Link is Up - 1Gbps/Full - flow control off mv88e6085 0.1:00: Link is Up - 100Mbps/Full - flow control off mv88e6085 0.2:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:00] driver [Generic PHY] (irq=POLL) mv88e6085 0.2:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) mv88e6085 0.2:00: configuring for fixed/rgmii-txid link mode mv88e6085 0.2:00: configuring for fixed/rgmii-txid link mode mv88e6085 0.4:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:00] driver [Marvell 88E1545] (irq=POLL) mv88e6085 0.4:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:01] driver [Marvell 88E1545] (irq=POLL) mv88e6085 0.4:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:02] driver [Marvell 88E1545] (irq=POLL) mv88e6085 0.4:00: configuring for fixed/rgmii-txid link mode mv88e6085 0.1:00: Linked as a consumer to 400d1000.ethernet DSA: tree 0 setup mv88e6085 0.4:00: Dropping the link to 10-0022 libphy: mdio_mux: probed mdio-mux-gpio mdio-mux: Dropping the link to 10-0022 Andrew
On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote: > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > Rev B is interesting because switch0 and switch1 got genphy, while > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > interrupt properties, so don't loop back to their parent device. > > This is interesting and not what I really expected to happen. It goes to > show that we really need more time to understand all the subtleties of > device dependencies before jumping on patching stuff. There is an even more interesting variation which I would like to point out. It seems like a very odd loophole in the device links. Take the example of the mv88e6xxx DSA driver. On my board (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I had to declare the switches as interrupt controller and add interrupts to their internal PHYs, I still need considerable force to 'break' this board in the way discussed in this thread. The correct PHY driver insists to probe, and not genphy. Let me explain. The automatic device links between the switch (supplier, as interrupt-controller) and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle. Important note: fwnode_link_add does not link devices, it links OF nodes. Even more important node, in the form of a comment: * The driver core will use the fwnode link to create a device link between the * two device objects corresponding to @con and @sup when they are created. The * driver core will automatically delete the fwnode link between @con and @sup * after doing that. Okay?! What seems to be omitted is that the DSA switch driver's probing itself can be deferred. For example: dsa_register_switch -> dsa_switch_probe -> dsa_switch_parse_of -> dsa_switch_parse_ports_of -> dsa_port_parse_of -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0)); -> not found => return -EPROBE_DEFER When dsa_register_switch() returns -EPROBE_DEFER, it is effectively an error path. So the reverse of initialization is performed. The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_ dsa_register_switch. So when dsa_register_switch returns error code, mv88e6xxx_mdios_unregister() will be called. When mv88e6xxx_mdios_unregister() is called, the MDIO buses with internal PHYs are destroyed. So the PHY devices themselves are destroyed too. And the device links between the DSA switch and the internal PHYs, those created based on the firmware node links created by fwnode_link_add, are dropped too. Now remember the comment that the device links created based on fwnode_link_add are not restored. So probing of the DSA switch finally resumes, and this time device_links_check_suppliers() is effectively bypassed, the PHYs no longer request probe deferral due to their supplier not being ready, because the device link no longer exists. Isn't this self-sabotaging?! Now generally, DSA drivers defer probing because they probe in parallel with the DSA master. This is typical if the switch is on a SPI bus, or I2C, or on an MDIO bus provided by a _standalone_ MDIO controller. If the MDIO controller is not standalone, but is provided by Ethernet controller that is the DSA master itself, then things change a lot, because probing can never be parallel. The DSA master probes, initializes its MDIO bus, and this triggers the probing of the MDIO devices on that bus, one of which is the DSA switch. So DSA can no longer defer the probe due to that reason. Secondly, in DSA we even have variation between drivers as to where they register their internal MDIO buses. The mv88e6xxx driver does this in mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this is important. DSA provides drivers with a .setup() callback, which is guaranteed to take place after nothing can defer the switch's probe anymore. So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this setup, because the device links framework isn't sabotaging itself anymore. Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register() from the probe method and not the setup method, the entire design issue with interrupts on internal DSA switch ports would have went absolutely unnoticed for a few more years. I have not tested this, but it also seems plausible that DSA can trivially and reliably bypass any fw_devlink=on restrictions by simply moving all of_mdiobus_register() driver calls from the .setup() method to their respective probe methods (prior to calling dsa_register_switch), then effectively fabricate an -EPROBE_DEFER during the first probe attempt. I mean, who will know whether that probe deferral request was justified or not? Anyway, I'm not sure everyone agrees with this type of "solution" (even though it's worth pointing it out as a fw_devlink limitation). In any case, we need some sort of lightweight "fix" to the chicken-and-egg problem, which will give me enough time to think of something better. I hope it is at least clearer now that there are subtleties and nuances, and we cannot just assess how many boards are broken by looking at the device trees. By design, all are, sure, but they might still work, and that's better than nothing...
On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote: > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote: > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > interrupt properties, so don't loop back to their parent device. > > > > This is interesting and not what I really expected to happen. It goes to > > show that we really need more time to understand all the subtleties of > > device dependencies before jumping on patching stuff. > > There is an even more interesting variation which I would like to point > out. It seems like a very odd loophole in the device links. > > Take the example of the mv88e6xxx DSA driver. On my board > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I > had to declare the switches as interrupt controller and add interrupts > to their internal PHYs, I still need considerable force to 'break' this > board in the way discussed in this thread. The correct PHY driver insists > to probe, and not genphy. Let me explain. > > The automatic device links between the switch (supplier, as interrupt-controller) > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle. > > Important note: fwnode_link_add does not link devices, it links OF nodes. > > Even more important node, in the form of a comment: > > * The driver core will use the fwnode link to create a device link between the > * two device objects corresponding to @con and @sup when they are created. The > * driver core will automatically delete the fwnode link between @con and @sup > * after doing that. > > Okay?! > > What seems to be omitted is that the DSA switch driver's probing itself > can be deferred. For example: > > dsa_register_switch > -> dsa_switch_probe > -> dsa_switch_parse_of > -> dsa_switch_parse_ports_of > -> dsa_port_parse_of > -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0)); > -> not found => return -EPROBE_DEFER > > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively > an error path. So the reverse of initialization is performed. > > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_ > dsa_register_switch. So when dsa_register_switch returns error code, > mv88e6xxx_mdios_unregister() will be called. > > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with > internal PHYs are destroyed. So the PHY devices themselves are destroyed > too. And the device links between the DSA switch and the internal PHYs, > those created based on the firmware node links created by fwnode_link_add, > are dropped too. > > Now remember the comment that the device links created based on > fwnode_link_add are not restored. > > So probing of the DSA switch finally resumes, and this time > device_links_check_suppliers() is effectively bypassed, the PHYs no > longer request probe deferral due to their supplier not being ready, > because the device link no longer exists. > > Isn't this self-sabotaging?! > > Now generally, DSA drivers defer probing because they probe in parallel > with the DSA master. This is typical if the switch is on a SPI bus, or > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller. > > If the MDIO controller is not standalone, but is provided by Ethernet > controller that is the DSA master itself, then things change a lot, > because probing can never be parallel. The DSA master probes, > initializes its MDIO bus, and this triggers the probing of the MDIO > devices on that bus, one of which is the DSA switch. So DSA can no > longer defer the probe due to that reason. > > Secondly, in DSA we even have variation between drivers as to where they > register their internal MDIO buses. The mv88e6xxx driver does this in > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this > is important. DSA provides drivers with a .setup() callback, which is > guaranteed to take place after nothing can defer the switch's probe > anymore. > > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this > setup, because the device links framework isn't sabotaging itself anymore. > > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register() > from the probe method and not the setup method, the entire design issue > with interrupts on internal DSA switch ports would have went absolutely > unnoticed for a few more years. > > I have not tested this, but it also seems plausible that DSA can > trivially and reliably bypass any fw_devlink=on restrictions by simply > moving all of_mdiobus_register() driver calls from the .setup() method > to their respective probe methods (prior to calling dsa_register_switch), > then effectively fabricate an -EPROBE_DEFER during the first probe attempt. > I mean, who will know whether that probe deferral request was justified > or not? Pushing the thought even further, it is not even necessary to move the of_mdiobus_register() call to the probe function. Where it is (in .setup) is already good enough. It is sufficient to return -EOPNOTSUPP once (the first time) immediately _after_ the call to of_mdiobus_register (and have a proper error path, i.e. call mdiobus_unregister too). > Anyway, I'm not sure everyone agrees with this type of "solution" (even > though it's worth pointing it out as a fw_devlink limitation). In any > case, we need some sort of lightweight "fix" to the chicken-and-egg > problem, which will give me enough time to think of something better. > I hope it is at least clearer now that there are subtleties and nuances, > and we cannot just assess how many boards are broken by looking at the > device trees. By design, all are, sure, but they might still work, and > that's better than nothing...
On Tue, Aug 31, 2021 at 4:02 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > If the switches are broken without the phy-handle or ethernet change, > > I'm not sure if the "BROKEN_PARENT" patch would help. > > > > Which is not enough to fix these Ethernet switches. > > > > Ok, if you can give more specifics on this, I'll look into it. > > The switches probe, but get the wrong PHY driver, genphy, not the > Marvell PHY driver. And genphy is not sufficient for this hardware. > > I'd need: > > 1) The DTS file that you see the issue on. > > I did the bisect on arch/arm/boot/dts/vf610-zii-dev-rev-c.dts but i > also tested arch/arm/boot/dts/vf610-zii-dev-rev-b.dts. Thanks for the detailed info Andrew. I looked at the DT files. So yeah, this is similar to the realtek issue and my generic fix for DSA should work for all of these unless I'm forgetting something. Please let me know if it doesn't. -Saravana > Rev B is interesting because switch0 and switch1 got genphy, while > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > interrupt properties, so don't loop back to their parent device. > > Here is Rev B. I trimmed out other devices probing in parallel: > > [ 1.029100] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00 > [ 1.034735] fec 400d1000.ethernet: Using random MAC address: 42:f2:14:33:78:f5 > [ 1.042272] libphy: fec_enet_mii_bus: probed > [ 1.455932] libphy: mdio_mux: probed > [ 1.459432] mv88e6085 0.1:00: switch 0x3520 detected: Marvell 88E6352, revision 1 > [ 1.494076] libphy: mdio: probed > [ 1.518958] libphy: mdio_mux: probed > [ 1.522553] mv88e6085 0.2:00: switch 0x3520 detected: Marvell 88E6352, revision 1 > [ 1.537295] libphy: mdio: probed > [ 1.556571] libphy: mdio_mux: probed > [ 1.559719] mv88e6085 0.4:00: switch 0x1a70 detected: Marvell 88E6185, revision 2 > [ 1.574614] libphy: mdio: probed > [ 1.733104] mv88e6085 0.1:00 lan0 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:00] driver [Generic PHY] (irq=POLL) > [ 1.750737] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) > [ 1.768273] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) > [ 1.806561] mv88e6085 0.2:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:00] driver [Generic PHY] (irq=POLL) > [ 1.824033] mv88e6085 0.2:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) > [ 1.841496] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) > [ 1.943535] mv88e6085 0.4:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:00] driver [Marvell 88E1545] (irq=POLL) > [ 2.003529] mv88e6085 0.4:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:01] driver [Marvell 88E1545] (irq=POLL) > [ 2.063535] mv88e6085 0.4:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@4!switch@0!mdio:02] driver [Marvell 88E1545] (irq=POLL) > [ 2.084768] DSA: tree 0 setup > [ 2.087791] libphy: mdio_mux: probed > [ 2.265477] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=POLL) > > root@zii-devel-b:~# cat /sys/kernel/debug/devices_deferred > root@zii-devel-b:~# > > For Rev C we see: > > [ 1.244417] fec 400d1000.ethernet: Invalid MAC address: 00:00:00:00:00:00 > [ 1.250081] fec 400d1000.ethernet: Using random MAC address: c6:42:89:ed:5f:dd > [ 1.257507] libphy: fec_enet_mii_bus: probed > [ 1.570725] libphy: mdio_mux: probed > [ 1.574208] mv88e6085 0.1:00: switch 0xa10 detected: Marvell 88E6390X, revision 1 > [ 1.590272] libphy: mdio: probed > [ 1.627721] libphy: mdio_mux: probed > [ 1.631222] mv88e6085 0.2:00: switch 0xa10 detected: Marvell 88E6390X, revision 1 > [ 1.659643] libphy: mdio: probed > [ 1.811665] mv88e6085 0.1:00 lan1 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) > [ 1.829230] mv88e6085 0.1:00 lan2 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) > [ 1.845884] mv88e6085 0.1:00 lan3 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:03] driver [Generic PHY] (irq=POLL) > [ 1.863237] mv88e6085 0.1:00 lan4 (uninitialized): PHY [!mdio-mux!mdio@1!switch@0!mdio:04] driver [Generic PHY] (irq=POLL) > [ 1.884078] mv88e6085 0.2:00 lan5 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:01] driver [Generic PHY] (irq=POLL) > [ 1.901630] mv88e6085 0.2:00 lan6 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:02] driver [Generic PHY] (irq=POLL) > [ 1.918287] mv88e6085 0.2:00 lan7 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:03] driver [Generic PHY] (irq=POLL) > [ 1.933721] mv88e6085 0.2:00 lan8 (uninitialized): PHY [!mdio-mux!mdio@2!switch@0!mdio:04] driver [Generic PHY] (irq=POLL) > [ 1.948722] DSA: tree 0 setup > [ 1.951599] libphy: mdio_mux: probed > > [ 21.565550] Micrel KSZ8041 400d0000.ethernet-1:00: attached PHY driver (mii_bus:phy_addr=400d0000.ethernet-1:00, irq=48) > > I have Rev B using NFS root, so the interfaces are configured up by > the kernel during boot. Rev C has a local root filesystem, so user > space brings the interfaces up, and it is only when the FEC is opened > does it attach to the Micrel PHY. That explains the difference between > 2.265 and 21.565 seconds for the last line. > > Again, nothing deferred. > > Andrew
On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > Rev B is interesting because switch0 and switch1 got genphy, while > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > interrupt properties, so don't loop back to their parent device. > > This is interesting and not what I really expected to happen. It goes to > show that we really need more time to understand all the subtleties of > device dependencies before jumping on patching stuff. > > In case the DSA tree contains more than one switch, different things > will happen in dsa_register_switch(). > The tree itself is only initialized when the last switch calls > dsa_register_switch(). All the other switches just mark themselves as > present and exit probing early. See this piece of code in dsa_tree_setup: > > complete = dsa_tree_setup_routing_table(dst); > if (!complete) > return 0; > > So it should be a general property of cross-chip DSA trees that all > switches except the last one will have the specific PHY driver probed > properly, and not the genphy. > > Because all (N - 1) switches of a tree exit early in dsa_register_switch, > they have successfully probed by the time the last switch brings up the > tree, and brings up the PHYs on behalf of every other switch. > > The last switch can connect to the PHY on behalf of the other switches > past their probe ending, and those PHYs should not defer probing because > their supplier is now probed. It is only that the last switch cannot > connect to the PHYs of its own ports. I'm not saying this with any intention of making things easier for me (I'm not even sure it does). But your description about how multiple switches are handled by DSA has me even more convinced than before that DSA needs to use a component device model. This is like the textbook example for component devices. -Saravana
On Tue, Aug 31, 2021 at 6:38 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote: > > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote: > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > > interrupt properties, so don't loop back to their parent device. > > > > > > This is interesting and not what I really expected to happen. It goes to > > > show that we really need more time to understand all the subtleties of > > > device dependencies before jumping on patching stuff. > > > > There is an even more interesting variation which I would like to point > > out. It seems like a very odd loophole in the device links. > > > > Take the example of the mv88e6xxx DSA driver. On my board > > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I > > had to declare the switches as interrupt controller and add interrupts > > to their internal PHYs, I still need considerable force to 'break' this > > board in the way discussed in this thread. The correct PHY driver insists > > to probe, and not genphy. Let me explain. > > > > The automatic device links between the switch (supplier, as interrupt-controller) > > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle. > > > > Important note: fwnode_link_add does not link devices, it links OF nodes. > > > > Even more important node, in the form of a comment: > > > > * The driver core will use the fwnode link to create a device link between the > > * two device objects corresponding to @con and @sup when they are created. The > > * driver core will automatically delete the fwnode link between @con and @sup > > * after doing that. > > > > Okay?! > > > > What seems to be omitted is that the DSA switch driver's probing itself > > can be deferred. For example: > > > > dsa_register_switch > > -> dsa_switch_probe > > -> dsa_switch_parse_of > > -> dsa_switch_parse_ports_of > > -> dsa_port_parse_of > > -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0)); > > -> not found => return -EPROBE_DEFER > > > > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively > > an error path. So the reverse of initialization is performed. > > > > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_ > > dsa_register_switch. So when dsa_register_switch returns error code, > > mv88e6xxx_mdios_unregister() will be called. > > > > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with > > internal PHYs are destroyed. So the PHY devices themselves are destroyed > > too. And the device links between the DSA switch and the internal PHYs, > > those created based on the firmware node links created by fwnode_link_add, > > are dropped too. > > > > Now remember the comment that the device links created based on > > fwnode_link_add are not restored. > > > > So probing of the DSA switch finally resumes, and this time > > device_links_check_suppliers() is effectively bypassed, the PHYs no > > longer request probe deferral due to their supplier not being ready, > > because the device link no longer exists. > > > > Isn't this self-sabotaging?! Yeah, this is a known "issue". I'm saying "issue" because at worst it'd allow a few unnecessary deferred probes. And if you want to break or get fw_devlink to ignore your child devices or your consumers, there are simpler APIs to do it without having to intentionally defer a probe. Fixing this "issue" would just use up more memory and increase boot time for no meaningful benefit. > > > > Now generally, DSA drivers defer probing because they probe in parallel > > with the DSA master. This is typical if the switch is on a SPI bus, or > > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller. > > > > If the MDIO controller is not standalone, but is provided by Ethernet > > controller that is the DSA master itself, then things change a lot, > > because probing can never be parallel. The DSA master probes, > > initializes its MDIO bus, and this triggers the probing of the MDIO > > devices on that bus, one of which is the DSA switch. So DSA can no > > longer defer the probe due to that reason. > > > > Secondly, in DSA we even have variation between drivers as to where they > > register their internal MDIO buses. The mv88e6xxx driver does this in > > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb > > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this > > is important. DSA provides drivers with a .setup() callback, which is > > guaranteed to take place after nothing can defer the switch's probe > > anymore. > > > > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register > > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this > > setup, because the device links framework isn't sabotaging itself anymore. > > > > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register() > > from the probe method and not the setup method, the entire design issue > > with interrupts on internal DSA switch ports would have went absolutely > > unnoticed for a few more years. > > > > I have not tested this, but it also seems plausible that DSA can > > trivially and reliably bypass any fw_devlink=on restrictions by simply > > moving all of_mdiobus_register() driver calls from the .setup() method > > to their respective probe methods (prior to calling dsa_register_switch), > > then effectively fabricate an -EPROBE_DEFER during the first probe attempt. > > I mean, who will know whether that probe deferral request was justified > > or not? > > Pushing the thought even further, it is not even necessary to move the > of_mdiobus_register() call to the probe function. Where it is (in .setup) > is already good enough. It is sufficient to return -EOPNOTSUPP once > (the first time) immediately _after_ the call to of_mdiobus_register > (and have a proper error path, i.e. call mdiobus_unregister too). Right, there are plenty of ways to intentionally break fw_devlink. I hope that's not the point :) And I don't think -EOPNOTSUPP would work because your device wouldn't be probed again. > > > Anyway, I'm not sure everyone agrees with this type of "solution" (even > > though it's worth pointing it out as a fw_devlink limitation). In any > > case, we need some sort of lightweight "fix" to the chicken-and-egg > > problem, which will give me enough time to think of something better. I think the generic DSA patch I gave would be the lightweight fix to address this chicken-and-egg issue. As for the long term fix, I'd really suggest looking into using the component device model. I'd even be happy to help make any driver core/component device improvements you might need. I'm also interested in looking into improving the PHY probing so that the genphy never probes a device that has a driver that could probe it. Even outside of all this fw_devlink thing, they way PHY is handled now, if any of the supplier really isn't ready yet (say a clock), then the genphy gets used -- which isn't good. -Saravana > > I hope it is at least clearer now that there are subtleties and nuances, > > and we cannot just assess how many boards are broken by looking at the > > device trees. By design, all are, sure, but they might still work, and > > that's better than nothing...
On Tue, Aug 31, 2021 at 07:00:58PM -0700, Saravana Kannan wrote: > On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > interrupt properties, so don't loop back to their parent device. > > > > This is interesting and not what I really expected to happen. It goes to > > show that we really need more time to understand all the subtleties of > > device dependencies before jumping on patching stuff. > > > > In case the DSA tree contains more than one switch, different things > > will happen in dsa_register_switch(). > > The tree itself is only initialized when the last switch calls > > dsa_register_switch(). All the other switches just mark themselves as > > present and exit probing early. See this piece of code in dsa_tree_setup: > > > > complete = dsa_tree_setup_routing_table(dst); > > if (!complete) > > return 0; > > > > So it should be a general property of cross-chip DSA trees that all > > switches except the last one will have the specific PHY driver probed > > properly, and not the genphy. > > > > Because all (N - 1) switches of a tree exit early in dsa_register_switch, > > they have successfully probed by the time the last switch brings up the > > tree, and brings up the PHYs on behalf of every other switch. > > > > The last switch can connect to the PHY on behalf of the other switches > > past their probe ending, and those PHYs should not defer probing because > > their supplier is now probed. It is only that the last switch cannot > > connect to the PHYs of its own ports. > > I'm not saying this with any intention of making things easier for me > (I'm not even sure it does). But your description about how multiple > switches are handled by DSA has me even more convinced than before > that DSA needs to use a component device model. This is like the > textbook example for component devices. In this example, I guess the component master would be the "struct dsa_switch_tree", but there is no struct device associated with it. How many "struct dsa_switch_tree" instances there are in a system depends on whether OF is used or not. If we use OF, the device tree needs to be parsed, and every unique first cell (tree-id) of: dsa,member = <tree-id switch-id>; constitutes a different "struct dsa_switch_tree". If we do not use OF, the number of switch trees in a system is one, see dsa_switch_parse. It seems to me like the compare function for component_match (where each component is a "struct dsa_switch" should look at dev->of_node and parse the "dsa,member" property again, and match on the same tree-id as the component master itself? There's also the question of how to do the component_match in a way that also works for the pdata/non-OF based DSA systems (of which I have none to test). All of this to move dsa_tree_setup() outside of the probe calling context of any individual struct dsa_switch, and into the "bind" calling context of the component master associated with the struct dsa_switch_tree. This would allow the phy_connect()/phy_attach_direct() calls to find the PHY device already bound to the specific driver, which would avoid binding genphy as a last resort? Two questions: - Why would it now be more guaranteed that the PHY drivers are bound to the internal PHY devices exactly during the time span between events (a) Switch driver (a component of the switch tree) finishes probing (b) Switch tree (the component master) starts binding I mean in lack of any guarantee, we can still end up in a situation where the specific PHY driver still is not bound early enough to the internal PHY to be available by the time we call phylink_of_phy_connect, and we have all those component device goodies but they don't help. I'm sure I'm misunderstanding something but I don't know what. - What if the internal PHY has other suppliers beyond the interrupt-parent? What if, say, it has a reset-gpios = <&gpio1>, where gpio1 is provided by some other thing on some other slow bus, which is equally slow (or slower) to probe to the DSA switch itself. So the temporary absence of this other supplier is causing the specific PHY driver to defer probing, just enough for a concurrent call to phylink_of_phy_connect -> phy_attach_direct to say "ok, I've waited enough and there is no driver, genphy it is". How would this be avoided? Or are you thinking of some kind of two-level component driver system: - the DSA switch is a component master, with components being its sub-devices such as internal PHYs etc - the DSA switch is also a component of the DSA switch tree But in that case, why is it even relevant to model the DSA switch tree probing as a component master, and why does it help? I don't care if it's a "textbook example" or not if it doesn't help.
On Tue, Aug 31, 2021 at 07:19:40PM -0700, Saravana Kannan wrote: > On Tue, Aug 31, 2021 at 6:38 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote: > > > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote: > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > > > interrupt properties, so don't loop back to their parent device. > > > > > > > > This is interesting and not what I really expected to happen. It goes to > > > > show that we really need more time to understand all the subtleties of > > > > device dependencies before jumping on patching stuff. > > > > > > There is an even more interesting variation which I would like to point > > > out. It seems like a very odd loophole in the device links. > > > > > > Take the example of the mv88e6xxx DSA driver. On my board > > > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I > > > had to declare the switches as interrupt controller and add interrupts > > > to their internal PHYs, I still need considerable force to 'break' this > > > board in the way discussed in this thread. The correct PHY driver insists > > > to probe, and not genphy. Let me explain. > > > > > > The automatic device links between the switch (supplier, as interrupt-controller) > > > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle. > > > > > > Important note: fwnode_link_add does not link devices, it links OF nodes. > > > > > > Even more important node, in the form of a comment: > > > > > > * The driver core will use the fwnode link to create a device link between the > > > * two device objects corresponding to @con and @sup when they are created. The > > > * driver core will automatically delete the fwnode link between @con and @sup > > > * after doing that. > > > > > > Okay?! > > > > > > What seems to be omitted is that the DSA switch driver's probing itself > > > can be deferred. For example: > > > > > > dsa_register_switch > > > -> dsa_switch_probe > > > -> dsa_switch_parse_of > > > -> dsa_switch_parse_ports_of > > > -> dsa_port_parse_of > > > -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0)); > > > -> not found => return -EPROBE_DEFER > > > > > > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively > > > an error path. So the reverse of initialization is performed. > > > > > > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_ > > > dsa_register_switch. So when dsa_register_switch returns error code, > > > mv88e6xxx_mdios_unregister() will be called. > > > > > > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with > > > internal PHYs are destroyed. So the PHY devices themselves are destroyed > > > too. And the device links between the DSA switch and the internal PHYs, > > > those created based on the firmware node links created by fwnode_link_add, > > > are dropped too. > > > > > > Now remember the comment that the device links created based on > > > fwnode_link_add are not restored. > > > > > > So probing of the DSA switch finally resumes, and this time > > > device_links_check_suppliers() is effectively bypassed, the PHYs no > > > longer request probe deferral due to their supplier not being ready, > > > because the device link no longer exists. > > > > > > Isn't this self-sabotaging?! > > Yeah, this is a known "issue". I'm saying "issue" because at worst > it'd allow a few unnecessary deferred probes. And if you want to break > or get fw_devlink to ignore your child devices or your consumers, > there are simpler APIs to do it without having to intentionally defer > a probe. Fixing this "issue" would just use up more memory and > increase boot time for no meaningful benefit. But I mean, if the goal of fw_devlink is to infer a probing order based on phandles, and it is faced with a long chain of devices, then any -EPROBE_DEFER of a device on top of the chain will break the probing order for all devices beneath it. It is self-defeating, it is already memory used for nothing. > > > > > > Now generally, DSA drivers defer probing because they probe in parallel > > > with the DSA master. This is typical if the switch is on a SPI bus, or > > > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller. > > > > > > If the MDIO controller is not standalone, but is provided by Ethernet > > > controller that is the DSA master itself, then things change a lot, > > > because probing can never be parallel. The DSA master probes, > > > initializes its MDIO bus, and this triggers the probing of the MDIO > > > devices on that bus, one of which is the DSA switch. So DSA can no > > > longer defer the probe due to that reason. > > > > > > Secondly, in DSA we even have variation between drivers as to where they > > > register their internal MDIO buses. The mv88e6xxx driver does this in > > > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb > > > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this > > > is important. DSA provides drivers with a .setup() callback, which is > > > guaranteed to take place after nothing can defer the switch's probe > > > anymore. > > > > > > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register > > > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this > > > setup, because the device links framework isn't sabotaging itself anymore. > > > > > > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register() > > > from the probe method and not the setup method, the entire design issue > > > with interrupts on internal DSA switch ports would have went absolutely > > > unnoticed for a few more years. > > > > > > I have not tested this, but it also seems plausible that DSA can > > > trivially and reliably bypass any fw_devlink=on restrictions by simply > > > moving all of_mdiobus_register() driver calls from the .setup() method > > > to their respective probe methods (prior to calling dsa_register_switch), > > > then effectively fabricate an -EPROBE_DEFER during the first probe attempt. > > > I mean, who will know whether that probe deferral request was justified > > > or not? > > > > Pushing the thought even further, it is not even necessary to move the > > of_mdiobus_register() call to the probe function. Where it is (in .setup) > > is already good enough. It is sufficient to return -EOPNOTSUPP once > > (the first time) immediately _after_ the call to of_mdiobus_register > > (and have a proper error path, i.e. call mdiobus_unregister too). > > Right, there are plenty of ways to intentionally break fw_devlink. I > hope that's not the point :) And I don't think -EOPNOTSUPP would work > because your device wouldn't be probed again. Yes, -EPROBE_DEFER is what I meant. > > > > > Anyway, I'm not sure everyone agrees with this type of "solution" (even > > > though it's worth pointing it out as a fw_devlink limitation). In any > > > case, we need some sort of lightweight "fix" to the chicken-and-egg > > > problem, which will give me enough time to think of something better. > > I think the generic DSA patch I gave would be the lightweight fix to > address this chicken-and-egg issue. > > As for the long term fix, I'd really suggest looking into using the > component device model. I'd even be happy to help make any driver > core/component device improvements you might need. > > I'm also interested in looking into improving the PHY probing so that > the genphy never probes a device that has a driver that could probe > it. Even outside of all this fw_devlink thing, they way PHY is handled > now, if any of the supplier really isn't ready yet (say a clock), then > the genphy gets used -- which isn't good. I think this is the real problem which needs to be addressed. The trouble is, I don't know if phy_attach_direct can find out the reason for which d->driver is NULL, i.e. that there was a driver which matched and attempted the probe, but returned -EPROBE_DEFER. > -Saravana > > > > I hope it is at least clearer now that there are subtleties and nuances, > > > and we cannot just assess how many boards are broken by looking at the > > > device trees. By design, all are, sure, but they might still work, and > > > that's better than nothing...
On Wed, Sep 1, 2021 at 1:46 AM Vladimir Oltean <olteanv@gmail.com> wrote: > > On Tue, Aug 31, 2021 at 07:00:58PM -0700, Saravana Kannan wrote: > > On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > > interrupt properties, so don't loop back to their parent device. > > > > > > This is interesting and not what I really expected to happen. It goes to > > > show that we really need more time to understand all the subtleties of > > > device dependencies before jumping on patching stuff. > > > > > > In case the DSA tree contains more than one switch, different things > > > will happen in dsa_register_switch(). > > > The tree itself is only initialized when the last switch calls > > > dsa_register_switch(). All the other switches just mark themselves as > > > present and exit probing early. See this piece of code in dsa_tree_setup: > > > > > > complete = dsa_tree_setup_routing_table(dst); > > > if (!complete) > > > return 0; > > > > > > So it should be a general property of cross-chip DSA trees that all > > > switches except the last one will have the specific PHY driver probed > > > properly, and not the genphy. > > > > > > Because all (N - 1) switches of a tree exit early in dsa_register_switch, > > > they have successfully probed by the time the last switch brings up the > > > tree, and brings up the PHYs on behalf of every other switch. > > > > > > The last switch can connect to the PHY on behalf of the other switches > > > past their probe ending, and those PHYs should not defer probing because > > > their supplier is now probed. It is only that the last switch cannot > > > connect to the PHYs of its own ports. > > > > I'm not saying this with any intention of making things easier for me > > (I'm not even sure it does). But your description about how multiple > > switches are handled by DSA has me even more convinced than before > > that DSA needs to use a component device model. This is like the > > textbook example for component devices. > > In this example, I guess the component master would be the "struct dsa_switch_tree", Right. > but there is no struct device associated with it. We can create one? I don't think it needs to have a DT node. And if it does, this is where my "I'm willing to help improve component device" offer comes in to help make it a bit more generic. > > How many "struct dsa_switch_tree" instances there are in a system > depends on whether OF is used or not. > > If we use OF, the device tree needs to be parsed, and every unique first > cell (tree-id) of: > dsa,member = <tree-id switch-id>; > constitutes a different "struct dsa_switch_tree". > > If we do not use OF, the number of switch trees in a system is one, see dsa_switch_parse. > > It seems to me like the compare function for component_match (where each > component is a "struct dsa_switch" should look at dev->of_node and parse > the "dsa,member" property again, and match on the same tree-id as the > component master itself? I don't know enough about DSA to give a useful answer here. But I guess so? > > There's also the question of how to do the component_match in a way that > also works for the pdata/non-OF based DSA systems (of which I have none to test). You could always just short circuit it and not create a component device if it's just pdata/non-OF. That's one option. > All of this to move dsa_tree_setup() outside of the probe calling > context of any individual struct dsa_switch, and into the "bind" calling > context of the component master associated with the struct dsa_switch_tree. Right. > This would allow the phy_connect()/phy_attach_direct() calls to find the > PHY device already bound to the specific driver, which would avoid > binding genphy as a last resort? Short answer, yes. Long answer: this would fix multiple things: 1) Remove the parent's probe from depending on the child's probe(). This is not guaranteed at all, so we'd fix this bad assumption in the code. 2) It would allow the PHYs to probe with fw_devlink because the switch would have completed probing. 3) It'd avoid the bad design of the last switch's probe doing all the PHY handling for the previous N-1 switches. What if something fails there? Shouldn't it be one of the previous switches that should report the failure (either in probe or switch registration or whatever?)? The component device model would allow each switch to do all it's own work and then the component master can do the "tying up" of all these switches and PHYs. > > Two questions: > > - Why would it now be more guaranteed that the PHY drivers are bound to > the internal PHY devices exactly during the time span between events > (a) Switch driver (a component of the switch tree) finishes probing > (b) Switch tree (the component master) starts binding Firstly, PHYs won't defer probe due to fw_devlink enforcing their dependency on the switch and will actually have their probe() called (and possibly succeed -- see more below). > I mean in lack of any guarantee, we can still end up in a situation > where the specific PHY driver still is not bound early enough to the > internal PHY to be available by the time we call phylink_of_phy_connect, > and we have all those component device goodies but they don't help. > I'm sure I'm misunderstanding something but I don't know what. > > - What if the internal PHY has other suppliers beyond the interrupt-parent? > What if, say, it has a reset-gpios = <&gpio1>, where gpio1 is provided > by some other thing on some other slow bus, which is equally slow (or > slower) to probe to the DSA switch itself. So the temporary absence of > this other supplier is causing the specific PHY driver to defer probing, > just enough for a concurrent call to phylink_of_phy_connect -> phy_attach_direct > to say "ok, I've waited enough and there is no driver, genphy it is". > How would this be avoided? Good question and this is another reason for me suggesting the use of component model. > Or are you thinking of some kind of two-level > component driver system: > - the DSA switch is a component master, with components being its > sub-devices such as internal PHYs etc > - the DSA switch is also a component of the DSA switch tree I was thinking of one component master with all the devices that make up the DSA switch tree. I don't think there's any requirement that all the component devices need to be of the same type. That way, the DSA switch tree won't even be attempted until all the devices are ready. One thing that's not clear to me wrt using specific driver vs the genphy -- at what point is it okay to give up waiting for a specific driver? This is more of a question to the maintainers than what happens today. What if the specific driver is a module that's loaded after the switch's driver? There's no time bound to this event. Are we going to put the restriction that all the PHY's drivers need to be registered/loaded before the switch's driver? If that's the decision, that's okay by me. But I just want to understand the requirements. Also see my reply to your other email. -Saravana
On Wed, Sep 1, 2021 at 2:02 AM Vladimir Oltean <olteanv@gmail.com> wrote: > > On Tue, Aug 31, 2021 at 07:19:40PM -0700, Saravana Kannan wrote: > > On Tue, Aug 31, 2021 at 6:38 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > > > On Wed, Sep 01, 2021 at 04:28:26AM +0300, Vladimir Oltean wrote: > > > > On Wed, Sep 01, 2021 at 02:18:04AM +0300, Vladimir Oltean wrote: > > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > > > > interrupt properties, so don't loop back to their parent device. > > > > > > > > > > This is interesting and not what I really expected to happen. It goes to > > > > > show that we really need more time to understand all the subtleties of > > > > > device dependencies before jumping on patching stuff. > > > > > > > > There is an even more interesting variation which I would like to point > > > > out. It seems like a very odd loophole in the device links. > > > > > > > > Take the example of the mv88e6xxx DSA driver. On my board > > > > (arch/arm64/boot/dts/marvell/armada-3720-turris-mox.dts), even after I > > > > had to declare the switches as interrupt controller and add interrupts > > > > to their internal PHYs, I still need considerable force to 'break' this > > > > board in the way discussed in this thread. The correct PHY driver insists > > > > to probe, and not genphy. Let me explain. > > > > > > > > The automatic device links between the switch (supplier, as interrupt-controller) > > > > and PHYs (consumers) are added by fwnode_link_add, called from of_link_to_phandle. > > > > > > > > Important note: fwnode_link_add does not link devices, it links OF nodes. > > > > > > > > Even more important node, in the form of a comment: > > > > > > > > * The driver core will use the fwnode link to create a device link between the > > > > * two device objects corresponding to @con and @sup when they are created. The > > > > * driver core will automatically delete the fwnode link between @con and @sup > > > > * after doing that. > > > > > > > > Okay?! > > > > > > > > What seems to be omitted is that the DSA switch driver's probing itself > > > > can be deferred. For example: > > > > > > > > dsa_register_switch > > > > -> dsa_switch_probe > > > > -> dsa_switch_parse_of > > > > -> dsa_switch_parse_ports_of > > > > -> dsa_port_parse_of > > > > -> of_find_net_device_by_node(of_parse_phandle(dn, "ethernet", 0)); > > > > -> not found => return -EPROBE_DEFER > > > > > > > > When dsa_register_switch() returns -EPROBE_DEFER, it is effectively > > > > an error path. So the reverse of initialization is performed. > > > > > > > > The mv88e6xxx driver calls mv88e6xxx_mdios_register() right _before_ > > > > dsa_register_switch. So when dsa_register_switch returns error code, > > > > mv88e6xxx_mdios_unregister() will be called. > > > > > > > > When mv88e6xxx_mdios_unregister() is called, the MDIO buses with > > > > internal PHYs are destroyed. So the PHY devices themselves are destroyed > > > > too. And the device links between the DSA switch and the internal PHYs, > > > > those created based on the firmware node links created by fwnode_link_add, > > > > are dropped too. > > > > > > > > Now remember the comment that the device links created based on > > > > fwnode_link_add are not restored. > > > > > > > > So probing of the DSA switch finally resumes, and this time > > > > device_links_check_suppliers() is effectively bypassed, the PHYs no > > > > longer request probe deferral due to their supplier not being ready, > > > > because the device link no longer exists. > > > > > > > > Isn't this self-sabotaging?! > > > > Yeah, this is a known "issue". I'm saying "issue" because at worst > > it'd allow a few unnecessary deferred probes. And if you want to break > > or get fw_devlink to ignore your child devices or your consumers, > > there are simpler APIs to do it without having to intentionally defer > > a probe. Fixing this "issue" would just use up more memory and > > increase boot time for no meaningful benefit. > > But I mean, if the goal of fw_devlink is to infer a probing order based > on phandles, and it is faced with a long chain of devices, then any > -EPROBE_DEFER of a device on top of the chain will break the probing > order for all devices beneath it. It is self-defeating, it is already > memory used for nothing. > > > > > > > > > Now generally, DSA drivers defer probing because they probe in parallel > > > > with the DSA master. This is typical if the switch is on a SPI bus, or > > > > I2C, or on an MDIO bus provided by a _standalone_ MDIO controller. > > > > > > > > If the MDIO controller is not standalone, but is provided by Ethernet > > > > controller that is the DSA master itself, then things change a lot, > > > > because probing can never be parallel. The DSA master probes, > > > > initializes its MDIO bus, and this triggers the probing of the MDIO > > > > devices on that bus, one of which is the DSA switch. So DSA can no > > > > longer defer the probe due to that reason. > > > > > > > > Secondly, in DSA we even have variation between drivers as to where they > > > > register their internal MDIO buses. The mv88e6xxx driver does this in > > > > mv88e6xxx_probe (the probe function on the MDIO bus). The rtl8366rb > > > > driver calls realtek_smi_setup_mdio() from rtl8366rb_setup(), and this > > > > is important. DSA provides drivers with a .setup() callback, which is > > > > guaranteed to take place after nothing can defer the switch's probe > > > > anymore. > > > > > > > > So putting two and two together, sure enough, if I move mv88e6xxx_mdios_register > > > > from mv88e6xxx_probe to mv88e6xxx_setup, then I can reliably break this > > > > setup, because the device links framework isn't sabotaging itself anymore. > > > > > > > > Conversely, I am pretty sure that if rtl8366rb was to call of_mdiobus_register() > > > > from the probe method and not the setup method, the entire design issue > > > > with interrupts on internal DSA switch ports would have went absolutely > > > > unnoticed for a few more years. > > > > > > > > I have not tested this, but it also seems plausible that DSA can > > > > trivially and reliably bypass any fw_devlink=on restrictions by simply > > > > moving all of_mdiobus_register() driver calls from the .setup() method > > > > to their respective probe methods (prior to calling dsa_register_switch), > > > > then effectively fabricate an -EPROBE_DEFER during the first probe attempt. > > > > I mean, who will know whether that probe deferral request was justified > > > > or not? > > > > > > Pushing the thought even further, it is not even necessary to move the > > > of_mdiobus_register() call to the probe function. Where it is (in .setup) > > > is already good enough. It is sufficient to return -EOPNOTSUPP once > > > (the first time) immediately _after_ the call to of_mdiobus_register > > > (and have a proper error path, i.e. call mdiobus_unregister too). > > > > Right, there are plenty of ways to intentionally break fw_devlink. I > > hope that's not the point :) And I don't think -EOPNOTSUPP would work > > because your device wouldn't be probed again. > > Yes, -EPROBE_DEFER is what I meant. > > > > > > > > Anyway, I'm not sure everyone agrees with this type of "solution" (even > > > > though it's worth pointing it out as a fw_devlink limitation). In any > > > > case, we need some sort of lightweight "fix" to the chicken-and-egg > > > > problem, which will give me enough time to think of something better. > > > > I think the generic DSA patch I gave would be the lightweight fix to > > address this chicken-and-egg issue. > > > > As for the long term fix, I'd really suggest looking into using the > > component device model. I'd even be happy to help make any driver > > core/component device improvements you might need. > > > > I'm also interested in looking into improving the PHY probing so that > > the genphy never probes a device that has a driver that could probe > > it. Even outside of all this fw_devlink thing, they way PHY is handled > > now, if any of the supplier really isn't ready yet (say a clock), then > > the genphy gets used -- which isn't good. > > I think this is the real problem which needs to be addressed. This is one of the real problems which need to be addresses. > The > trouble is, I don't know if phy_attach_direct can find out the reason > for which d->driver is NULL, i.e. that there was a driver which matched > and attempted the probe, but returned -EPROBE_DEFER. I think if we can set the requirement that the PHY's driver needs to be loaded/registered before the switch's driver, this should be possible to figure out. Either using dev->can_match or with some additional minor changes to driver-core. -Saravana
> How would this be avoided? Or are you thinking of some kind of two-level > component driver system: > - the DSA switch is a component master, with components being its > sub-devices such as internal PHYs etc > - the DSA switch is also a component of the DSA switch tree I think you might be missing a level. Think about the automotive reference design system you posted the DT for a couple of days ago. Don't you have cascaded switches, which are not members of the same DSA tree. You might need a component for that whole group of switches, above what you suggest here. Can you nest components? How deep can you nest them? Andrew
On Thu, Sep 2, 2021 at 10:41 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > How would this be avoided? Or are you thinking of some kind of two-level > > component driver system: > > - the DSA switch is a component master, with components being its > > sub-devices such as internal PHYs etc > > - the DSA switch is also a component of the DSA switch tree > > I think you might be missing a level. Think about the automotive > reference design system you posted the DT for a couple of days > ago. Don't you have cascaded switches, which are not members of the > same DSA tree. You might need a component for that whole group of > switches, above what you suggest here. > > Can you nest components? How deep can you nest them? As far as I know you can nest components. Also, technically you can make your own lightweight component model like behaviour using stateful device links or fwnode links (probably just a simple for loop). Just create a new "dsa_switch_tree" device and create device links between that and whatever other devices that need to probe first. And then you'll just have a common "dsa_switch_tree" driver that probes these types of devices. I'm waiting for [1] to land before I jump in and clean up the component model to be more flexible and cleaner by using device links. The current implementation does a lot of stuff that device links will take care of for free. [1] - https://lore.kernel.org/lkml/CAGETcx-mRrqC_sGiBk+wx8RtwjJjXf0KJo+ejU6SweEBiATaLw@mail.gmail.com/ -Saravana
On Tue, Aug 31, 2021 at 6:16 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > > I must admit, my main problem at the moment is -rc1 in two weeks > > > time. It seems like a number of board with Ethernet switches will be > > > broken, that worked before. phy-handle is not limited to switch > > > drivers, it is also used for Ethernet drivers. So it could be, a > > > number of Ethernet drivers are also going to be broken in -rc1? > > > > Again, in those cases, based on your FEC example, fw_devlink=on > > actually improves things. > > Debatable. I did some testing. As expected some boards with Ethernet > switches are now broken. Without fw_devlink=on, some boards are not > optimal, but they actually work. With it, they are broken. > > I did a bisect, and they have been broken since: > > ea718c699055c8566eb64432388a04974c43b2ea is the first bad commit > commit ea718c699055c8566eb64432388a04974c43b2ea > Author: Saravana Kannan <saravanak@google.com> > Date: Tue Mar 2 13:11:32 2021 -0800 > > Revert "Revert "driver core: Set fw_devlink=on by default"" > > This reverts commit 3e4c982f1ce75faf5314477b8da296d2d00919df. > > Since all reported issues due to fw_devlink=on should be addressed by > this series, revert the revert. fw_devlink=on Take II. > > Signed-off-by: Saravana Kannan <saravanak@google.com> > Link: https://lore.kernel.org/r/20210302211133.2244281-4-saravanak@google.com > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > drivers/base/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > So however it is fixed, it needs to go into stable, not just -rc1. > > > Again, it's not a widespread problem as I explained before. > > fw_devlink=on has been the default for 2 kernel versions now. With no > > unfixed reported issues. > > Given that some Ethernet switches have been broken all that time, i > wonder what else has been broken? Normally, the kernel which is > release in December becomes the next LTS. It then gets picked up by > the distros and more wide spread tested. So it could be, you get a > flood of reports in January and February about things which are > broken. This is why i don't think you should be relying on bug > reports, you should be taking a more proactive stance and trying to > analyse the DTB blobs. > > I will spend some time trying out your proposed fix. See if they are a > quick fix for stable. Hi Andrew, Did you have a chance to try it out? I can fix up the commit text and send out vN+1 of the patch if it works for you. -Saravana
> --- a/net/dsa/dsa2.c > +++ b/net/dsa/dsa2.c > @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct > dsa_switch *ds, struct device_node *dn) > { > int err; > > + /* A lot of switch devices have their PHYs as child devices and have > + * the PHYs depend on the switch as a supplier (Eg: interrupt > + * controller). With fw_devlink=on, that means the PHYs will defer > + * probe until the probe() of the switch completes. However, the way > + * the DSA framework is designed, the PHYs are expected to be probed > + * successfully before the probe() of the switch completes. > + * > + * So, mark the switch devices as a "broken parent" so that fw_devlink > + * knows not to create device links between PHYs and the parent switch. > + */ > + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > err = dsa_switch_parse_member_of(ds, dn); > if (err) > return err; This does not work. First off, its dn, not np. But with that fixed, it still does not work. This is too late, the mdio busses have already been registered and probed, the PHYs have been found on the busses, and the PHYs would of been probed, if not for fw_devlink. What did work was: diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index c45ca2473743..45d67d50e35f 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -6249,8 +6249,10 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev) if (!np && !pdata) return -EINVAL; - if (np) + if (np) { compat_info = of_device_get_match_data(dev); + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; + } if (pdata) { compat_info = pdata_device_get_match_data(dev); This will fix it for mv88e6xxx. But if the same problem occurs in any of the other DSA drivers, they will still be broken: ~/linux/drivers/net/dsa$ grep -r mdiobus_register * bcm_sf2.c: err = mdiobus_register(priv->slave_mii_bus); dsa_loop_bdinfo.c: return mdiobus_register_board_info(&bdinfo, 1); lantiq_gswip.c: return of_mdiobus_register(ds->slave_mii_bus, mdio_np); mt7530.c: ret = mdiobus_register(bus); mv88e6xxx/chip.c: err = of_mdiobus_register(bus, np); grep: mv88e6xxx/chip.o: binary file matches ocelot/seville_vsc9953.c: rc = mdiobus_register(bus); ocelot/felix_vsc9959.c: rc = mdiobus_register(bus); qca/ar9331.c: ret = of_mdiobus_register(mbus, mnp); qca8k.c: return devm_of_mdiobus_register(priv->dev, bus, mdio); realtek-smi-core.c: ret = of_mdiobus_register(smi->slave_mii_bus, mdio_np); sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np); sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np); sja1105/sja1105_mdio.c: rc = mdiobus_register(bus); sja1105/sja1105_mdio.c:int sja1105_mdiobus_register(struct dsa_switch *ds) sja1105/sja1105.h:int sja1105_mdiobus_register(struct dsa_switch *ds); sja1105/sja1105_main.c: rc = sja1105_mdiobus_register(ds); If you are happy to use a big hammer: diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c index 53f034fc2ef7..7ecd910f7fb8 100644 --- a/drivers/net/phy/mdio_bus.c +++ b/drivers/net/phy/mdio_bus.c @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner) NULL == bus->read || NULL == bus->write) return -EINVAL; + if (bus->parent && bus->parent->of_node) + bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; + BUG_ON(bus->state != MDIOBUS_ALLOCATED && bus->state != MDIOBUS_UNREGISTERED); So basically saying all MDIO busses potentially have a problem. I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are not broken, they work fine, if fw_devlink gets out of the way and allows them to do their job. You also asked about why the component framework is not used. DSA has been around for a while, the first commit dates back to October 2008. Russell Kings first commit for the component framework is January 2014. The plain driver model has worked for the last 13 years, so there has not been any need to change. Andrew
On Wed, Sep 8, 2021 at 6:39 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > --- a/net/dsa/dsa2.c > > +++ b/net/dsa/dsa2.c > > @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct > > dsa_switch *ds, struct device_node *dn) > > { > > int err; > > > > + /* A lot of switch devices have their PHYs as child devices and have > > + * the PHYs depend on the switch as a supplier (Eg: interrupt > > + * controller). With fw_devlink=on, that means the PHYs will defer > > + * probe until the probe() of the switch completes. However, the way > > + * the DSA framework is designed, the PHYs are expected to be probed > > + * successfully before the probe() of the switch completes. > > + * > > + * So, mark the switch devices as a "broken parent" so that fw_devlink > > + * knows not to create device links between PHYs and the parent switch. > > + */ > > + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > > err = dsa_switch_parse_member_of(ds, dn); > > if (err) > > return err; > > This does not work. First off, its dn, not np. My bad. Copy paste error. > But with that fixed, it > still does not work. This is too late, the mdio busses have already > been registered and probed, the PHYs have been found on the busses, > and the PHYs would of been probed, if not for fw_devlink. Sigh... looks like some drivers register their mdio bus in their dsa_switch_ops->setup while others do it in their actual probe function (which actually makes more sense to me). > > What did work was: > > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c > index c45ca2473743..45d67d50e35f 100644 > --- a/drivers/net/dsa/mv88e6xxx/chip.c > +++ b/drivers/net/dsa/mv88e6xxx/chip.c > @@ -6249,8 +6249,10 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev) > if (!np && !pdata) > return -EINVAL; > > - if (np) > + if (np) { > compat_info = of_device_get_match_data(dev); > + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > + } > > if (pdata) { > compat_info = pdata_device_get_match_data(dev); > > This will fix it for mv88e6xxx. But if the same problem occurs in any > of the other DSA drivers, they will still be broken: > > ~/linux/drivers/net/dsa$ grep -r mdiobus_register * > bcm_sf2.c: err = mdiobus_register(priv->slave_mii_bus); > dsa_loop_bdinfo.c: return mdiobus_register_board_info(&bdinfo, 1); > lantiq_gswip.c: return of_mdiobus_register(ds->slave_mii_bus, mdio_np); > mt7530.c: ret = mdiobus_register(bus); > mv88e6xxx/chip.c: err = of_mdiobus_register(bus, np); > grep: mv88e6xxx/chip.o: binary file matches > ocelot/seville_vsc9953.c: rc = mdiobus_register(bus); > ocelot/felix_vsc9959.c: rc = mdiobus_register(bus); > qca/ar9331.c: ret = of_mdiobus_register(mbus, mnp); > qca8k.c: return devm_of_mdiobus_register(priv->dev, bus, mdio); > realtek-smi-core.c: ret = of_mdiobus_register(smi->slave_mii_bus, mdio_np); This one would have worked because it registers it in the ->setup() ops. So it's not a simple grep for of_mdiobus_register(). But your point stands nonetheless. > sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np); > sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np); > sja1105/sja1105_mdio.c: rc = mdiobus_register(bus); > sja1105/sja1105_mdio.c:int sja1105_mdiobus_register(struct dsa_switch *ds) > sja1105/sja1105.h:int sja1105_mdiobus_register(struct dsa_switch *ds); > sja1105/sja1105_main.c: rc = sja1105_mdiobus_register(ds); > > If you are happy to use a big hammer: I'm okay with this big hammer for now while we figure out something better. > > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c > index 53f034fc2ef7..7ecd910f7fb8 100644 > --- a/drivers/net/phy/mdio_bus.c > +++ b/drivers/net/phy/mdio_bus.c > @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner) > NULL == bus->read || NULL == bus->write) > return -EINVAL; > > + if (bus->parent && bus->parent->of_node) > + bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > + > BUG_ON(bus->state != MDIOBUS_ALLOCATED && > bus->state != MDIOBUS_UNREGISTERED); > > So basically saying all MDIO busses potentially have a problem. > > I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are > not broken, they work fine, if fw_devlink gets out of the way and > allows them to do their job. The parent assuming the child will be probed as soon as it's added is a broken expectation/assumption. fw_devlink is just catching them immediately. Having said that, this is not the hill either of us should choose to die on. So, how about something like: FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD If that works, I can clean up the series with this and the MDIO fix you mentioned. > You also asked about why the component framework is not used. DSA has > been around for a while, the first commit dates back to October > 2008. Russell Kings first commit for the component framework is > January 2014. The plain driver model has worked for the last 13 years, > so there has not been any need to change. Thanks for the history on why it couldn't have been used earlier. In the long run, I'd still like to fix this so that the dsa_tree_setup() doesn't need the flag above. I have some ideas using device links that'll be much simpler to understand and maintain than using the component framework. I'll send out patches for that (not meant for 5.15) later and we can go with the MDIO bus hammer for 5.15. -Saravana
On Wed, Sep 08, 2021 at 08:21:05PM -0700, Saravana Kannan wrote: > > But with that fixed, it > > still does not work. This is too late, the mdio busses have already > > been registered and probed, the PHYs have been found on the busses, > > and the PHYs would of been probed, if not for fw_devlink. > > Sigh... looks like some drivers register their mdio bus in their > dsa_switch_ops->setup while others do it in their actual probe > function (which actually makes more sense to me). If it makes more sense to you for of_mdiobus_register to be placed in the probe function and not in ->setup, then please be aware of my previous email pointing out that DSA defers probe due to other reasons too before calling ->setup, like of_find_net_device_by_node not finding the DSA master. If the MDIO bus was registered by then, it will be destroyed by the unwind path and the device links will not be created a second time, effectively defeating fw_devlink. > > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c > > index 53f034fc2ef7..7ecd910f7fb8 100644 > > --- a/drivers/net/phy/mdio_bus.c > > +++ b/drivers/net/phy/mdio_bus.c > > @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner) > > NULL == bus->read || NULL == bus->write) > > return -EINVAL; > > > > + if (bus->parent && bus->parent->of_node) > > + bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > > + > > BUG_ON(bus->state != MDIOBUS_ALLOCATED && > > bus->state != MDIOBUS_UNREGISTERED); > > > > So basically saying all MDIO busses potentially have a problem. > > > > I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are > > not broken, they work fine, if fw_devlink gets out of the way and > > allows them to do their job. > > The parent assuming the child will be probed as soon as it's added is > a broken expectation/assumption. fw_devlink is just catching them > immediately. It's not really a broken expectation when you come to think of the fact that synchronous probing is requested, and this is the internal PHY of the switch we are talking about, not just any PHY off the street, with random dependencies. It is known beforehand, and so are the dependencies. All dependencies that the internal PHY has should be provided by the switch driver by the time it registers the MDIO bus. The system is not prepared to handle an -EPROBE_DEFER simply because there is no good reason why it should happen. I see fw_devlink as injecting a fault in this case. Maybe we should treat it, but in any case it is adding a pointless -EPROBE_DEFER when the PHY driver could have probed immediately. This will slow down the boot time when we treat it properly eventually. > Having said that, this is not the hill either of us should choose to > die on. So, how about something like: > FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD > > If that works, I can clean up the series with this and the MDIO fix > you mentioned. I'm okay with the "needs child bound on add" name. > > You also asked about why the component framework is not used. DSA has > > been around for a while, the first commit dates back to October > > 2008. Russell Kings first commit for the component framework is > > January 2014. The plain driver model has worked for the last 13 years, > > so there has not been any need to change. > > Thanks for the history on why it couldn't have been used earlier. > > In the long run, I'd still like to fix this so that the > dsa_tree_setup() doesn't need the flag above. I have some ideas using > device links that'll be much simpler to understand and maintain than > using the component framework. I'll send out patches for that (not > meant for 5.15) later and we can go with the MDIO bus hammer for 5.15. Details? I am not too fond of using the component framework because I am not convinced we should be fabricating struct devices we do not need, just to comply with an API that solves a fabricated problem.
> Sigh... looks like some drivers register their mdio bus in their > dsa_switch_ops->setup while others do it in their actual probe > function (which actually makes more sense to me). Drivers are free to do whatever they want. The driver model allows it. > I'm okay with this big hammer for now while we figure out something > better. > > > > > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c > > index 53f034fc2ef7..7ecd910f7fb8 100644 > > --- a/drivers/net/phy/mdio_bus.c > > +++ b/drivers/net/phy/mdio_bus.c > > @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner) > > NULL == bus->read || NULL == bus->write) > > return -EINVAL; > > > > + if (bus->parent && bus->parent->of_node) > > + bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > > + > > BUG_ON(bus->state != MDIOBUS_ALLOCATED && > > bus->state != MDIOBUS_UNREGISTERED); > > > > So basically saying all MDIO busses potentially have a problem. > > > > I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are > > not broken, they work fine, if fw_devlink gets out of the way and > > allows them to do their job. > > The parent assuming the child will be probed as soon as it's added is > a broken expectation/assumption. fw_devlink is just catching them > immediately. Why is it broken? As i said in the history, DSA has worked since 2008. This behaviour is not that old, but it has been used and worked for a number of years. I actual think your model of the driver model is too simple and needs to accept that a driver probe is not atomic. Resources a driver registers with other parts of the kernel can be used before the probe completes. And we have some corners of the kernel that depend on that. This is particularly true for network drivers. As soon as you register a network interface to the stack, it will start using it, before the probe function has completed. It does not wait around for the driver core to say it has completed probing. And i doubt this is unique to networking. Maybe when a frame buffer driver registers a frame buffer with the core, the core starts to draw the splash screen, before the probe finishes? Maybe when a block driver registers a block device, the core starts reading the partition table, before the probe finishes? These are all examples of using a resource before the probe completes. > Having said that, this is not the hill either of us should choose to > die on. So, how about something like: > FWNODE_FLAG_NEEDS_CHILD_BOUND_ON_ADD > > If that works, I can clean up the series with this and the MDIO fix > you mentioned. That is O.K. for me as a fix. I can test patches next week. Andrew
On 9/8/2021 6:39 PM, Andrew Lunn wrote: >> --- a/net/dsa/dsa2.c >> +++ b/net/dsa/dsa2.c >> @@ -1286,6 +1286,17 @@ static int dsa_switch_parse_of(struct >> dsa_switch *ds, struct device_node *dn) >> { >> int err; >> >> + /* A lot of switch devices have their PHYs as child devices and have >> + * the PHYs depend on the switch as a supplier (Eg: interrupt >> + * controller). With fw_devlink=on, that means the PHYs will defer >> + * probe until the probe() of the switch completes. However, the way >> + * the DSA framework is designed, the PHYs are expected to be probed >> + * successfully before the probe() of the switch completes. >> + * >> + * So, mark the switch devices as a "broken parent" so that fw_devlink >> + * knows not to create device links between PHYs and the parent switch. >> + */ >> + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; >> err = dsa_switch_parse_member_of(ds, dn); >> if (err) >> return err; > > This does not work. First off, its dn, not np. But with that fixed, it > still does not work. This is too late, the mdio busses have already > been registered and probed, the PHYs have been found on the busses, > and the PHYs would of been probed, if not for fw_devlink. > > What did work was: > > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c > index c45ca2473743..45d67d50e35f 100644 > --- a/drivers/net/dsa/mv88e6xxx/chip.c > +++ b/drivers/net/dsa/mv88e6xxx/chip.c > @@ -6249,8 +6249,10 @@ static int mv88e6xxx_probe(struct mdio_device *mdiodev) > if (!np && !pdata) > return -EINVAL; > > - if (np) > + if (np) { > compat_info = of_device_get_match_data(dev); > + np->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > + } > > if (pdata) { > compat_info = pdata_device_get_match_data(dev); > > This will fix it for mv88e6xxx. But if the same problem occurs in any > of the other DSA drivers, they will still be broken: > > ~/linux/drivers/net/dsa$ grep -r mdiobus_register * > bcm_sf2.c: err = mdiobus_register(priv->slave_mii_bus); > dsa_loop_bdinfo.c: return mdiobus_register_board_info(&bdinfo, 1); > lantiq_gswip.c: return of_mdiobus_register(ds->slave_mii_bus, mdio_np); > mt7530.c: ret = mdiobus_register(bus); > mv88e6xxx/chip.c: err = of_mdiobus_register(bus, np); > grep: mv88e6xxx/chip.o: binary file matches > ocelot/seville_vsc9953.c: rc = mdiobus_register(bus); > ocelot/felix_vsc9959.c: rc = mdiobus_register(bus); > qca/ar9331.c: ret = of_mdiobus_register(mbus, mnp); > qca8k.c: return devm_of_mdiobus_register(priv->dev, bus, mdio); > realtek-smi-core.c: ret = of_mdiobus_register(smi->slave_mii_bus, mdio_np); > sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np); > sja1105/sja1105_mdio.c: rc = of_mdiobus_register(bus, np); > sja1105/sja1105_mdio.c: rc = mdiobus_register(bus); > sja1105/sja1105_mdio.c:int sja1105_mdiobus_register(struct dsa_switch *ds) > sja1105/sja1105.h:int sja1105_mdiobus_register(struct dsa_switch *ds); > sja1105/sja1105_main.c: rc = sja1105_mdiobus_register(ds); > > If you are happy to use a big hammer: > > diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c > index 53f034fc2ef7..7ecd910f7fb8 100644 > --- a/drivers/net/phy/mdio_bus.c > +++ b/drivers/net/phy/mdio_bus.c > @@ -525,6 +525,9 @@ int __mdiobus_register(struct mii_bus *bus, struct module *owner) > NULL == bus->read || NULL == bus->write) > return -EINVAL; > > + if (bus->parent && bus->parent->of_node) > + bus->parent->of_node->fwnode.flags |= FWNODE_FLAG_BROKEN_PARENT; > + > BUG_ON(bus->state != MDIOBUS_ALLOCATED && > bus->state != MDIOBUS_UNREGISTERED); > > So basically saying all MDIO busses potentially have a problem. > > I also don't like the name FWNODE_FLAG_BROKEN_PARENT. The parents are > not broken, they work fine, if fw_devlink gets out of the way and > allows them to do their job. > > You also asked about why the component framework is not used. DSA has > been around for a while, the first commit dates back to October > 2008. Russell Kings first commit for the component framework is > January 2014. The plain driver model has worked for the last 13 years, > so there has not been any need to change. That part of the story is more complicated than that, DSA did not get any development from 2008 till 2014 when I picked it up to add support for Broadcom switches. In 2016, in order to support more switches, especially those that were "pure" MDIO devices, Andrew came up with the mdio_device structure and also created "dsa2" which allowed registering switches from their actual bus. This allowed us to depart from all of the limitations of the unique "dsa" platform device which was just horrible to work with. I recall very clearly that one of your prototypes that I tested was using the component framework, although I do not remember why we did not pursue that route and instead the DSA switch tree got reference counted and got its current form. So Andrew, you did evaluate the component framework but ended up not using it, do you remember why? There is nothing wrong with the current approach of allowing switches to come up and do the final tree setup when the tree is fully resolved. If there are driver level changes that we can make to ease the pain on the device link framework, we should certainly entertain them, keep in mind that DSA for better or worse shows a lot of cargo cult programming.
On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > Rev B is interesting because switch0 and switch1 got genphy, while > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > interrupt properties, so don't loop back to their parent device. > > This is interesting and not what I really expected to happen. It goes to > show that we really need more time to understand all the subtleties of > device dependencies before jumping on patching stuff. > > In case the DSA tree contains more than one switch, different things > will happen in dsa_register_switch(). > The tree itself is only initialized when the last switch calls > dsa_register_switch(). All the other switches just mark themselves as > present and exit probing early. See this piece of code in dsa_tree_setup: > > complete = dsa_tree_setup_routing_table(dst); > if (!complete) > return 0; Hi Vladimir, Can you point me to an example dts file that has a DSA tree with more than one switch and also point me to the switches that form the tree? I'm working on a RFC series that tries to improve some stuff and having an example DTS to look at would help. Thanks, Saravana > > So it should be a general property of cross-chip DSA trees that all > switches except the last one will have the specific PHY driver probed > properly, and not the genphy. > > Because all (N - 1) switches of a tree exit early in dsa_register_switch, > they have successfully probed by the time the last switch brings up the > tree, and brings up the PHYs on behalf of every other switch. > > The last switch can connect to the PHY on behalf of the other switches > past their probe ending, and those PHYs should not defer probing because > their supplier is now probed. It is only that the last switch cannot > connect to the PHYs of its own ports. > > So if this does not work (you say that there are 2 switches that use > genphy) I suspect there are also other bugs involved.
On Wed, Sep 29, 2021 at 10:33:16PM -0700, Saravana Kannan wrote: > On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > interrupt properties, so don't loop back to their parent device. > > > > This is interesting and not what I really expected to happen. It goes to > > show that we really need more time to understand all the subtleties of > > device dependencies before jumping on patching stuff. > > > > In case the DSA tree contains more than one switch, different things > > will happen in dsa_register_switch(). > > The tree itself is only initialized when the last switch calls > > dsa_register_switch(). All the other switches just mark themselves as > > present and exit probing early. See this piece of code in dsa_tree_setup: > > > > complete = dsa_tree_setup_routing_table(dst); > > if (!complete) > > return 0; > > Hi Vladimir, > > Can you point me to an example dts file that has a DSA tree with more > than one switch and also point me to the switches that form the tree? > > I'm working on a RFC series that tries to improve some stuff and > having an example DTS to look at would help. Some of the Zodiac boards have multiple switches. They are all Marvell switches, using the mv88e6xxx driver. arch/arm/boot/dts/vf610-zii-dev-rev-b.dts arch/arm/boot/dts/vf610-zii-dev-rev-c.dts arch/arm/boot/dts/vf610-zii-scu4-aib.dts Andrew
Hi Saravana, On Wed, Sep 29, 2021 at 10:33:16PM -0700, Saravana Kannan wrote: > On Tue, Aug 31, 2021 at 4:18 PM Vladimir Oltean <olteanv@gmail.com> wrote: > > > > On Wed, Sep 01, 2021 at 01:02:09AM +0200, Andrew Lunn wrote: > > > Rev B is interesting because switch0 and switch1 got genphy, while > > > switch2 got the correct Marvell PHY driver. switch2 PHYs don't have > > > interrupt properties, so don't loop back to their parent device. > > > > This is interesting and not what I really expected to happen. It goes to > > show that we really need more time to understand all the subtleties of > > device dependencies before jumping on patching stuff. > > > > In case the DSA tree contains more than one switch, different things > > will happen in dsa_register_switch(). > > The tree itself is only initialized when the last switch calls > > dsa_register_switch(). All the other switches just mark themselves as > > present and exit probing early. See this piece of code in dsa_tree_setup: > > > > complete = dsa_tree_setup_routing_table(dst); > > if (!complete) > > return 0; > > Hi Vladimir, > > Can you point me to an example dts file that has a DSA tree with more > than one switch and also point me to the switches that form the tree? > > I'm working on a RFC series that tries to improve some stuff and > having an example DTS to look at would help. > > Thanks, > Saravana Andrew is testing with arch/arm/boot/dts/vf610-zii-dev-rev-b.dts. Graphically it looks like this: +-----------------------------+ | VF610 SoC | | +--------+ | | | fec1 | | +----------+--------+---------+ | DSA master | | ethernet = <&fec1>; +--------+----------+---------------------------+ | | port@6 | | | +----------+ | | | CPU port | dsa,member = <0 0>; | | +----------+ -> tree 0, switch 0 | | | cpu | | | +----------+ | | | | switch0 | | | +-----------+-----------+-----------+-----------+ | port@0 | port@1 | port@2 | port@5 | +-----------+-----------+-----------+-----------+ |switch0phy0|switch0phy1|switch0phy2| no PHY | +-----------+-----------+-----------+-----------+ | user port | user port | user port | DSA port | +-----------+-----------+-----------+-----------+ | lan0 | lan1 | lan2 | dsa | +-----------+-----------+-----------+-----------+ | link = <&switch1port6 &switch2port9>; | | | | link = <&switch0port5>; +----------+----------+-------------------------+ | | port@6 | | | +----------+ | | | DSA port | dsa,member = <0 1>; | | +----------+ -> tree 0, switch 1 | | | dsa | | | +----------+ | | | | switch1 | | | +-----------+-----------+-----------+-----------+ | port@0 | port@1 | port@2 | port@5 | +-----------+-----------+-----------+-----------+ |switch1phy0|switch1phy1|switch2phy2| no PHY | +-----------+-----------+-----------+-----------+ | user port | user port | user port | DSA port | +-----------+-----------+-----------+-----------+ | lan3 | lan4 | lan5 | dsa | +-----------+-----------+-----------+-----------+ | link = <&switch2port9>; | | | | link = <&switch1port5 &switch0port5>; +----------+----------+-------------------------------------+ | | port@9 | | | +----------+ | | | DSA port | dsa,member = <0 2>; | | +----------+ -> tree 0, switch 2 | | | dsa | | | +----------+ | | | | switch2 | | | +-----------+-----------+-----------+-----------+-----------+ | port@0 | port@1 | port@2 | port@3 | port@4 | +-----------+-----------+-----------+-----------+-----------+ |switch2phy0|switch2phy1|switch2phy2| no PHY | no PHY | +-----------+-----------+-----------+-----------+-----------+ | user port | user port | user port | user port | user port | +-----------+-----------+-----------+-----------+-----------+ | lan6 | lan7 | lan8 | optical3 | optical4 | +-----------+-----------+-----------+-----------+-----------+
> Andrew is testing with arch/arm/boot/dts/vf610-zii-dev-rev-b.dts. > > Graphically it looks like this: Nice ASCII art :-) This shows the flow of Ethernet frames thought the switch cluster. What is missing, and causing fw_devlink problems is the MDIO bus master for the PHYs, and the interrupt control where PHY interrupts are stored, and the linking from the PHY to the interrupt controller. Physically all these parts are inside the Ethernet switch package. But Linux models them as separate blocks. This is because in the general case, they are all discrete blocks. You have a MAC chip, and a PHY chip, and the PHY interrupt output it connected to a SoC GPIO. > > +-----------------------------+ > | VF610 SoC | > | +--------+ | > | | fec1 | | > +----------+--------+---------+ > | DSA master > | > | ethernet = <&fec1>; > +--------+----------+---------------------------+ > | | port@6 | | > | +----------+ | > | | CPU port | dsa,member = <0 0>; | > | +----------+ -> tree 0, switch 0 | > | | cpu | | > | +----------+ | > | | > | switch0 | > | | > +-----------+-----------+-----------+-----------+ Inside the block above, is the interrupt controller and the MDIO bus master. > | port@0 | port@1 | port@2 | port@5 | > +-----------+-----------+-----------+-----------+ > |switch0phy0|switch0phy1|switch0phy2| no PHY | > +-----------+-----------+-----------+-----------+ The control path for these PHYs is over the MDIO bus. They are probed via the control path bus. These PHYs also have an interrupt output, which is wired to the interrupt controller above. > | user port | user port | user port | DSA port | > +-----------+-----------+-----------+-----------+ > | lan0 | lan1 | lan2 | dsa | > +-----------+-----------+-----------+-----------+ Andrew
On Thu, Sep 30, 2021 at 7:00 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > Andrew is testing with arch/arm/boot/dts/vf610-zii-dev-rev-b.dts. > > > > Graphically it looks like this: > > Nice ASCII art :-) Thanks for the awesome diagram Vladimir! > > This shows the flow of Ethernet frames thought the switch > cluster. What is missing, and causing fw_devlink problems is the MDIO > bus master for the PHYs, and the interrupt control where PHY > interrupts are stored, and the linking from the PHY to the interrupt > controller. Physically all these parts are inside the Ethernet switch > package. But Linux models them as separate blocks. This is because in > the general case, they are all discrete blocks. You have a MAC chip, > and a PHY chip, and the PHY interrupt output it connected to a SoC > GPIO. > > > > > +-----------------------------+ > > | VF610 SoC | > > | +--------+ | > > | | fec1 | | > > +----------+--------+---------+ > > | DSA master > > | > > | ethernet = <&fec1>; > > +--------+----------+---------------------------+ > > | | port@6 | | > > | +----------+ | > > | | CPU port | dsa,member = <0 0>; | > > | +----------+ -> tree 0, switch 0 | > > | | cpu | | > > | +----------+ | > > | | > > | switch0 | > > | | > > +-----------+-----------+-----------+-----------+ > > Inside the block above, is the interrupt controller and the MDIO bus > master. > > > > | port@0 | port@1 | port@2 | port@5 | > > +-----------+-----------+-----------+-----------+ > > |switch0phy0|switch0phy1|switch0phy2| no PHY | > > +-----------+-----------+-----------+-----------+ > > The control path for these PHYs is over the MDIO bus. They are probed > via the control path bus. These PHYs also have an interrupt output, > which is wired to the interrupt controller above. > > > > | user port | user port | user port | DSA port | > > +-----------+-----------+-----------+-----------+ > > | lan0 | lan1 | lan2 | dsa | > > +-----------+-----------+-----------+-----------+ > Thanks for the dts paths and the additional details Andrew. I think this gives me enough info for now to make sure whatever I'm coding isn't completely stupid. I'm trying to make the generic PHY driver less greedy (taking it a bit further than what Vladimir was attempting) and also delay the use of generic PHY driver as late as possible (so that we give as much time as possible for the specific driver to be registered/loaded before we give up and use generic PHY driver). This would also need some changes to the DSA code and hence these questions. Btw, do we have non-DSA networking devices where fw_devlink=on delaying PHY probes is causing an issue? -Saravana
> Btw, do we have non-DSA networking devices where fw_devlink=on > delaying PHY probes is causing an issue? I don't know if issues have been reported, but the realtek driver has had problems in the past when the generic driver is used. Take a look at r8169_mdio_register(), it does something similar to DSA. What is going to make things interesting is that phy_attach_direct() is called in two different contexts. During the MAC drivers probe, it is O.K. to return EPROBE_DEFER, and let the MAC driver try again later, if we know there is a specific PHY driver for it. But when called during the MAC drivers open() op, -EPROBE_DEFER is not allowed. What to do then is an interesting question. Andrew
On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote: > > > Btw, do we have non-DSA networking devices where fw_devlink=on > > delaying PHY probes is causing an issue? > > I don't know if issues have been reported, but the realtek driver has > had problems in the past when the generic driver is used. Take a look > at r8169_mdio_register(), it does something similar to DSA. Does it have the issue of having the PHY as its child too and then depending on it to bind to a driver? I can't tell because I didn't know how to find that info for a PCI device. > > What is going to make things interesting is that phy_attach_direct() > is called in two different contexts. During the MAC drivers probe, it > is O.K. to return EPROBE_DEFER, and let the MAC driver try again > later, if we know there is a specific PHY driver for it. But when > called during the MAC drivers open() op, -EPROBE_DEFER is not > allowed. What to do then is an interesting question. Yeah, basically before doing an open() it'll have to call an API to say "just bind with whatever you got". Or something along those lines. I already know how to get that to work. I'll send some RFC soonish (I hope). -Saravana
On 9/30/21 12:48 PM, Saravana Kannan wrote: > On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote: >> >>> Btw, do we have non-DSA networking devices where fw_devlink=on >>> delaying PHY probes is causing an issue? >> >> I don't know if issues have been reported, but the realtek driver has >> had problems in the past when the generic driver is used. Take a look >> at r8169_mdio_register(), it does something similar to DSA. > > Does it have the issue of having the PHY as its child too and then > depending on it to bind to a driver? I can't tell because I didn't > know how to find that info for a PCI device. Yes, r8169 includes a MDIO bus controller, and the PHY is internal to the Ethernet MAC. These are AFAIR the relevant changes to this discussion: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16983507742cbcaa5592af530872a82e82fb9c51 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11287b693d03830010356339e4ceddf47dee34fa > >> >> What is going to make things interesting is that phy_attach_direct() >> is called in two different contexts. During the MAC drivers probe, it >> is O.K. to return EPROBE_DEFER, and let the MAC driver try again >> later, if we know there is a specific PHY driver for it. But when >> called during the MAC drivers open() op, -EPROBE_DEFER is not >> allowed. What to do then is an interesting question. > > Yeah, basically before doing an open() it'll have to call an API to > say "just bind with whatever you got". Or something along those lines. > I already know how to get that to work. I'll send some RFC soonish (I > hope). I don't think this is going to scale, we have dozens and dozens of drivers that connect to the PHY during ndo_open(). It is not realistic to audit them all, just like the opposite case where the drivers do probe MDIO/PHY during their .probe() call is not realistic either.
On Thu, Sep 30, 2021 at 1:06 PM Florian Fainelli <f.fainelli@gmail.com> wrote: > > On 9/30/21 12:48 PM, Saravana Kannan wrote: > > On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote: > >> > >>> Btw, do we have non-DSA networking devices where fw_devlink=on > >>> delaying PHY probes is causing an issue? > >> > >> I don't know if issues have been reported, but the realtek driver has > >> had problems in the past when the generic driver is used. Take a look > >> at r8169_mdio_register(), it does something similar to DSA. > > > > Does it have the issue of having the PHY as its child too and then > > depending on it to bind to a driver? I can't tell because I didn't > > know how to find that info for a PCI device. > > Yes, r8169 includes a MDIO bus controller, and the PHY is internal to > the Ethernet MAC. These are AFAIR the relevant changes to this discussion: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16983507742cbcaa5592af530872a82e82fb9c51 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11287b693d03830010356339e4ceddf47dee34fa > > > > > >> > >> What is going to make things interesting is that phy_attach_direct() > >> is called in two different contexts. During the MAC drivers probe, it > >> is O.K. to return EPROBE_DEFER, and let the MAC driver try again > >> later, if we know there is a specific PHY driver for it. But when > >> called during the MAC drivers open() op, -EPROBE_DEFER is not > >> allowed. What to do then is an interesting question. > > > > Yeah, basically before doing an open() it'll have to call an API to > > say "just bind with whatever you got". Or something along those lines. > > I already know how to get that to work. I'll send some RFC soonish (I > > hope). > > I don't think this is going to scale, we have dozens and dozens of > drivers that connect to the PHY during ndo_open(). Whichever code calls ->ndo_open() can't that mark all the PHYs that'll be used as "needs to be ready now"? In any case, if we can have an API that allows a less greedy Generic PHY binding, we could slowly transition drivers over or at least move them over as they hit issues with Gen PHY. Anyway, I'll think discussing it over code would be easier. I'll also have more context as I try to make changes. So, let's continue this on my future RFC. -Saravana > It is not realistic > to audit them all, just like the opposite case where the drivers do > probe MDIO/PHY during their .probe() call is not realistic either. > -- > Florian
> I don't think this is going to scale, we have dozens and dozens of > drivers that connect to the PHY during ndo_open(). It is not realistic > to audit them all, just like the opposite case where the drivers do > probe MDIO/PHY during their .probe() call is not realistic either. I was wondering if Coccinelle could help use here. But a quick scan of the documents don't suggest it can follow call stacks. Ideally we would what something to goes and finds the struct net_device_ops, and gets the function used for .ndo_open. Then look into that function, and all functions it calls within the driver, and see if any of them connect the PHY to the MAC. We could then add an additional parameter to indicate we are in ndo_open context. But it looks like that is wishful thinking. Andrew
On 9/30/21 1:14 PM, Saravana Kannan wrote: > On Thu, Sep 30, 2021 at 1:06 PM Florian Fainelli <f.fainelli@gmail.com> wrote: >> >> On 9/30/21 12:48 PM, Saravana Kannan wrote: >>> On Thu, Sep 30, 2021 at 12:38 PM Andrew Lunn <andrew@lunn.ch> wrote: >>>> >>>>> Btw, do we have non-DSA networking devices where fw_devlink=on >>>>> delaying PHY probes is causing an issue? >>>> >>>> I don't know if issues have been reported, but the realtek driver has >>>> had problems in the past when the generic driver is used. Take a look >>>> at r8169_mdio_register(), it does something similar to DSA. >>> >>> Does it have the issue of having the PHY as its child too and then >>> depending on it to bind to a driver? I can't tell because I didn't >>> know how to find that info for a PCI device. >> >> Yes, r8169 includes a MDIO bus controller, and the PHY is internal to >> the Ethernet MAC. These are AFAIR the relevant changes to this discussion: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16983507742cbcaa5592af530872a82e82fb9c51 >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=11287b693d03830010356339e4ceddf47dee34fa >> >> >>> >>>> >>>> What is going to make things interesting is that phy_attach_direct() >>>> is called in two different contexts. During the MAC drivers probe, it >>>> is O.K. to return EPROBE_DEFER, and let the MAC driver try again >>>> later, if we know there is a specific PHY driver for it. But when >>>> called during the MAC drivers open() op, -EPROBE_DEFER is not >>>> allowed. What to do then is an interesting question. >>> >>> Yeah, basically before doing an open() it'll have to call an API to >>> say "just bind with whatever you got". Or something along those lines. >>> I already know how to get that to work. I'll send some RFC soonish (I >>> hope). >> >> I don't think this is going to scale, we have dozens and dozens of >> drivers that connect to the PHY during ndo_open(). > > Whichever code calls ->ndo_open() can't that mark all the PHYs that'll > be used as "needs to be ready now"? In any case, if we can have an API > that allows a less greedy Generic PHY binding, we could slowly > transition drivers over or at least move them over as they hit issues > with Gen PHY. Anyway, I'll think discussing it over code would be > easier. I'll also have more context as I try to make changes. So, > let's continue this on my future RFC. It is the same API that is being used whether you connect to the PHY at ndo_open() time or whether you do that during the parent's ->probe() fortunately or unfortunately. Now we could set a flag in either case, and hope that it addresses both situations? Being able to be selective about the Ethernet PHY driver is being used is actually a good idea, there are plenty of systems out there whereby using the Generic PHY driver will not lead to a functional Ethernet link, if we could say "I want my dedicated driver, and not Generic PHY" that would actually help some cases, too.
diff --git a/drivers/base/core.c b/drivers/base/core.c index f6360490a4a3..2cc34f8ff051 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -1719,6 +1719,28 @@ static int fw_devlink_create_devlink(struct device *con, struct device *sup_dev; int ret = 0; + /* + * In some cases, a device P might also be a supplier to its child node + * C. However, this would defer the probe of C until the probe of P + * completes successfully. This is perfectly fine in the device driver + * model. device_add() doesn't guarantee probe completion of the device + * by the time it returns. + * + * However, there are a few drivers that assume C will finish probing + * as soon as it's added and before P finishes probing. While this is a + * broken assumption that needs the driver to be fixed, we don't want + * to block fw_devlink improvements because of these drivers. + * + * So, we provide a flag to let fw_devlink know not to delay the probe + * of C until the probe of P completes successfully. + * + * When such a flag is set, we can't create device links with P as the + * supplier of C as that would delay the probe of C. + */ + if (sup_handle->flags & FWNODE_FLAG_BROKEN_PARENT && + fwnode_is_ancestor_of(sup_handle, con->fwnode)) + return -EINVAL; + sup_dev = get_dev_from_fwnode(sup_handle); if (sup_dev) { /* diff --git a/include/linux/fwnode.h b/include/linux/fwnode.h index 59828516ebaf..9382065e6ff8 100644 --- a/include/linux/fwnode.h +++ b/include/linux/fwnode.h @@ -22,10 +22,13 @@ struct device; * LINKS_ADDED: The fwnode has already be parsed to add fwnode links. * NOT_DEVICE: The fwnode will never be populated as a struct device. * INITIALIZED: The hardware corresponding to fwnode has been initialized. + * BROKEN_PARENT: The driver of this fwnode/device expects the child devices to + * probe as soon as they are added. */ #define FWNODE_FLAG_LINKS_ADDED BIT(0) #define FWNODE_FLAG_NOT_DEVICE BIT(1) #define FWNODE_FLAG_INITIALIZED BIT(2) +#define FWNODE_FLAG_BROKEN_PARENT BIT(3) struct fwnode_handle { struct fwnode_handle *secondary;
If a parent device is also a supplier to a child device, fw_devlink=on (correctly) delays the probe() of the child device until the probe() of the parent finishes successfully. However, some drivers of such parent devices (where parent is also a supplier) incorrectly expect the child device to finish probing successfully as soon as they are added using device_add() and before the probe() of the parent device has completed successfully. While this might have worked before, this is not guaranteed by driver core. fw_devlink=on catches/breaks such drivers. One example of such a case is discussed in the link mentioned below. Add a flag to make fw_devlink=on not enforce these supplier-consumer relationships, so these drivers can continue working. The flag is intentionally called BROKEN_PARENT so it's clear that this flag shouldn't be used in the normal case and that there's a problem with the driver. Link: https://lore.kernel.org/netdev/CAGETcx_uj0V4DChME-gy5HGKTYnxLBX=TH2rag29f_p=UcG+Tg@mail.gmail.com/ Signed-off-by: Saravana Kannan <saravanak@google.com> --- drivers/base/core.c | 22 ++++++++++++++++++++++ include/linux/fwnode.h | 3 +++ 2 files changed, 25 insertions(+)