mbox series

[RFC,net-next,0/3] Make the PHY library stop being so greedy when binding the generic PHY driver

Message ID 20210901225053.1205571-1-vladimir.oltean@nxp.com (mailing list archive)
Headers show
Series Make the PHY library stop being so greedy when binding the generic PHY driver | expand

Message

Vladimir Oltean Sept. 1, 2021, 10:50 p.m. UTC
This is a continuation of the discussion on patch "[v1,1/2] driver core:
fw_devlink: Add support for FWNODE_FLAG_BROKEN_PARENT" from here:
https://patchwork.kernel.org/project/netdevbpf/patch/20210826074526.825517-2-saravanak@google.com/

Summary: in a complex combination of device dependencies which is not
really relevant to what is being proposed here, DSA ends up calling
phylink_of_phy_connect during a period in which the PHY driver goes
through a series of probe deferral events.

The central point of that discussion is that DSA seems "broken" for
expecting the PHY driver to probe immediately on PHYs belonging to the
internal MDIO buses of switches. A few suggestions were made about what
to do, but some were not satisfactory and some did not solve the problem.

In fact, fw_devlink, the mechanism that causes the PHY driver to defer
probing in this particular case, has some significant "issues" too, but
its "issues" are only in quotes "because at worst it'd allow a few
unnecessary deferred probes":
https://patchwork.kernel.org/project/netdevbpf/patch/20210826074526.825517-2-saravanak@google.com/#24418895

So if that's the criterion by which an issue is an issue, maybe we
should take a step back and look at the bigger picture.

There is nothing about the idea that a PHY might defer probing, or about
the changes proposed here, that has anything with DSA. Furthermore, the
changes done by this series solve the problem in the same way: "they
allow a few unnecessary deferred probes" <- in this case they provoke
this to the caller of phy_attach_direct.

If we look at commit 16983507742c ("net: phy: probe PHY drivers
synchronously"), we see that the PHY library expectation is for the PHY
device to have a PHY driver bound to it as soon as device_add() finishes.

Well, as it turns out, in case the PHY device has any supplier which is
not ready, this is not possible, but that patch still seems to ensure
that the process of binding a driver to the device has at least started.
That process will continue for a while, and will race with
phy_attach_direct calls, so we need to make the latter observe the fact
that a driver is struggling to probe, and wait for it a bit more.

What I've not tested is loading the PHY module at runtime, and seeing
how phy_attach_direct behaves then. I expect that this change set will
not alter the behavior in that case: the genphy will still bind to a
device with no driver, and phy_attach_direct will not return -EPROBE_DEFER
in that case.

I might not be very versed in the device core/internals, but the patches
make sense to me, and worked as intended from the first try on my system
(Turris MOX with mv88e6xxx), which was modified to make the same "sins"
as those called out in the thread above:

- use PHY interrupts provided by the switch itself as an interrupt-controller
- call of_mdiobus_register from setup() and not from probe(), so as to
  not circumvent fw_devlink's limitations, and still get to hit the PHY
  probe deferral conditions.

So feedback and testing on other platforms is very appreciated.

Vladimir Oltean (3):
  net: phy: don't bind genphy in phy_attach_direct if the specific
    driver defers probe
  net: dsa: destroy the phylink instance on any error in
    dsa_slave_phy_setup
  net: dsa: allow the phy_connect() call to return -EPROBE_DEFER

 drivers/base/dd.c            | 21 +++++++++++++++++++--
 drivers/net/phy/phy_device.c |  8 ++++++++
 include/linux/device.h       |  1 +
 net/dsa/dsa2.c               |  2 ++
 net/dsa/slave.c              | 12 +++++-------
 5 files changed, 35 insertions(+), 9 deletions(-)

Comments

Russell King (Oracle) Sept. 2, 2021, 12:19 p.m. UTC | #1
On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> The central point of that discussion is that DSA seems "broken" for
> expecting the PHY driver to probe immediately on PHYs belonging to the
> internal MDIO buses of switches. A few suggestions were made about what
> to do, but some were not satisfactory and some did not solve the problem.

I think you need to describe the mechanism here. Why wouldn't a PHY
belonging to an internal MDIO bus of a switch not probe immediately?
What resources may not be available?

If we have a DSA driver that tries to probe the PHYs before e.g. the
interrupt controller inside the DSA switch has been configured, aren't
we just making completely unnecessary problems for ourselves? Wouldn't
it be saner to ensure that the interrupt controller has been setup
and become available prior to attempting to setup anything that
relies upon that interrupt controller?

From what I see of Marvell switches, the internal PHYs only ever rely
on internal resources of the switch they are embedded in.

External PHYs to the switch are a different matter - these can rely on
external clocks, and in that scenario, it would make sense for a
deferred probe to cause the entire switch to defer, since we don't
have all the resources for the switch to be functional (and, because we
want the PHYs to be present at switch probe time, not when we try to
bring up the interface, I don't see there's much other choice.)

Trying to move that to interface-up time /will/ break userspace - for
example, Debian's interfaces(8) bridge support will become unreliable,
and probably a whole host of other userspace. It will cause regressions
and instability to userspace. So that's a big no.

Maybe I'm missing exactly what the problem is...
Vladimir Oltean Sept. 2, 2021, 12:35 p.m. UTC | #2
On Thu, Sep 02, 2021 at 01:19:27PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> > The central point of that discussion is that DSA seems "broken" for
> > expecting the PHY driver to probe immediately on PHYs belonging to the
> > internal MDIO buses of switches. A few suggestions were made about what
> > to do, but some were not satisfactory and some did not solve the problem.
> 
> I think you need to describe the mechanism here. Why wouldn't a PHY
> belonging to an internal MDIO bus of a switch not probe immediately?
> What resources may not be available?

As you point out below, the interrupt-controller is what is not available.
There is a mechanism called fw_devlink which infers links from one OF
node to another based on phandles. When you have an interrupt-parent,
that OF node becomes a supplier to you. Those OF node links are then
transferred to device links once the devices having those OF nodes are
created.

> If we have a DSA driver that tries to probe the PHYs before e.g. the
> interrupt controller inside the DSA switch has been configured, aren't
> we just making completely unnecessary problems for ourselves?

This is not what happens, if that were the case, of course I would fix
_that_ and not in this way.

> Wouldn't it be saner to ensure that the interrupt controller has been
> setup and become available prior to attempting to setup anything that
> relies upon that interrupt controller?

The interrupt controller _has_ been set up. The trouble is that the
interrupt controller has the same OF node as the switch itself, and the
same OF node. Therefore, fw_devlink waits for the _entire_ switch to
finish probing, it doesn't have insight into the fact that the
dependency is just on the interrupt controller.

> From what I see of Marvell switches, the internal PHYs only ever rely
> on internal resources of the switch they are embedded in.
> 
> External PHYs to the switch are a different matter - these can rely on
> external clocks, and in that scenario, it would make sense for a
> deferred probe to cause the entire switch to defer, since we don't
> have all the resources for the switch to be functional (and, because we
> want the PHYs to be present at switch probe time, not when we try to
> bring up the interface, I don't see there's much other choice.)
> 
> Trying to move that to interface-up time /will/ break userspace - for
> example, Debian's interfaces(8) bridge support will become unreliable,
> and probably a whole host of other userspace. It will cause regressions
> and instability to userspace. So that's a big no.

Why a big no? I expect there to be 2 call paths of phy_attach_direct:
- At probe time. Both the MAC driver and the PHY driver are probing.
  This is what has this patch addresses. There is no issue to return
  -EPROBE_DEFER at that time, since drivers connect to the PHY before
  they register their netdev. So if connecting defers, there is no
  netdev to unregister, and user space knows nothing of this.
- At .ndo_open time. This is where it maybe gets interesting, but not to
  user space. If you open a netdev and it connects to the PHY then, I
  wouldn't expect the PHY to be undergoing a probing process, all of
  that should have been settled by then, should it not? Where it might
  get interesting is with NFS root, and I admit I haven't tested that.
Vladimir Oltean Sept. 2, 2021, 12:59 p.m. UTC | #3
On Thu, Sep 02, 2021 at 03:35:32PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 01:19:27PM +0100, Russell King (Oracle) wrote:
> > On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> > > The central point of that discussion is that DSA seems "broken" for
> > > expecting the PHY driver to probe immediately on PHYs belonging to the
> > > internal MDIO buses of switches. A few suggestions were made about what
> > > to do, but some were not satisfactory and some did not solve the problem.
> > 
> > I think you need to describe the mechanism here. Why wouldn't a PHY
> > belonging to an internal MDIO bus of a switch not probe immediately?
> > What resources may not be available?
> 
> As you point out below, the interrupt-controller is what is not available.
> There is a mechanism called fw_devlink which infers links from one OF
> node to another based on phandles. When you have an interrupt-parent,
> that OF node becomes a supplier to you. Those OF node links are then
> transferred to device links once the devices having those OF nodes are
> created.
> 
> > If we have a DSA driver that tries to probe the PHYs before e.g. the
> > interrupt controller inside the DSA switch has been configured, aren't
> > we just making completely unnecessary problems for ourselves?
> 
> This is not what happens, if that were the case, of course I would fix
> _that_ and not in this way.
> 
> > Wouldn't it be saner to ensure that the interrupt controller has been
> > setup and become available prior to attempting to setup anything that
> > relies upon that interrupt controller?
> 
> The interrupt controller _has_ been set up. The trouble is that the
> interrupt controller has the same OF node as the switch itself, and the
> same OF node. Therefore, fw_devlink waits for the _entire_ switch to

...and the same struct device, not "OF node" repeated twice, silly me.

> finish probing, it doesn't have insight into the fact that the
> dependency is just on the interrupt controller.
> 
> > From what I see of Marvell switches, the internal PHYs only ever rely
> > on internal resources of the switch they are embedded in.
> > 
> > External PHYs to the switch are a different matter - these can rely on
> > external clocks, and in that scenario, it would make sense for a
> > deferred probe to cause the entire switch to defer, since we don't
> > have all the resources for the switch to be functional (and, because we
> > want the PHYs to be present at switch probe time, not when we try to
> > bring up the interface, I don't see there's much other choice.)
> > 
> > Trying to move that to interface-up time /will/ break userspace - for
> > example, Debian's interfaces(8) bridge support will become unreliable,
> > and probably a whole host of other userspace. It will cause regressions
> > and instability to userspace. So that's a big no.
> 
> Why a big no? I expect there to be 2 call paths of phy_attach_direct:
> - At probe time. Both the MAC driver and the PHY driver are probing.
>   This is what has this patch addresses. There is no issue to return
>   -EPROBE_DEFER at that time, since drivers connect to the PHY before
>   they register their netdev. So if connecting defers, there is no
>   netdev to unregister, and user space knows nothing of this.
> - At .ndo_open time. This is where it maybe gets interesting, but not to
>   user space. If you open a netdev and it connects to the PHY then, I
>   wouldn't expect the PHY to be undergoing a probing process, all of
>   that should have been settled by then, should it not? Where it might
>   get interesting is with NFS root, and I admit I haven't tested that.
Russell King (Oracle) Sept. 2, 2021, 1:26 p.m. UTC | #4
On Thu, Sep 02, 2021 at 03:35:32PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 01:19:27PM +0100, Russell King (Oracle) wrote:
> > On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> > > The central point of that discussion is that DSA seems "broken" for
> > > expecting the PHY driver to probe immediately on PHYs belonging to the
> > > internal MDIO buses of switches. A few suggestions were made about what
> > > to do, but some were not satisfactory and some did not solve the problem.
> > 
> > I think you need to describe the mechanism here. Why wouldn't a PHY
> > belonging to an internal MDIO bus of a switch not probe immediately?
> > What resources may not be available?
> 
> As you point out below, the interrupt-controller is what is not available.
> There is a mechanism called fw_devlink which infers links from one OF
> node to another based on phandles. When you have an interrupt-parent,
> that OF node becomes a supplier to you. Those OF node links are then
> transferred to device links once the devices having those OF nodes are
> created.
> 
> > If we have a DSA driver that tries to probe the PHYs before e.g. the
> > interrupt controller inside the DSA switch has been configured, aren't
> > we just making completely unnecessary problems for ourselves?
> 
> This is not what happens, if that were the case, of course I would fix
> _that_ and not in this way.
> 
> > Wouldn't it be saner to ensure that the interrupt controller has been
> > setup and become available prior to attempting to setup anything that
> > relies upon that interrupt controller?
> 
> The interrupt controller _has_ been set up. The trouble is that the
> interrupt controller has the same OF node as the switch itself, and the
> same OF node. Therefore, fw_devlink waits for the _entire_ switch to
> finish probing, it doesn't have insight into the fact that the
> dependency is just on the interrupt controller.
> 
> > From what I see of Marvell switches, the internal PHYs only ever rely
> > on internal resources of the switch they are embedded in.
> > 
> > External PHYs to the switch are a different matter - these can rely on
> > external clocks, and in that scenario, it would make sense for a
> > deferred probe to cause the entire switch to defer, since we don't
> > have all the resources for the switch to be functional (and, because we
> > want the PHYs to be present at switch probe time, not when we try to
> > bring up the interface, I don't see there's much other choice.)
> > 
> > Trying to move that to interface-up time /will/ break userspace - for
> > example, Debian's interfaces(8) bridge support will become unreliable,
> > and probably a whole host of other userspace. It will cause regressions
> > and instability to userspace. So that's a big no.
> 
> Why a big no?

Fundamental rule of kernel programming: we do not break existing
userspace.

Debian has had support for configuring bridges at boot time via
the interfaces file for years. Breaking that is going to upset a
lot of people (me included) resulting in busted networks. It
would be a sure way to make oneself unpopular.

> I expect there to be 2 call paths of phy_attach_direct:
> - At probe time. Both the MAC driver and the PHY driver are probing.
>   This is what has this patch addresses. There is no issue to return
>   -EPROBE_DEFER at that time, since drivers connect to the PHY before
>   they register their netdev. So if connecting defers, there is no
>   netdev to unregister, and user space knows nothing of this.
> - At .ndo_open time. This is where it maybe gets interesting, but not to
>   user space. If you open a netdev and it connects to the PHY then, I
>   wouldn't expect the PHY to be undergoing a probing process, all of
>   that should have been settled by then, should it not? Where it might
>   get interesting is with NFS root, and I admit I haven't tested that.

I don't think you can make that assumption. Consider the case where
systemd is being used, DSA stuff is modular, and we're trying to
setup a bridge device on DSA. DSA could be probing while the bridge
is being setup.

Sadly, this isn't theoretical. I've ended up needing:

	pre-up sleep 1

in my bridge configuration to allow time for DSA to finish probing.
It's not a pleasant solution, nor a particularly reliable one at
that, but it currently works around the problem. We don't need more
cases of this kind of thing leading to boot time unreliability...

Or if we do, then we're turning Linux into Windows, where you can
end up with different behaviours each time the system is boot
depending on the exact order that various stuff comes up.
Vladimir Oltean Sept. 2, 2021, 3:23 p.m. UTC | #5
On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 03:35:32PM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 01:19:27PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> > > > The central point of that discussion is that DSA seems "broken" for
> > > > expecting the PHY driver to probe immediately on PHYs belonging to the
> > > > internal MDIO buses of switches. A few suggestions were made about what
> > > > to do, but some were not satisfactory and some did not solve the problem.
> > >
> > > I think you need to describe the mechanism here. Why wouldn't a PHY
> > > belonging to an internal MDIO bus of a switch not probe immediately?
> > > What resources may not be available?
> >
> > As you point out below, the interrupt-controller is what is not available.
> > There is a mechanism called fw_devlink which infers links from one OF
> > node to another based on phandles. When you have an interrupt-parent,
> > that OF node becomes a supplier to you. Those OF node links are then
> > transferred to device links once the devices having those OF nodes are
> > created.
> >
> > > If we have a DSA driver that tries to probe the PHYs before e.g. the
> > > interrupt controller inside the DSA switch has been configured, aren't
> > > we just making completely unnecessary problems for ourselves?
> >
> > This is not what happens, if that were the case, of course I would fix
> > _that_ and not in this way.
> >
> > > Wouldn't it be saner to ensure that the interrupt controller has been
> > > setup and become available prior to attempting to setup anything that
> > > relies upon that interrupt controller?
> >
> > The interrupt controller _has_ been set up. The trouble is that the
> > interrupt controller has the same OF node as the switch itself, and the
> > same OF node. Therefore, fw_devlink waits for the _entire_ switch to
> > finish probing, it doesn't have insight into the fact that the
> > dependency is just on the interrupt controller.
> >
> > > From what I see of Marvell switches, the internal PHYs only ever rely
> > > on internal resources of the switch they are embedded in.
> > >
> > > External PHYs to the switch are a different matter - these can rely on
> > > external clocks, and in that scenario, it would make sense for a
> > > deferred probe to cause the entire switch to defer, since we don't
> > > have all the resources for the switch to be functional (and, because we
> > > want the PHYs to be present at switch probe time, not when we try to
> > > bring up the interface, I don't see there's much other choice.)
> > >
> > > Trying to move that to interface-up time /will/ break userspace - for
> > > example, Debian's interfaces(8) bridge support will become unreliable,
> > > and probably a whole host of other userspace. It will cause regressions
> > > and instability to userspace. So that's a big no.
> >
> > Why a big no?
>
> Fundamental rule of kernel programming: we do not break existing
> userspace.

Of course, I wasn't asking why we shouldn't be breaking user space, but
about the specifics of why this change would do that.

> Debian has had support for configuring bridges at boot time via
> the interfaces file for years. Breaking that is going to upset a
> lot of people (me included) resulting in busted networks. It
> would be a sure way to make oneself unpopular.
>
> > I expect there to be 2 call paths of phy_attach_direct:
> > - At probe time. Both the MAC driver and the PHY driver are probing.
> >   This is what has this patch addresses. There is no issue to return
> >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> >   they register their netdev. So if connecting defers, there is no
> >   netdev to unregister, and user space knows nothing of this.
> > - At .ndo_open time. This is where it maybe gets interesting, but not to
> >   user space. If you open a netdev and it connects to the PHY then, I
> >   wouldn't expect the PHY to be undergoing a probing process, all of
> >   that should have been settled by then, should it not? Where it might
> >   get interesting is with NFS root, and I admit I haven't tested that.
>
> I don't think you can make that assumption. Consider the case where
> systemd is being used, DSA stuff is modular, and we're trying to
> setup a bridge device on DSA. DSA could be probing while the bridge
> is being setup.
>
> Sadly, this isn't theoretical. I've ended up needing:
>
> 	pre-up sleep 1
>
> in my bridge configuration to allow time for DSA to finish probing.
> It's not a pleasant solution, nor a particularly reliable one at
> that, but it currently works around the problem.

What problem? This is the first time I've heard of this report, and you
should definitely not need that.

I do have a system set up to use systemd-networkd, and I did want to try
this out:

$ for file in /etc/systemd/network/*; do echo ${file}; cat ${file}; done
/etc/systemd/network/br0.netdev
[NetDev]
Name=br0
Kind=bridge

[Bridge]
VLANFiltering=no
DefaultPVID=1
STP=no

[VLAN]
MVRP=no
/etc/systemd/network/br0.network
[Match]
Name=br0

[Network]
DHCP=ipv4
/etc/systemd/network/eth0.network
[Match]
Name=eth0

[Network]
Bridge=br0
/etc/systemd/network/eth1.network
[Match]
Name=eth1

[Network]
Bridge=br0
/etc/systemd/network/eth2.network
[Match]
Name=eth2

[Network]
LinkLocalAddressing=yes
/etc/systemd/network/swp.network
[Match]
Name=swp*

[Network]
BindCarrier=eth2
Bridge=br0
# Before
# bridge link
7: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 4
8: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 master br0 state disabled priority 32 cost 100
# Kick off the probing
$ insmod sja1105.ko
[   34.922908] sja1105 spi0.1: Probed switch chip: SJA1105T
$ insmod tag_sja1105.ko
$ echo spi0.1 > /sys/bus/spi/drivers/sja1105/bind
[   51.345993] sja1105 spi0.1: Probed switch chip: SJA1105T
[   51.378063] sja1105 spi0.1 swp5 (uninitialized): PHY [mdio@2d24000:06] driver [Broadcom BCM5464] (irq=POLL)
[   51.389880] sja1105 spi0.1 swp2 (uninitialized): PHY [mdio@2d24000:03] driver [Broadcom BCM5464] (irq=POLL)
[   51.401806] sja1105 spi0.1 swp3 (uninitialized): PHY [mdio@2d24000:04] driver [Broadcom BCM5464] (irq=POLL)
[   51.413710] sja1105 spi0.1 swp4 (uninitialized): PHY [mdio@2d24000:05] driver [Broadcom BCM5464] (irq=POLL)
[   51.424859] fsl-gianfar soc:ethernet@2d90000 eth2: Link is Up - 1Gbps/Full - flow control off
[   51.453768] sja1105 spi0.1: configuring for fixed/rgmii link mode
[   51.460094] device eth2 entered promiscuous mode
[   51.464856] DSA: tree 0 setup
[   51.477105] br0: port 3(swp2) entered blocking state
[   51.478394] sja1105 spi0.1: Link is Up - 1Gbps/Full - flow control off
[   51.482080] br0: port 3(swp2) entered disabled state
[   51.531585] device swp2 entered promiscuous mode
[   51.550365] sja1105 spi0.1 swp2: configuring for phy/rgmii-id link mode
[   51.559631] br0: port 4(swp5) entered blocking state
[   51.564597] br0: port 4(swp5) entered disabled state
[   51.586224] device swp5 entered promiscuous mode
[   51.647483] sja1105 spi0.1 swp5: configuring for phy/rgmii-id link mode
[   51.665995] br0: port 5(swp4) entered blocking state
[   51.671004] br0: port 5(swp4) entered disabled state
[   51.677991] device swp4 entered promiscuous mode
[   51.685967] br0: port 6(swp3) entered blocking state
[   51.690935] br0: port 6(swp3) entered disabled state
[   51.698246] device swp3 entered promiscuous mode
[   51.746640] sja1105 spi0.1 swp4: configuring for phy/rgmii-id link mode
[   51.754986] sja1105 spi0.1 swp3: configuring for phy/rgmii-id link mode
[   54.716225] sja1105 spi0.1 swp2: Link is Up - 1Gbps/Full - flow control off
[   54.723208] IPv6: ADDRCONF(NETDEV_CHANGE): swp2: link becomes ready
[   54.729620] br0: port 3(swp2) entered blocking state
[   54.734576] br0: port 3(swp2) entered forwarding state
[   54.796136] sja1105 spi0.1 swp5: Link is Up - 1Gbps/Full - flow control off
[   54.803117] IPv6: ADDRCONF(NETDEV_CHANGE): swp5: link becomes ready
[   54.809527] br0: port 4(swp5) entered blocking state
[   54.814484] br0: port 4(swp5) entered forwarding state
[   54.876397] sja1105 spi0.1 swp3: Link is Up - 1Gbps/Full - flow control off
[   54.883378] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
[   54.889790] br0: port 6(swp3) entered blocking state
[   54.894744] br0: port 6(swp3) entered forwarding state
# After
$ bridge link
7: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 4
8: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 master br0 state disabled priority 32 cost 100
12: swp5@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 4
13: swp2@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 4
14: swp3@eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 4
15: swp4@eth2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 master br0 state disabled priority 32 cost 100

The ports are ready to pass traffic, and are doing it.

So what does "wait for DSA to finish probing" mean? What driver, kernel
and systemd-networkd version is this, exactly, and what is it that needs
waiting?
Russell King (Oracle) Sept. 2, 2021, 4:31 p.m. UTC | #6
On Thu, Sep 02, 2021 at 06:23:42PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> > Debian has had support for configuring bridges at boot time via
> > the interfaces file for years. Breaking that is going to upset a
> > lot of people (me included) resulting in busted networks. It
> > would be a sure way to make oneself unpopular.
> >
> > > I expect there to be 2 call paths of phy_attach_direct:
> > > - At probe time. Both the MAC driver and the PHY driver are probing.
> > >   This is what has this patch addresses. There is no issue to return
> > >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> > >   they register their netdev. So if connecting defers, there is no
> > >   netdev to unregister, and user space knows nothing of this.
> > > - At .ndo_open time. This is where it maybe gets interesting, but not to
> > >   user space. If you open a netdev and it connects to the PHY then, I
> > >   wouldn't expect the PHY to be undergoing a probing process, all of
> > >   that should have been settled by then, should it not? Where it might
> > >   get interesting is with NFS root, and I admit I haven't tested that.
> >
> > I don't think you can make that assumption. Consider the case where
> > systemd is being used, DSA stuff is modular, and we're trying to
> > setup a bridge device on DSA. DSA could be probing while the bridge
> > is being setup.
> >
> > Sadly, this isn't theoretical. I've ended up needing:
> >
> > 	pre-up sleep 1
> >
> > in my bridge configuration to allow time for DSA to finish probing.
> > It's not a pleasant solution, nor a particularly reliable one at
> > that, but it currently works around the problem.
> 
> What problem? This is the first time I've heard of this report, and you
> should definitely not need that.

I found it when upgrading the Clearfog by the DSL modems to v5.13.
When I rebooted it with a previously working kernel (v5.7) it has
never had a problem. With v5.13, it failed to add all the lan ports
into the bridge, because the bridge was still being setup by the
kernel while userspace was trying to configure it. Note that I have
extra debug in my kernels, hence the extra messages:

Aug 30 11:29:52 sw-dsl kernel: [    3.308583] Marvell 88E1540 mv88e6xxx-0:03: probe: irq=78
Aug 30 11:29:52 sw-dsl kernel: [    3.308595] Marvell 88E1540 mv88e6xxx-0:03: probe: irq=78
Aug 30 11:29:52 sw-dsl kernel: [    3.332403] Marvell 88E1540 mv88e6xxx-0:04: probe: irq=79
Aug 30 11:29:52 sw-dsl kernel: [    3.332415] Marvell 88E1540 mv88e6xxx-0:04: probe: irq=79
Aug 30 11:29:52 sw-dsl kernel: [    3.412638] Marvell 88E1545 mv88e6xxx-0:0f: probe: irq=-1
Aug 30 11:29:52 sw-dsl kernel: [    3.412649] Marvell 88E1545 mv88e6xxx-0:0f: probe: irq=-1
Aug 30 11:29:52 sw-dsl kernel: [    3.515888] libphy: mv88e6xxx SMI: probed

Here, userspace starts configuring eno1, the ethernet port connected
to the DSA switch:

Aug 30 11:29:52 sw-dsl kernel: [    3.536090] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
Aug 30 11:29:52 sw-dsl kernel: [    3.536109] mvneta f1030000.ethernet eno1: major config 1000base-x
Aug 30 11:29:52 sw-dsl kernel: [    3.536117] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=0000000,00000200,00002240 pause=04 link=0 an=1
Aug 30 11:29:52 sw-dsl kernel: [    3.536135] mvneta f1030000.ethernet eno1: mac link down
Aug 30 11:29:52 sw-dsl kernel: [    3.536135] mvneta f1030000.ethernet eno1: mac link down
Aug 30 11:29:52 sw-dsl kernel: [    3.536146] mvneta f1030000.ethernet eno1: mac link down
Aug 30 11:29:52 sw-dsl kernel: [    3.536146] mvneta f1030000.ethernet eno1: mac link down
Aug 30 11:29:52 sw-dsl kernel: [    3.572013] mvneta f1030000.ethernet eno1: mac link up
Aug 30 11:29:52 sw-dsl kernel: [    3.572016] mvneta f1030000.ethernet eno1: mac link up
Aug 30 11:29:52 sw-dsl kernel: [    3.572046] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
Aug 30 11:29:52 sw-dsl kernel: [    3.657820] 8021q: 802.1Q VLAN Support v1.8

We get the link to eno1 going down/up due to DSA's actions:

Aug 30 11:29:53 sw-dsl kernel: [    4.291882] mvneta f1030000.ethernet eno1: Link is Down
Aug 30 11:29:53 sw-dsl kernel: [    4.309425] mvneta f1030000.ethernet eno1: mac link down
Aug 30 11:29:53 sw-dsl kernel: [    4.309425] mvneta f1030000.ethernet eno1: mac link down
Aug 30 11:29:53 sw-dsl kernel: [    4.309440] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
Aug 30 11:29:53 sw-dsl kernel: [    4.309447] mvneta f1030000.ethernet eno1: major config 1000base-x
Aug 30 11:29:53 sw-dsl kernel: [    4.309454] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=0000000,00000200,00002240 pause=04 link=0 an=1
Aug 30 11:29:53 sw-dsl kernel: [    4.345013] mvneta f1030000.ethernet eno1: mac link up
Aug 30 11:29:53 sw-dsl kernel: [    4.345014] mvneta f1030000.ethernet eno1: mac link up
Aug 30 11:29:53 sw-dsl kernel: [    4.345036] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx

DSA then starts initialising the ports:

Aug 30 11:29:53 sw-dsl kernel: [    4.397647] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): PHY [mv88e6xxx-0:00] driver [Marvell 88E1540] (irq=75)
Aug 30 11:29:53 sw-dsl kernel: [    4.397663] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
0000000,00000000,000022ef
Aug 30 11:29:53 sw-dsl kernel: [    4.493080] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): PHY [mv88e6xxx-0:01] driver [Marvell 88E1540] (irq=76)
Aug 30 11:29:53 sw-dsl kernel: [    4.493093] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
0000000,00000000,000022ef
Aug 30 11:29:53 sw-dsl kernel: [    4.577070] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): PHY [mv88e6xxx-0:02] driver [Marvell 88E1540] (irq=77)
Aug 30 11:29:53 sw-dsl kernel: [    4.577081] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
0000000,00000000,000022ef

Meanwhile userspace is trying to setup the bridge while this is going
on, and has tried to add the non-existent lan2 at this point, but
lan4 has just been created in time, so Debian's bridge support adds
it to the brdsl bridge:

Aug 30 11:29:53 sw-dsl kernel: [    4.652237] brdsl: port 1(lan4) entered blocking state
Aug 30 11:29:53 sw-dsl kernel: [    4.652250] brdsl: port 1(lan4) entered disabled state

DSA continues setting up the other ports, here lan2, but the bridge
setup scripts have already moved on past lan2.

Aug 30 11:29:53 sw-dsl kernel: [    4.674038] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): PHY [mv88e6xxx-0:03] driver [Marvell 88E1540] (irq=78)
Aug 30 11:29:53 sw-dsl kernel: [    4.674052] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
0000000,00000000,000022ef
Aug 30 11:29:53 sw-dsl kernel: [    4.674612] device lan4 entered promiscuous mode
Aug 30 11:29:53 sw-dsl kernel: [    4.785886] device eno1 entered promiscuous mode
Aug 30 11:29:53 sw-dsl kernel: [    4.786971] mv88e6085 f1072004.mdio-mii:04 lan4: configuring for phy/gmii link mode
Aug 30 11:29:53 sw-dsl kernel: [    4.786980] mv88e6085 f1072004.mdio-mii:04 lan4: major config gmii
Aug 30 11:29:53 sw-dsl kernel: [    4.786986] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
Aug 30 11:29:53 sw-dsl kernel: [    4.786996] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
Aug 30 11:29:53 sw-dsl kernel: [    4.789977] 8021q: adding VLAN 0 to HW filter
on device lan4
Aug 30 11:29:53 sw-dsl kernel: [    4.836720] brdsl: port 2(eno2) entered blocking state
Aug 30 11:29:53 sw-dsl kernel: [    4.836733] brdsl: port 2(eno2) entered disabled state

Here, the SFP port (on eno2) is added to the bridge.

Aug 30 11:29:53 sw-dsl kernel: [    4.836907] device eno2 entered promiscuous mode
Aug 30 11:29:53 sw-dsl kernel: [    4.837011] brdsl: port 2(eno2) entered blocking state
Aug 30 11:29:53 sw-dsl kernel: [    4.837019] brdsl: port 2(eno2) entered forwarding state
Aug 30 11:29:53 sw-dsl kernel: [    4.837058] IPv6: ADDRCONF(NETDEV_CHANGE): brdsl: link becomes ready
Aug 30 11:29:53 sw-dsl kernel: [    4.846989] mv88e6085 f1072004.mdio-mii:04 lan4: phy link down gmii/Unknown/Unknown/off
Aug 30 11:29:53 sw-dsl kernel: [    4.896264] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): PHY [mv88e6xxx-0:04] driver [Marvell 88E1540] (irq=79)
Aug 30 11:29:53 sw-dsl kernel: [    4.896278] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
0000000,00000000,000022ef
Aug 30 11:29:53 sw-dsl kernel: [    4.934514] DSA: tree 0 setup

Here, the DSA tree has finally finished initialising in the kernel.

Aug 30 11:29:53 sw-dsl kernel: [    4.986877] mv88e6085 f1072004.mdio-mii:04 lan1: configuring for phy/gmii link mode
Aug 30 11:29:53 sw-dsl kernel: [    4.986890] mv88e6085 f1072004.mdio-mii:04 lan1: major config gmii
Aug 30 11:29:53 sw-dsl kernel: [    4.986896] mv88e6085 f1072004.mdio-mii:04 lan1: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
Aug 30 11:29:53 sw-dsl kernel: [    4.986907] mv88e6085 f1072004.mdio-mii:04: p4: dsa_port_phylink_mac_config()
Aug 30 11:29:53 sw-dsl kernel: [    4.990199] 8021q: adding VLAN 0 to HW filter
on device lan1
Aug 30 11:29:54 sw-dsl kernel: [    5.041313] mv88e6085 f1072004.mdio-mii:04 lan1: phy link down gmii/Unknown/Unknown/off
Aug 30 11:29:56 sw-dsl kernel: [    7.630016] mv88e6085 f1072004.mdio-mii:04 lan4: phy link up gmii/1Gbps/Full/off
Aug 30 11:29:56 sw-dsl kernel: [    7.630031] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/1Gbps/Full adv=0000000,00000000,00000000 pause=00 link=1 an=0
Aug 30 11:29:56 sw-dsl kernel: [    7.630043] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
Aug 30 11:29:56 sw-dsl kernel: [    7.630294] mv88e6085 f1072004.mdio-mii:04 lan4: Link is Up - 1Gbps/Full - flow control off
Aug 30 11:29:56 sw-dsl kernel: [    7.630312] brdsl: port 1(lan4) entered blocking state
Aug 30 11:29:56 sw-dsl kernel: [    7.630321] brdsl: port 1(lan4) entered forwarding state

I then notice that my Internet connection hasn't come back, so I start
poking about with it, first adding it to the bridge:

Aug 30 11:31:13 sw-dsl kernel: [   84.990122] brdsl: port 3(lan2) entered blocking state
Aug 30 11:31:13 sw-dsl kernel: [   84.990134] brdsl: port 3(lan2) entered disabled state
Aug 30 11:31:14 sw-dsl kernel: [   85.063971] device lan2 entered promiscuous mode

And then setting it to up state and configuring its vlan settings:

Aug 30 11:32:45 sw-dsl kernel: [  176.476090] mv88e6085 f1072004.mdio-mii:04 lan2: configuring for phy/gmii link mode
Aug 30 11:32:45 sw-dsl kernel: [  176.476103] mv88e6085 f1072004.mdio-mii:04 lan2: major config gmii
Aug 30 11:32:45 sw-dsl kernel: [  176.476109] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
Aug 30 11:32:45 sw-dsl kernel: [  176.476120] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
Aug 30 11:32:45 sw-dsl kernel: [  176.479495] 8021q: adding VLAN 0 to HW filter
on device lan2
Aug 30 11:32:45 sw-dsl kernel: [  176.537796] mv88e6085 f1072004.mdio-mii:04 lan2: phy link down gmii/Unknown/Unknown/off
Aug 30 11:32:48 sw-dsl kernel: [  179.280863] mv88e6085 f1072004.mdio-mii:04 lan2: phy link up gmii/1Gbps/Full/rx/tx
Aug 30 11:32:48 sw-dsl kernel: [  179.280877] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/1Gbps/Full adv=0000000,00000000,00000000 pause=03 link=1 an=0
Aug 30 11:32:48 sw-dsl kernel: [  179.280888] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
Aug 30 11:32:48 sw-dsl kernel: [  179.280894] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_link_up()
Aug 30 11:32:48 sw-dsl kernel: [  179.282958] mv88e6085 f1072004.mdio-mii:04 lan2: Link is Up - 1Gbps/Full - flow control rx/tx

I had:

iface brdsl inet manual
        bridge-ports lan2 lan4
        bridge-maxwait 0
	up brctl addif $IFACE eno2

I now have:
iface brdsl inet manual
        bridge-ports lan2 lan4
        bridge-waitport 10
        bridge-maxwait 0
        pre-up sleep 1
	up brctl addif $IFACE eno2

to ensure that all ports get properly configured.

What can be seen from the above is that there is most definitely a race.
It is possible to start configuring a DSA switch before the DSA switch
driver has finished being probed by the kernel.

Here is the kernel log from v5.7 which has never showed these problems,
because DSA seemed to always setup everything in kernel space prior to
userspace beginning configuration:

Aug 25 23:03:54 sw-dsl kernel: [    5.793137] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
Aug 25 23:03:54 sw-dsl kernel: [    5.793148] mvneta f1030000.ethernet eno1: config interface 1000base-x
Aug 25 23:03:54 sw-dsl kernel: [    5.793157] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=000,00000200,00002240 pause=04 link=0 an=1
Aug 25 23:03:54 sw-dsl kernel: [    5.793168] mvneta f1030000.ethernet eno1: mac link down
Aug 25 23:03:54 sw-dsl kernel: [    5.793170] mvneta f1030000.ethernet eno1: mac link down
Aug 25 23:03:54 sw-dsl kernel: [    5.819769] mvneta f1030000.ethernet eno1: mac link up
Aug 25 23:03:54 sw-dsl kernel: [    5.819792] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
Aug 25 23:03:54 sw-dsl kernel: [    5.948900] 8021q: 802.1Q VLAN Support v1.8
  6.459779] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 0
Aug 25 23:03:54 sw-dsl kernel: [    6.462890] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): PHY [mv88e6xxx-0:00] driver [Marvell 88E1540] (irq=67)
Aug 25 23:03:54 sw-dsl kernel: [    6.462905] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
Aug 25 23:03:54 sw-dsl kernel: [    6.465904] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 1
Aug 25 23:03:54 sw-dsl kernel: [    6.468101] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): PHY [mv88e6xxx-0:01] driver [Marvell 88E1540] (irq=68)
Aug 25 23:03:54 sw-dsl kernel: [    6.468109] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
Aug 25 23:03:54 sw-dsl kernel: [    6.472162] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 2
Aug 25 23:03:54 sw-dsl kernel: [    6.474247] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): PHY [mv88e6xxx-0:02] driver [Marvell 88E1540] (irq=69)
Aug 25 23:03:54 sw-dsl kernel: [    6.474261] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
Aug 25 23:03:54 sw-dsl kernel: [    6.481824] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 3
Aug 25 23:03:54 sw-dsl kernel: [    6.486354] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): PHY [mv88e6xxx-0:03] driver [Marvell 88E1540] (irq=70)
Aug 25 23:03:54 sw-dsl kernel: [    6.486363] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
Aug 25 23:03:54 sw-dsl kernel: [    6.498494] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 4
Aug 25 23:03:54 sw-dsl kernel: [    6.502272] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): PHY [mv88e6xxx-0:04] driver [Marvell 88E1540] (irq=71)
Aug 25 23:03:54 sw-dsl kernel: [    6.502279] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
Aug 25 23:03:54 sw-dsl kernel: [    6.532258] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 6
Aug 25 23:03:54 sw-dsl kernel: [    6.535877] mvneta f1030000.ethernet eno1: Link is Down
Aug 25 23:03:54 sw-dsl kernel: [    6.541733] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
Aug 25 23:03:54 sw-dsl kernel: [    6.541741] mvneta f1030000.ethernet eno1: config interface 1000base-x
Aug 25 23:03:54 sw-dsl kernel: [    6.541754] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=000,00000200,00002240 pause=04 link=0 an=1
Aug 25 23:03:54 sw-dsl kernel: [    6.541771] mvneta f1030000.ethernet eno1: mac link down
Aug 25 23:03:54 sw-dsl kernel: [    6.541779] mvneta f1030000.ethernet eno1: mac link down
Aug 25 23:03:54 sw-dsl kernel: [    6.541907] DSA: tree 0 setup

Here, the kernel DSA switch driver has finished doing its setup
before we even get to configuring the bridge device below.

Aug 25 23:03:54 sw-dsl kernel: [    6.569105] mvneta f1030000.ethernet eno1: mac link up
Aug 25 23:03:54 sw-dsl kernel: [    6.569113] mvneta f1030000.ethernet eno1: mac link up
Aug 25 23:03:54 sw-dsl kernel: [    6.569139] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
Aug 25 23:03:55 sw-dsl kernel: [    6.931763] brdsl: port 1(lan2) entered blocking state
Aug 25 23:03:55 sw-dsl kernel: [    6.931769] brdsl: port 1(lan2) entered disabled state
Aug 25 23:03:55 sw-dsl kernel: [    6.932863] device lan2 entered promiscuous mode
Aug 25 23:03:55 sw-dsl kernel: [    7.032838] device eno1 entered promiscuous mode
Aug 25 23:03:55 sw-dsl kernel: [    7.032902] mv88e6085 f1072004.mdio-mii:04 lan2: configuring for phy/gmii link mode
Aug 25 23:03:55 sw-dsl kernel: [    7.032907] mv88e6085 f1072004.mdio-mii:04 lan2: config interface gmii
Aug 25 23:03:55 sw-dsl kernel: [    7.032916] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
Aug 25 23:03:55 sw-dsl kernel: [    7.032920] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
Aug 25 23:03:55 sw-dsl kernel: [    7.037225] 8021q: adding VLAN 0 to HW filter
on device lan2
Aug 25 23:03:55 sw-dsl kernel: [    7.044979] brdsl: port 2(lan4) entered blocking state
Aug 25 23:03:55 sw-dsl kernel: [    7.044985] brdsl: port 2(lan4) entered disabled state
Aug 25 23:03:55 sw-dsl kernel: [    7.056189] device lan4 entered promiscuous mode
Aug 25 23:03:55 sw-dsl kernel: [    7.107067] mv88e6085 f1072004.mdio-mii:04 lan4: configuring for phy/gmii link mode
Aug 25 23:03:55 sw-dsl kernel: [    7.107073] mv88e6085 f1072004.mdio-mii:04 lan4: config interface gmii
Aug 25 23:03:55 sw-dsl kernel: [    7.107080] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
Aug 25 23:03:55 sw-dsl kernel: [    7.107084] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
Aug 25 23:03:55 sw-dsl kernel: [    7.118831] 8021q: adding VLAN 0 to HW filter
on device lan4
Aug 25 23:03:55 sw-dsl kernel: [    7.153604] brdsl: port 3(eno2) entered blocking state
Aug 25 23:03:55 sw-dsl kernel: [    7.153610] brdsl: port 3(eno2) entered disabled state
Aug 25 23:03:55 sw-dsl kernel: [    7.153720] mv88e6085 f1072004.mdio-mii:04 lan2: phy link down gmii/Unknown/Unknown/off
Aug 25 23:03:55 sw-dsl kernel: [    7.153790] device eno2 entered promiscuous mode
Aug 25 23:03:55 sw-dsl kernel: [    7.153890] brdsl: port 3(eno2) entered blocking state
Aug 25 23:03:55 sw-dsl kernel: [    7.153895] brdsl: port 3(eno2) entered forwarding state
Aug 25 23:03:55 sw-dsl kernel: [    7.153930] IPv6: ADDRCONF(NETDEV_CHANGE): brdsl: link becomes ready
Aug 25 23:03:55 sw-dsl kernel: [    7.295739] mv88e6085 f1072004.mdio-mii:04 lan4: phy link down gmii/Unknown/Unknown/off
Aug 25 23:03:55 sw-dsl kernel: [    7.575615] mv88e6085 f1072004.mdio-mii:04 lan1: configuring for phy/gmii link mode
Aug 25 23:03:55 sw-dsl kernel: [    7.575622] mv88e6085 f1072004.mdio-mii:04 lan1: config interface gmii
Aug 25 23:03:55 sw-dsl kernel: [    7.575630] mv88e6085 f1072004.mdio-mii:04 lan1: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
Aug 25 23:03:55 sw-dsl kernel: [    7.575634] mv88e6085 f1072004.mdio-mii:04: p4: dsa_port_phylink_mac_config()
Aug 25 23:03:55 sw-dsl kernel: [    7.579334] 8021q: adding VLAN 0 to HW filter
on device lan1
Aug 25 23:03:55 sw-dsl kernel: [    7.635966] mv88e6085 f1072004.mdio-mii:04 lan1: phy link down gmii/Unknown/Unknown/off
Vladimir Oltean Sept. 2, 2021, 5:10 p.m. UTC | #7
On Thu, Sep 02, 2021 at 05:31:44PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 06:23:42PM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> > > Debian has had support for configuring bridges at boot time via
> > > the interfaces file for years. Breaking that is going to upset a
> > > lot of people (me included) resulting in busted networks. It
> > > would be a sure way to make oneself unpopular.
> > >
> > > > I expect there to be 2 call paths of phy_attach_direct:
> > > > - At probe time. Both the MAC driver and the PHY driver are probing.
> > > >   This is what has this patch addresses. There is no issue to return
> > > >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> > > >   they register their netdev. So if connecting defers, there is no
> > > >   netdev to unregister, and user space knows nothing of this.
> > > > - At .ndo_open time. This is where it maybe gets interesting, but not to
> > > >   user space. If you open a netdev and it connects to the PHY then, I
> > > >   wouldn't expect the PHY to be undergoing a probing process, all of
> > > >   that should have been settled by then, should it not? Where it might
> > > >   get interesting is with NFS root, and I admit I haven't tested that.
> > >
> > > I don't think you can make that assumption. Consider the case where
> > > systemd is being used, DSA stuff is modular, and we're trying to
> > > setup a bridge device on DSA. DSA could be probing while the bridge
> > > is being setup.
> > >
> > > Sadly, this isn't theoretical. I've ended up needing:
> > >
> > > 	pre-up sleep 1
> > >
> > > in my bridge configuration to allow time for DSA to finish probing.
> > > It's not a pleasant solution, nor a particularly reliable one at
> > > that, but it currently works around the problem.
> > 
> > What problem? This is the first time I've heard of this report, and you
> > should definitely not need that.
> 
> I found it when upgrading the Clearfog by the DSL modems to v5.13.
> When I rebooted it with a previously working kernel (v5.7) it has
> never had a problem. With v5.13, it failed to add all the lan ports
> into the bridge, because the bridge was still being setup by the
> kernel while userspace was trying to configure it. Note that I have
> extra debug in my kernels, hence the extra messages:

Ok, first you talked about the interfaces file, then systemd. If it's
not about systemd's network manager then I don't see how it is relevant.
What package and version is this exactly, ifupdown, ifupdown2,
ifupdown-ng, busybox ifupdown? I think they all use the interfaces file.

> Aug 30 11:29:52 sw-dsl kernel: [    3.308583] Marvell 88E1540 mv88e6xxx-0:03: probe: irq=78
> Aug 30 11:29:52 sw-dsl kernel: [    3.308595] Marvell 88E1540 mv88e6xxx-0:03: probe: irq=78
> Aug 30 11:29:52 sw-dsl kernel: [    3.332403] Marvell 88E1540 mv88e6xxx-0:04: probe: irq=79
> Aug 30 11:29:52 sw-dsl kernel: [    3.332415] Marvell 88E1540 mv88e6xxx-0:04: probe: irq=79
> Aug 30 11:29:52 sw-dsl kernel: [    3.412638] Marvell 88E1545 mv88e6xxx-0:0f: probe: irq=-1
> Aug 30 11:29:52 sw-dsl kernel: [    3.412649] Marvell 88E1545 mv88e6xxx-0:0f: probe: irq=-1
> Aug 30 11:29:52 sw-dsl kernel: [    3.515888] libphy: mv88e6xxx SMI: probed
> 
> Here, userspace starts configuring eno1, the ethernet port connected
> to the DSA switch:
> 
> Aug 30 11:29:52 sw-dsl kernel: [    3.536090] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> Aug 30 11:29:52 sw-dsl kernel: [    3.536109] mvneta f1030000.ethernet eno1: major config 1000base-x
> Aug 30 11:29:52 sw-dsl kernel: [    3.536117] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=0000000,00000200,00002240 pause=04 link=0 an=1
> Aug 30 11:29:52 sw-dsl kernel: [    3.536135] mvneta f1030000.ethernet eno1: mac link down
> Aug 30 11:29:52 sw-dsl kernel: [    3.536135] mvneta f1030000.ethernet eno1: mac link down
> Aug 30 11:29:52 sw-dsl kernel: [    3.536146] mvneta f1030000.ethernet eno1: mac link down
> Aug 30 11:29:52 sw-dsl kernel: [    3.536146] mvneta f1030000.ethernet eno1: mac link down
> Aug 30 11:29:52 sw-dsl kernel: [    3.572013] mvneta f1030000.ethernet eno1: mac link up
> Aug 30 11:29:52 sw-dsl kernel: [    3.572016] mvneta f1030000.ethernet eno1: mac link up
> Aug 30 11:29:52 sw-dsl kernel: [    3.572046] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> Aug 30 11:29:52 sw-dsl kernel: [    3.657820] 8021q: 802.1Q VLAN Support v1.8
> 
> We get the link to eno1 going down/up due to DSA's actions:

What "actions"? There were only 2 DSA changes related to the state of
the master interface, but DSA never forces the master to go down. Quite
the opposite, it forces the master up when it needs to, and it goes down
when the master goes down. See:

9d5ef190e561 ("net: dsa: automatically bring up DSA master when opening user port")
c0a8a9c27493 ("net: dsa: automatically bring user ports down when master goes down")

So if eno1 goes down and that causes breakage, DSA did not trigger it.
Also, please note that eno1 goes down in your "working" example too.

> 
> Aug 30 11:29:53 sw-dsl kernel: [    4.291882] mvneta f1030000.ethernet eno1: Link is Down
> Aug 30 11:29:53 sw-dsl kernel: [    4.309425] mvneta f1030000.ethernet eno1: mac link down
> Aug 30 11:29:53 sw-dsl kernel: [    4.309425] mvneta f1030000.ethernet eno1: mac link down
> Aug 30 11:29:53 sw-dsl kernel: [    4.309440] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> Aug 30 11:29:53 sw-dsl kernel: [    4.309447] mvneta f1030000.ethernet eno1: major config 1000base-x
> Aug 30 11:29:53 sw-dsl kernel: [    4.309454] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=0000000,00000200,00002240 pause=04 link=0 an=1
> Aug 30 11:29:53 sw-dsl kernel: [    4.345013] mvneta f1030000.ethernet eno1: mac link up
> Aug 30 11:29:53 sw-dsl kernel: [    4.345014] mvneta f1030000.ethernet eno1: mac link up
> Aug 30 11:29:53 sw-dsl kernel: [    4.345036] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> 
> DSA then starts initialising the ports:
> 
> Aug 30 11:29:53 sw-dsl kernel: [    4.397647] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): PHY [mv88e6xxx-0:00] driver [Marvell 88E1540] (irq=75)
> Aug 30 11:29:53 sw-dsl kernel: [    4.397663] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> 0000000,00000000,000022ef
> Aug 30 11:29:53 sw-dsl kernel: [    4.493080] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): PHY [mv88e6xxx-0:01] driver [Marvell 88E1540] (irq=76)
> Aug 30 11:29:53 sw-dsl kernel: [    4.493093] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> 0000000,00000000,000022ef
> Aug 30 11:29:53 sw-dsl kernel: [    4.577070] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): PHY [mv88e6xxx-0:02] driver [Marvell 88E1540] (irq=77)
> Aug 30 11:29:53 sw-dsl kernel: [    4.577081] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> 0000000,00000000,000022ef
> 
> Meanwhile userspace is trying to setup the bridge while this is going
> on, and has tried to add the non-existent lan2 at this point, but
> lan4 has just been created in time, so Debian's bridge support adds
> it to the brdsl bridge:
> 
> Aug 30 11:29:53 sw-dsl kernel: [    4.652237] brdsl: port 1(lan4) entered blocking state
> Aug 30 11:29:53 sw-dsl kernel: [    4.652250] brdsl: port 1(lan4) entered disabled state
> 
> DSA continues setting up the other ports, here lan2, but the bridge
> setup scripts have already moved on past lan2.

How does this program know that lan2 exists before it starts attempting
to enslave it to a bridge via the brctl program, and what does DSA do to
violate that assumption?

> 
> Aug 30 11:29:53 sw-dsl kernel: [    4.674038] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): PHY [mv88e6xxx-0:03] driver [Marvell 88E1540] (irq=78)
> Aug 30 11:29:53 sw-dsl kernel: [    4.674052] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> 0000000,00000000,000022ef
> Aug 30 11:29:53 sw-dsl kernel: [    4.674612] device lan4 entered promiscuous mode
> Aug 30 11:29:53 sw-dsl kernel: [    4.785886] device eno1 entered promiscuous mode
> Aug 30 11:29:53 sw-dsl kernel: [    4.786971] mv88e6085 f1072004.mdio-mii:04 lan4: configuring for phy/gmii link mode
> Aug 30 11:29:53 sw-dsl kernel: [    4.786980] mv88e6085 f1072004.mdio-mii:04 lan4: major config gmii
> Aug 30 11:29:53 sw-dsl kernel: [    4.786986] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
> Aug 30 11:29:53 sw-dsl kernel: [    4.786996] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
> Aug 30 11:29:53 sw-dsl kernel: [    4.789977] 8021q: adding VLAN 0 to HW filter
> on device lan4
> Aug 30 11:29:53 sw-dsl kernel: [    4.836720] brdsl: port 2(eno2) entered blocking state
> Aug 30 11:29:53 sw-dsl kernel: [    4.836733] brdsl: port 2(eno2) entered disabled state
> 
> Here, the SFP port (on eno2) is added to the bridge.
> 
> Aug 30 11:29:53 sw-dsl kernel: [    4.836907] device eno2 entered promiscuous mode
> Aug 30 11:29:53 sw-dsl kernel: [    4.837011] brdsl: port 2(eno2) entered blocking state
> Aug 30 11:29:53 sw-dsl kernel: [    4.837019] brdsl: port 2(eno2) entered forwarding state
> Aug 30 11:29:53 sw-dsl kernel: [    4.837058] IPv6: ADDRCONF(NETDEV_CHANGE): brdsl: link becomes ready
> Aug 30 11:29:53 sw-dsl kernel: [    4.846989] mv88e6085 f1072004.mdio-mii:04 lan4: phy link down gmii/Unknown/Unknown/off
> Aug 30 11:29:53 sw-dsl kernel: [    4.896264] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): PHY [mv88e6xxx-0:04] driver [Marvell 88E1540] (irq=79)
> Aug 30 11:29:53 sw-dsl kernel: [    4.896278] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> 0000000,00000000,000022ef
> Aug 30 11:29:53 sw-dsl kernel: [    4.934514] DSA: tree 0 setup
> 
> Here, the DSA tree has finally finished initialising in the kernel.
> 
> Aug 30 11:29:53 sw-dsl kernel: [    4.986877] mv88e6085 f1072004.mdio-mii:04 lan1: configuring for phy/gmii link mode
> Aug 30 11:29:53 sw-dsl kernel: [    4.986890] mv88e6085 f1072004.mdio-mii:04 lan1: major config gmii
> Aug 30 11:29:53 sw-dsl kernel: [    4.986896] mv88e6085 f1072004.mdio-mii:04 lan1: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
> Aug 30 11:29:53 sw-dsl kernel: [    4.986907] mv88e6085 f1072004.mdio-mii:04: p4: dsa_port_phylink_mac_config()
> Aug 30 11:29:53 sw-dsl kernel: [    4.990199] 8021q: adding VLAN 0 to HW filter
> on device lan1
> Aug 30 11:29:54 sw-dsl kernel: [    5.041313] mv88e6085 f1072004.mdio-mii:04 lan1: phy link down gmii/Unknown/Unknown/off
> Aug 30 11:29:56 sw-dsl kernel: [    7.630016] mv88e6085 f1072004.mdio-mii:04 lan4: phy link up gmii/1Gbps/Full/off
> Aug 30 11:29:56 sw-dsl kernel: [    7.630031] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/1Gbps/Full adv=0000000,00000000,00000000 pause=00 link=1 an=0
> Aug 30 11:29:56 sw-dsl kernel: [    7.630043] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
> Aug 30 11:29:56 sw-dsl kernel: [    7.630294] mv88e6085 f1072004.mdio-mii:04 lan4: Link is Up - 1Gbps/Full - flow control off
> Aug 30 11:29:56 sw-dsl kernel: [    7.630312] brdsl: port 1(lan4) entered blocking state
> Aug 30 11:29:56 sw-dsl kernel: [    7.630321] brdsl: port 1(lan4) entered forwarding state
> 
> I then notice that my Internet connection hasn't come back, so I start
> poking about with it, first adding it to the bridge:
> 
> Aug 30 11:31:13 sw-dsl kernel: [   84.990122] brdsl: port 3(lan2) entered blocking state
> Aug 30 11:31:13 sw-dsl kernel: [   84.990134] brdsl: port 3(lan2) entered disabled state
> Aug 30 11:31:14 sw-dsl kernel: [   85.063971] device lan2 entered promiscuous mode
> 
> And then setting it to up state and configuring its vlan settings:
> 
> Aug 30 11:32:45 sw-dsl kernel: [  176.476090] mv88e6085 f1072004.mdio-mii:04 lan2: configuring for phy/gmii link mode
> Aug 30 11:32:45 sw-dsl kernel: [  176.476103] mv88e6085 f1072004.mdio-mii:04 lan2: major config gmii
> Aug 30 11:32:45 sw-dsl kernel: [  176.476109] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
> Aug 30 11:32:45 sw-dsl kernel: [  176.476120] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
> Aug 30 11:32:45 sw-dsl kernel: [  176.479495] 8021q: adding VLAN 0 to HW filter
> on device lan2
> Aug 30 11:32:45 sw-dsl kernel: [  176.537796] mv88e6085 f1072004.mdio-mii:04 lan2: phy link down gmii/Unknown/Unknown/off
> Aug 30 11:32:48 sw-dsl kernel: [  179.280863] mv88e6085 f1072004.mdio-mii:04 lan2: phy link up gmii/1Gbps/Full/rx/tx
> Aug 30 11:32:48 sw-dsl kernel: [  179.280877] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/1Gbps/Full adv=0000000,00000000,00000000 pause=03 link=1 an=0
> Aug 30 11:32:48 sw-dsl kernel: [  179.280888] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
> Aug 30 11:32:48 sw-dsl kernel: [  179.280894] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_link_up()
> Aug 30 11:32:48 sw-dsl kernel: [  179.282958] mv88e6085 f1072004.mdio-mii:04 lan2: Link is Up - 1Gbps/Full - flow control rx/tx
> 
> I had:
> 
> iface brdsl inet manual
>         bridge-ports lan2 lan4
>         bridge-maxwait 0
> 	up brctl addif $IFACE eno2
> 
> I now have:
> iface brdsl inet manual
>         bridge-ports lan2 lan4
>         bridge-waitport 10
>         bridge-maxwait 0
>         pre-up sleep 1
> 	up brctl addif $IFACE eno2

I searched google for the "bridge-ports" keyword relative to ifupdown
and could not find the source code of a program which parses this. Could
you let me know what is the source code of the program you are using?

> 
> to ensure that all ports get properly configured.
> 
> What can be seen from the above is that there is most definitely a race.
> It is possible to start configuring a DSA switch before the DSA switch
> driver has finished being probed by the kernel.
> 
> Here is the kernel log from v5.7 which has never showed these problems,
> because DSA seemed to always setup everything in kernel space prior to
> userspace beginning configuration:
> 
> Aug 25 23:03:54 sw-dsl kernel: [    5.793137] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> Aug 25 23:03:54 sw-dsl kernel: [    5.793148] mvneta f1030000.ethernet eno1: config interface 1000base-x
> Aug 25 23:03:54 sw-dsl kernel: [    5.793157] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=000,00000200,00002240 pause=04 link=0 an=1
> Aug 25 23:03:54 sw-dsl kernel: [    5.793168] mvneta f1030000.ethernet eno1: mac link down
> Aug 25 23:03:54 sw-dsl kernel: [    5.793170] mvneta f1030000.ethernet eno1: mac link down
> Aug 25 23:03:54 sw-dsl kernel: [    5.819769] mvneta f1030000.ethernet eno1: mac link up
> Aug 25 23:03:54 sw-dsl kernel: [    5.819792] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> Aug 25 23:03:54 sw-dsl kernel: [    5.948900] 8021q: 802.1Q VLAN Support v1.8
>   6.459779] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 0
> Aug 25 23:03:54 sw-dsl kernel: [    6.462890] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): PHY [mv88e6xxx-0:00] driver [Marvell 88E1540] (irq=67)
> Aug 25 23:03:54 sw-dsl kernel: [    6.462905] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> Aug 25 23:03:54 sw-dsl kernel: [    6.465904] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 1
> Aug 25 23:03:54 sw-dsl kernel: [    6.468101] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): PHY [mv88e6xxx-0:01] driver [Marvell 88E1540] (irq=68)
> Aug 25 23:03:54 sw-dsl kernel: [    6.468109] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> Aug 25 23:03:54 sw-dsl kernel: [    6.472162] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 2
> Aug 25 23:03:54 sw-dsl kernel: [    6.474247] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): PHY [mv88e6xxx-0:02] driver [Marvell 88E1540] (irq=69)
> Aug 25 23:03:54 sw-dsl kernel: [    6.474261] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> Aug 25 23:03:54 sw-dsl kernel: [    6.481824] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 3
> Aug 25 23:03:54 sw-dsl kernel: [    6.486354] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): PHY [mv88e6xxx-0:03] driver [Marvell 88E1540] (irq=70)
> Aug 25 23:03:54 sw-dsl kernel: [    6.486363] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> Aug 25 23:03:54 sw-dsl kernel: [    6.498494] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 4
> Aug 25 23:03:54 sw-dsl kernel: [    6.502272] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): PHY [mv88e6xxx-0:04] driver [Marvell 88E1540] (irq=71)
> Aug 25 23:03:54 sw-dsl kernel: [    6.502279] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> Aug 25 23:03:54 sw-dsl kernel: [    6.532258] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 6
> Aug 25 23:03:54 sw-dsl kernel: [    6.535877] mvneta f1030000.ethernet eno1: Link is Down
> Aug 25 23:03:54 sw-dsl kernel: [    6.541733] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> Aug 25 23:03:54 sw-dsl kernel: [    6.541741] mvneta f1030000.ethernet eno1: config interface 1000base-x
> Aug 25 23:03:54 sw-dsl kernel: [    6.541754] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=000,00000200,00002240 pause=04 link=0 an=1
> Aug 25 23:03:54 sw-dsl kernel: [    6.541771] mvneta f1030000.ethernet eno1: mac link down
> Aug 25 23:03:54 sw-dsl kernel: [    6.541779] mvneta f1030000.ethernet eno1: mac link down
> Aug 25 23:03:54 sw-dsl kernel: [    6.541907] DSA: tree 0 setup
> 
> Here, the kernel DSA switch driver has finished doing its setup
> before we even get to configuring the bridge device below.
> 
> Aug 25 23:03:54 sw-dsl kernel: [    6.569105] mvneta f1030000.ethernet eno1: mac link up
> Aug 25 23:03:54 sw-dsl kernel: [    6.569113] mvneta f1030000.ethernet eno1: mac link up
> Aug 25 23:03:54 sw-dsl kernel: [    6.569139] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> Aug 25 23:03:55 sw-dsl kernel: [    6.931763] brdsl: port 1(lan2) entered blocking state
> Aug 25 23:03:55 sw-dsl kernel: [    6.931769] brdsl: port 1(lan2) entered disabled state
> Aug 25 23:03:55 sw-dsl kernel: [    6.932863] device lan2 entered promiscuous mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.032838] device eno1 entered promiscuous mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.032902] mv88e6085 f1072004.mdio-mii:04 lan2: configuring for phy/gmii link mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.032907] mv88e6085 f1072004.mdio-mii:04 lan2: config interface gmii
> Aug 25 23:03:55 sw-dsl kernel: [    7.032916] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
> Aug 25 23:03:55 sw-dsl kernel: [    7.032920] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
> Aug 25 23:03:55 sw-dsl kernel: [    7.037225] 8021q: adding VLAN 0 to HW filter
> on device lan2
> Aug 25 23:03:55 sw-dsl kernel: [    7.044979] brdsl: port 2(lan4) entered blocking state
> Aug 25 23:03:55 sw-dsl kernel: [    7.044985] brdsl: port 2(lan4) entered disabled state
> Aug 25 23:03:55 sw-dsl kernel: [    7.056189] device lan4 entered promiscuous mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.107067] mv88e6085 f1072004.mdio-mii:04 lan4: configuring for phy/gmii link mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.107073] mv88e6085 f1072004.mdio-mii:04 lan4: config interface gmii
> Aug 25 23:03:55 sw-dsl kernel: [    7.107080] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
> Aug 25 23:03:55 sw-dsl kernel: [    7.107084] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
> Aug 25 23:03:55 sw-dsl kernel: [    7.118831] 8021q: adding VLAN 0 to HW filter
> on device lan4
> Aug 25 23:03:55 sw-dsl kernel: [    7.153604] brdsl: port 3(eno2) entered blocking state
> Aug 25 23:03:55 sw-dsl kernel: [    7.153610] brdsl: port 3(eno2) entered disabled state
> Aug 25 23:03:55 sw-dsl kernel: [    7.153720] mv88e6085 f1072004.mdio-mii:04 lan2: phy link down gmii/Unknown/Unknown/off
> Aug 25 23:03:55 sw-dsl kernel: [    7.153790] device eno2 entered promiscuous mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.153890] brdsl: port 3(eno2) entered blocking state
> Aug 25 23:03:55 sw-dsl kernel: [    7.153895] brdsl: port 3(eno2) entered forwarding state
> Aug 25 23:03:55 sw-dsl kernel: [    7.153930] IPv6: ADDRCONF(NETDEV_CHANGE): brdsl: link becomes ready
> Aug 25 23:03:55 sw-dsl kernel: [    7.295739] mv88e6085 f1072004.mdio-mii:04 lan4: phy link down gmii/Unknown/Unknown/off
> Aug 25 23:03:55 sw-dsl kernel: [    7.575615] mv88e6085 f1072004.mdio-mii:04 lan1: configuring for phy/gmii link mode
> Aug 25 23:03:55 sw-dsl kernel: [    7.575622] mv88e6085 f1072004.mdio-mii:04 lan1: config interface gmii
> Aug 25 23:03:55 sw-dsl kernel: [    7.575630] mv88e6085 f1072004.mdio-mii:04 lan1: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
> Aug 25 23:03:55 sw-dsl kernel: [    7.575634] mv88e6085 f1072004.mdio-mii:04: p4: dsa_port_phylink_mac_config()
> Aug 25 23:03:55 sw-dsl kernel: [    7.579334] 8021q: adding VLAN 0 to HW filter
> on device lan1
> Aug 25 23:03:55 sw-dsl kernel: [    7.635966] mv88e6085 f1072004.mdio-mii:04 lan1: phy link down gmii/Unknown/Unknown/off
> 
> -- 
> RMK's Patch system: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.armlinux.org.uk%2Fdeveloper%2Fpatches%2F&amp;data=04%7C01%7Cvladimir.oltean%40nxp.com%7C4226a7652ae7497284df08d96e2f29e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637661971114812881%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6hDf%2FS%2FMnpRhzEYuW14zuaEAcaTgdMsQJPpmR9WA5cI%3D&amp;reserved=0
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Russell King (Oracle) Sept. 2, 2021, 5:50 p.m. UTC | #8
On Thu, Sep 02, 2021 at 05:10:34PM +0000, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 05:31:44PM +0100, Russell King (Oracle) wrote:
> > On Thu, Sep 02, 2021 at 06:23:42PM +0300, Vladimir Oltean wrote:
> > > On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> > > > Debian has had support for configuring bridges at boot time via
> > > > the interfaces file for years. Breaking that is going to upset a
> > > > lot of people (me included) resulting in busted networks. It
> > > > would be a sure way to make oneself unpopular.
> > > >
> > > > > I expect there to be 2 call paths of phy_attach_direct:
> > > > > - At probe time. Both the MAC driver and the PHY driver are probing.
> > > > >   This is what has this patch addresses. There is no issue to return
> > > > >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> > > > >   they register their netdev. So if connecting defers, there is no
> > > > >   netdev to unregister, and user space knows nothing of this.
> > > > > - At .ndo_open time. This is where it maybe gets interesting, but not to
> > > > >   user space. If you open a netdev and it connects to the PHY then, I
> > > > >   wouldn't expect the PHY to be undergoing a probing process, all of
> > > > >   that should have been settled by then, should it not? Where it might
> > > > >   get interesting is with NFS root, and I admit I haven't tested that.
> > > >
> > > > I don't think you can make that assumption. Consider the case where
> > > > systemd is being used, DSA stuff is modular, and we're trying to
> > > > setup a bridge device on DSA. DSA could be probing while the bridge
> > > > is being setup.
> > > >
> > > > Sadly, this isn't theoretical. I've ended up needing:
> > > >
> > > > 	pre-up sleep 1
> > > >
> > > > in my bridge configuration to allow time for DSA to finish probing.
> > > > It's not a pleasant solution, nor a particularly reliable one at
> > > > that, but it currently works around the problem.
> > > 
> > > What problem? This is the first time I've heard of this report, and you
> > > should definitely not need that.
> > 
> > I found it when upgrading the Clearfog by the DSL modems to v5.13.
> > When I rebooted it with a previously working kernel (v5.7) it has
> > never had a problem. With v5.13, it failed to add all the lan ports
> > into the bridge, because the bridge was still being setup by the
> > kernel while userspace was trying to configure it. Note that I have
> > extra debug in my kernels, hence the extra messages:
> 
> Ok, first you talked about the interfaces file, then systemd. If it's
> not about systemd's network manager then I don't see how it is relevant.

You're reading in stuff to what I write that I did not write... I said:

"Consider the case where systemd is being used, DSA stuff is modular,
and we're trying to setup a bridge device on DSA."

That does not mean I'm using systemd's network manager - which is
something I know little about and have never used.

The reason I mentioned systemd is precisely because with systemd, you
get a hell of a lot happening parallel - and that's significiant in
this case, because it's very clear that modules are being loaded in
parallel with networking being brought up - and that is where the
problems begin. In fact, modules themselves get loaded in paralllel
with systemd.

> What package and version is this exactly, ifupdown, ifupdown2,
> ifupdown-ng, busybox ifupdown? I think they all use the interfaces file.

It's a standard uptodate debian oldstable (buster) install, not yet
upgraded to bullseye:

ifupdown 0.8.35
bridge-utils 1.6-2

> > We get the link to eno1 going down/up due to DSA's actions:
> 
> What "actions"? There were only 2 DSA changes related to the state of
> the master interface, but DSA never forces the master to go down. Quite
> the opposite, it forces the master up when it needs to, and it goes down
> when the master goes down. See:
> 
> 9d5ef190e561 ("net: dsa: automatically bring up DSA master when opening user port")
> c0a8a9c27493 ("net: dsa: automatically bring user ports down when master goes down")

mv88e6xxx will temporarily force the link down while the port is
being configured if one asks it to operate in in-band mode (which
I have.)

> So if eno1 goes down and that causes breakage, DSA did not trigger it.
> Also, please note that eno1 goes down in your "working" example too.

I'm not complaining about this. It's a non-problem. It does however
serve as an indication where we are through the bring-up.

> > Aug 30 11:29:53 sw-dsl kernel: [    4.291882] mvneta f1030000.ethernet eno1: Link is Down
> > Aug 30 11:29:53 sw-dsl kernel: [    4.309425] mvneta f1030000.ethernet eno1: mac link down
> > Aug 30 11:29:53 sw-dsl kernel: [    4.309425] mvneta f1030000.ethernet eno1: mac link down
> > Aug 30 11:29:53 sw-dsl kernel: [    4.309440] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> > Aug 30 11:29:53 sw-dsl kernel: [    4.309447] mvneta f1030000.ethernet eno1: major config 1000base-x
> > Aug 30 11:29:53 sw-dsl kernel: [    4.309454] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=0000000,00000200,00002240 pause=04 link=0 an=1
> > Aug 30 11:29:53 sw-dsl kernel: [    4.345013] mvneta f1030000.ethernet eno1: mac link up
> > Aug 30 11:29:53 sw-dsl kernel: [    4.345014] mvneta f1030000.ethernet eno1: mac link up
> > Aug 30 11:29:53 sw-dsl kernel: [    4.345036] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> > 
> > DSA then starts initialising the ports:
> > 
> > Aug 30 11:29:53 sw-dsl kernel: [    4.397647] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): PHY [mv88e6xxx-0:00] driver [Marvell 88E1540] (irq=75)
> > Aug 30 11:29:53 sw-dsl kernel: [    4.397663] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> > 0000000,00000000,000022ef
> > Aug 30 11:29:53 sw-dsl kernel: [    4.493080] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): PHY [mv88e6xxx-0:01] driver [Marvell 88E1540] (irq=76)
> > Aug 30 11:29:53 sw-dsl kernel: [    4.493093] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> > 0000000,00000000,000022ef
> > Aug 30 11:29:53 sw-dsl kernel: [    4.577070] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): PHY [mv88e6xxx-0:02] driver [Marvell 88E1540] (irq=77)
> > Aug 30 11:29:53 sw-dsl kernel: [    4.577081] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> > 0000000,00000000,000022ef
> > 
> > Meanwhile userspace is trying to setup the bridge while this is going
> > on, and has tried to add the non-existent lan2 at this point, but
> > lan4 has just been created in time, so Debian's bridge support adds
> > it to the brdsl bridge:
> > 
> > Aug 30 11:29:53 sw-dsl kernel: [    4.652237] brdsl: port 1(lan4) entered blocking state
> > Aug 30 11:29:53 sw-dsl kernel: [    4.652250] brdsl: port 1(lan4) entered disabled state
> > 
> > DSA continues setting up the other ports, here lan2, but the bridge
> > setup scripts have already moved on past lan2.
> 
> How does this program know that lan2 exists before it starts attempting
> to enslave it to a bridge via the brctl program, and what does DSA do to
> violate that assumption?

This is the whole point I'm trying to get across to you - these are
_scripts_. They aren't some fancy program that runs in the background.
They assume that the interfaces are already there - as can be seen
from my v5.7 log, they are. With v5.13, they aren't because stuff starts
coming up while DSA is still initialising.

> > Aug 30 11:29:53 sw-dsl kernel: [    4.674038] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): PHY [mv88e6xxx-0:03] driver [Marvell 88E1540] (irq=78)
> > Aug 30 11:29:53 sw-dsl kernel: [    4.674052] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> > 0000000,00000000,000022ef
> > Aug 30 11:29:53 sw-dsl kernel: [    4.674612] device lan4 entered promiscuous mode
> > Aug 30 11:29:53 sw-dsl kernel: [    4.785886] device eno1 entered promiscuous mode
> > Aug 30 11:29:53 sw-dsl kernel: [    4.786971] mv88e6085 f1072004.mdio-mii:04 lan4: configuring for phy/gmii link mode
> > Aug 30 11:29:53 sw-dsl kernel: [    4.786980] mv88e6085 f1072004.mdio-mii:04 lan4: major config gmii
> > Aug 30 11:29:53 sw-dsl kernel: [    4.786986] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
> > Aug 30 11:29:53 sw-dsl kernel: [    4.786996] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
> > Aug 30 11:29:53 sw-dsl kernel: [    4.789977] 8021q: adding VLAN 0 to HW filter
> > on device lan4
> > Aug 30 11:29:53 sw-dsl kernel: [    4.836720] brdsl: port 2(eno2) entered blocking state
> > Aug 30 11:29:53 sw-dsl kernel: [    4.836733] brdsl: port 2(eno2) entered disabled state
> > 
> > Here, the SFP port (on eno2) is added to the bridge.
> > 
> > Aug 30 11:29:53 sw-dsl kernel: [    4.836907] device eno2 entered promiscuous mode
> > Aug 30 11:29:53 sw-dsl kernel: [    4.837011] brdsl: port 2(eno2) entered blocking state
> > Aug 30 11:29:53 sw-dsl kernel: [    4.837019] brdsl: port 2(eno2) entered forwarding state
> > Aug 30 11:29:53 sw-dsl kernel: [    4.837058] IPv6: ADDRCONF(NETDEV_CHANGE): brdsl: link becomes ready
> > Aug 30 11:29:53 sw-dsl kernel: [    4.846989] mv88e6085 f1072004.mdio-mii:04 lan4: phy link down gmii/Unknown/Unknown/off
> > Aug 30 11:29:53 sw-dsl kernel: [    4.896264] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): PHY [mv88e6xxx-0:04] driver [Marvell 88E1540] (irq=79)
> > Aug 30 11:29:53 sw-dsl kernel: [    4.896278] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): phy: setting supported 0000000,00000000,000022ef advertising
> > 0000000,00000000,000022ef
> > Aug 30 11:29:53 sw-dsl kernel: [    4.934514] DSA: tree 0 setup
> > 
> > Here, the DSA tree has finally finished initialising in the kernel.
> > 
> > Aug 30 11:29:53 sw-dsl kernel: [    4.986877] mv88e6085 f1072004.mdio-mii:04 lan1: configuring for phy/gmii link mode
> > Aug 30 11:29:53 sw-dsl kernel: [    4.986890] mv88e6085 f1072004.mdio-mii:04 lan1: major config gmii
> > Aug 30 11:29:53 sw-dsl kernel: [    4.986896] mv88e6085 f1072004.mdio-mii:04 lan1: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
> > Aug 30 11:29:53 sw-dsl kernel: [    4.986907] mv88e6085 f1072004.mdio-mii:04: p4: dsa_port_phylink_mac_config()
> > Aug 30 11:29:53 sw-dsl kernel: [    4.990199] 8021q: adding VLAN 0 to HW filter
> > on device lan1
> > Aug 30 11:29:54 sw-dsl kernel: [    5.041313] mv88e6085 f1072004.mdio-mii:04 lan1: phy link down gmii/Unknown/Unknown/off
> > Aug 30 11:29:56 sw-dsl kernel: [    7.630016] mv88e6085 f1072004.mdio-mii:04 lan4: phy link up gmii/1Gbps/Full/off
> > Aug 30 11:29:56 sw-dsl kernel: [    7.630031] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/1Gbps/Full adv=0000000,00000000,00000000 pause=00 link=1 an=0
> > Aug 30 11:29:56 sw-dsl kernel: [    7.630043] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
> > Aug 30 11:29:56 sw-dsl kernel: [    7.630294] mv88e6085 f1072004.mdio-mii:04 lan4: Link is Up - 1Gbps/Full - flow control off
> > Aug 30 11:29:56 sw-dsl kernel: [    7.630312] brdsl: port 1(lan4) entered blocking state
> > Aug 30 11:29:56 sw-dsl kernel: [    7.630321] brdsl: port 1(lan4) entered forwarding state
> > 
> > I then notice that my Internet connection hasn't come back, so I start
> > poking about with it, first adding it to the bridge:
> > 
> > Aug 30 11:31:13 sw-dsl kernel: [   84.990122] brdsl: port 3(lan2) entered blocking state
> > Aug 30 11:31:13 sw-dsl kernel: [   84.990134] brdsl: port 3(lan2) entered disabled state
> > Aug 30 11:31:14 sw-dsl kernel: [   85.063971] device lan2 entered promiscuous mode
> > 
> > And then setting it to up state and configuring its vlan settings:
> > 
> > Aug 30 11:32:45 sw-dsl kernel: [  176.476090] mv88e6085 f1072004.mdio-mii:04 lan2: configuring for phy/gmii link mode
> > Aug 30 11:32:45 sw-dsl kernel: [  176.476103] mv88e6085 f1072004.mdio-mii:04 lan2: major config gmii
> > Aug 30 11:32:45 sw-dsl kernel: [  176.476109] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=0000000,00000000,00000000 pause=00 link=0 an=0
> > Aug 30 11:32:45 sw-dsl kernel: [  176.476120] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
> > Aug 30 11:32:45 sw-dsl kernel: [  176.479495] 8021q: adding VLAN 0 to HW filter
> > on device lan2
> > Aug 30 11:32:45 sw-dsl kernel: [  176.537796] mv88e6085 f1072004.mdio-mii:04 lan2: phy link down gmii/Unknown/Unknown/off
> > Aug 30 11:32:48 sw-dsl kernel: [  179.280863] mv88e6085 f1072004.mdio-mii:04 lan2: phy link up gmii/1Gbps/Full/rx/tx
> > Aug 30 11:32:48 sw-dsl kernel: [  179.280877] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/1Gbps/Full adv=0000000,00000000,00000000 pause=03 link=1 an=0
> > Aug 30 11:32:48 sw-dsl kernel: [  179.280888] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
> > Aug 30 11:32:48 sw-dsl kernel: [  179.280894] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_link_up()
> > Aug 30 11:32:48 sw-dsl kernel: [  179.282958] mv88e6085 f1072004.mdio-mii:04 lan2: Link is Up - 1Gbps/Full - flow control rx/tx
> > 
> > I had:
> > 
> > iface brdsl inet manual
> >         bridge-ports lan2 lan4
> >         bridge-maxwait 0
> > 	up brctl addif $IFACE eno2
> > 
> > I now have:
> > iface brdsl inet manual
> >         bridge-ports lan2 lan4
> >         bridge-waitport 10
> >         bridge-maxwait 0
> >         pre-up sleep 1
> > 	up brctl addif $IFACE eno2
> 
> I searched google for the "bridge-ports" keyword relative to ifupdown
> and could not find the source code of a program which parses this. Could
> you let me know what is the source code of the program you are using?

It's a script, see the debian bridge-utils package:
/lib/bridge-utils/ifupdown.sh

Also see the ifup man page - ifup converts much of the interfaces
file into environment variables for called hook scripts in
/etc/network/*.d to make use of. So e.g. bridge-ports becomes
$IF_BRIDGE_PORTS etc.

Debian has been using this method since probably shortly after
bridge support was introduced - it's been around for a very long
time.

> > to ensure that all ports get properly configured.
> > 
> > What can be seen from the above is that there is most definitely a race.
> > It is possible to start configuring a DSA switch before the DSA switch
> > driver has finished being probed by the kernel.
> > 
> > Here is the kernel log from v5.7 which has never showed these problems,
> > because DSA seemed to always setup everything in kernel space prior to
> > userspace beginning configuration:
> > 
> > Aug 25 23:03:54 sw-dsl kernel: [    5.793137] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> > Aug 25 23:03:54 sw-dsl kernel: [    5.793148] mvneta f1030000.ethernet eno1: config interface 1000base-x
> > Aug 25 23:03:54 sw-dsl kernel: [    5.793157] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=000,00000200,00002240 pause=04 link=0 an=1
> > Aug 25 23:03:54 sw-dsl kernel: [    5.793168] mvneta f1030000.ethernet eno1: mac link down
> > Aug 25 23:03:54 sw-dsl kernel: [    5.793170] mvneta f1030000.ethernet eno1: mac link down
> > Aug 25 23:03:54 sw-dsl kernel: [    5.819769] mvneta f1030000.ethernet eno1: mac link up
> > Aug 25 23:03:54 sw-dsl kernel: [    5.819792] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> > Aug 25 23:03:54 sw-dsl kernel: [    5.948900] 8021q: 802.1Q VLAN Support v1.8
> >   6.459779] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 0
> > Aug 25 23:03:54 sw-dsl kernel: [    6.462890] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): PHY [mv88e6xxx-0:00] driver [Marvell 88E1540] (irq=67)
> > Aug 25 23:03:54 sw-dsl kernel: [    6.462905] mv88e6085 f1072004.mdio-mii:04 lan5 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> > Aug 25 23:03:54 sw-dsl kernel: [    6.465904] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 1
> > Aug 25 23:03:54 sw-dsl kernel: [    6.468101] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): PHY [mv88e6xxx-0:01] driver [Marvell 88E1540] (irq=68)
> > Aug 25 23:03:54 sw-dsl kernel: [    6.468109] mv88e6085 f1072004.mdio-mii:04 lan4 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> > Aug 25 23:03:54 sw-dsl kernel: [    6.472162] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 2
> > Aug 25 23:03:54 sw-dsl kernel: [    6.474247] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): PHY [mv88e6xxx-0:02] driver [Marvell 88E1540] (irq=69)
> > Aug 25 23:03:54 sw-dsl kernel: [    6.474261] mv88e6085 f1072004.mdio-mii:04 lan3 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> > Aug 25 23:03:54 sw-dsl kernel: [    6.481824] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 3
> > Aug 25 23:03:54 sw-dsl kernel: [    6.486354] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): PHY [mv88e6xxx-0:03] driver [Marvell 88E1540] (irq=70)
> > Aug 25 23:03:54 sw-dsl kernel: [    6.486363] mv88e6085 f1072004.mdio-mii:04 lan2 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> > Aug 25 23:03:54 sw-dsl kernel: [    6.498494] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 4
> > Aug 25 23:03:54 sw-dsl kernel: [    6.502272] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): PHY [mv88e6xxx-0:04] driver [Marvell 88E1540] (irq=71)
> > Aug 25 23:03:54 sw-dsl kernel: [    6.502279] mv88e6085 f1072004.mdio-mii:04 lan1 (uninitialized): phy: setting supported 000,00000000,000022ef advertising 000,00000000,000022ef
> > Aug 25 23:03:54 sw-dsl kernel: [    6.532258] mv88e6085 f1072004.mdio-mii:04: nonfatal error -95 setting MTU on port 6
> > Aug 25 23:03:54 sw-dsl kernel: [    6.535877] mvneta f1030000.ethernet eno1: Link is Down
> > Aug 25 23:03:54 sw-dsl kernel: [    6.541733] mvneta f1030000.ethernet eno1: configuring for inband/1000base-x link mode
> > Aug 25 23:03:54 sw-dsl kernel: [    6.541741] mvneta f1030000.ethernet eno1: config interface 1000base-x
> > Aug 25 23:03:54 sw-dsl kernel: [    6.541754] mvneta f1030000.ethernet eno1: phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown adv=000,00000200,00002240 pause=04 link=0 an=1
> > Aug 25 23:03:54 sw-dsl kernel: [    6.541771] mvneta f1030000.ethernet eno1: mac link down
> > Aug 25 23:03:54 sw-dsl kernel: [    6.541779] mvneta f1030000.ethernet eno1: mac link down
> > Aug 25 23:03:54 sw-dsl kernel: [    6.541907] DSA: tree 0 setup
> > 
> > Here, the kernel DSA switch driver has finished doing its setup
> > before we even get to configuring the bridge device below.
> > 
> > Aug 25 23:03:54 sw-dsl kernel: [    6.569105] mvneta f1030000.ethernet eno1: mac link up
> > Aug 25 23:03:54 sw-dsl kernel: [    6.569113] mvneta f1030000.ethernet eno1: mac link up
> > Aug 25 23:03:54 sw-dsl kernel: [    6.569139] mvneta f1030000.ethernet eno1: Link is Up - 1Gbps/Full - flow control rx/tx
> > Aug 25 23:03:55 sw-dsl kernel: [    6.931763] brdsl: port 1(lan2) entered blocking state
> > Aug 25 23:03:55 sw-dsl kernel: [    6.931769] brdsl: port 1(lan2) entered disabled state
> > Aug 25 23:03:55 sw-dsl kernel: [    6.932863] device lan2 entered promiscuous mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.032838] device eno1 entered promiscuous mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.032902] mv88e6085 f1072004.mdio-mii:04 lan2: configuring for phy/gmii link mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.032907] mv88e6085 f1072004.mdio-mii:04 lan2: config interface gmii
> > Aug 25 23:03:55 sw-dsl kernel: [    7.032916] mv88e6085 f1072004.mdio-mii:04 lan2: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
> > Aug 25 23:03:55 sw-dsl kernel: [    7.032920] mv88e6085 f1072004.mdio-mii:04: p3: dsa_port_phylink_mac_config()
> > Aug 25 23:03:55 sw-dsl kernel: [    7.037225] 8021q: adding VLAN 0 to HW filter
> > on device lan2
> > Aug 25 23:03:55 sw-dsl kernel: [    7.044979] brdsl: port 2(lan4) entered blocking state
> > Aug 25 23:03:55 sw-dsl kernel: [    7.044985] brdsl: port 2(lan4) entered disabled state
> > Aug 25 23:03:55 sw-dsl kernel: [    7.056189] device lan4 entered promiscuous mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.107067] mv88e6085 f1072004.mdio-mii:04 lan4: configuring for phy/gmii link mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.107073] mv88e6085 f1072004.mdio-mii:04 lan4: config interface gmii
> > Aug 25 23:03:55 sw-dsl kernel: [    7.107080] mv88e6085 f1072004.mdio-mii:04 lan4: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
> > Aug 25 23:03:55 sw-dsl kernel: [    7.107084] mv88e6085 f1072004.mdio-mii:04: p1: dsa_port_phylink_mac_config()
> > Aug 25 23:03:55 sw-dsl kernel: [    7.118831] 8021q: adding VLAN 0 to HW filter
> > on device lan4
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153604] brdsl: port 3(eno2) entered blocking state
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153610] brdsl: port 3(eno2) entered disabled state
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153720] mv88e6085 f1072004.mdio-mii:04 lan2: phy link down gmii/Unknown/Unknown/off
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153790] device eno2 entered promiscuous mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153890] brdsl: port 3(eno2) entered blocking state
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153895] brdsl: port 3(eno2) entered forwarding state
> > Aug 25 23:03:55 sw-dsl kernel: [    7.153930] IPv6: ADDRCONF(NETDEV_CHANGE): brdsl: link becomes ready
> > Aug 25 23:03:55 sw-dsl kernel: [    7.295739] mv88e6085 f1072004.mdio-mii:04 lan4: phy link down gmii/Unknown/Unknown/off
> > Aug 25 23:03:55 sw-dsl kernel: [    7.575615] mv88e6085 f1072004.mdio-mii:04 lan1: configuring for phy/gmii link mode
> > Aug 25 23:03:55 sw-dsl kernel: [    7.575622] mv88e6085 f1072004.mdio-mii:04 lan1: config interface gmii
> > Aug 25 23:03:55 sw-dsl kernel: [    7.575630] mv88e6085 f1072004.mdio-mii:04 lan1: phylink_mac_config: mode=phy/gmii/Unknown/Unknown adv=000,00000000,00000000 pause=00 link=0 an=0
> > Aug 25 23:03:55 sw-dsl kernel: [    7.575634] mv88e6085 f1072004.mdio-mii:04: p4: dsa_port_phylink_mac_config()
> > Aug 25 23:03:55 sw-dsl kernel: [    7.579334] 8021q: adding VLAN 0 to HW filter
> > on device lan1
> > Aug 25 23:03:55 sw-dsl kernel: [    7.635966] mv88e6085 f1072004.mdio-mii:04 lan1: phy link down gmii/Unknown/Unknown/off
> > 
> > -- 
> > RMK's Patch system: https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.armlinux.org.uk%2Fdeveloper%2Fpatches%2F&amp;data=04%7C01%7Cvladimir.oltean%40nxp.com%7C4226a7652ae7497284df08d96e2f29e4%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C637661971114812881%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6hDf%2FS%2FMnpRhzEYuW14zuaEAcaTgdMsQJPpmR9WA5cI%3D&amp;reserved=0
> > FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Vladimir Oltean Sept. 2, 2021, 7:05 p.m. UTC | #9
On Thu, Sep 02, 2021 at 06:50:43PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 05:10:34PM +0000, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 05:31:44PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Sep 02, 2021 at 06:23:42PM +0300, Vladimir Oltean wrote:
> > > > On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> > > > > Debian has had support for configuring bridges at boot time via
> > > > > the interfaces file for years. Breaking that is going to upset a
> > > > > lot of people (me included) resulting in busted networks. It
> > > > > would be a sure way to make oneself unpopular.
> > > > >
> > > > > > I expect there to be 2 call paths of phy_attach_direct:
> > > > > > - At probe time. Both the MAC driver and the PHY driver are probing.
> > > > > >   This is what has this patch addresses. There is no issue to return
> > > > > >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> > > > > >   they register their netdev. So if connecting defers, there is no
> > > > > >   netdev to unregister, and user space knows nothing of this.
> > > > > > - At .ndo_open time. This is where it maybe gets interesting, but not to
> > > > > >   user space. If you open a netdev and it connects to the PHY then, I
> > > > > >   wouldn't expect the PHY to be undergoing a probing process, all of
> > > > > >   that should have been settled by then, should it not? Where it might
> > > > > >   get interesting is with NFS root, and I admit I haven't tested that.
> > > > >
> > > > > I don't think you can make that assumption. Consider the case where
> > > > > systemd is being used, DSA stuff is modular, and we're trying to
> > > > > setup a bridge device on DSA. DSA could be probing while the bridge
> > > > > is being setup.
> > > > >
> > > > > Sadly, this isn't theoretical. I've ended up needing:
> > > > >
> > > > > 	pre-up sleep 1
> > > > >
> > > > > in my bridge configuration to allow time for DSA to finish probing.
> > > > > It's not a pleasant solution, nor a particularly reliable one at
> > > > > that, but it currently works around the problem.
> > > >
> > > > What problem? This is the first time I've heard of this report, and you
> > > > should definitely not need that.
> > >
> > > I found it when upgrading the Clearfog by the DSL modems to v5.13.
> > > When I rebooted it with a previously working kernel (v5.7) it has
> > > never had a problem. With v5.13, it failed to add all the lan ports
> > > into the bridge, because the bridge was still being setup by the
> > > kernel while userspace was trying to configure it. Note that I have
> > > extra debug in my kernels, hence the extra messages:
> >
> > Ok, first you talked about the interfaces file, then systemd. If it's
> > not about systemd's network manager then I don't see how it is relevant.
>
> You're reading in stuff to what I write that I did not write... I said:
>
> "Consider the case where systemd is being used, DSA stuff is modular,
> and we're trying to setup a bridge device on DSA."
>
> That does not mean I'm using systemd's network manager - which is
> something I know little about and have never used.

You should definitely try it out, it gets a lot of new features added
all the time, it uses the netlink interface, it reacts on udev events.

> The reason I mentioned systemd is precisely because with systemd, you
> get a hell of a lot happening parallel - and that's significiant in
> this case, because it's very clear that modules are being loaded in
> parallel with networking being brought up - and that is where the
> problems begin. In fact, modules themselves get loaded in paralllel
> with systemd.

So that's what I don't understand. You're saying that the ifupdown
service runs in parallel with systemd-modules-load.service, and
networking is a kernel module? Doesn't that mean it behaves as expected,
then? /shrug/
Have you tried adding an 'After=systemd-modules-load.service' dependency
to the ifupdown service? I don't think that DSA is that bad that it
registers its net devices outside of the process context in which the
insmod mv88e6xxx.ko is called. Quite the opposite, I think (but I
haven't actually taken a close look yet) that the component stuff
Saravana is proposing would do exactly that. So you "fix" one issue, you
introduce another.
Russell King (Oracle) Sept. 2, 2021, 8:03 p.m. UTC | #10
On Thu, Sep 02, 2021 at 10:05:07PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 06:50:43PM +0100, Russell King (Oracle) wrote:
> > On Thu, Sep 02, 2021 at 05:10:34PM +0000, Vladimir Oltean wrote:
> > > On Thu, Sep 02, 2021 at 05:31:44PM +0100, Russell King (Oracle) wrote:
> > > > On Thu, Sep 02, 2021 at 06:23:42PM +0300, Vladimir Oltean wrote:
> > > > > On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> > > > > > Debian has had support for configuring bridges at boot time via
> > > > > > the interfaces file for years. Breaking that is going to upset a
> > > > > > lot of people (me included) resulting in busted networks. It
> > > > > > would be a sure way to make oneself unpopular.
> > > > > >
> > > > > > > I expect there to be 2 call paths of phy_attach_direct:
> > > > > > > - At probe time. Both the MAC driver and the PHY driver are probing.
> > > > > > >   This is what has this patch addresses. There is no issue to return
> > > > > > >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> > > > > > >   they register their netdev. So if connecting defers, there is no
> > > > > > >   netdev to unregister, and user space knows nothing of this.
> > > > > > > - At .ndo_open time. This is where it maybe gets interesting, but not to
> > > > > > >   user space. If you open a netdev and it connects to the PHY then, I
> > > > > > >   wouldn't expect the PHY to be undergoing a probing process, all of
> > > > > > >   that should have been settled by then, should it not? Where it might
> > > > > > >   get interesting is with NFS root, and I admit I haven't tested that.
> > > > > >
> > > > > > I don't think you can make that assumption. Consider the case where
> > > > > > systemd is being used, DSA stuff is modular, and we're trying to
> > > > > > setup a bridge device on DSA. DSA could be probing while the bridge
> > > > > > is being setup.
> > > > > >
> > > > > > Sadly, this isn't theoretical. I've ended up needing:
> > > > > >
> > > > > > 	pre-up sleep 1
> > > > > >
> > > > > > in my bridge configuration to allow time for DSA to finish probing.
> > > > > > It's not a pleasant solution, nor a particularly reliable one at
> > > > > > that, but it currently works around the problem.
> > > > >
> > > > > What problem? This is the first time I've heard of this report, and you
> > > > > should definitely not need that.
> > > >
> > > > I found it when upgrading the Clearfog by the DSL modems to v5.13.
> > > > When I rebooted it with a previously working kernel (v5.7) it has
> > > > never had a problem. With v5.13, it failed to add all the lan ports
> > > > into the bridge, because the bridge was still being setup by the
> > > > kernel while userspace was trying to configure it. Note that I have
> > > > extra debug in my kernels, hence the extra messages:
> > >
> > > Ok, first you talked about the interfaces file, then systemd. If it's
> > > not about systemd's network manager then I don't see how it is relevant.
> >
> > You're reading in stuff to what I write that I did not write... I said:
> >
> > "Consider the case where systemd is being used, DSA stuff is modular,
> > and we're trying to setup a bridge device on DSA."
> >
> > That does not mean I'm using systemd's network manager - which is
> > something I know little about and have never used.
> 
> You should definitely try it out, it gets a lot of new features added
> all the time, it uses the netlink interface, it reacts on udev events.
> 
> > The reason I mentioned systemd is precisely because with systemd, you
> > get a hell of a lot happening parallel - and that's significiant in
> > this case, because it's very clear that modules are being loaded in
> > parallel with networking being brought up - and that is where the
> > problems begin. In fact, modules themselves get loaded in paralllel
> > with systemd.
> 
> So that's what I don't understand. You're saying that the ifupdown
> service runs in parallel with systemd-modules-load.service, and
> networking is a kernel module? Doesn't that mean it behaves as expected,
> then? /shrug/
> Have you tried adding an 'After=systemd-modules-load.service' dependency
> to the ifupdown service? I don't think that DSA is that bad that it
> registers its net devices outside of the process context in which the
> insmod mv88e6xxx.ko is called. Quite the opposite, I think (but I
> haven't actually taken a close look yet) that the component stuff
> Saravana is proposing would do exactly that. So you "fix" one issue, you
> introduce another.

# systemctl list-dependencies networking.service
networking.service
  ├─ifupdown-pre.service
  ├─system.slice
  └─network.target
# systemctl list-dependencies ifupdown-pre.service
ifupdown-pre.service
  ├─system.slice
  └─systemd-udevd.service

Looking in the service files for a better idea:

networking.service:
Requires=ifupdown-pre.service
Wants=network.target
After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service ifupdown-pre.service
Before=network.target shutdown.target network-online.target

ifupdown-pre.service:
Wants=systemd-udevd.service
After=systemd-udev-trigger.service
Before=network.target

So, the dependency you mention is already present. As is a dependency
on udev. The problem is udev does all the automatic module loading
asynchronously and in a multithreaded way.

I don't think there's a way to make systemd wait for all module loads
to complete.
Andrew Lunn Sept. 2, 2021, 8:07 p.m. UTC | #11
> The interrupt controller _has_ been set up. The trouble is that the
> interrupt controller has the same OF node as the switch itself, and the
> same OF node. Therefore, fw_devlink waits for the _entire_ switch to
> finish probing, it doesn't have insight into the fact that the
> dependency is just on the interrupt controller.

That seems to be the problem. fw_devlink appears to think probe is an
atomic operation. A device is not probed, or full probed. Where as the
drivers are making use of it being non atomic.

Maybe fw_devlink needs the third state, probing. And when deciding if
a device can be probed and depends on a device which is currently
probing, it looks deeper, follows the phandle and see if the resource
is actually available?

	 Andrew
Vladimir Oltean Sept. 2, 2021, 8:21 p.m. UTC | #12
On Thu, Sep 02, 2021 at 09:03:01PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 10:05:07PM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 06:50:43PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Sep 02, 2021 at 05:10:34PM +0000, Vladimir Oltean wrote:
> > > > On Thu, Sep 02, 2021 at 05:31:44PM +0100, Russell King (Oracle) wrote:
> > > > > On Thu, Sep 02, 2021 at 06:23:42PM +0300, Vladimir Oltean wrote:
> > > > > > On Thu, Sep 02, 2021 at 02:26:35PM +0100, Russell King (Oracle) wrote:
> > > > > > > Debian has had support for configuring bridges at boot time via
> > > > > > > the interfaces file for years. Breaking that is going to upset a
> > > > > > > lot of people (me included) resulting in busted networks. It
> > > > > > > would be a sure way to make oneself unpopular.
> > > > > > >
> > > > > > > > I expect there to be 2 call paths of phy_attach_direct:
> > > > > > > > - At probe time. Both the MAC driver and the PHY driver are probing.
> > > > > > > >   This is what has this patch addresses. There is no issue to return
> > > > > > > >   -EPROBE_DEFER at that time, since drivers connect to the PHY before
> > > > > > > >   they register their netdev. So if connecting defers, there is no
> > > > > > > >   netdev to unregister, and user space knows nothing of this.
> > > > > > > > - At .ndo_open time. This is where it maybe gets interesting, but not to
> > > > > > > >   user space. If you open a netdev and it connects to the PHY then, I
> > > > > > > >   wouldn't expect the PHY to be undergoing a probing process, all of
> > > > > > > >   that should have been settled by then, should it not? Where it might
> > > > > > > >   get interesting is with NFS root, and I admit I haven't tested that.
> > > > > > >
> > > > > > > I don't think you can make that assumption. Consider the case where
> > > > > > > systemd is being used, DSA stuff is modular, and we're trying to
> > > > > > > setup a bridge device on DSA. DSA could be probing while the bridge
> > > > > > > is being setup.
> > > > > > >
> > > > > > > Sadly, this isn't theoretical. I've ended up needing:
> > > > > > >
> > > > > > > 	pre-up sleep 1
> > > > > > >
> > > > > > > in my bridge configuration to allow time for DSA to finish probing.
> > > > > > > It's not a pleasant solution, nor a particularly reliable one at
> > > > > > > that, but it currently works around the problem.
> > > > > >
> > > > > > What problem? This is the first time I've heard of this report, and you
> > > > > > should definitely not need that.
> > > > >
> > > > > I found it when upgrading the Clearfog by the DSL modems to v5.13.
> > > > > When I rebooted it with a previously working kernel (v5.7) it has
> > > > > never had a problem. With v5.13, it failed to add all the lan ports
> > > > > into the bridge, because the bridge was still being setup by the
> > > > > kernel while userspace was trying to configure it. Note that I have
> > > > > extra debug in my kernels, hence the extra messages:
> > > >
> > > > Ok, first you talked about the interfaces file, then systemd. If it's
> > > > not about systemd's network manager then I don't see how it is relevant.
> > >
> > > You're reading in stuff to what I write that I did not write... I said:
> > >
> > > "Consider the case where systemd is being used, DSA stuff is modular,
> > > and we're trying to setup a bridge device on DSA."
> > >
> > > That does not mean I'm using systemd's network manager - which is
> > > something I know little about and have never used.
> > 
> > You should definitely try it out, it gets a lot of new features added
> > all the time, it uses the netlink interface, it reacts on udev events.
> > 
> > > The reason I mentioned systemd is precisely because with systemd, you
> > > get a hell of a lot happening parallel - and that's significiant in
> > > this case, because it's very clear that modules are being loaded in
> > > parallel with networking being brought up - and that is where the
> > > problems begin. In fact, modules themselves get loaded in paralllel
> > > with systemd.
> > 
> > So that's what I don't understand. You're saying that the ifupdown
> > service runs in parallel with systemd-modules-load.service, and
> > networking is a kernel module? Doesn't that mean it behaves as expected,
> > then? /shrug/
> > Have you tried adding an 'After=systemd-modules-load.service' dependency
> > to the ifupdown service? I don't think that DSA is that bad that it
> > registers its net devices outside of the process context in which the
> > insmod mv88e6xxx.ko is called. Quite the opposite, I think (but I
> > haven't actually taken a close look yet) that the component stuff
> > Saravana is proposing would do exactly that. So you "fix" one issue, you
> > introduce another.
> 
> # systemctl list-dependencies networking.service
> networking.service
>   ├─ifupdown-pre.service
>   ├─system.slice
>   └─network.target
> # systemctl list-dependencies ifupdown-pre.service
> ifupdown-pre.service
>   ├─system.slice
>   └─systemd-udevd.service
> 
> Looking in the service files for a better idea:
> 
> networking.service:
> Requires=ifupdown-pre.service
> Wants=network.target
> After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service ifupdown-pre.service
> Before=network.target shutdown.target network-online.target
> 
> ifupdown-pre.service:
> Wants=systemd-udevd.service
> After=systemd-udev-trigger.service
> Before=network.target
> 
> So, the dependency you mention is already present. As is a dependency
> on udev. The problem is udev does all the automatic module loading
> asynchronously and in a multithreaded way.
> 
> I don't think there's a way to make systemd wait for all module loads
> to complete.

So ifupdown-pre.service has a call to "udevadm settle". This "watches
the udev event queue, and exits if all current events are handled",
according to the man page. But which current events? ifupdown-pre.service
does not have the dependency on systemd-modules-load.service, just
networking.service does. So maybe ifupdown-pre.service does not wait for
DSA to finish initializing, then it tells networking.service that all is ok.
Russell King (Oracle) Sept. 2, 2021, 8:29 p.m. UTC | #13
On Thu, Sep 02, 2021 at 11:21:24PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 09:03:01PM +0100, Russell King (Oracle) wrote:
> > # systemctl list-dependencies networking.service
> > networking.service
> >   ├─ifupdown-pre.service
> >   ├─system.slice
> >   └─network.target
> > # systemctl list-dependencies ifupdown-pre.service
> > ifupdown-pre.service
> >   ├─system.slice
> >   └─systemd-udevd.service
> > 
> > Looking in the service files for a better idea:
> > 
> > networking.service:
> > Requires=ifupdown-pre.service
> > Wants=network.target
> > After=local-fs.target network-pre.target apparmor.service systemd-sysctl.service systemd-modules-load.service ifupdown-pre.service
> > Before=network.target shutdown.target network-online.target
> > 
> > ifupdown-pre.service:
> > Wants=systemd-udevd.service
> > After=systemd-udev-trigger.service
> > Before=network.target
> > 
> > So, the dependency you mention is already present. As is a dependency
> > on udev. The problem is udev does all the automatic module loading
> > asynchronously and in a multithreaded way.
> > 
> > I don't think there's a way to make systemd wait for all module loads
> > to complete.
> 
> So ifupdown-pre.service has a call to "udevadm settle". This "watches
> the udev event queue, and exits if all current events are handled",
> according to the man page. But which current events? ifupdown-pre.service
> does not have the dependency on systemd-modules-load.service, just
> networking.service does. So maybe ifupdown-pre.service does not wait for
> DSA to finish initializing, then it tells networking.service that all is ok.

ifupdown-pre.service does have a call to udevadm settle, and that
does get called from what I can tell.

systemd-modules-load.service is an entire red herring. The only
module listed in the various modules-load.d directories is "tun"
for openvpn (which isn't currently being used.)

As I've already told you (and you seem to have ignored), DSA gets
loaded by udev, not by systemd-modules-load.service.
systemd-modules-load.service is irrelevant to my situation.

I think there's a problem with "and exits if all current events are
handled" - does that mean it's fired off a modprobe process which
is in progress, or does that mean that the modprobe process has
completed.

Given that we can see that ifup is being run while the DSA module is
still in the middle of probing, the latter interpretation can not be
true - unless systemd is ignoring the dependencies. Or just in
general, systemd being systemd (I have very little faith in systemd
behaving as it should.)
Vladimir Oltean Sept. 2, 2021, 8:32 p.m. UTC | #14
On Thu, Sep 02, 2021 at 10:07:49PM +0200, Andrew Lunn wrote:
> > The interrupt controller _has_ been set up. The trouble is that the
> > interrupt controller has the same OF node as the switch itself, and the
> > same OF node. Therefore, fw_devlink waits for the _entire_ switch to
> > finish probing, it doesn't have insight into the fact that the
> > dependency is just on the interrupt controller.
>
> That seems to be the problem. fw_devlink appears to think probe is an
> atomic operation. A device is not probed, or full probed. Where as the
> drivers are making use of it being non atomic.
>
> Maybe fw_devlink needs the third state, probing. And when deciding if
> a device can be probed and depends on a device which is currently
> probing, it looks deeper, follows the phandle and see if the resource
> is actually available?

This is interesting because there already exists a device link state for
when the consumer is "probing", but for the supplier, it's binary:

/**
 * enum device_link_state - Device link states.
 * @DL_STATE_NONE: The presence of the drivers is not being tracked.
 * @DL_STATE_DORMANT: None of the supplier/consumer drivers is present.
 * @DL_STATE_AVAILABLE: The supplier driver is present, but the consumer is not.
 * @DL_STATE_CONSUMER_PROBE: The consumer is probing (supplier driver present).
 * @DL_STATE_ACTIVE: Both the supplier and consumer drivers are present.
 * @DL_STATE_SUPPLIER_UNBIND: The supplier driver is unbinding.
 */

The check that's killing us is in device_links_check_suppliers, and is
for DL_STATE_AVAILABLE:

	list_for_each_entry(link, &dev->links.suppliers, c_node) {
		if (!(link->flags & DL_FLAG_MANAGED))
			continue;

		if (link->status != DL_STATE_AVAILABLE &&
		    !(link->flags & DL_FLAG_SYNC_STATE_ONLY)) {
			device_links_missing_supplier(dev);
			dev_err(dev, "probe deferral - supplier %s not ready\n",
				dev_name(link->supplier));
			ret = -EPROBE_DEFER;
			break;
		}
		WRITE_ONCE(link->status, DL_STATE_CONSUMER_PROBE);
	}

Anyway, I was expecting quite a different reaction from this patch
series, and especially one from Saravana. We are essentially battling to
handle an -EPROBE_DEFER we don't need (the battle might be worth it
though, in the general case, which is one of the reasons I posted them).

But these patches also solve DSA's issue with the circular dependency
between the switch and its internal PHYs, and nobody seems to have asked
the most important question: why?

The PHY should return -EPROBE_DEFER ad infinitum, since its supplier has
never finished probing by the time it calls phy_attach_direct.
Russell King (Oracle) Sept. 2, 2021, 9:39 p.m. UTC | #15
On Thu, Sep 02, 2021 at 11:32:48PM +0300, Vladimir Oltean wrote:
> But these patches also solve DSA's issue with the circular dependency
> between the switch and its internal PHYs, and nobody seems to have asked
> the most important question: why?

Surely you specified that in your cover message and in the patch
that actually fixes the problem, as one always should do.
Vladimir Oltean Sept. 2, 2021, 10:05 p.m. UTC | #16
On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> This is a continuation of the discussion on patch "[v1,1/2] driver core:
> fw_devlink: Add support for FWNODE_FLAG_BROKEN_PARENT" from here:
> https://patchwork.kernel.org/project/netdevbpf/patch/20210826074526.825517-2-saravanak@google.com/
> 
> Summary: in a complex combination of device dependencies which is not
> really relevant to what is being proposed here, DSA ends up calling
> phylink_of_phy_connect during a period in which the PHY driver goes
> through a series of probe deferral events.
> 
> The central point of that discussion is that DSA seems "broken" for
> expecting the PHY driver to probe immediately on PHYs belonging to the
> internal MDIO buses of switches. A few suggestions were made about what
> to do, but some were not satisfactory and some did not solve the problem.
> 
> In fact, fw_devlink, the mechanism that causes the PHY driver to defer
> probing in this particular case, has some significant "issues" too, but
> its "issues" are only in quotes "because at worst it'd allow a few
> unnecessary deferred probes":
> https://patchwork.kernel.org/project/netdevbpf/patch/20210826074526.825517-2-saravanak@google.com/#24418895
> 
> So if that's the criterion by which an issue is an issue, maybe we
> should take a step back and look at the bigger picture.
> 
> There is nothing about the idea that a PHY might defer probing, or about
> the changes proposed here, that has anything with DSA. Furthermore, the
> changes done by this series solve the problem in the same way: "they
> allow a few unnecessary deferred probes" <- in this case they provoke
> this to the caller of phy_attach_direct.
> 
> If we look at commit 16983507742c ("net: phy: probe PHY drivers
> synchronously"), we see that the PHY library expectation is for the PHY
> device to have a PHY driver bound to it as soon as device_add() finishes.
> 
> Well, as it turns out, in case the PHY device has any supplier which is
> not ready, this is not possible, but that patch still seems to ensure
> that the process of binding a driver to the device has at least started.
> That process will continue for a while, and will race with
> phy_attach_direct calls, so we need to make the latter observe the fact
> that a driver is struggling to probe, and wait for it a bit more.
> 
> What I've not tested is loading the PHY module at runtime, and seeing
> how phy_attach_direct behaves then. I expect that this change set will
> not alter the behavior in that case: the genphy will still bind to a
> device with no driver, and phy_attach_direct will not return -EPROBE_DEFER
> in that case.
> 
> I might not be very versed in the device core/internals, but the patches
> make sense to me, and worked as intended from the first try on my system
> (Turris MOX with mv88e6xxx), which was modified to make the same "sins"
> as those called out in the thread above:
> 
> - use PHY interrupts provided by the switch itself as an interrupt-controller
> - call of_mdiobus_register from setup() and not from probe(), so as to
>   not circumvent fw_devlink's limitations, and still get to hit the PHY
>   probe deferral conditions.
> 
> So feedback and testing on other platforms is very appreciated.
> 
> Vladimir Oltean (3):
>   net: phy: don't bind genphy in phy_attach_direct if the specific
>     driver defers probe
>   net: dsa: destroy the phylink instance on any error in
>     dsa_slave_phy_setup
>   net: dsa: allow the phy_connect() call to return -EPROBE_DEFER
> 
>  drivers/base/dd.c            | 21 +++++++++++++++++++--
>  drivers/net/phy/phy_device.c |  8 ++++++++
>  include/linux/device.h       |  1 +
>  net/dsa/dsa2.c               |  2 ++
>  net/dsa/slave.c              | 12 +++++-------
>  5 files changed, 35 insertions(+), 9 deletions(-)
> 
> -- 
> 2.25.1
> 

Ouch, I just realized that Saravana, the person whose reaction I've been
waiting for the most, is not copied....

Saravana, you can find the thread here to sync up with what has been
discussed:
https://patchwork.kernel.org/project/netdevbpf/cover/20210901225053.1205571-1-vladimir.oltean@nxp.com/

Sorry.
Saravana Kannan Sept. 2, 2021, 11:29 p.m. UTC | #17
On Thu, Sep 2, 2021 at 3:05 PM Vladimir Oltean <olteanv@gmail.com> wrote:
>
> On Thu, Sep 02, 2021 at 01:50:50AM +0300, Vladimir Oltean wrote:
> > This is a continuation of the discussion on patch "[v1,1/2] driver core:
> > fw_devlink: Add support for FWNODE_FLAG_BROKEN_PARENT" from here:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20210826074526.825517-2-saravanak@google.com/
> >
> > Summary: in a complex combination of device dependencies which is not
> > really relevant to what is being proposed here, DSA ends up calling
> > phylink_of_phy_connect during a period in which the PHY driver goes
> > through a series of probe deferral events.
> >
> > The central point of that discussion is that DSA seems "broken" for
> > expecting the PHY driver to probe immediately on PHYs belonging to the
> > internal MDIO buses of switches. A few suggestions were made about what
> > to do, but some were not satisfactory and some did not solve the problem.
> >
> > In fact, fw_devlink, the mechanism that causes the PHY driver to defer
> > probing in this particular case, has some significant "issues" too, but
> > its "issues" are only in quotes "because at worst it'd allow a few
> > unnecessary deferred probes":
> > https://patchwork.kernel.org/project/netdevbpf/patch/20210826074526.825517-2-saravanak@google.com/#24418895
> >
> > So if that's the criterion by which an issue is an issue, maybe we
> > should take a step back and look at the bigger picture.
> >
> > There is nothing about the idea that a PHY might defer probing, or about
> > the changes proposed here, that has anything with DSA. Furthermore, the
> > changes done by this series solve the problem in the same way: "they
> > allow a few unnecessary deferred probes" <- in this case they provoke
> > this to the caller of phy_attach_direct.
> >
> > If we look at commit 16983507742c ("net: phy: probe PHY drivers
> > synchronously"), we see that the PHY library expectation is for the PHY
> > device to have a PHY driver bound to it as soon as device_add() finishes.
> >
> > Well, as it turns out, in case the PHY device has any supplier which is
> > not ready, this is not possible, but that patch still seems to ensure
> > that the process of binding a driver to the device has at least started.
> > That process will continue for a while, and will race with
> > phy_attach_direct calls, so we need to make the latter observe the fact
> > that a driver is struggling to probe, and wait for it a bit more.
> >
> > What I've not tested is loading the PHY module at runtime, and seeing
> > how phy_attach_direct behaves then. I expect that this change set will
> > not alter the behavior in that case: the genphy will still bind to a
> > device with no driver, and phy_attach_direct will not return -EPROBE_DEFER
> > in that case.
> >
> > I might not be very versed in the device core/internals, but the patches
> > make sense to me, and worked as intended from the first try on my system
> > (Turris MOX with mv88e6xxx), which was modified to make the same "sins"
> > as those called out in the thread above:
> >
> > - use PHY interrupts provided by the switch itself as an interrupt-controller
> > - call of_mdiobus_register from setup() and not from probe(), so as to
> >   not circumvent fw_devlink's limitations, and still get to hit the PHY
> >   probe deferral conditions.
> >
> > So feedback and testing on other platforms is very appreciated.
> >
> > Vladimir Oltean (3):
> >   net: phy: don't bind genphy in phy_attach_direct if the specific
> >     driver defers probe
> >   net: dsa: destroy the phylink instance on any error in
> >     dsa_slave_phy_setup
> >   net: dsa: allow the phy_connect() call to return -EPROBE_DEFER
> >
> >  drivers/base/dd.c            | 21 +++++++++++++++++++--
> >  drivers/net/phy/phy_device.c |  8 ++++++++
> >  include/linux/device.h       |  1 +
> >  net/dsa/dsa2.c               |  2 ++
> >  net/dsa/slave.c              | 12 +++++-------
> >  5 files changed, 35 insertions(+), 9 deletions(-)
> >
> > --
> > 2.25.1
> >
>
> Ouch, I just realized that Saravana, the person whose reaction I've been
> waiting for the most, is not copied....
>
> Saravana, you can find the thread here to sync up with what has been
> discussed:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210901225053.1205571-1-vladimir.oltean@nxp.com/

Woah! The thread blew up.

>
> Sorry.

No worries.

I'll read through the thread later and maybe provide more responses,
but one thing I wanted to say right away:

Don't depend on dev->p->deferred_probe. It can be "empty" for a device
that has returned -EPROBE_DEFER for a bunch of reasons:

1. When the device is in the middle of being reattempted, it would be
empty. You can't hold any lock that'll ensure correctness either
because deferred probe locking is a mess (I'm working on cleaning that
up).

2. I'm working on actually not adding devices to that list if there's
a known supplier that hasn't been probed yet. No point retrying it
again and again for every deferred probe trigger when we know it's
going to fail. And we'll basically get topological probe ordering.

Your closest bet right now is d->can_match. Only caveat is that it's
not cleared if the actual driver gets unregistered.


-Saravana