diff mbox series

Race in PHY subsystem? Attaching to PHY devices before they get probed

Message ID bdffa33c-e3eb-4c3b-adf3-99a02bc7d205@gmail.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Race in PHY subsystem? Attaching to PHY devices before they get probed | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Rafał Miłecki Jan. 22, 2024, 7:09 a.m. UTC
Hi!

I have MT7988 SoC board with following problem:
[   26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110

This issue is known to occur when PHY's firmware is not running. After
some debugging I discovered that .config_init() CB gets called while
.probe() CB is still being executed.

It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
gets fully probed and it seems there is nothing in PHY subsystem
verifying that. Please note this PHY takes quite some time to probe as
it involves sending firmware to hardware.

Is that a possible race in PHY subsystem?
Should we have phy_attach_direct() wait for PHY to be fully probed?



I verified this issue with following patch although -EPROBE_DEFER didn't
work automagically and I had to re-do "ifconfig eth2 up" manually.

Comments

Rafał Miłecki Jan. 22, 2024, 9:48 a.m. UTC | #1
On 22.01.2024 08:09, Rafał Miłecki wrote:
> I have MT7988 SoC board with following problem:
> [   26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> 
> This issue is known to occur when PHY's firmware is not running. After
> some debugging I discovered that .config_init() CB gets called while
> .probe() CB is still being executed.
> 
> It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> gets fully probed and it seems there is nothing in PHY subsystem
> verifying that. Please note this PHY takes quite some time to probe as
> it involves sending firmware to hardware.
> 
> Is that a possible race in PHY subsystem?
> Should we have phy_attach_direct() wait for PHY to be fully probed?

I don't expect this to be an acceptable solution but it works as a quick
workaround & proof of issue.

[   24.763875] mtk_soc_eth 15100000.ethernet eth2: Waiting for PHY mdio-bus:08 to become ready
[   38.645874] mtk_soc_eth 15100000.ethernet eth2: PHY mdio-bus:08 is ready now

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3611ea64875e..cdb766b0ea22 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1435,8 +1435,21 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
  	struct device *d = &phydev->mdio.dev;
  	struct module *ndev_owner = NULL;
  	bool using_genphy = false;
+	unsigned long time_left;
  	int err;

+	if (!try_wait_for_completion(&phydev->probed)) {
+		netdev_info(dev, "Waiting for PHY %s to become ready\n", phydev_name(phydev));
+
+		time_left = wait_for_completion_timeout(&phydev->probed, msecs_to_jiffies(20000));
+		if (!time_left) {
+			netdev_warn(dev, "PHY %s is still not ready!\n", phydev_name(phydev));
+			return -EPROBE_DEFER;
+		}
+
+		netdev_info(dev, "PHY %s is ready now\n", phydev_name(phydev));
+	}
+
  	/* For Ethernet device drivers that register their own MDIO bus, we
  	 * will have bus->owner match ndev_mod, so we do not want to increment
  	 * our own module->refcnt here, otherwise we would not be able to
@@ -1562,6 +1575,8 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
  		phydev->devlink = device_link_add(dev->dev.parent, &phydev->mdio.dev,
  						  DL_FLAG_PM_RUNTIME | DL_FLAG_STATELESS);

+	complete(&phydev->probed);
+
  	return err;

  error:
@@ -3382,6 +3397,9 @@ static int phy_probe(struct device *dev)
  	if (err)
  		phy_device_reset(phydev, 1);

+	if (!err)
+		complete(&phydev->probed);
+
  	return err;
  }

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 684efaeca07c..d95b68dfad59 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -541,6 +541,7 @@ struct macsec_ops;
   * struct phy_device - An instance of a PHY
   *
   * @mdio: MDIO bus this PHY is on
+ * @probed: Completion of PHY probing
   * @drv: Pointer to the driver for this PHY instance
   * @devlink: Create a link between phy dev and mac dev, if the external phy
   *           used by current mac interface is managed by another mac interface.
@@ -636,6 +637,8 @@ struct macsec_ops;
  struct phy_device {
  	struct mdio_device mdio;

+	struct completion probed;
+
  	/* Information about the PHY type */
  	/* And management functions */
  	struct phy_driver *drv;
Andrew Lunn Jan. 22, 2024, 2:12 p.m. UTC | #2
On Mon, Jan 22, 2024 at 08:09:58AM +0100, Rafał Miłecki wrote:
> Hi!
> 
> I have MT7988 SoC board with following problem:
> [   26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> 
> This issue is known to occur when PHY's firmware is not running. After
> some debugging I discovered that .config_init() CB gets called while
> .probe() CB is still being executed.
> 
> It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> gets fully probed and it seems there is nothing in PHY subsystem
> verifying that. Please note this PHY takes quite some time to probe as
> it involves sending firmware to hardware.
> 
> Is that a possible race in PHY subsystem?

Seems like it.

There is a patch "net: phylib: get rid of unnecessary locking" which
removed locks from probe, which might of helped, but the patch also
says:

    The locking in phy_probe() and phy_remove() does very little to prevent
    any races with e.g. phy_attach_direct(),

suggesting it probably did not help.

I think the traditional way problems like this are avoided is that the
device should not be visible to the rest of the system until probe has
completed.

Maybe look at phy_device_register(). I think it is the call to
mdiobus_register_device() which makes it visible. What i don't know is
the call path which calls probe. Is it part of device_add()? Can
mdiobus_register_device() be moved to the end of this function?

Could you add a WARN_ON(1) in the probe function to get the call
stack. With that, we might be able to figure out where the
mdiobus_register_device() needs to move.

Thanks
	Andrew
Russell King (Oracle) Jan. 22, 2024, 4:56 p.m. UTC | #3
On Mon, Jan 22, 2024 at 03:12:42PM +0100, Andrew Lunn wrote:
> On Mon, Jan 22, 2024 at 08:09:58AM +0100, Rafał Miłecki wrote:
> > Hi!
> > 
> > I have MT7988 SoC board with following problem:
> > [   26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> > 
> > This issue is known to occur when PHY's firmware is not running. After
> > some debugging I discovered that .config_init() CB gets called while
> > .probe() CB is still being executed.
> > 
> > It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> > gets fully probed and it seems there is nothing in PHY subsystem
> > verifying that. Please note this PHY takes quite some time to probe as
> > it involves sending firmware to hardware.
> > 
> > Is that a possible race in PHY subsystem?
> 
> Seems like it.
> 
> There is a patch "net: phylib: get rid of unnecessary locking" which
> removed locks from probe, which might of helped, but the patch also
> says:
> 
>     The locking in phy_probe() and phy_remove() does very little to prevent
>     any races with e.g. phy_attach_direct(),
> 
> suggesting it probably did not help.

The reason for that statement is because phy_attach_direct() doesn't
take phydev->lock _at all_, so taking the lock in phy_probe() has no
effect on any race with phy_attach_direct().

> I think the traditional way problems like this are avoided is that the
> device should not be visible to the rest of the system until probe has
> completed.

However, we have the problem of the generic driver fallback - which
phy_attach_direct() does.

The probe vs phy_attach_direct() has been racy for quite a long time,
and the poking about that's done in that function such as assigning
struct device's driver member, calling device_bind_driver() etc is
all hellishly racy if the phy_device _could_ be bound simultaneously.

Also note this... we call device_bind_driver() from phy_attach_direct(),
and the documentation for this function states:

 * This function must be called with the device lock held.

which we don't do. So we're already violating the locking requirements
for the driver model.

So, I would suggest that the solution is to make use of device_lock()
which will also only return once a probe has succeeded.

However, that's still not ideal - because the fact we have a race here
means that what could happen is that phy_attach_direct() is called
a little earlier than the probe begins, and the phy device ends up
being bound to the generic PHY driver rather than its specific driver.

I think what this comes down to are the following points:

1) not using the required device model locking
2) the mere existence of the default driver makes for a race between
   the PHY being attached and its driver being probed.

No amount of locking saves us from (2) - the only solutions that I can
see to this are:
1) to put up with there being such a race
2) get rid of the default drivers altogether and insist that we have
   specific PHY drivers for _all_ PHYs
3) have some kind of retry mechanism

A further problem is... we can't simply return -EPROBE_DEFER from
phy_attach_direct() because this function may not be called from
probe functions - it may be called from the .ndo_open method which
has no idea how to handle a probe deferal. Moreover, returning an
error to userspace will just cause it to fail (because all errors
from trying to bring a netdev up are considered to be fatal.)

So, it's a really yucky problem, and I don't see any nice and simple
solution.
Christian Marangi Jan. 24, 2024, 2:58 p.m. UTC | #4
On Mon, Jan 22, 2024 at 04:56:18PM +0000, Russell King (Oracle) wrote:
> On Mon, Jan 22, 2024 at 03:12:42PM +0100, Andrew Lunn wrote:
> > On Mon, Jan 22, 2024 at 08:09:58AM +0100, Rafał Miłecki wrote:
> > > Hi!
> > > 
> > > I have MT7988 SoC board with following problem:
> > > [   26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> > > 
> > > This issue is known to occur when PHY's firmware is not running. After
> > > some debugging I discovered that .config_init() CB gets called while
> > > .probe() CB is still being executed.
> > > 
> > > It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> > > gets fully probed and it seems there is nothing in PHY subsystem
> > > verifying that. Please note this PHY takes quite some time to probe as
> > > it involves sending firmware to hardware.
> > > 
> > > Is that a possible race in PHY subsystem?
> > 
> > Seems like it.
> > 
> > There is a patch "net: phylib: get rid of unnecessary locking" which
> > removed locks from probe, which might of helped, but the patch also
> > says:
> > 
> >     The locking in phy_probe() and phy_remove() does very little to prevent
> >     any races with e.g. phy_attach_direct(),
> > 
> > suggesting it probably did not help.
> 
> The reason for that statement is because phy_attach_direct() doesn't
> take phydev->lock _at all_, so taking the lock in phy_probe() has no
> effect on any race with phy_attach_direct().
> 
> > I think the traditional way problems like this are avoided is that the
> > device should not be visible to the rest of the system until probe has
> > completed.
> 
> However, we have the problem of the generic driver fallback - which
> phy_attach_direct() does.
> 
> The probe vs phy_attach_direct() has been racy for quite a long time,
> and the poking about that's done in that function such as assigning
> struct device's driver member, calling device_bind_driver() etc is
> all hellishly racy if the phy_device _could_ be bound simultaneously.
> 
> Also note this... we call device_bind_driver() from phy_attach_direct(),
> and the documentation for this function states:
> 
>  * This function must be called with the device lock held.
> 
> which we don't do. So we're already violating the locking requirements
> for the driver model.
> 
> So, I would suggest that the solution is to make use of device_lock()
> which will also only return once a probe has succeeded.
> 
> However, that's still not ideal - because the fact we have a race here
> means that what could happen is that phy_attach_direct() is called
> a little earlier than the probe begins, and the phy device ends up
> being bound to the generic PHY driver rather than its specific driver.
> 
> I think what this comes down to are the following points:
> 
> 1) not using the required device model locking
> 2) the mere existence of the default driver makes for a race between
>    the PHY being attached and its driver being probed.
> 
> No amount of locking saves us from (2) - the only solutions that I can
> see to this are:
> 1) to put up with there being such a race
> 2) get rid of the default drivers altogether and insist that we have
>    specific PHY drivers for _all_ PHYs
> 3) have some kind of retry mechanism
> 
> A further problem is... we can't simply return -EPROBE_DEFER from
> phy_attach_direct() because this function may not be called from
> probe functions - it may be called from the .ndo_open method which
> has no idea how to handle a probe deferal. Moreover, returning an
> error to userspace will just cause it to fail (because all errors
> from trying to bring a netdev up are considered to be fatal.)
> 
> So, it's a really yucky problem, and I don't see any nice and simple
> solution.
>

Well if we start having more and more PHY that require loading a FW then
this will become a big problem...

I wasted some good time on this and if the MDIO is slow enough loading
the FW can take even 100s resulting in probe still having to finish and
config_init called later.

Since the FW has not been loaded config_init returns bad data and fails
to configure. (and after a while probe is complete)

I don't know if it would be ok as a solution but I think moving the
fw_load call in the config_init seems to "handle" this problem but IMHO
it's still and hack for a fragile implementation.
Andrew Lunn Jan. 24, 2024, 5:52 p.m. UTC | #5
> Well if we start having more and more PHY that require loading a FW then
> this will become a big problem...
> 
> I wasted some good time on this and if the MDIO is slow enough loading
> the FW can take even 100s resulting in probe still having to finish and
> config_init called later.

If its going to take 100s of seconds, i don't think we can have
'ip set link up' stall for that long. It needs to return an error code,
and hopefully a useful error message asking the user to throw the machine
away and get a better one!

> Since the FW has not been loaded config_init returns bad data and fails
> to configure. (and after a while probe is complete)
> 
> I don't know if it would be ok as a solution but I think moving the
> fw_load call in the config_init seems to "handle" this problem but IMHO
> it's still and hack for a fragile implementation.

Just throwing out ideas, but i think we need to split this into two
different use cases:

1) Firmware loading is fast, only 1-2 seconds. We can block operations
on the PHY until it is ready

2) Firmware loading is slow, we return -EBUSY or -EAGAIN, or similar.

Maybe we could add a struct completion to struct phy_device, which has
compete() called on it when probe finishes. phy_attach_direct() does a
wait_for_completion_timeout()?

This is assuming we cannot actually fix phylib to correctly use the
driver model, PHYs are not visible until probe is complete, and the
MAC drivers can handle that.

	Andrew
Russell King (Oracle) Jan. 24, 2024, 7 p.m. UTC | #6
On Wed, Jan 24, 2024 at 06:52:35PM +0100, Andrew Lunn wrote:
> This is assuming we cannot actually fix phylib to correctly use the
> driver model, PHYs are not visible until probe is complete, and the
> MAC drivers can handle that.

The only thing I can think is some kind of kbuild extension that looks
through all the PHY drivers that are enabled in some way, generates a
table of PHY driver match IDs, and use that as an "exclude" list for
the generic PHY drivers.

This would preclude the use of out of tree PHY drivers. Whether we
think that's a good or bad thing depends on ones own point of view.

However, the default fall-back to the generic PHY driver can't work
through the driver model because of the reasons we already know.
(If someone wants them expanded, then please ask, but Andrew and
myself are aware, so as I'm replying to Andrew here...)
diff mbox series

Patch

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 3611ea64875e..3be499d2376b 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1437,6 +1437,11 @@  int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
  	bool using_genphy = false;
  	int err;

+	if (!phydev->probed) {
+		phydev_warn(phydev, "PHY is not ready yet!\n");
+		return -EPROBE_DEFER;
+	}
+
  	/* For Ethernet device drivers that register their own MDIO bus, we
  	 * will have bus->owner match ndev_mod, so we do not want to increment
  	 * our own module->refcnt here, otherwise we would not be able to
@@ -1562,6 +1567,8 @@  int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
  		phydev->devlink = device_link_add(dev->dev.parent, &phydev->mdio.dev,
  						  DL_FLAG_PM_RUNTIME | DL_FLAG_STATELESS);

+	phydev->probed = true;
+
  	return err;

  error:
@@ -3382,6 +3389,9 @@  static int phy_probe(struct device *dev)
  	if (err)
  		phy_device_reset(phydev, 1);

+	if (!err)
+		phydev->probed = true;
+
  	return err;
  }

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 684efaeca07c..29875a22ac89 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -634,6 +634,8 @@  struct macsec_ops;
   * handling, as well as handling shifts in PHY hardware state
   */
  struct phy_device {
+	bool probed;
+
  	struct mdio_device mdio;

  	/* Information about the PHY type */