diff mbox series

[RFC,net-next,1/3] net: phy: don't bind genphy in phy_attach_direct if the specific driver defers probe

Message ID 20210901225053.1205571-2-vladimir.oltean@nxp.com (mailing list archive)
State RFC, archived
Headers show
Series Make the PHY library stop being so greedy when binding the generic PHY driver | expand

Commit Message

Vladimir Oltean Sept. 1, 2021, 10:50 p.m. UTC
There are systems where the PHY driver might get its probe deferred due
to a missing supplier, like an interrupt-parent, gpio, clock or whatever.

If the phy_attach_direct call happens right in between probe attempts,
the PHY library is greedy and assumes that a specific driver will never
appear, so it just binds the generic PHY driver.

In certain cases this is the wrong choice, because some PHYs simply need
the specific driver. The specific PHY driver was going to probe, given
enough time, but this doesn't seem to matter to phy_attach_direct.

To solve this, make phy_attach_direct check whether a specific PHY
driver is pending or not, and if it is, just defer the probing of the
MAC that's connecting to us a bit more too.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/base/dd.c            | 21 +++++++++++++++++++--
 drivers/net/phy/phy_device.c |  8 ++++++++
 include/linux/device.h       |  1 +
 3 files changed, 28 insertions(+), 2 deletions(-)

Comments

Greg KH Sept. 2, 2021, 5:43 a.m. UTC | #1
On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> There are systems where the PHY driver might get its probe deferred due
> to a missing supplier, like an interrupt-parent, gpio, clock or whatever.
> 
> If the phy_attach_direct call happens right in between probe attempts,
> the PHY library is greedy and assumes that a specific driver will never
> appear, so it just binds the generic PHY driver.
> 
> In certain cases this is the wrong choice, because some PHYs simply need
> the specific driver. The specific PHY driver was going to probe, given
> enough time, but this doesn't seem to matter to phy_attach_direct.
> 
> To solve this, make phy_attach_direct check whether a specific PHY
> driver is pending or not, and if it is, just defer the probing of the
> MAC that's connecting to us a bit more too.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  drivers/base/dd.c            | 21 +++++++++++++++++++--
>  drivers/net/phy/phy_device.c |  8 ++++++++
>  include/linux/device.h       |  1 +
>  3 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 1c379d20812a..b22073b0acd2 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work)
>  }
>  static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
>  
> +static bool __device_pending_probe(struct device *dev)
> +{
> +	return !list_empty(&dev->p->deferred_probe);
> +}
> +
> +bool device_pending_probe(struct device *dev)
> +{
> +	bool pending;
> +
> +	mutex_lock(&deferred_probe_mutex);
> +	pending = __device_pending_probe(dev);
> +	mutex_unlock(&deferred_probe_mutex);
> +
> +	return pending;
> +}
> +EXPORT_SYMBOL_GPL(device_pending_probe);
> +
>  void driver_deferred_probe_add(struct device *dev)
>  {
>  	if (!dev->can_match)
>  		return;
>  
>  	mutex_lock(&deferred_probe_mutex);
> -	if (list_empty(&dev->p->deferred_probe)) {
> +	if (!__device_pending_probe(dev)) {
>  		dev_dbg(dev, "Added to deferred list\n");
>  		list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);
>  	}
> @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev)
>  void driver_deferred_probe_del(struct device *dev)
>  {
>  	mutex_lock(&deferred_probe_mutex);
> -	if (!list_empty(&dev->p->deferred_probe)) {
> +	if (__device_pending_probe(dev)) {
>  		dev_dbg(dev, "Removed from deferred list\n");
>  		list_del_init(&dev->p->deferred_probe);
>  		__device_set_deferred_probe_reason(dev, NULL);
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 52310df121de..2c22a32f0a1c 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
>  
>  	/* Assume that if there is no driver, that it doesn't
>  	 * exist, and we should use the genphy driver.
> +	 * The exception is during probing, when the PHY driver might have
> +	 * attempted a probe but has requested deferral. Since there might be
> +	 * MAC drivers which also attach to the PHY during probe time, try
> +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> +	 * probing until then.

Wait, no, this should not be a "special" thing, and why would the list
of deferred probe show this?

If a bus wants to have this type of "generic vs. specific" logic, then
it needs to handle it in the bus logic itself as that does NOT fit into
the normal driver model at all.  Don't try to get a "hint" of this by
messing with the probe function list.

thanks,

greg k-h
Vladimir Oltean Sept. 2, 2021, 10:11 a.m. UTC | #2
On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote:
> Wait, no, this should not be a "special" thing, and why would the list
> of deferred probe show this?

Why as in why would it work/do what I want, or as in why would you want to do that?

> If a bus wants to have this type of "generic vs. specific" logic, then
> it needs to handle it in the bus logic itself as that does NOT fit into
> the normal driver model at all.  Don't try to get a "hint" of this by
> messing with the probe function list.

Where and how? Do you have an example?
Greg KH Sept. 2, 2021, 10:37 a.m. UTC | #3
On Thu, Sep 02, 2021 at 01:11:50PM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote:
> > Wait, no, this should not be a "special" thing, and why would the list
> > of deferred probe show this?
> 
> Why as in why would it work/do what I want, or as in why would you want to do that?

Both!  :)

> > If a bus wants to have this type of "generic vs. specific" logic, then
> > it needs to handle it in the bus logic itself as that does NOT fit into
> > the normal driver model at all.  Don't try to get a "hint" of this by
> > messing with the probe function list.
> 
> Where and how? Do you have an example?

No I do not, sorry, most busses do not do this for obvious ordering /
loading / we are not that crazy reasons.

What is causing this all to suddenly break?  The devlink stuff?

thanks,

greg k-h
Vladimir Oltean Sept. 2, 2021, 11:17 a.m. UTC | #4
On Thu, Sep 02, 2021 at 12:37:34PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Sep 02, 2021 at 01:11:50PM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 07:43:10AM +0200, Greg Kroah-Hartman wrote:
> > > Wait, no, this should not be a "special" thing, and why would the list
> > > of deferred probe show this?
> >
> > Why as in why would it work/do what I want, or as in why would you want to do that?
>
> Both!  :)

So first: why would it work.
You seem to have a misconception that I am "messing with the probe
function list".
I am not, I am just exporting the information whether the device had a
driver which returned -EPROBE_DEFER during probe, or not. For that I am
looking at the presence of this device on the deferred_probe_pending_list.

driver_probe_device
-> if (ret == -EPROBE_DEFER || ret == EPROBE_DEFER) driver_deferred_probe_add(dev);
   -> list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);

driver_bound
-> driver_deferred_probe_del
   -> list_del_init(&dev->p->deferred_probe);

So the presence of "dev" inside deferred_probe_pending_list means
precisely that a driver is pending to be bound.

Second: why would I want to do that.
In the case of PHY devices, the driver binding process starts here:

phy_device_register
-> device_add

It begins synchronously, but may not finish due to probe deferral.
So after device_add finishes, phydev->drv might be NULL due to 2 reasons:

1. -EPROBE_DEFER triggered by "somebody", either by the PHY driver probe
   function itself, or by third parties (like device_links_check_suppliers
   happening to notice that before even calling the driver's probe fn).
   Anyway, the distinction between these 2 is pretty much irrelevant.

2. There genuinely was no driver loaded in the system for this PHY. Note
   that the way things are written, the Generic PHY driver will not
   match on any device in phy_bus_match(). It is bound manually, separately.

The PHY library is absolutely happy to work with a headless chicken, a
phydev with a NULL phydev->drv. Just search for "if (!phydev->drv)"
inside drivers/net/phy/phy.c and drivers/net/phy/phy_device.c.

However, the phydev walking with a NULL drv can only last for so long.
An Ethernet port will soon need that PHY device, and will attach to it.
There are many code paths, all ending in phy_attach_direct.
However, when an Ethernet port decides to attach to a PHY device is
completely asynchronous to the lifetime of the PHY device itself.
This moment is where a driver is really needed, and if none is present,
the generic one is force-bound.

My patch only distinguishes between case 1 and 2 for which phydev->drv
might be NULL. It avoids force-binding the generic PHY when a specific
PHY driver was found, but did not finish binding due to probe deferral.

> > > If a bus wants to have this type of "generic vs. specific" logic, then
> > > it needs to handle it in the bus logic itself as that does NOT fit into
> > > the normal driver model at all.  Don't try to get a "hint" of this by
> > > messing with the probe function list.
> >
> > Where and how? Do you have an example?
>
> No I do not, sorry, most busses do not do this for obvious ordering /
> loading / we are not that crazy reasons.
>
> What is causing this all to suddenly break?  The devlink stuff?

There was a report related to fw_devlink indeed, however strictly
speaking, I wouldn't say it is the cause of all this. It is pretty
uncommon for a PHY device to defer probing I think, hence the bad
assumptions made around it.
Rafael J. Wysocki Sept. 2, 2021, 2:37 p.m. UTC | #5
On Thu, Sep 2, 2021 at 7:43 AM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > There are systems where the PHY driver might get its probe deferred due
> > to a missing supplier, like an interrupt-parent, gpio, clock or whatever.
> >
> > If the phy_attach_direct call happens right in between probe attempts,
> > the PHY library is greedy and assumes that a specific driver will never
> > appear, so it just binds the generic PHY driver.
> >
> > In certain cases this is the wrong choice, because some PHYs simply need
> > the specific driver. The specific PHY driver was going to probe, given
> > enough time, but this doesn't seem to matter to phy_attach_direct.
> >
> > To solve this, make phy_attach_direct check whether a specific PHY
> > driver is pending or not, and if it is, just defer the probing of the
> > MAC that's connecting to us a bit more too.
> >
> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > ---
> >  drivers/base/dd.c            | 21 +++++++++++++++++++--
> >  drivers/net/phy/phy_device.c |  8 ++++++++
> >  include/linux/device.h       |  1 +
> >  3 files changed, 28 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> > index 1c379d20812a..b22073b0acd2 100644
> > --- a/drivers/base/dd.c
> > +++ b/drivers/base/dd.c
> > @@ -128,13 +128,30 @@ static void deferred_probe_work_func(struct work_struct *work)
> >  }
> >  static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
> >
> > +static bool __device_pending_probe(struct device *dev)
> > +{
> > +     return !list_empty(&dev->p->deferred_probe);
> > +}
> > +
> > +bool device_pending_probe(struct device *dev)
> > +{
> > +     bool pending;
> > +
> > +     mutex_lock(&deferred_probe_mutex);
> > +     pending = __device_pending_probe(dev);
> > +     mutex_unlock(&deferred_probe_mutex);
> > +
> > +     return pending;
> > +}
> > +EXPORT_SYMBOL_GPL(device_pending_probe);
> > +
> >  void driver_deferred_probe_add(struct device *dev)
> >  {
> >       if (!dev->can_match)
> >               return;
> >
> >       mutex_lock(&deferred_probe_mutex);
> > -     if (list_empty(&dev->p->deferred_probe)) {
> > +     if (!__device_pending_probe(dev)) {
> >               dev_dbg(dev, "Added to deferred list\n");
> >               list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);
> >       }
> > @@ -144,7 +161,7 @@ void driver_deferred_probe_add(struct device *dev)
> >  void driver_deferred_probe_del(struct device *dev)
> >  {
> >       mutex_lock(&deferred_probe_mutex);
> > -     if (!list_empty(&dev->p->deferred_probe)) {
> > +     if (__device_pending_probe(dev)) {
> >               dev_dbg(dev, "Removed from deferred list\n");
> >               list_del_init(&dev->p->deferred_probe);
> >               __device_set_deferred_probe_reason(dev, NULL);
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 52310df121de..2c22a32f0a1c 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> >
> >       /* Assume that if there is no driver, that it doesn't
> >        * exist, and we should use the genphy driver.
> > +      * The exception is during probing, when the PHY driver might have
> > +      * attempted a probe but has requested deferral. Since there might be
> > +      * MAC drivers which also attach to the PHY during probe time, try
> > +      * harder to bind the specific PHY driver, and defer the MAC driver's
> > +      * probing until then.
>
> Wait, no, this should not be a "special" thing, and why would the list
> of deferred probe show this?
>
> If a bus wants to have this type of "generic vs. specific" logic, then
> it needs to handle it in the bus logic itself as that does NOT fit into
> the normal driver model at all.

Well, I think that this is a general issue and it appears to me to be
present in the driver core too, at least to some extent.

Namely, if there are two drivers matching the same device and the
first one's ->probe() returns -EPROBE_DEFER, that will be converted to
EPROBE_DEFER by really_probe(), so driver_probe_device() will pass it
to __device_attach_driver() which then will return 0.  This
bus_for_each_drv() will call __device_attach_driver()  for the second
matching driver even though the first one may still probe successfully
later.

To me, this really is a variant of "if a driver has failed to probe,
try another one" which phy_attach_direct() appears to be doing and in
both cases the probing of the "alternative" is premature if the
probing of the original driver has been deferred.

> Don't try to get a "hint" of this by messing with the probe function list.

I agree that this doesn't look particularly clean, but then I'm
wondering how to address this cleanly.
Russell King (Oracle) Sept. 2, 2021, 6:50 p.m. UTC | #6
On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 52310df121de..2c22a32f0a1c 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
>  
>  	/* Assume that if there is no driver, that it doesn't
>  	 * exist, and we should use the genphy driver.
> +	 * The exception is during probing, when the PHY driver might have
> +	 * attempted a probe but has requested deferral. Since there might be
> +	 * MAC drivers which also attach to the PHY during probe time, try
> +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> +	 * probing until then.
>  	 */
>  	if (!d->driver) {
> +		if (device_pending_probe(d))
> +			return -EPROBE_DEFER;

Something else that concerns me here.

As noted, many network drivers attempt to attach their PHY when the
device is brought up, and not during their probe function.

Taking a driver at random:

drivers/net/ethernet/renesas/sh_eth.c

sh_eth_phy_init() calls of_phy_connect() or phy_connect(), which
ultimately calls phy_attach_direct() and propagates the error code
via an error pointer.

sh_eth_phy_init() propagates the error code to its caller,
sh_eth_phy_start(). This is called from sh_eth_open(), which
probagates the error code. This is called from .ndo_open... and it's
highly likely -EPROBE_DEFER will end up being returned to userspace
through either netlink or netdev ioctls.

Since EPROBE_DEFER is not an error number that we export to
userspace, this should basically never be exposed to userspace, yet
we have a path that it _could_ be exposed if the above condition
is true.

If device_pending_probe() returns true e.g. during initial boot up
while modules are being loaded - maybe the phy driver doesn't have
all the resources it needs because of some other module that hasn't
finished initialising - then we have a window where this will be
exposed to userspace.

So, do we need to fix all the network drivers to do something if
their .ndo_open method encounters this? If so, what? Sleep a bit
and try again? How many times to retry? Convert the error code into
something else, causing userspace to fail where it worked before? If
so which error code?

I think this needs to be thought through a bit better. In this case,
I feel that throwing -EPROBE_DEFER to solve one problem with one
subsystem can result in new problems elsewhere.

We did have an idea at one point about reserving some flag bits in
phydev->dev_flags for phylib use, but I don't think that happened.
If this is the direction we want to go, I think we need to have a
flag in dev_flags so that callers opt-in to the new behaviour whereas
callers such as from .ndo_open keep the old behaviour - because they
just aren't setup to handle an -EPROBE_DEFER return from these
functions.
Vladimir Oltean Sept. 2, 2021, 7:23 p.m. UTC | #7
On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 52310df121de..2c22a32f0a1c 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> >  
> >  	/* Assume that if there is no driver, that it doesn't
> >  	 * exist, and we should use the genphy driver.
> > +	 * The exception is during probing, when the PHY driver might have
> > +	 * attempted a probe but has requested deferral. Since there might be
> > +	 * MAC drivers which also attach to the PHY during probe time, try
> > +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> > +	 * probing until then.
> >  	 */
> >  	if (!d->driver) {
> > +		if (device_pending_probe(d))
> > +			return -EPROBE_DEFER;
> 
> Something else that concerns me here.
> 
> As noted, many network drivers attempt to attach their PHY when the
> device is brought up, and not during their probe function.
> 
> Taking a driver at random:
> 
> drivers/net/ethernet/renesas/sh_eth.c
> 
> sh_eth_phy_init() calls of_phy_connect() or phy_connect(), which
> ultimately calls phy_attach_direct() and propagates the error code
> via an error pointer.
> 
> sh_eth_phy_init() propagates the error code to its caller,
> sh_eth_phy_start(). This is called from sh_eth_open(), which
> probagates the error code. This is called from .ndo_open... and it's
> highly likely -EPROBE_DEFER will end up being returned to userspace
> through either netlink or netdev ioctls.
> 
> Since EPROBE_DEFER is not an error number that we export to
> userspace, this should basically never be exposed to userspace, yet
> we have a path that it _could_ be exposed if the above condition
> is true.
> 
> If device_pending_probe() returns true e.g. during initial boot up
> while modules are being loaded - maybe the phy driver doesn't have
> all the resources it needs because of some other module that hasn't
> finished initialising - then we have a window where this will be
> exposed to userspace.
> 
> So, do we need to fix all the network drivers to do something if
> their .ndo_open method encounters this? If so, what? Sleep a bit
> and try again? How many times to retry? Convert the error code into
> something else, causing userspace to fail where it worked before? If
> so which error code?

It depends what is the outcome you're going for.
If there's a PHY driver pending, I would do something to wait for that
if I could, it would be silly for the PHY driver to be loading but the
PHY to still be bound to genphy.

I feel that connecting to the PHY from the probe path is the overall
cleaner way to go since it deals with this automatically, but due to the
sheer volume of drivers that connect from .ndo_open, modifying them in
bulk is out of the question. Something sensible needs to happen with
them too, and 'genphy is what you get' might be just that, which is
basically what is happening without these patches. On that note, I don't
know whether there is any objective advantage to connecting to the PHY
at .ndo_open time.

> 
> I think this needs to be thought through a bit better. In this case,
> I feel that throwing -EPROBE_DEFER to solve one problem with one
> subsystem can result in new problems elsewhere.
> 
> We did have an idea at one point about reserving some flag bits in
> phydev->dev_flags for phylib use, but I don't think that happened.
> If this is the direction we want to go, I think we need to have a
> flag in dev_flags so that callers opt-in to the new behaviour whereas
> callers such as from .ndo_open keep the old behaviour - because they
> just aren't setup to handle an -EPROBE_DEFER return from these
> functions.

Or that, yes. I hadn't actually thought about using PHY flags, but I
suppose callers which already can cope with EPROBE_DEFER (they connect
from probe) can opt into that.
Andrew Lunn Sept. 2, 2021, 7:51 p.m. UTC | #8
On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > index 52310df121de..2c22a32f0a1c 100644
> > --- a/drivers/net/phy/phy_device.c
> > +++ b/drivers/net/phy/phy_device.c
> > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> >  
> >  	/* Assume that if there is no driver, that it doesn't
> >  	 * exist, and we should use the genphy driver.
> > +	 * The exception is during probing, when the PHY driver might have
> > +	 * attempted a probe but has requested deferral. Since there might be
> > +	 * MAC drivers which also attach to the PHY during probe time, try
> > +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> > +	 * probing until then.
> >  	 */
> >  	if (!d->driver) {
> > +		if (device_pending_probe(d))
> > +			return -EPROBE_DEFER;
> 
> Something else that concerns me here.
> 
> As noted, many network drivers attempt to attach their PHY when the
> device is brought up, and not during their probe function.

Yes, this is going to be a problem. I agree it is too late to return
-EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the
device is still on the deferred list, otherwise use genphy. And maybe
a timeout and return -ENODEV, which is not 100% correct, we know the
device exists, we just cannot drive it.

Can we tell we are in the context of a driver probe? Or do we need to
add a parameter to the various phy_attach API calls to let the core
know if this is probe or open?

This is more likely to be a problem with NFS root, with the kernel
bringing up an interface as soon as its registered. userspace bringing
up interfaces is generally much later, and udev tends to wait around
until there are no more driver load requests before the boot
continues.

	Andrew
Florian Fainelli Sept. 2, 2021, 8:33 p.m. UTC | #9
On 9/2/2021 12:51 PM, Andrew Lunn wrote:
> On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
>> On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
>>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>>> index 52310df121de..2c22a32f0a1c 100644
>>> --- a/drivers/net/phy/phy_device.c
>>> +++ b/drivers/net/phy/phy_device.c
>>> @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
>>>   
>>>   	/* Assume that if there is no driver, that it doesn't
>>>   	 * exist, and we should use the genphy driver.
>>> +	 * The exception is during probing, when the PHY driver might have
>>> +	 * attempted a probe but has requested deferral. Since there might be
>>> +	 * MAC drivers which also attach to the PHY during probe time, try
>>> +	 * harder to bind the specific PHY driver, and defer the MAC driver's
>>> +	 * probing until then.
>>>   	 */
>>>   	if (!d->driver) {
>>> +		if (device_pending_probe(d))
>>> +			return -EPROBE_DEFER;
>>
>> Something else that concerns me here.
>>
>> As noted, many network drivers attempt to attach their PHY when the
>> device is brought up, and not during their probe function.
> 
> Yes, this is going to be a problem. I agree it is too late to return
> -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the
> device is still on the deferred list, otherwise use genphy. And maybe
> a timeout and return -ENODEV, which is not 100% correct, we know the
> device exists, we just cannot drive it.

Is it really going to be a problem though? The two cases where this will 
matter is if we use IP auto-configuration within the kernel, which this 
patchset ought to be helping with, if we are already in user-space and 
the PHY is connected at .ndo_open() time, there is a whole lot of things 
that did happen prior to getting there, such as udevd using modaliases 
in order to load every possible module we might, so I am debating 
whether we will really see a probe deferral at all.

> 
> Can we tell we are in the context of a driver probe? Or do we need to
> add a parameter to the various phy_attach API calls to let the core
> know if this is probe or open?

Actually we do the RTNL lock will be held during ndo_open and it won't 
during driver probe.

> 
> This is more likely to be a problem with NFS root, with the kernel
> bringing up an interface as soon as its registered. userspace bringing
> up interfaces is generally much later, and udev tends to wait around
> until there are no more driver load requests before the boot
> continues.

See my point above, with Vladimir's change, we should have fw_devlink do 
its job such that by the time the network interface is needed for IP 
auto-configuration, all of its depending resources should also be ready, 
would not they?
Russell King (Oracle) Sept. 2, 2021, 9:33 p.m. UTC | #10
On Thu, Sep 02, 2021 at 01:33:57PM -0700, Florian Fainelli wrote:
> On 9/2/2021 12:51 PM, Andrew Lunn wrote:
> > On Thu, Sep 02, 2021 at 07:50:16PM +0100, Russell King (Oracle) wrote:
> > > On Thu, Sep 02, 2021 at 01:50:51AM +0300, Vladimir Oltean wrote:
> > > > diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> > > > index 52310df121de..2c22a32f0a1c 100644
> > > > --- a/drivers/net/phy/phy_device.c
> > > > +++ b/drivers/net/phy/phy_device.c
> > > > @@ -1386,8 +1386,16 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
> > > >   	/* Assume that if there is no driver, that it doesn't
> > > >   	 * exist, and we should use the genphy driver.
> > > > +	 * The exception is during probing, when the PHY driver might have
> > > > +	 * attempted a probe but has requested deferral. Since there might be
> > > > +	 * MAC drivers which also attach to the PHY during probe time, try
> > > > +	 * harder to bind the specific PHY driver, and defer the MAC driver's
> > > > +	 * probing until then.
> > > >   	 */
> > > >   	if (!d->driver) {
> > > > +		if (device_pending_probe(d))
> > > > +			return -EPROBE_DEFER;
> > > 
> > > Something else that concerns me here.
> > > 
> > > As noted, many network drivers attempt to attach their PHY when the
> > > device is brought up, and not during their probe function.
> > 
> > Yes, this is going to be a problem. I agree it is too late to return
> > -EPROBE_DEFER. Maybe phy_attach_direct() needs to wait around, if the
> > device is still on the deferred list, otherwise use genphy. And maybe
> > a timeout and return -ENODEV, which is not 100% correct, we know the
> > device exists, we just cannot drive it.
> 
> Is it really going to be a problem though? The two cases where this will
> matter is if we use IP auto-configuration within the kernel, which this
> patchset ought to be helping with

There is no handling of EPROBE_DEFER in the IP auto-configuration
code while trying to bring up interfaces:

        for_each_netdev(&init_net, dev) {
                if (ic_is_init_dev(dev)) {
...
                        oflags = dev->flags;
                        if (dev_change_flags(dev, oflags | IFF_UP, NULL) < 0) {
                                pr_err("IP-Config: Failed to open %s\n",
                                       dev->name);
                                continue;
                        }

So, the only way this could be reliable is if we can guarantee that
all deferred probes will have been retried by the time we get here.
Do we have that guarantee?

> if we are already in user-space and the
> PHY is connected at .ndo_open() time, there is a whole lot of things that
> did happen prior to getting there, such as udevd using modaliases in order
> to load every possible module we might, so I am debating whether we will
> really see a probe deferral at all.

As can be seen from my recent posts which show on Debian Buster that
interfaces are attempted to be brought up while e.g. mv88e6xxx is still
probing, we can't make any guarantees that things have "settled" by the
time userspace attempts to bring up the network interfaces.

I may have more on why that is happening... I won't post it here, I'll
post to the other thread.

> > Can we tell we are in the context of a driver probe? Or do we need to
> > add a parameter to the various phy_attach API calls to let the core
> > know if this is probe or open?
> 
> Actually we do the RTNL lock will be held during ndo_open and it won't
> during driver probe.

That's probably an unreliable indicator. DPAA2 has weirdness in the
way it can dynamically create and destroy network interfaces, which
does lead to problems with the rtnl lock. I've been carrying a patch
from NXP for this for almost two years now, which NXP still haven't
submitted:

http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4

... and I've no idea why that patch never made mainline. I need it
to avoid the stated deadlock on SolidRun Honeycomb platforms when
creating additional network interfaces for the SFP cages in userspace.
Vladimir Oltean Sept. 2, 2021, 9:39 p.m. UTC | #11
On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> That's probably an unreliable indicator. DPAA2 has weirdness in the
> way it can dynamically create and destroy network interfaces, which
> does lead to problems with the rtnl lock. I've been carrying a patch
> from NXP for this for almost two years now, which NXP still haven't
> submitted:
> 
> http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> 
> ... and I've no idea why that patch never made mainline. I need it
> to avoid the stated deadlock on SolidRun Honeycomb platforms when
> creating additional network interfaces for the SFP cages in userspace.

Ah, nice, I've copied that broken logic for the dpaa2-switch too:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065

So why don't you send the patch? I can send it too if you want to, one
for the switch and one for the DPNI driver.
Russell King (Oracle) Sept. 2, 2021, 10:24 p.m. UTC | #12
On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote:
> On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> > That's probably an unreliable indicator. DPAA2 has weirdness in the
> > way it can dynamically create and destroy network interfaces, which
> > does lead to problems with the rtnl lock. I've been carrying a patch
> > from NXP for this for almost two years now, which NXP still haven't
> > submitted:
> > 
> > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> > 
> > ... and I've no idea why that patch never made mainline. I need it
> > to avoid the stated deadlock on SolidRun Honeycomb platforms when
> > creating additional network interfaces for the SFP cages in userspace.
> 
> Ah, nice, I've copied that broken logic for the dpaa2-switch too:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065
> 
> So why don't you send the patch? I can send it too if you want to, one
> for the switch and one for the DPNI driver.

Sorry, I mis-stated. NXP did submit that exact patch, but it's actually
incorrect for the reason I stated when it was sent:

https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/

I did miss the rtnl_lock() around phylink_disconnect_phy() in the
description of the race, which goes someway towards hiding it, but
there is still a race between phylink_destroy() and another thread
calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed:

static int
dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
                             struct ethtool_link_ksettings *link_settings)
{
        struct dpaa2_eth_priv *priv = netdev_priv(net_dev);

        if (dpaa2_eth_is_type_phy(priv))
                return phylink_ethtool_ksettings_get(priv->mac->phylink,
                                                     link_settings);

which dereferences priv->mac and priv->mac->phylink, vs:

static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg)
{
...
        if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) {
                dpaa2_eth_set_mac_addr(netdev_priv(net_dev));
                dpaa2_eth_update_tx_fqids(priv);

                if (dpaa2_eth_has_mac(priv))
                        dpaa2_eth_disconnect_mac(priv);
                else
                        dpaa2_eth_connect_mac(priv);
        }

static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv)
{
        if (dpaa2_eth_is_type_phy(priv))
                dpaa2_mac_disconnect(priv->mac);

        if (!dpaa2_eth_has_mac(priv))
                return;

        dpaa2_mac_close(priv->mac);
        kfree(priv->mac);		<== potential use after free bug by
        priv->mac = NULL;		<== dpaa2_eth_get_link_ksettings()
}

void dpaa2_mac_disconnect(struct dpaa2_mac *mac)
{
        if (!mac->phylink)
                return;

        phylink_disconnect_phy(mac->phylink);
        phylink_destroy(mac->phylink);	<== another use-after-free bug via
					    dpaa2_eth_get_link_ksettings()
        dpaa2_pcs_destroy(mac);
}

Note that phylink_destroy() is documented as:

 * Note: the rtnl lock must not be held when calling this function.

because it calls sfp_bus_del_upstream(), which will take the rtnl lock
itself. An alternative solution would be to remove the rtnl locking
from sfp_bus_del_upstream(), but then force _everyone_ to take the
rtnl lock before calling phylink_destroy() - meaning a larger block of
code ends up executing under the lock than is really necessary.

However, as I stated in my review of the patch "As I've already stated,
the phylink is not designed to be created and destroyed on a published
network device." That still remains true today, and it seems that the
issue has never been fixed in DPAA2 despite having been pointed out.
Vladimir Oltean Sept. 2, 2021, 10:45 p.m. UTC | #13
On Thu, Sep 02, 2021 at 11:24:39PM +0100, Russell King (Oracle) wrote:
> On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> > > That's probably an unreliable indicator. DPAA2 has weirdness in the
> > > way it can dynamically create and destroy network interfaces, which
> > > does lead to problems with the rtnl lock. I've been carrying a patch
> > > from NXP for this for almost two years now, which NXP still haven't
> > > submitted:
> > >
> > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> > >
> > > ... and I've no idea why that patch never made mainline. I need it
> > > to avoid the stated deadlock on SolidRun Honeycomb platforms when
> > > creating additional network interfaces for the SFP cages in userspace.
> >
> > Ah, nice, I've copied that broken logic for the dpaa2-switch too:
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065
> >
> > So why don't you send the patch? I can send it too if you want to, one
> > for the switch and one for the DPNI driver.
>
> Sorry, I mis-stated. NXP did submit that exact patch, but it's actually
> incorrect for the reason I stated when it was sent:
>
> https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/

So why are you carrying it then?

> I did miss the rtnl_lock() around phylink_disconnect_phy() in the
> description of the race, which goes someway towards hiding it, but
> there is still a race between phylink_destroy() and another thread
> calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed:
>
> static int
> dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
>                              struct ethtool_link_ksettings *link_settings)
> {
>         struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
>
>         if (dpaa2_eth_is_type_phy(priv))
>                 return phylink_ethtool_ksettings_get(priv->mac->phylink,
>                                                      link_settings);
>
> which dereferences priv->mac and priv->mac->phylink, vs:
>
> static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg)
> {
> ...
>         if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) {
>                 dpaa2_eth_set_mac_addr(netdev_priv(net_dev));
>                 dpaa2_eth_update_tx_fqids(priv);
>
>                 if (dpaa2_eth_has_mac(priv))
>                         dpaa2_eth_disconnect_mac(priv);
>                 else
>                         dpaa2_eth_connect_mac(priv);
>         }
>
> static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv)
> {
>         if (dpaa2_eth_is_type_phy(priv))
>                 dpaa2_mac_disconnect(priv->mac);
>
>         if (!dpaa2_eth_has_mac(priv))
>                 return;
>
>         dpaa2_mac_close(priv->mac);
>         kfree(priv->mac);		<== potential use after free bug by
>         priv->mac = NULL;		<== dpaa2_eth_get_link_ksettings()
> }

Okay, so this needs to stay under the rtnetlink mutex, to serialize with
dpaa2_eth_get_link_ksettings which is already under the rtnetlink mutex.
So the way in which rtnl_lock is taken right now is actually fine in a way.

>
> void dpaa2_mac_disconnect(struct dpaa2_mac *mac)
> {
>         if (!mac->phylink)
>                 return;
>
>         phylink_disconnect_phy(mac->phylink);
>         phylink_destroy(mac->phylink);	<== another use-after-free bug via
> 					    dpaa2_eth_get_link_ksettings()
>         dpaa2_pcs_destroy(mac);
> }
>
> Note that phylink_destroy() is documented as:
>
>  * Note: the rtnl lock must not be held when calling this function.
>
> because it calls sfp_bus_del_upstream(), which will take the rtnl lock
> itself. An alternative solution would be to remove the rtnl locking
> from sfp_bus_del_upstream(), but then force _everyone_ to take the
> rtnl lock before calling phylink_destroy() - meaning a larger block of
> code ends up executing under the lock than is really necessary.

So phylink_destroy has exactly 20 call sites, it is not that bad?

And as for "larger block than necessary" - doesn't the dpaa2 prolonged
usage count as necessary?

> However, as I stated in my review of the patch "As I've already stated,
> the phylink is not designed to be created and destroyed on a published
> network device." That still remains true today, and it seems that the
> issue has never been fixed in DPAA2 despite having been pointed out.

So what would you do, exactly, to "fix" the issue that a DPNI can
connect and disconnect at runtime from a DPMAC?

Also, "X is not designed to Y" doesn't really say much, given a bit of
will power. Linux was not designed to run on non-i386 either.

Any other issues besides needing to take rtnl_mutex top-level when
calling phylink_destroy? Since phylink_disconnect_phy needs it anyway,
and phylink_destroy ends up calling sfp_bus_del_upstream which takes the
same mutex again, and drivers that connect/disconnect at probe/remove
time end up calling both in a row, I don't think there is much of an
issue to speak of, or that the rework would be that difficult.
Andrew Lunn Sept. 2, 2021, 11:02 p.m. UTC | #14
> > Note that phylink_destroy() is documented as:
> >
> >  * Note: the rtnl lock must not be held when calling this function.
> >

...

> 
> Any other issues besides needing to take rtnl_mutex top-level when
> calling phylink_destroy?

We should try to keep phylink_create and phylink_destroy symmetrical:

/**
 * phylink_create() - create a phylink instance
 * @config: a pointer to the target &struct phylink_config
 * @fwnode: a pointer to a &struct fwnode_handle describing the network
 *      interface
 * @iface: the desired link mode defined by &typedef phy_interface_t
 * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC.
 *
 * Create a new phylink instance, and parse the link parameters found in @np.
 * This will parse in-band modes, fixed-link or SFP configuration.
 *
 * Note: the rtnl lock must not be held when calling this function.

Having different locking requirements will catch people out.

Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such
a macro.

    Andrew
Vladimir Oltean Sept. 2, 2021, 11:26 p.m. UTC | #15
On Fri, Sep 03, 2021 at 01:02:06AM +0200, Andrew Lunn wrote:
> We should try to keep phylink_create and phylink_destroy symmetrical:
> 
> /**
>  * phylink_create() - create a phylink instance
>  * @config: a pointer to the target &struct phylink_config
>  * @fwnode: a pointer to a &struct fwnode_handle describing the network
>  *      interface
>  * @iface: the desired link mode defined by &typedef phy_interface_t
>  * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC.
>  *
>  * Create a new phylink instance, and parse the link parameters found in @np.
>  * This will parse in-band modes, fixed-link or SFP configuration.
>  *
>  * Note: the rtnl lock must not be held when calling this function.
> 
> Having different locking requirements will catch people out.
> 
> Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such
> a macro.

In this case, the easiest might be to just take a different mutex in
dpaa2 which serializes all places that access the priv->mac references.
I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed
those locks and will see what fails tomorrow, but I don't think dpaa2
has a good enough justification to take the rtnl_mutex just so that it
can connect and disconnect to the MAC freely at runtime.
Russell King (Oracle) Sept. 3, 2021, 12:04 a.m. UTC | #16
On Fri, Sep 03, 2021 at 02:26:07AM +0300, Vladimir Oltean wrote:
> On Fri, Sep 03, 2021 at 01:02:06AM +0200, Andrew Lunn wrote:
> > We should try to keep phylink_create and phylink_destroy symmetrical:
> > 
> > /**
> >  * phylink_create() - create a phylink instance
> >  * @config: a pointer to the target &struct phylink_config
> >  * @fwnode: a pointer to a &struct fwnode_handle describing the network
> >  *      interface
> >  * @iface: the desired link mode defined by &typedef phy_interface_t
> >  * @mac_ops: a pointer to a &struct phylink_mac_ops for the MAC.
> >  *
> >  * Create a new phylink instance, and parse the link parameters found in @np.
> >  * This will parse in-band modes, fixed-link or SFP configuration.
> >  *
> >  * Note: the rtnl lock must not be held when calling this function.
> > 
> > Having different locking requirements will catch people out.
> > 
> > Interestingly, there is no ASSERT_NO_RTNL(). Maybe we should add such
> > a macro.
> 
> In this case, the easiest might be to just take a different mutex in
> dpaa2 which serializes all places that access the priv->mac references.
> I don't know exactly why the SFP bus needs the rtnl_mutex, I've removed
> those locks and will see what fails tomorrow, but I don't think dpaa2
> has a good enough justification to take the rtnl_mutex just so that it
> can connect and disconnect to the MAC freely at runtime.

It needs it to ensure that the sfp-bus code is safe. sfp-bus code
sits between phylink and the sfp stuff, and will be called from
either side. It can't have its own lock, because that gives lockdep
splats.

Removing a lock and then running the kernel is a down right stupid
way to test to see if a lock is necessary.

That approach is like having built a iron bridge, covered it in paint,
then you remove most the bolts, and then test to see whether it's safe
for vehicles to travel over it by riding your bicycle across it and
declaring it safe.

Sorry, but if you think "remove lock, run kernel, if it works fine
the lock is unnecessary" is a valid approach, then you've just
disqualified yourself from discussing this topic any further.
Locking is done by knowing the code and code analysis, not by
playing "does the code fail if I remove it" games. I am utterly
shocked that you think that this is a valid approach.
Ioana Ciornei Sept. 3, 2021, 9:27 a.m. UTC | #17
On Thu, Sep 02, 2021 at 11:24:39PM +0100, Russell King (Oracle) wrote:
> On Fri, Sep 03, 2021 at 12:39:49AM +0300, Vladimir Oltean wrote:
> > On Thu, Sep 02, 2021 at 10:33:03PM +0100, Russell King (Oracle) wrote:
> > > That's probably an unreliable indicator. DPAA2 has weirdness in the
> > > way it can dynamically create and destroy network interfaces, which
> > > does lead to problems with the rtnl lock. I've been carrying a patch
> > > from NXP for this for almost two years now, which NXP still haven't
> > > submitted:
> > > 
> > > http://git.armlinux.org.uk/cgit/linux-arm.git/commit/?h=cex7&id=a600f2ee50223e9bcdcf86b65b4c427c0fd425a4
> > > 
> > > ... and I've no idea why that patch never made mainline. I need it
> > > to avoid the stated deadlock on SolidRun Honeycomb platforms when
> > > creating additional network interfaces for the SFP cages in userspace.
> > 
> > Ah, nice, I've copied that broken logic for the dpaa2-switch too:
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d52ef12f7d6c016f3b249db95af33f725e3dd065
> > 
> > So why don't you send the patch? I can send it too if you want to, one
> > for the switch and one for the DPNI driver.
> 
> Sorry, I mis-stated. NXP did submit that exact patch, but it's actually
> incorrect for the reason I stated when it was sent:
> 
> https://patchwork.ozlabs.org/project/netdev/patch/1574363727-5437-2-git-send-email-ioana.ciornei@nxp.com/
> 
> I did miss the rtnl_lock() around phylink_disconnect_phy() in the
> description of the race, which goes someway towards hiding it, but
> there is still a race between phylink_destroy() and another thread
> calling dpaa2_eth_get_link_ksettings(), and priv->mac being freed:
> 
> static int
> dpaa2_eth_get_link_ksettings(struct net_device *net_dev,
>                              struct ethtool_link_ksettings *link_settings)
> {
>         struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
> 
>         if (dpaa2_eth_is_type_phy(priv))
>                 return phylink_ethtool_ksettings_get(priv->mac->phylink,
>                                                      link_settings);
> 
> which dereferences priv->mac and priv->mac->phylink, vs:
> 
> static irqreturn_t dpni_irq0_handler_thread(int irq_num, void *arg)
> {
> ...
>         if (status & DPNI_IRQ_EVENT_ENDPOINT_CHANGED) {
>                 dpaa2_eth_set_mac_addr(netdev_priv(net_dev));
>                 dpaa2_eth_update_tx_fqids(priv);
> 
>                 if (dpaa2_eth_has_mac(priv))
>                         dpaa2_eth_disconnect_mac(priv);
>                 else
>                         dpaa2_eth_connect_mac(priv);
>         }
> 
> static void dpaa2_eth_disconnect_mac(struct dpaa2_eth_priv *priv)
> {
>         if (dpaa2_eth_is_type_phy(priv))
>                 dpaa2_mac_disconnect(priv->mac);
> 
>         if (!dpaa2_eth_has_mac(priv))
>                 return;
> 
>         dpaa2_mac_close(priv->mac);
>         kfree(priv->mac);		<== potential use after free bug by
>         priv->mac = NULL;		<== dpaa2_eth_get_link_ksettings()
> }
> 
> void dpaa2_mac_disconnect(struct dpaa2_mac *mac)
> {
>         if (!mac->phylink)
>                 return;
> 
>         phylink_disconnect_phy(mac->phylink);
>         phylink_destroy(mac->phylink);	<== another use-after-free bug via
> 					    dpaa2_eth_get_link_ksettings()
>         dpaa2_pcs_destroy(mac);
> }
> 
> Note that phylink_destroy() is documented as:
> 
>  * Note: the rtnl lock must not be held when calling this function.
> 
> because it calls sfp_bus_del_upstream(), which will take the rtnl lock
> itself. An alternative solution would be to remove the rtnl locking
> from sfp_bus_del_upstream(), but then force _everyone_ to take the
> rtnl lock before calling phylink_destroy() - meaning a larger block of
> code ends up executing under the lock than is really necessary.
> 
> However, as I stated in my review of the patch "As I've already stated,
> the phylink is not designed to be created and destroyed on a published
> network device." That still remains true today, and it seems that the
> issue has never been fixed in DPAA2 despite having been pointed out.
> 

My attempt to fix this issue was that patch that you just pointed at.
Taking your feedback into account (that phylink is not designed to be
created and destroyed on a published networking device) I really do not
know what other viable solution to send out.

The alternative here would have been to just have a different driver for
the MAC side (probing on dpmac objects) that creates the phylink
instance at probe time and then is just used by the dpaa2-eth driver
when it connects to a dpmac. This way no phylink is created/destroyed
dynamically.

This was the architecture of my initial attempt at supporting phylink in
DPAA2.
https://patchwork.ozlabs.org/project/netdev/patch/1560470153-26155-5-git-send-email-ioana.ciornei@nxp.com/

If you have any suggestion on how I should go about fixing this, please
let me know.

Ioana
Vladimir Oltean Sept. 3, 2021, 8:48 p.m. UTC | #18
On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> Removing a lock and then running the kernel is a down right stupid
> way to test to see if a lock is necessary.
> 
> That approach is like having built a iron bridge, covered it in paint,
> then you remove most the bolts, and then test to see whether it's safe
> for vehicles to travel over it by riding your bicycle across it and
> declaring it safe.
> 
> Sorry, but if you think "remove lock, run kernel, if it works fine
> the lock is unnecessary" is a valid approach, then you've just
> disqualified yourself from discussing this topic any further.
> Locking is done by knowing the code and code analysis, not by
> playing "does the code fail if I remove it" games. I am utterly
> shocked that you think that this is a valid approach.

... and this is exactly why you will no longer get any attention from me
on this topic. Good luck.
Russell King (Oracle) Sept. 3, 2021, 10:06 p.m. UTC | #19
On Fri, Sep 03, 2021 at 11:48:22PM +0300, Vladimir Oltean wrote:
> On Fri, Sep 03, 2021 at 01:04:19AM +0100, Russell King (Oracle) wrote:
> > Removing a lock and then running the kernel is a down right stupid
> > way to test to see if a lock is necessary.
> > 
> > That approach is like having built a iron bridge, covered it in paint,
> > then you remove most the bolts, and then test to see whether it's safe
> > for vehicles to travel over it by riding your bicycle across it and
> > declaring it safe.
> > 
> > Sorry, but if you think "remove lock, run kernel, if it works fine
> > the lock is unnecessary" is a valid approach, then you've just
> > disqualified yourself from discussing this topic any further.
> > Locking is done by knowing the code and code analysis, not by
> > playing "does the code fail if I remove it" games. I am utterly
> > shocked that you think that this is a valid approach.
> 
> ... and this is exactly why you will no longer get any attention from me
> on this topic. Good luck.

Good, because your approach to this to me reads as "I don't think you
know what the hell you're doing so I'm going to remove a lock to test
whether it is needed." Effectively, that action is an insult towards
me as the author of that code.

And as I said, if you think that's a valid approach, then quite frankly
I don't want you touching my code, because you clearly don't know what
you're doing as you aren't willing to put the necessary effort in to
understanding the code.

Removing a lock and running the kernel is _never_ a valid way to see
whether the lock is required or not. The only way is via code analysis.

I wonder whether you'd take the same approach with filesystems or
memory management code. Why don't you try removing some locks from
those subsystems and see how long your filesystems last?

You could have asked why the lock was necessary, and I would have
described it. That would have been the civil approach. Maybe even
put forward a hypothesis why you think the lock isn't necessary, but
no, you decide that the best way to go about this is to remove the
lock and see whether the kernel breaks.

It may shock you to know that those of us who have been working on
the kernel for almost 30 years and have seen the evolution of the
kernel from uniprocessor to SMP, have had to debug race conditions
caused by a lack of locking know very well that you can have what
seems to be a functioning kernel despite missing locks - and such a
kernel can last quite a long time and only show up the race quite
rarely. This is exactly why "lets remove the lock and see if it
breaks" is a completely invalid approach. I'm sorry that you don't
seem to realise just how stupid a suggestion that was.

I can tell you now: removing the locks you proposed will not show an
immediate problem, but by removing those locks you will definitely
open up race conditions between driver binding events on the SFP
side and network usage on the netdev side which will only occur
rarely.

And just because they only happen rarely is not a justification to
remove locks, no matter how inconvenient those locks may be.
diff mbox series

Patch

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 1c379d20812a..b22073b0acd2 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -128,13 +128,30 @@  static void deferred_probe_work_func(struct work_struct *work)
 }
 static DECLARE_WORK(deferred_probe_work, deferred_probe_work_func);
 
+static bool __device_pending_probe(struct device *dev)
+{
+	return !list_empty(&dev->p->deferred_probe);
+}
+
+bool device_pending_probe(struct device *dev)
+{
+	bool pending;
+
+	mutex_lock(&deferred_probe_mutex);
+	pending = __device_pending_probe(dev);
+	mutex_unlock(&deferred_probe_mutex);
+
+	return pending;
+}
+EXPORT_SYMBOL_GPL(device_pending_probe);
+
 void driver_deferred_probe_add(struct device *dev)
 {
 	if (!dev->can_match)
 		return;
 
 	mutex_lock(&deferred_probe_mutex);
-	if (list_empty(&dev->p->deferred_probe)) {
+	if (!__device_pending_probe(dev)) {
 		dev_dbg(dev, "Added to deferred list\n");
 		list_add_tail(&dev->p->deferred_probe, &deferred_probe_pending_list);
 	}
@@ -144,7 +161,7 @@  void driver_deferred_probe_add(struct device *dev)
 void driver_deferred_probe_del(struct device *dev)
 {
 	mutex_lock(&deferred_probe_mutex);
-	if (!list_empty(&dev->p->deferred_probe)) {
+	if (__device_pending_probe(dev)) {
 		dev_dbg(dev, "Removed from deferred list\n");
 		list_del_init(&dev->p->deferred_probe);
 		__device_set_deferred_probe_reason(dev, NULL);
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 52310df121de..2c22a32f0a1c 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -1386,8 +1386,16 @@  int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 
 	/* Assume that if there is no driver, that it doesn't
 	 * exist, and we should use the genphy driver.
+	 * The exception is during probing, when the PHY driver might have
+	 * attempted a probe but has requested deferral. Since there might be
+	 * MAC drivers which also attach to the PHY during probe time, try
+	 * harder to bind the specific PHY driver, and defer the MAC driver's
+	 * probing until then.
 	 */
 	if (!d->driver) {
+		if (device_pending_probe(d))
+			return -EPROBE_DEFER;
+
 		if (phydev->is_c45)
 			d->driver = &genphy_c45_driver.mdiodrv.driver;
 		else
diff --git a/include/linux/device.h b/include/linux/device.h
index e270cb740b9e..505e77715789 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -889,6 +889,7 @@  int __must_check driver_attach(struct device_driver *drv);
 void device_initial_probe(struct device *dev);
 int __must_check device_reprobe(struct device *dev);
 
+bool device_pending_probe(struct device *dev);
 bool device_is_bound(struct device *dev);
 
 /*