Message ID | 20211109113921.1020311-1-u.kleine-koenig@pengutronix.de (mailing list archive) |
---|---|
State | Deferred |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net: dsa: Some cleanups in remove code | expand |
Your commit prefix does not reflect the fact that you are touching the vsc73xx driver. Try "net: dsa: vsc73xx: ". On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > vsc73xx_remove() returns zero unconditionally and no caller checks the > returned value. So convert the function to return no value. This I agree with. > For both the platform and the spi variant ..._get_drvdata() will never > return NULL in .remove() because the remove callback is only called after > the probe callback returned successfully and in this case driver data was > set to a non-NULL value. Have you read the commit message of 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown")? To remove the check for dev_get_drvdata == NULL in ->remove, you need to prove that ->remove will never be called after ->shutdown. For platform devices this is pretty easy to prove, for SPI devices not so much. I intentionally kept the code structure the same because code gets copied around a lot, it is easy to copy from the wrong place. > Also setting driver data to NULL is not necessary, this is already done > in the driver core in __device_release_driver(), so drop this from the > remove callback, too. And this was also intentional, for visibility more or less. I would like you to ack that you understand the problems surrounding ->remove/->shutdown ordering for devices on buses, prior to making seemingly trivial cleanups. > Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> > ---
Hello, Cc += gregkh, maybe he has something to say on this matter On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > Your commit prefix does not reflect the fact that you are touching the > vsc73xx driver. Try "net: dsa: vsc73xx: ". Oh, I missed that indeed. > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > returned value. So convert the function to return no value. > > This I agree with. > > > For both the platform and the spi variant ..._get_drvdata() will never > > return NULL in .remove() because the remove callback is only called after > > the probe callback returned successfully and in this case driver data was > > set to a non-NULL value. > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > compatible with masters which unregister on shutdown")? No. But I did now. I consider it very surprising that .shutdown() calls the .remove() callback and would recommend to not do this. The commit log seems to prove this being difficult. > To remove the check for dev_get_drvdata == NULL in ->remove, you need to > prove that ->remove will never be called after ->shutdown. For platform > devices this is pretty easy to prove, for SPI devices not so much. > I intentionally kept the code structure the same because code gets > copied around a lot, it is easy to copy from the wrong place. Alternatively remove spi_set_drvdata(spi, NULL); from vsc73xx_spi_shutdown()? Also I'm not aware how platform devices are different to spi devices that the ordering of .remove and shutdown() is more or less obvious than on the other bus?! > > Also setting driver data to NULL is not necessary, this is already done > > in the driver core in __device_release_driver(), so drop this from the > > remove callback, too. > > And this was also intentional, for visibility more or less. I would like > you to ack that you understand the problems surrounding ->remove/->shutdown > ordering for devices on buses, prior to making seemingly trivial cleanups. I see that the change is not so obviously correct as I thought. I'll have to think about this and will respin if and when I find a sane way forward. Best regards Uwe
On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > Hello, > > Cc += gregkh, maybe he has something to say on this matter > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > Your commit prefix does not reflect the fact that you are touching the > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > Oh, I missed that indeed. > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > returned value. So convert the function to return no value. > > > > This I agree with. > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > return NULL in .remove() because the remove callback is only called after > > > the probe callback returned successfully and in this case driver data was > > > set to a non-NULL value. > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > compatible with masters which unregister on shutdown")? > > No. But I did now. I consider it very surprising that .shutdown() calls > the .remove() callback and would recommend to not do this. The commit > log seems to prove this being difficult. Why do you consider it surprising? Many drivers implement ->shutdown by calling ->remove for the simple reason that ->remove provides for a well-tested code path already, and leaves the hardware in a known state, workable for kexec and others. Many drivers have buses beneath them. Those buses go away when these drivers unregister, and so do their children. ============================================== => some drivers do both => children of these buses should expect to be potentially unregistered after they've been shut down. > > To remove the check for dev_get_drvdata == NULL in ->remove, you need to > > prove that ->remove will never be called after ->shutdown. For platform > > devices this is pretty easy to prove, for SPI devices not so much. > > I intentionally kept the code structure the same because code gets > > copied around a lot, it is easy to copy from the wrong place. > > Alternatively remove spi_set_drvdata(spi, NULL); from > vsc73xx_spi_shutdown()? What is the end goal exactly? > Also I'm not aware how platform devices are > different to spi devices that the ordering of .remove and shutdown() is > more or less obvious than on the other bus?! Not sure what you mean. See the explanation above. For the "platform" bus, there simply isn't any code path that unregisters children on the ->shutdown callback. For other buses, there is. > > > Also setting driver data to NULL is not necessary, this is already done > > > in the driver core in __device_release_driver(), so drop this from the > > > remove callback, too. > > > > And this was also intentional, for visibility more or less. I would like > > you to ack that you understand the problems surrounding ->remove/->shutdown > > ordering for devices on buses, prior to making seemingly trivial cleanups. > > I see that the change is not so obviously correct as I thought. I'll > have to think about this and will respin if and when I find a sane way > forward. A way forward towards what? This is literally a cosmetic patch that would happen to break some stuff, were it to be applied.
Hello Vladimir, On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote: > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > > Your commit prefix does not reflect the fact that you are touching the > > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > > > Oh, I missed that indeed. > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > > returned value. So convert the function to return no value. > > > > > > This I agree with. > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > > return NULL in .remove() because the remove callback is only called after > > > > the probe callback returned successfully and in this case driver data was > > > > set to a non-NULL value. > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > > compatible with masters which unregister on shutdown")? > > > > No. But I did now. I consider it very surprising that .shutdown() calls > > the .remove() callback and would recommend to not do this. The commit > > log seems to prove this being difficult. > > Why do you consider it surprising? In my book .shutdown should be minimal and just silence the device, such that it e.g. doesn't do any DMA any more. > Many drivers implement ->shutdown by calling ->remove for the simple > reason that ->remove provides for a well-tested code path already, and > leaves the hardware in a known state, workable for kexec and others. > > Many drivers have buses beneath them. Those buses go away when these > drivers unregister, and so do their children. > > ============================================== > > => some drivers do both => children of these buses should expect to be > potentially unregistered after they've been shut down. Do you know this happens, or do you "only" fear it might happen? > > > To remove the check for dev_get_drvdata == NULL in ->remove, you need to > > > prove that ->remove will never be called after ->shutdown. For platform > > > devices this is pretty easy to prove, for SPI devices not so much. > > > I intentionally kept the code structure the same because code gets > > > copied around a lot, it is easy to copy from the wrong place. > > > > Alternatively remove spi_set_drvdata(spi, NULL); from > > vsc73xx_spi_shutdown()? > > What is the end goal exactly? My end goal is: diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h index eb7ac8a1e03c..183cf15fbdd2 100644 --- a/include/linux/spi/spi.h +++ b/include/linux/spi/spi.h @@ -280,7 +280,7 @@ struct spi_message; struct spi_driver { const struct spi_device_id *id_table; int (*probe)(struct spi_device *spi); - int (*remove)(struct spi_device *spi); + void (*remove)(struct spi_device *spi); void (*shutdown)(struct spi_device *spi); struct device_driver driver; }; As (nearly) all spi drivers must be touched in the same commit, the preparing goal is to have these remove callbacks simple, such that I only have to replace their "return 0;" by "return;" (or nothing if it's at the end of the function). Looking at vsc73xx's remove function I didn't stop at this minimal goal and simplified the stuff that I thought to be superflous. > > Also I'm not aware how platform devices are > > different to spi devices that the ordering of .remove and shutdown() is > > more or less obvious than on the other bus?! > > Not sure what you mean. See the explanation above. For the "platform" > bus, there simply isn't any code path that unregisters children on the > ->shutdown callback. For other buses, there is. OK, with your last mail I understood that now, thanks. Best regards Uwe
On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote: > Hello Vladimir, > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote: > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > > > Your commit prefix does not reflect the fact that you are touching the > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > > > > > Oh, I missed that indeed. > > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > > > returned value. So convert the function to return no value. > > > > > > > > This I agree with. > > > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > > > return NULL in .remove() because the remove callback is only called after > > > > > the probe callback returned successfully and in this case driver data was > > > > > set to a non-NULL value. > > > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > > > compatible with masters which unregister on shutdown")? > > > > > > No. But I did now. I consider it very surprising that .shutdown() calls > > > the .remove() callback and would recommend to not do this. The commit > > > log seems to prove this being difficult. > > > > Why do you consider it surprising? > > In my book .shutdown should be minimal and just silence the device, such > that it e.g. doesn't do any DMA any more. To me, the more important thing to consider is that many drivers lack any ->shutdown hook at all, and making their ->shutdown simply call ->remove is often times the least-effort path of doing something reasonable towards quiescing the hardware. Not to mention the lesser evil compared to not having a ->shutdown at all. That's not to say I am not in favor of a minimal shutdown procedure if possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch(). But judging what should go into dsa_switch_shutdown() was definitely not simple and there might still be corner cases that I missed - although it works for now, knock on wood. The reality is that you'll have a very hard time convincing people to write a dedicated code path for shutdown, if you can convince them to write one at all. They wouldn't even know if it does all the right things - it's not like you kexec every day (unless you're using Linux as a bootloader - but then again, if you do that you're kind of asking for trouble - the reason why this is the case is exactly because not having a ->shutdown hook implemented by drivers is an option, and the driver core doesn't e.g. fall back to calling the ->remove method, even with all the insanity that might ensue). > > Many drivers implement ->shutdown by calling ->remove for the simple > > reason that ->remove provides for a well-tested code path already, and > > leaves the hardware in a known state, workable for kexec and others. > > > > Many drivers have buses beneath them. Those buses go away when these > > drivers unregister, and so do their children. > > > > ============================================== > > > > => some drivers do both => children of these buses should expect to be > > potentially unregistered after they've been shut down. > > Do you know this happens, or do you "only" fear it might happen? Are you asking whether there are SPI controllers that implement ->shutdown as ->remove? Just search for "\.shutdown" in drivers/spi. 3 out of 3 implementations call ->remove. If you really have time to waste, here, have fun: Lino Sanfilippo had not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch connected to a Raspberry Pi, both of which were due to other drivers implementing their ->shutdown as ->remove. First driver was the DSA master/host port (bcmgenet), the other was the bcm2835_spi controller. https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/ https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/ https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/ As soon as we implemented ->shutdown in DSA drivers (which we had mostly not done up until that thread) we ran into the surprise that ->remove will get called too. Yay. It wasn't trivial to sort out, but we did it eventually in a more systematic way. Not sure whether there's anything to change at the drivers/base/ level. Since any SPI-controlled DSA switch can fundamentally be connected to mostly any SPI controller, then yes, I have no doubt at all that the same can happen even with the vsc73xx driver you're patching here. > > > > To remove the check for dev_get_drvdata == NULL in ->remove, you need to > > > > prove that ->remove will never be called after ->shutdown. For platform > > > > devices this is pretty easy to prove, for SPI devices not so much. > > > > I intentionally kept the code structure the same because code gets > > > > copied around a lot, it is easy to copy from the wrong place. > > > > > > Alternatively remove spi_set_drvdata(spi, NULL); from > > > vsc73xx_spi_shutdown()? > > > > What is the end goal exactly? > > My end goal is: > > diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h > index eb7ac8a1e03c..183cf15fbdd2 100644 > --- a/include/linux/spi/spi.h > +++ b/include/linux/spi/spi.h > @@ -280,7 +280,7 @@ struct spi_message; > struct spi_driver { > const struct spi_device_id *id_table; > int (*probe)(struct spi_device *spi); > - int (*remove)(struct spi_device *spi); > + void (*remove)(struct spi_device *spi); > void (*shutdown)(struct spi_device *spi); > struct device_driver driver; > }; > > As (nearly) all spi drivers must be touched in the same commit, the > preparing goal is to have these remove callbacks simple, such that I > only have to replace their "return 0;" by "return;" (or nothing if it's > at the end of the function). Looking at vsc73xx's remove function I > didn't stop at this minimal goal and simplified the stuff that I thought > to be superflous. Yeah, well I guess you can stop at the minimal goal. > > > Also I'm not aware how platform devices are > > > different to spi devices that the ordering of .remove and shutdown() is > > > more or less obvious than on the other bus?! > > > > Not sure what you mean. See the explanation above. For the "platform" > > bus, there simply isn't any code path that unregisters children on the > > ->shutdown callback. For other buses, there is. > > OK, with your last mail I understood that now, thanks. > > Best regards > Uwe > > -- > Pengutronix e.K. | Uwe Kleine-König | > Industrial Linux Solutions | https://www.pengutronix.de/ |
Hello Vladimir, On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote: > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote: > > Hello Vladimir, > > > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote: > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > > > > Your commit prefix does not reflect the fact that you are touching the > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > > > > > > > Oh, I missed that indeed. > > > > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > > > > returned value. So convert the function to return no value. > > > > > > > > > > This I agree with. > > > > > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > > > > return NULL in .remove() because the remove callback is only called after > > > > > > the probe callback returned successfully and in this case driver data was > > > > > > set to a non-NULL value. > > > > > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > > > > compatible with masters which unregister on shutdown")? > > > > > > > > No. But I did now. I consider it very surprising that .shutdown() calls > > > > the .remove() callback and would recommend to not do this. The commit > > > > log seems to prove this being difficult. > > > > > > Why do you consider it surprising? > > > > In my book .shutdown should be minimal and just silence the device, such > > that it e.g. doesn't do any DMA any more. > > To me, the more important thing to consider is that many drivers lack > any ->shutdown hook at all, and making their ->shutdown simply call > ->remove is often times the least-effort path of doing something > reasonable towards quiescing the hardware. Not to mention the lesser > evil compared to not having a ->shutdown at all. > > That's not to say I am not in favor of a minimal shutdown procedure if > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch(). > But judging what should go into dsa_switch_shutdown() was definitely not > simple and there might still be corner cases that I missed - although it > works for now, knock on wood. > > The reality is that you'll have a very hard time convincing people to > write a dedicated code path for shutdown, if you can convince them to > write one at all. They wouldn't even know if it does all the right > things - it's not like you kexec every day (unless you're using Linux as > a bootloader - but then again, if you do that you're kind of asking for > trouble - the reason why this is the case is exactly because not having > a ->shutdown hook implemented by drivers is an option, and the driver > core doesn't e.g. fall back to calling the ->remove method, even with > all the insanity that might ensue). Maybe I'm missing an important point here, but I thought it to be fine for most drivers not to have a .shutdown hook. > > > Many drivers implement ->shutdown by calling ->remove for the simple > > > reason that ->remove provides for a well-tested code path already, and > > > leaves the hardware in a known state, workable for kexec and others. > > > > > > Many drivers have buses beneath them. Those buses go away when these > > > drivers unregister, and so do their children. > > > > > > ============================================== > > > > > > => some drivers do both => children of these buses should expect to be > > > potentially unregistered after they've been shut down. > > > > Do you know this happens, or do you "only" fear it might happen? > > Are you asking whether there are SPI controllers that implement > ->shutdown as ->remove? No I ask if it happens a lot / sometimes / ever that a driver's remove callback is run for a device that was already shut down. > Just search for "\.shutdown" in drivers/spi. > 3 out of 3 implementations call ->remove. > > If you really have time to waste, here, have fun: Lino Sanfilippo had > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch > connected to a Raspberry Pi, both of which were due to other drivers > implementing their ->shutdown as ->remove. First driver was the DSA > master/host port (bcmgenet), the other was the bcm2835_spi controller. > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/ > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/ > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/ > As soon as we implemented ->shutdown in DSA drivers (which we had mostly > not done up until that thread) we ran into the surprise that ->remove > will get called too. Yay. It wasn't trivial to sort out, but we did it > eventually in a more systematic way. Not sure whether there's anything > to change at the drivers/base/ level. What I wonder is: There are drivers that call .remove from .shutdown. Is the right action to make other parts of the kernel robust with this behaviour, or should the drivers changed to not call .remove from .shutdown? IMHO this is a question of promises of/expectations against the core device layer. It must be known if for a shut down device there is (and should be) a possibility that .remove is called. Depending on that device drivers must be ready for this to happen, or can rely on it not to happen. From a global maintenance POV it would be good if it could not happen, because then the complexity is concentrated to a small place (i.e. the driver core, or maybe generic code in all subsystems) instead of making each and every driver robust to this possible event that a considerable part of the driver writers isn't aware of. Best regards Uwe
On Thu, Nov 11, 2021 at 08:57:54AM +0100, Uwe Kleine-König wrote: > Hello Vladimir, > > On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote: > > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote: > > > Hello Vladimir, > > > > > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote: > > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > > > > > Your commit prefix does not reflect the fact that you are touching the > > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > > > > > > > > > Oh, I missed that indeed. > > > > > > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > > > > > returned value. So convert the function to return no value. > > > > > > > > > > > > This I agree with. > > > > > > > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > > > > > return NULL in .remove() because the remove callback is only called after > > > > > > > the probe callback returned successfully and in this case driver data was > > > > > > > set to a non-NULL value. > > > > > > > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > > > > > compatible with masters which unregister on shutdown")? > > > > > > > > > > No. But I did now. I consider it very surprising that .shutdown() calls > > > > > the .remove() callback and would recommend to not do this. The commit > > > > > log seems to prove this being difficult. > > > > > > > > Why do you consider it surprising? > > > > > > In my book .shutdown should be minimal and just silence the device, such > > > that it e.g. doesn't do any DMA any more. > > > > To me, the more important thing to consider is that many drivers lack > > any ->shutdown hook at all, and making their ->shutdown simply call > > ->remove is often times the least-effort path of doing something > > reasonable towards quiescing the hardware. Not to mention the lesser > > evil compared to not having a ->shutdown at all. > > > > That's not to say I am not in favor of a minimal shutdown procedure if > > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch(). > > But judging what should go into dsa_switch_shutdown() was definitely not > > simple and there might still be corner cases that I missed - although it > > works for now, knock on wood. > > > > The reality is that you'll have a very hard time convincing people to > > write a dedicated code path for shutdown, if you can convince them to > > write one at all. They wouldn't even know if it does all the right > > things - it's not like you kexec every day (unless you're using Linux as > > a bootloader - but then again, if you do that you're kind of asking for > > trouble - the reason why this is the case is exactly because not having > > a ->shutdown hook implemented by drivers is an option, and the driver > > core doesn't e.g. fall back to calling the ->remove method, even with > > all the insanity that might ensue). > > Maybe I'm missing an important point here, but I thought it to be fine > for most drivers not to have a .shutdown hook. Depends on what you mean by "most drivers". One other case of definitely problematic things that ->shutdown must take care of are shared interrupts. I don't have a metric at hand, but there are definitely not few drivers which support IRQF_SHARED. Some of those don't implement ->shutdown. What I'm saying is that it would definitely go a long way for the problems caused by these to be solved in one fell swoop by having some logic to fall back to the ->remove path. > > > > Many drivers implement ->shutdown by calling ->remove for the simple > > > > reason that ->remove provides for a well-tested code path already, and > > > > leaves the hardware in a known state, workable for kexec and others. > > > > > > > > Many drivers have buses beneath them. Those buses go away when these > > > > drivers unregister, and so do their children. > > > > > > > > ============================================== > > > > > > > > => some drivers do both => children of these buses should expect to be > > > > potentially unregistered after they've been shut down. > > > > > > Do you know this happens, or do you "only" fear it might happen? > > > > Are you asking whether there are SPI controllers that implement > > ->shutdown as ->remove? > > No I ask if it happens a lot / sometimes / ever that a driver's remove > callback is run for a device that was already shut down. So if a SPI device is connected to one of the 3 SPI controllers mentioned by me below, it happens with 100% reproduction rate. Otherwise it happens with 0% reproduction rate. But you don't write a SPI device driver to work with just one SPI controller, ideally you write it to work with all. > > Just search for "\.shutdown" in drivers/spi. > > 3 out of 3 implementations call ->remove. > > > > If you really have time to waste, here, have fun: Lino Sanfilippo had > > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch > > connected to a Raspberry Pi, both of which were due to other drivers > > implementing their ->shutdown as ->remove. First driver was the DSA > > master/host port (bcmgenet), the other was the bcm2835_spi controller. > > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/ > > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/ > > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/ > > As soon as we implemented ->shutdown in DSA drivers (which we had mostly > > not done up until that thread) we ran into the surprise that ->remove > > will get called too. Yay. It wasn't trivial to sort out, but we did it > > eventually in a more systematic way. Not sure whether there's anything > > to change at the drivers/base/ level. > > What I wonder is: There are drivers that call .remove from .shutdown. Is > the right action to make other parts of the kernel robust with this > behaviour, or should the drivers changed to not call .remove from > .shutdown? > > IMHO this is a question of promises of/expectations against the core > device layer. It must be known if for a shut down device there is (and > should be) a possibility that .remove is called. Depending on that > device drivers must be ready for this to happen, or can rely on it not > to happen. > > From a global maintenance POV it would be good if it could not happen, > because then the complexity is concentrated to a small place (i.e. the > driver core, or maybe generic code in all subsystems) instead of making > each and every driver robust to this possible event that a considerable > part of the driver writers isn't aware of. IMO, if you can not offer a solid promise but merely a fragile one, then it is always better to be robust (which DSA now is). How would you propose that this particular promise could be fulfilled? Simply patch the known offending drivers today and hope more drivers won't do this in the future? Patching the device core to keep track of which devices were shut down, so as to not call into their ->remove method? Mind you, this issue was reported as a bug and had to be dealt with locally, for stable kernels, so changing the driver core was not an option.
On Thu, Nov 11, 2021 at 12:47:01PM +0200, Vladimir Oltean wrote: > On Thu, Nov 11, 2021 at 08:57:54AM +0100, Uwe Kleine-König wrote: > > Hello Vladimir, > > > > On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote: > > > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote: > > > > Hello Vladimir, > > > > > > > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote: > > > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > > > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > > > > > > Your commit prefix does not reflect the fact that you are touching the > > > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > > > > > > > > > > > Oh, I missed that indeed. > > > > > > > > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > > > > > > returned value. So convert the function to return no value. > > > > > > > > > > > > > > This I agree with. > > > > > > > > > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > > > > > > return NULL in .remove() because the remove callback is only called after > > > > > > > > the probe callback returned successfully and in this case driver data was > > > > > > > > set to a non-NULL value. > > > > > > > > > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > > > > > > compatible with masters which unregister on shutdown")? > > > > > > > > > > > > No. But I did now. I consider it very surprising that .shutdown() calls > > > > > > the .remove() callback and would recommend to not do this. The commit > > > > > > log seems to prove this being difficult. > > > > > > > > > > Why do you consider it surprising? > > > > > > > > In my book .shutdown should be minimal and just silence the device, such > > > > that it e.g. doesn't do any DMA any more. > > > > > > To me, the more important thing to consider is that many drivers lack > > > any ->shutdown hook at all, and making their ->shutdown simply call > > > ->remove is often times the least-effort path of doing something > > > reasonable towards quiescing the hardware. Not to mention the lesser > > > evil compared to not having a ->shutdown at all. > > > > > > That's not to say I am not in favor of a minimal shutdown procedure if > > > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch(). > > > But judging what should go into dsa_switch_shutdown() was definitely not > > > simple and there might still be corner cases that I missed - although it > > > works for now, knock on wood. > > > > > > The reality is that you'll have a very hard time convincing people to > > > write a dedicated code path for shutdown, if you can convince them to > > > write one at all. They wouldn't even know if it does all the right > > > things - it's not like you kexec every day (unless you're using Linux as > > > a bootloader - but then again, if you do that you're kind of asking for > > > trouble - the reason why this is the case is exactly because not having > > > a ->shutdown hook implemented by drivers is an option, and the driver > > > core doesn't e.g. fall back to calling the ->remove method, even with > > > all the insanity that might ensue). > > > > Maybe I'm missing an important point here, but I thought it to be fine > > for most drivers not to have a .shutdown hook. > > Depends on what you mean by "most drivers". One other case of definitely > problematic things that ->shutdown must take care of are shared interrupts. > I don't have a metric at hand, but there are definitely not few drivers > which support IRQF_SHARED. Some of those don't implement ->shutdown. > What I'm saying is that it would definitely go a long way for the > problems caused by these to be solved in one fell swoop by having some > logic to fall back to the ->remove path. > > > > > > Many drivers implement ->shutdown by calling ->remove for the simple > > > > > reason that ->remove provides for a well-tested code path already, and > > > > > leaves the hardware in a known state, workable for kexec and others. > > > > > > > > > > Many drivers have buses beneath them. Those buses go away when these > > > > > drivers unregister, and so do their children. > > > > > > > > > > ============================================== > > > > > > > > > > => some drivers do both => children of these buses should expect to be > > > > > potentially unregistered after they've been shut down. > > > > > > > > Do you know this happens, or do you "only" fear it might happen? > > > > > > Are you asking whether there are SPI controllers that implement > > > ->shutdown as ->remove? > > > > No I ask if it happens a lot / sometimes / ever that a driver's remove > > callback is run for a device that was already shut down. > > So if a SPI device is connected to one of the 3 SPI controllers > mentioned by me below, it happens with 100% reproduction rate. Otherwise > it happens with 0% reproduction rate. But you don't write a SPI device > driver to work with just one SPI controller, ideally you write it to > work with all. > > > > Just search for "\.shutdown" in drivers/spi. > > > 3 out of 3 implementations call ->remove. > > > > > > If you really have time to waste, here, have fun: Lino Sanfilippo had > > > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch > > > connected to a Raspberry Pi, both of which were due to other drivers > > > implementing their ->shutdown as ->remove. First driver was the DSA > > > master/host port (bcmgenet), the other was the bcm2835_spi controller. > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/ > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/ > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/ > > > As soon as we implemented ->shutdown in DSA drivers (which we had mostly > > > not done up until that thread) we ran into the surprise that ->remove > > > will get called too. Yay. It wasn't trivial to sort out, but we did it > > > eventually in a more systematic way. Not sure whether there's anything > > > to change at the drivers/base/ level. > > > > What I wonder is: There are drivers that call .remove from .shutdown. Is > > the right action to make other parts of the kernel robust with this > > behaviour, or should the drivers changed to not call .remove from > > .shutdown? > > > > IMHO this is a question of promises of/expectations against the core > > device layer. It must be known if for a shut down device there is (and > > should be) a possibility that .remove is called. Depending on that > > device drivers must be ready for this to happen, or can rely on it not > > to happen. > > > > From a global maintenance POV it would be good if it could not happen, > > because then the complexity is concentrated to a small place (i.e. the > > driver core, or maybe generic code in all subsystems) instead of making > > each and every driver robust to this possible event that a considerable > > part of the driver writers isn't aware of. > > IMO, if you can not offer a solid promise but merely a fragile one, then > it is always better to be robust (which DSA now is). How would you > propose that this particular promise could be fulfilled? Simply patch > the known offending drivers today and hope more drivers won't do this in > the future? Patching the device core to keep track of which devices > were shut down, so as to not call into their ->remove method? > Mind you, this issue was reported as a bug and had to be dealt with > locally, for stable kernels, so changing the driver core was not an > option. Fix things properly first, in Linus's tree, and then worry about stable kernels. Never the other way around please. thanks, greg k-h
On Thu, Nov 11, 2021 at 12:44:00PM +0100, Greg Kroah-Hartman wrote: > On Thu, Nov 11, 2021 at 12:47:01PM +0200, Vladimir Oltean wrote: > > On Thu, Nov 11, 2021 at 08:57:54AM +0100, Uwe Kleine-König wrote: > > > Hello Vladimir, > > > > > > On Thu, Nov 11, 2021 at 12:56:11AM +0200, Vladimir Oltean wrote: > > > > On Wed, Nov 10, 2021 at 10:03:46PM +0100, Uwe Kleine-König wrote: > > > > > Hello Vladimir, > > > > > > > > > > On Wed, Nov 10, 2021 at 03:15:40PM +0200, Vladimir Oltean wrote: > > > > > > On Tue, Nov 09, 2021 at 06:50:55PM +0100, Uwe Kleine-König wrote: > > > > > > > On Tue, Nov 09, 2021 at 01:54:34PM +0200, Vladimir Oltean wrote: > > > > > > > > Your commit prefix does not reflect the fact that you are touching the > > > > > > > > vsc73xx driver. Try "net: dsa: vsc73xx: ". > > > > > > > > > > > > > > Oh, I missed that indeed. > > > > > > > > > > > > > > > On Tue, Nov 09, 2021 at 12:39:21PM +0100, Uwe Kleine-König wrote: > > > > > > > > > vsc73xx_remove() returns zero unconditionally and no caller checks the > > > > > > > > > returned value. So convert the function to return no value. > > > > > > > > > > > > > > > > This I agree with. > > > > > > > > > > > > > > > > > For both the platform and the spi variant ..._get_drvdata() will never > > > > > > > > > return NULL in .remove() because the remove callback is only called after > > > > > > > > > the probe callback returned successfully and in this case driver data was > > > > > > > > > set to a non-NULL value. > > > > > > > > > > > > > > > > Have you read the commit message of 0650bf52b31f ("net: dsa: be > > > > > > > > compatible with masters which unregister on shutdown")? > > > > > > > > > > > > > > No. But I did now. I consider it very surprising that .shutdown() calls > > > > > > > the .remove() callback and would recommend to not do this. The commit > > > > > > > log seems to prove this being difficult. > > > > > > > > > > > > Why do you consider it surprising? > > > > > > > > > > In my book .shutdown should be minimal and just silence the device, such > > > > > that it e.g. doesn't do any DMA any more. > > > > > > > > To me, the more important thing to consider is that many drivers lack > > > > any ->shutdown hook at all, and making their ->shutdown simply call > > > > ->remove is often times the least-effort path of doing something > > > > reasonable towards quiescing the hardware. Not to mention the lesser > > > > evil compared to not having a ->shutdown at all. > > > > > > > > That's not to say I am not in favor of a minimal shutdown procedure if > > > > possible. Notice how DSA has dsa_switch_shutdown() vs dsa_unregister_switch(). > > > > But judging what should go into dsa_switch_shutdown() was definitely not > > > > simple and there might still be corner cases that I missed - although it > > > > works for now, knock on wood. > > > > > > > > The reality is that you'll have a very hard time convincing people to > > > > write a dedicated code path for shutdown, if you can convince them to > > > > write one at all. They wouldn't even know if it does all the right > > > > things - it's not like you kexec every day (unless you're using Linux as > > > > a bootloader - but then again, if you do that you're kind of asking for > > > > trouble - the reason why this is the case is exactly because not having > > > > a ->shutdown hook implemented by drivers is an option, and the driver > > > > core doesn't e.g. fall back to calling the ->remove method, even with > > > > all the insanity that might ensue). > > > > > > Maybe I'm missing an important point here, but I thought it to be fine > > > for most drivers not to have a .shutdown hook. > > > > Depends on what you mean by "most drivers". One other case of definitely > > problematic things that ->shutdown must take care of are shared interrupts. > > I don't have a metric at hand, but there are definitely not few drivers > > which support IRQF_SHARED. Some of those don't implement ->shutdown. > > What I'm saying is that it would definitely go a long way for the > > problems caused by these to be solved in one fell swoop by having some > > logic to fall back to the ->remove path. > > > > > > > > Many drivers implement ->shutdown by calling ->remove for the simple > > > > > > reason that ->remove provides for a well-tested code path already, and > > > > > > leaves the hardware in a known state, workable for kexec and others. > > > > > > > > > > > > Many drivers have buses beneath them. Those buses go away when these > > > > > > drivers unregister, and so do their children. > > > > > > > > > > > > ============================================== > > > > > > > > > > > > => some drivers do both => children of these buses should expect to be > > > > > > potentially unregistered after they've been shut down. > > > > > > > > > > Do you know this happens, or do you "only" fear it might happen? > > > > > > > > Are you asking whether there are SPI controllers that implement > > > > ->shutdown as ->remove? > > > > > > No I ask if it happens a lot / sometimes / ever that a driver's remove > > > callback is run for a device that was already shut down. > > > > So if a SPI device is connected to one of the 3 SPI controllers > > mentioned by me below, it happens with 100% reproduction rate. Otherwise > > it happens with 0% reproduction rate. But you don't write a SPI device > > driver to work with just one SPI controller, ideally you write it to > > work with all. > > > > > > Just search for "\.shutdown" in drivers/spi. > > > > 3 out of 3 implementations call ->remove. > > > > > > > > If you really have time to waste, here, have fun: Lino Sanfilippo had > > > > not one, but two (!!!) reboot problems with his ksz9897 Ethernet switch > > > > connected to a Raspberry Pi, both of which were due to other drivers > > > > implementing their ->shutdown as ->remove. First driver was the DSA > > > > master/host port (bcmgenet), the other was the bcm2835_spi controller. > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210909095324.12978-1-LinoSanfilippo@gmx.de/ > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210912120932.993440-1-vladimir.oltean@nxp.com/ > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210917133436.553995-1-vladimir.oltean@nxp.com/ > > > > As soon as we implemented ->shutdown in DSA drivers (which we had mostly > > > > not done up until that thread) we ran into the surprise that ->remove > > > > will get called too. Yay. It wasn't trivial to sort out, but we did it > > > > eventually in a more systematic way. Not sure whether there's anything > > > > to change at the drivers/base/ level. > > > > > > What I wonder is: There are drivers that call .remove from .shutdown. Is > > > the right action to make other parts of the kernel robust with this > > > behaviour, or should the drivers changed to not call .remove from > > > .shutdown? > > > > > > IMHO this is a question of promises of/expectations against the core > > > device layer. It must be known if for a shut down device there is (and > > > should be) a possibility that .remove is called. Depending on that > > > device drivers must be ready for this to happen, or can rely on it not > > > to happen. > > > > > > From a global maintenance POV it would be good if it could not happen, > > > because then the complexity is concentrated to a small place (i.e. the > > > driver core, or maybe generic code in all subsystems) instead of making > > > each and every driver robust to this possible event that a considerable > > > part of the driver writers isn't aware of. > > > > IMO, if you can not offer a solid promise but merely a fragile one, then > > it is always better to be robust (which DSA now is). How would you > > propose that this particular promise could be fulfilled? Simply patch > > the known offending drivers today and hope more drivers won't do this in > > the future? Patching the device core to keep track of which devices > > were shut down, so as to not call into their ->remove method? > > Mind you, this issue was reported as a bug and had to be dealt with > > locally, for stable kernels, so changing the driver core was not an > > option. > > Fix things properly first, in Linus's tree, and then worry about stable > kernels. Never the other way around please. Thanks, all clear now. I don't know why I never thought about that.
diff --git a/drivers/net/dsa/vitesse-vsc73xx-core.c b/drivers/net/dsa/vitesse-vsc73xx-core.c index a4b1447ff055..4c18f619ec02 100644 --- a/drivers/net/dsa/vitesse-vsc73xx-core.c +++ b/drivers/net/dsa/vitesse-vsc73xx-core.c @@ -1216,12 +1216,10 @@ int vsc73xx_probe(struct vsc73xx *vsc) } EXPORT_SYMBOL(vsc73xx_probe); -int vsc73xx_remove(struct vsc73xx *vsc) +void vsc73xx_remove(struct vsc73xx *vsc) { dsa_unregister_switch(vsc->ds); gpiod_set_value(vsc->reset, 1); - - return 0; } EXPORT_SYMBOL(vsc73xx_remove); diff --git a/drivers/net/dsa/vitesse-vsc73xx-platform.c b/drivers/net/dsa/vitesse-vsc73xx-platform.c index fe4b154a0a57..f2715bee2173 100644 --- a/drivers/net/dsa/vitesse-vsc73xx-platform.c +++ b/drivers/net/dsa/vitesse-vsc73xx-platform.c @@ -116,13 +116,8 @@ static int vsc73xx_platform_remove(struct platform_device *pdev) { struct vsc73xx_platform *vsc_platform = platform_get_drvdata(pdev); - if (!vsc_platform) - return 0; - vsc73xx_remove(&vsc_platform->vsc); - platform_set_drvdata(pdev, NULL); - return 0; } diff --git a/drivers/net/dsa/vitesse-vsc73xx-spi.c b/drivers/net/dsa/vitesse-vsc73xx-spi.c index 645398901e05..6b33f754982b 100644 --- a/drivers/net/dsa/vitesse-vsc73xx-spi.c +++ b/drivers/net/dsa/vitesse-vsc73xx-spi.c @@ -163,13 +163,8 @@ static int vsc73xx_spi_remove(struct spi_device *spi) { struct vsc73xx_spi *vsc_spi = spi_get_drvdata(spi); - if (!vsc_spi) - return 0; - vsc73xx_remove(&vsc_spi->vsc); - spi_set_drvdata(spi, NULL); - return 0; } diff --git a/drivers/net/dsa/vitesse-vsc73xx.h b/drivers/net/dsa/vitesse-vsc73xx.h index 30b951504e65..30b1f0a36566 100644 --- a/drivers/net/dsa/vitesse-vsc73xx.h +++ b/drivers/net/dsa/vitesse-vsc73xx.h @@ -26,5 +26,5 @@ struct vsc73xx_ops { int vsc73xx_is_addr_valid(u8 block, u8 subblock); int vsc73xx_probe(struct vsc73xx *vsc); -int vsc73xx_remove(struct vsc73xx *vsc); +void vsc73xx_remove(struct vsc73xx *vsc); void vsc73xx_shutdown(struct vsc73xx *vsc);
vsc73xx_remove() returns zero unconditionally and no caller checks the returned value. So convert the function to return no value. For both the platform and the spi variant ..._get_drvdata() will never return NULL in .remove() because the remove callback is only called after the probe callback returned successfully and in this case driver data was set to a non-NULL value. Also setting driver data to NULL is not necessary, this is already done in the driver core in __device_release_driver(), so drop this from the remove callback, too. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> --- drivers/net/dsa/vitesse-vsc73xx-core.c | 4 +--- drivers/net/dsa/vitesse-vsc73xx-platform.c | 5 ----- drivers/net/dsa/vitesse-vsc73xx-spi.c | 5 ----- drivers/net/dsa/vitesse-vsc73xx.h | 2 +- 4 files changed, 2 insertions(+), 14 deletions(-)