Message ID | 20201117094517.5654-1-Sergey.Semin@baikalelectronics.ru (mailing list archive) |
---|---|
State | Accepted |
Commit | 4fae3a58ab59d8a286864d61fe1846283a0316f2 |
Headers | show |
Series | [RFC] spi: Take the SPI IO-mutex in the spi_setup() method | expand |
On Tue, Nov 17, 2020 at 11:45 AM Serge Semin <Sergey.Semin@baikalelectronics.ru> wrote: > > I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw: > Explicitly de-assert CS on SPI transfer completion") a concurrent usage of > the spidev devices with different chip-selects causes the "SPI transfer > timed out" error. I'll read this later... > + mutex_lock(&spi->controller->io_mutex); > + > if (spi->controller->setup) > status = spi->controller->setup(spi); > > if (spi->controller->auto_runtime_pm && spi->controller->set_cs) { > status = pm_runtime_get_sync(spi->controller->dev.parent); I didn't check what this lock is protecting, but have you checked all PM runtime callbacks if they are not taking the lock. When you call PM runtime functions with 'sync' it may include a lot of work, some of which may sleep (not a problem for mutex) and may take arbitrary locks (might be a deadlock in case of trying the same lock). > if (status < 0) { > + mutex_unlock(&spi->controller->io_mutex); > pm_runtime_put_noidle(spi->controller->dev.parent); > dev_err(&spi->controller->dev, "Failed to power device: %d\n", > status); > @@ -3354,6 +3357,8 @@ int spi_setup(struct spi_device *spi) > spi_set_cs(spi, false); > } > > + mutex_unlock(&spi->controller->io_mutex); > +
On Tue, Nov 17, 2020 at 12:56:44PM +0200, Andy Shevchenko wrote: > On Tue, Nov 17, 2020 at 11:45 AM Serge Semin > <Sergey.Semin@baikalelectronics.ru> wrote: > > > > I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw: > > Explicitly de-assert CS on SPI transfer completion") a concurrent usage of > > the spidev devices with different chip-selects causes the "SPI transfer > > timed out" error. > > I'll read this later... > > > + mutex_lock(&spi->controller->io_mutex); > > + > > if (spi->controller->setup) > > status = spi->controller->setup(spi); > > > > if (spi->controller->auto_runtime_pm && spi->controller->set_cs) { > > status = pm_runtime_get_sync(spi->controller->dev.parent); > > I didn't check what this lock is protecting, It is used to protect the SPI io operations. So it's locked only during the SPI memory operations and the SPI-message execution. That's the time when the core toggles the controller chip-selects by calling the spi_set_cs() method and the set_cs callback. > but have you checked all > PM runtime callbacks if they are not taking the lock. When you call PM > runtime functions with 'sync' it may include a lot of work, some of > which may sleep (not a problem for mutex) and may take arbitrary locks > (might be a deadlock in case of trying the same lock). Yeah, I understand that. Simple grepping hasn't showed anyone else but the SPI-core using it. So unless the controllers PM methods also call spi_setup() or request SPI-transfers, there shouldn't be a deadlock. Moreover as I can see from the __spi_pump_messages() method the IO-mutex is locked during the sync-suffixed PM-methods invocation. AFAICS locking io_mutex around the PM-methods here shouldn't cause problems. But of course testing it in various platforms/controllers is always welcome. -Sergey > > > if (status < 0) { > > + mutex_unlock(&spi->controller->io_mutex); > > pm_runtime_put_noidle(spi->controller->dev.parent); > > dev_err(&spi->controller->dev, "Failed to power device: %d\n", > > status); > > @@ -3354,6 +3357,8 @@ int spi_setup(struct spi_device *spi) > > spi_set_cs(spi, false); > > } > > > > + mutex_unlock(&spi->controller->io_mutex); > > + > > > -- > With Best Regards, > Andy Shevchenko
On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote: > method being called at the same time. In particular in calling the > spi_set_cs(false) while there is an SPI-transfer being executed. In my > case due to the commit cited above all CSs get to be switched off by > calling the spi_setup() for /dev/spidev0.1 while there is an concurrent > SPI-transfer execution performed on /dev/spidev0.0. Of course a situation > of the spi_setup() being called while there is an SPI-transfer being > executed for two different SPI peripheral devices of the same controller > may happen not only for the spidev driver, but for instance for MMC SPI + > some another device, or spi_setup() being called from an SPI-peripheral > probe method while some other device has already been probed and is being > used by a corresponding driver... It's documented that a driver's spi_setup() operation is supposed to support being able to be called concurrently with other transfers, see spi-summary.rst. > Of course I could have provided a fix affecting the DW APB SSI driver > only, for instance, by creating a mutual exclusive access to the set_cs > callback and setting/clearing only the bit responsible for the > corresponding chip-select. But after a short research I've discovered that > the problem most likely affects a lot of the other drivers: Yeah, problems with it are very common as the documentation has noted since forever. IIRC there was some problem triggered by trying to force it to be serialised but I can't remember what it was.
On Wed, Nov 18, 2020 at 01:16:04PM +0000, Mark Brown wrote: > On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote: > > > method being called at the same time. In particular in calling the > > spi_set_cs(false) while there is an SPI-transfer being executed. In my > > case due to the commit cited above all CSs get to be switched off by > > calling the spi_setup() for /dev/spidev0.1 while there is an concurrent > > SPI-transfer execution performed on /dev/spidev0.0. Of course a situation > > of the spi_setup() being called while there is an SPI-transfer being > > executed for two different SPI peripheral devices of the same controller > > may happen not only for the spidev driver, but for instance for MMC SPI + > > some another device, or spi_setup() being called from an SPI-peripheral > > probe method while some other device has already been probed and is being > > used by a corresponding driver... > > It's documented that a driver's spi_setup() operation is supposed to > support being able to be called concurrently with other transfers, see > spi-summary.rst. > > > Of course I could have provided a fix affecting the DW APB SSI driver > > only, for instance, by creating a mutual exclusive access to the set_cs > > callback and setting/clearing only the bit responsible for the > > corresponding chip-select. But after a short research I've discovered that > > the problem most likely affects a lot of the other drivers: > > Yeah, problems with it are very common as the documentation has noted > since forever. IIRC there was some problem triggered by trying to force > it to be serialised but I can't remember what it was. Does it mean nack for this patch from you? So you suggest to fix the controller driver instead, right? If so the best solution would be to just lock the IO mutex in the set_cs callback of the DW APB SSI driver... -Sergey
On Wed, Nov 18, 2020 at 07:29:31PM +0300, Serge Semin wrote: > On Wed, Nov 18, 2020 at 01:16:04PM +0000, Mark Brown wrote: > > Yeah, problems with it are very common as the documentation has noted > > since forever. IIRC there was some problem triggered by trying to force > > it to be serialised but I can't remember what it was. > Does it mean nack for this patch from you? So you suggest to fix the controller > driver instead, right? If so the best solution would be to just lock the > IO mutex in the set_cs callback of the DW APB SSI driver... I'm not 100% clear what the original issue was, given that this is a constant source of errors in drivers it seems like it should be better to change the core but since I don't know why we have this the way it is it's hard to tell what special cases we might have that could explode if we try to do so. I *think* the main issue is things that don't actually have separate per device registers trying to configure the single set of controler registers shared by all devices in which case the locking is fine and helps with this specific case where it's a read/modify/write operation on per device stuff and this makes sense.
Hi Mark, On Thu, Nov 19, 2020 at 7:45 PM Mark Brown <broonie@kernel.org> wrote: > On Wed, Nov 18, 2020 at 07:29:31PM +0300, Serge Semin wrote: > > On Wed, Nov 18, 2020 at 01:16:04PM +0000, Mark Brown wrote: > > > Yeah, problems with it are very common as the documentation has noted > > > since forever. IIRC there was some problem triggered by trying to force > > > it to be serialised but I can't remember what it was. > > > Does it mean nack for this patch from you? So you suggest to fix the controller > > driver instead, right? If so the best solution would be to just lock the > > IO mutex in the set_cs callback of the DW APB SSI driver... > > I'm not 100% clear what the original issue was, given that this is a > constant source of errors in drivers it seems like it should be better > to change the core but since I don't know why we have this the way it is > it's hard to tell what special cases we might have that could explode if > we try to do so. I *think* the main issue is things that don't actually > have separate per device registers trying to configure the single set of > controler registers shared by all devices in which case the locking is > fine and helps with this specific case where it's a read/modify/write > operation on per device stuff and this makes sense. It's also an issue on SPI controllers with a single native chipselect, tricking the driver author into believing that writing to registers during .setup() is not an issue. Until an integrator starts using cs-gpios... Gr{oetje,eeting}s, Geert
On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote: > Of course I could have provided a fix affecting the DW APB SSI driver > only, for instance, by creating a mutual exclusive access to the set_cs > callback and setting/clearing only the bit responsible for the > corresponding chip-select. But after a short research I've discovered that I think the driver needs a fix anyway for the case where there's a mix of devices with standard and inverted chip selects, it assumes they all have the same polarity AFAICT.
On Fri, Nov 20, 2020 at 05:17:48PM +0000, Mark Brown wrote: > On Tue, Nov 17, 2020 at 12:45:17PM +0300, Serge Semin wrote: > > > Of course I could have provided a fix affecting the DW APB SSI driver > > only, for instance, by creating a mutual exclusive access to the set_cs > > callback and setting/clearing only the bit responsible for the > > corresponding chip-select. But after a short research I've discovered that > > I think the driver needs a fix anyway for the case where there's a mix > of devices with standard and inverted chip selects, it assumes they all > have the same polarity AFAICT. No. The polarity inversion isn't supported by the DW APB SSI controller. Native chip-select is always active-low, while a corresponding peripheral is activated by setting a bit in the SER register (Slave Enable Register). So as long as the dw_spi_set_cs() callback isn't called while there is a SPI-message execution going on it shall be save do preserve the current version of the method. -Sergey
On Tue, 17 Nov 2020 12:45:17 +0300, Serge Semin wrote: > I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw: > Explicitly de-assert CS on SPI transfer completion") a concurrent usage of > the spidev devices with different chip-selects causes the "SPI transfer > timed out" error. The root cause of the problem has turned to be in a race > condition of the SPI-transfer execution procedure and the spi_setup() > method being called at the same time. In particular in calling the > spi_set_cs(false) while there is an SPI-transfer being executed. In my > case due to the commit cited above all CSs get to be switched off by > calling the spi_setup() for /dev/spidev0.1 while there is an concurrent > SPI-transfer execution performed on /dev/spidev0.0. Of course a situation > of the spi_setup() being called while there is an SPI-transfer being > executed for two different SPI peripheral devices of the same controller > may happen not only for the spidev driver, but for instance for MMC SPI + > some another device, or spi_setup() being called from an SPI-peripheral > probe method while some other device has already been probed and is being > used by a corresponding driver... > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next Thanks! [1/1] spi: Take the SPI IO-mutex in the spi_setup() method commit: 4fae3a58ab59d8a286864d61fe1846283a0316f2 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c index 0cab239d8e7f..353fa178e39b 100644 --- a/drivers/spi/spi.c +++ b/drivers/spi/spi.c @@ -3327,12 +3327,15 @@ int spi_setup(struct spi_device *spi) if (!spi->max_speed_hz) spi->max_speed_hz = spi->controller->max_speed_hz; + mutex_lock(&spi->controller->io_mutex); + if (spi->controller->setup) status = spi->controller->setup(spi); if (spi->controller->auto_runtime_pm && spi->controller->set_cs) { status = pm_runtime_get_sync(spi->controller->dev.parent); if (status < 0) { + mutex_unlock(&spi->controller->io_mutex); pm_runtime_put_noidle(spi->controller->dev.parent); dev_err(&spi->controller->dev, "Failed to power device: %d\n", status); @@ -3354,6 +3357,8 @@ int spi_setup(struct spi_device *spi) spi_set_cs(spi, false); } + mutex_unlock(&spi->controller->io_mutex); + if (spi->rt && !spi->controller->rt) { spi->controller->rt = true; spi_set_thread_rt(spi->controller);
I've discovered that due to the recent commit 49d7d695ca4b ("spi: dw: Explicitly de-assert CS on SPI transfer completion") a concurrent usage of the spidev devices with different chip-selects causes the "SPI transfer timed out" error. The root cause of the problem has turned to be in a race condition of the SPI-transfer execution procedure and the spi_setup() method being called at the same time. In particular in calling the spi_set_cs(false) while there is an SPI-transfer being executed. In my case due to the commit cited above all CSs get to be switched off by calling the spi_setup() for /dev/spidev0.1 while there is an concurrent SPI-transfer execution performed on /dev/spidev0.0. Of course a situation of the spi_setup() being called while there is an SPI-transfer being executed for two different SPI peripheral devices of the same controller may happen not only for the spidev driver, but for instance for MMC SPI + some another device, or spi_setup() being called from an SPI-peripheral probe method while some other device has already been probed and is being used by a corresponding driver... Of course I could have provided a fix affecting the DW APB SSI driver only, for instance, by creating a mutual exclusive access to the set_cs callback and setting/clearing only the bit responsible for the corresponding chip-select. But after a short research I've discovered that the problem most likely affects a lot of the other drivers: - drivers/spi/spi-sun4i.c - RMW the chip-select register; - drivers/spi/spi-rockchip.c - RMW the chip-select register; - drivers/spi/spi-qup.c - RMW a generic force-CS flag in a CSR. - drivers/spi/spi-sifive.c - set a generic CS-mode flag in a CSR. - drivers/spi/spi-bcm63xx-hsspi.c - uses an internal mutex to serialize the bus config changes, but still isn't protected from the race condition described above; - drivers/spi/spi-geni-qcom.c - RMW a chip-select internal flag and set the CS state in HW; - drivers/spi/spi-orion.c - RMW a chip-select register; - drivers/spi/spi-cadence.c - RMW a chip-select register; - drivers/spi/spi-armada-3700.c - RMW a chip-select register; - drivers/spi/spi-lantiq-ssc.c - overwrites the chip-select register; - drivers/spi/spi-sun6i.c - RMW a chip-select register; - drivers/spi/spi-synquacer.c - RMW a chip-select register; - drivers/spi/spi-altera.c - directly sets the chip-select state; - drivers/spi/spi-omap2-mcspi.c - RMW an internally cached CS state and writes it to HW; - drivers/spi/spi-mt65xx.c - RMW some CSR; - drivers/spi/spi-jcore.c - directly sets the chip-selects state; - drivers/spi/spi-mt7621.c - RMW a chip-select register; I could have missed some drivers, but a scale of the problem is obvious. As you can see most of the drivers perform an unprotected Read-modify-write chip-select register modification in the set_cs callback. Seeing the spi_setup() function is calling the spi_set_cs() and it can be executed concurrently with SPI-transfers exec procedure, which also calls spi_set_cs() in the SPI core spi_transfer_one_message() method, the race condition of the register modification turns to be obvious. To sum up the problem denoted above affects each driver for a controller having more than one chip-select lane and which: 1) performs the RMW to some CS-related register with no serialization; 2) directly disables any CS on spi_set_cs(dev, false). * the later is the case of the DW APB SSI driver. The controllers which equipped with a single CS theoretically can also experience the problem, but in practice will not since normally the spi_setup() isn't called concurrently with the SPI-transfers executed on the same SPI peripheral device. In order to generically fix the denoted bug I'd suggest to serialize an access to the controller IO by taking the IO mutex in the spi_setup() callback. The mutex is held while there is an SPI communication going on on the SPI-bus of the corresponding SPI-controller. So calling the spi_setup() method and disabling/updating the CS state within it would be safe while there is no any SPI-transfers being executed. Also note I suppose it would be safer to protect the spi_controller->setup() callback invocation too, seeing some of the SPI-controller drivers update a HW state in there. Fixes: 49d7d695ca4b ("spi: dw: Explicitly de-assert CS on SPI transfer completion") Signed-off-by: Serge Semin <Sergey.Semin@baikalelectronics.ru> --- drivers/spi/spi.c | 5 +++++ 1 file changed, 5 insertions(+)