Message ID | 20180713152733.2326-1-alexander.sverdlin@nokia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jul 13, 2018 at 5:27 PM Alexander Sverdlin <alexander.sverdlin@nokia.com> wrote: > Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may > take seconds, depending on CPU load. In this case vital SPI accesses can > fail because of user-space applications. Some other drivers already do not > have timeouts in polling mode. > > Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Wow what system is this and how does that happen? I guess it is fine, but the timeout is there for a reason still. What about setting the timeout to a minute or something? Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/15/2018 12:01 PM, Linus Walleij wrote: > On Fri, Jul 13, 2018 at 5:27 PM Alexander Sverdlin > <alexander.sverdlin@nokia.com> wrote: > >> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may >> take seconds, depending on CPU load. In this case vital SPI accesses can >> fail because of user-space applications. Some other drivers already do not >> have timeouts in polling mode. >> >> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> > > Wow what system is this and how does that happen? > > I guess it is fine, but the timeout is there for a reason still. What about > setting the timeout to a minute or something? How about resetting the timeout if there is progress? E.g. have readwriter() return whether it was able to read or write some data and then reset the timeout. If the timeout is due to CPU contention readwriter() should always be able to push/pull new data to/from the hardware. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Linus, On 15/07/18 12:01, Linus Walleij wrote: >> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may >> take seconds, depending on CPU load. In this case vital SPI accesses can >> fail because of user-space applications. Some other drivers already do not >> have timeouts in polling mode. >> >> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> > Wow what system is this and how does that happen? we observe this on two different Cortex A15-based SoCs. One has no DMA implemented in PL022 (axxia), another one has DMA but has no drivers for it (keystone2). Therefore both forced to polling mode (PIO mode is much worse because IRQs can stop process context for seconds if one accesses several megabytes on the MTD flash in a bulk operation). > I guess it is fine, but the timeout is there for a reason still. What about > setting the timeout to a minute or something? I think it should be fine with some timeout in this order of magnitude, the only question is, is it enough to make it fixed 60-120 sec or should I introduce a Kconfig option with such a default value?
Hello! On 15/07/18 20:03, Lars-Peter Clausen wrote: >>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may >>> take seconds, depending on CPU load. In this case vital SPI accesses can >>> fail because of user-space applications. Some other drivers already do not >>> have timeouts in polling mode. >>> >>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> >> Wow what system is this and how does that happen? >> >> I guess it is fine, but the timeout is there for a reason still. What about >> setting the timeout to a minute or something? > How about resetting the timeout if there is progress? E.g. have > readwriter() return whether it was able to read or write some data and > then reset the timeout. If the timeout is due to CPU contention > readwriter() should always be able to push/pull new data to/from the > hardware. Well, if it's scheduled at all. In our case a poor task is not scheduled at all for 1-2 seconds (because of other high prio tasks which are here for a reason as well). So from the priority PoV we can allow a task reading from MTD to wait couple of seconds, but it's really strange for it to fail even though the MTD media itself is not corrupted. The risk increases with the number of tasks and hundreds of tasks in startup is not seldom and if one has HZ=100... This 1 second timeout is prone to fail in highly loaded system. The timeout itself is here to catch HW failures or erratas, I suppose it can tolerate some minutes as well...
On 07/16/2018 10:05 AM, Alexander Sverdlin wrote: > Hello! > > On 15/07/18 20:03, Lars-Peter Clausen wrote: >>>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may >>>> take seconds, depending on CPU load. In this case vital SPI accesses can >>>> fail because of user-space applications. Some other drivers already do not >>>> have timeouts in polling mode. >>>> >>>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> >>> Wow what system is this and how does that happen? >>> >>> I guess it is fine, but the timeout is there for a reason still. What about >>> setting the timeout to a minute or something? >> How about resetting the timeout if there is progress? E.g. have >> readwriter() return whether it was able to read or write some data and >> then reset the timeout. If the timeout is due to CPU contention >> readwriter() should always be able to push/pull new data to/from the >> hardware. > > Well, if it's scheduled at all. In our case a poor task is not scheduled at all > for 1-2 seconds (because of other high prio tasks which are here for a reason as > well). > So from the priority PoV we can allow a task reading from MTD to wait couple of > seconds, but it's really strange for it to fail even though the MTD media itself > is not corrupted. > The risk increases with the number of tasks and hundreds of tasks in startup is > not seldom and if one has HZ=100... This 1 second timeout is prone to fail > in highly loaded system. You'd run readwriter() before checking the timeout and reset the timeout if it is able to make progress. Only if there was no progress you'd check the timeout. -- To unsubscribe from this list: send the line "unsubscribe linux-spi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Lars, On 16/07/18 10:21, Lars-Peter Clausen wrote: >> Well, if it's scheduled at all. In our case a poor task is not scheduled at all >> for 1-2 seconds (because of other high prio tasks which are here for a reason as >> well). >> So from the priority PoV we can allow a task reading from MTD to wait couple of >> seconds, but it's really strange for it to fail even though the MTD media itself >> is not corrupted. >> The risk increases with the number of tasks and hundreds of tasks in startup is >> not seldom and if one has HZ=100... This 1 second timeout is prone to fail >> in highly loaded system. > You'd run readwriter() before checking the timeout and reset the timeout if > it is able to make progress. Only if there was no progress you'd check the > timeout. I think your point is valid, but I don't feel only this improvement will be enough. What do you think about both increasing the timeout to 60-120sec and resetting timeout if a progress has been made? Then I can prepare v2.
On Mon, Jul 16, 2018 at 10:44:46AM +0200, Alexander Sverdlin wrote: > On 16/07/18 10:21, Lars-Peter Clausen wrote: > > You'd run readwriter() before checking the timeout and reset the timeout if > > it is able to make progress. Only if there was no progress you'd check the > > timeout. > I think your point is valid, but I don't feel only this improvement will be enough. > What do you think about both increasing the timeout to 60-120sec and resetting > timeout if a progress has been made? > Then I can prepare v2. That's sounding like an extremely high timeout, long enough that people are likely to think that the system has locked up (and it'd probably be triggering the scheduler warnings too). It feels like either whatever is consuming the CPU has a problem that needs fixing or we need some system wide indication that there's something intentionally doing this so other tasks should lift any timeout checks that they have.
diff --git a/drivers/spi/spi-pl022.c b/drivers/spi/spi-pl022.c index 1af8c96b940e..1816774714a4 100644 --- a/drivers/spi/spi-pl022.c +++ b/drivers/spi/spi-pl022.c @@ -294,8 +294,6 @@ #define CLEAR_ALL_INTERRUPTS 0x3 -#define SPI_POLLING_TIMEOUT 1000 - /* * The type of reading going on on this chip */ @@ -1491,7 +1489,6 @@ static void do_polling_transfer(struct pl022 *pl022) struct spi_transfer *transfer = NULL; struct spi_transfer *previous = NULL; struct chip_data *chip; - unsigned long time, timeout; chip = pl022->cur_chip; message = pl022->cur_msg; @@ -1531,16 +1528,8 @@ static void do_polling_transfer(struct pl022 *pl022) dev_dbg(&pl022->adev->dev, "polling transfer ongoing ...\n"); - timeout = jiffies + msecs_to_jiffies(SPI_POLLING_TIMEOUT); while (pl022->tx < pl022->tx_end || pl022->rx < pl022->rx_end) { - time = jiffies; readwriter(pl022); - if (time_after(time, timeout)) { - dev_warn(&pl022->adev->dev, - "%s: timeout!\n", __func__); - message->state = STATE_ERROR; - goto out; - } cpu_relax(); } @@ -1551,7 +1540,7 @@ static void do_polling_transfer(struct pl022 *pl022) /* Move to next transfer */ message->state = next_transfer(pl022); } -out: + /* Handle end of message */ if (message->state == STATE_DONE) message->status = 0; @@ -1559,7 +1548,6 @@ static void do_polling_transfer(struct pl022 *pl022) message->status = -EIO; giveback(pl022); - return; } static int pl022_transfer_one_message(struct spi_master *master,
Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may take seconds, depending on CPU load. In this case vital SPI accesses can fail because of user-space applications. Some other drivers already do not have timeouts in polling mode. Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> --- drivers/spi/spi-pl022.c | 14 +------------- 1 file changed, 1 insertion(+), 13 deletions(-)