diff mbox

spi: pl022: Remove timeout in polling mode operation

Message ID 20180713152733.2326-1-alexander.sverdlin@nokia.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alexander Sverdlin July 13, 2018, 3:27 p.m. UTC
Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
take seconds, depending on CPU load. In this case vital SPI accesses can
fail because of user-space applications. Some other drivers already do not
have timeouts in polling mode.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
---
 drivers/spi/spi-pl022.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

Comments

Linus Walleij July 15, 2018, 10:01 a.m. UTC | #1
On Fri, Jul 13, 2018 at 5:27 PM Alexander Sverdlin
<alexander.sverdlin@nokia.com> wrote:

> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
> take seconds, depending on CPU load. In this case vital SPI accesses can
> fail because of user-space applications. Some other drivers already do not
> have timeouts in polling mode.
>
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>

Wow what system is this and how does that happen?

I guess it is fine, but the timeout is there for a reason still. What about
setting the timeout to a minute or something?

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lars-Peter Clausen July 15, 2018, 6:03 p.m. UTC | #2
On 07/15/2018 12:01 PM, Linus Walleij wrote:
> On Fri, Jul 13, 2018 at 5:27 PM Alexander Sverdlin
> <alexander.sverdlin@nokia.com> wrote:
> 
>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>> take seconds, depending on CPU load. In this case vital SPI accesses can
>> fail because of user-space applications. Some other drivers already do not
>> have timeouts in polling mode.
>>
>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Wow what system is this and how does that happen?
> 
> I guess it is fine, but the timeout is there for a reason still. What about
> setting the timeout to a minute or something?

How about resetting the timeout if there is progress? E.g. have
readwriter() return whether it was able to read or write some data and
then reset the timeout. If the timeout is due to CPU contention
readwriter() should always be able to push/pull new data to/from the
hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Sverdlin July 16, 2018, 7:57 a.m. UTC | #3
Hello Linus,

On 15/07/18 12:01, Linus Walleij wrote:
>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>> take seconds, depending on CPU load. In this case vital SPI accesses can
>> fail because of user-space applications. Some other drivers already do not
>> have timeouts in polling mode.
>>
>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> Wow what system is this and how does that happen?

we observe this on two different Cortex A15-based SoCs.
One has no DMA implemented in PL022 (axxia), another one has DMA but has no drivers
for it (keystone2).
Therefore both forced to polling mode (PIO mode is much worse because IRQs
can stop process context for seconds if one accesses several megabytes on the MTD
flash in a bulk operation).

> I guess it is fine, but the timeout is there for a reason still. What about
> setting the timeout to a minute or something?

I think it should be fine with some timeout in this order of magnitude, the only
question is, is it enough to make it fixed 60-120 sec or should I introduce a 
Kconfig option with such a default value?
Alexander Sverdlin July 16, 2018, 8:05 a.m. UTC | #4
Hello!

On 15/07/18 20:03, Lars-Peter Clausen wrote:
>>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>>> take seconds, depending on CPU load. In this case vital SPI accesses can
>>> fail because of user-space applications. Some other drivers already do not
>>> have timeouts in polling mode.
>>>
>>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>> Wow what system is this and how does that happen?
>>
>> I guess it is fine, but the timeout is there for a reason still. What about
>> setting the timeout to a minute or something?
> How about resetting the timeout if there is progress? E.g. have
> readwriter() return whether it was able to read or write some data and
> then reset the timeout. If the timeout is due to CPU contention
> readwriter() should always be able to push/pull new data to/from the
> hardware.

Well, if it's scheduled at all. In our case a poor task is not scheduled at all
for 1-2 seconds (because of other high prio tasks which are here for a reason as
well).
So from the priority PoV we can allow a task reading from MTD to wait couple of
seconds, but it's really strange for it to fail even though the MTD media itself
is not corrupted.
The risk increases with the number of tasks and hundreds of tasks in startup is
not seldom and if one has HZ=100... This 1 second timeout is prone to fail
in highly loaded system.

The timeout itself is here to catch HW failures or erratas, I suppose it can
tolerate some minutes as well...
Lars-Peter Clausen July 16, 2018, 8:21 a.m. UTC | #5
On 07/16/2018 10:05 AM, Alexander Sverdlin wrote:
> Hello!
> 
> On 15/07/18 20:03, Lars-Peter Clausen wrote:
>>>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>>>> take seconds, depending on CPU load. In this case vital SPI accesses can
>>>> fail because of user-space applications. Some other drivers already do not
>>>> have timeouts in polling mode.
>>>>
>>>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>>> Wow what system is this and how does that happen?
>>>
>>> I guess it is fine, but the timeout is there for a reason still. What about
>>> setting the timeout to a minute or something?
>> How about resetting the timeout if there is progress? E.g. have
>> readwriter() return whether it was able to read or write some data and
>> then reset the timeout. If the timeout is due to CPU contention
>> readwriter() should always be able to push/pull new data to/from the
>> hardware.
> 
> Well, if it's scheduled at all. In our case a poor task is not scheduled at all
> for 1-2 seconds (because of other high prio tasks which are here for a reason as
> well).
> So from the priority PoV we can allow a task reading from MTD to wait couple of
> seconds, but it's really strange for it to fail even though the MTD media itself
> is not corrupted.
> The risk increases with the number of tasks and hundreds of tasks in startup is
> not seldom and if one has HZ=100... This 1 second timeout is prone to fail
> in highly loaded system.

You'd run readwriter() before checking the timeout and reset the timeout if
it is able to make progress. Only if there was no progress you'd check the
timeout.
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Sverdlin July 16, 2018, 8:44 a.m. UTC | #6
Hello Lars,

On 16/07/18 10:21, Lars-Peter Clausen wrote:
>> Well, if it's scheduled at all. In our case a poor task is not scheduled at all
>> for 1-2 seconds (because of other high prio tasks which are here for a reason as
>> well).
>> So from the priority PoV we can allow a task reading from MTD to wait couple of
>> seconds, but it's really strange for it to fail even though the MTD media itself
>> is not corrupted.
>> The risk increases with the number of tasks and hundreds of tasks in startup is
>> not seldom and if one has HZ=100... This 1 second timeout is prone to fail
>> in highly loaded system.
> You'd run readwriter() before checking the timeout and reset the timeout if
> it is able to make progress. Only if there was no progress you'd check the
> timeout.

I think your point is valid, but I don't feel only this improvement will be enough.
What do you think about both increasing the timeout to 60-120sec and resetting
timeout if a progress has been made?
Then I can prepare v2.
Mark Brown July 16, 2018, 11:17 a.m. UTC | #7
On Mon, Jul 16, 2018 at 10:44:46AM +0200, Alexander Sverdlin wrote:
> On 16/07/18 10:21, Lars-Peter Clausen wrote:

> > You'd run readwriter() before checking the timeout and reset the timeout if
> > it is able to make progress. Only if there was no progress you'd check the
> > timeout.

> I think your point is valid, but I don't feel only this improvement will be enough.
> What do you think about both increasing the timeout to 60-120sec and resetting
> timeout if a progress has been made?
> Then I can prepare v2.

That's sounding like an extremely high timeout, long enough that people
are likely to think that the system has locked up (and it'd probably be
triggering the scheduler warnings too).  It feels like either whatever
is consuming the CPU has a problem that needs fixing or we need some
system wide indication that there's something intentionally doing this
so other tasks should lift any timeout checks that they have.
diff mbox

Patch

diff --git a/drivers/spi/spi-pl022.c b/drivers/spi/spi-pl022.c
index 1af8c96b940e..1816774714a4 100644
--- a/drivers/spi/spi-pl022.c
+++ b/drivers/spi/spi-pl022.c
@@ -294,8 +294,6 @@ 
 
 #define CLEAR_ALL_INTERRUPTS  0x3
 
-#define SPI_POLLING_TIMEOUT 1000
-
 /*
  * The type of reading going on on this chip
  */
@@ -1491,7 +1489,6 @@  static void do_polling_transfer(struct pl022 *pl022)
 	struct spi_transfer *transfer = NULL;
 	struct spi_transfer *previous = NULL;
 	struct chip_data *chip;
-	unsigned long time, timeout;
 
 	chip = pl022->cur_chip;
 	message = pl022->cur_msg;
@@ -1531,16 +1528,8 @@  static void do_polling_transfer(struct pl022 *pl022)
 
 		dev_dbg(&pl022->adev->dev, "polling transfer ongoing ...\n");
 
-		timeout = jiffies + msecs_to_jiffies(SPI_POLLING_TIMEOUT);
 		while (pl022->tx < pl022->tx_end || pl022->rx < pl022->rx_end) {
-			time = jiffies;
 			readwriter(pl022);
-			if (time_after(time, timeout)) {
-				dev_warn(&pl022->adev->dev,
-				"%s: timeout!\n", __func__);
-				message->state = STATE_ERROR;
-				goto out;
-			}
 			cpu_relax();
 		}
 
@@ -1551,7 +1540,7 @@  static void do_polling_transfer(struct pl022 *pl022)
 		/* Move to next transfer */
 		message->state = next_transfer(pl022);
 	}
-out:
+
 	/* Handle end of message */
 	if (message->state == STATE_DONE)
 		message->status = 0;
@@ -1559,7 +1548,6 @@  static void do_polling_transfer(struct pl022 *pl022)
 		message->status = -EIO;
 
 	giveback(pl022);
-	return;
 }
 
 static int pl022_transfer_one_message(struct spi_master *master,