spi: pl022: Remove timeout in polling mode operation

Message ID	20180713152733.2326-1-alexander.sverdlin@nokia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-spi-owner@kernel.org> From: Alexander Sverdlin <alexander.sverdlin@nokia.com> To: linux-spi@vger.kernel.org Cc: Alexander Sverdlin <alexander.sverdlin@nokia.com>, Mark Brown <broonie@kernel.org>, Linus Walleij <linus.walleij@linaro.org>, Magnus Templing <magnus.templing@stericsson.com> Subject: [PATCH] spi: pl022: Remove timeout in polling mode operation Date: Fri, 13 Jul 2018 17:27:33 +0200 Message-Id: <20180713152733.2326-1-alexander.sverdlin@nokia.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: nokia.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; HE1PR0701MB1884; 23:jl3+c8ECbiu1UdJ8Y3FHshWTv68NopB5VD7vxLF?= =?us-ascii?Q?LbmZrrw+AE6h0NThMxrBkMeZPbb/RU+2ZWZddNTP+uLkyrGuGHou4ge8d+bG?= =?us-ascii?Q?muhOi6d2zX+t1UtxgZLfoiEh7lNy+cbAVziTkX8OVu9+KjqdPuz98O2ATyyG?= =?us-ascii?Q?n65yn1tBN0ts7rAETd6t92W0yfrVMeVurtgk0g+pZ/gm7jwkTlhsHXxklLHm?= =?us-ascii?Q?ftlF4IxFiTijGNcsG8T44KZCMpeMnGs8l8xA/ACYvKhnWKx00m4zMaBtTLM2?= =?us-ascii?Q?AukjwloPnjOjJgsvnEBsKlXj0cFMk0Ghj/OXchfu0KrhFHiyMVnpPFyoPVpY?= =?us-ascii?Q?2DrfdECns0rW0qsHBpvF4Bx7xQRqlRljmzOWdyxiZGte0a95tpZTyNAd07qN?= =?us-ascii?Q?VY4LPqwOICfzH1sCjpE6mLmwsoH7iYEl1qtuWLG2FZt4Q/wSw6QU8pm7rCrR?= =?us-ascii?Q?3pFqzTGYaxtkLvmu13WLWARs5vPXiz5Hw4E5oMHpVCV+tVEjdf1GqQuRerSo?= =?us-ascii?Q?cuyyzG3H7KPtcF8wL5B9uDcSMMYAMRj644QqMlwfZBgpUvVhPjErjpEHT42F?= =?us-ascii?Q?YZFbF8CPugL+VFl41b0uC8VjbuXQznQCSJ5dnNJ5qddbAKOY38Oob2rg3ZgT?= =?us-ascii?Q?Oc9ebWGfICSN82fW3GUIEl9QdSVYEyVJOMfOyNSpAOO8qQgeZ1k0x3TbQssf?= =?us-ascii?Q?Mu6WH6snIr4F3yxC5p+1nEE8q069qhZRoaeNAAKoj7YMiMDiidpMx3p3KMe3?= =?us-ascii?Q?g8KGdfQxvM4FyhZ3hVKh44MogjNh7HVj7YNSZTtpgMkOXgcMV7AWkhUjZ9CD?= =?us-ascii?Q?gODyoMb24eU757ysRfqzAvo44zRzg88ZnfLr9a5swabzREHrHdYSaQkYKroR?= =?us-ascii?Q?UwPb8MqzzCOcNi8qaFHYaRQqaDmSA/b1UNqPW4jXoT5sBrPhIlnEHJyX0R4a?= =?us-ascii?Q?W2AtoOkk7Rk8L+pVgQR23NqrJN8h/FWCbNrU7uJsJlO1rfRaqLZYcq550tks?= =?us-ascii?Q?tuHJWDjPgaUKuMZlDq4wncdhr7mX9DGiexxnj2sFVJBijfkUhXkHj/XIxcyd?= =?us-ascii?Q?3BB4vyTW2ZOcqXZeZfCa+O3sJnfAjfIEBQhlkxnr+F9egFJIMFvT+Un6weY5?= =?us-ascii?Q?xYgWgd7BzzxVyXgmadpeacx5duIv7wZVIEoZ11Cdot6HgwddNBxtY+MtPH0U?= =?us-ascii?Q?sOlpVuc6POBtsw+4Rtvfr4IFewr7YFNHXDE1+h6fInPAHEd1Wcus0fFGvqw?= =?us-ascii?Q?=3D=3D?= X-Microsoft-Antispam-Message-Info: zik0o8OVYEwbyXuqlg3nHhmFyJt5cM7SLyNgQCmatLwnJvsa4kZM4RNfJv/ZdmFeu8KkQHSLyJwm/OfpqlNhhDMmuYDklJIIEglCaaEqSnfZDwoWWB2RiOof0evBBheFKgRmBc58mwISdoR+wNgsyVaF+utA8lsRtOeTzP1pnO5QGFExOkwyUY3AjY1H6CrzkfIeOJnoaBCyIkGz92Yqq/W+1ZSEwk390/nnCyY/SoE6bzuLbdsdwxeCQ1Sp4qyhdrqLUaSylHLsjux7uEM3lRrZkiQSA/amsOy+xVXxuRFNjpqGYyWiAvlsmoPkO55Ske5gKr0zZiok33akS2vH3/91sIO7Wbk1VgZoozaehxx+JvhzAv56N0Kq6LwKKcsOdGRh8Cj9iOl93l8ZOTLaeQ== X-Microsoft-Exchange-Diagnostics: 1; HE1PR0701MB1884; 6:BLkCBQdKKcXFYq1m0dBkq98/KUXVUueHupyX1mmvx+ud4HQZN0w08Iuvi5/cytH3WwfxBhRRmi8LURDRSYxPsqADJt3JSfxMLgE7OZ4gTQZzzQLm6xuDr860c8Iwthsqd9mNF/H2RyhldiA5DrTgpjlnJC/X+V/17ev7OSyb3FSwlC7w6Kn6CD/+gCOiQEahzQAitAJoc08EfEdHaiTnNYVvu7jhR2hwcrti9Nkv96OKbujuB0ObbKhvIVVyMfGkFWbrxyxz/KkZnDUt7Xw/jk4fLw9b0Xk77QhHLSGovkJ9AgjnX7/dwPf6BfdN3+p52dr9767EaQVWHFUcRvfd7qCFhggiHjVOPyI61Jadv7cl52atXBcGo/USfEVaMksc4pY/7EdrdDCjhLrM8pEDhHvSyh1sR+0AWNAI03ymmQvU6ulrauBFmKj+pPa6B0XaIZHMikCqk84goxlVJstwKg==; 5:wOcp2QlxRYymJ1HuicjGmdJiWACh5rld0P6QeTn90pBA3OiaAJOGaVCXEzNBWx29iLY5NHNEDt05732T2DKLdN4K0QIKcj+R9M3iixFIKIaQi9MoyFNDn6yQWGDBcRbm+pw3zv66dOh2G6XPKhI9ZBPl6yNB5J8xNVCjlW8l5lo=; 24:GLz3/tnQCFA//vWcs8O825x8UWexVVL5KjhdzR4p/pLlhKihxYeeB+O1oLqoMbdI44x3tuWaP5UuSxGO+0kgnzXutRI3xvVY6c7oEObHUxI= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; HE1PR0701MB1884; 7:BRWC7FfiC5YQHmf//pkMwgi+tt3vlmx/Kdzo7koT6lVvNUasjZ3LFvXbiLXnA79tSea9FYmY7tzsR0pyI66O3gk4tR157aXL6QyRlY1FB8jvb9JOzLQGtuyZvcRHx/ldAQaZjiMSbPqv/H8KT8Z5aizrHB9diBLf41LWu5ROLYSJPxd/JElYYgkiNrY5iWnZ9u1OfVsdwpNEOWOLdQ0HjenPZexO+zSvOKVcsEcIPd16357TP36gqh8QDq8vH0xl Sender: linux-spi-owner@vger.kernel.org Precedence: bulk

Alexander Sverdlin July 13, 2018, 3:27 p.m. UTC

Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
take seconds, depending on CPU load. In this case vital SPI accesses can
fail because of user-space applications. Some other drivers already do not
have timeouts in polling mode.

Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
---
 drivers/spi/spi-pl022.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

Linus Walleij July 15, 2018, 10:01 a.m. UTC | #1

On Fri, Jul 13, 2018 at 5:27 PM Alexander Sverdlin
<alexander.sverdlin@nokia.com> wrote:

> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
> take seconds, depending on CPU load. In this case vital SPI accesses can
> fail because of user-space applications. Some other drivers already do not
> have timeouts in polling mode.
>
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>

Wow what system is this and how does that happen?

I guess it is fine, but the timeout is there for a reason still. What about
setting the timeout to a minute or something?

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Lars-Peter Clausen July 15, 2018, 6:03 p.m. UTC | #2

On 07/15/2018 12:01 PM, Linus Walleij wrote:
> On Fri, Jul 13, 2018 at 5:27 PM Alexander Sverdlin
> <alexander.sverdlin@nokia.com> wrote:
> 
>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>> take seconds, depending on CPU load. In this case vital SPI accesses can
>> fail because of user-space applications. Some other drivers already do not
>> have timeouts in polling mode.
>>
>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Wow what system is this and how does that happen?
> 
> I guess it is fine, but the timeout is there for a reason still. What about
> setting the timeout to a minute or something?

How about resetting the timeout if there is progress? E.g. have
readwriter() return whether it was able to read or write some data and
then reset the timeout. If the timeout is due to CPU contention
readwriter() should always be able to push/pull new data to/from the
hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexander Sverdlin July 16, 2018, 7:57 a.m. UTC | #3

Hello Linus,

On 15/07/18 12:01, Linus Walleij wrote:
>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>> take seconds, depending on CPU load. In this case vital SPI accesses can
>> fail because of user-space applications. Some other drivers already do not
>> have timeouts in polling mode.
>>
>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> Wow what system is this and how does that happen?

we observe this on two different Cortex A15-based SoCs.
One has no DMA implemented in PL022 (axxia), another one has DMA but has no drivers
for it (keystone2).
Therefore both forced to polling mode (PIO mode is much worse because IRQs
can stop process context for seconds if one accesses several megabytes on the MTD
flash in a bulk operation).

> I guess it is fine, but the timeout is there for a reason still. What about
> setting the timeout to a minute or something?

I think it should be fine with some timeout in this order of magnitude, the only
question is, is it enough to make it fixed 60-120 sec or should I introduce a 
Kconfig option with such a default value?

Alexander Sverdlin July 16, 2018, 8:05 a.m. UTC | #4

Hello!

On 15/07/18 20:03, Lars-Peter Clausen wrote:
>>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>>> take seconds, depending on CPU load. In this case vital SPI accesses can
>>> fail because of user-space applications. Some other drivers already do not
>>> have timeouts in polling mode.
>>>
>>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>> Wow what system is this and how does that happen?
>>
>> I guess it is fine, but the timeout is there for a reason still. What about
>> setting the timeout to a minute or something?
> How about resetting the timeout if there is progress? E.g. have
> readwriter() return whether it was able to read or write some data and
> then reset the timeout. If the timeout is due to CPU contention
> readwriter() should always be able to push/pull new data to/from the
> hardware.

Well, if it's scheduled at all. In our case a poor task is not scheduled at all
for 1-2 seconds (because of other high prio tasks which are here for a reason as
well).
So from the priority PoV we can allow a task reading from MTD to wait couple of
seconds, but it's really strange for it to fail even though the MTD media itself
is not corrupted.
The risk increases with the number of tasks and hundreds of tasks in startup is
not seldom and if one has HZ=100... This 1 second timeout is prone to fail
in highly loaded system.

The timeout itself is here to catch HW failures or erratas, I suppose it can
tolerate some minutes as well...

Lars-Peter Clausen July 16, 2018, 8:21 a.m. UTC | #5

On 07/16/2018 10:05 AM, Alexander Sverdlin wrote:
> Hello!
> 
> On 15/07/18 20:03, Lars-Peter Clausen wrote:
>>>> Some tests show, that bulk SPI accesses (255 bytes, maximum PL022 can) may
>>>> take seconds, depending on CPU load. In this case vital SPI accesses can
>>>> fail because of user-space applications. Some other drivers already do not
>>>> have timeouts in polling mode.
>>>>
>>>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>>> Wow what system is this and how does that happen?
>>>
>>> I guess it is fine, but the timeout is there for a reason still. What about
>>> setting the timeout to a minute or something?
>> How about resetting the timeout if there is progress? E.g. have
>> readwriter() return whether it was able to read or write some data and
>> then reset the timeout. If the timeout is due to CPU contention
>> readwriter() should always be able to push/pull new data to/from the
>> hardware.
> 
> Well, if it's scheduled at all. In our case a poor task is not scheduled at all
> for 1-2 seconds (because of other high prio tasks which are here for a reason as
> well).
> So from the priority PoV we can allow a task reading from MTD to wait couple of
> seconds, but it's really strange for it to fail even though the MTD media itself
> is not corrupted.
> The risk increases with the number of tasks and hundreds of tasks in startup is
> not seldom and if one has HZ=100... This 1 second timeout is prone to fail
> in highly loaded system.

You'd run readwriter() before checking the timeout and reset the timeout if
it is able to make progress. Only if there was no progress you'd check the
timeout.
--
To unsubscribe from this list: send the line "unsubscribe linux-spi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alexander Sverdlin July 16, 2018, 8:44 a.m. UTC | #6

Hello Lars,

On 16/07/18 10:21, Lars-Peter Clausen wrote:
>> Well, if it's scheduled at all. In our case a poor task is not scheduled at all
>> for 1-2 seconds (because of other high prio tasks which are here for a reason as
>> well).
>> So from the priority PoV we can allow a task reading from MTD to wait couple of
>> seconds, but it's really strange for it to fail even though the MTD media itself
>> is not corrupted.
>> The risk increases with the number of tasks and hundreds of tasks in startup is
>> not seldom and if one has HZ=100... This 1 second timeout is prone to fail
>> in highly loaded system.
> You'd run readwriter() before checking the timeout and reset the timeout if
> it is able to make progress. Only if there was no progress you'd check the
> timeout.

I think your point is valid, but I don't feel only this improvement will be enough.
What do you think about both increasing the timeout to 60-120sec and resetting
timeout if a progress has been made?
Then I can prepare v2.

Mark Brown July 16, 2018, 11:17 a.m. UTC | #7

On Mon, Jul 16, 2018 at 10:44:46AM +0200, Alexander Sverdlin wrote:
> On 16/07/18 10:21, Lars-Peter Clausen wrote:

> > You'd run readwriter() before checking the timeout and reset the timeout if
> > it is able to make progress. Only if there was no progress you'd check the
> > timeout.

> I think your point is valid, but I don't feel only this improvement will be enough.
> What do you think about both increasing the timeout to 60-120sec and resetting
> timeout if a progress has been made?
> Then I can prepare v2.

That's sounding like an extremely high timeout, long enough that people
are likely to think that the system has locked up (and it'd probably be
triggering the scheduler warnings too).  It feels like either whatever
is consuming the CPU has a problem that needs fixing or we need some
system wide indication that there's something intentionally doing this
so other tasks should lift any timeout checks that they have.

spi: pl022: Remove timeout in polling mode operation

Commit Message

Comments

Patch