diff mbox series

[RFT] mmc: tmio: avoid concurrent runs of mmc_request_done()

Message ID 20240228100354.3285-2-wsa+renesas@sang-engineering.com (mailing list archive)
State New, archived
Headers show
Series [RFT] mmc: tmio: avoid concurrent runs of mmc_request_done() | expand

Commit Message

Wolfram Sang Feb. 28, 2024, 10:03 a.m. UTC
With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
outside of the spinlock protected critical section. That leaves a small
race window during execution of 'tmio_mmc_reset()' where the done_work
handler could grab a pointer to the now invalid 'host->mrq'. Both would
use it to call mmc_request_done() causing problems (see Link).

However, 'host->mrq' cannot simply be cleared earlier inside the
critical section. That would allow new mrqs to come in asynchronously
while the actual reset of the controller still needs to be done. So,
like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
coming in but still avoiding concurrency between work handlers.

Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")
---

Dirk: could you get this tested on your affected setups? I am somewhat
optimistic that this is already enough. For sure, it is a needed first
step.

 drivers/mmc/host/tmio_mmc_core.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Dirk Behme Feb. 29, 2024, 6:21 a.m. UTC | #1
Hi Wolfram,

On 28.02.2024 11:03, Wolfram Sang wrote:
> With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
> outside of the spinlock protected critical section. That leaves a small
> race window during execution of 'tmio_mmc_reset()' where the done_work
> handler could grab a pointer to the now invalid 'host->mrq'. Both would
> use it to call mmc_request_done() causing problems (see Link).
> 
> However, 'host->mrq' cannot simply be cleared earlier inside the
> critical section. That would allow new mrqs to come in asynchronously
> while the actual reset of the controller still needs to be done. So,
> like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
> coming in but still avoiding concurrency between work handlers.
> 
> Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
> Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/
> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
> Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")

Tested-by: Dirk Behme <dirk.behme@de.bosch.com>
Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com>

> ---
> 
> Dirk: could you get this tested on your affected setups? I am somewhat
> optimistic that this is already enough. For sure, it is a needed first
> step.

Testing looks good :) Many thanks!

At least the issues we observed before are not seen any more. As we are 
not exactly sure on the root cause, of course this is not a 100% proof. 
But as the change looks good, looks like it won't break something and 
the system behaves good with it I would say we are good to go.

I think we could add anything like

Cc: stable@vger.kernel.org # 3.0+

?

>   drivers/mmc/host/tmio_mmc_core.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c
> index be7f18fd4836..c253d176db69 100644
> --- a/drivers/mmc/host/tmio_mmc_core.c
> +++ b/drivers/mmc/host/tmio_mmc_core.c
> @@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work)
>   	else
>   		mrq->cmd->error = -ETIMEDOUT;
>   
> +	/* No new calls yet, but disallow concurrent tmio_mmc_done_work() */
> +	host->mrq = ERR_PTR(-EBUSY);
>   	host->cmd = NULL;
>   	host->data = NULL;
Thanks again!

Dirk
Wolfram Sang Feb. 29, 2024, 7:33 a.m. UTC | #2
Hi Dirk,

> > With the to-be-fixed commit, the reset_work handler cleared 'host->mrq'
> > outside of the spinlock protected critical section. That leaves a small
> > race window during execution of 'tmio_mmc_reset()' where the done_work
> > handler could grab a pointer to the now invalid 'host->mrq'. Both would
> > use it to call mmc_request_done() causing problems (see Link).
> > 
> > However, 'host->mrq' cannot simply be cleared earlier inside the
> > critical section. That would allow new mrqs to come in asynchronously
> > while the actual reset of the controller still needs to be done. So,
> > like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from
> > coming in but still avoiding concurrency between work handlers.
> > 
> > Reported-by: Dirk Behme <dirk.behme@de.bosch.com>
> > Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/
> > Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
> > Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race")
> 
> Tested-by: Dirk Behme <dirk.behme@de.bosch.com>
> Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com>

Awesome! Thanks for the super-fast tags!

> At least the issues we observed before are not seen any more. As we are not
> exactly sure on the root cause, of course this is not a 100% proof. But as
> the change looks good, looks like it won't break something and the system
> behaves good with it I would say we are good to go.

I agree. We don't know if it is all you need. But there definitely was a
race window and closing it removes some observed anomalies. Let's hope
all of them :) I looked many times at the code and, to the best of my
knowledge, don't see side effects. 'host->mrq' stays non-NULL, so new
mrqs won't be added like before. Changing it to an ERR_PTR will only
affect the check in the done_work handler which is what we want. But, of
course, more eyes are always welcome.

> I think we could add anything like
> 
> Cc: stable@vger.kernel.org # 3.0+

Yes, we should definitely have that. I would have added it once your
testing got good results. This affects every Renesas SDHI or Uniphier SD
instance since 3.0 (12 years). Wow! So, thanks a ton for your report and
assistance in debugging it. Very much appreciated! And, phew, I am happy
that this solution does not make the locking more complex \o/

All the best,

   Wolfram
diff mbox series

Patch

diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c
index be7f18fd4836..c253d176db69 100644
--- a/drivers/mmc/host/tmio_mmc_core.c
+++ b/drivers/mmc/host/tmio_mmc_core.c
@@ -259,6 +259,8 @@  static void tmio_mmc_reset_work(struct work_struct *work)
 	else
 		mrq->cmd->error = -ETIMEDOUT;
 
+	/* No new calls yet, but disallow concurrent tmio_mmc_done_work() */
+	host->mrq = ERR_PTR(-EBUSY);
 	host->cmd = NULL;
 	host->data = NULL;