Message ID | 20240228100354.3285-2-wsa+renesas@sang-engineering.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Geert Uytterhoeven |
Headers | show |
Series | [RFT] mmc: tmio: avoid concurrent runs of mmc_request_done() | expand |
Hi Wolfram, On 28.02.2024 11:03, Wolfram Sang wrote: > With the to-be-fixed commit, the reset_work handler cleared 'host->mrq' > outside of the spinlock protected critical section. That leaves a small > race window during execution of 'tmio_mmc_reset()' where the done_work > handler could grab a pointer to the now invalid 'host->mrq'. Both would > use it to call mmc_request_done() causing problems (see Link). > > However, 'host->mrq' cannot simply be cleared earlier inside the > critical section. That would allow new mrqs to come in asynchronously > while the actual reset of the controller still needs to be done. So, > like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from > coming in but still avoiding concurrency between work handlers. > > Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/ > Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> > Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race") Tested-by: Dirk Behme <dirk.behme@de.bosch.com> Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com> > --- > > Dirk: could you get this tested on your affected setups? I am somewhat > optimistic that this is already enough. For sure, it is a needed first > step. Testing looks good :) Many thanks! At least the issues we observed before are not seen any more. As we are not exactly sure on the root cause, of course this is not a 100% proof. But as the change looks good, looks like it won't break something and the system behaves good with it I would say we are good to go. I think we could add anything like Cc: stable@vger.kernel.org # 3.0+ ? > drivers/mmc/host/tmio_mmc_core.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c > index be7f18fd4836..c253d176db69 100644 > --- a/drivers/mmc/host/tmio_mmc_core.c > +++ b/drivers/mmc/host/tmio_mmc_core.c > @@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work) > else > mrq->cmd->error = -ETIMEDOUT; > > + /* No new calls yet, but disallow concurrent tmio_mmc_done_work() */ > + host->mrq = ERR_PTR(-EBUSY); > host->cmd = NULL; > host->data = NULL; Thanks again! Dirk
Hi Dirk, > > With the to-be-fixed commit, the reset_work handler cleared 'host->mrq' > > outside of the spinlock protected critical section. That leaves a small > > race window during execution of 'tmio_mmc_reset()' where the done_work > > handler could grab a pointer to the now invalid 'host->mrq'. Both would > > use it to call mmc_request_done() causing problems (see Link). > > > > However, 'host->mrq' cannot simply be cleared earlier inside the > > critical section. That would allow new mrqs to come in asynchronously > > while the actual reset of the controller still needs to be done. So, > > like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from > > coming in but still avoiding concurrency between work handlers. > > > > Reported-by: Dirk Behme <dirk.behme@de.bosch.com> > > Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/ > > Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> > > Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race") > > Tested-by: Dirk Behme <dirk.behme@de.bosch.com> > Reviewed-by: Dirk Behme <dirk.behme@de.bosch.com> Awesome! Thanks for the super-fast tags! > At least the issues we observed before are not seen any more. As we are not > exactly sure on the root cause, of course this is not a 100% proof. But as > the change looks good, looks like it won't break something and the system > behaves good with it I would say we are good to go. I agree. We don't know if it is all you need. But there definitely was a race window and closing it removes some observed anomalies. Let's hope all of them :) I looked many times at the code and, to the best of my knowledge, don't see side effects. 'host->mrq' stays non-NULL, so new mrqs won't be added like before. Changing it to an ERR_PTR will only affect the check in the done_work handler which is what we want. But, of course, more eyes are always welcome. > I think we could add anything like > > Cc: stable@vger.kernel.org # 3.0+ Yes, we should definitely have that. I would have added it once your testing got good results. This affects every Renesas SDHI or Uniphier SD instance since 3.0 (12 years). Wow! So, thanks a ton for your report and assistance in debugging it. Very much appreciated! And, phew, I am happy that this solution does not make the locking more complex \o/ All the best, Wolfram
diff --git a/drivers/mmc/host/tmio_mmc_core.c b/drivers/mmc/host/tmio_mmc_core.c index be7f18fd4836..c253d176db69 100644 --- a/drivers/mmc/host/tmio_mmc_core.c +++ b/drivers/mmc/host/tmio_mmc_core.c @@ -259,6 +259,8 @@ static void tmio_mmc_reset_work(struct work_struct *work) else mrq->cmd->error = -ETIMEDOUT; + /* No new calls yet, but disallow concurrent tmio_mmc_done_work() */ + host->mrq = ERR_PTR(-EBUSY); host->cmd = NULL; host->data = NULL;
With the to-be-fixed commit, the reset_work handler cleared 'host->mrq' outside of the spinlock protected critical section. That leaves a small race window during execution of 'tmio_mmc_reset()' where the done_work handler could grab a pointer to the now invalid 'host->mrq'. Both would use it to call mmc_request_done() causing problems (see Link). However, 'host->mrq' cannot simply be cleared earlier inside the critical section. That would allow new mrqs to come in asynchronously while the actual reset of the controller still needs to be done. So, like 'tmio_mmc_set_ios()', an ERR_PTR is used to prevent new mrqs from coming in but still avoiding concurrency between work handlers. Reported-by: Dirk Behme <dirk.behme@de.bosch.com> Closes: https://lore.kernel.org/all/20240220061356.3001761-1-dirk.behme@de.bosch.com/ Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Fixes: df3ef2d3c92c ("mmc: protect the tmio_mmc driver against a theoretical race") --- Dirk: could you get this tested on your affected setups? I am somewhat optimistic that this is already enough. For sure, it is a needed first step. drivers/mmc/host/tmio_mmc_core.c | 2 ++ 1 file changed, 2 insertions(+)