diff mbox series

[RFC,1/2] mmc: sdhci: Manually check card status after reset

Message ID 20190501175457.195855-1-rrangel@chromium.org (mailing list archive)
State New, archived
Headers show
Series [RFC,1/2] mmc: sdhci: Manually check card status after reset | expand

Commit Message

Raul Rangel May 1, 2019, 5:54 p.m. UTC
There is a race condition between resetting the SDHCI controller and
disconnecting the card.

For example:
0) Card is connected and transferring data
1) mmc_sd_reset is called to reset the controller due to a data error
2) sdhci_set_ios calls sdhci_do_reset
3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has
configured.
4) Wait for SOFT_RESET_ALL to clear
5) CD logic notices card is gone and CARD_PRESENT goes low, but since the
   IRQs are not configured a CARD_REMOVED interrupt is never raised.
6) IRQs are enabled again
7) mmc layer never notices the device is disconnected. The SDHCI layer
   will keep returning -ENOMEDIUM. This results in a card that is always
   present and not functional.

Signed-off-by: Raul E Rangel <rrangel@chromium.org>
---
You can see an example of the following two patches here:
https://privatebin.net/?b0f5953716d34ca6#C699bCBQ99NdvspfDW7CMucT8CJG4DgL+yUNPyepDCo=
Line 8213: EILSEQ
Line 8235: SDHC is hard reset
Line 8240: Controller completes reset and card is no longer present
Line 8379: mmc_sd_reset notices card is missing and issues a card_event
           and schedules a detect change.
Line 8402: Don't init the card since it's already gone.
Line 8717: Marks card as removed
Line 8820: mmc_sd_remove removes the block device

I am running into a kernel panic. A task gets stuck for more than 120
seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not
calling something correctly?

Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc=

I sometimes see the following:
[  547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time

I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host
So I'm guessing I'm missing an upstream fix.

Do the patches look correct or am I doing something that would cause a
kernel panic?

I have a DUT setup with a GPIO I can use to toggle the CD pin. I ran a
test where I connect and then randomly, between 0s - 1s disconnect the
card. This got over 20k iterations before the panic. Though when I do it
manually and stop for 2 minutes the panic happens.

Any help would be appreciated.

Thanks,
Raul


 drivers/mmc/core/sd.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Raul Rangel May 3, 2019, 3:12 p.m. UTC | #1
On Wed, May 01, 2019 at 11:54:56AM -0600, Raul E Rangel wrote:
> I am running into a kernel panic. A task gets stuck for more than 120
> seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not
> calling something correctly?
> 
> Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc=
> 
> I sometimes see the following:
> [  547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time
> 
> I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host
> So I'm guessing I'm missing an upstream fix.
> 
So I tried these patches on the 5.1 with the memory leak patch applied:
https://patchwork.kernel.org/patch/10927541/

Everything works as expected:
    mmc0: new high speed SDHC card at address 0001
    mmcblk0: mmc0:0001 00000 7.41 GiB
     mmcblk0: p1 p2
    mmc0: card status changed during reset
    mmc0: card was removed during reset
    mmc0: tried to HW reset card, got error -123
    print_req_error: I/O error, dev mmcblk0, sector 2072 flags 80700
    print_req_error: I/O error, dev mmcblk0, sector 2073 flags 80700
    mmc0: card 0001 removed

    mmc0: new high speed SDHC card at address 0001
    mmcblk0: mmc0:0001 00000 7.41 GiB
     mmcblk0: p1 p2
    mmc0: card status changed during reset
    mmc0: card was removed during reset
    mmc0: tried to HW reset card, got error -123
    print_req_error: I/O error, dev mmcblk0, sector 12584832 flags 80700
    print_req_error: I/O error, dev mmcblk0, sector 12584833 flags 80700
    mmc0: card 0001 removed
    Buffer I/O error on dev mmcblk0p2, logical block 369904, async page read

    mmc0: new high speed SDHC card at address 0001
    mmcblk0: mmc0:0001 00000 7.41 GiB
     mmcblk0: p1 p2
    mmc0: card 0001 removed

I also ran another test. Using the DUT with a GPIO to randomly toggle the
CD pin I was able to get 4k+ iterations without any problems. So I think
the patches are good.

    --- Iteration 4506 ----
    Disconnecting card
    [ 7444.370169] mmc0: card 0001 removed
    Connecting Card and sleeping for 0.31476s
    [ 7444.682177] mmc0: new high speed SDHC card at address 0001
    [ 7444.719038] mmcblk0: mmc0:0001 00000 7.41 GiB
    [ 7444.727375]  mmcblk0: p1 p2
    Card Disconnected
    [ 7444.831896] mmc0: Card removed during transfer!
    [ 7444.831914] mmc0: Resetting controller.
    --- Iteration 4507 ----
    Connecting Card and sleeping for 0.30259s
    [ 7445.033649] mmc0: card 0001 removed
    Card Disconnected
    [ 7445.307008] mmc0: error -123 whilst initialising SD card
    --- Iteration 4508 ----
    Connecting Card and sleeping for 0.24827s
    Card Disconnected
    [ 7445.716033] mmc0: Card removed during transfer!
    [ 7445.716052] mmc0: Resetting controller.
    [ 7445.716489] mmc0: error -123 whilst initialising SD card
    --- Iteration 4509 ----
    Connecting Card and sleeping for 0.14132s
    Card Disconnected
    --- Iteration 4510 ----
    Connecting Card and sleeping for 0.26001s
    Card Disconnected
    [ 7446.436079] mmc0: error -123 whilst initialising SD card
    Connecting card to verify if it died
    [ 7446.906804] mmc0: new high speed SDHC card at address 0001
    [ 7446.933631] mmcblk0: mmc0:0001 00000 7.41 GiB
    [ 7446.949224]  mmcblk0: p1 p2
    SDHC still functional

Should I send the patches out again without the RFC tag?

I'll keep trying to track down the hung task I was seeing on 4.14. But I
don't think that's related to these patches. I might just end up
backporting the blk-mq patches to our 4.14 branch since I suspect that
fixes it.

Thanks,
Raul
Raul Rangel May 8, 2019, 7 p.m. UTC | #2
On Fri, May 03, 2019 at 09:12:24AM -0600, Raul Rangel wrote:
> On Wed, May 01, 2019 at 11:54:56AM -0600, Raul E Rangel wrote:
> > I am running into a kernel panic. A task gets stuck for more than 120
> > seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not
> > calling something correctly?
> > 
> > Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc=
> > 
> > I sometimes see the following:
> > [  547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time
> > 
> > I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host
> > So I'm guessing I'm missing an upstream fix.
> > 
> 
> I'll keep trying to track down the hung task I was seeing on 4.14. But I
> don't think that's related to these patches. I might just end up
> backporting the blk-mq patches to our 4.14 branch since I suspect that
> fixes it.

So I tracked down the hung task in 4.14, it's a resource leak.
mmc_cleanup_queue stops the worker thread. If there were any requests in
the queue they would be holding onto a reference of mmc_blk_data. When
mmc_blk_remove_req calls mmc_blk_put, there are still references to md, so
it never calls blk_cleanup_queue, and the requests stay in the queue
forever.

Fortunately Adrian already has a fix for this: https://lore.kernel.org/patchwork/patch/856512/
I think we should cherry-pick 41e3efd07d5a02c80f503e29d755aa1bbb4245de
into v4.14. I've tried it locally and it fixes the kernel panic I was
seeing.

I've also sent out two more patches for v4.14 that need to be applied
with Adrian's patch:
* https://patchwork.kernel.org/patch/10936439/
* https://patchwork.kernel.org/patch/10936441/

As for this patch, are there any comments? I have a test running that is
doing random connect/disconnects, and it's over 6k iterations now.

Thanks,
Raul
Ulf Hansson May 28, 2019, 7:38 a.m. UTC | #3
On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote:
>
> There is a race condition between resetting the SDHCI controller and
> disconnecting the card.
>
> For example:
> 0) Card is connected and transferring data
> 1) mmc_sd_reset is called to reset the controller due to a data error

I assume you refer to mmc_sd_hw_reset()? In that case, I think you
have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's
responsibility is to reset the SD-card and not the host/controller.

Whether there some additional "reset" of the controller needed, that
is assumed by the core, to be managed via the ->set_ios() callback for
the host.

> 2) sdhci_set_ios calls sdhci_do_reset
> 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has
> configured.
> 4) Wait for SOFT_RESET_ALL to clear
> 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the
>    IRQs are not configured a CARD_REMOVED interrupt is never raised.
> 6) IRQs are enabled again
> 7) mmc layer never notices the device is disconnected. The SDHCI layer
>    will keep returning -ENOMEDIUM. This results in a card that is always
>    present and not functional.

This sounds like host specific problems, which most likely should be
fixed in host driver, solely. Unless I am missing something, of
course.

>
> Signed-off-by: Raul E Rangel <rrangel@chromium.org>
> ---
> You can see an example of the following two patches here:
> https://privatebin.net/?b0f5953716d34ca6#C699bCBQ99NdvspfDW7CMucT8CJG4DgL+yUNPyepDCo=
> Line 8213: EILSEQ
> Line 8235: SDHC is hard reset
> Line 8240: Controller completes reset and card is no longer present
> Line 8379: mmc_sd_reset notices card is missing and issues a card_event
>            and schedules a detect change.
> Line 8402: Don't init the card since it's already gone.
> Line 8717: Marks card as removed
> Line 8820: mmc_sd_remove removes the block device
>
> I am running into a kernel panic. A task gets stuck for more than 120
> seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not
> calling something correctly?
>
> Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc=
>
> I sometimes see the following:
> [  547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time
>
> I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host
> So I'm guessing I'm missing an upstream fix.
>
> Do the patches look correct or am I doing something that would cause a
> kernel panic?
>
> I have a DUT setup with a GPIO I can use to toggle the CD pin. I ran a
> test where I connect and then randomly, between 0s - 1s disconnect the
> card. This got over 20k iterations before the panic. Though when I do it
> manually and stop for 2 minutes the panic happens.
>
> Any help would be appreciated.
>
> Thanks,
> Raul
>
>
>  drivers/mmc/core/sd.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
> index 265e1aeeb9d8..9206c4297d66 100644
> --- a/drivers/mmc/core/sd.c
> +++ b/drivers/mmc/core/sd.c
> @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host)
>
>  static int mmc_sd_hw_reset(struct mmc_host *host)
>  {
> +       int present;
>         mmc_power_cycle(host, host->card->ocr);
> +
> +       present = host->ops->get_cd(host);
> +
> +       /* The card status could have changed while resetting. */
> +       if ((mmc_card_removed(host->card) && present) ||
> +           (!mmc_card_removed(host->card) && !present)) {
> +               pr_info("%s: card status changed during reset\n",
> +                      mmc_hostname(host));
> +               host->ops->card_event(host);
> +               mmc_detect_change(host, 0);
> +       }
> +
> +       /* Don't perform unnecessary transactions if the card is missing. */
> +       if (!present) {
> +               pr_info("%s: card was removed during reset\n",
> +                       mmc_hostname(host));
> +               return -ENOMEDIUM;
> +       }
> +

When doing a  mmc_hw_reset() (which ends up calling mmc_sd_hw_reset()
in case of SD cards), we are making a final attempt to make the card
functional again, via a power cycle and a re-init of it.

In this path, we don't care whether the card is removed, as that
should have been detected already when the block layer calls
mmc_detect_card_removed().

>         return mmc_sd_init_card(host, host->card->ocr, host->card);
>  }
>
> --
> 2.21.0.593.g511ec345e18-goog
>

Kind regards
Uffe
Raul Rangel June 7, 2019, 4:05 p.m. UTC | #4
On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote:
> On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote:

First off, thanks for the review.

> >
> > There is a race condition between resetting the SDHCI controller and
> > disconnecting the card.
> >
> > For example:
> > 0) Card is connected and transferring data
> > 1) mmc_sd_reset is called to reset the controller due to a data error
> 
> I assume you refer to mmc_sd_hw_reset()? In that case, I think you
> have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's
> responsibility is to reset the SD-card and not the host/controller.
You are correct. I was looking at a 4.14 kernel where it's called
mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset.

All I was trying to convey here was that a block error will eventually
call sdhci_set_ios with SOFT_RESET_ALL via:
mmc_blk_reset
mmc_hw_reset
mmc_sd_hw_reset
mmc_power_cycle
mmc_power_off
mmc_set_initial_state
sdhci_set_ios
sdhci_reinit
sdhci_init
sdhci_do_reset(host, SDHCI_RESET_ALL);

> 
> Whether there some additional "reset" of the controller needed, that
> is assumed by the core, to be managed via the ->set_ios() callback for
> the host.
> 
> > 2) sdhci_set_ios calls sdhci_do_reset
> > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has
> > configured.
> > 4) Wait for SOFT_RESET_ALL to clear
> > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the
> >    IRQs are not configured a CARD_REMOVED interrupt is never raised.
> > 6) IRQs are enabled again
> > 7) mmc layer never notices the device is disconnected. The SDHCI layer
> >    will keep returning -ENOMEDIUM. This results in a card that is always
> >    present and not functional.
> 
> This sounds like host specific problems, which most likely should be
> fixed in host driver, solely. Unless I am missing something, of
> course.

So in sdhci_do_reset we call the reset callback which is typically
sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the
controller to reset. If SDHCI_RESET_ALL was passed as the flag, the
controller will clear the IRQ mask. If during that 100ms the card is
removed there is no notification to the MMC system that the card was
removed. So it seems like the card is always present.

> >  drivers/mmc/core/sd.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
> > index 265e1aeeb9d8..9206c4297d66 100644
> > --- a/drivers/mmc/core/sd.c
> > +++ b/drivers/mmc/core/sd.c
> > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host)
> >
> >  static int mmc_sd_hw_reset(struct mmc_host *host)
> >  {
> > +       int present;
> >         mmc_power_cycle(host, host->card->ocr);
> > +
> > +       present = host->ops->get_cd(host);
> > +
> > +       /* The card status could have changed while resetting. */
> > +       if ((mmc_card_removed(host->card) && present) ||
> > +           (!mmc_card_removed(host->card) && !present)) {
> > +               pr_info("%s: card status changed during reset\n",
> > +                      mmc_hostname(host));
> > +               host->ops->card_event(host);
> > +               mmc_detect_change(host, 0);
> > +       }
> > +
> > +       /* Don't perform unnecessary transactions if the card is missing. */
> > +       if (!present) {
> > +               pr_info("%s: card was removed during reset\n",
> > +                       mmc_hostname(host));
> > +               return -ENOMEDIUM;
> > +       }
> > +
> 
> When doing a  mmc_hw_reset() (which ends up calling mmc_sd_hw_reset()
> in case of SD cards), we are making a final attempt to make the card
> functional again, via a power cycle and a re-init of it.
> 
> In this path, we don't care whether the card is removed, as that
> should have been detected already when the block layer calls
> mmc_detect_card_removed().

mmc_detect_card_removed has the guard `host->detect_change` to
prevent it from checking the card status again. So maybe the fix to the
missing interrupt/race condition is to set `host->detect_change = 1`
from sdhci_do_reset after calling the reset handler. This way there will
always be a single poll after a full reset so the correct card status can
be detected?

> 
> >         return mmc_sd_init_card(host, host->card->ocr, host->card);
> >  }
> >
> > --
> > 2.21.0.593.g511ec345e18-goog
> >
> 
> Kind regards
> Uffe
>
Thanks again for the review!
Ulf Hansson June 10, 2019, 4:17 p.m. UTC | #5
+ Adrian

On Fri, 7 Jun 2019 at 18:05, Raul Rangel <rrangel@chromium.org> wrote:
>
> On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote:
> > On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote:
>
> First off, thanks for the review.
>
> > >
> > > There is a race condition between resetting the SDHCI controller and
> > > disconnecting the card.
> > >
> > > For example:
> > > 0) Card is connected and transferring data
> > > 1) mmc_sd_reset is called to reset the controller due to a data error
> >
> > I assume you refer to mmc_sd_hw_reset()? In that case, I think you
> > have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's
> > responsibility is to reset the SD-card and not the host/controller.
> You are correct. I was looking at a 4.14 kernel where it's called
> mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset.
>
> All I was trying to convey here was that a block error will eventually
> call sdhci_set_ios with SOFT_RESET_ALL via:
> mmc_blk_reset
> mmc_hw_reset
> mmc_sd_hw_reset
> mmc_power_cycle
> mmc_power_off
> mmc_set_initial_state
> sdhci_set_ios
> sdhci_reinit
> sdhci_init
> sdhci_do_reset(host, SDHCI_RESET_ALL);
>
> >
> > Whether there some additional "reset" of the controller needed, that
> > is assumed by the core, to be managed via the ->set_ios() callback for
> > the host.
> >
> > > 2) sdhci_set_ios calls sdhci_do_reset
> > > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has
> > > configured.
> > > 4) Wait for SOFT_RESET_ALL to clear
> > > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the
> > >    IRQs are not configured a CARD_REMOVED interrupt is never raised.
> > > 6) IRQs are enabled again
> > > 7) mmc layer never notices the device is disconnected. The SDHCI layer
> > >    will keep returning -ENOMEDIUM. This results in a card that is always
> > >    present and not functional.
> >
> > This sounds like host specific problems, which most likely should be
> > fixed in host driver, solely. Unless I am missing something, of
> > course.
>
> So in sdhci_do_reset we call the reset callback which is typically
> sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the
> controller to reset. If SDHCI_RESET_ALL was passed as the flag, the
> controller will clear the IRQ mask. If during that 100ms the card is
> removed there is no notification to the MMC system that the card was
> removed. So it seems like the card is always present.

So you are saying that the problem occurs when the card is removed
during this 100ms period?

I assume there a way for sdhci to re-validate whether the card has
been removed during this period, right?

>
> > >  drivers/mmc/core/sd.c | 20 ++++++++++++++++++++
> > >  1 file changed, 20 insertions(+)
> > >
> > > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
> > > index 265e1aeeb9d8..9206c4297d66 100644
> > > --- a/drivers/mmc/core/sd.c
> > > +++ b/drivers/mmc/core/sd.c
> > > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host)
> > >
> > >  static int mmc_sd_hw_reset(struct mmc_host *host)
> > >  {
> > > +       int present;
> > >         mmc_power_cycle(host, host->card->ocr);
> > > +
> > > +       present = host->ops->get_cd(host);
> > > +
> > > +       /* The card status could have changed while resetting. */
> > > +       if ((mmc_card_removed(host->card) && present) ||
> > > +           (!mmc_card_removed(host->card) && !present)) {
> > > +               pr_info("%s: card status changed during reset\n",
> > > +                      mmc_hostname(host));
> > > +               host->ops->card_event(host);
> > > +               mmc_detect_change(host, 0);
> > > +       }
> > > +
> > > +       /* Don't perform unnecessary transactions if the card is missing. */
> > > +       if (!present) {
> > > +               pr_info("%s: card was removed during reset\n",
> > > +                       mmc_hostname(host));
> > > +               return -ENOMEDIUM;
> > > +       }
> > > +
> >
> > When doing a  mmc_hw_reset() (which ends up calling mmc_sd_hw_reset()
> > in case of SD cards), we are making a final attempt to make the card
> > functional again, via a power cycle and a re-init of it.
> >
> > In this path, we don't care whether the card is removed, as that
> > should have been detected already when the block layer calls
> > mmc_detect_card_removed().
>
> mmc_detect_card_removed has the guard `host->detect_change` to
> prevent it from checking the card status again. So maybe the fix to the
> missing interrupt/race condition is to set `host->detect_change = 1`
> from sdhci_do_reset after calling the reset handler. This way there will
> always be a single poll after a full reset so the correct card status can
> be detected?

It sounds like you should call mmc_detect_change(), after the reset to
let the core check for cards that may have been removed/inserted.

Would that work?

[...]

Kind regards
Uffe
Raul Rangel June 10, 2019, 4:32 p.m. UTC | #6
On Mon, Jun 10, 2019 at 06:17:31PM +0200, Ulf Hansson wrote:
> + Adrian
> 
> On Fri, 7 Jun 2019 at 18:05, Raul Rangel <rrangel@chromium.org> wrote:
> >
> > On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote:
> > > On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote:
> >
> > First off, thanks for the review.
> >
> > > >
> > > > There is a race condition between resetting the SDHCI controller and
> > > > disconnecting the card.
> > > >
> > > > For example:
> > > > 0) Card is connected and transferring data
> > > > 1) mmc_sd_reset is called to reset the controller due to a data error
> > >
> > > I assume you refer to mmc_sd_hw_reset()? In that case, I think you
> > > have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's
> > > responsibility is to reset the SD-card and not the host/controller.
> > You are correct. I was looking at a 4.14 kernel where it's called
> > mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset.
> >
> > All I was trying to convey here was that a block error will eventually
> > call sdhci_set_ios with SOFT_RESET_ALL via:
> > mmc_blk_reset
> > mmc_hw_reset
> > mmc_sd_hw_reset
> > mmc_power_cycle
> > mmc_power_off
> > mmc_set_initial_state
> > sdhci_set_ios
> > sdhci_reinit
> > sdhci_init
> > sdhci_do_reset(host, SDHCI_RESET_ALL);
> >
> > >
> > > Whether there some additional "reset" of the controller needed, that
> > > is assumed by the core, to be managed via the ->set_ios() callback for
> > > the host.
> > >
> > > > 2) sdhci_set_ios calls sdhci_do_reset
> > > > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has
> > > > configured.
> > > > 4) Wait for SOFT_RESET_ALL to clear
> > > > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the
> > > >    IRQs are not configured a CARD_REMOVED interrupt is never raised.
> > > > 6) IRQs are enabled again
> > > > 7) mmc layer never notices the device is disconnected. The SDHCI layer
> > > >    will keep returning -ENOMEDIUM. This results in a card that is always
> > > >    present and not functional.
> > >
> > > This sounds like host specific problems, which most likely should be
> > > fixed in host driver, solely. Unless I am missing something, of
> > > course.
> >
> > So in sdhci_do_reset we call the reset callback which is typically
> > sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the
> > controller to reset. If SDHCI_RESET_ALL was passed as the flag, the
> > controller will clear the IRQ mask. If during that 100ms the card is
> > removed there is no notification to the MMC system that the card was
> > removed. So it seems like the card is always present.
> 
> So you are saying that the problem occurs when the card is removed
> during this 100ms period?
Exactly. Thought I think most controllers reset fast enough to where
that window is very small. But for this AMD controller we need to do a
full reset which takes a while, so we can see the problem.

> 
> I assume there a way for sdhci to re-validate whether the card has
> been removed during this period, right?
I'll let Adrian chime in here.

> >
> > > >  drivers/mmc/core/sd.c | 20 ++++++++++++++++++++
> > > >  1 file changed, 20 insertions(+)
> > > >
> > > > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
> > > > index 265e1aeeb9d8..9206c4297d66 100644
> > > > --- a/drivers/mmc/core/sd.c
> > > > +++ b/drivers/mmc/core/sd.c
> > > > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host)
> > > >
> > > >  static int mmc_sd_hw_reset(struct mmc_host *host)
> > > >  {
> > > > +       int present;
> > > >         mmc_power_cycle(host, host->card->ocr);
> > > > +
> > > > +       present = host->ops->get_cd(host);
> > > > +
> > > > +       /* The card status could have changed while resetting. */
> > > > +       if ((mmc_card_removed(host->card) && present) ||
> > > > +           (!mmc_card_removed(host->card) && !present)) {
> > > > +               pr_info("%s: card status changed during reset\n",
> > > > +                      mmc_hostname(host));
> > > > +               host->ops->card_event(host);
> > > > +               mmc_detect_change(host, 0);
> > > > +       }
> > > > +
> > > > +       /* Don't perform unnecessary transactions if the card is missing. */
> > > > +       if (!present) {
> > > > +               pr_info("%s: card was removed during reset\n",
> > > > +                       mmc_hostname(host));
> > > > +               return -ENOMEDIUM;
> > > > +       }
> > > > +
> > >
> > > When doing a  mmc_hw_reset() (which ends up calling mmc_sd_hw_reset()
> > > in case of SD cards), we are making a final attempt to make the card
> > > functional again, via a power cycle and a re-init of it.
> > >
> > > In this path, we don't care whether the card is removed, as that
> > > should have been detected already when the block layer calls
> > > mmc_detect_card_removed().
> >
> > mmc_detect_card_removed has the guard `host->detect_change` to
> > prevent it from checking the card status again. So maybe the fix to the
> > missing interrupt/race condition is to set `host->detect_change = 1`
> > from sdhci_do_reset after calling the reset handler. This way there will
> > always be a single poll after a full reset so the correct card status can
> > be detected?
> 
> It sounds like you should call mmc_detect_change(), after the reset to
> let the core check for cards that may have been removed/inserted.
> 
> Would that work?
Yeah, I could add `mmc_detect_change()` to the AMD SDHCI patch. I just
thought it would be better to fix the potential race condition at a
higher level. Though I'm happy to move it to the AMD patch. Let me know.

Thanks
> 
> [...]
> 
> Kind regards
> Uffe
Adrian Hunter June 11, 2019, 10:30 a.m. UTC | #7
On 10/06/19 7:32 PM, Raul Rangel wrote:
> On Mon, Jun 10, 2019 at 06:17:31PM +0200, Ulf Hansson wrote:
>> + Adrian
>>
>> On Fri, 7 Jun 2019 at 18:05, Raul Rangel <rrangel@chromium.org> wrote:
>>>
>>> On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote:
>>>> On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote:
>>>
>>> First off, thanks for the review.
>>>
>>>>>
>>>>> There is a race condition between resetting the SDHCI controller and
>>>>> disconnecting the card.
>>>>>
>>>>> For example:
>>>>> 0) Card is connected and transferring data
>>>>> 1) mmc_sd_reset is called to reset the controller due to a data error
>>>>
>>>> I assume you refer to mmc_sd_hw_reset()? In that case, I think you
>>>> have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's
>>>> responsibility is to reset the SD-card and not the host/controller.
>>> You are correct. I was looking at a 4.14 kernel where it's called
>>> mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset.
>>>
>>> All I was trying to convey here was that a block error will eventually
>>> call sdhci_set_ios with SOFT_RESET_ALL via:
>>> mmc_blk_reset
>>> mmc_hw_reset
>>> mmc_sd_hw_reset
>>> mmc_power_cycle
>>> mmc_power_off
>>> mmc_set_initial_state
>>> sdhci_set_ios
>>> sdhci_reinit
>>> sdhci_init
>>> sdhci_do_reset(host, SDHCI_RESET_ALL);
>>>
>>>>
>>>> Whether there some additional "reset" of the controller needed, that
>>>> is assumed by the core, to be managed via the ->set_ios() callback for
>>>> the host.
>>>>
>>>>> 2) sdhci_set_ios calls sdhci_do_reset
>>>>> 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has
>>>>> configured.
>>>>> 4) Wait for SOFT_RESET_ALL to clear
>>>>> 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the
>>>>>    IRQs are not configured a CARD_REMOVED interrupt is never raised.
>>>>> 6) IRQs are enabled again
>>>>> 7) mmc layer never notices the device is disconnected. The SDHCI layer
>>>>>    will keep returning -ENOMEDIUM. This results in a card that is always
>>>>>    present and not functional.
>>>>
>>>> This sounds like host specific problems, which most likely should be
>>>> fixed in host driver, solely. Unless I am missing something, of
>>>> course.
>>>
>>> So in sdhci_do_reset we call the reset callback which is typically
>>> sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the
>>> controller to reset. If SDHCI_RESET_ALL was passed as the flag, the
>>> controller will clear the IRQ mask. If during that 100ms the card is
>>> removed there is no notification to the MMC system that the card was
>>> removed. So it seems like the card is always present.
>>
>> So you are saying that the problem occurs when the card is removed
>> during this 100ms period?
> Exactly. Thought I think most controllers reset fast enough to where
> that window is very small. But for this AMD controller we need to do a
> full reset which takes a while, so we can see the problem.
> 
>>
>> I assume there a way for sdhci to re-validate whether the card has
>> been removed during this period, right?
> I'll let Adrian chime in here.
> 
>>>
>>>>>  drivers/mmc/core/sd.c | 20 ++++++++++++++++++++
>>>>>  1 file changed, 20 insertions(+)
>>>>>
>>>>> diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
>>>>> index 265e1aeeb9d8..9206c4297d66 100644
>>>>> --- a/drivers/mmc/core/sd.c
>>>>> +++ b/drivers/mmc/core/sd.c
>>>>> @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host)
>>>>>
>>>>>  static int mmc_sd_hw_reset(struct mmc_host *host)
>>>>>  {
>>>>> +       int present;
>>>>>         mmc_power_cycle(host, host->card->ocr);
>>>>> +
>>>>> +       present = host->ops->get_cd(host);
>>>>> +
>>>>> +       /* The card status could have changed while resetting. */
>>>>> +       if ((mmc_card_removed(host->card) && present) ||
>>>>> +           (!mmc_card_removed(host->card) && !present)) {
>>>>> +               pr_info("%s: card status changed during reset\n",
>>>>> +                      mmc_hostname(host));
>>>>> +               host->ops->card_event(host);
>>>>> +               mmc_detect_change(host, 0);
>>>>> +       }
>>>>> +
>>>>> +       /* Don't perform unnecessary transactions if the card is missing. */
>>>>> +       if (!present) {
>>>>> +               pr_info("%s: card was removed during reset\n",
>>>>> +                       mmc_hostname(host));
>>>>> +               return -ENOMEDIUM;
>>>>> +       }
>>>>> +
>>>>
>>>> When doing a  mmc_hw_reset() (which ends up calling mmc_sd_hw_reset()
>>>> in case of SD cards), we are making a final attempt to make the card
>>>> functional again, via a power cycle and a re-init of it.
>>>>
>>>> In this path, we don't care whether the card is removed, as that
>>>> should have been detected already when the block layer calls
>>>> mmc_detect_card_removed().
>>>
>>> mmc_detect_card_removed has the guard `host->detect_change` to
>>> prevent it from checking the card status again. So maybe the fix to the
>>> missing interrupt/race condition is to set `host->detect_change = 1`
>>> from sdhci_do_reset after calling the reset handler. This way there will
>>> always be a single poll after a full reset so the correct card status can
>>> be detected?
>>
>> It sounds like you should call mmc_detect_change(), after the reset to
>> let the core check for cards that may have been removed/inserted.
>>
>> Would that work?
> Yeah, I could add `mmc_detect_change()` to the AMD SDHCI patch. I just
> thought it would be better to fix the potential race condition at a
> higher level. Though I'm happy to move it to the AMD patch. Let me know.

Does the following work?


diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 0cd5f2ce98df..f672171246b0 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -341,8 +341,19 @@ static void sdhci_init(struct sdhci_host *host, int soft)
 
 static void sdhci_reinit(struct sdhci_host *host)
 {
+	u32 cd = host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT);
+
 	sdhci_init(host, 0);
 	sdhci_enable_card_detection(host);
+
+	/*
+	 * A change to the card detect bits indicates a change in present state,
+	 * refer sdhci_set_card_detection(). A card detect interrupt might have
+	 * been missed while the host controller was being reset, so trigger a
+	 * rescan to check.
+	 */
+	if (cd != (host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT)))
+		mmc_detect_change(host->mmc, msecs_to_jiffies(200));
 }
 
 static void __sdhci_led_activate(struct sdhci_host *host)
Raul Rangel June 19, 2019, 2:56 p.m. UTC | #8
On Tue, Jun 11, 2019 at 01:30:55PM +0300, Adrian Hunter wrote:
> Does the following work?
> 
> 
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index 0cd5f2ce98df..f672171246b0 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -341,8 +341,19 @@ static void sdhci_init(struct sdhci_host *host, int soft)
>  
>  static void sdhci_reinit(struct sdhci_host *host)
>  {
> +	u32 cd = host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT);
> +
>  	sdhci_init(host, 0);
>  	sdhci_enable_card_detection(host);
> +
> +	/*
> +	 * A change to the card detect bits indicates a change in present state,
> +	 * refer sdhci_set_card_detection(). A card detect interrupt might have
> +	 * been missed while the host controller was being reset, so trigger a
> +	 * rescan to check.
> +	 */
> +	if (cd != (host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT)))
> +		mmc_detect_change(host->mmc, msecs_to_jiffies(200));
>  }
>  
>  static void __sdhci_led_activate(struct sdhci_host *host)

Your patch looks good. I tried it out and got over 57k insertion/removal
iterations. Do you want me to send out your patch, or do you want to do
it?

Just to recap, the patch you proposed + the AMD SDHCI specific patch fix
the problem.

Thanks!
Raul Rangel Aug. 1, 2019, 3:16 p.m. UTC | #9
On Wed, Jun 19, 2019 at 08:56:25AM -0600, Raul Rangel wrote:
> Your patch looks good. I tried it out and got over 57k insertion/removal
> iterations. Do you want me to send out your patch, or do you want to do
> it?
> 
> Just to recap, the patch you proposed + the AMD SDHCI specific patch fix
> the problem.
> 

Just pinging the thread again to see what you would prefer. I can send
the patch with you in the Signed-off. Or if you would prefer to not
update the sdhci driver at all, I can add the logic into the AMD
specific patch.

I would like to get this in so it lands by 5.4.

Thanks,
Raul
Adrian Hunter Aug. 2, 2019, 4:58 a.m. UTC | #10
On 1/08/19 6:16 PM, Raul Rangel wrote:
> On Wed, Jun 19, 2019 at 08:56:25AM -0600, Raul Rangel wrote:
>> Your patch looks good. I tried it out and got over 57k insertion/removal
>> iterations. Do you want me to send out your patch, or do you want to do
>> it?
>>
>> Just to recap, the patch you proposed + the AMD SDHCI specific patch fix
>> the problem.
>>
> 
> Just pinging the thread again to see what you would prefer. I can send
> the patch with you in the Signed-off. Or if you would prefer to not
> update the sdhci driver at all, I can add the logic into the AMD
> specific patch.
> 
> I would like to get this in so it lands by 5.4.

You seem not to have answered to my suggestion for a change to sdhci_reinit() here:

	https://lore.kernel.org/lkml/fcdf6cc4-2729-abe2-85c8-b0d04901c5ae@intel.com/
Raul Rangel Aug. 5, 2019, 4:49 p.m. UTC | #11
On Fri, Aug 02, 2019 at 07:58:20AM +0300, Adrian Hunter wrote:
> 
> You seem not to have answered to my suggestion for a change to sdhci_reinit() here:
> 
> 	https://lore.kernel.org/lkml/fcdf6cc4-2729-abe2-85c8-b0d04901c5ae@intel.com/
> 
I thought I answered it here: https://lore.kernel.org/lkml/20190619145625.GA50985@google.com/#t

Did I miss something?

Thanks
Adrian Hunter Aug. 6, 2019, 5:51 a.m. UTC | #12
On 5/08/19 7:49 PM, Raul Rangel wrote:
> On Fri, Aug 02, 2019 at 07:58:20AM +0300, Adrian Hunter wrote:
>>
>> You seem not to have answered to my suggestion for a change to sdhci_reinit() here:
>>
>> 	https://lore.kernel.org/lkml/fcdf6cc4-2729-abe2-85c8-b0d04901c5ae@intel.com/
>>
> I thought I answered it here: https://lore.kernel.org/lkml/20190619145625.GA50985@google.com/#t
> 
> Did I miss something?

I didn't get that mail for some reason, sorry.

Please send out my patch and your second with my ack.
diff mbox series

Patch

diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c
index 265e1aeeb9d8..9206c4297d66 100644
--- a/drivers/mmc/core/sd.c
+++ b/drivers/mmc/core/sd.c
@@ -1242,7 +1242,27 @@  static int mmc_sd_runtime_resume(struct mmc_host *host)
 
 static int mmc_sd_hw_reset(struct mmc_host *host)
 {
+	int present;
 	mmc_power_cycle(host, host->card->ocr);
+
+	present = host->ops->get_cd(host);
+
+	/* The card status could have changed while resetting. */
+	if ((mmc_card_removed(host->card) && present) ||
+	    (!mmc_card_removed(host->card) && !present)) {
+		pr_info("%s: card status changed during reset\n",
+		       mmc_hostname(host));
+		host->ops->card_event(host);
+		mmc_detect_change(host, 0);
+	}
+
+	/* Don't perform unnecessary transactions if the card is missing. */
+	if (!present) {
+		pr_info("%s: card was removed during reset\n",
+			mmc_hostname(host));
+		return -ENOMEDIUM;
+	}
+
 	return mmc_sd_init_card(host, host->card->ocr, host->card);
 }