Message ID | 20190501175457.195855-1-rrangel@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,1/2] mmc: sdhci: Manually check card status after reset | expand |
On Wed, May 01, 2019 at 11:54:56AM -0600, Raul E Rangel wrote: > I am running into a kernel panic. A task gets stuck for more than 120 > seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not > calling something correctly? > > Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc= > > I sometimes see the following: > [ 547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time > > I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host > So I'm guessing I'm missing an upstream fix. > So I tried these patches on the 5.1 with the memory leak patch applied: https://patchwork.kernel.org/patch/10927541/ Everything works as expected: mmc0: new high speed SDHC card at address 0001 mmcblk0: mmc0:0001 00000 7.41 GiB mmcblk0: p1 p2 mmc0: card status changed during reset mmc0: card was removed during reset mmc0: tried to HW reset card, got error -123 print_req_error: I/O error, dev mmcblk0, sector 2072 flags 80700 print_req_error: I/O error, dev mmcblk0, sector 2073 flags 80700 mmc0: card 0001 removed mmc0: new high speed SDHC card at address 0001 mmcblk0: mmc0:0001 00000 7.41 GiB mmcblk0: p1 p2 mmc0: card status changed during reset mmc0: card was removed during reset mmc0: tried to HW reset card, got error -123 print_req_error: I/O error, dev mmcblk0, sector 12584832 flags 80700 print_req_error: I/O error, dev mmcblk0, sector 12584833 flags 80700 mmc0: card 0001 removed Buffer I/O error on dev mmcblk0p2, logical block 369904, async page read mmc0: new high speed SDHC card at address 0001 mmcblk0: mmc0:0001 00000 7.41 GiB mmcblk0: p1 p2 mmc0: card 0001 removed I also ran another test. Using the DUT with a GPIO to randomly toggle the CD pin I was able to get 4k+ iterations without any problems. So I think the patches are good. --- Iteration 4506 ---- Disconnecting card [ 7444.370169] mmc0: card 0001 removed Connecting Card and sleeping for 0.31476s [ 7444.682177] mmc0: new high speed SDHC card at address 0001 [ 7444.719038] mmcblk0: mmc0:0001 00000 7.41 GiB [ 7444.727375] mmcblk0: p1 p2 Card Disconnected [ 7444.831896] mmc0: Card removed during transfer! [ 7444.831914] mmc0: Resetting controller. --- Iteration 4507 ---- Connecting Card and sleeping for 0.30259s [ 7445.033649] mmc0: card 0001 removed Card Disconnected [ 7445.307008] mmc0: error -123 whilst initialising SD card --- Iteration 4508 ---- Connecting Card and sleeping for 0.24827s Card Disconnected [ 7445.716033] mmc0: Card removed during transfer! [ 7445.716052] mmc0: Resetting controller. [ 7445.716489] mmc0: error -123 whilst initialising SD card --- Iteration 4509 ---- Connecting Card and sleeping for 0.14132s Card Disconnected --- Iteration 4510 ---- Connecting Card and sleeping for 0.26001s Card Disconnected [ 7446.436079] mmc0: error -123 whilst initialising SD card Connecting card to verify if it died [ 7446.906804] mmc0: new high speed SDHC card at address 0001 [ 7446.933631] mmcblk0: mmc0:0001 00000 7.41 GiB [ 7446.949224] mmcblk0: p1 p2 SDHC still functional Should I send the patches out again without the RFC tag? I'll keep trying to track down the hung task I was seeing on 4.14. But I don't think that's related to these patches. I might just end up backporting the blk-mq patches to our 4.14 branch since I suspect that fixes it. Thanks, Raul
On Fri, May 03, 2019 at 09:12:24AM -0600, Raul Rangel wrote: > On Wed, May 01, 2019 at 11:54:56AM -0600, Raul E Rangel wrote: > > I am running into a kernel panic. A task gets stuck for more than 120 > > seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not > > calling something correctly? > > > > Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc= > > > > I sometimes see the following: > > [ 547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time > > > > I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host > > So I'm guessing I'm missing an upstream fix. > > > > I'll keep trying to track down the hung task I was seeing on 4.14. But I > don't think that's related to these patches. I might just end up > backporting the blk-mq patches to our 4.14 branch since I suspect that > fixes it. So I tracked down the hung task in 4.14, it's a resource leak. mmc_cleanup_queue stops the worker thread. If there were any requests in the queue they would be holding onto a reference of mmc_blk_data. When mmc_blk_remove_req calls mmc_blk_put, there are still references to md, so it never calls blk_cleanup_queue, and the requests stay in the queue forever. Fortunately Adrian already has a fix for this: https://lore.kernel.org/patchwork/patch/856512/ I think we should cherry-pick 41e3efd07d5a02c80f503e29d755aa1bbb4245de into v4.14. I've tried it locally and it fixes the kernel panic I was seeing. I've also sent out two more patches for v4.14 that need to be applied with Adrian's patch: * https://patchwork.kernel.org/patch/10936439/ * https://patchwork.kernel.org/patch/10936441/ As for this patch, are there any comments? I have a test running that is doing random connect/disconnects, and it's over 6k iterations now. Thanks, Raul
On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote: > > There is a race condition between resetting the SDHCI controller and > disconnecting the card. > > For example: > 0) Card is connected and transferring data > 1) mmc_sd_reset is called to reset the controller due to a data error I assume you refer to mmc_sd_hw_reset()? In that case, I think you have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's responsibility is to reset the SD-card and not the host/controller. Whether there some additional "reset" of the controller needed, that is assumed by the core, to be managed via the ->set_ios() callback for the host. > 2) sdhci_set_ios calls sdhci_do_reset > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has > configured. > 4) Wait for SOFT_RESET_ALL to clear > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the > IRQs are not configured a CARD_REMOVED interrupt is never raised. > 6) IRQs are enabled again > 7) mmc layer never notices the device is disconnected. The SDHCI layer > will keep returning -ENOMEDIUM. This results in a card that is always > present and not functional. This sounds like host specific problems, which most likely should be fixed in host driver, solely. Unless I am missing something, of course. > > Signed-off-by: Raul E Rangel <rrangel@chromium.org> > --- > You can see an example of the following two patches here: > https://privatebin.net/?b0f5953716d34ca6#C699bCBQ99NdvspfDW7CMucT8CJG4DgL+yUNPyepDCo= > Line 8213: EILSEQ > Line 8235: SDHC is hard reset > Line 8240: Controller completes reset and card is no longer present > Line 8379: mmc_sd_reset notices card is missing and issues a card_event > and schedules a detect change. > Line 8402: Don't init the card since it's already gone. > Line 8717: Marks card as removed > Line 8820: mmc_sd_remove removes the block device > > I am running into a kernel panic. A task gets stuck for more than 120 > seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not > calling something correctly? > > Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc= > > I sometimes see the following: > [ 547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time > > I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host > So I'm guessing I'm missing an upstream fix. > > Do the patches look correct or am I doing something that would cause a > kernel panic? > > I have a DUT setup with a GPIO I can use to toggle the CD pin. I ran a > test where I connect and then randomly, between 0s - 1s disconnect the > card. This got over 20k iterations before the panic. Though when I do it > manually and stop for 2 minutes the panic happens. > > Any help would be appreciated. > > Thanks, > Raul > > > drivers/mmc/core/sd.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c > index 265e1aeeb9d8..9206c4297d66 100644 > --- a/drivers/mmc/core/sd.c > +++ b/drivers/mmc/core/sd.c > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host) > > static int mmc_sd_hw_reset(struct mmc_host *host) > { > + int present; > mmc_power_cycle(host, host->card->ocr); > + > + present = host->ops->get_cd(host); > + > + /* The card status could have changed while resetting. */ > + if ((mmc_card_removed(host->card) && present) || > + (!mmc_card_removed(host->card) && !present)) { > + pr_info("%s: card status changed during reset\n", > + mmc_hostname(host)); > + host->ops->card_event(host); > + mmc_detect_change(host, 0); > + } > + > + /* Don't perform unnecessary transactions if the card is missing. */ > + if (!present) { > + pr_info("%s: card was removed during reset\n", > + mmc_hostname(host)); > + return -ENOMEDIUM; > + } > + When doing a mmc_hw_reset() (which ends up calling mmc_sd_hw_reset() in case of SD cards), we are making a final attempt to make the card functional again, via a power cycle and a re-init of it. In this path, we don't care whether the card is removed, as that should have been detected already when the block layer calls mmc_detect_card_removed(). > return mmc_sd_init_card(host, host->card->ocr, host->card); > } > > -- > 2.21.0.593.g511ec345e18-goog > Kind regards Uffe
On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote: > On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote: First off, thanks for the review. > > > > There is a race condition between resetting the SDHCI controller and > > disconnecting the card. > > > > For example: > > 0) Card is connected and transferring data > > 1) mmc_sd_reset is called to reset the controller due to a data error > > I assume you refer to mmc_sd_hw_reset()? In that case, I think you > have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's > responsibility is to reset the SD-card and not the host/controller. You are correct. I was looking at a 4.14 kernel where it's called mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset. All I was trying to convey here was that a block error will eventually call sdhci_set_ios with SOFT_RESET_ALL via: mmc_blk_reset mmc_hw_reset mmc_sd_hw_reset mmc_power_cycle mmc_power_off mmc_set_initial_state sdhci_set_ios sdhci_reinit sdhci_init sdhci_do_reset(host, SDHCI_RESET_ALL); > > Whether there some additional "reset" of the controller needed, that > is assumed by the core, to be managed via the ->set_ios() callback for > the host. > > > 2) sdhci_set_ios calls sdhci_do_reset > > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has > > configured. > > 4) Wait for SOFT_RESET_ALL to clear > > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the > > IRQs are not configured a CARD_REMOVED interrupt is never raised. > > 6) IRQs are enabled again > > 7) mmc layer never notices the device is disconnected. The SDHCI layer > > will keep returning -ENOMEDIUM. This results in a card that is always > > present and not functional. > > This sounds like host specific problems, which most likely should be > fixed in host driver, solely. Unless I am missing something, of > course. So in sdhci_do_reset we call the reset callback which is typically sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the controller to reset. If SDHCI_RESET_ALL was passed as the flag, the controller will clear the IRQ mask. If during that 100ms the card is removed there is no notification to the MMC system that the card was removed. So it seems like the card is always present. > > drivers/mmc/core/sd.c | 20 ++++++++++++++++++++ > > 1 file changed, 20 insertions(+) > > > > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c > > index 265e1aeeb9d8..9206c4297d66 100644 > > --- a/drivers/mmc/core/sd.c > > +++ b/drivers/mmc/core/sd.c > > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host) > > > > static int mmc_sd_hw_reset(struct mmc_host *host) > > { > > + int present; > > mmc_power_cycle(host, host->card->ocr); > > + > > + present = host->ops->get_cd(host); > > + > > + /* The card status could have changed while resetting. */ > > + if ((mmc_card_removed(host->card) && present) || > > + (!mmc_card_removed(host->card) && !present)) { > > + pr_info("%s: card status changed during reset\n", > > + mmc_hostname(host)); > > + host->ops->card_event(host); > > + mmc_detect_change(host, 0); > > + } > > + > > + /* Don't perform unnecessary transactions if the card is missing. */ > > + if (!present) { > > + pr_info("%s: card was removed during reset\n", > > + mmc_hostname(host)); > > + return -ENOMEDIUM; > > + } > > + > > When doing a mmc_hw_reset() (which ends up calling mmc_sd_hw_reset() > in case of SD cards), we are making a final attempt to make the card > functional again, via a power cycle and a re-init of it. > > In this path, we don't care whether the card is removed, as that > should have been detected already when the block layer calls > mmc_detect_card_removed(). mmc_detect_card_removed has the guard `host->detect_change` to prevent it from checking the card status again. So maybe the fix to the missing interrupt/race condition is to set `host->detect_change = 1` from sdhci_do_reset after calling the reset handler. This way there will always be a single poll after a full reset so the correct card status can be detected? > > > return mmc_sd_init_card(host, host->card->ocr, host->card); > > } > > > > -- > > 2.21.0.593.g511ec345e18-goog > > > > Kind regards > Uffe > Thanks again for the review!
+ Adrian On Fri, 7 Jun 2019 at 18:05, Raul Rangel <rrangel@chromium.org> wrote: > > On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote: > > On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote: > > First off, thanks for the review. > > > > > > > There is a race condition between resetting the SDHCI controller and > > > disconnecting the card. > > > > > > For example: > > > 0) Card is connected and transferring data > > > 1) mmc_sd_reset is called to reset the controller due to a data error > > > > I assume you refer to mmc_sd_hw_reset()? In that case, I think you > > have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's > > responsibility is to reset the SD-card and not the host/controller. > You are correct. I was looking at a 4.14 kernel where it's called > mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset. > > All I was trying to convey here was that a block error will eventually > call sdhci_set_ios with SOFT_RESET_ALL via: > mmc_blk_reset > mmc_hw_reset > mmc_sd_hw_reset > mmc_power_cycle > mmc_power_off > mmc_set_initial_state > sdhci_set_ios > sdhci_reinit > sdhci_init > sdhci_do_reset(host, SDHCI_RESET_ALL); > > > > > Whether there some additional "reset" of the controller needed, that > > is assumed by the core, to be managed via the ->set_ios() callback for > > the host. > > > > > 2) sdhci_set_ios calls sdhci_do_reset > > > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has > > > configured. > > > 4) Wait for SOFT_RESET_ALL to clear > > > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the > > > IRQs are not configured a CARD_REMOVED interrupt is never raised. > > > 6) IRQs are enabled again > > > 7) mmc layer never notices the device is disconnected. The SDHCI layer > > > will keep returning -ENOMEDIUM. This results in a card that is always > > > present and not functional. > > > > This sounds like host specific problems, which most likely should be > > fixed in host driver, solely. Unless I am missing something, of > > course. > > So in sdhci_do_reset we call the reset callback which is typically > sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the > controller to reset. If SDHCI_RESET_ALL was passed as the flag, the > controller will clear the IRQ mask. If during that 100ms the card is > removed there is no notification to the MMC system that the card was > removed. So it seems like the card is always present. So you are saying that the problem occurs when the card is removed during this 100ms period? I assume there a way for sdhci to re-validate whether the card has been removed during this period, right? > > > > drivers/mmc/core/sd.c | 20 ++++++++++++++++++++ > > > 1 file changed, 20 insertions(+) > > > > > > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c > > > index 265e1aeeb9d8..9206c4297d66 100644 > > > --- a/drivers/mmc/core/sd.c > > > +++ b/drivers/mmc/core/sd.c > > > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host) > > > > > > static int mmc_sd_hw_reset(struct mmc_host *host) > > > { > > > + int present; > > > mmc_power_cycle(host, host->card->ocr); > > > + > > > + present = host->ops->get_cd(host); > > > + > > > + /* The card status could have changed while resetting. */ > > > + if ((mmc_card_removed(host->card) && present) || > > > + (!mmc_card_removed(host->card) && !present)) { > > > + pr_info("%s: card status changed during reset\n", > > > + mmc_hostname(host)); > > > + host->ops->card_event(host); > > > + mmc_detect_change(host, 0); > > > + } > > > + > > > + /* Don't perform unnecessary transactions if the card is missing. */ > > > + if (!present) { > > > + pr_info("%s: card was removed during reset\n", > > > + mmc_hostname(host)); > > > + return -ENOMEDIUM; > > > + } > > > + > > > > When doing a mmc_hw_reset() (which ends up calling mmc_sd_hw_reset() > > in case of SD cards), we are making a final attempt to make the card > > functional again, via a power cycle and a re-init of it. > > > > In this path, we don't care whether the card is removed, as that > > should have been detected already when the block layer calls > > mmc_detect_card_removed(). > > mmc_detect_card_removed has the guard `host->detect_change` to > prevent it from checking the card status again. So maybe the fix to the > missing interrupt/race condition is to set `host->detect_change = 1` > from sdhci_do_reset after calling the reset handler. This way there will > always be a single poll after a full reset so the correct card status can > be detected? It sounds like you should call mmc_detect_change(), after the reset to let the core check for cards that may have been removed/inserted. Would that work? [...] Kind regards Uffe
On Mon, Jun 10, 2019 at 06:17:31PM +0200, Ulf Hansson wrote: > + Adrian > > On Fri, 7 Jun 2019 at 18:05, Raul Rangel <rrangel@chromium.org> wrote: > > > > On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote: > > > On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote: > > > > First off, thanks for the review. > > > > > > > > > > There is a race condition between resetting the SDHCI controller and > > > > disconnecting the card. > > > > > > > > For example: > > > > 0) Card is connected and transferring data > > > > 1) mmc_sd_reset is called to reset the controller due to a data error > > > > > > I assume you refer to mmc_sd_hw_reset()? In that case, I think you > > > have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's > > > responsibility is to reset the SD-card and not the host/controller. > > You are correct. I was looking at a 4.14 kernel where it's called > > mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset. > > > > All I was trying to convey here was that a block error will eventually > > call sdhci_set_ios with SOFT_RESET_ALL via: > > mmc_blk_reset > > mmc_hw_reset > > mmc_sd_hw_reset > > mmc_power_cycle > > mmc_power_off > > mmc_set_initial_state > > sdhci_set_ios > > sdhci_reinit > > sdhci_init > > sdhci_do_reset(host, SDHCI_RESET_ALL); > > > > > > > > Whether there some additional "reset" of the controller needed, that > > > is assumed by the core, to be managed via the ->set_ios() callback for > > > the host. > > > > > > > 2) sdhci_set_ios calls sdhci_do_reset > > > > 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has > > > > configured. > > > > 4) Wait for SOFT_RESET_ALL to clear > > > > 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the > > > > IRQs are not configured a CARD_REMOVED interrupt is never raised. > > > > 6) IRQs are enabled again > > > > 7) mmc layer never notices the device is disconnected. The SDHCI layer > > > > will keep returning -ENOMEDIUM. This results in a card that is always > > > > present and not functional. > > > > > > This sounds like host specific problems, which most likely should be > > > fixed in host driver, solely. Unless I am missing something, of > > > course. > > > > So in sdhci_do_reset we call the reset callback which is typically > > sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the > > controller to reset. If SDHCI_RESET_ALL was passed as the flag, the > > controller will clear the IRQ mask. If during that 100ms the card is > > removed there is no notification to the MMC system that the card was > > removed. So it seems like the card is always present. > > So you are saying that the problem occurs when the card is removed > during this 100ms period? Exactly. Thought I think most controllers reset fast enough to where that window is very small. But for this AMD controller we need to do a full reset which takes a while, so we can see the problem. > > I assume there a way for sdhci to re-validate whether the card has > been removed during this period, right? I'll let Adrian chime in here. > > > > > > drivers/mmc/core/sd.c | 20 ++++++++++++++++++++ > > > > 1 file changed, 20 insertions(+) > > > > > > > > diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c > > > > index 265e1aeeb9d8..9206c4297d66 100644 > > > > --- a/drivers/mmc/core/sd.c > > > > +++ b/drivers/mmc/core/sd.c > > > > @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host) > > > > > > > > static int mmc_sd_hw_reset(struct mmc_host *host) > > > > { > > > > + int present; > > > > mmc_power_cycle(host, host->card->ocr); > > > > + > > > > + present = host->ops->get_cd(host); > > > > + > > > > + /* The card status could have changed while resetting. */ > > > > + if ((mmc_card_removed(host->card) && present) || > > > > + (!mmc_card_removed(host->card) && !present)) { > > > > + pr_info("%s: card status changed during reset\n", > > > > + mmc_hostname(host)); > > > > + host->ops->card_event(host); > > > > + mmc_detect_change(host, 0); > > > > + } > > > > + > > > > + /* Don't perform unnecessary transactions if the card is missing. */ > > > > + if (!present) { > > > > + pr_info("%s: card was removed during reset\n", > > > > + mmc_hostname(host)); > > > > + return -ENOMEDIUM; > > > > + } > > > > + > > > > > > When doing a mmc_hw_reset() (which ends up calling mmc_sd_hw_reset() > > > in case of SD cards), we are making a final attempt to make the card > > > functional again, via a power cycle and a re-init of it. > > > > > > In this path, we don't care whether the card is removed, as that > > > should have been detected already when the block layer calls > > > mmc_detect_card_removed(). > > > > mmc_detect_card_removed has the guard `host->detect_change` to > > prevent it from checking the card status again. So maybe the fix to the > > missing interrupt/race condition is to set `host->detect_change = 1` > > from sdhci_do_reset after calling the reset handler. This way there will > > always be a single poll after a full reset so the correct card status can > > be detected? > > It sounds like you should call mmc_detect_change(), after the reset to > let the core check for cards that may have been removed/inserted. > > Would that work? Yeah, I could add `mmc_detect_change()` to the AMD SDHCI patch. I just thought it would be better to fix the potential race condition at a higher level. Though I'm happy to move it to the AMD patch. Let me know. Thanks > > [...] > > Kind regards > Uffe
On 10/06/19 7:32 PM, Raul Rangel wrote: > On Mon, Jun 10, 2019 at 06:17:31PM +0200, Ulf Hansson wrote: >> + Adrian >> >> On Fri, 7 Jun 2019 at 18:05, Raul Rangel <rrangel@chromium.org> wrote: >>> >>> On Tue, May 28, 2019 at 09:38:20AM +0200, Ulf Hansson wrote: >>>> On Wed, 1 May 2019 at 19:55, Raul E Rangel <rrangel@chromium.org> wrote: >>> >>> First off, thanks for the review. >>> >>>>> >>>>> There is a race condition between resetting the SDHCI controller and >>>>> disconnecting the card. >>>>> >>>>> For example: >>>>> 0) Card is connected and transferring data >>>>> 1) mmc_sd_reset is called to reset the controller due to a data error >>>> >>>> I assume you refer to mmc_sd_hw_reset()? In that case, I think you >>>> have interpreted the purpose of mmc_sd_hw_reset() slightly wrong. It's >>>> responsibility is to reset the SD-card and not the host/controller. >>> You are correct. I was looking at a 4.14 kernel where it's called >>> mmc_sd_reset. 4.19 and above call it mmc_sd_hw_reset. >>> >>> All I was trying to convey here was that a block error will eventually >>> call sdhci_set_ios with SOFT_RESET_ALL via: >>> mmc_blk_reset >>> mmc_hw_reset >>> mmc_sd_hw_reset >>> mmc_power_cycle >>> mmc_power_off >>> mmc_set_initial_state >>> sdhci_set_ios >>> sdhci_reinit >>> sdhci_init >>> sdhci_do_reset(host, SDHCI_RESET_ALL); >>> >>>> >>>> Whether there some additional "reset" of the controller needed, that >>>> is assumed by the core, to be managed via the ->set_ios() callback for >>>> the host. >>>> >>>>> 2) sdhci_set_ios calls sdhci_do_reset >>>>> 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has >>>>> configured. >>>>> 4) Wait for SOFT_RESET_ALL to clear >>>>> 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the >>>>> IRQs are not configured a CARD_REMOVED interrupt is never raised. >>>>> 6) IRQs are enabled again >>>>> 7) mmc layer never notices the device is disconnected. The SDHCI layer >>>>> will keep returning -ENOMEDIUM. This results in a card that is always >>>>> present and not functional. >>>> >>>> This sounds like host specific problems, which most likely should be >>>> fixed in host driver, solely. Unless I am missing something, of >>>> course. >>> >>> So in sdhci_do_reset we call the reset callback which is typically >>> sdhci_reset. sdhci_reset can wait for up to 100ms waiting for the >>> controller to reset. If SDHCI_RESET_ALL was passed as the flag, the >>> controller will clear the IRQ mask. If during that 100ms the card is >>> removed there is no notification to the MMC system that the card was >>> removed. So it seems like the card is always present. >> >> So you are saying that the problem occurs when the card is removed >> during this 100ms period? > Exactly. Thought I think most controllers reset fast enough to where > that window is very small. But for this AMD controller we need to do a > full reset which takes a while, so we can see the problem. > >> >> I assume there a way for sdhci to re-validate whether the card has >> been removed during this period, right? > I'll let Adrian chime in here. > >>> >>>>> drivers/mmc/core/sd.c | 20 ++++++++++++++++++++ >>>>> 1 file changed, 20 insertions(+) >>>>> >>>>> diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c >>>>> index 265e1aeeb9d8..9206c4297d66 100644 >>>>> --- a/drivers/mmc/core/sd.c >>>>> +++ b/drivers/mmc/core/sd.c >>>>> @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host) >>>>> >>>>> static int mmc_sd_hw_reset(struct mmc_host *host) >>>>> { >>>>> + int present; >>>>> mmc_power_cycle(host, host->card->ocr); >>>>> + >>>>> + present = host->ops->get_cd(host); >>>>> + >>>>> + /* The card status could have changed while resetting. */ >>>>> + if ((mmc_card_removed(host->card) && present) || >>>>> + (!mmc_card_removed(host->card) && !present)) { >>>>> + pr_info("%s: card status changed during reset\n", >>>>> + mmc_hostname(host)); >>>>> + host->ops->card_event(host); >>>>> + mmc_detect_change(host, 0); >>>>> + } >>>>> + >>>>> + /* Don't perform unnecessary transactions if the card is missing. */ >>>>> + if (!present) { >>>>> + pr_info("%s: card was removed during reset\n", >>>>> + mmc_hostname(host)); >>>>> + return -ENOMEDIUM; >>>>> + } >>>>> + >>>> >>>> When doing a mmc_hw_reset() (which ends up calling mmc_sd_hw_reset() >>>> in case of SD cards), we are making a final attempt to make the card >>>> functional again, via a power cycle and a re-init of it. >>>> >>>> In this path, we don't care whether the card is removed, as that >>>> should have been detected already when the block layer calls >>>> mmc_detect_card_removed(). >>> >>> mmc_detect_card_removed has the guard `host->detect_change` to >>> prevent it from checking the card status again. So maybe the fix to the >>> missing interrupt/race condition is to set `host->detect_change = 1` >>> from sdhci_do_reset after calling the reset handler. This way there will >>> always be a single poll after a full reset so the correct card status can >>> be detected? >> >> It sounds like you should call mmc_detect_change(), after the reset to >> let the core check for cards that may have been removed/inserted. >> >> Would that work? > Yeah, I could add `mmc_detect_change()` to the AMD SDHCI patch. I just > thought it would be better to fix the potential race condition at a > higher level. Though I'm happy to move it to the AMD patch. Let me know. Does the following work? diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c index 0cd5f2ce98df..f672171246b0 100644 --- a/drivers/mmc/host/sdhci.c +++ b/drivers/mmc/host/sdhci.c @@ -341,8 +341,19 @@ static void sdhci_init(struct sdhci_host *host, int soft) static void sdhci_reinit(struct sdhci_host *host) { + u32 cd = host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT); + sdhci_init(host, 0); sdhci_enable_card_detection(host); + + /* + * A change to the card detect bits indicates a change in present state, + * refer sdhci_set_card_detection(). A card detect interrupt might have + * been missed while the host controller was being reset, so trigger a + * rescan to check. + */ + if (cd != (host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT))) + mmc_detect_change(host->mmc, msecs_to_jiffies(200)); } static void __sdhci_led_activate(struct sdhci_host *host)
On Tue, Jun 11, 2019 at 01:30:55PM +0300, Adrian Hunter wrote: > Does the following work? > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c > index 0cd5f2ce98df..f672171246b0 100644 > --- a/drivers/mmc/host/sdhci.c > +++ b/drivers/mmc/host/sdhci.c > @@ -341,8 +341,19 @@ static void sdhci_init(struct sdhci_host *host, int soft) > > static void sdhci_reinit(struct sdhci_host *host) > { > + u32 cd = host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT); > + > sdhci_init(host, 0); > sdhci_enable_card_detection(host); > + > + /* > + * A change to the card detect bits indicates a change in present state, > + * refer sdhci_set_card_detection(). A card detect interrupt might have > + * been missed while the host controller was being reset, so trigger a > + * rescan to check. > + */ > + if (cd != (host->ier & (SDHCI_INT_CARD_REMOVE | SDHCI_INT_CARD_INSERT))) > + mmc_detect_change(host->mmc, msecs_to_jiffies(200)); > } > > static void __sdhci_led_activate(struct sdhci_host *host) Your patch looks good. I tried it out and got over 57k insertion/removal iterations. Do you want me to send out your patch, or do you want to do it? Just to recap, the patch you proposed + the AMD SDHCI specific patch fix the problem. Thanks!
On Wed, Jun 19, 2019 at 08:56:25AM -0600, Raul Rangel wrote: > Your patch looks good. I tried it out and got over 57k insertion/removal > iterations. Do you want me to send out your patch, or do you want to do > it? > > Just to recap, the patch you proposed + the AMD SDHCI specific patch fix > the problem. > Just pinging the thread again to see what you would prefer. I can send the patch with you in the Signed-off. Or if you would prefer to not update the sdhci driver at all, I can add the logic into the AMD specific patch. I would like to get this in so it lands by 5.4. Thanks, Raul
On 1/08/19 6:16 PM, Raul Rangel wrote: > On Wed, Jun 19, 2019 at 08:56:25AM -0600, Raul Rangel wrote: >> Your patch looks good. I tried it out and got over 57k insertion/removal >> iterations. Do you want me to send out your patch, or do you want to do >> it? >> >> Just to recap, the patch you proposed + the AMD SDHCI specific patch fix >> the problem. >> > > Just pinging the thread again to see what you would prefer. I can send > the patch with you in the Signed-off. Or if you would prefer to not > update the sdhci driver at all, I can add the logic into the AMD > specific patch. > > I would like to get this in so it lands by 5.4. You seem not to have answered to my suggestion for a change to sdhci_reinit() here: https://lore.kernel.org/lkml/fcdf6cc4-2729-abe2-85c8-b0d04901c5ae@intel.com/
On Fri, Aug 02, 2019 at 07:58:20AM +0300, Adrian Hunter wrote: > > You seem not to have answered to my suggestion for a change to sdhci_reinit() here: > > https://lore.kernel.org/lkml/fcdf6cc4-2729-abe2-85c8-b0d04901c5ae@intel.com/ > I thought I answered it here: https://lore.kernel.org/lkml/20190619145625.GA50985@google.com/#t Did I miss something? Thanks
On 5/08/19 7:49 PM, Raul Rangel wrote: > On Fri, Aug 02, 2019 at 07:58:20AM +0300, Adrian Hunter wrote: >> >> You seem not to have answered to my suggestion for a change to sdhci_reinit() here: >> >> https://lore.kernel.org/lkml/fcdf6cc4-2729-abe2-85c8-b0d04901c5ae@intel.com/ >> > I thought I answered it here: https://lore.kernel.org/lkml/20190619145625.GA50985@google.com/#t > > Did I miss something? I didn't get that mail for some reason, sorry. Please send out my patch and your second with my ack.
diff --git a/drivers/mmc/core/sd.c b/drivers/mmc/core/sd.c index 265e1aeeb9d8..9206c4297d66 100644 --- a/drivers/mmc/core/sd.c +++ b/drivers/mmc/core/sd.c @@ -1242,7 +1242,27 @@ static int mmc_sd_runtime_resume(struct mmc_host *host) static int mmc_sd_hw_reset(struct mmc_host *host) { + int present; mmc_power_cycle(host, host->card->ocr); + + present = host->ops->get_cd(host); + + /* The card status could have changed while resetting. */ + if ((mmc_card_removed(host->card) && present) || + (!mmc_card_removed(host->card) && !present)) { + pr_info("%s: card status changed during reset\n", + mmc_hostname(host)); + host->ops->card_event(host); + mmc_detect_change(host, 0); + } + + /* Don't perform unnecessary transactions if the card is missing. */ + if (!present) { + pr_info("%s: card was removed during reset\n", + mmc_hostname(host)); + return -ENOMEDIUM; + } + return mmc_sd_init_card(host, host->card->ocr, host->card); }
There is a race condition between resetting the SDHCI controller and disconnecting the card. For example: 0) Card is connected and transferring data 1) mmc_sd_reset is called to reset the controller due to a data error 2) sdhci_set_ios calls sdhci_do_reset 3) SOFT_RESET_ALL is toggled which clears the IRQs the controller has configured. 4) Wait for SOFT_RESET_ALL to clear 5) CD logic notices card is gone and CARD_PRESENT goes low, but since the IRQs are not configured a CARD_REMOVED interrupt is never raised. 6) IRQs are enabled again 7) mmc layer never notices the device is disconnected. The SDHCI layer will keep returning -ENOMEDIUM. This results in a card that is always present and not functional. Signed-off-by: Raul E Rangel <rrangel@chromium.org> --- You can see an example of the following two patches here: https://privatebin.net/?b0f5953716d34ca6#C699bCBQ99NdvspfDW7CMucT8CJG4DgL+yUNPyepDCo= Line 8213: EILSEQ Line 8235: SDHC is hard reset Line 8240: Controller completes reset and card is no longer present Line 8379: mmc_sd_reset notices card is missing and issues a card_event and schedules a detect change. Line 8402: Don't init the card since it's already gone. Line 8717: Marks card as removed Line 8820: mmc_sd_remove removes the block device I am running into a kernel panic. A task gets stuck for more than 120 seconds. I keep seeing blkdev_close in the stack trace, so maybe I'm not calling something correctly? Here is the panic: https://privatebin.net/?8ec48c1547d19975#dq/h189w5jmTlbMKKAwZjUr4bhm7Q2AgvGdRqc5BxAc= I sometimes see the following: [ 547.943974] udevd[144]: seq 2350 '/devices/pci0000:00/0000:00:14.7/mmc_host/mmc0/mmc0:0001/block/mmcblk0/mmcblk0p1' is taking a long time I was getting the kernel panic on a 4.14 kernel: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/f3dc032faf4d074f20ada437e2d081a28ac699da/drivers/mmc/host So I'm guessing I'm missing an upstream fix. Do the patches look correct or am I doing something that would cause a kernel panic? I have a DUT setup with a GPIO I can use to toggle the CD pin. I ran a test where I connect and then randomly, between 0s - 1s disconnect the card. This got over 20k iterations before the panic. Though when I do it manually and stop for 2 minutes the panic happens. Any help would be appreciated. Thanks, Raul drivers/mmc/core/sd.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)