diff mbox

mmc: dw_mmc: Consider HLE errors to be data and command errors

Message ID 1426002490-2014-1-git-send-email-dianders@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Doug Anderson March 10, 2015, 3:48 p.m. UTC
The dw_mmc driver enables HLE errors as part of DW_MCI_ERROR_FLAGS but
nothing in the interrupt handler actually handles them and ACKs them.
That means that if we ever get an HLE error we'll just keep getting
interrupts and we'll wedge things.

We really don't expect HLE errors but if we ever get them we shouldn't
silently ignore them.

Note that I have seen HLE errors while constantly ejecting and
inserting cards (ejecting while inserting, etc).

Signed-off-by: Doug Anderson <dianders@chromium.org>
---
Note that this works together with the patch I sent up yesterday (the
CMD 11 timer).  I would have sent the two together except that I had
local printouts (and ACKing of HLE) and didn't realize that this was
also required for a full solution.

 drivers/mmc/host/dw_mmc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Jaehoon Chung March 13, 2015, 11:30 a.m. UTC | #1
Hi, Doug.

On 03/11/2015 12:48 AM, Doug Anderson wrote:
> The dw_mmc driver enables HLE errors as part of DW_MCI_ERROR_FLAGS but
> nothing in the interrupt handler actually handles them and ACKs them.
> That means that if we ever get an HLE error we'll just keep getting
> interrupts and we'll wedge things.
> 
> We really don't expect HLE errors but if we ever get them we shouldn't
> silently ignore them.
> 
> Note that I have seen HLE errors while constantly ejecting and
> inserting cards (ejecting while inserting, etc).

Right, It is occurred when card inserting/ejecting.(This case is the case of removable card.)
Did you test with eMMC? We needs to consider how control HLE error.
But I think this patch can't solve all of HLE problem.

Best Regards,
Jaehoon Chung

> 
> Signed-off-by: Doug Anderson <dianders@chromium.org>
> ---
> Note that this works together with the patch I sent up yesterday (the
> CMD 11 timer).  I would have sent the two together except that I had
> local printouts (and ACKing of HLE) and didn't realize that this was
> also required for a full solution.
> 
>  drivers/mmc/host/dw_mmc.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 47dfd0e..294edc9c 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -44,11 +44,11 @@
>  /* Common flag combinations */
>  #define DW_MCI_DATA_ERROR_FLAGS	(SDMMC_INT_DRTO | SDMMC_INT_DCRC | \
>  				 SDMMC_INT_HTO | SDMMC_INT_SBE  | \
> -				 SDMMC_INT_EBE)
> +				 SDMMC_INT_EBE | SDMMC_INT_HLE)
>  #define DW_MCI_CMD_ERROR_FLAGS	(SDMMC_INT_RTO | SDMMC_INT_RCRC | \
> -				 SDMMC_INT_RESP_ERR)
> +				 SDMMC_INT_RESP_ERR | SDMMC_INT_HLE)
>  #define DW_MCI_ERROR_FLAGS	(DW_MCI_DATA_ERROR_FLAGS | \
> -				 DW_MCI_CMD_ERROR_FLAGS  | SDMMC_INT_HLE)
> +				 DW_MCI_CMD_ERROR_FLAGS)
>  #define DW_MCI_SEND_STATUS	1
>  #define DW_MCI_RECV_STATUS	2
>  #define DW_MCI_DMA_THRESHOLD	16
>
Doug Anderson March 13, 2015, 8:27 p.m. UTC | #2
Hi,

On Fri, Mar 13, 2015 at 4:30 AM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
> Hi, Doug.
>
> On 03/11/2015 12:48 AM, Doug Anderson wrote:
>> The dw_mmc driver enables HLE errors as part of DW_MCI_ERROR_FLAGS but
>> nothing in the interrupt handler actually handles them and ACKs them.
>> That means that if we ever get an HLE error we'll just keep getting
>> interrupts and we'll wedge things.
>>
>> We really don't expect HLE errors but if we ever get them we shouldn't
>> silently ignore them.
>>
>> Note that I have seen HLE errors while constantly ejecting and
>> inserting cards (ejecting while inserting, etc).
>
> Right, It is occurred when card inserting/ejecting.(This case is the case of removable card.)
> Did you test with eMMC? We needs to consider how control HLE error.

I'm running it on systems with eMMC, SD Cards, and SDIO WiFi.  HLE
doesn't show up in normal circumstances, only in ejecting the SD card
at the wrong time.  ...since you can't eject eMMC, I didn't see
problems there.

> But I think this patch can't solve all of HLE problem.

Agreed.  HLE means that the controller is pretty wedged and (as I
understand it) means that there's something else we're doing wrong
elsewhere in the dw_mmc driver (like writing more data to an already
busy controller).  We should probably track down and find those cases,
too.

I agree also that this code probably won't fix the controller in all
cases of HLE errors.  ...but I'm not 100% certain of the best way to
really do that, do you know?

...but in any case the absolute worst thing to do is what the driver
is already doing: unmask the HLE interrupt but never handle it
anywhere...  My patch is at least better than that...

If you have another suggested way to make HLE error handling better
(or avoid them to begin with) I'm happy to test!  :)


-Doug
Jaehoon Chung March 16, 2015, 5:56 a.m. UTC | #3
Hi, Doug.

On 03/14/2015 05:27 AM, Doug Anderson wrote:
> Hi,
> 
> On Fri, Mar 13, 2015 at 4:30 AM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>> Hi, Doug.
>>
>> On 03/11/2015 12:48 AM, Doug Anderson wrote:
>>> The dw_mmc driver enables HLE errors as part of DW_MCI_ERROR_FLAGS but
>>> nothing in the interrupt handler actually handles them and ACKs them.
>>> That means that if we ever get an HLE error we'll just keep getting
>>> interrupts and we'll wedge things.
>>>
>>> We really don't expect HLE errors but if we ever get them we shouldn't
>>> silently ignore them.
>>>
>>> Note that I have seen HLE errors while constantly ejecting and
>>> inserting cards (ejecting while inserting, etc).
>>
>> Right, It is occurred when card inserting/ejecting.(This case is the case of removable card.)
>> Did you test with eMMC? We needs to consider how control HLE error.
> 
> I'm running it on systems with eMMC, SD Cards, and SDIO WiFi.  HLE
> doesn't show up in normal circumstances, only in ejecting the SD card
> at the wrong time.  ...since you can't eject eMMC, I didn't see
> problems there.

When card is inserting/removing, HLE is often occurred.
Since there is some request into queue when card is removed.(in my understanding.)
It's also related with controlling clock.

> 
>> But I think this patch can't solve all of HLE problem.
> 
> Agreed.  HLE means that the controller is pretty wedged and (as I
> understand it) means that there's something else we're doing wrong
> elsewhere in the dw_mmc driver (like writing more data to an already
> busy controller).  We should probably track down and find those cases,
> too.
> 
> I agree also that this code probably won't fix the controller in all
> cases of HLE errors.  ...but I'm not 100% certain of the best way to
> really do that, do you know?
> 
> ...but in any case the absolute worst thing to do is what the driver
> is already doing: unmask the HLE interrupt but never handle it
> anywhere...  My patch is at least better than that...

Agreed, your patch should be at least better than now.
But if pending is set HLE error bit,
it should hit the cases of DW_MCI_DATA_ERROR_FLAGS & DW_MCI_CMD_ERROR_FLAGS.
and i think send_stop_abort() can't run, doesn't?
(If HLE is occurred at non-removable card, controller can't do anything.)

If i can reproduce HLE error, i can check more detailedly.(Trying to reproduce it.)
I don't find fully solution yet. But finding the solution is my or our(?) part/role in future.
Actually, i'm using the ctrl reset at my local tree, when HLE error is occurred.
(Also it's not solution..)
According to TRM, "HLE is raised, software then has to reload the command."
We needs to consider how reload the command without lost previous request.

> 
> If you have another suggested way to make HLE error handling better
> (or avoid them to begin with) I'm happy to test!  :)

I will try to find HLE error handling..if you also have other opinion, let me know, plz.
I needs to listen other opinion, it's great helpful to me.. :)

Thank you a lot!

Best Regards,
Jaehoon Chung

> 
> 
> -Doug
>
Jaehoon Chung March 30, 2015, 12:55 a.m. UTC | #4
Dear Doug,

I'm considering to control HLE error..So holding this patch.
If this is absolutely necessary patch, let me know, plz.

Best Regards,
Jaehoon Chung

On 03/16/2015 02:56 PM, Jaehoon Chung wrote:
> Hi, Doug.
> 
> On 03/14/2015 05:27 AM, Doug Anderson wrote:
>> Hi,
>>
>> On Fri, Mar 13, 2015 at 4:30 AM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Hi, Doug.
>>>
>>> On 03/11/2015 12:48 AM, Doug Anderson wrote:
>>>> The dw_mmc driver enables HLE errors as part of DW_MCI_ERROR_FLAGS but
>>>> nothing in the interrupt handler actually handles them and ACKs them.
>>>> That means that if we ever get an HLE error we'll just keep getting
>>>> interrupts and we'll wedge things.
>>>>
>>>> We really don't expect HLE errors but if we ever get them we shouldn't
>>>> silently ignore them.
>>>>
>>>> Note that I have seen HLE errors while constantly ejecting and
>>>> inserting cards (ejecting while inserting, etc).
>>>
>>> Right, It is occurred when card inserting/ejecting.(This case is the case of removable card.)
>>> Did you test with eMMC? We needs to consider how control HLE error.
>>
>> I'm running it on systems with eMMC, SD Cards, and SDIO WiFi.  HLE
>> doesn't show up in normal circumstances, only in ejecting the SD card
>> at the wrong time.  ...since you can't eject eMMC, I didn't see
>> problems there.
> 
> When card is inserting/removing, HLE is often occurred.
> Since there is some request into queue when card is removed.(in my understanding.)
> It's also related with controlling clock.
> 
>>
>>> But I think this patch can't solve all of HLE problem.
>>
>> Agreed.  HLE means that the controller is pretty wedged and (as I
>> understand it) means that there's something else we're doing wrong
>> elsewhere in the dw_mmc driver (like writing more data to an already
>> busy controller).  We should probably track down and find those cases,
>> too.
>>
>> I agree also that this code probably won't fix the controller in all
>> cases of HLE errors.  ...but I'm not 100% certain of the best way to
>> really do that, do you know?
>>
>> ...but in any case the absolute worst thing to do is what the driver
>> is already doing: unmask the HLE interrupt but never handle it
>> anywhere...  My patch is at least better than that...
> 
> Agreed, your patch should be at least better than now.
> But if pending is set HLE error bit,
> it should hit the cases of DW_MCI_DATA_ERROR_FLAGS & DW_MCI_CMD_ERROR_FLAGS.
> and i think send_stop_abort() can't run, doesn't?
> (If HLE is occurred at non-removable card, controller can't do anything.)
> 
> If i can reproduce HLE error, i can check more detailedly.(Trying to reproduce it.)
> I don't find fully solution yet. But finding the solution is my or our(?) part/role in future.
> Actually, i'm using the ctrl reset at my local tree, when HLE error is occurred.
> (Also it's not solution..)
> According to TRM, "HLE is raised, software then has to reload the command."
> We needs to consider how reload the command without lost previous request.
> 
>>
>> If you have another suggested way to make HLE error handling better
>> (or avoid them to begin with) I'm happy to test!  :)
> 
> I will try to find HLE error handling..if you also have other opinion, let me know, plz.
> I needs to listen other opinion, it's great helpful to me.. :)
> 
> Thank you a lot!
> 
> Best Regards,
> Jaehoon Chung
> 
>>
>>
>> -Doug
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Doug Anderson March 30, 2015, 3:47 p.m. UTC | #5
Jaehoon,

On Sun, Mar 29, 2015 at 5:55 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
> Dear Doug,
>
> I'm considering to control HLE error..So holding this patch.
> If this is absolutely necessary patch, let me know, plz.
>
> Best Regards,
> Jaehoon Chung

Sounds OK.  I have certainly applied this locally and the driver isn't
robust against insertions / removals without it, but once the card is
inserted things are OK so it's probably not urgent that it be applied
upstream.  Hopefully we can figure out a better solution...

-Doug
Doug Anderson May 18, 2016, 12:47 a.m. UTC | #6
Jaehoon,

On Mon, Mar 30, 2015 at 8:47 AM, Doug Anderson <dianders@chromium.org> wrote:
> Jaehoon,
>
> On Sun, Mar 29, 2015 at 5:55 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>> Dear Doug,
>>
>> I'm considering to control HLE error..So holding this patch.
>> If this is absolutely necessary patch, let me know, plz.
>>
>> Best Regards,
>> Jaehoon Chung
>
> Sounds OK.  I have certainly applied this locally and the driver isn't
> robust against insertions / removals without it, but once the card is
> inserted things are OK so it's probably not urgent that it be applied
> upstream.  Hopefully we can figure out a better solution...

I'm now testing a nice new rebased kernel and I'm hitting this again.

Of course I'll just pick my same patch to my new kernel tree, but
since it's been a year and nobody has done anything better, would you
consider landing my patch?  It is certainly better than nothing.

-Doug
Shawn Lin May 18, 2016, 1:59 a.m. UTC | #7
Hi Doug,

On 2016-5-18 8:47, Doug Anderson wrote:
> Jaehoon,
>
> On Mon, Mar 30, 2015 at 8:47 AM, Doug Anderson <dianders@chromium.org> wrote:
>> Jaehoon,
>>
>> On Sun, Mar 29, 2015 at 5:55 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Dear Doug,
>>>
>>> I'm considering to control HLE error..So holding this patch.
>>> If this is absolutely necessary patch, let me know, plz.
>>>
>>> Best Regards,
>>> Jaehoon Chung
>> Sounds OK.  I have certainly applied this locally and the driver isn't
>> robust against insertions / removals without it, but once the card is
>> inserted things are OK so it's probably not urgent that it be applied
>> upstream.  Hopefully we can figure out a better solution...
> I'm now testing a nice new rebased kernel and I'm hitting this again.
>
> Of course I'll just pick my same patch to my new kernel tree, but
> since it's been a year and nobody has done anything better, would you
> consider landing my patch?  It is certainly better than nothing.

Could you try this patch to see if you can still find HLE?

@@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci *host, u32 status)
  static void dw_mci_handle_cd(struct dw_mci *host)
  {
         int i;
+       int present;

         for (i = 0; i < host->num_slots; i++) {
                 struct dw_mci_slot *slot = host->slot[i];

                 if (!slot)
                         continue;

+               present = !(mci_readl(slot->host, CDETECT) & (1 << slot->id));
+               if (present)
+                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
+               else
+                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);

                 if (slot->mmc->ops->card_event)
                         slot->mmc->ops->card_event(slot->mmc);


>
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Jaehoon Chung May 18, 2016, 2:08 a.m. UTC | #8
On 05/18/2016 09:47 AM, Doug Anderson wrote:
> Jaehoon,
> 
> On Mon, Mar 30, 2015 at 8:47 AM, Doug Anderson <dianders@chromium.org> wrote:
>> Jaehoon,
>>
>> On Sun, Mar 29, 2015 at 5:55 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>> Dear Doug,
>>>
>>> I'm considering to control HLE error..So holding this patch.
>>> If this is absolutely necessary patch, let me know, plz.
>>>
>>> Best Regards,
>>> Jaehoon Chung
>>
>> Sounds OK.  I have certainly applied this locally and the driver isn't
>> robust against insertions / removals without it, but once the card is
>> inserted things are OK so it's probably not urgent that it be applied
>> upstream.  Hopefully we can figure out a better solution...
> 
> I'm now testing a nice new rebased kernel and I'm hitting this again.
> 
> Of course I'll just pick my same patch to my new kernel tree, but
> since it's been a year and nobody has done anything better, would you
> consider landing my patch?  It is certainly better than nothing.

Sure, it's right.
I think that main reason of HLE is wait_prvdata_complete. (I'm guessing..)
On other hands, dwmmc controller is handling something wrong. (I found that HLE is occurred the similar case.)
After find the main solution, it's not bad that your patch is applied on dwmmc controller.

Ulf have sent PR for next..So if we needs to apply this, i will apply on fix.

Best Regards,
Jaehoon Chung

> 
> -Doug
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>
Doug Anderson May 18, 2016, 4:12 a.m. UTC | #9
Hi,

On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
<shawn.lin@kernel-upstream.org> wrote:
> Could you try this patch to see if you can still find HLE?
>
> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
> *host, u32 status)
>  static void dw_mci_handle_cd(struct dw_mci *host)
>  {
>         int i;
> +       int present;
>
>         for (i = 0; i < host->num_slots; i++) {
>                 struct dw_mci_slot *slot = host->slot[i];
>
>                 if (!slot)
>                         continue;
>
> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
> slot->id));
> +               if (present)
> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
> +               else
> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);

No, because we don't use the builtin card detect on veyron.  ;)

We use GPIO card detect because we didn't like the way JTAG and SD
interacted.  Also on rk3288 the builtin card detect line had the wrong
voltage domain (you couldn't detect a card when the IO lines were
powered off).  The builtin card detect line is always driven low on
veyron.


I'm nearly certain that the root cause of my HLE errors is actually
related to the same problem addressed by the commit 7c5209c315ea
("mmc: core: Increase delay for voltage to stabilize from 3.3V to
1.8V").  I think that on minnie we're still on the hairy edge and
sometimes the line doesn't transition fast enough.

It appears that increasing this to 30ms avoids the HLE errors.

I _think_ I can actually fully fix this properly by temporarily
engaging the internal pull-ups while the voltage switch is happening.
This will bleed away the voltage just a little bit faster (since lines
are driven low here).  I'll try to confirm that.


In any case, it seems like we should take this patch since (without
this patch) the failure case when you get HLE errors is that the
interrupt controller fires over and over again (with no printouts) and
your system stalls with no error messages.

-Doug
Doug Anderson May 18, 2016, 4:13 a.m. UTC | #10
Hi,

On Tue, May 17, 2016 at 7:08 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
> On 05/18/2016 09:47 AM, Doug Anderson wrote:
>> Jaehoon,
>>
>> On Mon, Mar 30, 2015 at 8:47 AM, Doug Anderson <dianders@chromium.org> wrote:
>>> Jaehoon,
>>>
>>> On Sun, Mar 29, 2015 at 5:55 PM, Jaehoon Chung <jh80.chung@samsung.com> wrote:
>>>> Dear Doug,
>>>>
>>>> I'm considering to control HLE error..So holding this patch.
>>>> If this is absolutely necessary patch, let me know, plz.
>>>>
>>>> Best Regards,
>>>> Jaehoon Chung
>>>
>>> Sounds OK.  I have certainly applied this locally and the driver isn't
>>> robust against insertions / removals without it, but once the card is
>>> inserted things are OK so it's probably not urgent that it be applied
>>> upstream.  Hopefully we can figure out a better solution...
>>
>> I'm now testing a nice new rebased kernel and I'm hitting this again.
>>
>> Of course I'll just pick my same patch to my new kernel tree, but
>> since it's been a year and nobody has done anything better, would you
>> consider landing my patch?  It is certainly better than nothing.
>
> Sure, it's right.
> I think that main reason of HLE is wait_prvdata_complete. (I'm guessing..)
> On other hands, dwmmc controller is handling something wrong. (I found that HLE is occurred the similar case.)
> After find the main solution, it's not bad that your patch is applied on dwmmc controller.
>
> Ulf have sent PR for next..So if we needs to apply this, i will apply on fix.

It's not new, so I'd say just queue it up for the next version
whenever it's convenient.

-Doug
Shawn Lin May 18, 2016, 9:14 a.m. UTC | #11
Hi

On 2016-5-18 12:12, Doug Anderson wrote:
> Hi,
>
> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
> <shawn.lin@kernel-upstream.org> wrote:
>> Could you try this patch to see if you can still find HLE?
>>
>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>> *host, u32 status)
>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>   {
>>          int i;
>> +       int present;
>>
>>          for (i = 0; i < host->num_slots; i++) {
>>                  struct dw_mci_slot *slot = host->slot[i];
>>
>>                  if (!slot)
>>                          continue;
>>
>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>> slot->id));
>> +               if (present)
>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>> +               else
>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>
> No, because we don't use the builtin card detect on veyron.  ;)
>
> We use GPIO card detect because we didn't like the way JTAG and SD
> interacted.  Also on rk3288 the builtin card detect line had the wrong
> voltage domain (you couldn't detect a card when the IO lines were
> powered off).  The builtin card detect line is always driven low on
> veyron.

Okay, I see.

>
>
> I'm nearly certain that the root cause of my HLE errors is actually
> related to the same problem addressed by the commit 7c5209c315ea
> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
> 1.8V").  I think that on minnie we're still on the hairy edge and
> sometimes the line doesn't transition fast enough.

Things are not so simple from your details.

I was not enabling SD3.0 support, then I also found HLE sometimes.
So it seems commit 7c5209c315ea does not contibute to this phenomenon.

The scenario looks like:
remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
set_ios -> setup_bus -> disabled clk , then HLE irq storm coming

 From the code of dw_mci_prepare_command:
SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
wait_busy here, then cmd code is loding into queue of dw_mmc but
still failing send out because it's in busy?

With my patch, things go well:
remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
status(CMD13) return directly -> power_off -> set_ios -> setup_bus -> 
disable clk

So why should we allow inquiry of card status if we sure the card is
removed? I mean no any further cmds should be delivered.

And another question: should we wait busy for cmd13?

>
> It appears that increasing this to 30ms avoids the HLE errors.
>
> I _think_ I can actually fully fix this properly by temporarily
> engaging the internal pull-ups while the voltage switch is happening.
> This will bleed away the voltage just a little bit faster (since lines
> are driven low here).  I'll try to confirm that.
>
>
> In any case, it seems like we should take this patch since (without
> this patch) the failure case when you get HLE errors is that the
> interrupt controller fires over and over again (with no printouts) and
> your system stalls with no error messages.

Sure, at least we need to address this irq storm...

>
> -Doug
>
>
>
Doug Anderson May 18, 2016, 5:37 p.m. UTC | #12
Hi,

On Wed, May 18, 2016 at 2:14 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
> Hi
>
>
> On 2016-5-18 12:12, Doug Anderson wrote:
>>
>> Hi,
>>
>> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
>> <shawn.lin@kernel-upstream.org> wrote:
>>>
>>> Could you try this patch to see if you can still find HLE?
>>>
>>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>>> *host, u32 status)
>>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>>   {
>>>          int i;
>>> +       int present;
>>>
>>>          for (i = 0; i < host->num_slots; i++) {
>>>                  struct dw_mci_slot *slot = host->slot[i];
>>>
>>>                  if (!slot)
>>>                          continue;
>>>
>>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>>> slot->id));
>>> +               if (present)
>>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>> +               else
>>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>
>>
>> No, because we don't use the builtin card detect on veyron.  ;)
>>
>> We use GPIO card detect because we didn't like the way JTAG and SD
>> interacted.  Also on rk3288 the builtin card detect line had the wrong
>> voltage domain (you couldn't detect a card when the IO lines were
>> powered off).  The builtin card detect line is always driven low on
>> veyron.
>
>
> Okay, I see.
>
>>
>>
>> I'm nearly certain that the root cause of my HLE errors is actually
>> related to the same problem addressed by the commit 7c5209c315ea
>> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
>> 1.8V").  I think that on minnie we're still on the hairy edge and
>> sometimes the line doesn't transition fast enough.
>
>
> Things are not so simple from your details.
>
> I was not enabling SD3.0 support, then I also found HLE sometimes.
> So it seems commit 7c5209c315ea does not contibute to this phenomenon.

Just to clarify, in my case commit 7c5209c315ea didn't make the
problem worse, but made it better.  Just not better enough.  ;)


> The scenario looks like:
> remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
> set_ios -> setup_bus -> disabled clk , then HLE irq storm coming
>
> From the code of dw_mci_prepare_command:
> SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
> wait_busy here, then cmd code is loding into queue of dw_mmc but
> still failing send out because it's in busy?
>
> With my patch, things go well:
> remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
> status(CMD13) return directly -> power_off -> set_ios -> setup_bus ->
> disable clk
>
> So why should we allow inquiry of card status if we sure the card is
> removed? I mean no any further cmds should be delivered.

Quite honestly just dealing with the HLE error (my patch or
equivalent) might be a sane solution for the problem you describe.

dw_mmc needs to be able to work with an external card detect GPIO.
It's been part of the dw_mmc driver for a long time and is (in fact)
in use upstream at least by rk3288-veyron.  Any solution that only
works for internal card detect is not enough.  Just handling the HLE
error to deal with the interrupt storm and then letting Linux remove
the card (because of the card detect interrupt) seems totally OK to
me.

Note: I'd be very curious if your problems get better if you disable
the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
detect and you use the boot default of "grf_force_jtag" then your pins
will be unmuxed behind your back when the card is ejected.  This could
be causing the dw_mmc controller to get confused.


> And another question: should we wait busy for cmd13?

I don't think so.  As I understand it CMD13 uses only the CMD line for
communication and it should be appropriate to send this when the bus
is "busy" (which means that the DATA lines are low).

Also: it seems odd that the HLE IRQ storm didn't come right after the
CMD 13 in your description above.  Are you sure it was the CMD 13 that
caused the HLEs, or could it has been something else?


-Doug
Heiko Stübner May 18, 2016, 6:01 p.m. UTC | #13
Am Mittwoch, 18. Mai 2016, 10:37:52 schrieb Doug Anderson:
> Note: I'd be very curious if your problems get better if you disable
> the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
> detect and you use the boot default of "grf_force_jtag" then your pins
> will be unmuxed behind your back when the card is ejected.  This could
> be causing the dw_mmc controller to get confused.

On the rk3288, we saw issues with the jtag/sdmmc function and thus disabled 
that altogether in [0]. Not sure if that is a similar problem for you.

Heiko

[0] 
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c9b75d51c940c25587a2ad72ec7ec60490abfb6c
Shawn Lin May 19, 2016, 11:31 a.m. UTC | #14
Hi,

On 2016/5/19 1:37, Doug Anderson wrote:
> Hi,
>
> On Wed, May 18, 2016 at 2:14 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
>> Hi
>>
>>
>> On 2016-5-18 12:12, Doug Anderson wrote:
>>>
>>> Hi,
>>>
>>> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
>>> <shawn.lin@kernel-upstream.org> wrote:
>>>>
>>>> Could you try this patch to see if you can still find HLE?
>>>>
>>>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>>>> *host, u32 status)
>>>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>>>   {
>>>>          int i;
>>>> +       int present;
>>>>
>>>>          for (i = 0; i < host->num_slots; i++) {
>>>>                  struct dw_mci_slot *slot = host->slot[i];
>>>>
>>>>                  if (!slot)
>>>>                          continue;
>>>>
>>>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>>>> slot->id));
>>>> +               if (present)
>>>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>> +               else
>>>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>
>>>
>>> No, because we don't use the builtin card detect on veyron.  ;)
>>>
>>> We use GPIO card detect because we didn't like the way JTAG and SD
>>> interacted.  Also on rk3288 the builtin card detect line had the wrong
>>> voltage domain (you couldn't detect a card when the IO lines were
>>> powered off).  The builtin card detect line is always driven low on
>>> veyron.
>>
>>
>> Okay, I see.
>>
>>>
>>>
>>> I'm nearly certain that the root cause of my HLE errors is actually
>>> related to the same problem addressed by the commit 7c5209c315ea
>>> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
>>> 1.8V").  I think that on minnie we're still on the hairy edge and
>>> sometimes the line doesn't transition fast enough.
>>
>>
>> Things are not so simple from your details.
>>
>> I was not enabling SD3.0 support, then I also found HLE sometimes.
>> So it seems commit 7c5209c315ea does not contibute to this phenomenon.
>
> Just to clarify, in my case commit 7c5209c315ea didn't make the
> problem worse, but made it better.  Just not better enough.  ;)
>
>
>> The scenario looks like:
>> remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
>> set_ios -> setup_bus -> disabled clk , then HLE irq storm coming
>>
>> From the code of dw_mci_prepare_command:
>> SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
>> wait_busy here, then cmd code is loding into queue of dw_mmc but
>> still failing send out because it's in busy?
>>
>> With my patch, things go well:
>> remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
>> status(CMD13) return directly -> power_off -> set_ios -> setup_bus ->
>> disable clk
>>
>> So why should we allow inquiry of card status if we sure the card is
>> removed? I mean no any further cmds should be delivered.
>
> Quite honestly just dealing with the HLE error (my patch or
> equivalent) might be a sane solution for the problem you describe.

Yes, your patch looks good to me, so it should be merged firstly. :)
Then let's push it a bit further more that when HLEs are coming,
somethings must be wrong(currently I don't see a obvious clue from
the code itself although, I'm prone to think it belongs to the
software issue).


>
> dw_mmc needs to be able to work with an external card detect GPIO.
> It's been part of the dw_mmc driver for a long time and is (in fact)
> in use upstream at least by rk3288-veyron.  Any solution that only
> works for internal card detect is not enough.  Just handling the HLE
> error to deal with the interrupt storm and then letting Linux remove
> the card (because of the card detect interrupt) seems totally OK to
> me.
>

Sure, some of rockchip Socs use gpio for CD because they don't
have a internal CD, such as RK3036, right?

> Note: I'd be very curious if your problems get better if you disable

Not at all.

> the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
> detect and you use the boot default of "grf_force_jtag" then your pins
> will be unmuxed behind your back when the card is ejected.  This could
> be causing the dw_mmc controller to get confused.

Right, grf_force_jtag is also not a friend of mine. :)
So I had disabled this function before I was debugging it.

>
>
>> And another question: should we wait busy for cmd13?
>
> I don't think so.  As I understand it CMD13 uses only the CMD line for
> communication and it should be appropriate to send this when the bus
> is "busy" (which means that the DATA lines are low).

Ahh... take back my question.. I was just considering a wired situation
that pins are unmuxed on the background(cmd line as well) when cmd13 is
delivering....


>
> Also: it seems odd that the HLE IRQ storm didn't come right after the
> CMD 13 in your description above.  Are you sure it was the CMD 13 that
> caused the HLEs, or could it has been something else?

Actually no. Any cmds be issued can trigger HLEs, I think, after sd card 
is removed When I hacked mmc_sd_detecd to send other cmds intead
of cmd13.

 From dw_mmc databook v270a(7.2.3 Clock Programming) we can see:
The DWC_mobile_storage loads each of these registers only when the
start_cmd bit and the Update_clk_regs_only bit in the CMD register are
set. When a command is successfully loaded, the DWC_mobile_storage
clears this bit, unless the DWC_mobile_storage already has another
command in the queue, at which point it gives an HLE (Hardware Locked
Error); for details on HLEs, refer to “Error Handling” on page 233.
Software should look for the start_cmd and the Update_clk_regs_only
bits, and should also set the wait_prvdata_complete bit to ensure that
clock parameters do not change during data transfer.

Maybe the cmd is trying to load(or somethings wrong with the
controller?) when we disable the clk? That may explain my observation
that HLEs came after disabling clk.


>
>
> -Doug
>
>
>
Jaehoon Chung May 19, 2016, 1:07 p.m. UTC | #15
On 05/19/2016 08:31 PM, Shawn Lin wrote:
> Hi,
> 
> On 2016/5/19 1:37, Doug Anderson wrote:
>> Hi,
>>
>> On Wed, May 18, 2016 at 2:14 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
>>> Hi
>>>
>>>
>>> On 2016-5-18 12:12, Doug Anderson wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
>>>> <shawn.lin@kernel-upstream.org> wrote:
>>>>>
>>>>> Could you try this patch to see if you can still find HLE?
>>>>>
>>>>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>>>>> *host, u32 status)
>>>>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>>>>   {
>>>>>          int i;
>>>>> +       int present;
>>>>>
>>>>>          for (i = 0; i < host->num_slots; i++) {
>>>>>                  struct dw_mci_slot *slot = host->slot[i];
>>>>>
>>>>>                  if (!slot)
>>>>>                          continue;
>>>>>
>>>>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>>>>> slot->id));
>>>>> +               if (present)
>>>>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>> +               else
>>>>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>
>>>>
>>>> No, because we don't use the builtin card detect on veyron.  ;)
>>>>
>>>> We use GPIO card detect because we didn't like the way JTAG and SD
>>>> interacted.  Also on rk3288 the builtin card detect line had the wrong
>>>> voltage domain (you couldn't detect a card when the IO lines were
>>>> powered off).  The builtin card detect line is always driven low on
>>>> veyron.
>>>
>>>
>>> Okay, I see.
>>>
>>>>
>>>>
>>>> I'm nearly certain that the root cause of my HLE errors is actually
>>>> related to the same problem addressed by the commit 7c5209c315ea
>>>> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
>>>> 1.8V").  I think that on minnie we're still on the hairy edge and
>>>> sometimes the line doesn't transition fast enough.
>>>
>>>
>>> Things are not so simple from your details.
>>>
>>> I was not enabling SD3.0 support, then I also found HLE sometimes.
>>> So it seems commit 7c5209c315ea does not contibute to this phenomenon.
>>
>> Just to clarify, in my case commit 7c5209c315ea didn't make the
>> problem worse, but made it better.  Just not better enough.  ;)
>>
>>
>>> The scenario looks like:
>>> remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
>>> set_ios -> setup_bus -> disabled clk , then HLE irq storm coming
>>>
>>> From the code of dw_mci_prepare_command:
>>> SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
>>> wait_busy here, then cmd code is loding into queue of dw_mmc but
>>> still failing send out because it's in busy?
>>>
>>> With my patch, things go well:
>>> remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
>>> status(CMD13) return directly -> power_off -> set_ios -> setup_bus ->
>>> disable clk
>>>
>>> So why should we allow inquiry of card status if we sure the card is
>>> removed? I mean no any further cmds should be delivered.
>>
>> Quite honestly just dealing with the HLE error (my patch or
>> equivalent) might be a sane solution for the problem you describe.
> 
> Yes, your patch looks good to me, so it should be merged firstly. :)
> Then let's push it a bit further more that when HLEs are coming,
> somethings must be wrong(currently I don't see a obvious clue from
> the code itself although, I'm prone to think it belongs to the
> software issue).

We don't know what's main cause for HLE..But i also think it's relevant to SW issue.
But we need to consider all possibilities..

> 
> 
>>
>> dw_mmc needs to be able to work with an external card detect GPIO.
>> It's been part of the dw_mmc driver for a long time and is (in fact)
>> in use upstream at least by rk3288-veyron.  Any solution that only
>> works for internal card detect is not enough.  Just handling the HLE
>> error to deal with the interrupt storm and then letting Linux remove
>> the card (because of the card detect interrupt) seems totally OK to
>> me.
>>
> 
> Sure, some of rockchip Socs use gpio for CD because they don't
> have a internal CD, such as RK3036, right?
> 
>> Note: I'd be very curious if your problems get better if you disable
> 
> Not at all.
> 
>> the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
>> detect and you use the boot default of "grf_force_jtag" then your pins
>> will be unmuxed behind your back when the card is ejected.  This could
>> be causing the dw_mmc controller to get confused.
> 
> Right, grf_force_jtag is also not a friend of mine. :)
> So I had disabled this function before I was debugging it.
> 
>>
>>
>>> And another question: should we wait busy for cmd13?
>>
>> I don't think so.  As I understand it CMD13 uses only the CMD line for
>> communication and it should be appropriate to send this when the bus
>> is "busy" (which means that the DATA lines are low).
> 
> Ahh... take back my question.. I was just considering a wired situation
> that pins are unmuxed on the background(cmd line as well) when cmd13 is
> delivering....
> 
> 
>>
>> Also: it seems odd that the HLE IRQ storm didn't come right after the
>> CMD 13 in your description above.  Are you sure it was the CMD 13 that
>> caused the HLEs, or could it has been something else?
> 
> Actually no. Any cmds be issued can trigger HLEs, I think, after sd card is removed When I hacked mmc_sd_detecd to send other cmds intead
> of cmd13.
> 
> From dw_mmc databook v270a(7.2.3 Clock Programming) we can see:
> The DWC_mobile_storage loads each of these registers only when the
> start_cmd bit and the Update_clk_regs_only bit in the CMD register are
> set. When a command is successfully loaded, the DWC_mobile_storage
> clears this bit, unless the DWC_mobile_storage already has another
> command in the queue, at which point it gives an HLE (Hardware Locked
> Error); for details on HLEs, refer to “Error Handling” on page 233.
> Software should look for the start_cmd and the Update_clk_regs_only
> bits, and should also set the wait_prvdata_complete bit to ensure that
> clock parameters do not change during data transfer.
> 
> Maybe the cmd is trying to load(or somethings wrong with the
> controller?) when we disable the clk? That may explain my observation
> that HLEs came after disabling clk.

I agreed.

To Disable clock, it sends cmd with update_clk_regs_only and wait_prvdata_complete bit.
I think it's problem..(waiting for prvdata..)

If there are ongoing some data(read/write), then before disabling clock, waiting for completing previous data.
(But card was already removed, and it couldn't do anything.)
It's difficult to analyze the HLE..So After applying first, then we can solve this problem, step by step.

Best Regards,
Jaehoon Chung

> 
> 
>>
>>
>> -Doug
>>
>>
>>
> 
>
Shawn Lin May 26, 2016, 2:23 a.m. UTC | #16
Hi Jaehoon,

On 2016/5/19 21:07, Jaehoon Chung wrote:
> On 05/19/2016 08:31 PM, Shawn Lin wrote:
>> Hi,
>>
>> On 2016/5/19 1:37, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Wed, May 18, 2016 at 2:14 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
>>>> Hi
>>>>
>>>>
>>>> On 2016-5-18 12:12, Doug Anderson wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
>>>>> <shawn.lin@kernel-upstream.org> wrote:
>>>>>>
>>>>>> Could you try this patch to see if you can still find HLE?
>>>>>>
>>>>>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>>>>>> *host, u32 status)
>>>>>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>>>>>   {
>>>>>>          int i;
>>>>>> +       int present;
>>>>>>
>>>>>>          for (i = 0; i < host->num_slots; i++) {
>>>>>>                  struct dw_mci_slot *slot = host->slot[i];
>>>>>>
>>>>>>                  if (!slot)
>>>>>>                          continue;
>>>>>>
>>>>>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>>>>>> slot->id));
>>>>>> +               if (present)
>>>>>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>>> +               else
>>>>>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>>
>>>>>
>>>>> No, because we don't use the builtin card detect on veyron.  ;)
>>>>>
>>>>> We use GPIO card detect because we didn't like the way JTAG and SD
>>>>> interacted.  Also on rk3288 the builtin card detect line had the wrong
>>>>> voltage domain (you couldn't detect a card when the IO lines were
>>>>> powered off).  The builtin card detect line is always driven low on
>>>>> veyron.
>>>>
>>>>
>>>> Okay, I see.
>>>>
>>>>>
>>>>>
>>>>> I'm nearly certain that the root cause of my HLE errors is actually
>>>>> related to the same problem addressed by the commit 7c5209c315ea
>>>>> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
>>>>> 1.8V").  I think that on minnie we're still on the hairy edge and
>>>>> sometimes the line doesn't transition fast enough.
>>>>
>>>>
>>>> Things are not so simple from your details.
>>>>
>>>> I was not enabling SD3.0 support, then I also found HLE sometimes.
>>>> So it seems commit 7c5209c315ea does not contibute to this phenomenon.
>>>
>>> Just to clarify, in my case commit 7c5209c315ea didn't make the
>>> problem worse, but made it better.  Just not better enough.  ;)
>>>
>>>
>>>> The scenario looks like:
>>>> remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
>>>> set_ios -> setup_bus -> disabled clk , then HLE irq storm coming
>>>>
>>>> From the code of dw_mci_prepare_command:
>>>> SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
>>>> wait_busy here, then cmd code is loding into queue of dw_mmc but
>>>> still failing send out because it's in busy?
>>>>
>>>> With my patch, things go well:
>>>> remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
>>>> status(CMD13) return directly -> power_off -> set_ios -> setup_bus ->
>>>> disable clk
>>>>
>>>> So why should we allow inquiry of card status if we sure the card is
>>>> removed? I mean no any further cmds should be delivered.
>>>
>>> Quite honestly just dealing with the HLE error (my patch or
>>> equivalent) might be a sane solution for the problem you describe.
>>
>> Yes, your patch looks good to me, so it should be merged firstly. :)
>> Then let's push it a bit further more that when HLEs are coming,
>> somethings must be wrong(currently I don't see a obvious clue from
>> the code itself although, I'm prone to think it belongs to the
>> software issue).
>
> We don't know what's main cause for HLE..But i also think it's relevant to SW issue.
> But we need to consider all possibilities..
>
>>
>>
>>>
>>> dw_mmc needs to be able to work with an external card detect GPIO.
>>> It's been part of the dw_mmc driver for a long time and is (in fact)
>>> in use upstream at least by rk3288-veyron.  Any solution that only
>>> works for internal card detect is not enough.  Just handling the HLE
>>> error to deal with the interrupt storm and then letting Linux remove
>>> the card (because of the card detect interrupt) seems totally OK to
>>> me.
>>>
>>
>> Sure, some of rockchip Socs use gpio for CD because they don't
>> have a internal CD, such as RK3036, right?
>>
>>> Note: I'd be very curious if your problems get better if you disable
>>
>> Not at all.
>>
>>> the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
>>> detect and you use the boot default of "grf_force_jtag" then your pins
>>> will be unmuxed behind your back when the card is ejected.  This could
>>> be causing the dw_mmc controller to get confused.
>>
>> Right, grf_force_jtag is also not a friend of mine. :)
>> So I had disabled this function before I was debugging it.
>>
>>>
>>>
>>>> And another question: should we wait busy for cmd13?
>>>
>>> I don't think so.  As I understand it CMD13 uses only the CMD line for
>>> communication and it should be appropriate to send this when the bus
>>> is "busy" (which means that the DATA lines are low).
>>
>> Ahh... take back my question.. I was just considering a wired situation
>> that pins are unmuxed on the background(cmd line as well) when cmd13 is
>> delivering....
>>
>>
>>>
>>> Also: it seems odd that the HLE IRQ storm didn't come right after the
>>> CMD 13 in your description above.  Are you sure it was the CMD 13 that
>>> caused the HLEs, or could it has been something else?
>>
>> Actually no. Any cmds be issued can trigger HLEs, I think, after sd card is removed When I hacked mmc_sd_detecd to send other cmds intead
>> of cmd13.
>>
>> From dw_mmc databook v270a(7.2.3 Clock Programming) we can see:
>> The DWC_mobile_storage loads each of these registers only when the
>> start_cmd bit and the Update_clk_regs_only bit in the CMD register are
>> set. When a command is successfully loaded, the DWC_mobile_storage
>> clears this bit, unless the DWC_mobile_storage already has another
>> command in the queue, at which point it gives an HLE (Hardware Locked
>> Error); for details on HLEs, refer to “Error Handling” on page 233.
>> Software should look for the start_cmd and the Update_clk_regs_only
>> bits, and should also set the wait_prvdata_complete bit to ensure that
>> clock parameters do not change during data transfer.
>>
>> Maybe the cmd is trying to load(or somethings wrong with the
>> controller?) when we disable the clk? That may explain my observation
>> that HLEs came after disabling clk.
>
> I agreed.
>
> To Disable clock, it sends cmd with update_clk_regs_only and wait_prvdata_complete bit.
> I think it's problem..(waiting for prvdata..)
>
> If there are ongoing some data(read/write), then before disabling clock, waiting for completing previous data.
> (But card was already removed, and it couldn't do anything.)
> It's difficult to analyze the HLE..So After applying first, then we can solve this problem, step by step.
>

I saw you send a PR for v4.7-fix to Ulf which didn't include this one.
Do you plan to add it into 4.8 materials? :)

> Best Regards,
> Jaehoon Chung
>
>>
>>
>>>
>>>
>>> -Doug
>>>
>>>
>>>
>>
>>
>
>
>
>
Jaehoon Chung May 26, 2016, 3:59 a.m. UTC | #17
On 05/26/2016 11:23 AM, Shawn Lin wrote:
> Hi Jaehoon,
> 
> On 2016/5/19 21:07, Jaehoon Chung wrote:
>> On 05/19/2016 08:31 PM, Shawn Lin wrote:
>>> Hi,
>>>
>>> On 2016/5/19 1:37, Doug Anderson wrote:
>>>> Hi,
>>>>
>>>> On Wed, May 18, 2016 at 2:14 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
>>>>> Hi
>>>>>
>>>>>
>>>>> On 2016-5-18 12:12, Doug Anderson wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
>>>>>> <shawn.lin@kernel-upstream.org> wrote:
>>>>>>>
>>>>>>> Could you try this patch to see if you can still find HLE?
>>>>>>>
>>>>>>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>>>>>>> *host, u32 status)
>>>>>>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>>>>>>   {
>>>>>>>          int i;
>>>>>>> +       int present;
>>>>>>>
>>>>>>>          for (i = 0; i < host->num_slots; i++) {
>>>>>>>                  struct dw_mci_slot *slot = host->slot[i];
>>>>>>>
>>>>>>>                  if (!slot)
>>>>>>>                          continue;
>>>>>>>
>>>>>>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>>>>>>> slot->id));
>>>>>>> +               if (present)
>>>>>>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>>>> +               else
>>>>>>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>>>
>>>>>>
>>>>>> No, because we don't use the builtin card detect on veyron.  ;)
>>>>>>
>>>>>> We use GPIO card detect because we didn't like the way JTAG and SD
>>>>>> interacted.  Also on rk3288 the builtin card detect line had the wrong
>>>>>> voltage domain (you couldn't detect a card when the IO lines were
>>>>>> powered off).  The builtin card detect line is always driven low on
>>>>>> veyron.
>>>>>
>>>>>
>>>>> Okay, I see.
>>>>>
>>>>>>
>>>>>>
>>>>>> I'm nearly certain that the root cause of my HLE errors is actually
>>>>>> related to the same problem addressed by the commit 7c5209c315ea
>>>>>> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
>>>>>> 1.8V").  I think that on minnie we're still on the hairy edge and
>>>>>> sometimes the line doesn't transition fast enough.
>>>>>
>>>>>
>>>>> Things are not so simple from your details.
>>>>>
>>>>> I was not enabling SD3.0 support, then I also found HLE sometimes.
>>>>> So it seems commit 7c5209c315ea does not contibute to this phenomenon.
>>>>
>>>> Just to clarify, in my case commit 7c5209c315ea didn't make the
>>>> problem worse, but made it better.  Just not better enough.  ;)
>>>>
>>>>
>>>>> The scenario looks like:
>>>>> remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
>>>>> set_ios -> setup_bus -> disabled clk , then HLE irq storm coming
>>>>>
>>>>> From the code of dw_mci_prepare_command:
>>>>> SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
>>>>> wait_busy here, then cmd code is loding into queue of dw_mmc but
>>>>> still failing send out because it's in busy?
>>>>>
>>>>> With my patch, things go well:
>>>>> remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
>>>>> status(CMD13) return directly -> power_off -> set_ios -> setup_bus ->
>>>>> disable clk
>>>>>
>>>>> So why should we allow inquiry of card status if we sure the card is
>>>>> removed? I mean no any further cmds should be delivered.
>>>>
>>>> Quite honestly just dealing with the HLE error (my patch or
>>>> equivalent) might be a sane solution for the problem you describe.
>>>
>>> Yes, your patch looks good to me, so it should be merged firstly. :)
>>> Then let's push it a bit further more that when HLEs are coming,
>>> somethings must be wrong(currently I don't see a obvious clue from
>>> the code itself although, I'm prone to think it belongs to the
>>> software issue).
>>
>> We don't know what's main cause for HLE..But i also think it's relevant to SW issue.
>> But we need to consider all possibilities..
>>
>>>
>>>
>>>>
>>>> dw_mmc needs to be able to work with an external card detect GPIO.
>>>> It's been part of the dw_mmc driver for a long time and is (in fact)
>>>> in use upstream at least by rk3288-veyron.  Any solution that only
>>>> works for internal card detect is not enough.  Just handling the HLE
>>>> error to deal with the interrupt storm and then letting Linux remove
>>>> the card (because of the card detect interrupt) seems totally OK to
>>>> me.
>>>>
>>>
>>> Sure, some of rockchip Socs use gpio for CD because they don't
>>> have a internal CD, such as RK3036, right?
>>>
>>>> Note: I'd be very curious if your problems get better if you disable
>>>
>>> Not at all.
>>>
>>>> the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
>>>> detect and you use the boot default of "grf_force_jtag" then your pins
>>>> will be unmuxed behind your back when the card is ejected.  This could
>>>> be causing the dw_mmc controller to get confused.
>>>
>>> Right, grf_force_jtag is also not a friend of mine. :)
>>> So I had disabled this function before I was debugging it.
>>>
>>>>
>>>>
>>>>> And another question: should we wait busy for cmd13?
>>>>
>>>> I don't think so.  As I understand it CMD13 uses only the CMD line for
>>>> communication and it should be appropriate to send this when the bus
>>>> is "busy" (which means that the DATA lines are low).
>>>
>>> Ahh... take back my question.. I was just considering a wired situation
>>> that pins are unmuxed on the background(cmd line as well) when cmd13 is
>>> delivering....
>>>
>>>
>>>>
>>>> Also: it seems odd that the HLE IRQ storm didn't come right after the
>>>> CMD 13 in your description above.  Are you sure it was the CMD 13 that
>>>> caused the HLEs, or could it has been something else?
>>>
>>> Actually no. Any cmds be issued can trigger HLEs, I think, after sd card is removed When I hacked mmc_sd_detecd to send other cmds intead
>>> of cmd13.
>>>
>>> From dw_mmc databook v270a(7.2.3 Clock Programming) we can see:
>>> The DWC_mobile_storage loads each of these registers only when the
>>> start_cmd bit and the Update_clk_regs_only bit in the CMD register are
>>> set. When a command is successfully loaded, the DWC_mobile_storage
>>> clears this bit, unless the DWC_mobile_storage already has another
>>> command in the queue, at which point it gives an HLE (Hardware Locked
>>> Error); for details on HLEs, refer to “Error Handling” on page 233.
>>> Software should look for the start_cmd and the Update_clk_regs_only
>>> bits, and should also set the wait_prvdata_complete bit to ensure that
>>> clock parameters do not change during data transfer.
>>>
>>> Maybe the cmd is trying to load(or somethings wrong with the
>>> controller?) when we disable the clk? That may explain my observation
>>> that HLEs came after disabling clk.
>>
>> I agreed.
>>
>> To Disable clock, it sends cmd with update_clk_regs_only and wait_prvdata_complete bit.
>> I think it's problem..(waiting for prvdata..)
>>
>> If there are ongoing some data(read/write), then before disabling clock, waiting for completing previous data.
>> (But card was already removed, and it couldn't do anything.)
>> It's difficult to analyze the HLE..So After applying first, then we can solve this problem, step by step.
>>
> 
> I saw you send a PR for v4.7-fix to Ulf which didn't include this one.
> Do you plan to add it into 4.8 materials? :)

Yes, I think good that this is prepared for next. (Will apply on this weekend.)
Do you have other opinion? :)

If you have other opinion, i will reflect yours. Thanks!

Best Regards,
Jaehoon Chung

> 
>> Best Regards,
>> Jaehoon Chung
>>
>>>
>>>
>>>>
>>>>
>>>> -Doug
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
> 
>
Shawn Lin May 26, 2016, 4:07 a.m. UTC | #18
在 2016/5/26 11:59, Jaehoon Chung 写道:
> On 05/26/2016 11:23 AM, Shawn Lin wrote:
>> Hi Jaehoon,
>>
>> On 2016/5/19 21:07, Jaehoon Chung wrote:
>>> On 05/19/2016 08:31 PM, Shawn Lin wrote:
>>>> Hi,
>>>>
>>>> On 2016/5/19 1:37, Doug Anderson wrote:
>>>>> Hi,
>>>>>
>>>>> On Wed, May 18, 2016 at 2:14 AM, Shawn Lin <shawn.lin@rock-chips.com> wrote:
>>>>>> Hi
>>>>>>
>>>>>>
>>>>>> On 2016-5-18 12:12, Doug Anderson wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Tue, May 17, 2016 at 6:59 PM, Shawn Lin
>>>>>>> <shawn.lin@kernel-upstream.org> wrote:
>>>>>>>>
>>>>>>>> Could you try this patch to see if you can still find HLE?
>>>>>>>>
>>>>>>>> @@ -2356,12 +2356,22 @@ static void dw_mci_cmd_interrupt(struct dw_mci
>>>>>>>> *host, u32 status)
>>>>>>>>   static void dw_mci_handle_cd(struct dw_mci *host)
>>>>>>>>   {
>>>>>>>>          int i;
>>>>>>>> +       int present;
>>>>>>>>
>>>>>>>>          for (i = 0; i < host->num_slots; i++) {
>>>>>>>>                  struct dw_mci_slot *slot = host->slot[i];
>>>>>>>>
>>>>>>>>                  if (!slot)
>>>>>>>>                          continue;
>>>>>>>>
>>>>>>>> +               present = !(mci_readl(slot->host, CDETECT) & (1 <<
>>>>>>>> slot->id));
>>>>>>>> +               if (present)
>>>>>>>> +                       set_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>>>>> +               else
>>>>>>>> +                       clear_bit(DW_MMC_CARD_PRESENT, &slot->flags);
>>>>>>>
>>>>>>>
>>>>>>> No, because we don't use the builtin card detect on veyron.  ;)
>>>>>>>
>>>>>>> We use GPIO card detect because we didn't like the way JTAG and SD
>>>>>>> interacted.  Also on rk3288 the builtin card detect line had the wrong
>>>>>>> voltage domain (you couldn't detect a card when the IO lines were
>>>>>>> powered off).  The builtin card detect line is always driven low on
>>>>>>> veyron.
>>>>>>
>>>>>>
>>>>>> Okay, I see.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm nearly certain that the root cause of my HLE errors is actually
>>>>>>> related to the same problem addressed by the commit 7c5209c315ea
>>>>>>> ("mmc: core: Increase delay for voltage to stabilize from 3.3V to
>>>>>>> 1.8V").  I think that on minnie we're still on the hairy edge and
>>>>>>> sometimes the line doesn't transition fast enough.
>>>>>>
>>>>>>
>>>>>> Things are not so simple from your details.
>>>>>>
>>>>>> I was not enabling SD3.0 support, then I also found HLE sometimes.
>>>>>> So it seems commit 7c5209c315ea does not contibute to this phenomenon.
>>>>>
>>>>> Just to clarify, in my case commit 7c5209c315ea didn't make the
>>>>> problem worse, but made it better.  Just not better enough.  ;)
>>>>>
>>>>>
>>>>>> The scenario looks like:
>>>>>> remove sd-card -> mmc_sd_detect -> send status(CMD13) ->power_off ->
>>>>>> set_ios -> setup_bus -> disabled clk , then HLE irq storm coming
>>>>>>
>>>>>> From the code of dw_mci_prepare_command:
>>>>>> SDMMC_CMD_PRV_DAT_WAIT will not be used for CMD13, so we don't
>>>>>> wait_busy here, then cmd code is loding into queue of dw_mmc but
>>>>>> still failing send out because it's in busy?
>>>>>>
>>>>>> With my patch, things go well:
>>>>>> remove sd-card -> clear bit of DW_MMC_CARD_PRESENT  -> send
>>>>>> status(CMD13) return directly -> power_off -> set_ios -> setup_bus ->
>>>>>> disable clk
>>>>>>
>>>>>> So why should we allow inquiry of card status if we sure the card is
>>>>>> removed? I mean no any further cmds should be delivered.
>>>>>
>>>>> Quite honestly just dealing with the HLE error (my patch or
>>>>> equivalent) might be a sane solution for the problem you describe.
>>>>
>>>> Yes, your patch looks good to me, so it should be merged firstly. :)
>>>> Then let's push it a bit further more that when HLEs are coming,
>>>> somethings must be wrong(currently I don't see a obvious clue from
>>>> the code itself although, I'm prone to think it belongs to the
>>>> software issue).
>>>
>>> We don't know what's main cause for HLE..But i also think it's relevant to SW issue.
>>> But we need to consider all possibilities..
>>>
>>>>
>>>>
>>>>>
>>>>> dw_mmc needs to be able to work with an external card detect GPIO.
>>>>> It's been part of the dw_mmc driver for a long time and is (in fact)
>>>>> in use upstream at least by rk3288-veyron.  Any solution that only
>>>>> works for internal card detect is not enough.  Just handling the HLE
>>>>> error to deal with the interrupt storm and then letting Linux remove
>>>>> the card (because of the card detect interrupt) seems totally OK to
>>>>> me.
>>>>>
>>>>
>>>> Sure, some of rockchip Socs use gpio for CD because they don't
>>>> have a internal CD, such as RK3036, right?
>>>>
>>>>> Note: I'd be very curious if your problems get better if you disable
>>>>
>>>> Not at all.
>>>>
>>>>> the "grf_force_jtag" bit in the GRF.  If you're using the builtin card
>>>>> detect and you use the boot default of "grf_force_jtag" then your pins
>>>>> will be unmuxed behind your back when the card is ejected.  This could
>>>>> be causing the dw_mmc controller to get confused.
>>>>
>>>> Right, grf_force_jtag is also not a friend of mine. :)
>>>> So I had disabled this function before I was debugging it.
>>>>
>>>>>
>>>>>
>>>>>> And another question: should we wait busy for cmd13?
>>>>>
>>>>> I don't think so.  As I understand it CMD13 uses only the CMD line for
>>>>> communication and it should be appropriate to send this when the bus
>>>>> is "busy" (which means that the DATA lines are low).
>>>>
>>>> Ahh... take back my question.. I was just considering a wired situation
>>>> that pins are unmuxed on the background(cmd line as well) when cmd13 is
>>>> delivering....
>>>>
>>>>
>>>>>
>>>>> Also: it seems odd that the HLE IRQ storm didn't come right after the
>>>>> CMD 13 in your description above.  Are you sure it was the CMD 13 that
>>>>> caused the HLEs, or could it has been something else?
>>>>
>>>> Actually no. Any cmds be issued can trigger HLEs, I think, after sd card is removed When I hacked mmc_sd_detecd to send other cmds intead
>>>> of cmd13.
>>>>
>>>> From dw_mmc databook v270a(7.2.3 Clock Programming) we can see:
>>>> The DWC_mobile_storage loads each of these registers only when the
>>>> start_cmd bit and the Update_clk_regs_only bit in the CMD register are
>>>> set. When a command is successfully loaded, the DWC_mobile_storage
>>>> clears this bit, unless the DWC_mobile_storage already has another
>>>> command in the queue, at which point it gives an HLE (Hardware Locked
>>>> Error); for details on HLEs, refer to “Error Handling” on page 233.
>>>> Software should look for the start_cmd and the Update_clk_regs_only
>>>> bits, and should also set the wait_prvdata_complete bit to ensure that
>>>> clock parameters do not change during data transfer.
>>>>
>>>> Maybe the cmd is trying to load(or somethings wrong with the
>>>> controller?) when we disable the clk? That may explain my observation
>>>> that HLEs came after disabling clk.
>>>
>>> I agreed.
>>>
>>> To Disable clock, it sends cmd with update_clk_regs_only and wait_prvdata_complete bit.
>>> I think it's problem..(waiting for prvdata..)
>>>
>>> If there are ongoing some data(read/write), then before disabling clock, waiting for completing previous data.
>>> (But card was already removed, and it couldn't do anything.)
>>> It's difficult to analyze the HLE..So After applying first, then we can solve this problem, step by step.
>>>
>>
>> I saw you send a PR for v4.7-fix to Ulf which didn't include this one.
>> Do you plan to add it into 4.8 materials? :)
>
> Yes, I think good that this is prepared for next. (Will apply on this weekend.)
> Do you have other opinion? :)

No, thanks. I just wanna make sure this one will be merged. :)

>
> If you have other opinion, i will reflect yours. Thanks!
>
> Best Regards,
> Jaehoon Chung
>
>>
>>> Best Regards,
>>> Jaehoon Chung
>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> -Doug
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
diff mbox

Patch

diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
index 47dfd0e..294edc9c 100644
--- a/drivers/mmc/host/dw_mmc.c
+++ b/drivers/mmc/host/dw_mmc.c
@@ -44,11 +44,11 @@ 
 /* Common flag combinations */
 #define DW_MCI_DATA_ERROR_FLAGS	(SDMMC_INT_DRTO | SDMMC_INT_DCRC | \
 				 SDMMC_INT_HTO | SDMMC_INT_SBE  | \
-				 SDMMC_INT_EBE)
+				 SDMMC_INT_EBE | SDMMC_INT_HLE)
 #define DW_MCI_CMD_ERROR_FLAGS	(SDMMC_INT_RTO | SDMMC_INT_RCRC | \
-				 SDMMC_INT_RESP_ERR)
+				 SDMMC_INT_RESP_ERR | SDMMC_INT_HLE)
 #define DW_MCI_ERROR_FLAGS	(DW_MCI_DATA_ERROR_FLAGS | \
-				 DW_MCI_CMD_ERROR_FLAGS  | SDMMC_INT_HLE)
+				 DW_MCI_CMD_ERROR_FLAGS)
 #define DW_MCI_SEND_STATUS	1
 #define DW_MCI_RECV_STATUS	2
 #define DW_MCI_DMA_THRESHOLD	16