diff mbox series

[v1] mmc: sdhci: Prevent stale command and data interrupt handling

Message ID 20241003161007.3485810-1-m.wilczynski@samsung.com (mailing list archive)
State New
Headers show
Series [v1] mmc: sdhci: Prevent stale command and data interrupt handling | expand

Checks

Context Check Description
conchuod/vmtest-for-next-PR success PR summary
conchuod/patch-1-test-1 success .github/scripts/patches/tests/build_rv32_defconfig.sh took 130.95s
conchuod/patch-1-test-2 success .github/scripts/patches/tests/build_rv64_clang_allmodconfig.sh took 1315.83s
conchuod/patch-1-test-3 success .github/scripts/patches/tests/build_rv64_gcc_allmodconfig.sh took 1551.84s
conchuod/patch-1-test-4 success .github/scripts/patches/tests/build_rv64_nommu_k210_defconfig.sh took 19.96s
conchuod/patch-1-test-5 success .github/scripts/patches/tests/build_rv64_nommu_virt_defconfig.sh took 22.10s
conchuod/patch-1-test-6 success .github/scripts/patches/tests/checkpatch.sh took 0.44s
conchuod/patch-1-test-7 success .github/scripts/patches/tests/dtb_warn_rv64.sh took 41.44s
conchuod/patch-1-test-8 success .github/scripts/patches/tests/header_inline.sh took 0.00s
conchuod/patch-1-test-9 success .github/scripts/patches/tests/kdoc.sh took 0.53s
conchuod/patch-1-test-10 success .github/scripts/patches/tests/module_param.sh took 0.03s
conchuod/patch-1-test-11 success .github/scripts/patches/tests/verify_fixes.sh took 0.00s
conchuod/patch-1-test-12 success .github/scripts/patches/tests/verify_signedoff.sh took 0.03s

Commit Message

Michal Wilczynski Oct. 3, 2024, 4:10 p.m. UTC
While working with the T-Head 1520 LicheePi4A SoC, certain conditions
arose that allowed me to reproduce a race issue in the sdhci code.

To reproduce the bug, you need to enable the sdio1 controller in the
device tree file
`arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi` as follows:

&sdio1 {
	bus-width = <4>;
	max-frequency = <100000000>;
	no-sd;
	no-mmc;
	broken-cd;
	cap-sd-highspeed;
	post-power-on-delay-ms = <50>;
	status = "okay";
	wakeup-source;
	keep-power-in-suspend;
};

When resetting the SoC using the reset button, the following messages
appear in the dmesg log:

[    8.164898] mmc2: Got command interrupt 0x00000001 even though no
command operation was in progress.
[    8.174054] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
[    8.180503] mmc2: sdhci: Sys addr:  0x00000000 | Version:  0x00000005
[    8.186950] mmc2: sdhci: Blk size:  0x00000000 | Blk cnt:  0x00000000
[    8.193395] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x00000000
[    8.199841] mmc2: sdhci: Present:   0x03da0000 | Host ctl: 0x00000000
[    8.206287] mmc2: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
[    8.212733] mmc2: sdhci: Wake-up:   0x00000000 | Clock:    0x0000decf
[    8.219178] mmc2: sdhci: Timeout:   0x00000000 | Int stat: 0x00000000
[    8.225622] mmc2: sdhci: Int enab:  0x00ff1003 | Sig enab: 0x00ff1003
[    8.232068] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[    8.238513] mmc2: sdhci: Caps:      0x3f69c881 | Caps_1:   0x08008177
[    8.244959] mmc2: sdhci: Cmd:       0x00000502 | Max curr: 0x00191919
[    8.254115] mmc2: sdhci: Resp[0]:   0x00001009 | Resp[1]:  0x00000000
[    8.260561] mmc2: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
[    8.267005] mmc2: sdhci: Host ctl2: 0x00001000
[    8.271453] mmc2: sdhci: ADMA Err:  0x00000000 | ADMA Ptr:
0x0000000000000000
[    8.278594] mmc2: sdhci: ============================================

I also enabled some traces to better understand the problem:

     kworker/3:1-62      [003] .....     8.163538: mmc_request_start:
mmc2: start struct mmc_request[000000000d30cc0c]: cmd_opcode=5
cmd_arg=0x0 cmd_flags=0x2e1 cmd_retries=0 stop_opcode=0 stop_arg=0x0
stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
retune_period=0
          <idle>-0       [000] d.h2.     8.164816: sdhci_cmd_irq:
hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x10000
intmask_p=0x18000
     irq/24-mmc2-96      [000] .....     8.164840: sdhci_thread_irq:
msg=
     irq/24-mmc2-96      [000] d.h2.     8.164896: sdhci_cmd_irq:
hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x1
intmask_p=0x1
     irq/24-mmc2-96      [000] .....     8.285142: mmc_request_done:
mmc2: end struct mmc_request[000000000d30cc0c]: cmd_opcode=5
cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0 need_retune=0
hold_retune=1 retune_period=0

Here's what happens: the __mmc_start_request function is called with
opcode 5. Since the power to the Wi-Fi card, which resides on this SDIO
bus, is initially off after the reset, an interrupt SDHCI_INT_TIMEOUT is
triggered. Immediately after that, a second interrupt SDHCI_INT_RESPONSE
is triggered. Depending on the exact timing, these conditions can
trigger the following race problem:

1) The sdhci_cmd_irq top half handles the command as an error. It sets
   host->cmd to NULL and host->pending_reset to true.
2) The sdhci_thread_irq bottom half is scheduled next and executes faster
   than the second interrupt handler for SDHCI_INT_RESPONSE. It clears
   host->pending_reset before the SDHCI_INT_RESPONSE handler runs.
3) The pending interrupt SDHCI_INT_RESPONSE handler gets called, triggering
   a code path that prints: "mmc2: Got command interrupt 0x00000001 even
   though no command operation was in progress."

To solve this issue, we need to clear pending interrupts when resetting
host->pending_reset. This ensures that after sdhci_threaded_irq restores
interrupts, there are no pending stale interrupts.

Signed-off-by: Michal Wilczynski <m.wilczynski@samsung.com>
---
 drivers/mmc/host/sdhci.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Adrian Hunter Oct. 7, 2024, 4:09 p.m. UTC | #1
On 3/10/24 19:10, Michal Wilczynski wrote:
> While working with the T-Head 1520 LicheePi4A SoC, certain conditions
> arose that allowed me to reproduce a race issue in the sdhci code.
> 
> To reproduce the bug, you need to enable the sdio1 controller in the
> device tree file
> `arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi` as follows:
> 
> &sdio1 {
> 	bus-width = <4>;
> 	max-frequency = <100000000>;
> 	no-sd;
> 	no-mmc;
> 	broken-cd;
> 	cap-sd-highspeed;
> 	post-power-on-delay-ms = <50>;
> 	status = "okay";
> 	wakeup-source;
> 	keep-power-in-suspend;
> };
> 
> When resetting the SoC using the reset button, the following messages
> appear in the dmesg log:
> 
> [    8.164898] mmc2: Got command interrupt 0x00000001 even though no
> command operation was in progress.
> [    8.174054] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
> [    8.180503] mmc2: sdhci: Sys addr:  0x00000000 | Version:  0x00000005
> [    8.186950] mmc2: sdhci: Blk size:  0x00000000 | Blk cnt:  0x00000000
> [    8.193395] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x00000000
> [    8.199841] mmc2: sdhci: Present:   0x03da0000 | Host ctl: 0x00000000
> [    8.206287] mmc2: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
> [    8.212733] mmc2: sdhci: Wake-up:   0x00000000 | Clock:    0x0000decf
> [    8.219178] mmc2: sdhci: Timeout:   0x00000000 | Int stat: 0x00000000
> [    8.225622] mmc2: sdhci: Int enab:  0x00ff1003 | Sig enab: 0x00ff1003
> [    8.232068] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
> [    8.238513] mmc2: sdhci: Caps:      0x3f69c881 | Caps_1:   0x08008177
> [    8.244959] mmc2: sdhci: Cmd:       0x00000502 | Max curr: 0x00191919
> [    8.254115] mmc2: sdhci: Resp[0]:   0x00001009 | Resp[1]:  0x00000000
> [    8.260561] mmc2: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
> [    8.267005] mmc2: sdhci: Host ctl2: 0x00001000
> [    8.271453] mmc2: sdhci: ADMA Err:  0x00000000 | ADMA Ptr:
> 0x0000000000000000
> [    8.278594] mmc2: sdhci: ============================================
> 
> I also enabled some traces to better understand the problem:
> 
>      kworker/3:1-62      [003] .....     8.163538: mmc_request_start:
> mmc2: start struct mmc_request[000000000d30cc0c]: cmd_opcode=5
> cmd_arg=0x0 cmd_flags=0x2e1 cmd_retries=0 stop_opcode=0 stop_arg=0x0
> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
> retune_period=0
>           <idle>-0       [000] d.h2.     8.164816: sdhci_cmd_irq:
> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x10000
> intmask_p=0x18000
>      irq/24-mmc2-96      [000] .....     8.164840: sdhci_thread_irq:
> msg=
>      irq/24-mmc2-96      [000] d.h2.     8.164896: sdhci_cmd_irq:
> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x1
> intmask_p=0x1
>      irq/24-mmc2-96      [000] .....     8.285142: mmc_request_done:
> mmc2: end struct mmc_request[000000000d30cc0c]: cmd_opcode=5
> cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0 need_retune=0
> hold_retune=1 retune_period=0
> 
> Here's what happens: the __mmc_start_request function is called with
> opcode 5. Since the power to the Wi-Fi card, which resides on this SDIO
> bus, is initially off after the reset, an interrupt SDHCI_INT_TIMEOUT is
> triggered. Immediately after that, a second interrupt SDHCI_INT_RESPONSE
> is triggered. Depending on the exact timing, these conditions can
> trigger the following race problem:
> 
> 1) The sdhci_cmd_irq top half handles the command as an error. It sets
>    host->cmd to NULL and host->pending_reset to true.
> 2) The sdhci_thread_irq bottom half is scheduled next and executes faster
>    than the second interrupt handler for SDHCI_INT_RESPONSE. It clears
>    host->pending_reset before the SDHCI_INT_RESPONSE handler runs.
> 3) The pending interrupt SDHCI_INT_RESPONSE handler gets called, triggering
>    a code path that prints: "mmc2: Got command interrupt 0x00000001 even
>    though no command operation was in progress."
> 
> To solve this issue, we need to clear pending interrupts when resetting
> host->pending_reset. This ensures that after sdhci_threaded_irq restores
> interrupts, there are no pending stale interrupts.
> 
> Signed-off-by: Michal Wilczynski <m.wilczynski@samsung.com>
> ---
>  drivers/mmc/host/sdhci.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index 4b91c9e96635..b91a6076c332 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -3098,6 +3098,10 @@ static bool sdhci_request_done(struct sdhci_host *host)
>  		sdhci_reset_for(host, REQUEST_ERROR);
>  
>  		host->pending_reset = false;
> +
> +		/* Clear any pending interrupts after reset */
> +		sdhci_writel(host, SDHCI_INT_CMD_MASK | SDHCI_INT_DATA_MASK,
> +			     SDHCI_INT_STATUS);

According to SDHCI spec, "Software Reset For CMD Line" clears
"Command Complete" in "Normal Interrupt Status register", so the
interrupt status should not need to be cleared again.

Which SDHCI driver is it?

>  	}
>  
>  	/*
Michal Wilczynski Oct. 7, 2024, 5 p.m. UTC | #2
On 10/7/24 18:09, Adrian Hunter wrote:
> On 3/10/24 19:10, Michal Wilczynski wrote:
>> While working with the T-Head 1520 LicheePi4A SoC, certain conditions
>> arose that allowed me to reproduce a race issue in the sdhci code.
>>
>> To reproduce the bug, you need to enable the sdio1 controller in the
>> device tree file
>> `arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi` as follows:
>>
>> &sdio1 {
>> 	bus-width = <4>;
>> 	max-frequency = <100000000>;
>> 	no-sd;
>> 	no-mmc;
>> 	broken-cd;
>> 	cap-sd-highspeed;
>> 	post-power-on-delay-ms = <50>;
>> 	status = "okay";
>> 	wakeup-source;
>> 	keep-power-in-suspend;
>> };
>>
>> When resetting the SoC using the reset button, the following messages
>> appear in the dmesg log:
>>
>> [    8.164898] mmc2: Got command interrupt 0x00000001 even though no
>> command operation was in progress.
>> [    8.174054] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
>> [    8.180503] mmc2: sdhci: Sys addr:  0x00000000 | Version:  0x00000005
>> [    8.186950] mmc2: sdhci: Blk size:  0x00000000 | Blk cnt:  0x00000000
>> [    8.193395] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x00000000
>> [    8.199841] mmc2: sdhci: Present:   0x03da0000 | Host ctl: 0x00000000
>> [    8.206287] mmc2: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
>> [    8.212733] mmc2: sdhci: Wake-up:   0x00000000 | Clock:    0x0000decf
>> [    8.219178] mmc2: sdhci: Timeout:   0x00000000 | Int stat: 0x00000000
>> [    8.225622] mmc2: sdhci: Int enab:  0x00ff1003 | Sig enab: 0x00ff1003
>> [    8.232068] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>> [    8.238513] mmc2: sdhci: Caps:      0x3f69c881 | Caps_1:   0x08008177
>> [    8.244959] mmc2: sdhci: Cmd:       0x00000502 | Max curr: 0x00191919
>> [    8.254115] mmc2: sdhci: Resp[0]:   0x00001009 | Resp[1]:  0x00000000
>> [    8.260561] mmc2: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>> [    8.267005] mmc2: sdhci: Host ctl2: 0x00001000
>> [    8.271453] mmc2: sdhci: ADMA Err:  0x00000000 | ADMA Ptr:
>> 0x0000000000000000
>> [    8.278594] mmc2: sdhci: ============================================
>>
>> I also enabled some traces to better understand the problem:
>>
>>      kworker/3:1-62      [003] .....     8.163538: mmc_request_start:
>> mmc2: start struct mmc_request[000000000d30cc0c]: cmd_opcode=5
>> cmd_arg=0x0 cmd_flags=0x2e1 cmd_retries=0 stop_opcode=0 stop_arg=0x0
>> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
>> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
>> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
>> retune_period=0
>>           <idle>-0       [000] d.h2.     8.164816: sdhci_cmd_irq:
>> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x10000
>> intmask_p=0x18000
>>      irq/24-mmc2-96      [000] .....     8.164840: sdhci_thread_irq:
>> msg=
>>      irq/24-mmc2-96      [000] d.h2.     8.164896: sdhci_cmd_irq:
>> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x1
>> intmask_p=0x1
>>      irq/24-mmc2-96      [000] .....     8.285142: mmc_request_done:
>> mmc2: end struct mmc_request[000000000d30cc0c]: cmd_opcode=5
>> cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0 need_retune=0
>> hold_retune=1 retune_period=0
>>
>> Here's what happens: the __mmc_start_request function is called with
>> opcode 5. Since the power to the Wi-Fi card, which resides on this SDIO
>> bus, is initially off after the reset, an interrupt SDHCI_INT_TIMEOUT is
>> triggered. Immediately after that, a second interrupt SDHCI_INT_RESPONSE
>> is triggered. Depending on the exact timing, these conditions can
>> trigger the following race problem:
>>
>> 1) The sdhci_cmd_irq top half handles the command as an error. It sets
>>    host->cmd to NULL and host->pending_reset to true.
>> 2) The sdhci_thread_irq bottom half is scheduled next and executes faster
>>    than the second interrupt handler for SDHCI_INT_RESPONSE. It clears
>>    host->pending_reset before the SDHCI_INT_RESPONSE handler runs.
>> 3) The pending interrupt SDHCI_INT_RESPONSE handler gets called, triggering
>>    a code path that prints: "mmc2: Got command interrupt 0x00000001 even
>>    though no command operation was in progress."
>>
>> To solve this issue, we need to clear pending interrupts when resetting
>> host->pending_reset. This ensures that after sdhci_threaded_irq restores
>> interrupts, there are no pending stale interrupts.
>>
>> Signed-off-by: Michal Wilczynski <m.wilczynski@samsung.com>
>> ---
>>  drivers/mmc/host/sdhci.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>> index 4b91c9e96635..b91a6076c332 100644
>> --- a/drivers/mmc/host/sdhci.c
>> +++ b/drivers/mmc/host/sdhci.c
>> @@ -3098,6 +3098,10 @@ static bool sdhci_request_done(struct sdhci_host *host)
>>  		sdhci_reset_for(host, REQUEST_ERROR);
>>  
>>  		host->pending_reset = false;
>> +
>> +		/* Clear any pending interrupts after reset */
>> +		sdhci_writel(host, SDHCI_INT_CMD_MASK | SDHCI_INT_DATA_MASK,
>> +			     SDHCI_INT_STATUS);
> 
> According to SDHCI spec, "Software Reset For CMD Line" clears
> "Command Complete" in "Normal Interrupt Status register", so the
> interrupt status should not need to be cleared again.
> 
> Which SDHCI driver is it?
> 

Thank you for your review.

The driver in use is located at drivers/mmc/host/sdhci-of-dwcmshc.c,
and the .compatible string is thead,th1520-dwcmshc.

Based on the specifications, it appears that the specific hardware used in
the LicheePi4A SoC may not be fully compliant with the SDHCI standard.
Therefore, the appropriate solution might be to add additional clearing
code to the th1520_sdhci_reset() function.

>>  	}
>>  
>>  	/*
> 
>
Adrian Hunter Oct. 7, 2024, 6:58 p.m. UTC | #3
On 7/10/24 20:00, Michal Wilczynski wrote:
> 
> 
> On 10/7/24 18:09, Adrian Hunter wrote:
>> On 3/10/24 19:10, Michal Wilczynski wrote:
>>> While working with the T-Head 1520 LicheePi4A SoC, certain conditions
>>> arose that allowed me to reproduce a race issue in the sdhci code.
>>>
>>> To reproduce the bug, you need to enable the sdio1 controller in the
>>> device tree file
>>> `arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi` as follows:
>>>
>>> &sdio1 {
>>> 	bus-width = <4>;
>>> 	max-frequency = <100000000>;
>>> 	no-sd;
>>> 	no-mmc;
>>> 	broken-cd;
>>> 	cap-sd-highspeed;
>>> 	post-power-on-delay-ms = <50>;
>>> 	status = "okay";
>>> 	wakeup-source;
>>> 	keep-power-in-suspend;
>>> };
>>>
>>> When resetting the SoC using the reset button, the following messages
>>> appear in the dmesg log:
>>>
>>> [    8.164898] mmc2: Got command interrupt 0x00000001 even though no
>>> command operation was in progress.
>>> [    8.174054] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [    8.180503] mmc2: sdhci: Sys addr:  0x00000000 | Version:  0x00000005
>>> [    8.186950] mmc2: sdhci: Blk size:  0x00000000 | Blk cnt:  0x00000000
>>> [    8.193395] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x00000000
>>> [    8.199841] mmc2: sdhci: Present:   0x03da0000 | Host ctl: 0x00000000
>>> [    8.206287] mmc2: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
>>> [    8.212733] mmc2: sdhci: Wake-up:   0x00000000 | Clock:    0x0000decf
>>> [    8.219178] mmc2: sdhci: Timeout:   0x00000000 | Int stat: 0x00000000
>>> [    8.225622] mmc2: sdhci: Int enab:  0x00ff1003 | Sig enab: 0x00ff1003
>>> [    8.232068] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [    8.238513] mmc2: sdhci: Caps:      0x3f69c881 | Caps_1:   0x08008177
>>> [    8.244959] mmc2: sdhci: Cmd:       0x00000502 | Max curr: 0x00191919
>>> [    8.254115] mmc2: sdhci: Resp[0]:   0x00001009 | Resp[1]:  0x00000000
>>> [    8.260561] mmc2: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>> [    8.267005] mmc2: sdhci: Host ctl2: 0x00001000
>>> [    8.271453] mmc2: sdhci: ADMA Err:  0x00000000 | ADMA Ptr:
>>> 0x0000000000000000
>>> [    8.278594] mmc2: sdhci: ============================================
>>>
>>> I also enabled some traces to better understand the problem:
>>>
>>>      kworker/3:1-62      [003] .....     8.163538: mmc_request_start:
>>> mmc2: start struct mmc_request[000000000d30cc0c]: cmd_opcode=5
>>> cmd_arg=0x0 cmd_flags=0x2e1 cmd_retries=0 stop_opcode=0 stop_arg=0x0
>>> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
>>> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
>>> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
>>> retune_period=0
>>>           <idle>-0       [000] d.h2.     8.164816: sdhci_cmd_irq:
>>> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x10000
>>> intmask_p=0x18000
>>>      irq/24-mmc2-96      [000] .....     8.164840: sdhci_thread_irq:
>>> msg=
>>>      irq/24-mmc2-96      [000] d.h2.     8.164896: sdhci_cmd_irq:
>>> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x1
>>> intmask_p=0x1
>>>      irq/24-mmc2-96      [000] .....     8.285142: mmc_request_done:
>>> mmc2: end struct mmc_request[000000000d30cc0c]: cmd_opcode=5
>>> cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0 need_retune=0
>>> hold_retune=1 retune_period=0
>>>
>>> Here's what happens: the __mmc_start_request function is called with
>>> opcode 5. Since the power to the Wi-Fi card, which resides on this SDIO
>>> bus, is initially off after the reset, an interrupt SDHCI_INT_TIMEOUT is
>>> triggered. Immediately after that, a second interrupt SDHCI_INT_RESPONSE
>>> is triggered. Depending on the exact timing, these conditions can
>>> trigger the following race problem:
>>>
>>> 1) The sdhci_cmd_irq top half handles the command as an error. It sets
>>>    host->cmd to NULL and host->pending_reset to true.
>>> 2) The sdhci_thread_irq bottom half is scheduled next and executes faster
>>>    than the second interrupt handler for SDHCI_INT_RESPONSE. It clears
>>>    host->pending_reset before the SDHCI_INT_RESPONSE handler runs.
>>> 3) The pending interrupt SDHCI_INT_RESPONSE handler gets called, triggering
>>>    a code path that prints: "mmc2: Got command interrupt 0x00000001 even
>>>    though no command operation was in progress."
>>>
>>> To solve this issue, we need to clear pending interrupts when resetting
>>> host->pending_reset. This ensures that after sdhci_threaded_irq restores
>>> interrupts, there are no pending stale interrupts.
>>>
>>> Signed-off-by: Michal Wilczynski <m.wilczynski@samsung.com>
>>> ---
>>>  drivers/mmc/host/sdhci.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>>> index 4b91c9e96635..b91a6076c332 100644
>>> --- a/drivers/mmc/host/sdhci.c
>>> +++ b/drivers/mmc/host/sdhci.c
>>> @@ -3098,6 +3098,10 @@ static bool sdhci_request_done(struct sdhci_host *host)
>>>  		sdhci_reset_for(host, REQUEST_ERROR);
>>>  
>>>  		host->pending_reset = false;
>>> +
>>> +		/* Clear any pending interrupts after reset */
>>> +		sdhci_writel(host, SDHCI_INT_CMD_MASK | SDHCI_INT_DATA_MASK,
>>> +			     SDHCI_INT_STATUS);
>>
>> According to SDHCI spec, "Software Reset For CMD Line" clears
>> "Command Complete" in "Normal Interrupt Status register", so the
>> interrupt status should not need to be cleared again.
>>
>> Which SDHCI driver is it?
>>
> 
> Thank you for your review.
> 
> The driver in use is located at drivers/mmc/host/sdhci-of-dwcmshc.c,
> and the .compatible string is thead,th1520-dwcmshc.
> 
> Based on the specifications, it appears that the specific hardware used in
> the LicheePi4A SoC may not be fully compliant with the SDHCI standard.
> Therefore, the appropriate solution might be to add additional clearing
> code to the th1520_sdhci_reset() function.
> 

Seems reasonable
diff mbox series

Patch

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 4b91c9e96635..b91a6076c332 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -3098,6 +3098,10 @@  static bool sdhci_request_done(struct sdhci_host *host)
 		sdhci_reset_for(host, REQUEST_ERROR);
 
 		host->pending_reset = false;
+
+		/* Clear any pending interrupts after reset */
+		sdhci_writel(host, SDHCI_INT_CMD_MASK | SDHCI_INT_DATA_MASK,
+			     SDHCI_INT_STATUS);
 	}
 
 	/*