diff mbox series

wifi: ath12k: Fix WARN_ON during firmware crash in split-phy

Message ID 20240529034405.2863150-1-quic_aarasahu@quicinc.com (mailing list archive)
State Accepted
Commit 670d4949bc8e7178420761130664f563453085bd
Delegated to: Kalle Valo
Headers show
Series wifi: ath12k: Fix WARN_ON during firmware crash in split-phy | expand

Commit Message

Aaradhana Sahu May 29, 2024, 3:44 a.m. UTC
Whenever firmware is crashed in split-phy below WARN_ON triggered:

? __warn+0x7b/0x1a0
? drv_stop+0x1eb/0x210 [mac80211]
? report_bug+0x10b/0x200
? handle_bug+0x3f/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? drv_stop+0x1eb/0x210 [mac80211]
ieee80211_do_stop+0x5ba/0x850 [mac80211]
ieee80211_stop+0x51/0x180 [mac80211]
__dev_close_many+0xb3/0x130
dev_close_many+0xa3/0x180
? lock_release+0xde/0x420
dev_close.part.147+0x5f/0xa0
cfg80211_shutdown_all_interfaces+0x44/0xe0 [cfg80211]
ieee80211_restart_work+0xf9/0x130 [mac80211]
process_scheduled_works+0x377/0x6f0

The sequence of WARN_ON is:
Thread 1:
-Firmware crash calls ath12k_core_reset().
-Call ieee80211_restart_hw() inside
 ath12k_core_post_reconfigure_recovery() which schedules worker
 for both hardware.
-Wait for completion of ab->recovery_start.

Thread 2 (worker thread):
-One hardware acquires rtnl_lock() inside ieee80211_restart_hw() and
 calls ath12k_mac_wait_reconfigure() into ath12k_mac_op_start().
-Hardware is waiting for ab->reconfigure_complete but at this time
 recovery_start_count value is 1 because another worker thread
 (local->restart_work) is still waiting for rtnl_lock().
 recovery_start_count is not equal to number of radios
 (2 in split-phy). So ab->recovery_start complete does not set
 due to this, thread 1 is still waiting and not able to perform
 hif power down up and firmware reload.
-Wait timeout happens for ab->reconfigure_complete and comeback
 to caller (ath12k_mac_op_start()) and sends WMI command to
 crashed firmware and gets error.
-This returns error to drv_start() and local->started is set to false.
-Hardware calls cfg80211_shutdown_all_interfaces() after receiving error
 inside ieee80211_restart_work() and goes to drv_stop(), here we trigger
 WARN_ON as local->started is false.

To fix this issue call ieee80211_restart_hw() after firmware has been
reloaded. Now, each hardware can send WMI command to firmware
successfully. With this fix we don't need to wait for
ab->recovery_start completion so remove
ath12k_mac_wait_reconfigure().

Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.0.1-00029-QCAHKSWPL_SILICONZ-1
Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.1.1-00209-QCAHKSWPL_SILICONZ-1
Tested-on: WCN7850 HW2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3

Signed-off-by: Aaradhana Sahu <quic_aarasahu@quicinc.com>
---
 drivers/net/wireless/ath/ath12k/core.c | 21 ++++++++-------------
 drivers/net/wireless/ath/ath12k/core.h |  3 ---
 drivers/net/wireless/ath/ath12k/mac.c  | 23 -----------------------
 3 files changed, 8 insertions(+), 39 deletions(-)


base-commit: 0442aec67bbd9cbf93bf7f6ee59c9bf5348b9249

Comments

Jeff Johnson May 29, 2024, 4:32 p.m. UTC | #1
On 5/28/2024 8:44 PM, Aaradhana Sahu wrote:
> Whenever firmware is crashed in split-phy below WARN_ON triggered:
> 
> ? __warn+0x7b/0x1a0
> ? drv_stop+0x1eb/0x210 [mac80211]
> ? report_bug+0x10b/0x200
> ? handle_bug+0x3f/0x70
> ? exc_invalid_op+0x13/0x60
> ? asm_exc_invalid_op+0x16/0x20
> ? drv_stop+0x1eb/0x210 [mac80211]
> ieee80211_do_stop+0x5ba/0x850 [mac80211]
> ieee80211_stop+0x51/0x180 [mac80211]
> __dev_close_many+0xb3/0x130
> dev_close_many+0xa3/0x180
> ? lock_release+0xde/0x420
> dev_close.part.147+0x5f/0xa0
> cfg80211_shutdown_all_interfaces+0x44/0xe0 [cfg80211]
> ieee80211_restart_work+0xf9/0x130 [mac80211]
> process_scheduled_works+0x377/0x6f0
> 
> The sequence of WARN_ON is:
> Thread 1:
> -Firmware crash calls ath12k_core_reset().
> -Call ieee80211_restart_hw() inside
>  ath12k_core_post_reconfigure_recovery() which schedules worker
>  for both hardware.
> -Wait for completion of ab->recovery_start.
> 
> Thread 2 (worker thread):
> -One hardware acquires rtnl_lock() inside ieee80211_restart_hw() and
>  calls ath12k_mac_wait_reconfigure() into ath12k_mac_op_start().
> -Hardware is waiting for ab->reconfigure_complete but at this time
>  recovery_start_count value is 1 because another worker thread
>  (local->restart_work) is still waiting for rtnl_lock().
>  recovery_start_count is not equal to number of radios
>  (2 in split-phy). So ab->recovery_start complete does not set
>  due to this, thread 1 is still waiting and not able to perform
>  hif power down up and firmware reload.
> -Wait timeout happens for ab->reconfigure_complete and comeback
>  to caller (ath12k_mac_op_start()) and sends WMI command to
>  crashed firmware and gets error.
> -This returns error to drv_start() and local->started is set to false.
> -Hardware calls cfg80211_shutdown_all_interfaces() after receiving error
>  inside ieee80211_restart_work() and goes to drv_stop(), here we trigger
>  WARN_ON as local->started is false.
> 
> To fix this issue call ieee80211_restart_hw() after firmware has been
> reloaded. Now, each hardware can send WMI command to firmware
> successfully. With this fix we don't need to wait for
> ab->recovery_start completion so remove
> ath12k_mac_wait_reconfigure().
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.0.1-00029-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.1.1-00209-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 HW2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Signed-off-by: Aaradhana Sahu <quic_aarasahu@quicinc.com>

Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Kalle Valo June 10, 2024, 5:10 p.m. UTC | #2
Aaradhana Sahu <quic_aarasahu@quicinc.com> writes:

> Whenever firmware is crashed in split-phy below WARN_ON triggered:
>
> ? __warn+0x7b/0x1a0
> ? drv_stop+0x1eb/0x210 [mac80211]
> ? report_bug+0x10b/0x200
> ? handle_bug+0x3f/0x70
> ? exc_invalid_op+0x13/0x60
> ? asm_exc_invalid_op+0x16/0x20
> ? drv_stop+0x1eb/0x210 [mac80211]
> ieee80211_do_stop+0x5ba/0x850 [mac80211]
> ieee80211_stop+0x51/0x180 [mac80211]
> __dev_close_many+0xb3/0x130
> dev_close_many+0xa3/0x180
> ? lock_release+0xde/0x420
> dev_close.part.147+0x5f/0xa0
> cfg80211_shutdown_all_interfaces+0x44/0xe0 [cfg80211]
> ieee80211_restart_work+0xf9/0x130 [mac80211]
> process_scheduled_works+0x377/0x6f0

This is just the stack trace, not the full warning. If you send me the
full warning I can add it to the commit message. Also it would be always
good to identify what warning it is exactly as line numbers can change
etc.
Aaradhana Sahu June 13, 2024, 5:01 a.m. UTC | #3
On 6/10/2024 10:40 PM, Kalle Valo wrote:
> Aaradhana Sahu <quic_aarasahu@quicinc.com> writes:
> 
>> Whenever firmware is crashed in split-phy below WARN_ON triggered:
>>
>> ? __warn+0x7b/0x1a0
>> ? drv_stop+0x1eb/0x210 [mac80211]
>> ? report_bug+0x10b/0x200
>> ? handle_bug+0x3f/0x70
>> ? exc_invalid_op+0x13/0x60
>> ? asm_exc_invalid_op+0x16/0x20
>> ? drv_stop+0x1eb/0x210 [mac80211]
>> ieee80211_do_stop+0x5ba/0x850 [mac80211]
>> ieee80211_stop+0x51/0x180 [mac80211]
>> __dev_close_many+0xb3/0x130
>> dev_close_many+0xa3/0x180
>> ? lock_release+0xde/0x420
>> dev_close.part.147+0x5f/0xa0
>> cfg80211_shutdown_all_interfaces+0x44/0xe0 [cfg80211]
>> ieee80211_restart_work+0xf9/0x130 [mac80211]
>> process_scheduled_works+0x377/0x6f0
> 
> This is just the stack trace, not the full warning. If you send me the
> full warning I can add it to the commit message. Also it would be always
> good to identify what warning it is exactly as line numbers can change
> etc.
> 

Sure, the full warning is given below:

[  364.713223] WARNING: CPU: 3 PID: 82 at net/mac80211/driver-ops.c:41 drv_stop+0xac/0xbc
[  364.716875] Modules linked in: ath12k qmi_helpers
[  364.724598] CPU: 3 PID: 82 Comm: kworker/3:2 Tainted: G      D W          6.9.0-next-20240520-00113-gd981a3784e15 #39
[  364.729378] Hardware name: Qualcomm Technologies, Inc. IPQ9574/AP-AL02-C9 (DT)
[  364.739965] Workqueue: events_freezable ieee80211_restart_work
[  364.747082] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  364.752897] pc : drv_stop+0xac/0xbc
[  364.759752] lr : ieee80211_stop_device+0x54/0x64
[  364.763226] sp : ffff8000848dbb20
[  364.768085] x29: ffff8000848dbb20 x28: 0000000000000790 x27: ffff000014d78900
[  364.771301] x26: ffff000014d791f8 x25: ffff000007f0d9b0 x24: 0000000000000018
[  364.778419] x23: 0000000000000001 x22: 0000000000000000 x21: ffff000014d78e10
[  364.785537] x20: ffff800081dc0000 x19: ffff000014d78900 x18: ffffffffffffffff
[  364.792655] x17: ffff7fffbca84000 x16: ffff800083fe0000 x15: ffff800081dc0b48
[  364.799774] x14: 0000000000000076 x13: 0000000000000076 x12: 0000000000000001
[  364.806892] x11: 0000000000000000 x10: 0000000000000a60 x9 : ffff8000848db980
[  364.814009] x8 : ffff000000dddfc0 x7 : 0000000000000400 x6 : ffff800083b012d8
[  364.821128] x5 : ffff800083b012d8 x4 : 0000000000000000 x3 : ffff000014d78398
[  364.828246] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000014d78900
[  364.835364] Call trace:
[  364.842478]  drv_stop+0xac/0xbc
[  364.844734]  ieee80211_stop_device+0x54/0x64
[  364.847860]  ieee80211_do_stop+0x5a0/0x790
[  364.852375]  ieee80211_stop+0x4c/0x178
[  364.856280]  __dev_close_many+0xb0/0x150
[  364.860014]  dev_close_many+0x88/0x130
[  364.864092]  dev_close.part.171+0x44/0x74
[  364.867653]  dev_close+0x1c/0x28
[  364.871732]  cfg80211_shutdown_all_interfaces+0x44/0xfc
[  364.875031]  ieee80211_restart_work+0xfc/0x14c
[  364.879979]  process_scheduled_works+0x18c/0x2dc
[  364.884494]  worker_thread+0x13c/0x314
[  364.889266]  kthread+0x118/0x124
[  364.892825]  ret_from_fork+0x10/0x20
[  364.896211] ---[ end trace 0000000000000000 ]---
Kalle Valo June 13, 2024, 6:07 a.m. UTC | #4
Aaradhana Sahu <quic_aarasahu@quicinc.com> writes:

> On 6/10/2024 10:40 PM, Kalle Valo wrote:
>> Aaradhana Sahu <quic_aarasahu@quicinc.com> writes:
>> 
>>> Whenever firmware is crashed in split-phy below WARN_ON triggered:
>>>
>>> ? __warn+0x7b/0x1a0
>>> ? drv_stop+0x1eb/0x210 [mac80211]
>>> ? report_bug+0x10b/0x200
>>> ? handle_bug+0x3f/0x70
>>> ? exc_invalid_op+0x13/0x60
>>> ? asm_exc_invalid_op+0x16/0x20
>>> ? drv_stop+0x1eb/0x210 [mac80211]
>>> ieee80211_do_stop+0x5ba/0x850 [mac80211]
>>> ieee80211_stop+0x51/0x180 [mac80211]
>>> __dev_close_many+0xb3/0x130
>>> dev_close_many+0xa3/0x180
>>> ? lock_release+0xde/0x420
>>> dev_close.part.147+0x5f/0xa0
>>> cfg80211_shutdown_all_interfaces+0x44/0xe0 [cfg80211]
>>> ieee80211_restart_work+0xf9/0x130 [mac80211]
>>> process_scheduled_works+0x377/0x6f0
>> 
>> This is just the stack trace, not the full warning. If you send me the
>> full warning I can add it to the commit message. Also it would be always
>> good to identify what warning it is exactly as line numbers can change
>> etc.
>> 
>
> Sure, the full warning is given below:
>
> [ 364.713223] WARNING: CPU: 3 PID: 82 at net/mac80211/driver-ops.c:41
> drv_stop+0xac/0xbc
> [  364.716875] Modules linked in: ath12k qmi_helpers
> [ 364.724598] CPU: 3 PID: 82 Comm: kworker/3:2 Tainted: G D W
> 6.9.0-next-20240520-00113-gd981a3784e15 #39
> [  364.729378] Hardware name: Qualcomm Technologies, Inc. IPQ9574/AP-AL02-C9 (DT)
> [  364.739965] Workqueue: events_freezable ieee80211_restart_work
> [  364.747082] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [  364.752897] pc : drv_stop+0xac/0xbc
> [  364.759752] lr : ieee80211_stop_device+0x54/0x64
> [  364.763226] sp : ffff8000848dbb20
> [  364.768085] x29: ffff8000848dbb20 x28: 0000000000000790 x27: ffff000014d78900
> [  364.771301] x26: ffff000014d791f8 x25: ffff000007f0d9b0 x24: 0000000000000018
> [  364.778419] x23: 0000000000000001 x22: 0000000000000000 x21: ffff000014d78e10
> [  364.785537] x20: ffff800081dc0000 x19: ffff000014d78900 x18: ffffffffffffffff
> [  364.792655] x17: ffff7fffbca84000 x16: ffff800083fe0000 x15: ffff800081dc0b48
> [  364.799774] x14: 0000000000000076 x13: 0000000000000076 x12: 0000000000000001
> [  364.806892] x11: 0000000000000000 x10: 0000000000000a60 x9 : ffff8000848db980
> [  364.814009] x8 : ffff000000dddfc0 x7 : 0000000000000400 x6 : ffff800083b012d8
> [  364.821128] x5 : ffff800083b012d8 x4 : 0000000000000000 x3 : ffff000014d78398
> [  364.828246] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000014d78900
> [  364.835364] Call trace:
> [  364.842478]  drv_stop+0xac/0xbc
> [  364.844734]  ieee80211_stop_device+0x54/0x64
> [  364.847860]  ieee80211_do_stop+0x5a0/0x790
> [  364.852375]  ieee80211_stop+0x4c/0x178
> [  364.856280]  __dev_close_many+0xb0/0x150
> [  364.860014]  dev_close_many+0x88/0x130
> [  364.864092]  dev_close.part.171+0x44/0x74
> [  364.867653]  dev_close+0x1c/0x28
> [  364.871732]  cfg80211_shutdown_all_interfaces+0x44/0xfc
> [  364.875031]  ieee80211_restart_work+0xfc/0x14c
> [  364.879979]  process_scheduled_works+0x18c/0x2dc
> [  364.884494]  worker_thread+0x13c/0x314
> [  364.889266]  kthread+0x118/0x124
> [  364.892825]  ret_from_fork+0x10/0x20
> [  364.896211] ---[ end trace 0000000000000000 ]---

Thanks, so I assume it's this check from drv_stop():

	if (WARN_ON(!local->started))
		return;
Aaradhana Sahu June 13, 2024, 6:11 a.m. UTC | #5
On 6/13/2024 11:37 AM, Kalle Valo wrote:
> Aaradhana Sahu <quic_aarasahu@quicinc.com> writes:
> 
>> On 6/10/2024 10:40 PM, Kalle Valo wrote:
>>> Aaradhana Sahu <quic_aarasahu@quicinc.com> writes:
>>>
>>>> Whenever firmware is crashed in split-phy below WARN_ON triggered:
>>>>
>>>> ? __warn+0x7b/0x1a0
>>>> ? drv_stop+0x1eb/0x210 [mac80211]
>>>> ? report_bug+0x10b/0x200
>>>> ? handle_bug+0x3f/0x70
>>>> ? exc_invalid_op+0x13/0x60
>>>> ? asm_exc_invalid_op+0x16/0x20
>>>> ? drv_stop+0x1eb/0x210 [mac80211]
>>>> ieee80211_do_stop+0x5ba/0x850 [mac80211]
>>>> ieee80211_stop+0x51/0x180 [mac80211]
>>>> __dev_close_many+0xb3/0x130
>>>> dev_close_many+0xa3/0x180
>>>> ? lock_release+0xde/0x420
>>>> dev_close.part.147+0x5f/0xa0
>>>> cfg80211_shutdown_all_interfaces+0x44/0xe0 [cfg80211]
>>>> ieee80211_restart_work+0xf9/0x130 [mac80211]
>>>> process_scheduled_works+0x377/0x6f0
>>>
>>> This is just the stack trace, not the full warning. If you send me the
>>> full warning I can add it to the commit message. Also it would be always
>>> good to identify what warning it is exactly as line numbers can change
>>> etc.
>>>
>>
>> Sure, the full warning is given below:
>>
>> [ 364.713223] WARNING: CPU: 3 PID: 82 at net/mac80211/driver-ops.c:41
>> drv_stop+0xac/0xbc
>> [  364.716875] Modules linked in: ath12k qmi_helpers
>> [ 364.724598] CPU: 3 PID: 82 Comm: kworker/3:2 Tainted: G D W
>> 6.9.0-next-20240520-00113-gd981a3784e15 #39
>> [  364.729378] Hardware name: Qualcomm Technologies, Inc. IPQ9574/AP-AL02-C9 (DT)
>> [  364.739965] Workqueue: events_freezable ieee80211_restart_work
>> [  364.747082] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [  364.752897] pc : drv_stop+0xac/0xbc
>> [  364.759752] lr : ieee80211_stop_device+0x54/0x64
>> [  364.763226] sp : ffff8000848dbb20
>> [  364.768085] x29: ffff8000848dbb20 x28: 0000000000000790 x27: ffff000014d78900
>> [  364.771301] x26: ffff000014d791f8 x25: ffff000007f0d9b0 x24: 0000000000000018
>> [  364.778419] x23: 0000000000000001 x22: 0000000000000000 x21: ffff000014d78e10
>> [  364.785537] x20: ffff800081dc0000 x19: ffff000014d78900 x18: ffffffffffffffff
>> [  364.792655] x17: ffff7fffbca84000 x16: ffff800083fe0000 x15: ffff800081dc0b48
>> [  364.799774] x14: 0000000000000076 x13: 0000000000000076 x12: 0000000000000001
>> [  364.806892] x11: 0000000000000000 x10: 0000000000000a60 x9 : ffff8000848db980
>> [  364.814009] x8 : ffff000000dddfc0 x7 : 0000000000000400 x6 : ffff800083b012d8
>> [  364.821128] x5 : ffff800083b012d8 x4 : 0000000000000000 x3 : ffff000014d78398
>> [  364.828246] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000014d78900
>> [  364.835364] Call trace:
>> [  364.842478]  drv_stop+0xac/0xbc
>> [  364.844734]  ieee80211_stop_device+0x54/0x64
>> [  364.847860]  ieee80211_do_stop+0x5a0/0x790
>> [  364.852375]  ieee80211_stop+0x4c/0x178
>> [  364.856280]  __dev_close_many+0xb0/0x150
>> [  364.860014]  dev_close_many+0x88/0x130
>> [  364.864092]  dev_close.part.171+0x44/0x74
>> [  364.867653]  dev_close+0x1c/0x28
>> [  364.871732]  cfg80211_shutdown_all_interfaces+0x44/0xfc
>> [  364.875031]  ieee80211_restart_work+0xfc/0x14c
>> [  364.879979]  process_scheduled_works+0x18c/0x2dc
>> [  364.884494]  worker_thread+0x13c/0x314
>> [  364.889266]  kthread+0x118/0x124
>> [  364.892825]  ret_from_fork+0x10/0x20
>> [  364.896211] ---[ end trace 0000000000000000 ]---
> 
> Thanks, so I assume it's this check from drv_stop():
> 
> 	if (WARN_ON(!local->started))
> 		return;
> 

Yes.
Kalle Valo June 17, 2024, 2:45 p.m. UTC | #6
Aaradhana Sahu <quic_aarasahu@quicinc.com> wrote:

> Whenever firmware is crashed in split-phy below WARN_ON() triggered:
> 
> WARNING: CPU: 3 PID: 82 at net/mac80211/driver-ops.c:41 drv_stop+0xac/0xbc
> Modules linked in: ath12k qmi_helpers
> CPU: 3 PID: 82 Comm: kworker/3:2 Tainted: G      D W          6.9.0-next-20240520-00113-gd981a3784e15 #39
> Hardware name: Qualcomm Technologies, Inc. IPQ9574/AP-AL02-C9 (DT)
> Workqueue: events_freezable ieee80211_restart_work
> pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : drv_stop+0xac/0xbc
> lr : ieee80211_stop_device+0x54/0x64
> sp : ffff8000848dbb20
> x29: ffff8000848dbb20 x28: 0000000000000790 x27: ffff000014d78900
> x26: ffff000014d791f8 x25: ffff000007f0d9b0 x24: 0000000000000018
> x23: 0000000000000001 x22: 0000000000000000 x21: ffff000014d78e10
> x20: ffff800081dc0000 x19: ffff000014d78900 x18: ffffffffffffffff
> x17: ffff7fffbca84000 x16: ffff800083fe0000 x15: ffff800081dc0b48
> x14: 0000000000000076 x13: 0000000000000076 x12: 0000000000000001
> x11: 0000000000000000 x10: 0000000000000a60 x9 : ffff8000848db980
> x8 : ffff000000dddfc0 x7 : 0000000000000400 x6 : ffff800083b012d8
> x5 : ffff800083b012d8 x4 : 0000000000000000 x3 : ffff000014d78398
> x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000014d78900
> Call trace:
>  drv_stop+0xac/0xbc
>  ieee80211_stop_device+0x54/0x64
>  ieee80211_do_stop+0x5a0/0x790
>  ieee80211_stop+0x4c/0x178
>  __dev_close_many+0xb0/0x150
>  dev_close_many+0x88/0x130
>  dev_close.part.171+0x44/0x74
>  dev_close+0x1c/0x28
>  cfg80211_shutdown_all_interfaces+0x44/0xfc
>  ieee80211_restart_work+0xfc/0x14c
>  process_scheduled_works+0x18c/0x2dc
>  worker_thread+0x13c/0x314
>  kthread+0x118/0x124
>  ret_from_fork+0x10/0x20
> ---[ end trace 0000000000000000 ]---
> 
> The warning in question is from drv_stop():
> 
>         if (WARN_ON(!local->started))
>                 return;
> 
> The sequence of WARN_ON() is:
> Thread 1:
> -Firmware crash calls ath12k_core_reset().
> -Call ieee80211_restart_hw() inside
>  ath12k_core_post_reconfigure_recovery() which schedules worker
>  for both hardware.
> -Wait for completion of ab->recovery_start.
> 
> Thread 2 (worker thread):
> -One hardware acquires rtnl_lock() inside ieee80211_restart_hw() and
>  calls ath12k_mac_wait_reconfigure() into ath12k_mac_op_start().
> -Hardware is waiting for ab->reconfigure_complete but at this time
>  recovery_start_count value is 1 because another worker thread
>  (local->restart_work) is still waiting for rtnl_lock().
>  recovery_start_count is not equal to number of radios
>  (2 in split-phy). So ab->recovery_start complete does not set
>  due to this, thread 1 is still waiting and not able to perform
>  hif power down up and firmware reload.
> -Wait timeout happens for ab->reconfigure_complete and comeback
>  to caller (ath12k_mac_op_start()) and sends WMI command to
>  crashed firmware and gets error.
> -This returns error to drv_start() and local->started is set to false.
> -Hardware calls cfg80211_shutdown_all_interfaces() after receiving error
>  inside ieee80211_restart_work() and goes to drv_stop(), here we trigger
>  WARN_ON as local->started is false.
> 
> To fix this issue call ieee80211_restart_hw() after firmware has been
> reloaded. Now, each hardware can send WMI command to firmware
> successfully. With this fix we don't need to wait for
> ab->recovery_start completion so remove
> ath12k_mac_wait_reconfigure().
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.0.1-00029-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.1.1-00209-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 HW2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Signed-off-by: Aaradhana Sahu <quic_aarasahu@quicinc.com>
> Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com>
> Signed-off-by: Kalle Valo <quic_kvalo@quicinc.com>

Patch applied to ath-next branch of ath.git, thanks.

670d4949bc8e wifi: ath12k: Fix WARN_ON during firmware crash in split-phy
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath12k/core.c b/drivers/net/wireless/ath/ath12k/core.c
index e7f628e935e4..9fac8b1b064e 100644
--- a/drivers/net/wireless/ath/ath12k/core.c
+++ b/drivers/net/wireless/ath/ath12k/core.c
@@ -1059,8 +1059,6 @@  static void ath12k_core_post_reconfigure_recovery(struct ath12k_base *ab)
 				mutex_unlock(&ar->conf_mutex);
 			}
 
-			/* Restart after all the link/radio halt */
-			ieee80211_restart_hw(ah->hw);
 			break;
 		case ATH12K_HW_STATE_OFF:
 			ath12k_warn(ab,
@@ -1087,7 +1085,8 @@  static void ath12k_core_post_reconfigure_recovery(struct ath12k_base *ab)
 static void ath12k_core_restart(struct work_struct *work)
 {
 	struct ath12k_base *ab = container_of(work, struct ath12k_base, restart_work);
-	int ret;
+	struct ath12k_hw *ah;
+	int ret, i;
 
 	ret = ath12k_core_reconfigure_on_crash(ab);
 	if (ret) {
@@ -1095,8 +1094,12 @@  static void ath12k_core_restart(struct work_struct *work)
 		return;
 	}
 
-	if (ab->is_reset)
-		complete_all(&ab->reconfigure_complete);
+	if (ab->is_reset) {
+		for (i = 0; i < ab->num_hw; i++) {
+			ah = ab->ah[i];
+			ieee80211_restart_hw(ah->hw);
+		}
+	}
 
 	complete(&ab->restart_completed);
 }
@@ -1150,20 +1153,14 @@  static void ath12k_core_reset(struct work_struct *work)
 	ath12k_dbg(ab, ATH12K_DBG_BOOT, "reset starting\n");
 
 	ab->is_reset = true;
-	atomic_set(&ab->recovery_start_count, 0);
-	reinit_completion(&ab->recovery_start);
 	atomic_set(&ab->recovery_count, 0);
 
 	ath12k_core_pre_reconfigure_recovery(ab);
 
-	reinit_completion(&ab->reconfigure_complete);
 	ath12k_core_post_reconfigure_recovery(ab);
 
 	ath12k_dbg(ab, ATH12K_DBG_BOOT, "waiting recovery start...\n");
 
-	time_left = wait_for_completion_timeout(&ab->recovery_start,
-						ATH12K_RECOVER_START_TIMEOUT_HZ);
-
 	ath12k_hif_irq_disable(ab);
 	ath12k_hif_ce_irq_disable(ab);
 
@@ -1246,8 +1243,6 @@  struct ath12k_base *ath12k_core_alloc(struct device *dev, size_t priv_size,
 	mutex_init(&ab->core_lock);
 	spin_lock_init(&ab->base_lock);
 	init_completion(&ab->reset_complete);
-	init_completion(&ab->reconfigure_complete);
-	init_completion(&ab->recovery_start);
 
 	INIT_LIST_HEAD(&ab->peers);
 	init_waitqueue_head(&ab->peer_mapping_wq);
diff --git a/drivers/net/wireless/ath/ath12k/core.h b/drivers/net/wireless/ath/ath12k/core.h
index 7d20b09c52e6..96fafa0e05dc 100644
--- a/drivers/net/wireless/ath/ath12k/core.h
+++ b/drivers/net/wireless/ath/ath12k/core.h
@@ -846,11 +846,8 @@  struct ath12k_base {
 	struct work_struct reset_work;
 	atomic_t reset_count;
 	atomic_t recovery_count;
-	atomic_t recovery_start_count;
 	bool is_reset;
 	struct completion reset_complete;
-	struct completion reconfigure_complete;
-	struct completion recovery_start;
 	/* continuous recovery fail count */
 	atomic_t fail_cont_count;
 	unsigned long reset_fail_timeout;
diff --git a/drivers/net/wireless/ath/ath12k/mac.c b/drivers/net/wireless/ath/ath12k/mac.c
index 784964ae03ec..33616ab795af 100644
--- a/drivers/net/wireless/ath/ath12k/mac.c
+++ b/drivers/net/wireless/ath/ath12k/mac.c
@@ -5834,28 +5834,6 @@  static int ath12k_mac_config_mon_status_default(struct ath12k *ar, bool enable)
 	/* TODO: Need to support new monitor mode */
 }
 
-static void ath12k_mac_wait_reconfigure(struct ath12k_base *ab)
-{
-	int recovery_start_count;
-
-	if (!ab->is_reset)
-		return;
-
-	recovery_start_count = atomic_inc_return(&ab->recovery_start_count);
-
-	ath12k_dbg(ab, ATH12K_DBG_MAC, "recovery start count %d\n", recovery_start_count);
-
-	if (recovery_start_count == ab->num_radios) {
-		complete(&ab->recovery_start);
-		ath12k_dbg(ab, ATH12K_DBG_MAC, "recovery started success\n");
-	}
-
-	ath12k_dbg(ab, ATH12K_DBG_MAC, "waiting reconfigure...\n");
-
-	wait_for_completion_timeout(&ab->reconfigure_complete,
-				    ATH12K_RECONFIGURE_TIMEOUT_HZ);
-}
-
 static int ath12k_mac_start(struct ath12k *ar)
 {
 	struct ath12k_hw *ah = ar->ah;
@@ -5987,7 +5965,6 @@  static int ath12k_mac_op_start(struct ieee80211_hw *hw)
 		break;
 	case ATH12K_HW_STATE_RESTARTING:
 		ah->state = ATH12K_HW_STATE_RESTARTED;
-		ath12k_mac_wait_reconfigure(ah->ab);
 		break;
 	case ATH12K_HW_STATE_RESTARTED:
 	case ATH12K_HW_STATE_WEDGED: