Message ID | 20200214035555.24762-1-wgong@codeaurora.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Kalle Valo |
Headers | show |
Series | [RFC] ath10k: change to do napi_enable and napi_disable when insmod and rmmod for sdio | expand |
Wen Gong <wgong@codeaurora.org> writes: > It happened "Kernel panic - not syncing: hung_task: blocked tasks" when > test simulate crash and ifconfig down/rmmod meanwhile. > > Test steps: > > 1.Test commands > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio > > 2. dmesg: > [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash > [ 5622.655995] ieee80211 phy0: Hardware restart was requested > [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds. > [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds. > [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks > [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G W 4.19.86 #137 > [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT) > [ 5776.359855] Call trace: > [ 5776.359868] dump_backtrace+0x0/0x170 > [ 5776.359881] show_stack+0x20/0x2c > [ 5776.359896] dump_stack+0xd4/0x10c > [ 5776.359916] panic+0x12c/0x29c > [ 5776.359937] hung_task_panic+0x0/0x50 > [ 5776.359953] kthread+0x120/0x130 > [ 5776.359965] ret_from_fork+0x10/0x18 > [ 5776.359986] SMP: stopping secondary CPUs > [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000 > [ 5776.360026] CPU features: 0x0,2188200c > [ 5776.360035] Memory Limit: none > > command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked > callstack of ifconfig: > [<0>] __switch_to+0x120/0x13c > [<0>] msleep+0x28/0x38 > [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio] > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] > [<0>] ath10k_halt+0x120/0x178 [ath10k_core] > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] > [<0>] drv_stop+0xe0/0x1e4 [mac80211] > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] > [<0>] ieee80211_stop+0x20/0x30 [mac80211] > [<0>] __dev_close_many+0xb8/0x11c > [<0>] __dev_change_flags+0xe0/0x1d0 > [<0>] dev_change_flags+0x30/0x6c > [<0>] devinet_ioctl+0x370/0x564 > [<0>] inet_ioctl+0xdc/0x304 > [<0>] sock_do_ioctl+0x50/0x288 > [<0>] compat_sock_ioctl+0x1b4/0x1aac > [<0>] __se_compat_sys_ioctl+0x100/0x26fc > [<0>] __arm64_compat_sys_ioctl+0x20/0x2c > [<0>] el0_svc_common+0xa4/0x154 > [<0>] el0_svc_compat_handler+0x2c/0x38 > [<0>] el0_svc_compat+0x8/0x18 > [<0>] 0xffffffffffffffff > > callstack of rmmod: > [<0>] __switch_to+0x120/0x13c > [<0>] msleep+0x28/0x38 > [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio] > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] > [<0>] ath10k_halt+0x120/0x178 [ath10k_core] > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] > [<0>] drv_stop+0xe0/0x1e4 [mac80211] > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] > [<0>] ieee80211_stop+0x20/0x30 [mac80211] > [<0>] __dev_close_many+0xb8/0x11c > [<0>] dev_close_many+0x70/0x100 > [<0>] dev_close+0x4c/0x80 > [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211] > [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211] > [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211] > [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core] > [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core] > [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio] > [<0>] sdio_bus_remove+0x48/0x108 > [<0>] device_release_driver_internal+0x138/0x1ec > [<0>] driver_detach+0x6c/0xa8 > [<0>] bus_remove_driver+0x78/0xa8 > [<0>] driver_unregister+0x30/0x50 > [<0>] sdio_unregister_driver+0x28/0x34 > [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio] > [<0>] __arm64_sys_delete_module+0x1e0/0x22c > [<0>] el0_svc_common+0xa4/0x154 > [<0>] el0_svc_compat_handler+0x2c/0x38 > [<0>] el0_svc_compat+0x8/0x18 > [<0>] 0xffffffffffffffff > > The test command run simulate_fw_crash firstly and it call into > ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable > is called and bit NAPI_STATE_SCHED is set. After that, function > ath10k_sdio_hif_stop is called again from ath10k_stop by command > "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked. > > It is blocked by napi_synchronize, napi_disable will set bit with > NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop > becuase bit NAPI_STATE_SCHED is set by napi_disable. > > function of napi_synchronize > static inline void napi_synchronize(const struct napi_struct *n) > { > if (IS_ENABLED(CONFIG_SMP)) > while (test_bit(NAPI_STATE_SCHED, &n->state)) > msleep(1); > else > barrier(); > } > > function of napi_disable > void napi_disable(struct napi_struct *n) > { > might_sleep(); > set_bit(NAPI_STATE_DISABLE, &n->state); > > while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) > msleep(1); > while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state)) > msleep(1); > > hrtimer_cancel(&n->timer); > > clear_bit(NAPI_STATE_DISABLE, &n->state); > } > > Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042. > > Signed-off-by: Wen Gong <wgong@codeaurora.org> > --- > drivers/net/wireless/ath/ath10k/sdio.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c > index 7b894dcaad2e..b71499b171c6 100644 > --- a/drivers/net/wireless/ath/ath10k/sdio.c > +++ b/drivers/net/wireless/ath/ath10k/sdio.c > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar) > struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar); > int ret; > > - napi_enable(&ar->napi); > - > /* Sleep 20 ms before HIF interrupts are disabled. > * This will give target plenty of time to process the BMI done > * request before interrupts are disabled. > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar) > spin_unlock_bh(&ar_sdio->wr_async_lock); > > napi_synchronize(&ar->napi); > - napi_disable(&ar->napi); > } > > #ifdef CONFIG_PM > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func, > > netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll, > ATH10K_NAPI_BUDGET); > + napi_enable(&ar->napi); > > ath10k_dbg(ar, ATH10K_DBG_BOOT, > "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n", > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func) > > ath10k_core_unregister(ar); > > + napi_disable(&ar->napi); > netif_napi_del(&ar->napi); > > ath10k_core_destroy(ar); I'm not really convinced that this is the right fix, but I'm no NAPI expert. Can anyone else help? And even if we did this fix/hack in ath10k we should change all bus types to do the same. SDIO should not behave differently from PCI, AHB and SNOC.
On Thu, Aug 20, 2020 at 2:03 PM Kalle Valo <kvalo@codeaurora.org> wrote: > > Wen Gong <wgong@codeaurora.org> writes: > > > It happened "Kernel panic - not syncing: hung_task: blocked tasks" when > > test simulate crash and ifconfig down/rmmod meanwhile. > > > > Test steps: > > > > 1.Test commands > > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down > > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio > > > > 2. dmesg: > > [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash > > [ 5622.655995] ieee80211 phy0: Hardware restart was requested > > [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds. > > [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds. > > [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks > > [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G W 4.19.86 #137 > > [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT) > > [ 5776.359855] Call trace: > > [ 5776.359868] dump_backtrace+0x0/0x170 > > [ 5776.359881] show_stack+0x20/0x2c > > [ 5776.359896] dump_stack+0xd4/0x10c > > [ 5776.359916] panic+0x12c/0x29c > > [ 5776.359937] hung_task_panic+0x0/0x50 > > [ 5776.359953] kthread+0x120/0x130 > > [ 5776.359965] ret_from_fork+0x10/0x18 > > [ 5776.359986] SMP: stopping secondary CPUs > > [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000 > > [ 5776.360026] CPU features: 0x0,2188200c > > [ 5776.360035] Memory Limit: none > > > > command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked > > callstack of ifconfig: > > [<0>] __switch_to+0x120/0x13c > > [<0>] msleep+0x28/0x38 > > [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio] > > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] > > [<0>] ath10k_halt+0x120/0x178 [ath10k_core] > > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] > > [<0>] drv_stop+0xe0/0x1e4 [mac80211] > > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] > > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] > > [<0>] ieee80211_stop+0x20/0x30 [mac80211] > > [<0>] __dev_close_many+0xb8/0x11c > > [<0>] __dev_change_flags+0xe0/0x1d0 > > [<0>] dev_change_flags+0x30/0x6c > > [<0>] devinet_ioctl+0x370/0x564 > > [<0>] inet_ioctl+0xdc/0x304 > > [<0>] sock_do_ioctl+0x50/0x288 > > [<0>] compat_sock_ioctl+0x1b4/0x1aac > > [<0>] __se_compat_sys_ioctl+0x100/0x26fc > > [<0>] __arm64_compat_sys_ioctl+0x20/0x2c > > [<0>] el0_svc_common+0xa4/0x154 > > [<0>] el0_svc_compat_handler+0x2c/0x38 > > [<0>] el0_svc_compat+0x8/0x18 > > [<0>] 0xffffffffffffffff > > > > callstack of rmmod: > > [<0>] __switch_to+0x120/0x13c > > [<0>] msleep+0x28/0x38 > > [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio] > > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] > > [<0>] ath10k_halt+0x120/0x178 [ath10k_core] > > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] > > [<0>] drv_stop+0xe0/0x1e4 [mac80211] > > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] > > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] > > [<0>] ieee80211_stop+0x20/0x30 [mac80211] > > [<0>] __dev_close_many+0xb8/0x11c > > [<0>] dev_close_many+0x70/0x100 > > [<0>] dev_close+0x4c/0x80 > > [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211] > > [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211] > > [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211] > > [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core] > > [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core] > > [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio] > > [<0>] sdio_bus_remove+0x48/0x108 > > [<0>] device_release_driver_internal+0x138/0x1ec > > [<0>] driver_detach+0x6c/0xa8 > > [<0>] bus_remove_driver+0x78/0xa8 > > [<0>] driver_unregister+0x30/0x50 > > [<0>] sdio_unregister_driver+0x28/0x34 > > [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio] > > [<0>] __arm64_sys_delete_module+0x1e0/0x22c > > [<0>] el0_svc_common+0xa4/0x154 > > [<0>] el0_svc_compat_handler+0x2c/0x38 > > [<0>] el0_svc_compat+0x8/0x18 > > [<0>] 0xffffffffffffffff > > > > The test command run simulate_fw_crash firstly and it call into > > ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable > > is called and bit NAPI_STATE_SCHED is set. After that, function > > ath10k_sdio_hif_stop is called again from ath10k_stop by command > > "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked. > > > > It is blocked by napi_synchronize, napi_disable will set bit with > > NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop > > becuase bit NAPI_STATE_SCHED is set by napi_disable. > > > > function of napi_synchronize > > static inline void napi_synchronize(const struct napi_struct *n) > > { > > if (IS_ENABLED(CONFIG_SMP)) > > while (test_bit(NAPI_STATE_SCHED, &n->state)) > > msleep(1); > > else > > barrier(); > > } > > > > function of napi_disable > > void napi_disable(struct napi_struct *n) > > { > > might_sleep(); > > set_bit(NAPI_STATE_DISABLE, &n->state); > > > > while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) > > msleep(1); > > while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state)) > > msleep(1); > > > > hrtimer_cancel(&n->timer); > > > > clear_bit(NAPI_STATE_DISABLE, &n->state); > > } > > > > Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042. > > > > Signed-off-by: Wen Gong <wgong@codeaurora.org> > > --- > > drivers/net/wireless/ath/ath10k/sdio.c | 5 ++--- > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c > > index 7b894dcaad2e..b71499b171c6 100644 > > --- a/drivers/net/wireless/ath/ath10k/sdio.c > > +++ b/drivers/net/wireless/ath/ath10k/sdio.c > > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar) > > struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar); > > int ret; > > > > - napi_enable(&ar->napi); > > - > > /* Sleep 20 ms before HIF interrupts are disabled. > > * This will give target plenty of time to process the BMI done > > * request before interrupts are disabled. > > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar) > > spin_unlock_bh(&ar_sdio->wr_async_lock); > > > > napi_synchronize(&ar->napi); > > - napi_disable(&ar->napi); > > } > > > > #ifdef CONFIG_PM > > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func, > > > > netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll, > > ATH10K_NAPI_BUDGET); > > + napi_enable(&ar->napi); > > > > ath10k_dbg(ar, ATH10K_DBG_BOOT, > > "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n", > > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func) > > > > ath10k_core_unregister(ar); > > > > + napi_disable(&ar->napi); > > netif_napi_del(&ar->napi); > > > > ath10k_core_destroy(ar); > > I'm not really convinced that this is the right fix, but I'm no NAPI > expert. Can anyone else help? Calling napi_disable() twice can lead to hangs, but moving NAPI from start/stop to the probe isn't the right approach as the datapath is tied to start/stop. Maybe check the state of NAPI before disable? if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) napi_disable(&ar->napi) or maintain napi_state like this https://patchwork.kernel.org/patch/10249365/ Also, the most common cause for such issues (1st napi_synchronize/napi_disable hang) is that napi_poll is being scheduled, so, you might want to check that napi_schedule isn't called after stop. cd ath10k; git log --grep=napi shows plenty of such issues. the one that matches closest is c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a regression.
On Thu, Aug 20, 2020 at 2:49 PM Krishna Chaitanya <chaitanya.mgit@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 2:03 PM Kalle Valo <kvalo@codeaurora.org> wrote: > > > > Wen Gong <wgong@codeaurora.org> writes: > > > > > It happened "Kernel panic - not syncing: hung_task: blocked tasks" when > > > test simulate crash and ifconfig down/rmmod meanwhile. > > > > > > Test steps: > > > > > > 1.Test commands > > > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down > > > echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio > > > > > > 2. dmesg: > > > [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash > > > [ 5622.655995] ieee80211 phy0: Hardware restart was requested > > > [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds. > > > [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds. > > > [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks > > > [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G W 4.19.86 #137 > > > [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT) > > > [ 5776.359855] Call trace: > > > [ 5776.359868] dump_backtrace+0x0/0x170 > > > [ 5776.359881] show_stack+0x20/0x2c > > > [ 5776.359896] dump_stack+0xd4/0x10c > > > [ 5776.359916] panic+0x12c/0x29c > > > [ 5776.359937] hung_task_panic+0x0/0x50 > > > [ 5776.359953] kthread+0x120/0x130 > > > [ 5776.359965] ret_from_fork+0x10/0x18 > > > [ 5776.359986] SMP: stopping secondary CPUs > > > [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000 > > > [ 5776.360026] CPU features: 0x0,2188200c > > > [ 5776.360035] Memory Limit: none > > > > > > command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked > > > callstack of ifconfig: > > > [<0>] __switch_to+0x120/0x13c > > > [<0>] msleep+0x28/0x38 > > > [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio] > > > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] > > > [<0>] ath10k_halt+0x120/0x178 [ath10k_core] > > > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] > > > [<0>] drv_stop+0xe0/0x1e4 [mac80211] > > > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] > > > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] > > > [<0>] ieee80211_stop+0x20/0x30 [mac80211] > > > [<0>] __dev_close_many+0xb8/0x11c > > > [<0>] __dev_change_flags+0xe0/0x1d0 > > > [<0>] dev_change_flags+0x30/0x6c > > > [<0>] devinet_ioctl+0x370/0x564 > > > [<0>] inet_ioctl+0xdc/0x304 > > > [<0>] sock_do_ioctl+0x50/0x288 > > > [<0>] compat_sock_ioctl+0x1b4/0x1aac > > > [<0>] __se_compat_sys_ioctl+0x100/0x26fc > > > [<0>] __arm64_compat_sys_ioctl+0x20/0x2c > > > [<0>] el0_svc_common+0xa4/0x154 > > > [<0>] el0_svc_compat_handler+0x2c/0x38 > > > [<0>] el0_svc_compat+0x8/0x18 > > > [<0>] 0xffffffffffffffff > > > > > > callstack of rmmod: > > > [<0>] __switch_to+0x120/0x13c > > > [<0>] msleep+0x28/0x38 > > > [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio] > > > [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] > > > [<0>] ath10k_halt+0x120/0x178 [ath10k_core] > > > [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] > > > [<0>] drv_stop+0xe0/0x1e4 [mac80211] > > > [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] > > > [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] > > > [<0>] ieee80211_stop+0x20/0x30 [mac80211] > > > [<0>] __dev_close_many+0xb8/0x11c > > > [<0>] dev_close_many+0x70/0x100 > > > [<0>] dev_close+0x4c/0x80 > > > [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211] > > > [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211] > > > [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211] > > > [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core] > > > [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core] > > > [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio] > > > [<0>] sdio_bus_remove+0x48/0x108 > > > [<0>] device_release_driver_internal+0x138/0x1ec > > > [<0>] driver_detach+0x6c/0xa8 > > > [<0>] bus_remove_driver+0x78/0xa8 > > > [<0>] driver_unregister+0x30/0x50 > > > [<0>] sdio_unregister_driver+0x28/0x34 > > > [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio] > > > [<0>] __arm64_sys_delete_module+0x1e0/0x22c > > > [<0>] el0_svc_common+0xa4/0x154 > > > [<0>] el0_svc_compat_handler+0x2c/0x38 > > > [<0>] el0_svc_compat+0x8/0x18 > > > [<0>] 0xffffffffffffffff > > > > > > The test command run simulate_fw_crash firstly and it call into > > > ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable > > > is called and bit NAPI_STATE_SCHED is set. After that, function > > > ath10k_sdio_hif_stop is called again from ath10k_stop by command > > > "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked. > > > > > > It is blocked by napi_synchronize, napi_disable will set bit with > > > NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop > > > becuase bit NAPI_STATE_SCHED is set by napi_disable. > > > > > > function of napi_synchronize > > > static inline void napi_synchronize(const struct napi_struct *n) > > > { > > > if (IS_ENABLED(CONFIG_SMP)) > > > while (test_bit(NAPI_STATE_SCHED, &n->state)) > > > msleep(1); > > > else > > > barrier(); > > > } > > > > > > function of napi_disable > > > void napi_disable(struct napi_struct *n) > > > { > > > might_sleep(); > > > set_bit(NAPI_STATE_DISABLE, &n->state); > > > > > > while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) > > > msleep(1); > > > while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state)) > > > msleep(1); > > > > > > hrtimer_cancel(&n->timer); > > > > > > clear_bit(NAPI_STATE_DISABLE, &n->state); > > > } > > > > > > Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042. > > > > > > Signed-off-by: Wen Gong <wgong@codeaurora.org> > > > --- > > > drivers/net/wireless/ath/ath10k/sdio.c | 5 ++--- > > > 1 file changed, 2 insertions(+), 3 deletions(-) > > > > > > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c > > > index 7b894dcaad2e..b71499b171c6 100644 > > > --- a/drivers/net/wireless/ath/ath10k/sdio.c > > > +++ b/drivers/net/wireless/ath/ath10k/sdio.c > > > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar) > > > struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar); > > > int ret; > > > > > > - napi_enable(&ar->napi); > > > - > > > /* Sleep 20 ms before HIF interrupts are disabled. > > > * This will give target plenty of time to process the BMI done > > > * request before interrupts are disabled. > > > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar) > > > spin_unlock_bh(&ar_sdio->wr_async_lock); > > > > > > napi_synchronize(&ar->napi); > > > - napi_disable(&ar->napi); > > > } > > > > > > #ifdef CONFIG_PM > > > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func, > > > > > > netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll, > > > ATH10K_NAPI_BUDGET); > > > + napi_enable(&ar->napi); > > > > > > ath10k_dbg(ar, ATH10K_DBG_BOOT, > > > "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n", > > > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func) > > > > > > ath10k_core_unregister(ar); > > > > > > + napi_disable(&ar->napi); > > > netif_napi_del(&ar->napi); > > > > > > ath10k_core_destroy(ar); > > > > I'm not really convinced that this is the right fix, but I'm no NAPI > > expert. Can anyone else help? > Calling napi_disable() twice can lead to hangs, but moving NAPI from > start/stop to > the probe isn't the right approach as the datapath is tied to start/stop. > > Maybe check the state of NAPI before disable? > > if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > napi_disable(&ar->napi) > > or maintain napi_state like this https://patchwork.kernel.org/patch/10249365/ > > Also, the most common cause for such issues (1st > napi_synchronize/napi_disable hang) > is that napi_poll is being scheduled, so, you might want to check that > napi_schedule isn't > called after stop. > > cd ath10k; git log --grep=napi shows plenty of such issues. the one > that matches closest is > c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a regression. Also, I see that napi_schedule() is being called from work_queue async_work_rx so we should cancel that work in hif_stop before calling napi_synchronize.
On 2020-08-20 17:19, Krishna Chaitanya wrote: ... >> > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c >> > index 7b894dcaad2e..b71499b171c6 100644 >> > --- a/drivers/net/wireless/ath/ath10k/sdio.c >> > +++ b/drivers/net/wireless/ath/ath10k/sdio.c >> > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar) >> > struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar); >> > int ret; >> > >> > - napi_enable(&ar->napi); >> > - >> > /* Sleep 20 ms before HIF interrupts are disabled. >> > * This will give target plenty of time to process the BMI done >> > * request before interrupts are disabled. >> > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar) >> > spin_unlock_bh(&ar_sdio->wr_async_lock); >> > >> > napi_synchronize(&ar->napi); >> > - napi_disable(&ar->napi); >> > } >> > >> > #ifdef CONFIG_PM >> > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func, >> > >> > netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll, >> > ATH10K_NAPI_BUDGET); >> > + napi_enable(&ar->napi); >> > >> > ath10k_dbg(ar, ATH10K_DBG_BOOT, >> > "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n", >> > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func) >> > >> > ath10k_core_unregister(ar); >> > >> > + napi_disable(&ar->napi); >> > netif_napi_del(&ar->napi); >> > >> > ath10k_core_destroy(ar); >> >> I'm not really convinced that this is the right fix, but I'm no NAPI >> expert. Can anyone else help? > Calling napi_disable() twice can lead to hangs, but moving NAPI from > start/stop to > the probe isn't the right approach as the datapath is tied to > start/stop. > > Maybe check the state of NAPI before disable? > > if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > napi_disable(&ar->napi) > or maintain napi_state like this > https://patchwork.kernel.org/patch/10249365/ it is better to use above link's patch. napi.state is controlled by napi API, it is better ath10k not know it. > > Also, the most common cause for such issues (1st > napi_synchronize/napi_disable hang) > is that napi_poll is being scheduled, so, you might want to check that > napi_schedule isn't > called after stop. > > cd ath10k; git log --grep=napi shows plenty of such issues. the one > that matches closest is > c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a > regression. This above commit's scene is not same with this patch. It is hang for only do 1 simulate crash of the commit, this patch is doing simulate crash and rmmod meanwhile.
On 2020-08-20 17:26, Krishna Chaitanya wrote: > On Thu, Aug 20, 2020 at 2:49 PM Krishna Chaitanya > <chaitanya.mgit@gmail.com> wrote: >> >> On Thu, Aug 20, 2020 at 2:03 PM Kalle Valo <kvalo@codeaurora.org> >> wrote: >> > >> > Wen Gong <wgong@codeaurora.org> writes: >> > ... >> > I'm not really convinced that this is the right fix, but I'm no NAPI >> > expert. Can anyone else help? >> Calling napi_disable() twice can lead to hangs, but moving NAPI from >> start/stop to >> the probe isn't the right approach as the datapath is tied to >> start/stop. >> >> Maybe check the state of NAPI before disable? >> >> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >> napi_disable(&ar->napi) >> >> or maintain napi_state like this >> https://patchwork.kernel.org/patch/10249365/ >> >> Also, the most common cause for such issues (1st >> napi_synchronize/napi_disable hang) >> is that napi_poll is being scheduled, so, you might want to check that >> napi_schedule isn't >> called after stop. >> >> cd ath10k; git log --grep=napi shows plenty of such issues. the one >> that matches closest is >> c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a >> regression. > Also, I see that napi_schedule() is being called from work_queue > async_work_rx > so we should cancel that work in hif_stop before calling > napi_synchronize. Yes, I will do that in a new patch.
On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: > > On 2020-08-20 17:19, Krishna Chaitanya wrote: > ... > >> > diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c > >> > index 7b894dcaad2e..b71499b171c6 100644 > >> > --- a/drivers/net/wireless/ath/ath10k/sdio.c > >> > +++ b/drivers/net/wireless/ath/ath10k/sdio.c > >> > @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar) > >> > struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar); > >> > int ret; > >> > > >> > - napi_enable(&ar->napi); > >> > - > >> > /* Sleep 20 ms before HIF interrupts are disabled. > >> > * This will give target plenty of time to process the BMI done > >> > * request before interrupts are disabled. > >> > @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar) > >> > spin_unlock_bh(&ar_sdio->wr_async_lock); > >> > > >> > napi_synchronize(&ar->napi); > >> > - napi_disable(&ar->napi); > >> > } > >> > > >> > #ifdef CONFIG_PM > >> > @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func, > >> > > >> > netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll, > >> > ATH10K_NAPI_BUDGET); > >> > + napi_enable(&ar->napi); > >> > > >> > ath10k_dbg(ar, ATH10K_DBG_BOOT, > >> > "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n", > >> > @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func) > >> > > >> > ath10k_core_unregister(ar); > >> > > >> > + napi_disable(&ar->napi); > >> > netif_napi_del(&ar->napi); > >> > > >> > ath10k_core_destroy(ar); > >> > >> I'm not really convinced that this is the right fix, but I'm no NAPI > >> expert. Can anyone else help? > > Calling napi_disable() twice can lead to hangs, but moving NAPI from > > start/stop to > > the probe isn't the right approach as the datapath is tied to > > start/stop. > > > > Maybe check the state of NAPI before disable? > > > > if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > > napi_disable(&ar->napi) > > or maintain napi_state like this > > https://patchwork.kernel.org/patch/10249365/ > it is better to use above link's patch. > napi.state is controlled by napi API, it is better ath10k not know it. Sure, but IMHO just canceling the async rx work should solve the issue. > > Also, the most common cause for such issues (1st > > napi_synchronize/napi_disable hang) > > is that napi_poll is being scheduled, so, you might want to check that > > napi_schedule isn't > > called after stop. > > > > cd ath10k; git log --grep=napi shows plenty of such issues. the one > > that matches closest is > > c2cac2f74ab4bcf0db0dcf3a612f1e5b52d145c8, so, it could just be a > > regression. > This above commit's scene is not same with this patch. > It is hang for only do 1 simulate crash of the commit, this patch is > doing simulate crash and rmmod meanwhile. yes, it is the closest one I could find.
On 2020-08-20 18:52, Krishna Chaitanya wrote: > On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: >> >> On 2020-08-20 17:19, Krishna Chaitanya wrote: ... >> >> I'm not really convinced that this is the right fix, but I'm no NAPI >> >> expert. Can anyone else help? >> > Calling napi_disable() twice can lead to hangs, but moving NAPI from >> > start/stop to >> > the probe isn't the right approach as the datapath is tied to >> > start/stop. >> > >> > Maybe check the state of NAPI before disable? >> > >> > if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >> > napi_disable(&ar->napi) >> > or maintain napi_state like this >> > https://patchwork.kernel.org/patch/10249365/ >> it is better to use above link's patch. >> napi.state is controlled by napi API, it is better ath10k not know it. > Sure, but IMHO just canceling the async rx work should solve the issue. Oh no, canceling the async rx work will not solve this issue, rx worker ath10k_rx_indication_async_work call napi_schedule, after napi_complete, the NAPI_STATE_SCHED will clear. The issue of this patch is because 2 thread called to hif_stop and NAPI_STATE_SCHED not clear. ...
On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: > > On 2020-08-20 18:52, Krishna Chaitanya wrote: > > On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: > >> > >> On 2020-08-20 17:19, Krishna Chaitanya wrote: > ... > >> >> I'm not really convinced that this is the right fix, but I'm no NAPI > >> >> expert. Can anyone else help? > >> > Calling napi_disable() twice can lead to hangs, but moving NAPI from > >> > start/stop to > >> > the probe isn't the right approach as the datapath is tied to > >> > start/stop. > >> > > >> > Maybe check the state of NAPI before disable? > >> > > >> > if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > >> > napi_disable(&ar->napi) > >> > or maintain napi_state like this > >> > https://patchwork.kernel.org/patch/10249365/ > >> it is better to use above link's patch. > >> napi.state is controlled by napi API, it is better ath10k not know it. > > Sure, but IMHO just canceling the async rx work should solve the issue. > Oh no, canceling the async rx work will not solve this issue, rx worker > ath10k_rx_indication_async_work call napi_schedule, after napi_complete, > the NAPI_STATE_SCHED will clear. > The issue of this patch is because 2 thread called to hif_stop and > NAPI_STATE_SCHED not clear. That fix is still valid and good to have. ndev_stop being called twice is typical scenarios (stop vs rmmod), so just checking the netdev_flags for IFF_UP and returning from hif_Stop should suffice, no?
On 8/20/20 9:08 AM, Krishna Chaitanya wrote: > On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: >> >> On 2020-08-20 18:52, Krishna Chaitanya wrote: >>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: >>>> >>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: >> ... >>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI >>>>>> expert. Can anyone else help? >>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from >>>>> start/stop to >>>>> the probe isn't the right approach as the datapath is tied to >>>>> start/stop. >>>>> >>>>> Maybe check the state of NAPI before disable? >>>>> >>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >>>>> napi_disable(&ar->napi) >>>>> or maintain napi_state like this >>>>> https://patchwork.kernel.org/patch/10249365/ >>>> it is better to use above link's patch. >>>> napi.state is controlled by napi API, it is better ath10k not know it. >>> Sure, but IMHO just canceling the async rx work should solve the issue. >> Oh no, canceling the async rx work will not solve this issue, rx worker >> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, >> the NAPI_STATE_SCHED will clear. >> The issue of this patch is because 2 thread called to hif_stop and >> NAPI_STATE_SCHED not clear. > That fix is still valid and good to have. > > ndev_stop being called twice is typical scenarios (stop vs rmmod), so > just checking the netdev_flags for IFF_UP and returning from hif_Stop > should suffice, no? My approach to fix this problem was to add a boolean in ath10k as to whether it had napi enabled or not, and then check that before trying to enable/disable it again. Seems to work fine, and cleaner in my mind than checking internal napi flags. Thanks, Ben
On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: > > On 8/20/20 9:08 AM, Krishna Chaitanya wrote: > > On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: > >> > >> On 2020-08-20 18:52, Krishna Chaitanya wrote: > >>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: > >>>> > >>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: > >> ... > >>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI > >>>>>> expert. Can anyone else help? > >>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from > >>>>> start/stop to > >>>>> the probe isn't the right approach as the datapath is tied to > >>>>> start/stop. > >>>>> > >>>>> Maybe check the state of NAPI before disable? > >>>>> > >>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > >>>>> napi_disable(&ar->napi) > >>>>> or maintain napi_state like this > >>>>> https://patchwork.kernel.org/patch/10249365/ > >>>> it is better to use above link's patch. > >>>> napi.state is controlled by napi API, it is better ath10k not know it. > >>> Sure, but IMHO just canceling the async rx work should solve the issue. > >> Oh no, canceling the async rx work will not solve this issue, rx worker > >> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, > >> the NAPI_STATE_SCHED will clear. > >> The issue of this patch is because 2 thread called to hif_stop and > >> NAPI_STATE_SCHED not clear. > > That fix is still valid and good to have. > > > > ndev_stop being called twice is typical scenarios (stop vs rmmod), so > > just checking the netdev_flags for IFF_UP and returning from hif_Stop > > should suffice, no? > > My approach to fix this problem was to add a boolean in ath10k as to whether > it had napi enabled or not, and then check that before trying to enable/disable > it again. Seems to work fine, and cleaner in my mind than checking internal > napi flags. A much simpler approach is just to check for IFF_UP and skip NAPI (and others) in the hif_stop no? (provided proper RTNL locking is done if hif_stop is being called internally as well).
On 8/20/20 10:00 AM, Krishna Chaitanya wrote: > On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: >> >> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: >>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: >>>> >>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: >>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: >>>>>> >>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: >>>> ... >>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI >>>>>>>> expert. Can anyone else help? >>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from >>>>>>> start/stop to >>>>>>> the probe isn't the right approach as the datapath is tied to >>>>>>> start/stop. >>>>>>> >>>>>>> Maybe check the state of NAPI before disable? >>>>>>> >>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >>>>>>> napi_disable(&ar->napi) >>>>>>> or maintain napi_state like this >>>>>>> https://patchwork.kernel.org/patch/10249365/ >>>>>> it is better to use above link's patch. >>>>>> napi.state is controlled by napi API, it is better ath10k not know it. >>>>> Sure, but IMHO just canceling the async rx work should solve the issue. >>>> Oh no, canceling the async rx work will not solve this issue, rx worker >>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, >>>> the NAPI_STATE_SCHED will clear. >>>> The issue of this patch is because 2 thread called to hif_stop and >>>> NAPI_STATE_SCHED not clear. >>> That fix is still valid and good to have. >>> >>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so >>> just checking the netdev_flags for IFF_UP and returning from hif_Stop >>> should suffice, no? >> >> My approach to fix this problem was to add a boolean in ath10k as to whether >> it had napi enabled or not, and then check that before trying to enable/disable >> it again. Seems to work fine, and cleaner in my mind than checking internal >> napi flags. > A much simpler approach is just to check for IFF_UP and skip NAPI (and others) > in the hif_stop no? (provided proper RTNL locking is done if hif_stop > is being called > internally as well). > I'm not sure, but I think the driver should be internally consistent and not spend a lot of time trying to guess about interactions with objects higher in the stack. Here is my original patch to fix this, it is not complex. https://patchwork.kernel.org/patch/10249363/ Thanks, Ben
On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote: > > On 8/20/20 10:00 AM, Krishna Chaitanya wrote: > > On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: > >> > >> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: > >>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: > >>>> > >>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: > >>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: > >>>>>> > >>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: > >>>> ... > >>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI > >>>>>>>> expert. Can anyone else help? > >>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from > >>>>>>> start/stop to > >>>>>>> the probe isn't the right approach as the datapath is tied to > >>>>>>> start/stop. > >>>>>>> > >>>>>>> Maybe check the state of NAPI before disable? > >>>>>>> > >>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > >>>>>>> napi_disable(&ar->napi) > >>>>>>> or maintain napi_state like this > >>>>>>> https://patchwork.kernel.org/patch/10249365/ > >>>>>> it is better to use above link's patch. > >>>>>> napi.state is controlled by napi API, it is better ath10k not know it. > >>>>> Sure, but IMHO just canceling the async rx work should solve the issue. > >>>> Oh no, canceling the async rx work will not solve this issue, rx worker > >>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, > >>>> the NAPI_STATE_SCHED will clear. > >>>> The issue of this patch is because 2 thread called to hif_stop and > >>>> NAPI_STATE_SCHED not clear. > >>> That fix is still valid and good to have. > >>> > >>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so > >>> just checking the netdev_flags for IFF_UP and returning from hif_Stop > >>> should suffice, no? > >> > >> My approach to fix this problem was to add a boolean in ath10k as to whether > >> it had napi enabled or not, and then check that before trying to enable/disable > >> it again. Seems to work fine, and cleaner in my mind than checking internal > >> napi flags. > > A much simpler approach is just to check for IFF_UP and skip NAPI (and others) > > in the hif_stop no? (provided proper RTNL locking is done if hif_stop > > is being called > > internally as well). > > > > I'm not sure, but I think the driver should be internally consistent and not > spend a lot of time trying to guess about interactions with objects higher > in the stack. Fair enough, the network interface state is a basic thing controlled by the driver, so, should be okay to use. Anyways, the in-driver approach has more control. > > Here is my original patch to fix this, it is not complex. > > https://patchwork.kernel.org/patch/10249363/ Sure, I have shared your patch above :).
On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya <chaitanya.mgit@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote: > > > > On 8/20/20 10:00 AM, Krishna Chaitanya wrote: > > > On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: > > >> > > >> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: > > >>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: > > >>>> > > >>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: > > >>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: > > >>>>>> > > >>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: > > >>>> ... > > >>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI > > >>>>>>>> expert. Can anyone else help? > > >>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from > > >>>>>>> start/stop to > > >>>>>>> the probe isn't the right approach as the datapath is tied to > > >>>>>>> start/stop. > > >>>>>>> > > >>>>>>> Maybe check the state of NAPI before disable? > > >>>>>>> > > >>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > > >>>>>>> napi_disable(&ar->napi) > > >>>>>>> or maintain napi_state like this > > >>>>>>> https://patchwork.kernel.org/patch/10249365/ > > >>>>>> it is better to use above link's patch. > > >>>>>> napi.state is controlled by napi API, it is better ath10k not know it. > > >>>>> Sure, but IMHO just canceling the async rx work should solve the issue. > > >>>> Oh no, canceling the async rx work will not solve this issue, rx worker > > >>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, > > >>>> the NAPI_STATE_SCHED will clear. > > >>>> The issue of this patch is because 2 thread called to hif_stop and > > >>>> NAPI_STATE_SCHED not clear. > > >>> That fix is still valid and good to have. > > >>> > > >>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so > > >>> just checking the netdev_flags for IFF_UP and returning from hif_Stop > > >>> should suffice, no? > > >> > > >> My approach to fix this problem was to add a boolean in ath10k as to whether > > >> it had napi enabled or not, and then check that before trying to enable/disable > > >> it again. Seems to work fine, and cleaner in my mind than checking internal > > >> napi flags. > > > A much simpler approach is just to check for IFF_UP and skip NAPI (and others) > > > in the hif_stop no? (provided proper RTNL locking is done if hif_stop > > > is being called > > > internally as well). > > > > > > > I'm not sure, but I think the driver should be internally consistent and not > > spend a lot of time trying to guess about interactions with objects higher > > in the stack. > Fair enough, the network interface state is a basic thing controlled > by the driver, > so, should be okay to use. Anyways, the in-driver approach has more control. > > > > Here is my original patch to fix this, it is not complex. > > > > https://patchwork.kernel.org/patch/10249363/ > Sure, I have shared your patch above :). Sent a bit early, any idea why this wasn't upstreamed earlier?
On 8/20/20 10:42 AM, Krishna Chaitanya wrote: > On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya > <chaitanya.mgit@gmail.com> wrote: >> >> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote: >>> >>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote: >>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: >>>>> >>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: >>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: >>>>>>> >>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: >>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: >>>>>>>>> >>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: >>>>>>> ... >>>>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI >>>>>>>>>>> expert. Can anyone else help? >>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from >>>>>>>>>> start/stop to >>>>>>>>>> the probe isn't the right approach as the datapath is tied to >>>>>>>>>> start/stop. >>>>>>>>>> >>>>>>>>>> Maybe check the state of NAPI before disable? >>>>>>>>>> >>>>>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >>>>>>>>>> napi_disable(&ar->napi) >>>>>>>>>> or maintain napi_state like this >>>>>>>>>> https://patchwork.kernel.org/patch/10249365/ >>>>>>>>> it is better to use above link's patch. >>>>>>>>> napi.state is controlled by napi API, it is better ath10k not know it. >>>>>>>> Sure, but IMHO just canceling the async rx work should solve the issue. >>>>>>> Oh no, canceling the async rx work will not solve this issue, rx worker >>>>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, >>>>>>> the NAPI_STATE_SCHED will clear. >>>>>>> The issue of this patch is because 2 thread called to hif_stop and >>>>>>> NAPI_STATE_SCHED not clear. >>>>>> That fix is still valid and good to have. >>>>>> >>>>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so >>>>>> just checking the netdev_flags for IFF_UP and returning from hif_Stop >>>>>> should suffice, no? >>>>> >>>>> My approach to fix this problem was to add a boolean in ath10k as to whether >>>>> it had napi enabled or not, and then check that before trying to enable/disable >>>>> it again. Seems to work fine, and cleaner in my mind than checking internal >>>>> napi flags. >>>> A much simpler approach is just to check for IFF_UP and skip NAPI (and others) >>>> in the hif_stop no? (provided proper RTNL locking is done if hif_stop >>>> is being called >>>> internally as well). >>>> >>> >>> I'm not sure, but I think the driver should be internally consistent and not >>> spend a lot of time trying to guess about interactions with objects higher >>> in the stack. >> Fair enough, the network interface state is a basic thing controlled >> by the driver, >> so, should be okay to use. Anyways, the in-driver approach has more control. >>> >>> Here is my original patch to fix this, it is not complex. >>> >>> https://patchwork.kernel.org/patch/10249363/ >> Sure, I have shared your patch above :). > Sent a bit early, any idea why this wasn't upstreamed earlier? No, one comment from Michal indicated maybe there were more problems lurking in this area, but he seemed to be OK with the patch over all. After that, it was just ignored. Thanks, Ben
On Thu, Aug 20, 2020 at 11:23 PM Ben Greear <greearb@candelatech.com> wrote: > > On 8/20/20 10:42 AM, Krishna Chaitanya wrote: > > On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya > > <chaitanya.mgit@gmail.com> wrote: > >> > >> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote: > >>> > >>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote: > >>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: > >>>>> > >>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: > >>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: > >>>>>>> > >>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: > >>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: > >>>>>>>>> > >>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: > >>>>>>> ... > >>>>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI > >>>>>>>>>>> expert. Can anyone else help? > >>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from > >>>>>>>>>> start/stop to > >>>>>>>>>> the probe isn't the right approach as the datapath is tied to > >>>>>>>>>> start/stop. > >>>>>>>>>> > >>>>>>>>>> Maybe check the state of NAPI before disable? > >>>>>>>>>> > >>>>>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) > >>>>>>>>>> napi_disable(&ar->napi) > >>>>>>>>>> or maintain napi_state like this > >>>>>>>>>> https://patchwork.kernel.org/patch/10249365/ > >>>>>>>>> it is better to use above link's patch. > >>>>>>>>> napi.state is controlled by napi API, it is better ath10k not know it. > >>>>>>>> Sure, but IMHO just canceling the async rx work should solve the issue. > >>>>>>> Oh no, canceling the async rx work will not solve this issue, rx worker > >>>>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, > >>>>>>> the NAPI_STATE_SCHED will clear. > >>>>>>> The issue of this patch is because 2 thread called to hif_stop and > >>>>>>> NAPI_STATE_SCHED not clear. > >>>>>> That fix is still valid and good to have. > >>>>>> > >>>>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so > >>>>>> just checking the netdev_flags for IFF_UP and returning from hif_Stop > >>>>>> should suffice, no? > >>>>> > >>>>> My approach to fix this problem was to add a boolean in ath10k as to whether > >>>>> it had napi enabled or not, and then check that before trying to enable/disable > >>>>> it again. Seems to work fine, and cleaner in my mind than checking internal > >>>>> napi flags. > >>>> A much simpler approach is just to check for IFF_UP and skip NAPI (and others) > >>>> in the hif_stop no? (provided proper RTNL locking is done if hif_stop > >>>> is being called > >>>> internally as well). > >>>> > >>> > >>> I'm not sure, but I think the driver should be internally consistent and not > >>> spend a lot of time trying to guess about interactions with objects higher > >>> in the stack. > >> Fair enough, the network interface state is a basic thing controlled > >> by the driver, > >> so, should be okay to use. Anyways, the in-driver approach has more control. > >>> > >>> Here is my original patch to fix this, it is not complex. > >>> > >>> https://patchwork.kernel.org/patch/10249363/ > >> Sure, I have shared your patch above :). > > Sent a bit early, any idea why this wasn't upstreamed earlier? > > No, one comment from Michal indicated maybe there were more problems lurking > in this area, but he seemed to be OK with the patch over all. After that, > it was just ignored. > Now might be a good time to push for it :)
On 8/20/20 1:15 PM, Krishna Chaitanya wrote: > On Thu, Aug 20, 2020 at 11:23 PM Ben Greear <greearb@candelatech.com> wrote: >> >> On 8/20/20 10:42 AM, Krishna Chaitanya wrote: >>> On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya >>> <chaitanya.mgit@gmail.com> wrote: >>>> >>>> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear <greearb@candelatech.com> wrote: >>>>> >>>>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote: >>>>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear <greearb@candelatech.com> wrote: >>>>>>> >>>>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: >>>>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> wrote: >>>>>>>>> >>>>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: >>>>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong <wgong@codeaurora.org> wrote: >>>>>>>>>>> >>>>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: >>>>>>>>> ... >>>>>>>>>>>>> I'm not really convinced that this is the right fix, but I'm no NAPI >>>>>>>>>>>>> expert. Can anyone else help? >>>>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving NAPI from >>>>>>>>>>>> start/stop to >>>>>>>>>>>> the probe isn't the right approach as the datapath is tied to >>>>>>>>>>>> start/stop. >>>>>>>>>>>> >>>>>>>>>>>> Maybe check the state of NAPI before disable? >>>>>>>>>>>> >>>>>>>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >>>>>>>>>>>> napi_disable(&ar->napi) >>>>>>>>>>>> or maintain napi_state like this >>>>>>>>>>>> https://patchwork.kernel.org/patch/10249365/ >>>>>>>>>>> it is better to use above link's patch. >>>>>>>>>>> napi.state is controlled by napi API, it is better ath10k not know it. >>>>>>>>>> Sure, but IMHO just canceling the async rx work should solve the issue. >>>>>>>>> Oh no, canceling the async rx work will not solve this issue, rx worker >>>>>>>>> ath10k_rx_indication_async_work call napi_schedule, after napi_complete, >>>>>>>>> the NAPI_STATE_SCHED will clear. >>>>>>>>> The issue of this patch is because 2 thread called to hif_stop and >>>>>>>>> NAPI_STATE_SCHED not clear. >>>>>>>> That fix is still valid and good to have. >>>>>>>> >>>>>>>> ndev_stop being called twice is typical scenarios (stop vs rmmod), so >>>>>>>> just checking the netdev_flags for IFF_UP and returning from hif_Stop >>>>>>>> should suffice, no? >>>>>>> >>>>>>> My approach to fix this problem was to add a boolean in ath10k as to whether >>>>>>> it had napi enabled or not, and then check that before trying to enable/disable >>>>>>> it again. Seems to work fine, and cleaner in my mind than checking internal >>>>>>> napi flags. >>>>>> A much simpler approach is just to check for IFF_UP and skip NAPI (and others) >>>>>> in the hif_stop no? (provided proper RTNL locking is done if hif_stop >>>>>> is being called >>>>>> internally as well). >>>>>> >>>>> >>>>> I'm not sure, but I think the driver should be internally consistent and not >>>>> spend a lot of time trying to guess about interactions with objects higher >>>>> in the stack. >>>> Fair enough, the network interface state is a basic thing controlled >>>> by the driver, >>>> so, should be okay to use. Anyways, the in-driver approach has more control. >>>>> >>>>> Here is my original patch to fix this, it is not complex. >>>>> >>>>> https://patchwork.kernel.org/patch/10249363/ >>>> Sure, I have shared your patch above :). >>> Sent a bit early, any idea why this wasn't upstreamed earlier? >> >> No, one comment from Michal indicated maybe there were more problems lurking >> in this area, but he seemed to be OK with the patch over all. After that, >> it was just ignored. >> > Now might be a good time to push for it :) > It is generally a waste of time in my experience. Kalle is the maintainer and should be seeing any of this he cares to see. If he likes the patch, he can apply it or something similar. If you have a reproducible test case, see if the patch fixes things, that might help it be accepted. Thanks, Ben
On 2020-08-21 04:59, Ben Greear wrote: > On 8/20/20 1:15 PM, Krishna Chaitanya wrote: >> On Thu, Aug 20, 2020 at 11:23 PM Ben Greear <greearb@candelatech.com> >> wrote: >>> >>> On 8/20/20 10:42 AM, Krishna Chaitanya wrote: >>>> On Thu, Aug 20, 2020 at 11:11 PM Krishna Chaitanya >>>> <chaitanya.mgit@gmail.com> wrote: >>>>> >>>>> On Thu, Aug 20, 2020 at 10:38 PM Ben Greear >>>>> <greearb@candelatech.com> wrote: >>>>>> >>>>>> On 8/20/20 10:00 AM, Krishna Chaitanya wrote: >>>>>>> On Thu, Aug 20, 2020 at 10:02 PM Ben Greear >>>>>>> <greearb@candelatech.com> wrote: >>>>>>>> >>>>>>>> On 8/20/20 9:08 AM, Krishna Chaitanya wrote: >>>>>>>>> On Thu, Aug 20, 2020 at 8:07 PM Wen Gong <wgong@codeaurora.org> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 2020-08-20 18:52, Krishna Chaitanya wrote: >>>>>>>>>>> On Thu, Aug 20, 2020 at 3:45 PM Wen Gong >>>>>>>>>>> <wgong@codeaurora.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 2020-08-20 17:19, Krishna Chaitanya wrote: >>>>>>>>>> ... >>>>>>>>>>>>>> I'm not really convinced that this is the right fix, but >>>>>>>>>>>>>> I'm no NAPI >>>>>>>>>>>>>> expert. Can anyone else help? >>>>>>>>>>>>> Calling napi_disable() twice can lead to hangs, but moving >>>>>>>>>>>>> NAPI from >>>>>>>>>>>>> start/stop to >>>>>>>>>>>>> the probe isn't the right approach as the datapath is tied >>>>>>>>>>>>> to >>>>>>>>>>>>> start/stop. >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe check the state of NAPI before disable? >>>>>>>>>>>>> >>>>>>>>>>>>> if (test_bit(NAPI_STATE_SCHED, &ar->napi.napi.state)) >>>>>>>>>>>>> napi_disable(&ar->napi) >>>>>>>>>>>>> or maintain napi_state like this >>>>>>>>>>>>> https://patchwork.kernel.org/patch/10249365/ >>>>>>>>>>>> it is better to use above link's patch. >>>>>>>>>>>> napi.state is controlled by napi API, it is better ath10k >>>>>>>>>>>> not know it. >>>>>>>>>>> Sure, but IMHO just canceling the async rx work should solve >>>>>>>>>>> the issue. >>>>>>>>>> Oh no, canceling the async rx work will not solve this issue, >>>>>>>>>> rx worker >>>>>>>>>> ath10k_rx_indication_async_work call napi_schedule, after >>>>>>>>>> napi_complete, >>>>>>>>>> the NAPI_STATE_SCHED will clear. >>>>>>>>>> The issue of this patch is because 2 thread called to hif_stop >>>>>>>>>> and >>>>>>>>>> NAPI_STATE_SCHED not clear. >>>>>>>>> That fix is still valid and good to have. >>>>>>>>> >>>>>>>>> ndev_stop being called twice is typical scenarios (stop vs >>>>>>>>> rmmod), so >>>>>>>>> just checking the netdev_flags for IFF_UP and returning >>>>>>>>> from hif_Stop >>>>>>>>> should suffice, no? >>>>>>>> >>>>>>>> My approach to fix this problem was to add a boolean in ath10k >>>>>>>> as to whether >>>>>>>> it had napi enabled or not, and then check that before trying to >>>>>>>> enable/disable >>>>>>>> it again. Seems to work fine, and cleaner in my mind than >>>>>>>> checking internal >>>>>>>> napi flags. >>>>>>> A much simpler approach is just to check for IFF_UP and skip NAPI >>>>>>> (and others) >>>>>>> in the hif_stop no? (provided proper RTNL locking is done if >>>>>>> hif_stop >>>>>>> is being called >>>>>>> internally as well). >>>>>>> >>>>>> >>>>>> I'm not sure, but I think the driver should be internally >>>>>> consistent and not >>>>>> spend a lot of time trying to guess about interactions with >>>>>> objects higher >>>>>> in the stack. >>>>> Fair enough, the network interface state is a basic thing >>>>> controlled >>>>> by the driver, >>>>> so, should be okay to use. Anyways, the in-driver approach has more >>>>> control. >>>>>> >>>>>> Here is my original patch to fix this, it is not complex. >>>>>> >>>>>> https://patchwork.kernel.org/patch/10249363/ >>>>> Sure, I have shared your patch above :). >>>> Sent a bit early, any idea why this wasn't upstreamed earlier? >>> >>> No, one comment from Michal indicated maybe there were more problems >>> lurking >>> in this area, but he seemed to be OK with the patch over all. After >>> that, >>> it was just ignored. >>> >> Now might be a good time to push for it :) >> > > It is generally a waste of time in my experience. Kalle is the > maintainer and should > be seeing any of this he cares to see. If he likes the patch, he can > apply it or > something similar. If you have a reproducible test case, see if the > patch fixes > things, that might help it be accepted. I have 2 cmd, each one can reproduce the hang. echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio and with the my patch, it fix the hang. Change of my patch is similar with your patch(https://patchwork.kernel.org/patch/10249365/), so it should also fix the hang with your patch. > > Thanks, > Ben
On 2020-08-21 10:45, Wen Gong wrote: ... > I have 2 cmd, each one can reproduce the hang. > echo soft > > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep > 0.05;ifconfig wlan0 down > echo soft > > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod > ath10k_sdio > and with the my patch, it fix the hang. Change of my patch is similar > with your > patch(https://patchwork.kernel.org/patch/10249365/), so it should also > fix the hang with your patch. I have sent v2: https://patchwork.kernel.org/patch/11732165/
Ben Greear <greearb@candelatech.com> writes: >>>>>> Here is my original patch to fix this, it is not complex. >>>>>> >>>>>> https://patchwork.kernel.org/patch/10249363/ >>>>> Sure, I have shared your patch above :). >>>> Sent a bit early, any idea why this wasn't upstreamed earlier? >>> >>> No, one comment from Michal indicated maybe there were more problems lurking >>> in this area, but he seemed to be OK with the patch over all. After that, >>> it was just ignored. >>> >> Now might be a good time to push for it :) >> > > It is generally a waste of time in my experience. Kalle is the maintainer and should > be seeing any of this he cares to see. If he likes the patch, he can apply it or > something similar. If you have a reproducible test case, see if the patch fixes > things, that might help it be accepted. The problem with yours (Ben's) patches is that you have your own set of patches for ath10k and your own firmware. So I cannot know at all if your patches work with upstream ath10k and upstream firmware, and would need to test the patches myself. But nowadays I just can't find the time for testing. So if someone else can do the testing and provide a Tested-on tag it would it increase my confidence level for the patches.
On 9/7/20 9:07 AM, Kalle Valo wrote: > Ben Greear <greearb@candelatech.com> writes: > >>>>>>> Here is my original patch to fix this, it is not complex. >>>>>>> >>>>>>> https://patchwork.kernel.org/patch/10249363/ >>>>>> Sure, I have shared your patch above :). >>>>> Sent a bit early, any idea why this wasn't upstreamed earlier? >>>> >>>> No, one comment from Michal indicated maybe there were more problems lurking >>>> in this area, but he seemed to be OK with the patch over all. After that, >>>> it was just ignored. >>>> >>> Now might be a good time to push for it :) >>> >> >> It is generally a waste of time in my experience. Kalle is the maintainer and should >> be seeing any of this he cares to see. If he likes the patch, he can apply it or >> something similar. If you have a reproducible test case, see if the patch fixes >> things, that might help it be accepted. > > The problem with yours (Ben's) patches is that you have your own set of > patches for ath10k and your own firmware. So I cannot know at all if > your patches work with upstream ath10k and upstream firmware, and would > need to test the patches myself. But nowadays I just can't find the time > for testing. So if someone else can do the testing and provide a > Tested-on tag it would it increase my confidence level for the patches. Surely codeaura could get a few entry level engineers to run basic testing against your target platforms on a regular basis? The several years of time this bug was known (to me at least, and to whoever saw my original patch) and the time wasted by codeaura to rediscover and re-fix the bug would have much better been spent just testing and review my patch to begin with. And not just my patches either, this pattern is far and wide in ath10k. Also, my driver is often tested against various upstream QCA firmware and chipsets in openwrt, so while bugs are always possible, there is some test coverage. Thanks, Ben
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c index 7b894dcaad2e..b71499b171c6 100644 --- a/drivers/net/wireless/ath/ath10k/sdio.c +++ b/drivers/net/wireless/ath/ath10k/sdio.c @@ -1756,8 +1756,6 @@ static int ath10k_sdio_hif_start(struct ath10k *ar) struct ath10k_sdio *ar_sdio = ath10k_sdio_priv(ar); int ret; - napi_enable(&ar->napi); - /* Sleep 20 ms before HIF interrupts are disabled. * This will give target plenty of time to process the BMI done * request before interrupts are disabled. @@ -1884,7 +1882,6 @@ static void ath10k_sdio_hif_stop(struct ath10k *ar) spin_unlock_bh(&ar_sdio->wr_async_lock); napi_synchronize(&ar->napi); - napi_disable(&ar->napi); } #ifdef CONFIG_PM @@ -2121,6 +2118,7 @@ static int ath10k_sdio_probe(struct sdio_func *func, netif_napi_add(&ar->napi_dev, &ar->napi, ath10k_sdio_napi_poll, ATH10K_NAPI_BUDGET); + napi_enable(&ar->napi); ath10k_dbg(ar, ATH10K_DBG_BOOT, "sdio new func %d vendor 0x%x device 0x%x block 0x%x/0x%x\n", @@ -2235,6 +2233,7 @@ static void ath10k_sdio_remove(struct sdio_func *func) ath10k_core_unregister(ar); + napi_disable(&ar->napi); netif_napi_del(&ar->napi); ath10k_core_destroy(ar);
It happened "Kernel panic - not syncing: hung_task: blocked tasks" when test simulate crash and ifconfig down/rmmod meanwhile. Test steps: 1.Test commands echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;sleep 0.05;ifconfig wlan0 down echo soft > /sys/kernel/debug/ieee80211/phy0/ath10k/simulate_fw_crash;rmmod ath10k_sdio 2. dmesg: [ 5622.548630] ath10k_sdio mmc1:0001:1: simulating soft firmware crash [ 5622.655995] ieee80211 phy0: Hardware restart was requested [ 5776.355164] INFO: task shill:1572 blocked for more than 122 seconds. [ 5776.355687] INFO: task kworker/1:2:24437 blocked for more than 122 seconds. [ 5776.359812] Kernel panic - not syncing: hung_task: blocked tasks [ 5776.359836] CPU: 1 PID: 55 Comm: khungtaskd Tainted: G W 4.19.86 #137 [ 5776.359846] Hardware name: MediaTek krane sku176 board (DT) [ 5776.359855] Call trace: [ 5776.359868] dump_backtrace+0x0/0x170 [ 5776.359881] show_stack+0x20/0x2c [ 5776.359896] dump_stack+0xd4/0x10c [ 5776.359916] panic+0x12c/0x29c [ 5776.359937] hung_task_panic+0x0/0x50 [ 5776.359953] kthread+0x120/0x130 [ 5776.359965] ret_from_fork+0x10/0x18 [ 5776.359986] SMP: stopping secondary CPUs [ 5776.360012] Kernel Offset: 0x141ea00000 from 0xffffff8008000000 [ 5776.360026] CPU features: 0x0,2188200c [ 5776.360035] Memory Limit: none command "ifconfig wlan0 down" or "rmmod ath10k_sdio" will be blocked callstack of ifconfig: [<0>] __switch_to+0x120/0x13c [<0>] msleep+0x28/0x38 [<0>] ath10k_sdio_hif_stop+0x24c/0x294 [ath10k_sdio] [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] [<0>] ath10k_halt+0x120/0x178 [ath10k_core] [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] [<0>] drv_stop+0xe0/0x1e4 [mac80211] [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] [<0>] ieee80211_stop+0x20/0x30 [mac80211] [<0>] __dev_close_many+0xb8/0x11c [<0>] __dev_change_flags+0xe0/0x1d0 [<0>] dev_change_flags+0x30/0x6c [<0>] devinet_ioctl+0x370/0x564 [<0>] inet_ioctl+0xdc/0x304 [<0>] sock_do_ioctl+0x50/0x288 [<0>] compat_sock_ioctl+0x1b4/0x1aac [<0>] __se_compat_sys_ioctl+0x100/0x26fc [<0>] __arm64_compat_sys_ioctl+0x20/0x2c [<0>] el0_svc_common+0xa4/0x154 [<0>] el0_svc_compat_handler+0x2c/0x38 [<0>] el0_svc_compat+0x8/0x18 [<0>] 0xffffffffffffffff callstack of rmmod: [<0>] __switch_to+0x120/0x13c [<0>] msleep+0x28/0x38 [<0>] ath10k_sdio_hif_stop+0x294/0x31c [ath10k_sdio] [<0>] ath10k_core_stop+0x50/0x78 [ath10k_core] [<0>] ath10k_halt+0x120/0x178 [ath10k_core] [<0>] ath10k_stop+0x4c/0x8c [ath10k_core] [<0>] drv_stop+0xe0/0x1e4 [mac80211] [<0>] ieee80211_stop_device+0x48/0x54 [mac80211] [<0>] ieee80211_do_stop+0x678/0x6f8 [mac80211] [<0>] ieee80211_stop+0x20/0x30 [mac80211] [<0>] __dev_close_many+0xb8/0x11c [<0>] dev_close_many+0x70/0x100 [<0>] dev_close+0x4c/0x80 [<0>] cfg80211_shutdown_all_interfaces+0x50/0xcc [cfg80211] [<0>] ieee80211_remove_interfaces+0x58/0x1a0 [mac80211] [<0>] ieee80211_unregister_hw+0x40/0x100 [mac80211] [<0>] ath10k_mac_unregister+0x1c/0x44 [ath10k_core] [<0>] ath10k_core_unregister+0x38/0x7c [ath10k_core] [<0>] ath10k_sdio_remove+0x8c/0xd0 [ath10k_sdio] [<0>] sdio_bus_remove+0x48/0x108 [<0>] device_release_driver_internal+0x138/0x1ec [<0>] driver_detach+0x6c/0xa8 [<0>] bus_remove_driver+0x78/0xa8 [<0>] driver_unregister+0x30/0x50 [<0>] sdio_unregister_driver+0x28/0x34 [<0>] cleanup_module+0x14/0x6bc [ath10k_sdio] [<0>] __arm64_sys_delete_module+0x1e0/0x22c [<0>] el0_svc_common+0xa4/0x154 [<0>] el0_svc_compat_handler+0x2c/0x38 [<0>] el0_svc_compat+0x8/0x18 [<0>] 0xffffffffffffffff The test command run simulate_fw_crash firstly and it call into ath10k_sdio_hif_stop from ath10k_core_restart, then napi_disable is called and bit NAPI_STATE_SCHED is set. After that, function ath10k_sdio_hif_stop is called again from ath10k_stop by command "ifconfig wlan0 down" or "rmmod ath10k_sdio", then command blocked. It is blocked by napi_synchronize, napi_disable will set bit with NAPI_STATE_SCHED, and then napi_synchronize will enter dead loop becuase bit NAPI_STATE_SCHED is set by napi_disable. function of napi_synchronize static inline void napi_synchronize(const struct napi_struct *n) { if (IS_ENABLED(CONFIG_SMP)) while (test_bit(NAPI_STATE_SCHED, &n->state)) msleep(1); else barrier(); } function of napi_disable void napi_disable(struct napi_struct *n) { might_sleep(); set_bit(NAPI_STATE_DISABLE, &n->state); while (test_and_set_bit(NAPI_STATE_SCHED, &n->state)) msleep(1); while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state)) msleep(1); hrtimer_cancel(&n->timer); clear_bit(NAPI_STATE_DISABLE, &n->state); } Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00042. Signed-off-by: Wen Gong <wgong@codeaurora.org> --- drivers/net/wireless/ath/ath10k/sdio.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)