Message ID | 20200610204017.4531-1-greearb@candelatech.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Johannes Berg |
Headers | show |
Series | mac80211: Fix kernel hang on ax200 firmware crash. | expand |
On 06/10/2020 01:40 PM, greearb@candelatech.com wrote: > From: Ben Greear <greearb@candelatech.com> > > I backported out-of-tree ax200 driver from backport-iwlwifi to my > 5.4 kernel so that I could run ax200 beside other radios (backports > mac80211 otherwise is incompatible and other drivers will crash). > > Always possible that upstream kernel doesn't suffer from exactly this > case, but upstream ax200 is too unstable to even get this far, so... > > The ax200 firmware crash often causes the kernel to deadlock due to the > while (sta->sta_state == IEEE80211_STA_AUTHORIZED) > loop in __sta_info_Destroy_part. If sta_info_move_state does not > make progress, then it will loop forever. In my case, sta_info_move_state > fails due to the sdata-in-driver check. Hello Johannes, Have any comment on this? Thanks, Ben > > Hung process looks like this: > > CPU: 7 PID: 23301 Comm: kworker/7:0 Tainted: G W 5.4.43+ #5 > Hardware name: Default string Default string/SKYBAY, BIOS 5.12 02/19/2019 > Workqueue: events_freezable ieee80211_restart_work [mac80211] > RIP: 0010:memcpy_erms+0x6/0x10 > Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 ce > RSP: 0018:ffffc90006117728 EFLAGS: 00010002 > RAX: ffffffff837ca040 RBX: 0000000000000000 RCX: 0000000000000006 > RDX: 0000000000000046 RSI: ffffffff8380aa84 RDI: ffffffff837ca080 > RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000001697 > R10: 0000000000000007 R11: 0000000000000000 R12: ffffffff837ca040 > R13: 0000000000000046 R14: 0000000000000000 R15: ffffffff8380aa44 > FS: 0000000000000000(0000) GS:ffff88826ddc0000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000562e61e28f18 CR3: 00000002554f6006 CR4: 00000000003606e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > msg_print_text+0x12a/0x1e0 > console_unlock+0x160/0x600 > vprintk_emit+0x146/0x2c0 > printk+0x4d/0x69 > ? lockdep_hardirqs_on+0xf1/0x190 > __sdata_err+0x61/0x150 [mac80211] > drv_sta_state+0x433/0x8f0 [mac80211] > sta_info_move_state+0x28e/0x370 [mac80211] > __sta_info_destroy_part2+0x48/0x1d0 [mac80211] > __sta_info_flush+0xf6/0x180 [mac80211] > ieee80211_set_disassoc+0xc1/0x490 [mac80211] > ieee80211_mgd_deauth+0x291/0x420 [mac80211] > cfg80211_mlme_deauth+0xd2/0x330 [cfg80211] > cfg80211_mlme_down+0x7c/0xc0 [cfg80211] > cfg80211_disconnect+0x2b1/0x320 [cfg80211] > cfg80211_leave+0x23/0x30 [cfg80211] > cfg80211_netdev_notifier_call+0x3a5/0x680 [cfg80211] > ? lockdep_rtnl_is_held+0x11/0x20 > ? addrconf_notify+0xb4/0xbb0 [ipv6] > ? packet_notifier+0xb8/0x2c0 > notifier_call_chain+0x40/0x60 > __dev_close_many+0x68/0x120 > dev_close_many+0x83/0x130 > dev_close.part.96+0x3f/0x70 > cfg80211_shutdown_all_interfaces+0x3e/0xc0 [cfg80211] > ieee80211_reconfig+0x96/0x2180 [mac80211] > ? cond_synchronize_rcu+0x20/0x20 > ieee80211_restart_work+0xb6/0xe0 [mac80211] > process_one_work+0x27c/0x640 > worker_thread+0x47/0x3f0 > ? process_one_work+0x640/0x640 > kthread+0xfc/0x130 > ? kthread_create_worker_on_cpu+0x70/0x70 > ret_from_fork+0x24/0x30 > > With this patch, there is safety code to bail out after 1000 tries of > moving the sta state, and also I check for EIO which is returned by > the sdata-in-driver failure case and treat that as success as far as > changing sta state goes. > > Console logs look like this in the failure case, and aside from the ax200 > radio that went phantom, the rest of the system is usable: > > iwlwifi 0000:12:00.0: 0x0000025B | CNVR_SCU_SD_REGS_SD_REG_ACTIVE_VDIG_MIRROR > iwlwifi 0000:12:00.0: Firmware error during reconfiguration - reprobe! > iwlwifi 0000:12:00.0: Failed to start RT ucode: -5 > wlan2: Failed check-sdata-in-driver check, flags: 0x0 count: 1 > wlan2: Failed check-sdata-in-driver check, flags: 0x0 count: 1 > wlan2: Failed check-sdata-in-driver check, flags: 0x0 count: 1 > iwlwifi 0000:12:00.0: Failed to trigger RX queues sync (-5) > wlan2: Failed check-sdata-in-driver check, flags: 0x0 count: 1 > wlan2: drv_sta_state failed with EIO (sdata not in driver?), state: 4 new-state: 3 > wlan2: drv_sta_state failed with EIO (sdata not in driver?), state: 3 new-state: 2 > wlan2: drv_sta_state failed with EIO (sdata not in driver?), state: 2 new-state: 1 > wlan2: Failed check-sdata-in-driver check, flags: 0x0 count: 1 > iwlwifi 0000:12:00.0: iwl_trans_wait_txq_empty bad state = 0 > iwlwifi 0000:12:00.0: dma_pool_destroy iwlwifi:bc, 00000000d859bd4c busy > > Signed-off-by: Ben Greear <greearb@candelatech.com> > --- > net/mac80211/sta_info.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c > index e2a04fc..31a3856 100644 > --- a/net/mac80211/sta_info.c > +++ b/net/mac80211/sta_info.c > @@ -1092,6 +1092,7 @@ static void __sta_info_destroy_part2(struct sta_info *sta) > struct ieee80211_sub_if_data *sdata = sta->sdata; > struct station_info *sinfo; > int ret; > + int count = 0; > > /* > * NOTE: This assumes at least synchronize_net() was done > @@ -1104,6 +1105,13 @@ static void __sta_info_destroy_part2(struct sta_info *sta) > while (sta->sta_state == IEEE80211_STA_AUTHORIZED) { > ret = sta_info_move_state(sta, IEEE80211_STA_ASSOC); > WARN_ON_ONCE(ret); > + if (++count > 1000) { > + /* WTF, bail out so that at least we don't hang the system. */ > + sdata_err(sdata, "Could not move state after 1000 tries, ret: %d state: %d\n", > + ret, sta->sta_state); > + WARN_ON_ONCE(1); > + break; > + } > } > > /* now keys can no longer be reached */ > @@ -2017,8 +2025,19 @@ int sta_info_move_state(struct sta_info *sta, > if (test_sta_flag(sta, WLAN_STA_INSERTED)) { > int err = drv_sta_state(sta->local, sta->sdata, sta, > sta->sta_state, new_state); > - if (err) > - return err; > + if (err == -EIO) { > + /* Sdata-not-in-driver, we are out of sync, but probably > + * best to carry on instead of bailing here, at least maybe > + * we can clean this up. > + */ > + sdata_err(sta->sdata, "drv_sta_state failed with EIO (sdata not in driver?), state: %d new-state: %d\n", > + sta->sta_state, new_state); > + WARN_ON_ONCE(1); > + } > + else { > + if (err) > + return err; > + } > } > > /* reflect the change in all state variables */ >
On Wed, 2020-06-10 at 13:40 -0700, greearb@candelatech.com wrote: > From: Ben Greear <greearb@candelatech.com> > > I backported out-of-tree ax200 driver from backport-iwlwifi to my > 5.4 kernel so that I could run ax200 beside other radios (backports > mac80211 otherwise is incompatible and other drivers will crash). > > Always possible that upstream kernel doesn't suffer from exactly this > case, but upstream ax200 is too unstable to even get this far, so... > > The ax200 firmware crash often causes the kernel to deadlock due to the > while (sta->sta_state == IEEE80211_STA_AUTHORIZED) > loop in __sta_info_Destroy_part. If sta_info_move_state does not > make progress, then it will loop forever. In my case, sta_info_move_state > fails due to the sdata-in-driver check. Interesting. I don't think I've seen this in our testing before. > iwlwifi 0000:12:00.0: dma_pool_destroy iwlwifi:bc, 00000000d859bd4c busy Ugh, yeah, as an aside - we still leak stuff there... need to dig into that. > Signed-off-by: Ben Greear <greearb@candelatech.com> > --- > net/mac80211/sta_info.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c > index e2a04fc..31a3856 100644 > --- a/net/mac80211/sta_info.c > +++ b/net/mac80211/sta_info.c > @@ -1092,6 +1092,7 @@ static void __sta_info_destroy_part2(struct sta_info *sta) > struct ieee80211_sub_if_data *sdata = sta->sdata; > struct station_info *sinfo; > int ret; > + int count = 0; > > /* > * NOTE: This assumes at least synchronize_net() was done > @@ -1104,6 +1105,13 @@ static void __sta_info_destroy_part2(struct sta_info *sta) > while (sta->sta_state == IEEE80211_STA_AUTHORIZED) { > ret = sta_info_move_state(sta, IEEE80211_STA_ASSOC); > WARN_ON_ONCE(ret); > + if (++count > 1000) { > + /* WTF, bail out so that at least we don't hang the system. */ > + sdata_err(sdata, "Could not move state after 1000 tries, ret: %d state: %d\n", > + ret, sta->sta_state); > + WARN_ON_ONCE(1); > + break; > + } I guess that should be if (WARN_ON_ONCE()) ... etc. > int err = drv_sta_state(sta->local, sta->sdata, sta, > sta->sta_state, new_state); > - if (err) > - return err; > + if (err == -EIO) { > + /* Sdata-not-in-driver, we are out of sync, but probably > + * best to carry on instead of bailing here, at least maybe > + * we can clean this up. > + */ It _could_ be the driver itself returning -EIO, so why not check the sdata-in-driver flag? Anyway, that mostly looks good and would make mac80211 more robust, but like I just said in the other patch I think you need to consider mac80211 changes more from mac80211's POV, not from an arbitrary driver's POV. Really here that mostly applies to the commit log, which should probably say something like mac80211: deadlock due to driver misbehaviour or so, and then go on to explain what it does in *mac80211*, and show the iwlwifi parts only as an *example*. Thanks, johannes
On 7/30/20 5:33 AM, Johannes Berg wrote: > On Wed, 2020-06-10 at 13:40 -0700, greearb@candelatech.com wrote: >> From: Ben Greear <greearb@candelatech.com> >> >> I backported out-of-tree ax200 driver from backport-iwlwifi to my >> 5.4 kernel so that I could run ax200 beside other radios (backports >> mac80211 otherwise is incompatible and other drivers will crash). >> >> Always possible that upstream kernel doesn't suffer from exactly this >> case, but upstream ax200 is too unstable to even get this far, so... >> >> The ax200 firmware crash often causes the kernel to deadlock due to the >> while (sta->sta_state == IEEE80211_STA_AUTHORIZED) >> loop in __sta_info_Destroy_part. If sta_info_move_state does not >> make progress, then it will loop forever. In my case, sta_info_move_state >> fails due to the sdata-in-driver check. > > Interesting. I don't think I've seen this in our testing before. So, put a few ax200 NICs (two is plenty) in a system and run any significant upload. It crashes FW in < 1 minute on stock 5.7, like completely and utterly unstable. The out-of-tree backports ax200 (which is what I was actually using here) is a bit better, closer to 5.2 kernel ax200 stability, but still crashes reliably on upload traffic with 2+ radios in a chassis. It will crash on download too, but less often. If you actually test this and have any trouble reproducing problems, please let me know. The firmware hard crash that causes ax200 phy to go away and cause the sdata-in-driver / EIO busy spin is a bit harder to reproduce, but certainly it happens often enough. > >> iwlwifi 0000:12:00.0: dma_pool_destroy iwlwifi:bc, 00000000d859bd4c busy > > Ugh, yeah, as an aside - we still leak stuff there... need to dig into > that. > >> Signed-off-by: Ben Greear <greearb@candelatech.com> >> --- >> net/mac80211/sta_info.c | 23 +++++++++++++++++++++-- >> 1 file changed, 21 insertions(+), 2 deletions(-) >> >> diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c >> index e2a04fc..31a3856 100644 >> --- a/net/mac80211/sta_info.c >> +++ b/net/mac80211/sta_info.c >> @@ -1092,6 +1092,7 @@ static void __sta_info_destroy_part2(struct sta_info *sta) >> struct ieee80211_sub_if_data *sdata = sta->sdata; >> struct station_info *sinfo; >> int ret; >> + int count = 0; >> >> /* >> * NOTE: This assumes at least synchronize_net() was done >> @@ -1104,6 +1105,13 @@ static void __sta_info_destroy_part2(struct sta_info *sta) >> while (sta->sta_state == IEEE80211_STA_AUTHORIZED) { >> ret = sta_info_move_state(sta, IEEE80211_STA_ASSOC); >> WARN_ON_ONCE(ret); >> + if (++count > 1000) { >> + /* WTF, bail out so that at least we don't hang the system. */ >> + sdata_err(sdata, "Could not move state after 1000 tries, ret: %d state: %d\n", >> + ret, sta->sta_state); >> + WARN_ON_ONCE(1); >> + break; >> + } > > I guess that should be > > if (WARN_ON_ONCE()) ... If we spin 1000 times, it is worth a second warning. Or do you mean the WARN_ON_ONCE(ret) should have if in front of it? > > > etc. > >> int err = drv_sta_state(sta->local, sta->sdata, sta, >> sta->sta_state, new_state); >> - if (err) >> - return err; >> + if (err == -EIO) { >> + /* Sdata-not-in-driver, we are out of sync, but probably >> + * best to carry on instead of bailing here, at least maybe >> + * we can clean this up. >> + */ > > It _could_ be the driver itself returning -EIO, so why not check the > sdata-in-driver flag? Right, but if driver is complaining here, we need to bail out regardless of sdata-in-driver or not, unless you think a driver could return EIO and then a small bit later start working for the same request? > > > Anyway, that mostly looks good and would make mac80211 more robust, but > like I just said in the other patch I think you need to consider > mac80211 changes more from mac80211's POV, not from an arbitrary > driver's POV. > > Really here that mostly applies to the commit log, which should probably > say something like > > mac80211: deadlock due to driver misbehaviour > > or so, and then go on to explain what it does in *mac80211*, and show > the iwlwifi parts only as an *example*. Its not really driver mis-behaviour per se. The root cause is that the firmware crashes too badly for the driver to recover (ok, so driver might could be better, but I've also seen cases where ath10k NIC falls off the PCI bus, so nothing the driver can do in that case I think). Per my other patches, I've seen this sdata-in-driver crap in the past, so I think I probably hit a similar bug in both ax200 and ath10k, but since ax200 is so easy to crash, it is much more likely to hit this bug than any other driver I'm aware of. I'll try to re-word the commit message though, I don't really care what it says so long as the code gets in. Thanks, Ben > > Thanks, > johannes >
On Thu, 2020-07-30 at 05:52 -0700, Ben Greear wrote: > > > > + if (++count > 1000) { > > > + /* WTF, bail out so that at least we don't hang the system. */ > > > + sdata_err(sdata, "Could not move state after 1000 tries, ret: %d state: %d\n", > > > + ret, sta->sta_state); > > > + WARN_ON_ONCE(1); > > > + break; > > > + } > > > > I guess that should be > > > > if (WARN_ON_ONCE()) ... > > If we spin 1000 times, it is worth a second warning. Or do you mean > the WARN_ON_ONCE(ret) should have if in front of it? Ah. I missed the WARN_ON_ONCE(ret) entirely. I just meant that the warning could/should be around the condition. In fact though, even the message probably should: if (WARN_ONCE(++count > 1000, "...", ...)) break; That way the message would be captured inside the warning, which is better for tooling that parses warnings. > > > > etc. > > > > > int err = drv_sta_state(sta->local, sta->sdata, sta, > > > sta->sta_state, new_state); > > > - if (err) > > > - return err; > > > + if (err == -EIO) { > > > + /* Sdata-not-in-driver, we are out of sync, but probably > > > + * best to carry on instead of bailing here, at least maybe > > > + * we can clean this up. > > > + */ > > > > It _could_ be the driver itself returning -EIO, so why not check the > > sdata-in-driver flag? > > Right, but if driver is complaining here, we need to bail out regardless of > sdata-in-driver or not, Yes. But I'm not sure we should WARN on that? > unless you think a driver could return EIO and then > a small bit later start working for the same request? Hah, no. If that's a possibility due to some stupid firmware reasons, let the driver deal with it. > > Really here that mostly applies to the commit log, which should probably > > say something like > > > > mac80211: deadlock due to driver misbehaviour > > > > or so, and then go on to explain what it does in *mac80211*, and show > > the iwlwifi parts only as an *example*. > > Its not really driver mis-behaviour per se. The root cause is that the > firmware crashes too badly for the driver to recover (ok, so driver might > could be better, but I've also seen cases where ath10k NIC falls off the PCI > bus, so nothing the driver can do in that case I think). Fair enough. We actually do have some code in there that tries to unbind/rebind the driver from the device eventually, but that's obviously a very last resort. FWIW, we do have multiple NICs in a single machine, but then we run them from VMs so each VM only has a single NIC. But I don't see why that should be different from the device/firmware point of view. Perhaps your PCIe configuration is different. > Per my other patches, I've seen this sdata-in-driver crap in the past, so > I think I probably hit a similar bug in both ax200 and ath10k, but since > ax200 is so easy to crash, it is much more likely to hit this bug than any > other driver I'm aware of. > > I'll try to re-word the commit message though, I don't really care what it > says so long as the code gets in. :) Thanks, johannes
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index e2a04fc..31a3856 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -1092,6 +1092,7 @@ static void __sta_info_destroy_part2(struct sta_info *sta) struct ieee80211_sub_if_data *sdata = sta->sdata; struct station_info *sinfo; int ret; + int count = 0; /* * NOTE: This assumes at least synchronize_net() was done @@ -1104,6 +1105,13 @@ static void __sta_info_destroy_part2(struct sta_info *sta) while (sta->sta_state == IEEE80211_STA_AUTHORIZED) { ret = sta_info_move_state(sta, IEEE80211_STA_ASSOC); WARN_ON_ONCE(ret); + if (++count > 1000) { + /* WTF, bail out so that at least we don't hang the system. */ + sdata_err(sdata, "Could not move state after 1000 tries, ret: %d state: %d\n", + ret, sta->sta_state); + WARN_ON_ONCE(1); + break; + } } /* now keys can no longer be reached */ @@ -2017,8 +2025,19 @@ int sta_info_move_state(struct sta_info *sta, if (test_sta_flag(sta, WLAN_STA_INSERTED)) { int err = drv_sta_state(sta->local, sta->sdata, sta, sta->sta_state, new_state); - if (err) - return err; + if (err == -EIO) { + /* Sdata-not-in-driver, we are out of sync, but probably + * best to carry on instead of bailing here, at least maybe + * we can clean this up. + */ + sdata_err(sta->sdata, "drv_sta_state failed with EIO (sdata not in driver?), state: %d new-state: %d\n", + sta->sta_state, new_state); + WARN_ON_ONCE(1); + } + else { + if (err) + return err; + } } /* reflect the change in all state variables */