Message ID | CA+BoTQmwGONqJ0MgccrMzzx9UsGuRCZpdy1RSqdgkcNm_te0xQ@mail.gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 03/29/2016 01:14 AM, Michal Kazior wrote: > On 26 March 2016 at 03:27, Ben Greear <greearb@candelatech.com> wrote: >> I've been seeing this for a while now. When firmware crashes, often the OS >> at least >> partially locks up. >> >> This is modified 4.4.6 driver/kernel, modified 10.4.3 firmware. I had 35 >> stations associated, >> and reset one. Flush fails (maybe because nothing stops tx on other vdevs >> while flushing one?) >> and I added a fake firmware crash even in case flush fails. >> >> Then, I get deadlock. I've seen other similar deadlocks when the firmware >> crashed due >> to 'natural' causes when adding vdevs.... >> >> Looks like the same process is not actually stuck in one place...each time >> the kernel splats, >> it is in a different place..spinning and spinning. Maybe it needs a >> bail-out on firmware >> crash? > [...] >> [ 316.477677] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! >> [kworker/u8:3:257] >> [ 316.477720] Modules linked in: nf_conntrack_netlink nf_conntrack >> nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan >> wanlink(O) pktgen rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt >> iTCO_vendor_support coretemp ath9k ath10k_pci hwmon ath9k_common ath10k_core >> ath9k_hw intel_rapl iosf_mbi ath x86_pkg_temp_thermal intel_powerclamp >> mac80211 kvm_intel kvm joydev irqbypass pcspkr serio_raw cfg80211 >> snd_hda_codec_hdmi lpc_ich i2c_i801 snd_hda_codec_realtek >> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep >> snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp soundcore >> tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic >> pata_acpi i915 e1000e ptp pps_core i2c_algo_bit drm_kms_helper drm i2c_core >> fjes video ipv6 [last unloaded: nf_conntrack] >> >> [ 316.477721] irq event stamp: 2111179 >> [ 316.477727] hardirqs last enabled at (2111179): [<ffffffff8113c347>] >> vprintk_emit+0x3ab/0x46a >> [ 316.477730] hardirqs last disabled at (2111178): [<ffffffff8113bff8>] >> vprintk_emit+0x5c/0x46a >> [ 316.477742] softirqs last enabled at (2111014): [<ffffffffa0e30965>] >> ath10k_set_key+0x136/0x602 [ath10k_core] >> [ 316.477749] softirqs last disabled at (2111012): [<ffffffffa0e30946>] >> ath10k_set_key+0x117/0x602 [ath10k_core] >> [ 316.477751] CPU: 1 PID: 257 Comm: kworker/u8:3 Tainted: G W O >> 4.4.6+ #21 >> [ 316.477752] Hardware name: To be filled by O.E.M. To be filled by >> O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012 >> [ 316.477780] Workqueue: wiphy3 ieee80211_iface_work [mac80211] >> [ 316.477781] task: ffff880212d225c0 ti: ffff880212d50000 task.ti: >> ffff880212d50000 >> [ 316.477790] RIP: 0010:[<ffffffffa0e38c1b>] [<ffffffffa0e38c1b>] >> ath10k_mac_tx_push_pending+0xc1/0x12d [ath10k_core] > > Just in case, do you have these applied? > > 750eeed89cf3 ath10k: fix pull-push tx threshold handling > 9d71d47eed20 ath10k: fix tx hang I have both of these...I'll try your patch below. I first have to fix the hash-table bugs in mac80211, as they break so many things that it is hard to test the rest of the system... Thanks, Ben > > Hmm.. If it still reproduces can you try the following diff? > > --- a/drivers/net/wireless/ath/ath10k/mac.c > +++ b/drivers/net/wireless/ath/ath10k/mac.c > @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar) > list_del_init(&artxq->list); > if (ret != -ENOENT) > list_add_tail(&artxq->list, &ar->txqs); > + else if (artxq == last) > + last = list_last_entry(&ar->txqs, struct > ath10k_txq, list); > > ath10k_htt_tx_txq_update(hw, txq); > > > Micha? > > _______________________________________________ > ath10k mailing list > ath10k@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/ath10k >
> Hmm.. If it still reproduces can you try the following diff? > > --- a/drivers/net/wireless/ath/ath10k/mac.c > +++ b/drivers/net/wireless/ath/ath10k/mac.c > @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar) > list_del_init(&artxq->list); > if (ret != -ENOENT) > list_add_tail(&artxq->list, &ar->txqs); > + else if (artxq == last) > + last = list_last_entry(&ar->txqs, struct > ath10k_txq, list); > > ath10k_htt_tx_txq_update(hw, txq); Ok, I added this code, and can still reproduce the code. Firmware is crashing multiple times a minute in this machine in it's current configuration. Right before it hung, firmware crashed and was restarted, and then I get the hang notification. I don't see any obvious bail-out in the tx_push_pending logic if the firmware crashes? Thanks, Ben
On 31 March 2016 at 00:28, Ben Greear <greearb@candelatech.com> wrote: > >> Hmm.. If it still reproduces can you try the following diff? >> >> --- a/drivers/net/wireless/ath/ath10k/mac.c >> +++ b/drivers/net/wireless/ath/ath10k/mac.c >> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar) >> list_del_init(&artxq->list); >> if (ret != -ENOENT) >> list_add_tail(&artxq->list, &ar->txqs); >> + else if (artxq == last) >> + last = list_last_entry(&ar->txqs, struct >> ath10k_txq, list); >> >> ath10k_htt_tx_txq_update(hw, txq); > > > Ok, I added this code, and can still reproduce the code. > > Firmware is crashing multiple times a minute in this machine in it's > current configuration. Right before it hung, firmware crashed and > was restarted, and then I get the hang notification. > > I don't see any obvious bail-out in the tx_push_pending logic > if the firmware crashes? There's no explicit bail-out, yes. It should bail out if ath10k_mac_tx_push_txq() fails though (except -ENOENT, which is treated slightly differently but should result in bail-out eventually as well as ar->txqs will drain until it's empty). HTT-tx doesn't check for FW crash but it should be ultimately limited by either CE ring size and HTT's num-pending-tx (both should not be replenished as FW crashed and interrupts should not come in anymore). Whichever the case a <0 retval should result in a bailout. Micha?
On 03/30/2016 11:32 PM, Michal Kazior wrote: > On 31 March 2016 at 00:28, Ben Greear <greearb@candelatech.com> wrote: >> >>> Hmm.. If it still reproduces can you try the following diff? >>> >>> --- a/drivers/net/wireless/ath/ath10k/mac.c >>> +++ b/drivers/net/wireless/ath/ath10k/mac.c >>> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar) >>> list_del_init(&artxq->list); >>> if (ret != -ENOENT) >>> list_add_tail(&artxq->list, &ar->txqs); >>> + else if (artxq == last) >>> + last = list_last_entry(&ar->txqs, struct >>> ath10k_txq, list); >>> >>> ath10k_htt_tx_txq_update(hw, txq); >> >> >> Ok, I added this code, and can still reproduce the code. >> >> Firmware is crashing multiple times a minute in this machine in it's >> current configuration. Right before it hung, firmware crashed and >> was restarted, and then I get the hang notification. >> >> I don't see any obvious bail-out in the tx_push_pending logic >> if the firmware crashes? > > There's no explicit bail-out, yes. It should bail out if > ath10k_mac_tx_push_txq() fails though (except -ENOENT, which is > treated slightly differently but should result in bail-out eventually > as well as ar->txqs will drain until it's empty). > > HTT-tx doesn't check for FW crash but it should be ultimately limited > by either CE ring size and HTT's num-pending-tx (both should not be > replenished as FW crashed and interrupts should not come in anymore). > Whichever the case a <0 retval should result in a bailout. I tried adding check for FW crash yesterday, but that did not help. Today, I added a limit of 2000 loops. I see that hit, and then kernel crashes. Maybe my patch is wrong. I've tried to apply (almost) every patch in linux.ath related to ath10k, including a few from the mailing list that have not been applied yet. My push-pending method now looks like this: void ath10k_mac_tx_push_pending(struct ath10k *ar) { struct ieee80211_hw *hw = ar->hw; struct ieee80211_txq *txq; struct ath10k_txq *artxq; struct ath10k_txq *last; int ret; int max; int loop_max = 2000; spin_lock_bh(&ar->txqs_lock); rcu_read_lock(); last = list_last_entry(&ar->txqs, struct ath10k_txq, list); while (!list_empty(&ar->txqs)) { artxq = list_first_entry(&ar->txqs, struct ath10k_txq, list); txq = container_of((void *)artxq, struct ieee80211_txq, drv_priv); if (--loop_max == 0) { ath10k_err(ar, "Looped 2000 times in tx_push_pending, bailing out.\n"); break; } /* Prevent aggressive sta/tid taking over tx queue */ max = 16; ret = 0; while (ath10k_mac_tx_can_push(hw, txq) && max--) { ret = ath10k_mac_tx_push_txq(hw, txq); if (ret < 0) break; } list_del_init(&artxq->list); if (ret != -ENOENT) list_add_tail(&artxq->list, &ar->txqs); else if (artxq == last) last = list_last_entry(&ar->txqs, struct ath10k_txq, list); ath10k_htt_tx_txq_update(hw, txq); if (artxq == last || (ret < 0 && ret != -ENOENT)) break; } rcu_read_unlock(); spin_unlock_bh(&ar->txqs_lock); } The crash I get is this: ath10k_pci 0000:05:00.0: firmware crashed! (uuid 2a118708-977d-43d6-8d40-079ddec99eb3) ath10k_pci 0000:05:00.0: firmware register dump: ath10k_pci 0000:05:00.0: [00]: 0x00000009 0x000015B3 0x0099E4B6 0x00955B31 ath10k_pci 0000:05:00.0: [04]: 0x0099E4B6 0x00060130 0x00000005 0x00000016 ath10k_pci 0000:05:00.0: [08]: 0x00455030 0x004402B0 0x004060F0 0x00000007 ath10k_pci 0000:05:00.0: [12]: 0x00000009 0x00000000 0x009533D0 0x009533DF ath10k_pci 0000:05:00.0: [16]: 0x00953438 0x0A00286E 0x009406B6 0x00000000 ath10k_pci 0000:05:00.0: [20]: 0x4099E4B6 0x00405FEC 0x000000BE 0x00955A00 ath10k_pci 0000:05:00.0: [24]: 0x8099E680 0x0040604C 0x00000000 0xC099E4B6 ath10k_pci 0000:05:00.0: [28]: 0x80986D5F 0x004060AC 0x00423A14 0x004060F0 ath10k_pci 0000:05:00.0: [32]: 0x80984E51 0x004060CC 0x00423A14 0x004060F0 ath10k_pci 0000:05:00.0: [36]: 0x80985CBF 0x004060EC 0x00424654 0x004402B0 ath10k_pci 0000:05:00.0: [40]: 0x809CAE6A 0x0040615C 0x004402B0 0x00424654 ath10k_pci 0000:05:00.0: [44]: 0x80984EBC 0x0040618C 0x004402B0 0x0040623C ath10k_pci 0000:05:00.0: [48]: 0x809CB3CC 0x0040623C 0x004402B0 0x00411988 ath10k_pci 0000:05:00.0: [52]: 0x80984DE0 0x0040626C 0x00424654 0x004402B0 ath10k_pci 0000:05:00.0: [56]: 0x809CCE08 0x0040635C 0x00424654 0x00423234 ath10k_pci 0000:05:00.0: ath10k_pci ATH10K_DBG_BUFFER: ath10k: [0000]: 0001854A 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 0001854E ath10k: [0008]: 07FC4C02 00000004 0001854F 0060581D 0001854F 17FC4C01 0F00851C 0000000A ath10k: [0016]: 06003007 0000FFAA FFFFFFFF 0001854F 17FC4C01 71108880 00000000 00C400BF ath10k: [0024]: 00000000 00000FF0 0001854F 17FC4C01 71108880 00010000 00C400BF 00000000 ath10k: [0032]: FFFFFFFF 0001854F 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF ath10k: [0040]: 0001854F 17FC4C01 71108880 00030000 00C400BF 000000FF FFFFFFFF 0001854F ath10k: [0048]: 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF 0001854F 17FC4C01 ath10k: [0056]: 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018550 0060581D 00018550 ath10k: [0064]: 0860581B 0000851C 00000000 00018550 0060581D 00018550 07FC4C02 00000004 ath10k: [0072]: 00018551 0060581D 00018551 17FC4C01 0F00851C 0000000A 06003007 0000FFAA ath10k: [0080]: FFFFFFFF 00018551 17FC4C01 71108880 00000000 00C400BF 00000000 00000FF0 ath10k: [0088]: 00018551 17FC4C01 71108880 00010000 00C400BF 00000000 FFFFFFFF 00018551 ath10k: [0096]: 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF 00018551 17FC4C01 ath10k: [0104]: 71108880 00030000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880 ath10k: [0112]: 00040000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880 00050000 ath10k: [0120]: 00C400BF 000000FF FBFFFFFF 00018551 14605853 51100001 000F0DE4 00000400 ath10k: [0128]: 00000056 00440380 00018551 0060581D 00018551 0460581C 00000001 00018551 ath10k: [0136]: 0060581D 00018551 07FC4C02 00000004 00018552 0060581D 00018552 17FC4C01 ath10k: [0144]: 0F00851C 0000000A 06003007 0000FFAA FFFFFFFF 00018553 17FC4C01 71108880 ath10k: [0152]: 00000000 00C400BF 00000000 00000FF0 00018553 17FC4C01 71108880 00010000 ath10k: [0160]: 00C400BF 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00020000 00C400BF ath10k: [0168]: 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00030000 00C400BF 000000FF ath10k: [0176]: FFFFFFFF 00018553 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF ath10k: [0184]: 00018553 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018553 ath10k: [0192]: 07FC4C02 00000001 00018553 07FC4C02 00000001 00018553 0BFC5826 000005E9 ath10k: [0200]: 00000003 00018554 0BFC5822 0000C01D 00000406 00018578 08383812 000F45C4 ath10k: [0208]: 00424654 00018578 10383809 0000143C 00000001 00000000 00000000 0001857B ath10k: [0216]: 14385853 51100001 000F0D9C 000003FC 00000057 004402B0 0001857B 14385853 ath10k: [0224]: 51100001 000F0D54 000003FE 00000058 004402B0 0001857B 07FC5830 00000008 ath10k: [0232]: 0001857B 14385854 51100002 000F0D54 00000061 00000057 004402B0 0001857B ath10k: [0240]: 14385851 91107001 00424654 004402B0 00000008 00000006 0001857B 17FC5855 ath10k: [0248]: 91108001 00000000 00000000 00000007 000000BE 0001857B 0FFC5855 91108002 ath10k: [0256]: 004402B0 00000010 0001857B 17FC0001 0099E4B6 000015B3 000015B3 00405EDC ath10k: [0264]: 00000009 ath10k_pci 0000:05:00.0: ATH10K_END sta13: drv-set-bitrate-mask had error return: -108 rdev-set-bitrate-mask failed: -108 wlan3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out. sta22: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting sta0: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting sta1: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting sta2: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out. sta3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting BUG: unable to handle kernel paging request at 0000000000001000 IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211] PGD 0 Oops: 0002 [#1] PREEMPT SMP Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan wanlink(O) pktgen rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt iTCO_vendor_support ath9k ath10k_pci coretemp ath9k_common hwmon intel_rapl ath10k_core iosf_mbi ath9k_hw x86_pkg_temp_thermal intel_powerclamp kvm_intel ath joydev kvm mac80211 irqbypass serio_raw pcspkr cfg80211 i2c_i801 lpc_ich snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic i915 pata_acpi i2c_algo_bit drm_kms_helper e1000e ptp pps_core drm i2c_core video fjes ipv6 [last unloaded: nf_conntrack] CPU: 2 PID: 581 Comm: kworker/u8:4 Tainted: G W O 4.4.6+ #21 Hardware name: To be filled by O.E.M. To be filled by O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012 Workqueue: phy2 ieee80211_iface_work [mac80211] task: ffff8800d9c90000 ti: ffff880213fd0000 task.ti: ffff880213fd0000 RIP: 0010:[<ffffffffa08e9810>] [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211] RSP: 0018:ffff88021eb03c28 EFLAGS: 00010296 RAX: ffff8800cbfd7000 RBX: ffff8800cbfd5060 RCX: ffff8800cbfd1000 RDX: 0000000000001000 RSI: 00000000d9c90805 RDI: ffff8800cbfd5000 RBP: ffff88021eb03c28 R08: 0000000000000001 R09: 0000000000000000 R10: ffff88021eb03ba8 R11: ffff8800cbfd5030 R12: ffff8800cbfd5060 R13: ffff880214a34902 R14: ffff8800cbfd5018 R15: ffff88021350e1b0 FS: 0000000000000000(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000001000 CR3: 0000000001c0a000 CR4: 00000000000406e0 Stack: ffff88021eb03c68 ffffffffa08e985a ffff880214a30a60 ffff880214a35600 ffff8800cbfd5060 ffff880214a349e0 ffff880214a35430 ffff88021350e1b0 ffff88021eb03cb8 ffffffffa0ec2bb4 ffff880214a30a60 0000000014a30a60 Call Trace: <IRQ> [<ffffffffa08e985a>] ieee80211_tx_dequeue+0x41/0xfe [mac80211] [<ffffffffa0ec2bb4>] ath10k_mac_tx_push_txq+0x6a/0x13b [ath10k_core] [<ffffffffa0ec2ddb>] ath10k_mac_tx_push_pending+0x156/0x16b [ath10k_core] [<ffffffffa0ed123d>] ath10k_htt_t2h_msg_handler+0x7d9/0x886 [ath10k_core] [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33 [<ffffffffa0fca532>] ? ath10k_pci_hif_send_complete_check+0x5d/0x5d [ath10k_pci] [<ffffffffa0fca557>] ath10k_pci_htt_rx_deliver+0x25/0x2a [ath10k_pci] [<ffffffffa0fcbb51>] ath10k_pci_process_rx_cb+0x191/0x1c9 [ath10k_pci] [<ffffffff810f23ad>] ? __local_bh_enable_ip+0xa4/0xb9 [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33 [<ffffffffa0fcbbbf>] ath10k_pci_htt_rx_cb+0x24/0x27 [ath10k_pci] [<ffffffffa0fce1be>] ath10k_ce_per_engine_service+0x64/0xa0 [ath10k_pci] [<ffffffffa0fce260>] ath10k_ce_per_engine_service_any+0x66/0x74 [ath10k_pci] [<ffffffffa0fcc4b3>] ath10k_pci_tasklet+0x3a/0x4e [ath10k_pci] [<ffffffff810f29e0>] tasklet_action+0xc0/0xcf [<ffffffff810f1ff6>] __do_softirq+0x1a4/0x407 [<ffffffff810f2462>] irq_exit+0x40/0x94 [<ffffffff810134a2>] do_IRQ+0xd5/0xed [<ffffffff816fb24c>] common_interrupt+0x8c/0x8c <EOI> [<ffffffff81129d49>] ? arch_local_irq_restore+0x6/0xd [<ffffffff816f8a3a>] __mutex_unlock_slowpath+0x120/0x137 [<ffffffff816f8a5a>] mutex_unlock+0x9/0xb [<ffffffffa0ebcc38>] ath10k_conf_tx+0x3a9/0x3bb [ath10k_core] [<ffffffffa08c2b48>] drv_conf_tx+0x140/0x202 [mac80211] [<ffffffffa08f3072>] ieee80211_set_wmm_default+0x1fb/0x24a [mac80211] [<ffffffffa0908bc5>] ieee80211_set_disassoc+0x248/0x31f [mac80211] [<ffffffffa0908ccf>] ieee80211_sta_connection_lost+0x33/0x69 [mac80211] [<ffffffffa090bb8f>] ieee80211_sta_work+0x5fc/0xda9 [mac80211] [<ffffffff8112d30b>] ? mark_held_locks+0x5e/0x74 [<ffffffff8112d490>] ? trace_hardirqs_on_caller+0x16f/0x18b [<ffffffff816fa024>] ? _raw_spin_unlock_irqrestore+0x48/0x5d [<ffffffffa08d54bd>] ieee80211_iface_work+0x335/0x34e [mac80211] [<ffffffff8110471a>] process_one_work+0x260/0x4db [<ffffffff81104e50>] worker_thread+0x1e9/0x29b [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8 [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8 [<ffffffff81109bfb>] kthread+0xcf/0xd7 [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f [<ffffffff816faaef>] ret_from_fork+0x3f/0x70 [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f Code: 55 48 89 e5 48 39 c7 74 27 48 85 c0 74 24 ff 4f 10 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 48 89 51 08 <48> 89 0a eb 02 31 c0 5d c3 55 48 89 e5 41 57 41 56 4c 8d 76 b8 RIP [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211] RSP <ffff88021eb03c28> CR2: 0000000000001000 ---[ end trace eb4cdb33d766b5f3 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled Rebooting in 10 seconds.. Thanks, Ben > > > Micha? >
On 31 March 2016 at 21:16, Ben Greear <greearb@candelatech.com> wrote: [...] > I tried adding check for FW crash yesterday, but that did not help. > > Today, I added a limit of 2000 loops. I see that hit, and then kernel > crashes. Maybe my patch is wrong. > > I've tried to apply (almost) every patch in linux.ath related to ath10k, > including a few from the mailing list that have not been applied yet. > > My push-pending method now looks like this: > > void ath10k_mac_tx_push_pending(struct ath10k *ar) > { [...] > } Looks sane. > The crash I get is this: > > > ath10k_pci 0000:05:00.0: firmware crashed! (uuid > 2a118708-977d-43d6-8d40-079ddec99eb3) [...] > BUG: unable to handle kernel paging request at 0000000000001000 > IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211] Hmm.. Do you have 2a58d42c1e01 ("mac80211: fix txq queue related crashes") applied? You might want to start dumping each iteration of push_pending() after FW crashes to see if all pointers are fine and what retval are spitted out by each push_txq(). Micha?
On 03/31/2016 10:26 PM, Michal Kazior wrote: > On 31 March 2016 at 21:16, Ben Greear <greearb@candelatech.com> wrote: > [...] >> I tried adding check for FW crash yesterday, but that did not help. >> >> Today, I added a limit of 2000 loops. I see that hit, and then kernel >> crashes. Maybe my patch is wrong. >> >> I've tried to apply (almost) every patch in linux.ath related to ath10k, >> including a few from the mailing list that have not been applied yet. >> >> My push-pending method now looks like this: >> >> void ath10k_mac_tx_push_pending(struct ath10k *ar) >> { > [...] >> } > > Looks sane. > > >> The crash I get is this: >> >> >> ath10k_pci 0000:05:00.0: firmware crashed! (uuid >> 2a118708-977d-43d6-8d40-079ddec99eb3) > [...] >> BUG: unable to handle kernel paging request at 0000000000001000 >> IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211] > > Hmm.. Do you have 2a58d42c1e01 ("mac80211: fix txq queue related > crashes") applied? Yes, though it is a different hash in my tree, probably merge issues. See the patches I posted today to fix stale access to peer objects, that seems to have fixed these problems for me, or at least made it much harder to hit. At quitting time, I was still seeing kasan errors in mac80211 stats logic, so there are more bugs waiting for tomorrow. Thanks, Ben
--- a/drivers/net/wireless/ath/ath10k/mac.c +++ b/drivers/net/wireless/ath/ath10k/mac.c @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar) list_del_init(&artxq->list); if (ret != -ENOENT) list_add_tail(&artxq->list, &ar->txqs); + else if (artxq == last) + last = list_last_entry(&ar->txqs, struct ath10k_txq, list); ath10k_htt_tx_txq_update(hw, txq);