diff mbox

Deadlock on (faked) firmware crash, CUS239, modified 10.4.3 firmware.

Message ID CA+BoTQmwGONqJ0MgccrMzzx9UsGuRCZpdy1RSqdgkcNm_te0xQ@mail.gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Michal Kazior March 29, 2016, 8:14 a.m. UTC
On 26 March 2016 at 03:27, Ben Greear <greearb@candelatech.com> wrote:
> I've been seeing this for a while now.  When firmware crashes, often the OS
> at least
> partially locks up.
>
> This is modified 4.4.6 driver/kernel, modified 10.4.3 firmware.  I had 35
> stations associated,
> and reset one.  Flush fails (maybe because nothing stops tx on other vdevs
> while flushing one?)
> and I added a fake firmware crash even in case flush fails.
>
> Then, I get deadlock.  I've seen other similar deadlocks when the firmware
> crashed due
> to 'natural' causes when adding vdevs....
>
> Looks like the same process is not actually stuck in one place...each time
> the kernel splats,
> it is in a different place..spinning and spinning.  Maybe it needs a
> bail-out on firmware
> crash?
[...]
> [  316.477677] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
> [kworker/u8:3:257]
> [  316.477720] Modules linked in: nf_conntrack_netlink nf_conntrack
> nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan
> wanlink(O) pktgen rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt
> iTCO_vendor_support coretemp ath9k ath10k_pci hwmon ath9k_common ath10k_core
> ath9k_hw intel_rapl iosf_mbi ath x86_pkg_temp_thermal intel_powerclamp
> mac80211 kvm_intel kvm joydev irqbypass pcspkr serio_raw cfg80211
> snd_hda_codec_hdmi lpc_ich i2c_i801 snd_hda_codec_realtek
> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep
> snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp soundcore
> tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic
> pata_acpi i915 e1000e ptp pps_core i2c_algo_bit drm_kms_helper drm i2c_core
> fjes video ipv6 [last unloaded: nf_conntrack]
>
> [  316.477721] irq event stamp: 2111179
> [  316.477727] hardirqs last  enabled at (2111179): [<ffffffff8113c347>]
> vprintk_emit+0x3ab/0x46a
> [  316.477730] hardirqs last disabled at (2111178): [<ffffffff8113bff8>]
> vprintk_emit+0x5c/0x46a
> [  316.477742] softirqs last  enabled at (2111014): [<ffffffffa0e30965>]
> ath10k_set_key+0x136/0x602 [ath10k_core]
> [  316.477749] softirqs last disabled at (2111012): [<ffffffffa0e30946>]
> ath10k_set_key+0x117/0x602 [ath10k_core]
> [  316.477751] CPU: 1 PID: 257 Comm: kworker/u8:3 Tainted: G        W  O
> 4.4.6+ #21
> [  316.477752] Hardware name: To be filled by O.E.M. To be filled by
> O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012
> [  316.477780] Workqueue: wiphy3 ieee80211_iface_work [mac80211]
> [  316.477781] task: ffff880212d225c0 ti: ffff880212d50000 task.ti:
> ffff880212d50000
> [  316.477790] RIP: 0010:[<ffffffffa0e38c1b>]  [<ffffffffa0e38c1b>]
> ath10k_mac_tx_push_pending+0xc1/0x12d [ath10k_core]

Just in case, do you have these applied?

 750eeed89cf3 ath10k: fix pull-push tx threshold handling
 9d71d47eed20 ath10k: fix tx hang

Hmm.. If it still reproduces can you try the following diff?



Micha?

Comments

Ben Greear March 29, 2016, 3:46 p.m. UTC | #1
On 03/29/2016 01:14 AM, Michal Kazior wrote:
> On 26 March 2016 at 03:27, Ben Greear <greearb@candelatech.com> wrote:
>> I've been seeing this for a while now.  When firmware crashes, often the OS
>> at least
>> partially locks up.
>>
>> This is modified 4.4.6 driver/kernel, modified 10.4.3 firmware.  I had 35
>> stations associated,
>> and reset one.  Flush fails (maybe because nothing stops tx on other vdevs
>> while flushing one?)
>> and I added a fake firmware crash even in case flush fails.
>>
>> Then, I get deadlock.  I've seen other similar deadlocks when the firmware
>> crashed due
>> to 'natural' causes when adding vdevs....
>>
>> Looks like the same process is not actually stuck in one place...each time
>> the kernel splats,
>> it is in a different place..spinning and spinning.  Maybe it needs a
>> bail-out on firmware
>> crash?
> [...]
>> [  316.477677] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
>> [kworker/u8:3:257]
>> [  316.477720] Modules linked in: nf_conntrack_netlink nf_conntrack
>> nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan
>> wanlink(O) pktgen rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt
>> iTCO_vendor_support coretemp ath9k ath10k_pci hwmon ath9k_common ath10k_core
>> ath9k_hw intel_rapl iosf_mbi ath x86_pkg_temp_thermal intel_powerclamp
>> mac80211 kvm_intel kvm joydev irqbypass pcspkr serio_raw cfg80211
>> snd_hda_codec_hdmi lpc_ich i2c_i801 snd_hda_codec_realtek
>> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep
>> snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp soundcore
>> tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic
>> pata_acpi i915 e1000e ptp pps_core i2c_algo_bit drm_kms_helper drm i2c_core
>> fjes video ipv6 [last unloaded: nf_conntrack]
>>
>> [  316.477721] irq event stamp: 2111179
>> [  316.477727] hardirqs last  enabled at (2111179): [<ffffffff8113c347>]
>> vprintk_emit+0x3ab/0x46a
>> [  316.477730] hardirqs last disabled at (2111178): [<ffffffff8113bff8>]
>> vprintk_emit+0x5c/0x46a
>> [  316.477742] softirqs last  enabled at (2111014): [<ffffffffa0e30965>]
>> ath10k_set_key+0x136/0x602 [ath10k_core]
>> [  316.477749] softirqs last disabled at (2111012): [<ffffffffa0e30946>]
>> ath10k_set_key+0x117/0x602 [ath10k_core]
>> [  316.477751] CPU: 1 PID: 257 Comm: kworker/u8:3 Tainted: G        W  O
>> 4.4.6+ #21
>> [  316.477752] Hardware name: To be filled by O.E.M. To be filled by
>> O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012
>> [  316.477780] Workqueue: wiphy3 ieee80211_iface_work [mac80211]
>> [  316.477781] task: ffff880212d225c0 ti: ffff880212d50000 task.ti:
>> ffff880212d50000
>> [  316.477790] RIP: 0010:[<ffffffffa0e38c1b>]  [<ffffffffa0e38c1b>]
>> ath10k_mac_tx_push_pending+0xc1/0x12d [ath10k_core]
>
> Just in case, do you have these applied?
>
>   750eeed89cf3 ath10k: fix pull-push tx threshold handling
>   9d71d47eed20 ath10k: fix tx hang

I have both of these...I'll try your patch below.

I first have to fix the hash-table bugs in mac80211, as they break
so many things that it is hard to test the rest of the system...

Thanks,
Ben

>
> Hmm.. If it still reproduces can you try the following diff?
>
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>                  list_del_init(&artxq->list);
>                  if (ret != -ENOENT)
>                          list_add_tail(&artxq->list, &ar->txqs);
> +               else if (artxq == last)
> +                       last = list_last_entry(&ar->txqs, struct
> ath10k_txq, list);
>
>                  ath10k_htt_tx_txq_update(hw, txq);
>
>
> Micha?
>
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
>
Ben Greear March 30, 2016, 10:28 p.m. UTC | #2
> Hmm.. If it still reproduces can you try the following diff?
>
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>                  list_del_init(&artxq->list);
>                  if (ret != -ENOENT)
>                          list_add_tail(&artxq->list, &ar->txqs);
> +               else if (artxq == last)
> +                       last = list_last_entry(&ar->txqs, struct
> ath10k_txq, list);
>
>                  ath10k_htt_tx_txq_update(hw, txq);

Ok, I added this code, and can still reproduce the code.

Firmware is crashing multiple times a minute in this machine in it's
current configuration.  Right before it hung, firmware crashed and
was restarted, and then I get the hang notification.

I don't see any obvious bail-out in the tx_push_pending logic
if the firmware crashes?

Thanks,
Ben
Michal Kazior March 31, 2016, 6:32 a.m. UTC | #3
On 31 March 2016 at 00:28, Ben Greear <greearb@candelatech.com> wrote:
>
>> Hmm.. If it still reproduces can you try the following diff?
>>
>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>                  list_del_init(&artxq->list);
>>                  if (ret != -ENOENT)
>>                          list_add_tail(&artxq->list, &ar->txqs);
>> +               else if (artxq == last)
>> +                       last = list_last_entry(&ar->txqs, struct
>> ath10k_txq, list);
>>
>>                  ath10k_htt_tx_txq_update(hw, txq);
>
>
> Ok, I added this code, and can still reproduce the code.
>
> Firmware is crashing multiple times a minute in this machine in it's
> current configuration.  Right before it hung, firmware crashed and
> was restarted, and then I get the hang notification.
>
> I don't see any obvious bail-out in the tx_push_pending logic
> if the firmware crashes?

There's no explicit bail-out, yes. It should bail out if
ath10k_mac_tx_push_txq() fails though (except -ENOENT, which is
treated slightly differently but should result in bail-out eventually
as well as ar->txqs will drain until it's empty).

HTT-tx doesn't check for FW crash but it should be ultimately limited
by either CE ring size and HTT's num-pending-tx (both should not be
replenished as FW crashed and interrupts should not come in anymore).
Whichever the case a <0 retval should result in a bailout.


Micha?
Ben Greear March 31, 2016, 7:16 p.m. UTC | #4
On 03/30/2016 11:32 PM, Michal Kazior wrote:
> On 31 March 2016 at 00:28, Ben Greear <greearb@candelatech.com> wrote:
>>
>>> Hmm.. If it still reproduces can you try the following diff?
>>>
>>> --- a/drivers/net/wireless/ath/ath10k/mac.c
>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c
>>> @@ -3780,6 +3780,8 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar)
>>>                   list_del_init(&artxq->list);
>>>                   if (ret != -ENOENT)
>>>                           list_add_tail(&artxq->list, &ar->txqs);
>>> +               else if (artxq == last)
>>> +                       last = list_last_entry(&ar->txqs, struct
>>> ath10k_txq, list);
>>>
>>>                   ath10k_htt_tx_txq_update(hw, txq);
>>
>>
>> Ok, I added this code, and can still reproduce the code.
>>
>> Firmware is crashing multiple times a minute in this machine in it's
>> current configuration.  Right before it hung, firmware crashed and
>> was restarted, and then I get the hang notification.
>>
>> I don't see any obvious bail-out in the tx_push_pending logic
>> if the firmware crashes?
>
> There's no explicit bail-out, yes. It should bail out if
> ath10k_mac_tx_push_txq() fails though (except -ENOENT, which is
> treated slightly differently but should result in bail-out eventually
> as well as ar->txqs will drain until it's empty).
>
> HTT-tx doesn't check for FW crash but it should be ultimately limited
> by either CE ring size and HTT's num-pending-tx (both should not be
> replenished as FW crashed and interrupts should not come in anymore).
> Whichever the case a <0 retval should result in a bailout.

I tried adding check for FW crash yesterday, but that did not help.

Today, I added a limit of 2000 loops.  I see that hit, and then kernel
crashes.  Maybe my patch is wrong.

I've tried to apply (almost) every patch in linux.ath related to ath10k,
including a few from the mailing list that have not been applied yet.

My push-pending method now looks like this:

void ath10k_mac_tx_push_pending(struct ath10k *ar)
{
	struct ieee80211_hw *hw = ar->hw;
	struct ieee80211_txq *txq;
	struct ath10k_txq *artxq;
	struct ath10k_txq *last;
	int ret;
	int max;
	int loop_max = 2000;

	spin_lock_bh(&ar->txqs_lock);
	rcu_read_lock();

	last = list_last_entry(&ar->txqs, struct ath10k_txq, list);
	while (!list_empty(&ar->txqs)) {
		artxq = list_first_entry(&ar->txqs, struct ath10k_txq, list);
		txq = container_of((void *)artxq, struct ieee80211_txq,
				   drv_priv);

		if (--loop_max == 0) {
			ath10k_err(ar, "Looped 2000 times in tx_push_pending, bailing out.\n");
			break;
		}
		
		/* Prevent aggressive sta/tid taking over tx queue */
		max = 16;
		ret = 0;
		while (ath10k_mac_tx_can_push(hw, txq) && max--) {
			ret = ath10k_mac_tx_push_txq(hw, txq);
			if (ret < 0)
				break;
		}

		list_del_init(&artxq->list);
		if (ret != -ENOENT)
			list_add_tail(&artxq->list, &ar->txqs);
		else if (artxq == last)
			last = list_last_entry(&ar->txqs, struct ath10k_txq, list);

		ath10k_htt_tx_txq_update(hw, txq);

		if (artxq == last || (ret < 0 && ret != -ENOENT))
			break;
	}

	rcu_read_unlock();
	spin_unlock_bh(&ar->txqs_lock);
}

The crash I get is this:


ath10k_pci 0000:05:00.0: firmware crashed! (uuid 2a118708-977d-43d6-8d40-079ddec99eb3)
ath10k_pci 0000:05:00.0: firmware register dump:
ath10k_pci 0000:05:00.0: [00]: 0x00000009 0x000015B3 0x0099E4B6 0x00955B31
ath10k_pci 0000:05:00.0: [04]: 0x0099E4B6 0x00060130 0x00000005 0x00000016
ath10k_pci 0000:05:00.0: [08]: 0x00455030 0x004402B0 0x004060F0 0x00000007
ath10k_pci 0000:05:00.0: [12]: 0x00000009 0x00000000 0x009533D0 0x009533DF
ath10k_pci 0000:05:00.0: [16]: 0x00953438 0x0A00286E 0x009406B6 0x00000000
ath10k_pci 0000:05:00.0: [20]: 0x4099E4B6 0x00405FEC 0x000000BE 0x00955A00
ath10k_pci 0000:05:00.0: [24]: 0x8099E680 0x0040604C 0x00000000 0xC099E4B6
ath10k_pci 0000:05:00.0: [28]: 0x80986D5F 0x004060AC 0x00423A14 0x004060F0
ath10k_pci 0000:05:00.0: [32]: 0x80984E51 0x004060CC 0x00423A14 0x004060F0
ath10k_pci 0000:05:00.0: [36]: 0x80985CBF 0x004060EC 0x00424654 0x004402B0
ath10k_pci 0000:05:00.0: [40]: 0x809CAE6A 0x0040615C 0x004402B0 0x00424654
ath10k_pci 0000:05:00.0: [44]: 0x80984EBC 0x0040618C 0x004402B0 0x0040623C
ath10k_pci 0000:05:00.0: [48]: 0x809CB3CC 0x0040623C 0x004402B0 0x00411988
ath10k_pci 0000:05:00.0: [52]: 0x80984DE0 0x0040626C 0x00424654 0x004402B0
ath10k_pci 0000:05:00.0: [56]: 0x809CCE08 0x0040635C 0x00424654 0x00423234
ath10k_pci 0000:05:00.0: ath10k_pci ATH10K_DBG_BUFFER:
ath10k: [0000]: 0001854A 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 0001854E
ath10k: [0008]: 07FC4C02 00000004 0001854F 0060581D 0001854F 17FC4C01 0F00851C 0000000A
ath10k: [0016]: 06003007 0000FFAA FFFFFFFF 0001854F 17FC4C01 71108880 00000000 00C400BF
ath10k: [0024]: 00000000 00000FF0 0001854F 17FC4C01 71108880 00010000 00C400BF 00000000
ath10k: [0032]: FFFFFFFF 0001854F 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF
ath10k: [0040]: 0001854F 17FC4C01 71108880 00030000 00C400BF 000000FF FFFFFFFF 0001854F
ath10k: [0048]: 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF 0001854F 17FC4C01
ath10k: [0056]: 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018550 0060581D 00018550
ath10k: [0064]: 0860581B 0000851C 00000000 00018550 0060581D 00018550 07FC4C02 00000004
ath10k: [0072]: 00018551 0060581D 00018551 17FC4C01 0F00851C 0000000A 06003007 0000FFAA
ath10k: [0080]: FFFFFFFF 00018551 17FC4C01 71108880 00000000 00C400BF 00000000 00000FF0
ath10k: [0088]: 00018551 17FC4C01 71108880 00010000 00C400BF 00000000 FFFFFFFF 00018551
ath10k: [0096]: 17FC4C01 71108880 00020000 00C400BF 00000000 FFFFFFFF 00018551 17FC4C01
ath10k: [0104]: 71108880 00030000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880
ath10k: [0112]: 00040000 00C400BF 000000FF FFFFFFFF 00018551 17FC4C01 71108880 00050000
ath10k: [0120]: 00C400BF 000000FF FBFFFFFF 00018551 14605853 51100001 000F0DE4 00000400
ath10k: [0128]: 00000056 00440380 00018551 0060581D 00018551 0460581C 00000001 00018551
ath10k: [0136]: 0060581D 00018551 07FC4C02 00000004 00018552 0060581D 00018552 17FC4C01
ath10k: [0144]: 0F00851C 0000000A 06003007 0000FFAA FFFFFFFF 00018553 17FC4C01 71108880
ath10k: [0152]: 00000000 00C400BF 00000000 00000FF0 00018553 17FC4C01 71108880 00010000
ath10k: [0160]: 00C400BF 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00020000 00C400BF
ath10k: [0168]: 00000000 FFFFFFFF 00018553 17FC4C01 71108880 00030000 00C400BF 000000FF
ath10k: [0176]: FFFFFFFF 00018553 17FC4C01 71108880 00040000 00C400BF 000000FF FFFFFFFF
ath10k: [0184]: 00018553 17FC4C01 71108880 00050000 00C400BF 000000FF FBFFFFFF 00018553
ath10k: [0192]: 07FC4C02 00000001 00018553 07FC4C02 00000001 00018553 0BFC5826 000005E9
ath10k: [0200]: 00000003 00018554 0BFC5822 0000C01D 00000406 00018578 08383812 000F45C4
ath10k: [0208]: 00424654 00018578 10383809 0000143C 00000001 00000000 00000000 0001857B
ath10k: [0216]: 14385853 51100001 000F0D9C 000003FC 00000057 004402B0 0001857B 14385853
ath10k: [0224]: 51100001 000F0D54 000003FE 00000058 004402B0 0001857B 07FC5830 00000008
ath10k: [0232]: 0001857B 14385854 51100002 000F0D54 00000061 00000057 004402B0 0001857B
ath10k: [0240]: 14385851 91107001 00424654 004402B0 00000008 00000006 0001857B 17FC5855
ath10k: [0248]: 91108001 00000000 00000000 00000007 000000BE 0001857B 0FFC5855 91108002
ath10k: [0256]: 004402B0 00000010 0001857B 17FC0001 0099E4B6 000015B3 000015B3 00405EDC
ath10k: [0264]: 00000009
ath10k_pci 0000:05:00.0: ATH10K_END
sta13: drv-set-bitrate-mask had error return: -108
rdev-set-bitrate-mask failed: -108
wlan3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out.
sta22: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta0: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta1: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
sta2: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
ath10k_pci 0000:05:00.0: Looped 2000 times in tx_push_pending, bailing out.
sta3: Failed to send nullfunc to AP 04:f0:21:f6:85:1c after 1000ms, disconnecting
BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
PGD 0
Oops: 0002 [#1] PREEMPT SMP
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 8021q garp mrp stp llc bnep bluetooth fuse macvlan wanlink(O) pktgen 
rpcsec_gss_krb5 nfsv4 nfs fscache iTCO_wdt iTCO_vendor_support ath9k ath10k_pci coretemp ath9k_common hwmon intel_rapl ath10k_core iosf_mbi ath9k_hw 
x86_pkg_temp_thermal intel_powerclamp kvm_intel ath joydev kvm mac80211 irqbypass serio_raw pcspkr cfg80211 i2c_i801 lpc_ich snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm 8250_fintek snd_timer snd shpchp 
soundcore tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ata_generic i915 pata_acpi i2c_algo_bit drm_kms_helper e1000e ptp pps_core drm i2c_core video 
fjes ipv6 [last unloaded: nf_conntrack]
CPU: 2 PID: 581 Comm: kworker/u8:4 Tainted: G        W  O    4.4.6+ #21
Hardware name: To be filled by O.E.M. To be filled by O.E.M./HURONRIVER, BIOS 4.6.5 05/02/2012
Workqueue: phy2 ieee80211_iface_work [mac80211]
task: ffff8800d9c90000 ti: ffff880213fd0000 task.ti: ffff880213fd0000
RIP: 0010:[<ffffffffa08e9810>]  [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
RSP: 0018:ffff88021eb03c28  EFLAGS: 00010296
RAX: ffff8800cbfd7000 RBX: ffff8800cbfd5060 RCX: ffff8800cbfd1000
RDX: 0000000000001000 RSI: 00000000d9c90805 RDI: ffff8800cbfd5000
RBP: ffff88021eb03c28 R08: 0000000000000001 R09: 0000000000000000
R10: ffff88021eb03ba8 R11: ffff8800cbfd5030 R12: ffff8800cbfd5060
R13: ffff880214a34902 R14: ffff8800cbfd5018 R15: ffff88021350e1b0
FS:  0000000000000000(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000001000 CR3: 0000000001c0a000 CR4: 00000000000406e0
Stack:
  ffff88021eb03c68 ffffffffa08e985a ffff880214a30a60 ffff880214a35600
  ffff8800cbfd5060 ffff880214a349e0 ffff880214a35430 ffff88021350e1b0
  ffff88021eb03cb8 ffffffffa0ec2bb4 ffff880214a30a60 0000000014a30a60
Call Trace:
  <IRQ>
  [<ffffffffa08e985a>] ieee80211_tx_dequeue+0x41/0xfe [mac80211]
  [<ffffffffa0ec2bb4>] ath10k_mac_tx_push_txq+0x6a/0x13b [ath10k_core]
  [<ffffffffa0ec2ddb>] ath10k_mac_tx_push_pending+0x156/0x16b [ath10k_core]
  [<ffffffffa0ed123d>] ath10k_htt_t2h_msg_handler+0x7d9/0x886 [ath10k_core]
  [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33
  [<ffffffffa0fca532>] ? ath10k_pci_hif_send_complete_check+0x5d/0x5d [ath10k_pci]
  [<ffffffffa0fca557>] ath10k_pci_htt_rx_deliver+0x25/0x2a [ath10k_pci]
  [<ffffffffa0fcbb51>] ath10k_pci_process_rx_cb+0x191/0x1c9 [ath10k_pci]
  [<ffffffff810f23ad>] ? __local_bh_enable_ip+0xa4/0xb9
  [<ffffffff816f9f9a>] ? _raw_spin_unlock_bh+0x30/0x33
  [<ffffffffa0fcbbbf>] ath10k_pci_htt_rx_cb+0x24/0x27 [ath10k_pci]
  [<ffffffffa0fce1be>] ath10k_ce_per_engine_service+0x64/0xa0 [ath10k_pci]
  [<ffffffffa0fce260>] ath10k_ce_per_engine_service_any+0x66/0x74 [ath10k_pci]
  [<ffffffffa0fcc4b3>] ath10k_pci_tasklet+0x3a/0x4e [ath10k_pci]
  [<ffffffff810f29e0>] tasklet_action+0xc0/0xcf
  [<ffffffff810f1ff6>] __do_softirq+0x1a4/0x407
  [<ffffffff810f2462>] irq_exit+0x40/0x94
  [<ffffffff810134a2>] do_IRQ+0xd5/0xed
  [<ffffffff816fb24c>] common_interrupt+0x8c/0x8c
  <EOI>
  [<ffffffff81129d49>] ? arch_local_irq_restore+0x6/0xd
  [<ffffffff816f8a3a>] __mutex_unlock_slowpath+0x120/0x137
  [<ffffffff816f8a5a>] mutex_unlock+0x9/0xb
  [<ffffffffa0ebcc38>] ath10k_conf_tx+0x3a9/0x3bb [ath10k_core]
  [<ffffffffa08c2b48>] drv_conf_tx+0x140/0x202 [mac80211]
  [<ffffffffa08f3072>] ieee80211_set_wmm_default+0x1fb/0x24a [mac80211]
  [<ffffffffa0908bc5>] ieee80211_set_disassoc+0x248/0x31f [mac80211]
  [<ffffffffa0908ccf>] ieee80211_sta_connection_lost+0x33/0x69 [mac80211]
  [<ffffffffa090bb8f>] ieee80211_sta_work+0x5fc/0xda9 [mac80211]
  [<ffffffff8112d30b>] ? mark_held_locks+0x5e/0x74
  [<ffffffff8112d490>] ? trace_hardirqs_on_caller+0x16f/0x18b
  [<ffffffff816fa024>] ? _raw_spin_unlock_irqrestore+0x48/0x5d
  [<ffffffffa08d54bd>] ieee80211_iface_work+0x335/0x34e [mac80211]
  [<ffffffff8110471a>] process_one_work+0x260/0x4db
  [<ffffffff81104e50>] worker_thread+0x1e9/0x29b
  [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8
  [<ffffffff81104c67>] ? rescuer_thread+0x2a8/0x2a8
  [<ffffffff81109bfb>] kthread+0xcf/0xd7
  [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f
  [<ffffffff816faaef>] ret_from_fork+0x3f/0x70
  [<ffffffff81109b2c>] ? kthread_parkme+0x1f/0x1f
Code: 55 48 89 e5 48 39 c7 74 27 48 85 c0 74 24 ff 4f 10 48 8b 08 48 8b 50 08 48 c7 00 00 00 00 00 48 c7 40 08 00 00 00 00 48 89 51 08 <48> 89 0a eb 02 31 c0 5d 
c3 55 48 89 e5 41 57 41 56 4c 8d 76 b8
RIP  [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
  RSP <ffff88021eb03c28>
CR2: 0000000000001000
---[ end trace eb4cdb33d766b5f3 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..

Thanks,
Ben

>
>
> Micha?
>
Michal Kazior April 1, 2016, 5:26 a.m. UTC | #5
On 31 March 2016 at 21:16, Ben Greear <greearb@candelatech.com> wrote:
[...]
> I tried adding check for FW crash yesterday, but that did not help.
>
> Today, I added a limit of 2000 loops.  I see that hit, and then kernel
> crashes.  Maybe my patch is wrong.
>
> I've tried to apply (almost) every patch in linux.ath related to ath10k,
> including a few from the mailing list that have not been applied yet.
>
> My push-pending method now looks like this:
>
> void ath10k_mac_tx_push_pending(struct ath10k *ar)
> {
[...]
> }

Looks sane.


> The crash I get is this:
>
>
> ath10k_pci 0000:05:00.0: firmware crashed! (uuid
> 2a118708-977d-43d6-8d40-079ddec99eb3)
[...]
> BUG: unable to handle kernel paging request at 0000000000001000
> IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]

Hmm.. Do you have 2a58d42c1e01 ("mac80211: fix txq queue related
crashes") applied?

You might want to start dumping each iteration of push_pending() after
FW crashes to see if all pointers are fine and what retval are spitted
out by each push_txq().


Micha?
Ben Greear April 1, 2016, 5:33 a.m. UTC | #6
On 03/31/2016 10:26 PM, Michal Kazior wrote:
> On 31 March 2016 at 21:16, Ben Greear <greearb@candelatech.com> wrote:
> [...]
>> I tried adding check for FW crash yesterday, but that did not help.
>>
>> Today, I added a limit of 2000 loops.  I see that hit, and then kernel
>> crashes.  Maybe my patch is wrong.
>>
>> I've tried to apply (almost) every patch in linux.ath related to ath10k,
>> including a few from the mailing list that have not been applied yet.
>>
>> My push-pending method now looks like this:
>>
>> void ath10k_mac_tx_push_pending(struct ath10k *ar)
>> {
> [...]
>> }
>
> Looks sane.
>
>
>> The crash I get is this:
>>
>>
>> ath10k_pci 0000:05:00.0: firmware crashed! (uuid
>> 2a118708-977d-43d6-8d40-079ddec99eb3)
> [...]
>> BUG: unable to handle kernel paging request at 0000000000001000
>> IP: [<ffffffffa08e9810>] __skb_dequeue+0x2e/0x37 [mac80211]
>
> Hmm.. Do you have 2a58d42c1e01 ("mac80211: fix txq queue related
> crashes") applied?

Yes, though it is a different hash in my tree, probably merge issues.

See the patches I posted today to fix stale access to peer objects,
that seems to have fixed these problems for me, or at least made
it much harder to hit.

At quitting time, I was still seeing kasan errors in mac80211
stats logic, so there are more bugs waiting for tomorrow.

Thanks,
Ben
diff mbox

Patch

--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -3780,6 +3780,8 @@  void ath10k_mac_tx_push_pending(struct ath10k *ar)
                list_del_init(&artxq->list);
                if (ret != -ENOENT)
                        list_add_tail(&artxq->list, &ar->txqs);
+               else if (artxq == last)
+                       last = list_last_entry(&ar->txqs, struct
ath10k_txq, list);

                ath10k_htt_tx_txq_update(hw, txq);