Message ID | 1529506583-20204-1-git-send-email-greearb@candelatech.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> From: Ben Greear <greearb@candelatech.com> > > This is against the 4.16 kernel, likely applies to later kernels > as well. Firmware is beta ath10k-ct firmware for 9984 NIC. The > patch is not firmware or chipset specific. > > When firmware crashes, packets can still be sent from the > mac80211 stack, and that can cause crashes in the ath10k > tx path. Yes, this is what I had observed and explained in the RFC patch which I have sent some time back. In my case the device never comes up and the entire system crashes; I hope you have observed the same behavior. > After adding this patch, I saw cases where the tx path was > called in state ATH10K_STATE_RESTARTED. > > I have not tested the tx_64 path, but assume it has similar > issues, so same patch was added to it. This case should not be dealt in ath10k, rather we should make sure packets does not reach the driver during hardware restart after the firmware crash. Can you please try the RFC patch and see if it fixes the issue? I am working on the patch and probably send it out after doing some testing and cleanup. https://patchwork.kernel.org/patch/10411967/ Thanks, Manikanta
On 06/20/2018 10:37 AM, Manikanta Pubbisetty wrote: > >> From: Ben Greear <greearb@candelatech.com> >> >> This is against the 4.16 kernel, likely applies to later kernels >> as well. Firmware is beta ath10k-ct firmware for 9984 NIC. The >> patch is not firmware or chipset specific. >> >> When firmware crashes, packets can still be sent from the >> mac80211 stack, and that can cause crashes in the ath10k >> tx path. > > Yes, this is what I had observed and explained in the RFC patch which I have sent some time back. In my case the device never comes up and the entire system > crashes; I hope you have observed the same behavior. > >> After adding this patch, I saw cases where the tx path was >> called in state ATH10K_STATE_RESTARTED. >> >> I have not tested the tx_64 path, but assume it has similar >> issues, so same patch was added to it. > > This case should not be dealt in ath10k, rather we should make sure packets does not reach the driver during hardware restart after the firmware crash. > Can you please try the RFC patch and see if it fixes the issue? I am working on the patch and probably send it out after doing some testing and cleanup. > > https://patchwork.kernel.org/patch/10411967/ I did see your patch, but I was not sure it would make it upstream. I think either way my patch might be useful in case bugs creep back in. And yes, I did see full system crash in this case. Thanks, Ben
On 2018-06-20 10:42, Ben Greear wrote: > On 06/20/2018 10:37 AM, Manikanta Pubbisetty wrote: >> [...] >> This case should not be dealt in ath10k, rather we should make sure >> packets does not reach the driver during hardware restart after the >> firmware crash. >> Can you please try the RFC patch and see if it fixes the issue? I am >> working on the patch and probably send it out after doing some testing >> and cleanup. >> >> https://patchwork.kernel.org/patch/10411967/ > > I did see your patch, but I was not sure it would make it upstream. > > I think either way my patch might be useful in case bugs creep back in. > ar->state check can not be in hot path as it is protected by mutex lock. -Rajkumar
On 06/20/2018 11:48 AM, Rajkumar Manoharan wrote: > On 2018-06-20 10:42, Ben Greear wrote: >> On 06/20/2018 10:37 AM, Manikanta Pubbisetty wrote: >>> > [...] >>> This case should not be dealt in ath10k, rather we should make sure packets does not reach the driver during hardware restart after the firmware crash. >>> Can you please try the RFC patch and see if it fixes the issue? I am working on the patch and probably send it out after doing some testing and cleanup. >>> >>> https://patchwork.kernel.org/patch/10411967/ >> >> I did see your patch, but I was not sure it would make it upstream. >> >> I think either way my patch might be useful in case bugs creep back in. >> > > ar->state check can not be in hot path as it is protected by mutex lock. If the tx logic is ever called while state is changing, that would seem to be a bug as well? Thanks, Ben
On 2018-06-20 11:51, Ben Greear wrote: > On 06/20/2018 11:48 AM, Rajkumar Manoharan wrote: >> On 2018-06-20 10:42, Ben Greear wrote: >>> On 06/20/2018 10:37 AM, Manikanta Pubbisetty wrote: >>>> >> [...] >>>> This case should not be dealt in ath10k, rather we should make sure >>>> packets does not reach the driver during hardware restart after the >>>> firmware crash. >>>> Can you please try the RFC patch and see if it fixes the issue? I am >>>> working on the patch and probably send it out after doing some >>>> testing and cleanup. >>>> >>>> https://patchwork.kernel.org/patch/10411967/ >>> >>> I did see your patch, but I was not sure it would make it upstream. >>> >>> I think either way my patch might be useful in case bugs creep back >>> in. >>> >> >> ar->state check can not be in hot path as it is protected by mutex >> lock. > > If the tx logic is ever called while state is changing, that would seem > to be a bug as well? > Thats true. The assumption is that packets should not be given to driver after ieee80211_stop_queues is called. Isn't it? Manikanta is trying to fix some cases where packets are still continued to sent even after queue stop. Worth trying that. -Rajkumar
================================================================== BUG: KASAN: null-ptr-deref in ath10k_htt_tx_32+0x18ba/0x2b00 [ath10k_core] Write of size 64 at addr 0000000000000000 by task kworker/u8:2/5115 ================================================================== BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 IP: memset_erms+0x9/0x10 PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP KASAN PTI Modules linked in: ath10k_pci ath10k_core rpcsec_gss_krb5 nfsv4 nfs fscache nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf 8021q garp mrp stp llc fuse macvlan pktgen lm78 hwmon_vid iTCO_wdt iTCO_vendor_support coretemp intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel ath snd_hda_codec_hdmi kvm snd_hda_intel irqbypass mac80211 snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device cfg80211 snd_pcm i2c_i801 snd_timer snd shpchp soundcore mei_wdt intel_pch_thermal acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel serio_raw e1000e i915 igb hwmon dca i2c_algo_bit drm_kms_helper drm i2c_core video ipv6 crc_ccitt [last unloaded: ath10k_core] CPU: 0 PID: 5115 Comm: kworker/u8:2 Tainted: G B W 4.16.15+ #19 Hardware name: _ _/, BIOS 5.11 08/26/2016 Workqueue: phy3 ieee80211_beacon_connection_loss_work [mac80211] RIP: 0010:memset_erms+0x9/0x10 RSP: 0018:ffff8801489af828 EFLAGS: 00010292 RAX: ffff880146920000 RBX: ffff88014a08d040 RCX: 0000000000000040 RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8801121f4fa8 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8801489af630 R11: 1ffff10029135e88 R12: ffff8801121f44a0 R13: 0000000000000000 R14: 0000000000000000 R15: 00000000ffec0000 FS: 0000000000000000(0000) GS:ffff88014de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000003a14005 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ath10k_htt_tx_32+0x18ba/0x2b00 [ath10k_core] ? ath10k_htt_tx_free_msdu_id+0xc0/0xc0 [ath10k_core] ? invoke_tx_handlers_late+0x2340/0x2340 [mac80211] ath10k_mac_tx+0xd7c/0x1680 [ath10k_core] ath10k_mac_tx_push_txq+0x1a2/0x3e0 [ath10k_core] ath10k_mac_op_wake_tx_queue+0x2fa/0x4e0 [ath10k_core] ieee80211_queue_skb+0x7cf/0xfa0 [mac80211] ieee80211_tx+0x259/0x330 [mac80211] ? ieee80211_tx_prepare_skb+0x3f0/0x3f0 [mac80211] ? ieee80211_xmit+0x26b/0x520 [mac80211] __ieee80211_tx_skb_tid_band+0x1e6/0x290 [mac80211] ieee80211_send_nullfunc+0x223/0x3f0 [mac80211] ieee80211_mgd_probe_ap_send+0x1af/0x4b0 [mac80211] ieee80211_mgd_probe_ap.part.22+0x28d/0x380 [mac80211] process_one_work+0x5f7/0x14d0 ? pwq_dec_nr_in_flight+0x2b0/0x2b0 ? _raw_spin_unlock_irq+0x24/0x40 worker_thread+0xdc/0x12d0 ? rescuer_thread+0x12b0/0x12b0 kthread+0x2cf/0x3c0 ? kthread_delayed_work_timer_fn+0x1e0/0x1e0 ret_from_fork+0x24/0x30 ce 48 b8 01 01 01 01 01 RIP: memset_erms+0x9/0x10 RSP: ffff8801489af828 CR2: 0000000000000000 ---[ end trace 5e24737a5c492997 ]--- Kernel panic - not syncing: Fatal exception in interrupt Signed-off-by: Ben Greear <greearb@candelatech.com> --- drivers/net/wireless/ath/ath10k/htt_tx.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c index fa134f4..709bb1d 100644 --- a/drivers/net/wireless/ath/ath10k/htt_tx.c +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c @@ -1173,6 +1173,18 @@ static int ath10k_htt_tx_32(struct ath10k_htt *htt, struct htt_msdu_ext_desc *ext_desc_t = NULL; u32 peer_id = HTT_INVALID_PEERID; + if (ar->state != ATH10K_STATE_ON) { + static bool done_once = 0; + if (!done_once) { + done_once = true; + ath10k_err(ar, "Invalid state: %d in ath10k_htt_tx_32, warning will not be repeated.\n", + ar->state); + WARN_ON(1); + } + res = -ENODEV; + goto err; + } + if (unlikely(info->flags & IEEE80211_TX_CTL_TX_OFFCHAN)) freq = ar->scan.roc_freq; @@ -1409,6 +1421,18 @@ static int ath10k_htt_tx_64(struct ath10k_htt *htt, struct htt_msdu_ext_desc_64 *ext_desc = NULL; struct htt_msdu_ext_desc_64 *ext_desc_t = NULL; + if (ar->state != ATH10K_STATE_ON) { + static bool done_once = 0; + if (!done_once) { + done_once = true; + ath10k_err(ar, "Invalid state: %d in ath10k_htt_tx_64, warning will not be repeated.\n", + ar->state); + WARN_ON(1); + } + res = -ENODEV; + goto err; + } + spin_lock_bh(&htt->tx_lock); res = ath10k_htt_tx_alloc_msdu_id(htt, msdu); spin_unlock_bh(&htt->tx_lock);
From: Ben Greear <greearb@candelatech.com> This is against the 4.16 kernel, likely applies to later kernels as well. Firmware is beta ath10k-ct firmware for 9984 NIC. The patch is not firmware or chipset specific. When firmware crashes, packets can still be sent from the mac80211 stack, and that can cause crashes in the ath10k tx path. After adding this patch, I saw cases where the tx path was called in state ATH10K_STATE_RESTARTED. I have not tested the tx_64 path, but assume it has similar issues, so same patch was added to it. Here is example of crash without the patch: The line that crashes decodes as: (gdb) l *(ath10k_htt_tx_32+0x18ba) 0x74a9a is in ath10k_htt_tx_32 (/home/greearb/git/linux-4.16.dev.y/drivers/net/wireless/ath/ath10k/htt_tx.c:1257). 1252 sizeof(struct htt_msdu_ext_desc)); 1253 frags = (struct htt_data_tx_desc_frag *) 1254 &ext_desc_t[msdu_id].frags; 1255 ext_desc = &ext_desc_t[msdu_id]; 1256 frags[0].tword_addr.paddr_lo = 1257 __cpu_to_le32(skb_cb->paddr); 1258 frags[0].tword_addr.paddr_hi = 0; 1259 frags[0].tword_addr.len_16 = __cpu_to_le16(msdu->len); 1260 1261 frags_paddr = htt->frag_desc.paddr + ath10k_pci 0000:04:00.0: ATH10K_END ath10k_pci 0000:04:00.0: firmware crashed! (guid 033040e0-a2e0-499c-b2a2-3e06832c649e) ath10k_pci 0000:04:00.0: firmware register dump: ath10k_pci 0000:04:00.0: [00]: 0x0000000A 0x00000000 0x00000000 0x00000000 ... [snipped rest of crash dump for brevity] ... ath10k_pci 0000:04:00.0: wmi unified ready event not received ath10k_pci 0000:04:00.0: Could not init core: -110