Message ID | 20200526145815.6415-6-mcgrof@kernel.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
Series | kernel: taint when the driver firmware crashes | expand |
On Tue, May 26, 2020 at 7:58 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > This makes use of the new taint_firmware_crashed() to help > annotate when firmware for device drivers crash. When firmware > crashes devices can sometimes become unresponsive, and recovery > sometimes requires a driver unload / reload and in the worst cases > a reboot. Just for the record, the underlying problem you seem to be complaining about does not appear to be a firmware crash at all. It does happen to result in a firmware crash report much later on (because when the PCIe endpoint is this hosed, sooner or later the driver thinks the firmware is dead), but it's not likely the root cause. More below. > Using a taint flag allows us to annotate when this happens clearly. > > I have run into this situation with this driver with the latest > firmware as of today, May 21, 2020 using v5.6.0, leaving me at > a state at which my only option is to reboot. Driver removal and > addition does not fix the situation. This is reported on kernel.org > bugzilla korg#207851 [0]. I took a look, and replied there: https://bugzilla.kernel.org/show_bug.cgi?id=207851#c2 Per the above, it seems more likely you have a PCI or power management problem, not an ath10k or ath10k-firmware problem. > But this isn't the first firmware crash reported, > others have been filed before and none of these bugs have yet been > addressed [1] [2] [3]. Including my own I see these firmware crash > reports: Yes, firmware does crash. Sometimes repeatedly. It also happens to be closed source, so it's nearly impossible for the average Linux dev to debug. But FWIW, those 3 all appear to be recoverable -- and then they crash again a few minutes later. So just as claimed on prior iterations of this patchset, ath10k is doing fine at recovery [*] -- it's "only" the firmware that's a problem. (And, if a WiFi firmware doesn't like something in the RF environment...it's totally understandable that the crash will happen more than once. Of course that sucks, but it's not unexpected.) Crucially, rebooting won't really do anything to help these people, AIUI. Maybe what you really want is to taint the kernel every time a non-free firmware is loaded ;) I'd also note that those 3 reports are 3 years old. There have been many ath10k-firmware updates since then, so it's not necessarily fair to dig those back up. Also, bugzilla.kernel.org is totally ignored by many linux-wireless@ folks. But I digress... All in all, I have no interest in this proposal, for many of the reasons already mentioned on previous iterations. It's way too coarse and won't be useful in understanding what's going on in a system, IMO, at least for ath10k. But it's also easy enough to ignore, so if it makes somebody happy to claim a taint, then so be it. Regards, Brian [*] Although, at least one of those doesn't appear to be as "clean" of a recovery attempt as typical. Maybe there are some lurking driver bugs in there too. > * korg#207851 [0] > * korg#197013 [1] > * korg#201237 [2] > * korg#195987 [3] > > [0] https://bugzilla.kernel.org/show_bug.cgi?id=207851 > [1] https://bugzilla.kernel.org/show_bug.cgi?id=197013 > [2] https://bugzilla.kernel.org/show_bug.cgi?id=201237 > [3] https://bugzilla.kernel.org/show_bug.cgi?id=195987
diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c index 1d941d53fdc9..818c3acc2468 100644 --- a/drivers/net/wireless/ath/ath10k/pci.c +++ b/drivers/net/wireless/ath/ath10k/pci.c @@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct *work) scnprintf(guid, sizeof(guid), "n/a"); ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); + taint_firmware_crashed(); ath10k_print_driver_info(ar); ath10k_pci_dump_registers(ar, crash_data); ath10k_ce_dump_registers(ar, crash_data); @@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar, if (ret) { if (ath10k_pci_has_fw_crashed(ar)) { ath10k_warn(ar, "firmware crashed during chip reset\n"); + taint_firmware_crashed(); ath10k_pci_fw_crashed_clear(ar); ath10k_pci_fw_crashed_dump(ar); } diff --git a/drivers/net/wireless/ath/ath10k/sdio.c b/drivers/net/wireless/ath/ath10k/sdio.c index e2aff2254a40..8b2fc0b89be4 100644 --- a/drivers/net/wireless/ath/ath10k/sdio.c +++ b/drivers/net/wireless/ath/ath10k/sdio.c @@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar) /* TODO: Add firmware crash handling */ ath10k_warn(ar, "firmware crashed\n"); + taint_firmware_crashed(); /* read counter to clear the interrupt, the debug error interrupt is * counter 0. @@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar) if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) { ath10k_err(ar, "firmware crashed!\n"); queue_work(ar->workqueue, &ar->restart_work); + taint_firmware_crashed(); } return ret; } diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c index 354d49b1cd45..071ee7607a4c 100644 --- a/drivers/net/wireless/ath/ath10k/snoc.c +++ b/drivers/net/wireless/ath/ath10k/snoc.c @@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar) scnprintf(guid, sizeof(guid), "n/a"); ath10k_err(ar, "firmware crashed! (guid %s)\n", guid); + taint_firmware_crashed(); ath10k_print_driver_info(ar); ath10k_msa_dump_memory(ar, crash_data); mutex_unlock(&ar->dump_mutex);