Message ID | 20241106-ath9k-deaf-detection-v1-1-736a150d2425@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Delegated to: | Jeff Johnson |
Headers | show |
Series | ath9k: Add RX inactivity detection and reset chip when it occurs | expand |
Hi, Thank you for submitting the patch. On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: > Since this is based on ideas by all three people, but not actually > directly derived from any of the patches, I'm including Suggested-by > tags from Simon, Sven and Felix below, which should hopefully serve as > proper credit. At least for me, this is more than enough. Thanks. I don't have the setup at the moment to test it again - maybe Issam can do this. One concern I would have (because I don't find the notes regarding this problem), is whether this check is now breaking because we count more things. In the past, rxlp/rxok was used for the check. And now I don't know whether the count for the other ones were still increasing. * RXHP (rather sure that "high priority frame" wasn't increasing) * RXEOL ("no RX descriptors available" - I would guess no, but I can't say for sure) * RXORN ("FIFO overrun" I would guess no, but I can't say for sure) Reviewed-by: Sven Eckelmann <se@simonwunderlich.de> Kind regards, Sven
Sven Eckelmann <se@simonwunderlich.de> writes: > Hi, > > Thank you for submitting the patch. > > On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: >> Since this is based on ideas by all three people, but not actually >> directly derived from any of the patches, I'm including Suggested-by >> tags from Simon, Sven and Felix below, which should hopefully serve as >> proper credit. > > At least for me, this is more than enough. Thanks. > > I don't have the setup at the moment to test it again - maybe Issam can do > this. One concern I would have (because I don't find the notes regarding this > problem), is whether this check is now breaking because we count more things. > In the past, rxlp/rxok was used for the check. And now I don't know whether > the count for the other ones were still increasing. > > * RXHP (rather sure that "high priority frame" wasn't increasing) > * RXEOL ("no RX descriptors available" - I would guess no, but I can't say for > sure) > * RXORN ("FIFO overrun" I would guess no, but I can't say for sure) > > Reviewed-by: Sven Eckelmann <se@simonwunderlich.de> Great, thanks for the review! I'll let it sit in patchwork for a little while to give people a chance to test it out before sending it over to Kalle to be applied :) -Toke
On Wednesday, November 6, 2024 3:12:59 PM CET Toke Høiland-Jørgensen wrote: > Sven Eckelmann <se@simonwunderlich.de> writes: > > Hi, > > > > Thank you for submitting the patch. > > > > On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: > >> Since this is based on ideas by all three people, but not actually > >> directly derived from any of the patches, I'm including Suggested-by > >> tags from Simon, Sven and Felix below, which should hopefully serve as > >> proper credit. > > > > At least for me, this is more than enough. Thanks. > > > > I don't have the setup at the moment to test it again - maybe Issam can do > > this. One concern I would have (because I don't find the notes regarding > > this problem), is whether this check is now breaking because we count > > more things. In the past, rxlp/rxok was used for the check. And now I > > don't know whether the count for the other ones were still increasing. > > > > * RXHP (rather sure that "high priority frame" wasn't increasing) > > * RXEOL ("no RX descriptors available" - I would guess no, but I can't say > > for> > > sure) > > > > * RXORN ("FIFO overrun" I would guess no, but I can't say for sure) > > > > Reviewed-by: Sven Eckelmann <se@simonwunderlich.de> > > Great, thanks for the review! I'll let it sit in patchwork for a little > while to give people a chance to test it out before sending it over to > Kalle to be applied :) > > -Toke Hi Toke, this looks good to me in general. I'm not sure either about the particular RX interrupts. We can test this by putting the AP in a shield box and verify that the counters are actually increasing, and that should be good enough. Acked-by: Simon Wunderlich <sw@simonwunderlich.de> Thank you! Simon
Simon Wunderlich <sw@simonwunderlich.de> writes: > On Wednesday, November 6, 2024 3:12:59 PM CET Toke Høiland-Jørgensen wrote: >> Sven Eckelmann <se@simonwunderlich.de> writes: >> > Hi, >> > >> > Thank you for submitting the patch. >> > >> > On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: >> >> Since this is based on ideas by all three people, but not actually >> >> directly derived from any of the patches, I'm including Suggested-by >> >> tags from Simon, Sven and Felix below, which should hopefully serve as >> >> proper credit. >> > >> > At least for me, this is more than enough. Thanks. >> > >> > I don't have the setup at the moment to test it again - maybe Issam can do >> > this. One concern I would have (because I don't find the notes regarding >> > this problem), is whether this check is now breaking because we count >> > more things. In the past, rxlp/rxok was used for the check. And now I >> > don't know whether the count for the other ones were still increasing. >> > >> > * RXHP (rather sure that "high priority frame" wasn't increasing) >> > * RXEOL ("no RX descriptors available" - I would guess no, but I can't say >> > for> >> > sure) >> > >> > * RXORN ("FIFO overrun" I would guess no, but I can't say for sure) >> > >> > Reviewed-by: Sven Eckelmann <se@simonwunderlich.de> >> >> Great, thanks for the review! I'll let it sit in patchwork for a little >> while to give people a chance to test it out before sending it over to >> Kalle to be applied :) >> >> -Toke > > Hi Toke, > > this looks good to me in general. I'm not sure either about the particular RX > interrupts. We can test this by putting the AP in a shield box and verify that > the counters are actually increasing, and that should be good enough. > > Acked-by: Simon Wunderlich <sw@simonwunderlich.de> Great, thanks! Would be awesome if you could test it out an report back! :) -Toke
On 11/6/24 17:03, Simon Wunderlich wrote: > On Wednesday, November 6, 2024 3:12:59 PM CET Toke Høiland-Jørgensen wrote: >> Sven Eckelmann <se@simonwunderlich.de> writes: >>> Hi, >>> >>> Thank you for submitting the patch. >>> >>> On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: >>>> Since this is based on ideas by all three people, but not actually >>>> directly derived from any of the patches, I'm including Suggested-by >>>> tags from Simon, Sven and Felix below, which should hopefully serve as >>>> proper credit. >>> At least for me, this is more than enough. Thanks. >>> >>> I don't have the setup at the moment to test it again - maybe Issam can do >>> this. One concern I would have (because I don't find the notes regarding >>> this problem), is whether this check is now breaking because we count >>> more things. In the past, rxlp/rxok was used for the check. And now I >>> don't know whether the count for the other ones were still increasing. >>> >>> * RXHP (rather sure that "high priority frame" wasn't increasing) >>> * RXEOL ("no RX descriptors available" - I would guess no, but I can't say >>> for> >>> sure) >>> >>> * RXORN ("FIFO overrun" I would guess no, but I can't say for sure) >>> >>> Reviewed-by: Sven Eckelmann <se@simonwunderlich.de> >> Great, thanks for the review! I'll let it sit in patchwork for a little >> while to give people a chance to test it out before sending it over to >> Kalle to be applied :) >> >> -Toke > Hi Toke, > > this looks good to me in general. I'm not sure either about the particular RX > interrupts. We can test this by putting the AP in a shield box and verify that > the counters are actually increasing, and that should be good enough. > > Acked-by: Simon Wunderlich <sw@simonwunderlich.de> > > Thank you! > Simon Hi Toke, I have tested this patch in mesh mode, and it functions as expected. I conducted the test by placing one node inside a shield box and the other outside, then verified whether a reset occurred due to RX path inactivity. Tested-by: Issam Hamdi <ih@simonwunderlich.de> Regards, Issam -
Hamdi Issam <ih@simonwunderlich.de> writes: > On 11/6/24 17:03, Simon Wunderlich wrote: >> On Wednesday, November 6, 2024 3:12:59 PM CET Toke Høiland-Jørgensen wrote: >>> Sven Eckelmann <se@simonwunderlich.de> writes: >>>> Hi, >>>> >>>> Thank you for submitting the patch. >>>> >>>> On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: >>>>> Since this is based on ideas by all three people, but not actually >>>>> directly derived from any of the patches, I'm including Suggested-by >>>>> tags from Simon, Sven and Felix below, which should hopefully serve as >>>>> proper credit. >>>> At least for me, this is more than enough. Thanks. >>>> >>>> I don't have the setup at the moment to test it again - maybe Issam can do >>>> this. One concern I would have (because I don't find the notes regarding >>>> this problem), is whether this check is now breaking because we count >>>> more things. In the past, rxlp/rxok was used for the check. And now I >>>> don't know whether the count for the other ones were still increasing. >>>> >>>> * RXHP (rather sure that "high priority frame" wasn't increasing) >>>> * RXEOL ("no RX descriptors available" - I would guess no, but I can't say >>>> for> >>>> sure) >>>> >>>> * RXORN ("FIFO overrun" I would guess no, but I can't say for sure) >>>> >>>> Reviewed-by: Sven Eckelmann <se@simonwunderlich.de> >>> Great, thanks for the review! I'll let it sit in patchwork for a little >>> while to give people a chance to test it out before sending it over to >>> Kalle to be applied :) >>> >>> -Toke >> Hi Toke, >> >> this looks good to me in general. I'm not sure either about the particular RX >> interrupts. We can test this by putting the AP in a shield box and verify that >> the counters are actually increasing, and that should be good enough. >> >> Acked-by: Simon Wunderlich <sw@simonwunderlich.de> >> >> Thank you! >> Simon > > Hi Toke, > > I have tested this patch in mesh mode, and it functions as expected. > > I conducted the test by placing one node inside a shield box and the > other outside, then verified whether a reset occurred due to RX path > inactivity. > > Tested-by: Issam Hamdi <ih@simonwunderlich.de> Great, thanks for testing! :) -Toke
On Wednesday, 6 November 2024 14:39:14 CET Sven Eckelmann wrote: > On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: > > Since this is based on ideas by all three people, but not actually > > directly derived from any of the patches, I'm including Suggested-by > > tags from Simon, Sven and Felix below, which should hopefully serve as > > proper credit. > > At least for me, this is more than enough. Thanks. > > I don't have the setup at the moment to test it again I've prepared a new Freifunk Vogtland firmware with this patch and placed it on a node which "regularly" has this problem. It is a node in a residential neighbourhood with the usual amount of WiFi activity. It was known that this node requires ~2 resets in 10 days. After having it running for 10 1/2 days, I could see that the node was still working correctly and the "Rx path inactive" counter showed 3 resets during that time period. Tested-by: Sven Eckelmann <se@simonwunderlich.de> Kind regards, Sven
Sven Eckelmann <se@simonwunderlich.de> writes: > On Wednesday, 6 November 2024 14:39:14 CET Sven Eckelmann wrote: >> On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote: >> > Since this is based on ideas by all three people, but not actually >> > directly derived from any of the patches, I'm including Suggested-by >> > tags from Simon, Sven and Felix below, which should hopefully serve as >> > proper credit. >> >> At least for me, this is more than enough. Thanks. >> >> I don't have the setup at the moment to test it again > > I've prepared a new Freifunk Vogtland firmware with this patch and placed it > on a node which "regularly" has this problem. It is a node in a residential > neighbourhood with the usual amount of WiFi activity. It was known that this > node requires ~2 resets in 10 days. > > After having it running for 10 1/2 days, I could see that the node was still > working correctly and the "Rx path inactive" counter showed 3 resets during > that time period. > > Tested-by: Sven Eckelmann <se@simonwunderlich.de> Great, thanks for testing! -Toke
On Wed, 06 Nov 2024 13:41:44 +0100, Toke Høiland-Jørgensen wrote: > Some ath9k chips can, seemingly at random, end up in a state which can > be described as "deaf". No or nearly no interrupts are generated anymore > for incoming packets. Existing links either break down after a while and > new links will not be established. > > The circumstances leading to this "deafness" is still unclear, but some > particular chips (especially 2-stream 11n SoCs, but also others) can go > 'deaf' when running AP or mesh (or both) after some time. It's probably > a hardware issue, and doing a channel scan to trigger a chip > reset (which one normally can't do on an AP interface) recovers the > hardware. > > [...] Applied, thanks! [1/1] ath9k: Add RX inactivity detection and reset chip when it occurs commit: b5f871ab4913b2403a7cdcbcde16d39d0b071fb3 Best regards,
diff --git a/drivers/net/wireless/ath/ath9k/ath9k.h b/drivers/net/wireless/ath/ath9k/ath9k.h index 29ca65a732a66c9452b19ad184f513f92e4c8c2c..bcfc8df0efe5b2c5510230e7f59592d51df8365e 100644 --- a/drivers/net/wireless/ath/ath9k/ath9k.h +++ b/drivers/net/wireless/ath/ath9k/ath9k.h @@ -1018,6 +1018,8 @@ struct ath_softc { u8 gtt_cnt; u32 intrstatus; + u32 rx_active_check_time; + u32 rx_active_count; u16 ps_flags; /* PS_* */ bool ps_enabled; bool ps_idle; diff --git a/drivers/net/wireless/ath/ath9k/debug.c b/drivers/net/wireless/ath/ath9k/debug.c index 51abc470125b3cc4e773f5272e2394c8f790a32e..a1e3bcf796f27fd58f5ab307a198927274f8f321 100644 --- a/drivers/net/wireless/ath/ath9k/debug.c +++ b/drivers/net/wireless/ath/ath9k/debug.c @@ -750,6 +750,7 @@ static int read_file_reset(struct seq_file *file, void *data) [RESET_TYPE_CALIBRATION] = "Calibration error", [RESET_TX_DMA_ERROR] = "Tx DMA stop error", [RESET_RX_DMA_ERROR] = "Rx DMA stop error", + [RESET_TYPE_RX_INACTIVE] = "Rx path inactive", }; int i; diff --git a/drivers/net/wireless/ath/ath9k/debug.h b/drivers/net/wireless/ath/ath9k/debug.h index 389459c04d14a5215783b916519d05419b970c20..cb3e75969875afed282b09ff2458dd8ca286d5b1 100644 --- a/drivers/net/wireless/ath/ath9k/debug.h +++ b/drivers/net/wireless/ath/ath9k/debug.h @@ -53,6 +53,7 @@ enum ath_reset_type { RESET_TYPE_CALIBRATION, RESET_TX_DMA_ERROR, RESET_RX_DMA_ERROR, + RESET_TYPE_RX_INACTIVE, __RESET_TYPE_MAX }; diff --git a/drivers/net/wireless/ath/ath9k/link.c b/drivers/net/wireless/ath/ath9k/link.c index d1e5767aab3cbc2ee91018edadf89a223cf1d759..d078a59d7d3cdbe623b507d29cc974b801f2deaf 100644 --- a/drivers/net/wireless/ath/ath9k/link.c +++ b/drivers/net/wireless/ath/ath9k/link.c @@ -50,7 +50,36 @@ static bool ath_tx_complete_check(struct ath_softc *sc) "tx hung, resetting the chip\n"); ath9k_queue_reset(sc, RESET_TYPE_TX_HANG); return false; +} + +#define RX_INACTIVE_CHECK_INTERVAL (4 * MSEC_PER_SEC) + +static bool ath_hw_rx_inactive_check(struct ath_softc *sc) +{ + struct ath_common *common = ath9k_hw_common(sc->sc_ah); + u32 interval, count; + + interval = jiffies_to_msecs(jiffies - sc->rx_active_check_time); + count = sc->rx_active_count; + + if (interval < RX_INACTIVE_CHECK_INTERVAL) + return true; /* too soon to check */ + sc->rx_active_count = 0; + sc->rx_active_check_time = jiffies; + + /* Need at least one interrupt per second, and we should only react if + * we are within a factor two of the expected interval + */ + if (interval > RX_INACTIVE_CHECK_INTERVAL * 2 || + count >= interval / MSEC_PER_SEC) + return true; + + ath_dbg(common, RESET, + "RX inactivity detected. Schedule chip reset\n"); + ath9k_queue_reset(sc, RESET_TYPE_RX_INACTIVE); + + return false; } void ath_hw_check_work(struct work_struct *work) @@ -58,8 +87,8 @@ void ath_hw_check_work(struct work_struct *work) struct ath_softc *sc = container_of(work, struct ath_softc, hw_check_work.work); - if (!ath_hw_check(sc) || - !ath_tx_complete_check(sc)) + if (!ath_hw_check(sc) || !ath_tx_complete_check(sc) || + !ath_hw_rx_inactive_check(sc)) return; ieee80211_queue_delayed_work(sc->hw, &sc->hw_check_work, diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c index b92c89dad8deac7281e0165c9e1a566ba99ece65..998f717a1a86eee539058f6f4d9318dd459fc8dd 100644 --- a/drivers/net/wireless/ath/ath9k/main.c +++ b/drivers/net/wireless/ath/ath9k/main.c @@ -453,6 +453,7 @@ void ath9k_tasklet(struct tasklet_struct *t) ath_rx_tasklet(sc, 0, true); ath_rx_tasklet(sc, 0, false); + sc->rx_active_count++; } if (status & ATH9K_INT_TX) {
Some ath9k chips can, seemingly at random, end up in a state which can be described as "deaf". No or nearly no interrupts are generated anymore for incoming packets. Existing links either break down after a while and new links will not be established. The circumstances leading to this "deafness" is still unclear, but some particular chips (especially 2-stream 11n SoCs, but also others) can go 'deaf' when running AP or mesh (or both) after some time. It's probably a hardware issue, and doing a channel scan to trigger a chip reset (which one normally can't do on an AP interface) recovers the hardware. The only way the driver can detect this state, is by detecting if there has been no RX activity for a while. In this case we can proactively reset the chip (which only takes a small number of milliseconds, so shouldn't interrupt things too much if it has been idle for several seconds), which functions as a workaround. OpenWrt, and various derivatives, have been carrying versions of this workaround for years, that were never upstreamed. One version[0], written by Felix Fietkau, used a simple counter and only reset if there was precisely zero RX activity for a long period of time. This had the problem that in some cases a small number of interrupts would appear even if the device was otherwise not responsive. For this reason, another version[1], written by Simon Wunderlich and Sven Eckelmann, used a time-based approach to calculate the average number of RX interrupts over a longer (four-second) interval, and reset the chip when seeing less than one interrupt per second over this period. However, that version relied on debugfs counters to keep track of the number of interrupts, which means it didn't work at all if debugfs was not enabled. This patch unifies the two versions: it uses the same approach as Felix' patch to count the number of RX handler invocations, but uses the same time-based windowing approach as Simon and Sven's patch to still handle the case where occasional interrupts appear but the device is otherwise deaf. Since this is based on ideas by all three people, but not actually directly derived from any of the patches, I'm including Suggested-by tags from Simon, Sven and Felix below, which should hopefully serve as proper credit. [0] https://patchwork.kernel.org/project/linux-wireless/patch/20170125163654.66431-3-nbd@nbd.name/ [1] https://patchwork.kernel.org/project/linux-wireless/patch/20161117083614.19188-2-sven.eckelmann@open-mesh.com/ Suggested-by: Simon Wunderlich <sw@simonwunderlich.de> Suggested-by: Sven Eckelmann <se@simonwunderlich.de> Suggested-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> --- drivers/net/wireless/ath/ath9k/ath9k.h | 2 ++ drivers/net/wireless/ath/ath9k/debug.c | 1 + drivers/net/wireless/ath/ath9k/debug.h | 1 + drivers/net/wireless/ath/ath9k/link.c | 33 +++++++++++++++++++++++++++++++-- drivers/net/wireless/ath/ath9k/main.c | 1 + 5 files changed, 36 insertions(+), 2 deletions(-) --- base-commit: c33f9c2728d0ccc7472e6239346c0fb3de556e0f change-id: 20241105-ath9k-deaf-detection-0e26bb3243a6