Message ID | 1545318971-28351-2-git-send-email-sgruszka@redhat.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
Series | [v2,1/3] rt2x00: use ratelimited variants dev_warn/dev_err | expand |
On Thu, 20 Dec 2018 at 16:16, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > Some USB host devices/drivers on some conditions can always return > EPROTO error on submitted URBs. That can cause infinity loop in the > rt2x00 driver. > > Since we can have single EPROTO errors we can not mark as device as > removed to avoid infinity loop. However we can count consecutive > EPROTO errors and mark device as removed if get lot of it. > I choose number 10 as threshold. > > Reported-and-tested-by: Randy Oostdyk <linux-kernel@oostdyk.com> > Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c > index 086aad22743d..60b8bccab83d 100644 > --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c > +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c > @@ -31,6 +31,22 @@ > #include "rt2x00.h" > #include "rt2x00usb.h" > > +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status) > +{ > + if (status == -ENODEV || status == -ENOENT) I am not sure about this, but it looks to me like you would never see ENOENT, but ETIMEDOUT: https://github.com/torvalds/linux/blob/master/drivers/usb/core/message.c#L64 In usb_start_wait_urb ETIMEDOUT is returned instead of ENOENT and passed up the chain. retval = (ctx.status == -ENOENT ? -ETIMEDOUT : ctx.status); Maybe I am wrong about this, but then again I have neither ever seen the driver respond to an ENOENT like this when an RT5592 "disappeared". Kind regards, jer
On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote: > On Thu, 20 Dec 2018 at 16:16, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > > Some USB host devices/drivers on some conditions can always return > > EPROTO error on submitted URBs. That can cause infinity loop in the > > rt2x00 driver. > > > > Since we can have single EPROTO errors we can not mark as device as > > removed to avoid infinity loop. However we can count consecutive > > EPROTO errors and mark device as removed if get lot of it. > > I choose number 10 as threshold. > > > > Reported-and-tested-by: Randy Oostdyk <linux-kernel@oostdyk.com> > > Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> > > > diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c > > index 086aad22743d..60b8bccab83d 100644 > > --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c > > +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c > > @@ -31,6 +31,22 @@ > > #include "rt2x00.h" > > #include "rt2x00usb.h" > > > > +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status) > > +{ > > + if (status == -ENODEV || status == -ENOENT) > > I am not sure about this, but it looks to me like you would never see > ENOENT, but ETIMEDOUT: > > https://github.com/torvalds/linux/blob/master/drivers/usb/core/message.c#L64 > > In usb_start_wait_urb ETIMEDOUT is returned instead of ENOENT and > passed up the chain. > > retval = (ctx.status == -ENOENT ? -ETIMEDOUT : ctx.status); > > > Maybe I am wrong about this, but then again I have neither ever seen > the driver respond to an ENOENT like this when an RT5592 > "disappeared". According to Documentation/driver-api/usb/error-codes.rst ENOENT is valid error code when interface does not exist or when URB was unlinked. Regards Stanislaw
Hi Stanislaw, On Mon, 7 Jan 2019 at 16:09, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote: > > Maybe I am wrong about this, but then again I have neither ever seen > > the driver respond to an ENOENT like this when an RT5592 > > "disappeared". By "neither ever seen" I mean I have never seen rt2x00usb respond properly like that because no -ENOENT was ever seen. I have many system logs collected over the years showing the exact error that the code was trying to catch and still does try to catch: with USB debug messages enabled, I tend to see usb: kworker.* timed out on ep.* followed by rt2x00 specific warnings about the queues filling up: rt2800usb_txdone: Warning - Data pending for entry N in queue N [may repeat a few times] followed by the fatal: rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X with error -110 [many of these, system is slowly locking up] So the only clue that I had was that perhaps rt2x00usb_vendor_request wasn't catching the correct return value. > According to Documentation/driver-api/usb/error-codes.rst > ENOENT is valid error code when interface does not exist or > when URB was unlinked. rt2x00usb_vendor_request calls usb_control_msg. usb_control_msg according to that document can return -ETIMEDOUT. 1. rt2x00usb_vendor_request calls usb_control_msg. 2. usb_control_msg calls usb_internal_control_msg and passes on its return value. 3. usb_internal_control_msg calls usb_start_wait_urb and passes on its return value. 4. usb_start_wait_urb very specifically substitutes -ETIMEDOUT for -ENOENT as the return value of usb_kill_urb and 5. that is passed back all the way up to rt2x00usb_vendor_request. Kind regards, jer
On Tue, 8 Jan 2019 at 10:30, Jeroen Roovers <jer@airfi.aero> wrote: > I tend to see > > usb: kworker.* timed out on ep.* > > followed by rt2x00 specific warnings about the queues filling up: Sorry that should read "preceded by". > rt2800usb_txdone: Warning - Data pending for entry N in queue N > [may repeat a few times] Kind regards, jer
On 08/01/2019, Jeroen Roovers <jer@airfi.aero> wrote: > Hi Stanislaw, > > > On Mon, 7 Jan 2019 at 16:09, Stanislaw Gruszka <sgruszka@redhat.com> wrote: >> On Mon, Jan 07, 2019 at 01:47:19PM +0100, Jeroen Roovers wrote: >> > Maybe I am wrong about this, but then again I have neither ever seen >> > the driver respond to an ENOENT like this when an RT5592 >> > "disappeared". > > By "neither ever seen" I mean I have never seen rt2x00usb respond > properly like that because no -ENOENT was ever seen. I have many > system logs collected over the years showing the exact error that the > code was trying to catch and still does try to catch: with USB debug > messages enabled, I tend to see > > usb: kworker.* timed out on ep.* > > followed by rt2x00 specific warnings about the queues filling up: > rt2800usb_txdone: Warning - Data pending for entry N in queue N > [may repeat a few times] > > followed by the fatal: > > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X > with error -110 > [many of these, system is slowly locking up] > > So the only clue that I had was that perhaps rt2x00usb_vendor_request > wasn't catching the correct return value. Hi error message vendor request failed - do you get it on a real hardware or in virtualized environment? noticed it only happens when running attached USB device in virtualbox, even in such case I was able once to hit the right combination of ubuntu, vbox and vbox extensions and could run the USB wifi without lockups. so you might consider fixing this bug at different level. oh, and I'd also pay attention to hardware connection at the USB port, at least one of the devices I tested seems flaky when plugged in (edup mini adapter/RT53xx) regards, Tom
Hi Tom, On Tue, 8 Jan 2019 at 12:04, Tom Psyborg <pozega.tomislav@gmail.com> wrote: > > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X > > with error -110 > > [many of these, system is slowly locking up] > > > > So the only clue that I had was that perhaps rt2x00usb_vendor_request > > wasn't catching the correct return value. > > Hi > > error message vendor request failed - do you get it on a real hardware > or in virtualized environment? I only run these on bare metal. What I assume so far is that when rt2x00usb_vendor_request starts failing like this, the MCU has failed. Power cycling the system helps but is undesirable, and sometimes so does a forced removal of rt2800usb, a short recovery period (cooling down, reloading the firmware?) and then loading the module again. But the problem I am looking to solve is not a hardware problem, it is recovering gracefully from a failure in the RT5592, so I have been looking intently at rt2x00usb_vendor_request because that's the function that complains loudly and kills the entire kernel when the RT5592 sees this failure. Kind regards, jer
On Wed, Jan 09, 2019 at 07:17:59AM +0100, Jeroen Roovers wrote: > Hi Tom, > > On Tue, 8 Jan 2019 at 12:04, Tom Psyborg <pozega.tomislav@gmail.com> wrote: > > > rt2x00usb_vendor_request: Error - Vendor Request X failed for offset X > > > with error -110 > > > [many of these, system is slowly locking up] > > > > > > So the only clue that I had was that perhaps rt2x00usb_vendor_request > > > wasn't catching the correct return value. > > > > Hi > > > > error message vendor request failed - do you get it on a real hardware > > or in virtualized environment? > > I only run these on bare metal. What I assume so far is that when > rt2x00usb_vendor_request starts failing like this, the MCU has failed. > Power cycling the system helps but is undesirable, and sometimes so > does a forced removal of rt2800usb, a short recovery period (cooling > down, reloading the firmware?) and then loading the module again. > > But the problem I am looking to solve is not a hardware problem, it is > recovering gracefully from a failure in the RT5592, so I have been > looking intently at rt2x00usb_vendor_request because that's the > function that complains loudly and kills the entire kernel when the > RT5592 sees this failure. So would be below patch (on top of this set) be a solution for at least to not kill the kernel. Or we looking for something better i.e. watchdog ? Regards Stanislaw diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c index 60b8bccab83d..ee5bd0efd2c7 100644 --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c @@ -36,7 +36,7 @@ static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status) if (status == -ENODEV || status == -ENOENT) return true; - if (status == -EPROTO) + if (status == -EPROTO || status == -ETIMEDOUT) rt2x00dev->num_proto_errs++; else rt2x00dev->num_proto_errs = 0;
Hi Stanislaw, On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > So would be below patch (on top of this set) be a solution for at > least to not kill the kernel. Or we looking for something better > i.e. watchdog ? I'll give it a spin. Thanks! Kind regards, jer
Aaaaaaand the results are in. On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <jer@airfi.aero> wrote: > > Hi Stanislaw, > > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > So would be below patch (on top of this set) be a solution for at > > least to not kill the kernel. Or we looking for something better > > i.e. watchdog ? > > I'll give it a spin. Thanks! hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity kernel: [ 500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x07 failed for offset 0x6888 with error -110 kernel: [ 500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x6888 with error -110 kernel: [ 501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x6110 with error -110 hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) kernel: [ 501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x07 failed for offset 0x1018 with error -110 kernel: [ 501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x1018 with error -110 kernel: [ 501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x1910 with error -110 hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due to inactivity hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) rt2x00usb_check_usb_error in your patch is set to clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6 errors were seen. Maybe I should set it to 1 as I have never seen an RT5592 recover from this. The system remained relatively stable until after I tried forcibly removing and then loading the rt2800usb module. A simple `ifconfig` then triggered a kernel panic. Sadly I couldn't capture it in time but I did spot that more phyN (up to phy4) devices had been added. Kind regards, jer
On Thu, Jan 10, 2019 at 03:29:44PM +0100, Jeroen Roovers wrote: > Aaaaaaand the results are in. > > On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <jer@airfi.aero> wrote: > > > > Hi Stanislaw, > > > > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > > > So would be below patch (on top of this set) be a solution for at > > > least to not kill the kernel. Or we looking for something better > > > i.e. watchdog ? > > > > I'll give it a spin. Thanks! > > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > kernel: [ 500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x07 failed for offset 0x6888 with error -110 > kernel: [ 500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x6888 with error -110 > kernel: [ 501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x6110 with error -110 > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > kernel: [ 501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x07 failed for offset 0x1018 with error -110 > kernel: [ 501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x1018 with error -110 > kernel: [ 501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x1910 with error -110 > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > > rt2x00usb_check_usb_error in your patch is set to > clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6 > errors were seen. Maybe I should set it to 1 as I have never seen an > RT5592 recover from this. I know cases where we have only single USB error and device work after that. But if I remember correctly it was only for EPROTO not ETIMEDOUT. > The system remained relatively stable until after I tried forcibly > removing and then loading the rt2800usb module. A simple `ifconfig` > then triggered a kernel panic. Sadly I couldn't capture it in time but > I did spot that more phyN (up to phy4) devices had been added. Apparently we should not crash and that should be fixed, but without logs there is nothing we can do. Let try to attack problem from other side. RT5592 mises TSSI calibration, so perhaps device overheat for you because of that. Please apply attached patch which fixes txpower setting in mac80211 then set some low txpower value i.e: iw dev wlan0 set txpower fixed 1500 and check if device still stop working with USB errors. Regards Stanislaw diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c index 818aa0060349..ca513e94be9e 100644 --- a/net/mac80211/cfg.c +++ b/net/mac80211/cfg.c @@ -2415,6 +2415,8 @@ static int ieee80211_set_tx_power(struct wiphy *wiphy, bool update_txp_type = false; bool has_monitor = false; + printk("%s mbm %d\n", __func__, mbm); + if (wdev) { sdata = IEEE80211_WDEV_TO_SUB_IF(wdev); diff --git a/net/mac80211/iface.c b/net/mac80211/iface.c index 3a0171a65db3..022874bf49f3 100644 --- a/net/mac80211/iface.c +++ b/net/mac80211/iface.c @@ -49,19 +49,23 @@ static void ieee80211_iface_work(struct work_struct *work); bool __ieee80211_recalc_txpower(struct ieee80211_sub_if_data *sdata) { + struct ieee80211_local *local = sdata->local; struct ieee80211_chanctx_conf *chanctx_conf; int power; - rcu_read_lock(); - chanctx_conf = rcu_dereference(sdata->vif.chanctx_conf); - if (!chanctx_conf) { + if (local->use_chanctx) { + rcu_read_lock(); + chanctx_conf = rcu_dereference(sdata->vif.chanctx_conf); + if (!chanctx_conf) { + rcu_read_unlock(); + return false; + } + power = ieee80211_chandef_max_power(&chanctx_conf->def); rcu_read_unlock(); - return false; + } else { + power = ieee80211_chandef_max_power(&local->hw.conf.chandef); } - power = ieee80211_chandef_max_power(&chanctx_conf->def); - rcu_read_unlock(); - if (sdata->user_power_level != IEEE80211_UNSET_POWER_LEVEL) power = min(power, sdata->user_power_level); diff --git a/net/mac80211/main.c b/net/mac80211/main.c index 7b8320d4a8e4..6005c5e09be6 100644 --- a/net/mac80211/main.c +++ b/net/mac80211/main.c @@ -144,10 +144,14 @@ static u32 ieee80211_hw_conf_chan(struct ieee80211_local *local) rcu_read_lock(); list_for_each_entry_rcu(sdata, &local->interfaces, list) { - if (!rcu_access_pointer(sdata->vif.chanctx_conf)) - continue; if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN) continue; + if (sdata->user_power_level != IEEE80211_UNSET_POWER_LEVEL) + power = min(power, sdata->user_power_level); + if (sdata->ap_power_level != IEEE80211_UNSET_POWER_LEVEL) + power = min(power, sdata->ap_power_level); + if (!rcu_access_pointer(sdata->vif.chanctx_conf)) + continue; power = min(power, sdata->vif.bss_conf.txpower); } rcu_read_unlock();
On Thu, 10 Jan 2019 at 15:29, Jeroen Roovers <jer@airfi.aero> wrote: > > Aaaaaaand the results are in. > > On Thu, 10 Jan 2019 at 08:49, Jeroen Roovers <jer@airfi.aero> wrote: > > > > Hi Stanislaw, > > > > On Wed, 9 Jan 2019 at 12:33, Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > > > So would be below patch (on top of this set) be a solution for at > > > least to not kill the kernel. Or we looking for something better > > > i.e. watchdog ? > > > > I'll give it a spin. Thanks! > > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > kernel: [ 500.782266] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x07 failed for offset 0x6888 with error -110 > kernel: [ 500.912237] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x6888 with error -110 > kernel: [ 501.042235] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x6110 with error -110 > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > kernel: [ 501.772201] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x07 failed for offset 0x1018 with error -110 > kernel: [ 501.902177] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x1018 with error -110 > kernel: [ 501.972186] ieee80211 phy0: rt2x00usb_vendor_request: Error > - Vendor Request 0x06 failed for offset 0x1910 with error -110 > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: disassociated due > to inactivity > hostapd: wlan0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due > to inactivity (timer DEAUTH/REMOVE) > > rt2x00usb_check_usb_error in your patch is set to > clearDEVICE_STATE_PRESENT after ten errors, but in this case only 6 > errors were seen. Maybe I should set it to 1 as I have never seen an > RT5592 recover from this. > > The system remained relatively stable until after I tried forcibly > removing and then loading the rt2800usb module. A simple `ifconfig` > then triggered a kernel panic. Sadly I couldn't capture it in time but > I did spot that more phyN (up to phy4) devices had been added. Yes, even when some rt2x00usb_vendor_request starts to fail but keeps on trying, the WLAN modules remain unrecoverable except by removing power, so the 10 retries are usually not reached and the device is never removed. Could it be that some operations do succeed, perhaps because the MCU was reset and is now capable of responding to some "vendor requests" but not others? Checking for num_proto_errs > 1 would then make as much sense as num_proto_errs > 10 when running an AP. Maybe it's different for an STA? Kind regards, jer
On Tue, Jan 22, 2019 at 06:32:05PM +0100, Jeroen Roovers wrote: > > then triggered a kernel panic. Sadly I couldn't capture it in time but > > I did spot that more phyN (up to phy4) devices had been added. > > Yes, even when some rt2x00usb_vendor_request starts to fail but keeps > on trying, the WLAN modules remain unrecoverable except by removing > power, so the 10 retries are usually not reached and the device is > never removed. Could it be that some operations do succeed, perhaps > because the MCU was reset and is now capable of responding to some > "vendor requests" but not others? > > Checking for num_proto_errs > 1 would then make as much sense as > num_proto_errs > 10 when running an AP. Maybe it's different for an > STA? Does it make sense to do num_proto_err > 3 check. Would that be helpful on your problem with rt2800usb device? I don't want make num_proto_err > 1 since this will remove device just after 2 errors. Also have you tested tx_power patch with lowering txpower ? Did it improve things ? Stanislaw
diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00.h b/drivers/net/wireless/ralink/rt2x00/rt2x00.h index 94459d5ee01b..88e0dc803227 100644 --- a/drivers/net/wireless/ralink/rt2x00/rt2x00.h +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00.h @@ -1022,6 +1022,7 @@ struct rt2x00_dev { unsigned int extra_tx_headroom; struct usb_anchor *anchor; + unsigned int num_proto_errs; /* Clock for System On Chip devices. */ struct clk *clk; diff --git a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c index 086aad22743d..60b8bccab83d 100644 --- a/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c +++ b/drivers/net/wireless/ralink/rt2x00/rt2x00usb.c @@ -31,6 +31,22 @@ #include "rt2x00.h" #include "rt2x00usb.h" +static bool rt2x00usb_check_usb_error(struct rt2x00_dev *rt2x00dev, int status) +{ + if (status == -ENODEV || status == -ENOENT) + return true; + + if (status == -EPROTO) + rt2x00dev->num_proto_errs++; + else + rt2x00dev->num_proto_errs = 0; + + if (rt2x00dev->num_proto_errs > 10) + return true; + + return false; +} + /* * Interfacing with the HW. */ @@ -57,7 +73,7 @@ int rt2x00usb_vendor_request(struct rt2x00_dev *rt2x00dev, if (status >= 0) return 0; - if (status == -ENODEV || status == -ENOENT) { + if (rt2x00usb_check_usb_error(rt2x00dev, status)) { /* Device has disappeared. */ clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags); break; @@ -321,7 +337,7 @@ static bool rt2x00usb_kick_tx_entry(struct queue_entry *entry, void *data) status = usb_submit_urb(entry_priv->urb, GFP_ATOMIC); if (status) { - if (status == -ENODEV || status == -ENOENT) + if (rt2x00usb_check_usb_error(rt2x00dev, status)) clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags); set_bit(ENTRY_DATA_IO_FAILED, &entry->flags); rt2x00lib_dmadone(entry); @@ -410,7 +426,7 @@ static bool rt2x00usb_kick_rx_entry(struct queue_entry *entry, void *data) status = usb_submit_urb(entry_priv->urb, GFP_ATOMIC); if (status) { - if (status == -ENODEV || status == -ENOENT) + if (rt2x00usb_check_usb_error(rt2x00dev, status)) clear_bit(DEVICE_STATE_PRESENT, &rt2x00dev->flags); set_bit(ENTRY_DATA_IO_FAILED, &entry->flags); rt2x00lib_dmadone(entry);
Some USB host devices/drivers on some conditions can always return EPROTO error on submitted URBs. That can cause infinity loop in the rt2x00 driver. Since we can have single EPROTO errors we can not mark as device as removed to avoid infinity loop. However we can count consecutive EPROTO errors and mark device as removed if get lot of it. I choose number 10 as threshold. Reported-and-tested-by: Randy Oostdyk <linux-kernel@oostdyk.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> --- drivers/net/wireless/ralink/rt2x00/rt2x00.h | 1 + drivers/net/wireless/ralink/rt2x00/rt2x00usb.c | 22 +++++++++++++++++++--- 2 files changed, 20 insertions(+), 3 deletions(-)