diff mbox series

wlcore: Fix BUG with clear completion on timeout

Message ID 20181001213805.86511-1-tony@atomide.com (mailing list archive)
State Accepted
Commit 4e651bad848955d88b29a568bfbfb4b831270e16
Delegated to: Kalle Valo
Headers show
Series wlcore: Fix BUG with clear completion on timeout | expand

Commit Message

Tony Lindgren Oct. 1, 2018, 9:38 p.m. UTC
We do not currently clear wl->elp_compl on ELP timeout and we have bogus
lingering pointer that wlcore_irq then will try to access after recovery
is done:

BUG: spinlock bad magic on CPU#1, irq/255-wl12xx/580
...
(spin_dump) from [<c01b9344>] (do_raw_spin_lock+0xc8/0x124)
(do_raw_spin_lock) from [<c09b3970>] (_raw_spin_lock_irqsave+0x68/0x74)
(_raw_spin_lock_irqsave) from [<c01a02f0>] (complete+0x24/0x58)
(complete) from [<bf572610>] (wlcore_irq+0x48/0x17c [wlcore])
(wlcore_irq [wlcore]) from [<c01c5efc>] (irq_thread_fn+0x2c/0x64)
(irq_thread_fn) from [<c01c623c>] (irq_thread+0x148/0x290)
(irq_thread) from [<c016b4b0>] (kthread+0x160/0x17c)
(kthread) from [<c01010b4>] (ret_from_fork+0x14/0x20)
...

After that the system will hang. Let's fix this by adding a flag for
recovery and moving the recovery work call to to the error handling
section.

And we want to set WL1271_FLAG_INTENDED_FW_RECOVERY and actually clear
it too in wl1271_recovery_work() and just downgrade the error to a
warning to prevent overly verbose output.

Cc: Eyal Reizer <eyalr@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
---
 drivers/net/wireless/ti/wlcore/main.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

Comments

Kalle Valo Oct. 5, 2018, 8:33 a.m. UTC | #1
Tony Lindgren <tony@atomide.com> wrote:

> We do not currently clear wl->elp_compl on ELP timeout and we have bogus
> lingering pointer that wlcore_irq then will try to access after recovery
> is done:
> 
> BUG: spinlock bad magic on CPU#1, irq/255-wl12xx/580
> ...
> (spin_dump) from [<c01b9344>] (do_raw_spin_lock+0xc8/0x124)
> (do_raw_spin_lock) from [<c09b3970>] (_raw_spin_lock_irqsave+0x68/0x74)
> (_raw_spin_lock_irqsave) from [<c01a02f0>] (complete+0x24/0x58)
> (complete) from [<bf572610>] (wlcore_irq+0x48/0x17c [wlcore])
> (wlcore_irq [wlcore]) from [<c01c5efc>] (irq_thread_fn+0x2c/0x64)
> (irq_thread_fn) from [<c01c623c>] (irq_thread+0x148/0x290)
> (irq_thread) from [<c016b4b0>] (kthread+0x160/0x17c)
> (kthread) from [<c01010b4>] (ret_from_fork+0x14/0x20)
> ...
> 
> After that the system will hang. Let's fix this by adding a flag for
> recovery and moving the recovery work call to to the error handling
> section.
> 
> And we want to set WL1271_FLAG_INTENDED_FW_RECOVERY and actually clear
> it too in wl1271_recovery_work() and just downgrade the error to a
> warning to prevent overly verbose output.
> 
> Cc: Eyal Reizer <eyalr@ti.com>
> Signed-off-by: Tony Lindgren <tony@atomide.com>

Patch applied to wireless-drivers-next.git, thanks.

4e651bad8489 wlcore: Fix BUG with clear completion on timeout
Adam Ford Nov. 30, 2018, 1:16 p.m. UTC | #2
On Fri, Oct 5, 2018 at 3:33 AM Kalle Valo <kvalo@codeaurora.org> wrote:
>
> Tony Lindgren <tony@atomide.com> wrote:
>
> > We do not currently clear wl->elp_compl on ELP timeout and we have bogus
> > lingering pointer that wlcore_irq then will try to access after recovery
> > is done:
> >
> > BUG: spinlock bad magic on CPU#1, irq/255-wl12xx/580
> > ...
> > (spin_dump) from [<c01b9344>] (do_raw_spin_lock+0xc8/0x124)
> > (do_raw_spin_lock) from [<c09b3970>] (_raw_spin_lock_irqsave+0x68/0x74)
> > (_raw_spin_lock_irqsave) from [<c01a02f0>] (complete+0x24/0x58)
> > (complete) from [<bf572610>] (wlcore_irq+0x48/0x17c [wlcore])
> > (wlcore_irq [wlcore]) from [<c01c5efc>] (irq_thread_fn+0x2c/0x64)
> > (irq_thread_fn) from [<c01c623c>] (irq_thread+0x148/0x290)
> > (irq_thread) from [<c016b4b0>] (kthread+0x160/0x17c)
> > (kthread) from [<c01010b4>] (ret_from_fork+0x14/0x20)
> > ...
> >
> > After that the system will hang. Let's fix this by adding a flag for
> > recovery and moving the recovery work call to to the error handling
> > section.
> >
> > And we want to set WL1271_FLAG_INTENDED_FW_RECOVERY and actually clear
> > it too in wl1271_recovery_work() and just downgrade the error to a
> > warning to prevent overly verbose output.
> >

Do we know how far back this bug goes and which versions need this
patch applied to it?  I have seen something similar on 4.19, but I
haven't tried this patch to fix it.  It wasn't clear to me if this is
linux-next or 4.19 or something different.

thanks

adam
> > Cc: Eyal Reizer <eyalr@ti.com>
> > Signed-off-by: Tony Lindgren <tony@atomide.com>
>
> Patch applied to wireless-drivers-next.git, thanks.
>
> 4e651bad8489 wlcore: Fix BUG with clear completion on timeout
>
> --
> https://patchwork.kernel.org/patch/10622767/
>
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
>
Tony Lindgren Nov. 30, 2018, 6:32 p.m. UTC | #3
Hi,

* Adam Ford <aford173@gmail.com> [181130 13:16]:
> On Fri, Oct 5, 2018 at 3:33 AM Kalle Valo <kvalo@codeaurora.org> wrote:
> >
> > Tony Lindgren <tony@atomide.com> wrote:
> >
> > > We do not currently clear wl->elp_compl on ELP timeout and we have bogus
> > > lingering pointer that wlcore_irq then will try to access after recovery
> > > is done:
> > >
> > > BUG: spinlock bad magic on CPU#1, irq/255-wl12xx/580
> > > ...
> > > (spin_dump) from [<c01b9344>] (do_raw_spin_lock+0xc8/0x124)
> > > (do_raw_spin_lock) from [<c09b3970>] (_raw_spin_lock_irqsave+0x68/0x74)
> > > (_raw_spin_lock_irqsave) from [<c01a02f0>] (complete+0x24/0x58)
> > > (complete) from [<bf572610>] (wlcore_irq+0x48/0x17c [wlcore])
> > > (wlcore_irq [wlcore]) from [<c01c5efc>] (irq_thread_fn+0x2c/0x64)
> > > (irq_thread_fn) from [<c01c623c>] (irq_thread+0x148/0x290)
> > > (irq_thread) from [<c016b4b0>] (kthread+0x160/0x17c)
> > > (kthread) from [<c01010b4>] (ret_from_fork+0x14/0x20)
> > > ...
> > >
> > > After that the system will hang. Let's fix this by adding a flag for
> > > recovery and moving the recovery work call to to the error handling
> > > section.
> > >
> > > And we want to set WL1271_FLAG_INTENDED_FW_RECOVERY and actually clear
> > > it too in wl1271_recovery_work() and just downgrade the error to a
> > > warning to prevent overly verbose output.
> > >
> 
> Do we know how far back this bug goes and which versions need this
> patch applied to it?  I have seen something similar on 4.19, but I
> haven't tried this patch to fix it.  It wasn't clear to me if this is
> linux-next or 4.19 or something different.

I'm not sure if this is needed for v4.19 as the wakeirq patch
is not there. Maybe give it a try and see if it helps with
the issue you're seeing, then request inclusion for stable if
it helps?

BTW any wlcore issues with earlier kernels should be separately
debugged and tested. Fixes done after changing wlcore to use
PM runtime and wakeirq may be incomple for earlier kernels,
that's the two commits and below and any changes related to them.

And in general there seems to be two categories of common issues
with wlcore that I've seen: GPIO interrupt not behaving with the
SoC or old firmware being used for wlcore.

Regards,

Tony

8< -----------------
3c83dd577c7f ("wlcore: Add support for optional wakeirq")
fa2648a34e73 ("wlcore: Add support for runtime PM")
diff mbox series

Patch

diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c
--- a/drivers/net/wireless/ti/wlcore/main.c
+++ b/drivers/net/wireless/ti/wlcore/main.c
@@ -957,6 +957,8 @@  static void wl1271_recovery_work(struct work_struct *work)
 	BUG_ON(wl->conf.recovery.bug_on_recovery &&
 	       !test_bit(WL1271_FLAG_INTENDED_FW_RECOVERY, &wl->flags));
 
+	clear_bit(WL1271_FLAG_INTENDED_FW_RECOVERY, &wl->flags);
+
 	if (wl->conf.recovery.no_recovery) {
 		wl1271_info("No recovery (chosen on module load). Fw will remain stuck.");
 		goto out_unlock;
@@ -6710,6 +6712,7 @@  static int __maybe_unused wlcore_runtime_resume(struct device *dev)
 	int ret;
 	unsigned long start_time = jiffies;
 	bool pending = false;
+	bool recovery = false;
 
 	/* Nothing to do if no ELP mode requested */
 	if (!test_bit(WL1271_FLAG_IN_ELP, &wl->flags))
@@ -6726,7 +6729,7 @@  static int __maybe_unused wlcore_runtime_resume(struct device *dev)
 
 	ret = wlcore_raw_write32(wl, HW_ACCESS_ELP_CTRL_REG, ELPCTRL_WAKE_UP);
 	if (ret < 0) {
-		wl12xx_queue_recovery_work(wl);
+		recovery = true;
 		goto err;
 	}
 
@@ -6734,11 +6737,12 @@  static int __maybe_unused wlcore_runtime_resume(struct device *dev)
 		ret = wait_for_completion_timeout(&compl,
 			msecs_to_jiffies(WL1271_WAKEUP_TIMEOUT));
 		if (ret == 0) {
-			wl1271_error("ELP wakeup timeout!");
-			wl12xx_queue_recovery_work(wl);
+			wl1271_warning("ELP wakeup timeout!");
 
 			/* Return no error for runtime PM for recovery */
-			return 0;
+			ret = 0;
+			recovery = true;
+			goto err;
 		}
 	}
 
@@ -6753,6 +6757,12 @@  static int __maybe_unused wlcore_runtime_resume(struct device *dev)
 	spin_lock_irqsave(&wl->wl_lock, flags);
 	wl->elp_compl = NULL;
 	spin_unlock_irqrestore(&wl->wl_lock, flags);
+
+	if (recovery) {
+		set_bit(WL1271_FLAG_INTENDED_FW_RECOVERY, &wl->flags);
+		wl12xx_queue_recovery_work(wl);
+	}
+
 	return ret;
 }