diff mbox series

Revert "Revert "wifi: ath11k: Enable threaded NAPI""

Message ID 20230809073432.4193-1-johan+linaro@kernel.org (mailing list archive)
State Rejected
Delegated to: Kalle Valo
Headers show
Series Revert "Revert "wifi: ath11k: Enable threaded NAPI"" | expand

Commit Message

Johan Hovold Aug. 9, 2023, 7:34 a.m. UTC
This reverts commit d265ebe41c911314bd273c218a37088835959fa1.

Disabling threaded NAPI causes the Lenovo ThinkPad X13s to hang (e.g. no
more interrupts received) almost immediately during RX.

Apparently something broke since commit 13aa2fb692d3 ("wifi: ath11k:
Enable threaded NAPI") so that a simple revert is no longer possible.

As commit d265ebe41c91 ("Revert "wifi: ath11k: Enable threaded NAPI"")
does not address the underlying issue reported with QCN9074, it seems we
need to reenable threaded NAPI before fixing both bugs properly.

Fixes: d265ebe41c91 ("Revert "wifi: ath11k: Enable threaded NAPI"")
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---

Hi Kalle,

Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
splat once).

I'm supposed to be on holiday this week, but thanks to the rain I gave
rc5 a try and ran into this.

I've added Bjorn, Mani and Konrad on CC who may be able to help with
debugging this further if needed while I'm out-of-office.

Johan


 drivers/net/wireless/ath/ath11k/ahb.c  | 1 +
 drivers/net/wireless/ath/ath11k/pcic.c | 1 +
 2 files changed, 2 insertions(+)

Comments

Manikanta Pubbisetty Aug. 9, 2023, 9:02 a.m. UTC | #1
On 8/9/2023 1:04 PM, Johan Hovold wrote:
> This reverts commit d265ebe41c911314bd273c218a37088835959fa1.
> 
> Disabling threaded NAPI causes the Lenovo ThinkPad X13s to hang (e.g. no
> more interrupts received) almost immediately during RX.
> 
> Apparently something broke since commit 13aa2fb692d3 ("wifi: ath11k:
> Enable threaded NAPI") so that a simple revert is no longer possible.
> 

This is getting as weird as it would get :)

> As commit d265ebe41c91 ("Revert "wifi: ath11k: Enable threaded NAPI"")
> does not address the underlying issue reported with QCN9074, it seems we
> need to reenable threaded NAPI before fixing both bugs properly.
> 

It seems that the revert has actually solved the issue reported with 
QCN9074.

https://bugzilla.kernel.org/show_bug.cgi?id=217536

We were trying to reproduce the problem on X86+QCN9074 (with threaded 
NAPI) from quite some time, but there is no repro yet.

Actually, enabling/disabling threaded NAPI is a simple affair; I'm 
wondering to hear that interrupts are blocked due  to not having 
threaded NAPI.

What is the chip that Lenovo Thinkpad X13s is having?

Thanks,
Manikanta
Johan Hovold Aug. 9, 2023, 9:16 a.m. UTC | #2
On Wed, Aug 09, 2023 at 02:32:37PM +0530, Manikanta Pubbisetty wrote:
> On 8/9/2023 1:04 PM, Johan Hovold wrote:
> > This reverts commit d265ebe41c911314bd273c218a37088835959fa1.
> > 
> > Disabling threaded NAPI causes the Lenovo ThinkPad X13s to hang (e.g. no
> > more interrupts received) almost immediately during RX.
> > 
> > Apparently something broke since commit 13aa2fb692d3 ("wifi: ath11k:
> > Enable threaded NAPI") so that a simple revert is no longer possible.
> > 
> 
> This is getting as weird as it would get :)
> 
> > As commit d265ebe41c91 ("Revert "wifi: ath11k: Enable threaded NAPI"")
> > does not address the underlying issue reported with QCN9074, it seems we
> > need to reenable threaded NAPI before fixing both bugs properly.
> > 
> 
> It seems that the revert has actually solved the issue reported with 
> QCN9074.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=217536

Sure, but it's only a workaround as the underlying cause has not been
identified.

> We were trying to reproduce the problem on X86+QCN9074 (with threaded 
> NAPI) from quite some time, but there is no repro yet.
>
> Actually, enabling/disabling threaded NAPI is a simple affair; I'm 
> wondering to hear that interrupts are blocked due  to not having 
> threaded NAPI.

It sounds to me like the driver's locking is broken if moving to softirq
processing hangs the machine like this. But I have not had time to try
to try to track it down besides verifying that reenabling threaded NAPI
makes the problem go away.

> What is the chip that Lenovo Thinkpad X13s is having?

It's a WCN6855 (QCNFA765).

Johan
Manikanta Pubbisetty Aug. 10, 2023, 4:33 a.m. UTC | #3
On 8/9/2023 2:46 PM, Johan Hovold wrote:
> On Wed, Aug 09, 2023 at 02:32:37PM +0530, Manikanta Pubbisetty wrote:
>> On 8/9/2023 1:04 PM, Johan Hovold wrote:
>>> This reverts commit d265ebe41c911314bd273c218a37088835959fa1.
>>>
>>> Disabling threaded NAPI causes the Lenovo ThinkPad X13s to hang (e.g. no
>>> more interrupts received) almost immediately during RX.
>>>
>>> Apparently something broke since commit 13aa2fb692d3 ("wifi: ath11k:
>>> Enable threaded NAPI") so that a simple revert is no longer possible.
>>>
>>
>> This is getting as weird as it would get :)
>>
>>> As commit d265ebe41c91 ("Revert "wifi: ath11k: Enable threaded NAPI"")
>>> does not address the underlying issue reported with QCN9074, it seems we
>>> need to reenable threaded NAPI before fixing both bugs properly.
>>>
>>
>> It seems that the revert has actually solved the issue reported with
>> QCN9074.
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=217536
> 
> Sure, but it's only a workaround as the underlying cause has not been
> identified.
> 
>> We were trying to reproduce the problem on X86+QCN9074 (with threaded
>> NAPI) from quite some time, but there is no repro yet.
>>
>> Actually, enabling/disabling threaded NAPI is a simple affair; I'm
>> wondering to hear that interrupts are blocked due  to not having
>> threaded NAPI.
> 
> It sounds to me like the driver's locking is broken if moving to softirq
> processing hangs the machine like this. But I have not had time to try
> to try to track it down besides verifying that reenabling threaded NAPI
> makes the problem go away.
> 
>> What is the chip that Lenovo Thinkpad X13s is having?
> 
> It's a WCN6855 (QCNFA765).
> 

WCN6855 & QCN9074 share the same driver code base since both being PCIe 
devices. One working and another not working seems to be surprising. Do 
you have a dmesg log when this problem occurred?

We are working on to root cause the original problem. The hindrance as 
of today is that we are not able to repro this so far in Qualcomm. We 
are planning to work with the reporter to get more logs.

Thanks,
Manikanta
Manikanta Pubbisetty Aug. 10, 2023, 4:46 a.m. UTC | #4
On 8/9/2023 2:46 PM, Johan Hovold wrote:
> On Wed, Aug 09, 2023 at 02:32:37PM +0530, Manikanta Pubbisetty wrote:
>> On 8/9/2023 1:04 PM, Johan Hovold wrote:
>>> This reverts commit d265ebe41c911314bd273c218a37088835959fa1.
>>>
>>> Disabling threaded NAPI causes the Lenovo ThinkPad X13s to hang (e.g. no
>>> more interrupts received) almost immediately during RX.
>>>
>>> Apparently something broke since commit 13aa2fb692d3 ("wifi: ath11k:
>>> Enable threaded NAPI") so that a simple revert is no longer possible.
>>>
>>
>> This is getting as weird as it would get :)
>>
>>> As commit d265ebe41c91 ("Revert "wifi: ath11k: Enable threaded NAPI"")
>>> does not address the underlying issue reported with QCN9074, it seems we
>>> need to reenable threaded NAPI before fixing both bugs properly.
>>>
>>
>> It seems that the revert has actually solved the issue reported with
>> QCN9074.
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=217536
> 
> Sure, but it's only a workaround as the underlying cause has not been
> identified.
> 
>> We were trying to reproduce the problem on X86+QCN9074 (with threaded
>> NAPI) from quite some time, but there is no repro yet.
>>
>> Actually, enabling/disabling threaded NAPI is a simple affair; I'm
>> wondering to hear that interrupts are blocked due  to not having
>> threaded NAPI.
> 
> It sounds to me like the driver's locking is broken if moving to softirq
> processing hangs the machine like this. But I have not had time to try
> to try to track it down besides verifying that reenabling threaded NAPI
> makes the problem go away.
> 
>> What is the chip that Lenovo Thinkpad X13s is having?
> 
> It's a WCN6855 (QCNFA765).
> 

Also it is worth to give a try with this patch here 
https://patchwork.kernel.org/project/linux-wireless/patch/20230601033840.2997-1-quic_bqiang@quicinc.com/ 
. This seems to be fixing some known interrupt issue on WCN6855. Could 
you pls give a try?

Thanks,
Manikanta
Johan Hovold Aug. 21, 2023, 1:27 p.m. UTC | #5
On Thu, Aug 10, 2023 at 10:03:55AM +0530, Manikanta Pubbisetty wrote:
> On 8/9/2023 2:46 PM, Johan Hovold wrote:
> > On Wed, Aug 09, 2023 at 02:32:37PM +0530, Manikanta Pubbisetty wrote:
> > > On 8/9/2023 1:04 PM, Johan Hovold wrote:
> > > > This reverts commit d265ebe41c911314bd273c218a37088835959fa1.
> > > > 
> > > > Disabling threaded NAPI causes the Lenovo ThinkPad X13s to hang (e.g. no
> > > > more interrupts received) almost immediately during RX.
> > > > 
> > > > Apparently something broke since commit 13aa2fb692d3 ("wifi: ath11k:
> > > > Enable threaded NAPI") so that a simple revert is no longer possible.

> > > What is the chip that Lenovo Thinkpad X13s is having?
> > 
> > It's a WCN6855 (QCNFA765).
> 
> WCN6855 & QCN9074 share the same driver code base since both being PCIe
> devices. One working and another not working seems to be surprising. Do you
> have a dmesg log when this problem occurred?

I can't access the logs after I hit this bug (e.g. as there are no
interrupts received from the keyboard) but when I trigger this from the
console, there is nothing logged when the hang happens.

Later, secondary errors are logged from other drivers that are no longer
receiving interrupts either, and RCU detects a stall as I mentioned
elsewhere.

Johan
Johan Hovold Aug. 21, 2023, 1:29 p.m. UTC | #6
On Thu, Aug 10, 2023 at 10:16:21AM +0530, Manikanta Pubbisetty wrote:

> Also it is worth to give a try with this patch here https://patchwork.kernel.org/project/linux-wireless/patch/20230601033840.2997-1-quic_bqiang@quicinc.com/
> . This seems to be fixing some known interrupt issue on WCN6855. Could you
> pls give a try?

That patch makes no difference as the X13s use multiple MSIs (so
ath11k_pci_set_irq_affinity_hint() is a noop).

Johan
Johan Hovold Aug. 21, 2023, 1:41 p.m. UTC | #7
Hi Kalle,

On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:

> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
> splat once).
> 
> I'm supposed to be on holiday this week, but thanks to the rain I gave
> rc5 a try and ran into this.
> 
> I've added Bjorn, Mani and Konrad on CC who may be able to help with
> debugging this further if needed while I'm out-of-office.

Back from my holiday now, and this regression is still there with
6.5-rc7.

Any chance we can get the offending commit reverted before 6.5 is
released? 

I'll take a closer look at this meanwhile.

Johan
Kalle Valo Aug. 22, 2023, 12:56 p.m. UTC | #8
Johan Hovold <johan@kernel.org> writes:

> Hi Kalle,
>
> On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
>
>> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
>> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
>> splat once).
>> 
>> I'm supposed to be on holiday this week, but thanks to the rain I gave
>> rc5 a try and ran into this.
>> 
>> I've added Bjorn, Mani and Konrad on CC who may be able to help with
>> debugging this further if needed while I'm out-of-office.
>
> Back from my holiday now, and this regression is still there with
> 6.5-rc7.

I was also away but back now.

> Any chance we can get the offending commit reverted before 6.5 is
> released? 

The problem here is that would break QCN9074 again so there is no good
solution. I suspect we have a fundamental issue in ath11k which we just
haven't discovered yet. I would prefer to get to the bottom of this
before reverting anything.

> I'll take a closer look at this meanwhile.

Thanks, much appreciated. Did you try enabling all kernel debug
features, maybe they would give some hints?
Johan Hovold Aug. 22, 2023, 1:44 p.m. UTC | #9
On Tue, Aug 22, 2023 at 03:56:24PM +0300, Kalle Valo wrote:
> Johan Hovold <johan@kernel.org> writes:
> > On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
> >
> >> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
> >> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
> >> splat once).

> > Any chance we can get the offending commit reverted before 6.5 is
> > released? 
> 
> The problem here is that would break QCN9074 again so there is no good
> solution. I suspect we have a fundamental issue in ath11k which we just
> haven't discovered yet. I would prefer to get to the bottom of this
> before reverting anything.

Sure, ideally we can find and fix the underlying issues these next few
days, but since this regression was introduced in rc5 in an attempt to
address the QCN9074 issue which has been there since 6.1 I think we
need to revert otherwise. 

> > I'll take a closer look at this meanwhile.
> 
> Thanks, much appreciated. Did you try enabling all kernel debug
> features, maybe they would give some hints?

Yes, I have a bunch of those enabled. Lockdep does not complain, but the
hard lockup detector triggers and it looks like CPU0 (which handles most
interrupts on this machine currently) has got stuck while processing an
interrupt.

RCU also detects the stall on CPU0 and provides a task dump for
ksoftirqd with the following call trace:

	__switch_to
	run_ksoftirqd
	smpboot_thread_fn
	kthread
	ret_from_fork

I just tried the out-of-tree pseudo NMI series [0] to get a stack trace,
but CPU0 does not respond to those either when I hit this.

Note that it takes a bit of RX to trigger this, but I hit it as soon as
I try to download something substantial (e.g. after a couple of MB).

Johan

[0] https://lore.kernel.org/lkml/20230419225604.21204-1-dianders@chromium.org/
Johan Hovold Aug. 26, 2023, 3:53 p.m. UTC | #10
On Tue, Aug 22, 2023 at 03:44:45PM +0200, Johan Hovold wrote:
> On Tue, Aug 22, 2023 at 03:56:24PM +0300, Kalle Valo wrote:
> > Johan Hovold <johan@kernel.org> writes:
> > > On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
> > >
> > >> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
> > >> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
> > >> splat once).
> 
> > > Any chance we can get the offending commit reverted before 6.5 is
> > > released? 
> > 
> > The problem here is that would break QCN9074 again so there is no good
> > solution. I suspect we have a fundamental issue in ath11k which we just
> > haven't discovered yet. I would prefer to get to the bottom of this
> > before reverting anything.
> 
> Sure, ideally we can find and fix the underlying issues these next few
> days, but since this regression was introduced in rc5 in an attempt to
> address the QCN9074 issue which has been there since 6.1 I think we
> need to revert otherwise. 

I've managed to track down what causes the hang on the X13s after
disabling threaded NAPI. Turns out to be a severe regression in the
genirq code that causes the software resend tasklet to loop
indefinitely.

I've just sent a fix here:

	https://lore.kernel.org/lkml/20230826154004.1417-1-johan+linaro@kernel.org/

I've also made some progress on the QCN9074 hang, but keeping the
threaded NAPI revert for now is indeed the right thing to do.

Johan
Kalle Valo Aug. 29, 2023, 11:47 a.m. UTC | #11
Johan Hovold <johan@kernel.org> writes:

> On Tue, Aug 22, 2023 at 03:44:45PM +0200, Johan Hovold wrote:
>> On Tue, Aug 22, 2023 at 03:56:24PM +0300, Kalle Valo wrote:
>> > Johan Hovold <johan@kernel.org> writes:
>> > > On Wed, Aug 09, 2023 at 09:34:32AM +0200, Johan Hovold wrote:
>> > >
>> > >> Disabling threaded NAPI caused a severe regression in 6.5-rc5 by making
>> > >> the X13s completely unusable (e.g. no keyboard input, I've seen an RCU
>> > >> splat once).
>> 
>> > > Any chance we can get the offending commit reverted before 6.5 is
>> > > released? 
>> > 
>> > The problem here is that would break QCN9074 again so there is no good
>> > solution. I suspect we have a fundamental issue in ath11k which we just
>> > haven't discovered yet. I would prefer to get to the bottom of this
>> > before reverting anything.
>> 
>> Sure, ideally we can find and fix the underlying issues these next few
>> days, but since this regression was introduced in rc5 in an attempt to
>> address the QCN9074 issue which has been there since 6.1 I think we
>> need to revert otherwise. 
>
> I've managed to track down what causes the hang on the X13s after
> disabling threaded NAPI. Turns out to be a severe regression in the
> genirq code that causes the software resend tasklet to loop
> indefinitely.
>
> I've just sent a fix here:
>
> 	https://lore.kernel.org/lkml/20230826154004.1417-1-johan+linaro@kernel.org/

Oh wow, that's a tricky bug :o I'm sure it was not easy to find.

> I've also made some progress on the QCN9074 hang, but keeping the
> threaded NAPI revert for now is indeed the right thing to do.

Ok, thanks for the update and looking at also this problem. Very much
appreciated! I'm sure we have a major bug lurking somewhere in ath11k,
would be so good to fix that.
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath11k/ahb.c b/drivers/net/wireless/ath/ath11k/ahb.c
index 139da578831a..1cebba7889d7 100644
--- a/drivers/net/wireless/ath/ath11k/ahb.c
+++ b/drivers/net/wireless/ath/ath11k/ahb.c
@@ -376,6 +376,7 @@  static void ath11k_ahb_ext_irq_enable(struct ath11k_base *ab)
 		struct ath11k_ext_irq_grp *irq_grp = &ab->ext_irq_grp[i];
 
 		if (!irq_grp->napi_enabled) {
+			dev_set_threaded(&irq_grp->napi_ndev, true);
 			napi_enable(&irq_grp->napi);
 			irq_grp->napi_enabled = true;
 		}
diff --git a/drivers/net/wireless/ath/ath11k/pcic.c b/drivers/net/wireless/ath/ath11k/pcic.c
index c63083633b37..c899616fbee4 100644
--- a/drivers/net/wireless/ath/ath11k/pcic.c
+++ b/drivers/net/wireless/ath/ath11k/pcic.c
@@ -466,6 +466,7 @@  void ath11k_pcic_ext_irq_enable(struct ath11k_base *ab)
 		struct ath11k_ext_irq_grp *irq_grp = &ab->ext_irq_grp[i];
 
 		if (!irq_grp->napi_enabled) {
+			dev_set_threaded(&irq_grp->napi_ndev, true);
 			napi_enable(&irq_grp->napi);
 			irq_grp->napi_enabled = true;
 		}