diff mbox series

[net-next,1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI

Message ID 20220203184031.1074008-1-yannick.vignon@oss.nxp.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 7 this patch: 7
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 18 this patch: 18
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 12 this patch: 12
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Yannick Vignon Feb. 3, 2022, 6:40 p.m. UTC
From: Yannick Vignon <yannick.vignon@nxp.com>

If NAPI was not scheduled from interrupt or softirq,
__raise_softirq_irqoff would mark the softirq pending, but not
wake up ksoftirqd. With force threaded IRQs, this is
compensated by the fact that the interrupt handlers are
protected inside a local_bh_disable()/local_bh_enable()
section, and bh_enable will call do_softirq if needed. With
normal threaded IRQs however, this is no longer the case
(unless the interrupt handler itself calls local_bh_enable()),
whic results in a pending softirq not being handled, and the
following message being printed out from tick-sched.c:
"NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"

Call raise_softirq_irqoff instead to make sure ksoftirqd is
woken up in such a case, ensuring __napi_schedule, etc behave
normally in more situations than just from an interrupt,
softirq or from within a bh_disable/bh_enable section.

Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
---
 net/core/dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Feb. 3, 2022, 7:08 p.m. UTC | #1
On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon
<yannick.vignon@oss.nxp.com> wrote:
>
> From: Yannick Vignon <yannick.vignon@nxp.com>
>
> If NAPI was not scheduled from interrupt or softirq,
> __raise_softirq_irqoff would mark the softirq pending, but not
> wake up ksoftirqd. With force threaded IRQs, this is
> compensated by the fact that the interrupt handlers are
> protected inside a local_bh_disable()/local_bh_enable()
> section, and bh_enable will call do_softirq if needed. With
> normal threaded IRQs however, this is no longer the case
> (unless the interrupt handler itself calls local_bh_enable()),
> whic results in a pending softirq not being handled, and the
> following message being printed out from tick-sched.c:
> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"
>
> Call raise_softirq_irqoff instead to make sure ksoftirqd is
> woken up in such a case, ensuring __napi_schedule, etc behave
> normally in more situations than just from an interrupt,
> softirq or from within a bh_disable/bh_enable section.
>

This is buggy. NAPI is called from the right context.

Can you provide a stack trace or something, so that the buggy driver
can be fixed ?

> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
> ---
>  net/core/dev.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1baab07820f6..f93b3173454c 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>         }
>
>         list_add_tail(&napi->poll_list, &sd->poll_list);
> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> +       raise_softirq_irqoff(NET_RX_SOFTIRQ);
>  }
>
>  #ifdef CONFIG_RPS
> --
> 2.25.1
>
Sebastian Andrzej Siewior Feb. 3, 2022, 7:41 p.m. UTC | #2
On 2022-02-03 19:40:30 [+0100], Yannick Vignon wrote:
> From: Yannick Vignon <yannick.vignon@nxp.com>
> 
> If NAPI was not scheduled from interrupt or softirq,
> __raise_softirq_irqoff would mark the softirq pending, but not
> wake up ksoftirqd. With force threaded IRQs, this is
> compensated by the fact that the interrupt handlers are
> protected inside a local_bh_disable()/local_bh_enable()

This is not compensated but one of the reasons why it has been added.

> section, and bh_enable will call do_softirq if needed. With
> normal threaded IRQs however, this is no longer the case
> (unless the interrupt handler itself calls local_bh_enable()),

Exactly.

> whic results in a pending softirq not being handled, and the
> following message being printed out from tick-sched.c:
> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"

Yes. This also includes various other call sites.

> Call raise_softirq_irqoff instead to make sure ksoftirqd is
> woken up in such a case, ensuring __napi_schedule, etc behave
> normally in more situations than just from an interrupt,
> softirq or from within a bh_disable/bh_enable section.

I would suggest to add a bh dis/en around the function that is known to
raise BH. This change to ____napi_schedule() as you suggest will raise
ksoftirqd and is not what you want. What you want is to process NAPI in
your current context.

> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>

Sebastian
Yannick Vignon Feb. 3, 2022, 11:40 p.m. UTC | #3
On 2/3/2022 8:08 PM, Eric Dumazet wrote:
> On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon
> <yannick.vignon@oss.nxp.com> wrote:
>>
>> From: Yannick Vignon <yannick.vignon@nxp.com>
>>
>> If NAPI was not scheduled from interrupt or softirq,
>> __raise_softirq_irqoff would mark the softirq pending, but not
>> wake up ksoftirqd. With force threaded IRQs, this is
>> compensated by the fact that the interrupt handlers are
>> protected inside a local_bh_disable()/local_bh_enable()
>> section, and bh_enable will call do_softirq if needed. With
>> normal threaded IRQs however, this is no longer the case
>> (unless the interrupt handler itself calls local_bh_enable()),
>> whic results in a pending softirq not being handled, and the
>> following message being printed out from tick-sched.c:
>> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"
>>
>> Call raise_softirq_irqoff instead to make sure ksoftirqd is
>> woken up in such a case, ensuring __napi_schedule, etc behave
>> normally in more situations than just from an interrupt,
>> softirq or from within a bh_disable/bh_enable section.
>>
> 
> This is buggy. NAPI is called from the right context.
> 
> Can you provide a stack trace or something, so that the buggy driver
> can be fixed ?
> 

Maybe some background on how I came to this would be helpful. I have 
been chasing down sources of latencies in processing rx packets on a 
PREEMPT_RT kernel and the stmmac driver. I observed that the main ones 
were bh_dis/en sections, preventing even my high-prio, (force-)threaded 
rx irq from being handled in a timely manner. Given that explicitly 
threaded irq handlers were not enclosed in a bh_dis/en section, and that 
from what I saw the stmmac interrupt handler didn't need such a 
protection anyway, I modified the stmmac driver to request threaded 
interrupts. This worked, safe for that "NOHZ" message: because 
__napi_schedule was now called from a kernel thread context, the softirq 
was no longer triggered.
(note that the problem solves itself when enabling threaded NAPI)

Is there a rule saying we shouldn't call __napi_schedule from a regular 
kernel thread, and in particular a threaded interrupt handler?

>> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
>> ---
>>   net/core/dev.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 1baab07820f6..f93b3173454c 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>          }
>>
>>          list_add_tail(&napi->poll_list, &sd->poll_list);
>> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
>> +       raise_softirq_irqoff(NET_RX_SOFTIRQ);
>>   }
>>
>>   #ifdef CONFIG_RPS
>> --
>> 2.25.1
>>
Eric Dumazet Feb. 3, 2022, 11:57 p.m. UTC | #4
On Thu, Feb 3, 2022 at 3:40 PM Yannick Vignon
<yannick.vignon@oss.nxp.com> wrote:
>
> On 2/3/2022 8:08 PM, Eric Dumazet wrote:
> > On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon
> > <yannick.vignon@oss.nxp.com> wrote:
> >>
> >> From: Yannick Vignon <yannick.vignon@nxp.com>
> >>
> >> If NAPI was not scheduled from interrupt or softirq,
> >> __raise_softirq_irqoff would mark the softirq pending, but not
> >> wake up ksoftirqd. With force threaded IRQs, this is
> >> compensated by the fact that the interrupt handlers are
> >> protected inside a local_bh_disable()/local_bh_enable()
> >> section, and bh_enable will call do_softirq if needed. With
> >> normal threaded IRQs however, this is no longer the case
> >> (unless the interrupt handler itself calls local_bh_enable()),
> >> whic results in a pending softirq not being handled, and the
> >> following message being printed out from tick-sched.c:
> >> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n"
> >>
> >> Call raise_softirq_irqoff instead to make sure ksoftirqd is
> >> woken up in such a case, ensuring __napi_schedule, etc behave
> >> normally in more situations than just from an interrupt,
> >> softirq or from within a bh_disable/bh_enable section.
> >>
> >
> > This is buggy. NAPI is called from the right context.
> >
> > Can you provide a stack trace or something, so that the buggy driver
> > can be fixed ?
> >
>
> Maybe some background on how I came to this would be helpful. I have
> been chasing down sources of latencies in processing rx packets on a
> PREEMPT_RT kernel and the stmmac driver. I observed that the main ones
> were bh_dis/en sections, preventing even my high-prio, (force-)threaded
> rx irq from being handled in a timely manner. Given that explicitly
> threaded irq handlers were not enclosed in a bh_dis/en section, and that
> from what I saw the stmmac interrupt handler didn't need such a
> protection anyway, I modified the stmmac driver to request threaded
> interrupts. This worked, safe for that "NOHZ" message: because
> __napi_schedule was now called from a kernel thread context, the softirq
> was no longer triggered.
> (note that the problem solves itself when enabling threaded NAPI)
>
> Is there a rule saying we shouldn't call __napi_schedule from a regular
> kernel thread, and in particular a threaded interrupt handler?

The rule is that you need to be in a safe context before calling
__napi_schedule()

This has been the case for more than a decade.

We are not going to slow down the stack just in case a process context
does not care about BH.

Please check:

commit ec13ee80145ccb95b00e6e610044bbd94a170051
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Wed May 16 10:57:12 2012 +0300

    virtio_net: invoke softirqs after __napi_schedule


>
> >> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
> >> ---
> >>   net/core/dev.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 1baab07820f6..f93b3173454c 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
> >>          }
> >>
> >>          list_add_tail(&napi->poll_list, &sd->poll_list);
> >> -       __raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >> +       raise_softirq_irqoff(NET_RX_SOFTIRQ);
> >>   }
> >>
> >>   #ifdef CONFIG_RPS
> >> --
> >> 2.25.1
> >>
>
Jakub Kicinski Feb. 4, 2022, 1:09 a.m. UTC | #5
On Fri, 4 Feb 2022 00:40:41 +0100 Yannick Vignon wrote:
> Maybe some background on how I came to this would be helpful. I have 
> been chasing down sources of latencies in processing rx packets on a 
> PREEMPT_RT kernel and the stmmac driver. I observed that the main ones 
> were bh_dis/en sections, preventing even my high-prio, (force-)threaded 
> rx irq from being handled in a timely manner. Given that explicitly 
> threaded irq handlers were not enclosed in a bh_dis/en section, and that 
> from what I saw the stmmac interrupt handler didn't need such a 
> protection anyway, I modified the stmmac driver to request threaded 
> interrupts. This worked, safe for that "NOHZ" message: because 
> __napi_schedule was now called from a kernel thread context, the softirq 
> was no longer triggered.
> (note that the problem solves itself when enabling threaded NAPI)

Let's be clear that the problem only exists when switching to threaded
IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
check in __napi_schedule_irqoff() which should handle your problem on
PREEMPT_RT.

We should slap a lockdep warning for non-irq contexts in
____napi_schedule(), I think, it was proposed by got lost.
Sebastian Andrzej Siewior Feb. 4, 2022, 8:19 a.m. UTC | #6
On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote:
> Let's be clear that the problem only exists when switching to threaded
> IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
> check in __napi_schedule_irqoff() which should handle your problem on
> PREEMPT_RT.

It does not. The problem is the missing bh-off/on around the call. The
forced-threaded handler has this. His explicit threaded-handler does not
and needs it.

> We should slap a lockdep warning for non-irq contexts in
> ____napi_schedule(), I think, it was proposed by got lost.

Something like this perhaps?:

diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f65..11c5f003d1591 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 {
 	struct task_struct *thread;
 
+	lockdep_assert_once(hardirq_count() | softirq_count());
+	lockdep_assert_irqs_disabled();
+
 	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
 		/* Paired with smp_mb__before_atomic() in
 		 * napi_enable()/dev_set_threaded().

Be aware that this (the first assert) will trigger in dev_cpu_dead() and
needs a bh-off/on around. I should have something in my RT tree :)

Sebastian
Jakub Kicinski Feb. 4, 2022, 3:43 p.m. UTC | #7
On Fri, 4 Feb 2022 09:19:22 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote:
> > Let's be clear that the problem only exists when switching to threaded
> > IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
> > check in __napi_schedule_irqoff() which should handle your problem on
> > PREEMPT_RT.  
> 
> It does not. The problem is the missing bh-off/on around the call. The
> forced-threaded handler has this. His explicit threaded-handler does not
> and needs it.

I see, what I was getting at is on PREEMPT_RT IRQs are already threaded
so I thought the patch was only targeting non-RT, I didn't think that
explicitly threading IRQ is advantageous also on RT.

> > We should slap a lockdep warning for non-irq contexts in
> > ____napi_schedule(), I think, it was proposed by got lost.  
> 
> Something like this perhaps?:
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1baab07820f65..11c5f003d1591 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>  {
>  	struct task_struct *thread;
>  
> +	lockdep_assert_once(hardirq_count() | softirq_count());
> +	lockdep_assert_irqs_disabled();
> +
>  	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
>  		/* Paired with smp_mb__before_atomic() in
>  		 * napi_enable()/dev_set_threaded().


Yannick Vignon Feb. 4, 2022, 5:15 p.m. UTC | #8
On 2/4/2022 4:43 PM, Jakub Kicinski wrote:
> On Fri, 4 Feb 2022 09:19:22 +0100 Sebastian Andrzej Siewior wrote:
>> On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote:
>>> Let's be clear that the problem only exists when switching to threaded
>>> IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a
>>> check in __napi_schedule_irqoff() which should handle your problem on
>>> PREEMPT_RT.
>>
>> It does not. The problem is the missing bh-off/on around the call. The
>> forced-threaded handler has this. His explicit threaded-handler does not
>> and needs it.
> 
> I see, what I was getting at is on PREEMPT_RT IRQs are already threaded
> so I thought the patch was only targeting non-RT, I didn't think that
> explicitly threading IRQ is advantageous also on RT.
> 

Something I forgot to mention is that the final use case I care about 
uses threaded NAPI (because of the improvement it gives when processing 
latency-sensitive network streams). And in that case, __napi_schedule is 
simply waking up the NAPI thread, no softirq is needed, and my 
controversial change isn't even needed for the whole system to work 
properly.

>>> We should slap a lockdep warning for non-irq contexts in
>>> ____napi_schedule(), I think, it was proposed by got lost.
>>
>> Something like this perhaps?:
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 1baab07820f65..11c5f003d1591 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,
>>   {
>>   	struct task_struct *thread;
>>   
>> +	lockdep_assert_once(hardirq_count() | softirq_count());
>> +	lockdep_assert_irqs_disabled();
>> +
>>   	if (test_bit(NAPI_STATE_THREADED, &napi->state)) {
>>   		/* Paired with smp_mb__before_atomic() in
>>   		 * napi_enable()/dev_set_threaded().
> 
> 
Jakub Kicinski Feb. 4, 2022, 5:36 p.m. UTC | #9
On Fri, 4 Feb 2022 18:15:40 +0100 Yannick Vignon wrote:
> >> Be aware that this (the first assert) will trigger in dev_cpu_dead() and
> >> needs a bh-off/on around. I should have something in my RT tree :)  
> > 
> > Or we could push the asserts only into the driver-facing helpers
> > (__napi_schedule(), __napi_schedule_irqoff()).  
> 
> As I explained above, everything is working fine when using threaded 
> NAPI. Why then forbid such a use case?
> 
> How about something like this instead:
> in the (stmmac) threaded interrupt handler:
> if (test_bit(NAPI_STATE_THREADED, &napi->state))
> 	__napi_schedule();
> else {
> 	local_bh_disable();
> 	__napi_schedule();
> 	local_bh_enable();
> }

Looks slightly racy, we check the bit again in ____napi_schedule() and
it may change in between.

> Then in __napi_schedule, add the lockdep checks, but __below__ the "if 
> (threaded) { ... }" block.
> 
> Would that be an acceptable change? Because really, the whole point of 
> my patchqueue is to remove latencies imposed on network interrupts by 
> bh_disable/enable sections. If moving to explicitly threaded IRQs means 
> the bh_disable/enable section is simply moved down the path and around 
> __napi_schedule, there is just no point.

IMHO seems reasonable as long as it's coded up neatly.
Jakub Kicinski Feb. 4, 2022, 5:45 p.m. UTC | #10
On Fri, 4 Feb 2022 18:15:40 +0100 Yannick Vignon wrote:
> > I see, what I was getting at is on PREEMPT_RT IRQs are already threaded
> > so I thought the patch was only targeting non-RT, I didn't think that
> > explicitly threading IRQ is advantageous also on RT.
> 
> Something I forgot to mention is that the final use case I care about 
> uses threaded NAPI (because of the improvement it gives when processing 
> latency-sensitive network streams). And in that case, __napi_schedule is 
> simply waking up the NAPI thread, no softirq is needed, and my 
> controversial change isn't even needed for the whole system to work 
> properly.

Coincidentally, I believe the threaded NAPI wake up is buggy - 
we assume the thread is only woken up when NAPI gets scheduled,
but IIUC signal delivery and other rare paths may wake up kthreads,
randomly.
Sebastian Andrzej Siewior Feb. 4, 2022, 6:03 p.m. UTC | #11
On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> Coincidentally, I believe the threaded NAPI wake up is buggy - 
> we assume the thread is only woken up when NAPI gets scheduled,
> but IIUC signal delivery and other rare paths may wake up kthreads,
> randomly.

I had to look into NAPI-threads for some reason.
What I dislike is that after enabling it via sysfs I have to:
- adjust task priority manual so it is preferred over other threads.
  This is usually important on RT. But then there is no overload
  protection.

- set an affinity-mask for the thread so it does not migrate from one
  CPU to the other. This is worse for a RT task where the scheduler
  tries to keep the task running.

Wouldn't it work to utilize the threaded-IRQ API and use that instead
the custom thread? Basically the primary handler would what it already
does (disable the interrupt) and the threaded handler would feed packets
into the stack. In the overload case one would need to lower the
thread-priority.

Sebastian
Jakub Kicinski Feb. 4, 2022, 6:50 p.m. UTC | #12
On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> > Coincidentally, I believe the threaded NAPI wake up is buggy - 
> > we assume the thread is only woken up when NAPI gets scheduled,
> > but IIUC signal delivery and other rare paths may wake up kthreads,
> > randomly.  
> 
> I had to look into NAPI-threads for some reason.
> What I dislike is that after enabling it via sysfs I have to:
> - adjust task priority manual so it is preferred over other threads.
>   This is usually important on RT. But then there is no overload
>   protection.
> 
> - set an affinity-mask for the thread so it does not migrate from one
>   CPU to the other. This is worse for a RT task where the scheduler
>   tries to keep the task running.
> 
> Wouldn't it work to utilize the threaded-IRQ API and use that instead
> the custom thread? Basically the primary handler would what it already
> does (disable the interrupt) and the threaded handler would feed packets
> into the stack. In the overload case one would need to lower the
> thread-priority.

Sounds like an interesting direction if you ask me! That said I have
not been able to make threaded NAPI useful in my experiments / with my
workloads so I'd defer to Wei for confirmation.

To be clear -- are you suggesting that drivers just switch to threaded
NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
dynamically engages a thread in a generic fashion?
Jakub Kicinski Feb. 4, 2022, 6:52 p.m. UTC | #13
On Fri, 4 Feb 2022 10:50:35 -0800 Jakub Kicinski wrote:
> To be clear -- are you suggesting that drivers just switch to threaded
> NAPI,

s/NAPI/IRQs/
Sebastian Andrzej Siewior Feb. 8, 2022, 11:51 a.m. UTC | #14
On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote:
> On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:
> > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> > > Coincidentally, I believe the threaded NAPI wake up is buggy - 
> > > we assume the thread is only woken up when NAPI gets scheduled,
> > > but IIUC signal delivery and other rare paths may wake up kthreads,
> > > randomly.  
> > 
> > I had to look into NAPI-threads for some reason.
> > What I dislike is that after enabling it via sysfs I have to:
> > - adjust task priority manual so it is preferred over other threads.
> >   This is usually important on RT. But then there is no overload
> >   protection.
> > 
> > - set an affinity-mask for the thread so it does not migrate from one
> >   CPU to the other. This is worse for a RT task where the scheduler
> >   tries to keep the task running.
> > 
> > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > the custom thread? Basically the primary handler would what it already
> > does (disable the interrupt) and the threaded handler would feed packets
> > into the stack. In the overload case one would need to lower the
> > thread-priority.
> 
> Sounds like an interesting direction if you ask me! That said I have
> not been able to make threaded NAPI useful in my experiments / with my
> workloads so I'd defer to Wei for confirmation.
> 
> To be clear -- are you suggesting that drivers just switch to threaded
> NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
> dynamically engages a thread in a generic fashion?

Uhm, kind of, yes.

Now you have
	request_irq(, handler_irq);
	netif_napi_add(, , handler_napi);

The handler_irq() disables the interrupt line and schedules the softirq
to process handler_napi(). Once handler_napi() is it re-enables the
interrupt line otherwise it will be processed again on the next tick.

If you enable threaded NAPI then you end up with a thread and the
softirq is no longer used. I don't know what the next action is but I
guess you search for that thread and pin it manually to CPU and assign a
RT priority (probably, otherwise it will compete with other tasks for
CPU resources).

Instead we could have
	request_threaded_irq(, handler_irq, handler_napi);

And we would have basically the same outcome. Except that handler_napi()
runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
the CPU affinity is adjusted if the IRQ-affinity is changed).
We would still have to work out the details what handler_irq() is
allowed to do and how to handle one IRQ and multiple handler_napi().

If you wrap request_threaded_irq() in something like request_napi_irq()
the you could switch between the former (softirq) and later (thread)
based NAPI handling (since you have all the needed details).

Sebastian
Jakub Kicinski Feb. 8, 2022, 3:35 p.m. UTC | #15
On Tue, 8 Feb 2022 12:51:07 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote:
> > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:  
> > > I had to look into NAPI-threads for some reason.
> > > What I dislike is that after enabling it via sysfs I have to:
> > > - adjust task priority manual so it is preferred over other threads.
> > >   This is usually important on RT. But then there is no overload
> > >   protection.
> > > 
> > > - set an affinity-mask for the thread so it does not migrate from one
> > >   CPU to the other. This is worse for a RT task where the scheduler
> > >   tries to keep the task running.
> > > 
> > > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > > the custom thread? Basically the primary handler would what it already
> > > does (disable the interrupt) and the threaded handler would feed packets
> > > into the stack. In the overload case one would need to lower the
> > > thread-priority.  
> > 
> > Sounds like an interesting direction if you ask me! That said I have
> > not been able to make threaded NAPI useful in my experiments / with my
> > workloads so I'd defer to Wei for confirmation.
> > 
> > To be clear -- are you suggesting that drivers just switch to threaded
> > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
> > dynamically engages a thread in a generic fashion?  
> 
> Uhm, kind of, yes.
> 
> Now you have
> 	request_irq(, handler_irq);
> 	netif_napi_add(, , handler_napi);
> 
> The handler_irq() disables the interrupt line and schedules the softirq
> to process handler_napi(). Once handler_napi() is it re-enables the
> interrupt line otherwise it will be processed again on the next tick.
> 
> If you enable threaded NAPI then you end up with a thread and the
> softirq is no longer used. I don't know what the next action is but I
> guess you search for that thread and pin it manually to CPU and assign a
> RT priority (probably, otherwise it will compete with other tasks for
> CPU resources).

FWIW I don't think servers would want RT prio.

> Instead we could have
> 	request_threaded_irq(, handler_irq, handler_napi);
> 
> And we would have basically the same outcome. Except that handler_napi()
> runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> the CPU affinity is adjusted if the IRQ-affinity is changed).
> We would still have to work out the details what handler_irq() is
> allowed to do and how to handle one IRQ and multiple handler_napi().
> 
> If you wrap request_threaded_irq() in something like request_napi_irq()
> the you could switch between the former (softirq) and later (thread)
> based NAPI handling (since you have all the needed details).

One use case to watch out for is drivers which explicitly moved 
to threaded NAPI because they want to schedule multiple threads 
(NAPIs) from a single IRQ to spread processing across more cores.
Paolo Abeni Feb. 8, 2022, 3:57 p.m. UTC | #16
On Tue, 2022-02-08 at 12:51 +0100, Sebastian Andrzej Siewior wrote:
> On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote:
> > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote:
> > > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote:
> > > > Coincidentally, I believe the threaded NAPI wake up is buggy - 
> > > > we assume the thread is only woken up when NAPI gets scheduled,
> > > > but IIUC signal delivery and other rare paths may wake up kthreads,
> > > > randomly.  
> > > 
> > > I had to look into NAPI-threads for some reason.
> > > What I dislike is that after enabling it via sysfs I have to:
> > > - adjust task priority manual so it is preferred over other threads.
> > >   This is usually important on RT. But then there is no overload
> > >   protection.
> > > 
> > > - set an affinity-mask for the thread so it does not migrate from one
> > >   CPU to the other. This is worse for a RT task where the scheduler
> > >   tries to keep the task running.
> > > 
> > > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > > the custom thread? Basically the primary handler would what it already
> > > does (disable the interrupt) and the threaded handler would feed packets
> > > into the stack. In the overload case one would need to lower the
> > > thread-priority.
> > 
> > Sounds like an interesting direction if you ask me! That said I have
> > not been able to make threaded NAPI useful in my experiments / with my
> > workloads so I'd defer to Wei for confirmation.
> > 
> > To be clear -- are you suggesting that drivers just switch to threaded
> > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded
> > dynamically engages a thread in a generic fashion?
> 
> Uhm, kind of, yes.
> 
> Now you have
> 	request_irq(, handler_irq);
> 	netif_napi_add(, , handler_napi);
> 
> The handler_irq() disables the interrupt line and schedules the softirq
> to process handler_napi(). Once handler_napi() is it re-enables the
> interrupt line otherwise it will be processed again on the next tick.
> 
> If you enable threaded NAPI then you end up with a thread and the
> softirq is no longer used. I don't know what the next action is but I
> guess you search for that thread and pin it manually to CPU and assign a
> RT priority (probably, otherwise it will compete with other tasks for
> CPU resources).
> 
> Instead we could have
> 	request_threaded_irq(, handler_irq, handler_napi);
> 
> And we would have basically the same outcome. Except that handler_napi()
> runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> the CPU affinity is adjusted if the IRQ-affinity is changed).
> We would still have to work out the details what handler_irq() is
> allowed to do and how to handle one IRQ and multiple handler_napi().
> 
> If you wrap request_threaded_irq() in something like request_napi_irq()
> the you could switch between the former (softirq) and later (thread)
> based NAPI handling (since you have all the needed details).

just for historic reference:

https://lkml.org/lkml/2016/6/15/460

I think that running the thread performing the NAPI loop with
SCHED_FIFO would be dangerous WRT DDOS. Even the affinity setting can
give mixed results depending on the workload - unless you do good
static CPUs allocation pinning each process manually, not really a
generic setup.

Cheers,

Paolo
Sebastian Andrzej Siewior Feb. 8, 2022, 5:45 p.m. UTC | #17
On 2022-02-08 07:35:20 [-0800], Jakub Kicinski wrote:
> > If you enable threaded NAPI then you end up with a thread and the
> > softirq is no longer used. I don't know what the next action is but I
> > guess you search for that thread and pin it manually to CPU and assign a
> > RT priority (probably, otherwise it will compete with other tasks for
> > CPU resources).
> 
> FWIW I don't think servers would want RT prio.

but then the NAPI thread is treated the same way like your xz -9.

> > Instead we could have
> > 	request_threaded_irq(, handler_irq, handler_napi);
> > 
> > And we would have basically the same outcome. Except that handler_napi()
> > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> > the CPU affinity is adjusted if the IRQ-affinity is changed).
> > We would still have to work out the details what handler_irq() is
> > allowed to do and how to handle one IRQ and multiple handler_napi().
> > 
> > If you wrap request_threaded_irq() in something like request_napi_irq()
> > the you could switch between the former (softirq) and later (thread)
> > based NAPI handling (since you have all the needed details).
> 
> One use case to watch out for is drivers which explicitly moved 
> to threaded NAPI because they want to schedule multiple threads 
> (NAPIs) from a single IRQ to spread processing across more cores.

the request_napi_irq() could have a sysfs switch (like we currently
have).
But you mentioned one IRQ and multiple NAPI threads, to distribute
across core. The usual case is that you have one IRQ for a network queue
and this network queue has one NAPI struct, right?

In the case where you would have one IRQ but two network queues, each
with one NAPI struct? And then you use the napi-threads and pin manually
queue-napi#1 to CPU#1 and queue-napi#2 to CPU#2 while the IRQ itself
fires on CPU#1?

Sebastian
Sebastian Andrzej Siewior Feb. 8, 2022, 6:21 p.m. UTC | #18
On 2022-02-08 16:57:59 [+0100], Paolo Abeni wrote:
> just for historic reference:
> 
> https://lkml.org/lkml/2016/6/15/460

that is
  https://lore.kernel.org/all/cover.1465996447.git.pabeni@redhat.com

let me digest that later…

> I think that running the thread performing the NAPI loop with
> SCHED_FIFO would be dangerous WRT DDOS. Even the affinity setting can
> give mixed results depending on the workload - unless you do good
> static CPUs allocation pinning each process manually, not really a
> generic setup.

The DDoS part is where I meant we need figure out the details. Since it
is a threaded-interrupt we could do msleep() as a break which is similar
to what softirq does. Detecting such a case might be difficult since the
budget is per-thread only and does not involve other NAPI-structs on the
same CPU like backlog.

The performance is usually best if the IRQ and threaded handler are
running on the same CPU. The win with the NAPI thread seems to be that
if two NAPIs fire on the same CPU then the scheduler will see two tasks
which will be moved (at some point) to different CPUs. And in a DDoS
like situation they can constantly push new skbs into the stack and
SCHED_OTHER ensures that the user can still do something. Without napi
threads, both NAPIs would be processed on the same CPU (in order) and
eventually moved to ksoftirqd where it will take a break under high
load.

> Cheers,
> 
> Paolo

Sebastian
Jakub Kicinski Feb. 9, 2022, 12:16 a.m. UTC | #19
On Tue, 8 Feb 2022 18:45:53 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-08 07:35:20 [-0800], Jakub Kicinski wrote:
> > > If you enable threaded NAPI then you end up with a thread and the
> > > softirq is no longer used. I don't know what the next action is but I
> > > guess you search for that thread and pin it manually to CPU and assign a
> > > RT priority (probably, otherwise it will compete with other tasks for
> > > CPU resources).  
> > 
> > FWIW I don't think servers would want RT prio.  
> 
> but then the NAPI thread is treated the same way like your xz -9.

You'd either pin the workload away from the network processing cores 
or, if there's no pinning, prefer requests to run to completion to
achieve lower latency.

> > > Instead we could have
> > > 	request_threaded_irq(, handler_irq, handler_napi);
> > > 
> > > And we would have basically the same outcome. Except that handler_napi()
> > > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and
> > > the CPU affinity is adjusted if the IRQ-affinity is changed).
> > > We would still have to work out the details what handler_irq() is
> > > allowed to do and how to handle one IRQ and multiple handler_napi().
> > > 
> > > If you wrap request_threaded_irq() in something like request_napi_irq()
> > > the you could switch between the former (softirq) and later (thread)
> > > based NAPI handling (since you have all the needed details).  
> > 
> > One use case to watch out for is drivers which explicitly moved 
> > to threaded NAPI because they want to schedule multiple threads 
> > (NAPIs) from a single IRQ to spread processing across more cores.  
> 
> the request_napi_irq() could have a sysfs switch (like we currently
> have).
> But you mentioned one IRQ and multiple NAPI threads, to distribute
> across core. The usual case is that you have one IRQ for a network queue
> and this network queue has one NAPI struct, right?
> 
> In the case where you would have one IRQ but two network queues, each
> with one NAPI struct? And then you use the napi-threads and pin manually
> queue-napi#1 to CPU#1 and queue-napi#2 to CPU#2 while the IRQ itself
> fires on CPU#1?

It was mt76, the WiFi driver, I looked in the morning and I think it
had a NAPI for Tx and 2 NAPIs for Rx. The scheduler can spread them
around. Felix will have the exact details.
Marc Kleine-Budde Feb. 9, 2022, 11:26 a.m. UTC | #20
On 04.02.2022 10:50:35, Jakub Kicinski wrote:
> > Wouldn't it work to utilize the threaded-IRQ API and use that instead
> > the custom thread? Basically the primary handler would what it already
> > does (disable the interrupt) and the threaded handler would feed packets
> > into the stack. In the overload case one would need to lower the
> > thread-priority.
> 
> Sounds like an interesting direction if you ask me! That said I have
> not been able to make threaded NAPI useful in my experiments / with my
> workloads so I'd defer to Wei for confirmation.

FWIW: We have a threaded NAPI use case. On a PREEMPT_RT enabled imx6 (4
core 32 bit ARM) the NAPI of the CAN controller (flexcan) is running in
threaded mode with raised priorities. This reduces latency spikes and
jitter of RX'ed CAN frames, which are part of a closed control loop.

regards,
Marc
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f6..f93b3173454c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4239,7 +4239,7 @@  static inline void ____napi_schedule(struct softnet_data *sd,
 	}
 
 	list_add_tail(&napi->poll_list, &sd->poll_list);
-	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+	raise_softirq_irqoff(NET_RX_SOFTIRQ);
 }
 
 #ifdef CONFIG_RPS