Message ID | 20220203184031.1074008-1-yannick.vignon@oss.nxp.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next,1/2] net: napi: wake up ksoftirqd if needed after scheduling NAPI | expand |
On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon <yannick.vignon@oss.nxp.com> wrote: > > From: Yannick Vignon <yannick.vignon@nxp.com> > > If NAPI was not scheduled from interrupt or softirq, > __raise_softirq_irqoff would mark the softirq pending, but not > wake up ksoftirqd. With force threaded IRQs, this is > compensated by the fact that the interrupt handlers are > protected inside a local_bh_disable()/local_bh_enable() > section, and bh_enable will call do_softirq if needed. With > normal threaded IRQs however, this is no longer the case > (unless the interrupt handler itself calls local_bh_enable()), > whic results in a pending softirq not being handled, and the > following message being printed out from tick-sched.c: > "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n" > > Call raise_softirq_irqoff instead to make sure ksoftirqd is > woken up in such a case, ensuring __napi_schedule, etc behave > normally in more situations than just from an interrupt, > softirq or from within a bh_disable/bh_enable section. > This is buggy. NAPI is called from the right context. Can you provide a stack trace or something, so that the buggy driver can be fixed ? > Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> > --- > net/core/dev.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 1baab07820f6..f93b3173454c 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, > } > > list_add_tail(&napi->poll_list, &sd->poll_list); > - __raise_softirq_irqoff(NET_RX_SOFTIRQ); > + raise_softirq_irqoff(NET_RX_SOFTIRQ); > } > > #ifdef CONFIG_RPS > -- > 2.25.1 >
On 2022-02-03 19:40:30 [+0100], Yannick Vignon wrote: > From: Yannick Vignon <yannick.vignon@nxp.com> > > If NAPI was not scheduled from interrupt or softirq, > __raise_softirq_irqoff would mark the softirq pending, but not > wake up ksoftirqd. With force threaded IRQs, this is > compensated by the fact that the interrupt handlers are > protected inside a local_bh_disable()/local_bh_enable() This is not compensated but one of the reasons why it has been added. > section, and bh_enable will call do_softirq if needed. With > normal threaded IRQs however, this is no longer the case > (unless the interrupt handler itself calls local_bh_enable()), Exactly. > whic results in a pending softirq not being handled, and the > following message being printed out from tick-sched.c: > "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n" Yes. This also includes various other call sites. > Call raise_softirq_irqoff instead to make sure ksoftirqd is > woken up in such a case, ensuring __napi_schedule, etc behave > normally in more situations than just from an interrupt, > softirq or from within a bh_disable/bh_enable section. I would suggest to add a bh dis/en around the function that is known to raise BH. This change to ____napi_schedule() as you suggest will raise ksoftirqd and is not what you want. What you want is to process NAPI in your current context. > Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> Sebastian
On 2/3/2022 8:08 PM, Eric Dumazet wrote: > On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon > <yannick.vignon@oss.nxp.com> wrote: >> >> From: Yannick Vignon <yannick.vignon@nxp.com> >> >> If NAPI was not scheduled from interrupt or softirq, >> __raise_softirq_irqoff would mark the softirq pending, but not >> wake up ksoftirqd. With force threaded IRQs, this is >> compensated by the fact that the interrupt handlers are >> protected inside a local_bh_disable()/local_bh_enable() >> section, and bh_enable will call do_softirq if needed. With >> normal threaded IRQs however, this is no longer the case >> (unless the interrupt handler itself calls local_bh_enable()), >> whic results in a pending softirq not being handled, and the >> following message being printed out from tick-sched.c: >> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n" >> >> Call raise_softirq_irqoff instead to make sure ksoftirqd is >> woken up in such a case, ensuring __napi_schedule, etc behave >> normally in more situations than just from an interrupt, >> softirq or from within a bh_disable/bh_enable section. >> > > This is buggy. NAPI is called from the right context. > > Can you provide a stack trace or something, so that the buggy driver > can be fixed ? > Maybe some background on how I came to this would be helpful. I have been chasing down sources of latencies in processing rx packets on a PREEMPT_RT kernel and the stmmac driver. I observed that the main ones were bh_dis/en sections, preventing even my high-prio, (force-)threaded rx irq from being handled in a timely manner. Given that explicitly threaded irq handlers were not enclosed in a bh_dis/en section, and that from what I saw the stmmac interrupt handler didn't need such a protection anyway, I modified the stmmac driver to request threaded interrupts. This worked, safe for that "NOHZ" message: because __napi_schedule was now called from a kernel thread context, the softirq was no longer triggered. (note that the problem solves itself when enabling threaded NAPI) Is there a rule saying we shouldn't call __napi_schedule from a regular kernel thread, and in particular a threaded interrupt handler? >> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> >> --- >> net/core/dev.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 1baab07820f6..f93b3173454c 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, >> } >> >> list_add_tail(&napi->poll_list, &sd->poll_list); >> - __raise_softirq_irqoff(NET_RX_SOFTIRQ); >> + raise_softirq_irqoff(NET_RX_SOFTIRQ); >> } >> >> #ifdef CONFIG_RPS >> -- >> 2.25.1 >>
On Thu, Feb 3, 2022 at 3:40 PM Yannick Vignon <yannick.vignon@oss.nxp.com> wrote: > > On 2/3/2022 8:08 PM, Eric Dumazet wrote: > > On Thu, Feb 3, 2022 at 11:06 AM Yannick Vignon > > <yannick.vignon@oss.nxp.com> wrote: > >> > >> From: Yannick Vignon <yannick.vignon@nxp.com> > >> > >> If NAPI was not scheduled from interrupt or softirq, > >> __raise_softirq_irqoff would mark the softirq pending, but not > >> wake up ksoftirqd. With force threaded IRQs, this is > >> compensated by the fact that the interrupt handlers are > >> protected inside a local_bh_disable()/local_bh_enable() > >> section, and bh_enable will call do_softirq if needed. With > >> normal threaded IRQs however, this is no longer the case > >> (unless the interrupt handler itself calls local_bh_enable()), > >> whic results in a pending softirq not being handled, and the > >> following message being printed out from tick-sched.c: > >> "NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n" > >> > >> Call raise_softirq_irqoff instead to make sure ksoftirqd is > >> woken up in such a case, ensuring __napi_schedule, etc behave > >> normally in more situations than just from an interrupt, > >> softirq or from within a bh_disable/bh_enable section. > >> > > > > This is buggy. NAPI is called from the right context. > > > > Can you provide a stack trace or something, so that the buggy driver > > can be fixed ? > > > > Maybe some background on how I came to this would be helpful. I have > been chasing down sources of latencies in processing rx packets on a > PREEMPT_RT kernel and the stmmac driver. I observed that the main ones > were bh_dis/en sections, preventing even my high-prio, (force-)threaded > rx irq from being handled in a timely manner. Given that explicitly > threaded irq handlers were not enclosed in a bh_dis/en section, and that > from what I saw the stmmac interrupt handler didn't need such a > protection anyway, I modified the stmmac driver to request threaded > interrupts. This worked, safe for that "NOHZ" message: because > __napi_schedule was now called from a kernel thread context, the softirq > was no longer triggered. > (note that the problem solves itself when enabling threaded NAPI) > > Is there a rule saying we shouldn't call __napi_schedule from a regular > kernel thread, and in particular a threaded interrupt handler? The rule is that you need to be in a safe context before calling __napi_schedule() This has been the case for more than a decade. We are not going to slow down the stack just in case a process context does not care about BH. Please check: commit ec13ee80145ccb95b00e6e610044bbd94a170051 Author: Michael S. Tsirkin <mst@redhat.com> Date: Wed May 16 10:57:12 2012 +0300 virtio_net: invoke softirqs after __napi_schedule > > >> Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> > >> --- > >> net/core/dev.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/core/dev.c b/net/core/dev.c > >> index 1baab07820f6..f93b3173454c 100644 > >> --- a/net/core/dev.c > >> +++ b/net/core/dev.c > >> @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, > >> } > >> > >> list_add_tail(&napi->poll_list, &sd->poll_list); > >> - __raise_softirq_irqoff(NET_RX_SOFTIRQ); > >> + raise_softirq_irqoff(NET_RX_SOFTIRQ); > >> } > >> > >> #ifdef CONFIG_RPS > >> -- > >> 2.25.1 > >> >
On Fri, 4 Feb 2022 00:40:41 +0100 Yannick Vignon wrote: > Maybe some background on how I came to this would be helpful. I have > been chasing down sources of latencies in processing rx packets on a > PREEMPT_RT kernel and the stmmac driver. I observed that the main ones > were bh_dis/en sections, preventing even my high-prio, (force-)threaded > rx irq from being handled in a timely manner. Given that explicitly > threaded irq handlers were not enclosed in a bh_dis/en section, and that > from what I saw the stmmac interrupt handler didn't need such a > protection anyway, I modified the stmmac driver to request threaded > interrupts. This worked, safe for that "NOHZ" message: because > __napi_schedule was now called from a kernel thread context, the softirq > was no longer triggered. > (note that the problem solves itself when enabling threaded NAPI) Let's be clear that the problem only exists when switching to threaded IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a check in __napi_schedule_irqoff() which should handle your problem on PREEMPT_RT. We should slap a lockdep warning for non-irq contexts in ____napi_schedule(), I think, it was proposed by got lost.
On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote: > Let's be clear that the problem only exists when switching to threaded > IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a > check in __napi_schedule_irqoff() which should handle your problem on > PREEMPT_RT. It does not. The problem is the missing bh-off/on around the call. The forced-threaded handler has this. His explicit threaded-handler does not and needs it. > We should slap a lockdep warning for non-irq contexts in > ____napi_schedule(), I think, it was proposed by got lost. Something like this perhaps?: diff --git a/net/core/dev.c b/net/core/dev.c index 1baab07820f65..11c5f003d1591 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd, { struct task_struct *thread; + lockdep_assert_once(hardirq_count() | softirq_count()); + lockdep_assert_irqs_disabled(); + if (test_bit(NAPI_STATE_THREADED, &napi->state)) { /* Paired with smp_mb__before_atomic() in * napi_enable()/dev_set_threaded(). Be aware that this (the first assert) will trigger in dev_cpu_dead() and needs a bh-off/on around. I should have something in my RT tree :) Sebastian
On Fri, 4 Feb 2022 09:19:22 +0100 Sebastian Andrzej Siewior wrote: > On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote: > > Let's be clear that the problem only exists when switching to threaded > > IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a > > check in __napi_schedule_irqoff() which should handle your problem on > > PREEMPT_RT. > > It does not. The problem is the missing bh-off/on around the call. The > forced-threaded handler has this. His explicit threaded-handler does not > and needs it. I see, what I was getting at is on PREEMPT_RT IRQs are already threaded so I thought the patch was only targeting non-RT, I didn't think that explicitly threading IRQ is advantageous also on RT. > > We should slap a lockdep warning for non-irq contexts in > > ____napi_schedule(), I think, it was proposed by got lost. > > Something like this perhaps?: > > diff --git a/net/core/dev.c b/net/core/dev.c > index 1baab07820f65..11c5f003d1591 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd, > { > struct task_struct *thread; > > + lockdep_assert_once(hardirq_count() | softirq_count()); > + lockdep_assert_irqs_disabled(); > + > if (test_bit(NAPI_STATE_THREADED, &napi->state)) { > /* Paired with smp_mb__before_atomic() in > * napi_enable()/dev_set_threaded().
On 2/4/2022 4:43 PM, Jakub Kicinski wrote: > On Fri, 4 Feb 2022 09:19:22 +0100 Sebastian Andrzej Siewior wrote: >> On 2022-02-03 17:09:01 [-0800], Jakub Kicinski wrote: >>> Let's be clear that the problem only exists when switching to threaded >>> IRQs on _non_ PREEMPT_RT kernel (or old kernels). We already have a >>> check in __napi_schedule_irqoff() which should handle your problem on >>> PREEMPT_RT. >> >> It does not. The problem is the missing bh-off/on around the call. The >> forced-threaded handler has this. His explicit threaded-handler does not >> and needs it. > > I see, what I was getting at is on PREEMPT_RT IRQs are already threaded > so I thought the patch was only targeting non-RT, I didn't think that > explicitly threading IRQ is advantageous also on RT. > Something I forgot to mention is that the final use case I care about uses threaded NAPI (because of the improvement it gives when processing latency-sensitive network streams). And in that case, __napi_schedule is simply waking up the NAPI thread, no softirq is needed, and my controversial change isn't even needed for the whole system to work properly. >>> We should slap a lockdep warning for non-irq contexts in >>> ____napi_schedule(), I think, it was proposed by got lost. >> >> Something like this perhaps?: >> >> diff --git a/net/core/dev.c b/net/core/dev.c >> index 1baab07820f65..11c5f003d1591 100644 >> --- a/net/core/dev.c >> +++ b/net/core/dev.c >> @@ -4217,6 +4217,9 @@ static inline void ____napi_schedule(struct softnet_data *sd, >> { >> struct task_struct *thread; >> >> + lockdep_assert_once(hardirq_count() | softirq_count()); >> + lockdep_assert_irqs_disabled(); >> + >> if (test_bit(NAPI_STATE_THREADED, &napi->state)) { >> /* Paired with smp_mb__before_atomic() in >> * napi_enable()/dev_set_threaded(). > >
On Fri, 4 Feb 2022 18:15:40 +0100 Yannick Vignon wrote: > >> Be aware that this (the first assert) will trigger in dev_cpu_dead() and > >> needs a bh-off/on around. I should have something in my RT tree :) > > > > Or we could push the asserts only into the driver-facing helpers > > (__napi_schedule(), __napi_schedule_irqoff()). > > As I explained above, everything is working fine when using threaded > NAPI. Why then forbid such a use case? > > How about something like this instead: > in the (stmmac) threaded interrupt handler: > if (test_bit(NAPI_STATE_THREADED, &napi->state)) > __napi_schedule(); > else { > local_bh_disable(); > __napi_schedule(); > local_bh_enable(); > } Looks slightly racy, we check the bit again in ____napi_schedule() and it may change in between. > Then in __napi_schedule, add the lockdep checks, but __below__ the "if > (threaded) { ... }" block. > > Would that be an acceptable change? Because really, the whole point of > my patchqueue is to remove latencies imposed on network interrupts by > bh_disable/enable sections. If moving to explicitly threaded IRQs means > the bh_disable/enable section is simply moved down the path and around > __napi_schedule, there is just no point. IMHO seems reasonable as long as it's coded up neatly.
On Fri, 4 Feb 2022 18:15:40 +0100 Yannick Vignon wrote: > > I see, what I was getting at is on PREEMPT_RT IRQs are already threaded > > so I thought the patch was only targeting non-RT, I didn't think that > > explicitly threading IRQ is advantageous also on RT. > > Something I forgot to mention is that the final use case I care about > uses threaded NAPI (because of the improvement it gives when processing > latency-sensitive network streams). And in that case, __napi_schedule is > simply waking up the NAPI thread, no softirq is needed, and my > controversial change isn't even needed for the whole system to work > properly. Coincidentally, I believe the threaded NAPI wake up is buggy - we assume the thread is only woken up when NAPI gets scheduled, but IIUC signal delivery and other rare paths may wake up kthreads, randomly.
On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote: > Coincidentally, I believe the threaded NAPI wake up is buggy - > we assume the thread is only woken up when NAPI gets scheduled, > but IIUC signal delivery and other rare paths may wake up kthreads, > randomly. I had to look into NAPI-threads for some reason. What I dislike is that after enabling it via sysfs I have to: - adjust task priority manual so it is preferred over other threads. This is usually important on RT. But then there is no overload protection. - set an affinity-mask for the thread so it does not migrate from one CPU to the other. This is worse for a RT task where the scheduler tries to keep the task running. Wouldn't it work to utilize the threaded-IRQ API and use that instead the custom thread? Basically the primary handler would what it already does (disable the interrupt) and the threaded handler would feed packets into the stack. In the overload case one would need to lower the thread-priority. Sebastian
On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote: > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote: > > Coincidentally, I believe the threaded NAPI wake up is buggy - > > we assume the thread is only woken up when NAPI gets scheduled, > > but IIUC signal delivery and other rare paths may wake up kthreads, > > randomly. > > I had to look into NAPI-threads for some reason. > What I dislike is that after enabling it via sysfs I have to: > - adjust task priority manual so it is preferred over other threads. > This is usually important on RT. But then there is no overload > protection. > > - set an affinity-mask for the thread so it does not migrate from one > CPU to the other. This is worse for a RT task where the scheduler > tries to keep the task running. > > Wouldn't it work to utilize the threaded-IRQ API and use that instead > the custom thread? Basically the primary handler would what it already > does (disable the interrupt) and the threaded handler would feed packets > into the stack. In the overload case one would need to lower the > thread-priority. Sounds like an interesting direction if you ask me! That said I have not been able to make threaded NAPI useful in my experiments / with my workloads so I'd defer to Wei for confirmation. To be clear -- are you suggesting that drivers just switch to threaded NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded dynamically engages a thread in a generic fashion?
On Fri, 4 Feb 2022 10:50:35 -0800 Jakub Kicinski wrote: > To be clear -- are you suggesting that drivers just switch to threaded > NAPI, s/NAPI/IRQs/
On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote: > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote: > > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote: > > > Coincidentally, I believe the threaded NAPI wake up is buggy - > > > we assume the thread is only woken up when NAPI gets scheduled, > > > but IIUC signal delivery and other rare paths may wake up kthreads, > > > randomly. > > > > I had to look into NAPI-threads for some reason. > > What I dislike is that after enabling it via sysfs I have to: > > - adjust task priority manual so it is preferred over other threads. > > This is usually important on RT. But then there is no overload > > protection. > > > > - set an affinity-mask for the thread so it does not migrate from one > > CPU to the other. This is worse for a RT task where the scheduler > > tries to keep the task running. > > > > Wouldn't it work to utilize the threaded-IRQ API and use that instead > > the custom thread? Basically the primary handler would what it already > > does (disable the interrupt) and the threaded handler would feed packets > > into the stack. In the overload case one would need to lower the > > thread-priority. > > Sounds like an interesting direction if you ask me! That said I have > not been able to make threaded NAPI useful in my experiments / with my > workloads so I'd defer to Wei for confirmation. > > To be clear -- are you suggesting that drivers just switch to threaded > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded > dynamically engages a thread in a generic fashion? Uhm, kind of, yes. Now you have request_irq(, handler_irq); netif_napi_add(, , handler_napi); The handler_irq() disables the interrupt line and schedules the softirq to process handler_napi(). Once handler_napi() is it re-enables the interrupt line otherwise it will be processed again on the next tick. If you enable threaded NAPI then you end up with a thread and the softirq is no longer used. I don't know what the next action is but I guess you search for that thread and pin it manually to CPU and assign a RT priority (probably, otherwise it will compete with other tasks for CPU resources). Instead we could have request_threaded_irq(, handler_irq, handler_napi); And we would have basically the same outcome. Except that handler_napi() runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and the CPU affinity is adjusted if the IRQ-affinity is changed). We would still have to work out the details what handler_irq() is allowed to do and how to handle one IRQ and multiple handler_napi(). If you wrap request_threaded_irq() in something like request_napi_irq() the you could switch between the former (softirq) and later (thread) based NAPI handling (since you have all the needed details). Sebastian
On Tue, 8 Feb 2022 12:51:07 +0100 Sebastian Andrzej Siewior wrote: > On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote: > > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote: > > > I had to look into NAPI-threads for some reason. > > > What I dislike is that after enabling it via sysfs I have to: > > > - adjust task priority manual so it is preferred over other threads. > > > This is usually important on RT. But then there is no overload > > > protection. > > > > > > - set an affinity-mask for the thread so it does not migrate from one > > > CPU to the other. This is worse for a RT task where the scheduler > > > tries to keep the task running. > > > > > > Wouldn't it work to utilize the threaded-IRQ API and use that instead > > > the custom thread? Basically the primary handler would what it already > > > does (disable the interrupt) and the threaded handler would feed packets > > > into the stack. In the overload case one would need to lower the > > > thread-priority. > > > > Sounds like an interesting direction if you ask me! That said I have > > not been able to make threaded NAPI useful in my experiments / with my > > workloads so I'd defer to Wei for confirmation. > > > > To be clear -- are you suggesting that drivers just switch to threaded > > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded > > dynamically engages a thread in a generic fashion? > > Uhm, kind of, yes. > > Now you have > request_irq(, handler_irq); > netif_napi_add(, , handler_napi); > > The handler_irq() disables the interrupt line and schedules the softirq > to process handler_napi(). Once handler_napi() is it re-enables the > interrupt line otherwise it will be processed again on the next tick. > > If you enable threaded NAPI then you end up with a thread and the > softirq is no longer used. I don't know what the next action is but I > guess you search for that thread and pin it manually to CPU and assign a > RT priority (probably, otherwise it will compete with other tasks for > CPU resources). FWIW I don't think servers would want RT prio. > Instead we could have > request_threaded_irq(, handler_irq, handler_napi); > > And we would have basically the same outcome. Except that handler_napi() > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and > the CPU affinity is adjusted if the IRQ-affinity is changed). > We would still have to work out the details what handler_irq() is > allowed to do and how to handle one IRQ and multiple handler_napi(). > > If you wrap request_threaded_irq() in something like request_napi_irq() > the you could switch between the former (softirq) and later (thread) > based NAPI handling (since you have all the needed details). One use case to watch out for is drivers which explicitly moved to threaded NAPI because they want to schedule multiple threads (NAPIs) from a single IRQ to spread processing across more cores.
On Tue, 2022-02-08 at 12:51 +0100, Sebastian Andrzej Siewior wrote: > On 2022-02-04 10:50:35 [-0800], Jakub Kicinski wrote: > > On Fri, 4 Feb 2022 19:03:31 +0100 Sebastian Andrzej Siewior wrote: > > > On 2022-02-04 09:45:22 [-0800], Jakub Kicinski wrote: > > > > Coincidentally, I believe the threaded NAPI wake up is buggy - > > > > we assume the thread is only woken up when NAPI gets scheduled, > > > > but IIUC signal delivery and other rare paths may wake up kthreads, > > > > randomly. > > > > > > I had to look into NAPI-threads for some reason. > > > What I dislike is that after enabling it via sysfs I have to: > > > - adjust task priority manual so it is preferred over other threads. > > > This is usually important on RT. But then there is no overload > > > protection. > > > > > > - set an affinity-mask for the thread so it does not migrate from one > > > CPU to the other. This is worse for a RT task where the scheduler > > > tries to keep the task running. > > > > > > Wouldn't it work to utilize the threaded-IRQ API and use that instead > > > the custom thread? Basically the primary handler would what it already > > > does (disable the interrupt) and the threaded handler would feed packets > > > into the stack. In the overload case one would need to lower the > > > thread-priority. > > > > Sounds like an interesting direction if you ask me! That said I have > > not been able to make threaded NAPI useful in my experiments / with my > > workloads so I'd defer to Wei for confirmation. > > > > To be clear -- are you suggesting that drivers just switch to threaded > > NAPI, or a more dynamic approach where echo 1 > /proc/irq/$n/threaded > > dynamically engages a thread in a generic fashion? > > Uhm, kind of, yes. > > Now you have > request_irq(, handler_irq); > netif_napi_add(, , handler_napi); > > The handler_irq() disables the interrupt line and schedules the softirq > to process handler_napi(). Once handler_napi() is it re-enables the > interrupt line otherwise it will be processed again on the next tick. > > If you enable threaded NAPI then you end up with a thread and the > softirq is no longer used. I don't know what the next action is but I > guess you search for that thread and pin it manually to CPU and assign a > RT priority (probably, otherwise it will compete with other tasks for > CPU resources). > > Instead we could have > request_threaded_irq(, handler_irq, handler_napi); > > And we would have basically the same outcome. Except that handler_napi() > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and > the CPU affinity is adjusted if the IRQ-affinity is changed). > We would still have to work out the details what handler_irq() is > allowed to do and how to handle one IRQ and multiple handler_napi(). > > If you wrap request_threaded_irq() in something like request_napi_irq() > the you could switch between the former (softirq) and later (thread) > based NAPI handling (since you have all the needed details). just for historic reference: https://lkml.org/lkml/2016/6/15/460 I think that running the thread performing the NAPI loop with SCHED_FIFO would be dangerous WRT DDOS. Even the affinity setting can give mixed results depending on the workload - unless you do good static CPUs allocation pinning each process manually, not really a generic setup. Cheers, Paolo
On 2022-02-08 07:35:20 [-0800], Jakub Kicinski wrote: > > If you enable threaded NAPI then you end up with a thread and the > > softirq is no longer used. I don't know what the next action is but I > > guess you search for that thread and pin it manually to CPU and assign a > > RT priority (probably, otherwise it will compete with other tasks for > > CPU resources). > > FWIW I don't think servers would want RT prio. but then the NAPI thread is treated the same way like your xz -9. > > Instead we could have > > request_threaded_irq(, handler_irq, handler_napi); > > > > And we would have basically the same outcome. Except that handler_napi() > > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and > > the CPU affinity is adjusted if the IRQ-affinity is changed). > > We would still have to work out the details what handler_irq() is > > allowed to do and how to handle one IRQ and multiple handler_napi(). > > > > If you wrap request_threaded_irq() in something like request_napi_irq() > > the you could switch between the former (softirq) and later (thread) > > based NAPI handling (since you have all the needed details). > > One use case to watch out for is drivers which explicitly moved > to threaded NAPI because they want to schedule multiple threads > (NAPIs) from a single IRQ to spread processing across more cores. the request_napi_irq() could have a sysfs switch (like we currently have). But you mentioned one IRQ and multiple NAPI threads, to distribute across core. The usual case is that you have one IRQ for a network queue and this network queue has one NAPI struct, right? In the case where you would have one IRQ but two network queues, each with one NAPI struct? And then you use the napi-threads and pin manually queue-napi#1 to CPU#1 and queue-napi#2 to CPU#2 while the IRQ itself fires on CPU#1? Sebastian
On 2022-02-08 16:57:59 [+0100], Paolo Abeni wrote: > just for historic reference: > > https://lkml.org/lkml/2016/6/15/460 that is https://lore.kernel.org/all/cover.1465996447.git.pabeni@redhat.com let me digest that later… > I think that running the thread performing the NAPI loop with > SCHED_FIFO would be dangerous WRT DDOS. Even the affinity setting can > give mixed results depending on the workload - unless you do good > static CPUs allocation pinning each process manually, not really a > generic setup. The DDoS part is where I meant we need figure out the details. Since it is a threaded-interrupt we could do msleep() as a break which is similar to what softirq does. Detecting such a case might be difficult since the budget is per-thread only and does not involve other NAPI-structs on the same CPU like backlog. The performance is usually best if the IRQ and threaded handler are running on the same CPU. The win with the NAPI thread seems to be that if two NAPIs fire on the same CPU then the scheduler will see two tasks which will be moved (at some point) to different CPUs. And in a DDoS like situation they can constantly push new skbs into the stack and SCHED_OTHER ensures that the user can still do something. Without napi threads, both NAPIs would be processed on the same CPU (in order) and eventually moved to ksoftirqd where it will take a break under high load. > Cheers, > > Paolo Sebastian
On Tue, 8 Feb 2022 18:45:53 +0100 Sebastian Andrzej Siewior wrote: > On 2022-02-08 07:35:20 [-0800], Jakub Kicinski wrote: > > > If you enable threaded NAPI then you end up with a thread and the > > > softirq is no longer used. I don't know what the next action is but I > > > guess you search for that thread and pin it manually to CPU and assign a > > > RT priority (probably, otherwise it will compete with other tasks for > > > CPU resources). > > > > FWIW I don't think servers would want RT prio. > > but then the NAPI thread is treated the same way like your xz -9. You'd either pin the workload away from the network processing cores or, if there's no pinning, prefer requests to run to completion to achieve lower latency. > > > Instead we could have > > > request_threaded_irq(, handler_irq, handler_napi); > > > > > > And we would have basically the same outcome. Except that handler_napi() > > > runs that SCHED_FIFO/50 and has the same CPU affinity as the IRQ (and > > > the CPU affinity is adjusted if the IRQ-affinity is changed). > > > We would still have to work out the details what handler_irq() is > > > allowed to do and how to handle one IRQ and multiple handler_napi(). > > > > > > If you wrap request_threaded_irq() in something like request_napi_irq() > > > the you could switch between the former (softirq) and later (thread) > > > based NAPI handling (since you have all the needed details). > > > > One use case to watch out for is drivers which explicitly moved > > to threaded NAPI because they want to schedule multiple threads > > (NAPIs) from a single IRQ to spread processing across more cores. > > the request_napi_irq() could have a sysfs switch (like we currently > have). > But you mentioned one IRQ and multiple NAPI threads, to distribute > across core. The usual case is that you have one IRQ for a network queue > and this network queue has one NAPI struct, right? > > In the case where you would have one IRQ but two network queues, each > with one NAPI struct? And then you use the napi-threads and pin manually > queue-napi#1 to CPU#1 and queue-napi#2 to CPU#2 while the IRQ itself > fires on CPU#1? It was mt76, the WiFi driver, I looked in the morning and I think it had a NAPI for Tx and 2 NAPIs for Rx. The scheduler can spread them around. Felix will have the exact details.
On 04.02.2022 10:50:35, Jakub Kicinski wrote: > > Wouldn't it work to utilize the threaded-IRQ API and use that instead > > the custom thread? Basically the primary handler would what it already > > does (disable the interrupt) and the threaded handler would feed packets > > into the stack. In the overload case one would need to lower the > > thread-priority. > > Sounds like an interesting direction if you ask me! That said I have > not been able to make threaded NAPI useful in my experiments / with my > workloads so I'd defer to Wei for confirmation. FWIW: We have a threaded NAPI use case. On a PREEMPT_RT enabled imx6 (4 core 32 bit ARM) the NAPI of the CAN controller (flexcan) is running in threaded mode with raised priorities. This reduces latency spikes and jitter of RX'ed CAN frames, which are part of a closed control loop. regards, Marc
diff --git a/net/core/dev.c b/net/core/dev.c index 1baab07820f6..f93b3173454c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4239,7 +4239,7 @@ static inline void ____napi_schedule(struct softnet_data *sd, } list_add_tail(&napi->poll_list, &sd->poll_list); - __raise_softirq_irqoff(NET_RX_SOFTIRQ); + raise_softirq_irqoff(NET_RX_SOFTIRQ); } #ifdef CONFIG_RPS