diff mbox series

[net,v2] net: fix race between napi kthread mode and busy poll

Message ID 20210227003047.1051347-1-weiwan@google.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] net: fix race between napi kthread mode and busy poll | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers warning 5 maintainers not CCed: andriin@fb.com cong.wang@bytedance.com ap420073@gmail.com daniel@iogearbox.net ast@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 6948 this patch: 6948
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 90 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 7161 this patch: 7161
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Wei Wang Feb. 27, 2021, 12:30 a.m. UTC
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi. napi_disable() could grab the SCHED bit as well.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
____napi_schedule() if the threaded mode is enabled, and gets cleared
in napi_complete_done(), and we only poll the napi in kthread if this
bit is set. This helps distinguish the ownership of the napi between
kthread and other scenarios and fixes the race issue.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/linux/netdevice.h |  2 ++
 net/core/dev.c            | 20 +++++++++++++-------
 2 files changed, 15 insertions(+), 7 deletions(-)

Comments

Jakub Kicinski Feb. 27, 2021, 12:48 a.m. UTC | #1
On Fri, 26 Feb 2021 16:30:47 -0800 Wei Wang wrote:
>  		thread = READ_ONCE(napi->thread);
>  		if (thread) {
> +			set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
>  			wake_up_process(thread);

What about the version which checks RUNNING? As long as
wake_up_process() implies a barrier I _think_ it should 
work as well. Am I missing some case, or did you decide
to go with the simpler/safer approach?
Wei Wang Feb. 27, 2021, 1:02 a.m. UTC | #2
On Fri, Feb 26, 2021 at 4:48 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 26 Feb 2021 16:30:47 -0800 Wei Wang wrote:
> >               thread = READ_ONCE(napi->thread);
> >               if (thread) {
> > +                     set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
> >                       wake_up_process(thread);
>
> What about the version which checks RUNNING? As long as
> wake_up_process() implies a barrier I _think_ it should
> work as well. Am I missing some case, or did you decide
> to go with the simpler/safer approach?


I assume you are referring to the following proposed patch in your
previous email right?
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4294,6 +4294,8 @@ static inline void ____napi_schedule(struct
softnet_data *sd,
                 */
                thread = READ_ONCE(napi->thread);
                if (thread) {
+                       if (thread->state == TASK_RUNNING)
+                               set_bit(NAPIF_STATE_SCHED_THREAD, &napi->state);
                        wake_up_process(thread);
                        return;
                }
@@ -6486,7 +6488,8 @@ bool napi_complete_done(struct napi_struct *n,
int work_done)
                WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

                new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
-                             NAPIF_STATE_PREFER_BUSY_POLL);
+                             NAPIF_STATE_PREFER_BUSY_POLL |
+                             NAPIF_STATE_SCHED_THREAD);

                /* If STATE_MISSED was set, leave STATE_SCHED set,
                 * because we will call napi->poll() one more time.
@@ -6968,16 +6971,24 @@ static int napi_poll(struct napi_struct *n,
struct list_head *repoll)

 static int napi_thread_wait(struct napi_struct *napi)
 {
+       bool woken = false;
+
        set_current_state(TASK_INTERRUPTIBLE);

        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+               unsigned long state = READ_ONCE(napi->state);
+
+               if ((state & NAPIF_STATE_SCHED) &&
+                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
                        WARN_ON(!list_empty(&napi->poll_list));
                        __set_current_state(TASK_RUNNING);
                        return 0;
+               } else {
+                       WARN_ON(woken);
                }

                schedule();
+               woken = true;
                set_current_state(TASK_INTERRUPTIBLE);
        }
        __set_current_state(TASK_RUNNING);

I don't think it is sufficient to only set SCHED_THREADED bit when the
thread is in RUNNING state.
In fact, the thread is most likely NOT in RUNNING mode before we call
wake_up_process() in ____napi_schedule(), because it has finished the
previous round of napi->poll() and SCHED bit was cleared, so
napi_thread_wait() sets the state to INTERRUPTIBLE and schedule() call
should already put it in sleep.
Jakub Kicinski Feb. 27, 2021, 1:22 a.m. UTC | #3
On Fri, 26 Feb 2021 17:02:17 -0800 Wei Wang wrote:
>  static int napi_thread_wait(struct napi_struct *napi)
>  {
> +       bool woken = false;
> +
>         set_current_state(TASK_INTERRUPTIBLE);
> 
>         while (!kthread_should_stop() && !napi_disable_pending(napi)) {
> -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
> +               unsigned long state = READ_ONCE(napi->state);
> +
> +               if ((state & NAPIF_STATE_SCHED) &&
> +                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
>                         WARN_ON(!list_empty(&napi->poll_list));
>                         __set_current_state(TASK_RUNNING);
>                         return 0;
> +               } else {
> +                       WARN_ON(woken);
>                 }
> 
>                 schedule();
> +               woken = true;
>                 set_current_state(TASK_INTERRUPTIBLE);
>         }
>         __set_current_state(TASK_RUNNING);
> 
> I don't think it is sufficient to only set SCHED_THREADED bit when the
> thread is in RUNNING state.
> In fact, the thread is most likely NOT in RUNNING mode before we call
> wake_up_process() in ____napi_schedule(), because it has finished the
> previous round of napi->poll() and SCHED bit was cleared, so
> napi_thread_wait() sets the state to INTERRUPTIBLE and schedule() call
> should already put it in sleep.

That's why the check says "|| woken":

	((state & NAPIF_STATE_SCHED_THREAD) ||	woken))

thread knows it owns the NAPI if:

  (a) the NAPI has the explicit flag set
or
  (b) it was just worken up and !kthread_should_stop(), since only
      someone who just claimed the normal SCHED on thread's behalf 
      will wake it up
Wei Wang Feb. 27, 2021, 1:35 a.m. UTC | #4
On Fri, Feb 26, 2021 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 26 Feb 2021 17:02:17 -0800 Wei Wang wrote:
> >  static int napi_thread_wait(struct napi_struct *napi)
> >  {
> > +       bool woken = false;
> > +
> >         set_current_state(TASK_INTERRUPTIBLE);
> >
> >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {
> > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
> > +               unsigned long state = READ_ONCE(napi->state);
> > +
> > +               if ((state & NAPIF_STATE_SCHED) &&
> > +                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
> >                         WARN_ON(!list_empty(&napi->poll_list));
> >                         __set_current_state(TASK_RUNNING);
> >                         return 0;
> > +               } else {
> > +                       WARN_ON(woken);
> >                 }
> >
> >                 schedule();
> > +               woken = true;
> >                 set_current_state(TASK_INTERRUPTIBLE);
> >         }
> >         __set_current_state(TASK_RUNNING);
> >
> > I don't think it is sufficient to only set SCHED_THREADED bit when the
> > thread is in RUNNING state.
> > In fact, the thread is most likely NOT in RUNNING mode before we call
> > wake_up_process() in ____napi_schedule(), because it has finished the
> > previous round of napi->poll() and SCHED bit was cleared, so
> > napi_thread_wait() sets the state to INTERRUPTIBLE and schedule() call
> > should already put it in sleep.
>
> That's why the check says "|| woken":
>
>         ((state & NAPIF_STATE_SCHED_THREAD) ||  woken))
>
> thread knows it owns the NAPI if:
>
>   (a) the NAPI has the explicit flag set
> or
>   (b) it was just worken up and !kthread_should_stop(), since only
>       someone who just claimed the normal SCHED on thread's behalf
>       will wake it up

The 'woken' is set after schedule(). If it is the first time
napi_threaded_wait() is called, and SCHED_THREADED is not set, and
woken is not set either, this thread will be put to sleep when it
reaches schedule(), even though there is work waiting to be done on
that napi. And I think this kthread will not be woken up again
afterwards, since the SCHED bit is already grabbed.
Jakub Kicinski Feb. 27, 2021, 2:08 a.m. UTC | #5
On Fri, 26 Feb 2021 17:35:21 -0800 Wei Wang wrote:
> On Fri, Feb 26, 2021 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 26 Feb 2021 17:02:17 -0800 Wei Wang wrote:  
> > >  static int napi_thread_wait(struct napi_struct *napi)
> > >  {
> > > +       bool woken = false;
> > > +
> > >         set_current_state(TASK_INTERRUPTIBLE);
> > >
> > >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {
> > > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
> > > +               unsigned long state = READ_ONCE(napi->state);
> > > +
> > > +               if ((state & NAPIF_STATE_SCHED) &&
> > > +                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
> > >                         WARN_ON(!list_empty(&napi->poll_list));
> > >                         __set_current_state(TASK_RUNNING);
> > >                         return 0;
> > > +               } else {
> > > +                       WARN_ON(woken);
> > >                 }
> > >
> > >                 schedule();
> > > +               woken = true;
> > >                 set_current_state(TASK_INTERRUPTIBLE);
> > >         }
> > >         __set_current_state(TASK_RUNNING);
> > >
> > > I don't think it is sufficient to only set SCHED_THREADED bit when the
> > > thread is in RUNNING state.
> > > In fact, the thread is most likely NOT in RUNNING mode before we call
> > > wake_up_process() in ____napi_schedule(), because it has finished the
> > > previous round of napi->poll() and SCHED bit was cleared, so
> > > napi_thread_wait() sets the state to INTERRUPTIBLE and schedule() call
> > > should already put it in sleep.  
> >
> > That's why the check says "|| woken":
> >
> >         ((state & NAPIF_STATE_SCHED_THREAD) ||  woken))
> >
> > thread knows it owns the NAPI if:
> >
> >   (a) the NAPI has the explicit flag set
> > or
> >   (b) it was just worken up and !kthread_should_stop(), since only
> >       someone who just claimed the normal SCHED on thread's behalf
> >       will wake it up  
> 
> The 'woken' is set after schedule(). If it is the first time
> napi_threaded_wait() is called, and SCHED_THREADED is not set, and
> woken is not set either, this thread will be put to sleep when it
> reaches schedule(), even though there is work waiting to be done on
> that napi. And I think this kthread will not be woken up again
> afterwards, since the SCHED bit is already grabbed.

Indeed, looks like the task will be in WAKING state until it runs?
We can switch the check in ____napi_schedule() from

	if (thread->state == TASK_RUNNING)

to

	if (!(thread->state & TASK_INTERRUPTIBLE))

?
Wei Wang Feb. 27, 2021, 7 p.m. UTC | #6
On Fri, Feb 26, 2021 at 6:08 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 26 Feb 2021 17:35:21 -0800 Wei Wang wrote:
> > On Fri, Feb 26, 2021 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Fri, 26 Feb 2021 17:02:17 -0800 Wei Wang wrote:
> > > >  static int napi_thread_wait(struct napi_struct *napi)
> > > >  {
> > > > +       bool woken = false;
> > > > +
> > > >         set_current_state(TASK_INTERRUPTIBLE);
> > > >
> > > >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {
> > > > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
> > > > +               unsigned long state = READ_ONCE(napi->state);
> > > > +
> > > > +               if ((state & NAPIF_STATE_SCHED) &&
> > > > +                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
> > > >                         WARN_ON(!list_empty(&napi->poll_list));
> > > >                         __set_current_state(TASK_RUNNING);
> > > >                         return 0;
> > > > +               } else {
> > > > +                       WARN_ON(woken);
> > > >                 }
> > > >
> > > >                 schedule();
> > > > +               woken = true;
> > > >                 set_current_state(TASK_INTERRUPTIBLE);
> > > >         }
> > > >         __set_current_state(TASK_RUNNING);
> > > >
> > > > I don't think it is sufficient to only set SCHED_THREADED bit when the
> > > > thread is in RUNNING state.
> > > > In fact, the thread is most likely NOT in RUNNING mode before we call
> > > > wake_up_process() in ____napi_schedule(), because it has finished the
> > > > previous round of napi->poll() and SCHED bit was cleared, so
> > > > napi_thread_wait() sets the state to INTERRUPTIBLE and schedule() call
> > > > should already put it in sleep.
> > >
> > > That's why the check says "|| woken":
> > >
> > >         ((state & NAPIF_STATE_SCHED_THREAD) ||  woken))
> > >
> > > thread knows it owns the NAPI if:
> > >
> > >   (a) the NAPI has the explicit flag set
> > > or
> > >   (b) it was just worken up and !kthread_should_stop(), since only
> > >       someone who just claimed the normal SCHED on thread's behalf
> > >       will wake it up
> >
> > The 'woken' is set after schedule(). If it is the first time
> > napi_threaded_wait() is called, and SCHED_THREADED is not set, and
> > woken is not set either, this thread will be put to sleep when it
> > reaches schedule(), even though there is work waiting to be done on
> > that napi. And I think this kthread will not be woken up again
> > afterwards, since the SCHED bit is already grabbed.
>
> Indeed, looks like the task will be in WAKING state until it runs?
> We can switch the check in ____napi_schedule() from
>
>         if (thread->state == TASK_RUNNING)
>
> to
>
>         if (!(thread->state & TASK_INTERRUPTIBLE))
>
> ?

Hmm... I am not very sure what state the thread will be put in after
kthread_create(). Could it be in TASK_INTERRUPTIBLE?
Wei Wang Feb. 27, 2021, 11:23 p.m. UTC | #7
On Sat, Feb 27, 2021 at 11:00 AM Wei Wang <weiwan@google.com> wrote:
>
> On Fri, Feb 26, 2021 at 6:08 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Fri, 26 Feb 2021 17:35:21 -0800 Wei Wang wrote:
> > > On Fri, Feb 26, 2021 at 5:22 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Fri, 26 Feb 2021 17:02:17 -0800 Wei Wang wrote:
> > > > >  static int napi_thread_wait(struct napi_struct *napi)
> > > > >  {
> > > > > +       bool woken = false;
> > > > > +
> > > > >         set_current_state(TASK_INTERRUPTIBLE);
> > > > >
> > > > >         while (!kthread_should_stop() && !napi_disable_pending(napi)) {
> > > > > -               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
> > > > > +               unsigned long state = READ_ONCE(napi->state);
> > > > > +
> > > > > +               if ((state & NAPIF_STATE_SCHED) &&
> > > > > +                   ((state & NAPIF_STATE_SCHED_THREAD) || woken)) {
> > > > >                         WARN_ON(!list_empty(&napi->poll_list));
> > > > >                         __set_current_state(TASK_RUNNING);
> > > > >                         return 0;
> > > > > +               } else {
> > > > > +                       WARN_ON(woken);
> > > > >                 }
> > > > >
> > > > >                 schedule();
> > > > > +               woken = true;
> > > > >                 set_current_state(TASK_INTERRUPTIBLE);
> > > > >         }
> > > > >         __set_current_state(TASK_RUNNING);
> > > > >
> > > > > I don't think it is sufficient to only set SCHED_THREADED bit when the
> > > > > thread is in RUNNING state.
> > > > > In fact, the thread is most likely NOT in RUNNING mode before we call
> > > > > wake_up_process() in ____napi_schedule(), because it has finished the
> > > > > previous round of napi->poll() and SCHED bit was cleared, so
> > > > > napi_thread_wait() sets the state to INTERRUPTIBLE and schedule() call
> > > > > should already put it in sleep.
> > > >
> > > > That's why the check says "|| woken":
> > > >
> > > >         ((state & NAPIF_STATE_SCHED_THREAD) ||  woken))
> > > >
> > > > thread knows it owns the NAPI if:
> > > >
> > > >   (a) the NAPI has the explicit flag set
> > > > or
> > > >   (b) it was just worken up and !kthread_should_stop(), since only
> > > >       someone who just claimed the normal SCHED on thread's behalf
> > > >       will wake it up
> > >
> > > The 'woken' is set after schedule(). If it is the first time
> > > napi_threaded_wait() is called, and SCHED_THREADED is not set, and
> > > woken is not set either, this thread will be put to sleep when it
> > > reaches schedule(), even though there is work waiting to be done on
> > > that napi. And I think this kthread will not be woken up again
> > > afterwards, since the SCHED bit is already grabbed.
> >
> > Indeed, looks like the task will be in WAKING state until it runs?
> > We can switch the check in ____napi_schedule() from
> >
> >         if (thread->state == TASK_RUNNING)
> >
> > to
> >
> >         if (!(thread->state & TASK_INTERRUPTIBLE))
> >
> > ?
>
> Hmm... I am not very sure what state the thread will be put in after
> kthread_create(). Could it be in TASK_INTERRUPTIBLE?

I did a printk and confirmed that the thread->state is
TASK_UNINTERRUPTIBLE after kthread_create() is called.
So I think if we change the above state to:
          if (thread->state != TASK_INTERRUPTIBLE)
                  set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
It should work.

I tested the following patch on my setup and saw no issues:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..682908707c1a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -360,6 +360,7 @@ enum {
        NAPI_STATE_IN_BUSY_POLL,        /* sk_busy_loop() owns this NAPI */
        NAPI_STATE_PREFER_BUSY_POLL,    /* prefer busy-polling over
softirq processing*/
        NAPI_STATE_THREADED,            /* The poll is performed
inside its own thread*/
+       NAPI_STATE_SCHED_THREADED,      /* Napi is currently scheduled
in threaded mode */
 };

 enum {
@@ -372,6 +373,7 @@ enum {
        NAPIF_STATE_IN_BUSY_POLL        = BIT(NAPI_STATE_IN_BUSY_POLL),
        NAPIF_STATE_PREFER_BUSY_POLL    = BIT(NAPI_STATE_PREFER_BUSY_POLL),
        NAPIF_STATE_THREADED            = BIT(NAPI_STATE_THREADED),
+       NAPIF_STATE_SCHED_THREADED      = BIT(NAPI_STATE_SCHED_THREADED),
 };

 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..43607523ee99 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1501,17 +1501,18 @@ static int napi_kthread_create(struct napi_struct *n)
 {
        int err = 0;

-       /* Create and wake up the kthread once to put it in
-        * TASK_INTERRUPTIBLE mode to avoid the blocked task
-        * warning and work with loadavg.
+       /* Avoid waking up the kthread during creation to prevent
+        * potential race.
         */
-       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
-                               n->dev->name, n->napi_id);
+       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
+                                  n->dev->name, n->napi_id);
        if (IS_ERR(n->thread)) {
                err = PTR_ERR(n->thread);
-               pr_err("kthread_run failed with err %d\n", err);
+               pr_err("kthread_create failed with err %d\n", err);
                n->thread = NULL;
        }
@@ -4294,6 +4295,8 @@ static inline void ____napi_schedule(struct
softnet_data *sd,
                 */
                thread = READ_ONCE(napi->thread);
                if (thread) {
+                       if (thread->state != TASK_INTERRUPTIBLE)
+                               set_bit(NAPI_STATE_SCHED_THREADED,
&napi->state);
                        wake_up_process(thread);
                        return;
                }
@@ -6486,6 +6489,7 @@ bool napi_complete_done(struct napi_struct *n,
int work_done)
                WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));

                new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
+                             NAPIF_STATE_SCHED_THREADED |
                              NAPIF_STATE_PREFER_BUSY_POLL);

                /* If STATE_MISSED was set, leave STATE_SCHED set,
@@ -6968,16 +6972,24 @@ static int napi_poll(struct napi_struct *n,
struct list_head *repoll)

 static int napi_thread_wait(struct napi_struct *napi)
 {
+       bool woken = false;
+
        set_current_state(TASK_INTERRUPTIBLE);

        while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-               if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+               /* Testing SCHED_THREADED bit here to make sure the current
+                * kthread owns this napi and could poll on this napi.
+                * Testing SCHED bit is not enough because SCHED bit might be
+                * set by some other busy poll thread or by napi_disable().
+                */
+               if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state)
|| woken) {
                        WARN_ON(!list_empty(&napi->poll_list));
                        __set_current_state(TASK_RUNNING);
                        return 0;
                }

                schedule();
+                /* woken being true indicates this thread owns this napi. */
+               woken = true;
                set_current_state(TASK_INTERRUPTIBLE);
        }
        __set_current_state(TASK_RUNNING);

Jakub, Eric and Alexander,
What do you think of the above patch?
To me, the logic here seems more complicated than the original v2
patch, but it helps save quite some set_bit() in ____napi_schedule().
So it may be worthwhile?
Jakub Kicinski Feb. 28, 2021, 7:17 p.m. UTC | #8
On Sat, 27 Feb 2021 15:23:56 -0800 Wei Wang wrote:
> > > Indeed, looks like the task will be in WAKING state until it runs?
> > > We can switch the check in ____napi_schedule() from
> > >
> > >         if (thread->state == TASK_RUNNING)
> > >
> > > to
> > >
> > >         if (!(thread->state & TASK_INTERRUPTIBLE))
> > >
> > > ?  
> >
> > Hmm... I am not very sure what state the thread will be put in after
> > kthread_create(). Could it be in TASK_INTERRUPTIBLE?  
> 
> I did a printk and confirmed that the thread->state is
> TASK_UNINTERRUPTIBLE after kthread_create() is called.
> So I think if we change the above state to:
>           if (thread->state != TASK_INTERRUPTIBLE)
>                   set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
> It should work.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 6c5967e80132..43607523ee99 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1501,17 +1501,18 @@ static int napi_kthread_create(struct napi_struct *n)
>  {
>         int err = 0;
> 
> -       /* Create and wake up the kthread once to put it in
> -        * TASK_INTERRUPTIBLE mode to avoid the blocked task
> -        * warning and work with loadavg.
> +       /* Avoid waking up the kthread during creation to prevent
> +        * potential race.
>          */
> -       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
> -                               n->dev->name, n->napi_id);
> +       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> +                                  n->dev->name, n->napi_id);

Does kthread_run() make the thread go into TASK_INTERRUPTIBLE ?
It just calls wake_up_process(), which according to a comment in the
kdoc..

 * Conceptually does:
 *
 *   If (@state & @p->state) @p->state = TASK_RUNNING.

So I think we could safely stick to kthread_run() if the condition in
at the NAPI wake point checks for INTERRUPTIBLE?
Wei Wang March 1, 2021, 6:16 p.m. UTC | #9
On Sun, Feb 28, 2021 at 11:17 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 27 Feb 2021 15:23:56 -0800 Wei Wang wrote:
> > > > Indeed, looks like the task will be in WAKING state until it runs?
> > > > We can switch the check in ____napi_schedule() from
> > > >
> > > >         if (thread->state == TASK_RUNNING)
> > > >
> > > > to
> > > >
> > > >         if (!(thread->state & TASK_INTERRUPTIBLE))
> > > >
> > > > ?
> > >
> > > Hmm... I am not very sure what state the thread will be put in after
> > > kthread_create(). Could it be in TASK_INTERRUPTIBLE?
> >
> > I did a printk and confirmed that the thread->state is
> > TASK_UNINTERRUPTIBLE after kthread_create() is called.
> > So I think if we change the above state to:
> >           if (thread->state != TASK_INTERRUPTIBLE)
> >                   set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
> > It should work.
>
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 6c5967e80132..43607523ee99 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1501,17 +1501,18 @@ static int napi_kthread_create(struct napi_struct *n)
> >  {
> >         int err = 0;
> >
> > -       /* Create and wake up the kthread once to put it in
> > -        * TASK_INTERRUPTIBLE mode to avoid the blocked task
> > -        * warning and work with loadavg.
> > +       /* Avoid waking up the kthread during creation to prevent
> > +        * potential race.
> >          */
> > -       n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
> > -                               n->dev->name, n->napi_id);
> > +       n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> > +                                  n->dev->name, n->napi_id);
>
> Does kthread_run() make the thread go into TASK_INTERRUPTIBLE ?
> It just calls wake_up_process(), which according to a comment in the
> kdoc..
>
>  * Conceptually does:
>  *
>  *   If (@state & @p->state) @p->state = TASK_RUNNING.
>
> So I think we could safely stick to kthread_run() if the condition in
> at the NAPI wake point checks for INTERRUPTIBLE?

I think so. kthread_run() wakes up the kthread and kthread_wait_poll()
should put it to INTERRUPTIBLE mode and schedule() will make it go to
sleep, and wait for the next napi_schedule().
I've also tested on my setup and saw no issues.
diff mbox series

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ddf4cfc12615..682908707c1a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -360,6 +360,7 @@  enum {
 	NAPI_STATE_IN_BUSY_POLL,	/* sk_busy_loop() owns this NAPI */
 	NAPI_STATE_PREFER_BUSY_POLL,	/* prefer busy-polling over softirq processing*/
 	NAPI_STATE_THREADED,		/* The poll is performed inside its own thread*/
+	NAPI_STATE_SCHED_THREADED,	/* Napi is currently scheduled in threaded mode */
 };
 
 enum {
@@ -372,6 +373,7 @@  enum {
 	NAPIF_STATE_IN_BUSY_POLL	= BIT(NAPI_STATE_IN_BUSY_POLL),
 	NAPIF_STATE_PREFER_BUSY_POLL	= BIT(NAPI_STATE_PREFER_BUSY_POLL),
 	NAPIF_STATE_THREADED		= BIT(NAPI_STATE_THREADED),
+	NAPIF_STATE_SCHED_THREADED	= BIT(NAPI_STATE_SCHED_THREADED),
 };
 
 enum gro_result {
diff --git a/net/core/dev.c b/net/core/dev.c
index 6c5967e80132..d4ce154c8df5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1501,15 +1501,14 @@  static int napi_kthread_create(struct napi_struct *n)
 {
 	int err = 0;
 
-	/* Create and wake up the kthread once to put it in
-	 * TASK_INTERRUPTIBLE mode to avoid the blocked task
-	 * warning and work with loadavg.
+	/* Avoid waking up the kthread during creation to prevent
+	 * potential race.
 	 */
-	n->thread = kthread_run(napi_threaded_poll, n, "napi/%s-%d",
-				n->dev->name, n->napi_id);
+	n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
+				   n->dev->name, n->napi_id);
 	if (IS_ERR(n->thread)) {
 		err = PTR_ERR(n->thread);
-		pr_err("kthread_run failed with err %d\n", err);
+		pr_err("kthread_create failed with err %d\n", err);
 		n->thread = NULL;
 	}
 
@@ -4294,6 +4293,7 @@  static inline void ____napi_schedule(struct softnet_data *sd,
 		 */
 		thread = READ_ONCE(napi->thread);
 		if (thread) {
+			set_bit(NAPI_STATE_SCHED_THREADED, &napi->state);
 			wake_up_process(thread);
 			return;
 		}
@@ -6486,6 +6486,7 @@  bool napi_complete_done(struct napi_struct *n, int work_done)
 		WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
 
 		new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED |
+			      NAPIF_STATE_SCHED_THREADED |
 			      NAPIF_STATE_PREFER_BUSY_POLL);
 
 		/* If STATE_MISSED was set, leave STATE_SCHED set,
@@ -6971,7 +6972,12 @@  static int napi_thread_wait(struct napi_struct *napi)
 	set_current_state(TASK_INTERRUPTIBLE);
 
 	while (!kthread_should_stop() && !napi_disable_pending(napi)) {
-		if (test_bit(NAPI_STATE_SCHED, &napi->state)) {
+		/* Testing SCHED_THREADED bit here to make sure the current
+		 * kthread owns this napi and could poll on this napi.
+		 * Testing SCHED bit is not enough because SCHED bit might be
+		 * set by some other busy poll thread or by napi_disable().
+		 */
+		if (test_bit(NAPI_STATE_SCHED_THREADED, &napi->state)) {
 			WARN_ON(!list_empty(&napi->poll_list));
 			__set_current_state(TASK_RUNNING);
 			return 0;