diff mbox series

[net,v2] napi: fix race inside napi_enable

Message ID 20210918085232.71436-1-xuanzhuo@linux.alibaba.com (mailing list archive)
State Accepted
Commit 3765996e4f0b8a755cab215a08df744490c76052
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] napi: fix race inside napi_enable | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers success CCed 13 of 13 maintainers
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 6 this patch: 6
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON()
netdev/build_allmodconfig_warn success Errors and warnings before: 6 this patch: 6
netdev/header_inline success Link

Commit Message

Xuan Zhuo Sept. 18, 2021, 8:52 a.m. UTC
The process will cause napi.state to contain NAPI_STATE_SCHED and
not in the poll_list, which will cause napi_disable() to get stuck.

The prefix "NAPI_STATE_" is removed in the figure below, and
NAPI_STATE_HASHED is ignored in napi.state.

                      CPU0       |                   CPU1       | napi.state
===============================================================================
napi_disable()                   |                              | SCHED | NPSVC
napi_enable()                    |                              |
{                                |                              |
    smp_mb__before_atomic();     |                              |
    clear_bit(SCHED, &n->state); |                              | NPSVC
                                 | napi_schedule_prep()         | SCHED | NPSVC
                                 | napi_poll()                  |
                                 |   napi_complete_done()       |
                                 |   {                          |
                                 |      if (n->state & (NPSVC | | (1)
                                 |               _BUSY_POLL)))  |
                                 |           return false;      |
                                 |     ................         |
                                 |   }                          | SCHED | NPSVC
                                 |                              |
    clear_bit(NPSVC, &n->state); |                              | SCHED
}                                |                              |
                                 |                              |
napi_schedule_prep()             |                              | SCHED | MISSED (2)

(1) Here return direct. Because of NAPI_STATE_NPSVC exists.
(2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list

Since NAPI_STATE_SCHED already exists and napi is not in the
sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
exist.

1. This will cause this queue to no longer receive packets.
2. If you encounter napi_disable under the protection of rtnl_lock, it
   will cause the entire rtnl_lock to be locked, affecting the overall
   system.

This patch uses cmpxchg to implement napi_enable(), which ensures that
there will be no race due to the separation of clear two bits.

Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/core/dev.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Comments

patchwork-bot+netdevbpf@kernel.org Sept. 20, 2021, 8:50 a.m. UTC | #1
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Sat, 18 Sep 2021 16:52:32 +0800 you wrote:
> The process will cause napi.state to contain NAPI_STATE_SCHED and
> not in the poll_list, which will cause napi_disable() to get stuck.
> 
> The prefix "NAPI_STATE_" is removed in the figure below, and
> NAPI_STATE_HASHED is ignored in napi.state.
> 
>                       CPU0       |                   CPU1       | napi.state
> ===============================================================================
> napi_disable()                   |                              | SCHED | NPSVC
> napi_enable()                    |                              |
> {                                |                              |
>     smp_mb__before_atomic();     |                              |
>     clear_bit(SCHED, &n->state); |                              | NPSVC
>                                  | napi_schedule_prep()         | SCHED | NPSVC
>                                  | napi_poll()                  |
>                                  |   napi_complete_done()       |
>                                  |   {                          |
>                                  |      if (n->state & (NPSVC | | (1)
>                                  |               _BUSY_POLL)))  |
>                                  |           return false;      |
>                                  |     ................         |
>                                  |   }                          | SCHED | NPSVC
>                                  |                              |
>     clear_bit(NPSVC, &n->state); |                              | SCHED
> }                                |                              |
>                                  |                              |
> napi_schedule_prep()             |                              | SCHED | MISSED (2)
> 
> [...]

Here is the summary with links:
  - [net,v2] napi: fix race inside napi_enable
    https://git.kernel.org/netdev/net/c/3765996e4f0b

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Jakub Kicinski Sept. 20, 2021, 7:20 p.m. UTC | #2
On Sat, 18 Sep 2021 16:52:32 +0800 Xuan Zhuo wrote:
> The process will cause napi.state to contain NAPI_STATE_SCHED and
> not in the poll_list, which will cause napi_disable() to get stuck.
> 
> The prefix "NAPI_STATE_" is removed in the figure below, and
> NAPI_STATE_HASHED is ignored in napi.state.
> 
>                       CPU0       |                   CPU1       | napi.state
> ===============================================================================
> napi_disable()                   |                              | SCHED | NPSVC
> napi_enable()                    |                              |
> {                                |                              |
>     smp_mb__before_atomic();     |                              |
>     clear_bit(SCHED, &n->state); |                              | NPSVC
>                                  | napi_schedule_prep()         | SCHED | NPSVC
>                                  | napi_poll()                  |
>                                  |   napi_complete_done()       |
>                                  |   {                          |
>                                  |      if (n->state & (NPSVC | | (1)
>                                  |               _BUSY_POLL)))  |
>                                  |           return false;      |
>                                  |     ................         |
>                                  |   }                          | SCHED | NPSVC
>                                  |                              |
>     clear_bit(NPSVC, &n->state); |                              | SCHED
> }                                |                              |
>                                  |                              |
> napi_schedule_prep()             |                              | SCHED | MISSED (2)
> 
> (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
> (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
> 
> Since NAPI_STATE_SCHED already exists and napi is not in the
> sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
> exist.
> 
> 1. This will cause this queue to no longer receive packets.
> 2. If you encounter napi_disable under the protection of rtnl_lock, it
>    will cause the entire rtnl_lock to be locked, affecting the overall
>    system.
> 
> This patch uses cmpxchg to implement napi_enable(), which ensures that
> there will be no race due to the separation of clear two bits.
> 
> Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

Why don't you just invert the order of clearing the bits:

diff --git a/net/core/dev.c b/net/core/dev.c
index a796754f75cc..706eca8112c1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6953,8 +6953,8 @@ void napi_enable(struct napi_struct *n)
 {
        BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
        smp_mb__before_atomic();
-       clear_bit(NAPI_STATE_SCHED, &n->state);
        clear_bit(NAPI_STATE_NPSVC, &n->state);
+       clear_bit(NAPI_STATE_SCHED, &n->state);
        if (n->dev->threaded && n->thread)
                set_bit(NAPI_STATE_THREADED, &n->state);
 }

That's simpler and symmetric with the disable path.
Xuan Zhuo Sept. 22, 2021, 6:47 a.m. UTC | #3
On Mon, 20 Sep 2021 12:20:24 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> On Sat, 18 Sep 2021 16:52:32 +0800 Xuan Zhuo wrote:
> > The process will cause napi.state to contain NAPI_STATE_SCHED and
> > not in the poll_list, which will cause napi_disable() to get stuck.
> >
> > The prefix "NAPI_STATE_" is removed in the figure below, and
> > NAPI_STATE_HASHED is ignored in napi.state.
> >
> >                       CPU0       |                   CPU1       | napi.state
> > ===============================================================================
> > napi_disable()                   |                              | SCHED | NPSVC
> > napi_enable()                    |                              |
> > {                                |                              |
> >     smp_mb__before_atomic();     |                              |
> >     clear_bit(SCHED, &n->state); |                              | NPSVC
> >                                  | napi_schedule_prep()         | SCHED | NPSVC
> >                                  | napi_poll()                  |
> >                                  |   napi_complete_done()       |
> >                                  |   {                          |
> >                                  |      if (n->state & (NPSVC | | (1)
> >                                  |               _BUSY_POLL)))  |
> >                                  |           return false;      |
> >                                  |     ................         |
> >                                  |   }                          | SCHED | NPSVC
> >                                  |                              |
> >     clear_bit(NPSVC, &n->state); |                              | SCHED
> > }                                |                              |
> >                                  |                              |
> > napi_schedule_prep()             |                              | SCHED | MISSED (2)
> >
> > (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
> > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
> >
> > Since NAPI_STATE_SCHED already exists and napi is not in the
> > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
> > exist.
> >
> > 1. This will cause this queue to no longer receive packets.
> > 2. If you encounter napi_disable under the protection of rtnl_lock, it
> >    will cause the entire rtnl_lock to be locked, affecting the overall
> >    system.
> >
> > This patch uses cmpxchg to implement napi_enable(), which ensures that
> > there will be no race due to the separation of clear two bits.
> >
> > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
>
> Why don't you just invert the order of clearing the bits:

I think it should be an atomic operation. The original two-step clear itself is
problematic. So from this perspective, it is not just a solution to this
problem.

Thanks.

>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a796754f75cc..706eca8112c1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6953,8 +6953,8 @@ void napi_enable(struct napi_struct *n)
>  {
>         BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
>         smp_mb__before_atomic();
> -       clear_bit(NAPI_STATE_SCHED, &n->state);
>         clear_bit(NAPI_STATE_NPSVC, &n->state);
> +       clear_bit(NAPI_STATE_SCHED, &n->state);
>         if (n->dev->threaded && n->thread)
>                 set_bit(NAPI_STATE_THREADED, &n->state);
>  }
>
> That's simpler and symmetric with the disable path.
Jakub Kicinski Sept. 23, 2021, 1:14 p.m. UTC | #4
On Wed, 22 Sep 2021 14:47:47 +0800 Xuan Zhuo wrote:
> > Why don't you just invert the order of clearing the bits:  
> 
> I think it should be an atomic operation. The original two-step clear itself is
> problematic. So from this perspective, it is not just a solution to this
> problem.

[resending, my MUA seems to have corrupted the CC list previously]

Can you show what breaks by it being non-atomic?

Because, again, the disable part is not atomic. Either it's needed on
both sides or it's not needed on either.
Jakub Kicinski Sept. 23, 2021, 2:54 p.m. UTC | #5
On Thu, 23 Sep 2021 21:40:56 +0800 Xuan Zhuo wrote:
> On Thu, 23 Sep 2021 06:14:17 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> > > I think it should be an atomic operation. The original two-step clear itself is
> > > problematic. So from this perspective, it is not just a solution to this
> > > problem.  
> >
> > [resending, my MUA seems to have corrupted the CC list previously]
> >
> > Can you show what breaks by it being non-atomic?  
> 
> Isn't the problem this time caused by non-atoms?
> 
> Of course, in response to this problem, adjusting the order seems to be able to
> solve this problem. Compared to changing to atomic operations, we have to test
> other problems that may be caused by modifying this order.
> 
> Relatively speaking, the use of atoms is a relatively simple way of processing.

Whether atomics are simple or not is not the question.

What I'm saying is that having asymmetric enable and disable paths
is fragile.

> > Because, again, the disable part is not atomic. Either it's needed on
> > both sides or it's not needed on either.  
> 
> For the disable part, I think it’s okay not to use atoms. Have you considered
> any special scenarios?

The point is both sides should do the same thing.
Sukadev Bhattiprolu Oct. 18, 2021, 9:58 p.m. UTC | #6
Xuan Zhuo [xuanzhuo@linux.alibaba.com] wrote:
> The process will cause napi.state to contain NAPI_STATE_SCHED and
> not in the poll_list, which will cause napi_disable() to get stuck.
> 
> The prefix "NAPI_STATE_" is removed in the figure below, and
> NAPI_STATE_HASHED is ignored in napi.state.
> 
>                       CPU0       |                   CPU1       | napi.state
> ===============================================================================
> napi_disable()                   |                              | SCHED | NPSVC
> napi_enable()                    |                              |
> {                                |                              |
>     smp_mb__before_atomic();     |                              |
>     clear_bit(SCHED, &n->state); |                              | NPSVC
>                                  | napi_schedule_prep()         | SCHED | NPSVC
>                                  | napi_poll()                  |
>                                  |   napi_complete_done()       |
>                                  |   {                          |
>                                  |      if (n->state & (NPSVC | | (1)
>                                  |               _BUSY_POLL)))  |
>                                  |           return false;      |
>                                  |     ................         |
>                                  |   }                          | SCHED | NPSVC
>                                  |                              |
>     clear_bit(NPSVC, &n->state); |                              | SCHED
> }                                |                              |

So its possible that after cpu0 cleared SCHED, cpu1 could have set it
back and we are going to use cmpxchg() to detect and retry right? If so,


>                                  |                              |
> napi_schedule_prep()             |                              | SCHED | MISSED (2)
> 
> (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
> (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
> 
> Since NAPI_STATE_SCHED already exists and napi is not in the
> sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
> exist.
> 
> 1. This will cause this queue to no longer receive packets.
> 2. If you encounter napi_disable under the protection of rtnl_lock, it
>    will cause the entire rtnl_lock to be locked, affecting the overall
>    system.
> 
> This patch uses cmpxchg to implement napi_enable(), which ensures that
> there will be no race due to the separation of clear two bits.
> 
> Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> ---
>  net/core/dev.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 74fd402d26dd..7ee9fecd3aff 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable);
>   */
>  void napi_enable(struct napi_struct *n)
>  {
> -	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> -	smp_mb__before_atomic();
> -	clear_bit(NAPI_STATE_SCHED, &n->state);
> -	clear_bit(NAPI_STATE_NPSVC, &n->state);
> -	if (n->dev->threaded && n->thread)
> -		set_bit(NAPI_STATE_THREADED, &n->state);
> +	unsigned long val, new;
> +
> +	do {
> +		val = READ_ONCE(n->state);
> +		BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));

is this BUG_ON valid/needed? We could have lost the cmpxchg() and
the other thread could have set NAPI_STATE_SCHED?

Sukadev

> +
> +		new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
> +		if (n->dev->threaded && n->thread)
> +			new |= NAPIF_STATE_THREADED;
> +	} while (cmpxchg(&n->state, val, new) != val);
>  }
>  EXPORT_SYMBOL(napi_enable);
>  
> -- 
> 2.31.0
Jakub Kicinski Oct. 18, 2021, 10:55 p.m. UTC | #7
On Mon, 18 Oct 2021 14:58:08 -0700 Sukadev Bhattiprolu wrote:
> >                       CPU0       |                   CPU1       | napi.state
> > ===============================================================================
> > napi_disable()                   |                              | SCHED | NPSVC
> > napi_enable()                    |                              |
> > {                                |                              |
> >     smp_mb__before_atomic();     |                              |
> >     clear_bit(SCHED, &n->state); |                              | NPSVC
> >                                  | napi_schedule_prep()         | SCHED | NPSVC
> >                                  | napi_poll()                  |
> >                                  |   napi_complete_done()       |
> >                                  |   {                          |
> >                                  |      if (n->state & (NPSVC | | (1)
> >                                  |               _BUSY_POLL)))  |
> >                                  |           return false;      |
> >                                  |     ................         |
> >                                  |   }                          | SCHED | NPSVC
> >                                  |                              |
> >     clear_bit(NPSVC, &n->state); |                              | SCHED
> > }                                |                              |  
> 
> So its possible that after cpu0 cleared SCHED, cpu1 could have set it
> back and we are going to use cmpxchg() to detect and retry right? If so,

This is a state diagram before the change. There's no cmpxchg() here.
        
> > napi_schedule_prep()             |                              | SCHED | MISSED (2)
> > 
> > (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
> > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
> > 
> > Since NAPI_STATE_SCHED already exists and napi is not in the
> > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
> > exist.
> > 
> > 1. This will cause this queue to no longer receive packets.
> > 2. If you encounter napi_disable under the protection of rtnl_lock, it
> >    will cause the entire rtnl_lock to be locked, affecting the overall
> >    system.
> > 
> > This patch uses cmpxchg to implement napi_enable(), which ensures that
> > there will be no race due to the separation of clear two bits.
> > 
> > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> > ---
> >  net/core/dev.c | 16 ++++++++++------
> >  1 file changed, 10 insertions(+), 6 deletions(-)
> > 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 74fd402d26dd..7ee9fecd3aff 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable);
> >   */
> >  void napi_enable(struct napi_struct *n)
> >  {
> > -	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
> > -	smp_mb__before_atomic();
> > -	clear_bit(NAPI_STATE_SCHED, &n->state);
> > -	clear_bit(NAPI_STATE_NPSVC, &n->state);
> > -	if (n->dev->threaded && n->thread)
> > -		set_bit(NAPI_STATE_THREADED, &n->state);
> > +	unsigned long val, new;
> > +
> > +	do {
> > +		val = READ_ONCE(n->state);
> > +		BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));  
> 
> is this BUG_ON valid/needed? We could have lost the cmpxchg() and
> the other thread could have set NAPI_STATE_SCHED?

The BUG_ON() is here to make sure that when napi_enable() is called the
napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set
on disabled NAPIs because that bit means ownership. Whoever disabled
the NAPI owns it.

That BUG_ON() could have been taken outside of the loop, there's no
point re-checking on every try. 

Are you seeing NAPI-related failures? We had at least 3 reports in the
last two weeks of strange failures which look like NAPI state getting
corrupted on net-next...
Dany Madden Oct. 18, 2021, 11:36 p.m. UTC | #8
On 2021-10-18 15:55, Jakub Kicinski wrote:
> On Mon, 18 Oct 2021 14:58:08 -0700 Sukadev Bhattiprolu wrote:
>> >                       CPU0       |                   CPU1       | napi.state
>> > ===============================================================================
>> > napi_disable()                   |                              | SCHED | NPSVC
>> > napi_enable()                    |                              |
>> > {                                |                              |
>> >     smp_mb__before_atomic();     |                              |
>> >     clear_bit(SCHED, &n->state); |                              | NPSVC
>> >                                  | napi_schedule_prep()         | SCHED | NPSVC
>> >                                  | napi_poll()                  |
>> >                                  |   napi_complete_done()       |
>> >                                  |   {                          |
>> >                                  |      if (n->state & (NPSVC | | (1)
>> >                                  |               _BUSY_POLL)))  |
>> >                                  |           return false;      |
>> >                                  |     ................         |
>> >                                  |   }                          | SCHED | NPSVC
>> >                                  |                              |
>> >     clear_bit(NPSVC, &n->state); |                              | SCHED
>> > }                                |                              |
>> 
>> So its possible that after cpu0 cleared SCHED, cpu1 could have set it
>> back and we are going to use cmpxchg() to detect and retry right? If 
>> so,
> 
> This is a state diagram before the change. There's no cmpxchg() here.
> 
>> > napi_schedule_prep()             |                              | SCHED | MISSED (2)
>> >
>> > (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
>> > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
>> >
>> > Since NAPI_STATE_SCHED already exists and napi is not in the
>> > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
>> > exist.
>> >
>> > 1. This will cause this queue to no longer receive packets.
>> > 2. If you encounter napi_disable under the protection of rtnl_lock, it
>> >    will cause the entire rtnl_lock to be locked, affecting the overall
>> >    system.
>> >
>> > This patch uses cmpxchg to implement napi_enable(), which ensures that
>> > there will be no race due to the separation of clear two bits.
>> >
>> > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable")
>> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
>> > ---
>> >  net/core/dev.c | 16 ++++++++++------
>> >  1 file changed, 10 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index 74fd402d26dd..7ee9fecd3aff 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable);
>> >   */
>> >  void napi_enable(struct napi_struct *n)
>> >  {
>> > -	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
>> > -	smp_mb__before_atomic();
>> > -	clear_bit(NAPI_STATE_SCHED, &n->state);
>> > -	clear_bit(NAPI_STATE_NPSVC, &n->state);
>> > -	if (n->dev->threaded && n->thread)
>> > -		set_bit(NAPI_STATE_THREADED, &n->state);
>> > +	unsigned long val, new;
>> > +
>> > +	do {
>> > +		val = READ_ONCE(n->state);
>> > +		BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));
>> 
>> is this BUG_ON valid/needed? We could have lost the cmpxchg() and
>> the other thread could have set NAPI_STATE_SCHED?
> 
> The BUG_ON() is here to make sure that when napi_enable() is called the
> napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set
> on disabled NAPIs because that bit means ownership. Whoever disabled
> the NAPI owns it.
> 
> That BUG_ON() could have been taken outside of the loop, there's no
> point re-checking on every try.
> 
> Are you seeing NAPI-related failures? We had at least 3 reports in the
> last two weeks of strange failures which look like NAPI state getting
> corrupted on net-next...

We hit two napi related crashes while attempting mtu size change.

1st crash:
[430425.020051] ------------[ cut here ]------------
[430425.020053] kernel BUG at ../net/core/dev.c:6938!
[430425.020057] Oops: Exception in kernel mode, sig: 5 [#1]
[430425.020068] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[430425.020075] Modules linked in: binfmt_misc rpadlpar_io rpaphp 
xt_tcpudp iptable_filter ip_tables x_tables pseries_rng ibmvnic rng_core 
ibmveth vmx_crypto gf128mul fuse btrfs blake2b_generic xor zstd_compress 
lzo_compress raid6_pq dm_service_time crc32c_vpmsum dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
[430425.020123] CPU: 0 PID: 34337 Comm: kworker/0:3 Kdump: loaded 
Tainted: G    W     5.15.0-rc2-suka-00486-gce916130f5f6 #3
[430425.020133] Workqueue: events_long __ibmvnic_reset [ibmvnic]
[430425.020145] NIP: c000000000cb03f4 LR: c0080000014a4ce8 CTR: 
c000000000cb03b0
[430425.020151] REGS: c00000002e5d37e0 TRAP: 0700  Tainted: G    W     
(5.15.0-rc2-suka-00486-gce916130f5f6)
[430425.020159] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> 
CR: 28248428 XER: 20000001
[430425.020176] CFAR: c0080000014ad9cc IRQMASK: 0
         GPR00: c0080000014a4ce8 c00000002e5d3a80 c000000001b12100 
c0000001274f3190
         GPR04: 00000000ffff36dc fffffffffffffff6 0000000000000019 
0000000000000010
         GPR08: c00000002ec48900 0000000000000001 c0000001274f31a0 
c0080000014ad9b8
         GPR12: c000000000cb03b0 c000000001d00000 0000000080000000 
00000000000003fe
         GPR16: 00000000000006e3 0000000000000000 0000000000000008 
c00000002ec48af8
         GPR20: c00000002ec48db0 0000000000000000 0000000000000004 
0000000000000000
         GPR24: c00000002ec48000 0000000000000004 c00000002ec49070 
0000000000000006
         GPR28: c00000002ec48900 c00000002ec48900 0000000000000002 
c00000002ec48000
[430425.020248] NIP [c000000000cb03f4] napi_enable+0x44/0xc0
[430425.020257] LR [c0080000014a4ce8] __ibmvnic_open+0xf0/0x440 
[ibmvnic]
[430425.020265] Call Trace:
[430425.020269] [c00000002e5d3a80] [c00000002ec48900] 0xc00000002ec48900 
(unreliable)
[430425.020277] [c00000002e5d3ab0] [c0080000014a4f40] 
__ibmvnic_open+0x348/0x440 [ibmvnic]
[430425.020286] [c00000002e5d3b40] [c0080000014ace58] 
__ibmvnic_reset+0xb10/0xe40 [ibmvnic]
[430425.020296] [c00000002e5d3c60] [c0000000001673a4] 
process_one_work+0x2d4/0x5d0
[430425.020305] [c00000002e5d3d00] [c000000000167718] 
worker_thread+0x78/0x6c0
[430425.020314] [c00000002e5d3da0] [c000000000173388] 
kthread+0x188/0x190
[430425.020322] [c00000002e5d3e10] [c00000000000cee4] 
ret_from_kernel_thread+0x5c/0x64
[430425.020331] Instruction dump:
[430425.020335] 38a0fff6 39430010 e92d0c80 f9210028 39200000 60000000 
60000000 e9030010
[430425.020348] f9010020 e9210020 7d2948f8 792907e0 <0b090000> e9230038 
7d072838 89290889
[430425.020364] ---[ end trace 3abb5ec5589518ca ]---
[430425.068100]
[430425.068108] Sending IPI to other CPUs
[430425.068206] IPI complete
[430425.090333] kexec: Starting switchover sequence.

2nd crash:
[ 1526.539335] NAPI poll function 0x96b6b00f7adfd returned 0, exceeding 
its budget of -49738736.
[ 1526.539349] BUG: Kernel NULL pointer dereference on read at 
0x00000000
[ 1526.539354] Faulting instruction address: 0xc000000000cc4054
[ 1526.539358] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1526.539376] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[ 1526.539390] Modules linked in: rpadlpar_io rpaphp xt_tcpudp 
iptable_filter ip_tables x_tables pseries_rng ibmvnic rng_core ibmveth 
vmx_crypto gf128mul fuse btrfs blake2b_generic xor zstd_compress 
lzo_compress raid6_pq dm_service_time crc32c_vpmsum dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
[ 1526.539469] CPU: 0 PID: 11 Comm: ksoftirqd/0 Kdump: loaded Not 
tainted 5.15.0-rc2-suka-00489-gd86e74e0e2e9 #4
[ 1526.539484] NIP: c000000000cc4054 LR: c000000000cc4494 CTR: 
c00000000089b9f0
[ 1526.539495] REGS: c00000000652b790 TRAP: 0380  Not tainted 
(5.15.0-rc2-suka-00489-gd86e74e0e2e9)
[ 1526.539506] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> 
CR: 28004224 XER: 20000001
[ 1526.539538] CFAR: c000000000cc4490 IRQMASK: 0
         GPR00: c000000000cc4494 c00000000652ba30 c000000001b12100 
c0000003fd090d08
         GPR04: 00000000fffeffff c00000000652b800 0000000000000027 
c0000003fd007e08
         GPR08: 0000000000000023 0000000000000000 0000000000000027 
0000000000000001
         GPR12: 0000000028004224 c000000001d00000 c0000000064ceb00 
0000000000000001
         GPR16: 0000000000000000 c000000001b33a00 000000010001df1f 
0000000000000003
         GPR20: c0000003fd090c00 0000000000000026 ffffffffffffffff 
0000000000000000
         GPR24: c000000001b33a00 c00000000652bba8 000000010001df20 
c00000000652ba58
         GPR28: c00000000652bb97 fffffffffd090c10 c0000003fd090d08 
0000000000000000
[ 1526.539661] NIP [c000000000cc4054] 
netif_receive_skb_list_internal+0x54/0x340
[ 1526.539675] LR [c000000000cc4494] gro_normal_list.part.149+0x34/0x60
[ 1526.539687] Call Trace:
[ 1526.539696] [c00000000652ba30] [c0000000001ca308] 
vprintk_emit+0xe8/0x2b0 (unreliable)
[ 1526.539713] [c00000000652bab0] [c000000000cc4494] 
gro_normal_list.part.149+0x34/0x60
[ 1526.539727] [c00000000652bae0] [c000000000cc6340] 
__napi_poll+0x250/0x330
[ 1526.539741] [c00000000652bb70] [c000000000cc697c] 
net_rx_action+0x31c/0x370
[ 1526.539754] [c00000000652bc20] [c000000000f3c8fc] 
__do_softirq+0x16c/0x420
[ 1526.539768] [c00000000652bd20] [c000000000147654] 
run_ksoftirqd+0x64/0x90
[ 1526.539783] [c00000000652bd40] [c00000000017955c] 
smpboot_thread_fn+0x21c/0x260
[ 1526.539796] [c00000000652bda0] [c000000000173388] kthread+0x188/0x190
[ 1526.539812] [c00000000652be10] [c00000000000cee4] 
ret_from_kernel_thread+0x5c/0x64
[ 1526.539827] Instruction dump:
[ 1526.539835] fbe1fff8 7c7e1b78 f8010010 f821ff81 ebe30000 3b610028 
e92d0c80 f9210048
[ 1526.539858] 39200000 fb610028 fb610030 7fa3f840 <eb9f0000> 419e004c 
7ffdfb78 60000000
[ 1526.539884] ---[ end trace e9681bc32653835d ]---
[ 1526.559758]
[ 1526.559766] Sending IPI to other CPUs
[ 1526.559858] IPI complete
[ 1526.581983] kexec: Starting switchover sequence.
Jakub Kicinski Oct. 18, 2021, 11:47 p.m. UTC | #9
On Mon, 18 Oct 2021 16:36:36 -0700 Dany Madden wrote:
> > The BUG_ON() is here to make sure that when napi_enable() is called the
> > napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set
> > on disabled NAPIs because that bit means ownership. Whoever disabled
> > the NAPI owns it.
> > 
> > That BUG_ON() could have been taken outside of the loop, there's no
> > point re-checking on every try.
> > 
> > Are you seeing NAPI-related failures? We had at least 3 reports in the
> > last two weeks of strange failures which look like NAPI state getting
> > corrupted on net-next...  
> 
> We hit two napi related crashes while attempting mtu size change.

Is it reproducible or happens rarely and randomly?
Sukadev Bhattiprolu Oct. 19, 2021, 12:01 a.m. UTC | #10
Jakub Kicinski [kuba@kernel.org] wrote:
> On Mon, 18 Oct 2021 16:36:36 -0700 Dany Madden wrote:
> > > The BUG_ON() is here to make sure that when napi_enable() is called the
> > > napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set
> > > on disabled NAPIs because that bit means ownership. Whoever disabled
> > > the NAPI owns it.
> > > 
> > > That BUG_ON() could have been taken outside of the loop, there's no
> > > point re-checking on every try.
> > > 
> > > Are you seeing NAPI-related failures? We had at least 3 reports in the
> > > last two weeks of strange failures which look like NAPI state getting
> > > corrupted on net-next...  
> > 
> > We hit two napi related crashes while attempting mtu size change.

BTW these are with a couple of bug fixes in ibmvnic driver applied
to 1e0083bd0777 ("gve: DQO: avoid unused variable warnings").

We are trying to narrow it down to the following change that is
required to be able to change mtu on ibmvnic.

> 
> Is it reproducible or happens rarely and randomly?

Random. We have been testing for a couple of weeks but hit both today.

Sukadev

---
index 8f17096e614d..a1533979c670 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1914,8 +1914,6 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
        ind_bufp = &tx_scrq->ind_buf;
 
        if (test_bit(0, &adapter->resetting)) {
-               if (!netif_subqueue_stopped(netdev, skb))
-                       netif_stop_subqueue(netdev, queue_num);
                dev_kfree_skb_any(skb);
Sukadev Bhattiprolu Oct. 22, 2021, 3:16 a.m. UTC | #11
Dany Madden [drt@linux.ibm.com] wrote:
> 
> We hit two napi related crashes while attempting mtu size change.
> 
> 1st crash:
> [430425.020051] ------------[ cut here ]------------
> [430425.020053] kernel BUG at ../net/core/dev.c:6938!
> [430425.020057] Oops: Exception in kernel mode, sig: 5 [#1]
> [430425.020068] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
> [430425.020075] Modules linked in: binfmt_misc rpadlpar_io rpaphp xt_tcpudp
> iptable_filter ip_tables x_tables pseries_rng ibmvnic rng_core ibmveth
> vmx_crypto gf128mul fuse btrfs blake2b_generic xor zstd_compress
> lzo_compress raid6_pq dm_service_time crc32c_vpmsum dm_mirror dm_region_hash
> dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
> [430425.020123] CPU: 0 PID: 34337 Comm: kworker/0:3 Kdump: loaded Tainted: G
> W     5.15.0-rc2-suka-00486-gce916130f5f6 #3
> [430425.020133] Workqueue: events_long __ibmvnic_reset [ibmvnic]
> [430425.020145] NIP: c000000000cb03f4 LR: c0080000014a4ce8 CTR:
> c000000000cb03b0
> [430425.020151] REGS: c00000002e5d37e0 TRAP: 0700  Tainted: G    W
> (5.15.0-rc2-suka-00486-gce916130f5f6)
> [430425.020159] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR:
> 28248428 XER: 20000001
> [430425.020176] CFAR: c0080000014ad9cc IRQMASK: 0
>         GPR00: c0080000014a4ce8 c00000002e5d3a80 c000000001b12100
> c0000001274f3190
>         GPR04: 00000000ffff36dc fffffffffffffff6 0000000000000019
> 0000000000000010
>         GPR08: c00000002ec48900 0000000000000001 c0000001274f31a0
> c0080000014ad9b8
>         GPR12: c000000000cb03b0 c000000001d00000 0000000080000000
> 00000000000003fe
>         GPR16: 00000000000006e3 0000000000000000 0000000000000008
> c00000002ec48af8
>         GPR20: c00000002ec48db0 0000000000000000 0000000000000004
> 0000000000000000
>         GPR24: c00000002ec48000 0000000000000004 c00000002ec49070
> 0000000000000006
>         GPR28: c00000002ec48900 c00000002ec48900 0000000000000002
> c00000002ec48000
> [430425.020248] NIP [c000000000cb03f4] napi_enable+0x44/0xc0
> [430425.020257] LR [c0080000014a4ce8] __ibmvnic_open+0xf0/0x440 [ibmvnic]
> [430425.020265] Call Trace:
> [430425.020269] [c00000002e5d3a80] [c00000002ec48900] 0xc00000002ec48900
> (unreliable)
> [430425.020277] [c00000002e5d3ab0] [c0080000014a4f40]
> __ibmvnic_open+0x348/0x440 [ibmvnic]
> [430425.020286] [c00000002e5d3b40] [c0080000014ace58]
> __ibmvnic_reset+0xb10/0xe40 [ibmvnic]
> [430425.020296] [c00000002e5d3c60] [c0000000001673a4]
> process_one_work+0x2d4/0x5d0
> [430425.020305] [c00000002e5d3d00] [c000000000167718]
> worker_thread+0x78/0x6c0
> [430425.020314] [c00000002e5d3da0] [c000000000173388] kthread+0x188/0x190
> [430425.020322] [c00000002e5d3e10] [c00000000000cee4]
> ret_from_kernel_thread+0x5c/0x64
> [430425.020331] Instruction dump:
> [430425.020335] 38a0fff6 39430010 e92d0c80 f9210028 39200000 60000000
> 60000000 e9030010
> [430425.020348] f9010020 e9210020 7d2948f8 792907e0 <0b090000> e9230038
> 7d072838 89290889
> [430425.020364] ---[ end trace 3abb5ec5589518ca ]---
> [430425.068100]
> [430425.068108] Sending IPI to other CPUs
> [430425.068206] IPI complete
> [430425.090333] kexec: Starting switchover sequence.

Jakub,

We hit this napi_enable() BUG_ON() crash three times this week. In two
of those times it appears that

	napi->state = netdev_priv(netdev)

i.e it contains ibmvnic_adapter* in our case.

	# Crash was on eth3

	crash> net |grep eth3
	c00000002e948000  eth3   10.1.194.173

	crash> net_device |grep SIZE
	SIZE: 2304

	crash> px 2304
	$1 = 0x900

	crash> ibmvnic_adapter c00000002e948900 |grep napi
	  napi = 0xc00000003b7dc000,
	  num_active_rx_napi = 8,
	  napi_enabled = false,

	crash> napi_struct 0xc00000003b7dc000 |grep state
	  state = 13835058056063650048,
	    state = 0 '\000',

	crash> px 13835058056063650048
	$2 = 0xc00000002e948900		#eth3 ibmvnic_adapter above

In the third case napi->state was 16 (i.e NAPI_STATE_SCHED was clear and
hence the bug in napi_enable()).

Let us know if any other fields are of interest. Do we have any clues on
when this started?

Sukadev
Jakub Kicinski Oct. 25, 2021, 5:36 p.m. UTC | #12
On Thu, 21 Oct 2021 20:16:04 -0700 Sukadev Bhattiprolu wrote:
> Let us know if any other fields are of interest. Do we have any clues on
> when this started?

I think this is the first of the series of similar hard to explain
reports:

https://lore.kernel.org/all/000000000000c1524005cdeacc5f@google.com/
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index 74fd402d26dd..7ee9fecd3aff 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6923,12 +6923,16 @@  EXPORT_SYMBOL(napi_disable);
  */
 void napi_enable(struct napi_struct *n)
 {
-	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
-	smp_mb__before_atomic();
-	clear_bit(NAPI_STATE_SCHED, &n->state);
-	clear_bit(NAPI_STATE_NPSVC, &n->state);
-	if (n->dev->threaded && n->thread)
-		set_bit(NAPI_STATE_THREADED, &n->state);
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));
+
+		new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC);
+		if (n->dev->threaded && n->thread)
+			new |= NAPIF_STATE_THREADED;
+	} while (cmpxchg(&n->state, val, new) != val);
 }
 EXPORT_SYMBOL(napi_enable);