Message ID | 20210918085232.71436-1-xuanzhuo@linux.alibaba.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 3765996e4f0b8a755cab215a08df744490c76052 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,v2] napi: fix race inside napi_enable | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 13 of 13 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 6 this patch: 6 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 6 this patch: 6 |
netdev/header_inline | success | Link |
Hello: This patch was applied to netdev/net.git (refs/heads/master): On Sat, 18 Sep 2021 16:52:32 +0800 you wrote: > The process will cause napi.state to contain NAPI_STATE_SCHED and > not in the poll_list, which will cause napi_disable() to get stuck. > > The prefix "NAPI_STATE_" is removed in the figure below, and > NAPI_STATE_HASHED is ignored in napi.state. > > CPU0 | CPU1 | napi.state > =============================================================================== > napi_disable() | | SCHED | NPSVC > napi_enable() | | > { | | > smp_mb__before_atomic(); | | > clear_bit(SCHED, &n->state); | | NPSVC > | napi_schedule_prep() | SCHED | NPSVC > | napi_poll() | > | napi_complete_done() | > | { | > | if (n->state & (NPSVC | | (1) > | _BUSY_POLL))) | > | return false; | > | ................ | > | } | SCHED | NPSVC > | | > clear_bit(NPSVC, &n->state); | | SCHED > } | | > | | > napi_schedule_prep() | | SCHED | MISSED (2) > > [...] Here is the summary with links: - [net,v2] napi: fix race inside napi_enable https://git.kernel.org/netdev/net/c/3765996e4f0b You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
On Sat, 18 Sep 2021 16:52:32 +0800 Xuan Zhuo wrote: > The process will cause napi.state to contain NAPI_STATE_SCHED and > not in the poll_list, which will cause napi_disable() to get stuck. > > The prefix "NAPI_STATE_" is removed in the figure below, and > NAPI_STATE_HASHED is ignored in napi.state. > > CPU0 | CPU1 | napi.state > =============================================================================== > napi_disable() | | SCHED | NPSVC > napi_enable() | | > { | | > smp_mb__before_atomic(); | | > clear_bit(SCHED, &n->state); | | NPSVC > | napi_schedule_prep() | SCHED | NPSVC > | napi_poll() | > | napi_complete_done() | > | { | > | if (n->state & (NPSVC | | (1) > | _BUSY_POLL))) | > | return false; | > | ................ | > | } | SCHED | NPSVC > | | > clear_bit(NPSVC, &n->state); | | SCHED > } | | > | | > napi_schedule_prep() | | SCHED | MISSED (2) > > (1) Here return direct. Because of NAPI_STATE_NPSVC exists. > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list > > Since NAPI_STATE_SCHED already exists and napi is not in the > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always > exist. > > 1. This will cause this queue to no longer receive packets. > 2. If you encounter napi_disable under the protection of rtnl_lock, it > will cause the entire rtnl_lock to be locked, affecting the overall > system. > > This patch uses cmpxchg to implement napi_enable(), which ensures that > there will be no race due to the separation of clear two bits. > > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable") > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Why don't you just invert the order of clearing the bits: diff --git a/net/core/dev.c b/net/core/dev.c index a796754f75cc..706eca8112c1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6953,8 +6953,8 @@ void napi_enable(struct napi_struct *n) { BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); smp_mb__before_atomic(); - clear_bit(NAPI_STATE_SCHED, &n->state); clear_bit(NAPI_STATE_NPSVC, &n->state); + clear_bit(NAPI_STATE_SCHED, &n->state); if (n->dev->threaded && n->thread) set_bit(NAPI_STATE_THREADED, &n->state); } That's simpler and symmetric with the disable path.
On Mon, 20 Sep 2021 12:20:24 -0700, Jakub Kicinski <kuba@kernel.org> wrote: > On Sat, 18 Sep 2021 16:52:32 +0800 Xuan Zhuo wrote: > > The process will cause napi.state to contain NAPI_STATE_SCHED and > > not in the poll_list, which will cause napi_disable() to get stuck. > > > > The prefix "NAPI_STATE_" is removed in the figure below, and > > NAPI_STATE_HASHED is ignored in napi.state. > > > > CPU0 | CPU1 | napi.state > > =============================================================================== > > napi_disable() | | SCHED | NPSVC > > napi_enable() | | > > { | | > > smp_mb__before_atomic(); | | > > clear_bit(SCHED, &n->state); | | NPSVC > > | napi_schedule_prep() | SCHED | NPSVC > > | napi_poll() | > > | napi_complete_done() | > > | { | > > | if (n->state & (NPSVC | | (1) > > | _BUSY_POLL))) | > > | return false; | > > | ................ | > > | } | SCHED | NPSVC > > | | > > clear_bit(NPSVC, &n->state); | | SCHED > > } | | > > | | > > napi_schedule_prep() | | SCHED | MISSED (2) > > > > (1) Here return direct. Because of NAPI_STATE_NPSVC exists. > > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list > > > > Since NAPI_STATE_SCHED already exists and napi is not in the > > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always > > exist. > > > > 1. This will cause this queue to no longer receive packets. > > 2. If you encounter napi_disable under the protection of rtnl_lock, it > > will cause the entire rtnl_lock to be locked, affecting the overall > > system. > > > > This patch uses cmpxchg to implement napi_enable(), which ensures that > > there will be no race due to the separation of clear two bits. > > > > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable") > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > Why don't you just invert the order of clearing the bits: I think it should be an atomic operation. The original two-step clear itself is problematic. So from this perspective, it is not just a solution to this problem. Thanks. > > diff --git a/net/core/dev.c b/net/core/dev.c > index a796754f75cc..706eca8112c1 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6953,8 +6953,8 @@ void napi_enable(struct napi_struct *n) > { > BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); > smp_mb__before_atomic(); > - clear_bit(NAPI_STATE_SCHED, &n->state); > clear_bit(NAPI_STATE_NPSVC, &n->state); > + clear_bit(NAPI_STATE_SCHED, &n->state); > if (n->dev->threaded && n->thread) > set_bit(NAPI_STATE_THREADED, &n->state); > } > > That's simpler and symmetric with the disable path.
On Wed, 22 Sep 2021 14:47:47 +0800 Xuan Zhuo wrote: > > Why don't you just invert the order of clearing the bits: > > I think it should be an atomic operation. The original two-step clear itself is > problematic. So from this perspective, it is not just a solution to this > problem. [resending, my MUA seems to have corrupted the CC list previously] Can you show what breaks by it being non-atomic? Because, again, the disable part is not atomic. Either it's needed on both sides or it's not needed on either.
On Thu, 23 Sep 2021 21:40:56 +0800 Xuan Zhuo wrote: > On Thu, 23 Sep 2021 06:14:17 -0700, Jakub Kicinski <kuba@kernel.org> wrote: > > > I think it should be an atomic operation. The original two-step clear itself is > > > problematic. So from this perspective, it is not just a solution to this > > > problem. > > > > [resending, my MUA seems to have corrupted the CC list previously] > > > > Can you show what breaks by it being non-atomic? > > Isn't the problem this time caused by non-atoms? > > Of course, in response to this problem, adjusting the order seems to be able to > solve this problem. Compared to changing to atomic operations, we have to test > other problems that may be caused by modifying this order. > > Relatively speaking, the use of atoms is a relatively simple way of processing. Whether atomics are simple or not is not the question. What I'm saying is that having asymmetric enable and disable paths is fragile. > > Because, again, the disable part is not atomic. Either it's needed on > > both sides or it's not needed on either. > > For the disable part, I think it’s okay not to use atoms. Have you considered > any special scenarios? The point is both sides should do the same thing.
Xuan Zhuo [xuanzhuo@linux.alibaba.com] wrote: > The process will cause napi.state to contain NAPI_STATE_SCHED and > not in the poll_list, which will cause napi_disable() to get stuck. > > The prefix "NAPI_STATE_" is removed in the figure below, and > NAPI_STATE_HASHED is ignored in napi.state. > > CPU0 | CPU1 | napi.state > =============================================================================== > napi_disable() | | SCHED | NPSVC > napi_enable() | | > { | | > smp_mb__before_atomic(); | | > clear_bit(SCHED, &n->state); | | NPSVC > | napi_schedule_prep() | SCHED | NPSVC > | napi_poll() | > | napi_complete_done() | > | { | > | if (n->state & (NPSVC | | (1) > | _BUSY_POLL))) | > | return false; | > | ................ | > | } | SCHED | NPSVC > | | > clear_bit(NPSVC, &n->state); | | SCHED > } | | So its possible that after cpu0 cleared SCHED, cpu1 could have set it back and we are going to use cmpxchg() to detect and retry right? If so, > | | > napi_schedule_prep() | | SCHED | MISSED (2) > > (1) Here return direct. Because of NAPI_STATE_NPSVC exists. > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list > > Since NAPI_STATE_SCHED already exists and napi is not in the > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always > exist. > > 1. This will cause this queue to no longer receive packets. > 2. If you encounter napi_disable under the protection of rtnl_lock, it > will cause the entire rtnl_lock to be locked, affecting the overall > system. > > This patch uses cmpxchg to implement napi_enable(), which ensures that > there will be no race due to the separation of clear two bits. > > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable") > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > --- > net/core/dev.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index 74fd402d26dd..7ee9fecd3aff 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable); > */ > void napi_enable(struct napi_struct *n) > { > - BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); > - smp_mb__before_atomic(); > - clear_bit(NAPI_STATE_SCHED, &n->state); > - clear_bit(NAPI_STATE_NPSVC, &n->state); > - if (n->dev->threaded && n->thread) > - set_bit(NAPI_STATE_THREADED, &n->state); > + unsigned long val, new; > + > + do { > + val = READ_ONCE(n->state); > + BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); is this BUG_ON valid/needed? We could have lost the cmpxchg() and the other thread could have set NAPI_STATE_SCHED? Sukadev > + > + new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC); > + if (n->dev->threaded && n->thread) > + new |= NAPIF_STATE_THREADED; > + } while (cmpxchg(&n->state, val, new) != val); > } > EXPORT_SYMBOL(napi_enable); > > -- > 2.31.0
On Mon, 18 Oct 2021 14:58:08 -0700 Sukadev Bhattiprolu wrote: > > CPU0 | CPU1 | napi.state > > =============================================================================== > > napi_disable() | | SCHED | NPSVC > > napi_enable() | | > > { | | > > smp_mb__before_atomic(); | | > > clear_bit(SCHED, &n->state); | | NPSVC > > | napi_schedule_prep() | SCHED | NPSVC > > | napi_poll() | > > | napi_complete_done() | > > | { | > > | if (n->state & (NPSVC | | (1) > > | _BUSY_POLL))) | > > | return false; | > > | ................ | > > | } | SCHED | NPSVC > > | | > > clear_bit(NPSVC, &n->state); | | SCHED > > } | | > > So its possible that after cpu0 cleared SCHED, cpu1 could have set it > back and we are going to use cmpxchg() to detect and retry right? If so, This is a state diagram before the change. There's no cmpxchg() here. > > napi_schedule_prep() | | SCHED | MISSED (2) > > > > (1) Here return direct. Because of NAPI_STATE_NPSVC exists. > > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list > > > > Since NAPI_STATE_SCHED already exists and napi is not in the > > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always > > exist. > > > > 1. This will cause this queue to no longer receive packets. > > 2. If you encounter napi_disable under the protection of rtnl_lock, it > > will cause the entire rtnl_lock to be locked, affecting the overall > > system. > > > > This patch uses cmpxchg to implement napi_enable(), which ensures that > > there will be no race due to the separation of clear two bits. > > > > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable") > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > --- > > net/core/dev.c | 16 ++++++++++------ > > 1 file changed, 10 insertions(+), 6 deletions(-) > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 74fd402d26dd..7ee9fecd3aff 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable); > > */ > > void napi_enable(struct napi_struct *n) > > { > > - BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); > > - smp_mb__before_atomic(); > > - clear_bit(NAPI_STATE_SCHED, &n->state); > > - clear_bit(NAPI_STATE_NPSVC, &n->state); > > - if (n->dev->threaded && n->thread) > > - set_bit(NAPI_STATE_THREADED, &n->state); > > + unsigned long val, new; > > + > > + do { > > + val = READ_ONCE(n->state); > > + BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); > > is this BUG_ON valid/needed? We could have lost the cmpxchg() and > the other thread could have set NAPI_STATE_SCHED? The BUG_ON() is here to make sure that when napi_enable() is called the napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set on disabled NAPIs because that bit means ownership. Whoever disabled the NAPI owns it. That BUG_ON() could have been taken outside of the loop, there's no point re-checking on every try. Are you seeing NAPI-related failures? We had at least 3 reports in the last two weeks of strange failures which look like NAPI state getting corrupted on net-next...
On 2021-10-18 15:55, Jakub Kicinski wrote: > On Mon, 18 Oct 2021 14:58:08 -0700 Sukadev Bhattiprolu wrote: >> > CPU0 | CPU1 | napi.state >> > =============================================================================== >> > napi_disable() | | SCHED | NPSVC >> > napi_enable() | | >> > { | | >> > smp_mb__before_atomic(); | | >> > clear_bit(SCHED, &n->state); | | NPSVC >> > | napi_schedule_prep() | SCHED | NPSVC >> > | napi_poll() | >> > | napi_complete_done() | >> > | { | >> > | if (n->state & (NPSVC | | (1) >> > | _BUSY_POLL))) | >> > | return false; | >> > | ................ | >> > | } | SCHED | NPSVC >> > | | >> > clear_bit(NPSVC, &n->state); | | SCHED >> > } | | >> >> So its possible that after cpu0 cleared SCHED, cpu1 could have set it >> back and we are going to use cmpxchg() to detect and retry right? If >> so, > > This is a state diagram before the change. There's no cmpxchg() here. > >> > napi_schedule_prep() | | SCHED | MISSED (2) >> > >> > (1) Here return direct. Because of NAPI_STATE_NPSVC exists. >> > (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list >> > >> > Since NAPI_STATE_SCHED already exists and napi is not in the >> > sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always >> > exist. >> > >> > 1. This will cause this queue to no longer receive packets. >> > 2. If you encounter napi_disable under the protection of rtnl_lock, it >> > will cause the entire rtnl_lock to be locked, affecting the overall >> > system. >> > >> > This patch uses cmpxchg to implement napi_enable(), which ensures that >> > there will be no race due to the separation of clear two bits. >> > >> > Fixes: 2d8bff12699abc ("netpoll: Close race condition between poll_one_napi and napi_disable") >> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> >> > --- >> > net/core/dev.c | 16 ++++++++++------ >> > 1 file changed, 10 insertions(+), 6 deletions(-) >> > >> > diff --git a/net/core/dev.c b/net/core/dev.c >> > index 74fd402d26dd..7ee9fecd3aff 100644 >> > --- a/net/core/dev.c >> > +++ b/net/core/dev.c >> > @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable); >> > */ >> > void napi_enable(struct napi_struct *n) >> > { >> > - BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); >> > - smp_mb__before_atomic(); >> > - clear_bit(NAPI_STATE_SCHED, &n->state); >> > - clear_bit(NAPI_STATE_NPSVC, &n->state); >> > - if (n->dev->threaded && n->thread) >> > - set_bit(NAPI_STATE_THREADED, &n->state); >> > + unsigned long val, new; >> > + >> > + do { >> > + val = READ_ONCE(n->state); >> > + BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); >> >> is this BUG_ON valid/needed? We could have lost the cmpxchg() and >> the other thread could have set NAPI_STATE_SCHED? > > The BUG_ON() is here to make sure that when napi_enable() is called the > napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set > on disabled NAPIs because that bit means ownership. Whoever disabled > the NAPI owns it. > > That BUG_ON() could have been taken outside of the loop, there's no > point re-checking on every try. > > Are you seeing NAPI-related failures? We had at least 3 reports in the > last two weeks of strange failures which look like NAPI state getting > corrupted on net-next... We hit two napi related crashes while attempting mtu size change. 1st crash: [430425.020051] ------------[ cut here ]------------ [430425.020053] kernel BUG at ../net/core/dev.c:6938! [430425.020057] Oops: Exception in kernel mode, sig: 5 [#1] [430425.020068] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [430425.020075] Modules linked in: binfmt_misc rpadlpar_io rpaphp xt_tcpudp iptable_filter ip_tables x_tables pseries_rng ibmvnic rng_core ibmveth vmx_crypto gf128mul fuse btrfs blake2b_generic xor zstd_compress lzo_compress raid6_pq dm_service_time crc32c_vpmsum dm_mirror dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4 [430425.020123] CPU: 0 PID: 34337 Comm: kworker/0:3 Kdump: loaded Tainted: G W 5.15.0-rc2-suka-00486-gce916130f5f6 #3 [430425.020133] Workqueue: events_long __ibmvnic_reset [ibmvnic] [430425.020145] NIP: c000000000cb03f4 LR: c0080000014a4ce8 CTR: c000000000cb03b0 [430425.020151] REGS: c00000002e5d37e0 TRAP: 0700 Tainted: G W (5.15.0-rc2-suka-00486-gce916130f5f6) [430425.020159] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28248428 XER: 20000001 [430425.020176] CFAR: c0080000014ad9cc IRQMASK: 0 GPR00: c0080000014a4ce8 c00000002e5d3a80 c000000001b12100 c0000001274f3190 GPR04: 00000000ffff36dc fffffffffffffff6 0000000000000019 0000000000000010 GPR08: c00000002ec48900 0000000000000001 c0000001274f31a0 c0080000014ad9b8 GPR12: c000000000cb03b0 c000000001d00000 0000000080000000 00000000000003fe GPR16: 00000000000006e3 0000000000000000 0000000000000008 c00000002ec48af8 GPR20: c00000002ec48db0 0000000000000000 0000000000000004 0000000000000000 GPR24: c00000002ec48000 0000000000000004 c00000002ec49070 0000000000000006 GPR28: c00000002ec48900 c00000002ec48900 0000000000000002 c00000002ec48000 [430425.020248] NIP [c000000000cb03f4] napi_enable+0x44/0xc0 [430425.020257] LR [c0080000014a4ce8] __ibmvnic_open+0xf0/0x440 [ibmvnic] [430425.020265] Call Trace: [430425.020269] [c00000002e5d3a80] [c00000002ec48900] 0xc00000002ec48900 (unreliable) [430425.020277] [c00000002e5d3ab0] [c0080000014a4f40] __ibmvnic_open+0x348/0x440 [ibmvnic] [430425.020286] [c00000002e5d3b40] [c0080000014ace58] __ibmvnic_reset+0xb10/0xe40 [ibmvnic] [430425.020296] [c00000002e5d3c60] [c0000000001673a4] process_one_work+0x2d4/0x5d0 [430425.020305] [c00000002e5d3d00] [c000000000167718] worker_thread+0x78/0x6c0 [430425.020314] [c00000002e5d3da0] [c000000000173388] kthread+0x188/0x190 [430425.020322] [c00000002e5d3e10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64 [430425.020331] Instruction dump: [430425.020335] 38a0fff6 39430010 e92d0c80 f9210028 39200000 60000000 60000000 e9030010 [430425.020348] f9010020 e9210020 7d2948f8 792907e0 <0b090000> e9230038 7d072838 89290889 [430425.020364] ---[ end trace 3abb5ec5589518ca ]--- [430425.068100] [430425.068108] Sending IPI to other CPUs [430425.068206] IPI complete [430425.090333] kexec: Starting switchover sequence. 2nd crash: [ 1526.539335] NAPI poll function 0x96b6b00f7adfd returned 0, exceeding its budget of -49738736. [ 1526.539349] BUG: Kernel NULL pointer dereference on read at 0x00000000 [ 1526.539354] Faulting instruction address: 0xc000000000cc4054 [ 1526.539358] Oops: Kernel access of bad area, sig: 11 [#1] [ 1526.539376] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 1526.539390] Modules linked in: rpadlpar_io rpaphp xt_tcpudp iptable_filter ip_tables x_tables pseries_rng ibmvnic rng_core ibmveth vmx_crypto gf128mul fuse btrfs blake2b_generic xor zstd_compress lzo_compress raid6_pq dm_service_time crc32c_vpmsum dm_mirror dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4 [ 1526.539469] CPU: 0 PID: 11 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.15.0-rc2-suka-00489-gd86e74e0e2e9 #4 [ 1526.539484] NIP: c000000000cc4054 LR: c000000000cc4494 CTR: c00000000089b9f0 [ 1526.539495] REGS: c00000000652b790 TRAP: 0380 Not tainted (5.15.0-rc2-suka-00489-gd86e74e0e2e9) [ 1526.539506] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28004224 XER: 20000001 [ 1526.539538] CFAR: c000000000cc4490 IRQMASK: 0 GPR00: c000000000cc4494 c00000000652ba30 c000000001b12100 c0000003fd090d08 GPR04: 00000000fffeffff c00000000652b800 0000000000000027 c0000003fd007e08 GPR08: 0000000000000023 0000000000000000 0000000000000027 0000000000000001 GPR12: 0000000028004224 c000000001d00000 c0000000064ceb00 0000000000000001 GPR16: 0000000000000000 c000000001b33a00 000000010001df1f 0000000000000003 GPR20: c0000003fd090c00 0000000000000026 ffffffffffffffff 0000000000000000 GPR24: c000000001b33a00 c00000000652bba8 000000010001df20 c00000000652ba58 GPR28: c00000000652bb97 fffffffffd090c10 c0000003fd090d08 0000000000000000 [ 1526.539661] NIP [c000000000cc4054] netif_receive_skb_list_internal+0x54/0x340 [ 1526.539675] LR [c000000000cc4494] gro_normal_list.part.149+0x34/0x60 [ 1526.539687] Call Trace: [ 1526.539696] [c00000000652ba30] [c0000000001ca308] vprintk_emit+0xe8/0x2b0 (unreliable) [ 1526.539713] [c00000000652bab0] [c000000000cc4494] gro_normal_list.part.149+0x34/0x60 [ 1526.539727] [c00000000652bae0] [c000000000cc6340] __napi_poll+0x250/0x330 [ 1526.539741] [c00000000652bb70] [c000000000cc697c] net_rx_action+0x31c/0x370 [ 1526.539754] [c00000000652bc20] [c000000000f3c8fc] __do_softirq+0x16c/0x420 [ 1526.539768] [c00000000652bd20] [c000000000147654] run_ksoftirqd+0x64/0x90 [ 1526.539783] [c00000000652bd40] [c00000000017955c] smpboot_thread_fn+0x21c/0x260 [ 1526.539796] [c00000000652bda0] [c000000000173388] kthread+0x188/0x190 [ 1526.539812] [c00000000652be10] [c00000000000cee4] ret_from_kernel_thread+0x5c/0x64 [ 1526.539827] Instruction dump: [ 1526.539835] fbe1fff8 7c7e1b78 f8010010 f821ff81 ebe30000 3b610028 e92d0c80 f9210048 [ 1526.539858] 39200000 fb610028 fb610030 7fa3f840 <eb9f0000> 419e004c 7ffdfb78 60000000 [ 1526.539884] ---[ end trace e9681bc32653835d ]--- [ 1526.559758] [ 1526.559766] Sending IPI to other CPUs [ 1526.559858] IPI complete [ 1526.581983] kexec: Starting switchover sequence.
On Mon, 18 Oct 2021 16:36:36 -0700 Dany Madden wrote: > > The BUG_ON() is here to make sure that when napi_enable() is called the > > napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set > > on disabled NAPIs because that bit means ownership. Whoever disabled > > the NAPI owns it. > > > > That BUG_ON() could have been taken outside of the loop, there's no > > point re-checking on every try. > > > > Are you seeing NAPI-related failures? We had at least 3 reports in the > > last two weeks of strange failures which look like NAPI state getting > > corrupted on net-next... > > We hit two napi related crashes while attempting mtu size change. Is it reproducible or happens rarely and randomly?
Jakub Kicinski [kuba@kernel.org] wrote: > On Mon, 18 Oct 2021 16:36:36 -0700 Dany Madden wrote: > > > The BUG_ON() is here to make sure that when napi_enable() is called the > > > napi instance was dormant, i.e. disabled. We have "STATE_SCHED" bit set > > > on disabled NAPIs because that bit means ownership. Whoever disabled > > > the NAPI owns it. > > > > > > That BUG_ON() could have been taken outside of the loop, there's no > > > point re-checking on every try. > > > > > > Are you seeing NAPI-related failures? We had at least 3 reports in the > > > last two weeks of strange failures which look like NAPI state getting > > > corrupted on net-next... > > > > We hit two napi related crashes while attempting mtu size change. BTW these are with a couple of bug fixes in ibmvnic driver applied to 1e0083bd0777 ("gve: DQO: avoid unused variable warnings"). We are trying to narrow it down to the following change that is required to be able to change mtu on ibmvnic. > > Is it reproducible or happens rarely and randomly? Random. We have been testing for a couple of weeks but hit both today. Sukadev --- index 8f17096e614d..a1533979c670 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1914,8 +1914,6 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev) ind_bufp = &tx_scrq->ind_buf; if (test_bit(0, &adapter->resetting)) { - if (!netif_subqueue_stopped(netdev, skb)) - netif_stop_subqueue(netdev, queue_num); dev_kfree_skb_any(skb);
Dany Madden [drt@linux.ibm.com] wrote: > > We hit two napi related crashes while attempting mtu size change. > > 1st crash: > [430425.020051] ------------[ cut here ]------------ > [430425.020053] kernel BUG at ../net/core/dev.c:6938! > [430425.020057] Oops: Exception in kernel mode, sig: 5 [#1] > [430425.020068] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > [430425.020075] Modules linked in: binfmt_misc rpadlpar_io rpaphp xt_tcpudp > iptable_filter ip_tables x_tables pseries_rng ibmvnic rng_core ibmveth > vmx_crypto gf128mul fuse btrfs blake2b_generic xor zstd_compress > lzo_compress raid6_pq dm_service_time crc32c_vpmsum dm_mirror dm_region_hash > dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4 > [430425.020123] CPU: 0 PID: 34337 Comm: kworker/0:3 Kdump: loaded Tainted: G > W 5.15.0-rc2-suka-00486-gce916130f5f6 #3 > [430425.020133] Workqueue: events_long __ibmvnic_reset [ibmvnic] > [430425.020145] NIP: c000000000cb03f4 LR: c0080000014a4ce8 CTR: > c000000000cb03b0 > [430425.020151] REGS: c00000002e5d37e0 TRAP: 0700 Tainted: G W > (5.15.0-rc2-suka-00486-gce916130f5f6) > [430425.020159] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: > 28248428 XER: 20000001 > [430425.020176] CFAR: c0080000014ad9cc IRQMASK: 0 > GPR00: c0080000014a4ce8 c00000002e5d3a80 c000000001b12100 > c0000001274f3190 > GPR04: 00000000ffff36dc fffffffffffffff6 0000000000000019 > 0000000000000010 > GPR08: c00000002ec48900 0000000000000001 c0000001274f31a0 > c0080000014ad9b8 > GPR12: c000000000cb03b0 c000000001d00000 0000000080000000 > 00000000000003fe > GPR16: 00000000000006e3 0000000000000000 0000000000000008 > c00000002ec48af8 > GPR20: c00000002ec48db0 0000000000000000 0000000000000004 > 0000000000000000 > GPR24: c00000002ec48000 0000000000000004 c00000002ec49070 > 0000000000000006 > GPR28: c00000002ec48900 c00000002ec48900 0000000000000002 > c00000002ec48000 > [430425.020248] NIP [c000000000cb03f4] napi_enable+0x44/0xc0 > [430425.020257] LR [c0080000014a4ce8] __ibmvnic_open+0xf0/0x440 [ibmvnic] > [430425.020265] Call Trace: > [430425.020269] [c00000002e5d3a80] [c00000002ec48900] 0xc00000002ec48900 > (unreliable) > [430425.020277] [c00000002e5d3ab0] [c0080000014a4f40] > __ibmvnic_open+0x348/0x440 [ibmvnic] > [430425.020286] [c00000002e5d3b40] [c0080000014ace58] > __ibmvnic_reset+0xb10/0xe40 [ibmvnic] > [430425.020296] [c00000002e5d3c60] [c0000000001673a4] > process_one_work+0x2d4/0x5d0 > [430425.020305] [c00000002e5d3d00] [c000000000167718] > worker_thread+0x78/0x6c0 > [430425.020314] [c00000002e5d3da0] [c000000000173388] kthread+0x188/0x190 > [430425.020322] [c00000002e5d3e10] [c00000000000cee4] > ret_from_kernel_thread+0x5c/0x64 > [430425.020331] Instruction dump: > [430425.020335] 38a0fff6 39430010 e92d0c80 f9210028 39200000 60000000 > 60000000 e9030010 > [430425.020348] f9010020 e9210020 7d2948f8 792907e0 <0b090000> e9230038 > 7d072838 89290889 > [430425.020364] ---[ end trace 3abb5ec5589518ca ]--- > [430425.068100] > [430425.068108] Sending IPI to other CPUs > [430425.068206] IPI complete > [430425.090333] kexec: Starting switchover sequence. Jakub, We hit this napi_enable() BUG_ON() crash three times this week. In two of those times it appears that napi->state = netdev_priv(netdev) i.e it contains ibmvnic_adapter* in our case. # Crash was on eth3 crash> net |grep eth3 c00000002e948000 eth3 10.1.194.173 crash> net_device |grep SIZE SIZE: 2304 crash> px 2304 $1 = 0x900 crash> ibmvnic_adapter c00000002e948900 |grep napi napi = 0xc00000003b7dc000, num_active_rx_napi = 8, napi_enabled = false, crash> napi_struct 0xc00000003b7dc000 |grep state state = 13835058056063650048, state = 0 '\000', crash> px 13835058056063650048 $2 = 0xc00000002e948900 #eth3 ibmvnic_adapter above In the third case napi->state was 16 (i.e NAPI_STATE_SCHED was clear and hence the bug in napi_enable()). Let us know if any other fields are of interest. Do we have any clues on when this started? Sukadev
On Thu, 21 Oct 2021 20:16:04 -0700 Sukadev Bhattiprolu wrote: > Let us know if any other fields are of interest. Do we have any clues on > when this started? I think this is the first of the series of similar hard to explain reports: https://lore.kernel.org/all/000000000000c1524005cdeacc5f@google.com/
diff --git a/net/core/dev.c b/net/core/dev.c index 74fd402d26dd..7ee9fecd3aff 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6923,12 +6923,16 @@ EXPORT_SYMBOL(napi_disable); */ void napi_enable(struct napi_struct *n) { - BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); - smp_mb__before_atomic(); - clear_bit(NAPI_STATE_SCHED, &n->state); - clear_bit(NAPI_STATE_NPSVC, &n->state); - if (n->dev->threaded && n->thread) - set_bit(NAPI_STATE_THREADED, &n->state); + unsigned long val, new; + + do { + val = READ_ONCE(n->state); + BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); + + new = val & ~(NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC); + if (n->dev->threaded && n->thread) + new |= NAPIF_STATE_THREADED; + } while (cmpxchg(&n->state, val, new) != val); } EXPORT_SYMBOL(napi_enable);