diff mbox series

r8169: transmit queue timeouts and IRQ masking

Message ID ad6a0c52-4dcb-444e-88cd-a6c490a817fe@gmail.com (mailing list archive)
State Changes Requested
Headers show
Series r8169: transmit queue timeouts and IRQ masking | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Ken Milmore May 6, 2024, 9:28 p.m. UTC
I have a motherboard with an integrated RTL8125B network adapter, and I have found a way to predictably cause TX queue timeouts with the r8169 driver.

Briefly, if rtl8169_poll() ever gets called with interrupts unmasked on the device, then it seems to be possible to get it stuck in a non-interrupting state.
It appears disaster can be averted in this case by making sure device interrupts are always masked when inside rtl8169_poll()! For which see below...

The preconditions I found for causing a timeout are:
- Set gro_flush_timeout to a NON-ZERO value
- Set napi_defer_hard_irqs to ZERO
- Put some heavy bidirectional load on the interface (I find iperf3 to another host does the job nicely: 1Gbps is enough).

e.g.
# echo 20000 > /sys/class/net/eth0/gro_flush_timeout
# echo 0 > /sys/class/net/eth0/napi_defer_hard_irqs
# iperf3 --bidir -c hostname

The bitrate falls off to zero almost immediately, whereafter the interface just stops working:

[ ID][Role] Interval           Transfer     Bitrate         Retr  Cwnd
[  5][TX-C]   0.00-1.00   sec  1010 KBytes  8.26 Mbits/sec    1   1.39 KBytes       
[  7][RX-C]   0.00-1.00   sec   421 KBytes  3.45 Mbits/sec                  
[  5][TX-C]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0   1.39 KBytes       
[  7][RX-C]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec                  
[  5][TX-C]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.39 KBytes       

On recent(ish) kernels I see the "ASPM disabled on Tx timeout" message as it tries to recover, then after some delay, the familiar "transmit queue 0 timed out" warning usually occurs.


[  149.473134] ------------[ cut here ]------------
[  149.473155] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out 6812 ms
[  149.473188] WARNING: CPU: 18 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x235/0x240
[  149.473206] Modules linked in: nft_chain_nat nf_nat bridge stp llc qrtr sunrpc binfmt_misc ip6t_REJECT nf_reject_ipv6 joydev ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink hid_generic nls_ascii nls_cp437 usbhid vfat hid fat amdgpu intel_rapl_msr snd_sof_pci_intel_tgl intel_rapl_common snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation intel_uncore_frequency intel_uncore_frequency_common snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci x86_pkg_temp_thermal intel_powerclamp snd_sof_xtensa_dsp iwlmvm snd_sof coretemp kvm_intel snd_sof_utils snd_hda_codec_realtek snd_soc_hdac_hda mac80211 kvm snd_hda_ext_core snd_hda_codec_generic snd_soc_acpi_intel_match snd_soc_acpi ledtrig_audio snd_soc_core irqbypass snd_compress libarc4 drm_exec snd_pcm_dmaengine snd_hda_codec_hdmi ghash_clmulni_intel amdxcp soundwire_bus drm_buddy sha512_ssse3 gpu_sched sha256_ssse3 snd_hda_intel sha1_ssse3
[  149.473574]  drm_suballoc_helper snd_intel_dspcfg iwlwifi snd_intel_sdw_acpi drm_display_helper snd_hda_codec cec aesni_intel snd_hda_core crypto_simd cryptd rc_core snd_hwdep mei_pxp mei_hdcp snd_pcm rapl drm_ttm_helper pmt_telemetry iTCO_wdt cfg80211 ttm pmt_class snd_timer intel_pmc_bxt evdev intel_cstate drm_kms_helper wmi_bmof mxm_wmi snd iTCO_vendor_support mei_me intel_uncore i2c_algo_bit ee1004 pcspkr watchdog mei soundcore rfkill intel_vsec serial_multi_instantiate intel_pmc_core acpi_pad acpi_tad button drm nct6683 parport_pc ppdev lp parport loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod nvme ahci nvme_core xhci_pci libahci t10_pi xhci_hcd r8169 libata realtek crc64_rocksoft mdio_devres crc64 usbcore scsi_mod crc_t10dif i2c_i801 crct10dif_generic libphy crc32_pclmul crct10dif_pclmul crc32c_intel i2c_smbus video scsi_common usb_common crct10dif_common fan wmi pinctrl_alderlake
[  149.474122] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 6.6.13+bpo-amd64 #1  Debian 6.6.13-1~bpo12+1
[  149.474134] Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A WIFI DDR4 (MS-7D43), BIOS 1.E0 09/14/2023
[  149.474141] RIP: 0010:dev_watchdog+0x235/0x240
[  149.474154] Code: ff ff ff 48 89 df c6 05 6c 2a 40 01 01 e8 e3 37 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 60 5f f2 9f e8 0b e3 6a ff <0f> 0b e9 2a ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[  149.474163] RSP: 0018:ffffa4c3c0444e78 EFLAGS: 00010286
[  149.474176] RAX: 0000000000000000 RBX: ffff93d94b354000 RCX: 000000000000083f
[  149.474185] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[  149.474192] RBP: ffff93d94b3544c8 R08: 0000000000000000 R09: ffffa4c3c0444d00
[  149.474199] R10: 0000000000000003 R11: ffff93e0bf780228 R12: ffff93d94b346a00
[  149.474207] R13: ffff93d94b35441c R14: 0000000000000000 R15: 0000000000001a9c
[  149.474215] FS:  0000000000000000(0000) GS:ffff93e09f680000(0000) knlGS:0000000000000000
[  149.474225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  149.474232] CR2: 00007f616800a008 CR3: 0000000136820000 CR4: 0000000000f50ee0
[  149.474240] PKRU: 55555554
[  149.474247] Call Trace:
[  149.474254]  <IRQ>
[  149.474261]  ? dev_watchdog+0x235/0x240
[  149.474271]  ? __warn+0x81/0x130
[  149.474289]  ? dev_watchdog+0x235/0x240
[  149.474298]  ? report_bug+0x171/0x1a0
[  149.474314]  ? handle_bug+0x41/0x70
[  149.474326]  ? exc_invalid_op+0x17/0x70
[  149.474338]  ? asm_exc_invalid_op+0x1a/0x20
[  149.474353]  ? dev_watchdog+0x235/0x240
[  149.474363]  ? dev_watchdog+0x235/0x240
[  149.474372]  ? __pfx_dev_watchdog+0x10/0x10
[  149.474381]  call_timer_fn+0x24/0x130
[  149.474396]  ? __pfx_dev_watchdog+0x10/0x10
[  149.474404]  __run_timers+0x222/0x2c0
[  149.474420]  run_timer_softirq+0x1d/0x40
[  149.474433]  __do_softirq+0xc7/0x2ae
[  149.474444]  __irq_exit_rcu+0x96/0xb0
[  149.474459]  sysvec_apic_timer_interrupt+0x72/0x90
[  149.474469]  </IRQ>
[  149.474473]  <TASK>
[  149.474479]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  149.474492] RIP: 0010:cpuidle_enter_state+0xcc/0x440
[  149.474504] Code: fa b6 53 ff e8 35 f4 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 43 c4 52 ff 45 84 ff 0f 85 57 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[  149.474514] RSP: 0018:ffffa4c3c022be90 EFLAGS: 00000246
[  149.474524] RAX: ffff93e09f6b3440 RBX: ffffc4c3bfcb2140 RCX: 000000000000001f
[  149.474529] RDX: 0000000000000012 RSI: 000000003c9b26c9 RDI: 0000000000000000
[  149.474536] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000500
[  149.474542] R10: 0000000000000007 R11: ffff93e09f6b1fe4 R12: ffffffffa079a500
[  149.474548] R13: 00000022cd4ab97d R14: 0000000000000004 R15: 0000000000000000
[  149.474560]  cpuidle_enter+0x2d/0x40
[  149.474571]  do_idle+0x20d/0x270
[  149.474585]  cpu_startup_entry+0x2a/0x30
[  149.474598]  start_secondary+0x11e/0x140
[  149.474613]  secondary_startup_64_no_verify+0x18f/0x19b
[  149.474630]  </TASK>
[  149.474635] ---[ end trace 0000000000000000 ]---


Here's a dump of the registers (MAC Address redacted).
It seems to be stuck on TxDescUnavail | RxOverflow | TxOK | RxOk, and interrupts are unmasked.

# ethtool -d eth0
Unknown RealTek chip (TxConfig: 0x67100f00)
Offset		Values
------		------
0x0000:		XXXXXXXXXXXXXXXXX fe 09 40 00 00 00 80 00 01 00 
0x0010:		00 00 57 ff 00 00 00 00 0a 00 00 00 00 00 00 00 
0x0020:		00 10 7d ff 00 00 00 00 19 1e 9f b4 79 a5 3f 3b 
0x0030:		00 00 00 00 00 00 00 0c 3f 00 00 00 95 00 00 00 
0x0040:		00 0f 10 67 0e 0f c2 40 00 00 00 00 00 00 00 00 
0x0050:		11 00 cf bc 60 11 03 01 00 00 00 00 00 00 00 00 
0x0060:		00 00 00 00 00 00 00 00 00 00 02 00 f3 00 80 f0 
0x0070:		00 00 00 00 00 00 00 00 07 00 00 00 00 00 b3 e9 
0x0080:		62 60 02 00 00 02 20 00 00 00 00 00 00 00 00 00 
0x0090:		00 00 00 00 60 00 20 02 62 64 00 00 00 00 00 00 
0x00a0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
0x00b0:		1f 00 00 00 80 00 00 00 ec 10 1a d2 01 00 01 00 
0x00c0:		00 00 00 00 00 00 00 00 00 00 00 00 12 00 00 00 
0x00d0:		21 00 04 12 00 00 01 00 00 00 00 40 ff ff ff ff 
0x00e0:		20 20 03 01 00 00 91 fe 00 00 00 00 ff 02 00 00 
0x00f0:		3f 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 


I tried instrumenting the code a bit and found that rtl8169_poll() being called with interrupts unmasked seems to be a precursor to the problem occuring.
The *only* time this usually happens is when a GRO timer has been set but napi_defer_hard_irqs is off.
Now that the defaults are gro_flush_timeout=20000, napi_defer_hard_irqs=1, this probably doesn't happen very often for most people.
I guess it will happen with busy polling, but I haven't tested that yet.

Comments

Heiner Kallweit May 8, 2024, 9:14 p.m. UTC | #1
On 06.05.2024 23:28, Ken Milmore wrote:
> I have a motherboard with an integrated RTL8125B network adapter, and I have found a way to predictably cause TX queue timeouts with the r8169 driver.
> 
> Briefly, if rtl8169_poll() ever gets called with interrupts unmasked on the device, then it seems to be possible to get it stuck in a non-interrupting state.
> It appears disaster can be averted in this case by making sure device interrupts are always masked when inside rtl8169_poll()! For which see below...
> 
> The preconditions I found for causing a timeout are:
> - Set gro_flush_timeout to a NON-ZERO value
> - Set napi_defer_hard_irqs to ZERO
> - Put some heavy bidirectional load on the interface (I find iperf3 to another host does the job nicely: 1Gbps is enough).
> 
> e.g.
> # echo 20000 > /sys/class/net/eth0/gro_flush_timeout
> # echo 0 > /sys/class/net/eth0/napi_defer_hard_irqs
> # iperf3 --bidir -c hostname
> 

Thanks for the interesting report and the thorough analysis.
I could reproduce the issue on RTL8168h. It didn't even take heavy
bidirectional load, iperf3 -R -c hostname was enough in my case.

> The bitrate falls off to zero almost immediately, whereafter the interface just stops working:
> 
> [ ID][Role] Interval           Transfer     Bitrate         Retr  Cwnd
> [  5][TX-C]   0.00-1.00   sec  1010 KBytes  8.26 Mbits/sec    1   1.39 KBytes       
> [  7][RX-C]   0.00-1.00   sec   421 KBytes  3.45 Mbits/sec                  
> [  5][TX-C]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    0   1.39 KBytes       
> [  7][RX-C]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec                  
> [  5][TX-C]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   1.39 KBytes       
> 
> On recent(ish) kernels I see the "ASPM disabled on Tx timeout" message as it tries to recover, then after some delay, the familiar "transmit queue 0 timed out" warning usually occurs.
> 
> 
> [  149.473134] ------------[ cut here ]------------
> [  149.473155] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out 6812 ms
> [  149.473188] WARNING: CPU: 18 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x235/0x240
> [  149.473206] Modules linked in: nft_chain_nat nf_nat bridge stp llc qrtr sunrpc binfmt_misc ip6t_REJECT nf_reject_ipv6 joydev ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink hid_generic nls_ascii nls_cp437 usbhid vfat hid fat amdgpu intel_rapl_msr snd_sof_pci_intel_tgl intel_rapl_common snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation intel_uncore_frequency intel_uncore_frequency_common snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci x86_pkg_temp_thermal intel_powerclamp snd_sof_xtensa_dsp iwlmvm snd_sof coretemp kvm_intel snd_sof_utils snd_hda_codec_realtek snd_soc_hdac_hda mac80211 kvm snd_hda_ext_core snd_hda_codec_generic snd_soc_acpi_intel_match snd_soc_acpi ledtrig_audio snd_soc_core irqbypass snd_compress libarc4 drm_exec snd_pcm_dmaengine snd_hda_codec_hdmi ghash_clmulni_intel amdxcp soundwire_bus drm_buddy sha512_ssse3 gpu_sched sha256_ssse3 snd_hda_intel sha1_ssse3
> [  149.473574]  drm_suballoc_helper snd_intel_dspcfg iwlwifi snd_intel_sdw_acpi drm_display_helper snd_hda_codec cec aesni_intel snd_hda_core crypto_simd cryptd rc_core snd_hwdep mei_pxp mei_hdcp snd_pcm rapl drm_ttm_helper pmt_telemetry iTCO_wdt cfg80211 ttm pmt_class snd_timer intel_pmc_bxt evdev intel_cstate drm_kms_helper wmi_bmof mxm_wmi snd iTCO_vendor_support mei_me intel_uncore i2c_algo_bit ee1004 pcspkr watchdog mei soundcore rfkill intel_vsec serial_multi_instantiate intel_pmc_core acpi_pad acpi_tad button drm nct6683 parport_pc ppdev lp parport loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod nvme ahci nvme_core xhci_pci libahci t10_pi xhci_hcd r8169 libata realtek crc64_rocksoft mdio_devres crc64 usbcore scsi_mod crc_t10dif i2c_i801 crct10dif_generic libphy crc32_pclmul crct10dif_pclmul crc32c_intel i2c_smbus video scsi_common usb_common crct10dif_common fan wmi pinctrl_alderlake
> [  149.474122] CPU: 18 PID: 0 Comm: swapper/18 Not tainted 6.6.13+bpo-amd64 #1  Debian 6.6.13-1~bpo12+1
> [  149.474134] Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A WIFI DDR4 (MS-7D43), BIOS 1.E0 09/14/2023
> [  149.474141] RIP: 0010:dev_watchdog+0x235/0x240
> [  149.474154] Code: ff ff ff 48 89 df c6 05 6c 2a 40 01 01 e8 e3 37 fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 60 5f f2 9f e8 0b e3 6a ff <0f> 0b e9 2a ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
> [  149.474163] RSP: 0018:ffffa4c3c0444e78 EFLAGS: 00010286
> [  149.474176] RAX: 0000000000000000 RBX: ffff93d94b354000 RCX: 000000000000083f
> [  149.474185] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
> [  149.474192] RBP: ffff93d94b3544c8 R08: 0000000000000000 R09: ffffa4c3c0444d00
> [  149.474199] R10: 0000000000000003 R11: ffff93e0bf780228 R12: ffff93d94b346a00
> [  149.474207] R13: ffff93d94b35441c R14: 0000000000000000 R15: 0000000000001a9c
> [  149.474215] FS:  0000000000000000(0000) GS:ffff93e09f680000(0000) knlGS:0000000000000000
> [  149.474225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  149.474232] CR2: 00007f616800a008 CR3: 0000000136820000 CR4: 0000000000f50ee0
> [  149.474240] PKRU: 55555554
> [  149.474247] Call Trace:
> [  149.474254]  <IRQ>
> [  149.474261]  ? dev_watchdog+0x235/0x240
> [  149.474271]  ? __warn+0x81/0x130
> [  149.474289]  ? dev_watchdog+0x235/0x240
> [  149.474298]  ? report_bug+0x171/0x1a0
> [  149.474314]  ? handle_bug+0x41/0x70
> [  149.474326]  ? exc_invalid_op+0x17/0x70
> [  149.474338]  ? asm_exc_invalid_op+0x1a/0x20
> [  149.474353]  ? dev_watchdog+0x235/0x240
> [  149.474363]  ? dev_watchdog+0x235/0x240
> [  149.474372]  ? __pfx_dev_watchdog+0x10/0x10
> [  149.474381]  call_timer_fn+0x24/0x130
> [  149.474396]  ? __pfx_dev_watchdog+0x10/0x10
> [  149.474404]  __run_timers+0x222/0x2c0
> [  149.474420]  run_timer_softirq+0x1d/0x40
> [  149.474433]  __do_softirq+0xc7/0x2ae
> [  149.474444]  __irq_exit_rcu+0x96/0xb0
> [  149.474459]  sysvec_apic_timer_interrupt+0x72/0x90
> [  149.474469]  </IRQ>
> [  149.474473]  <TASK>
> [  149.474479]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  149.474492] RIP: 0010:cpuidle_enter_state+0xcc/0x440
> [  149.474504] Code: fa b6 53 ff e8 35 f4 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 43 c4 52 ff 45 84 ff 0f 85 57 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
> [  149.474514] RSP: 0018:ffffa4c3c022be90 EFLAGS: 00000246
> [  149.474524] RAX: ffff93e09f6b3440 RBX: ffffc4c3bfcb2140 RCX: 000000000000001f
> [  149.474529] RDX: 0000000000000012 RSI: 000000003c9b26c9 RDI: 0000000000000000
> [  149.474536] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000500
> [  149.474542] R10: 0000000000000007 R11: ffff93e09f6b1fe4 R12: ffffffffa079a500
> [  149.474548] R13: 00000022cd4ab97d R14: 0000000000000004 R15: 0000000000000000
> [  149.474560]  cpuidle_enter+0x2d/0x40
> [  149.474571]  do_idle+0x20d/0x270
> [  149.474585]  cpu_startup_entry+0x2a/0x30
> [  149.474598]  start_secondary+0x11e/0x140
> [  149.474613]  secondary_startup_64_no_verify+0x18f/0x19b
> [  149.474630]  </TASK>
> [  149.474635] ---[ end trace 0000000000000000 ]---
> 
> 
> Here's a dump of the registers (MAC Address redacted).
> It seems to be stuck on TxDescUnavail | RxOverflow | TxOK | RxOk, and interrupts are unmasked.
> 
> # ethtool -d eth0
> Unknown RealTek chip (TxConfig: 0x67100f00)
> Offset		Values
> ------		------
> 0x0000:		XXXXXXXXXXXXXXXXX fe 09 40 00 00 00 80 00 01 00 
> 0x0010:		00 00 57 ff 00 00 00 00 0a 00 00 00 00 00 00 00 
> 0x0020:		00 10 7d ff 00 00 00 00 19 1e 9f b4 79 a5 3f 3b 
> 0x0030:		00 00 00 00 00 00 00 0c 3f 00 00 00 95 00 00 00 
> 0x0040:		00 0f 10 67 0e 0f c2 40 00 00 00 00 00 00 00 00 
> 0x0050:		11 00 cf bc 60 11 03 01 00 00 00 00 00 00 00 00 
> 0x0060:		00 00 00 00 00 00 00 00 00 00 02 00 f3 00 80 f0 
> 0x0070:		00 00 00 00 00 00 00 00 07 00 00 00 00 00 b3 e9 
> 0x0080:		62 60 02 00 00 02 20 00 00 00 00 00 00 00 00 00 
> 0x0090:		00 00 00 00 60 00 20 02 62 64 00 00 00 00 00 00 
> 0x00a0:		00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 0x00b0:		1f 00 00 00 80 00 00 00 ec 10 1a d2 01 00 01 00 
> 0x00c0:		00 00 00 00 00 00 00 00 00 00 00 00 12 00 00 00 
> 0x00d0:		21 00 04 12 00 00 01 00 00 00 00 40 ff ff ff ff 
> 0x00e0:		20 20 03 01 00 00 91 fe 00 00 00 00 ff 02 00 00 
> 0x00f0:		3f 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00 
> 
> 
> I tried instrumenting the code a bit and found that rtl8169_poll() being called with interrupts unmasked seems to be a precursor to the problem occuring.
> The *only* time this usually happens is when a GRO timer has been set but napi_defer_hard_irqs is off.
> Now that the defaults are gro_flush_timeout=20000, napi_defer_hard_irqs=1, this probably doesn't happen very often for most people.
> I guess it will happen with busy polling, but I haven't tested that yet.
> 
> 
> diff --git linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
> index 81fd31f..927786f 100644
> --- linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c
> +++ linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4601,6 +4601,8 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>         struct net_device *dev = tp->dev;
>         int work_done;
>  
> +       WARN_ONCE(RTL_R32(tp, IntrMask_8125) != 0, "rtl8169_poll: IRQs enabled!");
> +
>         rtl_tx(dev, tp, budget);
>  
>         work_done = rtl_rx(dev, tp, budget);
> 
> [ 5055.978473] ------------[ cut here ]------------
> [ 5055.978503] rtl8169_poll: IRQs enabled!
> [ 5055.978527] WARNING: CPU: 15 PID: 0 at /home/ken/work/r8169/linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c:4604 rtl8169_poll+0x4e5/0x520 [r8169]
> [ 5055.978568] Modules linked in: r8169(OE) realtek mdio_devres libphy nft_chain_nat nf_nat bridge stp llc ip6t_REJECT nf_reject_ipv6 qrtr ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc nft_compat nf_tables libcrc32c nfnetlink binfmt_misc joydev nls_ascii nls_cp437 vfat fat hid_generic usbhid hid amdgpu intel_rapl_msr intel_rapl_common intel_uncore_frequency snd_sof_pci_intel_tgl intel_uncore_frequency_common snd_sof_intel_hda_common x86_pkg_temp_thermal soundwire_intel intel_powerclamp soundwire_generic_allocation snd_sof_intel_hda_mlink coretemp iwlmvm soundwire_cadence kvm_intel snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof kvm mac80211 snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core irqbypass snd_soc_acpi_intel_match snd_soc_acpi libarc4 snd_hda_codec_realtek ghash_clmulni_intel snd_soc_core sha512_ssse3 snd_hda_codec_generic drm_exec sha256_ssse3 amdxcp sha1_ssse3 ledtrig_audio drm_buddy iwlwifi gpu_sched snd_compress snd_hda_codec_hdmi snd_pcm_dmaengine
> [ 5055.979159]  drm_suballoc_helper soundwire_bus drm_display_helper aesni_intel snd_hda_intel cec crypto_simd cryptd snd_intel_dspcfg rc_core snd_intel_sdw_acpi rapl snd_hda_codec mei_hdcp drm_ttm_helper mei_pxp pmt_telemetry intel_cstate cfg80211 evdev pmt_class snd_hda_core ttm snd_hwdep mei_me snd_pcm drm_kms_helper iTCO_wdt wmi_bmof intel_pmc_bxt snd_timer intel_uncore snd iTCO_vendor_support i2c_algo_bit mei ee1004 mxm_wmi watchdog pcspkr soundcore rfkill intel_vsec serial_multi_instantiate intel_pmc_core acpi_tad acpi_pad button drm nct6683 parport_pc ppdev lp parport loop configfs efi_pstore efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod nvme ahci nvme_core libahci xhci_pci t10_pi libata xhci_hcd crc64_rocksoft crc64 usbcore scsi_mod crc_t10dif i2c_i801 crc32_pclmul crct10dif_generic crct10dif_pclmul crc32c_intel i2c_smbus video scsi_common usb_common fan crct10dif_common wmi pinctrl_alderlake [last unloaded: r8169(OE)]
> [ 5055.979787] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G        W  OE      6.6.13+bpo-amd64 #1  Debian 6.6.13-1~bpo12+1
> [ 5055.979797] Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A WIFI DDR4 (MS-7D43), BIOS 1.E0 09/14/2023
> [ 5055.979803] RIP: 0010:rtl8169_poll+0x4e5/0x520 [r8169]
> [ 5055.979829] Code: 19 00 00 76 40 89 50 38 eb 98 80 3d 24 e2 00 00 00 0f 85 66 fb ff ff 48 c7 c7 7a d0 8d c0 c6 05 10 e2 00 00 01 e8 ab f6 fe f4 <0f> 0b e9 4c fb ff ff ba 40 00 00 00 88 50 38 e9 a5 fc ff ff 31 c0
> [ 5055.979836] RSP: 0018:ffffad4340618e88 EFLAGS: 00010286
> [ 5055.979846] RAX: 0000000000000000 RBX: ffff8f799b16c9e0 RCX: 000000000000083f
> [ 5055.979853] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
> [ 5055.979859] RBP: ffff8f799b16c9e0 R08: 0000000000000000 R09: ffffad4340618d10
> [ 5055.979864] R10: 0000000000000003 R11: ffff8f80ff780228 R12: ffff8f799b16c000
> [ 5055.979871] R13: 0000000000000040 R14: ffff8f799b16c9c0 R15: ffff8f799b16c9e0
> [ 5055.979877] FS:  0000000000000000(0000) GS:ffff8f80df5c0000(0000) knlGS:0000000000000000
> [ 5055.979884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5055.979890] CR2: 00007f0062503000 CR3: 00000004afc20000 CR4: 0000000000f50ee0
> [ 5055.979897] PKRU: 55555554
> [ 5055.979903] Call Trace:
> [ 5055.979911]  <IRQ>
> [ 5055.979916]  ? rtl8169_poll+0x4e5/0x520 [r8169]
> [ 5055.979940]  ? __warn+0x81/0x130
> [ 5055.979955]  ? rtl8169_poll+0x4e5/0x520 [r8169]
> [ 5055.979978]  ? report_bug+0x171/0x1a0
> [ 5055.979992]  ? handle_bug+0x41/0x70
> [ 5055.980004]  ? exc_invalid_op+0x17/0x70
> [ 5055.980014]  ? asm_exc_invalid_op+0x1a/0x20
> [ 5055.980026]  ? rtl8169_poll+0x4e5/0x520 [r8169]
> [ 5055.980048]  ? rtl8169_poll+0x4e5/0x520 [r8169]
> [ 5055.980070]  ? ktime_get+0x3c/0xa0
> [ 5055.980079]  ? sched_clock+0x10/0x30
> [ 5055.980090]  __napi_poll+0x28/0x1b0
> [ 5055.980104]  net_rx_action+0x2a4/0x380
> [ 5055.980116]  __do_softirq+0xc7/0x2ae
> [ 5055.980126]  __irq_exit_rcu+0x96/0xb0
> [ 5055.980139]  sysvec_apic_timer_interrupt+0x72/0x90
> [ 5055.980148]  </IRQ>
> [ 5055.980152]  <TASK>
> [ 5055.980156]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 5055.980167] RIP: 0010:cpuidle_enter_state+0xcc/0x440
> [ 5055.980179] Code: fa b6 53 ff e8 35 f4 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 43 c4 52 ff 45 84 ff 0f 85 57 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
> [ 5055.980187] RSP: 0018:ffffad4340213e90 EFLAGS: 00000246
> [ 5055.980197] RAX: ffff8f80df5f3440 RBX: ffffcd433fbf2140 RCX: 000000000000001f
> [ 5055.980203] RDX: 000000000000000f RSI: 000000003c9b26c9 RDI: 0000000000000000
> [ 5055.980208] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000012
> [ 5055.980214] R10: 0000000000000008 R11: ffff8f80df5f1fe4 R12: ffffffffb759a500
> [ 5055.980219] R13: 000004992fcca629 R14: 0000000000000004 R15: 0000000000000000
> [ 5055.980229]  cpuidle_enter+0x2d/0x40
> [ 5055.980238]  do_idle+0x20d/0x270
> [ 5055.980253]  cpu_startup_entry+0x2a/0x30
> [ 5055.980265]  start_secondary+0x11e/0x140
> [ 5055.980279]  secondary_startup_64_no_verify+0x18f/0x19b
> [ 5055.980294]  </TASK>
> [ 5055.980298] ---[ end trace 0000000000000000 ]---
> 
> 
> So to make the problem go away, I found that putting an unconditional call to rtl_irq_disable() up front in rtl8169_poll() is sufficient.
> This seems a shame, since in almost every case, interrupts are already off at this point so it is an unnecessary write to the card.
> 
> I assume it is rtl8169_interrupt() clearing the interrupt status register while something inside rtl8169_interrupt() is going on that causes the problem, so this needs to be avoided.
> I tried moving the interrupt masking around inside rtl_tx() and rtl_rx() to see if I could work out which specific place is vulnerable to the race, but it was inconclusive.
> 
In general there's nothing wrong with having interrupts enabled.
Disabling them is just an optimization. Seems like this scenario
triggers some silicon bug.

> 
> The cheap hack below seems like a more performant solution than masking interrupts unconditionally in the poll function:
> If a hardware interrupt comes in and NAPI_STATE_SCHED is set, we assume we're either in the poll function already or it will be called again soon, so we can safely disable interrupts.
> It has worked perfectly for me so far, although it doesn't prevent the poll function from *ever* getting called with interrupts on. I suspect it will come apart with busy polling or the like.
> 
> No doubt some sort of semaphore between the interrupt handler and poll function will be needed to decide who gets to disable interrupts.
> It would be great if NAPI had a "begin polling" upcall or something...
> 
> 
> diff --git linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
> index 81fd31f..60cf4f6 100644
> --- linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c
> +++ linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4521,6 +4521,11 @@ release_descriptor:
>         return count;
>  }
>  
> +static inline bool napi_is_scheduled(struct napi_struct *n)
> +{
> +       return test_bit(NAPI_STATE_SCHED, &n->state);
> +}
> +
>  static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  {
>         struct rtl8169_private *tp = dev_instance;
> @@ -4546,7 +4551,8 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>         if (napi_schedule_prep(&tp->napi)) {
>                 rtl_irq_disable(tp);
>                 __napi_schedule(&tp->napi);
> -       }
> +       } else if (napi_is_scheduled(&tp->napi))
> +               rtl_irq_disable(tp);

Re-reading &tp->napi may be racy, and I think the code delivers
a wrong result if NAPI_STATE_SCHEDand NAPI_STATE_DISABLE
both are set.

>  out:
>         rtl_ack_events(tp, status);

The following uses a modified version of napi_schedule_prep()
to avoid re-reading the napi state.
We would have to see whether this extension to the net core is
acceptable, as r8169 would be the only user for now.
For testing it's one patch, for submitting it would need to be
splitted.

---
 drivers/net/ethernet/realtek/r8169_main.c |  6 ++++--
 include/linux/netdevice.h                 |  7 ++++++-
 net/core/dev.c                            | 12 ++++++------
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index eb329f0ab..94b97a16d 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 {
 	struct rtl8169_private *tp = dev_instance;
 	u32 status = rtl_get_events(tp);
+	int ret;
 
 	if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
 		return IRQ_NONE;
@@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 	}
 
-	if (napi_schedule_prep(&tp->napi)) {
+	ret = __napi_schedule_prep(&tp->napi);
+	if (ret >= 0)
 		rtl_irq_disable(tp);
+	if (ret > 0)
 		__napi_schedule(&tp->napi);
-	}
 out:
 	rtl_ack_events(tp, status);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 42b9e6dc6..3df560264 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -498,7 +498,12 @@ static inline bool napi_is_scheduled(struct napi_struct *n)
 	return test_bit(NAPI_STATE_SCHED, &n->state);
 }
 
-bool napi_schedule_prep(struct napi_struct *n);
+int __napi_schedule_prep(struct napi_struct *n);
+
+static inline bool napi_schedule_prep(struct napi_struct *n)
+{
+	return __napi_schedule_prep(n) > 0;
+}
 
 /**
  *	napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 4bf081c5a..126eab121 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6102,21 +6102,21 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
- *	napi_schedule_prep - check if napi can be scheduled
+ *	__napi_schedule_prep - check if napi can be scheduled
  *	@n: napi context
  *
  * Test if NAPI routine is already running, and if not mark
  * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
+ * insure only one NAPI poll instance runs. Return -1 if
+ * there is a pending NAPI disable.
  */
-bool napi_schedule_prep(struct napi_struct *n)
+int __napi_schedule_prep(struct napi_struct *n)
 {
 	unsigned long new, val = READ_ONCE(n->state);
 
 	do {
 		if (unlikely(val & NAPIF_STATE_DISABLE))
-			return false;
+			return -1;
 		new = val | NAPIF_STATE_SCHED;
 
 		/* Sets STATE_MISSED bit if STATE_SCHED was already set
@@ -6131,7 +6131,7 @@ bool napi_schedule_prep(struct napi_struct *n)
 
 	return !(val & NAPIF_STATE_SCHED);
 }
-EXPORT_SYMBOL(napi_schedule_prep);
+EXPORT_SYMBOL(__napi_schedule_prep);
 
 /**
  * __napi_schedule_irqoff - schedule for receive
Ken Milmore May 9, 2024, 10:24 p.m. UTC | #2
On 08/05/2024 22:14, Heiner Kallweit wrote:
> 
> Re-reading &tp->napi may be racy, and I think the code delivers
> a wrong result if NAPI_STATE_SCHEDand NAPI_STATE_DISABLE
> both are set.
> 
>>  out:
>>         rtl_ack_events(tp, status);
> 
> The following uses a modified version of napi_schedule_prep()
> to avoid re-reading the napi state.
> We would have to see whether this extension to the net core is
> acceptable, as r8169 would be the only user for now.
> For testing it's one patch, for submitting it would need to be
> splitted.
> 
> ---
>  drivers/net/ethernet/realtek/r8169_main.c |  6 ++++--
>  include/linux/netdevice.h                 |  7 ++++++-
>  net/core/dev.c                            | 12 ++++++------
>  3 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index eb329f0ab..94b97a16d 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  {
>  	struct rtl8169_private *tp = dev_instance;
>  	u32 status = rtl_get_events(tp);
> +	int ret;
>  
>  	if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>  		return IRQ_NONE;
> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>  	}
>  
> -	if (napi_schedule_prep(&tp->napi)) {
> +	ret = __napi_schedule_prep(&tp->napi);
> +	if (ret >= 0)
>  		rtl_irq_disable(tp);
> +	if (ret > 0)
>  		__napi_schedule(&tp->napi);
> -	}
>  out:
>  	rtl_ack_events(tp, status);
>  
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 42b9e6dc6..3df560264 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -498,7 +498,12 @@ static inline bool napi_is_scheduled(struct napi_struct *n)
>  	return test_bit(NAPI_STATE_SCHED, &n->state);
>  }
>  
> -bool napi_schedule_prep(struct napi_struct *n);
> +int __napi_schedule_prep(struct napi_struct *n);
> +
> +static inline bool napi_schedule_prep(struct napi_struct *n)
> +{
> +	return __napi_schedule_prep(n) > 0;
> +}
>  
>  /**
>   *	napi_schedule - schedule NAPI poll
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 4bf081c5a..126eab121 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6102,21 +6102,21 @@ void __napi_schedule(struct napi_struct *n)
>  EXPORT_SYMBOL(__napi_schedule);
>  
>  /**
> - *	napi_schedule_prep - check if napi can be scheduled
> + *	__napi_schedule_prep - check if napi can be scheduled
>   *	@n: napi context
>   *
>   * Test if NAPI routine is already running, and if not mark
>   * it as running.  This is used as a condition variable to
> - * insure only one NAPI poll instance runs.  We also make
> - * sure there is no pending NAPI disable.
> + * insure only one NAPI poll instance runs. Return -1 if
> + * there is a pending NAPI disable.
>   */
> -bool napi_schedule_prep(struct napi_struct *n)
> +int __napi_schedule_prep(struct napi_struct *n)
>  {
>  	unsigned long new, val = READ_ONCE(n->state);
>  
>  	do {
>  		if (unlikely(val & NAPIF_STATE_DISABLE))
> -			return false;
> +			return -1;
>  		new = val | NAPIF_STATE_SCHED;
>  
>  		/* Sets STATE_MISSED bit if STATE_SCHED was already set
> @@ -6131,7 +6131,7 @@ bool napi_schedule_prep(struct napi_struct *n)
>  
>  	return !(val & NAPIF_STATE_SCHED);
>  }
> -EXPORT_SYMBOL(napi_schedule_prep);
> +EXPORT_SYMBOL(__napi_schedule_prep);
>  
>  /**
>   * __napi_schedule_irqoff - schedule for receive

Here is a possible alternative (albeit expensive), using a flag in the driver:

diff --git linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
index 6e34177..d703af1 100644
--- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
+++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
@@ -579,6 +579,7 @@ enum rtl_flag {
        RTL_FLAG_TASK_RESET_PENDING,
        RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE,
        RTL_FLAG_TASK_TX_TIMEOUT,
+       RTL_FLAG_IRQ_DISABLED,
        RTL_FLAG_MAX
 };
 
@@ -4609,6 +4610,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 
        if (napi_schedule_prep(&tp->napi)) {
                rtl_irq_disable(tp);
+               set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
                __napi_schedule(&tp->napi);
        }
 out:
@@ -4655,12 +4657,17 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
        struct net_device *dev = tp->dev;
        int work_done;
 
+       if (!test_and_set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags))
+               rtl_irq_disable(tp);
+
        rtl_tx(dev, tp, budget);
 
        work_done = rtl_rx(dev, tp, budget);
 
-       if (work_done < budget && napi_complete_done(napi, work_done))
+       if (work_done < budget && napi_complete_done(napi, work_done)) {
+               clear_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
                rtl_irq_enable(tp);
+       }
 
        return work_done;
 }
Heiner Kallweit May 10, 2024, 10:06 p.m. UTC | #3
On 10.05.2024 00:24, Ken Milmore wrote:
> On 08/05/2024 22:14, Heiner Kallweit wrote:
>>
>> Re-reading &tp->napi may be racy, and I think the code delivers
>> a wrong result if NAPI_STATE_SCHEDand NAPI_STATE_DISABLE
>> both are set.
>>
>>>  out:
>>>         rtl_ack_events(tp, status);
>>
>> The following uses a modified version of napi_schedule_prep()
>> to avoid re-reading the napi state.
>> We would have to see whether this extension to the net core is
>> acceptable, as r8169 would be the only user for now.
>> For testing it's one patch, for submitting it would need to be
>> splitted.
>>
>> ---
>>  drivers/net/ethernet/realtek/r8169_main.c |  6 ++++--
>>  include/linux/netdevice.h                 |  7 ++++++-
>>  net/core/dev.c                            | 12 ++++++------
>>  3 files changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index eb329f0ab..94b97a16d 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -4639,6 +4639,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>  {
>>  	struct rtl8169_private *tp = dev_instance;
>>  	u32 status = rtl_get_events(tp);
>> +	int ret;
>>  
>>  	if ((status & 0xffff) == 0xffff || !(status & tp->irq_mask))
>>  		return IRQ_NONE;
>> @@ -4657,10 +4658,11 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>  		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>  	}
>>  
>> -	if (napi_schedule_prep(&tp->napi)) {
>> +	ret = __napi_schedule_prep(&tp->napi);
>> +	if (ret >= 0)
>>  		rtl_irq_disable(tp);
>> +	if (ret > 0)
>>  		__napi_schedule(&tp->napi);
>> -	}
>>  out:
>>  	rtl_ack_events(tp, status);
>>  
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 42b9e6dc6..3df560264 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -498,7 +498,12 @@ static inline bool napi_is_scheduled(struct napi_struct *n)
>>  	return test_bit(NAPI_STATE_SCHED, &n->state);
>>  }
>>  
>> -bool napi_schedule_prep(struct napi_struct *n);
>> +int __napi_schedule_prep(struct napi_struct *n);
>> +
>> +static inline bool napi_schedule_prep(struct napi_struct *n)
>> +{
>> +	return __napi_schedule_prep(n) > 0;
>> +}
>>  
>>  /**
>>   *	napi_schedule - schedule NAPI poll
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 4bf081c5a..126eab121 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -6102,21 +6102,21 @@ void __napi_schedule(struct napi_struct *n)
>>  EXPORT_SYMBOL(__napi_schedule);
>>  
>>  /**
>> - *	napi_schedule_prep - check if napi can be scheduled
>> + *	__napi_schedule_prep - check if napi can be scheduled
>>   *	@n: napi context
>>   *
>>   * Test if NAPI routine is already running, and if not mark
>>   * it as running.  This is used as a condition variable to
>> - * insure only one NAPI poll instance runs.  We also make
>> - * sure there is no pending NAPI disable.
>> + * insure only one NAPI poll instance runs. Return -1 if
>> + * there is a pending NAPI disable.
>>   */
>> -bool napi_schedule_prep(struct napi_struct *n)
>> +int __napi_schedule_prep(struct napi_struct *n)
>>  {
>>  	unsigned long new, val = READ_ONCE(n->state);
>>  
>>  	do {
>>  		if (unlikely(val & NAPIF_STATE_DISABLE))
>> -			return false;
>> +			return -1;
>>  		new = val | NAPIF_STATE_SCHED;
>>  
>>  		/* Sets STATE_MISSED bit if STATE_SCHED was already set
>> @@ -6131,7 +6131,7 @@ bool napi_schedule_prep(struct napi_struct *n)
>>  
>>  	return !(val & NAPIF_STATE_SCHED);
>>  }
>> -EXPORT_SYMBOL(napi_schedule_prep);
>> +EXPORT_SYMBOL(__napi_schedule_prep);
>>  
>>  /**
>>   * __napi_schedule_irqoff - schedule for receive
> 
> Here is a possible alternative (albeit expensive), using a flag in the driver:
> 
> diff --git linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
> index 6e34177..d703af1 100644
> --- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
> +++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
> @@ -579,6 +579,7 @@ enum rtl_flag {
>         RTL_FLAG_TASK_RESET_PENDING,
>         RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE,
>         RTL_FLAG_TASK_TX_TIMEOUT,
> +       RTL_FLAG_IRQ_DISABLED,
>         RTL_FLAG_MAX
>  };
>  
> @@ -4609,6 +4610,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  
>         if (napi_schedule_prep(&tp->napi)) {
>                 rtl_irq_disable(tp);
> +               set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
>                 __napi_schedule(&tp->napi);
>         }
>  out:
> @@ -4655,12 +4657,17 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>         struct net_device *dev = tp->dev;
>         int work_done;
>  
> +       if (!test_and_set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags))
> +               rtl_irq_disable(tp);
> +
>         rtl_tx(dev, tp, budget);
>  
>         work_done = rtl_rx(dev, tp, budget);
>  
> -       if (work_done < budget && napi_complete_done(napi, work_done))
> +       if (work_done < budget && napi_complete_done(napi, work_done)) {
> +               clear_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
>                 rtl_irq_enable(tp);
> +       }
>  
>         return work_done;
>  }
> 
> 
> 
> 

Nice idea. The following is a simplified version.
It's based on the thought that between scheduling NAPI and start of NAPI
polling interrupts don't hurt.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169_main.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index e5ea827a2..7b04dfecc 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -592,6 +592,7 @@ enum rtl_flag {
 	RTL_FLAG_TASK_RESET_PENDING,
 	RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE,
 	RTL_FLAG_TASK_TX_TIMEOUT,
+	RTL_FLAG_IRQ_DISABLED,
 	RTL_FLAG_MAX
 };
 
@@ -4657,10 +4658,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
 	}
 
-	if (napi_schedule_prep(&tp->napi)) {
-		rtl_irq_disable(tp);
-		__napi_schedule(&tp->napi);
-	}
+	napi_schedule(&tp->napi);
 out:
 	rtl_ack_events(tp, status);
 
@@ -4714,12 +4712,17 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 	struct net_device *dev = tp->dev;
 	int work_done;
 
+	if (!test_and_set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags))
+		rtl_irq_disable(tp);
+
 	rtl_tx(dev, tp, budget);
 
 	work_done = rtl_rx(dev, tp, budget);
 
-	if (work_done < budget && napi_complete_done(napi, work_done))
+	if (work_done < budget && napi_complete_done(napi, work_done)) {
+		clear_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
 		rtl_irq_enable(tp);
+	}
 
 	return work_done;
 }
Ken Milmore May 10, 2024, 10:29 p.m. UTC | #4
On 10/05/2024 23:06, Heiner Kallweit wrote:
> 
> Nice idea. The following is a simplified version.
> It's based on the thought that between scheduling NAPI and start of NAPI
> polling interrupts don't hurt.
> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/ethernet/realtek/r8169_main.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
> index e5ea827a2..7b04dfecc 100644
> --- a/drivers/net/ethernet/realtek/r8169_main.c
> +++ b/drivers/net/ethernet/realtek/r8169_main.c
> @@ -592,6 +592,7 @@ enum rtl_flag {
>  	RTL_FLAG_TASK_RESET_PENDING,
>  	RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE,
>  	RTL_FLAG_TASK_TX_TIMEOUT,
> +	RTL_FLAG_IRQ_DISABLED,
>  	RTL_FLAG_MAX
>  };
>  
> @@ -4657,10 +4658,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>  		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>  	}
>  
> -	if (napi_schedule_prep(&tp->napi)) {
> -		rtl_irq_disable(tp);
> -		__napi_schedule(&tp->napi);
> -	}
> +	napi_schedule(&tp->napi);
>  out:
>  	rtl_ack_events(tp, status);
>  
> @@ -4714,12 +4712,17 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>  	struct net_device *dev = tp->dev;
>  	int work_done;
>  
> +	if (!test_and_set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags))
> +		rtl_irq_disable(tp);
> +
>  	rtl_tx(dev, tp, budget);
>  
>  	work_done = rtl_rx(dev, tp, budget);
>  
> -	if (work_done < budget && napi_complete_done(napi, work_done))
> +	if (work_done < budget && napi_complete_done(napi, work_done)) {
> +		clear_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
>  		rtl_irq_enable(tp);
> +	}
>  
>  	return work_done;
>  }

Reading this worries me though:

https://docs.kernel.org/networking/napi.html
"napi_disable() and subsequent calls to the poll method only wait for the ownership of the instance to be released, not for the poll method to exit.
This means that drivers should avoid accessing any data structures after calling napi_complete_done()."

Which seems to imply that the IRQ enable following napi_complete_done() is unguarded, and might race with the disable on an incoming poll.
Is that a possibility?
Heiner Kallweit May 11, 2024, 4:31 p.m. UTC | #5
On 11.05.2024 00:29, Ken Milmore wrote:
> On 10/05/2024 23:06, Heiner Kallweit wrote:
>>
>> Nice idea. The following is a simplified version.
>> It's based on the thought that between scheduling NAPI and start of NAPI
>> polling interrupts don't hurt.
>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>>  drivers/net/ethernet/realtek/r8169_main.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
>> index e5ea827a2..7b04dfecc 100644
>> --- a/drivers/net/ethernet/realtek/r8169_main.c
>> +++ b/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -592,6 +592,7 @@ enum rtl_flag {
>>  	RTL_FLAG_TASK_RESET_PENDING,
>>  	RTL_FLAG_TASK_RESET_NO_QUEUE_WAKE,
>>  	RTL_FLAG_TASK_TX_TIMEOUT,
>> +	RTL_FLAG_IRQ_DISABLED,
>>  	RTL_FLAG_MAX
>>  };
>>  
>> @@ -4657,10 +4658,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
>>  		rtl_schedule_task(tp, RTL_FLAG_TASK_RESET_PENDING);
>>  	}
>>  
>> -	if (napi_schedule_prep(&tp->napi)) {
>> -		rtl_irq_disable(tp);
>> -		__napi_schedule(&tp->napi);
>> -	}
>> +	napi_schedule(&tp->napi);
>>  out:
>>  	rtl_ack_events(tp, status);
>>  
>> @@ -4714,12 +4712,17 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>>  	struct net_device *dev = tp->dev;
>>  	int work_done;
>>  
>> +	if (!test_and_set_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags))
>> +		rtl_irq_disable(tp);
>> +
>>  	rtl_tx(dev, tp, budget);
>>  
>>  	work_done = rtl_rx(dev, tp, budget);
>>  
>> -	if (work_done < budget && napi_complete_done(napi, work_done))
>> +	if (work_done < budget && napi_complete_done(napi, work_done)) {
>> +		clear_bit(RTL_FLAG_IRQ_DISABLED, tp->wk.flags);
>>  		rtl_irq_enable(tp);
>> +	}
>>  
>>  	return work_done;
>>  }
> 
> Reading this worries me though:
> 
> https://docs.kernel.org/networking/napi.html
> "napi_disable() and subsequent calls to the poll method only wait for the ownership of the instance to be released, not for the poll method to exit.
> This means that drivers should avoid accessing any data structures after calling napi_complete_done()."
> 
According to kernel doc napi_disable() waits.

/**
 *	napi_disable - prevent NAPI from scheduling
 *	@n: NAPI context
 *
 * Stop NAPI from being scheduled on this context.
 * Waits till any outstanding processing completes.
 */

> Which seems to imply that the IRQ enable following napi_complete_done() is unguarded, and might race with the disable on an incoming poll.
> Is that a possibility?

Same documents states in section "Scheduling and IRQ masking":
IRQ should only be unmasked after a successful call to napi_complete_done()
So I think we should be fine.
Ken Milmore May 12, 2024, 7:49 p.m. UTC | #6
On 11/05/2024 17:31, Heiner Kallweit wrote:
> On 11.05.2024 00:29, Ken Milmore wrote:
>>
>> Reading this worries me though:
>>
>> https://docs.kernel.org/networking/napi.html
>> "napi_disable() and subsequent calls to the poll method only wait for the ownership of the instance to be released, not for the poll method to exit.
>> This means that drivers should avoid accessing any data structures after calling napi_complete_done()."
>>
> According to kernel doc napi_disable() waits.
> 
> /**
>  *	napi_disable - prevent NAPI from scheduling
>  *	@n: NAPI context
>  *
>  * Stop NAPI from being scheduled on this context.
>  * Waits till any outstanding processing completes.
>  */
> 
>> Which seems to imply that the IRQ enable following napi_complete_done() is unguarded, and might race with the disable on an incoming poll.
>> Is that a possibility?
> 
> Same documents states in section "Scheduling and IRQ masking":
> IRQ should only be unmasked after a successful call to napi_complete_done()
> So I think we should be fine.
> 

Nevertheless, it would be good if we could get away without the flag.

I had started out with the assumption that an interrupt acknowledgement coinciding with some part of the work being done in rtl8169_poll() might be the cause of the problem.
So it seemed natural to try guarding the whole block by disabling interrupts at the beginning.
But this seems to work just as well:

diff --git linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
index 6e34177..353ce99 100644
--- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
+++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
@@ -4659,8 +4659,10 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
 
 	work_done = rtl_rx(dev, tp, budget);
 
-	if (work_done < budget && napi_complete_done(napi, work_done))
+	if (work_done < budget && napi_complete_done(napi, work_done)) {
+		rtl_irq_disable(tp);
 		rtl_irq_enable(tp);
+	}
 
 	return work_done;
 }

On this basis, I assume the problem may actually involve some subtlety with the behaviour of the interrupt mask and status registers.

In addition, I'm not sure it is such a good idea to do away with disabling interrupts from within rtl8169_interrupt().
This causes a modest, but noticeable increase in IRQ rate which I measured at around 3 to 7%, depending on whether the load is Tx or Rx heavy and also on the setting of gro_flush_timeout and napi_defer_hard_irqs.

e.g.
Tx only test with iperf3, gro_flush_timeout=20000, napi_defer_hard_irqs=1:
Averaged 32343 vs 30165 interrupts per second, an increase of about 7%.

Bidirectional test with with gro_flush_timeout=0, napi_defer_hard_irqs=0:
Averaged 82118 vs 79689 interrupts per second, an increase of about 3%.

Given that these NICs are already fairly heavy on interrupt rate, it seems a shame to make them even worse!

All in all I preferred the solution where we do all the interrupt disabling in rtl8169_interrupt(), notwithstanding that it may require a change to the interface of napi_schedule_prep().
Heiner Kallweit May 12, 2024, 10:08 p.m. UTC | #7
On 12.05.2024 21:49, Ken Milmore wrote:
> 
> 
> On 11/05/2024 17:31, Heiner Kallweit wrote:
>> On 11.05.2024 00:29, Ken Milmore wrote:
>>>
>>> Reading this worries me though:
>>>
>>> https://docs.kernel.org/networking/napi.html
>>> "napi_disable() and subsequent calls to the poll method only wait for the ownership of the instance to be released, not for the poll method to exit.
>>> This means that drivers should avoid accessing any data structures after calling napi_complete_done()."
>>>
>> According to kernel doc napi_disable() waits.
>>
>> /**
>>  *	napi_disable - prevent NAPI from scheduling
>>  *	@n: NAPI context
>>  *
>>  * Stop NAPI from being scheduled on this context.
>>  * Waits till any outstanding processing completes.
>>  */
>>
>>> Which seems to imply that the IRQ enable following napi_complete_done() is unguarded, and might race with the disable on an incoming poll.
>>> Is that a possibility?
>>
>> Same documents states in section "Scheduling and IRQ masking":
>> IRQ should only be unmasked after a successful call to napi_complete_done()
>> So I think we should be fine.
>>
> 
> Nevertheless, it would be good if we could get away without the flag.
> 
> I had started out with the assumption that an interrupt acknowledgement coinciding with some part of the work being done in rtl8169_poll() might be the cause of the problem.
> So it seemed natural to try guarding the whole block by disabling interrupts at the beginning.
> But this seems to work just as well:
> 
> diff --git linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
> index 6e34177..353ce99 100644
> --- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
> +++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
> @@ -4659,8 +4659,10 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>  
>  	work_done = rtl_rx(dev, tp, budget);
>  
> -	if (work_done < budget && napi_complete_done(napi, work_done))
> +	if (work_done < budget && napi_complete_done(napi, work_done)) {
> +		rtl_irq_disable(tp);
>  		rtl_irq_enable(tp);
> +	}
>  
>  	return work_done;
>  }
> 
> On this basis, I assume the problem may actually involve some subtlety with the behaviour of the interrupt mask and status registers.
> 
In the register dump in your original report the interrupt mask is set.
So it seems rtl_irq_enable() was executed. I don't have an explanation
why a previous rtl_irq_disable() makes a difference.
Interesting would be whether it has to be a write to the interrupt mask
register, or whether a write to any register is sufficient.

> In addition, I'm not sure it is such a good idea to do away with disabling interrupts from within rtl8169_interrupt().
> This causes a modest, but noticeable increase in IRQ rate which I measured at around 3 to 7%, depending on whether the load is Tx or Rx heavy and also on the setting of gro_flush_timeout and napi_defer_hard_irqs.
> 
> e.g.
> Tx only test with iperf3, gro_flush_timeout=20000, napi_defer_hard_irqs=1:
> Averaged 32343 vs 30165 interrupts per second, an increase of about 7%.
> 
> Bidirectional test with with gro_flush_timeout=0, napi_defer_hard_irqs=0:
> Averaged 82118 vs 79689 interrupts per second, an increase of about 3%.
> 
> Given that these NICs are already fairly heavy on interrupt rate, it seems a shame to make them even worse!
> 
> All in all I preferred the solution where we do all the interrupt disabling in rtl8169_interrupt(), notwithstanding that it may require a change to the interface of napi_schedule_prep().

I agree.
Ken Milmore May 13, 2024, 12:27 a.m. UTC | #8
On 12/05/2024 23:08, Heiner Kallweit wrote:
> On 12.05.2024 21:49, Ken Milmore wrote:
>>
>> I had started out with the assumption that an interrupt acknowledgement coinciding with some part of the work being done in rtl8169_poll() might be the cause of the problem.
>> So it seemed natural to try guarding the whole block by disabling interrupts at the beginning.
>> But this seems to work just as well:
>>
>> diff --git linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
>> index 6e34177..353ce99 100644
>> --- linux-source-6.1~/drivers/net/ethernet/realtek/r8169_main.c
>> +++ linux-source-6.1/drivers/net/ethernet/realtek/r8169_main.c
>> @@ -4659,8 +4659,10 @@ static int rtl8169_poll(struct napi_struct *napi, int budget)
>>  
>>  	work_done = rtl_rx(dev, tp, budget);
>>  
>> -	if (work_done < budget && napi_complete_done(napi, work_done))
>> +	if (work_done < budget && napi_complete_done(napi, work_done)) {
>> +		rtl_irq_disable(tp);
>>  		rtl_irq_enable(tp);
>> +	}
>>  
>>  	return work_done;
>>  }
>>
>> On this basis, I assume the problem may actually involve some subtlety with the behaviour of the interrupt mask and status registers.
>>
> In the register dump in your original report the interrupt mask is set.
> So it seems rtl_irq_enable() was executed. I don't have an explanation
> why a previous rtl_irq_disable() makes a difference.
> Interesting would be whether it has to be a write to the interrupt mask
> register, or whether a write to any register is sufficient.
> 

In place of calling rtl_irq_disable(), I tried poking at the doorbell and at some of the unused timer registers. These had no effect.

I tried writing various different values to the mask register:

RTL_W32(tp, IntrMask_8125, 0x00); // worked, naturally
RTL_W32(tp, IntrMask_8125, 0x3f); // no effect
RTL_W32(tp, IntrMask_8125, 0x3b); // no effect
RTL_W32(tp, IntrMask_8125, 0x3a); // worked!

So masking both TxOK and RxOK before unmasking seemed to work, but masking either of them individually didn't.

Also, masking just TxOK then just RxOK in sequence, or vice versa didn't seem to work; they both had to be masked together.

YMMV! :-)
diff mbox series

Patch

diff --git linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
index 81fd31f..927786f 100644
--- linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c
+++ linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
@@ -4601,6 +4601,8 @@  static int rtl8169_poll(struct napi_struct *napi, int budget)
        struct net_device *dev = tp->dev;
        int work_done;
 
+       WARN_ONCE(RTL_R32(tp, IntrMask_8125) != 0, "rtl8169_poll: IRQs enabled!");
+
        rtl_tx(dev, tp, budget);
 
        work_done = rtl_rx(dev, tp, budget);

[ 5055.978473] ------------[ cut here ]------------
[ 5055.978503] rtl8169_poll: IRQs enabled!
[ 5055.978527] WARNING: CPU: 15 PID: 0 at /home/ken/work/r8169/linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c:4604 rtl8169_poll+0x4e5/0x520 [r8169]
[ 5055.978568] Modules linked in: r8169(OE) realtek mdio_devres libphy nft_chain_nat nf_nat bridge stp llc ip6t_REJECT nf_reject_ipv6 qrtr ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sunrpc nft_compat nf_tables libcrc32c nfnetlink binfmt_misc joydev nls_ascii nls_cp437 vfat fat hid_generic usbhid hid amdgpu intel_rapl_msr intel_rapl_common intel_uncore_frequency snd_sof_pci_intel_tgl intel_uncore_frequency_common snd_sof_intel_hda_common x86_pkg_temp_thermal soundwire_intel intel_powerclamp soundwire_generic_allocation snd_sof_intel_hda_mlink coretemp iwlmvm soundwire_cadence kvm_intel snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof kvm mac80211 snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core irqbypass snd_soc_acpi_intel_match snd_soc_acpi libarc4 snd_hda_codec_realtek ghash_clmulni_intel snd_soc_core sha512_ssse3 snd_hda_codec_generic drm_exec sha256_ssse3 amdxcp sha1_ssse3 ledtrig_audio drm_buddy iwlwifi gpu_sched snd_compress snd_hda_codec_hdmi snd_pcm_dmaengine
[ 5055.979159]  drm_suballoc_helper soundwire_bus drm_display_helper aesni_intel snd_hda_intel cec crypto_simd cryptd snd_intel_dspcfg rc_core snd_intel_sdw_acpi rapl snd_hda_codec mei_hdcp drm_ttm_helper mei_pxp pmt_telemetry intel_cstate cfg80211 evdev pmt_class snd_hda_core ttm snd_hwdep mei_me snd_pcm drm_kms_helper iTCO_wdt wmi_bmof intel_pmc_bxt snd_timer intel_uncore snd iTCO_vendor_support i2c_algo_bit mei ee1004 mxm_wmi watchdog pcspkr soundcore rfkill intel_vsec serial_multi_instantiate intel_pmc_core acpi_tad acpi_pad button drm nct6683 parport_pc ppdev lp parport loop configfs efi_pstore efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic dm_mod nvme ahci nvme_core libahci xhci_pci t10_pi libata xhci_hcd crc64_rocksoft crc64 usbcore scsi_mod crc_t10dif i2c_i801 crc32_pclmul crct10dif_generic crct10dif_pclmul crc32c_intel i2c_smbus video scsi_common usb_common fan crct10dif_common wmi pinctrl_alderlake [last unloaded: r8169(OE)]
[ 5055.979787] CPU: 15 PID: 0 Comm: swapper/15 Tainted: G        W  OE      6.6.13+bpo-amd64 #1  Debian 6.6.13-1~bpo12+1
[ 5055.979797] Hardware name: Micro-Star International Co., Ltd. MS-7D43/PRO B660M-A WIFI DDR4 (MS-7D43), BIOS 1.E0 09/14/2023
[ 5055.979803] RIP: 0010:rtl8169_poll+0x4e5/0x520 [r8169]
[ 5055.979829] Code: 19 00 00 76 40 89 50 38 eb 98 80 3d 24 e2 00 00 00 0f 85 66 fb ff ff 48 c7 c7 7a d0 8d c0 c6 05 10 e2 00 00 01 e8 ab f6 fe f4 <0f> 0b e9 4c fb ff ff ba 40 00 00 00 88 50 38 e9 a5 fc ff ff 31 c0
[ 5055.979836] RSP: 0018:ffffad4340618e88 EFLAGS: 00010286
[ 5055.979846] RAX: 0000000000000000 RBX: ffff8f799b16c9e0 RCX: 000000000000083f
[ 5055.979853] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
[ 5055.979859] RBP: ffff8f799b16c9e0 R08: 0000000000000000 R09: ffffad4340618d10
[ 5055.979864] R10: 0000000000000003 R11: ffff8f80ff780228 R12: ffff8f799b16c000
[ 5055.979871] R13: 0000000000000040 R14: ffff8f799b16c9c0 R15: ffff8f799b16c9e0
[ 5055.979877] FS:  0000000000000000(0000) GS:ffff8f80df5c0000(0000) knlGS:0000000000000000
[ 5055.979884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5055.979890] CR2: 00007f0062503000 CR3: 00000004afc20000 CR4: 0000000000f50ee0
[ 5055.979897] PKRU: 55555554
[ 5055.979903] Call Trace:
[ 5055.979911]  <IRQ>
[ 5055.979916]  ? rtl8169_poll+0x4e5/0x520 [r8169]
[ 5055.979940]  ? __warn+0x81/0x130
[ 5055.979955]  ? rtl8169_poll+0x4e5/0x520 [r8169]
[ 5055.979978]  ? report_bug+0x171/0x1a0
[ 5055.979992]  ? handle_bug+0x41/0x70
[ 5055.980004]  ? exc_invalid_op+0x17/0x70
[ 5055.980014]  ? asm_exc_invalid_op+0x1a/0x20
[ 5055.980026]  ? rtl8169_poll+0x4e5/0x520 [r8169]
[ 5055.980048]  ? rtl8169_poll+0x4e5/0x520 [r8169]
[ 5055.980070]  ? ktime_get+0x3c/0xa0
[ 5055.980079]  ? sched_clock+0x10/0x30
[ 5055.980090]  __napi_poll+0x28/0x1b0
[ 5055.980104]  net_rx_action+0x2a4/0x380
[ 5055.980116]  __do_softirq+0xc7/0x2ae
[ 5055.980126]  __irq_exit_rcu+0x96/0xb0
[ 5055.980139]  sysvec_apic_timer_interrupt+0x72/0x90
[ 5055.980148]  </IRQ>
[ 5055.980152]  <TASK>
[ 5055.980156]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 5055.980167] RIP: 0010:cpuidle_enter_state+0xcc/0x440
[ 5055.980179] Code: fa b6 53 ff e8 35 f4 ff ff 8b 53 04 49 89 c5 0f 1f 44 00 00 31 ff e8 43 c4 52 ff 45 84 ff 0f 85 57 02 00 00 fb 0f 1f 44 00 00 <45> 85 f6 0f 88 85 01 00 00 49 63 d6 48 8d 04 52 48 8d 04 82 49 8d
[ 5055.980187] RSP: 0018:ffffad4340213e90 EFLAGS: 00000246
[ 5055.980197] RAX: ffff8f80df5f3440 RBX: ffffcd433fbf2140 RCX: 000000000000001f
[ 5055.980203] RDX: 000000000000000f RSI: 000000003c9b26c9 RDI: 0000000000000000
[ 5055.980208] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000012
[ 5055.980214] R10: 0000000000000008 R11: ffff8f80df5f1fe4 R12: ffffffffb759a500
[ 5055.980219] R13: 000004992fcca629 R14: 0000000000000004 R15: 0000000000000000
[ 5055.980229]  cpuidle_enter+0x2d/0x40
[ 5055.980238]  do_idle+0x20d/0x270
[ 5055.980253]  cpu_startup_entry+0x2a/0x30
[ 5055.980265]  start_secondary+0x11e/0x140
[ 5055.980279]  secondary_startup_64_no_verify+0x18f/0x19b
[ 5055.980294]  </TASK>
[ 5055.980298] ---[ end trace 0000000000000000 ]---


So to make the problem go away, I found that putting an unconditional call to rtl_irq_disable() up front in rtl8169_poll() is sufficient.
This seems a shame, since in almost every case, interrupts are already off at this point so it is an unnecessary write to the card.

I assume it is rtl8169_interrupt() clearing the interrupt status register while something inside rtl8169_interrupt() is going on that causes the problem, so this needs to be avoided.
I tried moving the interrupt masking around inside rtl_tx() and rtl_rx() to see if I could work out which specific place is vulnerable to the race, but it was inconclusive.


The cheap hack below seems like a more performant solution than masking interrupts unconditionally in the poll function:
If a hardware interrupt comes in and NAPI_STATE_SCHED is set, we assume we're either in the poll function already or it will be called again soon, so we can safely disable interrupts.
It has worked perfectly for me so far, although it doesn't prevent the poll function from *ever* getting called with interrupts on. I suspect it will come apart with busy polling or the like.

No doubt some sort of semaphore between the interrupt handler and poll function will be needed to decide who gets to disable interrupts.
It would be great if NAPI had a "begin polling" upcall or something...


diff --git linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
index 81fd31f..60cf4f6 100644
--- linux-source-6.6~/drivers/net/ethernet/realtek/r8169_main.c
+++ linux-source-6.6/drivers/net/ethernet/realtek/r8169_main.c
@@ -4521,6 +4521,11 @@  release_descriptor:
        return count;
 }
 
+static inline bool napi_is_scheduled(struct napi_struct *n)
+{
+       return test_bit(NAPI_STATE_SCHED, &n->state);
+}
+
 static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
 {
        struct rtl8169_private *tp = dev_instance;
@@ -4546,7 +4551,8 @@  static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
        if (napi_schedule_prep(&tp->napi)) {
                rtl_irq_disable(tp);
                __napi_schedule(&tp->napi);
-       }
+       } else if (napi_is_scheduled(&tp->napi))
+               rtl_irq_disable(tp);
 out:
        rtl_ack_events(tp, status);