tcp: check socket state before calling WARN_ON

Message ID	20241203081247.1533534-1-youngmin.nam@samsung.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mailout4.samsung.com (mailout4.samsung.com [203.254.224.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2EB5F1B0F1E for <netdev@vger.kernel.org>; Tue, 3 Dec 2024 08:10:15 +0000 (UTC) From: Youngmin Nam <youngmin.nam@samsung.com> To: edumazet@google.com, davem@davemloft.net, dsahern@kernel.org, kuba@kernel.org, pabeni@redhat.com, horms@kernel.org, youngmin.nam@samsung.com, dujeong.lee@samsung.com, guo88.liu@samsung.com, yiwang.cai@samsung.com Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, joonki.min@samsung.com, hajun.sung@samsung.com, d7271.choe@samsung.com, sw.ju@samsung.com Subject: [PATCH] tcp: check socket state before calling WARN_ON Date: Tue, 3 Dec 2024 17:12:47 +0900 Message-Id: <20241203081247.1533534-1-youngmin.nam@samsung.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 102P DLP-Filter: Pass References: <CGME20241203081005epcas2p247b3d05bc767b1a50ba85c4433657295@epcas2p2.samsung.com>
Series	tcp: check socket state before calling WARN_ON \| expand tcp: check socket state before calling WARN_ON

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 3 this patch: 3
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	success	CCed 6 of 6 maintainers
netdev/build_clang	success	Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 305 this patch: 305
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 18 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 1 this patch: 1
netdev/source_inline	success	Was 0 now: 0
netdev/contest	fail	net-next-2024-12-03--15-00 (tests: 760)

Youngmin Nam Dec. 3, 2024, 8:12 a.m. UTC

We encountered the following WARNINGs
in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
which triggered a kernel panic due to panic_on_warn.

case 1.
------------[ cut here ]------------
WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
Call trace:
 tcp_sacktag_write_queue+0xae8/0xb60
 tcp_ack+0x4ec/0x12b8
 tcp_rcv_state_process+0x22c/0xd38
 tcp_v4_do_rcv+0x220/0x300
 tcp_v4_rcv+0xa5c/0xbb4
 ip_protocol_deliver_rcu+0x198/0x34c
 ip_local_deliver_finish+0x94/0xc4
 ip_local_deliver+0x74/0x10c
 ip_rcv+0xa0/0x13c
Kernel panic - not syncing: kernel: panic_on_warn set ...

case 2.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
Call trace:
 tcp_fastretrans_alert+0x8ac/0xa74
 tcp_ack+0x904/0x12b8
 tcp_rcv_state_process+0x22c/0xd38
 tcp_v4_do_rcv+0x220/0x300
 tcp_v4_rcv+0xa5c/0xbb4
 ip_protocol_deliver_rcu+0x198/0x34c
 ip_local_deliver_finish+0x94/0xc4
 ip_local_deliver+0x74/0x10c
 ip_rcv+0xa0/0x13c
Kernel panic - not syncing: kernel: panic_on_warn set ...

When we check the socket state value at the time of the issue,
it was 0x4.

skc_state = 0x4,

This is "TCP_FIN_WAIT1" and which means the device closed its socket.

enum {
	TCP_ESTABLISHED = 1,
	TCP_SYN_SENT,
	TCP_SYN_RECV,
	TCP_FIN_WAIT1,

And also this means tp->packets_out was initialized as 0
by tcp_write_queue_purge().

In a congested network situation, a TCP ACK for
an already closed session may be received with a delay from the peer.
This can trigger the WARN_ON macro to help debug the situation.

To make this situation more meaningful, we would like to call
WARN_ON only when the state of the socket is "TCP_ESTABLISHED".
This will prevent the kernel from triggering a panic
due to panic_on_warn.

Signed-off-by: Youngmin Nam <youngmin.nam@samsung.com>
---
 net/ipv4/tcp_input.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Eric Dumazet Dec. 3, 2024, 11:07 a.m. UTC | #1

On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> We encountered the following WARNINGs
> in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> which triggered a kernel panic due to panic_on_warn.
>
> case 1.
> ------------[ cut here ]------------
> WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> Call trace:
>  tcp_sacktag_write_queue+0xae8/0xb60
>  tcp_ack+0x4ec/0x12b8
>  tcp_rcv_state_process+0x22c/0xd38
>  tcp_v4_do_rcv+0x220/0x300
>  tcp_v4_rcv+0xa5c/0xbb4
>  ip_protocol_deliver_rcu+0x198/0x34c
>  ip_local_deliver_finish+0x94/0xc4
>  ip_local_deliver+0x74/0x10c
>  ip_rcv+0xa0/0x13c
> Kernel panic - not syncing: kernel: panic_on_warn set ...
>
> case 2.
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> Call trace:
>  tcp_fastretrans_alert+0x8ac/0xa74
>  tcp_ack+0x904/0x12b8
>  tcp_rcv_state_process+0x22c/0xd38
>  tcp_v4_do_rcv+0x220/0x300
>  tcp_v4_rcv+0xa5c/0xbb4
>  ip_protocol_deliver_rcu+0x198/0x34c
>  ip_local_deliver_finish+0x94/0xc4
>  ip_local_deliver+0x74/0x10c
>  ip_rcv+0xa0/0x13c
> Kernel panic - not syncing: kernel: panic_on_warn set ...
>

I have not seen these warnings firing. Neal, have you seen this in the past ?

Please provide the kernel version (this must be a pristine LTS one).
and symbolized stack traces using scripts/decode_stacktrace.sh

If this warning was easy to trigger, please provide a packetdrill test ?



> When we check the socket state value at the time of the issue,
> it was 0x4.
>
> skc_state = 0x4,
>
> This is "TCP_FIN_WAIT1" and which means the device closed its socket.
>
> enum {
>         TCP_ESTABLISHED = 1,
>         TCP_SYN_SENT,
>         TCP_SYN_RECV,
>         TCP_FIN_WAIT1,
>
> And also this means tp->packets_out was initialized as 0
> by tcp_write_queue_purge().

What stack trace leads to this tcp_write_queue_purge() exactly ?

>
> In a congested network situation, a TCP ACK for
> an already closed session may be received with a delay from the peer.
> This can trigger the WARN_ON macro to help debug the situation.
>
> To make this situation more meaningful, we would like to call
> WARN_ON only when the state of the socket is "TCP_ESTABLISHED".
> This will prevent the kernel from triggering a panic
> due to panic_on_warn.
>
> Signed-off-by: Youngmin Nam <youngmin.nam@samsung.com>
> ---
>  net/ipv4/tcp_input.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5bdf13ac26ef..62f4c285ab80 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2037,7 +2037,8 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
>         WARN_ON((int)tp->sacked_out < 0);
>         WARN_ON((int)tp->lost_out < 0);
>         WARN_ON((int)tp->retrans_out < 0);
> -       WARN_ON((int)tcp_packets_in_flight(tp) < 0);
> +       if (sk->sk_state == TCP_ESTABLISHED)

In any case this test on sk_state is too specific.

> +               WARN_ON((int)tcp_packets_in_flight(tp) < 0);
>  #endif
>         return state->flag;
>  }
> @@ -3080,7 +3081,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
>                 return;
>
>         /* C. Check consistency of the current state. */
> -       tcp_verify_left_out(tp);
> +       if (sk->sk_state == TCP_ESTABLISHED)
> +               tcp_verify_left_out(tp);
>
>         /* D. Check state exit conditions. State can be terminated
>          *    when high_seq is ACKed. */
> --
> 2.39.2
>

Neal Cardwell Dec. 3, 2024, 3:34 p.m. UTC | #2

On Tue, Dec 3, 2024 at 6:07 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > We encountered the following WARNINGs
> > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > which triggered a kernel panic due to panic_on_warn.
> >
> > case 1.
> > ------------[ cut here ]------------
> > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > Call trace:
> >  tcp_sacktag_write_queue+0xae8/0xb60
> >  tcp_ack+0x4ec/0x12b8
> >  tcp_rcv_state_process+0x22c/0xd38
> >  tcp_v4_do_rcv+0x220/0x300
> >  tcp_v4_rcv+0xa5c/0xbb4
> >  ip_protocol_deliver_rcu+0x198/0x34c
> >  ip_local_deliver_finish+0x94/0xc4
> >  ip_local_deliver+0x74/0x10c
> >  ip_rcv+0xa0/0x13c
> > Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
> > case 2.
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > Call trace:
> >  tcp_fastretrans_alert+0x8ac/0xa74
> >  tcp_ack+0x904/0x12b8
> >  tcp_rcv_state_process+0x22c/0xd38
> >  tcp_v4_do_rcv+0x220/0x300
> >  tcp_v4_rcv+0xa5c/0xbb4
> >  ip_protocol_deliver_rcu+0x198/0x34c
> >  ip_local_deliver_finish+0x94/0xc4
> >  ip_local_deliver+0x74/0x10c
> >  ip_rcv+0xa0/0x13c
> > Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
>
> I have not seen these warnings firing. Neal, have you seen this in the past ?

I can't recall seeing these warnings over the past 5 years or so, and
(from checking our monitoring) they don't seem to be firing in our
fleet recently.

> In any case this test on sk_state is too specific.

I agree with Eric. IMHO TCP_FIN_WAIT1 deserves all the same warnings
as ESTABLISHED, since in this state the connection may still have a
big queue of data it is trying to reliably send to the other side,
with full loss recovery and congestion control logic.

I would suggest that instead of running with panic_on_warn it would
make more sense to not panic on warning, and instead add more detail
to these warning messages in your kernels during your testing, to help
debug what is going wrong. I would suggest adding to the warning
message:

tp->packets_out
tp->sacked_out
tp->lost_out
tp->retrans_out
tcp_is_sack(tp)
tp->mss_cache
inet_csk(sk)->icsk_ca_state
inet_csk(sk)->icsk_pmtu_cookie

A hunch would be that this is either firing for (a) non-SACK
connections, or (b) after an MTU reduction.

In particular, you might try `echo 0 >
/proc/sys/net/ipv4/tcp_mtu_probing` and see if that makes the warnings
go away.

cheers,
neal

Jakub Kicinski Dec. 4, 2024, 2:18 a.m. UTC | #3

On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > I have not seen these warnings firing. Neal, have you seen this in the past ?  
> 
> I can't recall seeing these warnings over the past 5 years or so, and
> (from checking our monitoring) they don't seem to be firing in our
> fleet recently.

FWIW I see this at Meta on 5.12 kernels, but nothing since.
Could be that one of our workloads is pinned to 5.12.
Youngmin, what's the newest kernel you can repro this on?

Youngmin Nam Dec. 4, 2024, 3:08 a.m. UTC | #4

Hi Eric.
Thanks for looking at this issue.

On Tue, Dec 03, 2024 at 12:07:05PM +0100, Eric Dumazet wrote:
> On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > We encountered the following WARNINGs
> > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > which triggered a kernel panic due to panic_on_warn.
> >
> > case 1.
> > ------------[ cut here ]------------
> > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > Call trace:
> >  tcp_sacktag_write_queue+0xae8/0xb60
> >  tcp_ack+0x4ec/0x12b8
> >  tcp_rcv_state_process+0x22c/0xd38
> >  tcp_v4_do_rcv+0x220/0x300
> >  tcp_v4_rcv+0xa5c/0xbb4
> >  ip_protocol_deliver_rcu+0x198/0x34c
> >  ip_local_deliver_finish+0x94/0xc4
> >  ip_local_deliver+0x74/0x10c
> >  ip_rcv+0xa0/0x13c
> > Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
> > case 2.
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > Call trace:
> >  tcp_fastretrans_alert+0x8ac/0xa74
> >  tcp_ack+0x904/0x12b8
> >  tcp_rcv_state_process+0x22c/0xd38
> >  tcp_v4_do_rcv+0x220/0x300
> >  tcp_v4_rcv+0xa5c/0xbb4
> >  ip_protocol_deliver_rcu+0x198/0x34c
> >  ip_local_deliver_finish+0x94/0xc4
> >  ip_local_deliver+0x74/0x10c
> >  ip_rcv+0xa0/0x13c
> > Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
> 
> I have not seen these warnings firing. Neal, have you seen this in the past ?
> 
> Please provide the kernel version (this must be a pristine LTS one).
We are running Android kernel for Android mobile device which is based on LTS kernel 6.6-30.
But we've seen this issue since kernel 5.15 LTS.

> and symbolized stack traces using scripts/decode_stacktrace.sh
Unfortunately, we don't have the matched vmlinux right now. So we need to rebuild and reproduce.
> 
> If this warning was easy to trigger, please provide a packetdrill test ?
I'm not sure if we can run packetdrill test on Android device. Anyway let me check.

FYI, Here are more detailed logs.

Case 1.
[26496.422651]I[4:  napi/wlan0-33:  467] ------------[ cut here ]------------
[26496.422665]I[4:  napi/wlan0-33:  467] WARNING: CPU: 4 PID: 467 at net/ipv4/tcp_input.c:2026 tcp_sacktag_write_queue+0xae8/0xb60
[26496.423420]I[4:  napi/wlan0-33:  467] CPU: 4 PID: 467 Comm: napi/wlan0-33 Tainted: G S         OE      6.6.30-android15-8-geeceb2c9cdf1-ab20240930.125201-4k #1 a1c80b36942fa6e9575b2544032a7536ed502804
[26496.423427]I[4:  napi/wlan0-33:  467] Hardware name: Samsung ERD9955 board based on S5E9955 (DT)
[26496.423432]I[4:  napi/wlan0-33:  467] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[26496.423438]I[4:  napi/wlan0-33:  467] pc : tcp_sacktag_write_queue+0xae8/0xb60
[26496.423446]I[4:  napi/wlan0-33:  467] lr : tcp_ack+0x4ec/0x12b8
[26496.423455]I[4:  napi/wlan0-33:  467] sp : ffffffc096b8b690
[26496.423458]I[4:  napi/wlan0-33:  467] x29: ffffffc096b8b710 x28: 0000000000008001 x27: 000000005526d635
[26496.423469]I[4:  napi/wlan0-33:  467] x26: ffffff8a19079684 x25: 000000005526dbfd x24: 0000000000000001
[26496.423480]I[4:  napi/wlan0-33:  467] x23: 000000000000000a x22: ffffff88e5f5b680 x21: 000000005526dbc9
[26496.423489]I[4:  napi/wlan0-33:  467] x20: ffffff8a19078d80 x19: ffffff88e9f4193e x18: ffffffd083114c80
[26496.423499]I[4:  napi/wlan0-33:  467] x17: 00000000529c6ef0 x16: 000000000000ff8b x15: 0000000000000000
[26496.423508]I[4:  napi/wlan0-33:  467] x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
[26496.423517]I[4:  napi/wlan0-33:  467] x11: 0000000000000000 x10: 0000000000000001 x9 : 00000000fffffffd
[26496.423526]I[4:  napi/wlan0-33:  467] x8 : 0000000000000001 x7 : 0000000000000000 x6 : ffffffd081ec0bc4
[26496.423536]I[4:  napi/wlan0-33:  467] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc096b8b818
[26496.423545]I[4:  napi/wlan0-33:  467] x2 : 000000005526d635 x1 : ffffff88e5f5b680 x0 : ffffff8a19078d80
[26496.423555]I[4:  napi/wlan0-33:  467] Call trace:
[26496.423558]I[4:  napi/wlan0-33:  467]  tcp_sacktag_write_queue+0xae8/0xb60
[26496.423566]I[4:  napi/wlan0-33:  467]  tcp_ack+0x4ec/0x12b8
[26496.423573]I[4:  napi/wlan0-33:  467]  tcp_rcv_state_process+0x22c/0xd38
[26496.423580]I[4:  napi/wlan0-33:  467]  tcp_v4_do_rcv+0x220/0x300
[26496.423590]I[4:  napi/wlan0-33:  467]  tcp_v4_rcv+0xa5c/0xbb4
[26496.423596]I[4:  napi/wlan0-33:  467]  ip_protocol_deliver_rcu+0x198/0x34c
[26496.423607]I[4:  napi/wlan0-33:  467]  ip_local_deliver_finish+0x94/0xc4
[26496.423614]I[4:  napi/wlan0-33:  467]  ip_local_deliver+0x74/0x10c
[26496.423620]I[4:  napi/wlan0-33:  467]  ip_rcv+0xa0/0x13c
[26496.423625]I[4:  napi/wlan0-33:  467]  __netif_receive_skb_core+0xe14/0x1104
[26496.423642]I[4:  napi/wlan0-33:  467]  __netif_receive_skb_list_core+0xb8/0x2dc
[26496.423649]I[4:  napi/wlan0-33:  467]  netif_receive_skb_list_internal+0x234/0x320
[26496.423655]I[4:  napi/wlan0-33:  467]  napi_complete_done+0xb4/0x1a0
[26496.423660]I[4:  napi/wlan0-33:  467]  slsi_rx_netif_napi_poll+0x22c/0x258 [scsc_wlan 16ac2100e65b7c78ce863cecc238b39b162dbe82]
[26496.423822]I[4:  napi/wlan0-33:  467]  __napi_poll+0x5c/0x238
[26496.423829]I[4:  napi/wlan0-33:  467]  napi_threaded_poll+0x110/0x204
[26496.423836]I[4:  napi/wlan0-33:  467]  kthread+0x114/0x138
[26496.423845]I[4:  napi/wlan0-33:  467]  ret_from_fork+0x10/0x20
[26496.423856]I[4:  napi/wlan0-33:  467] Kernel panic - not syncing: kernel: panic_on_warn set ..

Case 2.
[ 1843.463330]I[0: surfaceflinger:  648] ------------[ cut here ]------------
[ 1843.463355]I[0: surfaceflinger:  648] WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004 tcp_fastretrans_alert+0x8ac/0xa74
[ 1843.464508]I[0: surfaceflinger:  648] CPU: 0 PID: 648 Comm: surfaceflinger Tainted: G S         OE      6.6.30-android15-8-geeceb2c9cdf1-ab20241017.075836-4k #1 de751202c2c5ab3ec352a00ae470fc5e907bdcfe
[ 1843.464520]I[0: surfaceflinger:  648] Hardware name: Samsung ERD8855 board based on S5E8855 (DT)
[ 1843.464527]I[0: surfaceflinger:  648] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1843.464535]I[0: surfaceflinger:  648] pc : tcp_fastretrans_alert+0x8ac/0xa74
[ 1843.464548]I[0: surfaceflinger:  648] lr : tcp_ack+0x904/0x12b8
[ 1843.464556]I[0: surfaceflinger:  648] sp : ffffffc0800036e0
[ 1843.464561]I[0: surfaceflinger:  648] x29: ffffffc0800036e0 x28: 0000000000008005 x27: 000000001bc05562
[ 1843.464579]I[0: surfaceflinger:  648] x26: ffffff890418a3c4 x25: 0000000000000000 x24: 000000000000cd02
[ 1843.464595]I[0: surfaceflinger:  648] x23: 000000001bc05562 x22: 0000000000000000 x21: ffffffc080003800
[ 1843.464611]I[0: surfaceflinger:  648] x20: ffffffc08000378c x19: ffffff8904189ac0 x18: 0000000000000000
[ 1843.464627]I[0: surfaceflinger:  648] x17: 00000000529c6ef0 x16: 000000000000ff8b x15: 0000000000000001
[ 1843.464642]I[0: surfaceflinger:  648] x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
[ 1843.464658]I[0: surfaceflinger:  648] x11: ffffff883e9c9540 x10: 0000000000000001 x9 : 0000000000000001
[ 1843.464673]I[0: surfaceflinger:  648] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffffd081ec0bc4
[ 1843.464689]I[0: surfaceflinger:  648] x5 : 0000000000000000 x4 : ffffffc08000378c x3 : ffffffc080003800
[ 1843.464704]I[0: surfaceflinger:  648] x2 : 0000000000000000 x1 : 000000001bc05562 x0 : ffffff8904189ac0
[ 1843.464720]I[0: surfaceflinger:  648] Call trace:
[ 1843.464725]I[0: surfaceflinger:  648]  tcp_fastretrans_alert+0x8ac/0xa74
[ 1843.464735]I[0: surfaceflinger:  648]  tcp_ack+0x904/0x12b8
[ 1843.464743]I[0: surfaceflinger:  648]  tcp_rcv_state_process+0x22c/0xd38
[ 1843.464751]I[0: surfaceflinger:  648]  tcp_v4_do_rcv+0x220/0x300
[ 1843.464760]I[0: surfaceflinger:  648]  tcp_v4_rcv+0xa5c/0xbb4
[ 1843.464767]I[0: surfaceflinger:  648]  ip_protocol_deliver_rcu+0x198/0x34c
[ 1843.464776]I[0: surfaceflinger:  648]  ip_local_deliver_finish+0x94/0xc4
[ 1843.464784]I[0: surfaceflinger:  648]  ip_local_deliver+0x74/0x10c
[ 1843.464791]I[0: surfaceflinger:  648]  ip_rcv+0xa0/0x13c
[ 1843.464799]I[0: surfaceflinger:  648]  __netif_receive_skb_core+0xe14/0x1104
[ 1843.464810]I[0: surfaceflinger:  648]  __netif_receive_skb+0x40/0x124
[ 1843.464818]I[0: surfaceflinger:  648]  netif_receive_skb+0x7c/0x234
[ 1843.464825]I[0: surfaceflinger:  648]  slsi_rx_data_deliver_skb+0x1e0/0xdbc [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
[ 1843.465025]I[0: surfaceflinger:  648]  slsi_ba_process_complete+0x70/0xa4 [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
[ 1843.465219]I[0: surfaceflinger:  648]  slsi_ba_aging_timeout_handler+0x324/0x354 [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
[ 1843.465410]I[0: surfaceflinger:  648]  call_timer_fn+0xd0/0x360
[ 1843.465423]I[0: surfaceflinger:  648]  __run_timers+0x1b4/0x268
[ 1843.465432]I[0: surfaceflinger:  648]  run_timer_softirq+0x24/0x4c
[ 1843.465440]I[0: surfaceflinger:  648]  __do_softirq+0x158/0x45c
[ 1843.465448]I[0: surfaceflinger:  648]  ____do_softirq+0x10/0x20
[ 1843.465455]I[0: surfaceflinger:  648]  call_on_irq_stack+0x3c/0x74
[ 1843.465463]I[0: surfaceflinger:  648]  do_softirq_own_stack+0x1c/0x2c
[ 1843.465470]I[0: surfaceflinger:  648]  __irq_exit_rcu+0x54/0xb4
[ 1843.465480]I[0: surfaceflinger:  648]  irq_exit_rcu+0x10/0x1c
[ 1843.465489]I[0: surfaceflinger:  648]  el0_interrupt+0x54/0xe0
[ 1843.465499]I[0: surfaceflinger:  648]  __el0_irq_handler_common+0x18/0x28
[ 1843.465508]I[0: surfaceflinger:  648]  el0t_64_irq_handler+0x10/0x1c
[ 1843.465516]I[0: surfaceflinger:  648]  el0t_64_irq+0x1a8/0x1ac
[ 1843.465525]I[0: surfaceflinger:  648] Kernel panic - not syncing: kernel: panic_on_warn set ...

> > When we check the socket state value at the time of the issue,
> > it was 0x4.
> >
> > skc_state = 0x4,
> >
> > This is "TCP_FIN_WAIT1" and which means the device closed its socket.
> >
> > enum {
> >         TCP_ESTABLISHED = 1,
> >         TCP_SYN_SENT,
> >         TCP_SYN_RECV,
> >         TCP_FIN_WAIT1,
> >
> > And also this means tp->packets_out was initialized as 0
> > by tcp_write_queue_purge().
> 
> What stack trace leads to this tcp_write_queue_purge() exactly ?
I couldn't find the exact call stack to this.
But I just thought that the function would be called based on ramdump snapshot.

(*(struct tcp_sock *)(0xFFFFFF800467AB00)).packets_out = 0
(*(struct inet_connection_sock *)0xFFFFFF800467AB00).icsk_backoff = 0

> >
> > In a congested network situation, a TCP ACK for
> > an already closed session may be received with a delay from the peer.
> > This can trigger the WARN_ON macro to help debug the situation.
> >
> > To make this situation more meaningful, we would like to call
> > WARN_ON only when the state of the socket is "TCP_ESTABLISHED".
> > This will prevent the kernel from triggering a panic
> > due to panic_on_warn.
> >
> > Signed-off-by: Youngmin Nam <youngmin.nam@samsung.com>
> > ---
> >  net/ipv4/tcp_input.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 5bdf13ac26ef..62f4c285ab80 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -2037,7 +2037,8 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
> >         WARN_ON((int)tp->sacked_out < 0);
> >         WARN_ON((int)tp->lost_out < 0);
> >         WARN_ON((int)tp->retrans_out < 0);
> > -       WARN_ON((int)tcp_packets_in_flight(tp) < 0);
> > +       if (sk->sk_state == TCP_ESTABLISHED)
> 
> In any case this test on sk_state is too specific.
Yes. I agree as wll. Do you have any idea to avoid a kernel panic that we are going through ?

> > +               WARN_ON((int)tcp_packets_in_flight(tp) < 0);
> >  #endif
> >         return state->flag;
> >  }
> > @@ -3080,7 +3081,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
> >                 return;
> >
> >         /* C. Check consistency of the current state. */
> > -       tcp_verify_left_out(tp);
> > +       if (sk->sk_state == TCP_ESTABLISHED)
> > +               tcp_verify_left_out(tp);
> >
> >         /* D. Check state exit conditions. State can be terminated
> >          *    when high_seq is ACKed. */
> > --
> > 2.39.2
> >
>

Youngmin Nam Dec. 4, 2024, 3:26 a.m. UTC | #5

On Tue, Dec 03, 2024 at 10:34:46AM -0500, Neal Cardwell wrote:
> On Tue, Dec 3, 2024 at 6:07 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > We encountered the following WARNINGs
> > > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > > which triggered a kernel panic due to panic_on_warn.
> > >
> > > case 1.
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > > Call trace:
> > >  tcp_sacktag_write_queue+0xae8/0xb60
> > >  tcp_ack+0x4ec/0x12b8
> > >  tcp_rcv_state_process+0x22c/0xd38
> > >  tcp_v4_do_rcv+0x220/0x300
> > >  tcp_v4_rcv+0xa5c/0xbb4
> > >  ip_protocol_deliver_rcu+0x198/0x34c
> > >  ip_local_deliver_finish+0x94/0xc4
> > >  ip_local_deliver+0x74/0x10c
> > >  ip_rcv+0xa0/0x13c
> > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > >
> > > case 2.
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > > Call trace:
> > >  tcp_fastretrans_alert+0x8ac/0xa74
> > >  tcp_ack+0x904/0x12b8
> > >  tcp_rcv_state_process+0x22c/0xd38
> > >  tcp_v4_do_rcv+0x220/0x300
> > >  tcp_v4_rcv+0xa5c/0xbb4
> > >  ip_protocol_deliver_rcu+0x198/0x34c
> > >  ip_local_deliver_finish+0x94/0xc4
> > >  ip_local_deliver+0x74/0x10c
> > >  ip_rcv+0xa0/0x13c
> > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > >
> >
> > I have not seen these warnings firing. Neal, have you seen this in the past ?
> 
> I can't recall seeing these warnings over the past 5 years or so, and
> (from checking our monitoring) they don't seem to be firing in our
> fleet recently.
> 
> > In any case this test on sk_state is too specific.
> 
> I agree with Eric. IMHO TCP_FIN_WAIT1 deserves all the same warnings
> as ESTABLISHED, since in this state the connection may still have a
> big queue of data it is trying to reliably send to the other side,
> with full loss recovery and congestion control logic.
Yes I agree with Eric as well.

> 
> I would suggest that instead of running with panic_on_warn it would
> make more sense to not panic on warning, and instead add more detail
> to these warning messages in your kernels during your testing, to help
> debug what is going wrong. I would suggest adding to the warning
> message:
> 
> tp->packets_out
> tp->sacked_out
> tp->lost_out
> tp->retrans_out
> tcp_is_sack(tp)
> tp->mss_cache
> inet_csk(sk)->icsk_ca_state
> inet_csk(sk)->icsk_pmtu_cookie

Hi Neal.
Thanks for your opinion.

By the way, we enable panic_on_warn by default for stability.
As you know, panic_on_warn is not applied to a specific subsystem but to the entire kernel.
We just want to avoid the kernel panic.

So when I see below lwn article, I think we might use pr_warn() instaed of WARN_ON().
https://lwn.net/Articles/969923/

How do you think of it ?
> 
> A hunch would be that this is either firing for (a) non-SACK
> connections, or (b) after an MTU reduction.
> 
> In particular, you might try `echo 0 >
> /proc/sys/net/ipv4/tcp_mtu_probing` and see if that makes the warnings
> go away.
> 
> cheers,
> neal
>

Youngmin Nam Dec. 4, 2024, 3:39 a.m. UTC | #6

On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > I have not seen these warnings firing. Neal, have you seen this in the past ?  
> > 
> > I can't recall seeing these warnings over the past 5 years or so, and
> > (from checking our monitoring) they don't seem to be firing in our
> > fleet recently.
> 
> FWIW I see this at Meta on 5.12 kernels, but nothing since.
> Could be that one of our workloads is pinned to 5.12.
> Youngmin, what's the newest kernel you can repro this on?
> 
Hi Jakub.
Thank you for taking an interest in this issue.

We've seen this issue since 5.15 kernel.
Now, we can see this on 6.6 kernel which is the newest kernel we are running.

Eric Dumazet Dec. 4, 2024, 7:13 a.m. UTC | #7

On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > >
> > > I can't recall seeing these warnings over the past 5 years or so, and
> > > (from checking our monitoring) they don't seem to be firing in our
> > > fleet recently.
> >
> > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > Could be that one of our workloads is pinned to 5.12.
> > Youngmin, what's the newest kernel you can repro this on?
> >
> Hi Jakub.
> Thank you for taking an interest in this issue.
>
> We've seen this issue since 5.15 kernel.
> Now, we can see this on 6.6 kernel which is the newest kernel we are running.

The fact that we are processing ACK packets after the write queue has
been purged would be a serious bug.

Thus the WARN() makes sense to us.

It would be easy to build a packetdrill test. Please do so, then we
can fix the root cause.

Thank you !

Dujeong.lee Dec. 4, 2024, 7:48 a.m. UTC | #8

On Wed, Dec 4, 2024 at 4:14 PM Eric Dumazet wrote:
> To: Youngmin Nam <youngmin.nam@samsung.com>
> Cc: Jakub Kicinski <kuba@kernel.org>; Neal Cardwell <ncardwell@google.com>;
> davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com;
> horms@kernel.org; dujeong.lee@samsung.com; guo88.liu@samsung.com;
> yiwang.cai@samsung.com; netdev@vger.kernel.org; linux-
> kernel@vger.kernel.org; joonki.min@samsung.com; hajun.sung@samsung.com;
> d7271.choe@samsung.com; sw.ju@samsung.com
> Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
> 
> On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com>
> wrote:
> >
> > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > I have not seen these warnings firing. Neal, have you seen this in
> the past ?
> > > >
> > > > I can't recall seeing these warnings over the past 5 years or so,
> > > > and (from checking our monitoring) they don't seem to be firing in
> > > > our fleet recently.
> > >
> > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > Could be that one of our workloads is pinned to 5.12.
> > > Youngmin, what's the newest kernel you can repro this on?
> > >
> > Hi Jakub.
> > Thank you for taking an interest in this issue.
> >
> > We've seen this issue since 5.15 kernel.
> > Now, we can see this on 6.6 kernel which is the newest kernel we are
> running.
> 
> The fact that we are processing ACK packets after the write queue has been
> purged would be a serious bug.
> 
> Thus the WARN() makes sense to us.
> 
> It would be easy to build a packetdrill test. Please do so, then we can
> fix the root cause.
> 
> Thank you !


Please let me share some more details and clarifications on the issue from ramdump snapshot locally secured.

1) This issue has been reported from Android-T linux kernel when we enabled panic_on_warn for the first time.
Reproduction rate is not high and can be seen in any test cases with public internet connection.

2) Analysis from ramdump (which is not available at the moment).
2-A) From ramdump, I was able to find below values.
tp->packets_out = 0
tp->retrans_out = 1
tp->max_packets_out = 1
tp->max_packets_Seq = 1575830358
tp->snd_ssthresh = 5
tp->snd_cwnd = 1
tp->prior_cwnd = 10
tp->wite_seq = 1575830359
tp->pushed_seq = 1575830358
tp->lost_out = 1
tp->sacked_out = 0

2-B) The last Tx packet from the device is (Time: 17371.562934)
Hex:
4500005b95a3400040063e34c0a848188efacf0a888a01bb5ded432f5ad8ab29801800495b5800000101080a3a52197fef299d901703030022f3589123b0523bdd07be137a98ca9b5d3475332d4382c7b420571e6d437a07ba7787

Internet Protocol Version 4
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
Total Length: 91
Identification: 0x95a3 (38307)
010. .... = Flags: 0x2, Don't fragment
...0 0000 0000 0000 = Fragment Offset: 0
Time to Live: 64
Protocol: TCP (6)
Header Checksum: 0x3e34
Header checksum status: Unverified
Source Address: 192.168.72.24
Destination Address: 142.250.207.10
Stream index: 0

Transmission Control Protocol
Source Port: 34954
Destination Port: 443
Stream index: 0
Conversation completeness: Incomplete (0)
TCP Segment Len: 39
Sequence Number: 0x5ded432f
Sequence Number (raw): 1575830319
Next Sequence Number: 40
Acknowledgment Number: 0x5ad8ab29
Acknowledgment number (raw): 1524149033
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
Window: 73
Calculated window size: 73
Window size scaling factor: -1 (unknown)
Checksum: 0x5b58
Checksum Status: Unverified
Urgent Pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
Timestamps
SEQ/ACK analysis
TCP payload (39 bytes)
Transport Layer Security
TLSv1.2 Record Layer: Application Data Protocol: Hypertext Transfer Protocol

2-C) When warn hit, DUT was processing (Time: 17399.502603, 28 seconds later since last Tx)
Hex:
456000405FA20000720681F08EFACF0AC0A8481801BB888A5AD8AB295DED4356B010010D93D800000101080AEF299EF43A52089F0101050A5DED432F5DED4356

Internet Protocol Version 4
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x60 (DSCP: CS3, ECN: Not-ECT)
Total Length: 64
Identification: 0x5fa2 (24482)
000. .... = Flags: 0x0
...0 0000 0000 0000 = Fragment Offset: 0
Time to Live: 114
Protocol: TCP (6)
Header Checksum: 0x81f0
Header checksum status: Unverified
Source Address: 142.250.207.10
Destination Address: 192.168.72.24
Stream index: 0

Transmission Control Protocol
Source Port: 443
Destination Port: 34954
Stream index: 0
Conversation completeness: Incomplete (0)
TCP Segment Len: 0
Sequence Number: 0x5ad8ab29
Sequence Number (raw): 1524149033
Next Sequence Number: 1
Acknowledgment Number: 0x5ded4356
Acknowledgment number (raw): 1575830358
1011 .... = Header Length: 44 bytes (11)
Flags: 0x010 (ACK)
Window: 269
Calculated window size: 269
Window size scaling factor: -1 (unknown)
Checksum: 0x93d8
Checksum Status: Unverified
Urgent Pointer: 0
Options: (24 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
Timestamps

2-D) The DUT received ack after 28 seconds from Access Point.

3)Clarification on "tcp_write_queue_purge" claim
This is just my conjecture based on ramdump snapshot and it is not shown in calltrace.
Based on tcp status in snapshot I thought tcp_write_queue_purge was called and packets_out was cleared.

4) In our kernel "/proc/sys/net/ipv4/tcp_mtu_probing" is set to 0.

Eric Dumazet Dec. 4, 2024, 8:55 a.m. UTC | #9

On Wed, Dec 4, 2024 at 4:22 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> On Tue, Dec 03, 2024 at 10:34:46AM -0500, Neal Cardwell wrote:
> > On Tue, Dec 3, 2024 at 6:07 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > >
> > > > We encountered the following WARNINGs
> > > > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > > > which triggered a kernel panic due to panic_on_warn.
> > > >
> > > > case 1.
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > > > Call trace:
> > > >  tcp_sacktag_write_queue+0xae8/0xb60
> > > >  tcp_ack+0x4ec/0x12b8
> > > >  tcp_rcv_state_process+0x22c/0xd38
> > > >  tcp_v4_do_rcv+0x220/0x300
> > > >  tcp_v4_rcv+0xa5c/0xbb4
> > > >  ip_protocol_deliver_rcu+0x198/0x34c
> > > >  ip_local_deliver_finish+0x94/0xc4
> > > >  ip_local_deliver+0x74/0x10c
> > > >  ip_rcv+0xa0/0x13c
> > > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > > >
> > > > case 2.
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > > > Call trace:
> > > >  tcp_fastretrans_alert+0x8ac/0xa74
> > > >  tcp_ack+0x904/0x12b8
> > > >  tcp_rcv_state_process+0x22c/0xd38
> > > >  tcp_v4_do_rcv+0x220/0x300
> > > >  tcp_v4_rcv+0xa5c/0xbb4
> > > >  ip_protocol_deliver_rcu+0x198/0x34c
> > > >  ip_local_deliver_finish+0x94/0xc4
> > > >  ip_local_deliver+0x74/0x10c
> > > >  ip_rcv+0xa0/0x13c
> > > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > > >
> > >
> > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> >
> > I can't recall seeing these warnings over the past 5 years or so, and
> > (from checking our monitoring) they don't seem to be firing in our
> > fleet recently.
> >
> > > In any case this test on sk_state is too specific.
> >
> > I agree with Eric. IMHO TCP_FIN_WAIT1 deserves all the same warnings
> > as ESTABLISHED, since in this state the connection may still have a
> > big queue of data it is trying to reliably send to the other side,
> > with full loss recovery and congestion control logic.
> Yes I agree with Eric as well.
>
> >
> > I would suggest that instead of running with panic_on_warn it would
> > make more sense to not panic on warning, and instead add more detail
> > to these warning messages in your kernels during your testing, to help
> > debug what is going wrong. I would suggest adding to the warning
> > message:
> >
> > tp->packets_out
> > tp->sacked_out
> > tp->lost_out
> > tp->retrans_out
> > tcp_is_sack(tp)
> > tp->mss_cache
> > inet_csk(sk)->icsk_ca_state
> > inet_csk(sk)->icsk_pmtu_cookie
>
> Hi Neal.
> Thanks for your opinion.
>
> By the way, we enable panic_on_warn by default for stability.
> As you know, panic_on_warn is not applied to a specific subsystem but to the entire kernel.
> We just want to avoid the kernel panic.
>
> So when I see below lwn article, I think we might use pr_warn() instaed of WARN_ON().
> https://lwn.net/Articles/969923/
>
> How do you think of it ?

You want to silence a WARN_ON() because you chose  to make all WARN_ON() fatal.

We want something to be able to fix real bugs, because we really care
of TCP being correct.

We have these discussions all the time.
https://lwn.net/Articles/969923/ is a good summary.

It makes sense for debug kernels (for instance used by syzkaller or
other fuzzers) to panic,
but for production this is a high risk, there is a reason
panic_on_warn is not set by default.

If we use a soft  print there like pr_warn(), no future bug will be caught.

What next : add a new sysctl to panic whenever a pr_warn() is hit by syzkaller ?

Then Android will set this sysctl "because of stability concerns"

> >
> > A hunch would be that this is either firing for (a) non-SACK
> > connections, or (b) after an MTU reduction.
> >
> > In particular, you might try `echo 0 >
> > /proc/sys/net/ipv4/tcp_mtu_probing` and see if that makes the warnings
> > go away.
> >
> > cheers,
> > neal
> >

Eric Dumazet Dec. 4, 2024, 9:03 a.m. UTC | #10

On Wed, Dec 4, 2024 at 4:05 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> Hi Eric.
> Thanks for looking at this issue.
>
> On Tue, Dec 03, 2024 at 12:07:05PM +0100, Eric Dumazet wrote:
> > On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > We encountered the following WARNINGs
> > > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > > which triggered a kernel panic due to panic_on_warn.
> > >
> > > case 1.
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > > Call trace:
> > >  tcp_sacktag_write_queue+0xae8/0xb60
> > >  tcp_ack+0x4ec/0x12b8
> > >  tcp_rcv_state_process+0x22c/0xd38
> > >  tcp_v4_do_rcv+0x220/0x300
> > >  tcp_v4_rcv+0xa5c/0xbb4
> > >  ip_protocol_deliver_rcu+0x198/0x34c
> > >  ip_local_deliver_finish+0x94/0xc4
> > >  ip_local_deliver+0x74/0x10c
> > >  ip_rcv+0xa0/0x13c
> > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > >
> > > case 2.
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > > Call trace:
> > >  tcp_fastretrans_alert+0x8ac/0xa74
> > >  tcp_ack+0x904/0x12b8
> > >  tcp_rcv_state_process+0x22c/0xd38
> > >  tcp_v4_do_rcv+0x220/0x300
> > >  tcp_v4_rcv+0xa5c/0xbb4
> > >  ip_protocol_deliver_rcu+0x198/0x34c
> > >  ip_local_deliver_finish+0x94/0xc4
> > >  ip_local_deliver+0x74/0x10c
> > >  ip_rcv+0xa0/0x13c
> > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > >
> >
> > I have not seen these warnings firing. Neal, have you seen this in the past ?
> >
> > Please provide the kernel version (this must be a pristine LTS one).
> We are running Android kernel for Android mobile device which is based on LTS kernel 6.6-30.
> But we've seen this issue since kernel 5.15 LTS.
>
> > and symbolized stack traces using scripts/decode_stacktrace.sh
> Unfortunately, we don't have the matched vmlinux right now. So we need to rebuild and reproduce.
> >
> > If this warning was easy to trigger, please provide a packetdrill test ?
> I'm not sure if we can run packetdrill test on Android device. Anyway let me check.
>
> FYI, Here are more detailed logs.
>
> Case 1.
> [26496.422651]I[4:  napi/wlan0-33:  467] ------------[ cut here ]------------
> [26496.422665]I[4:  napi/wlan0-33:  467] WARNING: CPU: 4 PID: 467 at net/ipv4/tcp_input.c:2026 tcp_sacktag_write_queue+0xae8/0xb60
> [26496.423420]I[4:  napi/wlan0-33:  467] CPU: 4 PID: 467 Comm: napi/wlan0-33 Tainted: G S         OE      6.6.30-android15-8-geeceb2c9cdf1-ab20240930.125201-4k #1 a1c80b36942fa6e9575b2544032a7536ed502804
> [26496.423427]I[4:  napi/wlan0-33:  467] Hardware name: Samsung ERD9955 board based on S5E9955 (DT)
> [26496.423432]I[4:  napi/wlan0-33:  467] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [26496.423438]I[4:  napi/wlan0-33:  467] pc : tcp_sacktag_write_queue+0xae8/0xb60
> [26496.423446]I[4:  napi/wlan0-33:  467] lr : tcp_ack+0x4ec/0x12b8
> [26496.423455]I[4:  napi/wlan0-33:  467] sp : ffffffc096b8b690
> [26496.423458]I[4:  napi/wlan0-33:  467] x29: ffffffc096b8b710 x28: 0000000000008001 x27: 000000005526d635
> [26496.423469]I[4:  napi/wlan0-33:  467] x26: ffffff8a19079684 x25: 000000005526dbfd x24: 0000000000000001
> [26496.423480]I[4:  napi/wlan0-33:  467] x23: 000000000000000a x22: ffffff88e5f5b680 x21: 000000005526dbc9
> [26496.423489]I[4:  napi/wlan0-33:  467] x20: ffffff8a19078d80 x19: ffffff88e9f4193e x18: ffffffd083114c80
> [26496.423499]I[4:  napi/wlan0-33:  467] x17: 00000000529c6ef0 x16: 000000000000ff8b x15: 0000000000000000
> [26496.423508]I[4:  napi/wlan0-33:  467] x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
> [26496.423517]I[4:  napi/wlan0-33:  467] x11: 0000000000000000 x10: 0000000000000001 x9 : 00000000fffffffd
> [26496.423526]I[4:  napi/wlan0-33:  467] x8 : 0000000000000001 x7 : 0000000000000000 x6 : ffffffd081ec0bc4
> [26496.423536]I[4:  napi/wlan0-33:  467] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc096b8b818
> [26496.423545]I[4:  napi/wlan0-33:  467] x2 : 000000005526d635 x1 : ffffff88e5f5b680 x0 : ffffff8a19078d80
> [26496.423555]I[4:  napi/wlan0-33:  467] Call trace:
> [26496.423558]I[4:  napi/wlan0-33:  467]  tcp_sacktag_write_queue+0xae8/0xb60
> [26496.423566]I[4:  napi/wlan0-33:  467]  tcp_ack+0x4ec/0x12b8
> [26496.423573]I[4:  napi/wlan0-33:  467]  tcp_rcv_state_process+0x22c/0xd38
> [26496.423580]I[4:  napi/wlan0-33:  467]  tcp_v4_do_rcv+0x220/0x300
> [26496.423590]I[4:  napi/wlan0-33:  467]  tcp_v4_rcv+0xa5c/0xbb4
> [26496.423596]I[4:  napi/wlan0-33:  467]  ip_protocol_deliver_rcu+0x198/0x34c
> [26496.423607]I[4:  napi/wlan0-33:  467]  ip_local_deliver_finish+0x94/0xc4
> [26496.423614]I[4:  napi/wlan0-33:  467]  ip_local_deliver+0x74/0x10c
> [26496.423620]I[4:  napi/wlan0-33:  467]  ip_rcv+0xa0/0x13c
> [26496.423625]I[4:  napi/wlan0-33:  467]  __netif_receive_skb_core+0xe14/0x1104
> [26496.423642]I[4:  napi/wlan0-33:  467]  __netif_receive_skb_list_core+0xb8/0x2dc
> [26496.423649]I[4:  napi/wlan0-33:  467]  netif_receive_skb_list_internal+0x234/0x320
> [26496.423655]I[4:  napi/wlan0-33:  467]  napi_complete_done+0xb4/0x1a0
> [26496.423660]I[4:  napi/wlan0-33:  467]  slsi_rx_netif_napi_poll+0x22c/0x258 [scsc_wlan 16ac2100e65b7c78ce863cecc238b39b162dbe82]
> [26496.423822]I[4:  napi/wlan0-33:  467]  __napi_poll+0x5c/0x238
> [26496.423829]I[4:  napi/wlan0-33:  467]  napi_threaded_poll+0x110/0x204
> [26496.423836]I[4:  napi/wlan0-33:  467]  kthread+0x114/0x138
> [26496.423845]I[4:  napi/wlan0-33:  467]  ret_from_fork+0x10/0x20
> [26496.423856]I[4:  napi/wlan0-33:  467] Kernel panic - not syncing: kernel: panic_on_warn set ..
>
> Case 2.
> [ 1843.463330]I[0: surfaceflinger:  648] ------------[ cut here ]------------
> [ 1843.463355]I[0: surfaceflinger:  648] WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004 tcp_fastretrans_alert+0x8ac/0xa74
> [ 1843.464508]I[0: surfaceflinger:  648] CPU: 0 PID: 648 Comm: surfaceflinger Tainted: G S         OE      6.6.30-android15-8-geeceb2c9cdf1-ab20241017.075836-4k #1 de751202c2c5ab3ec352a00ae470fc5e907bdcfe
> [ 1843.464520]I[0: surfaceflinger:  648] Hardware name: Samsung ERD8855 board based on S5E8855 (DT)
> [ 1843.464527]I[0: surfaceflinger:  648] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> [ 1843.464535]I[0: surfaceflinger:  648] pc : tcp_fastretrans_alert+0x8ac/0xa74
> [ 1843.464548]I[0: surfaceflinger:  648] lr : tcp_ack+0x904/0x12b8
> [ 1843.464556]I[0: surfaceflinger:  648] sp : ffffffc0800036e0
> [ 1843.464561]I[0: surfaceflinger:  648] x29: ffffffc0800036e0 x28: 0000000000008005 x27: 000000001bc05562
> [ 1843.464579]I[0: surfaceflinger:  648] x26: ffffff890418a3c4 x25: 0000000000000000 x24: 000000000000cd02
> [ 1843.464595]I[0: surfaceflinger:  648] x23: 000000001bc05562 x22: 0000000000000000 x21: ffffffc080003800
> [ 1843.464611]I[0: surfaceflinger:  648] x20: ffffffc08000378c x19: ffffff8904189ac0 x18: 0000000000000000
> [ 1843.464627]I[0: surfaceflinger:  648] x17: 00000000529c6ef0 x16: 000000000000ff8b x15: 0000000000000001
> [ 1843.464642]I[0: surfaceflinger:  648] x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
> [ 1843.464658]I[0: surfaceflinger:  648] x11: ffffff883e9c9540 x10: 0000000000000001 x9 : 0000000000000001
> [ 1843.464673]I[0: surfaceflinger:  648] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffffd081ec0bc4
> [ 1843.464689]I[0: surfaceflinger:  648] x5 : 0000000000000000 x4 : ffffffc08000378c x3 : ffffffc080003800
> [ 1843.464704]I[0: surfaceflinger:  648] x2 : 0000000000000000 x1 : 000000001bc05562 x0 : ffffff8904189ac0
> [ 1843.464720]I[0: surfaceflinger:  648] Call trace:
> [ 1843.464725]I[0: surfaceflinger:  648]  tcp_fastretrans_alert+0x8ac/0xa74
> [ 1843.464735]I[0: surfaceflinger:  648]  tcp_ack+0x904/0x12b8
> [ 1843.464743]I[0: surfaceflinger:  648]  tcp_rcv_state_process+0x22c/0xd38
> [ 1843.464751]I[0: surfaceflinger:  648]  tcp_v4_do_rcv+0x220/0x300
> [ 1843.464760]I[0: surfaceflinger:  648]  tcp_v4_rcv+0xa5c/0xbb4
> [ 1843.464767]I[0: surfaceflinger:  648]  ip_protocol_deliver_rcu+0x198/0x34c
> [ 1843.464776]I[0: surfaceflinger:  648]  ip_local_deliver_finish+0x94/0xc4
> [ 1843.464784]I[0: surfaceflinger:  648]  ip_local_deliver+0x74/0x10c
> [ 1843.464791]I[0: surfaceflinger:  648]  ip_rcv+0xa0/0x13c
> [ 1843.464799]I[0: surfaceflinger:  648]  __netif_receive_skb_core+0xe14/0x1104
> [ 1843.464810]I[0: surfaceflinger:  648]  __netif_receive_skb+0x40/0x124
> [ 1843.464818]I[0: surfaceflinger:  648]  netif_receive_skb+0x7c/0x234
> [ 1843.464825]I[0: surfaceflinger:  648]  slsi_rx_data_deliver_skb+0x1e0/0xdbc [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
> [ 1843.465025]I[0: surfaceflinger:  648]  slsi_ba_process_complete+0x70/0xa4 [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
> [ 1843.465219]I[0: surfaceflinger:  648]  slsi_ba_aging_timeout_handler+0x324/0x354 [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
> [ 1843.465410]I[0: surfaceflinger:  648]  call_timer_fn+0xd0/0x360
> [ 1843.465423]I[0: surfaceflinger:  648]  __run_timers+0x1b4/0x268
> [ 1843.465432]I[0: surfaceflinger:  648]  run_timer_softirq+0x24/0x4c
> [ 1843.465440]I[0: surfaceflinger:  648]  __do_softirq+0x158/0x45c
> [ 1843.465448]I[0: surfaceflinger:  648]  ____do_softirq+0x10/0x20
> [ 1843.465455]I[0: surfaceflinger:  648]  call_on_irq_stack+0x3c/0x74
> [ 1843.465463]I[0: surfaceflinger:  648]  do_softirq_own_stack+0x1c/0x2c
> [ 1843.465470]I[0: surfaceflinger:  648]  __irq_exit_rcu+0x54/0xb4
> [ 1843.465480]I[0: surfaceflinger:  648]  irq_exit_rcu+0x10/0x1c
> [ 1843.465489]I[0: surfaceflinger:  648]  el0_interrupt+0x54/0xe0
> [ 1843.465499]I[0: surfaceflinger:  648]  __el0_irq_handler_common+0x18/0x28
> [ 1843.465508]I[0: surfaceflinger:  648]  el0t_64_irq_handler+0x10/0x1c
> [ 1843.465516]I[0: surfaceflinger:  648]  el0t_64_irq+0x1a8/0x1ac
> [ 1843.465525]I[0: surfaceflinger:  648] Kernel panic - not syncing: kernel: panic_on_warn set ...
>
> > > When we check the socket state value at the time of the issue,
> > > it was 0x4.
> > >
> > > skc_state = 0x4,
> > >
> > > This is "TCP_FIN_WAIT1" and which means the device closed its socket.
> > >
> > > enum {
> > >         TCP_ESTABLISHED = 1,
> > >         TCP_SYN_SENT,
> > >         TCP_SYN_RECV,
> > >         TCP_FIN_WAIT1,
> > >
> > > And also this means tp->packets_out was initialized as 0
> > > by tcp_write_queue_purge().
> >
> > What stack trace leads to this tcp_write_queue_purge() exactly ?
> I couldn't find the exact call stack to this.
> But I just thought that the function would be called based on ramdump snapshot.
>
> (*(struct tcp_sock *)(0xFFFFFF800467AB00)).packets_out = 0
> (*(struct inet_connection_sock *)0xFFFFFF800467AB00).icsk_backoff = 0

TCP_FIN_WAIT1 is set whenever the application does a shutdown(fd, SHUT_WR);

This means that all bytes in the send queue and retransmit queue
should be kept, and will eventually be sent.

 tcp_write_queue_purge() must not be called until we receive some
valid RST packet or fatal timeout.

6.6.30 is old, LTS 6.6.63 has some TCP changes that might br related.

$ git log --oneline v6.6.30..v6.6.63 -- net/ipv4/tcp*c
229dfdc36f31a8d47433438bc0e6e1662c4ab404 tcp: fix mptcp DSS corruption
due to large pmtu xmit
2daffbd861de532172079dceef5c0f36a26eee14 tcp: fix TFO SYN_RECV to not
zero retrans_stamp with retransmits out
718c49f840ef4e451bf44a8a62aae89ebdd5a687 tcp: new TCP_INFO stats for RTO events
04dce9a120502aea4ca66eebf501f404a22cd493 tcp: fix tcp_enter_recovery()
to zero retrans_stamp when it's safe
e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f tcp: fix to allow timestamp
undo if no retransmits were sent
5cce1c07bf8972d3ab94c25aa9fb6170f99082e0 tcp: avoid reusing FIN_WAIT2
when trying to find port in connect() process
4fe707a2978929b498d3730d77a6ab103881420d tcp: process the 3rd ACK with
sk_socket for TFO/MPTCP
9fd29738377c10749cb292510ebc202988ea0a31 tcp: Don't drop SYN+ACK for
simultaneous connect().
c8219a27fa43a2cbf99f5176f6dddfe73e7a24ae tcp_bpf: fix return value of
tcp_bpf_sendmsg()
69f397e60c3be615c32142682d62fc0b6d5d5d67 net: remove NULL-pointer net
parameter in ip_metrics_convert
f0974e6bc385f0e53034af17abbb86ac0246ef1c tcp: do not export tcp_twsk_purge()
99580ae890ec8bd98b21a2a9c6668f8f1555b62e tcp: prevent concurrent
execution of tcp_sk_exit_batch
7348061662c7149b81e38e545d5dd6bd427bec81 tcp/dccp: do not care about
families in inet_twsk_purge()
227355ad4e4a6da5435451b3cc7f3ed9091288fa tcp: Update window clamping condition
77100f2e8412dbb84b3e7f1b947c9531cb509492 tcp_metrics: optimize
tcp_metrics_flush_all()
6772c4868a8e7ad5305957cdb834ce881793acb7 net: drop bad gso csum_start
and offset in virtio_net_hdr
1cfdc250b3d210acd5a4a47337b654e04693cf7f tcp: Adjust clamping window
for applications specifying SO_RCVBUF
f9fef23a81db9adc1773979fabf921eba679d5e7 tcp: annotate data-races
around tp->window_clamp
44aa1e461ccd1c2e49cffad5e75e1b944ec590ef tcp: fix races in tcp_v[46]_err()
bc4f9c2ccd68afec3a8395d08fd329af2022c7e8 tcp: fix race in tcp_write_err()
ecc6836d63513fb4857a7525608d7fdd0c837cb3 tcp: add tcp_done_with_error() helper
dfcdd7f89e401d2c6616be90c76c2fac3fa98fde tcp: avoid too many retransmit packets
b75f281bddebdcf363884f0d53c562366e9ead99 tcp: use signed arithmetic in
tcp_rtx_probe0_timed_out()
124886cf20599024eb33608a2c3608b7abf3839b tcp: fix incorrect undo
caused by DSACK of TLP retransmit
8c2debdd170e395934ac0e039748576dfde14e99 tcp_metrics: validate source
addr length
8a7fc2362d6d234befde681ea4fb6c45c1789ed5 UPSTREAM: tcp: fix DSACK undo
in fast recovery to call tcp_try_to_open()
b4b26d23a1e2bc188cec8592e111d68d83b9031f tcp: fix
tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
fdae4d139f4778b20a40c60705c53f5f146459b5 Fix race for duplicate reqsk
on identical SYN
250fad18b0c959b137ad745428fb411f1ac1bbc6 tcp: clear tp->retrans_stamp
in tcp_rcv_fastopen_synack()
acdf17546ef8ee73c94e442e3f4b933e42c3dfac tcp: count CLOSE-WAIT sockets
for TCP_MIB_CURRESTAB
50569d12945f86fa4b321c4b1c3005874dbaa0f1 net: tls: fix marking packets
as decrypted
02261d3f9dc7d1d7be7d778f839e3404ab99034c tcp: Fix shift-out-of-bounds
in dctcp_update_alpha().
00bb933578acd88395bf6e770cacdbe2d6a0be86 tcp: avoid premature drops in
tcp_add_backlog()
6e48faad92be13166184d21506e4e54c79c13adc tcp: Use
refcount_inc_not_zero() in tcp_twsk_unique().
f47d0d32fa94e815fdd78b8b88684873e67939f4 tcp: defer
shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets

Neal Cardwell Dec. 4, 2024, 2:21 p.m. UTC | #11

On Wed, Dec 4, 2024 at 2:48 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
>
> On Wed, Dec 4, 2024 at 4:14 PM Eric Dumazet wrote:
> > To: Youngmin Nam <youngmin.nam@samsung.com>
> > Cc: Jakub Kicinski <kuba@kernel.org>; Neal Cardwell <ncardwell@google.com>;
> > davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com;
> > horms@kernel.org; dujeong.lee@samsung.com; guo88.liu@samsung.com;
> > yiwang.cai@samsung.com; netdev@vger.kernel.org; linux-
> > kernel@vger.kernel.org; joonki.min@samsung.com; hajun.sung@samsung.com;
> > d7271.choe@samsung.com; sw.ju@samsung.com
> > Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
> >
> > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com>
> > wrote:
> > >
> > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > I have not seen these warnings firing. Neal, have you seen this in
> > the past ?
> > > > >
> > > > > I can't recall seeing these warnings over the past 5 years or so,
> > > > > and (from checking our monitoring) they don't seem to be firing in
> > > > > our fleet recently.
> > > >
> > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > Could be that one of our workloads is pinned to 5.12.
> > > > Youngmin, what's the newest kernel you can repro this on?
> > > >
> > > Hi Jakub.
> > > Thank you for taking an interest in this issue.
> > >
> > > We've seen this issue since 5.15 kernel.
> > > Now, we can see this on 6.6 kernel which is the newest kernel we are
> > running.
> >
> > The fact that we are processing ACK packets after the write queue has been
> > purged would be a serious bug.
> >
> > Thus the WARN() makes sense to us.
> >
> > It would be easy to build a packetdrill test. Please do so, then we can
> > fix the root cause.
> >
> > Thank you !
>
>
> Please let me share some more details and clarifications on the issue from ramdump snapshot locally secured.
>
> 1) This issue has been reported from Android-T linux kernel when we enabled panic_on_warn for the first time.
> Reproduction rate is not high and can be seen in any test cases with public internet connection.
>
> 2) Analysis from ramdump (which is not available at the moment).
> 2-A) From ramdump, I was able to find below values.
> tp->packets_out = 0
> tp->retrans_out = 1
> tp->max_packets_out = 1
> tp->max_packets_Seq = 1575830358
> tp->snd_ssthresh = 5
> tp->snd_cwnd = 1
> tp->prior_cwnd = 10
> tp->wite_seq = 1575830359
> tp->pushed_seq = 1575830358
> tp->lost_out = 1
> tp->sacked_out = 0

Thanks for all the details! If the ramdump becomes available again at
some point, would it be possible to pull out the following values as
well:

tp->mss_cache
inet_csk(sk)->icsk_pmtu_cookie
inet_csk(sk)->icsk_ca_state

Thanks,
neal

Youngmin Nam Dec. 5, 2024, 2:45 a.m. UTC | #12

On Wed, Dec 04, 2024 at 10:03:14AM +0100, Eric Dumazet wrote:
> On Wed, Dec 4, 2024 at 4:05 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > Hi Eric.
> > Thanks for looking at this issue.
> >
> > On Tue, Dec 03, 2024 at 12:07:05PM +0100, Eric Dumazet wrote:
> > > On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > >
> > > > We encountered the following WARNINGs
> > > > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > > > which triggered a kernel panic due to panic_on_warn.
> > > >
> > > > case 1.
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > > > Call trace:
> > > >  tcp_sacktag_write_queue+0xae8/0xb60
> > > >  tcp_ack+0x4ec/0x12b8
> > > >  tcp_rcv_state_process+0x22c/0xd38
> > > >  tcp_v4_do_rcv+0x220/0x300
> > > >  tcp_v4_rcv+0xa5c/0xbb4
> > > >  ip_protocol_deliver_rcu+0x198/0x34c
> > > >  ip_local_deliver_finish+0x94/0xc4
> > > >  ip_local_deliver+0x74/0x10c
> > > >  ip_rcv+0xa0/0x13c
> > > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > > >
> > > > case 2.
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > > > Call trace:
> > > >  tcp_fastretrans_alert+0x8ac/0xa74
> > > >  tcp_ack+0x904/0x12b8
> > > >  tcp_rcv_state_process+0x22c/0xd38
> > > >  tcp_v4_do_rcv+0x220/0x300
> > > >  tcp_v4_rcv+0xa5c/0xbb4
> > > >  ip_protocol_deliver_rcu+0x198/0x34c
> > > >  ip_local_deliver_finish+0x94/0xc4
> > > >  ip_local_deliver+0x74/0x10c
> > > >  ip_rcv+0xa0/0x13c
> > > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > > >
> > >
> > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > >
> > > Please provide the kernel version (this must be a pristine LTS one).
> > We are running Android kernel for Android mobile device which is based on LTS kernel 6.6-30.
> > But we've seen this issue since kernel 5.15 LTS.
> >
> > > and symbolized stack traces using scripts/decode_stacktrace.sh
> > Unfortunately, we don't have the matched vmlinux right now. So we need to rebuild and reproduce.
> > >
> > > If this warning was easy to trigger, please provide a packetdrill test ?
> > I'm not sure if we can run packetdrill test on Android device. Anyway let me check.
> >
> > FYI, Here are more detailed logs.
> >
> > Case 1.
> > [26496.422651]I[4:  napi/wlan0-33:  467] ------------[ cut here ]------------
> > [26496.422665]I[4:  napi/wlan0-33:  467] WARNING: CPU: 4 PID: 467 at net/ipv4/tcp_input.c:2026 tcp_sacktag_write_queue+0xae8/0xb60
> > [26496.423420]I[4:  napi/wlan0-33:  467] CPU: 4 PID: 467 Comm: napi/wlan0-33 Tainted: G S         OE      6.6.30-android15-8-geeceb2c9cdf1-ab20240930.125201-4k #1 a1c80b36942fa6e9575b2544032a7536ed502804
> > [26496.423427]I[4:  napi/wlan0-33:  467] Hardware name: Samsung ERD9955 board based on S5E9955 (DT)
> > [26496.423432]I[4:  napi/wlan0-33:  467] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [26496.423438]I[4:  napi/wlan0-33:  467] pc : tcp_sacktag_write_queue+0xae8/0xb60
> > [26496.423446]I[4:  napi/wlan0-33:  467] lr : tcp_ack+0x4ec/0x12b8
> > [26496.423455]I[4:  napi/wlan0-33:  467] sp : ffffffc096b8b690
> > [26496.423458]I[4:  napi/wlan0-33:  467] x29: ffffffc096b8b710 x28: 0000000000008001 x27: 000000005526d635
> > [26496.423469]I[4:  napi/wlan0-33:  467] x26: ffffff8a19079684 x25: 000000005526dbfd x24: 0000000000000001
> > [26496.423480]I[4:  napi/wlan0-33:  467] x23: 000000000000000a x22: ffffff88e5f5b680 x21: 000000005526dbc9
> > [26496.423489]I[4:  napi/wlan0-33:  467] x20: ffffff8a19078d80 x19: ffffff88e9f4193e x18: ffffffd083114c80
> > [26496.423499]I[4:  napi/wlan0-33:  467] x17: 00000000529c6ef0 x16: 000000000000ff8b x15: 0000000000000000
> > [26496.423508]I[4:  napi/wlan0-33:  467] x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
> > [26496.423517]I[4:  napi/wlan0-33:  467] x11: 0000000000000000 x10: 0000000000000001 x9 : 00000000fffffffd
> > [26496.423526]I[4:  napi/wlan0-33:  467] x8 : 0000000000000001 x7 : 0000000000000000 x6 : ffffffd081ec0bc4
> > [26496.423536]I[4:  napi/wlan0-33:  467] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffc096b8b818
> > [26496.423545]I[4:  napi/wlan0-33:  467] x2 : 000000005526d635 x1 : ffffff88e5f5b680 x0 : ffffff8a19078d80
> > [26496.423555]I[4:  napi/wlan0-33:  467] Call trace:
> > [26496.423558]I[4:  napi/wlan0-33:  467]  tcp_sacktag_write_queue+0xae8/0xb60
> > [26496.423566]I[4:  napi/wlan0-33:  467]  tcp_ack+0x4ec/0x12b8
> > [26496.423573]I[4:  napi/wlan0-33:  467]  tcp_rcv_state_process+0x22c/0xd38
> > [26496.423580]I[4:  napi/wlan0-33:  467]  tcp_v4_do_rcv+0x220/0x300
> > [26496.423590]I[4:  napi/wlan0-33:  467]  tcp_v4_rcv+0xa5c/0xbb4
> > [26496.423596]I[4:  napi/wlan0-33:  467]  ip_protocol_deliver_rcu+0x198/0x34c
> > [26496.423607]I[4:  napi/wlan0-33:  467]  ip_local_deliver_finish+0x94/0xc4
> > [26496.423614]I[4:  napi/wlan0-33:  467]  ip_local_deliver+0x74/0x10c
> > [26496.423620]I[4:  napi/wlan0-33:  467]  ip_rcv+0xa0/0x13c
> > [26496.423625]I[4:  napi/wlan0-33:  467]  __netif_receive_skb_core+0xe14/0x1104
> > [26496.423642]I[4:  napi/wlan0-33:  467]  __netif_receive_skb_list_core+0xb8/0x2dc
> > [26496.423649]I[4:  napi/wlan0-33:  467]  netif_receive_skb_list_internal+0x234/0x320
> > [26496.423655]I[4:  napi/wlan0-33:  467]  napi_complete_done+0xb4/0x1a0
> > [26496.423660]I[4:  napi/wlan0-33:  467]  slsi_rx_netif_napi_poll+0x22c/0x258 [scsc_wlan 16ac2100e65b7c78ce863cecc238b39b162dbe82]
> > [26496.423822]I[4:  napi/wlan0-33:  467]  __napi_poll+0x5c/0x238
> > [26496.423829]I[4:  napi/wlan0-33:  467]  napi_threaded_poll+0x110/0x204
> > [26496.423836]I[4:  napi/wlan0-33:  467]  kthread+0x114/0x138
> > [26496.423845]I[4:  napi/wlan0-33:  467]  ret_from_fork+0x10/0x20
> > [26496.423856]I[4:  napi/wlan0-33:  467] Kernel panic - not syncing: kernel: panic_on_warn set ..
> >
> > Case 2.
> > [ 1843.463330]I[0: surfaceflinger:  648] ------------[ cut here ]------------
> > [ 1843.463355]I[0: surfaceflinger:  648] WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004 tcp_fastretrans_alert+0x8ac/0xa74
> > [ 1843.464508]I[0: surfaceflinger:  648] CPU: 0 PID: 648 Comm: surfaceflinger Tainted: G S         OE      6.6.30-android15-8-geeceb2c9cdf1-ab20241017.075836-4k #1 de751202c2c5ab3ec352a00ae470fc5e907bdcfe
> > [ 1843.464520]I[0: surfaceflinger:  648] Hardware name: Samsung ERD8855 board based on S5E8855 (DT)
> > [ 1843.464527]I[0: surfaceflinger:  648] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> > [ 1843.464535]I[0: surfaceflinger:  648] pc : tcp_fastretrans_alert+0x8ac/0xa74
> > [ 1843.464548]I[0: surfaceflinger:  648] lr : tcp_ack+0x904/0x12b8
> > [ 1843.464556]I[0: surfaceflinger:  648] sp : ffffffc0800036e0
> > [ 1843.464561]I[0: surfaceflinger:  648] x29: ffffffc0800036e0 x28: 0000000000008005 x27: 000000001bc05562
> > [ 1843.464579]I[0: surfaceflinger:  648] x26: ffffff890418a3c4 x25: 0000000000000000 x24: 000000000000cd02
> > [ 1843.464595]I[0: surfaceflinger:  648] x23: 000000001bc05562 x22: 0000000000000000 x21: ffffffc080003800
> > [ 1843.464611]I[0: surfaceflinger:  648] x20: ffffffc08000378c x19: ffffff8904189ac0 x18: 0000000000000000
> > [ 1843.464627]I[0: surfaceflinger:  648] x17: 00000000529c6ef0 x16: 000000000000ff8b x15: 0000000000000001
> > [ 1843.464642]I[0: surfaceflinger:  648] x14: 0000000000000001 x13: 0000000000000001 x12: 0000000000000000
> > [ 1843.464658]I[0: surfaceflinger:  648] x11: ffffff883e9c9540 x10: 0000000000000001 x9 : 0000000000000001
> > [ 1843.464673]I[0: surfaceflinger:  648] x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffffd081ec0bc4
> > [ 1843.464689]I[0: surfaceflinger:  648] x5 : 0000000000000000 x4 : ffffffc08000378c x3 : ffffffc080003800
> > [ 1843.464704]I[0: surfaceflinger:  648] x2 : 0000000000000000 x1 : 000000001bc05562 x0 : ffffff8904189ac0
> > [ 1843.464720]I[0: surfaceflinger:  648] Call trace:
> > [ 1843.464725]I[0: surfaceflinger:  648]  tcp_fastretrans_alert+0x8ac/0xa74
> > [ 1843.464735]I[0: surfaceflinger:  648]  tcp_ack+0x904/0x12b8
> > [ 1843.464743]I[0: surfaceflinger:  648]  tcp_rcv_state_process+0x22c/0xd38
> > [ 1843.464751]I[0: surfaceflinger:  648]  tcp_v4_do_rcv+0x220/0x300
> > [ 1843.464760]I[0: surfaceflinger:  648]  tcp_v4_rcv+0xa5c/0xbb4
> > [ 1843.464767]I[0: surfaceflinger:  648]  ip_protocol_deliver_rcu+0x198/0x34c
> > [ 1843.464776]I[0: surfaceflinger:  648]  ip_local_deliver_finish+0x94/0xc4
> > [ 1843.464784]I[0: surfaceflinger:  648]  ip_local_deliver+0x74/0x10c
> > [ 1843.464791]I[0: surfaceflinger:  648]  ip_rcv+0xa0/0x13c
> > [ 1843.464799]I[0: surfaceflinger:  648]  __netif_receive_skb_core+0xe14/0x1104
> > [ 1843.464810]I[0: surfaceflinger:  648]  __netif_receive_skb+0x40/0x124
> > [ 1843.464818]I[0: surfaceflinger:  648]  netif_receive_skb+0x7c/0x234
> > [ 1843.464825]I[0: surfaceflinger:  648]  slsi_rx_data_deliver_skb+0x1e0/0xdbc [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
> > [ 1843.465025]I[0: surfaceflinger:  648]  slsi_ba_process_complete+0x70/0xa4 [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
> > [ 1843.465219]I[0: surfaceflinger:  648]  slsi_ba_aging_timeout_handler+0x324/0x354 [scsc_wlan 12b378a8d5cf5e6cd833136fc49079d43751bd28]
> > [ 1843.465410]I[0: surfaceflinger:  648]  call_timer_fn+0xd0/0x360
> > [ 1843.465423]I[0: surfaceflinger:  648]  __run_timers+0x1b4/0x268
> > [ 1843.465432]I[0: surfaceflinger:  648]  run_timer_softirq+0x24/0x4c
> > [ 1843.465440]I[0: surfaceflinger:  648]  __do_softirq+0x158/0x45c
> > [ 1843.465448]I[0: surfaceflinger:  648]  ____do_softirq+0x10/0x20
> > [ 1843.465455]I[0: surfaceflinger:  648]  call_on_irq_stack+0x3c/0x74
> > [ 1843.465463]I[0: surfaceflinger:  648]  do_softirq_own_stack+0x1c/0x2c
> > [ 1843.465470]I[0: surfaceflinger:  648]  __irq_exit_rcu+0x54/0xb4
> > [ 1843.465480]I[0: surfaceflinger:  648]  irq_exit_rcu+0x10/0x1c
> > [ 1843.465489]I[0: surfaceflinger:  648]  el0_interrupt+0x54/0xe0
> > [ 1843.465499]I[0: surfaceflinger:  648]  __el0_irq_handler_common+0x18/0x28
> > [ 1843.465508]I[0: surfaceflinger:  648]  el0t_64_irq_handler+0x10/0x1c
> > [ 1843.465516]I[0: surfaceflinger:  648]  el0t_64_irq+0x1a8/0x1ac
> > [ 1843.465525]I[0: surfaceflinger:  648] Kernel panic - not syncing: kernel: panic_on_warn set ...
> >
> > > > When we check the socket state value at the time of the issue,
> > > > it was 0x4.
> > > >
> > > > skc_state = 0x4,
> > > >
> > > > This is "TCP_FIN_WAIT1" and which means the device closed its socket.
> > > >
> > > > enum {
> > > >         TCP_ESTABLISHED = 1,
> > > >         TCP_SYN_SENT,
> > > >         TCP_SYN_RECV,
> > > >         TCP_FIN_WAIT1,
> > > >
> > > > And also this means tp->packets_out was initialized as 0
> > > > by tcp_write_queue_purge().
> > >
> > > What stack trace leads to this tcp_write_queue_purge() exactly ?
> > I couldn't find the exact call stack to this.
> > But I just thought that the function would be called based on ramdump snapshot.
> >
> > (*(struct tcp_sock *)(0xFFFFFF800467AB00)).packets_out = 0
> > (*(struct inet_connection_sock *)0xFFFFFF800467AB00).icsk_backoff = 0
> 
> TCP_FIN_WAIT1 is set whenever the application does a shutdown(fd, SHUT_WR);
> 
> This means that all bytes in the send queue and retransmit queue
> should be kept, and will eventually be sent.
> 
>  tcp_write_queue_purge() must not be called until we receive some
> valid RST packet or fatal timeout.
> 
> 6.6.30 is old, LTS 6.6.63 has some TCP changes that might br related.
> 
> $ git log --oneline v6.6.30..v6.6.63 -- net/ipv4/tcp*c
> 229dfdc36f31a8d47433438bc0e6e1662c4ab404 tcp: fix mptcp DSS corruption
> due to large pmtu xmit
> 2daffbd861de532172079dceef5c0f36a26eee14 tcp: fix TFO SYN_RECV to not
> zero retrans_stamp with retransmits out
> 718c49f840ef4e451bf44a8a62aae89ebdd5a687 tcp: new TCP_INFO stats for RTO events
> 04dce9a120502aea4ca66eebf501f404a22cd493 tcp: fix tcp_enter_recovery()
> to zero retrans_stamp when it's safe
> e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f tcp: fix to allow timestamp
> undo if no retransmits were sent
> 5cce1c07bf8972d3ab94c25aa9fb6170f99082e0 tcp: avoid reusing FIN_WAIT2
> when trying to find port in connect() process
> 4fe707a2978929b498d3730d77a6ab103881420d tcp: process the 3rd ACK with
> sk_socket for TFO/MPTCP
> 9fd29738377c10749cb292510ebc202988ea0a31 tcp: Don't drop SYN+ACK for
> simultaneous connect().
> c8219a27fa43a2cbf99f5176f6dddfe73e7a24ae tcp_bpf: fix return value of
> tcp_bpf_sendmsg()
> 69f397e60c3be615c32142682d62fc0b6d5d5d67 net: remove NULL-pointer net
> parameter in ip_metrics_convert
> f0974e6bc385f0e53034af17abbb86ac0246ef1c tcp: do not export tcp_twsk_purge()
> 99580ae890ec8bd98b21a2a9c6668f8f1555b62e tcp: prevent concurrent
> execution of tcp_sk_exit_batch
> 7348061662c7149b81e38e545d5dd6bd427bec81 tcp/dccp: do not care about
> families in inet_twsk_purge()
> 227355ad4e4a6da5435451b3cc7f3ed9091288fa tcp: Update window clamping condition
> 77100f2e8412dbb84b3e7f1b947c9531cb509492 tcp_metrics: optimize
> tcp_metrics_flush_all()
> 6772c4868a8e7ad5305957cdb834ce881793acb7 net: drop bad gso csum_start
> and offset in virtio_net_hdr
> 1cfdc250b3d210acd5a4a47337b654e04693cf7f tcp: Adjust clamping window
> for applications specifying SO_RCVBUF
> f9fef23a81db9adc1773979fabf921eba679d5e7 tcp: annotate data-races
> around tp->window_clamp
> 44aa1e461ccd1c2e49cffad5e75e1b944ec590ef tcp: fix races in tcp_v[46]_err()
> bc4f9c2ccd68afec3a8395d08fd329af2022c7e8 tcp: fix race in tcp_write_err()
> ecc6836d63513fb4857a7525608d7fdd0c837cb3 tcp: add tcp_done_with_error() helper
> dfcdd7f89e401d2c6616be90c76c2fac3fa98fde tcp: avoid too many retransmit packets
> b75f281bddebdcf363884f0d53c562366e9ead99 tcp: use signed arithmetic in
> tcp_rtx_probe0_timed_out()
> 124886cf20599024eb33608a2c3608b7abf3839b tcp: fix incorrect undo
> caused by DSACK of TLP retransmit
> 8c2debdd170e395934ac0e039748576dfde14e99 tcp_metrics: validate source
> addr length
> 8a7fc2362d6d234befde681ea4fb6c45c1789ed5 UPSTREAM: tcp: fix DSACK undo
> in fast recovery to call tcp_try_to_open()
> b4b26d23a1e2bc188cec8592e111d68d83b9031f tcp: fix
> tcp_rcv_fastopen_synack() to enter TCP_CA_Loss for failed TFO
> fdae4d139f4778b20a40c60705c53f5f146459b5 Fix race for duplicate reqsk
> on identical SYN
> 250fad18b0c959b137ad745428fb411f1ac1bbc6 tcp: clear tp->retrans_stamp
> in tcp_rcv_fastopen_synack()
> acdf17546ef8ee73c94e442e3f4b933e42c3dfac tcp: count CLOSE-WAIT sockets
> for TCP_MIB_CURRESTAB
> 50569d12945f86fa4b321c4b1c3005874dbaa0f1 net: tls: fix marking packets
> as decrypted
> 02261d3f9dc7d1d7be7d778f839e3404ab99034c tcp: Fix shift-out-of-bounds
> in dctcp_update_alpha().
> 00bb933578acd88395bf6e770cacdbe2d6a0be86 tcp: avoid premature drops in
> tcp_add_backlog()
> 6e48faad92be13166184d21506e4e54c79c13adc tcp: Use
> refcount_inc_not_zero() in tcp_twsk_unique().
> f47d0d32fa94e815fdd78b8b88684873e67939f4 tcp: defer
> shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets
> 
Thanks for your information.
Let me try with our newest Android kernel which is 6.6.50

Dujeong.lee Dec. 5, 2024, 12:31 p.m. UTC | #13

On Wed, Dec 4, 2024 at 11:22 PM Neal Cardwell <ncardwell@google.com> wrote:
> On Wed, Dec 4, 2024 at 2:48 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
> > On Wed, Dec 4, 2024 at 4:14 PM Eric Dumazet wrote:
> > > To: Youngmin Nam <youngmin.nam@samsung.com>
> > > Cc: Jakub Kicinski <kuba@kernel.org>; Neal Cardwell
> > > <ncardwell@google.com>; davem@davemloft.net; dsahern@kernel.org;
> > > pabeni@redhat.com; horms@kernel.org; dujeong.lee@samsung.com;
> > > guo88.liu@samsung.com; yiwang.cai@samsung.com;
> > > netdev@vger.kernel.org; linux- kernel@vger.kernel.org;
> > > joonki.min@samsung.com; hajun.sung@samsung.com;
> > > d7271.choe@samsung.com; sw.ju@samsung.com
> > > Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
> > >
> > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam
> > > <youngmin.nam@samsung.com>
> > > wrote:
> > > >
> > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > I have not seen these warnings firing. Neal, have you seen
> > > > > > > this in
> > > the past ?
> > > > > >
> > > > > > I can't recall seeing these warnings over the past 5 years or
> > > > > > so, and (from checking our monitoring) they don't seem to be
> > > > > > firing in our fleet recently.
> > > > >
> > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > >
> > > > Hi Jakub.
> > > > Thank you for taking an interest in this issue.
> > > >
> > > > We've seen this issue since 5.15 kernel.
> > > > Now, we can see this on 6.6 kernel which is the newest kernel we
> > > > are
> > > running.
> > >
> > > The fact that we are processing ACK packets after the write queue
> > > has been purged would be a serious bug.
> > >
> > > Thus the WARN() makes sense to us.
> > >
> > > It would be easy to build a packetdrill test. Please do so, then we
> > > can fix the root cause.
> > >
> > > Thank you !
> >
> >
> > Please let me share some more details and clarifications on the issue
> from ramdump snapshot locally secured.
> >
> > 1) This issue has been reported from Android-T linux kernel when we
> enabled panic_on_warn for the first time.
> > Reproduction rate is not high and can be seen in any test cases with
> public internet connection.
> >
> > 2) Analysis from ramdump (which is not available at the moment).
> > 2-A) From ramdump, I was able to find below values.
> > tp->packets_out = 0
> > tp->retrans_out = 1
> > tp->max_packets_out = 1
> > tp->max_packets_Seq = 1575830358
> > tp->snd_ssthresh = 5
> > tp->snd_cwnd = 1
> > tp->prior_cwnd = 10
> > tp->wite_seq = 1575830359
> > tp->pushed_seq = 1575830358
> > tp->lost_out = 1
> > tp->sacked_out = 0
> 
> Thanks for all the details! If the ramdump becomes available again at some
> point, would it be possible to pull out the following values as
> well:
> 
> tp->mss_cache
> inet_csk(sk)->icsk_pmtu_cookie
> inet_csk(sk)->icsk_ca_state
> 
> Thanks,
> neal

Okay I will check the below values once ramdump is secured.
- tp->mss_cache
- inet_csk(sk)->icsk_pmtu_cookie
- inet_csk(sk)->icsk_ca_state

Now we are running test with the latest kernel.

Thanks
Dujeong.

Youngmin Nam Dec. 6, 2024, 5:53 a.m. UTC | #14

On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > >
> > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > (from checking our monitoring) they don't seem to be firing in our
> > > > fleet recently.
> > >
> > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > Could be that one of our workloads is pinned to 5.12.
> > > Youngmin, what's the newest kernel you can repro this on?
> > >
> > Hi Jakub.
> > Thank you for taking an interest in this issue.
> >
> > We've seen this issue since 5.15 kernel.
> > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> 
> The fact that we are processing ACK packets after the write queue has
> been purged would be a serious bug.
> 
> Thus the WARN() makes sense to us.
> 
> It would be easy to build a packetdrill test. Please do so, then we
> can fix the root cause.
> 
> Thank you !
> 

Hi Eric.

Unfortunately, we are not familiar with the Packetdrill test.
Refering to the official website on Github, I tried to install it on my device.

Here is what I did on my local machine.

$ mkdir packetdrill
$ cd packetdrill
$ git clone https://github.com/google/packetdrill.git .
$ cd gtests/net/packetdrill/
$./configure
$ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc

$ adb root
$ adb push packetdrill /data/
$ adb shell

And here is what I did on my device

erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
/system/bin/sh: ./packetdrill/run_all.py: No such file or directory

I'm not sure if this procedure is correct.
Could you help us run the Packetdrill on an Android device ?

Eric Dumazet Dec. 6, 2024, 8:35 a.m. UTC | #15

On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > >
> > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > fleet recently.
> > > >
> > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > Could be that one of our workloads is pinned to 5.12.
> > > > Youngmin, what's the newest kernel you can repro this on?
> > > >
> > > Hi Jakub.
> > > Thank you for taking an interest in this issue.
> > >
> > > We've seen this issue since 5.15 kernel.
> > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> >
> > The fact that we are processing ACK packets after the write queue has
> > been purged would be a serious bug.
> >
> > Thus the WARN() makes sense to us.
> >
> > It would be easy to build a packetdrill test. Please do so, then we
> > can fix the root cause.
> >
> > Thank you !
> >
>
> Hi Eric.
>
> Unfortunately, we are not familiar with the Packetdrill test.
> Refering to the official website on Github, I tried to install it on my device.
>
> Here is what I did on my local machine.
>
> $ mkdir packetdrill
> $ cd packetdrill
> $ git clone https://github.com/google/packetdrill.git .
> $ cd gtests/net/packetdrill/
> $./configure
> $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
>
> $ adb root
> $ adb push packetdrill /data/
> $ adb shell
>
> And here is what I did on my device
>
> erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
>
> I'm not sure if this procedure is correct.
> Could you help us run the Packetdrill on an Android device ?

packetdrill can run anywhere, for instance on your laptop, no need to
compile / install it on Android

Then you can run single test like

# packetdrill gtests/net/tcp/sack/sack-route-refresh-ip-tos.pkt

Youngmin Nam Dec. 6, 2024, 9:01 a.m. UTC | #16

On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > >
> > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > > >
> > > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > > fleet recently.
> > > > >
> > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > >
> > > > Hi Jakub.
> > > > Thank you for taking an interest in this issue.
> > > >
> > > > We've seen this issue since 5.15 kernel.
> > > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> > >
> > > The fact that we are processing ACK packets after the write queue has
> > > been purged would be a serious bug.
> > >
> > > Thus the WARN() makes sense to us.
> > >
> > > It would be easy to build a packetdrill test. Please do so, then we
> > > can fix the root cause.
> > >
> > > Thank you !
> > >
> >
> > Hi Eric.
> >
> > Unfortunately, we are not familiar with the Packetdrill test.
> > Refering to the official website on Github, I tried to install it on my device.
> >
> > Here is what I did on my local machine.
> >
> > $ mkdir packetdrill
> > $ cd packetdrill
> > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > $ cd gtests/net/packetdrill/
> > $./configure
> > $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> >
> > $ adb root
> > $ adb push packetdrill /data/
> > $ adb shell
> >
> > And here is what I did on my device
> >
> > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> > /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
> >
> > I'm not sure if this procedure is correct.
> > Could you help us run the Packetdrill on an Android device ?
> 
> packetdrill can run anywhere, for instance on your laptop, no need to
> compile / install it on Android
> 
> Then you can run single test like
> 
> # packetdrill gtests/net/tcp/sack/sack-route-refresh-ip-tos.pkt
> 

You mean.. To test an Android device, we need to run packetdrill on laptop, right ?

Laptop(run packetdrill script) <--------------------------> Android device

By the way, how can we test the Android device (DUT) from packetdrill which is running on Laptop?
I hope you understand that I am aksing this question because we are not familiar with the packetdrill.
Thanks.

Eric Dumazet Dec. 6, 2024, 9:08 a.m. UTC | #17

On Fri, Dec 6, 2024 at 9:58 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> > On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > > >
> > > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > > > >
> > > > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > > > fleet recently.
> > > > > >
> > > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > > >
> > > > > Hi Jakub.
> > > > > Thank you for taking an interest in this issue.
> > > > >
> > > > > We've seen this issue since 5.15 kernel.
> > > > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> > > >
> > > > The fact that we are processing ACK packets after the write queue has
> > > > been purged would be a serious bug.
> > > >
> > > > Thus the WARN() makes sense to us.
> > > >
> > > > It would be easy to build a packetdrill test. Please do so, then we
> > > > can fix the root cause.
> > > >
> > > > Thank you !
> > > >
> > >
> > > Hi Eric.
> > >
> > > Unfortunately, we are not familiar with the Packetdrill test.
> > > Refering to the official website on Github, I tried to install it on my device.
> > >
> > > Here is what I did on my local machine.
> > >
> > > $ mkdir packetdrill
> > > $ cd packetdrill
> > > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > > $ cd gtests/net/packetdrill/
> > > $./configure
> > > $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> > >
> > > $ adb root
> > > $ adb push packetdrill /data/
> > > $ adb shell
> > >
> > > And here is what I did on my device
> > >
> > > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> > > /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
> > >
> > > I'm not sure if this procedure is correct.
> > > Could you help us run the Packetdrill on an Android device ?
> >
> > packetdrill can run anywhere, for instance on your laptop, no need to
> > compile / install it on Android
> >
> > Then you can run single test like
> >
> > # packetdrill gtests/net/tcp/sack/sack-route-refresh-ip-tos.pkt
> >
>
> You mean.. To test an Android device, we need to run packetdrill on laptop, right ?
>
> Laptop(run packetdrill script) <--------------------------> Android device
>
> By the way, how can we test the Android device (DUT) from packetdrill which is running on Laptop?
> I hope you understand that I am aksing this question because we are not familiar with the packetdrill.
> Thanks.

packetdrill does not need to run on a physical DUT, it uses a software
stack : TCP and tun device.

You have a kernel tree, compile it and run a VM, like virtme-ng

vng -bv

We use this to run kernel selftests in which we started adding
packetdrill tests (in recent kernel tree)

./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-4pkt.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_client.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_batch.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-after-win-update.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-fq-ack-per-2pkt.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_maxfrags.pkt
./tools/testing/selftests/net/packetdrill/tcp_inq_server.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_exclusive.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_basic.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_small.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-app-limited-9-packets-out.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-2pkt.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_oneshot.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_fastopen-server.pkt
./tools/testing/selftests/net/packetdrill/tcp_inq_client.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_edge.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-app-limited.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_fastopen-client.pkt
./tools/testing/selftests/net/packetdrill/tcp_zerocopy_closed.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-1pkt.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-after-idle.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-2pkt-send-5pkt.pkt
./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-2pkt-send-6pkt.pkt
./tools/testing/selftests/net/packetdrill/tcp_md5_md5-only-on-client-ack.pkt
./tools/testing/selftests/net/netfilter/packetdrill/conntrack_synack_old.pkt
./tools/testing/selftests/net/netfilter/packetdrill/conntrack_syn_challenge_ack.pkt
./tools/testing/selftests/net/netfilter/packetdrill/conntrack_inexact_rst.pkt
./tools/testing/selftests/net/netfilter/packetdrill/conntrack_synack_reuse.pkt
./tools/testing/selftests/net/netfilter/packetdrill/conntrack_rst_invalid.pkt
./tools/testing/selftests/net/netfilter/packetdrill/conntrack_ack_loss_stall.pkt

Neal Cardwell Dec. 6, 2024, 3:34 p.m. UTC | #18

On Fri, Dec 6, 2024 at 4:08 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Dec 6, 2024 at 9:58 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> > > On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > >
> > > > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > > > > >
> > > > > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > > > > fleet recently.
> > > > > > >
> > > > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > > > >
> > > > > > Hi Jakub.
> > > > > > Thank you for taking an interest in this issue.
> > > > > >
> > > > > > We've seen this issue since 5.15 kernel.
> > > > > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> > > > >
> > > > > The fact that we are processing ACK packets after the write queue has
> > > > > been purged would be a serious bug.
> > > > >
> > > > > Thus the WARN() makes sense to us.
> > > > >
> > > > > It would be easy to build a packetdrill test. Please do so, then we
> > > > > can fix the root cause.
> > > > >
> > > > > Thank you !
> > > > >
> > > >
> > > > Hi Eric.
> > > >
> > > > Unfortunately, we are not familiar with the Packetdrill test.
> > > > Refering to the official website on Github, I tried to install it on my device.
> > > >
> > > > Here is what I did on my local machine.
> > > >
> > > > $ mkdir packetdrill
> > > > $ cd packetdrill
> > > > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > > > $ cd gtests/net/packetdrill/
> > > > $./configure
> > > > $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> > > >
> > > > $ adb root
> > > > $ adb push packetdrill /data/
> > > > $ adb shell
> > > >
> > > > And here is what I did on my device
> > > >
> > > > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> > > > /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
> > > >
> > > > I'm not sure if this procedure is correct.
> > > > Could you help us run the Packetdrill on an Android device ?

BTW, Youngmin, do you have a packet trace (e.g., tcpdump .pcap file)
of the workload that causes this warning?

If not, in order to construct a packetdrill test to reproduce this
issue, you may need to:

(1) add code to the warning to print the local and remote IP address
and port number when the warning fires (see DBGUNDO() for an example)

(2) take a tcpdump .pcap trace of the workload

Then you can use the {local_ip:local_port, remote_ip:remote_port} info
from (1) to find the packet trace in (2) that can be used to construct
a packetdrill test to reproduce this issue.

thanks,
neal

Youngmin Nam Dec. 9, 2024, 1:32 a.m. UTC | #19

On Fri, Dec 06, 2024 at 10:08:17AM +0100, Eric Dumazet wrote:
> On Fri, Dec 6, 2024 at 9:58 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> > > On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > >
> > > > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > > > > >
> > > > > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > > > > fleet recently.
> > > > > > >
> > > > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > > > >
> > > > > > Hi Jakub.
> > > > > > Thank you for taking an interest in this issue.
> > > > > >
> > > > > > We've seen this issue since 5.15 kernel.
> > > > > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> > > > >
> > > > > The fact that we are processing ACK packets after the write queue has
> > > > > been purged would be a serious bug.
> > > > >
> > > > > Thus the WARN() makes sense to us.
> > > > >
> > > > > It would be easy to build a packetdrill test. Please do so, then we
> > > > > can fix the root cause.
> > > > >
> > > > > Thank you !
> > > > >
> > > >
> > > > Hi Eric.
> > > >
> > > > Unfortunately, we are not familiar with the Packetdrill test.
> > > > Refering to the official website on Github, I tried to install it on my device.
> > > >
> > > > Here is what I did on my local machine.
> > > >
> > > > $ mkdir packetdrill
> > > > $ cd packetdrill
> > > > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > > > $ cd gtests/net/packetdrill/
> > > > $./configure
> > > > $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> > > >
> > > > $ adb root
> > > > $ adb push packetdrill /data/
> > > > $ adb shell
> > > >
> > > > And here is what I did on my device
> > > >
> > > > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> > > > /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
> > > >
> > > > I'm not sure if this procedure is correct.
> > > > Could you help us run the Packetdrill on an Android device ?
> > >
> > > packetdrill can run anywhere, for instance on your laptop, no need to
> > > compile / install it on Android
> > >
> > > Then you can run single test like
> > >
> > > # packetdrill gtests/net/tcp/sack/sack-route-refresh-ip-tos.pkt
> > >
> >
> > You mean.. To test an Android device, we need to run packetdrill on laptop, right ?
> >
> > Laptop(run packetdrill script) <--------------------------> Android device
> >
> > By the way, how can we test the Android device (DUT) from packetdrill which is running on Laptop?
> > I hope you understand that I am aksing this question because we are not familiar with the packetdrill.
> > Thanks.
> 
> packetdrill does not need to run on a physical DUT, it uses a software
> stack : TCP and tun device.
> 
> You have a kernel tree, compile it and run a VM, like virtme-ng
> 
> vng -bv
> 
> We use this to run kernel selftests in which we started adding
> packetdrill tests (in recent kernel tree)
> 
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-4pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_client.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_batch.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-after-win-update.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-fq-ack-per-2pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_maxfrags.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_inq_server.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_exclusive.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_basic.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_small.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-app-limited-9-packets-out.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-2pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_oneshot.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_fastopen-server.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_inq_client.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_edge.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-app-limited.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_fastopen-client.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_closed.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-1pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-after-idle.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-2pkt-send-5pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-per-2pkt-send-6pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_md5_md5-only-on-client-ack.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_synack_old.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_syn_challenge_ack.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_inexact_rst.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_synack_reuse.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_rst_invalid.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_ack_loss_stall.pkt
> 

You mean we should run our kernel in a virtual machine environment instead of on a real device.
Actually, we don't have a virtual environment for the Android kernel.
Additionally, I'm not sure if a virtual environment for an Android device is available.
Anyway, we are going to reproduce this issue using our stability stress test.

Thanks.

Youngmin Nam Dec. 9, 2024, 1:52 a.m. UTC | #20

On Fri, Dec 06, 2024 at 10:34:16AM -0500, Neal Cardwell wrote:
> On Fri, Dec 6, 2024 at 4:08 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Dec 6, 2024 at 9:58 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> > > > On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > > >
> > > > > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > > > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > > > > >
> > > > > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > > > > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > > > > > > > >
> > > > > > > > > I can't recall seeing these warnings over the past 5 years or so, and
> > > > > > > > > (from checking our monitoring) they don't seem to be firing in our
> > > > > > > > > fleet recently.
> > > > > > > >
> > > > > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > > > > >
> > > > > > > Hi Jakub.
> > > > > > > Thank you for taking an interest in this issue.
> > > > > > >
> > > > > > > We've seen this issue since 5.15 kernel.
> > > > > > > Now, we can see this on 6.6 kernel which is the newest kernel we are running.
> > > > > >
> > > > > > The fact that we are processing ACK packets after the write queue has
> > > > > > been purged would be a serious bug.
> > > > > >
> > > > > > Thus the WARN() makes sense to us.
> > > > > >
> > > > > > It would be easy to build a packetdrill test. Please do so, then we
> > > > > > can fix the root cause.
> > > > > >
> > > > > > Thank you !
> > > > > >
> > > > >
> > > > > Hi Eric.
> > > > >
> > > > > Unfortunately, we are not familiar with the Packetdrill test.
> > > > > Refering to the official website on Github, I tried to install it on my device.
> > > > >
> > > > > Here is what I did on my local machine.
> > > > >
> > > > > $ mkdir packetdrill
> > > > > $ cd packetdrill
> > > > > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > > > > $ cd gtests/net/packetdrill/
> > > > > $./configure
> > > > > $ make CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aarch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> > > > >
> > > > > $ adb root
> > > > > $ adb push packetdrill /data/
> > > > > $ adb shell
> > > > >
> > > > > And here is what I did on my device
> > > > >
> > > > > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S -v -L -l tcp/
> > > > > /system/bin/sh: ./packetdrill/run_all.py: No such file or directory
> > > > >
> > > > > I'm not sure if this procedure is correct.
> > > > > Could you help us run the Packetdrill on an Android device ?
> 
> BTW, Youngmin, do you have a packet trace (e.g., tcpdump .pcap file)
> of the workload that causes this warning?
> 
> If not, in order to construct a packetdrill test to reproduce this
> issue, you may need to:
> 
> (1) add code to the warning to print the local and remote IP address
> and port number when the warning fires (see DBGUNDO() for an example)
> 
> (2) take a tcpdump .pcap trace of the workload
> 
> Then you can use the {local_ip:local_port, remote_ip:remote_port} info
> from (1) to find the packet trace in (2) that can be used to construct
> a packetdrill test to reproduce this issue.
> 
> thanks,
> neal
> 

(Neal, please ignore my previous email as I missed adding the CC list.)

Thank you for your detailed and considerate information.

We are currently trying to reproduce this issue using our stability stress test and
aiming to capture the tcpdump output.

Thanks.

Dujeong.lee Dec. 9, 2024, 10:16 a.m. UTC | #21

On Fri, Dec 06, 2024 at 10:08:17AM +0100, Eric Dumazet wrote:
> On Fri, Dec 6, 2024 at 9:58 AM Youngmin Nam <youngmin.nam@samsung.com>
> wrote:
> >
> > On Fri, Dec 06, 2024 at 09:35:32AM +0100, Eric Dumazet wrote:
> > > On Fri, Dec 6, 2024 at 6:50 AM Youngmin Nam <youngmin.nam@samsung.com>
> wrote:
> > > >
> > > > On Wed, Dec 04, 2024 at 08:13:33AM +0100, Eric Dumazet wrote:
> > > > > On Wed, Dec 4, 2024 at 4:35 AM Youngmin Nam
> <youngmin.nam@samsung.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 03, 2024 at 06:18:39PM -0800, Jakub Kicinski wrote:
> > > > > > > On Tue, 3 Dec 2024 10:34:46 -0500 Neal Cardwell wrote:
> > > > > > > > > I have not seen these warnings firing. Neal, have you seen
> this in the past ?
> > > > > > > >
> > > > > > > > I can't recall seeing these warnings over the past 5 years
> > > > > > > > or so, and (from checking our monitoring) they don't seem
> > > > > > > > to be firing in our fleet recently.
> > > > > > >
> > > > > > > FWIW I see this at Meta on 5.12 kernels, but nothing since.
> > > > > > > Could be that one of our workloads is pinned to 5.12.
> > > > > > > Youngmin, what's the newest kernel you can repro this on?
> > > > > > >
> > > > > > Hi Jakub.
> > > > > > Thank you for taking an interest in this issue.
> > > > > >
> > > > > > We've seen this issue since 5.15 kernel.
> > > > > > Now, we can see this on 6.6 kernel which is the newest kernel we
> are running.
> > > > >
> > > > > The fact that we are processing ACK packets after the write
> > > > > queue has been purged would be a serious bug.
> > > > >
> > > > > Thus the WARN() makes sense to us.
> > > > >
> > > > > It would be easy to build a packetdrill test. Please do so, then
> > > > > we can fix the root cause.
> > > > >
> > > > > Thank you !
> > > > >
> > > >
> > > > Hi Eric.
> > > >
> > > > Unfortunately, we are not familiar with the Packetdrill test.
> > > > Refering to the official website on Github, I tried to install it on
> my device.
> > > >
> > > > Here is what I did on my local machine.
> > > >
> > > > $ mkdir packetdrill
> > > > $ cd packetdrill
> > > > $ git clone https://protect2.fireeye.com/v1/url?k=746d28f3-15e63dd6-
> 746ca3bc-74fe485cbff6-e405b48a4881ecfc&q=1&e=ca164227-d8ec-4d3c-bd27-
> af2d38964105&u=https%3A%2F%2Fgithub.com%2Fgoogle%2Fpacketdrill.git .
> > > > $ cd gtests/net/packetdrill/
> > > > $./configure
> > > > $ make
> > > > CC=/home/youngmin/Downloads/arm-gnu-toolchain-13.3.rel1-x86_64-aar
> > > > ch64-none-linux-gnu/bin/aarch64-none-linux-gnu-gcc
> > > >
> > > > $ adb root
> > > > $ adb push packetdrill /data/
> > > > $ adb shell
> > > >
> > > > And here is what I did on my device
> > > >
> > > > erd9955:/data/packetdrill/gtests/net # ./packetdrill/run_all.py -S
> > > > -v -L -l tcp/
> > > > /system/bin/sh: ./packetdrill/run_all.py: No such file or
> > > > directory
> > > >
> > > > I'm not sure if this procedure is correct.
> > > > Could you help us run the Packetdrill on an Android device ?
> > >
> > > packetdrill can run anywhere, for instance on your laptop, no need
> > > to compile / install it on Android
> > >
> > > Then you can run single test like
> > >
> > > # packetdrill gtests/net/tcp/sack/sack-route-refresh-ip-tos.pkt
> > >
> >
> > You mean.. To test an Android device, we need to run packetdrill on
> laptop, right ?
> >
> > Laptop(run packetdrill script) <--------------------------> Android
> > device
> >
> > By the way, how can we test the Android device (DUT) from packetdrill
> which is running on Laptop?
> > I hope you understand that I am aksing this question because we are not
> familiar with the packetdrill.
> > Thanks.
> 
> packetdrill does not need to run on a physical DUT, it uses a software
> stack : TCP and tun device.
> 
> You have a kernel tree, compile it and run a VM, like virtme-ng
> 
> vng -bv
> 
> We use this to run kernel selftests in which we started adding packetdrill
> tests (in recent kernel tree)
> 
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-
> per-4pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_client.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_batch.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-after-
> win-update.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-fq-
> ack-per-2pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_maxfrags.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_inq_server.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_exclusive.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_basic.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_small.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-app-
> limited-9-packets-out.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-
> per-2pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_oneshot.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_fastopen-server.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_inq_client.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_epoll_edge.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-app-
> limited.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_fastopen-client.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_zerocopy_closed.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-
> per-1pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-after-
> idle.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-
> per-2pkt-send-5pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_slow_start_slow-start-ack-
> per-2pkt-send-6pkt.pkt
> ./tools/testing/selftests/net/packetdrill/tcp_md5_md5-only-on-client-
> ack.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_synack_old.p
> kt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_syn_challeng
> e_ack.pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_inexact_rst.
> pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_synack_reuse
> .pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_rst_invalid.
> pkt
> ./tools/testing/selftests/net/netfilter/packetdrill/conntrack_ack_loss_sta
> ll.pkt

Thanks for all the details on packetdrill and we are also exploring USENIX 2013 material.
I have one question. The issue happens when DUT receives TCP ack with large delay from network, e.g., 28seconds since last Tx. Is packetdrill able to emulate this network delay (or congestion) in script level?

Eric Dumazet Dec. 9, 2024, 10:20 a.m. UTC | #22

On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
>

> Thanks for all the details on packetdrill and we are also exploring USENIX 2013 material.
> I have one question. The issue happens when DUT receives TCP ack with large delay from network, e.g., 28seconds since last Tx. Is packetdrill able to emulate this network delay (or congestion) in script level?

Yes, the packetdrill scripts can wait an arbitrary amount of time
between each event

+28 <next event>

28 seconds seems okay. If the issue was triggered after 4 days,
packetdrill would be impractical ;)

Dujeong.lee Dec. 10, 2024, 3:38 a.m. UTC | #23

On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee <dujeong.lee@samsung.com>
> wrote:
> >
> 
> > Thanks for all the details on packetdrill and we are also exploring
> USENIX 2013 material.
> > I have one question. The issue happens when DUT receives TCP ack with
> large delay from network, e.g., 28seconds since last Tx. Is packetdrill
> able to emulate this network delay (or congestion) in script level?
> 
> Yes, the packetdrill scripts can wait an arbitrary amount of time between
> each event
> 
> +28 <next event>
> 
> 28 seconds seems okay. If the issue was triggered after 4 days,
> packetdrill would be impractical ;)

Hi all,

We secured new ramdump.
Please find the below values with TCP header details.

tp->packets_out = 0
tp->sacked_out = 0
tp->lost_out = 1
tp->retrans_out = 1
tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp))
tp->mss_cache = 1400
((struct inet_connection_sock *)sk)->icsk_ca_state = 4
((struct inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500

Hex from ip header:
45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB A7 CC 51 F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8 42 62 F5 F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6

Transmission Control Protocol
Source Port: 443
Destination Port: 42956
TCP Segment Len: 0
Sequence Number (raw): 1375232972
Acknowledgment number (raw): 1381592486
1011 .... = Header Length: 44 bytes (11)
Flags: 0x010 (ACK)
Window: 1028
Calculated window size: 1028
Urgent Pointer: 0
Options: (24 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps, No-Operation (NOP), No-Operation (NOP), SACK

If anyone wants to check other values, please feel free to ask me

Thanks,
Dujeong.

Dujeong.lee Dec. 10, 2024, 7:10 a.m. UTC | #24

On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee <dujeong.lee@samsung.com>
> > wrote:
> > >
> >
> > > Thanks for all the details on packetdrill and we are also exploring
> > USENIX 2013 material.
> > > I have one question. The issue happens when DUT receives TCP ack
> > > with
> > large delay from network, e.g., 28seconds since last Tx. Is
> > packetdrill able to emulate this network delay (or congestion) in script
> level?
> >
> > Yes, the packetdrill scripts can wait an arbitrary amount of time
> > between each event
> >
> > +28 <next event>
> >
> > 28 seconds seems okay. If the issue was triggered after 4 days,
> > packetdrill would be impractical ;)
> 
> Hi all,
> 
> We secured new ramdump.
> Please find the below values with TCP header details.
> 
> tp->packets_out = 0
> tp->sacked_out = 0
> tp->lost_out = 1
> tp->retrans_out = 1
> tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> ((struct inet_connection_sock *)sk)->icsk_ca_state = 4 ((struct
> inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> 
> Hex from ip header:
> 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB A7 CC 51
> F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8 42 62 F5
> F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> 
> Transmission Control Protocol
> Source Port: 443
> Destination Port: 42956
> TCP Segment Len: 0
> Sequence Number (raw): 1375232972
> Acknowledgment number (raw): 1381592486
> 1011 .... = Header Length: 44 bytes (11)
> Flags: 0x010 (ACK)
> Window: 1028
> Calculated window size: 1028
> Urgent Pointer: 0
> Options: (24 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps,
> No-Operation (NOP), No-Operation (NOP), SACK
> 
> If anyone wants to check other values, please feel free to ask me
> 
> Thanks,
> Dujeong.

I have a question.

From the latest ramdump I could see that
1) tcp_sk(sk)->packets_out = 0
2) inet_csk(sk)->icsk_backoff = 0
3) sk_write_queue.len = 0
which suggests that tcp_write_queue_purge was indeed called.

Noting that:
1) tcp_write_queue_purge reset packets_out to 0
and
2) in_flight should be non-negative where in_flight = packets_out - left_out + retrans_out,
what if we reset left_out and retrans_out as well in tcp_write_queue_purge?

Do we see any potential issue with this?

Youngmin Nam Dec. 13, 2024, 7:14 a.m. UTC | #25

On Wed, Dec 04, 2024 at 12:08:59PM +0900, Youngmin Nam wrote:
> Hi Eric.
> Thanks for looking at this issue.
> 
> On Tue, Dec 03, 2024 at 12:07:05PM +0100, Eric Dumazet wrote:
> > On Tue, Dec 3, 2024 at 9:10 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > We encountered the following WARNINGs
> > > in tcp_sacktag_write_queue()/tcp_fastretrans_alert()
> > > which triggered a kernel panic due to panic_on_warn.
> > >
> > > case 1.
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 4 PID: 453 at net/ipv4/tcp_input.c:2026
> > > Call trace:
> > >  tcp_sacktag_write_queue+0xae8/0xb60
> > >  tcp_ack+0x4ec/0x12b8
> > >  tcp_rcv_state_process+0x22c/0xd38
> > >  tcp_v4_do_rcv+0x220/0x300
> > >  tcp_v4_rcv+0xa5c/0xbb4
> > >  ip_protocol_deliver_rcu+0x198/0x34c
> > >  ip_local_deliver_finish+0x94/0xc4
> > >  ip_local_deliver+0x74/0x10c
> > >  ip_rcv+0xa0/0x13c
> > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > >
> > > case 2.
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 648 at net/ipv4/tcp_input.c:3004
> > > Call trace:
> > >  tcp_fastretrans_alert+0x8ac/0xa74
> > >  tcp_ack+0x904/0x12b8
> > >  tcp_rcv_state_process+0x22c/0xd38
> > >  tcp_v4_do_rcv+0x220/0x300
> > >  tcp_v4_rcv+0xa5c/0xbb4
> > >  ip_protocol_deliver_rcu+0x198/0x34c
> > >  ip_local_deliver_finish+0x94/0xc4
> > >  ip_local_deliver+0x74/0x10c
> > >  ip_rcv+0xa0/0x13c
> > > Kernel panic - not syncing: kernel: panic_on_warn set ...
> > >
> > 
> > I have not seen these warnings firing. Neal, have you seen this in the past ?
> > 
> > Please provide the kernel version (this must be a pristine LTS one).
> We are running Android kernel for Android mobile device which is based on LTS kernel 6.6-30.
> But we've seen this issue since kernel 5.15 LTS.
> 
> > and symbolized stack traces using scripts/decode_stacktrace.sh
> Unfortunately, we don't have the matched vmlinux right now. So we need to rebuild and reproduce.

Hi Eric.

We successfully reproduced this issue.
Here is the symbolized stack trace.

* Case 1
WARNING: CPU: 2 PID: 509 at net/ipv4/tcp_input.c:2026 tcp_sacktag_write_queue+0xae8/0xb60

panic+0x180                        mov w0, wzr (kernel/panic.c:369)
__warn+0x1d4                       adrp x0, #0xffffffd08256b000 <f_midi_longname+48857> (kernel/panic.c:240)
report_bug+0x174                   mov w19, #1 (lib/bug.c:201)
bug_handler+0x24                   cmp w0, #1 (arch/arm64/kernel/traps.c:1032)
brk_handler+0x94                   cbz w0, #0xffffffd081015eac <brk_handler+220> (arch/arm64/kernel/debug-monitors.c:330)
do_debug_exception+0xa4            cbz w0, #0xffffffd08103afe8 <do_debug_exception+200> (arch/arm64/mm/fault.c:965)
el1_dbg+0x58                       bl #0xffffffd08203994c <arm64_exit_el1_dbg> (arch/arm64/kernel/entry-common.c:443)
el1h_64_sync_handler+0x3c          b #0xffffffd082038884 <el1h_64_sync_handler+120> (arch/arm64/kernel/entry-common.c:482)
el1h_64_sync+0x68                  b #0xffffffd081012150 <ret_to_kernel> (arch/arm64/kernel/entry.S:594)
tcp_sacktag_write_queue+0xae8      brk #0x800 (net/ipv4/tcp_input.c:2029)
tcp_ack+0x494                      orr w21, w0, w21 (net/ipv4/tcp_input.c:3914)
tcp_rcv_state_process+0x224        ldrb w8, [x19, #0x12] (net/ipv4/tcp_input.c:6635)
tcp_v4_do_rcv+0x1ec                cbz w0, #0xffffffd081eb0628 <tcp_v4_do_rcv+520> (net/ipv4/tcp_ipv4.c:1757)
tcp_v4_rcv+0x984                   mov x0, x20 (include/linux/spinlock.h:391)
ip_protocol_deliver_rcu+0x194      tbz w0, #0x1f, #0xffffffd081e7cd00 <ip_protocol_deliver_rcu+496> (net/ipv4/ip_input.c:207)
ip_local_deliver+0xe4              bl #0xffffffd081166910 <__rcu_read_unlock> (include/linux/rcupdate.h:818)
ip_rcv+0x90                        mov w21, w0 (include/net/dst.h:468)
__netif_receive_skb_core+0xdc4     mov x23, x27 (net/core/dev.c:2241)
__netif_receive_skb_list_core+0xb8  ldr x26, [sp, #8] (net/core/dev.c:5648)
netif_receive_skb_list_inter..+0x228  tbz w21, #0, #0xffffffd081d819dc <netif_receive_skb_list_internal+576> (net/core/dev.c:5716)
napi_complete_done+0xb4            str x22, [x19, #0x108] (include/linux/list.h:37)
slsi_rx_netif_napi_poll+0x22c      mov w0, w20 (../exynos/soc-series/s-android15/drivers/net/wireless/pcie_scsc/netif.c:1722)
__napi_poll+0x5c                   mov w19, w0 (net/core/dev.c:6575)
napi_threaded_poll+0x110           strb wzr, [x28, #0x39] (net/core/dev.c:6721)
kthread+0x114                      sxtw x0, w0 (kernel/kthread.c:390)
ret_from_fork+0x10                 mrs x28, sp_el0 (arch/arm64/kernel/entry.S:862)

* Case 2
WARNING: CPU: 7 PID: 2099 at net/ipv4/tcp_input.c:3030 tcp_fastretrans_alert+0x860/0x910

panic+0x180                        mov w0, wzr (kernel/panic.c:369)
__warn+0x1d4                       adrp x0, #0xffffffd08256b000 <f_midi_longname+48857> (kernel/panic.c:240)
report_bug+0x174                   mov w19, #1 (lib/bug.c:201)
bug_handler+0x24                   cmp w0, #1 (arch/arm64/kernel/traps.c:1032)
brk_handler+0x94                   cbz w0, #0xffffffd081015eac <brk_handler+220> (arch/arm64/kernel/debug-monitors.c:330)
do_debug_exception+0xa4            cbz w0, #0xffffffd08103afe8 <do_debug_exception+200> (arch/arm64/mm/fault.c:965)
el1_dbg+0x58                       bl #0xffffffd08203994c <arm64_exit_el1_dbg> (arch/arm64/kernel/entry-common.c:443)
el1h_64_sync_handler+0x3c          b #0xffffffd082038884 <el1h_64_sync_handler+120> (arch/arm64/kernel/entry-common.c:482)
el1h_64_sync+0x68                  b #0xffffffd081012150 <ret_to_kernel> (arch/arm64/kernel/entry.S:594)
tcp_fastretrans_alert+0x860        brk #0x800 (net/ipv4/tcp_input.c:2723)
tcp_ack+0x8a4                      ldur w21, [x29, #-0x20] (net/ipv4/tcp_input.c:3991)
tcp_rcv_state_process+0x224        ldrb w8, [x19, #0x12] (net/ipv4/tcp_input.c:6635)
tcp_v4_do_rcv+0x1ec                cbz w0, #0xffffffd081eb0628 <tcp_v4_do_rcv+520> (net/ipv4/tcp_ipv4.c:1757)
tcp_v4_rcv+0x984                   mov x0, x20 (include/linux/spinlock.h:391)
ip_protocol_deliver_rcu+0x194      tbz w0, #0x1f, #0xffffffd081e7cd00 <ip_protocol_deliver_rcu+496> (net/ipv4/ip_input.c:207)
ip_local_deliver+0xe4              bl #0xffffffd081166910 <__rcu_read_unlock> (include/linux/rcupdate.h:818)
ip_rcv+0x90                        mov w21, w0 (include/net/dst.h:468)
__netif_receive_skb_core+0xdc4     mov x23, x27 (net/core/dev.c:2241)
__netif_receive_skb+0x40           ldr x2, [sp, #8] (net/core/dev.c:5570)
netif_receive_skb+0x3c             mov w19, w0 (net/core/dev.c:5771)
slsi_rx_data_deliver_skb+0xbe0     cmp w0, #1 (../exynos/soc-series/s-android15/drivers/net/wireless/pcie_scsc/sap_ma.c:1104)
slsi_ba_process_complete+0x70      mov x0, x21 (include/linux/spinlock.h:356)
slsi_ba_aging_timeout_handler+0x324  mov x0, x21 (include/linux/spinlock.h:396)
call_timer_fn+0x4c                 nop (arch/arm64/include/asm/jump_label.h:22)
__run_timers+0x1c4                 mov x0, x19 (kernel/time/timer.c:1755)
run_timer_softirq+0x24             mov w9, #0x1280 (kernel/time/timer.c:2038)
handle_softirqs+0x124              nop (arch/arm64/include/asm/jump_label.h:22)
__do_softirq+0x14                  ldp x29, x30, [sp], #0x10 (kernel/softirq.c:634)
____do_softirq+0x10                ldp x29, x30, [sp], #0x10 (arch/arm64/kernel/irq.c:82)
call_on_irq_stack+0x3c             mov sp, x29 (arch/arm64/kernel/entry.S:896)
do_softirq_own_stack+0x1c          ldp x29, x30, [sp], #0x10 (arch/arm64/kernel/irq.c:87)
__irq_exit_rcu+0x54                adrp x9, #0xffffffd083064000 <this_cpu_vector> (kernel/softirq.c:662)
irq_exit_rcu+0x10                  ldp x29, x30, [sp], #0x10 (kernel/softirq.c:697)
el0_interrupt+0x54                 bl #0xffffffd0810197b4 <local_daif_mask> (arch/arm64/kernel/entry-common.c:136)
__el0_irq_handler_common+0x18      ldp x29, x30, [sp], #0x10 (arch/arm64/kernel/entry-common.c:774)
el0t_64_irq_handler+0x10           ldp x29, x30, [sp], #0x10 (arch/arm64/kernel/entry-common.c:779)
el0t_64_irq+0x1a8                  b #0xffffffd0810121b8 <ret_to_user> (arch/arm64/kernel/entry.S:600)

Dujeong.lee Dec. 18, 2024, 10:18 a.m. UTC | #26

Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet <edumazet@google.com> wrote:
> > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > <dujeong.lee@samsung.com>
> > > wrote:
> > > >
> > >
> > > > Thanks for all the details on packetdrill and we are also
> > > > exploring
> > > USENIX 2013 material.
> > > > I have one question. The issue happens when DUT receives TCP ack
> > > > with
> > > large delay from network, e.g., 28seconds since last Tx. Is
> > > packetdrill able to emulate this network delay (or congestion) in
> > > script
> > level?
> > >
> > > Yes, the packetdrill scripts can wait an arbitrary amount of time
> > > between each event
> > >
> > > +28 <next event>
> > >
> > > 28 seconds seems okay. If the issue was triggered after 4 days,
> > > packetdrill would be impractical ;)
> >
> > Hi all,
> >
> > We secured new ramdump.
> > Please find the below values with TCP header details.
> >
> > tp->packets_out = 0
> > tp->sacked_out = 0
> > tp->lost_out = 1
> > tp->retrans_out = 1
> > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4 ((struct
> > inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> >
> > Hex from ip header:
> > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB A7
> > CC 51
> > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8 42
> > 62 F5
> > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> >
> > Transmission Control Protocol
> > Source Port: 443
> > Destination Port: 42956
> > TCP Segment Len: 0
> > Sequence Number (raw): 1375232972
> > Acknowledgment number (raw): 1381592486
> > 1011 .... = Header Length: 44 bytes (11)
> > Flags: 0x010 (ACK)
> > Window: 1028
> > Calculated window size: 1028
> > Urgent Pointer: 0
> > Options: (24 bytes), No-Operation (NOP), No-Operation (NOP),
> > Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
> >
> > If anyone wants to check other values, please feel free to ask me
> >
> > Thanks,
> > Dujeong.
> 
> I have a question.
> 
> From the latest ramdump I could see that
> 1) tcp_sk(sk)->packets_out = 0
> 2) inet_csk(sk)->icsk_backoff = 0
> 3) sk_write_queue.len = 0
> which suggests that tcp_write_queue_purge was indeed called.
> 
> Noting that:
> 1) tcp_write_queue_purge reset packets_out to 0 and
> 2) in_flight should be non-negative where in_flight = packets_out -
> left_out + retrans_out, what if we reset left_out and retrans_out as well
> in tcp_write_queue_purge?
> 
> Do we see any potential issue with this?

Hello Eric and Neal.

It is a gentle reminder.
Could you please review the latest ramdump values and and question?

Thanks,
Dujeong.

Eric Dumazet Dec. 18, 2024, 10:27 a.m. UTC | #27

On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
>
> Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> > On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > > <dujeong.lee@samsung.com>
> > > > wrote:
> > > > >
> > > >
> > > > > Thanks for all the details on packetdrill and we are also
> > > > > exploring
> > > > USENIX 2013 material.
> > > > > I have one question. The issue happens when DUT receives TCP ack
> > > > > with
> > > > large delay from network, e.g., 28seconds since last Tx. Is
> > > > packetdrill able to emulate this network delay (or congestion) in
> > > > script
> > > level?
> > > >
> > > > Yes, the packetdrill scripts can wait an arbitrary amount of time
> > > > between each event
> > > >
> > > > +28 <next event>
> > > >
> > > > 28 seconds seems okay. If the issue was triggered after 4 days,
> > > > packetdrill would be impractical ;)
> > >
> > > Hi all,
> > >
> > > We secured new ramdump.
> > > Please find the below values with TCP header details.
> > >
> > > tp->packets_out = 0
> > > tp->sacked_out = 0
> > > tp->lost_out = 1
> > > tp->retrans_out = 1
> > > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4 ((struct
> > > inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> > >
> > > Hex from ip header:
> > > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB A7
> > > CC 51
> > > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8 42
> > > 62 F5
> > > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> > >
> > > Transmission Control Protocol
> > > Source Port: 443
> > > Destination Port: 42956
> > > TCP Segment Len: 0
> > > Sequence Number (raw): 1375232972
> > > Acknowledgment number (raw): 1381592486
> > > 1011 .... = Header Length: 44 bytes (11)
> > > Flags: 0x010 (ACK)
> > > Window: 1028
> > > Calculated window size: 1028
> > > Urgent Pointer: 0
> > > Options: (24 bytes), No-Operation (NOP), No-Operation (NOP),
> > > Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
> > >
> > > If anyone wants to check other values, please feel free to ask me
> > >
> > > Thanks,
> > > Dujeong.
> >
> > I have a question.
> >
> > From the latest ramdump I could see that
> > 1) tcp_sk(sk)->packets_out = 0
> > 2) inet_csk(sk)->icsk_backoff = 0
> > 3) sk_write_queue.len = 0
> > which suggests that tcp_write_queue_purge was indeed called.
> >
> > Noting that:
> > 1) tcp_write_queue_purge reset packets_out to 0 and
> > 2) in_flight should be non-negative where in_flight = packets_out -
> > left_out + retrans_out, what if we reset left_out and retrans_out as well
> > in tcp_write_queue_purge?
> >
> > Do we see any potential issue with this?
>
> Hello Eric and Neal.
>
> It is a gentle reminder.
> Could you please review the latest ramdump values and and question?

It will have to wait next year, Neal is OOO.

I asked a packetdrill reproducer, I can not spend days working on an
issue that does not trigger in our production hosts.

Something could be wrong in your trees, or perhaps some eBPF program
changing the state of the socket...

Dujeong.lee Dec. 30, 2024, 12:23 a.m. UTC | #28

On Wed, Dec 18, 2024 7:28 PM Eric Dumazet <edumazet@google.com> wrote:

> On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee <dujeong.lee@samsung.com>
> wrote:
> >
> > Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> > > On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > > > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet <edumazet@google.com>
> wrote:
> > > > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > > > <dujeong.lee@samsung.com>
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > Thanks for all the details on packetdrill and we are also
> > > > > > exploring
> > > > > USENIX 2013 material.
> > > > > > I have one question. The issue happens when DUT receives TCP
> > > > > > ack with
> > > > > large delay from network, e.g., 28seconds since last Tx. Is
> > > > > packetdrill able to emulate this network delay (or congestion)
> > > > > in script
> > > > level?
> > > > >
> > > > > Yes, the packetdrill scripts can wait an arbitrary amount of
> > > > > time between each event
> > > > >
> > > > > +28 <next event>
> > > > >
> > > > > 28 seconds seems okay. If the issue was triggered after 4 days,
> > > > > packetdrill would be impractical ;)
> > > >
> > > > Hi all,
> > > >
> > > > We secured new ramdump.
> > > > Please find the below values with TCP header details.
> > > >
> > > > tp->packets_out = 0
> > > > tp->sacked_out = 0
> > > > tp->lost_out = 1
> > > > tp->retrans_out = 1
> > > > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > > > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4 ((struct
> > > > inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> > > >
> > > > Hex from ip header:
> > > > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB
> > > > A7 CC 51
> > > > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8
> > > > 42
> > > > 62 F5
> > > > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> > > >
> > > > Transmission Control Protocol
> > > > Source Port: 443
> > > > Destination Port: 42956
> > > > TCP Segment Len: 0
> > > > Sequence Number (raw): 1375232972
> > > > Acknowledgment number (raw): 1381592486
> > > > 1011 .... = Header Length: 44 bytes (11)
> > > > Flags: 0x010 (ACK)
> > > > Window: 1028
> > > > Calculated window size: 1028
> > > > Urgent Pointer: 0
> > > > Options: (24 bytes), No-Operation (NOP), No-Operation (NOP),
> > > > Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
> > > >
> > > > If anyone wants to check other values, please feel free to ask me
> > > >
> > > > Thanks,
> > > > Dujeong.
> > >
> > > I have a question.
> > >
> > > From the latest ramdump I could see that
> > > 1) tcp_sk(sk)->packets_out = 0
> > > 2) inet_csk(sk)->icsk_backoff = 0
> > > 3) sk_write_queue.len = 0
> > > which suggests that tcp_write_queue_purge was indeed called.
> > >
> > > Noting that:
> > > 1) tcp_write_queue_purge reset packets_out to 0 and
> > > 2) in_flight should be non-negative where in_flight = packets_out -
> > > left_out + retrans_out, what if we reset left_out and retrans_out as
> > > well in tcp_write_queue_purge?
> > >
> > > Do we see any potential issue with this?
> >
> > Hello Eric and Neal.
> >
> > It is a gentle reminder.
> > Could you please review the latest ramdump values and and question?
> 
> It will have to wait next year, Neal is OOO.
> 
> I asked a packetdrill reproducer, I can not spend days working on an issue
> that does not trigger in our production hosts.
> 
> Something could be wrong in your trees, or perhaps some eBPF program
> changing the state of the socket...

Hi Eric

I tried to make packetdrill script for local mode, which injects delayed acks for data and FIN after close.

// Test basic connection teardown where local process closes first:
// the local process calls close() first, so we send a FIN.
// Then we receive an delayed ACK for data and FIN.
// Then we receive a FIN and ACK it.

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               // Create socket
   +.01...0.011 connect(3, ..., ...) = 0                                      // Initiate connection
   +0 >  S 0:0(0) <...>                                                       // Send SYN
   +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
   +0 >  . 1:1(0) ack 1                                                       // Send ACK

   +0 write(3, ..., 1000) = 1000                                              // Write 1000 bytes
   +0 >  P. 1:1001(1000) ack 1                                                // Send data with PSH flag

   +0 close(3) = 0                                                            // Local side initiates close
   +0 >  F. 1001:1001(0) ack 1                                                // Send FIN
   +1 < . 1:1(0) ack 1001 win 257                                              // Receive ACK for data
   +0 < . 1:1(0) ack 1002 win 257                                             // Receive ACK for FIN

   +0 < F. 1:1(0) ack 1002 win 257                                            // Receive FIN from remote
   +0 >  . 1002:1002(0) ack 2                                                 // Send ACK for FIN


But got below error when I run the script.

$ sudo ./packetdrill ../tcp/close/close-half-delayed-ack.pkt
../tcp/close/close-half-delayed-ack.pkt:22: error handling packet: live packet field tcp_fin: expected: 0 (0x0) vs actual: 1 (0x1)
script packet:  1.010997 . 1002:1002(0) ack 2
actual packet:  0.014840 F. 1001:1001(0) ack 1 win 256

From the log, looks like there is difference between script and actual results.
Could you help (provide any guide) to check why this error happens?

Thanks,
Dujeong.

Eric Dumazet Dec. 30, 2024, 9:33 a.m. UTC | #29

On Mon, Dec 30, 2024 at 1:24 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
>
> On Wed, Dec 18, 2024 7:28 PM Eric Dumazet <edumazet@google.com> wrote:
>
> > On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee <dujeong.lee@samsung.com>
> > wrote:
> > >
> > > Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> > > > On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > > > > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet <edumazet@google.com>
> > wrote:
> > > > > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > > > > <dujeong.lee@samsung.com>
> > > > > > wrote:
> > > > > > >
> > > > > >
> > > > > > > Thanks for all the details on packetdrill and we are also
> > > > > > > exploring
> > > > > > USENIX 2013 material.
> > > > > > > I have one question. The issue happens when DUT receives TCP
> > > > > > > ack with
> > > > > > large delay from network, e.g., 28seconds since last Tx. Is
> > > > > > packetdrill able to emulate this network delay (or congestion)
> > > > > > in script
> > > > > level?
> > > > > >
> > > > > > Yes, the packetdrill scripts can wait an arbitrary amount of
> > > > > > time between each event
> > > > > >
> > > > > > +28 <next event>
> > > > > >
> > > > > > 28 seconds seems okay. If the issue was triggered after 4 days,
> > > > > > packetdrill would be impractical ;)
> > > > >
> > > > > Hi all,
> > > > >
> > > > > We secured new ramdump.
> > > > > Please find the below values with TCP header details.
> > > > >
> > > > > tp->packets_out = 0
> > > > > tp->sacked_out = 0
> > > > > tp->lost_out = 1
> > > > > tp->retrans_out = 1
> > > > > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > > > > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4 ((struct
> > > > > inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> > > > >
> > > > > Hex from ip header:
> > > > > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB
> > > > > A7 CC 51
> > > > > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8
> > > > > 42
> > > > > 62 F5
> > > > > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> > > > >
> > > > > Transmission Control Protocol
> > > > > Source Port: 443
> > > > > Destination Port: 42956
> > > > > TCP Segment Len: 0
> > > > > Sequence Number (raw): 1375232972
> > > > > Acknowledgment number (raw): 1381592486
> > > > > 1011 .... = Header Length: 44 bytes (11)
> > > > > Flags: 0x010 (ACK)
> > > > > Window: 1028
> > > > > Calculated window size: 1028
> > > > > Urgent Pointer: 0
> > > > > Options: (24 bytes), No-Operation (NOP), No-Operation (NOP),
> > > > > Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
> > > > >
> > > > > If anyone wants to check other values, please feel free to ask me
> > > > >
> > > > > Thanks,
> > > > > Dujeong.
> > > >
> > > > I have a question.
> > > >
> > > > From the latest ramdump I could see that
> > > > 1) tcp_sk(sk)->packets_out = 0
> > > > 2) inet_csk(sk)->icsk_backoff = 0
> > > > 3) sk_write_queue.len = 0
> > > > which suggests that tcp_write_queue_purge was indeed called.
> > > >
> > > > Noting that:
> > > > 1) tcp_write_queue_purge reset packets_out to 0 and
> > > > 2) in_flight should be non-negative where in_flight = packets_out -
> > > > left_out + retrans_out, what if we reset left_out and retrans_out as
> > > > well in tcp_write_queue_purge?
> > > >
> > > > Do we see any potential issue with this?
> > >
> > > Hello Eric and Neal.
> > >
> > > It is a gentle reminder.
> > > Could you please review the latest ramdump values and and question?
> >
> > It will have to wait next year, Neal is OOO.
> >
> > I asked a packetdrill reproducer, I can not spend days working on an issue
> > that does not trigger in our production hosts.
> >
> > Something could be wrong in your trees, or perhaps some eBPF program
> > changing the state of the socket...
>
> Hi Eric
>
> I tried to make packetdrill script for local mode, which injects delayed acks for data and FIN after close.
>
> // Test basic connection teardown where local process closes first:
> // the local process calls close() first, so we send a FIN.
> // Then we receive an delayed ACK for data and FIN.
> // Then we receive a FIN and ACK it.
>
> `../common/defaults.sh`
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               // Create socket
>    +.01...0.011 connect(3, ..., ...) = 0                                      // Initiate connection
>    +0 >  S 0:0(0) <...>                                                       // Send SYN
>    +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
>    +0 >  . 1:1(0) ack 1                                                       // Send ACK
>
>    +0 write(3, ..., 1000) = 1000                                              // Write 1000 bytes
>    +0 >  P. 1:1001(1000) ack 1                                                // Send data with PSH flag
>
>    +0 close(3) = 0                                                            // Local side initiates close
>    +0 >  F. 1001:1001(0) ack 1                                                // Send FIN
>    +1 < . 1:1(0) ack 1001 win 257                                              // Receive ACK for data
>    +0 < . 1:1(0) ack 1002 win 257                                             // Receive ACK for FIN
>
>    +0 < F. 1:1(0) ack 1002 win 257                                            // Receive FIN from remote
>    +0 >  . 1002:1002(0) ack 2                                                 // Send ACK for FIN
>
>
> But got below error when I run the script.
>
> $ sudo ./packetdrill ../tcp/close/close-half-delayed-ack.pkt
> ../tcp/close/close-half-delayed-ack.pkt:22: error handling packet: live packet field tcp_fin: expected: 0 (0x0) vs actual: 1 (0x1)
> script packet:  1.010997 . 1002:1002(0) ack 2
> actual packet:  0.014840 F. 1001:1001(0) ack 1 win 256

This means the FIN was retransmited earlier.
Then the data segment was probably also retransmit.

You can use "tcpdump -i any &" while developing your script.

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
       // Create socket
   +.01...0.111 connect(3, ..., ...) = 0
       // Initiate connection
   +0 >  S 0:0(0) <...>
       // Send SYN
  +.1 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
   +0 >  . 1:1(0) ack 1
       // Send ACK

   +0 write(3, ..., 1000) = 1000
       // Write 1000 bytes
   +0 >  P. 1:1001(1000) ack 1
       // Send data with PSH flag

   +0 close(3) = 0
       // Local side initiates close
   +0 >  F. 1001:1001(0) ack 1
       // Send FIN
  +.2 >  F. 1001:1001(0) ack 1    // FIN retransmit
+.2~+.4 >  P. 1:1001(1000) ack 1 // RTX

   +0 < . 1:1(0) ack 1001 win 257
        // Receive ACK for data
   +0 > F. 1001:1001(0) ack 1 // FIN retransmit
   +0 < . 1:1(0) ack 1002 win 257
       // Receive ACK for FIN

   +0 < F. 1:1(0) ack 1002 win 257
       // Receive FIN from remote
   +0 >  . 1002:1002(0) ack 2
       // Send ACK for FIN

Dujeong.lee Jan. 2, 2025, 12:22 a.m. UTC | #30

On Mon, Dec 30, 2024 at 6:34 PM Eric Dumazet <edumazet@google.com>
wrote:
>
> On Mon, Dec 30, 2024 at 1:24 AM Dujeong.lee <dujeong.lee@samsung.com>
> wrote:
> >
> > On Wed, Dec 18, 2024 7:28 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > > On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee
> > > <dujeong.lee@samsung.com>
> > > wrote:
> > > >
> > > > Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> > > > > On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > > > > > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet
> > > > > > <edumazet@google.com>
> > > wrote:
> > > > > > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > > > > > <dujeong.lee@samsung.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > >
> > > > > > > > Thanks for all the details on packetdrill and we are also
> > > > > > > > exploring
> > > > > > > USENIX 2013 material.
> > > > > > > > I have one question. The issue happens when DUT receives
> > > > > > > > TCP ack with
> > > > > > > large delay from network, e.g., 28seconds since last Tx. Is
> > > > > > > packetdrill able to emulate this network delay (or
> > > > > > > congestion) in script
> > > > > > level?
> > > > > > >
> > > > > > > Yes, the packetdrill scripts can wait an arbitrary amount of
> > > > > > > time between each event
> > > > > > >
> > > > > > > +28 <next event>
> > > > > > >
> > > > > > > 28 seconds seems okay. If the issue was triggered after 4
> > > > > > > days, packetdrill would be impractical ;)
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > We secured new ramdump.
> > > > > > Please find the below values with TCP header details.
> > > > > >
> > > > > > tp->packets_out = 0
> > > > > > tp->sacked_out = 0
> > > > > > tp->lost_out = 1
> > > > > > tp->retrans_out = 1
> > > > > > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > > > > > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4
> > > > > > ((struct inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> > > > > >
> > > > > > Hex from ip header:
> > > > > > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01
> > > > > > BB
> > > > > > A7 CC 51
> > > > > > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72
> > > > > > C8
> > > > > > 42
> > > > > > 62 F5
> > > > > > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> > > > > >
> > > > > > Transmission Control Protocol
> > > > > > Source Port: 443
> > > > > > Destination Port: 42956
> > > > > > TCP Segment Len: 0
> > > > > > Sequence Number (raw): 1375232972 Acknowledgment number (raw):
> > > > > > 1381592486
> > > > > > 1011 .... = Header Length: 44 bytes (11)
> > > > > > Flags: 0x010 (ACK)
> > > > > > Window: 1028
> > > > > > Calculated window size: 1028
> > > > > > Urgent Pointer: 0
> > > > > > Options: (24 bytes), No-Operation (NOP), No-Operation (NOP),
> > > > > > Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
> > > > > >
> > > > > > If anyone wants to check other values, please feel free to ask
> > > > > > me
> > > > > >
> > > > > > Thanks,
> > > > > > Dujeong.
> > > > >
> > > > > I have a question.
> > > > >
> > > > > From the latest ramdump I could see that
> > > > > 1) tcp_sk(sk)->packets_out = 0
> > > > > 2) inet_csk(sk)->icsk_backoff = 0
> > > > > 3) sk_write_queue.len = 0
> > > > > which suggests that tcp_write_queue_purge was indeed called.
> > > > >
> > > > > Noting that:
> > > > > 1) tcp_write_queue_purge reset packets_out to 0 and
> > > > > 2) in_flight should be non-negative where in_flight =
> > > > > packets_out - left_out + retrans_out, what if we reset left_out
> > > > > and retrans_out as well in tcp_write_queue_purge?
> > > > >
> > > > > Do we see any potential issue with this?
> > > >
> > > > Hello Eric and Neal.
> > > >
> > > > It is a gentle reminder.
> > > > Could you please review the latest ramdump values and and question?
> > >
> > > It will have to wait next year, Neal is OOO.
> > >
> > > I asked a packetdrill reproducer, I can not spend days working on an
> > > issue that does not trigger in our production hosts.
> > >
> > > Something could be wrong in your trees, or perhaps some eBPF program
> > > changing the state of the socket...
> >
> > Hi Eric
> >
> > I tried to make packetdrill script for local mode, which injects delayed
> acks for data and FIN after close.
> >
> > // Test basic connection teardown where local process closes first:
> > // the local process calls close() first, so we send a FIN.
> > // Then we receive an delayed ACK for data and FIN.
> > // Then we receive a FIN and ACK it.
> >
> > `../common/defaults.sh`
> >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               //
> Create socket
> >    +.01...0.011 connect(3, ..., ...) = 0                                      //
> Initiate connection
> >    +0 >  S 0:0(0) <...>                                                       // Send
> SYN
> >    +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>
> // Receive SYN-ACK with TCP options
> >    +0 >  . 1:1(0) ack 1                                                       // Send
> ACK
> >
> >    +0 write(3, ..., 1000) = 1000                                              //
> Write 1000 bytes
> >    +0 >  P. 1:1001(1000) ack 1                                                // Send
> data with PSH flag
> >
> >    +0 close(3) = 0                                                            // Local
> side initiates close
> >    +0 >  F. 1001:1001(0) ack 1                                                // Send
> FIN
> >    +1 < . 1:1(0) ack 1001 win 257                                              //
> Receive ACK for data
> >    +0 < . 1:1(0) ack 1002 win 257                                             //
> Receive ACK for FIN
> >
> >    +0 < F. 1:1(0) ack 1002 win 257                                            //
> Receive FIN from remote
> >    +0 >  . 1002:1002(0) ack 2                                                 // Send
> ACK for FIN
> >
> >
> > But got below error when I run the script.
> >
> > $ sudo ./packetdrill ../tcp/close/close-half-delayed-ack.pkt
> > ../tcp/close/close-half-delayed-ack.pkt:22: error handling packet:
> > live packet field tcp_fin: expected: 0 (0x0) vs actual: 1 (0x1) script
> > packet:  1.010997 . 1002:1002(0) ack 2 actual packet:  0.014840 F.
> > 1001:1001(0) ack 1 win 256
> 
> This means the FIN was retransmited earlier.
> Then the data segment was probably also retransmit.
> 
> You can use "tcpdump -i any &" while developing your script.
> 
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>        // Create socket
>    +.01...0.111 connect(3, ..., ...) = 0
>        // Initiate connection
>    +0 >  S 0:0(0) <...>
>        // Send SYN
>   +.1 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
> 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
>    +0 >  . 1:1(0) ack 1
>        // Send ACK
> 
>    +0 write(3, ..., 1000) = 1000
>        // Write 1000 bytes
>    +0 >  P. 1:1001(1000) ack 1
>        // Send data with PSH flag
> 
>    +0 close(3) = 0
>        // Local side initiates close
>    +0 >  F. 1001:1001(0) ack 1
>        // Send FIN
>   +.2 >  F. 1001:1001(0) ack 1    // FIN retransmit
> +.2~+.4 >  P. 1:1001(1000) ack 1 // RTX
> 
>    +0 < . 1:1(0) ack 1001 win 257
>         // Receive ACK for data
>    +0 > F. 1001:1001(0) ack 1 // FIN retransmit
>    +0 < . 1:1(0) ack 1002 win 257
>        // Receive ACK for FIN
> 
>    +0 < F. 1:1(0) ack 1002 win 257
>        // Receive FIN from remote
>    +0 >  . 1002:1002(0) ack 2
>        // Send ACK for FIN

Hi Eric,

I modified the script and inlined tcpdump capture

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               // Create socket
   +.01...0.011 connect(3, ..., ...) = 0                                      // Initiate connection
   +0 >  S 0:0(0) <...>                                                       // Send SYN
1 0.000000 192.168.114.235 192.0.2.1 TCP 80 40784 → 8080 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM TSval=2913446377 TSecr=0 WS=256

   +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
2 0.000209 192.0.2.1 192.168.114.235 TCP 72 8080 → 40784 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1000 WS=64 SACK_PERM

   +0 >  . 1:1(0) ack 1                                                       // Send ACK
3 0.000260 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [ACK] Seq=1 Ack=1 Win=65536 Len=0

   +0 write(3, ..., 1000) = 1000                                              // Write 1000 bytes
   +0 >  P. 1:1001(1000) ack 1                                                // Send data with PSH flag
4 0.000344 192.168.114.235 192.0.2.1 TCP 1060 40784 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=1000

   +0 close(3) = 0                                                            // Local side initiates close
   +0 >  F. 1001:1001(0) ack 1                                                // Send FIN
5 0.000381 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [FIN, ACK] Seq=1001 Ack=1 Win=65536 Len=0

   +.2 >  F. 1001:1001(0) ack 1                                               // FIN retransmit
6 0.004545 192.168.114.235 192.0.2.1 TCP 60 [TCP Retransmission] 40784 → 8080 [FIN, ACK] Seq=1001 Ack=1 Win=65536 Len=0

   +.2~+.4 >  P. 1:1001(1000) ack 1                                           // RTX
   +0 < . 1:1(0) ack 1001 win 257                                             // Receive ACK for data
   +0 < . 1:1(0) ack 1002 win 257                                             // Receive ACK for FIN

   +0 < F. 1:1(0) ack 1002 win 257                                            // Receive FIN from remote
   +0 >  . 1002:1002(0) ack 2                                                 // Send ACK for FIN


And hit below error.
../tcp/close/close-half-delayed-ack.pkt:18: error handling packet: timing error: expected outbound packet at 0.210706 sec but happened at 0.014838 sec; tolerance 0.025002 sec
script packet:  0.210706 F. 1001:1001(0) ack 1
actual packet:  0.014838 F. 1001:1001(0) ack 1 win 256

For me, it looks like delay in below line does not take effect by packetdrill.
+.2 >  F. 1001:1001(0) ack 1                                               // FIN retransmit

Thanks,
Dujeong.

Eric Dumazet Jan. 2, 2025, 8:16 a.m. UTC | #31

On Thu, Jan 2, 2025 at 1:22 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
>
> On Mon, Dec 30, 2024 at 6:34 PM Eric Dumazet <edumazet@google.com>
> wrote:
> >
> > On Mon, Dec 30, 2024 at 1:24 AM Dujeong.lee <dujeong.lee@samsung.com>
> > wrote:
> > >
> > > On Wed, Dec 18, 2024 7:28 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > > On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee
> > > > <dujeong.lee@samsung.com>
> > > > wrote:
> > > > >
> > > > > Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> > > > > > On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > > > > > > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet
> > > > > > > <edumazet@google.com>
> > > > wrote:
> > > > > > > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > > > > > > <dujeong.lee@samsung.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks for all the details on packetdrill and we are also
> > > > > > > > > exploring
> > > > > > > > USENIX 2013 material.
> > > > > > > > > I have one question. The issue happens when DUT receives
> > > > > > > > > TCP ack with
> > > > > > > > large delay from network, e.g., 28seconds since last Tx. Is
> > > > > > > > packetdrill able to emulate this network delay (or
> > > > > > > > congestion) in script
> > > > > > > level?
> > > > > > > >
> > > > > > > > Yes, the packetdrill scripts can wait an arbitrary amount of
> > > > > > > > time between each event
> > > > > > > >
> > > > > > > > +28 <next event>
> > > > > > > >
> > > > > > > > 28 seconds seems okay. If the issue was triggered after 4
> > > > > > > > days, packetdrill would be impractical ;)
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We secured new ramdump.
> > > > > > > Please find the below values with TCP header details.
> > > > > > >
> > > > > > > tp->packets_out = 0
> > > > > > > tp->sacked_out = 0
> > > > > > > tp->lost_out = 1
> > > > > > > tp->retrans_out = 1
> > > > > > > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > > > > > > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4
> > > > > > > ((struct inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
> > > > > > >
> > > > > > > Hex from ip header:
> > > > > > > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01
> > > > > > > BB
> > > > > > > A7 CC 51
> > > > > > > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72
> > > > > > > C8
> > > > > > > 42
> > > > > > > 62 F5
> > > > > > > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> > > > > > >
> > > > > > > Transmission Control Protocol
> > > > > > > Source Port: 443
> > > > > > > Destination Port: 42956
> > > > > > > TCP Segment Len: 0
> > > > > > > Sequence Number (raw): 1375232972 Acknowledgment number (raw):
> > > > > > > 1381592486
> > > > > > > 1011 .... = Header Length: 44 bytes (11)
> > > > > > > Flags: 0x010 (ACK)
> > > > > > > Window: 1028
> > > > > > > Calculated window size: 1028
> > > > > > > Urgent Pointer: 0
> > > > > > > Options: (24 bytes), No-Operation (NOP), No-Operation (NOP),
> > > > > > > Timestamps, No-Operation (NOP), No-Operation (NOP), SACK
> > > > > > >
> > > > > > > If anyone wants to check other values, please feel free to ask
> > > > > > > me
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Dujeong.
> > > > > >
> > > > > > I have a question.
> > > > > >
> > > > > > From the latest ramdump I could see that
> > > > > > 1) tcp_sk(sk)->packets_out = 0
> > > > > > 2) inet_csk(sk)->icsk_backoff = 0
> > > > > > 3) sk_write_queue.len = 0
> > > > > > which suggests that tcp_write_queue_purge was indeed called.
> > > > > >
> > > > > > Noting that:
> > > > > > 1) tcp_write_queue_purge reset packets_out to 0 and
> > > > > > 2) in_flight should be non-negative where in_flight =
> > > > > > packets_out - left_out + retrans_out, what if we reset left_out
> > > > > > and retrans_out as well in tcp_write_queue_purge?
> > > > > >
> > > > > > Do we see any potential issue with this?
> > > > >
> > > > > Hello Eric and Neal.
> > > > >
> > > > > It is a gentle reminder.
> > > > > Could you please review the latest ramdump values and and question?
> > > >
> > > > It will have to wait next year, Neal is OOO.
> > > >
> > > > I asked a packetdrill reproducer, I can not spend days working on an
> > > > issue that does not trigger in our production hosts.
> > > >
> > > > Something could be wrong in your trees, or perhaps some eBPF program
> > > > changing the state of the socket...
> > >
> > > Hi Eric
> > >
> > > I tried to make packetdrill script for local mode, which injects delayed
> > acks for data and FIN after close.
> > >
> > > // Test basic connection teardown where local process closes first:
> > > // the local process calls close() first, so we send a FIN.
> > > // Then we receive an delayed ACK for data and FIN.
> > > // Then we receive a FIN and ACK it.
> > >
> > > `../common/defaults.sh`
> > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               //
> > Create socket
> > >    +.01...0.011 connect(3, ..., ...) = 0                                      //
> > Initiate connection
> > >    +0 >  S 0:0(0) <...>                                                       // Send
> > SYN
> > >    +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>
> > // Receive SYN-ACK with TCP options
> > >    +0 >  . 1:1(0) ack 1                                                       // Send
> > ACK
> > >
> > >    +0 write(3, ..., 1000) = 1000                                              //
> > Write 1000 bytes
> > >    +0 >  P. 1:1001(1000) ack 1                                                // Send
> > data with PSH flag
> > >
> > >    +0 close(3) = 0                                                            // Local
> > side initiates close
> > >    +0 >  F. 1001:1001(0) ack 1                                                // Send
> > FIN
> > >    +1 < . 1:1(0) ack 1001 win 257                                              //
> > Receive ACK for data
> > >    +0 < . 1:1(0) ack 1002 win 257                                             //
> > Receive ACK for FIN
> > >
> > >    +0 < F. 1:1(0) ack 1002 win 257                                            //
> > Receive FIN from remote
> > >    +0 >  . 1002:1002(0) ack 2                                                 // Send
> > ACK for FIN
> > >
> > >
> > > But got below error when I run the script.
> > >
> > > $ sudo ./packetdrill ../tcp/close/close-half-delayed-ack.pkt
> > > ../tcp/close/close-half-delayed-ack.pkt:22: error handling packet:
> > > live packet field tcp_fin: expected: 0 (0x0) vs actual: 1 (0x1) script
> > > packet:  1.010997 . 1002:1002(0) ack 2 actual packet:  0.014840 F.
> > > 1001:1001(0) ack 1 win 256
> >
> > This means the FIN was retransmited earlier.
> > Then the data segment was probably also retransmit.
> >
> > You can use "tcpdump -i any &" while developing your script.
> >
> >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >        // Create socket
> >    +.01...0.111 connect(3, ..., ...) = 0
> >        // Initiate connection
> >    +0 >  S 0:0(0) <...>
> >        // Send SYN
> >   +.1 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
> > 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
> >    +0 >  . 1:1(0) ack 1
> >        // Send ACK
> >
> >    +0 write(3, ..., 1000) = 1000
> >        // Write 1000 bytes
> >    +0 >  P. 1:1001(1000) ack 1
> >        // Send data with PSH flag
> >
> >    +0 close(3) = 0
> >        // Local side initiates close
> >    +0 >  F. 1001:1001(0) ack 1
> >        // Send FIN
> >   +.2 >  F. 1001:1001(0) ack 1    // FIN retransmit
> > +.2~+.4 >  P. 1:1001(1000) ack 1 // RTX
> >
> >    +0 < . 1:1(0) ack 1001 win 257
> >         // Receive ACK for data
> >    +0 > F. 1001:1001(0) ack 1 // FIN retransmit
> >    +0 < . 1:1(0) ack 1002 win 257
> >        // Receive ACK for FIN
> >
> >    +0 < F. 1:1(0) ack 1002 win 257
> >        // Receive FIN from remote
> >    +0 >  . 1002:1002(0) ack 2
> >        // Send ACK for FIN
>
> Hi Eric,
>
> I modified the script and inlined tcpdump capture
>
> `../common/defaults.sh`
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               // Create socket
>    +.01...0.011 connect(3, ..., ...) = 0                                      // Initiate connection
>    +0 >  S 0:0(0) <...>                                                       // Send SYN
> 1 0.000000 192.168.114.235 192.0.2.1 TCP 80 40784 → 8080 [SYN] Seq=0 Win=65535 Len=0 MSS=1460 SACK_PERM TSval=2913446377 TSecr=0 WS=256
>
>    +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
> 2 0.000209 192.0.2.1 192.168.114.235 TCP 72 8080 → 40784 [SYN, ACK] Seq=0 Ack=1 Win=32768 Len=0 MSS=1000 WS=64 SACK_PERM
>
>    +0 >  . 1:1(0) ack 1                                                       // Send ACK
> 3 0.000260 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [ACK] Seq=1 Ack=1 Win=65536 Len=0
>
>    +0 write(3, ..., 1000) = 1000                                              // Write 1000 bytes
>    +0 >  P. 1:1001(1000) ack 1                                                // Send data with PSH flag
> 4 0.000344 192.168.114.235 192.0.2.1 TCP 1060 40784 → 8080 [PSH, ACK] Seq=1 Ack=1 Win=65536 Len=1000
>
>    +0 close(3) = 0                                                            // Local side initiates close
>    +0 >  F. 1001:1001(0) ack 1                                                // Send FIN
> 5 0.000381 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [FIN, ACK] Seq=1001 Ack=1 Win=65536 Len=0
>
>    +.2 >  F. 1001:1001(0) ack 1                                               // FIN retransmit
> 6 0.004545 192.168.114.235 192.0.2.1 TCP 60 [TCP Retransmission] 40784 → 8080 [FIN, ACK] Seq=1001 Ack=1 Win=65536 Len=0
>
>    +.2~+.4 >  P. 1:1001(1000) ack 1                                           // RTX
>    +0 < . 1:1(0) ack 1001 win 257                                             // Receive ACK for data
>    +0 < . 1:1(0) ack 1002 win 257                                             // Receive ACK for FIN
>
>    +0 < F. 1:1(0) ack 1002 win 257                                            // Receive FIN from remote
>    +0 >  . 1002:1002(0) ack 2                                                 // Send ACK for FIN
>
>
> And hit below error.
> ../tcp/close/close-half-delayed-ack.pkt:18: error handling packet: timing error: expected outbound packet at 0.210706 sec but happened at 0.014838 sec; tolerance 0.025002 sec
> script packet:  0.210706 F. 1001:1001(0) ack 1
> actual packet:  0.014838 F. 1001:1001(0) ack 1 win 256
>
> For me, it looks like delay in below line does not take effect by packetdrill.
> +.2 >  F. 1001:1001(0) ack 1                                               // FIN retransmit

I think you misunderstood how packetdrill works.

In packetdrill, you can specify delays for incoming packets (to
account for network delays, or remote TCP stack bugs/behavior)

But outgoing packets are generated by the kernel TCP stack.
Packetdrill checks that these packets have the expected layouts and
sent at the expected time.

Dujeong.lee Jan. 3, 2025, 4:16 a.m. UTC | #32

On Thu, Jan 2, 2025 at 5:17 PM Eric Dumazet <edumazet@google.com>
wrote:

> To: Dujeong.lee <dujeong.lee@samsung.com>
> Cc: Youngmin Nam <youngmin.nam@samsung.com>; Jakub Kicinski
> <kuba@kernel.org>; Neal Cardwell <ncardwell@google.com>;
> davem@davemloft.net; dsahern@kernel.org; pabeni@redhat.com;
> horms@kernel.org; guo88.liu@samsung.com; yiwang.cai@samsung.com;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> joonki.min@samsung.com; hajun.sung@samsung.com; d7271.choe@samsung.com;
> sw.ju@samsung.com; iamyunsu.kim@samsung.com; kw0619.kim@samsung.com;
> hsl.lim@samsung.com; hanbum22.lee@samsung.com; chaemoo.lim@samsung.com;
> seungjin1.yu@samsung.com
> Subject: Re: [PATCH] tcp: check socket state before calling WARN_ON
> 
> On Thu, Jan 2, 2025 at 1:22 AM Dujeong.lee <dujeong.lee@samsung.com> wrote:
> >
> > On Mon, Dec 30, 2024 at 6:34 PM Eric Dumazet <edumazet@google.com>
> > wrote:
> > >
> > > On Mon, Dec 30, 2024 at 1:24 AM Dujeong.lee
> > > <dujeong.lee@samsung.com>
> > > wrote:
> > > >
> > > > On Wed, Dec 18, 2024 7:28 PM Eric Dumazet <edumazet@google.com>
> wrote:
> > > >
> > > > > On Wed, Dec 18, 2024 at 11:18 AM Dujeong.lee
> > > > > <dujeong.lee@samsung.com>
> > > > > wrote:
> > > > > >
> > > > > > Tue, December 10, 2024 at 4:10 PM Dujeong Lee wrote:
> > > > > > > On Tue, Dec 10, 2024 at 12:39 PM Dujeong Lee wrote:
> > > > > > > > On Mon, Dec 9, 2024 at 7:21 PM Eric Dumazet
> > > > > > > > <edumazet@google.com>
> > > > > wrote:
> > > > > > > > > On Mon, Dec 9, 2024 at 11:16 AM Dujeong.lee
> > > > > > > > > <dujeong.lee@samsung.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks for all the details on packetdrill and we are
> > > > > > > > > > also exploring
> > > > > > > > > USENIX 2013 material.
> > > > > > > > > > I have one question. The issue happens when DUT
> > > > > > > > > > receives TCP ack with
> > > > > > > > > large delay from network, e.g., 28seconds since last Tx.
> > > > > > > > > Is packetdrill able to emulate this network delay (or
> > > > > > > > > congestion) in script
> > > > > > > > level?
> > > > > > > > >
> > > > > > > > > Yes, the packetdrill scripts can wait an arbitrary
> > > > > > > > > amount of time between each event
> > > > > > > > >
> > > > > > > > > +28 <next event>
> > > > > > > > >
> > > > > > > > > 28 seconds seems okay. If the issue was triggered after
> > > > > > > > > 4 days, packetdrill would be impractical ;)
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > We secured new ramdump.
> > > > > > > > Please find the below values with TCP header details.
> > > > > > > >
> > > > > > > > tp->packets_out = 0
> > > > > > > > tp->sacked_out = 0
> > > > > > > > tp->lost_out = 1
> > > > > > > > tp->retrans_out = 1
> > > > > > > > tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
> > > > > > > > ((struct inet_connection_sock *)sk)->icsk_ca_state = 4
> > > > > > > > ((struct inet_connection_sock *)sk)->icsk_pmtu_cookie =
> > > > > > > > 1500
> > > > > > > >
> > > > > > > > Hex from ip header:
> > > > > > > > 45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00
> > > > > > > > F7 01 BB
> > > > > > > > A7 CC 51
> > > > > > > > F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A
> > > > > > > > 89 72
> > > > > > > > C8
> > > > > > > > 42
> > > > > > > > 62 F5
> > > > > > > > F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
> > > > > > > >
> > > > > > > > Transmission Control Protocol Source Port: 443 Destination
> > > > > > > > Port: 42956 TCP Segment Len: 0 Sequence Number (raw):
> > > > > > > > 1375232972 Acknowledgment number (raw):
> > > > > > > > 1381592486
> > > > > > > > 1011 .... = Header Length: 44 bytes (11)
> > > > > > > > Flags: 0x010 (ACK)
> > > > > > > > Window: 1028
> > > > > > > > Calculated window size: 1028 Urgent Pointer: 0
> > > > > > > > Options: (24 bytes), No-Operation (NOP), No-Operation
> > > > > > > > (NOP), Timestamps, No-Operation (NOP), No-Operation (NOP),
> > > > > > > > SACK
> > > > > > > >
> > > > > > > > If anyone wants to check other values, please feel free to
> > > > > > > > ask me
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Dujeong.
> > > > > > >
> > > > > > > I have a question.
> > > > > > >
> > > > > > > From the latest ramdump I could see that
> > > > > > > 1) tcp_sk(sk)->packets_out = 0
> > > > > > > 2) inet_csk(sk)->icsk_backoff = 0
> > > > > > > 3) sk_write_queue.len = 0
> > > > > > > which suggests that tcp_write_queue_purge was indeed called.
> > > > > > >
> > > > > > > Noting that:
> > > > > > > 1) tcp_write_queue_purge reset packets_out to 0 and
> > > > > > > 2) in_flight should be non-negative where in_flight =
> > > > > > > packets_out - left_out + retrans_out, what if we reset
> > > > > > > left_out and retrans_out as well in tcp_write_queue_purge?
> > > > > > >
> > > > > > > Do we see any potential issue with this?
> > > > > >
> > > > > > Hello Eric and Neal.
> > > > > >
> > > > > > It is a gentle reminder.
> > > > > > Could you please review the latest ramdump values and and
> question?
> > > > >
> > > > > It will have to wait next year, Neal is OOO.
> > > > >
> > > > > I asked a packetdrill reproducer, I can not spend days working
> > > > > on an issue that does not trigger in our production hosts.
> > > > >
> > > > > Something could be wrong in your trees, or perhaps some eBPF
> > > > > program changing the state of the socket...
> > > >
> > > > Hi Eric
> > > >
> > > > I tried to make packetdrill script for local mode, which injects
> > > > delayed
> > > acks for data and FIN after close.
> > > >
> > > > // Test basic connection teardown where local process closes first:
> > > > // the local process calls close() first, so we send a FIN.
> > > > // Then we receive an delayed ACK for data and FIN.
> > > > // Then we receive a FIN and ACK it.
> > > >
> > > > `../common/defaults.sh`
> > > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> //
> > > Create socket
> > > >    +.01...0.011 connect(3, ..., ...) = 0
> //
> > > Initiate connection
> > > >    +0 >  S 0:0(0) <...>                                                       //
> Send
> > > SYN
> > > >    +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
> > > > 6,nop,nop,sackOK>
> > > // Receive SYN-ACK with TCP options
> > > >    +0 >  . 1:1(0) ack 1                                                       //
> Send
> > > ACK
> > > >
> > > >    +0 write(3, ..., 1000) = 1000                                              //
> > > Write 1000 bytes
> > > >    +0 >  P. 1:1001(1000) ack 1                                                //
> Send
> > > data with PSH flag
> > > >
> > > >    +0 close(3) = 0                                                            //
> Local
> > > side initiates close
> > > >    +0 >  F. 1001:1001(0) ack 1                                                //
> Send
> > > FIN
> > > >    +1 < . 1:1(0) ack 1001 win 257
> //
> > > Receive ACK for data
> > > >    +0 < . 1:1(0) ack 1002 win 257                                             //
> > > Receive ACK for FIN
> > > >
> > > >    +0 < F. 1:1(0) ack 1002 win 257                                            //
> > > Receive FIN from remote
> > > >    +0 >  . 1002:1002(0) ack 2                                                 //
> Send
> > > ACK for FIN
> > > >
> > > >
> > > > But got below error when I run the script.
> > > >
> > > > $ sudo ./packetdrill ../tcp/close/close-half-delayed-ack.pkt
> > > > ../tcp/close/close-half-delayed-ack.pkt:22: error handling packet:
> > > > live packet field tcp_fin: expected: 0 (0x0) vs actual: 1 (0x1)
> > > > script
> > > > packet:  1.010997 . 1002:1002(0) ack 2 actual packet:  0.014840 F.
> > > > 1001:1001(0) ack 1 win 256
> > >
> > > This means the FIN was retransmited earlier.
> > > Then the data segment was probably also retransmit.
> > >
> > > You can use "tcpdump -i any &" while developing your script.
> > >
> > >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > >        // Create socket
> > >    +.01...0.111 connect(3, ..., ...) = 0
> > >        // Initiate connection
> > >    +0 >  S 0:0(0) <...>
> > >        // Send SYN
> > >   +.1 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale
> > > 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
> > >    +0 >  . 1:1(0) ack 1
> > >        // Send ACK
> > >
> > >    +0 write(3, ..., 1000) = 1000
> > >        // Write 1000 bytes
> > >    +0 >  P. 1:1001(1000) ack 1
> > >        // Send data with PSH flag
> > >
> > >    +0 close(3) = 0
> > >        // Local side initiates close
> > >    +0 >  F. 1001:1001(0) ack 1
> > >        // Send FIN
> > >   +.2 >  F. 1001:1001(0) ack 1    // FIN retransmit
> > > +.2~+.4 >  P. 1:1001(1000) ack 1 // RTX
> > >
> > >    +0 < . 1:1(0) ack 1001 win 257
> > >         // Receive ACK for data
> > >    +0 > F. 1001:1001(0) ack 1 // FIN retransmit
> > >    +0 < . 1:1(0) ack 1002 win 257
> > >        // Receive ACK for FIN
> > >
> > >    +0 < F. 1:1(0) ack 1002 win 257
> > >        // Receive FIN from remote
> > >    +0 >  . 1002:1002(0) ack 2
> > >        // Send ACK for FIN
> >
> > Hi Eric,
> >
> > I modified the script and inlined tcpdump capture
> >
> > `../common/defaults.sh`
> >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               //
> Create socket
> >    +.01...0.011 connect(3, ..., ...) = 0                                      //
> Initiate connection
> >    +0 >  S 0:0(0) <...>                                                       // Send
> SYN
> > 1 0.000000 192.168.114.235 192.0.2.1 TCP 80 40784 → 8080 [SYN] Seq=0
> > Win=65535 Len=0 MSS=1460 SACK_PERM TSval=2913446377 TSecr=0 WS=256
> >
> >    +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>
> // Receive SYN-ACK with TCP options
> > 2 0.000209 192.0.2.1 192.168.114.235 TCP 72 8080 → 40784 [SYN, ACK]
> > Seq=0 Ack=1 Win=32768 Len=0 MSS=1000 WS=64 SACK_PERM
> >
> >    +0 >  . 1:1(0) ack 1                                                       // Send
> ACK
> > 3 0.000260 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [ACK] Seq=1
> > Ack=1 Win=65536 Len=0
> >
> >    +0 write(3, ..., 1000) = 1000                                              //
> Write 1000 bytes
> >    +0 >  P. 1:1001(1000) ack 1                                                // Send
> data with PSH flag
> > 4 0.000344 192.168.114.235 192.0.2.1 TCP 1060 40784 → 8080 [PSH, ACK]
> > Seq=1 Ack=1 Win=65536 Len=1000
> >
> >    +0 close(3) = 0                                                            // Local
> side initiates close
> >    +0 >  F. 1001:1001(0) ack 1                                                // Send
> FIN
> > 5 0.000381 192.168.114.235 192.0.2.1 TCP 60 40784 → 8080 [FIN, ACK]
> > Seq=1001 Ack=1 Win=65536 Len=0
> >
> >    +.2 >  F. 1001:1001(0) ack 1                                               // FIN
> retransmit
> > 6 0.004545 192.168.114.235 192.0.2.1 TCP 60 [TCP Retransmission] 40784
> > → 8080 [FIN, ACK] Seq=1001 Ack=1 Win=65536 Len=0
> >
> >    +.2~+.4 >  P. 1:1001(1000) ack 1                                           // RTX
> >    +0 < . 1:1(0) ack 1001 win 257                                             //
> Receive ACK for data
> >    +0 < . 1:1(0) ack 1002 win 257                                             //
> Receive ACK for FIN
> >
> >    +0 < F. 1:1(0) ack 1002 win 257                                            //
> Receive FIN from remote
> >    +0 >  . 1002:1002(0) ack 2                                                 // Send
> ACK for FIN
> >
> >
> > And hit below error.
> > ../tcp/close/close-half-delayed-ack.pkt:18: error handling packet:
> > timing error: expected outbound packet at 0.210706 sec but happened at
> > 0.014838 sec; tolerance 0.025002 sec script packet:  0.210706 F.
> > 1001:1001(0) ack 1 actual packet:  0.014838 F. 1001:1001(0) ack 1 win
> > 256
> >
> > For me, it looks like delay in below line does not take effect by
> packetdrill.
> > +.2 >  F. 1001:1001(0) ack 1                                               // FIN
> retransmit
> 
> I think you misunderstood how packetdrill works.
> 
> In packetdrill, you can specify delays for incoming packets (to account
> for network delays, or remote TCP stack bugs/behavior)
> 
> But outgoing packets are generated by the kernel TCP stack.
> Packetdrill checks that these packets have the expected layouts and sent
> at the expected time.


Hi Eric and Neal

Thanks for explanation.
Now I updated script based on local packet Tx pattern based on tcpdump
and injected delay for remote packet. Now it works without any issue.
// Test basic connection teardown where local process closes first:
// the local process calls close() first, so we send a FIN.
// Then we receive an delayed ACK for data and FIN.
// Then we receive a FIN and ACK it.

`../common/defaults.sh`
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3                               // Create socket
   +.01...0.011 connect(3, ..., ...) = 0                                      // Initiate connection
   +0 >  S 0:0(0) <...>                                                       // Send SYN
   +0 < S. 0:0(0) ack 1 win 32768 <mss 1000,nop,wscale 6,nop,nop,sackOK>      // Receive SYN-ACK with TCP options
   +0 >  . 1:1(0) ack 1                                                       // Send ACK

   +0 write(3, ..., 1000) = 1000                                              // Write 1000 bytes

   +0 >  P. 1:1001(1000) ack 1                                                // Send data with PSH flag

   +0 close(3) = 0                                                            // Local side initiates close

   +0 >  F. 1001:1001(0) ack 1                                                // Send FIN

   +0 >  F. 1001:1001(0) ack 1                                                // FIN retransmit

   +.2  >  P. 1:1001(1000) ack 1                                              // RTX
   +.4  >  P. 1:1001(1000) ack 1                                              // RTX
   +.8  >  P. 1:1001(1000) ack 1                                              // RTX
   +1.6 >  P. 1:1001(1000) ack 1                                              // RTX
   +3.2 >  P. 1:1001(1000) ack 1                                              // RTX
   +6.4 >  P. 1:1001(1000) ack 1                                              // RTX
   +13  >  P. 1:1001(1000) ack 1                                              // RTX
   +26  >  P. 1:1001(1000) ack 1                                              // RTX

   +1 < . 1:1(0) ack 1002 win 257                                            // Receive ACK for FIN

   +1 < . 1:1(0) ack 1001 win 257                                            // Receive ACK for data

   +0 < F. 1:1(0) ack 1002 win 257                                            // Receive FIN from remote

We will develop the script to reliably reproduce the case.
Maybe we need to get good tcpdump trace when issue happens. But it would take sometime.

In the meantime, since we have complete set of ramdump snapshot, it would be appreciated if Neal could find anything from the values I provided earlier.

tp->packets_out = 0
tp->sacked_out = 0
tp->lost_out = 1
tp->retrans_out = 1
tp->rx_opt.sack_ok = 5 (tcp_is_sack(tp)) mss_cache = 1400
((struct inet_connection_sock *)sk)->icsk_ca_state = 4 
((struct inet_connection_sock *)sk)->icsk_pmtu_cookie = 1500
Hex from ip header:
45 00 00 40 75 40 00 00 39 06 91 13 8E FB 2A CA C0 A8 00 F7 01 BB A7 
CC 51
F8 63 CC 52 59 6D A6 B0 10 04 04 77 76 00 00 01 01 08 0A 89 72 C8 42
62 F5
F5 D1 01 01 05 0A 52 59 6D A5 52 59 6D A6
Transmission Control Protocol
Source Port: 443
Destination Port: 42956
TCP Segment Len: 0
Sequence Number (raw): 1375232972
Acknowledgment number (raw): 1381592486
1011 .... = Header Length: 44 bytes (11)
Flags: 0x010 (ACK)
Window: 1028
Calculated window size: 1028
Urgent Pointer: 0
Options: (24 bytes), No-Operation (NOP), No-Operation (NOP), 
Timestamps, No-Operation (NOP), No-Operation (NOP), SACK

Thanks,
Dujeong.

Youngmin Nam Jan. 17, 2025, 5:08 a.m. UTC | #33

> Thanks for all the details! If the ramdump becomes available again at
> some point, would it be possible to pull out the following values as
> well:
> 
> tp->mss_cache
> inet_csk(sk)->icsk_pmtu_cookie
> inet_csk(sk)->icsk_ca_state
> 
> Thanks,
> neal
> 

Hi Neal. Happy new year.

We are currently trying to capture a tcpdump during the problem situation
to construct the Packetdrill script. However, this issue does not occur very often.

By the way, we have a full ramdump, so we can provide the information you requested.

tp->packets_out = 0
tp->sacked_out = 0
tp->lost_out = 4
tp->retrans_out = 1
tcp_is_sack(tp) = 1
tp->mss_cache = 1428
inet_csk(sk)->icsk_ca_state = 4
inet_csk(sk)->icsk_pmtu_cookie = 1500

If you need any specific information from the ramdump, please let me know.

Thanks.

Neal Cardwell Jan. 17, 2025, 3:18 p.m. UTC | #34

On Fri, Jan 17, 2025 at 12:04 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> > Thanks for all the details! If the ramdump becomes available again at
> > some point, would it be possible to pull out the following values as
> > well:
> >
> > tp->mss_cache
> > inet_csk(sk)->icsk_pmtu_cookie
> > inet_csk(sk)->icsk_ca_state
> >
> > Thanks,
> > neal
> >
>
> Hi Neal. Happy new year.
>
> We are currently trying to capture a tcpdump during the problem situation
> to construct the Packetdrill script. However, this issue does not occur very often.
>
> By the way, we have a full ramdump, so we can provide the information you requested.
>
> tp->packets_out = 0
> tp->sacked_out = 0
> tp->lost_out = 4
> tp->retrans_out = 1
> tcp_is_sack(tp) = 1
> tp->mss_cache = 1428
> inet_csk(sk)->icsk_ca_state = 4
> inet_csk(sk)->icsk_pmtu_cookie = 1500
>
> If you need any specific information from the ramdump, please let me know.

The icsk_ca_state = 4 is interesting, since that's TCP_CA_Loss,
indicating RTO recovery. Perhaps the socket suffered many recurring
timeouts and timed out with ETIMEDOUT,
causing the tcp_write_queue_purge() call that reset packets_out to
0... and then some race happened during the teardown process that
caused another incoming packet to be processed in this resulting
inconsistent state?

Do you have a way to use GDB or a similar tool to print all the fields
of the socket? Like:

  (gdb)  p *(struct tcp_sock*) some_hex_address_goes_here

?

If so, that could be useful in extracting further hints about what
state this socket is in.

If that's not possible, but a few extra fields are possible, would you
be able to pull out the following:

tp->retrans_stamp
tp->tcp_mstamp
icsk->icsk_retransmits
icsk->icsk_backoff
icsk->icsk_rto

thanks,
neal

Youngmin Nam Jan. 20, 2025, 12:18 a.m. UTC | #35

On Fri, Jan 17, 2025 at 10:18:58AM -0500, Neal Cardwell wrote:
> On Fri, Jan 17, 2025 at 12:04 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > > Thanks for all the details! If the ramdump becomes available again at
> > > some point, would it be possible to pull out the following values as
> > > well:
> > >
> > > tp->mss_cache
> > > inet_csk(sk)->icsk_pmtu_cookie
> > > inet_csk(sk)->icsk_ca_state
> > >
> > > Thanks,
> > > neal
> > >
> >
> > Hi Neal. Happy new year.
> >
> > We are currently trying to capture a tcpdump during the problem situation
> > to construct the Packetdrill script. However, this issue does not occur very often.
> >
> > By the way, we have a full ramdump, so we can provide the information you requested.
> >
> > tp->packets_out = 0
> > tp->sacked_out = 0
> > tp->lost_out = 4
> > tp->retrans_out = 1
> > tcp_is_sack(tp) = 1
> > tp->mss_cache = 1428
> > inet_csk(sk)->icsk_ca_state = 4
> > inet_csk(sk)->icsk_pmtu_cookie = 1500
> >
> > If you need any specific information from the ramdump, please let me know.
> 
> The icsk_ca_state = 4 is interesting, since that's TCP_CA_Loss,
> indicating RTO recovery. Perhaps the socket suffered many recurring
> timeouts and timed out with ETIMEDOUT,
> causing the tcp_write_queue_purge() call that reset packets_out to
> 0... and then some race happened during the teardown process that
> caused another incoming packet to be processed in this resulting
> inconsistent state?
> 
> Do you have a way to use GDB or a similar tool to print all the fields
> of the socket? Like:
> 
>   (gdb)  p *(struct tcp_sock*) some_hex_address_goes_here
> 
> ?
> 
> If so, that could be useful in extracting further hints about what
> state this socket is in.
> 
> If that's not possible, but a few extra fields are possible, would you
> be able to pull out the following:
> 
> tp->retrans_stamp
> tp->tcp_mstamp
> icsk->icsk_retransmits
> icsk->icsk_backoff
> icsk->icsk_rto
> 
> thanks,
> neal
> 

Hi Neal,
Thank you for looking into this issue.
When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
We can provide any information you would like to inspect.

tp->retrans_stamp = 3339228
tp->tcp_mstamp = 3552879949
icsk->icsk_retransmits = 2
icsk->icsk_backoff = 0
icsk->icsk_rto = 16340

Here is all the information about tcp_sock.

(struct tcp_sock *) tp = 0xFFFFFF88C1053C00 -> (
  (struct inet_connection_sock) inet_conn = ((struct inet_sock) icsk_inet = ((struct sock) sk = ((struct sock_common) __sk_common = ((__addrpair) skc_addrpair = 13979358786200921654, (__be32) skc_daddr = 255714870, (__be32) skc_rcv_saddr = 3254823104, (unsigned int) skc_hash = 2234333897, (__u16 [2]) skc_u16hashes = (15049, 34093), (__portpair) skc_portpair = 3600464641, (__be16) skc_dport = 47873, (__u16) skc_num = 54938, (unsigned short) skc_family = 10, (volatile unsigned char) skc_state = 4, (unsigned char:4) skc_reuse = 0, (unsigned char:1) skc_reuseport = 0, (unsigned char:1) skc_ipv6only = 0, (unsigned char:1) skc_net_refcnt = 1, (int) skc_bound_dev_if = 0, (struct hlist_node) skc_bind_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct hlist_node) skc_portaddr_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct proto *) skc_prot = 0xFFFFFFD08322CFE0, (possible_net_t) skc_net = ((struct net *) net = 0xFFFFFFD083316600), (struct in6_addr) skc_v6_daddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 54, 230, 61, 15), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 58934, 3901), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 255714870))), (struct in6_addr) skc_v6_rcv_saddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 192, 168, 0, 194), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 43200, 49664), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 3254823104))), (atomic64_t) skc_cookie = ((s64) counter = 6016), (unsigned long) skc_flags = 769, (struct sock *) skc_listener = 0x0301, (struct inet_timewait_death_row *) skc_tw_dr = 0x0301, (int [0]) skc_dontcopy_begin = (), (struct hlist_node) skc_node = ((struct hlist_node *) next = 0x7593, (struct hlist_node * *) pprev = 0xFFFFFF882D49D648), (struct hlist_nulls_node) skc_nulls_node = ((struct hlist_nulls_node *) next = 0x7593, (struct hlist_nulls_node * *) pprev = 0xFFFFFF882D49D648), (unsigned short) skc_tx_queue_mapping = 65535, (unsigned short) skc_rx_queue_mapping = 65535, (int) skc_incoming_cpu = 2, (u32) skc_rcv_wnd = 2, (u32) skc_tw_rcv_nxt = 2, (refcount_t) skc_refcnt = ((atomic_t) refs = ((int) counter = 2)), (int [0]) skc_dontcopy_end = (), (u32) skc_rxhash = 0, (u32) skc_window_clamp = 0, (u32) skc_tw_snd_nxt = 0), (struct dst_entry *) sk_rx_dst = 0xFFFFFF8821384F00, (int) sk_rx_dst_ifindex = 14, (u32) sk_rx_dst_cookie = 0, (socket_lock_t) sk_lock = ((spinlock_t) slock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 1), (u8) locked = 1, (u8) pending = 0, (u16) locked_pending = 1, (u16) tail = 0))), (int) owned = 0, (wait_queue_head_t) wq = ((spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_loc
  (u16) tcp_header_len = 32,
  (u16) gso_segs = 26,
  (__be32) pred_flags = 2566918272,
  (u64) bytes_received = 7950,
  (u32) segs_in = 21,
  (u32) data_segs_in = 12,
  (u32) rcv_nxt = 2956372822,
  (u32) copied_seq = 2956372822,
  (u32) rcv_wup = 2956372822,
  (u32) snd_nxt = 2253384964,
  (u32) segs_out = 30,
  (u32) data_segs_out = 13,
  (u64) bytes_sent = 11381,
  (u64) bytes_acked = 3841,
  (u32) dsack_dups = 0,
  (u32) snd_una = 2253381091,
  (u32) snd_sml = 2253384964,
  (u32) rcv_tstamp = 757520,
  (u32) lsndtime = 768000,
  (u32) last_oow_ack_time = 750065,
  (u32) compressed_ack_rcv_nxt = 2956370839,
  (u32) tsoffset = 1072186408,
  (struct list_head) tsq_node = ((struct list_head *) next = 0xFFFFFF88C1054260, (struct list_head *) prev = 0xFFFFFF88C1054260),
  (struct list_head) tsorted_sent_queue = ((struct list_head *) next = 0xFFFFFF88C1054270, (struct list_head *) prev = 0xFFFFFF88C1054270),
  (u32) snd_wl1 = 2956372861,
  (u32) snd_wnd = 78336,
  (u32) max_window = 78336,
  (u32) mss_cache = 1428,
  (u32) window_clamp = 2752512,
  (u32) rcv_ssthresh = 91439,
  (u8) scaling_ratio = 84,
  (struct tcp_rack) rack = ((u64) mstamp = 3312043553, (u32) rtt_us = 3103952, (u32) end_seq = 2253381091, (u32) last_delivered = 0, (u8) reo_wnd_steps = 1, (u8:5) reo_wnd_persist = 0, (u8:1) dsack_seen = 0, (u8:1) advanced = 1),
  (u16) advmss = 1448,
  (u8) compressed_ack = 0,
  (u8:2) dup_ack_counter = 2,
  (u8:1) tlp_retrans = 1,
  (u8:5) unused = 0,
  (u32) chrono_start = 780636,
  (u32 [3]) chrono_stat = (30386, 0, 0),
  (u8:2) chrono_type = 1,
  (u8:1) rate_app_limited = 1,
  (u8:1) fastopen_connect = 0,
  (u8:1) fastopen_no_cookie = 0,
  (u8:1) is_sack_reneg = 0,
  (u8:2) fastopen_client_fail = 0,
  (u8:4) nonagle = 0,
  (u8:1) thin_lto = 0,
  (u8:1) recvmsg_inq = 0,
  (u8:1) repair = 0,
  (u8:1) frto = 1,
  (u8) repair_queue = 0,
  (u8:2) save_syn = 0,
  (u8:1) syn_data = 0,
  (u8:1) syn_fastopen = 0,
  (u8:1) syn_fastopen_exp = 0,
  (u8:1) syn_fastopen_ch = 0,
  (u8:1) syn_data_acked = 0,
  (u8:1) is_cwnd_limited = 1,
  (u32) tlp_high_seq = 0,
  (u32) tcp_tx_delay = 0,
  (u64) tcp_wstamp_ns = 3371996070858,
  (u64) tcp_clock_cache = 3552879949296,
  (u64) tcp_mstamp = 3552879949,
  (u32) srtt_us = 29633751,
  (u32) mdev_us = 10160190,
  (u32) mdev_max_us = 10160190,
  (u32) rttvar_us = 12632227,
  (u32) rtt_seq = 2253383947,
  (struct minmax) rtt_min = ((struct minmax_sample [3]) s = (((u32) t = 753091, (u32) v = 330326), ((u32) t = 753091, (u32) v = 330326), ((u32) t = 753091, (u32) v = 330326))),
  (u32) packets_out = 0,
  (u32) retrans_out = 1,
  (u32) max_packets_out = 4,
  (u32) cwnd_usage_seq = 2253384964,
  (u16) urg_data = 0,
  (u8) ecn_flags = 0,
  (u8) keepalive_probes = 0,
  (u32) reordering = 3,
  (u32) reord_seen = 0,
  (u32) snd_up = 2253381091,
  (struct tcp_options_received) rx_opt = ((int) ts_recent_stamp = 3330, (u32) ts_recent = 1119503967, (u32) rcv_tsval = 1119728668, (u32) rcv_tsecr = 3312043, (u16:1) saw_tstamp = 1, (u16:1) tstamp_ok = 1, (u16:1) dsack = 0, (u16:1) wscale_ok = 1, (u16:3) sack_ok = 1, (u16:1) smc_ok = 0, (u16:4) snd_wscale = 9, (u16:4) rcv_wscale = 9, (u8:1) saw_unknown = 0, (u8:7) unused = 0, (u8) num_sacks = 0, (u16) user_mss = 0, (u16) mss_clamp = 1440),
  (u32) snd_ssthresh = 7,
  (u32) snd_cwnd = 1,
  (u32) snd_cwnd_cnt = 0,
  (u32) snd_cwnd_clamp = 4294967295,
  (u32) snd_cwnd_used = 0,
  (u32) snd_cwnd_stamp = 768000,
  (u32) prior_cwnd = 10,
  (u32) prr_delivered = 0,
  (u32) prr_out = 0,
  (u32) delivered = 7,
  (u32) delivered_ce = 0,
  (u32) lost = 8,
  (u32) app_limited = 8,
  (u64) first_tx_mstamp = 3312043553,
  (u64) delivered_mstamp = 3315147505,
  (u32) rate_delivered = 1,
  (u32) rate_interval_us = 330326,
  (u32) rcv_wnd = 91648,
  (u32) write_seq = 2253384964,
  (u32) notsent_lowat = 0,
  (u32) pushed_seq = 2253384963,
  (u32) lost_out = 4,
  (u32) sacked_out = 0,
  (struct hrtimer) pacing_timer = ((struct timerqueue_node) node = ((struct rb_node) node = ((unsigned long) __rb_parent_color = 18446743561551823800, (struct rb_node *) rb_right = 0x0, (struct rb_node *) rb_left = 0x0), (ktime_t) expires = 0), (ktime_t) _softexpires = 0, (enum hrtimer_restart (*)()) function = 0xFFFFFFD081EA565C, (struct hrtimer_clock_base *) base = 0xFFFFFF8962589DC0, (u8) state = 0, (u8) is_rel = 0, (u8) is_soft = 1, (u8) is_hard = 0, (u64) android_kabi_reserved1 = 0),
  (struct hrtimer) compressed_ack_timer = ((struct timerqueue_node) node = ((struct rb_node) node = ((unsigned long) __rb_parent_color = 18446743561551823872, (struct rb_node *) rb_right = 0x0, (struct rb_node *) rb_left = 0x0), (ktime_t) expires = 0), (ktime_t) _softexpires = 0, (enum hrtimer_restart (*)()) function = 0xFFFFFFD081EAE348, (struct hrtimer_clock_base *) base = 0xFFFFFF8962589DC0, (u8) state = 0, (u8) is_rel = 0, (u8) is_soft = 1, (u8) is_hard = 0, (u64) android_kabi_reserved1 = 0),
  (struct sk_buff *) lost_skb_hint = 0x0,
  (struct sk_buff *) retransmit_skb_hint = 0x0,
  (struct rb_root) out_of_order_queue = ((struct rb_node *) rb_node = 0x0),
  (struct sk_buff *) ooo_last_skb = 0xFFFFFF891768FB00,
  (struct tcp_sack_block [1]) duplicate_sack = (((u32) start_seq = 2956372143, (u32) end_seq = 2956372822)),
  (struct tcp_sack_block [4]) selective_acks = (((u32) start_seq = 2956371078, (u32) end_seq = 2956372143), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0)),
  (struct tcp_sack_block [4]) recv_sack_cache = (((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0)),
  (struct sk_buff *) highest_sack = 0x0,
  (int) lost_cnt_hint = 0,
  (u32) prior_ssthresh = 2147483647,
  (u32) high_seq = 2253384964,
  (u32) retrans_stamp = 3339228,
  (u32) undo_marker = 2253381091,
  (int) undo_retrans = 3,
  (u64) bytes_retrans = 3669,
  (u32) total_retrans = 6,
  (u32) urg_seq = 0,
  (unsigned int) keepalive_time = 0,
  (unsigned int) keepalive_intvl = 0,
  (int) linger2 = 0,
  (u8) bpf_sock_ops_cb_flags = 0,
  (u8:1) bpf_chg_cc_inprogress = 0,
  (u16) timeout_rehash = 5,
  (u32) rcv_ooopack = 2,
  (u32) rcv_rtt_last_tsecr = 3312043,
  (struct) rcv_rtt_est = ((u32) rtt_us = 0, (u32) seq = 2956431836, (u64) time = 3301136665),
  (struct) rcvq_space = ((u32) space = 14480, (u32) seq = 2956364872, (u64) time = 3300257291),
  (struct) mtu_probe = ((u32) probe_seq_start = 0, (u32) probe_seq_end = 0),
  (u32) plb_rehash = 0,
  (u32) mtu_info = 0,
  (struct tcp_fastopen_request *) fastopen_req = 0x0,
  (struct request_sock *) fastopen_rsk = 0x0,
  (struct saved_syn *) saved_syn = 0x0,
  (u64) android_oem_data1 = 0,
  (u64) android_kabi_reserved1 = 0)

And here are the details of inet_connection_sock.

(struct tcp_sock *) tp = 0xFFFFFF88C1053C00 -> (
  (struct inet_connection_sock) inet_conn = (
    (struct inet_sock) icsk_inet = ((struct sock) sk = ((struct sock_common) __sk_common = ((__addrpair) skc_addrpair = 13979358786200921654, (__be32) skc_daddr = 255714870, (__be32) skc_rcv_saddr = 3254823104, (unsigned int) skc_hash = 2234333897, (__u16 [2]) skc_u16hashes = (15049, 34093), (__portpair) skc_portpair = 3600464641, (__be16) skc_dport = 47873, (__u16) skc_num = 54938, (unsigned short) skc_family = 10, (volatile unsigned char) skc_state = 4, (unsigned char:4) skc_reuse = 0, (unsigned char:1) skc_reuseport = 0, (unsigned char:1) skc_ipv6only = 0, (unsigned char:1) skc_net_refcnt = 1, (int) skc_bound_dev_if = 0, (struct hlist_node) skc_bind_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct hlist_node) skc_portaddr_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct proto *) skc_prot = 0xFFFFFFD08322CFE0, (possible_net_t) skc_net = ((struct net *) net = 0xFFFFFFD083316600), (struct in6_addr) skc_v6_daddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 54, 230, 61, 15), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 58934, 3901), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 255714870))), (struct in6_addr) skc_v6_rcv_saddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 192, 168, 0, 194), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 43200, 49664), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 3254823104))), (atomic64_t) skc_cookie = ((s64) counter = 6016), (unsigned long) skc_flags = 769, (struct sock *) skc_listener = 0x0301, (struct inet_timewait_death_row *) skc_tw_dr = 0x0301, (int [0]) skc_dontcopy_begin = (), (struct hlist_node) skc_node = ((struct hlist_node *) next = 0x7593, (struct hlist_node * *) pprev = 0xFFFFFF882D49D648), (struct hlist_nulls_node) skc_nulls_node = ((struct hlist_nulls_node *) next = 0x7593, (struct hlist_nulls_node * *) pprev = 0xFFFFFF882D49D648), (unsigned short) skc_tx_queue_mapping = 65535, (unsigned short) skc_rx_queue_mapping = 65535, (int) skc_incoming_cpu = 2, (u32) skc_rcv_wnd = 2, (u32) skc_tw_rcv_nxt = 2, (refcount_t) skc_refcnt = ((atomic_t) refs = ((int) counter = 2)), (int [0]) skc_dontcopy_end = (), (u32) skc_rxhash = 0, (u32) skc_window_clamp = 0, (u32) skc_tw_snd_nxt = 0), (struct dst_entry *) sk_rx_dst = 0xFFFFFF8821384F00, (int) sk_rx_dst_ifindex = 14, (u32) sk_rx_dst_cookie = 0, (socket_lock_t) sk_lock = ((spinlock_t) slock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 1), (u8) locked = 1, (u8) pending = 0, (u16) locked_pending = 1, (u16) tail = 0))), (int) owned = 0, (wait_queue_head_t) wq = ((spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0))), (struct list_head) head = ((struct list_head *) next = 0xFFFFFF88C1053CA8, (struct list_head *) prev = 0xFFFFFF88C1053CA8))), (atomic_t) sk_drops = ((int) counter = 7), (int) sk_rcvlowat = 1, (struct sk_buff_head) sk_error_queue = ((struct sk_buff *) next = 0xFFFFFF88C1053CC0, (struct sk_buff *) prev = 0xFFFFFF88C1053CC0, (struct sk_buff_list) list = ((struct sk_buff *) next = 0xFFFFFF88C1053CC0, (struct sk_buff *) prev = 0xFFFFFF88C1053CC0), (__u32) qlen = 0, (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0)))), (struct sk_buff_head) sk_receive_queue = ((struct sk_buff *) next = 0xFFFFFF88C1053CD8, (struct sk_buff *) prev = 0xFFFFFF88C1053CD8, (struct sk_buff_list) list = ((struct sk_buff *) next = 0xFFFFFF88C1053CD8, (struct sk_buff *) prev = 0xFFFFFF88C1053CD8), (__u32) qlen = 0, (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0)))), (struct) sk_backlog = ((at
    (struct request_sock_queue) icsk_accept_queue = ((spinlock_t) rskq_lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0))), (u8) rskq_defer_accept = 0, (u32) synflood_warned = 0, (atomic_t) qlen = ((int) counter = 0), (atomic_t) young = ((int) counter = 0), (struct request_sock *) rskq_accept_head = 0x0, (struct request_sock *) rskq_accept_tail = 0x0, (struct fastopen_queue) fastopenq = ((struct request_sock *) rskq_rst_head = 0x0, (struct request_sock *) rskq_rst_tail = 0x0, (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0))), (int) qlen = 0, (int) max_qlen = 0, (struct tcp_fastopen_context *) ctx = 0x0)),
    (struct inet_bind_bucket *) icsk_bind_hash = 0xFFFFFF881CE60A80,
    (struct inet_bind2_bucket *) icsk_bind2_hash = 0xFFFFFF881B1FD500,
    (unsigned long) icsk_timeout = 4295751636,
    (struct timer_list) icsk_retransmit_timer = ((struct hlist_node) entry = ((struct hlist_node *) next = 0xDEAD000000000122, (struct hlist_node * *) pprev = 0x0), (unsigned long) expires = 4295751636, (void (*)()) function = 0xFFFFFFD081EADE28, (u32) flags = 1056964613, (u64) android_kabi_reserved1 = 0, (u64) android_kabi_reserved2 = 0),
    (struct timer_list) icsk_delack_timer = ((struct hlist_node) entry = ((struct hlist_node *) next = 0xDEAD000000000122, (struct hlist_node * *) pprev = 0x0), (unsigned long) expires = 4295721093, (void (*)()) function = 0xFFFFFFD081EADF60, (u32) flags = 25165829, (u64) android_kabi_reserved1 = 0, (u64) android_kabi_reserved2 = 0),
    (__u32) icsk_rto_=_16340,
    (__u32) icsk_rto_min = 50,
    (__u32) icsk_delack_max = 50,
    (__u32) icsk_pmtu_cookie = 1500,
    (const struct tcp_congestion_ops *) icsk_ca_ops = 0xFFFFFFD0830A9440,
    (const struct inet_connection_sock_af_ops *) icsk_af_ops = 0xFFFFFFD082290FC8,
    (const struct tcp_ulp_ops *) icsk_ulp_ops = 0x0,
    (void *) icsk_ulp_data = 0x0,
    (void (*)()) icsk_clean_acked = 0x0,
    (unsigned int (*)()) icsk_sync_mss = 0xFFFFFFD081EA60EC,
    (__u8:5) icsk_ca_state = 4,
    (__u8:1) icsk_ca_initialized = 1,
    (__u8:1) icsk_ca_setsockopt = 0,
    (__u8:1) icsk_ca_dst_locked = 0,
    (__u8) icsk_retransmits = 2,
    (__u8) icsk_pending = 0,
    (__u8) icsk_backoff = 0,
    (__u8) icsk_syn_retries = 0,
    (__u8) icsk_probes_out = 0,
    (__u16) icsk_ext_hdr_len = 0,
    (struct) icsk_ack = ((__u8) pending = 0, (__u8) quick = 15, (__u8) pingpong = 0, (__u8) retry = 0, (__u32) ato = 10, (unsigned long) timeout = 4295721093, (__u32) lrcvtime = 753787, (__u16) last_seg_size = 0, (__u16) rcv_mss = 1428),
    (struct) icsk_mtup = ((int) search_high = 1480, (int) search_low = 1076, (u32:31) probe_size = 0, (u32:1) enabled = 0, (u32) probe_timestamp = 0),
    (u32) icsk_probes_tstamp = 0,
    (u32) icsk_user_timeout = 0,
    (u64) android_kabi_reserved1 = 0,
    (u64 [13]) icsk_ca_priv = (0, 0, 0, 0, 0, 14482612842890526720, 14482612845143911684, 4294967295, 0, 0, 0, 0, 0)),

Youngmin Nam Feb. 3, 2025, 5:21 a.m. UTC | #36

On Mon, Jan 20, 2025 at 09:18:48AM +0900, Youngmin Nam wrote:
> On Fri, Jan 17, 2025 at 10:18:58AM -0500, Neal Cardwell wrote:
> > On Fri, Jan 17, 2025 at 12:04 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > > Thanks for all the details! If the ramdump becomes available again at
> > > > some point, would it be possible to pull out the following values as
> > > > well:
> > > >
> > > > tp->mss_cache
> > > > inet_csk(sk)->icsk_pmtu_cookie
> > > > inet_csk(sk)->icsk_ca_state
> > > >
> > > > Thanks,
> > > > neal
> > > >
> > >
> > > Hi Neal. Happy new year.
> > >
> > > We are currently trying to capture a tcpdump during the problem situation
> > > to construct the Packetdrill script. However, this issue does not occur very often.
> > >
> > > By the way, we have a full ramdump, so we can provide the information you requested.
> > >
> > > tp->packets_out = 0
> > > tp->sacked_out = 0
> > > tp->lost_out = 4
> > > tp->retrans_out = 1
> > > tcp_is_sack(tp) = 1
> > > tp->mss_cache = 1428
> > > inet_csk(sk)->icsk_ca_state = 4
> > > inet_csk(sk)->icsk_pmtu_cookie = 1500
> > >
> > > If you need any specific information from the ramdump, please let me know.
> > 
> > The icsk_ca_state = 4 is interesting, since that's TCP_CA_Loss,
> > indicating RTO recovery. Perhaps the socket suffered many recurring
> > timeouts and timed out with ETIMEDOUT,
> > causing the tcp_write_queue_purge() call that reset packets_out to
> > 0... and then some race happened during the teardown process that
> > caused another incoming packet to be processed in this resulting
> > inconsistent state?
> > 
> > Do you have a way to use GDB or a similar tool to print all the fields
> > of the socket? Like:
> > 
> >   (gdb)  p *(struct tcp_sock*) some_hex_address_goes_here
> > 
> > ?
> > 
> > If so, that could be useful in extracting further hints about what
> > state this socket is in.
> > 
> > If that's not possible, but a few extra fields are possible, would you
> > be able to pull out the following:
> > 
> > tp->retrans_stamp
> > tp->tcp_mstamp
> > icsk->icsk_retransmits
> > icsk->icsk_backoff
> > icsk->icsk_rto
> > 
> > thanks,
> > neal
> > 
> 
> Hi Neal,
> Thank you for looking into this issue.
> When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> We can provide any information you would like to inspect.
> 
> tp->retrans_stamp = 3339228
> tp->tcp_mstamp = 3552879949
> icsk->icsk_retransmits = 2
> icsk->icsk_backoff = 0
> icsk->icsk_rto = 16340
> 
> Here is all the information about tcp_sock.
> 
> (struct tcp_sock *) tp = 0xFFFFFF88C1053C00 -> (
>   (struct inet_connection_sock) inet_conn = ((struct inet_sock) icsk_inet = ((struct sock) sk = ((struct sock_common) __sk_common = ((__addrpair) skc_addrpair = 13979358786200921654, (__be32) skc_daddr = 255714870, (__be32) skc_rcv_saddr = 3254823104, (unsigned int) skc_hash = 2234333897, (__u16 [2]) skc_u16hashes = (15049, 34093), (__portpair) skc_portpair = 3600464641, (__be16) skc_dport = 47873, (__u16) skc_num = 54938, (unsigned short) skc_family = 10, (volatile unsigned char) skc_state = 4, (unsigned char:4) skc_reuse = 0, (unsigned char:1) skc_reuseport = 0, (unsigned char:1) skc_ipv6only = 0, (unsigned char:1) skc_net_refcnt = 1, (int) skc_bound_dev_if = 0, (struct hlist_node) skc_bind_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct hlist_node) skc_portaddr_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct proto *) skc_prot = 0xFFFFFFD08322CFE0, (possible_net_t) skc_net = ((struct net *) net = 0xFFFFFFD083316600), (struct in6_addr) skc_v6_daddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 54, 230, 61, 15), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 58934, 3901), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 255714870))), (struct in6_addr) skc_v6_rcv_saddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 192, 168, 0, 194), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 43200, 49664), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 3254823104))), (atomic64_t) skc_cookie = ((s64) counter = 6016), (unsigned long) skc_flags = 769, (struct sock *) skc_listener = 0x0301, (struct inet_timewait_death_row *) skc_tw_dr = 0x0301, (int [0]) skc_dontcopy_begin = (), (struct hlist_node) skc_node = ((struct hlist_node *) next = 0x7593, (struct hlist_node * *) pprev = 0xFFFFFF882D49D648), (struct hlist_nulls_node) skc_nulls_node = ((struct hlist_nulls_node *) next = 0x7593, (struct hlist_nulls_node * *) pprev = 0xFFFFFF882D49D648), (unsigned short) skc_tx_queue_mapping = 65535, (unsigned short) skc_rx_queue_mapping = 65535, (int) skc_incoming_cpu = 2, (u32) skc_rcv_wnd = 2, (u32) skc_tw_rcv_nxt = 2, (refcount_t) skc_refcnt = ((atomic_t) refs = ((int) counter = 2)), (int [0]) skc_dontcopy_end = (), (u32) skc_rxhash = 0, (u32) skc_window_clamp = 0, (u32) skc_tw_snd_nxt = 0), (struct dst_entry *) sk_rx_dst = 0xFFFFFF8821384F00, (int) sk_rx_dst_ifindex = 14, (u32) sk_rx_dst_cookie = 0, (socket_lock_t) sk_lock = ((spinlock_t) slock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 1), (u8) locked = 1, (u8) pending = 0, (u16) locked_pending = 1, (u16) tail = 0))), (int) owned = 0, (wait_queue_head_t) wq = ((spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_loc
>   (u16) tcp_header_len = 32,
>   (u16) gso_segs = 26,
>   (__be32) pred_flags = 2566918272,
>   (u64) bytes_received = 7950,
>   (u32) segs_in = 21,
>   (u32) data_segs_in = 12,
>   (u32) rcv_nxt = 2956372822,
>   (u32) copied_seq = 2956372822,
>   (u32) rcv_wup = 2956372822,
>   (u32) snd_nxt = 2253384964,
>   (u32) segs_out = 30,
>   (u32) data_segs_out = 13,
>   (u64) bytes_sent = 11381,
>   (u64) bytes_acked = 3841,
>   (u32) dsack_dups = 0,
>   (u32) snd_una = 2253381091,
>   (u32) snd_sml = 2253384964,
>   (u32) rcv_tstamp = 757520,
>   (u32) lsndtime = 768000,
>   (u32) last_oow_ack_time = 750065,
>   (u32) compressed_ack_rcv_nxt = 2956370839,
>   (u32) tsoffset = 1072186408,
>   (struct list_head) tsq_node = ((struct list_head *) next = 0xFFFFFF88C1054260, (struct list_head *) prev = 0xFFFFFF88C1054260),
>   (struct list_head) tsorted_sent_queue = ((struct list_head *) next = 0xFFFFFF88C1054270, (struct list_head *) prev = 0xFFFFFF88C1054270),
>   (u32) snd_wl1 = 2956372861,
>   (u32) snd_wnd = 78336,
>   (u32) max_window = 78336,
>   (u32) mss_cache = 1428,
>   (u32) window_clamp = 2752512,
>   (u32) rcv_ssthresh = 91439,
>   (u8) scaling_ratio = 84,
>   (struct tcp_rack) rack = ((u64) mstamp = 3312043553, (u32) rtt_us = 3103952, (u32) end_seq = 2253381091, (u32) last_delivered = 0, (u8) reo_wnd_steps = 1, (u8:5) reo_wnd_persist = 0, (u8:1) dsack_seen = 0, (u8:1) advanced = 1),
>   (u16) advmss = 1448,
>   (u8) compressed_ack = 0,
>   (u8:2) dup_ack_counter = 2,
>   (u8:1) tlp_retrans = 1,
>   (u8:5) unused = 0,
>   (u32) chrono_start = 780636,
>   (u32 [3]) chrono_stat = (30386, 0, 0),
>   (u8:2) chrono_type = 1,
>   (u8:1) rate_app_limited = 1,
>   (u8:1) fastopen_connect = 0,
>   (u8:1) fastopen_no_cookie = 0,
>   (u8:1) is_sack_reneg = 0,
>   (u8:2) fastopen_client_fail = 0,
>   (u8:4) nonagle = 0,
>   (u8:1) thin_lto = 0,
>   (u8:1) recvmsg_inq = 0,
>   (u8:1) repair = 0,
>   (u8:1) frto = 1,
>   (u8) repair_queue = 0,
>   (u8:2) save_syn = 0,
>   (u8:1) syn_data = 0,
>   (u8:1) syn_fastopen = 0,
>   (u8:1) syn_fastopen_exp = 0,
>   (u8:1) syn_fastopen_ch = 0,
>   (u8:1) syn_data_acked = 0,
>   (u8:1) is_cwnd_limited = 1,
>   (u32) tlp_high_seq = 0,
>   (u32) tcp_tx_delay = 0,
>   (u64) tcp_wstamp_ns = 3371996070858,
>   (u64) tcp_clock_cache = 3552879949296,
>   (u64) tcp_mstamp = 3552879949,
>   (u32) srtt_us = 29633751,
>   (u32) mdev_us = 10160190,
>   (u32) mdev_max_us = 10160190,
>   (u32) rttvar_us = 12632227,
>   (u32) rtt_seq = 2253383947,
>   (struct minmax) rtt_min = ((struct minmax_sample [3]) s = (((u32) t = 753091, (u32) v = 330326), ((u32) t = 753091, (u32) v = 330326), ((u32) t = 753091, (u32) v = 330326))),
>   (u32) packets_out = 0,
>   (u32) retrans_out = 1,
>   (u32) max_packets_out = 4,
>   (u32) cwnd_usage_seq = 2253384964,
>   (u16) urg_data = 0,
>   (u8) ecn_flags = 0,
>   (u8) keepalive_probes = 0,
>   (u32) reordering = 3,
>   (u32) reord_seen = 0,
>   (u32) snd_up = 2253381091,
>   (struct tcp_options_received) rx_opt = ((int) ts_recent_stamp = 3330, (u32) ts_recent = 1119503967, (u32) rcv_tsval = 1119728668, (u32) rcv_tsecr = 3312043, (u16:1) saw_tstamp = 1, (u16:1) tstamp_ok = 1, (u16:1) dsack = 0, (u16:1) wscale_ok = 1, (u16:3) sack_ok = 1, (u16:1) smc_ok = 0, (u16:4) snd_wscale = 9, (u16:4) rcv_wscale = 9, (u8:1) saw_unknown = 0, (u8:7) unused = 0, (u8) num_sacks = 0, (u16) user_mss = 0, (u16) mss_clamp = 1440),
>   (u32) snd_ssthresh = 7,
>   (u32) snd_cwnd = 1,
>   (u32) snd_cwnd_cnt = 0,
>   (u32) snd_cwnd_clamp = 4294967295,
>   (u32) snd_cwnd_used = 0,
>   (u32) snd_cwnd_stamp = 768000,
>   (u32) prior_cwnd = 10,
>   (u32) prr_delivered = 0,
>   (u32) prr_out = 0,
>   (u32) delivered = 7,
>   (u32) delivered_ce = 0,
>   (u32) lost = 8,
>   (u32) app_limited = 8,
>   (u64) first_tx_mstamp = 3312043553,
>   (u64) delivered_mstamp = 3315147505,
>   (u32) rate_delivered = 1,
>   (u32) rate_interval_us = 330326,
>   (u32) rcv_wnd = 91648,
>   (u32) write_seq = 2253384964,
>   (u32) notsent_lowat = 0,
>   (u32) pushed_seq = 2253384963,
>   (u32) lost_out = 4,
>   (u32) sacked_out = 0,
>   (struct hrtimer) pacing_timer = ((struct timerqueue_node) node = ((struct rb_node) node = ((unsigned long) __rb_parent_color = 18446743561551823800, (struct rb_node *) rb_right = 0x0, (struct rb_node *) rb_left = 0x0), (ktime_t) expires = 0), (ktime_t) _softexpires = 0, (enum hrtimer_restart (*)()) function = 0xFFFFFFD081EA565C, (struct hrtimer_clock_base *) base = 0xFFFFFF8962589DC0, (u8) state = 0, (u8) is_rel = 0, (u8) is_soft = 1, (u8) is_hard = 0, (u64) android_kabi_reserved1 = 0),
>   (struct hrtimer) compressed_ack_timer = ((struct timerqueue_node) node = ((struct rb_node) node = ((unsigned long) __rb_parent_color = 18446743561551823872, (struct rb_node *) rb_right = 0x0, (struct rb_node *) rb_left = 0x0), (ktime_t) expires = 0), (ktime_t) _softexpires = 0, (enum hrtimer_restart (*)()) function = 0xFFFFFFD081EAE348, (struct hrtimer_clock_base *) base = 0xFFFFFF8962589DC0, (u8) state = 0, (u8) is_rel = 0, (u8) is_soft = 1, (u8) is_hard = 0, (u64) android_kabi_reserved1 = 0),
>   (struct sk_buff *) lost_skb_hint = 0x0,
>   (struct sk_buff *) retransmit_skb_hint = 0x0,
>   (struct rb_root) out_of_order_queue = ((struct rb_node *) rb_node = 0x0),
>   (struct sk_buff *) ooo_last_skb = 0xFFFFFF891768FB00,
>   (struct tcp_sack_block [1]) duplicate_sack = (((u32) start_seq = 2956372143, (u32) end_seq = 2956372822)),
>   (struct tcp_sack_block [4]) selective_acks = (((u32) start_seq = 2956371078, (u32) end_seq = 2956372143), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0)),
>   (struct tcp_sack_block [4]) recv_sack_cache = (((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0), ((u32) start_seq = 0, (u32) end_seq = 0)),
>   (struct sk_buff *) highest_sack = 0x0,
>   (int) lost_cnt_hint = 0,
>   (u32) prior_ssthresh = 2147483647,
>   (u32) high_seq = 2253384964,
>   (u32) retrans_stamp = 3339228,
>   (u32) undo_marker = 2253381091,
>   (int) undo_retrans = 3,
>   (u64) bytes_retrans = 3669,
>   (u32) total_retrans = 6,
>   (u32) urg_seq = 0,
>   (unsigned int) keepalive_time = 0,
>   (unsigned int) keepalive_intvl = 0,
>   (int) linger2 = 0,
>   (u8) bpf_sock_ops_cb_flags = 0,
>   (u8:1) bpf_chg_cc_inprogress = 0,
>   (u16) timeout_rehash = 5,
>   (u32) rcv_ooopack = 2,
>   (u32) rcv_rtt_last_tsecr = 3312043,
>   (struct) rcv_rtt_est = ((u32) rtt_us = 0, (u32) seq = 2956431836, (u64) time = 3301136665),
>   (struct) rcvq_space = ((u32) space = 14480, (u32) seq = 2956364872, (u64) time = 3300257291),
>   (struct) mtu_probe = ((u32) probe_seq_start = 0, (u32) probe_seq_end = 0),
>   (u32) plb_rehash = 0,
>   (u32) mtu_info = 0,
>   (struct tcp_fastopen_request *) fastopen_req = 0x0,
>   (struct request_sock *) fastopen_rsk = 0x0,
>   (struct saved_syn *) saved_syn = 0x0,
>   (u64) android_oem_data1 = 0,
>   (u64) android_kabi_reserved1 = 0)
> 
> And here are the details of inet_connection_sock.
> 
> (struct tcp_sock *) tp = 0xFFFFFF88C1053C00 -> (
>   (struct inet_connection_sock) inet_conn = (
>     (struct inet_sock) icsk_inet = ((struct sock) sk = ((struct sock_common) __sk_common = ((__addrpair) skc_addrpair = 13979358786200921654, (__be32) skc_daddr = 255714870, (__be32) skc_rcv_saddr = 3254823104, (unsigned int) skc_hash = 2234333897, (__u16 [2]) skc_u16hashes = (15049, 34093), (__portpair) skc_portpair = 3600464641, (__be16) skc_dport = 47873, (__u16) skc_num = 54938, (unsigned short) skc_family = 10, (volatile unsigned char) skc_state = 4, (unsigned char:4) skc_reuse = 0, (unsigned char:1) skc_reuseport = 0, (unsigned char:1) skc_ipv6only = 0, (unsigned char:1) skc_net_refcnt = 1, (int) skc_bound_dev_if = 0, (struct hlist_node) skc_bind_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct hlist_node) skc_portaddr_node = ((struct hlist_node *) next = 0x0, (struct hlist_node * *) pprev = 0xFFFFFF881CE60AC0), (struct proto *) skc_prot = 0xFFFFFFD08322CFE0, (possible_net_t) skc_net = ((struct net *) net = 0xFFFFFFD083316600), (struct in6_addr) skc_v6_daddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 54, 230, 61, 15), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 58934, 3901), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 255714870))), (struct in6_addr) skc_v6_rcv_saddr = ((union) in6_u = ((__u8 [16]) u6_addr8 = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 192, 168, 0, 194), (__be16 [8]) u6_addr16 = (0, 0, 0, 0, 0, 65535, 43200, 49664), (__be32 [4]) u6_addr32 = (0, 0, 4294901760, 3254823104))), (atomic64_t) skc_cookie = ((s64) counter = 6016), (unsigned long) skc_flags = 769, (struct sock *) skc_listener = 0x0301, (struct inet_timewait_death_row *) skc_tw_dr = 0x0301, (int [0]) skc_dontcopy_begin = (), (struct hlist_node) skc_node = ((struct hlist_node *) next = 0x7593, (struct hlist_node * *) pprev = 0xFFFFFF882D49D648), (struct hlist_nulls_node) skc_nulls_node = ((struct hlist_nulls_node *) next = 0x7593, (struct hlist_nulls_node * *) pprev = 0xFFFFFF882D49D648), (unsigned short) skc_tx_queue_mapping = 65535, (unsigned short) skc_rx_queue_mapping = 65535, (int) skc_incoming_cpu = 2, (u32) skc_rcv_wnd = 2, (u32) skc_tw_rcv_nxt = 2, (refcount_t) skc_refcnt = ((atomic_t) refs = ((int) counter = 2)), (int [0]) skc_dontcopy_end = (), (u32) skc_rxhash = 0, (u32) skc_window_clamp = 0, (u32) skc_tw_snd_nxt = 0), (struct dst_entry *) sk_rx_dst = 0xFFFFFF8821384F00, (int) sk_rx_dst_ifindex = 14, (u32) sk_rx_dst_cookie = 0, (socket_lock_t) sk_lock = ((spinlock_t) slock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 1), (u8) locked = 1, (u8) pending = 0, (u16) locked_pending = 1, (u16) tail = 0))), (int) owned = 0, (wait_queue_head_t) wq = ((spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0))), (struct list_head) head = ((struct list_head *) next = 0xFFFFFF88C1053CA8, (struct list_head *) prev = 0xFFFFFF88C1053CA8))), (atomic_t) sk_drops = ((int) counter = 7), (int) sk_rcvlowat = 1, (struct sk_buff_head) sk_error_queue = ((struct sk_buff *) next = 0xFFFFFF88C1053CC0, (struct sk_buff *) prev = 0xFFFFFF88C1053CC0, (struct sk_buff_list) list = ((struct sk_buff *) next = 0xFFFFFF88C1053CC0, (struct sk_buff *) prev = 0xFFFFFF88C1053CC0), (__u32) qlen = 0, (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0)))), (struct sk_buff_head) sk_receive_queue = ((struct sk_buff *) next = 0xFFFFFF88C1053CD8, (struct sk_buff *) prev = 0xFFFFFF88C1053CD8, (struct sk_buff_list) list = ((struct sk_buff *) next = 0xFFFFFF88C1053CD8, (struct sk_buff *) prev = 0xFFFFFF88C1053CD8), (__u32) qlen = 0, (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0)))), (struct) sk_backlog = ((at
>     (struct request_sock_queue) icsk_accept_queue = ((spinlock_t) rskq_lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0))), (u8) rskq_defer_accept = 0, (u32) synflood_warned = 0, (atomic_t) qlen = ((int) counter = 0), (atomic_t) young = ((int) counter = 0), (struct request_sock *) rskq_accept_head = 0x0, (struct request_sock *) rskq_accept_tail = 0x0, (struct fastopen_queue) fastopenq = ((struct request_sock *) rskq_rst_head = 0x0, (struct request_sock *) rskq_rst_tail = 0x0, (spinlock_t) lock = ((struct raw_spinlock) rlock = ((arch_spinlock_t) raw_lock = ((atomic_t) val = ((int) counter = 0), (u8) locked = 0, (u8) pending = 0, (u16) locked_pending = 0, (u16) tail = 0))), (int) qlen = 0, (int) max_qlen = 0, (struct tcp_fastopen_context *) ctx = 0x0)),
>     (struct inet_bind_bucket *) icsk_bind_hash = 0xFFFFFF881CE60A80,
>     (struct inet_bind2_bucket *) icsk_bind2_hash = 0xFFFFFF881B1FD500,
>     (unsigned long) icsk_timeout = 4295751636,
>     (struct timer_list) icsk_retransmit_timer = ((struct hlist_node) entry = ((struct hlist_node *) next = 0xDEAD000000000122, (struct hlist_node * *) pprev = 0x0), (unsigned long) expires = 4295751636, (void (*)()) function = 0xFFFFFFD081EADE28, (u32) flags = 1056964613, (u64) android_kabi_reserved1 = 0, (u64) android_kabi_reserved2 = 0),
>     (struct timer_list) icsk_delack_timer = ((struct hlist_node) entry = ((struct hlist_node *) next = 0xDEAD000000000122, (struct hlist_node * *) pprev = 0x0), (unsigned long) expires = 4295721093, (void (*)()) function = 0xFFFFFFD081EADF60, (u32) flags = 25165829, (u64) android_kabi_reserved1 = 0, (u64) android_kabi_reserved2 = 0),
>     (__u32) icsk_rto_=_16340,
>     (__u32) icsk_rto_min = 50,
>     (__u32) icsk_delack_max = 50,
>     (__u32) icsk_pmtu_cookie = 1500,
>     (const struct tcp_congestion_ops *) icsk_ca_ops = 0xFFFFFFD0830A9440,
>     (const struct inet_connection_sock_af_ops *) icsk_af_ops = 0xFFFFFFD082290FC8,
>     (const struct tcp_ulp_ops *) icsk_ulp_ops = 0x0,
>     (void *) icsk_ulp_data = 0x0,
>     (void (*)()) icsk_clean_acked = 0x0,
>     (unsigned int (*)()) icsk_sync_mss = 0xFFFFFFD081EA60EC,
>     (__u8:5) icsk_ca_state = 4,
>     (__u8:1) icsk_ca_initialized = 1,
>     (__u8:1) icsk_ca_setsockopt = 0,
>     (__u8:1) icsk_ca_dst_locked = 0,
>     (__u8) icsk_retransmits = 2,
>     (__u8) icsk_pending = 0,
>     (__u8) icsk_backoff = 0,
>     (__u8) icsk_syn_retries = 0,
>     (__u8) icsk_probes_out = 0,
>     (__u16) icsk_ext_hdr_len = 0,
>     (struct) icsk_ack = ((__u8) pending = 0, (__u8) quick = 15, (__u8) pingpong = 0, (__u8) retry = 0, (__u32) ato = 10, (unsigned long) timeout = 4295721093, (__u32) lrcvtime = 753787, (__u16) last_seg_size = 0, (__u16) rcv_mss = 1428),
>     (struct) icsk_mtup = ((int) search_high = 1480, (int) search_low = 1076, (u32:31) probe_size = 0, (u32:1) enabled = 0, (u32) probe_timestamp = 0),
>     (u32) icsk_probes_tstamp = 0,
>     (u32) icsk_user_timeout = 0,
>     (u64) android_kabi_reserved1 = 0,
>     (u64 [13]) icsk_ca_priv = (0, 0, 0, 0, 0, 14482612842890526720, 14482612845143911684, 4294967295, 0, 0, 0, 0, 0)),

Hi Neal.

When you have a chance, could you take a look at this ramdump snapshot?

Neal Cardwell Feb. 24, 2025, 9:13 p.m. UTC | #37

On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
>
> > Hi Neal,
> > Thank you for looking into this issue.
> > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > We can provide any information you would like to inspect.

Thanks again for raising this issue, and providing all that data!

I've come up with a reproducer for this issue, and an explanation for
why this has only been seen on Android so far, and a theory about a
related socket leak issue, and a proposed fix for the WARN and the
socket leak.

Here is the scenario:

+ user process A has a socket in TCP_ESTABLISHED

+ user process A calls close(fd)

+ socket calls __tcp_close() and tcp_close_state() decides to enter
TCP_FIN_WAIT1 and send a FIN

+ FIN is lost and retransmitted, making the state:
---
 tp->packets_out = 1
 tp->sacked_out = 0
 tp->lost_out = 1
 tp->retrans_out = 1
---

+ someone invokes "ss" to --kill the socket using the functionality in
(1e64e298b8 "net: diag: Support destroying TCP sockets")

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a

 (note: this was added for Android, so would not be surprising to have
this inet_diag --kill run on Android)

+ the ss --kill causes a call to tcp_abort()

+ tcp_abort() calls tcp_write_queue_purge()

+ tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
retrans_out=1

+ tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state

+ ACK arrives and causes a WARN_ON from tcp_verify_left_out():

#define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)

because the state has:

 ---
 tcp_left_out(tp) = sacked_out + lost_out = 1
  tp->packets_out = 0
---

because the state is:

---
 tp->packets_out = 0
 tp->sacked_out = 0
 tp->lost_out = 1
 tp->retrans_out = 1
---

I guess perhaps one fix would be to just have tcp_write_queue_purge()
zero out those other fields:

---
 tp->sacked_out = 0
 tp->lost_out = 0
 tp->retrans_out = 0
---

However, there is a related and worse problem. Because this killed
socket has tp->packets_out, the next time the RTO timer fires,
tcp_retransmit_timer() notices !tp->packets_out is true, so it short
circuits and returns without setting another RTO timer or checking to
see if the socket should be deleted. So the tcp_sock is now sitting in
memory with no timer set to delete it. So we could leak a socket this
way. So AFAICT to fix this socket leak problem, perhaps we want a
patch like the following (not tested yet), so that we delete all
killed sockets immediately, whether they are SOCK_DEAD (orphans for
which the user already called close() or not) :

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 28cf19317b6c2..a266078b8ec8c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
        local_bh_disable();
        bh_lock_sock(sk);

-       if (!sock_flag(sk, SOCK_DEAD)) {
-               if (tcp_need_reset(sk->sk_state))
-                       tcp_send_active_reset(sk, GFP_ATOMIC);
-               tcp_done_with_error(sk, err);
-       }
+       if (tcp_need_reset(sk->sk_state))
+               tcp_send_active_reset(sk, GFP_ATOMIC);
+       tcp_done_with_error(sk, err);

        bh_unlock_sock(sk);
        local_bh_enable();
-       tcp_write_queue_purge(sk);
        release_sock(sk);
        return 0;
 }
---

Here is a packetdrill script that reproduces a scenario similar to that:

---  gtests/net/tcp/inet_diag/inet-diag-fin-wait-1-retrans-kill.pkt
// Test SOCK_DESTROY on TCP_FIN_WAIT1 sockets
// We use the "ss" socket statistics tool, which uses inet_diag sockets.

// ss -K can be slow
--tolerance_usecs=15000

// Set up config.
`../common/defaults.sh`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 2>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 32890

   +0 accept(3, ..., ...) = 4

// Send 4 data segments.
   +0 write(4, ..., 4000) = 4000
   +0 > P. 1:4001(4000) ack 1

   +0 close(4) = 0
   +0 > F. 4001:4001(0) ack 1

// In TCP_FIN_WAIT1 now...

// Send FIN as a TLP probe at 2*srtt:
+.200 > F. 4001:4001(0) ack 1

// Retransmit head.
+.300 > . 1:1001(1000) ack 1
   +0 `ss -tinmo src :8080`

// Test what happens when we ss --kill a socket in TCP_FIN_WAIT1.
// ss --kill is scary! Don't mess with the filter or risk killing many flows!
   +0 `ss -t --kill -n src :8080 `

   +0 `echo check what is left...; ss -tinmo src :8080`

// An ACK arrives that carries a SACK block and
// makes us call tcp_verify_left_out(tp) to WARN about the inconsistency:
+.010 < . 1:1(0) ack 1 win 32890 <sack 1001:2001,nop,nop>

   +0 `echo after SACK; ss -tinmo src :8080`
---

That script triggers one of the warning cases mentioned in the netdev
email thread (see below).

Note that when I extend the packetdrill script above to expect another
RTO retransmission, that retransmission never happens, which AFAICT
supports my theory about the socket leak issue.

best regards,
neal

---
ps: here is the warning triggered by the packetdrill script above

[412967.317794] ------------[ cut here ]------------
[412967.317801] WARNING: CPU: 109 PID: 865840 at
net/ipv4/tcp_input.c:2141 tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317805] Modules linked in: dummy act_mirred sch_netem ifb
bridge stp llc vfat fat i2c_mux_pca954x i2c_mux gq sha3_generic
spi_lewisburg_pch spidev cdc_acm google_bmc_lpc google_bmc_mailbox
xhci_pci xhci_hcd i2c_iimc
[412967.317818] CPU: 109 PID: 865840 Comm: packetdrill Kdump: loaded
Tainted: G S               N 5.10.0-smp-1300.91.890.1 #6
[412967.317820] Hardware name: Google LLC Indus/Indus_QC_00, BIOS
30.60.4 02/23/2023
[412967.317821] RIP: 0010:tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317824] Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 0f
0b 41 83 be b0 06 00 00 00 79 a0 0f 0b eb 9c 0f 0b 41 8b 86 d0 06 00
00 eb 9c <0f> 0b eb b3 0f 0b e9 ac fe ff ff b8 ff ff ff ff e9 39 ff ff
ff e8
[412967.317826] RSP: 0018:ffff999ff84b9770 EFLAGS: 00010286
[412967.317827] RAX: 0000000000000001 RBX: 0000000000000001 RCX:
0000000000000005
[412967.317828] RDX: 00000000fffffffc RSI: ffff99a04bd7d200 RDI:
ffff99a07956e1c8
[412967.317830] RBP: ffff999ff84b9830 R08: ffff99a04bd7d200 R09:
00000000f82c8a14
[412967.317831] R10: ffff999f8b67dc54 R11: ffff999f8b67dc00 R12:
0000000000000054
[412967.317832] R13: 0000000000000001 R14: ffff99a07956e040 R15:
0000000000000014
[412967.317833] FS:  00007f4cf48dc740(0000) GS:ffff99cebfc80000(0000)
knlGS:0000000000000000
[412967.317834] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[412967.317835] CR2: 00000000085e9480 CR3: 000000309df36006 CR4:
00000000007726f0
[412967.317836] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[412967.317837] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[412967.317838] PKRU: 55555554
[412967.317839] Call Trace:
[412967.317845]  ? tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317848]  ? __warn+0x195/0x2a0
[412967.317853]  ? tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317869]  ? report_bug+0xe6/0x150
[412967.317872]  ? handle_bug+0x4c/0x90
[412967.317878]  ? exc_invalid_op+0x3a/0x110
[412967.317881]  ? asm_exc_invalid_op+0x12/0x20
[412967.317885]  ? tcp_sacktag_write_queue+0xb6f/0xb90
[412967.317889]  tcp_ack+0x5c8/0x1a90
[412967.317893]  ? prep_new_page+0x81/0xe0
[412967.317897]  ? prep_new_page+0x41/0xe0
[412967.317900]  ? get_page_from_freelist+0x1556/0x15b0
[412967.317904]  tcp_rcv_state_process+0x2cd/0xe00
[412967.317908]  tcp_v4_do_rcv+0x2ac/0x370
[412967.317913]  tcp_v4_rcv+0x9b8/0xab0
[412967.317916]  ip_protocol_deliver_rcu+0x71/0x110
[412967.317921]  ip_local_deliver+0xa8/0x130
[412967.317925]  ? inet_rtm_getroute+0x191/0x910
[412967.317928]  ? ip_local_deliver+0x130/0x130
[412967.317931]  ip_rcv+0x41/0xd0
[412967.317934]  ? ip_rcv_core+0x300/0x300
[412967.317937]  __netif_receive_skb+0x9e/0x160
[412967.317942]  netif_receive_skb+0x2c/0x130
[412967.317945]  tun_rx_batched+0x17b/0x1e0
[412967.317950]  tun_get_user+0xe39/0xff0
[412967.317953]  ? __switch_to_asm+0x3a/0x60
[412967.317958]  tun_chr_write_iter+0x57/0x80
[412967.317961]  do_iter_readv_writev+0x143/0x180
[412967.317967]  do_iter_write+0x8b/0x1d0
[412967.317970]  vfs_writev+0x96/0x130
[412967.317974]  do_writev+0x6b/0x100
[412967.317978]  do_syscall_64+0x6d/0xa0
[412967.317981]  entry_SYSCALL_64_after_hwframe+0x67/0xd1
[412967.317985] RIP: 0033:0x7f4cf4a10885

Neal Cardwell Feb. 25, 2025, 5:24 p.m. UTC | #38

On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> >
> > > Hi Neal,
> > > Thank you for looking into this issue.
> > > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > > We can provide any information you would like to inspect.
>
> Thanks again for raising this issue, and providing all that data!
>
> I've come up with a reproducer for this issue, and an explanation for
> why this has only been seen on Android so far, and a theory about a
> related socket leak issue, and a proposed fix for the WARN and the
> socket leak.
>
> Here is the scenario:
>
> + user process A has a socket in TCP_ESTABLISHED
>
> + user process A calls close(fd)
>
> + socket calls __tcp_close() and tcp_close_state() decides to enter
> TCP_FIN_WAIT1 and send a FIN
>
> + FIN is lost and retransmitted, making the state:
> ---
>  tp->packets_out = 1
>  tp->sacked_out = 0
>  tp->lost_out = 1
>  tp->retrans_out = 1
> ---
>
> + someone invokes "ss" to --kill the socket using the functionality in
> (1e64e298b8 "net: diag: Support destroying TCP sockets")
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a
>
>  (note: this was added for Android, so would not be surprising to have
> this inet_diag --kill run on Android)
>
> + the ss --kill causes a call to tcp_abort()
>
> + tcp_abort() calls tcp_write_queue_purge()
>
> + tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
> retrans_out=1
>
> + tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state
>
> + ACK arrives and causes a WARN_ON from tcp_verify_left_out():
>
> #define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)
>
> because the state has:
>
>  ---
>  tcp_left_out(tp) = sacked_out + lost_out = 1
>   tp->packets_out = 0
> ---
>
> because the state is:
>
> ---
>  tp->packets_out = 0
>  tp->sacked_out = 0
>  tp->lost_out = 1
>  tp->retrans_out = 1
> ---
>
> I guess perhaps one fix would be to just have tcp_write_queue_purge()
> zero out those other fields:
>
> ---
>  tp->sacked_out = 0
>  tp->lost_out = 0
>  tp->retrans_out = 0
> ---
>
> However, there is a related and worse problem. Because this killed
> socket has tp->packets_out, the next time the RTO timer fires,
> tcp_retransmit_timer() notices !tp->packets_out is true, so it short
> circuits and returns without setting another RTO timer or checking to
> see if the socket should be deleted. So the tcp_sock is now sitting in
> memory with no timer set to delete it. So we could leak a socket this
> way. So AFAICT to fix this socket leak problem, perhaps we want a
> patch like the following (not tested yet), so that we delete all
> killed sockets immediately, whether they are SOCK_DEAD (orphans for
> which the user already called close() or not) :
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 28cf19317b6c2..a266078b8ec8c 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
>         local_bh_disable();
>         bh_lock_sock(sk);
>
> -       if (!sock_flag(sk, SOCK_DEAD)) {
> -               if (tcp_need_reset(sk->sk_state))
> -                       tcp_send_active_reset(sk, GFP_ATOMIC);
> -               tcp_done_with_error(sk, err);
> -       }
> +       if (tcp_need_reset(sk->sk_state))
> +               tcp_send_active_reset(sk, GFP_ATOMIC);
> +       tcp_done_with_error(sk, err);
>
>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);
>         release_sock(sk);
>         return 0;
>  }
> ---

Actually, it seems like a similar fix was already merged into Linux v6.11:

bac76cf89816b tcp: fix forever orphan socket caused by tcp_abort

Details below.

Youngmin, does your kernel have this bac76cf89816b fix? If not, can
you please cherry-pick this fix and retest?

Thanks!
neal

ps: details for bac76cf89816b:

commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
Author: Xueming Feng <kuro@kuroa.me>
Date:   Mon Aug 26 18:23:27 2024 +0800

    tcp: fix forever orphan socket caused by tcp_abort

    We have some problem closing zero-window fin-wait-1 tcp sockets in our
    environment. This patch come from the investigation.

    Previously tcp_abort only sends out reset and calls tcp_done when the
    socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
    purging the write queue, but not close the socket and left it to the
    timer.

    While purging the write queue, tp->packets_out and sk->sk_write_queue
    is cleared along the way. However tcp_retransmit_timer have early
    return based on !tp->packets_out and tcp_probe_timer have early
    return based on !sk->sk_write_queue.

    This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
    and socket not being killed by the timers, converting a zero-windowed
    orphan into a forever orphan.

    This patch removes the SOCK_DEAD check in tcp_abort, making it send
    reset to peer and close the socket accordingly. Preventing the
    timer-less orphan from happening.

    According to Lorenzo's email in the v1 thread, the check was there to
    prevent force-closing the same socket twice. That situation is handled
    by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
    already closed.

    The -ENOENT code comes from the associate patch Lorenzo made for
    iproute2-ss; link attached below, which also conform to RFC 9293.

    At the end of the patch, tcp_write_queue_purge(sk) is removed because it
    was already called in tcp_done_with_error().

    p.s. This is the same patch with v2. Resent due to mis-labeled "changes
    requested" on patchwork.kernel.org.

    Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
    Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
    Signed-off-by: Xueming Feng <kuro@kuroa.me>
    Tested-by: Lorenzo Colitti <lorenzo@google.com>
    Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e03a342c9162b..831a18dc7aa6d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
                /* Don't race with userspace socket closes such as tcp_close. */
                lock_sock(sk);

+       /* Avoid closing the same socket twice. */
+       if (sk->sk_state == TCP_CLOSE) {
+               if (!has_current_bpf_ctx())
+                       release_sock(sk);
+               return -ENOENT;
+       }
+
        if (sk->sk_state == TCP_LISTEN) {
                tcp_set_state(sk, TCP_CLOSE);
                inet_csk_listen_stop(sk);
@@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
        local_bh_disable();
        bh_lock_sock(sk);

-       if (!sock_flag(sk, SOCK_DEAD)) {
-               if (tcp_need_reset(sk->sk_state))
-                       tcp_send_active_reset(sk, GFP_ATOMIC,
-                                             SK_RST_REASON_NOT_SPECIFIED);
-               tcp_done_with_error(sk, err);
-       }
+       if (tcp_need_reset(sk->sk_state))
+               tcp_send_active_reset(sk, GFP_ATOMIC,
+                                     SK_RST_REASON_NOT_SPECIFIED);
+       tcp_done_with_error(sk, err);

        bh_unlock_sock(sk);
        local_bh_enable();
-       tcp_write_queue_purge(sk);
        if (!has_current_bpf_ctx())
                release_sock(sk);
        return 0;

Yuchung Cheng Feb. 25, 2025, 6:28 p.m. UTC | #39

On Tue, Feb 25, 2025 at 9:25 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > > Hi Neal,
> > > > Thank you for looking into this issue.
> > > > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > > > We can provide any information you would like to inspect.
> >
> > Thanks again for raising this issue, and providing all that data!
> >
> > I've come up with a reproducer for this issue, and an explanation for
> > why this has only been seen on Android so far, and a theory about a
> > related socket leak issue, and a proposed fix for the WARN and the
> > socket leak.
> >
> > Here is the scenario:
> >
> > + user process A has a socket in TCP_ESTABLISHED
> >
> > + user process A calls close(fd)
> >
> > + socket calls __tcp_close() and tcp_close_state() decides to enter
> > TCP_FIN_WAIT1 and send a FIN
> >
> > + FIN is lost and retransmitted, making the state:
> > ---
> >  tp->packets_out = 1
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > + someone invokes "ss" to --kill the socket using the functionality in
> > (1e64e298b8 "net: diag: Support destroying TCP sockets")
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a
> >
> >  (note: this was added for Android, so would not be surprising to have
> > this inet_diag --kill run on Android)
> >
> > + the ss --kill causes a call to tcp_abort()
> >
> > + tcp_abort() calls tcp_write_queue_purge()
> >
> > + tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
> > retrans_out=1
> >
> > + tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state
> >
> > + ACK arrives and causes a WARN_ON from tcp_verify_left_out():
> >
> > #define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)
> >
> > because the state has:
> >
> >  ---
> >  tcp_left_out(tp) = sacked_out + lost_out = 1
> >   tp->packets_out = 0
> > ---
> >
> > because the state is:
> >
> > ---
> >  tp->packets_out = 0
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > I guess perhaps one fix would be to just have tcp_write_queue_purge()
> > zero out those other fields:
> >
> > ---
> >  tp->sacked_out = 0
> >  tp->lost_out = 0
> >  tp->retrans_out = 0
> > ---
> >
> > However, there is a related and worse problem. Because this killed
> > socket has tp->packets_out, the next time the RTO timer fires,
Zeroing all inflights stats in tcp_write_queue_purge still makes sense
to me. Why will the RTO timer still fire if packets_out is zeroed?


> > tcp_retransmit_timer() notices !tp->packets_out is true, so it short
> > circuits and returns without setting another RTO timer or checking to
> > see if the socket should be deleted. So the tcp_sock is now sitting in
> > memory with no timer set to delete it. So we could leak a socket this
> > way. So AFAICT to fix this socket leak problem, perhaps we want a
> > patch like the following (not tested yet), so that we delete all
> > killed sockets immediately, whether they are SOCK_DEAD (orphans for
> > which the user already called close() or not) :
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 28cf19317b6c2..a266078b8ec8c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
> >         local_bh_disable();
> >         bh_lock_sock(sk);
> >
> > -       if (!sock_flag(sk, SOCK_DEAD)) {
> > -               if (tcp_need_reset(sk->sk_state))
> > -                       tcp_send_active_reset(sk, GFP_ATOMIC);
> > -               tcp_done_with_error(sk, err);
> > -       }
> > +       if (tcp_need_reset(sk->sk_state))
> > +               tcp_send_active_reset(sk, GFP_ATOMIC);
> > +       tcp_done_with_error(sk, err);
> >
> >         bh_unlock_sock(sk);
> >         local_bh_enable();
> > -       tcp_write_queue_purge(sk);
> >         release_sock(sk);
> >         return 0;
> >  }
> > ---
>
> Actually, it seems like a similar fix was already merged into Linux v6.11:
>
> bac76cf89816b tcp: fix forever orphan socket caused by tcp_abort
>
> Details below.
>
> Youngmin, does your kernel have this bac76cf89816b fix? If not, can
> you please cherry-pick this fix and retest?
>
> Thanks!
> neal
>
> ps: details for bac76cf89816b:
>
> commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
> Author: Xueming Feng <kuro@kuroa.me>
> Date:   Mon Aug 26 18:23:27 2024 +0800
>
>     tcp: fix forever orphan socket caused by tcp_abort
>
>     We have some problem closing zero-window fin-wait-1 tcp sockets in our
>     environment. This patch come from the investigation.
>
>     Previously tcp_abort only sends out reset and calls tcp_done when the
>     socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
>     purging the write queue, but not close the socket and left it to the
>     timer.
>
>     While purging the write queue, tp->packets_out and sk->sk_write_queue
>     is cleared along the way. However tcp_retransmit_timer have early
>     return based on !tp->packets_out and tcp_probe_timer have early
>     return based on !sk->sk_write_queue.
>
>     This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
>     and socket not being killed by the timers, converting a zero-windowed
>     orphan into a forever orphan.
>
>     This patch removes the SOCK_DEAD check in tcp_abort, making it send
>     reset to peer and close the socket accordingly. Preventing the
>     timer-less orphan from happening.
>
>     According to Lorenzo's email in the v1 thread, the check was there to
>     prevent force-closing the same socket twice. That situation is handled
>     by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
>     already closed.
>
>     The -ENOENT code comes from the associate patch Lorenzo made for
>     iproute2-ss; link attached below, which also conform to RFC 9293.
>
>     At the end of the patch, tcp_write_queue_purge(sk) is removed because it
>     was already called in tcp_done_with_error().
>
>     p.s. This is the same patch with v2. Resent due to mis-labeled "changes
>     requested" on patchwork.kernel.org.
>
>     Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
>     Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
>     Signed-off-by: Xueming Feng <kuro@kuroa.me>
>     Tested-by: Lorenzo Colitti <lorenzo@google.com>
>     Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>     Reviewed-by: Eric Dumazet <edumazet@google.com>
>     Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
>     Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03a342c9162b..831a18dc7aa6d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
>                 /* Don't race with userspace socket closes such as tcp_close. */
>                 lock_sock(sk);
>
> +       /* Avoid closing the same socket twice. */
> +       if (sk->sk_state == TCP_CLOSE) {
> +               if (!has_current_bpf_ctx())
> +                       release_sock(sk);
> +               return -ENOENT;
> +       }
> +
>         if (sk->sk_state == TCP_LISTEN) {
>                 tcp_set_state(sk, TCP_CLOSE);
>                 inet_csk_listen_stop(sk);
> @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
>         local_bh_disable();
>         bh_lock_sock(sk);
>
> -       if (!sock_flag(sk, SOCK_DEAD)) {
> -               if (tcp_need_reset(sk->sk_state))
> -                       tcp_send_active_reset(sk, GFP_ATOMIC,
> -                                             SK_RST_REASON_NOT_SPECIFIED);
> -               tcp_done_with_error(sk, err);
> -       }
> +       if (tcp_need_reset(sk->sk_state))
> +               tcp_send_active_reset(sk, GFP_ATOMIC,
> +                                     SK_RST_REASON_NOT_SPECIFIED);
> +       tcp_done_with_error(sk, err);
>
>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);
>         if (!has_current_bpf_ctx())
>                 release_sock(sk);
>         return 0;

Eric Dumazet Feb. 25, 2025, 6:43 p.m. UTC | #40

On Tue, Feb 25, 2025 at 7:28 PM Yuchung Cheng <ycheng@google.com> wrote:
>
> On Tue, Feb 25, 2025 at 9:25 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > > >
> > > > > Hi Neal,
> > > > > Thank you for looking into this issue.
> > > > > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > > > > We can provide any information you would like to inspect.
> > >
> > > Thanks again for raising this issue, and providing all that data!
> > >
> > > I've come up with a reproducer for this issue, and an explanation for
> > > why this has only been seen on Android so far, and a theory about a
> > > related socket leak issue, and a proposed fix for the WARN and the
> > > socket leak.
> > >
> > > Here is the scenario:
> > >
> > > + user process A has a socket in TCP_ESTABLISHED
> > >
> > > + user process A calls close(fd)
> > >
> > > + socket calls __tcp_close() and tcp_close_state() decides to enter
> > > TCP_FIN_WAIT1 and send a FIN
> > >
> > > + FIN is lost and retransmitted, making the state:
> > > ---
> > >  tp->packets_out = 1
> > >  tp->sacked_out = 0
> > >  tp->lost_out = 1
> > >  tp->retrans_out = 1
> > > ---
> > >
> > > + someone invokes "ss" to --kill the socket using the functionality in
> > > (1e64e298b8 "net: diag: Support destroying TCP sockets")
> > >
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a
> > >
> > >  (note: this was added for Android, so would not be surprising to have
> > > this inet_diag --kill run on Android)
> > >
> > > + the ss --kill causes a call to tcp_abort()
> > >
> > > + tcp_abort() calls tcp_write_queue_purge()
> > >
> > > + tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
> > > retrans_out=1
> > >
> > > + tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state
> > >
> > > + ACK arrives and causes a WARN_ON from tcp_verify_left_out():
> > >
> > > #define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)
> > >
> > > because the state has:
> > >
> > >  ---
> > >  tcp_left_out(tp) = sacked_out + lost_out = 1
> > >   tp->packets_out = 0
> > > ---
> > >
> > > because the state is:
> > >
> > > ---
> > >  tp->packets_out = 0
> > >  tp->sacked_out = 0
> > >  tp->lost_out = 1
> > >  tp->retrans_out = 1
> > > ---
> > >
> > > I guess perhaps one fix would be to just have tcp_write_queue_purge()
> > > zero out those other fields:
> > >
> > > ---
> > >  tp->sacked_out = 0
> > >  tp->lost_out = 0
> > >  tp->retrans_out = 0
> > > ---
> > >
> > > However, there is a related and worse problem. Because this killed
> > > socket has tp->packets_out, the next time the RTO timer fires,
> Zeroing all inflights stats in tcp_write_queue_purge still makes sense
> to me. Why will the RTO timer still fire if packets_out is zeroed?

By definition, tcp_write_queue_purge() must only happen when the
socket reaches a final state.

No further transmit is possible, since this broke a  major TCP
principle (stream mode, no sendmsg() can be zapped)

tcp_write_timer_handler() immediately returns if the final state is reached.

if (((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)) ..
    return;

Also look at INET_CSK_CLEAR_TIMERS, if you want to know why the
retransmit timer can fire.

Youngmin Nam March 1, 2025, 5:37 a.m. UTC | #41

On Tue, Feb 25, 2025 at 12:24:47PM -0500, Neal Cardwell wrote:
> On Mon, Feb 24, 2025 at 4:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Feb 3, 2025 at 12:17 AM Youngmin Nam <youngmin.nam@samsung.com> wrote:
> > >
> > > > Hi Neal,
> > > > Thank you for looking into this issue.
> > > > When we first encountered this issue, we also suspected that tcp_write_queue_purge() was being called.
> > > > We can provide any information you would like to inspect.
> >
> > Thanks again for raising this issue, and providing all that data!
> >
> > I've come up with a reproducer for this issue, and an explanation for
> > why this has only been seen on Android so far, and a theory about a
> > related socket leak issue, and a proposed fix for the WARN and the
> > socket leak.
> >
> > Here is the scenario:
> >
> > + user process A has a socket in TCP_ESTABLISHED
> >
> > + user process A calls close(fd)
> >
> > + socket calls __tcp_close() and tcp_close_state() decides to enter
> > TCP_FIN_WAIT1 and send a FIN
> >
> > + FIN is lost and retransmitted, making the state:
> > ---
> >  tp->packets_out = 1
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > + someone invokes "ss" to --kill the socket using the functionality in
> > (1e64e298b8 "net: diag: Support destroying TCP sockets")
> >
> >   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c1e64e298b8cad309091b95d8436a0255c84f54a
> >
> >  (note: this was added for Android, so would not be surprising to have
> > this inet_diag --kill run on Android)
> >
> > + the ss --kill causes a call to tcp_abort()
> >
> > + tcp_abort() calls tcp_write_queue_purge()
> >
> > + tcp_write_queue_purge() sets packets_out=0 but leaves lost_out=1,
> > retrans_out=1
> >
> > + tcp_sock still exists in TCP_FIN_WAIT1 but now with an inconsistent state
> >
> > + ACK arrives and causes a WARN_ON from tcp_verify_left_out():
> >
> > #define tcp_verify_left_out(tp) WARN_ON(tcp_left_out(tp) > tp->packets_out)
> >
> > because the state has:
> >
> >  ---
> >  tcp_left_out(tp) = sacked_out + lost_out = 1
> >   tp->packets_out = 0
> > ---
> >
> > because the state is:
> >
> > ---
> >  tp->packets_out = 0
> >  tp->sacked_out = 0
> >  tp->lost_out = 1
> >  tp->retrans_out = 1
> > ---
> >
> > I guess perhaps one fix would be to just have tcp_write_queue_purge()
> > zero out those other fields:
> >
> > ---
> >  tp->sacked_out = 0
> >  tp->lost_out = 0
> >  tp->retrans_out = 0
> > ---
> >
> > However, there is a related and worse problem. Because this killed
> > socket has tp->packets_out, the next time the RTO timer fires,
> > tcp_retransmit_timer() notices !tp->packets_out is true, so it short
> > circuits and returns without setting another RTO timer or checking to
> > see if the socket should be deleted. So the tcp_sock is now sitting in
> > memory with no timer set to delete it. So we could leak a socket this
> > way. So AFAICT to fix this socket leak problem, perhaps we want a
> > patch like the following (not tested yet), so that we delete all
> > killed sockets immediately, whether they are SOCK_DEAD (orphans for
> > which the user already called close() or not) :
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 28cf19317b6c2..a266078b8ec8c 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -5563,15 +5563,12 @@ int tcp_abort(struct sock *sk, int err)
> >         local_bh_disable();
> >         bh_lock_sock(sk);
> >
> > -       if (!sock_flag(sk, SOCK_DEAD)) {
> > -               if (tcp_need_reset(sk->sk_state))
> > -                       tcp_send_active_reset(sk, GFP_ATOMIC);
> > -               tcp_done_with_error(sk, err);
> > -       }
> > +       if (tcp_need_reset(sk->sk_state))
> > +               tcp_send_active_reset(sk, GFP_ATOMIC);
> > +       tcp_done_with_error(sk, err);
> >
> >         bh_unlock_sock(sk);
> >         local_bh_enable();
> > -       tcp_write_queue_purge(sk);
> >         release_sock(sk);
> >         return 0;
> >  }
> > ---
> 
> Actually, it seems like a similar fix was already merged into Linux v6.11:
> 
> bac76cf89816b tcp: fix forever orphan socket caused by tcp_abort
> 
> Details below.
> 
> Youngmin, does your kernel have this bac76cf89816b fix? If not, can
> you please cherry-pick this fix and retest?
> 
> Thanks!
> neal

Hi Neal.

Thank you for your effort in debugging this issue with me.
I also appreciate your detailed explanation and for finding the patch related to the issue.

Our kernel(an Android kernel based on 6.6 LTS) does not have the patch you mentioned.(bac76cf89816b)

I'll let you know the test results after applying the patch.

Thank you.

> 
> ps: details for bac76cf89816b:
> 
> commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4
> Author: Xueming Feng <kuro@kuroa.me>
> Date:   Mon Aug 26 18:23:27 2024 +0800
> 
>     tcp: fix forever orphan socket caused by tcp_abort
> 
>     We have some problem closing zero-window fin-wait-1 tcp sockets in our
>     environment. This patch come from the investigation.
> 
>     Previously tcp_abort only sends out reset and calls tcp_done when the
>     socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
>     purging the write queue, but not close the socket and left it to the
>     timer.
> 
>     While purging the write queue, tp->packets_out and sk->sk_write_queue
>     is cleared along the way. However tcp_retransmit_timer have early
>     return based on !tp->packets_out and tcp_probe_timer have early
>     return based on !sk->sk_write_queue.
> 
>     This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
>     and socket not being killed by the timers, converting a zero-windowed
>     orphan into a forever orphan.
> 
>     This patch removes the SOCK_DEAD check in tcp_abort, making it send
>     reset to peer and close the socket accordingly. Preventing the
>     timer-less orphan from happening.
> 
>     According to Lorenzo's email in the v1 thread, the check was there to
>     prevent force-closing the same socket twice. That situation is handled
>     by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
>     already closed.
> 
>     The -ENOENT code comes from the associate patch Lorenzo made for
>     iproute2-ss; link attached below, which also conform to RFC 9293.
> 
>     At the end of the patch, tcp_write_queue_purge(sk) is removed because it
>     was already called in tcp_done_with_error().
> 
>     p.s. This is the same patch with v2. Resent due to mis-labeled "changes
>     requested" on patchwork.kernel.org.
> 
>     Link: https://protect2.fireeye.com/v1/url?k=544c1d82-0bd7257f-544d96cd-000babff317b-bdb81ce9ab3ea266&q=1&e=a6f04ac5-af96-4431-b73d-76d141ecd941&u=https%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fnetdev%2Fpatch%2F1450773094-7978-3-git-send-email-lorenzo%40google.com%2F
>     Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
>     Signed-off-by: Xueming Feng <kuro@kuroa.me>
>     Tested-by: Lorenzo Colitti <lorenzo@google.com>
>     Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
>     Reviewed-by: Eric Dumazet <edumazet@google.com>
>     Link: https://protect2.fireeye.com/v1/url?k=4a9f6303-15045bfe-4a9ee84c-000babff317b-4ccbbea72f6265df&q=1&e=a6f04ac5-af96-4431-b73d-76d141ecd941&u=https%3A%2F%2Fpatch.msgid.link%2F20240826102327.1461482-1-kuro%40kuroa.me
>     Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e03a342c9162b..831a18dc7aa6d 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4637,6 +4637,13 @@ int tcp_abort(struct sock *sk, int err)
>                 /* Don't race with userspace socket closes such as tcp_close. */
>                 lock_sock(sk);
> 
> +       /* Avoid closing the same socket twice. */
> +       if (sk->sk_state == TCP_CLOSE) {
> +               if (!has_current_bpf_ctx())
> +                       release_sock(sk);
> +               return -ENOENT;
> +       }
> +
>         if (sk->sk_state == TCP_LISTEN) {
>                 tcp_set_state(sk, TCP_CLOSE);
>                 inet_csk_listen_stop(sk);
> @@ -4646,16 +4653,13 @@ int tcp_abort(struct sock *sk, int err)
>         local_bh_disable();
>         bh_lock_sock(sk);
> 
> -       if (!sock_flag(sk, SOCK_DEAD)) {
> -               if (tcp_need_reset(sk->sk_state))
> -                       tcp_send_active_reset(sk, GFP_ATOMIC,
> -                                             SK_RST_REASON_NOT_SPECIFIED);
> -               tcp_done_with_error(sk, err);
> -       }
> +       if (tcp_need_reset(sk->sk_state))
> +               tcp_send_active_reset(sk, GFP_ATOMIC,
> +                                     SK_RST_REASON_NOT_SPECIFIED);
> +       tcp_done_with_error(sk, err);
> 
>         bh_unlock_sock(sk);
>         local_bh_enable();
> -       tcp_write_queue_purge(sk);
>         if (!has_current_bpf_ctx())
>                 release_sock(sk);
>         return 0;
>

tcp: check socket state before calling WARN_ON

Checks

Commit Message

Comments

Patch