[bpf-next] tcp: fix sock skb accounting in tcp_read_skb()

Message ID	20220709222029.297471-1-xiyou.wangcong@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	BPF
Headers	show Return-Path: <netdev-owner@kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, Cong Wang <cong.wang@bytedance.com>, syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com, Stanislav Fomichev <sdf@google.com>, Eric Dumazet <edumazet@google.com>, John Fastabend <john.fastabend@gmail.com> Subject: [Patch bpf-next] tcp: fix sock skb accounting in tcp_read_skb() Date: Sat, 9 Jul 2022 15:20:29 -0700 Message-Id: <20220709222029.297471-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[bpf-next] tcp: fix sock skb accounting in tcp_read_skb() \| expand [bpf-next] tcp: fix sock skb accounting in tcp_read_skb()

Context	Check	Description
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers	fail	1 blamed authors not CCed: daniel@iogearbox.net; 12 maintainers not CCed: daniel@iogearbox.net songliubraving@fb.com ast@kernel.org pabeni@redhat.com kuba@kernel.org davem@davemloft.net yoshfuji@linux-ipv6.org yhs@fb.com kafai@fb.com dsahern@kernel.org andrii@kernel.org kpsingh@kernel.org
netdev/build_clang	success	Errors and warnings before: 6 this patch: 6
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	success	Errors and warnings before: 2 this patch: 2
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 7 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Kernel LATEST on ubuntu-latest with llvm-15

Cong Wang July 9, 2022, 10:20 p.m. UTC

From: Cong Wang <cong.wang@bytedance.com>

Before commit 965b57b469a5 ("net: Introduce a new proto_ops
->read_skb()"), skb was not dequeued from receive queue hence
when we close TCP socket skb can be just flushed synchronously.

After this commit, we have to uncharge skb immediately after being
dequeued, otherwise it is still charged in the original sock. And we
still need to retain skb->sk, as eBPF programs may extract sock
information from skb->sk. Therefore, we have to call
skb_set_owner_sk_safe() here.

Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
Tested-by: Stanislav Fomichev <sdf@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/ipv4/tcp.c | 1 +
 1 file changed, 1 insertion(+)

Eric Dumazet July 12, 2022, 1:20 p.m. UTC | #1

On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cong.wang@bytedance.com>
>
> Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> ->read_skb()"), skb was not dequeued from receive queue hence
> when we close TCP socket skb can be just flushed synchronously.
>
> After this commit, we have to uncharge skb immediately after being
> dequeued, otherwise it is still charged in the original sock. And we
> still need to retain skb->sk, as eBPF programs may extract sock
> information from skb->sk. Therefore, we have to call
> skb_set_owner_sk_safe() here.
>
> Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
> Tested-by: Stanislav Fomichev <sdf@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  net/ipv4/tcp.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 9d2fd3ced21b..c6b1effb2afd 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
>                 int used;
>
>                 __skb_unlink(skb, &sk->sk_receive_queue);
> +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
>                 used = recv_actor(sk, skb);
>                 if (used <= 0) {
>                         if (!copied)
> --
> 2.34.1
>

I am reading tcp_read_skb(),it seems to have other bugs.
I wonder why syzbot has not caught up yet.

It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
The reason tcp_read_sock() passes a @len parameter is that is it not
skb->len, but (skb->len - offset)

Also if recv_actor(sk, skb) returns 0, we probably still need to
advance tp->copied_seq,
for instance if skb had a pure FIN (and thus skb->len == 0), since you
removed the skb from sk_receive_queue ?

Cong Wang July 17, 2022, 4:56 p.m. UTC | #2

On Tue, Jul 12, 2022 at 03:20:37PM +0200, Eric Dumazet wrote:
> On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> > ->read_skb()"), skb was not dequeued from receive queue hence
> > when we close TCP socket skb can be just flushed synchronously.
> >
> > After this commit, we have to uncharge skb immediately after being
> > dequeued, otherwise it is still charged in the original sock. And we
> > still need to retain skb->sk, as eBPF programs may extract sock
> > information from skb->sk. Therefore, we have to call
> > skb_set_owner_sk_safe() here.
> >
> > Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> > Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
> > Tested-by: Stanislav Fomichev <sdf@google.com>
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >  net/ipv4/tcp.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 9d2fd3ced21b..c6b1effb2afd 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> >                 int used;
> >
> >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
> >                 used = recv_actor(sk, skb);
> >                 if (used <= 0) {
> >                         if (!copied)
> > --
> > 2.34.1
> >
> 
> I am reading tcp_read_skb(),it seems to have other bugs.
> I wonder why syzbot has not caught up yet.

As you mentioned this here I assume you suggest I should fix all bugs in
one patch? (I am fine either way in this case, only slightly prefer to fix
one bug in each patch for readability.)

> 
> It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
> The reason tcp_read_sock() passes a @len parameter is that is it not
> skb->len, but (skb->len - offset)

If I understand tcp_recv_skb() correctly it only returns an offset for a
partial read of an skb. IOW, if we always read an entire skb at a time,
offset makes no sense here, right?

> 
> Also if recv_actor(sk, skb) returns 0, we probably still need to
> advance tp->copied_seq,
> for instance if skb had a pure FIN (and thus skb->len == 0), since you
> removed the skb from sk_receive_queue ?

Doesn't the following code handle this case?

        if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
                consume_skb(skb);
                ++seq;
                break;
        }

which is copied from tcp_read_sock()...

Thanks.

Eric Dumazet July 18, 2022, 7:26 a.m. UTC | #3

On Sun, Jul 17, 2022 at 6:56 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Tue, Jul 12, 2022 at 03:20:37PM +0200, Eric Dumazet wrote:
> > On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > From: Cong Wang <cong.wang@bytedance.com>
> > >
> > > Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> > > ->read_skb()"), skb was not dequeued from receive queue hence
> > > when we close TCP socket skb can be just flushed synchronously.
> > >
> > > After this commit, we have to uncharge skb immediately after being
> > > dequeued, otherwise it is still charged in the original sock. And we
> > > still need to retain skb->sk, as eBPF programs may extract sock
> > > information from skb->sk. Therefore, we have to call
> > > skb_set_owner_sk_safe() here.
> > >
> > > Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> > > Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
> > > Tested-by: Stanislav Fomichev <sdf@google.com>
> > > Cc: Eric Dumazet <edumazet@google.com>
> > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > ---
> > >  net/ipv4/tcp.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > index 9d2fd3ced21b..c6b1effb2afd 100644
> > > --- a/net/ipv4/tcp.c
> > > +++ b/net/ipv4/tcp.c
> > > @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> > >                 int used;
> > >
> > >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > > +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
> > >                 used = recv_actor(sk, skb);
> > >                 if (used <= 0) {
> > >                         if (!copied)
> > > --
> > > 2.34.1
> > >
> >
> > I am reading tcp_read_skb(),it seems to have other bugs.
> > I wonder why syzbot has not caught up yet.
>
> As you mentioned this here I assume you suggest I should fix all bugs in
> one patch? (I am fine either way in this case, only slightly prefer to fix
> one bug in each patch for readability.)

I only wonder that after fixing all bugs, we might end up with  tcp_read_sk()
being a clone of tcp_read_sock() :/

syzbot has a relevant report:

------------[ cut here ]------------
cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
Modules linked in:
CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 06/29/2022
Workqueue: events nsim_dev_trap_report_work
RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
RSP: 0018:ffffc90000007700 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:461 [inline]
ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
__netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
__netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
__napi_poll+0xb3/0x6e0 net/core/dev.c:6506
napi_poll net/core/dev.c:6573 [inline]
net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
__do_softirq+0x29b/0x9c2 kernel/softirq.c:571
do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
</IRQ>
<TASK>
do_softirq kernel/softirq.c:464 [inline]
__local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
spin_unlock_bh include/linux/spinlock.h:394 [inline]
nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
process_one_work+0x996/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e9/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
</TASK>------------[ cut here ]------------
cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
Modules linked in:
CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 06/29/2022
Workqueue: events nsim_dev_trap_report_work
RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
RSP: 0018:ffffc90000007700 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
dst_input include/net/dst.h:461 [inline]
ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
NF_HOOK include/linux/netfilter.h:307 [inline]
NF_HOOK include/linux/netfilter.h:301 [inline]
ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
__netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
__netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
__napi_poll+0xb3/0x6e0 net/core/dev.c:6506
napi_poll net/core/dev.c:6573 [inline]
net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
__do_softirq+0x29b/0x9c2 kernel/softirq.c:571
do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
</IRQ>
<TASK>
do_softirq kernel/softirq.c:464 [inline]
__local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
spin_unlock_bh include/linux/spinlock.h:394 [inline]
nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
process_one_work+0x996/0x1610 kernel/workqueue.c:2289
worker_thread+0x665/0x1080 kernel/workqueue.c:2436
kthread+0x2e9/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
</TASK>

>
> >
> > It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
> > The reason tcp_read_sock() passes a @len parameter is that is it not
> > skb->len, but (skb->len - offset)
>
> If I understand tcp_recv_skb() correctly it only returns an offset for a
> partial read of an skb. IOW, if we always read an entire skb at a time,
> offset makes no sense here, right?
>
> >
> > Also if recv_actor(sk, skb) returns 0, we probably still need to
> > advance tp->copied_seq,
> > for instance if skb had a pure FIN (and thus skb->len == 0), since you
> > removed the skb from sk_receive_queue ?
>
> Doesn't the following code handle this case?
>
>         if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
>                 consume_skb(skb);
>                 ++seq;
>                 break;
>         }
>
> which is copied from tcp_read_sock()...

I do not think this is enough, because you can break from the loop
before doing this  check about TCPHDR_FIN,
after skb has been unlinked from sk_receive_queue. TCP won't be able
to catch FIN.

__skb_unlink(skb, &sk->sk_receive_queue);
used = recv_actor(sk, skb);
if (used <= 0) {
    if (!copied)
        copied = used;
    break;                         /// HERE ///
}
seq += used;
copied += used;

if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {



>
> Thanks.

Cong Wang July 24, 2022, 5:59 p.m. UTC | #4

On Mon, Jul 18, 2022 at 09:26:29AM +0200, Eric Dumazet wrote:
> On Sun, Jul 17, 2022 at 6:56 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Tue, Jul 12, 2022 at 03:20:37PM +0200, Eric Dumazet wrote:
> > > On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > From: Cong Wang <cong.wang@bytedance.com>
> > > >
> > > > Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> > > > ->read_skb()"), skb was not dequeued from receive queue hence
> > > > when we close TCP socket skb can be just flushed synchronously.
> > > >
> > > > After this commit, we have to uncharge skb immediately after being
> > > > dequeued, otherwise it is still charged in the original sock. And we
> > > > still need to retain skb->sk, as eBPF programs may extract sock
> > > > information from skb->sk. Therefore, we have to call
> > > > skb_set_owner_sk_safe() here.
> > > >
> > > > Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> > > > Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
> > > > Tested-by: Stanislav Fomichev <sdf@google.com>
> > > > Cc: Eric Dumazet <edumazet@google.com>
> > > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > > ---
> > > >  net/ipv4/tcp.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > index 9d2fd3ced21b..c6b1effb2afd 100644
> > > > --- a/net/ipv4/tcp.c
> > > > +++ b/net/ipv4/tcp.c
> > > > @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> > > >                 int used;
> > > >
> > > >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > > > +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
> > > >                 used = recv_actor(sk, skb);
> > > >                 if (used <= 0) {
> > > >                         if (!copied)
> > > > --
> > > > 2.34.1
> > > >
> > >
> > > I am reading tcp_read_skb(),it seems to have other bugs.
> > > I wonder why syzbot has not caught up yet.
> >
> > As you mentioned this here I assume you suggest I should fix all bugs in
> > one patch? (I am fine either way in this case, only slightly prefer to fix
> > one bug in each patch for readability.)
> 
> I only wonder that after fixing all bugs, we might end up with  tcp_read_sk()
> being a clone of tcp_read_sock() :/

I really wish so, but unfortunately the partial read looks impossible to
merged with full skb read.


> 
> syzbot has a relevant report:
> 

Please provide a reproducer if you have, I don't see this report
anywhere (except here of course).

> ------------[ cut here ]------------
> cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> Modules linked in:
> CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 06/29/2022
> Workqueue: events nsim_dev_trap_report_work
> RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <IRQ>
> tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> NF_HOOK include/linux/netfilter.h:307 [inline]
> NF_HOOK include/linux/netfilter.h:301 [inline]
> ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> dst_input include/net/dst.h:461 [inline]
> ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> NF_HOOK include/linux/netfilter.h:307 [inline]
> NF_HOOK include/linux/netfilter.h:301 [inline]
> ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> napi_poll net/core/dev.c:6573 [inline]
> net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> </IRQ>
> <TASK>
> do_softirq kernel/softirq.c:464 [inline]
> __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> spin_unlock_bh include/linux/spinlock.h:394 [inline]
> nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
> process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> kthread+0x2e9/0x3a0 kernel/kthread.c:376
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> </TASK>------------[ cut here ]------------
> cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> Modules linked in:
> CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 06/29/2022
> Workqueue: events nsim_dev_trap_report_work
> RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <IRQ>
> tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> NF_HOOK include/linux/netfilter.h:307 [inline]
> NF_HOOK include/linux/netfilter.h:301 [inline]
> ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> dst_input include/net/dst.h:461 [inline]
> ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> NF_HOOK include/linux/netfilter.h:307 [inline]
> NF_HOOK include/linux/netfilter.h:301 [inline]
> ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> napi_poll net/core/dev.c:6573 [inline]
> net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> </IRQ>
> <TASK>
> do_softirq kernel/softirq.c:464 [inline]
> __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> spin_unlock_bh include/linux/spinlock.h:394 [inline]
> nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
> process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> kthread+0x2e9/0x3a0 kernel/kthread.c:376
> ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> </TASK>
> 
> >
> > >
> > > It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
> > > The reason tcp_read_sock() passes a @len parameter is that is it not
> > > skb->len, but (skb->len - offset)
> >
> > If I understand tcp_recv_skb() correctly it only returns an offset for a
> > partial read of an skb. IOW, if we always read an entire skb at a time,
> > offset makes no sense here, right?
> >
> > >
> > > Also if recv_actor(sk, skb) returns 0, we probably still need to
> > > advance tp->copied_seq,
> > > for instance if skb had a pure FIN (and thus skb->len == 0), since you
> > > removed the skb from sk_receive_queue ?
> >
> > Doesn't the following code handle this case?
> >
> >         if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> >                 consume_skb(skb);
> >                 ++seq;
> >                 break;
> >         }
> >
> > which is copied from tcp_read_sock()...
> 
> I do not think this is enough, because you can break from the loop
> before doing this  check about TCPHDR_FIN,

The logic is same for tcp_read_sock(). :)


> after skb has been unlinked from sk_receive_queue. TCP won't be able
> to catch FIN.

So TCP does not process FIN before ->sk_data_ready()? I wonder how FIN
(at least a pure FIN as you mentioned above) ends up being queued in
->sk_receive_queue anyway?

Thanks!

Eric Dumazet July 25, 2022, 8:45 a.m. UTC | #5

On Sun, Jul 24, 2022 at 7:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Mon, Jul 18, 2022 at 09:26:29AM +0200, Eric Dumazet wrote:
> > On Sun, Jul 17, 2022 at 6:56 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Tue, Jul 12, 2022 at 03:20:37PM +0200, Eric Dumazet wrote:
> > > > On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > From: Cong Wang <cong.wang@bytedance.com>
> > > > >
> > > > > Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> > > > > ->read_skb()"), skb was not dequeued from receive queue hence
> > > > > when we close TCP socket skb can be just flushed synchronously.
> > > > >
> > > > > After this commit, we have to uncharge skb immediately after being
> > > > > dequeued, otherwise it is still charged in the original sock. And we
> > > > > still need to retain skb->sk, as eBPF programs may extract sock
> > > > > information from skb->sk. Therefore, we have to call
> > > > > skb_set_owner_sk_safe() here.
> > > > >
> > > > > Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> > > > > Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
> > > > > Tested-by: Stanislav Fomichev <sdf@google.com>
> > > > > Cc: Eric Dumazet <edumazet@google.com>
> > > > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > > > ---
> > > > >  net/ipv4/tcp.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > index 9d2fd3ced21b..c6b1effb2afd 100644
> > > > > --- a/net/ipv4/tcp.c
> > > > > +++ b/net/ipv4/tcp.c
> > > > > @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> > > > >                 int used;
> > > > >
> > > > >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > > > > +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
> > > > >                 used = recv_actor(sk, skb);
> > > > >                 if (used <= 0) {
> > > > >                         if (!copied)
> > > > > --
> > > > > 2.34.1
> > > > >
> > > >
> > > > I am reading tcp_read_skb(),it seems to have other bugs.
> > > > I wonder why syzbot has not caught up yet.
> > >
> > > As you mentioned this here I assume you suggest I should fix all bugs in
> > > one patch? (I am fine either way in this case, only slightly prefer to fix
> > > one bug in each patch for readability.)
> >
> > I only wonder that after fixing all bugs, we might end up with  tcp_read_sk()
> > being a clone of tcp_read_sock() :/
>
> I really wish so, but unfortunately the partial read looks impossible to
> merged with full skb read.
>
>
> >
> > syzbot has a relevant report:
> >
>
> Please provide a reproducer if you have, I don't see this report
> anywhere (except here of course).

No repro yet.

I usually hold syzbot report until they have enough signal (repro, and
eventually bisection) in them to be considered.

>
> > ------------[ cut here ]------------
> > cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> > WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> > tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > Modules linked in:
> > CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> > 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS Google 06/29/2022
> > Workqueue: events nsim_dev_trap_report_work
> > RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> > 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> > e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> > RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> > RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> > RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> > R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> > FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <IRQ>
> > tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> > sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> > tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> > tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> > tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> > tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> > tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> > ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> > ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> > NF_HOOK include/linux/netfilter.h:307 [inline]
> > NF_HOOK include/linux/netfilter.h:301 [inline]
> > ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> > dst_input include/net/dst.h:461 [inline]
> > ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> > NF_HOOK include/linux/netfilter.h:307 [inline]
> > NF_HOOK include/linux/netfilter.h:301 [inline]
> > ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> > process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> > __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> > napi_poll net/core/dev.c:6573 [inline]
> > net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> > __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> > do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> > </IRQ>
> > <TASK>
> > do_softirq kernel/softirq.c:464 [inline]
> > __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> > spin_unlock_bh include/linux/spinlock.h:394 [inline]
> > nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> > nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsi
m/dev.c:840
> > process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> > worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > </TASK>------------[ cut here ]------------
> > cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> > WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> > tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > Modules linked in:
> > CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> > 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > BIOS Google 06/29/2022
> > Workqueue: events nsim_dev_trap_report_work
> > RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> > 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> > e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> > RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> > RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> > RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> > R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> > FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <IRQ>
> > tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> > sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> > tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> > tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> > tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> > tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> > tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> > ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> > ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> > NF_HOOK include/linux/netfilter.h:307 [inline]
> > NF_HOOK include/linux/netfilter.h:301 [inline]
> > ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> > dst_input include/net/dst.h:461 [inline]
> > ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> > NF_HOOK include/linux/netfilter.h:307 [inline]
> > NF_HOOK include/linux/netfilter.h:301 [inline]
> > ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> > process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> > __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> > napi_poll net/core/dev.c:6573 [inline]
> > net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> > __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> > do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> > </IRQ>
> > <TASK>
> > do_softirq kernel/softirq.c:464 [inline]
> > __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> > spin_unlock_bh include/linux/spinlock.h:394 [inline]
> > nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> > nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
> > process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> > worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > </TASK>
> >
> > >
> > > >
> > > > It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
> > > > The reason tcp_read_sock() passes a @len parameter is that is it not
> > > > skb->len, but (skb->len - offset)
> > >
> > > If I understand tcp_recv_skb() correctly it only returns an offset for a
> > > partial read of an skb. IOW, if we always read an entire skb at a time,
> > > offset makes no sense here, right?
> > >
> > > >
> > > > Also if recv_actor(sk, skb) returns 0, we probably still need to
> > > > advance tp->copied_seq,
> > > > for instance if skb had a pure FIN (and thus skb->len == 0), since you
> > > > removed the skb from sk_receive_queue ?
> > >
> > > Doesn't the following code handle this case?
> > >
> > >         if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > >                 consume_skb(skb);
> > >                 ++seq;
> > >                 break;
> > >         }
> > >
> > > which is copied from tcp_read_sock()...
> >
> > I do not think this is enough, because you can break from the loop
> > before doing this  check about TCPHDR_FIN,
>
> The logic is same for tcp_read_sock(). :)
>
>
> > after skb has been unlinked from sk_receive_queue. TCP won't be able
> > to catch FIN.
>
> So TCP does not process FIN before ->sk_data_ready()? I wonder how FIN
> (at least a pure FIN as you mentioned above) ends up being queued in
> ->sk_receive_queue anyway?

That's how TCP stores packets, including the final FIN.

(Because this skb can also contain payload anyway)

>
> Thanks!

Cong Wang July 26, 2022, 4:14 p.m. UTC | #6

On Mon, Jul 25, 2022 at 10:45:56AM +0200, Eric Dumazet wrote:
> On Sun, Jul 24, 2022 at 7:59 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Mon, Jul 18, 2022 at 09:26:29AM +0200, Eric Dumazet wrote:
> > > On Sun, Jul 17, 2022 at 6:56 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > On Tue, Jul 12, 2022 at 03:20:37PM +0200, Eric Dumazet wrote:
> > > > > On Sun, Jul 10, 2022 at 12:20 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > > >
> > > > > > From: Cong Wang <cong.wang@bytedance.com>
> > > > > >
> > > > > > Before commit 965b57b469a5 ("net: Introduce a new proto_ops
> > > > > > ->read_skb()"), skb was not dequeued from receive queue hence
> > > > > > when we close TCP socket skb can be just flushed synchronously.
> > > > > >
> > > > > > After this commit, we have to uncharge skb immediately after being
> > > > > > dequeued, otherwise it is still charged in the original sock. And we
> > > > > > still need to retain skb->sk, as eBPF programs may extract sock
> > > > > > information from skb->sk. Therefore, we have to call
> > > > > > skb_set_owner_sk_safe() here.
> > > > > >
> > > > > > Fixes: 965b57b469a5 ("net: Introduce a new proto_ops ->read_skb()")
> > > > > > Reported-and-tested-by: syzbot+a0e6f8738b58f7654417@syzkaller.appspotmail.com
> > > > > > Tested-by: Stanislav Fomichev <sdf@google.com>
> > > > > > Cc: Eric Dumazet <edumazet@google.com>
> > > > > > Cc: John Fastabend <john.fastabend@gmail.com>
> > > > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > > > > > index 9d2fd3ced21b..c6b1effb2afd 100644
> > > > > > --- a/net/ipv4/tcp.c
> > > > > > +++ b/net/ipv4/tcp.c
> > > > > > @@ -1749,6 +1749,7 @@ int tcp_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
> > > > > >                 int used;
> > > > > >
> > > > > >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > > > > > +               WARN_ON(!skb_set_owner_sk_safe(skb, sk));
> > > > > >                 used = recv_actor(sk, skb);
> > > > > >                 if (used <= 0) {
> > > > > >                         if (!copied)
> > > > > > --
> > > > > > 2.34.1
> > > > > >
> > > > >
> > > > > I am reading tcp_read_skb(),it seems to have other bugs.
> > > > > I wonder why syzbot has not caught up yet.
> > > >
> > > > As you mentioned this here I assume you suggest I should fix all bugs in
> > > > one patch? (I am fine either way in this case, only slightly prefer to fix
> > > > one bug in each patch for readability.)
> > >
> > > I only wonder that after fixing all bugs, we might end up with  tcp_read_sk()
> > > being a clone of tcp_read_sock() :/
> >
> > I really wish so, but unfortunately the partial read looks impossible to
> > merged with full skb read.
> >
> >
> > >
> > > syzbot has a relevant report:
> > >
> >
> > Please provide a reproducer if you have, I don't see this report
> > anywhere (except here of course).
> 
> No repro yet.
> 
> I usually hold syzbot report until they have enough signal (repro, and
> eventually bisection) in them to be considered.
> 
> >
> > > ------------[ cut here ]------------
> > > cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> > > WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> > > tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Modules linked in:
> > > CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> > > 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > > BIOS Google 06/29/2022
> > > Workqueue: events nsim_dev_trap_report_work
> > > RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> > > 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> > > e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> > > RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> > > RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> > > RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> > > R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> > > FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <IRQ>
> > > tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> > > sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> > > tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> > > tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> > > tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> > > tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> > > tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> > > ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> > > ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> > > dst_input include/net/dst.h:461 [inline]
> > > ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> > > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> > > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> > > process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> > > __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> > > napi_poll net/core/dev.c:6573 [inline]
> > > net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> > > __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> > > do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> > > </IRQ>
> > > <TASK>
> > > do_softirq kernel/softirq.c:464 [inline]
> > > __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> > > spin_unlock_bh include/linux/spinlock.h:394 [inline]
> > > nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> > > nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsi
> m/dev.c:840
> > > process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> > > worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> > > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > > </TASK>------------[ cut here ]------------
> > > cleanup rbuf bug: copied 301B4426 seq 301B4426 rcvnxt 302142E8
> > > WARNING: CPU: 0 PID: 3744 at net/ipv4/tcp.c:1567
> > > tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Modules linked in:
> > > CPU: 0 PID: 3744 Comm: kworker/0:7 Not tainted
> > > 5.19.0-rc5-syzkaller-01095-gedb2c3476db9 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine,
> > > BIOS Google 06/29/2022
> > > Workqueue: events nsim_dev_trap_report_work
> > > RIP: 0010:tcp_cleanup_rbuf+0x11d/0x5b0 net/ipv4/tcp.c:1567
> > > Code: ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e d7 03 00 00 8b 8d 38
> > > 08 00 00 44 89 e2 44 89 f6 48 c7 c7 20 82 df 8a e8 94 d8 58 01 <0f> 0b
> > > e8 cc 84 a0 f9 48 8d bd 88 07 00 00 48 b8 00 00 00 00 00 fc
> > > RSP: 0018:ffffc90000007700 EFLAGS: 00010282
> > > RAX: 0000000000000000 RBX: 000000000004fef7 RCX: 0000000000000000
> > > RDX: ffff8880201abb00 RSI: ffffffff8160d438 RDI: fffff52000000ed2
> > > RBP: ffff888016819800 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000103 R11: 0000000000000001 R12: 00000000301b4426
> > > R13: 0000000000000000 R14: 00000000301b4426 R15: 00000000301b4426
> > > FS: 0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000000020c55000 CR3: 0000000075009000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <IRQ>
> > > tcp_read_skb+0x29e/0x430 net/ipv4/tcp.c:1775
> > > sk_psock_verdict_data_ready+0x9d/0xc0 net/core/skmsg.c:1209
> > > tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4985
> > > tcp_data_queue+0x1bb2/0x4c60 net/ipv4/tcp_input.c:5059
> > > tcp_rcv_established+0x82f/0x20e0 net/ipv4/tcp_input.c:5984
> > > tcp_v4_do_rcv+0x66c/0x9b0 net/ipv4/tcp_ipv4.c:1661
> > > tcp_v4_rcv+0x343b/0x3940 net/ipv4/tcp_ipv4.c:2078
> > > ip_protocol_deliver_rcu+0xa3/0x7c0 net/ipv4/ip_input.c:205
> > > ip_local_deliver_finish+0x2e8/0x4c0 net/ipv4/ip_input.c:233
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_local_deliver+0x1aa/0x200 net/ipv4/ip_input.c:254
> > > dst_input include/net/dst.h:461 [inline]
> > > ip_rcv_finish+0x1cb/0x2f0 net/ipv4/ip_input.c:437
> > > NF_HOOK include/linux/netfilter.h:307 [inline]
> > > NF_HOOK include/linux/netfilter.h:301 [inline]
> > > ip_rcv+0xaa/0xd0 net/ipv4/ip_input.c:557
> > > __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5480
> > > __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5594
> > > process_backlog+0x3a0/0x7c0 net/core/dev.c:5922
> > > __napi_poll+0xb3/0x6e0 net/core/dev.c:6506
> > > napi_poll net/core/dev.c:6573 [inline]
> > > net_rx_action+0x9c1/0xd90 net/core/dev.c:6684
> > > __do_softirq+0x29b/0x9c2 kernel/softirq.c:571
> > > do_softirq.part.0+0xde/0x130 kernel/softirq.c:472
> > > </IRQ>
> > > <TASK>
> > > do_softirq kernel/softirq.c:464 [inline]
> > > __local_bh_enable_ip+0x102/0x120 kernel/softirq.c:396
> > > spin_unlock_bh include/linux/spinlock.h:394 [inline]
> > > nsim_dev_trap_report drivers/net/netdevsim/dev.c:814 [inline]
> > > nsim_dev_trap_report_work+0x84d/0xba0 drivers/net/netdevsim/dev.c:840
> > > process_one_work+0x996/0x1610 kernel/workqueue.c:2289
> > > worker_thread+0x665/0x1080 kernel/workqueue.c:2436
> > > kthread+0x2e9/0x3a0 kernel/kthread.c:376
> > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302
> > > </TASK>
> > >
> > > >
> > > > >
> > > > > It ignores the offset value from tcp_recv_skb(), this looks wrong to me.
> > > > > The reason tcp_read_sock() passes a @len parameter is that is it not
> > > > > skb->len, but (skb->len - offset)
> > > >
> > > > If I understand tcp_recv_skb() correctly it only returns an offset for a
> > > > partial read of an skb. IOW, if we always read an entire skb at a time,
> > > > offset makes no sense here, right?
> > > >
> > > > >
> > > > > Also if recv_actor(sk, skb) returns 0, we probably still need to
> > > > > advance tp->copied_seq,
> > > > > for instance if skb had a pure FIN (and thus skb->len == 0), since you
> > > > > removed the skb from sk_receive_queue ?
> > > >
> > > > Doesn't the following code handle this case?
> > > >
> > > >         if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > > >                 consume_skb(skb);
> > > >                 ++seq;
> > > >                 break;
> > > >         }
> > > >
> > > > which is copied from tcp_read_sock()...
> > >
> > > I do not think this is enough, because you can break from the loop
> > > before doing this  check about TCPHDR_FIN,
> >
> > The logic is same for tcp_read_sock(). :)
> >
> >
> > > after skb has been unlinked from sk_receive_queue. TCP won't be able
> > > to catch FIN.
> >
> > So TCP does not process FIN before ->sk_data_ready()? I wonder how FIN
> > (at least a pure FIN as you mentioned above) ends up being queued in
> > ->sk_receive_queue anyway?
> 
> That's how TCP stores packets, including the final FIN.
> 
> (Because this skb can also contain payload anyway)

If TCP really wants to queue a FIN with skb->len==0, then we have to
adjust the return value for recv_actor(), because we currently use 0 as
an error too (meaning no data is consumed):

        if (sk_psock_verdict_apply(psock, skb, ret) < 0)
                len = 0;  // here!
out:
        rcu_read_unlock();
        return len;


BTW, what is wrong if we simply drop it before queueing to
sk_receive_queue in TCP? Is it there just for collapse?

Thanks.

Eric Dumazet July 26, 2022, 4:47 p.m. UTC | #7

On Tue, Jul 26, 2022 at 6:14 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> If TCP really wants to queue a FIN with skb->len==0, then we have to
> adjust the return value for recv_actor(), because we currently use 0 as
> an error too (meaning no data is consumed):
>
>         if (sk_psock_verdict_apply(psock, skb, ret) < 0)
>                 len = 0;  // here!
> out:
>         rcu_read_unlock();
>         return len;
>
>
> BTW, what is wrong if we simply drop it before queueing to
> sk_receive_queue in TCP? Is it there just for collapse?

Because an incoming segment can have payload and FIN.

The consumer will need to consume the payload before FIN is considered/consumed,
with the complication of MSG_PEEK ...

Right after tcp_read_skb() removes the skb from sk_receive_queue,
we need to update TCP state, regardless of recv_actor().

Maybe like that:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ba2bdc81137490bd1748cde95789f8d2bff3ab0f..6e2c11cd921872e406baffc475c9870e147578a1
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1759,20 +1759,15 @@ int tcp_read_skb(struct sock *sk,
skb_read_actor_t recv_actor)
                int used;

                __skb_unlink(skb, &sk->sk_receive_queue);
+               seq = TCP_SKB_CB(skb)->end_seq;
                used = recv_actor(sk, skb);
                if (used <= 0) {
                        if (!copied)
                                copied = used;
                        break;
                }
-               seq += used;
                copied += used;

-               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
-                       consume_skb(skb);
-                       ++seq;
-                       break;
-               }
                consume_skb(skb);
                break;
        }

Eric Dumazet July 26, 2022, 4:49 p.m. UTC | #8

On Tue, Jul 26, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jul 26, 2022 at 6:14 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > If TCP really wants to queue a FIN with skb->len==0, then we have to
> > adjust the return value for recv_actor(), because we currently use 0 as
> > an error too (meaning no data is consumed):
> >
> >         if (sk_psock_verdict_apply(psock, skb, ret) < 0)
> >                 len = 0;  // here!
> > out:
> >         rcu_read_unlock();
> >         return len;
> >
> >
> > BTW, what is wrong if we simply drop it before queueing to
> > sk_receive_queue in TCP? Is it there just for collapse?
>
> Because an incoming segment can have payload and FIN.
>
> The consumer will need to consume the payload before FIN is considered/consumed,
> with the complication of MSG_PEEK ...
>
> Right after tcp_read_skb() removes the skb from sk_receive_queue,
> we need to update TCP state, regardless of recv_actor().
>
> Maybe like that:
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ba2bdc81137490bd1748cde95789f8d2bff3ab0f..6e2c11cd921872e406baffc475c9870e147578a1
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1759,20 +1759,15 @@ int tcp_read_skb(struct sock *sk,
> skb_read_actor_t recv_actor)
>                 int used;
>
>                 __skb_unlink(skb, &sk->sk_receive_queue);
> +               seq = TCP_SKB_CB(skb)->end_seq;
>                 used = recv_actor(sk, skb);
>                 if (used <= 0) {
>                         if (!copied)
>                                 copied = used;
>                         break;
>                 }
> -               seq += used;
>                 copied += used;
>
> -               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> -                       consume_skb(skb);
> -                       ++seq;
> -                       break;
> -               }
>                 consume_skb(skb);
>                 break;
>         }

Or even better:

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ba2bdc81137490bd1748cde95789f8d2bff3ab0f..66c187a2592c042565211565adb3f40a811dfd7d
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1759,21 +1759,15 @@ int tcp_read_skb(struct sock *sk,
skb_read_actor_t recv_actor)
                int used;

                __skb_unlink(skb, &sk->sk_receive_queue);
+               seq = TCP_SKB_CB(skb)->end_seq;
                used = recv_actor(sk, skb);
+               consume_skb(skb);
                if (used <= 0) {
                        if (!copied)
                                copied = used;
                        break;
                }
-               seq += used;
                copied += used;
-
-               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
-                       consume_skb(skb);
-                       ++seq;
-                       break;
-               }
-               consume_skb(skb);
                break;
        }
        WRITE_ONCE(tp->copied_seq, seq);

Eric Dumazet July 26, 2022, 4:53 p.m. UTC | #9

On Tue, Jul 26, 2022 at 6:49 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jul 26, 2022 at 6:47 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Jul 26, 2022 at 6:14 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > > If TCP really wants to queue a FIN with skb->len==0, then we have to
> > > adjust the return value for recv_actor(), because we currently use 0 as
> > > an error too (meaning no data is consumed):
> > >
> > >         if (sk_psock_verdict_apply(psock, skb, ret) < 0)
> > >                 len = 0;  // here!
> > > out:
> > >         rcu_read_unlock();
> > >         return len;
> > >
> > >
> > > BTW, what is wrong if we simply drop it before queueing to
> > > sk_receive_queue in TCP? Is it there just for collapse?
> >
> > Because an incoming segment can have payload and FIN.
> >
> > The consumer will need to consume the payload before FIN is considered/consumed,
> > with the complication of MSG_PEEK ...
> >
> > Right after tcp_read_skb() removes the skb from sk_receive_queue,
> > we need to update TCP state, regardless of recv_actor().
> >
> > Maybe like that:
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index ba2bdc81137490bd1748cde95789f8d2bff3ab0f..6e2c11cd921872e406baffc475c9870e147578a1
> > 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -1759,20 +1759,15 @@ int tcp_read_skb(struct sock *sk,
> > skb_read_actor_t recv_actor)
> >                 int used;
> >
> >                 __skb_unlink(skb, &sk->sk_receive_queue);
> > +               seq = TCP_SKB_CB(skb)->end_seq;
> >                 used = recv_actor(sk, skb);
> >                 if (used <= 0) {
> >                         if (!copied)
> >                                 copied = used;
> >                         break;
> >                 }
> > -               seq += used;
> >                 copied += used;
> >
> > -               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> > -                       consume_skb(skb);
> > -                       ++seq;
> > -                       break;
> > -               }
> >                 consume_skb(skb);
> >                 break;
> >         }
>
> Or even better:
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ba2bdc81137490bd1748cde95789f8d2bff3ab0f..66c187a2592c042565211565adb3f40a811dfd7d
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1759,21 +1759,15 @@ int tcp_read_skb(struct sock *sk,
> skb_read_actor_t recv_actor)
>                 int used;
>
>                 __skb_unlink(skb, &sk->sk_receive_queue);
> +               seq = TCP_SKB_CB(skb)->end_seq;
>                 used = recv_actor(sk, skb);
> +               consume_skb(skb);
>                 if (used <= 0) {
>                         if (!copied)
>                                 copied = used;
>                         break;
>                 }
> -               seq += used;
>                 copied += used;
> -
> -               if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) {
> -                       consume_skb(skb);
> -                       ++seq;
> -                       break;
> -               }
> -               consume_skb(skb);
>                 break;
>         }
>         WRITE_ONCE(tp->copied_seq, seq);

Note that this code will still not behave properly if we have in
receive queues two skbs of 1000 bytes of payload like:

seq 1:1001
seq 501:1501

tcp_recvmsg() would copy 1000 bytes from the first skb, then 500 bytes
from second skb.

[bpf-next] tcp: fix sock skb accounting in tcp_read_skb()

Checks

Commit Message

Comments

Patch