diff mbox series

[bpf,v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb

Message ID 20210701061656.34150-1-xiyou.wangcong@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf,v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/cc_maintainers warning 8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org andrii@kernel.org kafai@fb.com ast@kernel.org songliubraving@fb.com davem@davemloft.net kuba@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: line length of 102 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link

Commit Message

Cong Wang July 1, 2021, 6:16 a.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

Jiang observed OOM frequently when testing our AF_UNIX/UDP
proxy. This is due to the fact that we do not actually limit
the socket memory before queueing skb to ingress_skb. We
charge the skb memory later when handling the psock backlog,
but it is not limited either.

This patch adds checks for sk->sk_rcvbuf right before queuing
to ingress_skb and drops packets if this limit exceeds. This
is very similar to UDP receive path. Ideally we should set the
skb owner before this check too, but it is hard to make TCP
happy about sk_forward_alloc.

Reported-by: Jiang Wang <jiang.wang@bytedance.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/core/skmsg.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Jakub Sitnicki July 1, 2021, 3:56 p.m. UTC | #1
On Thu, Jul 01, 2021 at 08:16 AM CEST, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
>
> Jiang observed OOM frequently when testing our AF_UNIX/UDP
> proxy. This is due to the fact that we do not actually limit
> the socket memory before queueing skb to ingress_skb. We
> charge the skb memory later when handling the psock backlog,
> but it is not limited either.
>
> This patch adds checks for sk->sk_rcvbuf right before queuing
> to ingress_skb and drops packets if this limit exceeds. This
> is very similar to UDP receive path. Ideally we should set the
> skb owner before this check too, but it is hard to make TCP
> happy about sk_forward_alloc.
>
> Reported-by: Jiang Wang <jiang.wang@bytedance.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---

Acked-by: Jakub Sitnicki <jakub@cloudflare.com>

By saying that it is hard to make TCP happy about sk_forward_alloc, I'm
guessing you're referring to problems described in 144748eb0c44 ("bpf,
sockmap: Fix incorrect fwd_alloc accounting") [1]?

Thanks for the fix.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=144748eb0c445091466c9b741ebd0bfcc5914f3d

[...]
John Fastabend July 1, 2021, 4:23 p.m. UTC | #2
Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Jiang observed OOM frequently when testing our AF_UNIX/UDP
> proxy. This is due to the fact that we do not actually limit
> the socket memory before queueing skb to ingress_skb. We
> charge the skb memory later when handling the psock backlog,
> but it is not limited either.

Right, its not limiting but charging it should push back on
the stack so it stops feeding skbs to us. Maybe this doesn't
happen in UDP side?

> 
> This patch adds checks for sk->sk_rcvbuf right before queuing
> to ingress_skb and drops packets if this limit exceeds. This
> is very similar to UDP receive path. Ideally we should set the
> skb owner before this check too, but it is hard to make TCP
> happy about sk_forward_alloc.

But it breaks TCP side see below.

> 
> Reported-by: Jiang Wang <jiang.wang@bytedance.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>  net/core/skmsg.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 9b6160a191f8..a5185c781332 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
>  		return -EIO;
>  	}
>  	spin_lock_bh(&psock_other->ingress_lock);
> -	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
> +	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
> +	    atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
>  		spin_unlock_bh(&psock_other->ingress_lock);
>  		skb_bpf_redirect_clear(skb);
>  		sock_drop(from->sk, skb);
> @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
>  		}
>  		if (err < 0) {
>  			spin_lock_bh(&psock->ingress_lock);
> -			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> +			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
> +			    atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
>  				skb_queue_tail(&psock->ingress_skb, skb);

We can't just drop the packet in the memory overrun case here. This will
break TCP because the data will be gone and no one will retransmit.

Thats why in the current scheme on redirect we can push back when we
move it to the other queues ingress message queue or redirect into
the other socket via send.

At one point I considered charging the data sitting in the ingress_skb?
Would that solve the problem here? I think it would cause the enqueue
at the UDP to start dropping packets from __udp_enqueue_schedule_skb()?

>  				schedule_work(&psock->work);
>  				err = 0;
> -- 
> 2.27.0
>
John Fastabend July 1, 2021, 4:26 p.m. UTC | #3
Jakub Sitnicki wrote:
> On Thu, Jul 01, 2021 at 08:16 AM CEST, Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Jiang observed OOM frequently when testing our AF_UNIX/UDP
> > proxy. This is due to the fact that we do not actually limit
> > the socket memory before queueing skb to ingress_skb. We
> > charge the skb memory later when handling the psock backlog,
> > but it is not limited either.
> >
> > This patch adds checks for sk->sk_rcvbuf right before queuing
> > to ingress_skb and drops packets if this limit exceeds. This
> > is very similar to UDP receive path. Ideally we should set the
> > skb owner before this check too, but it is hard to make TCP
> > happy about sk_forward_alloc.
> >
> > Reported-by: Jiang Wang <jiang.wang@bytedance.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> 
> Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
> 
> By saying that it is hard to make TCP happy about sk_forward_alloc, I'm
> guessing you're referring to problems described in 144748eb0c44 ("bpf,
> sockmap: Fix incorrect fwd_alloc accounting") [1]?

I have a couple fixes on my stack here I'm testing that clean up
the tear down logic. Once thats in place maybe its as simple
as adding the owner_r bits and calling the destructor to ensure
memory accounting happens earlier so these ingress_skb packets
are accounted for.

I'll flush those out today, maybe it will be clear then.

> 
> Thanks for the fix.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=144748eb0c445091466c9b741ebd0bfcc5914f3d
> 
> [...]
Cong Wang July 1, 2021, 6 p.m. UTC | #4
On Thu, Jul 1, 2021 at 9:23 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > Jiang observed OOM frequently when testing our AF_UNIX/UDP
> > proxy. This is due to the fact that we do not actually limit
> > the socket memory before queueing skb to ingress_skb. We
> > charge the skb memory later when handling the psock backlog,
> > but it is not limited either.
>
> Right, its not limiting but charging it should push back on
> the stack so it stops feeding skbs to us. Maybe this doesn't
> happen in UDP side?

The OOM is due to skb queued in ingress_skb, not due to
user-space consuming skb slowly.

>
> >
> > This patch adds checks for sk->sk_rcvbuf right before queuing
> > to ingress_skb and drops packets if this limit exceeds. This
> > is very similar to UDP receive path. Ideally we should set the
> > skb owner before this check too, but it is hard to make TCP
> > happy about sk_forward_alloc.
>
> But it breaks TCP side see below.
>
> >
> > Reported-by: Jiang Wang <jiang.wang@bytedance.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Lorenz Bauer <lmb@cloudflare.com>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
> >  net/core/skmsg.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> > index 9b6160a191f8..a5185c781332 100644
> > --- a/net/core/skmsg.c
> > +++ b/net/core/skmsg.c
> > @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
> >               return -EIO;
> >       }
> >       spin_lock_bh(&psock_other->ingress_lock);
> > -     if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
> > +     if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
> > +         atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
> >               spin_unlock_bh(&psock_other->ingress_lock);
> >               skb_bpf_redirect_clear(skb);
> >               sock_drop(from->sk, skb);
> > @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
> >               }
> >               if (err < 0) {
> >                       spin_lock_bh(&psock->ingress_lock);
> > -                     if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
> > +                     if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
> > +                         atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
> >                               skb_queue_tail(&psock->ingress_skb, skb);
>
> We can't just drop the packet in the memory overrun case here. This will
> break TCP because the data will be gone and no one will retransmit.
>
> Thats why in the current scheme on redirect we can push back when we
> move it to the other queues ingress message queue or redirect into
> the other socket via send.
>
> At one point I considered charging the data sitting in the ingress_skb?
> Would that solve the problem here? I think it would cause the enqueue
> at the UDP to start dropping packets from __udp_enqueue_schedule_skb()?

I tried to move skb_set_owner_r() here, TCP is clearly unhappy about it,
as I explained in changelog. Yes, it probably helps if we could move it here.

Thanks.
Cong Wang July 2, 2021, 7:33 p.m. UTC | #5
On Thu, Jul 1, 2021 at 11:00 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Jul 1, 2021 at 9:23 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > We can't just drop the packet in the memory overrun case here. This will
> > break TCP because the data will be gone and no one will retransmit.
> >
> > Thats why in the current scheme on redirect we can push back when we
> > move it to the other queues ingress message queue or redirect into
> > the other socket via send.
> >
> > At one point I considered charging the data sitting in the ingress_skb?
> > Would that solve the problem here? I think it would cause the enqueue
> > at the UDP to start dropping packets from __udp_enqueue_schedule_skb()?
>
> I tried to move skb_set_owner_r() here, TCP is clearly unhappy about it,
> as I explained in changelog. Yes, it probably helps if we could move it here.

Sorry for replying too quickly. Actually it probably does _not_ help,
because we have to limit the dest socket buffer, not the source.

For example, if we have 5 sockets in a sockmap and all of them send
packets to the same one, the dest socket could still get 5x sk_rcvbuf
bytes pending in its ingress_skb.

And I do not know why you want to single out the TCP case, at least
the ebpf program can decide to drop TCP packets too, this drop is
not any different from the drop due to rcvbuf overlimit.

Therefore, I think my patch is fine as it is.

Thanks.
Jakub Sitnicki July 3, 2021, 5:52 p.m. UTC | #6
On Thu, Jul 01, 2021 at 06:23 PM CEST, John Fastabend wrote:

[...]

>> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
>> index 9b6160a191f8..a5185c781332 100644
>> --- a/net/core/skmsg.c
>> +++ b/net/core/skmsg.c
>> @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
>>  		return -EIO;
>>  	}
>>  	spin_lock_bh(&psock_other->ingress_lock);
>> -	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
>> +	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
>> +	    atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
>>  		spin_unlock_bh(&psock_other->ingress_lock);
>>  		skb_bpf_redirect_clear(skb);
>>  		sock_drop(from->sk, skb);
>> @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
>>  		}
>>  		if (err < 0) {
>>  			spin_lock_bh(&psock->ingress_lock);
>> -			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>> +			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
>> +			    atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
>>  				skb_queue_tail(&psock->ingress_skb, skb);
>
> We can't just drop the packet in the memory overrun case here. This will
> break TCP because the data will be gone and no one will retransmit.

I don't think it's always the case that data will be gone. But you're
right that it breaks TCP. I was too quick to Ack this patch.

When running with just the verdict prog attached, the -EIO error from
sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it
maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case.

tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays
on sk_receive_queue.

  sk->sk_data_ready
    sk_psock_verdict_data_ready
      ->read_sock(..., sk_psock_verdict_recv)
        tcp_read_sock (used = copied = 0)
          sk_psock_verdict_recv -> ret = 0
            sk_psock_verdict_apply -> -EIO
              sk_psock_skb_redirect -> -EIO

However, I think this gets us stuck. What if no more data gets queued,
and sk_data_ready doesn't get called again?


Then there is the case when a parser prog is attached. In this case the
skb is really gone if we drop it on redirect.

In sk_psock_strp_read, we ignore the -EIO error from
sk_psock_verdict_apply, and return to tcp_read_sock how many bytes have
been parsed.

  sk->sk_data_ready
    sk_psock_verdict_data_ready
      ->read_sock(..., sk_psock_verdict_recv)
        tcp_read_sock (used = copied = eaten)
          strp_recv -> ret = eaten
            __strp_recv -> ret = eaten
              strp->cb.rcv_msg -> -EIO
                sk_psock_verdict_apply -> -EIO
                  sk_psock_redirect -> -EIO

Maybe we could put the skb back on strp->skb_head list on error, though?

But again some notification would need to trigger a re-read, or we are
stuck.
Jakub Sitnicki July 4, 2021, 1:10 p.m. UTC | #7
On Sat, Jul 03, 2021 at 07:52 PM CEST, Jakub Sitnicki wrote:

[...]

> Then there is the case when a parser prog is attached. In this case the
> skb is really gone if we drop it on redirect.
>
> In sk_psock_strp_read, we ignore the -EIO error from
> sk_psock_verdict_apply, and return to tcp_read_sock how many bytes have
> been parsed.
>
>   sk->sk_data_ready
>     sk_psock_verdict_data_ready
>       ->read_sock(..., sk_psock_verdict_recv)
>         tcp_read_sock (used = copied = eaten)
>           strp_recv -> ret = eaten
>             __strp_recv -> ret = eaten
>               strp->cb.rcv_msg -> -EIO
>                 sk_psock_verdict_apply -> -EIO
>                   sk_psock_redirect -> -EIO

Copy-paste error. The call chain for the parser case goes as so:

  sk->sk_data_ready
    sk_psock_strp_data_ready
      strp_data_ready
        strp_read_sock
          ->read_sock(..., strp_recv)
            tcp_read_sock (used = copied = skb->len)
              strp_recv -> ret = skb->len
                __strp_recv -> ret = skb->len (dummy parser case)
                  strp->cb.rcv_msg -> -EIO
                    sk_psock_verdict_apply -> -EIO
                      sk_psock_redirect -> -EIO
Cong Wang July 4, 2021, 7:53 p.m. UTC | #8
On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> When running with just the verdict prog attached, the -EIO error from
> sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it
> maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case.
>
> tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays
> on sk_receive_queue.

Are you sure?

When recv_actor() returns 0, the while loop breaks:

1661                         used = recv_actor(desc, skb, offset, len);
1662                         if (used <= 0) {
1663                                 if (!copied)
1664                                         copied = used;
1665                                 break;

then it calls sk_eat_skb() a few lines after the loop:
...
1690                 sk_eat_skb(sk, skb);

>
>   sk->sk_data_ready
>     sk_psock_verdict_data_ready
>       ->read_sock(..., sk_psock_verdict_recv)
>         tcp_read_sock (used = copied = 0)
>           sk_psock_verdict_recv -> ret = 0
>             sk_psock_verdict_apply -> -EIO
>               sk_psock_skb_redirect -> -EIO
>
> However, I think this gets us stuck. What if no more data gets queued,
> and sk_data_ready doesn't get called again?

I think it is dropped by sk_eat_skb() in TCP case and of course the
sender will retransmit it after detecting this loss. So from this point of
view, there is no difference between drops due to overlimit and drops
due to eBPF program policy.

Thanks.
Jakub Sitnicki July 5, 2021, 8:24 a.m. UTC | #9
On Sun, Jul 04, 2021 at 09:53 PM CEST, Cong Wang wrote:
> On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> When running with just the verdict prog attached, the -EIO error from
>> sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it
>> maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case.
>>
>> tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays
>> on sk_receive_queue.
>
> Are you sure?
>
> When recv_actor() returns 0, the while loop breaks:
>
> 1661                         used = recv_actor(desc, skb, offset, len);
> 1662                         if (used <= 0) {
> 1663                                 if (!copied)
> 1664                                         copied = used;
> 1665                                 break;
>
> then it calls sk_eat_skb() a few lines after the loop:
> ...
> 1690                 sk_eat_skb(sk, skb);

This sk_eat_skb is still within the loop:

1636:int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
1637-		  sk_read_actor_t recv_actor)
1638-{
	…
1643-	int copied = 0;
        …
1647-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
1648-		if (offset < skb->len) {
			…
1661-			used = recv_actor(desc, skb, offset, len);
1662-			if (used <= 0) {
1663-				if (!copied)
1664-					copied = used;
1665-				break;
1666-			} else if (used <= len) {
1667-				seq += used;
1668-				copied += used;
1669-				offset += used;
1670-			}
			…
1684-		}
		…
1690-		sk_eat_skb(sk, skb);
		…
1694-	}
	…
1699-	/* Clean up data we have read: This will do ACK frames. */
1700-	if (copied > 0) {
1701-		tcp_recv_skb(sk, seq, &offset);
1702-		tcp_cleanup_rbuf(sk, copied);
1703-	}
1704-	return copied;
1705-}

sk_eat_skb could get called by tcp_recv_skb → sk_eat_skb if recv_actor
returned > 0 (the case when we have parser attached).

>
>>
>>   sk->sk_data_ready
>>     sk_psock_verdict_data_ready
>>       ->read_sock(..., sk_psock_verdict_recv)
>>         tcp_read_sock (used = copied = 0)
>>           sk_psock_verdict_recv -> ret = 0
>>             sk_psock_verdict_apply -> -EIO
>>               sk_psock_skb_redirect -> -EIO
>>
>> However, I think this gets us stuck. What if no more data gets queued,
>> and sk_data_ready doesn't get called again?
>
> I think it is dropped by sk_eat_skb() in TCP case and of course the
> sender will retransmit it after detecting this loss. So from this point of
> view, there is no difference between drops due to overlimit and drops
> due to eBPF program policy.

I'm not sure the retransmit will happen.

We update tp->rcv_nxt (tcp_rcv_nxt_update) when skb gets pushed onto
sk_receive_queue in either:

 - tcp_rcv_established -> tcp_queue_rcv, or
 - tcp_rcv_established -> tcp_data_queue -> tcp_queue_rcv

... and schedule ACK (tcp_event_data_recv) to be sent.

Say we are in quickack mode, then
tcp_ack_snd_check()/__tcp_ack_snd_check() would cause ACK to be sent
out.
John Fastabend July 5, 2021, 4:24 p.m. UTC | #10
Jakub Sitnicki wrote:
> On Sun, Jul 04, 2021 at 09:53 PM CEST, Cong Wang wrote:
> > On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> When running with just the verdict prog attached, the -EIO error from
> >> sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it
> >> maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case.
> >>
> >> tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays
> >> on sk_receive_queue.
> >
> > Are you sure?
> >
> > When recv_actor() returns 0, the while loop breaks:
> >
> > 1661                         used = recv_actor(desc, skb, offset, len);
> > 1662                         if (used <= 0) {
> > 1663                                 if (!copied)
> > 1664                                         copied = used;
> > 1665                                 break;
> >
> > then it calls sk_eat_skb() a few lines after the loop:
> > ...
> > 1690                 sk_eat_skb(sk, skb);
> 
> This sk_eat_skb is still within the loop:
> 
> 1636:int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
> 1637-		  sk_read_actor_t recv_actor)
> 1638-{
> 	…
> 1643-	int copied = 0;
>         …
> 1647-	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
> 1648-		if (offset < skb->len) {
> 			…
> 1661-			used = recv_actor(desc, skb, offset, len);
> 1662-			if (used <= 0) {
> 1663-				if (!copied)
> 1664-					copied = used;
> 1665-				break;
> 1666-			} else if (used <= len) {
> 1667-				seq += used;
> 1668-				copied += used;
> 1669-				offset += used;
> 1670-			}
> 			…
> 1684-		}
> 		…
> 1690-		sk_eat_skb(sk, skb);
> 		…
> 1694-	}
> 	…
> 1699-	/* Clean up data we have read: This will do ACK frames. */
> 1700-	if (copied > 0) {
> 1701-		tcp_recv_skb(sk, seq, &offset);
> 1702-		tcp_cleanup_rbuf(sk, copied);
> 1703-	}
> 1704-	return copied;
> 1705-}
> 
> sk_eat_skb could get called by tcp_recv_skb → sk_eat_skb if recv_actor
> returned > 0 (the case when we have parser attached).
> 
> >
> >>
> >>   sk->sk_data_ready
> >>     sk_psock_verdict_data_ready
> >>       ->read_sock(..., sk_psock_verdict_recv)
> >>         tcp_read_sock (used = copied = 0)
> >>           sk_psock_verdict_recv -> ret = 0
> >>             sk_psock_verdict_apply -> -EIO
> >>               sk_psock_skb_redirect -> -EIO
> >>
> >> However, I think this gets us stuck. What if no more data gets queued,
> >> and sk_data_ready doesn't get called again?

Agree looks like it will be stuck. I found a similar stuck queue
in the skmsg backlog case for this I'm testing changing our workqueue
over to a delayed workqueue and then calling it again after some
delay.

We could do the same here I think. Return 0 to stop the tcp_read_sock
as you show. Then the data is still on the sk_receive_queue and
memory should still be accounted for. Solving the memory overrun
on the dest socket. Then we kick a workqueue item that will call
data_ready at some point in the future. And we do a backoff so
we don't keep hitting it repeatedly. I think this should work and
I have the patches testing now for doing it on the backlog paths.

> >
> > I think it is dropped by sk_eat_skb() in TCP case and of course the
> > sender will retransmit it after detecting this loss. So from this point of
> > view, there is no difference between drops due to overlimit and drops
> > due to eBPF program policy.
> 
> I'm not sure the retransmit will happen.
> 
> We update tp->rcv_nxt (tcp_rcv_nxt_update) when skb gets pushed onto
> sk_receive_queue in either:
> 
>  - tcp_rcv_established -> tcp_queue_rcv, or
>  - tcp_rcv_established -> tcp_data_queue -> tcp_queue_rcv
> 
> ... and schedule ACK (tcp_event_data_recv) to be sent.

Right, this is what we've observed before when we did have a few drops
on error cases. And then some applications will break and user will
send bugs and we have to fix them. We can't unintentionally drop TCP
packets. A policy drop though is different in this case we want to
break the session. FWIW I'm considering adding a reset socket helper
so we can do this cleanly and tear the socket down.

> 
> Say we are in quickack mode, then
> tcp_ack_snd_check()/__tcp_ack_snd_check() would cause ACK to be sent
> out.

Yep.

For the stream parser case. I propose we stop the stream parser so it
wont pull in more data. Then requeue the skb we don't have memory for.
Finally use same delayed workqueue to start up the stream parser again.

It requires some patches to get working, but this should get us the
correct handling for TCP. And avoids drops in UDP stack which may
or may not be useful.

I'll try to flush my queue of patches tomorrow its a holiday today
here. With those I think above becomes not so difficult.

.John
diff mbox series

Patch

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 9b6160a191f8..a5185c781332 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -854,7 +854,8 @@  static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
 		return -EIO;
 	}
 	spin_lock_bh(&psock_other->ingress_lock);
-	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
+	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) ||
+	    atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) {
 		spin_unlock_bh(&psock_other->ingress_lock);
 		skb_bpf_redirect_clear(skb);
 		sock_drop(from->sk, skb);
@@ -930,7 +931,8 @@  static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
 		}
 		if (err < 0) {
 			spin_lock_bh(&psock->ingress_lock);
-			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
+			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) &&
+			    atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) {
 				skb_queue_tail(&psock->ingress_skb, skb);
 				schedule_work(&psock->work);
 				err = 0;