Message ID | 20210701061656.34150-1-xiyou.wangcong@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf,v2] skmsg: check sk_rcvbuf limit before queuing to ingress_skb | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | fail | Series targets non-next tree, but doesn't contain any Fixes tags |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for bpf |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | warning | 8 maintainers not CCed: yhs@fb.com kpsingh@kernel.org andrii@kernel.org kafai@fb.com ast@kernel.org songliubraving@fb.com davem@davemloft.net kuba@kernel.org |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 1 this patch: 1 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | warning | WARNING: line length of 102 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 1 this patch: 1 |
netdev/header_inline | success | Link |
On Thu, Jul 01, 2021 at 08:16 AM CEST, Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > > Jiang observed OOM frequently when testing our AF_UNIX/UDP > proxy. This is due to the fact that we do not actually limit > the socket memory before queueing skb to ingress_skb. We > charge the skb memory later when handling the psock backlog, > but it is not limited either. > > This patch adds checks for sk->sk_rcvbuf right before queuing > to ingress_skb and drops packets if this limit exceeds. This > is very similar to UDP receive path. Ideally we should set the > skb owner before this check too, but it is hard to make TCP > happy about sk_forward_alloc. > > Reported-by: Jiang Wang <jiang.wang@bytedance.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Lorenz Bauer <lmb@cloudflare.com> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- Acked-by: Jakub Sitnicki <jakub@cloudflare.com> By saying that it is hard to make TCP happy about sk_forward_alloc, I'm guessing you're referring to problems described in 144748eb0c44 ("bpf, sockmap: Fix incorrect fwd_alloc accounting") [1]? Thanks for the fix. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=144748eb0c445091466c9b741ebd0bfcc5914f3d [...]
Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > > Jiang observed OOM frequently when testing our AF_UNIX/UDP > proxy. This is due to the fact that we do not actually limit > the socket memory before queueing skb to ingress_skb. We > charge the skb memory later when handling the psock backlog, > but it is not limited either. Right, its not limiting but charging it should push back on the stack so it stops feeding skbs to us. Maybe this doesn't happen in UDP side? > > This patch adds checks for sk->sk_rcvbuf right before queuing > to ingress_skb and drops packets if this limit exceeds. This > is very similar to UDP receive path. Ideally we should set the > skb owner before this check too, but it is hard to make TCP > happy about sk_forward_alloc. But it breaks TCP side see below. > > Reported-by: Jiang Wang <jiang.wang@bytedance.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Lorenz Bauer <lmb@cloudflare.com> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- > net/core/skmsg.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 9b6160a191f8..a5185c781332 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) > return -EIO; > } > spin_lock_bh(&psock_other->ingress_lock); > - if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) { > + if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) || > + atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) { > spin_unlock_bh(&psock_other->ingress_lock); > skb_bpf_redirect_clear(skb); > sock_drop(from->sk, skb); > @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, > } > if (err < 0) { > spin_lock_bh(&psock->ingress_lock); > - if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { > + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) && > + atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) { > skb_queue_tail(&psock->ingress_skb, skb); We can't just drop the packet in the memory overrun case here. This will break TCP because the data will be gone and no one will retransmit. Thats why in the current scheme on redirect we can push back when we move it to the other queues ingress message queue or redirect into the other socket via send. At one point I considered charging the data sitting in the ingress_skb? Would that solve the problem here? I think it would cause the enqueue at the UDP to start dropping packets from __udp_enqueue_schedule_skb()? > schedule_work(&psock->work); > err = 0; > -- > 2.27.0 >
Jakub Sitnicki wrote: > On Thu, Jul 01, 2021 at 08:16 AM CEST, Cong Wang wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > > > Jiang observed OOM frequently when testing our AF_UNIX/UDP > > proxy. This is due to the fact that we do not actually limit > > the socket memory before queueing skb to ingress_skb. We > > charge the skb memory later when handling the psock backlog, > > but it is not limited either. > > > > This patch adds checks for sk->sk_rcvbuf right before queuing > > to ingress_skb and drops packets if this limit exceeds. This > > is very similar to UDP receive path. Ideally we should set the > > skb owner before this check too, but it is hard to make TCP > > happy about sk_forward_alloc. > > > > Reported-by: Jiang Wang <jiang.wang@bytedance.com> > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > --- > > Acked-by: Jakub Sitnicki <jakub@cloudflare.com> > > By saying that it is hard to make TCP happy about sk_forward_alloc, I'm > guessing you're referring to problems described in 144748eb0c44 ("bpf, > sockmap: Fix incorrect fwd_alloc accounting") [1]? I have a couple fixes on my stack here I'm testing that clean up the tear down logic. Once thats in place maybe its as simple as adding the owner_r bits and calling the destructor to ensure memory accounting happens earlier so these ingress_skb packets are accounted for. I'll flush those out today, maybe it will be clear then. > > Thanks for the fix. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=144748eb0c445091466c9b741ebd0bfcc5914f3d > > [...]
On Thu, Jul 1, 2021 at 9:23 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > > > Jiang observed OOM frequently when testing our AF_UNIX/UDP > > proxy. This is due to the fact that we do not actually limit > > the socket memory before queueing skb to ingress_skb. We > > charge the skb memory later when handling the psock backlog, > > but it is not limited either. > > Right, its not limiting but charging it should push back on > the stack so it stops feeding skbs to us. Maybe this doesn't > happen in UDP side? The OOM is due to skb queued in ingress_skb, not due to user-space consuming skb slowly. > > > > > This patch adds checks for sk->sk_rcvbuf right before queuing > > to ingress_skb and drops packets if this limit exceeds. This > > is very similar to UDP receive path. Ideally we should set the > > skb owner before this check too, but it is hard to make TCP > > happy about sk_forward_alloc. > > But it breaks TCP side see below. > > > > > Reported-by: Jiang Wang <jiang.wang@bytedance.com> > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > --- > > net/core/skmsg.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index 9b6160a191f8..a5185c781332 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) > > return -EIO; > > } > > spin_lock_bh(&psock_other->ingress_lock); > > - if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) { > > + if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) || > > + atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) { > > spin_unlock_bh(&psock_other->ingress_lock); > > skb_bpf_redirect_clear(skb); > > sock_drop(from->sk, skb); > > @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, > > } > > if (err < 0) { > > spin_lock_bh(&psock->ingress_lock); > > - if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { > > + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) && > > + atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) { > > skb_queue_tail(&psock->ingress_skb, skb); > > We can't just drop the packet in the memory overrun case here. This will > break TCP because the data will be gone and no one will retransmit. > > Thats why in the current scheme on redirect we can push back when we > move it to the other queues ingress message queue or redirect into > the other socket via send. > > At one point I considered charging the data sitting in the ingress_skb? > Would that solve the problem here? I think it would cause the enqueue > at the UDP to start dropping packets from __udp_enqueue_schedule_skb()? I tried to move skb_set_owner_r() here, TCP is clearly unhappy about it, as I explained in changelog. Yes, it probably helps if we could move it here. Thanks.
On Thu, Jul 1, 2021 at 11:00 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Thu, Jul 1, 2021 at 9:23 AM John Fastabend <john.fastabend@gmail.com> wrote: > > > > We can't just drop the packet in the memory overrun case here. This will > > break TCP because the data will be gone and no one will retransmit. > > > > Thats why in the current scheme on redirect we can push back when we > > move it to the other queues ingress message queue or redirect into > > the other socket via send. > > > > At one point I considered charging the data sitting in the ingress_skb? > > Would that solve the problem here? I think it would cause the enqueue > > at the UDP to start dropping packets from __udp_enqueue_schedule_skb()? > > I tried to move skb_set_owner_r() here, TCP is clearly unhappy about it, > as I explained in changelog. Yes, it probably helps if we could move it here. Sorry for replying too quickly. Actually it probably does _not_ help, because we have to limit the dest socket buffer, not the source. For example, if we have 5 sockets in a sockmap and all of them send packets to the same one, the dest socket could still get 5x sk_rcvbuf bytes pending in its ingress_skb. And I do not know why you want to single out the TCP case, at least the ebpf program can decide to drop TCP packets too, this drop is not any different from the drop due to rcvbuf overlimit. Therefore, I think my patch is fine as it is. Thanks.
On Thu, Jul 01, 2021 at 06:23 PM CEST, John Fastabend wrote: [...] >> diff --git a/net/core/skmsg.c b/net/core/skmsg.c >> index 9b6160a191f8..a5185c781332 100644 >> --- a/net/core/skmsg.c >> +++ b/net/core/skmsg.c >> @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) >> return -EIO; >> } >> spin_lock_bh(&psock_other->ingress_lock); >> - if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) { >> + if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) || >> + atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) { >> spin_unlock_bh(&psock_other->ingress_lock); >> skb_bpf_redirect_clear(skb); >> sock_drop(from->sk, skb); >> @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, >> } >> if (err < 0) { >> spin_lock_bh(&psock->ingress_lock); >> - if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { >> + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) && >> + atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) { >> skb_queue_tail(&psock->ingress_skb, skb); > > We can't just drop the packet in the memory overrun case here. This will > break TCP because the data will be gone and no one will retransmit. I don't think it's always the case that data will be gone. But you're right that it breaks TCP. I was too quick to Ack this patch. When running with just the verdict prog attached, the -EIO error from sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case. tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays on sk_receive_queue. sk->sk_data_ready sk_psock_verdict_data_ready ->read_sock(..., sk_psock_verdict_recv) tcp_read_sock (used = copied = 0) sk_psock_verdict_recv -> ret = 0 sk_psock_verdict_apply -> -EIO sk_psock_skb_redirect -> -EIO However, I think this gets us stuck. What if no more data gets queued, and sk_data_ready doesn't get called again? Then there is the case when a parser prog is attached. In this case the skb is really gone if we drop it on redirect. In sk_psock_strp_read, we ignore the -EIO error from sk_psock_verdict_apply, and return to tcp_read_sock how many bytes have been parsed. sk->sk_data_ready sk_psock_verdict_data_ready ->read_sock(..., sk_psock_verdict_recv) tcp_read_sock (used = copied = eaten) strp_recv -> ret = eaten __strp_recv -> ret = eaten strp->cb.rcv_msg -> -EIO sk_psock_verdict_apply -> -EIO sk_psock_redirect -> -EIO Maybe we could put the skb back on strp->skb_head list on error, though? But again some notification would need to trigger a re-read, or we are stuck.
On Sat, Jul 03, 2021 at 07:52 PM CEST, Jakub Sitnicki wrote: [...] > Then there is the case when a parser prog is attached. In this case the > skb is really gone if we drop it on redirect. > > In sk_psock_strp_read, we ignore the -EIO error from > sk_psock_verdict_apply, and return to tcp_read_sock how many bytes have > been parsed. > > sk->sk_data_ready > sk_psock_verdict_data_ready > ->read_sock(..., sk_psock_verdict_recv) > tcp_read_sock (used = copied = eaten) > strp_recv -> ret = eaten > __strp_recv -> ret = eaten > strp->cb.rcv_msg -> -EIO > sk_psock_verdict_apply -> -EIO > sk_psock_redirect -> -EIO Copy-paste error. The call chain for the parser case goes as so: sk->sk_data_ready sk_psock_strp_data_ready strp_data_ready strp_read_sock ->read_sock(..., strp_recv) tcp_read_sock (used = copied = skb->len) strp_recv -> ret = skb->len __strp_recv -> ret = skb->len (dummy parser case) strp->cb.rcv_msg -> -EIO sk_psock_verdict_apply -> -EIO sk_psock_redirect -> -EIO
On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote: > When running with just the verdict prog attached, the -EIO error from > sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it > maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case. > > tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays > on sk_receive_queue. Are you sure? When recv_actor() returns 0, the while loop breaks: 1661 used = recv_actor(desc, skb, offset, len); 1662 if (used <= 0) { 1663 if (!copied) 1664 copied = used; 1665 break; then it calls sk_eat_skb() a few lines after the loop: ... 1690 sk_eat_skb(sk, skb); > > sk->sk_data_ready > sk_psock_verdict_data_ready > ->read_sock(..., sk_psock_verdict_recv) > tcp_read_sock (used = copied = 0) > sk_psock_verdict_recv -> ret = 0 > sk_psock_verdict_apply -> -EIO > sk_psock_skb_redirect -> -EIO > > However, I think this gets us stuck. What if no more data gets queued, > and sk_data_ready doesn't get called again? I think it is dropped by sk_eat_skb() in TCP case and of course the sender will retransmit it after detecting this loss. So from this point of view, there is no difference between drops due to overlimit and drops due to eBPF program policy. Thanks.
On Sun, Jul 04, 2021 at 09:53 PM CEST, Cong Wang wrote: > On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote: >> When running with just the verdict prog attached, the -EIO error from >> sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it >> maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case. >> >> tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays >> on sk_receive_queue. > > Are you sure? > > When recv_actor() returns 0, the while loop breaks: > > 1661 used = recv_actor(desc, skb, offset, len); > 1662 if (used <= 0) { > 1663 if (!copied) > 1664 copied = used; > 1665 break; > > then it calls sk_eat_skb() a few lines after the loop: > ... > 1690 sk_eat_skb(sk, skb); This sk_eat_skb is still within the loop: 1636:int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, 1637- sk_read_actor_t recv_actor) 1638-{ … 1643- int copied = 0; … 1647- while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { 1648- if (offset < skb->len) { … 1661- used = recv_actor(desc, skb, offset, len); 1662- if (used <= 0) { 1663- if (!copied) 1664- copied = used; 1665- break; 1666- } else if (used <= len) { 1667- seq += used; 1668- copied += used; 1669- offset += used; 1670- } … 1684- } … 1690- sk_eat_skb(sk, skb); … 1694- } … 1699- /* Clean up data we have read: This will do ACK frames. */ 1700- if (copied > 0) { 1701- tcp_recv_skb(sk, seq, &offset); 1702- tcp_cleanup_rbuf(sk, copied); 1703- } 1704- return copied; 1705-} sk_eat_skb could get called by tcp_recv_skb → sk_eat_skb if recv_actor returned > 0 (the case when we have parser attached). > >> >> sk->sk_data_ready >> sk_psock_verdict_data_ready >> ->read_sock(..., sk_psock_verdict_recv) >> tcp_read_sock (used = copied = 0) >> sk_psock_verdict_recv -> ret = 0 >> sk_psock_verdict_apply -> -EIO >> sk_psock_skb_redirect -> -EIO >> >> However, I think this gets us stuck. What if no more data gets queued, >> and sk_data_ready doesn't get called again? > > I think it is dropped by sk_eat_skb() in TCP case and of course the > sender will retransmit it after detecting this loss. So from this point of > view, there is no difference between drops due to overlimit and drops > due to eBPF program policy. I'm not sure the retransmit will happen. We update tp->rcv_nxt (tcp_rcv_nxt_update) when skb gets pushed onto sk_receive_queue in either: - tcp_rcv_established -> tcp_queue_rcv, or - tcp_rcv_established -> tcp_data_queue -> tcp_queue_rcv ... and schedule ACK (tcp_event_data_recv) to be sent. Say we are in quickack mode, then tcp_ack_snd_check()/__tcp_ack_snd_check() would cause ACK to be sent out.
Jakub Sitnicki wrote: > On Sun, Jul 04, 2021 at 09:53 PM CEST, Cong Wang wrote: > > On Sat, Jul 3, 2021 at 10:52 AM Jakub Sitnicki <jakub@cloudflare.com> wrote: > >> When running with just the verdict prog attached, the -EIO error from > >> sk_psock_verdict_apply is propagated up to tcp_read_sock. That is, it > >> maps to 0 bytes used by recv_actor. sk_psock_verdict_recv in this case. > >> > >> tcp_read_sock, if 0 bytes were used = copied, won't sk_eat_skb. It stays > >> on sk_receive_queue. > > > > Are you sure? > > > > When recv_actor() returns 0, the while loop breaks: > > > > 1661 used = recv_actor(desc, skb, offset, len); > > 1662 if (used <= 0) { > > 1663 if (!copied) > > 1664 copied = used; > > 1665 break; > > > > then it calls sk_eat_skb() a few lines after the loop: > > ... > > 1690 sk_eat_skb(sk, skb); > > This sk_eat_skb is still within the loop: > > 1636:int tcp_read_sock(struct sock *sk, read_descriptor_t *desc, > 1637- sk_read_actor_t recv_actor) > 1638-{ > … > 1643- int copied = 0; > … > 1647- while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) { > 1648- if (offset < skb->len) { > … > 1661- used = recv_actor(desc, skb, offset, len); > 1662- if (used <= 0) { > 1663- if (!copied) > 1664- copied = used; > 1665- break; > 1666- } else if (used <= len) { > 1667- seq += used; > 1668- copied += used; > 1669- offset += used; > 1670- } > … > 1684- } > … > 1690- sk_eat_skb(sk, skb); > … > 1694- } > … > 1699- /* Clean up data we have read: This will do ACK frames. */ > 1700- if (copied > 0) { > 1701- tcp_recv_skb(sk, seq, &offset); > 1702- tcp_cleanup_rbuf(sk, copied); > 1703- } > 1704- return copied; > 1705-} > > sk_eat_skb could get called by tcp_recv_skb → sk_eat_skb if recv_actor > returned > 0 (the case when we have parser attached). > > > > >> > >> sk->sk_data_ready > >> sk_psock_verdict_data_ready > >> ->read_sock(..., sk_psock_verdict_recv) > >> tcp_read_sock (used = copied = 0) > >> sk_psock_verdict_recv -> ret = 0 > >> sk_psock_verdict_apply -> -EIO > >> sk_psock_skb_redirect -> -EIO > >> > >> However, I think this gets us stuck. What if no more data gets queued, > >> and sk_data_ready doesn't get called again? Agree looks like it will be stuck. I found a similar stuck queue in the skmsg backlog case for this I'm testing changing our workqueue over to a delayed workqueue and then calling it again after some delay. We could do the same here I think. Return 0 to stop the tcp_read_sock as you show. Then the data is still on the sk_receive_queue and memory should still be accounted for. Solving the memory overrun on the dest socket. Then we kick a workqueue item that will call data_ready at some point in the future. And we do a backoff so we don't keep hitting it repeatedly. I think this should work and I have the patches testing now for doing it on the backlog paths. > > > > I think it is dropped by sk_eat_skb() in TCP case and of course the > > sender will retransmit it after detecting this loss. So from this point of > > view, there is no difference between drops due to overlimit and drops > > due to eBPF program policy. > > I'm not sure the retransmit will happen. > > We update tp->rcv_nxt (tcp_rcv_nxt_update) when skb gets pushed onto > sk_receive_queue in either: > > - tcp_rcv_established -> tcp_queue_rcv, or > - tcp_rcv_established -> tcp_data_queue -> tcp_queue_rcv > > ... and schedule ACK (tcp_event_data_recv) to be sent. Right, this is what we've observed before when we did have a few drops on error cases. And then some applications will break and user will send bugs and we have to fix them. We can't unintentionally drop TCP packets. A policy drop though is different in this case we want to break the session. FWIW I'm considering adding a reset socket helper so we can do this cleanly and tear the socket down. > > Say we are in quickack mode, then > tcp_ack_snd_check()/__tcp_ack_snd_check() would cause ACK to be sent > out. Yep. For the stream parser case. I propose we stop the stream parser so it wont pull in more data. Then requeue the skb we don't have memory for. Finally use same delayed workqueue to start up the stream parser again. It requires some patches to get working, but this should get us the correct handling for TCP. And avoids drops in UDP stack which may or may not be useful. I'll try to flush my queue of patches tomorrow its a holiday today here. With those I think above becomes not so difficult. .John
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 9b6160a191f8..a5185c781332 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -854,7 +854,8 @@ static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb) return -EIO; } spin_lock_bh(&psock_other->ingress_lock); - if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) { + if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED) || + atomic_read(&sk_other->sk_rmem_alloc) > READ_ONCE(sk_other->sk_rcvbuf)) { spin_unlock_bh(&psock_other->ingress_lock); skb_bpf_redirect_clear(skb); sock_drop(from->sk, skb); @@ -930,7 +931,8 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, } if (err < 0) { spin_lock_bh(&psock->ingress_lock); - if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) { + if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED) && + atomic_read(&sk_other->sk_rmem_alloc) <= READ_ONCE(sk_other->sk_rcvbuf)) { skb_queue_tail(&psock->ingress_skb, skb); schedule_work(&psock->work); err = 0;