Message ID | 20210517022322.50501-1-xiyou.wangcong@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf] udp: fix a memory leak in udp_read_sock() | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for bpf |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | fail | 1 blamed authors not CCed: ast@kernel.org; 10 maintainers not CCed: dsahern@kernel.org yhs@fb.com kpsingh@kernel.org yoshfuji@linux-ipv6.org andrii@kernel.org kafai@fb.com ast@kernel.org songliubraving@fb.com davem@davemloft.net kuba@kernel.org |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 7 this patch: 7 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 13 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 7 this patch: 7 |
netdev/header_inline | success | Link |
On Sun, May 16, 2021 at 8:33 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > sk_psock_verdict_recv() clones the skb and uses the clone > afterward, so udp_read_sock() should free the original skb after > done using it. > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Cc: Lorenz Bauer <lmb@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Song Liu <song@kernel.org>
Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > > sk_psock_verdict_recv() clones the skb and uses the clone > afterward, so udp_read_sock() should free the original skb after > done using it. The clone only happens if sk_psock_verdict_recv() returns >0. > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Cc: Lorenz Bauer <lmb@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- > net/ipv4/udp.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 15f5504adf5b..e31d67fd5183 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, > if (used <= 0) { > if (!copied) > copied = used; > + kfree_skb(skb); This case is different from the TCP side, if there is an error the sockmap side will also call kfree_skb(). In TCP side we peek the skb because we don't want to drop it. On UDP side this will just drop data on the floor. Its not super friendly, but its UDP so we are making the assumption this is ok? We've tried to remove all the drop data cases from TCP it would be nice to not drop data on UDP side if we can help it. Could we requeue or peek the UDP skb to avoid this? > break; > } else if (used <= skb->len) { > copied += used; > } > > + kfree_skb(skb); > if (!desc->count) > break; > } > -- > 2.25.1 >
On Mon, May 17, 2021 at 10:36 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > afterward, so udp_read_sock() should free the original skb after > > done using it. > > The clone only happens if sk_psock_verdict_recv() returns >0. Sure, in case of error, no one uses the original skb either, so still need to free it. > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > --- > > net/ipv4/udp.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > index 15f5504adf5b..e31d67fd5183 100644 > > --- a/net/ipv4/udp.c > > +++ b/net/ipv4/udp.c > > @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, > > if (used <= 0) { > > if (!copied) > > copied = used; > > + kfree_skb(skb); > > This case is different from the TCP side, if there is an error > the sockmap side will also call kfree_skb(). In TCP side we peek > the skb because we don't want to drop it. On UDP side this will > just drop data on the floor. Its not super friendly, but its > UDP so we are making the assumption this is ok? We've tried > to remove all the drop data cases from TCP it would be nice > to not drop data on UDP side if we can help it. Could we > requeue or peek the UDP skb to avoid this? TCP is special because it supports splice() where we can do a partial read, so it needs to peek the skb, right? UDP only supports sockmap, where we always read a whole skb, so we do not need to peek here? Thanks.
Cong Wang wrote: > On Mon, May 17, 2021 at 10:36 PM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > afterward, so udp_read_sock() should free the original skb after > > > done using it. > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > Sure, in case of error, no one uses the original skb either, > so still need to free it. But the data is going to be dropped then. I'm questioning if this is the best we can do or not. Its simplest sure, but could we do a bit more work and peek those skbs or requeue them? Otherwise if you cross memory limits for a bit your likely to drop these unnecessarily. > > > > > > > > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap") > > > Cc: John Fastabend <john.fastabend@gmail.com> > > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > > --- > > > net/ipv4/udp.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > > > index 15f5504adf5b..e31d67fd5183 100644 > > > --- a/net/ipv4/udp.c > > > +++ b/net/ipv4/udp.c > > > @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, > > > if (used <= 0) { > > > if (!copied) > > > copied = used; > > > + kfree_skb(skb); > > > > This case is different from the TCP side, if there is an error > > the sockmap side will also call kfree_skb(). In TCP side we peek > > the skb because we don't want to drop it. On UDP side this will > > just drop data on the floor. Its not super friendly, but its > > UDP so we are making the assumption this is ok? We've tried > > to remove all the drop data cases from TCP it would be nice > > to not drop data on UDP side if we can help it. Could we > > requeue or peek the UDP skb to avoid this? > > TCP is special because it supports splice() where we can > do a partial read, so it needs to peek the skb, right? UDP only > supports sockmap, where we always read a whole skb, so we > do not need to peek here? Its also about not dropping data. In TCP we should not drop data at this point in the stack so if we get an error, ENOMEM or otherwise, we need to ensure we keep the original skb. > > Thanks.
On Tue, May 18, 2021 at 12:56 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > <john.fastabend@gmail.com> wrote: > > > > > > Cong Wang wrote: > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > afterward, so udp_read_sock() should free the original skb after > > > > done using it. > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > Sure, in case of error, no one uses the original skb either, > > so still need to free it. > > But the data is going to be dropped then. I'm questioning if this > is the best we can do or not. Its simplest sure, but could we > do a bit more work and peek those skbs or requeue them? Otherwise > if you cross memory limits for a bit your likely to drop these > unnecessarily. What are the benefits of not dropping it? When sockmap takes over sk->sk_data_ready() it should have total control over the skb's in the receive queue. Otherwise user-space recvmsg() would race with sockmap when they try to read the first skb at the same time, therefore potentially user-space could get duplicated data (one via recvmsg(), one via sockmap). I don't see any benefits but races here. Thanks.
Cong Wang wrote: > On Tue, May 18, 2021 at 12:56 PM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > <john.fastabend@gmail.com> wrote: > > > > > > > > Cong Wang wrote: > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > > afterward, so udp_read_sock() should free the original skb after > > > > > done using it. > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > > > Sure, in case of error, no one uses the original skb either, > > > so still need to free it. > > > > But the data is going to be dropped then. I'm questioning if this > > is the best we can do or not. Its simplest sure, but could we > > do a bit more work and peek those skbs or requeue them? Otherwise > > if you cross memory limits for a bit your likely to drop these > > unnecessarily. > > What are the benefits of not dropping it? When sockmap takes > over sk->sk_data_ready() it should have total control over the skb's > in the receive queue. Otherwise user-space recvmsg() would race > with sockmap when they try to read the first skb at the same time, > therefore potentially user-space could get duplicated data (one via > recvmsg(), one via sockmap). I don't see any benefits but races here. The benefit of _not_ dropping it is the packet gets to the receiver side. We've spent a bit of effort to get a packet across the network, received on the stack, and then we drop it at the last point is not so friendly. About races is the socket is locked by the caller here? Or is this not the case for UDP. Its OK in the end to say "its UDP and lossy" but ideally we don't make things worse by adding sockmap into the stack. We had these problems already on TCP side, where they are much more severe because sender believes retransmits will happen, and fixed them by now. It would be nice if UDP side also didn't introduce drops. > > Thanks.
On Wed, May 19, 2021 at 12:06 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Tue, May 18, 2021 at 12:56 PM John Fastabend > > <john.fastabend@gmail.com> wrote: > > > > > > Cong Wang wrote: > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > Cong Wang wrote: > > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > > > afterward, so udp_read_sock() should free the original skb after > > > > > > done using it. > > > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > > > > > Sure, in case of error, no one uses the original skb either, > > > > so still need to free it. > > > > > > But the data is going to be dropped then. I'm questioning if this > > > is the best we can do or not. Its simplest sure, but could we > > > do a bit more work and peek those skbs or requeue them? Otherwise > > > if you cross memory limits for a bit your likely to drop these > > > unnecessarily. > > > > What are the benefits of not dropping it? When sockmap takes > > over sk->sk_data_ready() it should have total control over the skb's > > in the receive queue. Otherwise user-space recvmsg() would race > > with sockmap when they try to read the first skb at the same time, > > therefore potentially user-space could get duplicated data (one via > > recvmsg(), one via sockmap). I don't see any benefits but races here. > > The benefit of _not_ dropping it is the packet gets to the receiver > side. We've spent a bit of effort to get a packet across the network, > received on the stack, and then we drop it at the last point is not > so friendly. Well, at least udp_recvmsg() could drop packets too in various scenarios, for example, a copy error. So, I do not think sockmap is special. > > About races is the socket is locked by the caller here? Or is this > not the case for UDP. Unlike TCP, the sock is not locked during BH for UDP receive path. Locking it is not the answer here, because 1) we certainly do not want to slow down UDP fast path; 2) UDP lacks sk->sk_backlog_rcv(). > > Its OK in the end to say "its UDP and lossy" but ideally we don't > make things worse by adding sockmap into the stack. We had these > problems already on TCP side, where they are much more severe > because sender believes retransmits will happen, and fixed them > by now. It would be nice if UDP side also didn't introduce > drops. Like I said, the normal UDP receive path drops packets too, sockmap is not different here. TCP does peek packets, for two reasons: 1) it has to support splice(); 2) it has locked the socket during BH receive. UDP has none of them, so UDP can't peek packets here. Thanks.
Cong Wang wrote: > On Wed, May 19, 2021 at 12:06 PM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend > > > <john.fastabend@gmail.com> wrote: > > > > > > > > Cong Wang wrote: > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > Cong Wang wrote: > > > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > > > > afterward, so udp_read_sock() should free the original skb after > > > > > > > done using it. > > > > > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > > > > > > > Sure, in case of error, no one uses the original skb either, > > > > > so still need to free it. > > > > > > > > But the data is going to be dropped then. I'm questioning if this > > > > is the best we can do or not. Its simplest sure, but could we > > > > do a bit more work and peek those skbs or requeue them? Otherwise > > > > if you cross memory limits for a bit your likely to drop these > > > > unnecessarily. > > > > > > What are the benefits of not dropping it? When sockmap takes > > > over sk->sk_data_ready() it should have total control over the skb's > > > in the receive queue. Otherwise user-space recvmsg() would race > > > with sockmap when they try to read the first skb at the same time, > > > therefore potentially user-space could get duplicated data (one via > > > recvmsg(), one via sockmap). I don't see any benefits but races here. > > > > The benefit of _not_ dropping it is the packet gets to the receiver > > side. We've spent a bit of effort to get a packet across the network, > > received on the stack, and then we drop it at the last point is not > > so friendly. > > Well, at least udp_recvmsg() could drop packets too in various > scenarios, for example, a copy error. So, I do not think sockmap > is special. OK I am at least convinced now that dropping packets is OK and likely a useful performance/complexity compromise. But, at this point we wont have any visibility into these drops correct? Looks like the pattern in UDP stack to handle this is to increment sk_drops and UDP_MIB_INERRORS. How about we do that here as well?
On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Wed, May 19, 2021 at 12:06 PM John Fastabend > > <john.fastabend@gmail.com> wrote: > > > > > > Cong Wang wrote: > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > Cong Wang wrote: > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > Cong Wang wrote: > > > > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > > > > > afterward, so udp_read_sock() should free the original skb after > > > > > > > > done using it. > > > > > > > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > > > > > > > > > Sure, in case of error, no one uses the original skb either, > > > > > > so still need to free it. > > > > > > > > > > But the data is going to be dropped then. I'm questioning if this > > > > > is the best we can do or not. Its simplest sure, but could we > > > > > do a bit more work and peek those skbs or requeue them? Otherwise > > > > > if you cross memory limits for a bit your likely to drop these > > > > > unnecessarily. > > > > > > > > What are the benefits of not dropping it? When sockmap takes > > > > over sk->sk_data_ready() it should have total control over the skb's > > > > in the receive queue. Otherwise user-space recvmsg() would race > > > > with sockmap when they try to read the first skb at the same time, > > > > therefore potentially user-space could get duplicated data (one via > > > > recvmsg(), one via sockmap). I don't see any benefits but races here. > > > > > > The benefit of _not_ dropping it is the packet gets to the receiver > > > side. We've spent a bit of effort to get a packet across the network, > > > received on the stack, and then we drop it at the last point is not > > > so friendly. > > > > Well, at least udp_recvmsg() could drop packets too in various > > scenarios, for example, a copy error. So, I do not think sockmap > > is special. > > OK I am at least convinced now that dropping packets is OK and likely > a useful performance/complexity compromise. > > But, at this point we wont have any visibility into these drops correct? > Looks like the pattern in UDP stack to handle this is to increment > sk_drops and UDP_MIB_INERRORS. How about we do that here as well? We are not dropping the packet, the packet is cloned and deliver to user-space via sk_psock_verdict_recv(), thus, we are simply leaking the original skb, regardless of any error. Maybe udp_read_sock() should check desc->error, but it has nothing to do with this path which only aims to address a memory leak. A separate patch is need to check desc->error, if really needed. Thanks.
Cong Wang wrote: > On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > On Wed, May 19, 2021 at 12:06 PM John Fastabend > > > <john.fastabend@gmail.com> wrote: > > > > > > > > Cong Wang wrote: > > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > Cong Wang wrote: > > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > Cong Wang wrote: > > > > > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > > > > > > afterward, so udp_read_sock() should free the original skb after > > > > > > > > > done using it. > > > > > > > > > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > > > > > > > > > > > Sure, in case of error, no one uses the original skb either, > > > > > > > so still need to free it. > > > > > > > > > > > > But the data is going to be dropped then. I'm questioning if this > > > > > > is the best we can do or not. Its simplest sure, but could we > > > > > > do a bit more work and peek those skbs or requeue them? Otherwise > > > > > > if you cross memory limits for a bit your likely to drop these > > > > > > unnecessarily. > > > > > > > > > > What are the benefits of not dropping it? When sockmap takes > > > > > over sk->sk_data_ready() it should have total control over the skb's > > > > > in the receive queue. Otherwise user-space recvmsg() would race > > > > > with sockmap when they try to read the first skb at the same time, > > > > > therefore potentially user-space could get duplicated data (one via > > > > > recvmsg(), one via sockmap). I don't see any benefits but races here. > > > > > > > > The benefit of _not_ dropping it is the packet gets to the receiver > > > > side. We've spent a bit of effort to get a packet across the network, > > > > received on the stack, and then we drop it at the last point is not > > > > so friendly. > > > > > > Well, at least udp_recvmsg() could drop packets too in various > > > scenarios, for example, a copy error. So, I do not think sockmap > > > is special. > > > > OK I am at least convinced now that dropping packets is OK and likely > > a useful performance/complexity compromise. > > > > But, at this point we wont have any visibility into these drops correct? > > Looks like the pattern in UDP stack to handle this is to increment > > sk_drops and UDP_MIB_INERRORS. How about we do that here as well? > > We are not dropping the packet, the packet is cloned and deliver to > user-space via sk_psock_verdict_recv(), thus, we are simply leaking > the original skb, regardless of any error. Maybe udp_read_sock() > should check desc->error, but it has nothing to do with this path which > only aims to address a memory leak. A separate patch is need to check > desc->error, if really needed. > > Thanks. "We are not dropping the packet" you'll need to be more explicit on how this path is not dropping the skb, udp_read_sock() skb = skb_recv_udp() __skb_recv_udp() __skb_try_recv_from_queue() __skb_unlink() // skb is off the queue used = recv_actor() sk_psock_verdict_recv() return 0; if (used <= 0) { kfree(skb) // skb is unlink'd and kfree'd break; return 0 So if in the error case the skb is unlink'd from the queue and kfree'd where is it still existing, how do we get it back? It sure looks dropped to me. Yes, the kfree() is needed to not leak it, but I'm saying we don't want to drop packets silently. The convention in UDP space looks to be inc sk->sk_drop and inc the MIB. When we have to debug this on deployed systems and packets silently get dropped its going to cause lots of pain so lets be sure we get the counters correct. .John
On Thu, May 20, 2021 at 10:43 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > Cong Wang wrote: > > > > On Wed, May 19, 2021 at 12:06 PM John Fastabend > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > Cong Wang wrote: > > > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > Cong Wang wrote: > > > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > > > Cong Wang wrote: > > > > > > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > > > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone > > > > > > > > > > afterward, so udp_read_sock() should free the original skb after > > > > > > > > > > done using it. > > > > > > > > > > > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0. > > > > > > > > > > > > > > > > Sure, in case of error, no one uses the original skb either, > > > > > > > > so still need to free it. > > > > > > > > > > > > > > But the data is going to be dropped then. I'm questioning if this > > > > > > > is the best we can do or not. Its simplest sure, but could we > > > > > > > do a bit more work and peek those skbs or requeue them? Otherwise > > > > > > > if you cross memory limits for a bit your likely to drop these > > > > > > > unnecessarily. > > > > > > > > > > > > What are the benefits of not dropping it? When sockmap takes > > > > > > over sk->sk_data_ready() it should have total control over the skb's > > > > > > in the receive queue. Otherwise user-space recvmsg() would race > > > > > > with sockmap when they try to read the first skb at the same time, > > > > > > therefore potentially user-space could get duplicated data (one via > > > > > > recvmsg(), one via sockmap). I don't see any benefits but races here. > > > > > > > > > > The benefit of _not_ dropping it is the packet gets to the receiver > > > > > side. We've spent a bit of effort to get a packet across the network, > > > > > received on the stack, and then we drop it at the last point is not > > > > > so friendly. > > > > > > > > Well, at least udp_recvmsg() could drop packets too in various > > > > scenarios, for example, a copy error. So, I do not think sockmap > > > > is special. > > > > > > OK I am at least convinced now that dropping packets is OK and likely > > > a useful performance/complexity compromise. > > > > > > But, at this point we wont have any visibility into these drops correct? > > > Looks like the pattern in UDP stack to handle this is to increment > > > sk_drops and UDP_MIB_INERRORS. How about we do that here as well? > > > > We are not dropping the packet, the packet is cloned and deliver to > > user-space via sk_psock_verdict_recv(), thus, we are simply leaking > > the original skb, regardless of any error. Maybe udp_read_sock() > > should check desc->error, but it has nothing to do with this path which > > only aims to address a memory leak. A separate patch is need to check > > desc->error, if really needed. > > > > Thanks. > > "We are not dropping the packet" you'll need to be more explicit on > how this path is not dropping the skb, You know it is cloned, don't you? So if we clone an skb and deliver the clone then free the original, what is dropped here? Absolutely nothing. By "drop", we clearly mean nothing is delivered. Or do you have any different definition of "drop"? > > udp_read_sock() > skb = skb_recv_udp() > __skb_recv_udp() > __skb_try_recv_from_queue() > __skb_unlink() // skb is off the queue > used = recv_actor() > sk_psock_verdict_recv() Why do you intentionally ignore the fact the skb is cloned and the clone is delivered?? > return 0; > if (used <= 0) { > kfree(skb) // skb is unlink'd and kfree'd Why do you ignore the other kfree_skb() I added in this patch? Which is clearly on the non-error path. This is why I said the skb needs to be freed _regardless_ of error or not. You just keep ignoring it. > break; > return 0 > > So if in the error case the skb is unlink'd from the queue and > kfree'd where is it still existing, how do we get it back? It > sure looks dropped to me. Yes, the kfree() is needed to not > leak it, but I'm saying we don't want to drop packets silently. See above, you clearly ignored the other kfree_skb() which is on non-error path. > The convention in UDP space looks to be inc sk->sk_drop and inc > the MIB. When we have to debug this on deployed systems and > packets silently get dropped its going to cause lots of pain so > lets be sure we get the counters correct. Sure, let me quote what I already said: " A separate patch is need to check desc->error, if really needed." This patch, as its subject tells, aims to address a memory leak, not to address error counters. BTW, TCP does not increase error counters either, yet another reason it deserves a separate patch to address both. Thanks.
Cong Wang wrote: > On Thu, May 20, 2021 at 10:43 AM John Fastabend > <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > > > Cong Wang wrote: > > > > > On Wed, May 19, 2021 at 12:06 PM John Fastabend > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > Cong Wang wrote: > > > > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > Cong Wang wrote: > > > > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend > > > > > > > > > <john.fastabend@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > Cong Wang wrote: > > > > > > > > > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > > > > > > [...] > > > > "We are not dropping the packet" you'll need to be more explicit on > > how this path is not dropping the skb, > > You know it is cloned, don't you? So if we clone an skb and deliver > the clone then free the original, what is dropped here? Absolutely > nothing. > > By "drop", we clearly mean nothing is delivered. Or do you have > any different definition of "drop"? This is my meaning of 'drop'. > > > > > udp_read_sock() > > skb = skb_recv_udp() > > __skb_recv_udp() > > __skb_try_recv_from_queue() > > __skb_unlink() // skb is off the queue > > used = recv_actor() > > sk_psock_verdict_recv() > > Why do you intentionally ignore the fact the skb is cloned > and the clone is delivered?? I'm only concerned about the error case here. > > > return 0; > > if (used <= 0) { > > kfree(skb) // skb is unlink'd and kfree'd > > Why do you ignore the other kfree_skb() I added in this patch? > Which is clearly on the non-error path. This is why I said the > skb needs to be freed _regardless_ of error or not. You just > keep ignoring it. Agree it needs a kfree_skb in both cases I'm not ignoring it. My issue with this fix is it is not complete. We need some counter to increment the udp stats so we can learn about these drops. And it looks like an extra two lines will get us that. > > > > break; > > return 0 > > > > So if in the error case the skb is unlink'd from the queue and > > kfree'd where is it still existing, how do we get it back? It > > sure looks dropped to me. Yes, the kfree() is needed to not > > leak it, but I'm saying we don't want to drop packets silently. > > See above, you clearly ignored the other kfree_skb() which is > on non-error path. Ignored in this thread because its correct and looks good to me. My only issue is with the error path. > > > The convention in UDP space looks to be inc sk->sk_drop and inc > > the MIB. When we have to debug this on deployed systems and > > packets silently get dropped its going to cause lots of pain so > > lets be sure we get the counters correct. > > Sure, let me quote what I already said: > " A separate patch is need to check desc->error, if really needed." Check desc->error or even just used <= 0 is sufficient here. Yes it is really needed I don't want to debug this in a prod environment without it. > > This patch, as its subject tells, aims to address a memory leak, not > to address error counters. BTW, TCP does not increase error OK either add the counters in this patch or as a series of two patches so we get a complete fix in one series. Without it some box out there will randomly drop UDP packets, probably DNS packets for extra fun, and we will have no way of knowing other than sporadic packet loss. Unless your arguing against having the counters or that the counters don't make sense for some reason? > counters either, yet another reason it deserves a separate patch > to address both. TCP case is different if we drop packets in a TCP error case thats not a 'lets increment the counters' problem the drop needs to be fixed we can't let data fall out of a TCP stream because no one will retransmit it. We've learned this the hard way. > > Thanks.
On Fri, May 21, 2021 at 3:09 PM John Fastabend <john.fastabend@gmail.com> wrote: > OK either add the counters in this patch or as a series of two > patches so we get a complete fix in one series. Without it some > box out there will randomly drop UDP packets, probably DNS > packets for extra fun, and we will have no way of knowing other > than sporadic packet loss. Unless your arguing against having the > counters or that the counters don't make sense for some reason? I never object increasing any counter here, My argument is it belongs to a separate patch for 3 reasons: 1) TCP does not have one either, hence needs to fix together; 2) A patch should fix one bug, not two or more bugs together; 3) It is not the only one place which needs to increase the counter, all of these kfree_skb()'s need, for example, this one inside sk_psock_verdict_recv(): psock = sk_psock(sk); if (unlikely(!psock)) { len = 0; kfree_skb(skb); goto out; } This example also shows it is harder to do so, because sk_psock_verdict_recv() is independent of any protocol, it is hard to increase a protocol-specific counter there. (Another one is in sk_psock_verdict_apply().) > > > counters either, yet another reason it deserves a separate patch > > to address both. > > TCP case is different if we drop packets in a TCP error case > thats not a 'lets increment the counters' problem the drop needs > to be fixed we can't let data fall out of a TCP stream because > no one will retransmit it. We've learned this the hard way. I think TCP always increases some counter when dropping a packet despite of retransmission, for example: static void tcp_drop(struct sock *sk, struct sk_buff *skb) { sk_drops_add(sk, skb); __kfree_skb(skb); } Thanks.
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 15f5504adf5b..e31d67fd5183 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc, if (used <= 0) { if (!copied) copied = used; + kfree_skb(skb); break; } else if (used <= skb->len) { copied += used; } + kfree_skb(skb); if (!desc->count) break; }