diff mbox series

[bpf,2/4] bpf, sockmap: Fix race in ingress receive verdict with redirect to self

Message ID 20211011191647.418704-3-john.fastabend@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf, sockmap: fixes stress testing and regression | expand

Checks

Context Check Description
bpf/vmtest-bpf-PR fail PR summary
netdev/cover_letter success Series has a cover letter
netdev/fixes_present success Fixes tag present in non-next series
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/cc_maintainers warning 13 maintainers not CCed: andrii@kernel.org edumazet@google.com ast@kernel.org songliubraving@fb.com yoshfuji@linux-ipv6.org yhs@fb.com davem@davemloft.net kafai@fb.com jakub@cloudflare.com kuba@kernel.org dsahern@kernel.org lmb@cloudflare.com kpsingh@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Fixes tag looks correct
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 71 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success No static functions without inline keyword in header files
bpf/vmtest-bpf fail VM_Test

Commit Message

John Fastabend Oct. 11, 2021, 7:16 p.m. UTC
A socket in a sockmap may have different combinations of programs
attached depending on configuration. There can be no programs in which
case the socket acts as a sink only. There can be a TX program in this
case a BPF program is attached to sending side, but no RX program is
attached. There can be an RX program only where sends have no BPF
program attached, but receives are hooked with BPF. And finally,
both TX and RX programs may be attached. Giving us the permutations,

 None, Tx, Rx, and TxRx

To date most of our use cases have been TX case being used as a fast
datapath to directly copy between local application and a userspace
proxy. Or Rx cases and TxRX applications that are operating an in
kernel based proxy. The traffic in the first case where we hook
applications into a userspace application looks like this,

  AppA  redirect   AppB
   Tx <-----------> Rx
   |                |
   +                +
   TCP <--> lo <--> TCP

In this case all traffic from AppA (after 3whs) is copied into the
AppB ingress queue and no traffic is ever on the TCP recieive_queue.

In the second case the application never receives, except in some
rare error cases, traffic on the actual user space socket. Instead
the send happens in the kernel.

           AppProxy       socket pool
       sk0 ------------->{sk1,sk2, skn}
        ^                      |
        |                      |
        |                      v
       ingress              lb egress
       TCP                  TCP

Here because traffic is never read off the socket with userspace
recv() APIs there is only ever one reader on the sk receive_queue.
Namely the BPF programs.

However, we've started to introduce a third configuration where the
BPF program on receive should process the data, but then the normal
case is to push the data into the receive queue of AppB.

       AppB
       recv()                (userspace)
     -----------------------
       tcp_bpf_recvmsg()     (kernel)
         |             |
         |             |
         |             |
       ingress_msgQ    |
         |             |
       RX_BPF          |
         |             |
         v             v
       sk->receive_queue


This is different from the App{A,B} redirect because traffic is
first received on the sk->receive_queue.

Now for the issue. The tcp_bpf_recvmsg() handler first checks the
ingress_msg queue for any data handled by the BPF rx program and
returned with PASS code so that it was enqueued on the ingress msg
queue. Then if no data exists on that queue it checks the socket
receive queue. Unfortunately, this is the same receive_queue the
BPF program is reading data off of. So we get a race. Its possible
for the recvmsg() hook to pull data off the receive_queue before
the BPF hook has a chance to read it. It typically happens when
an application is banging on recv() and getting EAGAINs. Until
they manage to race with the RX BPF program.

To fix this we note that before this patch at attach time when
the socket is loaded into the map we check if it needs a TX
program or just the base set of proto bpf hooks. Then it uses
the above general RX hook regardless of if we have a BPF program
attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And
to fix above race when an RX program is attached we use a new
hook that is nearly identical to the old one except now we
do not let the recv() call skip the RX BPF program. Now only
the BPF program pulls data from sk->receive_queue and recv()
only pulls data from the ingress msgQ post BPF program handling.

With this resolved our AppB from above has been up and running
for many hours without detecting any errors. We do this by
correlating counters in RX BPF events and the AppB to ensure
data is never skipping the BPF program. Selftests, was not
able to detect this because we only run them for a short
period of time on well ordered send/recvs so we don't get any
of the noise we see in real application environments.

Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 net/ipv4/tcp_bpf.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

Comments

Jakub Sitnicki Oct. 19, 2021, 9:16 a.m. UTC | #1
On Mon, Oct 11, 2021 at 09:16 PM CEST, John Fastabend wrote:
> A socket in a sockmap may have different combinations of programs
> attached depending on configuration. There can be no programs in which
> case the socket acts as a sink only. There can be a TX program in this
> case a BPF program is attached to sending side, but no RX program is
> attached. There can be an RX program only where sends have no BPF
> program attached, but receives are hooked with BPF. And finally,
> both TX and RX programs may be attached. Giving us the permutations,
>
>  None, Tx, Rx, and TxRx
>
> To date most of our use cases have been TX case being used as a fast
> datapath to directly copy between local application and a userspace
> proxy. Or Rx cases and TxRX applications that are operating an in
> kernel based proxy. The traffic in the first case where we hook
> applications into a userspace application looks like this,
>
>   AppA  redirect   AppB
>    Tx <-----------> Rx
>    |                |
>    +                +
>    TCP <--> lo <--> TCP
>
> In this case all traffic from AppA (after 3whs) is copied into the
> AppB ingress queue and no traffic is ever on the TCP recieive_queue.
>
> In the second case the application never receives, except in some
> rare error cases, traffic on the actual user space socket. Instead
> the send happens in the kernel.
>
>            AppProxy       socket pool
>        sk0 ------------->{sk1,sk2, skn}
>         ^                      |
>         |                      |
>         |                      v
>        ingress              lb egress
>        TCP                  TCP
>
> Here because traffic is never read off the socket with userspace
> recv() APIs there is only ever one reader on the sk receive_queue.
> Namely the BPF programs.
>
> However, we've started to introduce a third configuration where the
> BPF program on receive should process the data, but then the normal
> case is to push the data into the receive queue of AppB.
>
>        AppB
>        recv()                (userspace)
>      -----------------------
>        tcp_bpf_recvmsg()     (kernel)
>          |             |
>          |             |
>          |             |
>        ingress_msgQ    |
>          |             |
>        RX_BPF          |
>          |             |
>          v             v
>        sk->receive_queue
>
>
> This is different from the App{A,B} redirect because traffic is
> first received on the sk->receive_queue.
>
> Now for the issue. The tcp_bpf_recvmsg() handler first checks the
> ingress_msg queue for any data handled by the BPF rx program and
> returned with PASS code so that it was enqueued on the ingress msg
> queue. Then if no data exists on that queue it checks the socket
> receive queue. Unfortunately, this is the same receive_queue the
> BPF program is reading data off of. So we get a race. Its possible
> for the recvmsg() hook to pull data off the receive_queue before
> the BPF hook has a chance to read it. It typically happens when
> an application is banging on recv() and getting EAGAINs. Until
> they manage to race with the RX BPF program.
>
> To fix this we note that before this patch at attach time when
> the socket is loaded into the map we check if it needs a TX
> program or just the base set of proto bpf hooks. Then it uses
> the above general RX hook regardless of if we have a BPF program
> attached at rx or not. This patch now extends this check to
> handle all cases enumerated above, TX, RX, TXRX, and none. And
> to fix above race when an RX program is attached we use a new
> hook that is nearly identical to the old one except now we
> do not let the recv() call skip the RX BPF program. Now only
> the BPF program pulls data from sk->receive_queue and recv()
> only pulls data from the ingress msgQ post BPF program handling.
>
> With this resolved our AppB from above has been up and running
> for many hours without detecting any errors. We do this by
> correlating counters in RX BPF events and the AppB to ensure
> data is never skipping the BPF program. Selftests, was not
> able to detect this because we only run them for a short
> period of time on well ordered send/recvs so we don't get any
> of the noise we see in real application environments.
>
> Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> ---

Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
diff mbox series

Patch

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 35dcfb04f53d..0cc420c0e259 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -185,6 +185,41 @@  static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock,
 	return ret;
 }
 
+static int tcp_bpf_recvmsg_parser(struct sock *sk,
+				  struct msghdr *msg,
+				  size_t len,
+				  int nonblock,
+				  int flags,
+				  int *addr_len)
+{
+	struct sk_psock *psock;
+	int copied;
+
+	if (unlikely(flags & MSG_ERRQUEUE))
+		return inet_recv_error(sk, msg, len, addr_len);
+
+	psock = sk_psock_get(sk);
+	if (unlikely(!psock))
+		return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
+
+	lock_sock(sk);
+msg_bytes_ready:
+	copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
+	if (!copied) {
+		long timeo;
+		int data;
+
+		timeo = sock_rcvtimeo(sk, nonblock);
+		data = tcp_msg_wait_data(sk, psock, timeo);
+		if (data && !sk_psock_queue_empty(psock))
+			goto msg_bytes_ready;
+		copied = -EAGAIN;
+	}
+	release_sock(sk);
+	sk_psock_put(sk, psock);
+	return copied;
+}
+
 static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 		    int nonblock, int flags, int *addr_len)
 {
@@ -465,6 +500,8 @@  enum {
 enum {
 	TCP_BPF_BASE,
 	TCP_BPF_TX,
+	TCP_BPF_RX,
+	TCP_BPF_TXRX,
 	TCP_BPF_NUM_CFGS,
 };
 
@@ -483,6 +520,12 @@  static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
 	prot[TCP_BPF_TX]			= prot[TCP_BPF_BASE];
 	prot[TCP_BPF_TX].sendmsg		= tcp_bpf_sendmsg;
 	prot[TCP_BPF_TX].sendpage		= tcp_bpf_sendpage;
+
+	prot[TCP_BPF_RX]			= prot[TCP_BPF_BASE];
+	prot[TCP_BPF_RX].recvmsg		= tcp_bpf_recvmsg_parser;
+
+	prot[TCP_BPF_TXRX]			= prot[TCP_BPF_TX];
+	prot[TCP_BPF_TXRX].recvmsg		= tcp_bpf_recvmsg_parser;
 }
 
 static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
@@ -520,6 +563,10 @@  int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 	int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
 	int config = psock->progs.msg_parser   ? TCP_BPF_TX   : TCP_BPF_BASE;
 
+	if (psock->progs.stream_verdict || psock->progs.skb_verdict) {
+		config = (config == TCP_BPF_TX) ? TCP_BPF_TXRX : TCP_BPF_RX;
+	}
+
 	if (restore) {
 		if (inet_csk_has_ulp(sk)) {
 			/* TLS does not have an unhash proto in SW cases,