diff mbox series

[bpf-next,v4,1/2] net: bpf: Check SKB ownership against full socket.

Message ID 20230624014600.576756-2-kuifeng@meta.com (mailing list archive)
State Accepted
Commit 223f5f79f2ce8facd9d77dd44e9f403343630bfc
Headers show
Series Fix missing synack in BPF cgroup_skb filters | expand

Commit Message

Kui-Feng Lee June 24, 2023, 1:45 a.m. UTC
Check SKB ownership of an SKB against full sockets instead of request_sock.

The filters were called only if an SKB is owned by the sock that the SKB is
sent out through.  In another words, skb->sk should point to the sock that
it is sending through its egress.  However, the filters would miss SYN/ACK
SKBs that they are owned by a request_sock but sent through the listener
sock, that is the socket listening incoming connections.

However, the listener socket is also the full socket of the request socket.
We should use the full socket as the owner socket of an SKB instead.

What is the ownership check for?
====================================

BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the
ownership of an SKB. Alexei referred to a mailing list conversation that
took place a few years ago. [1] In that conversation, Daniel Borkmann
stated that:

    Wouldn't that mean however, when you go through stacked devices that
    you'd run the same eBPF cgroup program for skb->sk multiple times?

According to what Daniel said, the ownership check mentioned earlier
presumably prevents multiple calls of egress filters caused by an skb.

A test that reproduce this scenario shows that the BPF cgroup egress
programs can be called multiple times for one SKB if this ownership
check is not there.  So, we can not just remove this check.

Test Stacked Devices
=======================

We use L2TP to build an environment of stacked devices. L2TP (Layer 2
Tunneling Protocol) is a tunneling protocol used to support virtual private
networks (VPNs). It relays encapsulated packets; for example in UDP, to its
peer by using a socket.

Using L2TP, packets are first sent through the IP stack and should then
arrive at an L2TP device. The device will expand its SKB header to
encapsulate the packet. The SKB will be sent back to the IP stack using the
socket that was made for the L2TP session. After that, the routing process
will occur once more, but this time for a new destination.

We changed tools/testing/selftests/net/l2tp.sh to set up a test environment
using L2TP.  The run_ping() function in l2tp.sh is where the main change
occurred.

    run_ping()
    {
        local desc="$1"

        sleep 10
        run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1
        fc00:101::2
        log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
        sleep 10

    }

The test will use L2TP devices to send PING messages. These messages will
have a message size of 227 bytes as a special label to distinguish them.
This is not an ideal solution, but works.

During the execution of the test script, bpftrace was attached to
ip6_finish_output() and l2tp_xmit_skb(). BPF

    bpftrace -e '
      kfunc:ip6_finish_output {
        time("%H:%M:%S: ");
        printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p
                skb->sk=%p\n", args->skb, args->skb->len,
               args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); }
      kfunc:l2tp_xmit_skb {
        time("%H:%M:%S: ");
        printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb,
	       args->session->tunnel->sock); }'

The following is part of the output messages printed by bpftrace.

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275
              cgroup=0xffff88810741f800 sk=0xffff888105f3b900
              skb->sk=0xffff888105f3b900

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103dd6300
              skb->sk=0xffff888105f3b900

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275
              cgroup=0xffffffff837741d0 sk=0xffff888101fe0000
              skb->sk=0xffff888101fe0000

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103483180
              skb->sk=0xffff888101fe0000

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

The first four entries describe a PING message that was sent using the ping
command, whereas the following four entries describe the response received.
Multiple sockets are used to send one skb, including the socket used by the
L2TP session. This can be observed.

Based on this information, it seems that the ownership check is designed to
avoid multiple calls of egress filters caused by a single skb.

[1] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/,

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf-cgroup.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Daniel Borkmann June 30, 2023, 2:12 p.m. UTC | #1
On 6/24/23 3:45 AM, Kui-Feng Lee wrote:
> Check SKB ownership of an SKB against full sockets instead of request_sock.
> 
> The filters were called only if an SKB is owned by the sock that the SKB is
> sent out through.  In another words, skb->sk should point to the sock that
> it is sending through its egress.  However, the filters would miss SYN/ACK
> SKBs that they are owned by a request_sock but sent through the listener
> sock, that is the socket listening incoming connections.
> 
> However, the listener socket is also the full socket of the request socket.
> We should use the full socket as the owner socket of an SKB instead.
> 
> What is the ownership check for?
> ====================================
> 
> BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the
> ownership of an SKB. Alexei referred to a mailing list conversation that
> took place a few years ago. [1] In that conversation, Daniel Borkmann
> stated that:
> 
>      Wouldn't that mean however, when you go through stacked devices that
>      you'd run the same eBPF cgroup program for skb->sk multiple times?
> 
> According to what Daniel said, the ownership check mentioned earlier
> presumably prevents multiple calls of egress filters caused by an skb.
> 
> A test that reproduce this scenario shows that the BPF cgroup egress
> programs can be called multiple times for one SKB if this ownership
> check is not there.  So, we can not just remove this check.
> 
> Test Stacked Devices
> =======================
> 
> We use L2TP to build an environment of stacked devices. L2TP (Layer 2
> Tunneling Protocol) is a tunneling protocol used to support virtual private
> networks (VPNs). It relays encapsulated packets; for example in UDP, to its
> peer by using a socket.
> 
> Using L2TP, packets are first sent through the IP stack and should then
> arrive at an L2TP device. The device will expand its SKB header to
> encapsulate the packet. The SKB will be sent back to the IP stack using the
> socket that was made for the L2TP session. After that, the routing process
> will occur once more, but this time for a new destination.
> 
> We changed tools/testing/selftests/net/l2tp.sh to set up a test environment
> using L2TP.  The run_ping() function in l2tp.sh is where the main change
> occurred.
> 
>      run_ping()
>      {
>          local desc="$1"
> 
>          sleep 10
>          run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1
>          fc00:101::2
>          log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
>          sleep 10
> 
>      }
> 
> The test will use L2TP devices to send PING messages. These messages will
> have a message size of 227 bytes as a special label to distinguish them.
> This is not an ideal solution, but works.
> 
> During the execution of the test script, bpftrace was attached to
> ip6_finish_output() and l2tp_xmit_skb(). BPF
> 
>      bpftrace -e '
>        kfunc:ip6_finish_output {
>          time("%H:%M:%S: ");
>          printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p
>                  skb->sk=%p\n", args->skb, args->skb->len,
>                 args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); }
>        kfunc:l2tp_xmit_skb {
>          time("%H:%M:%S: ");
>          printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb,
> 	       args->session->tunnel->sock); }'
> 
> The following is part of the output messages printed by bpftrace.
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275
>                cgroup=0xffff88810741f800 sk=0xffff888105f3b900
>                skb->sk=0xffff888105f3b900
> 
>      16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
>                cgroup=0xffff88810741f800 sk=0xffff888103dd6300
>                skb->sk=0xffff888105f3b900
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
>                cgroup=(nil) sk=(nil) skb->sk=(nil)
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275
>                cgroup=0xffffffff837741d0 sk=0xffff888101fe0000
>                skb->sk=0xffff888101fe0000
> 
>      16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
>                cgroup=0xffff88810741f800 sk=0xffff888103483180
>                skb->sk=0xffff888101fe0000
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
>                cgroup=(nil) sk=(nil) skb->sk=(nil)
> 
> The first four entries describe a PING message that was sent using the ping
> command, whereas the following four entries describe the response received.
> Multiple sockets are used to send one skb, including the socket used by the
> L2TP session. This can be observed.
> 
> Based on this information, it seems that the ownership check is designed to
> avoid multiple calls of egress filters caused by a single skb.

Thanks for this investigation and adding these details to the commit log,
this is very useful to keep for the archive. I did some minor formatting
and pushed it out to bpf-next.
diff mbox series

Patch

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 57e9e109257e..8506690dbb9c 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -199,9 +199,9 @@  static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
 ({									       \
 	int __ret = 0;							       \
-	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
+	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
 		typeof(sk) __sk = sk_to_full_sk(sk);			       \
-		if (sk_fullsock(__sk) &&				       \
+		if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&	       \
 		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \
 			__ret = __cgroup_bpf_run_filter_skb(__sk, skb,	       \
 						      CGROUP_INET_EGRESS); \