diff mbox series

[bpf-next,v4,1/2] net: bpf: Check SKB ownership against full socket.

Message ID 20230624014600.576756-2-kuifeng@meta.com (mailing list archive)
State Accepted
Commit 223f5f79f2ce8facd9d77dd44e9f403343630bfc
Delegated to: BPF
Headers show
Series Fix missing synack in BPF cgroup_skb filters | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-31 success Logs for veristat
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1070 this patch: 1070
netdev/cc_maintainers success CCed 12 of 12 maintainers
netdev/build_clang success Errors and warnings before: 142 this patch: 142
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1077 this patch: 1077
netdev/checkpatch warning WARNING: From:/Signed-off-by: email address mismatch: 'From: Kui-Feng Lee <thinker.li@gmail.com>' != 'Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>'
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for veristat
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ${{ matrix.test }} on ${{ matrix.arch }} with ${{ matrix.toolchain_full }}
bpf/vmtest-bpf-next-VM_Test-2 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-5 fail Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 fail Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-8 success Logs for veristat

Commit Message

Kui-Feng Lee June 24, 2023, 1:45 a.m. UTC
Check SKB ownership of an SKB against full sockets instead of request_sock.

The filters were called only if an SKB is owned by the sock that the SKB is
sent out through.  In another words, skb->sk should point to the sock that
it is sending through its egress.  However, the filters would miss SYN/ACK
SKBs that they are owned by a request_sock but sent through the listener
sock, that is the socket listening incoming connections.

However, the listener socket is also the full socket of the request socket.
We should use the full socket as the owner socket of an SKB instead.

What is the ownership check for?
====================================

BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the
ownership of an SKB. Alexei referred to a mailing list conversation that
took place a few years ago. [1] In that conversation, Daniel Borkmann
stated that:

    Wouldn't that mean however, when you go through stacked devices that
    you'd run the same eBPF cgroup program for skb->sk multiple times?

According to what Daniel said, the ownership check mentioned earlier
presumably prevents multiple calls of egress filters caused by an skb.

A test that reproduce this scenario shows that the BPF cgroup egress
programs can be called multiple times for one SKB if this ownership
check is not there.  So, we can not just remove this check.

Test Stacked Devices
=======================

We use L2TP to build an environment of stacked devices. L2TP (Layer 2
Tunneling Protocol) is a tunneling protocol used to support virtual private
networks (VPNs). It relays encapsulated packets; for example in UDP, to its
peer by using a socket.

Using L2TP, packets are first sent through the IP stack and should then
arrive at an L2TP device. The device will expand its SKB header to
encapsulate the packet. The SKB will be sent back to the IP stack using the
socket that was made for the L2TP session. After that, the routing process
will occur once more, but this time for a new destination.

We changed tools/testing/selftests/net/l2tp.sh to set up a test environment
using L2TP.  The run_ping() function in l2tp.sh is where the main change
occurred.

    run_ping()
    {
        local desc="$1"

        sleep 10
        run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1
        fc00:101::2
        log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
        sleep 10

    }

The test will use L2TP devices to send PING messages. These messages will
have a message size of 227 bytes as a special label to distinguish them.
This is not an ideal solution, but works.

During the execution of the test script, bpftrace was attached to
ip6_finish_output() and l2tp_xmit_skb(). BPF

    bpftrace -e '
      kfunc:ip6_finish_output {
        time("%H:%M:%S: ");
        printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p
                skb->sk=%p\n", args->skb, args->skb->len,
               args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); }
      kfunc:l2tp_xmit_skb {
        time("%H:%M:%S: ");
        printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb,
	       args->session->tunnel->sock); }'

The following is part of the output messages printed by bpftrace.

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275
              cgroup=0xffff88810741f800 sk=0xffff888105f3b900
              skb->sk=0xffff888105f3b900

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103dd6300
              skb->sk=0xffff888105f3b900

    16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275
              cgroup=0xffffffff837741d0 sk=0xffff888101fe0000
              skb->sk=0xffff888101fe0000

    16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=0xffff88810741f800 sk=0xffff888103483180
              skb->sk=0xffff888101fe0000

    16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
              cgroup=(nil) sk=(nil) skb->sk=(nil)

The first four entries describe a PING message that was sent using the ping
command, whereas the following four entries describe the response received.
Multiple sockets are used to send one skb, including the socket used by the
L2TP session. This can be observed.

Based on this information, it seems that the ownership check is designed to
avoid multiple calls of egress filters caused by a single skb.

[1] https://lore.kernel.org/all/58193E9D.7040201@iogearbox.net/,

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf-cgroup.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Daniel Borkmann June 30, 2023, 2:12 p.m. UTC | #1
On 6/24/23 3:45 AM, Kui-Feng Lee wrote:
> Check SKB ownership of an SKB against full sockets instead of request_sock.
> 
> The filters were called only if an SKB is owned by the sock that the SKB is
> sent out through.  In another words, skb->sk should point to the sock that
> it is sending through its egress.  However, the filters would miss SYN/ACK
> SKBs that they are owned by a request_sock but sent through the listener
> sock, that is the socket listening incoming connections.
> 
> However, the listener socket is also the full socket of the request socket.
> We should use the full socket as the owner socket of an SKB instead.
> 
> What is the ownership check for?
> ====================================
> 
> BPF_CGROUP_RUN_PROG_INET_EGRESS() checked sk == skb->sk to ensure the
> ownership of an SKB. Alexei referred to a mailing list conversation that
> took place a few years ago. [1] In that conversation, Daniel Borkmann
> stated that:
> 
>      Wouldn't that mean however, when you go through stacked devices that
>      you'd run the same eBPF cgroup program for skb->sk multiple times?
> 
> According to what Daniel said, the ownership check mentioned earlier
> presumably prevents multiple calls of egress filters caused by an skb.
> 
> A test that reproduce this scenario shows that the BPF cgroup egress
> programs can be called multiple times for one SKB if this ownership
> check is not there.  So, we can not just remove this check.
> 
> Test Stacked Devices
> =======================
> 
> We use L2TP to build an environment of stacked devices. L2TP (Layer 2
> Tunneling Protocol) is a tunneling protocol used to support virtual private
> networks (VPNs). It relays encapsulated packets; for example in UDP, to its
> peer by using a socket.
> 
> Using L2TP, packets are first sent through the IP stack and should then
> arrive at an L2TP device. The device will expand its SKB header to
> encapsulate the packet. The SKB will be sent back to the IP stack using the
> socket that was made for the L2TP session. After that, the routing process
> will occur once more, but this time for a new destination.
> 
> We changed tools/testing/selftests/net/l2tp.sh to set up a test environment
> using L2TP.  The run_ping() function in l2tp.sh is where the main change
> occurred.
> 
>      run_ping()
>      {
>          local desc="$1"
> 
>          sleep 10
>          run_cmd host-1 ${ping6} -s 227 -c 4 -i 10 -I fc00:101::1
>          fc00:101::2
>          log_test $? 0 "IPv6 route through L2TP tunnel ${desc}"
>          sleep 10
> 
>      }
> 
> The test will use L2TP devices to send PING messages. These messages will
> have a message size of 227 bytes as a special label to distinguish them.
> This is not an ideal solution, but works.
> 
> During the execution of the test script, bpftrace was attached to
> ip6_finish_output() and l2tp_xmit_skb(). BPF
> 
>      bpftrace -e '
>        kfunc:ip6_finish_output {
>          time("%H:%M:%S: ");
>          printf("ip6_finish_output skb=%p skb->len=%d cgroup=%p sk=%p
>                  skb->sk=%p\n", args->skb, args->skb->len,
>                 args->sk->sk_cgrp_data.cgroup, args->sk, args->skb->sk); }
>        kfunc:l2tp_xmit_skb {
>          time("%H:%M:%S: ");
>          printf("l2tp_xmit_skb skb=%p sk=%p\n", args->skb,
> 	       args->session->tunnel->sock); }'
> 
> The following is part of the output messages printed by bpftrace.
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=275
>                cgroup=0xffff88810741f800 sk=0xffff888105f3b900
>                skb->sk=0xffff888105f3b900
> 
>      16:35:20: l2tp_xmit_skb skb=0xffff888103d8e600 sk=0xffff888103dd6300
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
>                cgroup=0xffff88810741f800 sk=0xffff888103dd6300
>                skb->sk=0xffff888105f3b900
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e600 skb->len=337
>                cgroup=(nil) sk=(nil) skb->sk=(nil)
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=275
>                cgroup=0xffffffff837741d0 sk=0xffff888101fe0000
>                skb->sk=0xffff888101fe0000
> 
>      16:35:20: l2tp_xmit_skb skb=0xffff888103d8e000 sk=0xffff888103483180
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
>                cgroup=0xffff88810741f800 sk=0xffff888103483180
>                skb->sk=0xffff888101fe0000
> 
>      16:35:20: ip6_finish_output skb=0xffff888103d8e000 skb->len=337
>                cgroup=(nil) sk=(nil) skb->sk=(nil)
> 
> The first four entries describe a PING message that was sent using the ping
> command, whereas the following four entries describe the response received.
> Multiple sockets are used to send one skb, including the socket used by the
> L2TP session. This can be observed.
> 
> Based on this information, it seems that the ownership check is designed to
> avoid multiple calls of egress filters caused by a single skb.

Thanks for this investigation and adding these details to the commit log,
this is very useful to keep for the archive. I did some minor formatting
and pushed it out to bpf-next.
diff mbox series

Patch

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 57e9e109257e..8506690dbb9c 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -199,9 +199,9 @@  static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb)			       \
 ({									       \
 	int __ret = 0;							       \
-	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk && sk == skb->sk) { \
+	if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {		       \
 		typeof(sk) __sk = sk_to_full_sk(sk);			       \
-		if (sk_fullsock(__sk) &&				       \
+		if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&	       \
 		    cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))	       \
 			__ret = __cgroup_bpf_run_filter_skb(__sk, skb,	       \
 						      CGROUP_INET_EGRESS); \