Message ID | 20210510034433.52818-1-kuniyu@amazon.co.jp (mailing list archive) |
---|---|
Headers | show |
Series | Socket migration for SO_REUSEPORT. | expand |
On Sun, May 9, 2021 at 8:45 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote: > > The SO_REUSEPORT option allows sockets to listen on the same port and to > accept connections evenly. However, there is a defect in the current > implementation [1]. When a SYN packet is received, the connection is tied > to a listening socket. Accordingly, when the listener is closed, in-flight > requests during the three-way handshake and child sockets in the accept > queue are dropped even if other listeners on the same port could accept > such connections. > > This situation can happen when various server management tools restart > server (such as nginx) processes. For instance, when we change nginx > configurations and restart it, it spins up new workers that respect the new > configuration and closes all listeners on the old workers, resulting in the > in-flight ACK of 3WHS is responded by RST. > > To avoid such a situation, users have to know deeply how the kernel handles > SYN packets and implement connection draining by eBPF [2]: > > 1. Stop routing SYN packets to the listener by eBPF. > 2. Wait for all timers to expire to complete requests > 3. Accept connections until EAGAIN, then close the listener. > > or > > 1. Start counting SYN packets and accept syscalls using the eBPF map. > 2. Stop routing SYN packets. > 3. Accept connections up to the count, then close the listener. > > In either way, we cannot close a listener immediately. However, ideally, > the application need not drain the not yet accepted sockets because 3WHS > and tying a connection to a listener are just the kernel behaviour. The > root cause is within the kernel, so the issue should be addressed in kernel > space and should not be visible to user space. This patchset fixes it so > that users need not take care of kernel implementation and connection > draining. With this patchset, the kernel redistributes requests and > connections from a listener to the others in the same reuseport group > at/after close or shutdown syscalls. > > Although some software does connection draining, there are still merits in > migration. For some security reasons, such as replacing TLS certificates, > we may want to apply new settings as soon as possible and/or we may not be > able to wait for connection draining. The sockets in the accept queue have > not started application sessions yet. So, if we do not drain such sockets, > they can be handled by the newer listeners and could have a longer > lifetime. It is difficult to drain all connections in every case, but we > can decrease such aborted connections by migration. In that sense, > migration is always better than draining. > > Moreover, auto-migration simplifies user space logic and also works well in > a case where we cannot modify and build a server program to implement the > workaround. > > Note that the source and destination listeners MUST have the same settings > at the socket API level; otherwise, applications may face inconsistency and > cause errors. In such a case, we have to use the eBPF program to select a > specific listener or to cancel migration. > > Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code > snippets along the way. > > > Link: > [1] The SO_REUSEPORT socket option > https://lwn.net/Articles/542629/ > > [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode > https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/ > > > Changelog: > v5: > * Move initializtion of sk_node from 6th to 5th patch > * Initialize sk_refcnt in reqsk_clone() > * Modify some definitions in reqsk_timer_handler() > * Validate in which path/state migration happens in selftest > > v4: > https://lore.kernel.org/bpf/20210427034623.46528-1-kuniyu@amazon.co.jp/ > * Make some functions and variables 'static' in selftest > * Remove 'scalability' from the cover letter > > v3: > https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/ > * Add sysctl back for reuseport_grow() > * Add helper functions to manage socks[] > * Separate migration related logic into functions: reuseport_resurrect(), > reuseport_stop_listen_sock(), reuseport_migrate_sock() > * Clone request_sock to be migrated > * Migrate request one by one > * Pass child socket to eBPF prog > > v2: > https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/ > * Do not save closed sockets in socks[] > * Revert 607904c357c61adf20b8fd18af765e501d61a385 > * Extract inet_csk_reqsk_queue_migrate() into a single patch > * Change the spin_lock order to avoid lockdep warning > * Add static to __reuseport_select_sock > * Use refcount_inc_not_zero() in reuseport_select_migrated_sock() > * Set the default attach type in bpf_prog_load_check_attach() > * Define new proto of BPF_FUNC_get_socket_cookie > * Fix test to be compiled successfully > * Update commit messages > > v1: > https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/ > * Remove the sysctl option > * Enable migration if eBPF progam is not attached > * Add expected_attach_type to check if eBPF program can migrate sockets > * Add a field to tell migration type to eBPF program > * Support BPF_FUNC_get_socket_cookie to get the cookie of sk > * Allocate an empty skb if skb is NULL > * Pass req_to_sk(req)->sk_hash because listener's hash is zero > * Update commit messages and coverletter > > RFC: > https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/ > > > Kuniyuki Iwashima (11): > net: Introduce net.ipv4.tcp_migrate_req. > tcp: Add num_closed_socks to struct sock_reuseport. > tcp: Keep TCP_CLOSE sockets in the reuseport group. > tcp: Add reuseport_migrate_sock() to select a new listener. > tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. > tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs. > tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. > bpf: Support BPF_FUNC_get_socket_cookie() for > BPF_PROG_TYPE_SK_REUSEPORT. > bpf: Support socket migration by eBPF. > libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT. > bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. > > Documentation/networking/ip-sysctl.rst | 20 + > include/linux/bpf.h | 1 + > include/linux/filter.h | 2 + > include/net/netns/ipv4.h | 1 + > include/net/request_sock.h | 2 + > include/net/sock_reuseport.h | 9 +- > include/uapi/linux/bpf.h | 16 + > kernel/bpf/syscall.c | 13 + > net/core/filter.c | 23 +- > net/core/request_sock.c | 39 ++ > net/core/sock_reuseport.c | 337 +++++++++-- > net/ipv4/inet_connection_sock.c | 146 ++++- > net/ipv4/inet_hashtables.c | 2 +- > net/ipv4/sysctl_net_ipv4.c | 9 + > net/ipv4/tcp_ipv4.c | 20 +- > net/ipv6/tcp_ipv6.c | 14 +- > tools/include/uapi/linux/bpf.h | 16 + > tools/lib/bpf/libbpf.c | 5 +- > tools/testing/selftests/bpf/network_helpers.c | 2 +- > tools/testing/selftests/bpf/network_helpers.h | 1 + > .../bpf/prog_tests/migrate_reuseport.c | 532 ++++++++++++++++++ > .../bpf/progs/test_migrate_reuseport.c | 67 +++ > 22 files changed, 1217 insertions(+), 60 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c > create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c > > -- > 2.30.2 > One test is failing in CI ([0]), please take a look. [0] https://travis-ci.com/github/kernel-patches/bpf/builds/225784969
From: Andrii Nakryiko <andrii.nakryiko@gmail.com> Date: Thu, 13 May 2021 14:27:13 -0700 > On Sun, May 9, 2021 at 8:45 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote: > > > > The SO_REUSEPORT option allows sockets to listen on the same port and to > > accept connections evenly. However, there is a defect in the current > > implementation [1]. When a SYN packet is received, the connection is tied > > to a listening socket. Accordingly, when the listener is closed, in-flight > > requests during the three-way handshake and child sockets in the accept > > queue are dropped even if other listeners on the same port could accept > > such connections. [...] > > One test is failing in CI ([0]), please take a look. > > [0] https://travis-ci.com/github/kernel-patches/bpf/builds/225784969 Thank you for checking. The test needs to drop SYN+ACK and currently it is done by iptables or ip6tables. But it seems that I should not use them. Should this be done by XDP? ---8<--- iptables v1.8.5 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded. ip6tables v1.8.5 (legacy): can't initialize ip6tables table `filter': Table does not exist (do you need to insmod?) Perhaps ip6tables or your kernel needs to be upgraded. ---8<---
On Fri, May 14, 2021 at 08:23:00AM +0900, Kuniyuki Iwashima wrote: > From: Andrii Nakryiko <andrii.nakryiko@gmail.com> > Date: Thu, 13 May 2021 14:27:13 -0700 > > On Sun, May 9, 2021 at 8:45 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote: > > > > > > The SO_REUSEPORT option allows sockets to listen on the same port and to > > > accept connections evenly. However, there is a defect in the current > > > implementation [1]. When a SYN packet is received, the connection is tied > > > to a listening socket. Accordingly, when the listener is closed, in-flight > > > requests during the three-way handshake and child sockets in the accept > > > queue are dropped even if other listeners on the same port could accept > > > such connections. > [...] > > > > One test is failing in CI ([0]), please take a look. > > > > [0] https://travis-ci.com/github/kernel-patches/bpf/builds/225784969 > > Thank you for checking. > > The test needs to drop SYN+ACK and currently it is done by iptables or > ip6tables. But it seems that I should not use them. Should this be done > by XDP? or drop it at a bpf_prog@tc-egress. I also don't have iptables in my kconfig and I had to add them to run this test. None of the test_progs depends on iptables also. > > ---8<--- > iptables v1.8.5 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) > Perhaps iptables or your kernel needs to be upgraded. > ip6tables v1.8.5 (legacy): can't initialize ip6tables table `filter': Table does not exist (do you need to insmod?) > Perhaps ip6tables or your kernel needs to be upgraded. > ---8<--- >
From: Martin KaFai Lau <kafai@fb.com> Date: Thu, 13 May 2021 23:26:25 -0700 > On Fri, May 14, 2021 at 08:23:00AM +0900, Kuniyuki Iwashima wrote: > > From: Andrii Nakryiko <andrii.nakryiko@gmail.com> > > Date: Thu, 13 May 2021 14:27:13 -0700 > > > On Sun, May 9, 2021 at 8:45 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote: > > > > > > > > The SO_REUSEPORT option allows sockets to listen on the same port and to > > > > accept connections evenly. However, there is a defect in the current > > > > implementation [1]. When a SYN packet is received, the connection is tied > > > > to a listening socket. Accordingly, when the listener is closed, in-flight > > > > requests during the three-way handshake and child sockets in the accept > > > > queue are dropped even if other listeners on the same port could accept > > > > such connections. > > [...] > > > > > > One test is failing in CI ([0]), please take a look. > > > > > > [0] https://travis-ci.com/github/kernel-patches/bpf/builds/225784969 > > > > Thank you for checking. > > > > The test needs to drop SYN+ACK and currently it is done by iptables or ^^^^^^^ the final ACK of 3WHS Sorry, this was typo. > > ip6tables. But it seems that I should not use them. Should this be done > > by XDP? > or drop it at a bpf_prog@tc-egress. > > I also don't have iptables in my kconfig and I had to add them > to run this test. None of the test_progs depends on iptables also. I'll rewrite the dropping part with XDP or TC. Thank you. > > > > > ---8<--- > > iptables v1.8.5 (legacy): can't initialize iptables table `filter': Table does not exist (do you need to insmod?) > > Perhaps iptables or your kernel needs to be upgraded. > > ip6tables v1.8.5 (legacy): can't initialize ip6tables table `filter': Table does not exist (do you need to insmod?) > > Perhaps ip6tables or your kernel needs to be upgraded. > > ---8<---