[v3,bpf-next,00/11] Socket migration for SO_REUSEPORT.

Message ID	20210420154140.80034-1-kuniyu@amazon.co.jp (mailing list archive)
Headers	show Return-Path: <bpf-owner@kernel.org> From: Kuniyuki Iwashima <kuniyu@amazon.co.jp> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Eric Dumazet <edumazet@google.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com> CC: Benjamin Herrenschmidt <benh@amazon.com>, Kuniyuki Iwashima <kuniyu@amazon.co.jp>, Kuniyuki Iwashima <kuni1840@gmail.com>, <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH v3 bpf-next 00/11] Socket migration for SO_REUSEPORT. Date: Wed, 21 Apr 2021 00:41:29 +0900 Message-ID: <20210420154140.80034-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	Socket migration for SO_REUSEPORT. \| expand [v3,bpf-next,00/11] Socket migration for SO_REUSEPORT. [v3,bpf-next,01/11] net: Introduce net.ipv4.tcp_migrate_req. [v3,bpf-next,02/11] tcp: Add num_closed_socks to struct sock_reuseport. [v3,bpf-next,03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group. [v3,bpf-next,04/11] tcp: Add reuseport_migrate_sock() to select a new listener. [v3,bpf-next,05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. [v3,bpf-next,06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs. [v3,bpf-next,07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. [v3,bpf-next,08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT. [v3,bpf-next,09/11] bpf: Support socket migration by eBPF. [v3,bpf-next,10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT. [v3,bpf-next,11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Message ID

20210420154140.80034-1-kuniyu@amazon.co.jp (mailing list archive)

Headers

From: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
To: "David S . Miller" <davem@davemloft.net>,
        Jakub Kicinski <kuba@kernel.org>,
        Eric Dumazet <edumazet@google.com>,
        Alexei Starovoitov <ast@kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Andrii Nakryiko <andrii@kernel.org>,
        Martin KaFai Lau <kafai@fb.com>
CC: Benjamin Herrenschmidt <benh@amazon.com>,
        Kuniyuki Iwashima <kuniyu@amazon.co.jp>,
        Kuniyuki Iwashima <kuni1840@gmail.com>, <bpf@vger.kernel.org>,
        <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: [PATCH v3 bpf-next 00/11] Socket migration for SO_REUSEPORT.
Date: Wed, 21 Apr 2021 00:41:29 +0900
Message-ID: <20210420154140.80034-1-kuniyu@amazon.co.jp>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Precedence: bulk

Series

Socket migration for SO_REUSEPORT. | expand

Message

Iwashima, Kuniyuki April 20, 2021, 3:41 p.m. UTC

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation [1]. When a SYN packet is received, the connection is tied
to a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

The SO_REUSEPORT option is excellent to improve scalability. On the other
hand, as a trade-off, users have to know deeply how the kernel handles SYN
packets and implement connection draining by eBPF [2]:

1. Stop routing SYN packets to the listener by eBPF.
2. Wait for all timers to expire to complete requests
3. Accept connections until EAGAIN, then close the listener.

1. Start counting SYN packets and accept syscalls using the eBPF map.
2. Stop routing SYN packets.
3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to the others in the same reuseport group
at/after close or shutdown syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons, such as replacing TLS certificates,
we may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining.

Moreover, auto-migration simplifies user space logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use the eBPF program to select a
specific listener or to cancel migration.

Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
snippets along the way.

Link:
[1] The SO_REUSEPORT socket option
https://lwn.net/Articles/542629/

[2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/

Changelog:
v3:
* Add sysctl back for reuseport_grow()
* Add helper functions to manage socks[]
* Separate migration related logic into functions: reuseport_resurrect(),
reuseport_stop_listen_sock(), reuseport_migrate_sock()
* Clone request_sock to be migrated
* Migrate request one by one
* Pass child socket to eBPF prog

v2:
https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/
* Do not save closed sockets in socks[]
* Revert 607904c357c61adf20b8fd18af765e501d61a385
* Extract inet_csk_reqsk_queue_migrate() into a single patch
* Change the spin_lock order to avoid lockdep warning
* Add static to __reuseport_select_sock
* Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
* Set the default attach type in bpf_prog_load_check_attach()
* Define new proto of BPF_FUNC_get_socket_cookie
* Fix test to be compiled successfully
* Update commit messages

v1:
https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/
* Remove the sysctl option
* Enable migration if eBPF progam is not attached
* Add expected_attach_type to check if eBPF program can migrate sockets
* Add a field to tell migration type to eBPF program
* Support BPF_FUNC_get_socket_cookie to get the cookie of sk
* Allocate an empty skb if skb is NULL
* Pass req_to_sk(req)->sk_hash because listener's hash is zero
* Update commit messages and coverletter

RFC:
https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/

Kuniyuki Iwashima (11):
net: Introduce net.ipv4.tcp_migrate_req.
tcp: Add num_closed_socks to struct sock_reuseport.
tcp: Keep TCP_CLOSE sockets in the reuseport group.
tcp: Add reuseport_migrate_sock() to select a new listener.
tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
bpf: Support BPF_FUNC_get_socket_cookie() for
BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Support socket migration by eBPF.
libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Documentation/networking/ip-sysctl.rst | 20 +
include/linux/bpf.h | 1 +
include/linux/filter.h | 2 +
include/net/netns/ipv4.h | 1 +
include/net/request_sock.h | 2 +
include/net/sock_reuseport.h | 9 +-
include/uapi/linux/bpf.h | 16 +
kernel/bpf/syscall.c | 13 +
net/core/filter.c | 23 +-
net/core/request_sock.c | 38 ++
net/core/sock_reuseport.c | 337 ++++++++++--
net/ipv4/inet_connection_sock.c | 147 +++++-
net/ipv4/inet_hashtables.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 9 +
net/ipv4/tcp_ipv4.c | 20 +-
net/ipv6/tcp_ipv6.c | 14 +-
tools/include/uapi/linux/bpf.h | 16 +
tools/lib/bpf/libbpf.c | 5 +-
tools/testing/selftests/bpf/network_helpers.c | 2 +-
tools/testing/selftests/bpf/network_helpers.h | 1 +
.../bpf/prog_tests/migrate_reuseport.c | 483 ++++++++++++++++++
.../bpf/progs/test_migrate_reuseport.c | 51 ++
22 files changed, 1150 insertions(+), 62 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c

Comments

Eric Dumazet April 20, 2021, 4:43 p.m. UTC | #1

On 4/20/21 5:41 PM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.
> 
> The SO_REUSEPORT option is excellent to improve scalability.

This was before the SYN processing was made lockless.

I really wonder if we still need SO_REUSEPORT for TCP ?

Eventually a new accept() system call where different threads
can express how they want to choose the children sockets would
be less invasive.

Instead of having many listeners, have one listener and eventually multiple
accept queues to improve scalability of accept() phase.

Iwashima, Kuniyuki April 21, 2021, 11:30 a.m. UTC | #2

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Tue, 20 Apr 2021 18:43:36 +0200
> On 4/20/21 5:41 PM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
> > 
> > The SO_REUSEPORT option is excellent to improve scalability.
> 
> This was before the SYN processing was made lockless.
>
> I really wonder if we still need SO_REUSEPORT for TCP ?

I'm sorry this might be misleading. This was an old topic in v3.5. Also,
scalability or performance are not the primary reason to use SO_REUSEPORT
for now.

There are cases which need SO_REUSEPORT for other reasons.

If servers take both UDP and TCP requests (for example, proxy of QUIC and
HTTP2), it is nice to have the same eBPF mechanism to handle UDP and TCP.

Also, about reloading configurations, some applications want to keep it
simple to reload configurations by replacing processes.

Then, even with the new accept() syscall, I think there would be migration
(of queue or of children) needed. If the way was like fd passing, it might
not work when the process died in the middle of fd passing.

So, I think it is better to do migration in kernel without interaction with
the old process.

In this point, SO_REUSEPORT is good because we can bind a new process
without interaction with the old process. And with this patchset, we can
migrate requests by close()/shutdown() the old listener.

> 
> Eventually a new accept() system call where different threads
> can express how they want to choose the children sockets would
> be less invasive.
> 
> Instead of having many listeners, have one listener and eventually multiple
> accept queues to improve scalability of accept() phase.

It sounds interesting. Could you elaborate the idea ?

And sorry, I couldn't understand correctly what "invasive" means. Does it
mean the new accept() will have less change or more simple API or something
other ?

Also, I wonder if the new accept() has similar flexibility as eBPF does.