[v4,bpf-next,00/11] Socket migration for SO_REUSEPORT.

Message ID	20210427034623.46528-1-kuniyu@amazon.co.jp (mailing list archive)
Headers	show Return-Path: <bpf-owner@kernel.org> From: Kuniyuki Iwashima <kuniyu@amazon.co.jp> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Eric Dumazet <edumazet@google.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com> CC: Benjamin Herrenschmidt <benh@amazon.com>, Kuniyuki Iwashima <kuniyu@amazon.co.jp>, Kuniyuki Iwashima <kuni1840@gmail.com>, <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [PATCH v4 bpf-next 00/11] Socket migration for SO_REUSEPORT. Date: Tue, 27 Apr 2021 12:46:12 +0900 Message-ID: <20210427034623.46528-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk
Series	Socket migration for SO_REUSEPORT. \| expand [v4,bpf-next,00/11] Socket migration for SO_REUSEPORT. [v4,bpf-next,01/11] net: Introduce net.ipv4.tcp_migrate_req. [v4,bpf-next,02/11] tcp: Add num_closed_socks to struct sock_reuseport. [v4,bpf-next,03/11] tcp: Keep TCP_CLOSE sockets in the reuseport group. [v4,bpf-next,04/11] tcp: Add reuseport_migrate_sock() to select a new listener. [v4,bpf-next,05/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. [v4,bpf-next,06/11] tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs. [v4,bpf-next,07/11] tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. [v4,bpf-next,08/11] bpf: Support BPF_FUNC_get_socket_cookie() for BPF_PROG_TYPE_SK_REUSEPORT. [v4,bpf-next,09/11] bpf: Support socket migration by eBPF. [v4,bpf-next,10/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT. [v4,bpf-next,11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Iwashima, Kuniyuki April 27, 2021, 3:46 a.m. UTC

The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation [1]. When a SYN packet is received, the connection is tied
to a listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners on the same port could accept
such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in the
in-flight ACK of 3WHS is responded by RST.

To avoid such a situation, users have to know deeply how the kernel handles
SYN packets and implement connection draining by eBPF [2]:

1. Stop routing SYN packets to the listener by eBPF.
2. Wait for all timers to expire to complete requests
3. Accept connections until EAGAIN, then close the listener.

or

1. Start counting SYN packets and accept syscalls using the eBPF map.
2. Stop routing SYN packets.
3. Accept connections up to the count, then close the listener.

In either way, we cannot close a listener immediately. However, ideally,
the application need not drain the not yet accepted sockets because 3WHS
and tying a connection to a listener are just the kernel behaviour. The
root cause is within the kernel, so the issue should be addressed in kernel
space and should not be visible to user space. This patchset fixes it so
that users need not take care of kernel implementation and connection
draining. With this patchset, the kernel redistributes requests and
connections from a listener to the others in the same reuseport group
at/after close or shutdown syscalls.

Although some software does connection draining, there are still merits in
migration. For some security reasons, such as replacing TLS certificates,
we may want to apply new settings as soon as possible and/or we may not be
able to wait for connection draining. The sockets in the accept queue have
not started application sessions yet. So, if we do not drain such sockets,
they can be handled by the newer listeners and could have a longer
lifetime. It is difficult to drain all connections in every case, but we
can decrease such aborted connections by migration. In that sense,
migration is always better than draining.

Moreover, auto-migration simplifies user space logic and also works well in
a case where we cannot modify and build a server program to implement the
workaround.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use the eBPF program to select a
specific listener or to cancel migration.

Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
snippets along the way.

Link:
[1] The SO_REUSEPORT socket option
https://lwn.net/Articles/542629/

[2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/

Changelog:
v4:
* Make some functions and variables 'static' in selftest
* Remove 'scalability' from the cover letter because it is not
primarily reason to use SO_REUSEPORT

v3:
https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/
* Add sysctl back for reuseport_grow()
* Add helper functions to manage socks[]
* Separate migration related logic into functions: reuseport_resurrect(),
reuseport_stop_listen_sock(), reuseport_migrate_sock()
* Clone request_sock to be migrated
* Migrate request one by one
* Pass child socket to eBPF prog

v2:
https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/
* Do not save closed sockets in socks[]
* Revert 607904c357c61adf20b8fd18af765e501d61a385
* Extract inet_csk_reqsk_queue_migrate() into a single patch
* Change the spin_lock order to avoid lockdep warning
* Add static to __reuseport_select_sock
* Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
* Set the default attach type in bpf_prog_load_check_attach()
* Define new proto of BPF_FUNC_get_socket_cookie
* Fix test to be compiled successfully
* Update commit messages

v1:
https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/
* Remove the sysctl option
* Enable migration if eBPF progam is not attached
* Add expected_attach_type to check if eBPF program can migrate sockets
* Add a field to tell migration type to eBPF program
* Support BPF_FUNC_get_socket_cookie to get the cookie of sk
* Allocate an empty skb if skb is NULL
* Pass req_to_sk(req)->sk_hash because listener's hash is zero
* Update commit messages and coverletter

RFC:
https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/

Kuniyuki Iwashima (11):
net: Introduce net.ipv4.tcp_migrate_req.
tcp: Add num_closed_socks to struct sock_reuseport.
tcp: Keep TCP_CLOSE sockets in the reuseport group.
tcp: Add reuseport_migrate_sock() to select a new listener.
tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
bpf: Support BPF_FUNC_get_socket_cookie() for
BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Support socket migration by eBPF.
libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Documentation/networking/ip-sysctl.rst | 20 +
include/linux/bpf.h | 1 +
include/linux/filter.h | 2 +
include/net/netns/ipv4.h | 1 +
include/net/request_sock.h | 2 +
include/net/sock_reuseport.h | 9 +-
include/uapi/linux/bpf.h | 16 +
kernel/bpf/syscall.c | 13 +
net/core/filter.c | 23 +-
net/core/request_sock.c | 38 ++
net/core/sock_reuseport.c | 337 ++++++++++--
net/ipv4/inet_connection_sock.c | 147 +++++-
net/ipv4/inet_hashtables.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 9 +
net/ipv4/tcp_ipv4.c | 20 +-
net/ipv6/tcp_ipv6.c | 14 +-
tools/include/uapi/linux/bpf.h | 16 +
tools/lib/bpf/libbpf.c | 5 +-
tools/testing/selftests/bpf/network_helpers.c | 2 +-
tools/testing/selftests/bpf/network_helpers.h | 1 +
.../bpf/prog_tests/migrate_reuseport.c | 484 ++++++++++++++++++
.../bpf/progs/test_migrate_reuseport.c | 51 ++
22 files changed, 1151 insertions(+), 62 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c

Jason Baron April 27, 2021, 4:38 p.m. UTC | #1

On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.

Hi Kuniyuki,

I had implemented a different approach to this that I wanted to get your
thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
listen fd (or any other fd) around. Currently, if you have an 'old' webserver
that you want to replace with a 'new' webserver, you would need a separate
process to receive the listen fd and then have that process send the fd to
the new webserver, if they are not running con-currently. So instead what
I'm proposing is a 'delayed close' for a unix socket. That is, one could do:

1) bind unix socket with path '/sockets'
2) sendmsg() the listen fd via the unix socket
2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
3) exit/close the old webserver and the listen socket
4) start the new webserver
5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
6) recvmsg() the listen fd

So the idea is that we set a timeout on the unix socket. If the new process
does not start and bind to the unix socket, it simply closes, thus releasing
the listen socket. However, if it does bind it can now call recvmsg() and
use the listen fd as normal. It can then simply continue to use the old listen
fds and/or create new ones and drain the old ones.

Thus, the old and new webservers do not have to run concurrently. This doesn't
involve any changes to the tcp layer and can be used to pass any type of fd.
not sure if it's actually useful for anything else though.

I'm not sure if this solves your use-case or not but I thought I'd share it.
One can also inherit the fds like in systemd's socket activation model, but
that again requires another process to hold open the listen fd.

I have a very rough patch (emphasis on rough), that implements this idea that
I'm attaching below to explain it better. It would need a bunch of fixups and
it's against an older kernel, but hopefully gives this direction a better
explanation.

Thanks,

-Jason




> 
> To avoid such a situation, users have to know deeply how the kernel handles
> SYN packets and implement connection draining by eBPF [2]:
> 
>   1. Stop routing SYN packets to the listener by eBPF.
>   2. Wait for all timers to expire to complete requests
>   3. Accept connections until EAGAIN, then close the listener.
> 
>   or
> 
>   1. Start counting SYN packets and accept syscalls using the eBPF map.
>   2. Stop routing SYN packets.
>   3. Accept connections up to the count, then close the listener.
> 
> In either way, we cannot close a listener immediately. However, ideally,
> the application need not drain the not yet accepted sockets because 3WHS
> and tying a connection to a listener are just the kernel behaviour. The
> root cause is within the kernel, so the issue should be addressed in kernel
> space and should not be visible to user space. This patchset fixes it so
> that users need not take care of kernel implementation and connection
> draining. With this patchset, the kernel redistributes requests and
> connections from a listener to the others in the same reuseport group
> at/after close or shutdown syscalls.
> 
> Although some software does connection draining, there are still merits in
> migration. For some security reasons, such as replacing TLS certificates,
> we may want to apply new settings as soon as possible and/or we may not be
> able to wait for connection draining. The sockets in the accept queue have
> not started application sessions yet. So, if we do not drain such sockets,
> they can be handled by the newer listeners and could have a longer
> lifetime. It is difficult to drain all connections in every case, but we
> can decrease such aborted connections by migration. In that sense,
> migration is always better than draining. 
> 
> Moreover, auto-migration simplifies user space logic and also works well in
> a case where we cannot modify and build a server program to implement the
> workaround.
> 
> Note that the source and destination listeners MUST have the same settings
> at the socket API level; otherwise, applications may face inconsistency and
> cause errors. In such a case, we have to use the eBPF program to select a
> specific listener or to cancel migration.
> 
> Special thanks to Martin KaFai Lau for bouncing ideas and exchanging code
> snippets along the way.
> 
> 
> Link:
>  [1] The SO_REUSEPORT socket option
>  https://urldefense.com/v3/__https://lwn.net/Articles/542629/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHvYbG59iA$ 
> 
>  [2] Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as drain mode
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/1458828813.10868.65.camel@edumazet-glaptop3.roam.corp.google.com/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHv5_PVAcw$ 
> 
> 
> Changelog:
>  v4:
>   * Make some functions and variables 'static' in selftest
>   * Remove 'scalability' from the cover letter because it is not
>     primarily reason to use SO_REUSEPORT
> 
>  v3:
>  https://urldefense.com/v3/__https://lore.kernel.org/bpf/20210420154140.80034-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHtKFGgFOg$ 
>   * Add sysctl back for reuseport_grow()
>   * Add helper functions to manage socks[]
>   * Separate migration related logic into functions: reuseport_resurrect(),
>     reuseport_stop_listen_sock(), reuseport_migrate_sock()
>   * Clone request_sock to be migrated
>   * Migrate request one by one
>   * Pass child socket to eBPF prog
> 
>  v2:
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/20201207132456.65472-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHtxujEgug$ 
>   * Do not save closed sockets in socks[]
>   * Revert 607904c357c61adf20b8fd18af765e501d61a385
>   * Extract inet_csk_reqsk_queue_migrate() into a single patch
>   * Change the spin_lock order to avoid lockdep warning
>   * Add static to __reuseport_select_sock
>   * Use refcount_inc_not_zero() in reuseport_select_migrated_sock()
>   * Set the default attach type in bpf_prog_load_check_attach()
>   * Define new proto of BPF_FUNC_get_socket_cookie
>   * Fix test to be compiled successfully
>   * Update commit messages
> 
>  v1:
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/20201201144418.35045-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHsPqhRjHg$ 
>   * Remove the sysctl option
>   * Enable migration if eBPF progam is not attached
>   * Add expected_attach_type to check if eBPF program can migrate sockets
>   * Add a field to tell migration type to eBPF program
>   * Support BPF_FUNC_get_socket_cookie to get the cookie of sk
>   * Allocate an empty skb if skb is NULL
>   * Pass req_to_sk(req)->sk_hash because listener's hash is zero
>   * Update commit messages and coverletter
> 
>  RFC:
>  https://urldefense.com/v3/__https://lore.kernel.org/netdev/20201117094023.3685-1-kuniyu@amazon.co.jp/__;!!GjvTz_vk!EfhfOTCU_7XOxhuo8yW-66aU3Arq_7mkRIloYIyJYvsuGuFYTPYMmHsn-5vckQ$ 
> 
> 
> Kuniyuki Iwashima (11):
>   net: Introduce net.ipv4.tcp_migrate_req.
>   tcp: Add num_closed_socks to struct sock_reuseport.
>   tcp: Keep TCP_CLOSE sockets in the reuseport group.
>   tcp: Add reuseport_migrate_sock() to select a new listener.
>   tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
>   tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.
>   tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK.
>   bpf: Support BPF_FUNC_get_socket_cookie() for
>     BPF_PROG_TYPE_SK_REUSEPORT.
>   bpf: Support socket migration by eBPF.
>   libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT.
>   bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
> 
>  Documentation/networking/ip-sysctl.rst        |  20 +
>  include/linux/bpf.h                           |   1 +
>  include/linux/filter.h                        |   2 +
>  include/net/netns/ipv4.h                      |   1 +
>  include/net/request_sock.h                    |   2 +
>  include/net/sock_reuseport.h                  |   9 +-
>  include/uapi/linux/bpf.h                      |  16 +
>  kernel/bpf/syscall.c                          |  13 +
>  net/core/filter.c                             |  23 +-
>  net/core/request_sock.c                       |  38 ++
>  net/core/sock_reuseport.c                     | 337 ++++++++++--
>  net/ipv4/inet_connection_sock.c               | 147 +++++-
>  net/ipv4/inet_hashtables.c                    |   2 +-
>  net/ipv4/sysctl_net_ipv4.c                    |   9 +
>  net/ipv4/tcp_ipv4.c                           |  20 +-
>  net/ipv6/tcp_ipv6.c                           |  14 +-
>  tools/include/uapi/linux/bpf.h                |  16 +
>  tools/lib/bpf/libbpf.c                        |   5 +-
>  tools/testing/selftests/bpf/network_helpers.c |   2 +-
>  tools/testing/selftests/bpf/network_helpers.h |   1 +
>  .../bpf/prog_tests/migrate_reuseport.c        | 484 ++++++++++++++++++
>  .../bpf/progs/test_migrate_reuseport.c        |  51 ++
>  22 files changed, 1151 insertions(+), 62 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
>

Maciej Żenczykowski April 27, 2021, 9:55 p.m. UTC | #2

On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation [1]. When a SYN packet is received, the connection is tied
> to a listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners on the same port could accept
> such connections.
>
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in the
> in-flight ACK of 3WHS is responded by RST.

This is IMHO a userspace bug.

You should never be closing or creating new SO_REUSEPORT sockets on a
running server (listening port).

There's at least 3 ways to accomplish this.

One involves a shim parent process that takes care of creating the
sockets (without close-on-exec),
then fork-exec's the actual server process[es] (which will use the
already opened listening fds),
and can thus re-fork-exec a new child while using the same set of sockets.
Here the old server can terminate before the new one starts.

(one could even envision systemd being modified to support this...)

The second involves the old running server fork-execing the new server
and handing off the non-CLOEXEC sockets that way.

The third approach involves unix fd passing of sockets to hand off the
listening sockets from the old process/thread(s) to the new
process/thread(s).  Once handed off the old server can stop accept'ing
on the listening sockets and close them (the real copies are in the
child), finish processing any still active connections (or time them
out) and terminate.

Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
count), nor closing the final copy of a given socket.

This is basically the same thing that was needed not to lose incoming
connections in a pre-SO_REUSEPORT world.
(no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
triggering a RST during the server restart, it just makes the window
when RSTs happen shorter)

This was from day one (I reported to Tom and worked with him on the
very initial distribution function) envisioned to work like this,
and we (Google) have always used it with unix fd handoff to support
transparent restart.

Maciej Żenczykowski April 27, 2021, 10 p.m. UTC | #3

On Tue, Apr 27, 2021 at 2:55 PM Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> >
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
>
> This is IMHO a userspace bug.
>
> You should never be closing or creating new SO_REUSEPORT sockets on a
> running server (listening port).
>
> There's at least 3 ways to accomplish this.
>
> One involves a shim parent process that takes care of creating the
> sockets (without close-on-exec),
> then fork-exec's the actual server process[es] (which will use the
> already opened listening fds),
> and can thus re-fork-exec a new child while using the same set of sockets.
> Here the old server can terminate before the new one starts.
>
> (one could even envision systemd being modified to support this...)
>
> The second involves the old running server fork-execing the new server
> and handing off the non-CLOEXEC sockets that way.

(this doesn't even need to be fork-exec -- can just be exec -- and is
potentially easier)

> The third approach involves unix fd passing of sockets to hand off the
> listening sockets from the old process/thread(s) to the new
> process/thread(s).  Once handed off the old server can stop accept'ing
> on the listening sockets and close them (the real copies are in the
> child), finish processing any still active connections (or time them

(this doesn't actually need to be a child, in can be an entirely new
parallel instance of the server,
potentially running in an entirely new container/cgroup setup, though
in the same network namespace)

> out) and terminate.
>
> Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
> count), nor closing the final copy of a given socket.
>
> This is basically the same thing that was needed not to lose incoming
> connections in a pre-SO_REUSEPORT world.
> (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
> triggering a RST during the server restart, it just makes the window
> when RSTs happen shorter)
>
> This was from day one (I reported to Tom and worked with him on the
> very initial distribution function) envisioned to work like this,
> and we (Google) have always used it with unix fd handoff to support
> transparent restart.

Martin KaFai Lau April 28, 2021, 1:27 a.m. UTC | #4

On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote:
> 
> 
> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
> 
> Hi Kuniyuki,
> 
> I had implemented a different approach to this that I wanted to get your
> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> that you want to replace with a 'new' webserver, you would need a separate
> process to receive the listen fd and then have that process send the fd to
> the new webserver, if they are not running con-currently. So instead what
> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> 
> 1) bind unix socket with path '/sockets'
> 2) sendmsg() the listen fd via the unix socket
> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> 3) exit/close the old webserver and the listen socket
> 4) start the new webserver
> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> 6) recvmsg() the listen fd
> 
> So the idea is that we set a timeout on the unix socket. If the new process
> does not start and bind to the unix socket, it simply closes, thus releasing
> the listen socket. However, if it does bind it can now call recvmsg() and
> use the listen fd as normal. It can then simply continue to use the old listen
> fds and/or create new ones and drain the old ones.
> 
> Thus, the old and new webservers do not have to run concurrently. This doesn't
> involve any changes to the tcp layer and can be used to pass any type of fd.
> not sure if it's actually useful for anything else though.
We also used to do tcp-listen(/udp) fd transfer because the new process can not
bind to the same IP:PORT in the old kernel without SO_REUSEPORT.  Some of the
services listen to many different IP:PORT(s).  Transferring all of them
was ok-ish but the old and new process do not necessary listen to the same set
of IP:PORT(s) (e.g. the config may have changed during restart) and it further
complicates the fd transfer logic in the userspace.

It was then moved to SO_REUSEPORT.  The new process can create its listen fds
without depending on the old process.  It pretty much starts as if there is
no old process.  There is no need to transfer the fds, simplified the userspace
logic.  The old and new process can work independently.  The old and new process
still run concurrently for a brief time period to avoid service disruption.

Iwashima, Kuniyuki April 28, 2021, 8:13 a.m. UTC | #5

From:   Jason Baron <jbaron@akamai.com>
Date:   Tue, 27 Apr 2021 12:38:58 -0400
> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation [1]. When a SYN packet is received, the connection is tied
> > to a listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners on the same port could accept
> > such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in the
> > in-flight ACK of 3WHS is responded by RST.
> 
> Hi Kuniyuki,
> 
> I had implemented a different approach to this that I wanted to get your
> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> that you want to replace with a 'new' webserver, you would need a separate
> process to receive the listen fd and then have that process send the fd to
> the new webserver, if they are not running con-currently. So instead what
> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> 
> 1) bind unix socket with path '/sockets'
> 2) sendmsg() the listen fd via the unix socket
> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> 3) exit/close the old webserver and the listen socket
> 4) start the new webserver
> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> 6) recvmsg() the listen fd
> 
> So the idea is that we set a timeout on the unix socket. If the new process
> does not start and bind to the unix socket, it simply closes, thus releasing
> the listen socket. However, if it does bind it can now call recvmsg() and
> use the listen fd as normal. It can then simply continue to use the old listen
> fds and/or create new ones and drain the old ones.
> 
> Thus, the old and new webservers do not have to run concurrently. This doesn't
> involve any changes to the tcp layer and can be used to pass any type of fd.
> not sure if it's actually useful for anything else though.
> 
> I'm not sure if this solves your use-case or not but I thought I'd share it.
> One can also inherit the fds like in systemd's socket activation model, but
> that again requires another process to hold open the listen fd.

Thank you for sharing code.

It seems bit more crash-tolerant than normal fd passing, but it can still
suffer if the process dies before passing fds. With this patch set, we can
migrate children sockets even if the process dies.

Also, as Martin said, fd passing tends to make application complicated.

If we do not mind these points, your approach could be an option.


> 
> I have a very rough patch (emphasis on rough), that implements this idea that
> I'm attaching below to explain it better. It would need a bunch of fixups and
> it's against an older kernel, but hopefully gives this direction a better
> explanation.
> 
> Thanks,
> 
> -Jason

Iwashima, Kuniyuki April 28, 2021, 8:18 a.m. UTC | #6

From:   Maciej Żenczykowski <zenczykowski@gmail.com>
Date:   Tue, 27 Apr 2021 15:00:12 -0700
> On Tue, Apr 27, 2021 at 2:55 PM Maciej Żenczykowski
> <zenczykowski@gmail.com> wrote:
> > On Mon, Apr 26, 2021 at 8:47 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > accept connections evenly. However, there is a defect in the current
> > > implementation [1]. When a SYN packet is received, the connection is tied
> > > to a listening socket. Accordingly, when the listener is closed, in-flight
> > > requests during the three-way handshake and child sockets in the accept
> > > queue are dropped even if other listeners on the same port could accept
> > > such connections.
> > >
> > > This situation can happen when various server management tools restart
> > > server (such as nginx) processes. For instance, when we change nginx
> > > configurations and restart it, it spins up new workers that respect the new
> > > configuration and closes all listeners on the old workers, resulting in the
> > > in-flight ACK of 3WHS is responded by RST.
> >
> > This is IMHO a userspace bug.

I do not think so.

If the kernel selected another listener for incoming connections, they
could be accept()ed. There is no room for usersapce to change the behaviour
without an in-kernel tool, eBPF. A feature that can cause failure
stochastically due to kernel behaviour cannot be a userspace bug.


> >
> > You should never be closing or creating new SO_REUSEPORT sockets on a
> > running server (listening port).
> >
> > There's at least 3 ways to accomplish this.
> >
> > One involves a shim parent process that takes care of creating the
> > sockets (without close-on-exec),
> > then fork-exec's the actual server process[es] (which will use the
> > already opened listening fds),
> > and can thus re-fork-exec a new child while using the same set of sockets.
> > Here the old server can terminate before the new one starts.
> >
> > (one could even envision systemd being modified to support this...)
> >
> > The second involves the old running server fork-execing the new server
> > and handing off the non-CLOEXEC sockets that way.
> 
> (this doesn't even need to be fork-exec -- can just be exec -- and is
> potentially easier)
> 
> > The third approach involves unix fd passing of sockets to hand off the
> > listening sockets from the old process/thread(s) to the new
> > process/thread(s).  Once handed off the old server can stop accept'ing
> > on the listening sockets and close them (the real copies are in the
> > child), finish processing any still active connections (or time them
> 
> (this doesn't actually need to be a child, in can be an entirely new
> parallel instance of the server,
> potentially running in an entirely new container/cgroup setup, though
> in the same network namespace)
> 
> > out) and terminate.
> >
> > Either way you're never creating new SO_REUSEPORT sockets (dup doesn't
> > count), nor closing the final copy of a given socket.

Indeed each approach can be an option, but it makes application more
complicated. Also what if the process holding the listener fd died, there
could be down time.

I do not think every approach works well in everywhere for everyone.


> >
> > This is basically the same thing that was needed not to lose incoming
> > connections in a pre-SO_REUSEPORT world.
> > (no SO_REUSEADDR by itself doesn't prevent an incoming SYN from
> > triggering a RST during the server restart, it just makes the window
> > when RSTs happen shorter)

SO_REUSEPORT makes each process/listener independent, and we need not pass
fds. So, it makes application much simpler. Even with SO_REUSEPORT, one
listener might crash, but it is more tolerant than losing all connections
at once.

To enjoy such merits, isn't it natural to improve the existing feature in
this post-SO_REUSEPORT world?


> >
> > This was from day one (I reported to Tom and worked with him on the
> > very initial distribution function) envisioned to work like this,
> > and we (Google) have always used it with unix fd handoff to support
> > transparent restart.

Eric Dumazet April 28, 2021, 2:18 p.m. UTC | #7

On 4/28/21 3:27 AM, Martin KaFai Lau wrote:
> On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote:
>>
>>
>> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
>>> The SO_REUSEPORT option allows sockets to listen on the same port and to
>>> accept connections evenly. However, there is a defect in the current
>>> implementation [1]. When a SYN packet is received, the connection is tied
>>> to a listening socket. Accordingly, when the listener is closed, in-flight
>>> requests during the three-way handshake and child sockets in the accept
>>> queue are dropped even if other listeners on the same port could accept
>>> such connections.
>>>
>>> This situation can happen when various server management tools restart
>>> server (such as nginx) processes. For instance, when we change nginx
>>> configurations and restart it, it spins up new workers that respect the new
>>> configuration and closes all listeners on the old workers, resulting in the
>>> in-flight ACK of 3WHS is responded by RST.
>>
>> Hi Kuniyuki,
>>
>> I had implemented a different approach to this that I wanted to get your
>> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
>> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
>> that you want to replace with a 'new' webserver, you would need a separate
>> process to receive the listen fd and then have that process send the fd to
>> the new webserver, if they are not running con-currently. So instead what
>> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
>>
>> 1) bind unix socket with path '/sockets'
>> 2) sendmsg() the listen fd via the unix socket
>> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
>> 3) exit/close the old webserver and the listen socket
>> 4) start the new webserver
>> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
>> 6) recvmsg() the listen fd
>>
>> So the idea is that we set a timeout on the unix socket. If the new process
>> does not start and bind to the unix socket, it simply closes, thus releasing
>> the listen socket. However, if it does bind it can now call recvmsg() and
>> use the listen fd as normal. It can then simply continue to use the old listen
>> fds and/or create new ones and drain the old ones.
>>
>> Thus, the old and new webservers do not have to run concurrently. This doesn't
>> involve any changes to the tcp layer and can be used to pass any type of fd.
>> not sure if it's actually useful for anything else though.
> We also used to do tcp-listen(/udp) fd transfer because the new process can not
> bind to the same IP:PORT in the old kernel without SO_REUSEPORT.  Some of the
> services listen to many different IP:PORT(s).  Transferring all of them
> was ok-ish but the old and new process do not necessary listen to the same set
> of IP:PORT(s) (e.g. the config may have changed during restart) and it further
> complicates the fd transfer logic in the userspace.
> 
> It was then moved to SO_REUSEPORT.  The new process can create its listen fds
> without depending on the old process.  It pretty much starts as if there is
> no old process.  There is no need to transfer the fds, simplified the userspace
> logic.  The old and new process can work independently.  The old and new process
> still run concurrently for a brief time period to avoid service disruption.
> 


Note that another technique is to force syncookies during the switch of old/new servers.

echo 2 >/proc/sys/net/ipv4/tcp_syncookies

If there is interest, we could add a socket option to override the sysctl on a per-socket basis.

Jason Baron April 28, 2021, 2:44 p.m. UTC | #8

On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> From:   Jason Baron <jbaron@akamai.com>
> Date:   Tue, 27 Apr 2021 12:38:58 -0400
>> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
>>> The SO_REUSEPORT option allows sockets to listen on the same port and to
>>> accept connections evenly. However, there is a defect in the current
>>> implementation [1]. When a SYN packet is received, the connection is tied
>>> to a listening socket. Accordingly, when the listener is closed, in-flight
>>> requests during the three-way handshake and child sockets in the accept
>>> queue are dropped even if other listeners on the same port could accept
>>> such connections.
>>>
>>> This situation can happen when various server management tools restart
>>> server (such as nginx) processes. For instance, when we change nginx
>>> configurations and restart it, it spins up new workers that respect the new
>>> configuration and closes all listeners on the old workers, resulting in the
>>> in-flight ACK of 3WHS is responded by RST.
>>
>> Hi Kuniyuki,
>>
>> I had implemented a different approach to this that I wanted to get your
>> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
>> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
>> that you want to replace with a 'new' webserver, you would need a separate
>> process to receive the listen fd and then have that process send the fd to
>> the new webserver, if they are not running con-currently. So instead what
>> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
>>
>> 1) bind unix socket with path '/sockets'
>> 2) sendmsg() the listen fd via the unix socket
>> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
>> 3) exit/close the old webserver and the listen socket
>> 4) start the new webserver
>> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
>> 6) recvmsg() the listen fd
>>
>> So the idea is that we set a timeout on the unix socket. If the new process
>> does not start and bind to the unix socket, it simply closes, thus releasing
>> the listen socket. However, if it does bind it can now call recvmsg() and
>> use the listen fd as normal. It can then simply continue to use the old listen
>> fds and/or create new ones and drain the old ones.
>>
>> Thus, the old and new webservers do not have to run concurrently. This doesn't
>> involve any changes to the tcp layer and can be used to pass any type of fd.
>> not sure if it's actually useful for anything else though.
>>
>> I'm not sure if this solves your use-case or not but I thought I'd share it.
>> One can also inherit the fds like in systemd's socket activation model, but
>> that again requires another process to hold open the listen fd.
> 
> Thank you for sharing code.
> 
> It seems bit more crash-tolerant than normal fd passing, but it can still
> suffer if the process dies before passing fds. With this patch set, we can
> migrate children sockets even if the process dies.
> 

I don't think crashing should be much of an issue. The old server can setup the
unix socket patch '/sockets' when it starts up and queue the listen sockets
there from the start. When it dies it will close all its fds, and the new
server can pick anything up any fds that are in the '/sockets' queue.


> Also, as Martin said, fd passing tends to make application complicated.
> 

It may be but perhaps its more flexible? It gives the new server the
chance to re-use the existing listen fds, close, drain and/or start new
ones. It also addresses the non-REUSEPORT case where you can't bind right
away.

Thanks,

-Jason

> If we do not mind these points, your approach could be an option.
> 
>

Iwashima, Kuniyuki April 28, 2021, 3:49 p.m. UTC | #9

From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed, 28 Apr 2021 16:18:30 +0200
> On 4/28/21 3:27 AM, Martin KaFai Lau wrote:
> > On Tue, Apr 27, 2021 at 12:38:58PM -0400, Jason Baron wrote:
> >>
> >>
> >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> >>> accept connections evenly. However, there is a defect in the current
> >>> implementation [1]. When a SYN packet is received, the connection is tied
> >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> >>> requests during the three-way handshake and child sockets in the accept
> >>> queue are dropped even if other listeners on the same port could accept
> >>> such connections.
> >>>
> >>> This situation can happen when various server management tools restart
> >>> server (such as nginx) processes. For instance, when we change nginx
> >>> configurations and restart it, it spins up new workers that respect the new
> >>> configuration and closes all listeners on the old workers, resulting in the
> >>> in-flight ACK of 3WHS is responded by RST.
> >>
> >> Hi Kuniyuki,
> >>
> >> I had implemented a different approach to this that I wanted to get your
> >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> >> that you want to replace with a 'new' webserver, you would need a separate
> >> process to receive the listen fd and then have that process send the fd to
> >> the new webserver, if they are not running con-currently. So instead what
> >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> >>
> >> 1) bind unix socket with path '/sockets'
> >> 2) sendmsg() the listen fd via the unix socket
> >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> >> 3) exit/close the old webserver and the listen socket
> >> 4) start the new webserver
> >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> >> 6) recvmsg() the listen fd
> >>
> >> So the idea is that we set a timeout on the unix socket. If the new process
> >> does not start and bind to the unix socket, it simply closes, thus releasing
> >> the listen socket. However, if it does bind it can now call recvmsg() and
> >> use the listen fd as normal. It can then simply continue to use the old listen
> >> fds and/or create new ones and drain the old ones.
> >>
> >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> >> involve any changes to the tcp layer and can be used to pass any type of fd.
> >> not sure if it's actually useful for anything else though.
> > We also used to do tcp-listen(/udp) fd transfer because the new process can not
> > bind to the same IP:PORT in the old kernel without SO_REUSEPORT.  Some of the
> > services listen to many different IP:PORT(s).  Transferring all of them
> > was ok-ish but the old and new process do not necessary listen to the same set
> > of IP:PORT(s) (e.g. the config may have changed during restart) and it further
> > complicates the fd transfer logic in the userspace.
> > 
> > It was then moved to SO_REUSEPORT.  The new process can create its listen fds
> > without depending on the old process.  It pretty much starts as if there is
> > no old process.  There is no need to transfer the fds, simplified the userspace
> > logic.  The old and new process can work independently.  The old and new process
> > still run concurrently for a brief time period to avoid service disruption.
> > 
> 
> 
> Note that another technique is to force syncookies during the switch of old/new servers.
> 
> echo 2 >/proc/sys/net/ipv4/tcp_syncookies
> 
> If there is interest, we could add a socket option to override the sysctl on a per-socket basis.

It can be a work-around but syncookies has its own downside. Forcing it may
lose some valuable TCP options. If there is an approach without syncookies,
it is better.

Iwashima, Kuniyuki April 28, 2021, 3:52 p.m. UTC | #10

From:   Jason Baron <jbaron@akamai.com>
Date:   Wed, 28 Apr 2021 10:44:12 -0400
> On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> > From:   Jason Baron <jbaron@akamai.com>
> > Date:   Tue, 27 Apr 2021 12:38:58 -0400
> >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> >>> accept connections evenly. However, there is a defect in the current
> >>> implementation [1]. When a SYN packet is received, the connection is tied
> >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> >>> requests during the three-way handshake and child sockets in the accept
> >>> queue are dropped even if other listeners on the same port could accept
> >>> such connections.
> >>>
> >>> This situation can happen when various server management tools restart
> >>> server (such as nginx) processes. For instance, when we change nginx
> >>> configurations and restart it, it spins up new workers that respect the new
> >>> configuration and closes all listeners on the old workers, resulting in the
> >>> in-flight ACK of 3WHS is responded by RST.
> >>
> >> Hi Kuniyuki,
> >>
> >> I had implemented a different approach to this that I wanted to get your
> >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> >> that you want to replace with a 'new' webserver, you would need a separate
> >> process to receive the listen fd and then have that process send the fd to
> >> the new webserver, if they are not running con-currently. So instead what
> >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> >>
> >> 1) bind unix socket with path '/sockets'
> >> 2) sendmsg() the listen fd via the unix socket
> >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> >> 3) exit/close the old webserver and the listen socket
> >> 4) start the new webserver
> >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> >> 6) recvmsg() the listen fd
> >>
> >> So the idea is that we set a timeout on the unix socket. If the new process
> >> does not start and bind to the unix socket, it simply closes, thus releasing
> >> the listen socket. However, if it does bind it can now call recvmsg() and
> >> use the listen fd as normal. It can then simply continue to use the old listen
> >> fds and/or create new ones and drain the old ones.
> >>
> >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> >> involve any changes to the tcp layer and can be used to pass any type of fd.
> >> not sure if it's actually useful for anything else though.
> >>
> >> I'm not sure if this solves your use-case or not but I thought I'd share it.
> >> One can also inherit the fds like in systemd's socket activation model, but
> >> that again requires another process to hold open the listen fd.
> > 
> > Thank you for sharing code.
> > 
> > It seems bit more crash-tolerant than normal fd passing, but it can still
> > suffer if the process dies before passing fds. With this patch set, we can
> > migrate children sockets even if the process dies.
> > 
> 
> I don't think crashing should be much of an issue. The old server can setup the
> unix socket patch '/sockets' when it starts up and queue the listen sockets
> there from the start. When it dies it will close all its fds, and the new
> server can pick anything up any fds that are in the '/sockets' queue.
> 
> 
> > Also, as Martin said, fd passing tends to make application complicated.
> > 
> 
> It may be but perhaps its more flexible? It gives the new server the
> chance to re-use the existing listen fds, close, drain and/or start new
> ones. It also addresses the non-REUSEPORT case where you can't bind right
> away.

If the flexibility is really worth the complexity, we do not care about it.
But, SO_REUSEPORT can give enough flexibility we want.

With socket migration, there is no need to reuse listener (fd passing),
drain children (incoming connections are automatically migrated if there is
already another listener bind()ed), and of course another listener can
close itself and migrated children.

If two different approaches resolves the same issue and one does not need
complexity in userspace, we select the simpler one.

Eric Dumazet April 28, 2021, 4:33 p.m. UTC | #11

On Wed, Apr 28, 2021 at 5:52 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
>
> From:   Jason Baron <jbaron@akamai.com>
> Date:   Wed, 28 Apr 2021 10:44:12 -0400
> > On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> > > From:   Jason Baron <jbaron@akamai.com>
> > > Date:   Tue, 27 Apr 2021 12:38:58 -0400
> > >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> > >>> accept connections evenly. However, there is a defect in the current
> > >>> implementation [1]. When a SYN packet is received, the connection is tied
> > >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> > >>> requests during the three-way handshake and child sockets in the accept
> > >>> queue are dropped even if other listeners on the same port could accept
> > >>> such connections.
> > >>>
> > >>> This situation can happen when various server management tools restart
> > >>> server (such as nginx) processes. For instance, when we change nginx
> > >>> configurations and restart it, it spins up new workers that respect the new
> > >>> configuration and closes all listeners on the old workers, resulting in the
> > >>> in-flight ACK of 3WHS is responded by RST.
> > >>
> > >> Hi Kuniyuki,
> > >>
> > >> I had implemented a different approach to this that I wanted to get your
> > >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> > >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> > >> that you want to replace with a 'new' webserver, you would need a separate
> > >> process to receive the listen fd and then have that process send the fd to
> > >> the new webserver, if they are not running con-currently. So instead what
> > >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> > >>
> > >> 1) bind unix socket with path '/sockets'
> > >> 2) sendmsg() the listen fd via the unix socket
> > >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> > >> 3) exit/close the old webserver and the listen socket
> > >> 4) start the new webserver
> > >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> > >> 6) recvmsg() the listen fd
> > >>
> > >> So the idea is that we set a timeout on the unix socket. If the new process
> > >> does not start and bind to the unix socket, it simply closes, thus releasing
> > >> the listen socket. However, if it does bind it can now call recvmsg() and
> > >> use the listen fd as normal. It can then simply continue to use the old listen
> > >> fds and/or create new ones and drain the old ones.
> > >>
> > >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> > >> involve any changes to the tcp layer and can be used to pass any type of fd.
> > >> not sure if it's actually useful for anything else though.
> > >>
> > >> I'm not sure if this solves your use-case or not but I thought I'd share it.
> > >> One can also inherit the fds like in systemd's socket activation model, but
> > >> that again requires another process to hold open the listen fd.
> > >
> > > Thank you for sharing code.
> > >
> > > It seems bit more crash-tolerant than normal fd passing, but it can still
> > > suffer if the process dies before passing fds. With this patch set, we can
> > > migrate children sockets even if the process dies.
> > >
> >
> > I don't think crashing should be much of an issue. The old server can setup the
> > unix socket patch '/sockets' when it starts up and queue the listen sockets
> > there from the start. When it dies it will close all its fds, and the new
> > server can pick anything up any fds that are in the '/sockets' queue.
> >
> >
> > > Also, as Martin said, fd passing tends to make application complicated.
> > >
> >
> > It may be but perhaps its more flexible? It gives the new server the
> > chance to re-use the existing listen fds, close, drain and/or start new
> > ones. It also addresses the non-REUSEPORT case where you can't bind right
> > away.
>
> If the flexibility is really worth the complexity, we do not care about it.
> But, SO_REUSEPORT can give enough flexibility we want.
>
> With socket migration, there is no need to reuse listener (fd passing),
> drain children (incoming connections are automatically migrated if there is
> already another listener bind()ed), and of course another listener can
> close itself and migrated children.
>
> If two different approaches resolves the same issue and one does not need
> complexity in userspace, we select the simpler one.

Kernel bloat and complexity is _not_ the simplest choice.

Touching a complex part of TCP stack is quite risky.

Iwashima, Kuniyuki April 29, 2021, 3:16 a.m. UTC | #12

From:   Eric Dumazet <edumazet@google.com>
Date:   Wed, 28 Apr 2021 18:33:32 +0200
> On Wed, Apr 28, 2021 at 5:52 PM Kuniyuki Iwashima <kuniyu@amazon.co.jp> wrote:
> >
> > From:   Jason Baron <jbaron@akamai.com>
> > Date:   Wed, 28 Apr 2021 10:44:12 -0400
> > > On 4/28/21 4:13 AM, Kuniyuki Iwashima wrote:
> > > > From:   Jason Baron <jbaron@akamai.com>
> > > > Date:   Tue, 27 Apr 2021 12:38:58 -0400
> > > >> On 4/26/21 11:46 PM, Kuniyuki Iwashima wrote:
> > > >>> The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > >>> accept connections evenly. However, there is a defect in the current
> > > >>> implementation [1]. When a SYN packet is received, the connection is tied
> > > >>> to a listening socket. Accordingly, when the listener is closed, in-flight
> > > >>> requests during the three-way handshake and child sockets in the accept
> > > >>> queue are dropped even if other listeners on the same port could accept
> > > >>> such connections.
> > > >>>
> > > >>> This situation can happen when various server management tools restart
> > > >>> server (such as nginx) processes. For instance, when we change nginx
> > > >>> configurations and restart it, it spins up new workers that respect the new
> > > >>> configuration and closes all listeners on the old workers, resulting in the
> > > >>> in-flight ACK of 3WHS is responded by RST.
> > > >>
> > > >> Hi Kuniyuki,
> > > >>
> > > >> I had implemented a different approach to this that I wanted to get your
> > > >> thoughts about. The idea is to use unix sockets and SCM_RIGHTS to pass the
> > > >> listen fd (or any other fd) around. Currently, if you have an 'old' webserver
> > > >> that you want to replace with a 'new' webserver, you would need a separate
> > > >> process to receive the listen fd and then have that process send the fd to
> > > >> the new webserver, if they are not running con-currently. So instead what
> > > >> I'm proposing is a 'delayed close' for a unix socket. That is, one could do:
> > > >>
> > > >> 1) bind unix socket with path '/sockets'
> > > >> 2) sendmsg() the listen fd via the unix socket
> > > >> 2) setsockopt() some 'timeout' on the unix socket (maybe 10 seconds or so)
> > > >> 3) exit/close the old webserver and the listen socket
> > > >> 4) start the new webserver
> > > >> 5) create new unix socket and bind to '/sockets' (if has MAY_WRITE file permissions)
> > > >> 6) recvmsg() the listen fd
> > > >>
> > > >> So the idea is that we set a timeout on the unix socket. If the new process
> > > >> does not start and bind to the unix socket, it simply closes, thus releasing
> > > >> the listen socket. However, if it does bind it can now call recvmsg() and
> > > >> use the listen fd as normal. It can then simply continue to use the old listen
> > > >> fds and/or create new ones and drain the old ones.
> > > >>
> > > >> Thus, the old and new webservers do not have to run concurrently. This doesn't
> > > >> involve any changes to the tcp layer and can be used to pass any type of fd.
> > > >> not sure if it's actually useful for anything else though.
> > > >>
> > > >> I'm not sure if this solves your use-case or not but I thought I'd share it.
> > > >> One can also inherit the fds like in systemd's socket activation model, but
> > > >> that again requires another process to hold open the listen fd.
> > > >
> > > > Thank you for sharing code.
> > > >
> > > > It seems bit more crash-tolerant than normal fd passing, but it can still
> > > > suffer if the process dies before passing fds. With this patch set, we can
> > > > migrate children sockets even if the process dies.
> > > >
> > >
> > > I don't think crashing should be much of an issue. The old server can setup the
> > > unix socket patch '/sockets' when it starts up and queue the listen sockets
> > > there from the start. When it dies it will close all its fds, and the new
> > > server can pick anything up any fds that are in the '/sockets' queue.
> > >
> > >
> > > > Also, as Martin said, fd passing tends to make application complicated.
> > > >
> > >
> > > It may be but perhaps its more flexible? It gives the new server the
> > > chance to re-use the existing listen fds, close, drain and/or start new
> > > ones. It also addresses the non-REUSEPORT case where you can't bind right
> > > away.
> >
> > If the flexibility is really worth the complexity, we do not care about it.
> > But, SO_REUSEPORT can give enough flexibility we want.
> >
> > With socket migration, there is no need to reuse listener (fd passing),
> > drain children (incoming connections are automatically migrated if there is
> > already another listener bind()ed), and of course another listener can
> > close itself and migrated children.
> >
> > If two different approaches resolves the same issue and one does not need
> > complexity in userspace, we select the simpler one.
> 
> Kernel bloat and complexity is _not_ the simplest choice.
> 
> Touching a complex part of TCP stack is quite risky.

Yes, we understand that is not a simple decision and your concern. So many
reviews are needed to see if our approach is really risky or not.

Martin KaFai Lau May 5, 2021, 6:54 a.m. UTC | #13

On Thu, Apr 29, 2021 at 12:16:09PM +0900, Kuniyuki Iwashima wrote:
[ ... ]

> > > > It may be but perhaps its more flexible? It gives the new server the
> > > > chance to re-use the existing listen fds, close, drain and/or start new
> > > > ones. It also addresses the non-REUSEPORT case where you can't bind right
> > > > away.

> > > If the flexibility is really worth the complexity, we do not care about it.
> > > But, SO_REUSEPORT can give enough flexibility we want.
> > >
> > > With socket migration, there is no need to reuse listener (fd passing),
> > > drain children (incoming connections are automatically migrated if there is
> > > already another listener bind()ed), and of course another listener can
> > > close itself and migrated children.
> > >
> > > If two different approaches resolves the same issue and one does not need
> > > complexity in userspace, we select the simpler one.

> > 
> > Kernel bloat and complexity is _not_ the simplest choice.
> > 
> > Touching a complex part of TCP stack is quite risky.

> 
> Yes, we understand that is not a simple decision and your concern. So many
> reviews are needed to see if our approach is really risky or not.

If fd passing is sufficient for a set of use cases, it is great.

However, it does not work well for everyone.  We are not saying
the SO_REUSEPORT(+ optional bpf) is better in all cases also.

After SO_REUSEPORT was added, some people had moved from fd-passing
to SO_REUSEPORT instead and have one bpf policy to select for both
TCP and UDP sk.

Since SO_REUSEPORT was first added, there has been multiple contributions
from different people and companies.  For example, first adding bpf
support to UDP, then to TCP, then a much more flexible way to select sk
from reuseport_array, and then sock_map/sock_hash support.  That is another
perspective showing that people find it useful.  Each of the contributions
changed the kernel code also for practical use cases.

This set is an extension/improvement to address a lacking in SO_REUSEPORT
when some of the sk is closed.  Patch 2 to 4 are the prep work
in sock_reuseport.c and they have the most changes in this set.
Patch 5 to 7 are the changes in tcp.  The code has been structured
to be as isolated as possible.  It will be most useful to at least
review and getting feedback in this part.  The remaining is bpf
related.

[v4,bpf-next,00/11] Socket migration for SO_REUSEPORT.

Message

Comments