mbox series

[RFC,bpf-next,0/8] Socket migration for SO_REUSEPORT.

Message ID 20201117094023.3685-1-kuniyu@amazon.co.jp (mailing list archive)
Headers show
Series Socket migration for SO_REUSEPORT. | expand

Message

Iwashima, Kuniyuki Nov. 17, 2020, 9:40 a.m. UTC
The SO_REUSEPORT option allows sockets to listen on the same port and to
accept connections evenly. However, there is a defect in the current
implementation. When a SYN packet is received, the connection is tied to a
listening socket. Accordingly, when the listener is closed, in-flight
requests during the three-way handshake and child sockets in the accept
queue are dropped even if other listeners could accept such connections.

This situation can happen when various server management tools restart
server (such as nginx) processes. For instance, when we change nginx
configurations and restart it, it spins up new workers that respect the new
configuration and closes all listeners on the old workers, resulting in
in-flight ACK of 3WHS is responded by RST.

As a workaround for this issue, we can do connection draining by eBPF:

  1. Before closing a listener, stop routing SYN packets to it.
  2. Wait enough time for requests to complete 3WHS.
  3. Accept connections until EAGAIN, then close the listener.

Although this approach seems to work well, EAGAIN has nothing to do with
how many requests are still during 3WHS. Thus, we have to know the number
of such requests by counting SYN packets by eBPF to complete connection
draining.

  1. Start counting SYN packets and accept syscalls using eBPF map.
  2. Stop routing SYN packets.
  3. Accept connections up to the count, then close the listener.

In cases that eBPF is used only for connection draining, it seems a bit
expensive. Moreover, there is some situation that we cannot modify and
build a server program to implement the workaround. This patchset
introduces a new sysctl option to free userland programs from the kernel
issue. If we enable net.ipv4.tcp_migrate_req before creating a reuseport
group, we can redistribute requests and connections from a listener to
others in the same reuseport group at close() or shutdown() syscalls.

Note that the source and destination listeners MUST have the same settings
at the socket API level; otherwise, applications may face inconsistency and
cause errors. In such a case, we have to use eBPF program to select a
specific listener or to cancel migration.

Kuniyuki Iwashima (8):
  net: Introduce net.ipv4.tcp_migrate_req.
  tcp: Keep TCP_CLOSE sockets in the reuseport group.
  tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.
  tcp: Migrate TFO requests causing RST during TCP_SYN_RECV.
  tcp: Migrate TCP_NEW_SYN_RECV requests.
  bpf: Add cookie in sk_reuseport_md.
  bpf: Call bpf_run_sk_reuseport() for socket migration.
  bpf: Test BPF_PROG_TYPE_SK_REUSEPORT for socket migration.

 Documentation/networking/ip-sysctl.rst        |  15 ++
 include/linux/bpf.h                           |   1 +
 include/net/inet_connection_sock.h            |  13 ++
 include/net/netns/ipv4.h                      |   1 +
 include/net/request_sock.h                    |  13 ++
 include/net/sock_reuseport.h                  |   8 +-
 include/uapi/linux/bpf.h                      |   1 +
 net/core/filter.c                             |  34 +++-
 net/core/sock_reuseport.c                     | 110 +++++++++--
 net/ipv4/inet_connection_sock.c               |  84 ++++++++-
 net/ipv4/inet_hashtables.c                    |   9 +-
 net/ipv4/sysctl_net_ipv4.c                    |   9 +
 net/ipv4/tcp_ipv4.c                           |   9 +-
 net/ipv6/tcp_ipv6.c                           |   9 +-
 tools/include/uapi/linux/bpf.h                |   1 +
 .../bpf/prog_tests/migrate_reuseport.c        | 175 ++++++++++++++++++
 .../bpf/progs/test_migrate_reuseport_kern.c   |  53 ++++++
 17 files changed, 511 insertions(+), 34 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c

Comments

David Laight Nov. 18, 2020, 9:18 a.m. UTC | #1
From: Kuniyuki Iwashima
> Sent: 17 November 2020 09:40
> 
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.

Can't you do something to stop new connections being queued (like
setting the 'backlog' to zero), then carry on doing accept()s
for a guard time (or until the queue length is zero) before finally
closing the listening socket.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Eric Dumazet Nov. 18, 2020, 4:25 p.m. UTC | #2
On 11/17/20 10:40 AM, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.
> 

I know some programs are simply removing a listener from the group,
so that they no longer handle new SYN packets,
and wait until all timers or 3WHS have completed before closing them.

They pass fd of newly accepted children to more recent programs using af_unix fd passing,
while in this draining mode.

Quite frankly, mixing eBPF in the picture is distracting.

It seems you want some way to transfer request sockets (and/or not yet accepted established ones)
from fd1 to fd2, isn't it something that should be discussed independently ?
Martin KaFai Lau Nov. 19, 2020, 1:49 a.m. UTC | #3
On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> The SO_REUSEPORT option allows sockets to listen on the same port and to
> accept connections evenly. However, there is a defect in the current
> implementation. When a SYN packet is received, the connection is tied to a
> listening socket. Accordingly, when the listener is closed, in-flight
> requests during the three-way handshake and child sockets in the accept
> queue are dropped even if other listeners could accept such connections.
> 
> This situation can happen when various server management tools restart
> server (such as nginx) processes. For instance, when we change nginx
> configurations and restart it, it spins up new workers that respect the new
> configuration and closes all listeners on the old workers, resulting in
> in-flight ACK of 3WHS is responded by RST.
> 
> As a workaround for this issue, we can do connection draining by eBPF:
> 
>   1. Before closing a listener, stop routing SYN packets to it.
>   2. Wait enough time for requests to complete 3WHS.
>   3. Accept connections until EAGAIN, then close the listener.
> 
> Although this approach seems to work well, EAGAIN has nothing to do with
> how many requests are still during 3WHS. Thus, we have to know the number
It sounds like the application can already drain the established socket
by accept()?  To solve the problem that you have,
does it mean migrating req_sk (the in-progress 3WHS) is enough?

Applications can already use the bpf prog to do (1) and divert
the SYN to the newly started process.

If the application cares about service disruption,
it usually needs to drain the fd(s) that it already has and
finishes serving the pending request (e.g. https) on them anyway.
The time taking to finish those could already be longer than it takes
to drain the accept queue or finish off the 3WHS in reasonable time.
or the application that you have does not need to drain the fd(s) 
it already has and it can close them immediately?

> of such requests by counting SYN packets by eBPF to complete connection
> draining.
> 
>   1. Start counting SYN packets and accept syscalls using eBPF map.
>   2. Stop routing SYN packets.
>   3. Accept connections up to the count, then close the listener.
Iwashima, Kuniyuki Nov. 19, 2020, 10:01 p.m. UTC | #4
From:   David Laight <David.Laight@ACULAB.COM>
Date:   Wed, 18 Nov 2020 09:18:24 +0000
> From: Kuniyuki Iwashima
> > Sent: 17 November 2020 09:40
> > 
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> 
> Can't you do something to stop new connections being queued (like
> setting the 'backlog' to zero), then carry on doing accept()s
> for a guard time (or until the queue length is zero) before finally
> closing the listening socket.

Yes, but with eBPF.
There are some ideas suggested and well discussed in the thread below,
resulting in that connection draining by eBPF was merged.
https://lore.kernel.org/netdev/1443313848-751-1-git-send-email-tolga.ceylan@gmail.com/


Also, setting zero to backlog does not work well.
https://lore.kernel.org/netdev/1447262610.17135.114.camel@edumazet-glaptop2.roam.corp.google.com/

---8<---
From: Eric Dumazet <eric.dumazet@gmail.com>
Subject: Re: [PATCH 1/1] net: Add SO_REUSEPORT_LISTEN_OFF socket option as
 drain mode
Date: Wed, 11 Nov 2015 09:23:30 -0800
> Actually listen(fd, 0) is not going to work well :
> 
> For request_sock that were created (by incoming SYN packet) before this
> listen(fd, 0) call, the 3rd packet (ACK coming from client) would not be
> able to create a child attached to this listener.
> 
> sk_acceptq_is_full() test in tcp_v4_syn_recv_sock() would simply drop
> the thing.
---8<---
Iwashima, Kuniyuki Nov. 19, 2020, 10:05 p.m. UTC | #5
From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed, 18 Nov 2020 17:25:44 +0100
> On 11/17/20 10:40 AM, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> > 
> 
> I know some programs are simply removing a listener from the group,
> so that they no longer handle new SYN packets,
> and wait until all timers or 3WHS have completed before closing them.
> 
> They pass fd of newly accepted children to more recent programs using af_unix fd passing,
> while in this draining mode.

Just out of curiosity, can I know the software for more study?


> Quite frankly, mixing eBPF in the picture is distracting.

I agree.
Also, I think eBPF itself is not always necessary in many cases and want
to make user programs simpler with this patchset.

The SO_REUSEPORT implementation is excellent to improve the scalability. On
the other hand, as a trade-off, users have to know deeply how the kernel
handles SYN packets and to implement connection draining by eBPF.


> It seems you want some way to transfer request sockets (and/or not yet accepted established ones)
> from fd1 to fd2, isn't it something that should be discussed independently ?

I understand that you are asking that I should discuss the issue and how to
transfer sockets independently. Please correct me if I have misunderstood
your question.

The kernel handles 3WHS and users cannot know its existence (without eBPF).
Many users believe SO_REUSEPORT should make it possible to distribute all
connections across available listeners ideally, but actually, there are
possibly some connections aborted silently. Some user may think that if the
kernel selected other listeners, the connections would not be dropped.

The root cause is within the kernel, so the issue should be addressed in
the kernel space and should not be visible to userspace. In order not to
make users bother with implementing new some stuff, I want to fix the root
cause by transferring sockets automatically so that users need not take
care of kernel implementation and connection draining.

Moreover, if possible, I did not want to mix eBPF with the issue. But there
may be some cases that different applications listen on the same port and
eBPF routes packets to each by some rules. In such cases, redistributing
sockets without user intention will break the application. This patchset
will work in many cases, but to care such cases, I added the eBPF part.
Iwashima, Kuniyuki Nov. 19, 2020, 10:17 p.m. UTC | #6
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 18 Nov 2020 17:49:13 -0800
> On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > accept connections evenly. However, there is a defect in the current
> > implementation. When a SYN packet is received, the connection is tied to a
> > listening socket. Accordingly, when the listener is closed, in-flight
> > requests during the three-way handshake and child sockets in the accept
> > queue are dropped even if other listeners could accept such connections.
> > 
> > This situation can happen when various server management tools restart
> > server (such as nginx) processes. For instance, when we change nginx
> > configurations and restart it, it spins up new workers that respect the new
> > configuration and closes all listeners on the old workers, resulting in
> > in-flight ACK of 3WHS is responded by RST.
> > 
> > As a workaround for this issue, we can do connection draining by eBPF:
> > 
> >   1. Before closing a listener, stop routing SYN packets to it.
> >   2. Wait enough time for requests to complete 3WHS.
> >   3. Accept connections until EAGAIN, then close the listener.
> > 
> > Although this approach seems to work well, EAGAIN has nothing to do with
> > how many requests are still during 3WHS. Thus, we have to know the number
> It sounds like the application can already drain the established socket
> by accept()?  To solve the problem that you have,
> does it mean migrating req_sk (the in-progress 3WHS) is enough?

Ideally, the application needs to drain only the accepted sockets because
3WHS and tying a connection to a listener are just kernel behaviour. Also,
there are some cases where we want to apply new configurations as soon as
possible such as replacing TLS certificates.

It is possible to drain the established sockets by accept(), but the
sockets in the accept queue have not started application sessions yet. So,
if we do not drain such sockets (or if the kernel happened to select
another listener), we can (could) apply the new settings much earlier.

Moreover, the established sockets may start long-standing connections so
that we cannot complete draining for a long time and may have to
force-close them (and they would have longer lifetime if they are migrated
to a new listener).


> Applications can already use the bpf prog to do (1) and divert
> the SYN to the newly started process.
> 
> If the application cares about service disruption,
> it usually needs to drain the fd(s) that it already has and
> finishes serving the pending request (e.g. https) on them anyway.
> The time taking to finish those could already be longer than it takes
> to drain the accept queue or finish off the 3WHS in reasonable time.
> or the application that you have does not need to drain the fd(s) 
> it already has and it can close them immediately?

In the point of view of service disruption, I agree with you.

However, I think that there are some situations where we want to apply new
configurations rather than to drain sockets with old configurations and
that if the kernel migrates sockets automatically, we can simplify user
programs.
Martin KaFai Lau Nov. 20, 2020, 2:31 a.m. UTC | #7
On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Wed, 18 Nov 2020 17:49:13 -0800
> > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > accept connections evenly. However, there is a defect in the current
> > > implementation. When a SYN packet is received, the connection is tied to a
> > > listening socket. Accordingly, when the listener is closed, in-flight
> > > requests during the three-way handshake and child sockets in the accept
> > > queue are dropped even if other listeners could accept such connections.
> > > 
> > > This situation can happen when various server management tools restart
> > > server (such as nginx) processes. For instance, when we change nginx
> > > configurations and restart it, it spins up new workers that respect the new
> > > configuration and closes all listeners on the old workers, resulting in
> > > in-flight ACK of 3WHS is responded by RST.
> > > 
> > > As a workaround for this issue, we can do connection draining by eBPF:
> > > 
> > >   1. Before closing a listener, stop routing SYN packets to it.
> > >   2. Wait enough time for requests to complete 3WHS.
> > >   3. Accept connections until EAGAIN, then close the listener.
> > > 
> > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > how many requests are still during 3WHS. Thus, we have to know the number
> > It sounds like the application can already drain the established socket
> > by accept()?  To solve the problem that you have,
> > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> 
> Ideally, the application needs to drain only the accepted sockets because
> 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> there are some cases where we want to apply new configurations as soon as
> possible such as replacing TLS certificates.
> 
> It is possible to drain the established sockets by accept(), but the
> sockets in the accept queue have not started application sessions yet. So,
> if we do not drain such sockets (or if the kernel happened to select
> another listener), we can (could) apply the new settings much earlier.
> 
> Moreover, the established sockets may start long-standing connections so
> that we cannot complete draining for a long time and may have to
> force-close them (and they would have longer lifetime if they are migrated
> to a new listener).
> 
> 
> > Applications can already use the bpf prog to do (1) and divert
> > the SYN to the newly started process.
> > 
> > If the application cares about service disruption,
> > it usually needs to drain the fd(s) that it already has and
> > finishes serving the pending request (e.g. https) on them anyway.
> > The time taking to finish those could already be longer than it takes
> > to drain the accept queue or finish off the 3WHS in reasonable time.
> > or the application that you have does not need to drain the fd(s) 
> > it already has and it can close them immediately?
> 
> In the point of view of service disruption, I agree with you.
> 
> However, I think that there are some situations where we want to apply new
> configurations rather than to drain sockets with old configurations and
> that if the kernel migrates sockets automatically, we can simplify user
> programs.
This configuration-update(/new-TLS-cert...etc) consideration will be useful
if it is also included in the cover letter.

It sounds like the service that you have is draining the existing
already-accepted fd(s) which are using the old configuration.
Those existing fd(s) could also be long life.  Potentially those
existing fd(s) will be in a much higher number than the
to-be-accepted fd(s)?

or you meant in some cases it wants to migrate to the new configuration
ASAP (e.g. for security reason) even it has to close all the
already-accepted fds() which are using the old configuration??

In either cases, considering the already-accepted fd(s)
is usually in a much more number, does the to-be-accepted
connection make any difference percentage-wise?
Iwashima, Kuniyuki Nov. 21, 2020, 10:16 a.m. UTC | #8
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 19 Nov 2020 18:31:57 -0800
> On Fri, Nov 20, 2020 at 07:17:49AM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Wed, 18 Nov 2020 17:49:13 -0800
> > > On Tue, Nov 17, 2020 at 06:40:15PM +0900, Kuniyuki Iwashima wrote:
> > > > The SO_REUSEPORT option allows sockets to listen on the same port and to
> > > > accept connections evenly. However, there is a defect in the current
> > > > implementation. When a SYN packet is received, the connection is tied to a
> > > > listening socket. Accordingly, when the listener is closed, in-flight
> > > > requests during the three-way handshake and child sockets in the accept
> > > > queue are dropped even if other listeners could accept such connections.
> > > > 
> > > > This situation can happen when various server management tools restart
> > > > server (such as nginx) processes. For instance, when we change nginx
> > > > configurations and restart it, it spins up new workers that respect the new
> > > > configuration and closes all listeners on the old workers, resulting in
> > > > in-flight ACK of 3WHS is responded by RST.
> > > > 
> > > > As a workaround for this issue, we can do connection draining by eBPF:
> > > > 
> > > >   1. Before closing a listener, stop routing SYN packets to it.
> > > >   2. Wait enough time for requests to complete 3WHS.
> > > >   3. Accept connections until EAGAIN, then close the listener.
> > > > 
> > > > Although this approach seems to work well, EAGAIN has nothing to do with
> > > > how many requests are still during 3WHS. Thus, we have to know the number
> > > It sounds like the application can already drain the established socket
> > > by accept()?  To solve the problem that you have,
> > > does it mean migrating req_sk (the in-progress 3WHS) is enough?
> > 
> > Ideally, the application needs to drain only the accepted sockets because
> > 3WHS and tying a connection to a listener are just kernel behaviour. Also,
> > there are some cases where we want to apply new configurations as soon as
> > possible such as replacing TLS certificates.
> > 
> > It is possible to drain the established sockets by accept(), but the
> > sockets in the accept queue have not started application sessions yet. So,
> > if we do not drain such sockets (or if the kernel happened to select
> > another listener), we can (could) apply the new settings much earlier.
> > 
> > Moreover, the established sockets may start long-standing connections so
> > that we cannot complete draining for a long time and may have to
> > force-close them (and they would have longer lifetime if they are migrated
> > to a new listener).
> > 
> > 
> > > Applications can already use the bpf prog to do (1) and divert
> > > the SYN to the newly started process.
> > > 
> > > If the application cares about service disruption,
> > > it usually needs to drain the fd(s) that it already has and
> > > finishes serving the pending request (e.g. https) on them anyway.
> > > The time taking to finish those could already be longer than it takes
> > > to drain the accept queue or finish off the 3WHS in reasonable time.
> > > or the application that you have does not need to drain the fd(s) 
> > > it already has and it can close them immediately?
> > 
> > In the point of view of service disruption, I agree with you.
> > 
> > However, I think that there are some situations where we want to apply new
> > configurations rather than to drain sockets with old configurations and
> > that if the kernel migrates sockets automatically, we can simplify user
> > programs.
> This configuration-update(/new-TLS-cert...etc) consideration will be useful
> if it is also included in the cover letter.

I will add this to the next cover letter.


> It sounds like the service that you have is draining the existing
> already-accepted fd(s) which are using the old configuration.
> Those existing fd(s) could also be long life.  Potentially those
> existing fd(s) will be in a much higher number than the
> to-be-accepted fd(s)?

In many cases, yes.


> or you meant in some cases it wants to migrate to the new configuration
> ASAP (e.g. for security reason) even it has to close all the
> already-accepted fds() which are using the old configuration??

And sometimes, yes.
As you expected, for some reasons including security, there are cases we
have to prioritize to close connections than to complete them.

For example, HTTP/1.1 is often short-lived, and we can complete draining
immediately. However, sometimes it can be long-lived by upgrading to
WebSocket. Then we may be not able to wait to finish draining.


> In either cases, considering the already-accepted fd(s)
> is usually in a much more number, does the to-be-accepted
> connection make any difference percentage-wise?

It is difficult to drain all connections in every case, but we can decrease
such aborted connections by migration. In that sense, I think migration is
always better than draining.