From patchwork Tue Nov 17 09:40:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911817 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14F85C64E75 for ; Tue, 17 Nov 2020 09:41:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BAA432466D for ; Tue, 17 Nov 2020 09:41:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="eEWyVZhe" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727353AbgKQJlU (ORCPT ); Tue, 17 Nov 2020 04:41:20 -0500 Received: from smtp-fw-6001.amazon.com ([52.95.48.154]:10411 "EHLO smtp-fw-6001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725355AbgKQJlU (ORCPT ); Tue, 17 Nov 2020 04:41:20 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606080; x=1637142080; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=+7noeoH7kZjzLDKf5z3YvNItSMxEV1uiKyWB4NySDiQ=; b=eEWyVZhePVq05yoGDnVYEIMFgDcg3NN0xEP7SJ23y9a8qDpZhMS2Vhdt kgzJiDpH8QZtpUqx2J2k/R06ZBWws8r5tzo3mGcJSAmneJ5R3gBaES7Ma LmgqbQ5nfPC3/ACcNlIaGTyJuhsgecYKC9GDG6hhMKNuhBCBwkdXgaVPk w=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="66876881" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-7d76a15f.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6001.iad6.amazon.com with ESMTP; 17 Nov 2020 09:41:19 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1a-7d76a15f.us-east-1.amazon.com (Postfix) with ESMTPS id 33D1AA0570; Tue, 17 Nov 2020 09:41:18 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:17 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:07 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 1/8] net: Introduce net.ipv4.tcp_migrate_req. Date: Tue, 17 Nov 2020 18:40:16 +0900 Message-ID: <20201117094023.3685-2-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled, and then we call listen() for SO_REUSEPORT enabled sockets and close one, we will be able to migrate its child sockets to another listener. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- Documentation/networking/ip-sysctl.rst | 15 +++++++++++++++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ 3 files changed, 25 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index dd2b12a32b73..4116771bf5ef 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -712,6 +712,21 @@ tcp_syncookies - INTEGER network connections you can set this knob to 2 to enable unconditionally generation of syncookies. +tcp_migrate_req - INTEGER + By default, when a listening socket is closed, child sockets are also + closed. If it has SO_REUSEPORT enabled, the dropped connections should + have been accepted by other listeners on the same port. This option + makes it possible to migrate child sockets to another listener when + calling close() or shutdown(). + + Default: 0 + + Note that the source and destination listeners _must_ have the same + settings at the socket API level. If there are different kinds of + sockets on the port, disable this option or use + BPF_PROG_TYPE_SK_REUSEPORT program to select the correct socket by + bpf_sk_select_reuseport() or to cancel migration by returning SK_DROP. + tcp_fastopen - INTEGER Enable TCP Fast Open (RFC7413) to send and accept data in the opening SYN packet. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 8e4fcac4df72..a3edc30d6a63 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -132,6 +132,7 @@ struct netns_ipv4 { int sysctl_tcp_syn_retries; int sysctl_tcp_synack_retries; int sysctl_tcp_syncookies; + int sysctl_tcp_migrate_req; int sysctl_tcp_reordering; int sysctl_tcp_retries1; int sysctl_tcp_retries2; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 3e5f4f2e705e..6b76298fa271 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -933,6 +933,15 @@ static struct ctl_table ipv4_net_table[] = { .proc_handler = proc_dointvec }, #endif + { + .procname = "tcp_migrate_req", + .data = &init_net.ipv4.sysctl_tcp_migrate_req, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE + }, { .procname = "tcp_reordering", .data = &init_net.ipv4.sysctl_tcp_reordering, From patchwork Tue Nov 17 09:40:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911819 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8719AC2D0E4 for ; Tue, 17 Nov 2020 09:42:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 194AE2464E for ; Tue, 17 Nov 2020 09:42:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="UG3TcgUt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727240AbgKQJlc (ORCPT ); Tue, 17 Nov 2020 04:41:32 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:35108 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725355AbgKQJlc (ORCPT ); Tue, 17 Nov 2020 04:41:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606090; x=1637142090; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=t9UMtEM+NwbJ05gHATdIGVFgiZ8G6C5NenHPztMgsX0=; b=UG3TcgUtfgoEgPwiGZcZOiE36n+b94DuGIzFpWnAp/mOfNhmpoJ2Hze3 cjWM+T1Obq1ElrqlYBr+tbOUTB+/sK8XiXC37Qr7cAPPn5BBJ6CvXmFvX tPO2VtTdPRuEAeHb3A8bS7PSbdwDgc3YLeu4pycZHErrPYp4XnJn98Kn7 w=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="96088317" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1d-74cf8b49.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 17 Nov 2020 09:41:29 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1d-74cf8b49.us-east-1.amazon.com (Postfix) with ESMTPS id E77A1C077D; Tue, 17 Nov 2020 09:41:27 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:27 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:23 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 2/8] tcp: Keep TCP_CLOSE sockets in the reuseport group. Date: Tue, 17 Nov 2020 18:40:17 +0900 Message-ID: <20201117094023.3685-3-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch is a preparation patch to migrate incoming connections in the later commits and adds two fields (migrate_req and num_closed_socks) to the struct sock_reuseport to keep TCP_CLOSE sockets in the reuseport group. If migrate_req is 1, and then we close a listening socket, we can migrate its connections to another listener in the same reuseport group. Then we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, it is impossible because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and to have access to it while any child socket references to them. The point is that reuseport_detach_sock() is called twice from inet_unhash() and sk_destruct(). At first, it moves the socket backwards in socks[] and increments num_closed_socks. Later, when all migrated connections are accepted, it removes the socket from socks[], decrements num_closed_socks, and sets NULL to sk_reuseport_cb. By this change, closed sockets can keep sk_reuseport_cb until all child requests have been freed or accepted. Consequently calling listen() after shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or inet_csk_bind_conflict() which expect that such sockets should not have the reuseport group. Therefore, this patch loosens such validation rules so that the socket can listen again if it has the same reuseport group with other listening sockets. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/sock_reuseport.h | 6 ++- net/core/sock_reuseport.c | 83 +++++++++++++++++++++++++++------ net/ipv4/inet_connection_sock.c | 7 ++- 3 files changed, 79 insertions(+), 17 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 505f1e18e9bf..ade3af55c91f 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock; struct sock_reuseport { struct rcu_head rcu; - u16 max_socks; /* length of socks */ - u16 num_socks; /* elements in socks */ + u16 max_socks; /* length of socks */ + u16 num_socks; /* elements in socks */ + u16 num_closed_socks; /* closed elements in socks */ /* The last synq overflow event timestamp of this * reuse->socks[] group. */ @@ -23,6 +24,7 @@ struct sock_reuseport { unsigned int reuseport_id; unsigned int bind_inany:1; unsigned int has_conns:1; + unsigned int migrate_req:1; struct bpf_prog __rcu *prog; /* optional BPF sock selector */ struct sock *socks[]; /* array of sock pointers */ }; diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index bbdd3c7b6cb5..01a8b4ba39d7 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -36,6 +36,7 @@ static struct sock_reuseport *__reuseport_alloc(unsigned int max_socks) int reuseport_alloc(struct sock *sk, bool bind_inany) { struct sock_reuseport *reuse; + struct net *net = sock_net(sk); int id, ret = 0; /* bh lock used since this function call may precede hlist lock in @@ -75,6 +76,8 @@ int reuseport_alloc(struct sock *sk, bool bind_inany) reuse->socks[0] = sk; reuse->num_socks = 1; reuse->bind_inany = bind_inany; + reuse->migrate_req = sk->sk_protocol == IPPROTO_TCP ? + net->ipv4.sysctl_tcp_migrate_req : 0; rcu_assign_pointer(sk->sk_reuseport_cb, reuse); out: @@ -98,16 +101,22 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) return NULL; more_reuse->num_socks = reuse->num_socks; + more_reuse->num_closed_socks = reuse->num_closed_socks; more_reuse->prog = reuse->prog; more_reuse->reuseport_id = reuse->reuseport_id; more_reuse->bind_inany = reuse->bind_inany; more_reuse->has_conns = reuse->has_conns; + more_reuse->migrate_req = reuse->migrate_req; + more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); - more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); + memcpy(more_reuse->socks + + (more_reuse->max_socks - more_reuse->num_closed_socks), + reuse->socks + reuse->num_socks, + reuse->num_closed_socks * sizeof(struct sock *)); - for (i = 0; i < reuse->num_socks; ++i) + for (i = 0; i < reuse->max_socks; ++i) rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, more_reuse); @@ -129,6 +138,25 @@ static void reuseport_free_rcu(struct rcu_head *head) kfree(reuse); } +static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk, + bool closed) +{ + int left, right; + + if (!closed) { + left = 0; + right = reuse->num_socks; + } else { + left = reuse->max_socks - reuse->num_closed_socks; + right = reuse->max_socks; + } + + for (; left < right; left++) + if (reuse->socks[left] == sk) + return left; + return -1; +} + /** * reuseport_add_sock - Add a socket to the reuseport group of another. * @sk: New socket to add to the group. @@ -153,12 +181,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) lockdep_is_held(&reuseport_lock)); old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); - if (old_reuse && old_reuse->num_socks != 1) { + + if (old_reuse == reuse) { + int i = reuseport_sock_index(reuse, sk, true); + + if (i == -1) { + spin_unlock_bh(&reuseport_lock); + return -EBUSY; + } + + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; + reuse->num_closed_socks--; + } else if (old_reuse && old_reuse->num_socks != 1) { spin_unlock_bh(&reuseport_lock); return -EBUSY; } - if (reuse->num_socks == reuse->max_socks) { + if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) { reuse = reuseport_grow(reuse); if (!reuse) { spin_unlock_bh(&reuseport_lock); @@ -174,8 +213,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) spin_unlock_bh(&reuseport_lock); - if (old_reuse) + if (old_reuse && old_reuse != reuse) call_rcu(&old_reuse->rcu, reuseport_free_rcu); + return 0; } EXPORT_SYMBOL(reuseport_add_sock); @@ -199,17 +239,34 @@ void reuseport_detach_sock(struct sock *sk) */ bpf_sk_reuseport_detach(sk); - rcu_assign_pointer(sk->sk_reuseport_cb, NULL); + if (!reuse->migrate_req || sk->sk_state == TCP_LISTEN) { + i = reuseport_sock_index(reuse, sk, false); + if (i == -1) + goto out; + + reuse->num_socks--; + reuse->socks[i] = reuse->socks[reuse->num_socks]; - for (i = 0; i < reuse->num_socks; i++) { - if (reuse->socks[i] == sk) { - reuse->socks[i] = reuse->socks[reuse->num_socks - 1]; - reuse->num_socks--; - if (reuse->num_socks == 0) - call_rcu(&reuse->rcu, reuseport_free_rcu); - break; + if (reuse->migrate_req) { + reuse->num_closed_socks++; + reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; + } else { + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + } else { + i = reuseport_sock_index(reuse, sk, true); + if (i == -1) + goto out; + + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; + reuse->num_closed_socks--; + + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + + if (reuse->num_socks + reuse->num_closed_socks == 0) + call_rcu(&reuse->rcu, reuseport_free_rcu); +out: spin_unlock_bh(&reuseport_lock); } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 4148f5f78f31..be8cda5b664f 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk, bool reuse = sk->sk_reuse; bool reuseport = !!sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); + struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb); /* * Unlike other sk lookup places we do not check @@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk, if ((!relax || (!reuseport_ok && reuseport && sk2->sk_reuseport && - !rcu_access_pointer(sk->sk_reuseport_cb) && + (!reuseport_cb || + reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) && (sk2->sk_state == TCP_TIME_WAIT || uid_eq(uid, sock_i_uid(sk2))))) && inet_rcv_saddr_equal(sk, sk2, true)) break; } else if (!reuseport_ok || !reuseport || !sk2->sk_reuseport || - rcu_access_pointer(sk->sk_reuseport_cb) || + (reuseport_cb && + reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2)))) { if (inet_rcv_saddr_equal(sk, sk2, true)) From patchwork Tue Nov 17 09:40:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911815 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 372E3C63798 for ; Tue, 17 Nov 2020 09:42:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D40BF24656 for ; Tue, 17 Nov 2020 09:42:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="uD5un3ij" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727417AbgKQJlm (ORCPT ); Tue, 17 Nov 2020 04:41:42 -0500 Received: from smtp-fw-6001.amazon.com ([52.95.48.154]:10482 "EHLO smtp-fw-6001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727402AbgKQJll (ORCPT ); Tue, 17 Nov 2020 04:41:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606101; x=1637142101; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=pqphziyiMnMteRD3/Ub1HhlxTKc6lKBH2L8N47H+TE4=; b=uD5un3ij/o6pEn3mwDvieaE3qc8XsHjUTOhXZQhW60IIOKVJk/Ody+oF tCFoPAQk65BBZ5CiaDKtHY8wZnZLvLPi6J95NwZXXGULU7c1SEQV8Hjbe xup0w+M/OUNtyZJDX8mmWRoqoBo/7YEZe1p3QgHlOaN6yxWvgGI7lkRdt k=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="66876948" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-807d4a99.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6001.iad6.amazon.com with ESMTP; 17 Nov 2020 09:41:40 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1a-807d4a99.us-east-1.amazon.com (Postfix) with ESMTPS id 064B8A1F36; Tue, 17 Nov 2020 09:41:37 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:37 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:33 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. Date: Tue, 17 Nov 2020 18:40:18 +0900 Message-ID: <20201117094023.3685-4-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch lets reuseport_detach_sock() return a pointer of struct sock, which is used only by inet_unhash(). If it is not NULL, inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV sockets from the closing listener to the selected one. Listening sockets hold incoming connections as a linked list of struct request_sock in the accept queue, and each request has reference to a full socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the requests from the closing listener's queue and relink them to the head of the new listener's queue. We do not process each request, so the migration completes in O(1) time complexity. However, in the case of TCP_SYN_RECV sockets, we will take special care in the next commit. By default, we select the last element of socks[] as the new listener. This behaviour is based on how the kernel moves sockets in socks[]. For example, we call listen() for four sockets (A, B, C, D), and close the first two by turns. The sockets move in socks[] like below. (See also [1]) socks[0] : A <-. socks[0] : D socks[0] : D socks[1] : B | => socks[1] : B <-. => socks[1] : C socks[2] : C | socks[2] : C --' socks[3] : D --' Then, if C and D have newer settings than A and B, and each socket has a request (a, b, c, d) in their accept queue, we can redistribute old requests evenly to new listeners. socks[0] : A (a) <-. socks[0] : D (a + d) socks[0] : D (a + d) socks[1] : B (b) | => socks[1] : B (b) <-. => socks[1] : C (b + c) socks[2] : C (c) | socks[2] : C (c) --' socks[3] : D (d) --' Here, (A, D), or (B, C) can have different application settings, but they MUST have the same settings at the socket API level; otherwise, unexpected error may happen. For instance, if only the new listeners have TCP_SAVE_SYN, old requests do not have SYN data, so the application will face inconsistency and cause an error. Therefore, if there are different kinds of sockets, we must disable this feature or use an eBPF program described in later commits. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/ --- include/net/inet_connection_sock.h | 1 + include/net/sock_reuseport.h | 2 +- net/core/sock_reuseport.c | 8 +++++++- net/ipv4/inet_connection_sock.c | 30 ++++++++++++++++++++++++++++++ net/ipv4/inet_hashtables.c | 9 +++++++-- 5 files changed, 46 insertions(+), 4 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 7338b3865a2a..2ea2d743f8fc 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk, struct sock *inet_csk_reqsk_queue_add(struct sock *sk, struct request_sock *req, struct sock *child); +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk); void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, unsigned long timeout); struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index ade3af55c91f..ece1c70ca907 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -32,7 +32,7 @@ struct sock_reuseport { extern int reuseport_alloc(struct sock *sk, bool bind_inany); extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany); -extern void reuseport_detach_sock(struct sock *sk); +extern struct sock *reuseport_detach_sock(struct sock *sk); extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, struct sk_buff *skb, diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 01a8b4ba39d7..74a46197854b 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -220,9 +220,10 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } EXPORT_SYMBOL(reuseport_add_sock); -void reuseport_detach_sock(struct sock *sk) +struct sock *reuseport_detach_sock(struct sock *sk) { struct sock_reuseport *reuse; + struct sock *nsk = NULL; int i; spin_lock_bh(&reuseport_lock); @@ -248,6 +249,9 @@ void reuseport_detach_sock(struct sock *sk) reuse->socks[i] = reuse->socks[reuse->num_socks]; if (reuse->migrate_req) { + if (reuse->num_socks) + nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; + reuse->num_closed_socks++; reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; } else { @@ -268,6 +272,8 @@ void reuseport_detach_sock(struct sock *sk) call_rcu(&reuse->rcu, reuseport_free_rcu); out: spin_unlock_bh(&reuseport_lock); + + return nsk; } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index be8cda5b664f..583db7e2b1da 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, } EXPORT_SYMBOL(inet_csk_reqsk_queue_add); +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) +{ + struct request_sock_queue *old_accept_queue, *new_accept_queue; + + old_accept_queue = &inet_csk(sk)->icsk_accept_queue; + new_accept_queue = &inet_csk(nsk)->icsk_accept_queue; + + spin_lock(&old_accept_queue->rskq_lock); + spin_lock(&new_accept_queue->rskq_lock); + + if (old_accept_queue->rskq_accept_head) { + if (new_accept_queue->rskq_accept_head) + old_accept_queue->rskq_accept_tail->dl_next = + new_accept_queue->rskq_accept_head; + else + new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail; + + new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head; + old_accept_queue->rskq_accept_head = NULL; + old_accept_queue->rskq_accept_tail = NULL; + + WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog); + WRITE_ONCE(sk->sk_ack_backlog, 0); + } + + spin_unlock(&new_accept_queue->rskq_lock); + spin_unlock(&old_accept_queue->rskq_lock); +} +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate); + struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, struct request_sock *req, bool own_req) { diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 8cbe74313f38..f35c76cf3365 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -629,6 +629,7 @@ void inet_unhash(struct sock *sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; struct inet_listen_hashbucket *ilb = NULL; + struct sock *nsk; spinlock_t *lock; if (sk_unhashed(sk)) @@ -644,8 +645,12 @@ void inet_unhash(struct sock *sk) if (sk_unhashed(sk)) goto unlock; - if (rcu_access_pointer(sk->sk_reuseport_cb)) - reuseport_detach_sock(sk); + if (rcu_access_pointer(sk->sk_reuseport_cb)) { + nsk = reuseport_detach_sock(sk); + if (nsk) + inet_csk_reqsk_queue_migrate(sk, nsk); + } + if (ilb) { inet_unhash2(hashinfo, sk); ilb->count--; From patchwork Tue Nov 17 09:40:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911825 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9213C64E7C for ; Tue, 17 Nov 2020 09:42:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 959FC2463F for ; Tue, 17 Nov 2020 09:42:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="NSGICnlb" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727476AbgKQJlz (ORCPT ); Tue, 17 Nov 2020 04:41:55 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:35208 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727218AbgKQJlz (ORCPT ); Tue, 17 Nov 2020 04:41:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606115; x=1637142115; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=hEFvtzZFGSnVCTICpdWrUCFtgf37qHg3ZYvzvaotl6s=; b=NSGICnlbxGMacyENHQvicjqEvZENYV/z2ZXY0/IMJzLzyxiN8uXeVGnd +xXmBrIYf573z2RV7/QExmXijLpx/xC/Myth5WN2Xe4TDTkf1nw78GhRf jlvWcSk8Ja16hLF16c0W1JBfFMQm+Y+DTpQA0CdA5pVTWJmSFFxUPyNY2 U=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="96088423" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1a-af6a10df.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 17 Nov 2020 09:41:54 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-af6a10df.us-east-1.amazon.com (Postfix) with ESMTPS id BA015A0481; Tue, 17 Nov 2020 09:41:52 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:52 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:41:48 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 4/8] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV. Date: Tue, 17 Nov 2020 18:40:19 +0900 Message-ID: <20201117094023.3685-5-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC A TFO request socket is only freed after BOTH 3WHS has completed (or aborted) and the child socket has been accepted (or its listener closed). Hence, depending on the order, there can be two kinds of request sockets in the accept queue. 3WHS -> accept : TCP_ESTABLISHED accept -> 3WHS : TCP_SYN_RECV Unlike TCP_ESTABLISHED socket, accept() does not free the request socket for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove(). Also, it accesses request_sock.rsk_listener. So, in order to complete TFO socket migration, we have to set the current listener to it at accept() before reqsk_fastopen_remove(). Moreover, if TFO request caused RST before 3WHS has completed, it is held in the listener's TFO queue to prevent DDoS attack. Thus, we also have to migrate the requests in TFO queue. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/ipv4/inet_connection_sock.c | 35 ++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 583db7e2b1da..398c5c708bc5 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) tcp_rsk(req)->tfo_listener) { spin_lock_bh(&queue->fastopenq.lock); if (tcp_rsk(req)->tfo_listener) { + if (req->rsk_listener != sk) { + /* TFO request was migrated to another listener so + * the new listener must be used in reqsk_fastopen_remove() + * to hold requests which cause RST. + */ + sock_put(req->rsk_listener); + sock_hold(sk); + req->rsk_listener = sk; + } + /* We are still waiting for the final ACK from 3WHS * so can't free req now. Instead, we set req->sk to * NULL to signify that the child socket is taken @@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req, if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) { BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req); - BUG_ON(sk != req->rsk_listener); /* Paranoid, to prevent race condition if * an inbound pkt destined for child is @@ -995,6 +1004,7 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_add); void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) { struct request_sock_queue *old_accept_queue, *new_accept_queue; + struct fastopen_queue *old_fastopenq, *new_fastopenq; old_accept_queue = &inet_csk(sk)->icsk_accept_queue; new_accept_queue = &inet_csk(nsk)->icsk_accept_queue; @@ -1019,6 +1029,29 @@ void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) spin_unlock(&new_accept_queue->rskq_lock); spin_unlock(&old_accept_queue->rskq_lock); + + old_fastopenq = &old_accept_queue->fastopenq; + new_fastopenq = &new_accept_queue->fastopenq; + + spin_lock_bh(&old_fastopenq->lock); + spin_lock_bh(&new_fastopenq->lock); + + new_fastopenq->qlen += old_fastopenq->qlen; + old_fastopenq->qlen = 0; + + if (old_fastopenq->rskq_rst_head) { + if (new_fastopenq->rskq_rst_head) + old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head; + else + old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail; + + new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head; + old_fastopenq->rskq_rst_head = NULL; + old_fastopenq->rskq_rst_tail = NULL; + } + + spin_unlock_bh(&new_fastopenq->lock); + spin_unlock_bh(&old_fastopenq->lock); } EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate); From patchwork Tue Nov 17 09:40:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911821 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D0D7C63777 for ; Tue, 17 Nov 2020 09:42:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F1E852464E for ; Tue, 17 Nov 2020 09:42:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="RyjwacqK" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727497AbgKQJmN (ORCPT ); Tue, 17 Nov 2020 04:42:13 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:8845 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726594AbgKQJmM (ORCPT ); Tue, 17 Nov 2020 04:42:12 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606132; x=1637142132; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=6AQ/lwoNdq7Ec+hf7LIe46G27QpR4QjgzjOBsKqGnbI=; b=RyjwacqK+uHLmkLiptNzYt2l8W1IHYegLa3xI7vhYpns2Q5nlKmsYptx kNuIWmtFsX5qnqJu8yKzGV5VDjobKwN7Ndt9oErO5Wnke7cpwKfdjiXwT Dji74t9upmNxLcRzjXGAXKmMY9bxY2Qh2ru9+4VxaFqLtYxqmM9TpGSVE Y=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="65440629" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-821c648d.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 17 Nov 2020 09:42:10 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-821c648d.us-east-1.amazon.com (Postfix) with ESMTPS id 01FFDA1F93; Tue, 17 Nov 2020 09:42:07 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:07 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:03 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 5/8] tcp: Migrate TCP_NEW_SYN_RECV requests. Date: Tue, 17 Nov 2020 18:40:20 +0900 Message-ID: <20201117094023.3685-6-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV requests at receiving the final ACK or sending a SYN+ACK. Therefore, this patch changes the code to call reuseport_select_sock() even if the listening socket is TCP_CLOSE. If we can pick out a listening socket from the reuseport group, we rewrite request_sock.rsk_listener and resume processing the request. Note that we call reuseport_select_sock() with skb NULL so that it selects a listener randomly by hash. There are two reasons to do so. First, we do not remember from which listener to which listener we have migrated sockets at close() or shutdown() syscalls, so we redistribute the requests evenly. As regards the second, we will cover in a later commit. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/inet_connection_sock.h | 12 ++++++++++++ include/net/request_sock.h | 13 +++++++++++++ net/ipv4/inet_connection_sock.c | 12 ++++++++++-- net/ipv4/tcp_ipv4.c | 9 +++++++-- net/ipv6/tcp_ipv6.c | 9 +++++++-- 5 files changed, 49 insertions(+), 6 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 2ea2d743f8fc..1e0958f5eb21 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); } +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, + struct sock *nsk, + struct request_sock *req) +{ + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, + &inet_csk(nsk)->icsk_accept_queue, + req); + sock_put(sk); + sock_hold(nsk); + req->rsk_listener = nsk; +} + static inline int inet_csk_reqsk_queue_len(const struct sock *sk) { return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue); diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 29e41ff3ec93..d18ba0b857cc 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue) atomic_inc(&queue->qlen); } +static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue, + struct request_sock_queue *new_accept_queue, + const struct request_sock *req) +{ + atomic_dec(&old_accept_queue->qlen); + atomic_inc(&new_accept_queue->qlen); + + if (req->num_timeout == 0) { + atomic_dec(&old_accept_queue->young); + atomic_inc(&new_accept_queue->young); + } +} + static inline int reqsk_queue_len(const struct request_sock_queue *queue) { return atomic_read(&queue->qlen); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 398c5c708bc5..8be20e3fff4f 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -743,8 +743,16 @@ static void reqsk_timer_handler(struct timer_list *t) struct request_sock_queue *queue = &icsk->icsk_accept_queue; int max_syn_ack_retries, qlen, expire = 0, resend = 0; - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) - goto drop; + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { + sk_listener = reuseport_select_sock(sk_listener, sk_listener->sk_hash, NULL, 0); + if (!sk_listener) { + sk_listener = req->rsk_listener; + goto drop; + } + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req); + icsk = inet_csk(sk_listener); + queue = &icsk->icsk_accept_queue; + } max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; /* Normally all the openreqs are young and become mature diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c2d5132c523c..7219984584ae 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1957,8 +1957,13 @@ int tcp_v4_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_sock(sk, sk->sk_hash, NULL, 0); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } /* We own a reference on the listener, increase it again * as we might lose it too soon. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 8db59f4e5f13..9a068b69a26e 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1619,8 +1619,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_sock(sk, sk->sk_hash, NULL, 0); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } sock_hold(sk); refcounted = true; From patchwork Tue Nov 17 09:40:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911829 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E60FC64E7C for ; Tue, 17 Nov 2020 09:42:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4C066206B2 for ; Tue, 17 Nov 2020 09:42:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="ba9b2fiJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727560AbgKQJme (ORCPT ); Tue, 17 Nov 2020 04:42:34 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:8913 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727122AbgKQJme (ORCPT ); Tue, 17 Nov 2020 04:42:34 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606153; x=1637142153; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=OeOCcafXw5F3JQeGaXDQErSXG44umN5sS/1hxlRsSz0=; b=ba9b2fiJnRGjs6Ztt3Dy1Os25P5ZXd59BPcvJf8K8Qxaf71fVQupab3Z oc1NH10EQFvkyx41vRDyErkNRr7TAdyv+Sj7HXRbyS+q1P/aNkRVZfVBZ wSSWYWbd7PdoPw/2UwTT7ic8p4bBCyeqJY+kb/gJ0O1RAO8Nx3/SxyKtg 8=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="65440695" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1d-37fd6b3d.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 17 Nov 2020 09:42:33 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1d-37fd6b3d.us-east-1.amazon.com (Postfix) with ESMTPS id D3F46281F8D; Tue, 17 Nov 2020 09:42:29 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:29 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:18 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 6/8] bpf: Add cookie in sk_reuseport_md. Date: Tue, 17 Nov 2020 18:40:21 +0900 Message-ID: <20201117094023.3685-7-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing in order to select the new listener. Currently, we can get a unique ID for each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch exposes the ID to the eBPF program. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 1 + net/core/filter.c | 8 ++++++++ tools/include/uapi/linux/bpf.h | 1 + 4 files changed, 11 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 581b2a2e78eb..c0646eceffa2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1897,6 +1897,7 @@ struct sk_reuseport_kern { u32 hash; u32 reuseport_id; bool bind_inany; + u64 cookie; }; bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 162999b12790..3fcddb032838 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4403,6 +4403,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __u64 cookie; /* ID of the listener in map */ }; #define BPF_TAG_SIZE 8 diff --git a/net/core/filter.c b/net/core/filter.c index 2ca5eecebacf..01e28f283962 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9862,6 +9862,7 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, reuse_kern->hash = hash; reuse_kern->reuseport_id = reuse->reuseport_id; reuse_kern->bind_inany = reuse->bind_inany; + reuse_kern->cookie = sock_gen_cookie(sk); } struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, @@ -10010,6 +10011,9 @@ sk_reuseport_is_valid_access(int off, int size, case offsetof(struct sk_reuseport_md, hash): return size == size_default; + case bpf_ctx_range(struct sk_reuseport_md, cookie): + return size == sizeof(__u64); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10082,6 +10086,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, bind_inany): SK_REUSEPORT_LOAD_FIELD(bind_inany); break; + + case offsetof(struct sk_reuseport_md, cookie): + SK_REUSEPORT_LOAD_FIELD(cookie); + break; } return insn - insn_buf; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 162999b12790..3fcddb032838 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4403,6 +4403,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __u64 cookie; /* ID of the listener in map */ }; #define BPF_TAG_SIZE 8 From patchwork Tue Nov 17 09:40:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911827 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7A74C64E7B for ; Tue, 17 Nov 2020 09:42:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 834A72463F for ; Tue, 17 Nov 2020 09:42:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="BEvJLRk+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727633AbgKQJml (ORCPT ); Tue, 17 Nov 2020 04:42:41 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:8939 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727122AbgKQJml (ORCPT ); Tue, 17 Nov 2020 04:42:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606160; x=1637142160; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=wrzdg6wvKa1DnV7c+HpDFpqsfyVbRJIGt6M+PZA79cU=; b=BEvJLRk+ioOn5ZQI/dX/CUV6bfTvWEev1cmCGBLrTjlofkav57c8qFx/ iprLIDWFFSxbQ96iHok6cQKi9LuI0MxY17QdBzc1pYy7M219S7vo9YRF0 0gyoEQ1iyqp+alXvG3ruGXicLABsRx0hVo+ug1NqZxvOQLLlR2vMC+MiN c=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="65440717" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1d-98acfc19.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 17 Nov 2020 09:42:40 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-98acfc19.us-east-1.amazon.com (Postfix) with ESMTPS id 9998DA182C; Tue, 17 Nov 2020 09:42:37 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:36 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:33 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 7/8] bpf: Call bpf_run_sk_reuseport() for socket migration. Date: Tue, 17 Nov 2020 18:40:22 +0900 Message-ID: <20201117094023.3685-8-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch makes it possible to select a new listener for socket migration by eBPF. The noteworthy point is that we select a listening socket in reuseport_detach_sock() and reuseport_select_sock(), but we do not have struct skb in the unhash path. Since we cannot pass skb to the eBPF program, we run only the BPF_PROG_TYPE_SK_REUSEPORT program by calling bpf_run_sk_reuseport() with skb NULL. So, some fields derived from skb are also NULL in the eBPF program. Moreover, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/core/filter.c | 26 +++++++++++++++++++++----- net/core/sock_reuseport.c | 23 ++++++++++++++++++++--- net/ipv4/inet_hashtables.c | 2 +- 3 files changed, 42 insertions(+), 9 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 01e28f283962..ffc4591878b8 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -8914,6 +8914,22 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type, SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF(S, NS, F, NF, \ BPF_FIELD_SIZEOF(NS, NF), 0) +#define SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, SIZE, OFF) \ + do { \ + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(S, F), si->dst_reg, \ + si->src_reg, offsetof(S, F)); \ + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); \ + *insn++ = BPF_LDX_MEM( \ + SIZE, si->dst_reg, si->dst_reg, \ + bpf_target_off(NS, NF, sizeof_field(NS, NF), \ + target_size) \ + + OFF); \ + } while (0) + +#define SOCK_ADDR_LOAD_NESTED_FIELD_OR_NULL(S, NS, F, NF) \ + SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF_OR_NULL(S, NS, F, NF, \ + BPF_FIELD_SIZEOF(NS, NF), 0) + /* SOCK_ADDR_STORE_NESTED_FIELD_OFF() has semantic similar to * SOCK_ADDR_LOAD_NESTED_FIELD_SIZE_OFF() but for store operation. * @@ -9858,7 +9874,7 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, reuse_kern->skb = skb; reuse_kern->sk = sk; reuse_kern->selected_sk = NULL; - reuse_kern->data_end = skb->data + skb_headlen(skb); + reuse_kern->data_end = skb ? skb->data + skb_headlen(skb) : NULL; reuse_kern->hash = hash; reuse_kern->reuseport_id = reuse->reuseport_id; reuse_kern->bind_inany = reuse->bind_inany; @@ -10039,10 +10055,10 @@ sk_reuseport_is_valid_access(int off, int size, }) #define SK_REUSEPORT_LOAD_SKB_FIELD(SKB_FIELD) \ - SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern, \ - struct sk_buff, \ - skb, \ - SKB_FIELD) + SOCK_ADDR_LOAD_NESTED_FIELD_OR_NULL(struct sk_reuseport_kern, \ + struct sk_buff, \ + skb, \ + SKB_FIELD) #define SK_REUSEPORT_LOAD_SK_FIELD(SK_FIELD) \ SOCK_ADDR_LOAD_NESTED_FIELD(struct sk_reuseport_kern, \ diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 74a46197854b..903f78ab35c3 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -224,6 +224,7 @@ struct sock *reuseport_detach_sock(struct sock *sk) { struct sock_reuseport *reuse; struct sock *nsk = NULL; + struct bpf_prog *prog; int i; spin_lock_bh(&reuseport_lock); @@ -249,8 +250,16 @@ struct sock *reuseport_detach_sock(struct sock *sk) reuse->socks[i] = reuse->socks[reuse->num_socks]; if (reuse->migrate_req) { - if (reuse->num_socks) - nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; + if (reuse->num_socks) { + prog = rcu_dereference(reuse->prog); + if (prog && prog->type == BPF_PROG_TYPE_SK_REUSEPORT) + nsk = bpf_run_sk_reuseport(reuse, sk, prog, + NULL, sk->sk_hash); + + if (!nsk) + nsk = i == reuse->num_socks ? + reuse->socks[i - 1] : reuse->socks[i]; + } reuse->num_closed_socks++; reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; @@ -340,8 +349,16 @@ struct sock *reuseport_select_sock(struct sock *sk, /* paired with smp_wmb() in reuseport_add_sock() */ smp_rmb(); - if (!prog || !skb) + if (!prog) + goto select_by_hash; + + if (!skb) { + if (reuse->migrate_req && + prog->type == BPF_PROG_TYPE_SK_REUSEPORT) + sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); + goto select_by_hash; + } if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index f35c76cf3365..d981e4876679 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -647,7 +647,7 @@ void inet_unhash(struct sock *sk) if (rcu_access_pointer(sk->sk_reuseport_cb)) { nsk = reuseport_detach_sock(sk); - if (nsk) + if (!IS_ERR_OR_NULL(nsk)) inet_csk_reqsk_queue_migrate(sk, nsk); } From patchwork Tue Nov 17 09:40:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Iwashima, Kuniyuki" X-Patchwork-Id: 11911831 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, UNWANTED_LANGUAGE_BODY,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D067AC2D0E4 for ; Tue, 17 Nov 2020 09:43:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 696E8206B2 for ; Tue, 17 Nov 2020 09:43:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="jyjnfL10" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727286AbgKQJmz (ORCPT ); Tue, 17 Nov 2020 04:42:55 -0500 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:35428 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727122AbgKQJmz (ORCPT ); Tue, 17 Nov 2020 04:42:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1605606174; x=1637142174; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=Yjg2cIeTqSYiEXQDPPifY0mk/C3SRGVuvOtSIHcTkFM=; b=jyjnfL105HTlvx6ebZWV4FAbDUirHMejMk0fPEDFkFu2zyQ+MxfidKHP zCaDRM/VX4URcPpfEvy8Ic1twVm13GWfNAACqlFkXL1JyBKzXNjp9VV+c ZOtI1nGULon0bTM0ftcejgRa+bRJJ1QBydiCX74JLn6G4gkbVH/w4l1Ge 8=; X-IronPort-AV: E=Sophos;i="5.77,485,1596499200"; d="scan'208";a="96088626" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-1a-af6a10df.us-east-1.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 17 Nov 2020 09:42:53 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-af6a10df.us-east-1.amazon.com (Postfix) with ESMTPS id 44CE1A0201; Tue, 17 Nov 2020 09:42:52 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:51 +0000 Received: from 38f9d3582de7.ant.amazon.com.com (10.43.161.237) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 17 Nov 2020 09:42:47 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , Subject: [RFC PATCH bpf-next 8/8] bpf: Test BPF_PROG_TYPE_SK_REUSEPORT for socket migration. Date: Tue, 17 Nov 2020 18:40:23 +0900 Message-ID: <20201117094023.3685-9-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201117094023.3685-1-kuniyu@amazon.co.jp> References: <20201117094023.3685-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.161.237] X-ClientProxiedBy: EX13D07UWA003.ant.amazon.com (10.43.160.35) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net X-Patchwork-State: RFC This patch adds a test for net.ipv4.tcp_migrate_req with eBPF. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- .../bpf/prog_tests/migrate_reuseport.c | 175 ++++++++++++++++++ .../bpf/progs/test_migrate_reuseport_kern.c | 53 ++++++ 2 files changed, 228 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c new file mode 100644 index 000000000000..fb182e575371 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c @@ -0,0 +1,175 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. call listen() for 5 server sockets. + * 2. update a map to migrate all child socket + * to the last server socket (map[cookie] = 4) + * 3. call connect() for 25 client sockets. + * 4. call close() first 4 server sockets. + * 5. call receive() for the last server socket. + * + * Author: Kuniyuki Iwashima + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define NUM_SOCKS 5 +#define LOCALHOST "127.0.0.1" +#define err_exit(condition, message) \ + do { \ + if (condition) { \ + perror("ERROR: " message " "); \ + setup_sysctl(0); \ + exit(1); \ + } \ + } while (0) + +__u64 server_fds[NUM_SOCKS]; +int prog_fd, map_fd, migrate_map_fd; + +void setup_sysctl(int value) +{ + FILE *file; + + file = fopen("/proc/sys/net/ipv4/tcp_migrate_req", "w"); + fprintf(file, "%d", value); + fclose(file); +} + +void setup_bpf(void) +{ + struct bpf_object *obj; + struct bpf_program *prog; + struct bpf_map *map, *migrate_map; + int err; + + obj = bpf_object__open("test_migrate_reuseport_kern.o"); + err_exit(libbpf_get_error(obj), "opening BPF object file failed"); + + err = bpf_object__load(obj); + err_exit(err, "loading BPF object failed"); + + prog = bpf_program__next(NULL, obj); + err_exit(!prog, "loading BPF program failed"); + + map = bpf_object__find_map_by_name(obj, "reuseport_map"); + err_exit(!map, "loading BPF reuseport_map failed"); + + migrate_map = bpf_object__find_map_by_name(obj, "migrate_map"); + err_exit(!map, "loading BPF migrate_map failed"); + + prog_fd = bpf_program__fd(prog); + map_fd = bpf_map__fd(map); + migrate_map_fd = bpf_map__fd(migrate_map); +} + +void test_listen(void) +{ + struct sockaddr_in addr; + socklen_t addr_len = sizeof(addr); + int i, err, optval = 1, migrated_to = NUM_SOCKS - 1; + __u64 value; + + addr.sin_family = AF_INET; + addr.sin_port = htons(80); + inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr); + + for (i = 0; i < NUM_SOCKS; i++) { + server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + err_exit(server_fds[i] == -1, "socket() for listener sockets failed"); + + err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT, + &optval, sizeof(optval)); + err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed"); + + if (i == 0) { + err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF, + &prog_fd, sizeof(prog_fd)); + err_exit(err == -1, "setsockopt() for SO_ATTACH_REUSEPORT_EBPF failed"); + } + + err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len); + err_exit(err == -1, "bind() failed"); + + err = listen(server_fds[i], 32); + err_exit(err == -1, "listen() failed"); + + err = bpf_map_update_elem(map_fd, &i, &server_fds[i], BPF_NOEXIST); + err_exit(err == -1, "updating BPF reuseport_map failed"); + + err = bpf_map_lookup_elem(map_fd, &i, &value); + err_exit(err == -1, "looking up BPF reuseport_map failed"); + + printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, migrated_to); + err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST); + err_exit(err == -1, "updating BPF migrate_map failed"); + } +} + +void test_connect(void) +{ + struct sockaddr_in addr; + socklen_t addr_len = sizeof(addr); + int i, err, client_fd; + + addr.sin_family = AF_INET; + addr.sin_port = htons(80); + inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr); + + for (i = 0; i < NUM_SOCKS * 5; i++) { + client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + err_exit(client_fd == -1, "socket() for listener sockets failed"); + + err = connect(client_fd, (struct sockaddr *)&addr, addr_len); + err_exit(err == -1, "connect() failed"); + + close(client_fd); + } +} + +void test_close(void) +{ + int i; + + for (i = 0; i < NUM_SOCKS - 1; i++) + close(server_fds[i]); +} + +void test_receive(void) +{ + struct sockaddr_in addr; + socklen_t addr_len = sizeof(addr); + int cnt, client_fd; + + fcntl(server_fds[NUM_SOCKS - 1], F_SETFL, O_NONBLOCK); + + for (cnt = 0; cnt < NUM_SOCKS * 5; cnt++) { + client_fd = accept(server_fds[NUM_SOCKS - 1], (struct sockaddr *)&addr, &addr_len); + err_exit(client_fd == -1, "accept() failed"); + } + + printf("%d accepted, %d is expected\n", cnt, NUM_SOCKS * 5); +} + +int main(void) +{ + setup_sysctl(1); + setup_bpf(); + test_listen(); + test_connect(); + test_close(); + test_receive(); + close(server_fds[NUM_SOCKS - 1]); + setup_sysctl(0); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c new file mode 100644 index 000000000000..79f8a3465c20 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. If data is not NULL (SYN packet), + * return SK_PASS without selecting a listener. + * 2. If data is NULL (socket migration), + * select a listener (reuseport_map[map[cookie]]) + * + * Author: Kuniyuki Iwashima + */ + +#include +#include + +#define NULL ((void *)0) + +int _version SEC("version") = 1; + +struct bpf_map_def SEC("maps") reuseport_map = { + .type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, + .key_size = sizeof(int), + .value_size = sizeof(__u64), + .max_entries = 256, +}; + +struct bpf_map_def SEC("maps") migrate_map = { + .type = BPF_MAP_TYPE_HASH, + .key_size = sizeof(__u64), + .value_size = sizeof(int), + .max_entries = 256, +}; + +SEC("sk_reuseport") +int select_by_skb_data(struct sk_reuseport_md *reuse_md) +{ + int *key, flags = 0; + void *data = reuse_md->data; + __u64 cookie = reuse_md->cookie; + + if (data) + return SK_PASS; + + key = bpf_map_lookup_elem(&migrate_map, &cookie); + if (key == NULL) + return SK_DROP; + + bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags); + + return SK_PASS; +} + +char _license[] SEC("license") = "GPL";