From patchwork Fri Aug 26 00:04:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955384 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2F53ECAAD1 for ; Fri, 26 Aug 2022 00:06:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244166AbiHZAGN (ORCPT ); Thu, 25 Aug 2022 20:06:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33660 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243988AbiHZAGK (ORCPT ); Thu, 25 Aug 2022 20:06:10 -0400 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 281BEC7426; Thu, 25 Aug 2022 17:06:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472369; x=1693008369; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EVJmwtFmjoSMCLp5woTEC52zQFeunS69QTL7PANK91k=; b=uUru8t3YmD7nUKrfQHY67YhNuJl6+LZYmtS6YgpG9CE9K0yvMPywh/B4 +HFRGY5v1/UaIxP3zG1fUozlEitKqSQfCcOnqQ+4NqKfn3AX0ejasJKPK UnJAij3aXOLb9E0SQsTOybneGlp69lQWgi4UiH5eecGzz0u8+dsGGqviT g=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="253012151" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-pdx-2b-1f9d5b26.us-west-2.amazon.com) ([10.25.36.210]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:05:54 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2b-1f9d5b26.us-west-2.amazon.com (Postfix) with ESMTPS id 8690244C8C; Fri, 26 Aug 2022 00:05:52 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:05:51 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:05:48 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND. Date: Thu, 25 Aug 2022 17:04:33 -0700 Message-ID: <20220826000445.46552-2-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The commit 90f7d7a0d0d6 ("locks: remove LOCK_MAND flock lock support") removed LOCK_MAND support from the kernel because nothing checked the flag, nor was there no use case. This patch revives LOCK_MAND to introduce a mandatory lock for read/write on /proc/sys. Currently, it's the only use case, so we added two changes while reverting the commit. First, we used to allow any f_mode for LOCK_MAND, but now we don't get it back. Instead, we enforce being FMODE_READ|FMODE_WRITE as LOCK_SH and LOCK_EX. Second, when f_ops->flock() was called with LOCK_MAND, each function returned -EOPNOTSUPP. The following patch does not use f_ops->flock(), so we put the validation before calling f_ops->flock(). Signed-off-by: Kuniyuki Iwashima --- fs/locks.c | 57 ++++++++++++++++++++------------ include/uapi/asm-generic/fcntl.h | 5 --- 2 files changed, 35 insertions(+), 27 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index c266cfdc3291..03ff10a3165e 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -421,6 +421,10 @@ static inline int flock_translate_cmd(int cmd) { case LOCK_UN: return F_UNLCK; } + + if (cmd & LOCK_MAND) + return cmd & (LOCK_MAND | LOCK_RW); + return -EINVAL; } @@ -879,6 +883,10 @@ static bool flock_locks_conflict(struct file_lock *caller_fl, if (caller_fl->fl_file == sys_fl->fl_file) return false; + if (caller_fl->fl_type & LOCK_MAND || + sys_fl->fl_type & LOCK_MAND) + return true; + return locks_conflict(caller_fl, sys_fl); } @@ -2077,9 +2085,7 @@ EXPORT_SYMBOL(locks_lock_inode_wait); * - %LOCK_SH -- a shared lock. * - %LOCK_EX -- an exclusive lock. * - %LOCK_UN -- remove an existing lock. - * - %LOCK_MAND -- a 'mandatory' flock. (DEPRECATED) - * - * %LOCK_MAND support has been removed from the kernel. + * - %LOCK_MAND -- a 'mandatory' flock. (only supported on /proc/sys/) */ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) { @@ -2087,19 +2093,6 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) struct file_lock fl; struct fd f; - /* - * LOCK_MAND locks were broken for a long time in that they never - * conflicted with one another and didn't prevent any sort of open, - * read or write activity. - * - * Just ignore these requests now, to preserve legacy behavior, but - * throw a warning to let people know that they don't actually work. - */ - if (cmd & LOCK_MAND) { - pr_warn_once("Attempt to set a LOCK_MAND lock via flock(2). This support has been removed and the request ignored.\n"); - return 0; - } - type = flock_translate_cmd(cmd & ~LOCK_NB); if (type < 0) return type; @@ -2109,6 +2102,7 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) if (!f.file) return error; + /* LOCK_MAND supports only read/write on proc_sysctl for now */ if (type != F_UNLCK && !(f.file->f_mode & (FMODE_READ | FMODE_WRITE))) goto out_putf; @@ -2122,12 +2116,18 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) if (can_sleep) fl.fl_flags |= FL_SLEEP; - if (f.file->f_op->flock) + if (f.file->f_op->flock) { + if (cmd & LOCK_MAND) { + error = -EOPNOTSUPP; + goto out_putf; + } + error = f.file->f_op->flock(f.file, (can_sleep) ? F_SETLKW : F_SETLK, &fl); - else + } else { error = locks_lock_file_wait(f.file, &fl); + } out_putf: fdput(f); @@ -2711,7 +2711,11 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl, seq_printf(f, " %s ", (inode == NULL) ? "*NOINODE*" : "ADVISORY "); } else if (IS_FLOCK(fl)) { - seq_puts(f, "FLOCK ADVISORY "); + if (fl->fl_type & LOCK_MAND) { + seq_puts(f, "FLOCK MANDATORY "); + } else { + seq_puts(f, "FLOCK ADVISORY "); + } } else if (IS_LEASE(fl)) { if (fl->fl_flags & FL_DELEG) seq_puts(f, "DELEG "); @@ -2727,10 +2731,19 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl, } else { seq_puts(f, "UNKNOWN UNKNOWN "); } - type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type; - seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" : - (type == F_RDLCK) ? "READ" : "UNLCK"); + if (fl->fl_type & LOCK_MAND) { + seq_printf(f, "%s ", + (fl->fl_type & LOCK_READ) + ? (fl->fl_type & LOCK_WRITE) ? "RW " : "READ " + : (fl->fl_type & LOCK_WRITE) ? "WRITE" : "NONE "); + } else { + type = IS_LEASE(fl) ? target_leasetype(fl) : fl->fl_type; + + seq_printf(f, "%s ", (type == F_WRLCK) ? "WRITE" : + (type == F_RDLCK) ? "READ" : "UNLCK"); + } + if (inode) { /* userspace relies on this representation of dev_t */ seq_printf(f, "%d %02x:%02x:%lu ", fl_pid, diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 1ecdb911add8..94fb8c6fd543 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -180,11 +180,6 @@ struct f_owner_ex { #define LOCK_NB 4 /* or'd with one of the above to prevent blocking */ #define LOCK_UN 8 /* remove lock */ - -/* - * LOCK_MAND support has been removed from the kernel. We leave the symbols - * here to not break legacy builds, but these should not be used in new code. - */ #define LOCK_MAND 32 /* This is a mandatory flock ... */ #define LOCK_READ 64 /* which allows concurrent read operations */ #define LOCK_WRITE 128 /* which allows concurrent write operations */ From patchwork Fri Aug 26 00:04:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955385 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96FF1ECAAD1 for ; Fri, 26 Aug 2022 00:06:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244298AbiHZAGb (ORCPT ); Thu, 25 Aug 2022 20:06:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244259AbiHZAG1 (ORCPT ); Thu, 25 Aug 2022 20:06:27 -0400 Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AE8B8C7BA5; Thu, 25 Aug 2022 17:06:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472382; x=1693008382; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R5NdENqqVqvIQgv37zKNDKl+4KZdem8oyqI/ZyiO/bs=; b=vCKARESB8H5iPtQnVpyMUtNfuG10EWJgSolPip84o3nyahRyfjVVLalH y2cCnsv3Xa7U0nOdxuMVbzPSQ11+LgbDuf+OIUrtGICDLVfPl309WHmQo 6m+pkkk3T7UM1fjSWEQpGjU7N4YbfCebtzMxPQLq736ezC9d+kpN2LmXU g=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="234194079" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-pdx-2b-02ee77e7.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:06:08 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-pdx-2b-02ee77e7.us-west-2.amazon.com (Postfix) with ESMTPS id A1CB044E48; Fri, 26 Aug 2022 00:06:07 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:06:07 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:06:03 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write. Date: Thu, 25 Aug 2022 17:04:34 -0700 Message-ID: <20220826000445.46552-3-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The preceding patch added LOCK_MAND support for flock(), and this patch adds read/write protection on sysctl knobs. The read/write operations will return -EPERM if the file is mandatory-locked. The following patches introduce sysctl knobs which are read in clone() or unshare() to control a per-netns hash table size for TCP/UDP. In such a case, we can use write protection to guarantee the hash table's size for the child netns. The difference between BPF_PROG_TYPE_CGROUP_SYSCTL is that the BPF prog requires processes to be in the same cgroup to allow/deny read/write to sysctl knobs. Note that the read protection might be useless, especially for some sysctl knobs whose value we can know in another way. For example, we can know fs.nr_open by opening too many files and checking the error, and net.ipv4.tcp_syn_retries by dropping SYN and dumping packets. Signed-off-by: Kuniyuki Iwashima --- fs/locks.c | 26 ++++++++++++++++++++++++++ fs/proc/proc_sysctl.c | 25 ++++++++++++++++++++++++- include/linux/fs.h | 1 + 3 files changed, 51 insertions(+), 1 deletion(-) diff --git a/fs/locks.c b/fs/locks.c index 03ff10a3165e..c858c6c61920 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -890,6 +890,32 @@ static bool flock_locks_conflict(struct file_lock *caller_fl, return locks_conflict(caller_fl, sys_fl); } +int flock_mandatory_locked(struct file *filp) +{ + struct file_lock_context *ctx; + struct file_lock *fl; + int flags = 0; + + ctx = smp_load_acquire(&file_inode(filp)->i_flctx); + if (!ctx) + goto out; + + spin_lock(&ctx->flc_lock); + list_for_each_entry(fl, &ctx->flc_flock, fl_list) { + if (!(fl->fl_type & LOCK_MAND)) + continue; + + if (fl->fl_file != filp) + flags = fl->fl_type & (LOCK_MAND | LOCK_RW); + + break; + } + spin_unlock(&ctx->flc_lock); +out: + return flags; +} +EXPORT_SYMBOL(flock_mandatory_locked); + void posix_test_lock(struct file *filp, struct file_lock *fl) { diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 021e83fe831f..ce2755670970 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -561,10 +561,30 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry, return err; } +static bool proc_mandatory_locked(struct file *filp, int write) +{ + int flags = flock_mandatory_locked(filp); + + if (flags & LOCK_MAND) { + if (write) { + if (flags & LOCK_WRITE) + return false; + } else { + if (flags & LOCK_READ) + return false; + } + + return true; + } + + return false; +} + static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, int write) { - struct inode *inode = file_inode(iocb->ki_filp); + struct file *filp = iocb->ki_filp; + struct inode *inode = file_inode(filp); struct ctl_table_header *head = grab_header(inode); struct ctl_table *table = PROC_I(inode)->sysctl_entry; size_t count = iov_iter_count(iter); @@ -582,6 +602,9 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter, if (sysctl_perm(head, table, write ? MAY_WRITE : MAY_READ)) goto out; + if (proc_mandatory_locked(filp, write)) + goto out; + /* if that can happen at all, it should be -EINVAL, not -EISDIR */ error = -EINVAL; if (!table->proc_handler) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9eced4cc286e..5d1d4b10a868 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1164,6 +1164,7 @@ extern void locks_copy_conflock(struct file_lock *, struct file_lock *); extern void locks_remove_posix(struct file *, fl_owner_t); extern void locks_remove_file(struct file *); extern void locks_release_private(struct file_lock *); +int flock_mandatory_locked(struct file *filp); extern void posix_test_lock(struct file *, struct file_lock *); extern int posix_lock_file(struct file *, struct file_lock *, struct file_lock *); extern int locks_delete_block(struct file_lock *); From patchwork Fri Aug 26 00:04:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955387 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE06CECAAD4 for ; Fri, 26 Aug 2022 00:06:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244243AbiHZAGv (ORCPT ); Thu, 25 Aug 2022 20:06:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34394 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244375AbiHZAGn (ORCPT ); Thu, 25 Aug 2022 20:06:43 -0400 Received: from smtp-fw-33001.amazon.com (smtp-fw-33001.amazon.com [207.171.190.10]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 38E40C8747; Thu, 25 Aug 2022 17:06:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472402; x=1693008402; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EQeW0sSy7Q8sjLsXJaxr3iBIqJE/m2vT7eadVIV6FfE=; b=LfWI+1nUvt2X0eV7MZPtIfXNg/+qTHNQWNVj/tfM4Pfa4NFQGtTHXWeY ptbXaFs1GJ9/uAPVBUnvrwz/tcMfIbAB/b4YO+0EaX8qw9MOTTZxwyKft BU5F3+TMoX4cOZUGYEKFN8ucEjTqFzGkB5PYTQDbixyZHFTTHtC1uN/G/ Q=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="221027636" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-pdx-2b-6e5a0cd6.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-33001.sea14.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:06:24 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2b-6e5a0cd6.us-west-2.amazon.com (Postfix) with ESMTPS id D493DA275A; Fri, 26 Aug 2022 00:06:22 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:06:22 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:06:18 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND). Date: Thu, 25 Aug 2022 17:04:35 -0700 Message-ID: <20220826000445.46552-4-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch adds test cases for LOCK_MAND: - LOCK_MAND with/without LOCK_READ/LOCK_WRITE - read/write - same fd - different fd - same PID - different PID Signed-off-by: Kuniyuki Iwashima --- tools/testing/selftests/sysctl/.gitignore | 2 + tools/testing/selftests/sysctl/Makefile | 9 +- tools/testing/selftests/sysctl/sysctl_flock.c | 157 ++++++++++++++++++ 3 files changed, 163 insertions(+), 5 deletions(-) create mode 100644 tools/testing/selftests/sysctl/.gitignore create mode 100644 tools/testing/selftests/sysctl/sysctl_flock.c diff --git a/tools/testing/selftests/sysctl/.gitignore b/tools/testing/selftests/sysctl/.gitignore new file mode 100644 index 000000000000..a3382ba798a6 --- /dev/null +++ b/tools/testing/selftests/sysctl/.gitignore @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +/sysctl_flock diff --git a/tools/testing/selftests/sysctl/Makefile b/tools/testing/selftests/sysctl/Makefile index 110301f9f5be..eb565b6c8340 100644 --- a/tools/testing/selftests/sysctl/Makefile +++ b/tools/testing/selftests/sysctl/Makefile @@ -2,12 +2,11 @@ # Makefile for sysctl selftests. # Expects kernel.sysctl_writes_strict=1. -# No binaries, but make sure arg-less "make" doesn't trigger "run_tests". -all: +CFLAGS = -Wall -Wl,--no-as-needed -O2 -g +CFLAGS += -D_GNU_SOURCE TEST_PROGS := sysctl.sh -include ../lib.mk +TEST_GEN_PROGS += sysctl_flock -# Nothing to clean up. -clean: +include ../lib.mk diff --git a/tools/testing/selftests/sysctl/sysctl_flock.c b/tools/testing/selftests/sysctl/sysctl_flock.c new file mode 100644 index 000000000000..4c4be10ae0d3 --- /dev/null +++ b/tools/testing/selftests/sysctl/sysctl_flock.c @@ -0,0 +1,157 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright Amazon.com Inc. or its affiliates. */ + +#include +#include +#include + +#include "../kselftest_harness.h" + +#define SYSCTL_PATH "/proc/sys/net/ipv4/tcp_migrate_req" +#define SYSCTL_BUFLEN 8 + +FIXTURE(proc_sysctl_flock) +{ + int fd; + char buf[SYSCTL_BUFLEN]; + int len; +}; + +FIXTURE_VARIANT(proc_sysctl_flock) +{ + int cmd; +}; + +FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand) +{ + .cmd = LOCK_MAND, +}; + +FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand_read) +{ + .cmd = LOCK_MAND | LOCK_READ, +}; + +FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand_write) +{ + .cmd = LOCK_MAND | LOCK_WRITE, +}; + +FIXTURE_VARIANT_ADD(proc_sysctl_flock, lock_mand_read_write) +{ + .cmd = LOCK_MAND | LOCK_READ | LOCK_WRITE, +}; + +int proc_sysctl_open(struct __test_metadata *_metadata) +{ + int fd; + + fd = open(SYSCTL_PATH, O_RDWR); + ASSERT_NE(-1, fd); + + return fd; +} + +FIXTURE_SETUP(proc_sysctl_flock) +{ + self->fd = proc_sysctl_open(_metadata); + ASSERT_EQ(0, flock(self->fd, variant->cmd)); + + self->len = read(self->fd, self->buf, sizeof(self->buf)); + ASSERT_NE(-1, self->len); + + ASSERT_EQ(self->len, write(self->fd, self->buf, self->len)); +} + +FIXTURE_TEARDOWN(proc_sysctl_flock) +{ + flock(self->fd, LOCK_UN); + close(self->fd); +} + +int is_readable(int cmd) +{ + return cmd & LOCK_READ; +} + +int is_writable(int cmd) +{ + return cmd & LOCK_WRITE; +} + +void proc_sysctl_newfd_read(struct __test_metadata *_metadata, + FIXTURE_DATA(proc_sysctl_flock) *self, + const FIXTURE_VARIANT(proc_sysctl_flock) *variant) +{ + char buf[SYSCTL_BUFLEN]; + int err, fd; + + fd = proc_sysctl_open(_metadata); + + err = read(fd, buf, SYSCTL_BUFLEN); + if (is_readable(variant->cmd)) { + ASSERT_EQ(self->len, err); + } else { + ASSERT_EQ(-1, err); + } + + close(fd); +} + +void proc_sysctl_newfd_write(struct __test_metadata *_metadata, + FIXTURE_DATA(proc_sysctl_flock) *self, + const FIXTURE_VARIANT(proc_sysctl_flock) *variant) +{ + int err, fd; + + fd = proc_sysctl_open(_metadata); + + err = write(fd, self->buf, self->len); + if (is_writable(variant->cmd)) { + ASSERT_EQ(self->len, err); + } else { + ASSERT_EQ(-1, err); + } + + close(fd); +} + +void proc_sysctl_fork(struct __test_metadata *_metadata, + FIXTURE_DATA(proc_sysctl_flock) *self, + const FIXTURE_VARIANT(proc_sysctl_flock) *variant, + int read) +{ + int pid, status; + + pid = fork(); + if (pid == 0) { + if (read) + return proc_sysctl_newfd_read(_metadata, self, variant); + else + return proc_sysctl_newfd_write(_metadata, self, variant); + } + + waitpid(pid, &status, 0); +} + +TEST_F(proc_sysctl_flock, test_newfd_read) +{ + proc_sysctl_newfd_read(_metadata, self, variant); +} + +TEST_F(proc_sysctl_flock, test_newfd_write) +{ + proc_sysctl_newfd_write(_metadata, self, variant); +} + +TEST_F(proc_sysctl_flock, test_fork_newfd_read) +{ + proc_sysctl_fork(_metadata, self, variant, 1); +} + +TEST_F(proc_sysctl_flock, test_fork_newfd_write) +{ + proc_sysctl_fork(_metadata, self, variant, 0); +} + +TEST_HARNESS_MAIN From patchwork Fri Aug 26 00:04:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955386 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CFABDECAAD1 for ; Fri, 26 Aug 2022 00:06:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244176AbiHZAGt (ORCPT ); Thu, 25 Aug 2022 20:06:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34354 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244254AbiHZAGj (ORCPT ); Thu, 25 Aug 2022 20:06:39 -0400 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DD454C7F83; Thu, 25 Aug 2022 17:06:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472399; x=1693008399; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AMeidybKzTy6YY1kywsY9OZrxS2aZyZZxaWs+l8aFcM=; b=lgkJDvyf0jxpscjWHBkfy4CoIaohtSuz03OtuhB7EU9BtFHUNbHvILek ta4Xw/LRU5S5xQyi1PzYQOZ5YvciKR2s/CC+WpGUZRuxwcuay9hiSHiSt /HHWwmDtCfa6gIMPmai5JYpbsUqrkNbFYIo9LNQCBZaiDMjTe3K6I120h Q=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="253012392" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-pdx-1box-2b-3386f33d.us-west-2.amazon.com) ([10.25.36.210]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:06:39 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-1box-2b-3386f33d.us-west-2.amazon.com (Postfix) with ESMTPS id 3279398F7A; Fri, 26 Aug 2022 00:06:38 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:06:37 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:06:34 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations. Date: Thu, 25 Aug 2022 17:04:36 -0700 Message-ID: <20220826000445.46552-5-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch adds a new init function for pernet_operations, init2(). We call each init2() during clone() or unshare() only, where we can access the parent netns for a child netns creation. Signed-off-by: Kuniyuki Iwashima --- include/net/net_namespace.h | 3 +++ net/core/net_namespace.c | 18 +++++++++++------- 2 files changed, 14 insertions(+), 7 deletions(-) diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 8c3587d5c308..3ca426649756 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -410,6 +410,8 @@ struct pernet_operations { * from register_pernet_subsys(), unregister_pernet_subsys() * register_pernet_device() and unregister_pernet_device(). * + * init2() is called during clone() or unshare() only. + * * Exit methods using blocking RCU primitives, such as * synchronize_rcu(), should be implemented via exit_batch. * Then, destruction of a group of net requires single @@ -422,6 +424,7 @@ struct pernet_operations { * the calls. */ int (*init)(struct net *net); + int (*init2)(struct net *net, struct net *old_net); void (*pre_exit)(struct net *net); void (*exit)(struct net *net); void (*exit_batch)(struct list_head *net_exit_list); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 6b9f19122ec1..b120ff97d9f5 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -116,7 +116,8 @@ static int net_assign_generic(struct net *net, unsigned int id, void *data) return 0; } -static int ops_init(const struct pernet_operations *ops, struct net *net) +static int ops_init(const struct pernet_operations *ops, + struct net *net, struct net *old_net) { int err = -ENOMEM; void *data = NULL; @@ -133,6 +134,8 @@ static int ops_init(const struct pernet_operations *ops, struct net *net) err = 0; if (ops->init) err = ops->init(net); + if (!err && ops->init2 && old_net) + err = ops->init2(net, old_net); if (!err) return 0; @@ -301,7 +304,8 @@ EXPORT_SYMBOL_GPL(get_net_ns_by_id); /* * setup_net runs the initializers for the network namespace object. */ -static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) +static __net_init int setup_net(struct net *net, struct net *old_net, + struct user_namespace *user_ns) { /* Must be called with pernet_ops_rwsem held */ const struct pernet_operations *ops, *saved_ops; @@ -323,7 +327,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) mutex_init(&net->ipv4.ra_mutex); list_for_each_entry(ops, &pernet_list, list) { - error = ops_init(ops, net); + error = ops_init(ops, net, old_net); if (error < 0) goto out_undo; } @@ -469,7 +473,7 @@ struct net *copy_net_ns(unsigned long flags, if (rv < 0) goto put_userns; - rv = setup_net(net, user_ns); + rv = setup_net(net, old_net, user_ns); up_read(&pernet_ops_rwsem); @@ -1107,7 +1111,7 @@ void __init net_ns_init(void) init_net.key_domain = &init_net_key_domain; #endif down_write(&pernet_ops_rwsem); - if (setup_net(&init_net, &init_user_ns)) + if (setup_net(&init_net, NULL, &init_user_ns)) panic("Could not setup the initial network namespace"); init_net_initialized = true; @@ -1148,7 +1152,7 @@ static int __register_pernet_operations(struct list_head *list, memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net)); old = set_active_memcg(memcg); - error = ops_init(ops, net); + error = ops_init(ops, net, NULL); set_active_memcg(old); mem_cgroup_put(memcg); if (error) @@ -1188,7 +1192,7 @@ static int __register_pernet_operations(struct list_head *list, return 0; } - return ops_init(ops, &init_net); + return ops_init(ops, &init_net, NULL); } static void __unregister_pernet_operations(struct pernet_operations *ops) From patchwork Fri Aug 26 00:04:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955398 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 821B5ECAAA2 for ; Fri, 26 Aug 2022 00:07:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244364AbiHZAHL (ORCPT ); Thu, 25 Aug 2022 20:07:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244311AbiHZAHJ (ORCPT ); Thu, 25 Aug 2022 20:07:09 -0400 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 05C59C8748; Thu, 25 Aug 2022 17:07:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472427; x=1693008427; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=r4LdSZUDZ6Wa2PcUEjszgiPOzzF4/l84paO2XRX3n54=; b=VximcEZrQ+cFjvcvCLV7b0oTrdyza2GQnBbANxZa+gdQzCx6KcHkFfai JGkIb5AGrvLnpDV2SU5GsePN3bloPSpM88f3n8IEekAGUYrJcUGeLYZke 3wakmXtKu/AX9y+hDuv2YM4OsYFPqZAJ0YXCrn6/znAQNPE6vRjG9EnhU s=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="123539149" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-pdx-2c-5c4a15b1.us-west-2.amazon.com) ([10.25.36.214]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:06:53 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-pdx-2c-5c4a15b1.us-west-2.amazon.com (Postfix) with ESMTPS id 7323C451A3; Fri, 26 Aug 2022 00:06:52 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:06:51 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:06:48 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 05/13] tcp: Clean up some functions. Date: Thu, 25 Aug 2022 17:04:37 -0700 Message-ID: <20220826000445.46552-6-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The changes are - Keep reverse christmas tree order - Remove unnecessary init of port in inet_csk_find_open_port() - Use req_to_sk() once in reqsk_queue_unlink() Signed-off-by: Kuniyuki Iwashima --- net/ipv4/inet_connection_sock.c | 21 ++++++++++----------- net/ipv4/inet_hashtables.c | 29 +++++++++++++++-------------- net/ipv4/tcp_ipv4.c | 4 ++-- 3 files changed, 27 insertions(+), 27 deletions(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index f0038043b661..8e71d65cfad4 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -286,15 +286,13 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, struct inet_bind_hashbucket **head2_ret, int *port_ret) { struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - int port = 0; + int i, low, high, attempt_half, port, l3mdev; struct inet_bind_hashbucket *head, *head2; struct net *net = sock_net(sk); - bool relax = false; - int i, low, high, attempt_half; struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; u32 remaining, offset; - int l3mdev; + bool relax = false; l3mdev = inet_sk_bound_l3mdev(sk); ports_exhausted: @@ -471,15 +469,14 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum) { bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN; struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; - int ret = 1, port = snum; - struct net *net = sock_net(sk); bool found_port = false, check_bind_conflict = true; bool bhash_created = false, bhash2_created = false; struct inet_bind_hashbucket *head, *head2; struct inet_bind2_bucket *tb2 = NULL; struct inet_bind_bucket *tb = NULL; bool head2_lock_acquired = false; - int l3mdev; + int ret = 1, port = snum, l3mdev; + struct net *net = sock_net(sk); l3mdev = inet_sk_bound_l3mdev(sk); @@ -909,14 +906,16 @@ static void reqsk_migrate_reset(struct request_sock *req) /* return true if req was found in the ehash table */ static bool reqsk_queue_unlink(struct request_sock *req) { - struct inet_hashinfo *hashinfo = req_to_sk(req)->sk_prot->h.hashinfo; + struct sock *sk = req_to_sk(req); bool found = false; - if (sk_hashed(req_to_sk(req))) { - spinlock_t *lock = inet_ehash_lockp(hashinfo, req->rsk_hash); + if (sk_hashed(sk)) { + struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + spinlock_t *lock; + lock = inet_ehash_lockp(hashinfo, req->rsk_hash); spin_lock(lock); - found = __sk_nulls_del_node_init_rcu(req_to_sk(req)); + found = __sk_nulls_del_node_init_rcu(sk); spin_unlock(lock); } if (timer_pending(&req->rsk_timer) && del_timer_sync(&req->rsk_timer)) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 60d77e234a68..29dce78de179 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -169,13 +169,14 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, static void __inet_put_port(struct sock *sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; - const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num, - hashinfo->bhash_size); - struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash]; - struct inet_bind_hashbucket *head2 = - inet_bhashfn_portaddr(hashinfo, sk, sock_net(sk), - inet_sk(sk)->inet_num); + struct inet_bind_hashbucket *head, *head2; + struct net *net = sock_net(sk); struct inet_bind_bucket *tb; + int bhash; + + bhash = inet_bhashfn(net, inet_sk(sk)->inet_num, hashinfo->bhash_size); + head = &hashinfo->bhash[bhash]; + head2 = inet_bhashfn_portaddr(hashinfo, sk, net, inet_sk(sk)->inet_num); spin_lock(&head->lock); tb = inet_csk(sk)->icsk_bind_hash; @@ -209,17 +210,17 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child) { struct inet_hashinfo *table = sk->sk_prot->h.hashinfo; unsigned short port = inet_sk(child)->inet_num; - const int bhash = inet_bhashfn(sock_net(sk), port, - table->bhash_size); - struct inet_bind_hashbucket *head = &table->bhash[bhash]; - struct inet_bind_hashbucket *head2 = - inet_bhashfn_portaddr(table, child, sock_net(sk), port); + struct inet_bind_hashbucket *head, *head2; bool created_inet_bind_bucket = false; - bool update_fastreuse = false; struct net *net = sock_net(sk); + bool update_fastreuse = false; struct inet_bind2_bucket *tb2; struct inet_bind_bucket *tb; - int l3mdev; + int bhash, l3mdev; + + bhash = inet_bhashfn(net, port, table->bhash_size); + head = &table->bhash[bhash]; + head2 = inet_bhashfn_portaddr(table, child, net, port); spin_lock(&head->lock); spin_lock(&head2->lock); @@ -629,8 +630,8 @@ static bool inet_ehash_lookup_by_sk(struct sock *sk, bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; - struct hlist_nulls_head *list; struct inet_ehash_bucket *head; + struct hlist_nulls_head *list; spinlock_t *lock; bool ret = true; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index cc2ad67f75be..61a9bf661814 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2406,9 +2406,9 @@ static void *established_get_first(struct seq_file *seq) static void *established_get_next(struct seq_file *seq, void *cur) { - struct sock *sk = cur; - struct hlist_nulls_node *node; struct tcp_iter_state *st = seq->private; + struct hlist_nulls_node *node; + struct sock *sk = cur; ++st->num; ++st->offset; From patchwork Fri Aug 26 00:04:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955399 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id ED3EDECAAA2 for ; Fri, 26 Aug 2022 00:07:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244375AbiHZAHM (ORCPT ); Thu, 25 Aug 2022 20:07:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244258AbiHZAHK (ORCPT ); Thu, 25 Aug 2022 20:07:10 -0400 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E9B89C8758; Thu, 25 Aug 2022 17:07:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472428; x=1693008428; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ExxTcEOovOLbdm9ksRNZF0KUpVQPopwrCz4poYm3Kdo=; b=ftTp6tQmfi17PkdwxJHt6yViYyG+CQVVgnebjHnbCs9xw5s+vX0W2zTM h1izofMz0Rb86Sqe7lyYMXHF74pHK73qYcMGlwywcbpAH+yUq2gWR4pSG y/p6c09qxod3QJVjt7tJUmeSJXY07Jlmy4uskbbOdlfv1PV2r5RgGXoRQ s=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="123539198" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-pdx-2c-90419278.us-west-2.amazon.com) ([10.25.36.214]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:07:07 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-pdx-2c-90419278.us-west-2.amazon.com (Postfix) with ESMTPS id 54F854541F; Fri, 26 Aug 2022 00:07:07 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:07:06 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:07:03 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo. Date: Thu, 25 Aug 2022 17:04:38 -0700 Message-ID: <20220826000445.46552-7-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We will soon introduce an optional per-netns ehash. This means we cannot use the global sk->sk_prot->h.hashinfo to fetch a TCP hashinfo. Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get a proper hashinfo from net->ipv4.tcp_death_row->hashinfo. Note that we need not use sk->sk_prot->h.hashinfo if DCCP is disabled. Signed-off-by: Kuniyuki Iwashima --- include/net/inet_hashtables.h | 10 ++++++++++ net/ipv4/af_inet.c | 2 +- net/ipv4/inet_connection_sock.c | 6 +++--- net/ipv4/inet_hashtables.c | 14 +++++++------- net/ipv4/tcp_ipv4.c | 2 +- net/ipv6/tcp_ipv6.c | 2 +- 6 files changed, 23 insertions(+), 13 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 44a419b9e3d5..2c866112433e 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -170,6 +170,16 @@ struct inet_hashinfo { struct inet_listen_hashbucket *lhash2; }; +static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk) +{ +#if IS_ENABLED(CONFIG_IP_DCCP) + return sk->sk_prot->h.hashinfo ? : + sock_net(sk)->ipv4.tcp_death_row->hashinfo; +#else + return sock_net(sk)->ipv4.tcp_death_row->hashinfo; +#endif +} + static inline struct inet_listen_hashbucket * inet_lhash2_bucket(struct inet_hashinfo *h, u32 hash) { diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index d3ab1ae32ef5..6d81d98a34ca 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1250,7 +1250,7 @@ static int inet_sk_reselect_saddr(struct sock *sk) } prev_addr_hashbucket = - inet_bhashfn_portaddr(sk->sk_prot->h.hashinfo, sk, + inet_bhashfn_portaddr(inet_get_hashinfo(sk), sk, sock_net(sk), inet->inet_num); inet->inet_saddr = inet->inet_rcv_saddr = new_saddr; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 8e71d65cfad4..9e3613da3e57 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -285,7 +285,7 @@ inet_csk_find_open_port(const struct sock *sk, struct inet_bind_bucket **tb_ret, struct inet_bind2_bucket **tb2_ret, struct inet_bind_hashbucket **head2_ret, int *port_ret) { - struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hinfo = inet_get_hashinfo(sk); int i, low, high, attempt_half, port, l3mdev; struct inet_bind_hashbucket *head, *head2; struct net *net = sock_net(sk); @@ -468,7 +468,7 @@ void inet_csk_update_fastreuse(struct inet_bind_bucket *tb, int inet_csk_get_port(struct sock *sk, unsigned short snum) { bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN; - struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hinfo = inet_get_hashinfo(sk); bool found_port = false, check_bind_conflict = true; bool bhash_created = false, bhash2_created = false; struct inet_bind_hashbucket *head, *head2; @@ -910,7 +910,7 @@ static bool reqsk_queue_unlink(struct request_sock *req) bool found = false; if (sk_hashed(sk)) { - struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk); spinlock_t *lock; lock = inet_ehash_lockp(hashinfo, req->rsk_hash); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 29dce78de179..23bfff989be6 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -168,7 +168,7 @@ void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, */ static void __inet_put_port(struct sock *sk) { - struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk); struct inet_bind_hashbucket *head, *head2; struct net *net = sock_net(sk); struct inet_bind_bucket *tb; @@ -208,7 +208,7 @@ EXPORT_SYMBOL(inet_put_port); int __inet_inherit_port(const struct sock *sk, struct sock *child) { - struct inet_hashinfo *table = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *table = inet_get_hashinfo(sk); unsigned short port = inet_sk(child)->inet_num; struct inet_bind_hashbucket *head, *head2; bool created_inet_bind_bucket = false; @@ -629,7 +629,7 @@ static bool inet_ehash_lookup_by_sk(struct sock *sk, */ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk) { - struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk); struct inet_ehash_bucket *head; struct hlist_nulls_head *list; spinlock_t *lock; @@ -701,7 +701,7 @@ static int inet_reuseport_add_sock(struct sock *sk, int __inet_hash(struct sock *sk, struct sock *osk) { - struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk); struct inet_listen_hashbucket *ilb2; int err = 0; @@ -747,7 +747,7 @@ EXPORT_SYMBOL_GPL(inet_hash); void inet_unhash(struct sock *sk) { - struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hashinfo = inet_get_hashinfo(sk); if (sk_unhashed(sk)) return; @@ -834,7 +834,7 @@ inet_bind2_bucket_find(const struct inet_bind_hashbucket *head, const struct net struct inet_bind_hashbucket * inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, int port) { - struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hinfo = inet_get_hashinfo(sk); u32 hash; #if IS_ENABLED(CONFIG_IPV6) struct in6_addr addr_any = {}; @@ -850,7 +850,7 @@ inet_bhash2_addr_any_hashbucket(const struct sock *sk, const struct net *net, in int inet_bhash2_update_saddr(struct inet_bind_hashbucket *prev_saddr, struct sock *sk) { - struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo; + struct inet_hashinfo *hinfo = inet_get_hashinfo(sk); struct inet_bind2_bucket *tb2, *new_tb2; int l3mdev = inet_sk_bound_l3mdev(sk); struct inet_bind_hashbucket *head2; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 61a9bf661814..7c3b3ce85a5e 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3083,7 +3083,7 @@ struct proto tcp_prot = { .slab_flags = SLAB_TYPESAFE_BY_RCU, .twsk_prot = &tcp_timewait_sock_ops, .rsk_prot = &tcp_request_sock_ops, - .h.hashinfo = &tcp_hashinfo, + .h.hashinfo = NULL, .no_autobind = true, .diag_destroy = tcp_abort, }; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index ff5c4fc135fc..791f24da9212 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2193,7 +2193,7 @@ struct proto tcpv6_prot = { .slab_flags = SLAB_TYPESAFE_BY_RCU, .twsk_prot = &tcp6_timewait_sock_ops, .rsk_prot = &tcp6_request_sock_ops, - .h.hashinfo = &tcp_hashinfo, + .h.hashinfo = NULL, .no_autobind = true, .diag_destroy = tcp_abort, }; From patchwork Fri Aug 26 00:04:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955400 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76087ECAAA3 for ; Fri, 26 Aug 2022 00:07:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244404AbiHZAHn (ORCPT ); Thu, 25 Aug 2022 20:07:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244376AbiHZAHl (ORCPT ); Thu, 25 Aug 2022 20:07:41 -0400 Received: from smtp-fw-9103.amazon.com (smtp-fw-9103.amazon.com [207.171.188.200]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EC63BC8760; Thu, 25 Aug 2022 17:07:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472460; x=1693008460; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=a1dRuHowJOOUH5w4rh7exnOG18Sv5Ne7vmCLZOHeUHQ=; b=HyBdWx8JAovP9U/gfIiBVyiF7u/mgkv6pMH5wEYFoNWjSs5m50M5awdp 6UxctchQVAhETkxtqpW1gS8U8vKQvqrxNnr81anEE8QBHNVjf8549Orik 50DC51Uf6kT8WyifOsrdjsWU7frgQGQ+yhTooagWypo4zNCbVCeHrGHGD 0=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="1048227570" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-pdx-2a-6a4112b2.us-west-2.amazon.com) ([10.25.36.214]) by smtp-border-fw-9103.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:07:23 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-pdx-2a-6a4112b2.us-west-2.amazon.com (Postfix) with ESMTPS id 4EE644C0258; Fri, 26 Aug 2022 00:07:22 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:07:21 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:07:18 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net. Date: Thu, 25 Aug 2022 17:04:39 -0700 Message-ID: <20220826000445.46552-8-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We will soon introduce an optional per-netns ehash. This means we cannot use tcp_hashinfo directly in most places. Instead, access it via net->ipv4.tcp_death_row->hashinfo. The access will be valid only while initialising tcp_hashinfo itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima --- .../chelsio/inline_crypto/chtls/chtls_cm.c | 5 +- .../mellanox/mlx5/core/en_accel/ktls_rx.c | 5 +- .../net/ethernet/netronome/nfp/crypto/tls.c | 5 +- net/core/filter.c | 5 +- net/ipv4/esp4.c | 3 +- net/ipv4/inet_hashtables.c | 2 +- net/ipv4/netfilter/nf_socket_ipv4.c | 2 +- net/ipv4/netfilter/nf_tproxy_ipv4.c | 17 +++-- net/ipv4/tcp_diag.c | 18 +++++- net/ipv4/tcp_ipv4.c | 63 +++++++++++-------- net/ipv4/tcp_minisocks.c | 2 +- net/ipv6/esp6.c | 3 +- net/ipv6/inet6_hashtables.c | 4 +- net/ipv6/netfilter/nf_socket_ipv6.c | 2 +- net/ipv6/netfilter/nf_tproxy_ipv6.c | 5 +- net/ipv6/tcp_ipv6.c | 16 ++--- net/mptcp/mptcp_diag.c | 7 ++- 17 files changed, 98 insertions(+), 66 deletions(-) diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c index ddfe9208529a..f90bfba4b303 100644 --- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c +++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c @@ -1069,8 +1069,7 @@ static void chtls_pass_accept_rpl(struct sk_buff *skb, cxgb4_l2t_send(csk->egress_dev, skb, csk->l2t_entry); } -static void inet_inherit_port(struct inet_hashinfo *hash_info, - struct sock *lsk, struct sock *newsk) +static void inet_inherit_port(struct sock *lsk, struct sock *newsk) { local_bh_disable(); __inet_inherit_port(lsk, newsk); @@ -1240,7 +1239,7 @@ static struct sock *chtls_recv_sock(struct sock *lsk, ipv4.sysctl_tcp_window_scaling), tp->window_clamp); neigh_release(n); - inet_inherit_port(&tcp_hashinfo, lsk, newsk); + inet_inherit_port(lsk, newsk); csk_set_flag(csk, CSK_CONN_INLINE); bh_unlock_sock(newsk); /* tcp_create_openreq_child ->sk_clone_lock */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c index 13145ecaf839..9ae36772cc15 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ktls_rx.c @@ -461,6 +461,7 @@ static void resync_update_sn(struct mlx5e_rq *rq, struct sk_buff *skb) { struct ethhdr *eth = (struct ethhdr *)(skb->data); struct net_device *netdev = rq->netdev; + struct net *net = dev_net(netdev); struct sock *sk = NULL; unsigned int datalen; struct iphdr *iph; @@ -475,7 +476,7 @@ static void resync_update_sn(struct mlx5e_rq *rq, struct sk_buff *skb) depth += sizeof(struct iphdr); th = (void *)iph + sizeof(struct iphdr); - sk = inet_lookup_established(dev_net(netdev), &tcp_hashinfo, + sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, iph->saddr, th->source, iph->daddr, th->dest, netdev->ifindex); #if IS_ENABLED(CONFIG_IPV6) @@ -485,7 +486,7 @@ static void resync_update_sn(struct mlx5e_rq *rq, struct sk_buff *skb) depth += sizeof(struct ipv6hdr); th = (void *)ipv6h + sizeof(struct ipv6hdr); - sk = __inet6_lookup_established(dev_net(netdev), &tcp_hashinfo, + sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &ipv6h->saddr, th->source, &ipv6h->daddr, ntohs(th->dest), netdev->ifindex, 0); diff --git a/drivers/net/ethernet/netronome/nfp/crypto/tls.c b/drivers/net/ethernet/netronome/nfp/crypto/tls.c index 78368e71ce83..7948c04a32dc 100644 --- a/drivers/net/ethernet/netronome/nfp/crypto/tls.c +++ b/drivers/net/ethernet/netronome/nfp/crypto/tls.c @@ -474,6 +474,7 @@ int nfp_net_tls_rx_resync_req(struct net_device *netdev, { struct nfp_net *nn = netdev_priv(netdev); struct nfp_net_tls_offload_ctx *ntls; + struct net *net = dev_net(netdev); struct ipv6hdr *ipv6h; struct tcphdr *th; struct iphdr *iph; @@ -494,13 +495,13 @@ int nfp_net_tls_rx_resync_req(struct net_device *netdev, switch (ipv6h->version) { case 4: - sk = inet_lookup_established(dev_net(netdev), &tcp_hashinfo, + sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, iph->saddr, th->source, iph->daddr, th->dest, netdev->ifindex); break; #if IS_ENABLED(CONFIG_IPV6) case 6: - sk = __inet6_lookup_established(dev_net(netdev), &tcp_hashinfo, + sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &ipv6h->saddr, th->source, &ipv6h->daddr, ntohs(th->dest), netdev->ifindex, 0); diff --git a/net/core/filter.c b/net/core/filter.c index e8508aaafd27..dd5844033b66 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -6468,6 +6468,7 @@ static const struct bpf_func_proto bpf_lwt_seg6_adjust_srh_proto = { static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, int dif, int sdif, u8 family, u8 proto) { + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; bool refcounted = false; struct sock *sk = NULL; @@ -6476,7 +6477,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, __be32 dst4 = tuple->ipv4.daddr; if (proto == IPPROTO_TCP) - sk = __inet_lookup(net, &tcp_hashinfo, NULL, 0, + sk = __inet_lookup(net, hinfo, NULL, 0, src4, tuple->ipv4.sport, dst4, tuple->ipv4.dport, dif, sdif, &refcounted); @@ -6490,7 +6491,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, struct in6_addr *dst6 = (struct in6_addr *)&tuple->ipv6.daddr; if (proto == IPPROTO_TCP) - sk = __inet6_lookup(net, &tcp_hashinfo, NULL, 0, + sk = __inet6_lookup(net, hinfo, NULL, 0, src6, tuple->ipv6.sport, dst6, ntohs(tuple->ipv6.dport), dif, sdif, &refcounted); diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 5c03eba787e5..951c965d9322 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -134,6 +134,7 @@ static void esp_free_tcp_sk(struct rcu_head *head) static struct sock *esp_find_tcp_sk(struct xfrm_state *x) { struct xfrm_encap_tmpl *encap = x->encap; + struct net *net = xs_net(x); struct esp_tcp_sk *esk; __be16 sport, dport; struct sock *nsk; @@ -160,7 +161,7 @@ static struct sock *esp_find_tcp_sk(struct xfrm_state *x) } spin_unlock_bh(&x->lock); - sk = inet_lookup_established(xs_net(x), &tcp_hashinfo, x->id.daddr.a4, + sk = inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, x->id.daddr.a4, dport, x->props.saddr.a4, sport, 0); if (!sk) return ERR_PTR(-ENOENT); diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 23bfff989be6..5eb21a95179b 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -386,7 +386,7 @@ static inline struct sock *inet_lookup_run_bpf(struct net *net, struct sock *sk, *reuse_sk; bool no_reuseport; - if (hashinfo != &tcp_hashinfo) + if (hashinfo != net->ipv4.tcp_death_row->hashinfo) return NULL; /* only TCP is supported */ no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_TCP, saddr, sport, diff --git a/net/ipv4/netfilter/nf_socket_ipv4.c b/net/ipv4/netfilter/nf_socket_ipv4.c index 2d42e4c35a20..9ba496d061cc 100644 --- a/net/ipv4/netfilter/nf_socket_ipv4.c +++ b/net/ipv4/netfilter/nf_socket_ipv4.c @@ -71,7 +71,7 @@ nf_socket_get_sock_v4(struct net *net, struct sk_buff *skb, const int doff, { switch (protocol) { case IPPROTO_TCP: - return inet_lookup(net, &tcp_hashinfo, skb, doff, + return inet_lookup(net, net->ipv4.tcp_death_row->hashinfo, skb, doff, saddr, sport, daddr, dport, in->ifindex); case IPPROTO_UDP: diff --git a/net/ipv4/netfilter/nf_tproxy_ipv4.c b/net/ipv4/netfilter/nf_tproxy_ipv4.c index b2bae0b0e42a..ce4e0d50a55c 100644 --- a/net/ipv4/netfilter/nf_tproxy_ipv4.c +++ b/net/ipv4/netfilter/nf_tproxy_ipv4.c @@ -79,6 +79,7 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, const struct net_device *in, const enum nf_tproxy_lookup_t lookup_type) { + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; struct sock *sk; switch (protocol) { @@ -92,12 +93,10 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, switch (lookup_type) { case NF_TPROXY_LOOKUP_LISTENER: - sk = inet_lookup_listener(net, &tcp_hashinfo, skb, - ip_hdrlen(skb) + - __tcp_hdrlen(hp), - saddr, sport, - daddr, dport, - in->ifindex, 0); + sk = inet_lookup_listener(net, hinfo, skb, + ip_hdrlen(skb) + __tcp_hdrlen(hp), + saddr, sport, daddr, dport, + in->ifindex, 0); if (sk && !refcount_inc_not_zero(&sk->sk_refcnt)) sk = NULL; @@ -108,9 +107,9 @@ nf_tproxy_get_sock_v4(struct net *net, struct sk_buff *skb, */ break; case NF_TPROXY_LOOKUP_ESTABLISHED: - sk = inet_lookup_established(net, &tcp_hashinfo, - saddr, sport, daddr, dport, - in->ifindex); + sk = inet_lookup_established(net, hinfo, + saddr, sport, daddr, dport, + in->ifindex); break; default: BUG(); diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c index 75a1c985f49a..958c923020c0 100644 --- a/net/ipv4/tcp_diag.c +++ b/net/ipv4/tcp_diag.c @@ -181,13 +181,21 @@ static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin) static void tcp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, const struct inet_diag_req_v2 *r) { - inet_diag_dump_icsk(&tcp_hashinfo, skb, cb, r); + struct inet_hashinfo *hinfo; + + hinfo = sock_net(cb->skb->sk)->ipv4.tcp_death_row->hashinfo; + + inet_diag_dump_icsk(hinfo, skb, cb, r); } static int tcp_diag_dump_one(struct netlink_callback *cb, const struct inet_diag_req_v2 *req) { - return inet_diag_dump_one_icsk(&tcp_hashinfo, cb, req); + struct inet_hashinfo *hinfo; + + hinfo = sock_net(cb->skb->sk)->ipv4.tcp_death_row->hashinfo; + + return inet_diag_dump_one_icsk(hinfo, cb, req); } #ifdef CONFIG_INET_DIAG_DESTROY @@ -195,9 +203,13 @@ static int tcp_diag_destroy(struct sk_buff *in_skb, const struct inet_diag_req_v2 *req) { struct net *net = sock_net(in_skb->sk); - struct sock *sk = inet_diag_find_one_icsk(net, &tcp_hashinfo, req); + struct inet_hashinfo *hinfo; + struct sock *sk; int err; + hinfo = net->ipv4.tcp_death_row->hashinfo; + sk = inet_diag_find_one_icsk(net, hinfo, req); + if (IS_ERR(sk)) return PTR_ERR(sk); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 7c3b3ce85a5e..b07930643b11 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -249,7 +249,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) if (!inet->inet_saddr) { if (inet_csk(sk)->icsk_bind2_hash) { - prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo, + prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo, sk, sock_net(sk), inet->inet_num); prev_sk_rcv_saddr = sk->sk_rcv_saddr; @@ -494,7 +494,7 @@ int tcp_v4_err(struct sk_buff *skb, u32 info) int err; struct net *net = dev_net(skb->dev); - sk = __inet_lookup_established(net, &tcp_hashinfo, iph->daddr, + sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, iph->daddr, th->dest, iph->saddr, ntohs(th->source), inet_iif(skb), 0); if (!sk) { @@ -759,7 +759,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) * Incoming packet is checked with md5 hash with finding key, * no RST generated if md5 hash doesn't match. */ - sk1 = __inet_lookup_listener(net, &tcp_hashinfo, NULL, 0, + sk1 = __inet_lookup_listener(net, net->ipv4.tcp_death_row->hashinfo, NULL, 0, ip_hdr(skb)->saddr, th->source, ip_hdr(skb)->daddr, ntohs(th->source), dif, sdif); @@ -1728,6 +1728,7 @@ EXPORT_SYMBOL(tcp_v4_do_rcv); int tcp_v4_early_demux(struct sk_buff *skb) { + struct net *net = dev_net(skb->dev); const struct iphdr *iph; const struct tcphdr *th; struct sock *sk; @@ -1744,7 +1745,7 @@ int tcp_v4_early_demux(struct sk_buff *skb) if (th->doff < sizeof(struct tcphdr) / 4) return 0; - sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo, + sk = __inet_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, iph->saddr, th->source, iph->daddr, ntohs(th->dest), skb->skb_iif, inet_sdif(skb)); @@ -1970,7 +1971,8 @@ int tcp_v4_rcv(struct sk_buff *skb) th = (const struct tcphdr *)skb->data; iph = ip_hdr(skb); lookup: - sk = __inet_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), th->source, + sk = __inet_lookup_skb(net->ipv4.tcp_death_row->hashinfo, + skb, __tcp_hdrlen(th), th->source, th->dest, sdif, &refcounted); if (!sk) goto no_tcp_socket; @@ -2152,9 +2154,9 @@ int tcp_v4_rcv(struct sk_buff *skb) } switch (tcp_timewait_state_process(inet_twsk(sk), skb, th)) { case TCP_TW_SYN: { - struct sock *sk2 = inet_lookup_listener(dev_net(skb->dev), - &tcp_hashinfo, skb, - __tcp_hdrlen(th), + struct sock *sk2 = inet_lookup_listener(net, + net->ipv4.tcp_death_row->hashinfo, + skb, __tcp_hdrlen(th), iph->saddr, th->source, iph->daddr, th->dest, inet_iif(skb), @@ -2304,15 +2306,16 @@ static bool seq_sk_match(struct seq_file *seq, const struct sock *sk) */ static void *listening_get_first(struct seq_file *seq) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct tcp_iter_state *st = seq->private; st->offset = 0; - for (; st->bucket <= tcp_hashinfo.lhash2_mask; st->bucket++) { + for (; st->bucket <= hinfo->lhash2_mask; st->bucket++) { struct inet_listen_hashbucket *ilb2; struct hlist_nulls_node *node; struct sock *sk; - ilb2 = &tcp_hashinfo.lhash2[st->bucket]; + ilb2 = &hinfo->lhash2[st->bucket]; if (hlist_nulls_empty(&ilb2->nulls_head)) continue; @@ -2337,6 +2340,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur) struct tcp_iter_state *st = seq->private; struct inet_listen_hashbucket *ilb2; struct hlist_nulls_node *node; + struct inet_hashinfo *hinfo; struct sock *sk = cur; ++st->num; @@ -2348,7 +2352,8 @@ static void *listening_get_next(struct seq_file *seq, void *cur) return sk; } - ilb2 = &tcp_hashinfo.lhash2[st->bucket]; + hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; + ilb2 = &hinfo->lhash2[st->bucket]; spin_unlock(&ilb2->lock); ++st->bucket; return listening_get_first(seq); @@ -2370,9 +2375,10 @@ static void *listening_get_idx(struct seq_file *seq, loff_t *pos) return rc; } -static inline bool empty_bucket(const struct tcp_iter_state *st) +static inline bool empty_bucket(struct inet_hashinfo *hinfo, + const struct tcp_iter_state *st) { - return hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].chain); + return hlist_nulls_empty(&hinfo->ehash[st->bucket].chain); } /* @@ -2381,20 +2387,21 @@ static inline bool empty_bucket(const struct tcp_iter_state *st) */ static void *established_get_first(struct seq_file *seq) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct tcp_iter_state *st = seq->private; st->offset = 0; - for (; st->bucket <= tcp_hashinfo.ehash_mask; ++st->bucket) { + for (; st->bucket <= hinfo->ehash_mask; ++st->bucket) { struct sock *sk; struct hlist_nulls_node *node; - spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, st->bucket); + spinlock_t *lock = inet_ehash_lockp(hinfo, st->bucket); /* Lockless fast path for the common case of empty buckets */ - if (empty_bucket(st)) + if (empty_bucket(hinfo, st)) continue; spin_lock_bh(lock); - sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) { + sk_nulls_for_each(sk, node, &hinfo->ehash[st->bucket].chain) { if (seq_sk_match(seq, sk)) return sk; } @@ -2406,6 +2413,7 @@ static void *established_get_first(struct seq_file *seq) static void *established_get_next(struct seq_file *seq, void *cur) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct tcp_iter_state *st = seq->private; struct hlist_nulls_node *node; struct sock *sk = cur; @@ -2420,7 +2428,7 @@ static void *established_get_next(struct seq_file *seq, void *cur) return sk; } - spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); + spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket)); ++st->bucket; return established_get_first(seq); } @@ -2458,6 +2466,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos) static void *tcp_seek_last_pos(struct seq_file *seq) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct tcp_iter_state *st = seq->private; int bucket = st->bucket; int offset = st->offset; @@ -2466,7 +2475,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) switch (st->state) { case TCP_SEQ_STATE_LISTENING: - if (st->bucket > tcp_hashinfo.lhash2_mask) + if (st->bucket > hinfo->lhash2_mask) break; st->state = TCP_SEQ_STATE_LISTENING; rc = listening_get_first(seq); @@ -2478,7 +2487,7 @@ static void *tcp_seek_last_pos(struct seq_file *seq) st->state = TCP_SEQ_STATE_ESTABLISHED; fallthrough; case TCP_SEQ_STATE_ESTABLISHED: - if (st->bucket > tcp_hashinfo.ehash_mask) + if (st->bucket > hinfo->ehash_mask) break; rc = established_get_first(seq); while (offset-- && rc && bucket == st->bucket) @@ -2546,16 +2555,17 @@ EXPORT_SYMBOL(tcp_seq_next); void tcp_seq_stop(struct seq_file *seq, void *v) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct tcp_iter_state *st = seq->private; switch (st->state) { case TCP_SEQ_STATE_LISTENING: if (v != SEQ_START_TOKEN) - spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock); + spin_unlock(&hinfo->lhash2[st->bucket].lock); break; case TCP_SEQ_STATE_ESTABLISHED: if (v) - spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); + spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket)); break; } } @@ -2750,6 +2760,7 @@ static int bpf_iter_tcp_realloc_batch(struct bpf_tcp_iter_state *iter, static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq, struct sock *start_sk) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct bpf_tcp_iter_state *iter = seq->private; struct tcp_iter_state *st = &iter->state; struct hlist_nulls_node *node; @@ -2769,7 +2780,7 @@ static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq, expected++; } } - spin_unlock(&tcp_hashinfo.lhash2[st->bucket].lock); + spin_unlock(&hinfo->lhash2[st->bucket].lock); return expected; } @@ -2777,6 +2788,7 @@ static unsigned int bpf_iter_tcp_listening_batch(struct seq_file *seq, static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq, struct sock *start_sk) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct bpf_tcp_iter_state *iter = seq->private; struct tcp_iter_state *st = &iter->state; struct hlist_nulls_node *node; @@ -2796,13 +2808,14 @@ static unsigned int bpf_iter_tcp_established_batch(struct seq_file *seq, expected++; } } - spin_unlock_bh(inet_ehash_lockp(&tcp_hashinfo, st->bucket)); + spin_unlock_bh(inet_ehash_lockp(hinfo, st->bucket)); return expected; } static struct sock *bpf_iter_tcp_batch(struct seq_file *seq) { + struct inet_hashinfo *hinfo = seq_file_net(seq)->ipv4.tcp_death_row->hashinfo; struct bpf_tcp_iter_state *iter = seq->private; struct tcp_iter_state *st = &iter->state; unsigned int expected; @@ -2818,7 +2831,7 @@ static struct sock *bpf_iter_tcp_batch(struct seq_file *seq) st->offset = 0; st->bucket++; if (st->state == TCP_SEQ_STATE_LISTENING && - st->bucket > tcp_hashinfo.lhash2_mask) { + st->bucket > hinfo->lhash2_mask) { st->state = TCP_SEQ_STATE_ESTABLISHED; st->bucket = 0; } diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index cb95d88497ae..361aad67c6d6 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -319,7 +319,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) /* Linkage updates. * Note that access to tw after this point is illegal. */ - inet_twsk_hashdance(tw, sk, &tcp_hashinfo); + inet_twsk_hashdance(tw, sk, tcp_death_row->hashinfo); local_bh_enable(); } else { /* Sorry, if we're out of memory, just CLOSE this diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 8220923a12f7..3cdeed3d2e1a 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -151,6 +151,7 @@ static void esp_free_tcp_sk(struct rcu_head *head) static struct sock *esp6_find_tcp_sk(struct xfrm_state *x) { struct xfrm_encap_tmpl *encap = x->encap; + struct net *net = xs_net(x); struct esp_tcp_sk *esk; __be16 sport, dport; struct sock *nsk; @@ -177,7 +178,7 @@ static struct sock *esp6_find_tcp_sk(struct xfrm_state *x) } spin_unlock_bh(&x->lock); - sk = __inet6_lookup_established(xs_net(x), &tcp_hashinfo, &x->id.daddr.in6, + sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &x->id.daddr.in6, dport, &x->props.saddr.in6, ntohs(sport), 0, 0); if (!sk) return ERR_PTR(-ENOENT); diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 7d53d62783b1..52513a6a54fd 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -21,8 +21,6 @@ #include #include -extern struct inet_hashinfo tcp_hashinfo; - u32 inet6_ehashfn(const struct net *net, const struct in6_addr *laddr, const u16 lport, const struct in6_addr *faddr, const __be16 fport) @@ -169,7 +167,7 @@ static inline struct sock *inet6_lookup_run_bpf(struct net *net, struct sock *sk, *reuse_sk; bool no_reuseport; - if (hashinfo != &tcp_hashinfo) + if (hashinfo != net->ipv4.tcp_death_row->hashinfo) return NULL; /* only TCP is supported */ no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_TCP, saddr, sport, diff --git a/net/ipv6/netfilter/nf_socket_ipv6.c b/net/ipv6/netfilter/nf_socket_ipv6.c index aa5bb8789ba0..6356407065c7 100644 --- a/net/ipv6/netfilter/nf_socket_ipv6.c +++ b/net/ipv6/netfilter/nf_socket_ipv6.c @@ -83,7 +83,7 @@ nf_socket_get_sock_v6(struct net *net, struct sk_buff *skb, int doff, { switch (protocol) { case IPPROTO_TCP: - return inet6_lookup(net, &tcp_hashinfo, skb, doff, + return inet6_lookup(net, net->ipv4.tcp_death_row->hashinfo, skb, doff, saddr, sport, daddr, dport, in->ifindex); case IPPROTO_UDP: diff --git a/net/ipv6/netfilter/nf_tproxy_ipv6.c b/net/ipv6/netfilter/nf_tproxy_ipv6.c index 6bac68fb27a3..b0e5fb3e4d95 100644 --- a/net/ipv6/netfilter/nf_tproxy_ipv6.c +++ b/net/ipv6/netfilter/nf_tproxy_ipv6.c @@ -80,6 +80,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, const struct net_device *in, const enum nf_tproxy_lookup_t lookup_type) { + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; struct sock *sk; switch (protocol) { @@ -93,7 +94,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, switch (lookup_type) { case NF_TPROXY_LOOKUP_LISTENER: - sk = inet6_lookup_listener(net, &tcp_hashinfo, skb, + sk = inet6_lookup_listener(net, hinfo, skb, thoff + __tcp_hdrlen(hp), saddr, sport, daddr, ntohs(dport), @@ -108,7 +109,7 @@ nf_tproxy_get_sock_v6(struct net *net, struct sk_buff *skb, int thoff, */ break; case NF_TPROXY_LOOKUP_ESTABLISHED: - sk = __inet6_lookup_established(net, &tcp_hashinfo, + sk = __inet6_lookup_established(net, hinfo, saddr, sport, daddr, ntohs(dport), in->ifindex, 0); break; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 791f24da9212..27b2fd98a2c4 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -286,12 +286,14 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, goto failure; } + tcp_death_row = sock_net(sk)->ipv4.tcp_death_row; + if (!saddr) { struct inet_bind_hashbucket *prev_addr_hashbucket = NULL; struct in6_addr prev_v6_rcv_saddr; if (icsk->icsk_bind2_hash) { - prev_addr_hashbucket = inet_bhashfn_portaddr(&tcp_hashinfo, + prev_addr_hashbucket = inet_bhashfn_portaddr(tcp_death_row->hashinfo, sk, sock_net(sk), inet->inet_num); prev_v6_rcv_saddr = sk->sk_v6_rcv_saddr; @@ -325,7 +327,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr, inet->inet_dport = usin->sin6_port; tcp_set_state(sk, TCP_SYN_SENT); - tcp_death_row = sock_net(sk)->ipv4.tcp_death_row; err = inet6_hash_connect(tcp_death_row, sk); if (err) goto late_failure; @@ -403,7 +404,7 @@ static int tcp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, bool fatal; int err; - sk = __inet6_lookup_established(net, &tcp_hashinfo, + sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &hdr->daddr, th->dest, &hdr->saddr, ntohs(th->source), skb->dev->ifindex, inet6_sdif(skb)); @@ -1037,7 +1038,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb) * no RST generated if md5 hash doesn't match. */ sk1 = inet6_lookup_listener(net, - &tcp_hashinfo, NULL, 0, + net->ipv4.tcp_death_row->hashinfo, NULL, 0, &ipv6h->saddr, th->source, &ipv6h->daddr, ntohs(th->source), dif, sdif); @@ -1636,7 +1637,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) hdr = ipv6_hdr(skb); lookup: - sk = __inet6_lookup_skb(&tcp_hashinfo, skb, __tcp_hdrlen(th), + sk = __inet6_lookup_skb(net->ipv4.tcp_death_row->hashinfo, skb, __tcp_hdrlen(th), th->source, th->dest, inet6_iif(skb), sdif, &refcounted); if (!sk) @@ -1811,7 +1812,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) { struct sock *sk2; - sk2 = inet6_lookup_listener(dev_net(skb->dev), &tcp_hashinfo, + sk2 = inet6_lookup_listener(net, net->ipv4.tcp_death_row->hashinfo, skb, __tcp_hdrlen(th), &ipv6_hdr(skb)->saddr, th->source, &ipv6_hdr(skb)->daddr, @@ -1844,6 +1845,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) void tcp_v6_early_demux(struct sk_buff *skb) { + struct net *net = dev_net(skb->dev); const struct ipv6hdr *hdr; const struct tcphdr *th; struct sock *sk; @@ -1861,7 +1863,7 @@ void tcp_v6_early_demux(struct sk_buff *skb) return; /* Note : We use inet6_iif() here, not tcp_v6_iif() */ - sk = __inet6_lookup_established(dev_net(skb->dev), &tcp_hashinfo, + sk = __inet6_lookup_established(net, net->ipv4.tcp_death_row->hashinfo, &hdr->saddr, th->source, &hdr->daddr, ntohs(th->dest), inet6_iif(skb), inet6_sdif(skb)); diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c index 7f9a71780437..12b0aac25a39 100644 --- a/net/mptcp/mptcp_diag.c +++ b/net/mptcp/mptcp_diag.c @@ -81,15 +81,18 @@ static void mptcp_diag_dump_listeners(struct sk_buff *skb, struct netlink_callba struct mptcp_diag_ctx *diag_ctx = (void *)cb->ctx; struct nlattr *bc = cb_data->inet_diag_nla_bc; struct net *net = sock_net(skb->sk); + struct inet_hashinfo *hinfo; int i; - for (i = diag_ctx->l_slot; i <= tcp_hashinfo.lhash2_mask; i++) { + hinfo = net->ipv4.tcp_death_row->hashinfo; + + for (i = diag_ctx->l_slot; i <= hinfo->lhash2_mask; i++) { struct inet_listen_hashbucket *ilb; struct hlist_nulls_node *node; struct sock *sk; int num = 0; - ilb = &tcp_hashinfo.lhash2[i]; + ilb = &hinfo->lhash2[i]; rcu_read_lock(); spin_lock(&ilb->lock); From patchwork Fri Aug 26 00:04:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955402 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB07FECAAD4 for ; Fri, 26 Aug 2022 00:07:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244437AbiHZAH4 (ORCPT ); Thu, 25 Aug 2022 20:07:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244170AbiHZAHy (ORCPT ); Thu, 25 Aug 2022 20:07:54 -0400 Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21C26C8769; Thu, 25 Aug 2022 17:07:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472473; x=1693008473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xgJ5UCRiGkYB82BgWGk27kXitxudv885PYYW9mWyfQc=; b=lw2hWRIY1x7l7kDBmEUIQDadgLXjVuwt1Ps2wD0lzN7PTlUJ7gn38CTL IFPwYCperGKeth9DcAnXbAb07x9wwkd0xXbfUNTfeodVZ58Q4RbLPJnSZ FGvf1kTnJVHtyRwHoI9UUWDPOe6W3uQkb/r2RQlvfpR51oGFT13eFPBGx o=; Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-pdx-2c-5c4a15b1.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:07:40 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-pdx-2c-5c4a15b1.us-west-2.amazon.com (Postfix) with ESMTPS id C695945195; Fri, 26 Aug 2022 00:07:38 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:07:36 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:07:33 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash. Date: Thu, 25 Aug 2022 17:04:40 -0700 Message-ID: <20220826000445.46552-9-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The more sockets we have in the hash table, the more time we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. Also, the root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. To resolve the issue, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. Also, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 128 # rounded up to 2^n # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own per-netns ehash When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Lock the sysctl knob We can use flock(LOCK_MAND) or BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) Also, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash in the future. As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. Signed-off-by: Kuniyuki Iwashima --- Documentation/networking/ip-sysctl.rst | 20 +++++++++ include/net/inet_hashtables.h | 6 +++ include/net/netns/ipv4.h | 1 + net/dccp/proto.c | 2 + net/ipv4/inet_hashtables.c | 57 ++++++++++++++++++++++++++ net/ipv4/inet_timewait_sock.c | 4 +- net/ipv4/sysctl_net_ipv4.c | 57 ++++++++++++++++++++++++++ net/ipv4/tcp.c | 1 + net/ipv4/tcp_ipv4.c | 53 ++++++++++++++++++++---- net/ipv6/tcp_ipv6.c | 12 +++++- 10 files changed, 202 insertions(+), 11 deletions(-) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 56cd4ea059b2..97a0952b11e3 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1037,6 +1037,26 @@ tcp_challenge_ack_limit - INTEGER in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) Default: 1000 +tcp_ehash_entries - INTEGER + Read-only number of hash buckets for TCP sockets in the current + networking namespace. + + A negative value means the networking namespace does not own its + hash buckets and shares the initial networking namespace's one. + +tcp_child_ehash_entries - INTEGER + Control the number of hash buckets for TCP sockets in the child + networking namespace, which must be set before clone() or unshare(). + + The written value except for 0 is rounded up to 2^n. 0 is a special + value, meaning the child networking namespace will share the initial + networking namespace's hash buckets. + + Note that the child will use the global one in case the kernel + fails to allocate enough memory. + + Default: 0 + UDP variables ============= diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 2c866112433e..039440936ab2 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -168,6 +168,8 @@ struct inet_hashinfo { /* The 2nd listener table hashed by local port and address */ unsigned int lhash2_mask; struct inet_listen_hashbucket *lhash2; + + bool pernet; }; static inline struct inet_hashinfo *inet_get_hashinfo(const struct sock *sk) @@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo) hashinfo->ehash_locks = NULL; } +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, + unsigned int ehash_entries); +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo); + struct inet_bind_bucket * inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net, struct inet_bind_hashbucket *head, diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index c7320ef356d9..6d9c01879027 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -170,6 +170,7 @@ struct netns_ipv4 { int sysctl_tcp_pacing_ca_ratio; int sysctl_tcp_wmem[3]; int sysctl_tcp_rmem[3]; + unsigned int sysctl_tcp_child_ehash_entries; unsigned long sysctl_tcp_comp_sack_delay_ns; unsigned long sysctl_tcp_comp_sack_slack_ns; int sysctl_max_syn_backlog; diff --git a/net/dccp/proto.c b/net/dccp/proto.c index 7cd4a6cc99fc..c548ca3e9b0e 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1197,6 +1197,8 @@ static int __init dccp_init(void) INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain); } + dccp_hashinfo.pernet = false; + rc = dccp_mib_init(); if (rc) goto out_free_dccp_bhash2; diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 5eb21a95179b..a57932b14bc6 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -1145,3 +1145,60 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo) return 0; } EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc); + +struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo, + unsigned int ehash_entries) +{ + struct inet_hashinfo *new_hashinfo; + int i; + + new_hashinfo = kmalloc(sizeof(*new_hashinfo), GFP_KERNEL); + if (!new_hashinfo) + goto err; + + new_hashinfo->ehash = kvmalloc_array(ehash_entries, + sizeof(struct inet_ehash_bucket), + GFP_KERNEL); + if (!new_hashinfo->ehash) + goto free_hashinfo; + + new_hashinfo->ehash_mask = ehash_entries - 1; + + if (inet_ehash_locks_alloc(new_hashinfo)) + goto free_ehash; + + for (i = 0; i < ehash_entries; i++) + INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i); + + new_hashinfo->bind_bucket_cachep = hashinfo->bind_bucket_cachep; + new_hashinfo->bhash = hashinfo->bhash; + new_hashinfo->bind2_bucket_cachep = hashinfo->bind2_bucket_cachep; + new_hashinfo->bhash2 = hashinfo->bhash2; + new_hashinfo->bhash_size = hashinfo->bhash_size; + + new_hashinfo->lhash2_mask = hashinfo->lhash2_mask; + new_hashinfo->lhash2 = hashinfo->lhash2; + + new_hashinfo->pernet = true; + + return new_hashinfo; + +free_ehash: + kvfree(new_hashinfo->ehash); +free_hashinfo: + kfree(new_hashinfo); +err: + return NULL; +} +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc); + +void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo) +{ + if (!hashinfo->pernet) + return; + + inet_ehash_locks_free(hashinfo); + kvfree(hashinfo->ehash); + kfree(hashinfo); +} +EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free); diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 47ccc343c9fb..a5d40acde9d6 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -59,8 +59,10 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw) inet_twsk_bind_unhash(tw, hashinfo); spin_unlock(&bhead->lock); - if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) + if (refcount_dec_and_test(&tw->tw_dr->tw_refcount)) { + inet_pernet_hashinfo_free(hashinfo); kfree(tw->tw_dr); + } inet_twsk_put(tw); } diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 5490c285668b..03a3187c4705 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -382,6 +382,48 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl, return ret; } +static int proc_tcp_ehash_entries(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct net *net = container_of(table->data, struct net, + ipv4.sysctl_tcp_child_ehash_entries); + struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row->hashinfo; + int tcp_ehash_entries; + struct ctl_table tbl; + + tcp_ehash_entries = hinfo->ehash_mask + 1; + + /* A negative number indicates that the child netns + * shares the global ehash. + */ + if (!net_eq(net, &init_net) && !hinfo->pernet) + tcp_ehash_entries *= -1; + + tbl.data = &tcp_ehash_entries; + tbl.maxlen = sizeof(int); + + return proc_dointvec(&tbl, write, buffer, lenp, ppos); +} + +static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + unsigned int tcp_child_ehash_entries; + int ret; + + ret = proc_douintvec(table, write, buffer, lenp, ppos); + if (!write || ret) + return ret; + + tcp_child_ehash_entries = READ_ONCE(*(unsigned int *)table->data); + if (tcp_child_ehash_entries) + tcp_child_ehash_entries = roundup_pow_of_two(tcp_child_ehash_entries); + + WRITE_ONCE(*(unsigned int *)table->data, tcp_child_ehash_entries); + + return 0; +} + #ifdef CONFIG_IP_ROUTE_MULTIPATH static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, void *buffer, size_t *lenp, @@ -1321,6 +1363,21 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "tcp_ehash_entries", + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, + .mode = 0444, + .proc_handler = proc_tcp_ehash_entries, + }, + { + .procname = "tcp_child_ehash_entries", + .data = &init_net.ipv4.sysctl_tcp_child_ehash_entries, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_tcp_child_ehash_entries, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_INT_MAX, + }, { .procname = "udp_rmem_min", .data = &init_net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index baf6adb723ad..f8ce673e32cb 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -4788,6 +4788,7 @@ void __init tcp_init(void) INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain); } + tcp_hashinfo.pernet = false; cnt = tcp_hashinfo.ehash_mask + 1; sysctl_tcp_max_orphans = cnt / 2; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index b07930643b11..604119f46b52 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -3109,14 +3109,23 @@ static void __net_exit tcp_sk_exit(struct net *net) if (net->ipv4.tcp_congestion_control) bpf_module_put(net->ipv4.tcp_congestion_control, net->ipv4.tcp_congestion_control->owner); - if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) + if (refcount_dec_and_test(&tcp_death_row->tw_refcount)) { + inet_pernet_hashinfo_free(tcp_death_row->hashinfo); kfree(tcp_death_row); + } } -static int __net_init tcp_sk_init(struct net *net) +static void __net_init tcp_set_hashinfo(struct net *net, struct inet_hashinfo *hinfo) { - int cnt; + int ehash_entries = hinfo->ehash_mask + 1; + net->ipv4.tcp_death_row->hashinfo = hinfo; + net->ipv4.tcp_death_row->sysctl_max_tw_buckets = ehash_entries / 2; + net->ipv4.sysctl_max_syn_backlog = max(128, ehash_entries / 128); +} + +static int __net_init tcp_sk_init(struct net *net) +{ net->ipv4.sysctl_tcp_ecn = 2; net->ipv4.sysctl_tcp_ecn_fallback = 1; @@ -3145,12 +3154,10 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.tcp_death_row = kzalloc(sizeof(struct inet_timewait_death_row), GFP_KERNEL); if (!net->ipv4.tcp_death_row) return -ENOMEM; + refcount_set(&net->ipv4.tcp_death_row->tw_refcount, 1); - cnt = tcp_hashinfo.ehash_mask + 1; - net->ipv4.tcp_death_row->sysctl_max_tw_buckets = cnt / 2; - net->ipv4.tcp_death_row->hashinfo = &tcp_hashinfo; + tcp_set_hashinfo(net, &tcp_hashinfo); - net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128); net->ipv4.sysctl_tcp_sack = 1; net->ipv4.sysctl_tcp_window_scaling = 1; net->ipv4.sysctl_tcp_timestamps = 1; @@ -3206,18 +3213,46 @@ static int __net_init tcp_sk_init(struct net *net) return 0; } +static int __net_init tcp_sk_init_pernet_hashinfo(struct net *net, struct net *old_net) +{ + struct inet_hashinfo *child_hinfo; + int ehash_entries; + + ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries); + if (!ehash_entries) + goto out; + + child_hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries); + if (child_hinfo) + tcp_set_hashinfo(net, child_hinfo); + else + pr_warn("Failed to allocate TCP ehash (entries: %u) " + "for a netns, fallback to use the global one\n", + ehash_entries); +out: + return 0; +} + static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list) { + bool purge_once = true; struct net *net; - inet_twsk_purge(&tcp_hashinfo, AF_INET); + list_for_each_entry(net, net_exit_list, exit_list) { + if (net->ipv4.tcp_death_row->hashinfo->pernet) { + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET); + } else if (purge_once) { + inet_twsk_purge(&tcp_hashinfo, AF_INET); + purge_once = false; + } - list_for_each_entry(net, net_exit_list, exit_list) tcp_fastopen_ctx_destroy(net); + } } static struct pernet_operations __net_initdata tcp_sk_ops = { .init = tcp_sk_init, + .init2 = tcp_sk_init_pernet_hashinfo, .exit = tcp_sk_exit, .exit_batch = tcp_sk_exit_batch, }; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 27b2fd98a2c4..19f730428720 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -2229,7 +2229,17 @@ static void __net_exit tcpv6_net_exit(struct net *net) static void __net_exit tcpv6_net_exit_batch(struct list_head *net_exit_list) { - inet_twsk_purge(&tcp_hashinfo, AF_INET6); + bool purge_once = true; + struct net *net; + + list_for_each_entry(net, net_exit_list, exit_list) { + if (net->ipv4.tcp_death_row->hashinfo->pernet) { + inet_twsk_purge(net->ipv4.tcp_death_row->hashinfo, AF_INET6); + } else if (purge_once) { + inet_twsk_purge(&tcp_hashinfo, AF_INET6); + purge_once = false; + } + } } static struct pernet_operations tcpv6_net_ops = { From patchwork Fri Aug 26 00:04:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955401 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D076ECAAA3 for ; Fri, 26 Aug 2022 00:07:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244421AbiHZAHz (ORCPT ); Thu, 25 Aug 2022 20:07:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35248 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242532AbiHZAHy (ORCPT ); Thu, 25 Aug 2022 20:07:54 -0400 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E625C876E; Thu, 25 Aug 2022 17:07:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472473; x=1693008473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QG8/xaeHJQ4qgHGmjgsXS9m7tqs3JRVyjg7b7hW6y/c=; b=JBnWWy1rMa6GLfezidJXSRtc3UTRW3ewhKUahSrwcJsRm0zXrkRMD5B+ ezOgPKunjFa71pPBc+bqZhOzNP2geFo64eDwnWtczE1CBrhNi5Ewi/s7L cX8CGm/TU+NvXdk+5fsnkCqj3VrO6WYCc/jliCdJEvrALoWn6AgAEbUOM M=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="123539416" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO email-inbound-relay-pdx-2c-4213ea4c.us-west-2.amazon.com) ([10.25.36.214]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:07:53 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2c-4213ea4c.us-west-2.amazon.com (Postfix) with ESMTPS id 7B32D8231F; Fri, 26 Aug 2022 00:07:52 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:07:51 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:07:48 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 09/13] udp: Clean up some functions. Date: Thu, 25 Aug 2022 17:04:41 -0700 Message-ID: <20220826000445.46552-10-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch adds no functional change and cleans up some functions that the following patches touch around so that we make them tidy and easy to review/revert. The change is - Keep reverse christmas tree order Signed-off-by: Kuniyuki Iwashima --- net/ipv4/udp.c | 35 +++++++++++++++++++++-------------- net/ipv6/udp.c | 12 ++++++++---- 2 files changed, 29 insertions(+), 18 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 34eda973bbf1..c073304e60bb 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -232,10 +232,10 @@ static int udp_reuseport_add_sock(struct sock *sk, struct udp_hslot *hslot) int udp_lib_get_port(struct sock *sk, unsigned short snum, unsigned int hash2_nulladdr) { - struct udp_hslot *hslot, *hslot2; struct udp_table *udptable = sk->sk_prot->h.udp_table; - int error = 1; + struct udp_hslot *hslot, *hslot2; struct net *net = sock_net(sk); + int error = 1; if (!snum) { int low, high, remaining; @@ -2524,10 +2524,13 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net, __be16 rmt_port, __be32 rmt_addr, int dif, int sdif) { - struct sock *sk, *result; unsigned short hnum = ntohs(loc_port); - unsigned int slot = udp_hashfn(net, hnum, udp_table.mask); - struct udp_hslot *hslot = &udp_table.hash[slot]; + struct sock *sk, *result; + struct udp_hslot *hslot; + unsigned int slot; + + slot = udp_hashfn(net, hnum, udp_table.mask); + hslot = &udp_table.hash[slot]; /* Do not bother scanning a too big list */ if (hslot->count > 10) @@ -2555,14 +2558,18 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net, __be16 rmt_port, __be32 rmt_addr, int dif, int sdif) { - unsigned short hnum = ntohs(loc_port); - unsigned int hash2 = ipv4_portaddr_hash(net, loc_addr, hnum); - unsigned int slot2 = hash2 & udp_table.mask; - struct udp_hslot *hslot2 = &udp_table.hash2[slot2]; INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr); - const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum); + unsigned short hnum = ntohs(loc_port); + unsigned int hash2, slot2; + struct udp_hslot *hslot2; + __portpair ports; struct sock *sk; + hash2 = ipv4_portaddr_hash(net, loc_addr, hnum); + slot2 = hash2 & udp_table.mask; + hslot2 = &udp_table.hash2[slot2]; + ports = INET_COMBINED_PORTS(rmt_port, hnum); + udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { if (inet_match(net, sk, acookie, ports, dif, sdif)) return sk; @@ -2963,10 +2970,10 @@ EXPORT_SYMBOL(udp_prot); static struct sock *udp_get_first(struct seq_file *seq, int start) { - struct sock *sk; - struct udp_seq_afinfo *afinfo; struct udp_iter_state *state = seq->private; struct net *net = seq_file_net(seq); + struct udp_seq_afinfo *afinfo; + struct sock *sk; if (state->bpf_seq_afinfo) afinfo = state->bpf_seq_afinfo; @@ -2997,9 +3004,9 @@ static struct sock *udp_get_first(struct seq_file *seq, int start) static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk) { - struct udp_seq_afinfo *afinfo; struct udp_iter_state *state = seq->private; struct net *net = seq_file_net(seq); + struct udp_seq_afinfo *afinfo; if (state->bpf_seq_afinfo) afinfo = state->bpf_seq_afinfo; @@ -3055,8 +3062,8 @@ EXPORT_SYMBOL(udp_seq_next); void udp_seq_stop(struct seq_file *seq, void *v) { - struct udp_seq_afinfo *afinfo; struct udp_iter_state *state = seq->private; + struct udp_seq_afinfo *afinfo; if (state->bpf_seq_afinfo) afinfo = state->bpf_seq_afinfo; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 16c176e7c69a..f7e0248a00bc 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1036,12 +1036,16 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net, int dif, int sdif) { unsigned short hnum = ntohs(loc_port); - unsigned int hash2 = ipv6_portaddr_hash(net, loc_addr, hnum); - unsigned int slot2 = hash2 & udp_table.mask; - struct udp_hslot *hslot2 = &udp_table.hash2[slot2]; - const __portpair ports = INET_COMBINED_PORTS(rmt_port, hnum); + unsigned int hash2, slot2; + struct udp_hslot *hslot2; + __portpair ports; struct sock *sk; + hash2 = ipv6_portaddr_hash(net, loc_addr, hnum); + slot2 = hash2 & udp_table.mask; + hslot2 = &udp_table.hash2[slot2]; + ports = INET_COMBINED_PORTS(rmt_port, hnum); + udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { if (sk->sk_state == TCP_ESTABLISHED && inet6_match(net, sk, rmt_addr, loc_addr, ports, dif, sdif)) From patchwork Fri Aug 26 00:04:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955403 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49379ECAAD5 for ; Fri, 26 Aug 2022 00:08:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244247AbiHZAIS (ORCPT ); Thu, 25 Aug 2022 20:08:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35662 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242873AbiHZAIQ (ORCPT ); Thu, 25 Aug 2022 20:08:16 -0400 Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ECC12C7FAD; Thu, 25 Aug 2022 17:08:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472495; x=1693008495; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=jvA6rXQ4x1824Pdsdu1jyLqM8NOA9+teC+8Pnvq+334=; b=PURFlA/XnOE4mg2Rv00a7giqWGArxAEAeBrBdnsO1U2v1y14rujbGEW1 Xx0N2F7biOjqwKaN/sywbL9qkd/Zxunlc14iTkL+sAcyEQlrt9bSwR2Al Di2kYV/AjdnncvSAW3IKsplOfWlFP0izmdn5jp3tvhVJaRjJkCpBo3JtN I=; Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-pdx-2a-e6c05252.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:08:12 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-pdx-2a-e6c05252.us-west-2.amazon.com (Postfix) with ESMTPS id 1CAF444EF4; Fri, 26 Aug 2022 00:08:07 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:08:06 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:08:03 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table. Date: Thu, 25 Aug 2022 17:04:42 -0700 Message-ID: <20220826000445.46552-11-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We will soon introduce an optional per-netns hash table for UDP. This means we cannot use the global sk->sk_prot->h.udp_table to fetch a UDP hash table. Instead, set NULL to sk->sk_prot->h.udp_table for UDP and get a proper table from net->ipv4.udp_table. Note that we still need sk->sk_prot->h.udp_table for UDP LITE. Signed-off-by: Kuniyuki Iwashima --- include/net/netns/ipv4.h | 1 + net/ipv4/udp.c | 15 +++++++++++---- net/ipv6/udp.c | 2 +- 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 6d9c01879027..c367da5d61e2 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -42,6 +42,7 @@ struct tcp_fastopen_context; struct netns_ipv4 { struct inet_timewait_death_row *tcp_death_row; + struct udp_table *udp_table; #ifdef CONFIG_SYSCTL struct ctl_table_header *forw_hdr; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index c073304e60bb..1bfae1fbe682 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -131,6 +131,11 @@ EXPORT_PER_CPU_SYMBOL_GPL(udp_memory_per_cpu_fw_alloc); #define MAX_UDP_PORTS 65536 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN) +static struct udp_table *udp_get_table_prot(struct sock *sk) +{ + return sk->sk_prot->h.udp_table ? : sock_net(sk)->ipv4.udp_table; +} + static int udp_lib_lport_inuse(struct net *net, __u16 num, const struct udp_hslot *hslot, unsigned long *bitmap, @@ -232,7 +237,7 @@ static int udp_reuseport_add_sock(struct sock *sk, struct udp_hslot *hslot) int udp_lib_get_port(struct sock *sk, unsigned short snum, unsigned int hash2_nulladdr) { - struct udp_table *udptable = sk->sk_prot->h.udp_table; + struct udp_table *udptable = udp_get_table_prot(sk); struct udp_hslot *hslot, *hslot2; struct net *net = sock_net(sk); int error = 1; @@ -2004,7 +2009,7 @@ EXPORT_SYMBOL(udp_disconnect); void udp_lib_unhash(struct sock *sk) { if (sk_hashed(sk)) { - struct udp_table *udptable = sk->sk_prot->h.udp_table; + struct udp_table *udptable = udp_get_table_prot(sk); struct udp_hslot *hslot, *hslot2; hslot = udp_hashslot(udptable, sock_net(sk), @@ -2035,7 +2040,7 @@ EXPORT_SYMBOL(udp_lib_unhash); void udp_lib_rehash(struct sock *sk, u16 newhash) { if (sk_hashed(sk)) { - struct udp_table *udptable = sk->sk_prot->h.udp_table; + struct udp_table *udptable = udp_get_table_prot(sk); struct udp_hslot *hslot, *hslot2, *nhslot2; hslot2 = udp_hashslot2(udptable, udp_sk(sk)->udp_portaddr_hash); @@ -2960,7 +2965,7 @@ struct proto udp_prot = { .sysctl_wmem_offset = offsetof(struct net, ipv4.sysctl_udp_wmem_min), .sysctl_rmem_offset = offsetof(struct net, ipv4.sysctl_udp_rmem_min), .obj_size = sizeof(struct udp_sock), - .h.udp_table = &udp_table, + .h.udp_table = NULL, .diag_destroy = udp_abort, }; EXPORT_SYMBOL(udp_prot); @@ -3273,6 +3278,8 @@ EXPORT_SYMBOL(udp_flow_hashrnd); static int __net_init udp_sysctl_init(struct net *net) { + net->ipv4.udp_table = &udp_table; + net->ipv4.sysctl_udp_rmem_min = PAGE_SIZE; net->ipv4.sysctl_udp_wmem_min = PAGE_SIZE; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index f7e0248a00bc..5dc069d5ad1e 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1747,7 +1747,7 @@ struct proto udpv6_prot = { .sysctl_wmem_offset = offsetof(struct net, ipv4.sysctl_udp_wmem_min), .sysctl_rmem_offset = offsetof(struct net, ipv4.sysctl_udp_rmem_min), .obj_size = sizeof(struct udp6_sock), - .h.udp_table = &udp_table, + .h.udp_table = NULL, .diag_destroy = udp_abort, }; From patchwork Fri Aug 26 00:04:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955404 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C29FBECAAA3 for ; Fri, 26 Aug 2022 00:08:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244493AbiHZAIp (ORCPT ); Thu, 25 Aug 2022 20:08:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238683AbiHZAI2 (ORCPT ); Thu, 25 Aug 2022 20:08:28 -0400 Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AEA3EC88A4; Thu, 25 Aug 2022 17:08:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472507; x=1693008507; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8cBbbgxMOWbr387X8gp+E7vzk+Wem09aMViQIOnIrW8=; b=mYd5HyaEtYekZ4VV3TE3a2oiYrNLh7QjkD8ypQY3SLs3p81zfy5jbPOi HLKsM1WkkiWi2rzZ/1jicmFJkDOojzeF/EUWhBkkI0g5LabD8UdRhADIh 3AsQTpd2Y/4Sdw4niGGARoIDkGrUmQmjP3/DFSiDV64KvVVMgdyRYfX6T 0=; Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-pdx-2c-4213ea4c.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-6001.iad6.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:08:24 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2c-4213ea4c.us-west-2.amazon.com (Postfix) with ESMTPS id 9872A8233B; Fri, 26 Aug 2022 00:08:22 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:08:22 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:08:18 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table. Date: Thu, 25 Aug 2022 17:04:43 -0700 Message-ID: <20220826000445.46552-12-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We will soon introduce an optional per-netns hash table for UDP. This means we cannot use the global udp_seq_afinfo.udp_table to fetch a UDP hash table. Instead, set NULL to udp_seq_afinfo.udp_table for UDP and get a proper table from net->ipv4.udp_table. Note that we still need udp_seq_afinfo.udp_table for UDP LITE. Signed-off-by: Kuniyuki Iwashima --- net/ipv4/udp.c | 32 ++++++++++++++++++++++++-------- net/ipv6/udp.c | 2 +- 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 1bfae1fbe682..17b4b175fc67 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -136,6 +136,12 @@ static struct udp_table *udp_get_table_prot(struct sock *sk) return sk->sk_prot->h.udp_table ? : sock_net(sk)->ipv4.udp_table; } +static struct udp_table *udp_get_table_afinfo(struct udp_seq_afinfo *afinfo, + struct net *net) +{ + return afinfo->udp_table ? : net->ipv4.udp_table; +} + static int udp_lib_lport_inuse(struct net *net, __u16 num, const struct udp_hslot *hslot, unsigned long *bitmap, @@ -2978,6 +2984,7 @@ static struct sock *udp_get_first(struct seq_file *seq, int start) struct udp_iter_state *state = seq->private; struct net *net = seq_file_net(seq); struct udp_seq_afinfo *afinfo; + struct udp_table *udptable; struct sock *sk; if (state->bpf_seq_afinfo) @@ -2985,9 +2992,11 @@ static struct sock *udp_get_first(struct seq_file *seq, int start) else afinfo = pde_data(file_inode(seq->file)); - for (state->bucket = start; state->bucket <= afinfo->udp_table->mask; + udptable = udp_get_table_afinfo(afinfo, net); + + for (state->bucket = start; state->bucket <= udptable->mask; ++state->bucket) { - struct udp_hslot *hslot = &afinfo->udp_table->hash[state->bucket]; + struct udp_hslot *hslot = &udptable->hash[state->bucket]; if (hlist_empty(&hslot->head)) continue; @@ -3012,6 +3021,7 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk) struct udp_iter_state *state = seq->private; struct net *net = seq_file_net(seq); struct udp_seq_afinfo *afinfo; + struct udp_table *udptable; if (state->bpf_seq_afinfo) afinfo = state->bpf_seq_afinfo; @@ -3025,8 +3035,11 @@ static struct sock *udp_get_next(struct seq_file *seq, struct sock *sk) sk->sk_family != afinfo->family))); if (!sk) { - if (state->bucket <= afinfo->udp_table->mask) - spin_unlock_bh(&afinfo->udp_table->hash[state->bucket].lock); + udptable = udp_get_table_afinfo(afinfo, net); + + if (state->bucket <= udptable->mask) + spin_unlock_bh(&udptable->hash[state->bucket].lock); + return udp_get_first(seq, state->bucket + 1); } return sk; @@ -3069,14 +3082,17 @@ void udp_seq_stop(struct seq_file *seq, void *v) { struct udp_iter_state *state = seq->private; struct udp_seq_afinfo *afinfo; + struct udp_table *udptable; if (state->bpf_seq_afinfo) afinfo = state->bpf_seq_afinfo; else afinfo = pde_data(file_inode(seq->file)); - if (state->bucket <= afinfo->udp_table->mask) - spin_unlock_bh(&afinfo->udp_table->hash[state->bucket].lock); + udptable = udp_get_table_afinfo(afinfo, seq_file_net(seq)); + + if (state->bucket <= udptable->mask) + spin_unlock_bh(&udptable->hash[state->bucket].lock); } EXPORT_SYMBOL(udp_seq_stop); @@ -3189,7 +3205,7 @@ EXPORT_SYMBOL(udp_seq_ops); static struct udp_seq_afinfo udp4_seq_afinfo = { .family = AF_INET, - .udp_table = &udp_table, + .udp_table = NULL, }; static int __net_init udp4_proc_init_net(struct net *net) @@ -3309,7 +3325,7 @@ static int bpf_iter_init_udp(void *priv_data, struct bpf_iter_aux_info *aux) return -ENOMEM; afinfo->family = AF_UNSPEC; - afinfo->udp_table = &udp_table; + afinfo->udp_table = NULL; st->bpf_seq_afinfo = afinfo; ret = bpf_iter_init_seq_net(priv_data, aux); if (ret) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 5dc069d5ad1e..8abf16c5a8a1 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1697,7 +1697,7 @@ EXPORT_SYMBOL(udp6_seq_ops); static struct udp_seq_afinfo udp6_seq_afinfo = { .family = AF_INET6, - .udp_table = &udp_table, + .udp_table = NULL, }; int __net_init udp6_proc_init(struct net *net) From patchwork Fri Aug 26 00:04:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955405 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76CB7ECAAD1 for ; Fri, 26 Aug 2022 00:09:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S244549AbiHZAJ0 (ORCPT ); Thu, 25 Aug 2022 20:09:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36612 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244609AbiHZAIs (ORCPT ); Thu, 25 Aug 2022 20:08:48 -0400 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F2043C926D; Thu, 25 Aug 2022 17:08:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472520; x=1693008520; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=W7V0jySFfeHCIADKO7KQy2sISBJJhxiYLxE6HrEK8wk=; b=QlYwOE90/I8jEWbQ9nV2Ezw3KDBYvmmzYmeDVMLO/LlM2ILW8gbsE4Jb s5XtJicKQe2jMZFV9n2JMNlSlFGoKnqBsv1PlV/pdS3VBDW5xqjU20I5A lPh9ely6PDk4Hv8vX0QXqMl4oESVzNau0fXGxKuLLXGNHB9ccwYuSZcV8 o=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="253012902" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO email-inbound-relay-pdx-2b-2520d768.us-west-2.amazon.com) ([10.25.36.210]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:08:38 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2b-2520d768.us-west-2.amazon.com (Postfix) with ESMTPS id B60D145458; Fri, 26 Aug 2022 00:08:37 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:08:37 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:08:33 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 12/13] udp: Access &udp_table via net. Date: Thu, 25 Aug 2022 17:04:44 -0700 Message-ID: <20220826000445.46552-13-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We will soon introduce an optional per-netns hash table for UDP. This means we cannot use udp_table directly in most places. Instead, access it via net->ipv4.udp_table. The access will be valid only while initialising udp_table itself and creating/destroying each netns. Signed-off-by: Kuniyuki Iwashima --- net/core/filter.c | 4 ++-- net/ipv4/udp.c | 23 +++++++++++++---------- net/ipv4/udp_diag.c | 6 +++--- net/ipv4/udp_offload.c | 5 +++-- net/ipv6/udp.c | 19 +++++++++++-------- net/ipv6/udp_offload.c | 5 +++-- 6 files changed, 35 insertions(+), 27 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index dd5844033b66..38b247f5de24 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -6484,7 +6484,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, else sk = __udp4_lib_lookup(net, src4, tuple->ipv4.sport, dst4, tuple->ipv4.dport, - dif, sdif, &udp_table, NULL); + dif, sdif, net->ipv4.udp_table, NULL); #if IS_ENABLED(CONFIG_IPV6) } else { struct in6_addr *src6 = (struct in6_addr *)&tuple->ipv6.saddr; @@ -6500,7 +6500,7 @@ static struct sock *sk_lookup(struct net *net, struct bpf_sock_tuple *tuple, src6, tuple->ipv6.sport, dst6, tuple->ipv6.dport, dif, sdif, - &udp_table, NULL); + net->ipv4.udp_table, NULL); #endif } diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 17b4b175fc67..f4825e38762a 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -478,7 +478,7 @@ static struct sock *udp4_lookup_run_bpf(struct net *net, struct sock *sk, *reuse_sk; bool no_reuseport; - if (udptable != &udp_table) + if (udptable != net->ipv4.udp_table) return NULL; /* only UDP is supported */ no_reuseport = bpf_sk_lookup_run_v4(net, IPPROTO_UDP, saddr, sport, @@ -559,10 +559,11 @@ struct sock *udp4_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport) { const struct iphdr *iph = ip_hdr(skb); + struct net *net = dev_net(skb->dev); - return __udp4_lib_lookup(dev_net(skb->dev), iph->saddr, sport, + return __udp4_lib_lookup(net, iph->saddr, sport, iph->daddr, dport, inet_iif(skb), - inet_sdif(skb), &udp_table, NULL); + inet_sdif(skb), net->ipv4.udp_table, NULL); } /* Must be called under rcu_read_lock(). @@ -575,7 +576,7 @@ struct sock *udp4_lib_lookup(struct net *net, __be32 saddr, __be16 sport, struct sock *sk; sk = __udp4_lib_lookup(net, saddr, sport, daddr, dport, - dif, 0, &udp_table, NULL); + dif, 0, net->ipv4.udp_table, NULL); if (sk && !refcount_inc_not_zero(&sk->sk_refcnt)) sk = NULL; return sk; @@ -810,7 +811,7 @@ int __udp4_lib_err(struct sk_buff *skb, u32 info, struct udp_table *udptable) int udp_err(struct sk_buff *skb, u32 info) { - return __udp4_lib_err(skb, info, &udp_table); + return __udp4_lib_err(skb, info, dev_net(skb->dev)->ipv4.udp_table); } /* @@ -2535,13 +2536,14 @@ static struct sock *__udp4_lib_mcast_demux_lookup(struct net *net, __be16 rmt_port, __be32 rmt_addr, int dif, int sdif) { + struct udp_table *udptable = net->ipv4.udp_table; unsigned short hnum = ntohs(loc_port); struct sock *sk, *result; struct udp_hslot *hslot; unsigned int slot; - slot = udp_hashfn(net, hnum, udp_table.mask); - hslot = &udp_table.hash[slot]; + slot = udp_hashfn(net, hnum, udptable->mask); + hslot = &udptable->hash[slot]; /* Do not bother scanning a too big list */ if (hslot->count > 10) @@ -2569,6 +2571,7 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net, __be16 rmt_port, __be32 rmt_addr, int dif, int sdif) { + struct udp_table *udptable = net->ipv4.udp_table; INET_ADDR_COOKIE(acookie, rmt_addr, loc_addr); unsigned short hnum = ntohs(loc_port); unsigned int hash2, slot2; @@ -2577,8 +2580,8 @@ static struct sock *__udp4_lib_demux_lookup(struct net *net, struct sock *sk; hash2 = ipv4_portaddr_hash(net, loc_addr, hnum); - slot2 = hash2 & udp_table.mask; - hslot2 = &udp_table.hash2[slot2]; + slot2 = hash2 & udptable->mask; + hslot2 = &udptable->hash2[slot2]; ports = INET_COMBINED_PORTS(rmt_port, hnum); udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { @@ -2660,7 +2663,7 @@ int udp_v4_early_demux(struct sk_buff *skb) int udp_rcv(struct sk_buff *skb) { - return __udp4_lib_rcv(skb, &udp_table, IPPROTO_UDP); + return __udp4_lib_rcv(skb, dev_net(skb->dev)->ipv4.udp_table, IPPROTO_UDP); } void udp_destroy_sock(struct sock *sk) diff --git a/net/ipv4/udp_diag.c b/net/ipv4/udp_diag.c index 1ed8c4d78e5c..de3f2d31f510 100644 --- a/net/ipv4/udp_diag.c +++ b/net/ipv4/udp_diag.c @@ -147,13 +147,13 @@ static void udp_dump(struct udp_table *table, struct sk_buff *skb, static void udp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, const struct inet_diag_req_v2 *r) { - udp_dump(&udp_table, skb, cb, r); + udp_dump(sock_net(cb->skb->sk)->ipv4.udp_table, skb, cb, r); } static int udp_diag_dump_one(struct netlink_callback *cb, const struct inet_diag_req_v2 *req) { - return udp_dump_one(&udp_table, cb, req); + return udp_dump_one(sock_net(cb->skb->sk)->ipv4.udp_table, cb, req); } static void udp_diag_get_info(struct sock *sk, struct inet_diag_msg *r, @@ -225,7 +225,7 @@ static int __udp_diag_destroy(struct sk_buff *in_skb, static int udp_diag_destroy(struct sk_buff *in_skb, const struct inet_diag_req_v2 *req) { - return __udp_diag_destroy(in_skb, req, &udp_table); + return __udp_diag_destroy(in_skb, req, sock_net(in_skb->sk)->ipv4.udp_table); } static int udplite_diag_destroy(struct sk_buff *in_skb, diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 6d1a4bec2614..aedde65e2268 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -600,10 +600,11 @@ static struct sock *udp4_gro_lookup_skb(struct sk_buff *skb, __be16 sport, __be16 dport) { const struct iphdr *iph = skb_gro_network_header(skb); + struct net *net = dev_net(skb->dev); - return __udp4_lib_lookup(dev_net(skb->dev), iph->saddr, sport, + return __udp4_lib_lookup(net, iph->saddr, sport, iph->daddr, dport, inet_iif(skb), - inet_sdif(skb), &udp_table, NULL); + inet_sdif(skb), net->ipv4.udp_table, NULL); } INDIRECT_CALLABLE_SCOPE diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 8abf16c5a8a1..739a8c86f0b6 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -203,7 +203,7 @@ static inline struct sock *udp6_lookup_run_bpf(struct net *net, struct sock *sk, *reuse_sk; bool no_reuseport; - if (udptable != &udp_table) + if (udptable != net->ipv4.udp_table) return NULL; /* only UDP is supported */ no_reuseport = bpf_sk_lookup_run_v6(net, IPPROTO_UDP, saddr, sport, @@ -284,10 +284,11 @@ struct sock *udp6_lib_lookup_skb(const struct sk_buff *skb, __be16 sport, __be16 dport) { const struct ipv6hdr *iph = ipv6_hdr(skb); + struct net *net = dev_net(skb->dev); - return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport, + return __udp6_lib_lookup(net, &iph->saddr, sport, &iph->daddr, dport, inet6_iif(skb), - inet6_sdif(skb), &udp_table, NULL); + inet6_sdif(skb), net->ipv4.udp_table, NULL); } /* Must be called under rcu_read_lock(). @@ -300,7 +301,7 @@ struct sock *udp6_lib_lookup(struct net *net, const struct in6_addr *saddr, __be struct sock *sk; sk = __udp6_lib_lookup(net, saddr, sport, daddr, dport, - dif, 0, &udp_table, NULL); + dif, 0, net->ipv4.udp_table, NULL); if (sk && !refcount_inc_not_zero(&sk->sk_refcnt)) sk = NULL; return sk; @@ -667,7 +668,8 @@ static __inline__ int udpv6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, u8 type, u8 code, int offset, __be32 info) { - return __udp6_lib_err(skb, opt, type, code, offset, info, &udp_table); + return __udp6_lib_err(skb, opt, type, code, offset, info, + dev_net(skb->dev)->ipv4.udp_table); } static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb) @@ -1035,6 +1037,7 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net, __be16 rmt_port, const struct in6_addr *rmt_addr, int dif, int sdif) { + struct udp_table *udptable = net->ipv4.udp_table; unsigned short hnum = ntohs(loc_port); unsigned int hash2, slot2; struct udp_hslot *hslot2; @@ -1042,8 +1045,8 @@ static struct sock *__udp6_lib_demux_lookup(struct net *net, struct sock *sk; hash2 = ipv6_portaddr_hash(net, loc_addr, hnum); - slot2 = hash2 & udp_table.mask; - hslot2 = &udp_table.hash2[slot2]; + slot2 = hash2 & udptable->mask; + hslot2 = &udptable->hash2[slot2]; ports = INET_COMBINED_PORTS(rmt_port, hnum); udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) { @@ -1099,7 +1102,7 @@ void udp_v6_early_demux(struct sk_buff *skb) INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb) { - return __udp6_lib_rcv(skb, &udp_table, IPPROTO_UDP); + return __udp6_lib_rcv(skb, dev_net(skb->dev)->ipv4.udp_table, IPPROTO_UDP); } /* diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c index 7720d04ed396..e0e10f6bcdc1 100644 --- a/net/ipv6/udp_offload.c +++ b/net/ipv6/udp_offload.c @@ -116,10 +116,11 @@ static struct sock *udp6_gro_lookup_skb(struct sk_buff *skb, __be16 sport, __be16 dport) { const struct ipv6hdr *iph = skb_gro_network_header(skb); + struct net *net = dev_net(skb->dev); - return __udp6_lib_lookup(dev_net(skb->dev), &iph->saddr, sport, + return __udp6_lib_lookup(net, &iph->saddr, sport, &iph->daddr, dport, inet6_iif(skb), - inet6_sdif(skb), &udp_table, NULL); + inet6_sdif(skb), net->ipv4.udp_table, NULL); } INDIRECT_CALLABLE_SCOPE From patchwork Fri Aug 26 00:04:45 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 12955406 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD42AECAAD1 for ; Fri, 26 Aug 2022 00:09:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229675AbiHZAJw (ORCPT ); Thu, 25 Aug 2022 20:09:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36600 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S244459AbiHZAJW (ORCPT ); Thu, 25 Aug 2022 20:09:22 -0400 Received: from smtp-fw-33001.amazon.com (smtp-fw-33001.amazon.com [207.171.190.10]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CAD14C9266; Thu, 25 Aug 2022 17:08:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1661472536; x=1693008536; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lupG/hkxLZYJLRWEjd0CqZVhDdOqba4SPxvt9oRunTM=; b=ooxhf8sbRxkJCBtrkZpHUIsVVpKsEMs6KbRph76eu5PfYBfgwOu5BR61 wKnHLgIxycFMREypk7ErRcmkjk7EPWc5rgr7vi4uORIhMmhm0zqFSPXeq qVu2uniGYOM6XeYOKKkXQykaANHkog1uMKzouuA90gOyAGQ0tWFW/BCeM g=; X-IronPort-AV: E=Sophos;i="5.93,264,1654560000"; d="scan'208";a="221028137" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-pdx-2b-1f9d5b26.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-33001.sea14.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Aug 2022 00:08:53 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-pdx-2b-1f9d5b26.us-west-2.amazon.com (Postfix) with ESMTPS id 31B0B44CFD; Fri, 26 Aug 2022 00:08:52 +0000 (UTC) Received: from EX19D004ANA001.ant.amazon.com (10.37.240.138) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.38; Fri, 26 Aug 2022 00:08:51 +0000 Received: from 88665a182662.ant.amazon.com (10.43.162.140) by EX19D004ANA001.ant.amazon.com (10.37.240.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1118.12; Fri, 26 Aug 2022 00:08:48 +0000 From: Kuniyuki Iwashima To: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jeff Layton , Chuck Lever , Luis Chamberlain , Kees Cook , Iurii Zaikin CC: Kuniyuki Iwashima , Kuniyuki Iwashima , , Subject: [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table. Date: Thu, 25 Aug 2022 17:04:45 -0700 Message-ID: <20220826000445.46552-14-kuniyu@amazon.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220826000445.46552-1-kuniyu@amazon.com> References: <20220826000445.46552-1-kuniyu@amazon.com> MIME-Version: 1.0 X-Originating-IP: [10.43.162.140] X-ClientProxiedBy: EX13D10UWB001.ant.amazon.com (10.43.161.111) To EX19D004ANA001.ant.amazon.com (10.37.240.138) Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We introduce an optional per-netns hash table for UDP. With a smaller hash table, we can look up sockets faster and isolate noisy neighbours. Also, we can reduce lock contention. We can control the hash table size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.udp_child_ehash_entries == 0). Moreover, we can fall back to using the global hash table in case we fail to allocate enough memory for a new hash table. We can check the current hash table size by another read-only sysctl knob, net.ipv4.udp_hash_entries. A negative value means the netns shares the global hash table (per-netns hash table is disabled or failed to allocate memory). We could optimise the hash table lookup/iteration further by removing netns comparison for the per-netns one in the future. Also, we could optimise the sparse udp_hslot layout by putting it in udp_table. Signed-off-by: Kuniyuki Iwashima --- Documentation/networking/ip-sysctl.rst | 20 ++++++++ include/net/netns/ipv4.h | 2 + net/ipv4/sysctl_net_ipv4.c | 56 +++++++++++++++++++++ net/ipv4/udp.c | 69 ++++++++++++++++++++++++++ 4 files changed, 147 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 97a0952b11e3..6dc4e2853e39 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1090,6 +1090,26 @@ udp_rmem_min - INTEGER udp_wmem_min - INTEGER UDP does not have tx memory accounting and this tunable has no effect. +udp_hash_entries - INTEGER + Read-only number of hash buckets for UDP sockets in the current + networking namespace. + + A negative value means the networking namespace does not own its + hash buckets and shares the initial networking namespace's one. + +udp_child_ehash_entries - INTEGER + Control the number of hash buckets for UDP sockets in the child + networking namespace, which must be set before clone() or unshare(). + + The written value except for 0 is rounded up to 2^n. 0 is a special + value, meaning the child networking namespace will share the initial + networking namespace's hash buckets. + + Note that the child will use the global one in case the kernel + fails to allocate enough memory. + + Default: 0 + RAW variables ============= diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index c367da5d61e2..a1be7ebb7338 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -200,6 +200,8 @@ struct netns_ipv4 { atomic_t dev_addr_genid; + unsigned int sysctl_udp_child_hash_entries; + #ifdef CONFIG_SYSCTL unsigned long *sysctl_local_reserved_ports; int sysctl_ip_prot_sock; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 03a3187c4705..b3cea3f36463 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -424,6 +424,47 @@ static int proc_tcp_child_ehash_entries(struct ctl_table *table, int write, return 0; } +static int proc_udp_hash_entries(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct net *net = container_of(table->data, struct net, + ipv4.sysctl_udp_child_hash_entries); + int udp_hash_entries; + struct ctl_table tbl; + + udp_hash_entries = net->ipv4.udp_table->mask + 1; + + /* A negative number indicates that the child netns + * shares the global udp_table. + */ + if (!net_eq(net, &init_net) && net->ipv4.udp_table == &udp_table) + udp_hash_entries *= -1; + + tbl.data = &udp_hash_entries; + tbl.maxlen = sizeof(int); + + return proc_dointvec(&tbl, write, buffer, lenp, ppos); +} + +static int proc_udp_child_hash_entries(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + unsigned int udp_child_hash_entries; + int ret; + + ret = proc_douintvec(table, write, buffer, lenp, ppos); + if (!write || ret) + return ret; + + udp_child_hash_entries = READ_ONCE(*(unsigned int *)table->data); + if (udp_child_hash_entries) + udp_child_hash_entries = roundup_pow_of_two(udp_child_hash_entries); + + WRITE_ONCE(*(unsigned int *)table->data, udp_child_hash_entries); + + return 0; +} + #ifdef CONFIG_IP_ROUTE_MULTIPATH static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write, void *buffer, size_t *lenp, @@ -1378,6 +1419,21 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_INT_MAX, }, + { + .procname = "udp_hash_entries", + .data = &init_net.ipv4.sysctl_udp_child_hash_entries, + .mode = 0444, + .proc_handler = proc_udp_hash_entries, + }, + { + .procname = "udp_child_hash_entries", + .data = &init_net.ipv4.sysctl_udp_child_hash_entries, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_udp_child_hash_entries, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_INT_MAX, + }, { .procname = "udp_rmem_min", .data = &init_net.ipv4.sysctl_udp_rmem_min, diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index f4825e38762a..c41306225305 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -3309,8 +3309,77 @@ static int __net_init udp_sysctl_init(struct net *net) return 0; } +static struct udp_table __net_init *udp_pernet_table_alloc(unsigned int hash_entries) +{ + struct udp_table *udptable; + int i; + + udptable = kmalloc(sizeof(*udptable), GFP_KERNEL); + if (!udptable) + goto out; + + udptable->hash = kvmalloc_array(hash_entries * 2, + sizeof(struct udp_hslot), GFP_KERNEL); + if (!udptable->hash) + goto free_table; + + udptable->hash2 = udptable->hash + hash_entries; + udptable->mask = hash_entries - 1; + udptable->log = ilog2(hash_entries); + + for (i = 0; i < hash_entries; i++) { + INIT_HLIST_HEAD(&udptable->hash[i].head); + udptable->hash[i].count = 0; + spin_lock_init(&udptable->hash[i].lock); + + INIT_HLIST_HEAD(&udptable->hash2[i].head); + udptable->hash2[i].count = 0; + spin_lock_init(&udptable->hash2[i].lock); + } + + return udptable; + +free_table: + kfree(udptable); +out: + return NULL; +} + +static int __net_init udp_pernet_table_init(struct net *net, struct net *old_net) +{ + struct udp_table *udptable; + unsigned int hash_entries; + + hash_entries = READ_ONCE(old_net->ipv4.sysctl_udp_child_hash_entries); + if (!hash_entries) + goto out; + + udptable = udp_pernet_table_alloc(hash_entries); + if (udptable) + net->ipv4.udp_table = udptable; + else + pr_warn("Failed to allocate UDP hash table (entries: %u) " + "for a netns, fallback to use the global one\n", + hash_entries); +out: + return 0; +} + +static void __net_exit udp_pernet_table_free(struct net *net) +{ + struct udp_table *udptable = net->ipv4.udp_table; + + if (udptable == &udp_table) + return; + + kvfree(udptable->hash); + kfree(udptable); +} + static struct pernet_operations __net_initdata udp_sysctl_ops = { .init = udp_sysctl_init, + .init2 = udp_pernet_table_init, + .exit = udp_pernet_table_free, }; #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)