From patchwork Wed Nov 18 10:47:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giuseppe Scrivano X-Patchwork-Id: 11914763 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 231F6C63777 for ; Wed, 18 Nov 2020 10:48:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C974B2417E for ; Wed, 18 Nov 2020 10:48:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Soxww3jK" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727783AbgKRKsC (ORCPT ); Wed, 18 Nov 2020 05:48:02 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:39333 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726249AbgKRKr7 (ORCPT ); Wed, 18 Nov 2020 05:47:59 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605696478; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=C8wAOglAL3fsqTpqfYjrg5+ouoD5MSyOi7eZ2Hc/xok=; b=Soxww3jKUR5ope3RQzJM9TDY5bstAycL3wp9TVoS5vatEwQV/IFJqY/hn8uhBXKtgcq6uE zjByhgn0lumISCkwgIAj9xI2tB+Rg2gGZrfCo84TZsPo6VqejdBH9ouMD570sFUFuRJLAn avhTGaOX7ZH92P2vTwl+SBM6BCVNB7o= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-255-FinVNhLKN-O1q4oYrDtVqQ-1; Wed, 18 Nov 2020 05:47:56 -0500 X-MC-Unique: FinVNhLKN-O1q4oYrDtVqQ-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 05B3F802B71; Wed, 18 Nov 2020 10:47:55 +0000 (UTC) Received: from lithium.redhat.com (ovpn-113-143.ams2.redhat.com [10.36.113.143]) by smtp.corp.redhat.com (Postfix) with ESMTP id 5E20360C43; Wed, 18 Nov 2020 10:47:53 +0000 (UTC) From: Giuseppe Scrivano To: linux-kernel@vger.kernel.org, christian.brauner@ubuntu.com Cc: linux@rasmusvillemoes.dk, viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org, containers@lists.linux-foundation.org Subject: [PATCH v3 1/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC Date: Wed, 18 Nov 2020 11:47:45 +0100 Message-Id: <20201118104746.873084-2-gscrivan@redhat.com> In-Reply-To: <20201118104746.873084-1-gscrivan@redhat.com> References: <20201118104746.873084-1-gscrivan@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't immediately close the files but it sets the close-on-exec bit. It is useful for e.g. container runtimes that usually install a seccomp profile "as late as possible" before execv'ing the container process itself. The container runtime could either do: 1 2 - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0); - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile(); - execve(...); - execve(...); Both alternative have some disadvantages. In the first variant the seccomp_profile cannot block the close_range syscall, as well as opendir/read/close/... for the fallback on older kernels. In the second variant, close_range() can be used only on the fds that are not going to be needed by the runtime anymore, and it must be potentially called multiple times to account for the different ranges that must be closed. Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues. The runtime is able to use the existing open fds, the seccomp profile can block close_range() and the syscalls used for its fallback. Signed-off-by: Giuseppe Scrivano --- fs/file.c | 44 ++++++++++++++++++++++++-------- include/uapi/linux/close_range.h | 3 +++ 2 files changed, 37 insertions(+), 10 deletions(-) diff --git a/fs/file.c b/fs/file.c index 21c0893f2f1d..69382580ae32 100644 --- a/fs/file.c +++ b/fs/file.c @@ -672,6 +672,35 @@ int __close_fd(struct files_struct *files, unsigned fd) } EXPORT_SYMBOL(__close_fd); /* for ksys_close() */ +static inline void __range_cloexec(struct files_struct *cur_fds, + unsigned int fd, unsigned int max_fd) +{ + struct fdtable *fdt; + + if (fd > max_fd) + return; + + spin_lock(&cur_fds->file_lock); + fdt = files_fdtable(cur_fds); + bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1); + spin_unlock(&cur_fds->file_lock); +} + +static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, + unsigned int max_fd) +{ + while (fd <= max_fd) { + struct file *file; + + file = pick_file(cur_fds, fd++); + if (!file) + continue; + + filp_close(file, cur_fds); + cond_resched(); + } +} + /** * __close_range() - Close all file descriptors in a given range. * @@ -687,7 +716,7 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) struct task_struct *me = current; struct files_struct *cur_fds = me->files, *fds = NULL; - if (flags & ~CLOSE_RANGE_UNSHARE) + if (flags & ~(CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC)) return -EINVAL; if (fd > max_fd) @@ -725,16 +754,11 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) } max_fd = min(max_fd, cur_max); - while (fd <= max_fd) { - struct file *file; - file = pick_file(cur_fds, fd++); - if (!file) - continue; - - filp_close(file, cur_fds); - cond_resched(); - } + if (flags & CLOSE_RANGE_CLOEXEC) + __range_cloexec(cur_fds, fd, max_fd); + else + __range_close(cur_fds, fd, max_fd); if (fds) { /* diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h index 6928a9fdee3c..2d804281554c 100644 --- a/include/uapi/linux/close_range.h +++ b/include/uapi/linux/close_range.h @@ -5,5 +5,8 @@ /* Unshare the file descriptor table before closing file descriptors. */ #define CLOSE_RANGE_UNSHARE (1U << 1) +/* Set the FD_CLOEXEC bit instead of closing the file descriptor. */ +#define CLOSE_RANGE_CLOEXEC (1U << 2) + #endif /* _UAPI_LINUX_CLOSE_RANGE_H */ From patchwork Wed Nov 18 10:47:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giuseppe Scrivano X-Patchwork-Id: 11914765 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCC55C63798 for ; Wed, 18 Nov 2020 10:48:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 527C522266 for ; Wed, 18 Nov 2020 10:48:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="BWdd263o" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727794AbgKRKsE (ORCPT ); Wed, 18 Nov 2020 05:48:04 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:45129 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727784AbgKRKsD (ORCPT ); Wed, 18 Nov 2020 05:48:03 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605696482; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=gm3OBX/9hJtnlJ3YS7mlFXt8HRp4QEhmJQZCJ3PsAxw=; b=BWdd263oFYXxf+q/aSW3o/kTah6qocM82CTp3KC3+itOGXyZIx14dfNFesFMzOfcOIIQ+W H/pf0uWPTCiaiIFx9BdCXvvu2NFjDsF6mtkXMFbJhIPmvFFDpq7yirT/xKl7n2gM1vQ6+A +bx5X2s/oYJzWBEhb+s5/7ZpxczmN7c= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-332-g-JpIBSuNo-C3Y6xwHJx0w-1; Wed, 18 Nov 2020 05:47:58 -0500 X-MC-Unique: g-JpIBSuNo-C3Y6xwHJx0w-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1FBC686ABD9; Wed, 18 Nov 2020 10:47:57 +0000 (UTC) Received: from lithium.redhat.com (ovpn-113-143.ams2.redhat.com [10.36.113.143]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6503060C05; Wed, 18 Nov 2020 10:47:55 +0000 (UTC) From: Giuseppe Scrivano To: linux-kernel@vger.kernel.org, christian.brauner@ubuntu.com Cc: linux@rasmusvillemoes.dk, viro@zeniv.linux.org.uk, linux-fsdevel@vger.kernel.org, containers@lists.linux-foundation.org Subject: [PATCH v3 2/2] selftests: core: add tests for CLOSE_RANGE_CLOEXEC Date: Wed, 18 Nov 2020 11:47:46 +0100 Message-Id: <20201118104746.873084-3-gscrivan@redhat.com> In-Reply-To: <20201118104746.873084-1-gscrivan@redhat.com> References: <20201118104746.873084-1-gscrivan@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org check that close_range(initial_fd, last_fd, CLOSE_RANGE_CLOEXEC) correctly sets the close-on-exec bit for the specified file descriptors. Open 100 file descriptors and set the close-on-exec flag for a subset of them first, then set it for every file descriptor above 2. Make sure RLIMIT_NOFILE doesn't affect the result. Signed-off-by: Giuseppe Scrivano Reported-by: kernel test robot --- .../testing/selftests/core/close_range_test.c | 74 +++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c index c99b98b0d461..18992c383852 100644 --- a/tools/testing/selftests/core/close_range_test.c +++ b/tools/testing/selftests/core/close_range_test.c @@ -11,6 +11,7 @@ #include #include #include +#include #include "../kselftest_harness.h" #include "../clone3/clone3_selftests.h" @@ -23,6 +24,10 @@ #define CLOSE_RANGE_UNSHARE (1U << 1) #endif +#ifndef CLOSE_RANGE_CLOEXEC +#define CLOSE_RANGE_CLOEXEC (1U << 2) +#endif + static inline int sys_close_range(unsigned int fd, unsigned int max_fd, unsigned int flags) { @@ -224,4 +229,73 @@ TEST(close_range_unshare_capped) EXPECT_EQ(0, WEXITSTATUS(status)); } +TEST(close_range_cloexec) +{ + int i, ret; + int open_fds[101]; + struct rlimit rlimit; + + for (i = 0; i < ARRAY_SIZE(open_fds); i++) { + int fd; + + fd = open("/dev/null", O_RDONLY); + ASSERT_GE(fd, 0) { + if (errno == ENOENT) + XFAIL(return, "Skipping test since /dev/null does not exist"); + } + + open_fds[i] = fd; + } + + ret = sys_close_range(1000, 1000, CLOSE_RANGE_CLOEXEC); + if (ret < 0) { + if (errno == ENOSYS) + XFAIL(return, "close_range() syscall not supported"); + if (errno == EINVAL) + XFAIL(return, "close_range() doesn't support CLOSE_RANGE_CLOEXEC"); + } + + /* Ensure the FD_CLOEXEC bit is set also with a resource limit in place. */ + ASSERT_EQ(0, getrlimit(RLIMIT_NOFILE, &rlimit)); + rlimit.rlim_cur = 25; + ASSERT_EQ(0, setrlimit(RLIMIT_NOFILE, &rlimit)); + + /* Set close-on-exec for two ranges: [0-50] and [75-100]. */ + ret = sys_close_range(open_fds[0], open_fds[50], CLOSE_RANGE_CLOEXEC); + ASSERT_EQ(0, ret); + ret = sys_close_range(open_fds[75], open_fds[100], CLOSE_RANGE_CLOEXEC); + ASSERT_EQ(0, ret); + + for (i = 0; i <= 50; i++) { + int flags = fcntl(open_fds[i], F_GETFD); + + EXPECT_GT(flags, -1); + EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC); + } + + for (i = 51; i <= 74; i++) { + int flags = fcntl(open_fds[i], F_GETFD); + + EXPECT_GT(flags, -1); + EXPECT_EQ(flags & FD_CLOEXEC, 0); + } + + for (i = 75; i <= 100; i++) { + int flags = fcntl(open_fds[i], F_GETFD); + + EXPECT_GT(flags, -1); + EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC); + } + + /* Test a common pattern. */ + ret = sys_close_range(3, UINT_MAX, CLOSE_RANGE_CLOEXEC); + for (i = 0; i <= 100; i++) { + int flags = fcntl(open_fds[i], F_GETFD); + + EXPECT_GT(flags, -1); + EXPECT_EQ(flags & FD_CLOEXEC, FD_CLOEXEC); + } +} + + TEST_HARNESS_MAIN