From patchwork Fri Oct 30 23:52:41 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Linus Torvalds <torvalds@linux-foundation.org>
X-Patchwork-Id: 7530061
Return-Path: <linux-fsdevel-owner@kernel.org>
X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 612E99F399
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Fri, 30 Oct 2015 23:52:52 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 83FD3206FD
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Fri, 30 Oct 2015 23:52:51 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 77049203C0
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Fri, 30 Oct 2015 23:52:50 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752124AbbJ3Xwn (ORCPT
	<rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
	Fri, 30 Oct 2015 19:52:43 -0400
Received: from mail-ig0-f173.google.com ([209.85.213.173]:32893 "EHLO
	mail-ig0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751554AbbJ3Xwm (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 30 Oct 2015 19:52:42 -0400
Received: by igvi2 with SMTP id i2so20046942igv.0;
	Fri, 30 Oct 2015 16:52:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20120113;
	h=mime-version:sender:in-reply-to:references:date:message-id:subject
	:from:to:cc:content-type;
	bh=SSwsEstvdzUC6JELyvw7AMw33Qgas0asUvcpQO4z6Z0=;
	b=yTE+2n0le9R8QU3yPjYDk72QRf3z5g96vwSyraGw+d9jJhiXRbg4A+REJXqH9v/8FI
	Fgbjv2czgdG6mCd7I9kOTClySuMmTc4oWVgF1v+ojivfHuVFy5U92ah4xVUu8hnD6yED
	62BsutNR7nvDcvuWkr6PfNHFa1sn+t6om/V44rinYJB+aj+8Zc8OSsm2ydCxUb2MeOyW
	aaoUY0YKloNKBkNXKI9FpAE5SgJ0/MTnU5IFJFHPj+7LhVrc1FkyMq79xTX4eR1p/GJj
	gjmz2oGqpJgbIquVOqEC7p88yPYZYGOBFJ8LoX3HuF5OvEdbbqc32V3hmNhX1+jgdDTK
	ybUw==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux-foundation.org; s=google;
	h=mime-version:sender:in-reply-to:references:date:message-id:subject
	:from:to:cc:content-type;
	bh=SSwsEstvdzUC6JELyvw7AMw33Qgas0asUvcpQO4z6Z0=;
	b=UeqFrBcc15ENKmfDoENqBo7F4wPtdp24esHw3tVTzo6nMuz3jLQtWZPoxuwXDRGWr/
	ypJozH0Mur56jDJGwwc0cCh5u/kVWQ9nP1g5M9EVNCRAyvvnPm32Mu+7KrVqxtboJoKG
	zKvzgxG7wO+LF3gKgsYk1TPijCgFRxum2LkRw=
MIME-Version: 1.0
X-Received: by 10.50.124.39 with SMTP id mf7mr926860igb.45.1446249161859;
	Fri, 30 Oct 2015 16:52:41 -0700 (PDT)
Received: by 10.36.124.150 with HTTP; Fri, 30 Oct 2015 16:52:41 -0700 (PDT)
In-Reply-To: <20151030223317.GK22011@ZenIV.linux.org.uk>
References: <20151028223330.GD22011@ZenIV.linux.org.uk>
	<1446073709.7476.93.camel@edumazet-glaptop2.roam.corp.google.com>
	<20151029001532.GE22011@ZenIV.linux.org.uk>
	<1446089381.7476.114.camel@edumazet-glaptop2.roam.corp.google.com>
	<20151029041611.GF22011@ZenIV.linux.org.uk>
	<1446122119.7476.138.camel@edumazet-glaptop2.roam.corp.google.com>
	<CA+55aFx_BQG4FzSuyL2KAz=c+R0cPv0ZpjboT_=yKjZNGmTUEg@mail.gmail.com>
	<20151030210215.GI22011@ZenIV.linux.org.uk>
	<CA+55aFyoA5sXQvFDavNa7cGkRevunmtPe8F8-_tZ9-826qfkgw@mail.gmail.com>
	<CA+55aFy9fPmbSCJoQiCkjkaiU2HTM6p=o-pJgiV7asyVWBjAEg@mail.gmail.com>
	<20151030223317.GK22011@ZenIV.linux.org.uk>
Date: Fri, 30 Oct 2015 16:52:41 -0700
X-Google-Sender-Auth: bA9JHYFIwhoK_3zZXJSmFcY7u24
Message-ID: 
 <CA+55aFwfFb=LXU77AbiPDHgWcpBwTBoJB4EMCZgTgX32cxMYWw@mail.gmail.com>
Subject: Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect
	for sockets in accept(3)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Dumazet <eric.dumazet@gmail.com>,
	David Miller <davem@davemloft.net>,
	Stephen Hemminger <stephen@networkplumber.org>,
	Network Development <netdev@vger.kernel.org>,
	David Howells <dhowells@redhat.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,T_DKIM_INVALID,T_TVD_MIME_EPI,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Fri, Oct 30, 2015 at 3:33 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Oct 30, 2015 at 02:50:46PM -0700, Linus Torvalds wrote:
>
>> Anyway. This is a pretty simple patch, and I actually think that we
>> could just get rid of the "next_fd" logic entirely with this. That
>> would make this *patch* be more complicated, but it would make the
>> resulting *code* be simpler.
>
> Dropping next_fd would screw you in case of strictly sequential allocations...

I don't think it would matter in real life, since I don't really think
you have lots of fd's with strictly sequential behavior.

That said, the trivial "open lots of fds" benchmark would show it, so
I guess we can just keep it. The next_fd logic is not expensive or
complex, after all.

> Your point re power-of-two allocations is well-taken, but then I'm not
> sure that kzalloc() is good enough here.

Attached is an updated patch that just uses the regular bitmap
allocator and extends it to also have the bitmap of bitmaps. It
actually simplifies the patch, so I guess it's better this way.

Anyway, I've tested it all a bit more, and for a trivial worst-case
stress program that explicitly kills the next_fd logic by doing

    for (i = 0; i < 1000000; i++) {
        close(3);
        dup2(0,3);
        if (dup(0))
            break;
    }

it takes it down from roughly 10s to 0.2s. So the patch is quite
noticeable on that kind of pattern.

NOTE! You'll obviously need to increase your limits to actually be
able to do the above with lots of file descriptors.

I ran Eric's test-program too, and find_next_zero_bit() dropped to a
fraction of a percent. It's not entirely gone, but it's down in the
noise.

I really suspect this patch is "good enough" in reality, and I would
*much* rather do something like this than add a new non-POSIX flag
that people have to update their binaries for. I agree with Eric that
*some* people will do so, but it's still the wrong thing to do. Let's
just make performance with the normal semantics be good enough that we
don't need to play odd special games.

Eric?

                       Linus
fs/file.c               | 39 +++++++++++++++++++++++++++++++++++----
 include/linux/fdtable.h |  2 ++
 2 files changed, 37 insertions(+), 4 deletions(-)
Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>

diff --git a/fs/file.c b/fs/file.c
index 6c672ad329e9..6f6eb2b03af5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
 	__free_fdtable(container_of(rcu, struct fdtable, rcu));
 }
 
+#define BITBIT_NR(nr)	BITS_TO_LONGS(BITS_TO_LONGS(nr))
+#define BITBIT_SIZE(nr)	(BITBIT_NR(nr) * sizeof(long))
+
 /*
  * Expand the fdset in the files_struct.  Called with the files spinlock
  * held for write.
@@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
 	memset((char *)(nfdt->open_fds) + cpy, 0, set);
 	memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
 	memset((char *)(nfdt->close_on_exec) + cpy, 0, set);
+
+	cpy = BITBIT_SIZE(ofdt->max_fds);
+	set = BITBIT_SIZE(nfdt->max_fds) - cpy;
+	memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
+	memset(cpy+(char *)nfdt->full_fds_bits, 0, set);
 }
 
 static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
 	fdt->fd = data;
 
 	data = alloc_fdmem(max_t(size_t,
-				 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES));
+				 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES));
 	if (!data)
 		goto out_arr;
 	fdt->open_fds = data;
 	data += nr / BITS_PER_BYTE;
 	fdt->close_on_exec = data;
+	data += nr / BITS_PER_BYTE;
+	fdt->full_fds_bits = data;
 
 	return fdt;
 
@@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 	__clear_bit(fd, fdt->close_on_exec);
 }
 
-static inline void __set_open_fd(int fd, struct fdtable *fdt)
+static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__set_bit(fd, fdt->open_fds);
+	fd /= BITS_PER_LONG;
+	if (!~fdt->open_fds[fd])
+		__set_bit(fd, fdt->full_fds_bits);
 }
 
-static inline void __clear_open_fd(int fd, struct fdtable *fdt)
+static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
 {
 	__clear_bit(fd, fdt->open_fds);
+	__clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
 }
 
 static int count_open_files(struct fdtable *fdt)
@@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 	new_fdt->max_fds = NR_OPEN_DEFAULT;
 	new_fdt->close_on_exec = newf->close_on_exec_init;
 	new_fdt->open_fds = newf->open_fds_init;
+	new_fdt->full_fds_bits = newf->full_fds_bits_init;
 	new_fdt->fd = &newf->fd_array[0];
 
 	spin_lock(&oldf->file_lock);
@@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 
 	memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8);
 	memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8);
+	memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files));
 
 	for (i = open_files; i != 0; i--) {
 		struct file *f = *old_fds++;
@@ -454,10 +470,25 @@ struct files_struct init_files = {
 		.fd		= &init_files.fd_array[0],
 		.close_on_exec	= init_files.close_on_exec_init,
 		.open_fds	= init_files.open_fds_init,
+		.full_fds_bits	= init_files.full_fds_bits_init,
 	},
 	.file_lock	= __SPIN_LOCK_UNLOCKED(init_files.file_lock),
 };
 
+static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
+{
+	unsigned long maxfd = fdt->max_fds;
+	unsigned long maxbit = maxfd / BITS_PER_LONG;
+	unsigned long bitbit = start / BITS_PER_LONG;
+
+	bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
+	if (bitbit > maxfd)
+		return maxfd;
+	if (bitbit > start)
+		start = bitbit;
+	return find_next_zero_bit(fdt->open_fds, maxfd, start);
+}
+
 /*
  * allocate a file descriptor, mark it busy.
  */
@@ -476,7 +507,7 @@ repeat:
 		fd = files->next_fd;
 
 	if (fd < fdt->max_fds)
-		fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
+		fd = find_next_fd(fdt, fd);
 
 	/*
 	 * N.B. For clone tasks sharing a files structure, this test
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 674e3e226465..5295535b60c6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -26,6 +26,7 @@ struct fdtable {
 	struct file __rcu **fd;      /* current fd array */
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
+	unsigned long *full_fds_bits;
 	struct rcu_head rcu;
 };
 
@@ -59,6 +60,7 @@ struct files_struct {
 	int next_fd;
 	unsigned long close_on_exec_init[1];
 	unsigned long open_fds_init[1];
+	unsigned long full_fds_bits_init[1];
 	struct file __rcu * fd_array[NR_OPEN_DEFAULT];
 };