From patchwork Fri Oct 30 21:50:46 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Torvalds X-Patchwork-Id: 7529391 Return-Path: X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id DC1059F399 for ; Fri, 30 Oct 2015 21:50:58 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 0A16720480 for ; Fri, 30 Oct 2015 21:50:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id F36D120481 for ; Fri, 30 Oct 2015 21:50:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031127AbbJ3Vut (ORCPT ); Fri, 30 Oct 2015 17:50:49 -0400 Received: from mail-lf0-f52.google.com ([209.85.215.52]:35604 "EHLO mail-lf0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754016AbbJ3Vus (ORCPT ); Fri, 30 Oct 2015 17:50:48 -0400 Received: by lfbn126 with SMTP id n126so40295027lfb.2; Fri, 30 Oct 2015 14:50:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=aou+B37WoWaYDhvKf48VJD1FRZUyT8H3wq70/gG93eU=; b=KAQO2ZkTBXFIn8ho4FiGABNXJI24FmZXpITtZ5A4+MKPgy2NNZY61gAQ1K7rqu4FPI 67jOmJ03aXOlyAVPJ1BYp+lil56Xdgub8pVoOeTzXlYg5obEj6V95r2IdPaf+OStsadp S+9bSlT+CIrNR6t+bPWOhPT62E0k25u784xHVhDPskqAluR0YfqIjp3AUqFsfTkvrLm1 KJBSQvQL32dOXRodv42VnPCZ+cEeGb3ptEqXMKlXJQSaPkE65Cp/6HaJFRj/hBlKiTWT 0vmGebUDE2FmSywqMOGo0u6Q6avEaVznfR/uuwpkUEswGPokwunO/tEPrZ+tbwSz7ahG 7sCg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=aou+B37WoWaYDhvKf48VJD1FRZUyT8H3wq70/gG93eU=; b=h6MtHpXAt6VrNCWdjvddt4hd8CuYy22BXRNs89fDyIrOO3xs3eFFMA4cwiAEljP574 zzg7Nyzua/7rN9J01sMAaytoFZmv2rJGH7qB1oTv4gy5LtSKy1s4PkX36JTz83BGWiIB NMkl1NO0d74rB0Dcx3cxKGdZfJE8zSmN6knt0= MIME-Version: 1.0 X-Received: by 10.112.13.226 with SMTP id k2mr5070589lbc.89.1446241846569; Fri, 30 Oct 2015 14:50:46 -0700 (PDT) Received: by 10.112.63.129 with HTTP; Fri, 30 Oct 2015 14:50:46 -0700 (PDT) In-Reply-To: References: <1446043677.7476.68.camel@edumazet-glaptop2.roam.corp.google.com> <20151028211347.GC22011@ZenIV.linux.org.uk> <1446068668.7476.89.camel@edumazet-glaptop2.roam.corp.google.com> <20151028223330.GD22011@ZenIV.linux.org.uk> <1446073709.7476.93.camel@edumazet-glaptop2.roam.corp.google.com> <20151029001532.GE22011@ZenIV.linux.org.uk> <1446089381.7476.114.camel@edumazet-glaptop2.roam.corp.google.com> <20151029041611.GF22011@ZenIV.linux.org.uk> <1446122119.7476.138.camel@edumazet-glaptop2.roam.corp.google.com> <20151030210215.GI22011@ZenIV.linux.org.uk> Date: Fri, 30 Oct 2015 14:50:46 -0700 X-Google-Sender-Auth: dYOM3ZbBOIuYa7deoEfUsB660Z0 Message-ID: Subject: Re: [Bug 106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3) From: Linus Torvalds To: Al Viro Cc: Eric Dumazet , David Miller , Stephen Hemminger , Network Development , David Howells , linux-fsdevel Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD,T_DKIM_INVALID,T_TVD_MIME_EPI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Fri, Oct 30, 2015 at 2:23 PM, Linus Torvalds wrote: > On Fri, Oct 30, 2015 at 2:02 PM, Al Viro wrote: >> >> Your variant has 1:64 ratio; obviously better than now, but we can actually >> do 1:bits-per-cacheline quite easily. > > Ok, but in that case you end up needing a counter for each cacheline > too (to count how many bits, in order to know when to say "cacheline > is entirely full"). So here's a largely untested version of my "one bit per word" approach. It seems to work, but looking at it, I'm unhappy with a few things: - using kmalloc() for the .full_fds_bits[] array is simple, but disgusting, since 99% of all programs just have a single word. I know I talked about just adding the allocation to the same one that allocates the bitmaps themselves, but I got lazy and didn't do it. Especially since that code seems to try fairly hard to make the allocations nice powers of two, according to the comments. That may actually matter from an allocation standpoint. - Maybe we could just use that "full_fds_bits_init" field for when a single word is sufficient, and avoid the kmalloc that way? Anyway. This is a pretty simple patch, and I actually think that we could just get rid of the "next_fd" logic entirely with this. That would make this *patch* be more complicated, but it would make the resulting *code* be simpler. Hmm? Want to play with this? Eric, what does this do to your test-case? Linus fs/file.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- include/linux/fdtable.h | 2 ++ 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..0b89a97cabb5 100644 --- a/fs/file.c +++ b/fs/file.c @@ -48,6 +48,7 @@ static void __free_fdtable(struct fdtable *fdt) { kvfree(fdt->fd); kvfree(fdt->open_fds); + kfree(fdt->full_fds_bits); kfree(fdt); } @@ -56,6 +57,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +#define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) +#define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -77,6 +81,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memset((char *)(nfdt->open_fds) + cpy, 0, set); memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + + cpy = BITBIT_SIZE(ofdt->max_fds); + set = BITBIT_SIZE(nfdt->max_fds) - cpy; + memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); + memset(cpy+(char *)nfdt->full_fds_bits, 0, set); } static struct fdtable * alloc_fdtable(unsigned int nr) @@ -122,8 +131,21 @@ static struct fdtable * alloc_fdtable(unsigned int nr) data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; + /* + * The "bitmap of bitmaps" has a bit set for each word in + * the open_fds array that is full. We initialize it to + * zero both at allocation and at copying, because it is + * important that it never have a bit set for a non-full + * word, but it doesn't have to be exact otherwise. + */ + fdt->full_fds_bits = kzalloc(BITBIT_SIZE(nr), GFP_KERNEL); + if (!fdt->full_fds_bits) + goto out_nofull; + return fdt; +out_nofull: + kvfree(fdt->open_fds); out_arr: kvfree(fdt->fd); out_fdt: @@ -229,14 +251,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt) { __set_bit(fd, fdt->open_fds); + fd /= BITS_PER_LONG; + if (!~fdt->open_fds[fd]) + __set_bit(fd, fdt->full_fds_bits); } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt) { __clear_bit(fd, fdt->open_fds); + __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); } static int count_open_files(struct fdtable *fdt) @@ -280,6 +306,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; new_fdt->open_fds = newf->open_fds_init; + new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = &newf->fd_array[0]; spin_lock(&oldf->file_lock); @@ -323,6 +350,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files)); for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; @@ -454,10 +482,25 @@ struct files_struct init_files = { .fd = &init_files.fd_array[0], .close_on_exec = init_files.close_on_exec_init, .open_fds = init_files.open_fds_init, + .full_fds_bits = init_files.full_fds_bits_init, }, .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock), }; +static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start) +{ + unsigned long maxfd = fdt->max_fds; + unsigned long maxbit = maxfd / BITS_PER_LONG; + unsigned long bitbit = start / BITS_PER_LONG; + + bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; + if (bitbit > maxfd) + return maxfd; + if (bitbit > start) + start = bitbit; + return find_next_zero_bit(fdt->open_fds, maxfd, start); +} + /* * allocate a file descriptor, mark it busy. */ @@ -476,7 +519,7 @@ repeat: fd = files->next_fd; if (fd < fdt->max_fds) - fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd); + fd = find_next_fd(fdt, fd); /* * N.B. For clone tasks sharing a files structure, this test diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 674e3e226465..5295535b60c6 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -26,6 +26,7 @@ struct fdtable { struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; unsigned long *open_fds; + unsigned long *full_fds_bits; struct rcu_head rcu; }; @@ -59,6 +60,7 @@ struct files_struct { int next_fd; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; + unsigned long full_fds_bits_init[1]; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; };