Message ID | CA+55aFwfFb=LXU77AbiPDHgWcpBwTBoJB4EMCZgTgX32cxMYWw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 30, 2015 at 04:52:41PM -0700, Linus Torvalds wrote: > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? IIRC, at least a part of what Eric used to complain about was that on seriously multithreaded processes doing a lot of e.g. socket(2) we end up a lot of bouncing the cacheline containing the first free bits in the bitmap. But looking at the whole thing, I really wonder if the tons of threads asking for random bytes won't get at least as bad cacheline bouncing while getting said bytes, so I'm not sure if that rationale has survived. PS: this problem obviously exists in Linus' variant as well as in mine; the question is whether Eric's approach manages to avoid it in the first place. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2015-10-30 at 16:52 -0700, Linus Torvalds wrote: sequential allocations... > > I don't think it would matter in real life, since I don't really think > you have lots of fd's with strictly sequential behavior. > > That said, the trivial "open lots of fds" benchmark would show it, so > I guess we can just keep it. The next_fd logic is not expensive or > complex, after all. +1 > Attached is an updated patch that just uses the regular bitmap > allocator and extends it to also have the bitmap of bitmaps. It > actually simplifies the patch, so I guess it's better this way. > > Anyway, I've tested it all a bit more, and for a trivial worst-case > stress program that explicitly kills the next_fd logic by doing > > for (i = 0; i < 1000000; i++) { > close(3); > dup2(0,3); > if (dup(0)) > break; > } > > it takes it down from roughly 10s to 0.2s. So the patch is quite > noticeable on that kind of pattern. > > NOTE! You'll obviously need to increase your limits to actually be > able to do the above with lots of file descriptors. > > I ran Eric's test-program too, and find_next_zero_bit() dropped to a > fraction of a percent. It's not entirely gone, but it's down in the > noise. > > I really suspect this patch is "good enough" in reality, and I would > *much* rather do something like this than add a new non-POSIX flag > that people have to update their binaries for. I agree with Eric that > *some* people will do so, but it's still the wrong thing to do. Let's > just make performance with the normal semantics be good enough that we > don't need to play odd special games. > > Eric? I absolutely agree a generic solution is far better, especially when its performance is in par. Tested-by: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Note that a non-POSIX flag (or a thread personality hints) would still allow the kernel to do proper NUMA affinity placement : Say the fd_array and bitmaps are split on the 2 nodes (or more, but most servers nowadays have 2 sockets really). Then at fd allocation time, we can prefer to pick an fd for which memory holding various bits and the file pointer are in the local node This speeds up subsequent fd system call on programs that constantly blow away cpu caches, saving QPI transactions. Thanks a lot Linus. lpaa24:~# taskset ff0ff ./opensock -t 16 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 3992764 lpaa24:~# ./opensock -t 48 -n 10000000 -l 10 count=10000000 (check/increase ulimit -n) total = 3545249 Profile with 16 threads : 69.55% opensock [.] memset 11.83% [kernel] [k] queued_spin_lock_slowpath 1.91% [kernel] [k] _find_next_bit.part.0 1.68% [kernel] [k] _raw_spin_lock 0.99% [kernel] [k] kmem_cache_alloc 0.99% [kernel] [k] memset_erms 0.95% [kernel] [k] get_empty_filp 0.82% [kernel] [k] __close_fd 0.73% [kernel] [k] __alloc_fd 0.65% [kernel] [k] sk_alloc 0.63% opensock [.] child_function 0.56% [kernel] [k] fput 0.35% [kernel] [k] sock_alloc 0.31% [kernel] [k] kmem_cache_free 0.31% [kernel] [k] inode_init_always 0.28% [kernel] [k] d_set_d_op 0.27% [kernel] [k] entry_SYSCALL_64_after_swapgs Profile with 48 threads : 57.92% [kernel] [k] queued_spin_lock_slowpath 32.14% opensock [.] memset 0.81% [kernel] [k] _find_next_bit.part.0 0.51% [kernel] [k] _raw_spin_lock 0.45% [kernel] [k] kmem_cache_alloc 0.38% [kernel] [k] kmem_cache_free 0.34% [kernel] [k] __close_fd 0.32% [kernel] [k] memset_erms 0.25% [kernel] [k] __alloc_fd 0.24% [kernel] [k] get_empty_filp 0.23% opensock [.] child_function 0.18% [kernel] [k] __d_alloc 0.17% [kernel] [k] inode_init_always 0.16% [kernel] [k] sock_alloc 0.16% [kernel] [k] del_timer 0.15% [kernel] [k] entry_SYSCALL_64_after_swapgs 0.15% perf [.] 0x000000000004d924 0.15% [kernel] [k] tcp_close -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/file.c b/fs/file.c index 6c672ad329e9..6f6eb2b03af5 100644 --- a/fs/file.c +++ b/fs/file.c @@ -56,6 +56,9 @@ static void free_fdtable_rcu(struct rcu_head *rcu) __free_fdtable(container_of(rcu, struct fdtable, rcu)); } +#define BITBIT_NR(nr) BITS_TO_LONGS(BITS_TO_LONGS(nr)) +#define BITBIT_SIZE(nr) (BITBIT_NR(nr) * sizeof(long)) + /* * Expand the fdset in the files_struct. Called with the files spinlock * held for write. @@ -77,6 +80,11 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt) memset((char *)(nfdt->open_fds) + cpy, 0, set); memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy); memset((char *)(nfdt->close_on_exec) + cpy, 0, set); + + cpy = BITBIT_SIZE(ofdt->max_fds); + set = BITBIT_SIZE(nfdt->max_fds) - cpy; + memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy); + memset(cpy+(char *)nfdt->full_fds_bits, 0, set); } static struct fdtable * alloc_fdtable(unsigned int nr) @@ -115,12 +123,14 @@ static struct fdtable * alloc_fdtable(unsigned int nr) fdt->fd = data; data = alloc_fdmem(max_t(size_t, - 2 * nr / BITS_PER_BYTE, L1_CACHE_BYTES)); + 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES)); if (!data) goto out_arr; fdt->open_fds = data; data += nr / BITS_PER_BYTE; fdt->close_on_exec = data; + data += nr / BITS_PER_BYTE; + fdt->full_fds_bits = data; return fdt; @@ -229,14 +239,18 @@ static inline void __clear_close_on_exec(int fd, struct fdtable *fdt) __clear_bit(fd, fdt->close_on_exec); } -static inline void __set_open_fd(int fd, struct fdtable *fdt) +static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt) { __set_bit(fd, fdt->open_fds); + fd /= BITS_PER_LONG; + if (!~fdt->open_fds[fd]) + __set_bit(fd, fdt->full_fds_bits); } -static inline void __clear_open_fd(int fd, struct fdtable *fdt) +static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt) { __clear_bit(fd, fdt->open_fds); + __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits); } static int count_open_files(struct fdtable *fdt) @@ -280,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) new_fdt->max_fds = NR_OPEN_DEFAULT; new_fdt->close_on_exec = newf->close_on_exec_init; new_fdt->open_fds = newf->open_fds_init; + new_fdt->full_fds_bits = newf->full_fds_bits_init; new_fdt->fd = &newf->fd_array[0]; spin_lock(&oldf->file_lock); @@ -323,6 +338,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp) memcpy(new_fdt->open_fds, old_fdt->open_fds, open_files / 8); memcpy(new_fdt->close_on_exec, old_fdt->close_on_exec, open_files / 8); + memcpy(new_fdt->full_fds_bits, old_fdt->full_fds_bits, BITBIT_SIZE(open_files)); for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; @@ -454,10 +470,25 @@ struct files_struct init_files = { .fd = &init_files.fd_array[0], .close_on_exec = init_files.close_on_exec_init, .open_fds = init_files.open_fds_init, + .full_fds_bits = init_files.full_fds_bits_init, }, .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock), }; +static unsigned long find_next_fd(struct fdtable *fdt, unsigned long start) +{ + unsigned long maxfd = fdt->max_fds; + unsigned long maxbit = maxfd / BITS_PER_LONG; + unsigned long bitbit = start / BITS_PER_LONG; + + bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; + if (bitbit > maxfd) + return maxfd; + if (bitbit > start) + start = bitbit; + return find_next_zero_bit(fdt->open_fds, maxfd, start); +} + /* * allocate a file descriptor, mark it busy. */ @@ -476,7 +507,7 @@ repeat: fd = files->next_fd; if (fd < fdt->max_fds) - fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd); + fd = find_next_fd(fdt, fd); /* * N.B. For clone tasks sharing a files structure, this test diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 674e3e226465..5295535b60c6 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -26,6 +26,7 @@ struct fdtable { struct file __rcu **fd; /* current fd array */ unsigned long *close_on_exec; unsigned long *open_fds; + unsigned long *full_fds_bits; struct rcu_head rcu; }; @@ -59,6 +60,7 @@ struct files_struct { int next_fd; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; + unsigned long full_fds_bits_init[1]; struct file __rcu * fd_array[NR_OPEN_DEFAULT]; };