Message ID | 20240717145018.3972922-4-yu.ma@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/file.c: optimize the critical section of file_lock in | expand |
On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: > > Skip 2-levels searching via find_next_zero_bit() when there is free slot in the > word contains next_fd, as: > (1) next_fd indicates the lower bound for the first free fd. > (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up > searching. this is stale -- now the fast path searches up to 64 fds in the lower bitmap > (3) After fdt is expanded (the bitmap size doubled for each time of expansion), > it would never be shrunk. The search size increases but there are few open fds > available here. > > This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by > Jan Kara <jack@suse.cz>, which is more generic and scalable than previous > versions. I think this paragraph is droppable. You already got an ack from Jan below, so stating he agrees with the patch is redundant. As for me I don't think this warrants mentioning. Just remove it, perhaps Christian will be willing to massage it by himself to avoid another series posting. > And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by > 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. > > Reviewed-by: Jan Kara <jack@suse.cz> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Signed-off-by: Yu Ma <yu.ma@intel.com> > --- > fs/file.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/fs/file.c b/fs/file.c > index 1be2a5bcc7c4..729c07a4fc28 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ > unsigned int maxbit = maxfd / BITS_PER_LONG; > unsigned int bitbit = start / BITS_PER_LONG; > + unsigned int bit; > + > + /* > + * Try to avoid looking at the second level bitmap > + */ > + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, > + start & (BITS_PER_LONG - 1)); > + if (bit < BITS_PER_LONG) > + return bit + bitbit * BITS_PER_LONG; > > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > if (bitbit >= maxfd) > -- > 2.43.0 >
On 7/20/2024 1:53 AM, Mateusz Guzik wrote: > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the >> word contains next_fd, as: >> (1) next_fd indicates the lower bound for the first free fd. >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up >> searching. > this is stale -- now the fast path searches up to 64 fds in the lower bitmap Nope, this is still valid, as the searching size of the fast path inside of find_next_fd() is always 64, it will execute the fast path inside of find_next_zero_bit(). > >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion), >> it would never be shrunk. The search size increases but there are few open fds >> available here. >> >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous >> versions. > I think this paragraph is droppable. You already got an ack from Jan > below, so stating he agrees with the patch is redundant. As for me I > don't think this warrants mentioning. Just remove it, perhaps > Christian will be willing to massage it by himself to avoid another > series posting. The idea of fast path for the word contains next_fd is from you, although this patch is small, I think it is reasonable to record here out of my respect. Appreciate for your guide and comments on this patch, I've learned a lot on the way of resolving problems :) Regards Yu >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. >> >> Reviewed-by: Jan Kara <jack@suse.cz> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> >> Signed-off-by: Yu Ma <yu.ma@intel.com> >> --- >> fs/file.c | 9 +++++++++ >> 1 file changed, 9 insertions(+) >> >> diff --git a/fs/file.c b/fs/file.c >> index 1be2a5bcc7c4..729c07a4fc28 100644 >> --- a/fs/file.c >> +++ b/fs/file.c >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ >> unsigned int maxbit = maxfd / BITS_PER_LONG; >> unsigned int bitbit = start / BITS_PER_LONG; >> + unsigned int bit; >> + >> + /* >> + * Try to avoid looking at the second level bitmap >> + */ >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, >> + start & (BITS_PER_LONG - 1)); >> + if (bit < BITS_PER_LONG) >> + return bit + bitbit * BITS_PER_LONG; >> >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; >> if (bitbit >= maxfd) >> -- >> 2.43.0 >> >
I think this is getting too much fluff traffic at this point, which is partially my fault. I'm buggering off. Overall the patchset looks good, I don't see any technical reasons to avoid merging it. On Sat, Jul 20, 2024 at 2:57 PM Ma, Yu <yu.ma@intel.com> wrote: > > > On 7/20/2024 1:53 AM, Mateusz Guzik wrote: > > On Wed, Jul 17, 2024 at 4:24 PM Yu Ma <yu.ma@intel.com> wrote: > >> Skip 2-levels searching via find_next_zero_bit() when there is free slot in the > >> word contains next_fd, as: > >> (1) next_fd indicates the lower bound for the first free fd. > >> (2) There is fast path inside of find_next_zero_bit() when size<=64 to speed up > >> searching. > > this is stale -- now the fast path searches up to 64 fds in the lower bitmap > > Nope, this is still valid, as the searching size of the fast path inside > of find_next_fd() is always 64, it will execute the fast path inside of > find_next_zero_bit(). > > > > > >> (3) After fdt is expanded (the bitmap size doubled for each time of expansion), > >> it would never be shrunk. The search size increases but there are few open fds > >> available here. > >> > >> This fast path is proposed by Mateusz Guzik <mjguzik@gmail.com>, and agreed by > >> Jan Kara <jack@suse.cz>, which is more generic and scalable than previous > >> versions. > > I think this paragraph is droppable. You already got an ack from Jan > > below, so stating he agrees with the patch is redundant. As for me I > > don't think this warrants mentioning. Just remove it, perhaps > > Christian will be willing to massage it by himself to avoid another > > series posting. > > The idea of fast path for the word contains next_fd is from you, > although this patch is small, I think it is reasonable to record here > out of my respect. Appreciate for your guide and comments on this patch, > I've learned a lot on the way of resolving problems :) > > > Regards > > Yu > > >> And on top of patch 1 and 2, it improves pts/blogbench-1.1.0 read by > >> 8% and write by 4% on Intel ICX 160 cores configuration with v6.10-rc7. > >> > >> Reviewed-by: Jan Kara <jack@suse.cz> > >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > >> Signed-off-by: Yu Ma <yu.ma@intel.com> > >> --- > >> fs/file.c | 9 +++++++++ > >> 1 file changed, 9 insertions(+) > >> > >> diff --git a/fs/file.c b/fs/file.c > >> index 1be2a5bcc7c4..729c07a4fc28 100644 > >> --- a/fs/file.c > >> +++ b/fs/file.c > >> @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > >> unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ > >> unsigned int maxbit = maxfd / BITS_PER_LONG; > >> unsigned int bitbit = start / BITS_PER_LONG; > >> + unsigned int bit; > >> + > >> + /* > >> + * Try to avoid looking at the second level bitmap > >> + */ > >> + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, > >> + start & (BITS_PER_LONG - 1)); > >> + if (bit < BITS_PER_LONG) > >> + return bit + bitbit * BITS_PER_LONG; > >> > >> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > >> if (bitbit >= maxfd) > >> -- > >> 2.43.0 > >> > >
Hello, kernel test robot noticed a 6.3% improvement of will-it-scale.per_thread_ops on: commit: b8decf0015a8b1ff02cdac61c0aa54355d8e73d7 ("[PATCH v5 3/3] fs/file.c: add fast path in find_next_fd()") url: https://github.com/intel-lab-lkp/linux/commits/Yu-Ma/fs-file-c-remove-sanity_check-and-add-likely-unlikely-in-alloc_fd/20240717-224830 base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all patch link: https://lore.kernel.org/all/20240717145018.3972922-4-yu.ma@intel.com/ patch subject: [PATCH v5 3/3] fs/file.c: add fast path in find_next_fd() testcase: will-it-scale test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480CTDX (Sapphire Rapids) with 512G memory parameters: nr_task: 100% mode: thread test: open3 cpufreq_governor: performance Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20240806/202408062152.7e5b5d6d-oliver.sang@intel.com ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-13/performance/x86_64-rhel-8.3/thread/100%/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/open3/will-it-scale commit: 5bb3423bf9 ("fs/file.c: conditionally clear full_fds") b8decf0015 ("fs/file.c: add fast path in find_next_fd()") 5bb3423bf9f9d91e b8decf0015a8b1ff02cdac61c0a ---------------- --------------------------- %stddev %change %stddev \ | \ 848151 +6.2% 901119 ± 2% will-it-scale.224.threads 3785 +6.3% 4022 ± 2% will-it-scale.per_thread_ops 848151 +6.2% 901119 ± 2% will-it-scale.workload 0.28 ± 4% +13.3% 0.32 ± 3% perf-stat.i.MPKI 31.31 ± 3% +2.0 33.28 perf-stat.i.cache-miss-rate% 14955855 ± 4% +13.6% 16995785 ± 4% perf-stat.i.cache-misses 49676581 +6.7% 53009444 ± 3% perf-stat.i.cache-references 43955 ± 4% -12.3% 38549 ± 4% perf-stat.i.cycles-between-cache-misses 0.28 ± 4% +13.4% 0.32 ± 4% perf-stat.overall.MPKI 29.84 ± 3% +1.9 31.78 ± 2% perf-stat.overall.cache-miss-rate% 43445 ± 4% -12.1% 38200 ± 4% perf-stat.overall.cycles-between-cache-misses 19005976 -5.4% 17972604 ± 2% perf-stat.overall.path-length 14869677 ± 4% +13.6% 16898438 ± 4% perf-stat.ps.cache-misses 49821402 +6.7% 53168235 ± 3% perf-stat.ps.cache-references 49.42 -0.1 49.34 perf-profile.calltrace.cycles-pp.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe 49.40 -0.1 49.32 perf-profile.calltrace.cycles-pp.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close 49.25 -0.1 49.18 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64 49.20 -0.1 49.13 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat 49.33 -0.1 49.26 perf-profile.calltrace.cycles-pp._raw_spin_lock.file_close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe 49.28 -0.1 49.22 perf-profile.calltrace.cycles-pp._raw_spin_lock.alloc_fd.do_sys_openat2.__x64_sys_openat.do_syscall_64 50.14 +0.0 50.18 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 50.17 +0.0 50.21 perf-profile.calltrace.cycles-pp.open64 0.64 ± 5% +0.1 0.75 ± 6% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.62 ± 5% +0.1 0.74 ± 6% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64 49.42 -0.1 49.34 perf-profile.children.cycles-pp.alloc_fd 49.40 -0.1 49.32 perf-profile.children.cycles-pp.file_close_fd 0.06 -0.0 0.05 perf-profile.children.cycles-pp.file_close_fd_locked 0.15 ± 5% +0.0 0.17 ± 4% perf-profile.children.cycles-pp.init_file 0.22 ± 3% +0.0 0.25 ± 3% perf-profile.children.cycles-pp.alloc_empty_file 0.18 ± 6% +0.0 0.22 ± 6% perf-profile.children.cycles-pp.__fput 50.14 +0.0 50.18 perf-profile.children.cycles-pp.__x64_sys_openat 50.18 +0.0 50.22 perf-profile.children.cycles-pp.open64 0.18 ± 14% +0.0 0.23 ± 7% perf-profile.children.cycles-pp.do_dentry_open 0.30 ± 8% +0.1 0.36 ± 8% perf-profile.children.cycles-pp.do_open 0.64 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.do_filp_open 0.63 ± 5% +0.1 0.75 ± 6% perf-profile.children.cycles-pp.path_openat 0.06 -0.0 0.05 perf-profile.self.cycles-pp.file_close_fd_locked 0.16 ± 2% +0.0 0.18 ± 2% perf-profile.self.cycles-pp._raw_spin_lock 0.08 ± 12% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.__fput 0.05 ± 7% +0.1 0.10 ± 4% perf-profile.self.cycles-pp.alloc_fd Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
diff --git a/fs/file.c b/fs/file.c index 1be2a5bcc7c4..729c07a4fc28 100644 --- a/fs/file.c +++ b/fs/file.c @@ -491,6 +491,15 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) unsigned int maxfd = fdt->max_fds; /* always multiple of BITS_PER_LONG */ unsigned int maxbit = maxfd / BITS_PER_LONG; unsigned int bitbit = start / BITS_PER_LONG; + unsigned int bit; + + /* + * Try to avoid looking at the second level bitmap + */ + bit = find_next_zero_bit(&fdt->open_fds[bitbit], BITS_PER_LONG, + start & (BITS_PER_LONG - 1)); + if (bit < BITS_PER_LONG) + return bit + bitbit * BITS_PER_LONG; bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; if (bitbit >= maxfd)