Message ID | 20240614163416.728752-4-yu.ma@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/file.c: optimize the critical section of | expand |
On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the Total nitpick: FILE is the libc thing, I would refer to it as 'struct file'. See below for the actual point. > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6. > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Signed-off-by: Yu Ma <yu.ma@intel.com> > --- > fs/file.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/fs/file.c b/fs/file.c > index a0e94a178c0b..59d62909e2e3 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > else > __clear_close_on_exec(fd, fdt); > error = fd; > -#if 1 > - /* Sanity check */ > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); > - rcu_assign_pointer(fdt->fd[fd], NULL); > - } > -#endif > I was going to ask when was the last time anyone seen this fire and suggest getting rid of it if enough time(tm) passed. Turns out it does show up sometimes, latest result I found is 2017 vintage: https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ So you are moving this to another locked area, but one which does not execute in the benchmark? Patch 2/3 states 28% read and 14% write increase, this commit message claims it goes up to 32% and 15% respectively -- pretty big. I presume this has to do with bouncing a line containing the fd. I would argue moving this check elsewhere is about as good as removing it altogether, but that's for the vfs overlords to decide. All that aside, looking at disasm of alloc_fd it is pretty clear there is time to save, for example: if (unlikely(nr >= fdt->max_fds)) { if (fd >= end) { error = -EMFILE; goto out; } error = expand_files(fd, fd); if (error < 0) goto out; if (error) goto repeat; } This elides 2 branches and a func call in the common case. Completely untested, maybe has some brainfarts, feel free to take without credit and further massage the routine. Moreover my disasm shows that even looking for a bit results in a func call(!) to _find_next_zero_bit -- someone(tm) should probably massage it into another inline. After the above massaging is done and if it turns out the check has to stay, you can plausibly damage-control it with prefetch -- issue it immediately after finding the fd number, before any other work. All that said, by the above I'm confident there is still *some* performance left on the table despite the lock. > out: > spin_unlock(&files->file_lock); > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) > } > EXPORT_SYMBOL(get_unused_fd_flags); > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd) > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) > { > struct fdtable *fdt = files_fdtable(files); > __clear_open_fd(fd, fdt); > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) > void put_unused_fd(unsigned int fd) > { > struct files_struct *files = current->files; > + struct fdtable *fdt = files_fdtable(files); > spin_lock(&files->file_lock); > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); > + rcu_assign_pointer(fdt->fd[fd], NULL); > + } > __put_unused_fd(files, fd); > spin_unlock(&files->file_lock); > }
On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote: > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the > > Total nitpick: FILE is the libc thing, I would refer to it as 'struct > file'. See below for the actual point. > > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6. > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > Signed-off-by: Yu Ma <yu.ma@intel.com> > > --- > > fs/file.c | 14 ++++++-------- > > 1 file changed, 6 insertions(+), 8 deletions(-) > > > > diff --git a/fs/file.c b/fs/file.c > > index a0e94a178c0b..59d62909e2e3 100644 > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > > else > > __clear_close_on_exec(fd, fdt); > > error = fd; > > -#if 1 > > - /* Sanity check */ > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); > > - rcu_assign_pointer(fdt->fd[fd], NULL); > > - } > > -#endif > > > > I was going to ask when was the last time anyone seen this fire and > suggest getting rid of it if enough time(tm) passed. Turns out it does > show up sometimes, latest result I found is 2017 vintage: > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ > > So you are moving this to another locked area, but one which does not > execute in the benchmark? > > Patch 2/3 states 28% read and 14% write increase, this commit message > claims it goes up to 32% and 15% respectively -- pretty big. I presume > this has to do with bouncing a line containing the fd. > > I would argue moving this check elsewhere is about as good as removing > it altogether, but that's for the vfs overlords to decide. > > All that aside, looking at disasm of alloc_fd it is pretty clear there > is time to save, for example: > > if (unlikely(nr >= fdt->max_fds)) { > if (fd >= end) { > error = -EMFILE; > goto out; > } > error = expand_files(fd, fd); > if (error < 0) > goto out; > if (error) > goto repeat; > } > Now that I wrote it I noticed the fd < end check has to be performed regardless of max_fds -- someone could have changed rlimit to a lower value after using a higher fd. But the main point stands: the call to expand_files and associated error handling don't have to be there. > This elides 2 branches and a func call in the common case. Completely > untested, maybe has some brainfarts, feel free to take without credit > and further massage the routine. > > Moreover my disasm shows that even looking for a bit results in > a func call(!) to _find_next_zero_bit -- someone(tm) should probably > massage it into another inline. > > After the above massaging is done and if it turns out the check has to > stay, you can plausibly damage-control it with prefetch -- issue it > immediately after finding the fd number, before any other work. > > All that said, by the above I'm confident there is still *some* > performance left on the table despite the lock. > > > out: > > spin_unlock(&files->file_lock); > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) > > } > > EXPORT_SYMBOL(get_unused_fd_flags); > > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) > > { > > struct fdtable *fdt = files_fdtable(files); > > __clear_open_fd(fd, fdt); > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > void put_unused_fd(unsigned int fd) > > { > > struct files_struct *files = current->files; > > + struct fdtable *fdt = files_fdtable(files); > > spin_lock(&files->file_lock); > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); > > + rcu_assign_pointer(fdt->fd[fd], NULL); > > + } > > __put_unused_fd(files, fd); > > spin_unlock(&files->file_lock); > > }
On 6/15/2024 12:41 PM, Mateusz Guzik wrote: > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: >> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the > Total nitpick: FILE is the libc thing, I would refer to it as 'struct > file'. See below for the actual point. Good point, not nitpick at all ;) , will update the word in commit message. >> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by >> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6. >> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> >> Signed-off-by: Yu Ma <yu.ma@intel.com> >> --- >> fs/file.c | 14 ++++++-------- >> 1 file changed, 6 insertions(+), 8 deletions(-) >> >> diff --git a/fs/file.c b/fs/file.c >> index a0e94a178c0b..59d62909e2e3 100644 >> --- a/fs/file.c >> +++ b/fs/file.c >> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) >> else >> __clear_close_on_exec(fd, fdt); >> error = fd; >> -#if 1 >> - /* Sanity check */ >> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { >> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); >> - rcu_assign_pointer(fdt->fd[fd], NULL); >> - } >> -#endif >> > I was going to ask when was the last time anyone seen this fire and > suggest getting rid of it if enough time(tm) passed. Turns out it does > show up sometimes, latest result I found is 2017 vintage: > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ > > So you are moving this to another locked area, but one which does not > execute in the benchmark? The consideration here as mentioned is to reduce the performance impact (if to reserve the sanity check, and have the same functionality) by moving it from critical path to non-critical, as put_unused_fd() is mostly used for error handling when fd is allocated successfully but struct file failed to obtained in the next step. > > Patch 2/3 states 28% read and 14% write increase, this commit message > claims it goes up to 32% and 15% respectively -- pretty big. I presume > this has to do with bouncing a line containing the fd. > > I would argue moving this check elsewhere is about as good as removing > it altogether, but that's for the vfs overlords to decide > > All that aside, looking at disasm of alloc_fd it is pretty clear there > is time to save, for example: > > if (unlikely(nr >= fdt->max_fds)) { > if (fd >= end) { > error = -EMFILE; > goto out; > } > error = expand_files(fd, fd); > if (error < 0) > goto out; > if (error) > goto repeat; > } > > This elides 2 branches and a func call in the common case. Completely > untested, maybe has some brainfarts, feel free to take without credit > and further massage the routine. > > Moreover my disasm shows that even looking for a bit results in > a func call(!) to _find_next_zero_bit -- someone(tm) should probably > massage it into another inline. > > After the above massaging is done and if it turns out the check has to > stay, you can plausibly damage-control it with prefetch -- issue it > immediately after finding the fd number, before any other work. > > All that said, by the above I'm confident there is still *some* > performance left on the table despite the lock. Thank you Guzik for such quick check and good suggestions :) Yes, there are extra branches and func call can be reduced for better performance, considering the fix for fast path, how about flow as below draft (besides the massage to find_next_fd()): error = -EMFILE; if (fd < fdt->max_fds) { if (~fdt->open_fds[0]) { fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd); goto fastreturn; } fd = find_next_fd(fdt, fd); } if (unlikely(fd >= fdt->max_fds)) { error = expand_files(files, fd); if (error < 0) goto out; if (error) goto repeat; } fastreturn: if (unlikely(fd >= end)) goto out; if (start <= files->next_fd) files->next_fd = fd + 1; .... >> out: >> spin_unlock(&files->file_lock); >> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) >> } >> EXPORT_SYMBOL(get_unused_fd_flags); >> >> -static void __put_unused_fd(struct files_struct *files, unsigned int fd) >> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) >> { >> struct fdtable *fdt = files_fdtable(files); >> __clear_open_fd(fd, fdt); >> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) >> void put_unused_fd(unsigned int fd) >> { >> struct files_struct *files = current->files; >> + struct fdtable *fdt = files_fdtable(files); >> spin_lock(&files->file_lock); >> + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { >> + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); >> + rcu_assign_pointer(fdt->fd[fd], NULL); >> + } >> __put_unused_fd(files, fd); >> spin_unlock(&files->file_lock); >> }
On Sat, Jun 15, 2024 at 06:41:45AM GMT, Mateusz Guzik wrote: > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the > > Total nitpick: FILE is the libc thing, I would refer to it as 'struct > file'. See below for the actual point. > > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6. > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > Signed-off-by: Yu Ma <yu.ma@intel.com> > > --- > > fs/file.c | 14 ++++++-------- > > 1 file changed, 6 insertions(+), 8 deletions(-) > > > > diff --git a/fs/file.c b/fs/file.c > > index a0e94a178c0b..59d62909e2e3 100644 > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > > else > > __clear_close_on_exec(fd, fdt); > > error = fd; > > -#if 1 > > - /* Sanity check */ > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); > > - rcu_assign_pointer(fdt->fd[fd], NULL); > > - } > > -#endif > > > > I was going to ask when was the last time anyone seen this fire and > suggest getting rid of it if enough time(tm) passed. Turns out it does > show up sometimes, latest result I found is 2017 vintage: > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ > > So you are moving this to another locked area, but one which does not > execute in the benchmark? > > Patch 2/3 states 28% read and 14% write increase, this commit message > claims it goes up to 32% and 15% respectively -- pretty big. I presume > this has to do with bouncing a line containing the fd. > > I would argue moving this check elsewhere is about as good as removing > it altogether, but that's for the vfs overlords to decide. This all dates back to 9d11a5176cc5 ("I just put a pre-90 on ftp.kernel.org, and I'm happy to report that Davem") which is pre-git. I think removing is fine.
On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote: > > On 6/15/2024 12:41 PM, Mateusz Guzik wrote: > > So you are moving this to another locked area, but one which does not > > execute in the benchmark? > > The consideration here as mentioned is to reduce the performance impact (if > to reserve the sanity check, and have the same functionality) by moving it > from critical path to non-critical, as put_unused_fd() is mostly used for > error handling when fd is allocated successfully but struct file failed to > obtained in the next step. > As mentioned by Christian in his mail this check can just be removed. > error = -EMFILE; > if (fd < fdt->max_fds) { I would likely() on it. > if (~fdt->open_fds[0]) { > fd = find_next_zero_bit(fdt->open_fds, > BITS_PER_LONG, fd); > goto fastreturn; > } > fd = find_next_fd(fdt, fd); > } > > if (unlikely(fd >= fdt->max_fds)) { > error = expand_files(files, fd); > if (error < 0) > goto out; > if (error) > goto repeat; > } > fastreturn: > if (unlikely(fd >= end)) > goto out; > if (start <= files->next_fd) > files->next_fd = fd + 1; > > .... > This is not a review, but it does read fine. LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd tests, I would make sure they still pass before posting a v2. I would definitely try moving out the lock to its own cacheline -- currently it resides with the bitmaps (and first 4 fds of the embedded array). I expect it to provide some win on top of the current patchset, but whether it will be sufficient to justify it I have no idea. Something of this sort: diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h index 2944d4aa413b..423cb599268a 100644 --- a/include/linux/fdtable.h +++ b/include/linux/fdtable.h @@ -50,11 +50,11 @@ struct files_struct { * written part on a separate cache line in SMP */ spinlock_t file_lock ____cacheline_aligned_in_smp; - unsigned int next_fd; + unsigned int next_fd ____cacheline_aligned_in_smp; unsigned long close_on_exec_init[1]; unsigned long open_fds_init[1]; unsigned long full_fds_bits_init[1]; - struct file __rcu * fd_array[NR_OPEN_DEFAULT]; + struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp; }; struct file_operations; (feel free to just take) All that said, I have to note blogbench has numerous bugs. For example it collects stats by merely bumping a global variable (not even with atomics), so some updates are straight up lost. To my understanding it was meant to test filesystems and I'm guessing threading (instead of separate processes) was only used to make it easier to share statistics. Given the current scales this unintentionally transitioned into bottlenecking on the file_lock instead. There were scalability changes made about 9 years ago and according to git log they were benchmarked by Eric Dumazet, while benchmark code was not shared. I suggest CC'ing the guy with your v2 and asking if he can bench. Even if he is no longer in position to do it perhaps he can rope someone in (or even better share his benchmark).
On 6/17/2024 7:23 PM, Mateusz Guzik wrote: > On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote: >> On 6/15/2024 12:41 PM, Mateusz Guzik wrote: >>> So you are moving this to another locked area, but one which does not >>> execute in the benchmark? >> The consideration here as mentioned is to reduce the performance impact (if >> to reserve the sanity check, and have the same functionality) by moving it >> from critical path to non-critical, as put_unused_fd() is mostly used for >> error handling when fd is allocated successfully but struct file failed to >> obtained in the next step. >> > As mentioned by Christian in his mail this check can just be removed. Yes, that's great, I'll update in v2 > >> error = -EMFILE; >> if (fd < fdt->max_fds) { > I would likely() on it. That's better :) > >> if (~fdt->open_fds[0]) { >> fd = find_next_zero_bit(fdt->open_fds, >> BITS_PER_LONG, fd); >> goto fastreturn; >> } >> fd = find_next_fd(fdt, fd); >> } >> >> if (unlikely(fd >= fdt->max_fds)) { >> error = expand_files(files, fd); >> if (error < 0) >> goto out; >> if (error) >> goto repeat; >> } >> fastreturn: >> if (unlikely(fd >= end)) >> goto out; >> if (start <= files->next_fd) >> files->next_fd = fd + 1; >> >> .... >> > This is not a review, but it does read fine. > > LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd > tests, I would make sure they still pass before posting a v2. Got it, thanks for the kind reminder. > > I would definitely try moving out the lock to its own cacheline -- > currently it resides with the bitmaps (and first 4 fds of the embedded > array). > > I expect it to provide some win on top of the current patchset, but > whether it will be sufficient to justify it I have no idea. > > Something of this sort: > diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h > index 2944d4aa413b..423cb599268a 100644 > --- a/include/linux/fdtable.h > +++ b/include/linux/fdtable.h > @@ -50,11 +50,11 @@ struct files_struct { > * written part on a separate cache line in SMP > */ > spinlock_t file_lock ____cacheline_aligned_in_smp; > - unsigned int next_fd; > + unsigned int next_fd ____cacheline_aligned_in_smp; > unsigned long close_on_exec_init[1]; > unsigned long open_fds_init[1]; > unsigned long full_fds_bits_init[1]; > - struct file __rcu * fd_array[NR_OPEN_DEFAULT]; > + struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp; > }; > > struct file_operations; > > (feel free to just take) Thanks Guzik for the thoughtful suggestions, seems we've ever tried to separate file_lock and next_fd to different cachelines, but no obvious improvement observed, we'll double check and verify these 2 tips to see how it goes. > > All that said, I have to note blogbench has numerous bugs. For example > it collects stats by merely bumping a global variable (not even with > atomics), so some updates are straight up lost. > > To my understanding it was meant to test filesystems and I'm guessing > threading (instead of separate processes) was only used to make it > easier to share statistics. Given the current scales this > unintentionally transitioned into bottlenecking on the file_lock > instead. > > There were scalability changes made about 9 years ago and according to > git log they were benchmarked by Eric Dumazet, while benchmark code was > not shared. I suggest CC'ing the guy with your v2 and asking if he can > bench. Even if he is no longer in position to do it perhaps he can rope > someone in (or even better share his benchmark). Good advice, we also noticed the problem for blogbench, and current data in commits is collected based on revision to make sure the variables for score statistic/reporting are safe, will submit patch to blogbench for update. We'll copy Eric for more information on his benchmark and check details if it is available.
On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote: > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote: > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the > > > > Total nitpick: FILE is the libc thing, I would refer to it as 'struct > > file'. See below for the actual point. > > > > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by > > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6. > > > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > > Signed-off-by: Yu Ma <yu.ma@intel.com> > > > --- > > > fs/file.c | 14 ++++++-------- > > > 1 file changed, 6 insertions(+), 8 deletions(-) > > > > > > diff --git a/fs/file.c b/fs/file.c > > > index a0e94a178c0b..59d62909e2e3 100644 > > > --- a/fs/file.c > > > +++ b/fs/file.c > > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > > > else > > > __clear_close_on_exec(fd, fdt); > > > error = fd; > > > -#if 1 > > > - /* Sanity check */ > > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { > > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); > > > - rcu_assign_pointer(fdt->fd[fd], NULL); > > > - } > > > -#endif > > > > > > > I was going to ask when was the last time anyone seen this fire and > > suggest getting rid of it if enough time(tm) passed. Turns out it does > > show up sometimes, latest result I found is 2017 vintage: > > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ > > > > So you are moving this to another locked area, but one which does not > > execute in the benchmark? > > > > Patch 2/3 states 28% read and 14% write increase, this commit message > > claims it goes up to 32% and 15% respectively -- pretty big. I presume > > this has to do with bouncing a line containing the fd. > > > > I would argue moving this check elsewhere is about as good as removing > > it altogether, but that's for the vfs overlords to decide. > > > > All that aside, looking at disasm of alloc_fd it is pretty clear there > > is time to save, for example: > > > > if (unlikely(nr >= fdt->max_fds)) { > > if (fd >= end) { > > error = -EMFILE; > > goto out; > > } > > error = expand_files(fd, fd); > > if (error < 0) > > goto out; > > if (error) > > goto repeat; > > } > > > > Now that I wrote it I noticed the fd < end check has to be performed > regardless of max_fds -- someone could have changed rlimit to a lower > value after using a higher fd. But the main point stands: the call to > expand_files and associated error handling don't have to be there. To really prevent someone from mucking with rlimit, we should probably take the task_lock to prevent do_prlimit() racing with this function. task_lock(current->group_leader); Tim > > > This elides 2 branches and a func call in the common case. Completely > > untested, maybe has some brainfarts, feel free to take without credit > > and further massage the routine. > > > > Moreover my disasm shows that even looking for a bit results in > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably > > massage it into another inline. > > > > After the above massaging is done and if it turns out the check has to > > stay, you can plausibly damage-control it with prefetch -- issue it > > immediately after finding the fd number, before any other work. > > > > All that said, by the above I'm confident there is still *some* > > performance left on the table despite the lock. > > > > > out: > > > spin_unlock(&files->file_lock); > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) > > > } > > > EXPORT_SYMBOL(get_unused_fd_flags); > > > > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > { > > > struct fdtable *fdt = files_fdtable(files); > > > __clear_open_fd(fd, fdt); > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > void put_unused_fd(unsigned int fd) > > > { > > > struct files_struct *files = current->files; > > > + struct fdtable *fdt = files_fdtable(files); > > > spin_lock(&files->file_lock); > > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { > > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); > > > + rcu_assign_pointer(fdt->fd[fd], NULL); > > > + } > > > __put_unused_fd(files, fd); > > > spin_unlock(&files->file_lock); > > > } >
On Mon, Jun 17, 2024 at 7:55 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote: > > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote: > > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: > > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the > > > > > > Total nitpick: FILE is the libc thing, I would refer to it as 'struct > > > file'. See below for the actual point. > > > > > > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by > > > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6. > > > > > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > > > Signed-off-by: Yu Ma <yu.ma@intel.com> > > > > --- > > > > fs/file.c | 14 ++++++-------- > > > > 1 file changed, 6 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/fs/file.c b/fs/file.c > > > > index a0e94a178c0b..59d62909e2e3 100644 > > > > --- a/fs/file.c > > > > +++ b/fs/file.c > > > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > > > > else > > > > __clear_close_on_exec(fd, fdt); > > > > error = fd; > > > > -#if 1 > > > > - /* Sanity check */ > > > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { > > > > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); > > > > - rcu_assign_pointer(fdt->fd[fd], NULL); > > > > - } > > > > -#endif > > > > > > > > > > I was going to ask when was the last time anyone seen this fire and > > > suggest getting rid of it if enough time(tm) passed. Turns out it does > > > show up sometimes, latest result I found is 2017 vintage: > > > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ > > > > > > So you are moving this to another locked area, but one which does not > > > execute in the benchmark? > > > > > > Patch 2/3 states 28% read and 14% write increase, this commit message > > > claims it goes up to 32% and 15% respectively -- pretty big. I presume > > > this has to do with bouncing a line containing the fd. > > > > > > I would argue moving this check elsewhere is about as good as removing > > > it altogether, but that's for the vfs overlords to decide. > > > > > > All that aside, looking at disasm of alloc_fd it is pretty clear there > > > is time to save, for example: > > > > > > if (unlikely(nr >= fdt->max_fds)) { > > > if (fd >= end) { > > > error = -EMFILE; > > > goto out; > > > } > > > error = expand_files(fd, fd); > > > if (error < 0) > > > goto out; > > > if (error) > > > goto repeat; > > > } > > > > > > > Now that I wrote it I noticed the fd < end check has to be performed > > regardless of max_fds -- someone could have changed rlimit to a lower > > value after using a higher fd. But the main point stands: the call to > > expand_files and associated error handling don't have to be there. > > To really prevent someone from mucking with rlimit, we should probably > take the task_lock to prevent do_prlimit() racing with this function. > > task_lock(current->group_leader); > It's fine to race against rlimit adjustments. The problem here is that both in my toy refactoring above and the posted patch the thread can use a high fd, lower the rlimit on its own and not have it respected on calls made later. > > > > > > This elides 2 branches and a func call in the common case. Completely > > > untested, maybe has some brainfarts, feel free to take without credit > > > and further massage the routine. > > > > > > Moreover my disasm shows that even looking for a bit results in > > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably > > > massage it into another inline. > > > > > > After the above massaging is done and if it turns out the check has to > > > stay, you can plausibly damage-control it with prefetch -- issue it > > > immediately after finding the fd number, before any other work. > > > > > > All that said, by the above I'm confident there is still *some* > > > performance left on the table despite the lock. > > > > > > > out: > > > > spin_unlock(&files->file_lock); > > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) > > > > } > > > > EXPORT_SYMBOL(get_unused_fd_flags); > > > > > > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > > { > > > > struct fdtable *fdt = files_fdtable(files); > > > > __clear_open_fd(fd, fdt); > > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > > void put_unused_fd(unsigned int fd) > > > > { > > > > struct files_struct *files = current->files; > > > > + struct fdtable *fdt = files_fdtable(files); > > > > spin_lock(&files->file_lock); > > > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { > > > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); > > > > + rcu_assign_pointer(fdt->fd[fd], NULL); > > > > + } > > > > __put_unused_fd(files, fd); > > > > spin_unlock(&files->file_lock); > > > > } > > >
On Mon, 2024-06-17 at 10:55 -0700, Tim Chen wrote: > On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote: > > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote: > > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote: > > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the > > > > > > > > > > Now that I wrote it I noticed the fd < end check has to be performed > > regardless of max_fds -- someone could have changed rlimit to a lower > > value after using a higher fd. But the main point stands: the call to > > expand_files and associated error handling don't have to be there. > > To really prevent someone from mucking with rlimit, we should probably > take the task_lock to prevent do_prlimit() racing with this function. > > task_lock(current->group_leader); > > Tim And also move the task_lock in do_prlimit() before the RLIMIT_NOFILE check. diff --git a/kernel/sys.c b/kernel/sys.c index 3a2df1bd9f64..b4e523728c3e 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource, return -EINVAL; resource = array_index_nospec(resource, RLIM_NLIMITS); + task_lock(tsk->group_leader); if (new_rlim) { if (new_rlim->rlim_cur > new_rlim->rlim_max) return -EINVAL; @@ -1481,7 +1482,6 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource, /* Holding a refcount on tsk protects tsk->signal from disappearing. */ rlim = tsk->signal->rlim + resource; - task_lock(tsk->group_leader); if (new_rlim) { /* * Keep the capable check against init_user_ns until cgroups can Tim > > > > > This elides 2 branches and a func call in the common case. Completely > > > untested, maybe has some brainfarts, feel free to take without credit > > > and further massage the routine. > > > > > > Moreover my disasm shows that even looking for a bit results in > > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably > > > massage it into another inline. > > > > > > After the above massaging is done and if it turns out the check has to > > > stay, you can plausibly damage-control it with prefetch -- issue it > > > immediately after finding the fd number, before any other work. > > > > > > All that said, by the above I'm confident there is still *some* > > > performance left on the table despite the lock. > > > > > > > out: > > > > spin_unlock(&files->file_lock); > > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) > > > > } > > > > EXPORT_SYMBOL(get_unused_fd_flags); > > > > > > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > > { > > > > struct fdtable *fdt = files_fdtable(files); > > > > __clear_open_fd(fd, fdt); > > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) > > > > void put_unused_fd(unsigned int fd) > > > > { > > > > struct files_struct *files = current->files; > > > > + struct fdtable *fdt = files_fdtable(files); > > > > spin_lock(&files->file_lock); > > > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { > > > > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); > > > > + rcu_assign_pointer(fdt->fd[fd], NULL); > > > > + } > > > > __put_unused_fd(files, fd); > > > > spin_unlock(&files->file_lock); > > > > } > > >
On Mon 17-06-24 11:04:41, Tim Chen wrote: > diff --git a/kernel/sys.c b/kernel/sys.c > index 3a2df1bd9f64..b4e523728c3e 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource, > return -EINVAL; > resource = array_index_nospec(resource, RLIM_NLIMITS); > > + task_lock(tsk->group_leader); > if (new_rlim) { > if (new_rlim->rlim_cur > new_rlim->rlim_max) > return -EINVAL; This is clearly broken as it leaves the lock behind on the error, no?
On Tue, Jun 18, 2024 at 10:35 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 17-06-24 11:04:41, Tim Chen wrote: > > diff --git a/kernel/sys.c b/kernel/sys.c > > index 3a2df1bd9f64..b4e523728c3e 100644 > > --- a/kernel/sys.c > > +++ b/kernel/sys.c > > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource, > > return -EINVAL; > > resource = array_index_nospec(resource, RLIM_NLIMITS); > > > > + task_lock(tsk->group_leader); > > if (new_rlim) { > > if (new_rlim->rlim_cur > new_rlim->rlim_max) > > return -EINVAL; > > This is clearly broken as it leaves the lock behind on the error, no? As I explained in my other e-mail there is no need to synchronize against rlimit changes, merely the code needs to honor the observed value, whatever it is. This holds for the stock kernel, does not hold for v1 of the patchset and presumably will be addressed in v2. Even if some synchronization was required, moving the lock higher does not buy anything because the newly protected area only validates the new limit before there is any attempt at setting it and the old limit is not looked at. I think we can safely drop this area and patiently wait for v2.
On Tue, 2024-06-18 at 10:35 +0200, Michal Hocko wrote: > On Mon 17-06-24 11:04:41, Tim Chen wrote: > > diff --git a/kernel/sys.c b/kernel/sys.c > > index 3a2df1bd9f64..b4e523728c3e 100644 > > --- a/kernel/sys.c > > +++ b/kernel/sys.c > > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource, > > return -EINVAL; > > resource = array_index_nospec(resource, RLIM_NLIMITS); > > > > + task_lock(tsk->group_leader); > > if (new_rlim) { > > if (new_rlim->rlim_cur > new_rlim->rlim_max) > > return -EINVAL; > > This is clearly broken as it leaves the lock behind on the error, no? Thanks for pointing that out. Need unlock before return. I don't like adding task_lock in alloc_fd path if there are multiple alloc_fd going on causing contention. The race with rlimit change should be a very rare thing. Should be sufficient that patch one check for fd not going beyond the observed rlimit. Tim
diff --git a/fs/file.c b/fs/file.c index a0e94a178c0b..59d62909e2e3 100644 --- a/fs/file.c +++ b/fs/file.c @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) else __clear_close_on_exec(fd, fdt); error = fd; -#if 1 - /* Sanity check */ - if (rcu_access_pointer(fdt->fd[fd]) != NULL) { - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd); - rcu_assign_pointer(fdt->fd[fd], NULL); - } -#endif out: spin_unlock(&files->file_lock); @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags) } EXPORT_SYMBOL(get_unused_fd_flags); -static void __put_unused_fd(struct files_struct *files, unsigned int fd) +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd) { struct fdtable *fdt = files_fdtable(files); __clear_open_fd(fd, fdt); @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd) void put_unused_fd(unsigned int fd) { struct files_struct *files = current->files; + struct fdtable *fdt = files_fdtable(files); spin_lock(&files->file_lock); + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) { + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd); + rcu_assign_pointer(fdt->fd[fd], NULL); + } __put_unused_fd(files, fd); spin_unlock(&files->file_lock); }