diff mbox series

[3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()

Message ID 20240614163416.728752-4-yu.ma@intel.com (mailing list archive)
State New
Headers show
Series fs/file.c: optimize the critical section of | expand

Commit Message

Ma, Yu June 14, 2024, 4:34 p.m. UTC
alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
allocated fd is NULL. Move the sanity check from performance critical alloc_fd()
path to non performance critical put_unused_fd() path.

As the initial NULL FILE object condition can be assured by zero initialization
in init_file, we just need to make sure that it is NULL when recycling fd back.
There are 3 functions call __put_unused_fd() to return fd,
file_close_fd_locked(), do_close_on_exec() and put_unused_fd(). For
file_close_fd_locked() and do_close_on_exec(), they have implemented NULL check
already. Adds NULL check to put_unused_fd() to cover all release paths.

Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

Comments

Mateusz Guzik June 15, 2024, 4:41 a.m. UTC | #1
On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the

Total nitpick: FILE is the libc thing, I would refer to it as 'struct
file'. See below for the actual point.

> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index a0e94a178c0b..59d62909e2e3 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	else
>  		__clear_close_on_exec(fd, fdt);
>  	error = fd;
> -#if 1
> -	/* Sanity check */
> -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> -		rcu_assign_pointer(fdt->fd[fd], NULL);
> -	}
> -#endif
>  

I was going to ask when was the last time anyone seen this fire and
suggest getting rid of it if enough time(tm) passed. Turns out it does
show up sometimes, latest result I found is 2017 vintage:
https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ

So you are moving this to another locked area, but one which does not
execute in the benchmark?

Patch 2/3 states 28% read and 14% write increase, this commit message
claims it goes up to 32% and 15% respectively -- pretty big. I presume
this has to do with bouncing a line containing the fd.

I would argue moving this check elsewhere is about as good as removing
it altogether, but that's for the vfs overlords to decide.

All that aside, looking at disasm of alloc_fd it is pretty clear there
is time to save, for example:

	if (unlikely(nr >= fdt->max_fds)) {
		if (fd >= end) {
			error = -EMFILE;
			goto out;
		}
		error = expand_files(fd, fd);
		if (error < 0)
			goto out;
		if (error)
			goto repeat;
	}

This elides 2 branches and a func call in the common case. Completely
untested, maybe has some brainfarts, feel free to take without credit
and further massage the routine.

Moreover my disasm shows that even looking for a bit results in
a func call(!) to _find_next_zero_bit -- someone(tm) should probably
massage it into another inline.

After the above massaging is done and if it turns out the check has to
stay, you can plausibly damage-control it with prefetch -- issue it
immediately after finding the fd number, before any other work.

All that said, by the above I'm confident there is still *some*
performance left on the table despite the lock.

>  out:
>  	spin_unlock(&files->file_lock);
> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
>  }
>  EXPORT_SYMBOL(get_unused_fd_flags);
>  
> -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
>  {
>  	struct fdtable *fdt = files_fdtable(files);
>  	__clear_open_fd(fd, fdt);
> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>  void put_unused_fd(unsigned int fd)
>  {
>  	struct files_struct *files = current->files;
> +	struct fdtable *fdt = files_fdtable(files);
>  	spin_lock(&files->file_lock);
> +	if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> +		printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> +		rcu_assign_pointer(fdt->fd[fd], NULL);
> +	}
>  	__put_unused_fd(files, fd);
>  	spin_unlock(&files->file_lock);
>  }
Mateusz Guzik June 15, 2024, 5:07 a.m. UTC | #2
On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> 
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
> 
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  fs/file.c | 14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/file.c b/fs/file.c
> > index a0e94a178c0b..59d62909e2e3 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> >  	else
> >  		__clear_close_on_exec(fd, fdt);
> >  	error = fd;
> > -#if 1
> > -	/* Sanity check */
> > -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > -		rcu_assign_pointer(fdt->fd[fd], NULL);
> > -	}
> > -#endif
> >  
> 
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
> 
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
> 
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
> 
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide.
> 
> All that aside, looking at disasm of alloc_fd it is pretty clear there
> is time to save, for example:
> 
> 	if (unlikely(nr >= fdt->max_fds)) {
> 		if (fd >= end) {
> 			error = -EMFILE;
> 			goto out;
> 		}
> 		error = expand_files(fd, fd);
> 		if (error < 0)
> 			goto out;
> 		if (error)
> 			goto repeat;
> 	}
> 

Now that I wrote it I noticed the fd < end check has to be performed
regardless of max_fds -- someone could have changed rlimit to a lower
value after using a higher fd. But the main point stands: the call to
expand_files and associated error handling don't have to be there.

> This elides 2 branches and a func call in the common case. Completely
> untested, maybe has some brainfarts, feel free to take without credit
> and further massage the routine.
> 
> Moreover my disasm shows that even looking for a bit results in
> a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> massage it into another inline.
> 
> After the above massaging is done and if it turns out the check has to
> stay, you can plausibly damage-control it with prefetch -- issue it
> immediately after finding the fd number, before any other work.
> 
> All that said, by the above I'm confident there is still *some*
> performance left on the table despite the lock.
> 
> >  out:
> >  	spin_unlock(&files->file_lock);
> > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> >  }
> >  EXPORT_SYMBOL(get_unused_fd_flags);
> >  
> > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> >  {
> >  	struct fdtable *fdt = files_fdtable(files);
> >  	__clear_open_fd(fd, fdt);
> > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> >  void put_unused_fd(unsigned int fd)
> >  {
> >  	struct files_struct *files = current->files;
> > +	struct fdtable *fdt = files_fdtable(files);
> >  	spin_lock(&files->file_lock);
> > +	if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > +		printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > +		rcu_assign_pointer(fdt->fd[fd], NULL);
> > +	}
> >  	__put_unused_fd(files, fd);
> >  	spin_unlock(&files->file_lock);
> >  }
Ma, Yu June 16, 2024, 3:47 a.m. UTC | #3
On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.

Good point, not nitpick at all ;) , will update the word in commit message.

>> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
>> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
>>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>> ---
>>   fs/file.c | 14 ++++++--------
>>   1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a0e94a178c0b..59d62909e2e3 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>   	else
>>   		__clear_close_on_exec(fd, fdt);
>>   	error = fd;
>> -#if 1
>> -	/* Sanity check */
>> -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
>> -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
>> -		rcu_assign_pointer(fdt->fd[fd], NULL);
>> -	}
>> -#endif
>>   
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
>
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?

The consideration here as mentioned is to reduce the performance impact 
(if to reserve the sanity check, and have the same functionality) by 
moving it from critical path to non-critical, as put_unused_fd() is 
mostly used for error handling when fd is allocated successfully but 
struct file failed to obtained in the next step.

>
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
>
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide
>
> All that aside, looking at disasm of alloc_fd it is pretty clear there
> is time to save, for example:
>
> 	if (unlikely(nr >= fdt->max_fds)) {
> 		if (fd >= end) {
> 			error = -EMFILE;
> 			goto out;
> 		}
> 		error = expand_files(fd, fd);
> 		if (error < 0)
> 			goto out;
> 		if (error)
> 			goto repeat;
> 	}
>
> This elides 2 branches and a func call in the common case. Completely
> untested, maybe has some brainfarts, feel free to take without credit
> and further massage the routine.
>
> Moreover my disasm shows that even looking for a bit results in
> a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> massage it into another inline.
>
> After the above massaging is done and if it turns out the check has to
> stay, you can plausibly damage-control it with prefetch -- issue it
> immediately after finding the fd number, before any other work.
>
> All that said, by the above I'm confident there is still *some*
> performance left on the table despite the lock.

Thank you Guzik for such quick check and good suggestions :) Yes, there 
are extra branches and func call can be reduced for better performance, 
considering the fix for fast path, how about flow as below draft 
(besides the massage to find_next_fd()):

         error = -EMFILE;
         if (fd < fdt->max_fds) {
                 if (~fdt->open_fds[0]) {
                         fd = find_next_zero_bit(fdt->open_fds, 
BITS_PER_LONG, fd);
                         goto fastreturn;
                 }
                 fd = find_next_fd(fdt, fd);
         }

         if (unlikely(fd >= fdt->max_fds)) {
                 error = expand_files(files, fd);
                 if (error < 0)
                         goto out;
                 if (error)
                         goto repeat;
         }
fastreturn:
         if (unlikely(fd >= end))
                 goto out;
         if (start <= files->next_fd)
                 files->next_fd = fd + 1;

        ....

>>   out:
>>   	spin_unlock(&files->file_lock);
>> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
>>   }
>>   EXPORT_SYMBOL(get_unused_fd_flags);
>>   
>> -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
>>   {
>>   	struct fdtable *fdt = files_fdtable(files);
>>   	__clear_open_fd(fd, fdt);
>> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>>   void put_unused_fd(unsigned int fd)
>>   {
>>   	struct files_struct *files = current->files;
>> +	struct fdtable *fdt = files_fdtable(files);
>>   	spin_lock(&files->file_lock);
>> +	if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
>> +		printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
>> +		rcu_assign_pointer(fdt->fd[fd], NULL);
>> +	}
>>   	__put_unused_fd(files, fd);
>>   	spin_unlock(&files->file_lock);
>>   }
Christian Brauner June 17, 2024, 8:36 a.m. UTC | #4
On Sat, Jun 15, 2024 at 06:41:45AM GMT, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> 
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
> 
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  fs/file.c | 14 ++++++--------
> >  1 file changed, 6 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/file.c b/fs/file.c
> > index a0e94a178c0b..59d62909e2e3 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> >  	else
> >  		__clear_close_on_exec(fd, fdt);
> >  	error = fd;
> > -#if 1
> > -	/* Sanity check */
> > -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > -		rcu_assign_pointer(fdt->fd[fd], NULL);
> > -	}
> > -#endif
> >  
> 
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
> 
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
> 
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
> 
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide.

This all dates back to 9d11a5176cc5 ("I just put a pre-90 on
ftp.kernel.org, and I'm happy to report that Davem")
which is pre-git. I think removing is fine.
Mateusz Guzik June 17, 2024, 11:23 a.m. UTC | #5
On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote:
> 
> On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
> > So you are moving this to another locked area, but one which does not
> > execute in the benchmark?
> 
> The consideration here as mentioned is to reduce the performance impact (if
> to reserve the sanity check, and have the same functionality) by moving it
> from critical path to non-critical, as put_unused_fd() is mostly used for
> error handling when fd is allocated successfully but struct file failed to
> obtained in the next step.
> 

As mentioned by Christian in his mail this check can just be removed.

>         error = -EMFILE;
>         if (fd < fdt->max_fds) {

I would likely() on it.

>                 if (~fdt->open_fds[0]) {
>                         fd = find_next_zero_bit(fdt->open_fds,
> BITS_PER_LONG, fd);
>                         goto fastreturn;
>                 }
>                 fd = find_next_fd(fdt, fd);
>         }
> 
>         if (unlikely(fd >= fdt->max_fds)) {
>                 error = expand_files(files, fd);
>                 if (error < 0)
>                         goto out;
>                 if (error)
>                         goto repeat;
>         }
> fastreturn:
>         if (unlikely(fd >= end))
>                 goto out;
>         if (start <= files->next_fd)
>                 files->next_fd = fd + 1;
> 
>        ....
> 

This is not a review, but it does read fine.

LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd
tests, I would make sure they still pass before posting a v2.

I would definitely try moving out the lock to its own cacheline --
currently it resides with the bitmaps (and first 4 fds of the embedded
array).

I expect it to provide some win on top of the current patchset, but
whether it will be sufficient to justify it I have no idea.

Something of this sort:
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 2944d4aa413b..423cb599268a 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -50,11 +50,11 @@ struct files_struct {
    * written part on a separate cache line in SMP
    */
        spinlock_t file_lock ____cacheline_aligned_in_smp;
-       unsigned int next_fd;
+       unsigned int next_fd ____cacheline_aligned_in_smp;
        unsigned long close_on_exec_init[1];
        unsigned long open_fds_init[1];
        unsigned long full_fds_bits_init[1];
-       struct file __rcu * fd_array[NR_OPEN_DEFAULT];
+       struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp;
 };

 struct file_operations;

(feel free to just take)

All that said, I have to note blogbench has numerous bugs. For example
it collects stats by merely bumping a global variable (not even with
atomics), so some updates are straight up lost.

To my understanding it was meant to test filesystems and I'm guessing
threading (instead of separate processes) was only used to make it
easier to share statistics. Given the current scales this
unintentionally transitioned into bottlenecking on the file_lock
instead.

There were scalability changes made about 9 years ago and according to
git log they were benchmarked by Eric Dumazet, while benchmark code was
not shared. I suggest CC'ing the guy with your v2 and asking if he can
bench.  Even if he is no longer in position to do it perhaps he can rope
someone in (or even better share his benchmark).
Ma, Yu June 17, 2024, 5:22 p.m. UTC | #6
On 6/17/2024 7:23 PM, Mateusz Guzik wrote:
> On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote:
>> On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
>>> So you are moving this to another locked area, but one which does not
>>> execute in the benchmark?
>> The consideration here as mentioned is to reduce the performance impact (if
>> to reserve the sanity check, and have the same functionality) by moving it
>> from critical path to non-critical, as put_unused_fd() is mostly used for
>> error handling when fd is allocated successfully but struct file failed to
>> obtained in the next step.
>>
> As mentioned by Christian in his mail this check can just be removed.

Yes, that's great, I'll update in v2

>
>>          error = -EMFILE;
>>          if (fd < fdt->max_fds) {
> I would likely() on it.

That's better :)

>
>>                  if (~fdt->open_fds[0]) {
>>                          fd = find_next_zero_bit(fdt->open_fds,
>> BITS_PER_LONG, fd);
>>                          goto fastreturn;
>>                  }
>>                  fd = find_next_fd(fdt, fd);
>>          }
>>
>>          if (unlikely(fd >= fdt->max_fds)) {
>>                  error = expand_files(files, fd);
>>                  if (error < 0)
>>                          goto out;
>>                  if (error)
>>                          goto repeat;
>>          }
>> fastreturn:
>>          if (unlikely(fd >= end))
>>                  goto out;
>>          if (start <= files->next_fd)
>>                  files->next_fd = fd + 1;
>>
>>         ....
>>
> This is not a review, but it does read fine.
>
> LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd
> tests, I would make sure they still pass before posting a v2.
Got it, thanks for the kind reminder.
>
> I would definitely try moving out the lock to its own cacheline --
> currently it resides with the bitmaps (and first 4 fds of the embedded
> array).
>
> I expect it to provide some win on top of the current patchset, but
> whether it will be sufficient to justify it I have no idea.
>
> Something of this sort:
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index 2944d4aa413b..423cb599268a 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -50,11 +50,11 @@ struct files_struct {
>      * written part on a separate cache line in SMP
>      */
>          spinlock_t file_lock ____cacheline_aligned_in_smp;
> -       unsigned int next_fd;
> +       unsigned int next_fd ____cacheline_aligned_in_smp;
>          unsigned long close_on_exec_init[1];
>          unsigned long open_fds_init[1];
>          unsigned long full_fds_bits_init[1];
> -       struct file __rcu * fd_array[NR_OPEN_DEFAULT];
> +       struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp;
>   };
>
>   struct file_operations;
>
> (feel free to just take)

Thanks Guzik for the thoughtful suggestions, seems we've ever tried to 
separate file_lock and next_fd to different cachelines, but no obvious 
improvement observed, we'll double check and verify these 2 tips to see 
how it goes.

>
> All that said, I have to note blogbench has numerous bugs. For example
> it collects stats by merely bumping a global variable (not even with
> atomics), so some updates are straight up lost.
>
> To my understanding it was meant to test filesystems and I'm guessing
> threading (instead of separate processes) was only used to make it
> easier to share statistics. Given the current scales this
> unintentionally transitioned into bottlenecking on the file_lock
> instead.
>
> There were scalability changes made about 9 years ago and according to
> git log they were benchmarked by Eric Dumazet, while benchmark code was
> not shared. I suggest CC'ing the guy with your v2 and asking if he can
> bench.  Even if he is no longer in position to do it perhaps he can rope
> someone in (or even better share his benchmark).

Good advice, we also noticed the problem for blogbench, and current data 
in commits is collected based on revision to make sure the variables for 
score statistic/reporting are safe, will submit patch to blogbench for 
update. We'll copy Eric for more information on his benchmark and check 
details if it is available.
Tim Chen June 17, 2024, 5:55 p.m. UTC | #7
On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote:
> On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> > 
> > Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> > file'. See below for the actual point.
> > 
> > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> > > 
> > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > ---
> > >  fs/file.c | 14 ++++++--------
> > >  1 file changed, 6 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/file.c b/fs/file.c
> > > index a0e94a178c0b..59d62909e2e3 100644
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > >  	else
> > >  		__clear_close_on_exec(fd, fdt);
> > >  	error = fd;
> > > -#if 1
> > > -	/* Sanity check */
> > > -	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > -		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > -		rcu_assign_pointer(fdt->fd[fd], NULL);
> > > -	}
> > > -#endif
> > >  
> > 
> > I was going to ask when was the last time anyone seen this fire and
> > suggest getting rid of it if enough time(tm) passed. Turns out it does
> > show up sometimes, latest result I found is 2017 vintage:
> > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
> > 
> > So you are moving this to another locked area, but one which does not
> > execute in the benchmark?
> > 
> > Patch 2/3 states 28% read and 14% write increase, this commit message
> > claims it goes up to 32% and 15% respectively -- pretty big. I presume
> > this has to do with bouncing a line containing the fd.
> > 
> > I would argue moving this check elsewhere is about as good as removing
> > it altogether, but that's for the vfs overlords to decide.
> > 
> > All that aside, looking at disasm of alloc_fd it is pretty clear there
> > is time to save, for example:
> > 
> > 	if (unlikely(nr >= fdt->max_fds)) {
> > 		if (fd >= end) {
> > 			error = -EMFILE;
> > 			goto out;
> > 		}
> > 		error = expand_files(fd, fd);
> > 		if (error < 0)
> > 			goto out;
> > 		if (error)
> > 			goto repeat;
> > 	}
> > 
> 
> Now that I wrote it I noticed the fd < end check has to be performed
> regardless of max_fds -- someone could have changed rlimit to a lower
> value after using a higher fd. But the main point stands: the call to
> expand_files and associated error handling don't have to be there.

To really prevent someone from mucking with rlimit, we should probably
take the task_lock to prevent do_prlimit() racing with this function.

task_lock(current->group_leader);

Tim

> 
> > This elides 2 branches and a func call in the common case. Completely
> > untested, maybe has some brainfarts, feel free to take without credit
> > and further massage the routine.
> > 
> > Moreover my disasm shows that even looking for a bit results in
> > a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> > massage it into another inline.
> > 
> > After the above massaging is done and if it turns out the check has to
> > stay, you can plausibly damage-control it with prefetch -- issue it
> > immediately after finding the fd number, before any other work.
> > 
> > All that said, by the above I'm confident there is still *some*
> > performance left on the table despite the lock.
> > 
> > >  out:
> > >  	spin_unlock(&files->file_lock);
> > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > >  }
> > >  EXPORT_SYMBOL(get_unused_fd_flags);
> > >  
> > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > >  {
> > >  	struct fdtable *fdt = files_fdtable(files);
> > >  	__clear_open_fd(fd, fdt);
> > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > >  void put_unused_fd(unsigned int fd)
> > >  {
> > >  	struct files_struct *files = current->files;
> > > +	struct fdtable *fdt = files_fdtable(files);
> > >  	spin_lock(&files->file_lock);
> > > +	if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > > +		printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > > +		rcu_assign_pointer(fdt->fd[fd], NULL);
> > > +	}
> > >  	__put_unused_fd(files, fd);
> > >  	spin_unlock(&files->file_lock);
> > >  }
>
Mateusz Guzik June 17, 2024, 5:59 p.m. UTC | #8
On Mon, Jun 17, 2024 at 7:55 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote:
> > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> > >
> > > Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> > > file'. See below for the actual point.
> > >
> > > > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > > > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > >  fs/file.c | 14 ++++++--------
> > > >  1 file changed, 6 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a0e94a178c0b..59d62909e2e3 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > >   else
> > > >           __clear_close_on_exec(fd, fdt);
> > > >   error = fd;
> > > > -#if 1
> > > > - /* Sanity check */
> > > > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > > > -         printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > > > -         rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > - }
> > > > -#endif
> > > >
> > >
> > > I was going to ask when was the last time anyone seen this fire and
> > > suggest getting rid of it if enough time(tm) passed. Turns out it does
> > > show up sometimes, latest result I found is 2017 vintage:
> > > https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
> > >
> > > So you are moving this to another locked area, but one which does not
> > > execute in the benchmark?
> > >
> > > Patch 2/3 states 28% read and 14% write increase, this commit message
> > > claims it goes up to 32% and 15% respectively -- pretty big. I presume
> > > this has to do with bouncing a line containing the fd.
> > >
> > > I would argue moving this check elsewhere is about as good as removing
> > > it altogether, but that's for the vfs overlords to decide.
> > >
> > > All that aside, looking at disasm of alloc_fd it is pretty clear there
> > > is time to save, for example:
> > >
> > >     if (unlikely(nr >= fdt->max_fds)) {
> > >             if (fd >= end) {
> > >                     error = -EMFILE;
> > >                     goto out;
> > >             }
> > >             error = expand_files(fd, fd);
> > >             if (error < 0)
> > >                     goto out;
> > >             if (error)
> > >                     goto repeat;
> > >     }
> > >
> >
> > Now that I wrote it I noticed the fd < end check has to be performed
> > regardless of max_fds -- someone could have changed rlimit to a lower
> > value after using a higher fd. But the main point stands: the call to
> > expand_files and associated error handling don't have to be there.
>
> To really prevent someone from mucking with rlimit, we should probably
> take the task_lock to prevent do_prlimit() racing with this function.
>
> task_lock(current->group_leader);
>

It's fine to race against rlimit adjustments.

The problem here is that both in my toy refactoring above and the
posted patch the thread can use a high fd, lower the rlimit on its own
and not have it respected on calls made later.

>
> >
> > > This elides 2 branches and a func call in the common case. Completely
> > > untested, maybe has some brainfarts, feel free to take without credit
> > > and further massage the routine.
> > >
> > > Moreover my disasm shows that even looking for a bit results in
> > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> > > massage it into another inline.
> > >
> > > After the above massaging is done and if it turns out the check has to
> > > stay, you can plausibly damage-control it with prefetch -- issue it
> > > immediately after finding the fd number, before any other work.
> > >
> > > All that said, by the above I'm confident there is still *some*
> > > performance left on the table despite the lock.
> > >
> > > >  out:
> > > >   spin_unlock(&files->file_lock);
> > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > > >  }
> > > >  EXPORT_SYMBOL(get_unused_fd_flags);
> > > >
> > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > >  {
> > > >   struct fdtable *fdt = files_fdtable(files);
> > > >   __clear_open_fd(fd, fdt);
> > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > >  void put_unused_fd(unsigned int fd)
> > > >  {
> > > >   struct files_struct *files = current->files;
> > > > + struct fdtable *fdt = files_fdtable(files);
> > > >   spin_lock(&files->file_lock);
> > > > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > > > +         printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > > > +         rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > + }
> > > >   __put_unused_fd(files, fd);
> > > >   spin_unlock(&files->file_lock);
> > > >  }
> >
>
Tim Chen June 17, 2024, 6:04 p.m. UTC | #9
On Mon, 2024-06-17 at 10:55 -0700, Tim Chen wrote:
> On Sat, 2024-06-15 at 07:07 +0200, Mateusz Guzik wrote:
> > On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> > > On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > > > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> > > 
> > > 
> > 
> > Now that I wrote it I noticed the fd < end check has to be performed
> > regardless of max_fds -- someone could have changed rlimit to a lower
> > value after using a higher fd. But the main point stands: the call to
> > expand_files and associated error handling don't have to be there.
> 
> To really prevent someone from mucking with rlimit, we should probably
> take the task_lock to prevent do_prlimit() racing with this function.
> 
> task_lock(current->group_leader);
> 
> Tim


And also move the task_lock in do_prlimit() before the RLIMIT_NOFILE check.

diff --git a/kernel/sys.c b/kernel/sys.c
index 3a2df1bd9f64..b4e523728c3e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
                return -EINVAL;
        resource = array_index_nospec(resource, RLIM_NLIMITS);
 
+       task_lock(tsk->group_leader);
        if (new_rlim) {
                if (new_rlim->rlim_cur > new_rlim->rlim_max)
                        return -EINVAL;
@@ -1481,7 +1482,6 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
 
        /* Holding a refcount on tsk protects tsk->signal from disappearing. */
        rlim = tsk->signal->rlim + resource;
-       task_lock(tsk->group_leader);
        if (new_rlim) {
                /*
                 * Keep the capable check against init_user_ns until cgroups can

Tim
> > 
> > > This elides 2 branches and a func call in the common case. Completely
> > > untested, maybe has some brainfarts, feel free to take without credit
> > > and further massage the routine.
> > > 
> > > Moreover my disasm shows that even looking for a bit results in
> > > a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> > > massage it into another inline.
> > > 
> > > After the above massaging is done and if it turns out the check has to
> > > stay, you can plausibly damage-control it with prefetch -- issue it
> > > immediately after finding the fd number, before any other work.
> > > 
> > > All that said, by the above I'm confident there is still *some*
> > > performance left on the table despite the lock.
> > > 
> > > >  out:
> > > >  	spin_unlock(&files->file_lock);
> > > > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > > >  }
> > > >  EXPORT_SYMBOL(get_unused_fd_flags);
> > > >  
> > > > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > >  {
> > > >  	struct fdtable *fdt = files_fdtable(files);
> > > >  	__clear_open_fd(fd, fdt);
> > > > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > > >  void put_unused_fd(unsigned int fd)
> > > >  {
> > > >  	struct files_struct *files = current->files;
> > > > +	struct fdtable *fdt = files_fdtable(files);
> > > >  	spin_lock(&files->file_lock);
> > > > +	if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > > > +		printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > > > +		rcu_assign_pointer(fdt->fd[fd], NULL);
> > > > +	}
> > > >  	__put_unused_fd(files, fd);
> > > >  	spin_unlock(&files->file_lock);
> > > >  }
> > 
>
Michal Hocko June 18, 2024, 8:35 a.m. UTC | #10
On Mon 17-06-24 11:04:41, Tim Chen wrote:
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 3a2df1bd9f64..b4e523728c3e 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
>                 return -EINVAL;
>         resource = array_index_nospec(resource, RLIM_NLIMITS);
>  
> +       task_lock(tsk->group_leader);
>         if (new_rlim) {
>                 if (new_rlim->rlim_cur > new_rlim->rlim_max)
>                         return -EINVAL;

This is clearly broken as it leaves the lock behind on the error, no?
Mateusz Guzik June 18, 2024, 9:06 a.m. UTC | #11
On Tue, Jun 18, 2024 at 10:35 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 17-06-24 11:04:41, Tim Chen wrote:
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 3a2df1bd9f64..b4e523728c3e 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >                 return -EINVAL;
> >         resource = array_index_nospec(resource, RLIM_NLIMITS);
> >
> > +       task_lock(tsk->group_leader);
> >         if (new_rlim) {
> >                 if (new_rlim->rlim_cur > new_rlim->rlim_max)
> >                         return -EINVAL;
>
> This is clearly broken as it leaves the lock behind on the error, no?

As I explained in my other e-mail there is no need to synchronize
against rlimit changes, merely the code needs to honor the observed
value, whatever it is.

This holds for the stock kernel, does not hold for v1 of the patchset
and presumably will be addressed in v2.

Even if some synchronization was required, moving the lock higher does
not buy anything because the newly protected area only validates the
new limit before there is any attempt at setting it and the old limit
is not looked at.

I think we can safely drop this area and patiently wait for v2.
Tim Chen June 18, 2024, 8:40 p.m. UTC | #12
On Tue, 2024-06-18 at 10:35 +0200, Michal Hocko wrote:
> On Mon 17-06-24 11:04:41, Tim Chen wrote:
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 3a2df1bd9f64..b4e523728c3e 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1471,6 +1471,7 @@ static int do_prlimit(struct task_struct *tsk, unsigned int resource,
> >                 return -EINVAL;
> >         resource = array_index_nospec(resource, RLIM_NLIMITS);
> >  
> > +       task_lock(tsk->group_leader);
> >         if (new_rlim) {
> >                 if (new_rlim->rlim_cur > new_rlim->rlim_max)
> >                         return -EINVAL;
> 
> This is clearly broken as it leaves the lock behind on the error, no?

Thanks for pointing that out.  Need unlock before return. I don't like
adding task_lock in alloc_fd path if there are multiple alloc_fd going
on causing contention.  

The race with rlimit change should be a very rare thing.  Should be
sufficient that patch one check for fd not going beyond the observed
rlimit.

Tim
diff mbox series

Patch

diff --git a/fs/file.c b/fs/file.c
index a0e94a178c0b..59d62909e2e3 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -548,13 +548,6 @@  static int alloc_fd(unsigned start, unsigned end, unsigned flags)
 	else
 		__clear_close_on_exec(fd, fdt);
 	error = fd;
-#if 1
-	/* Sanity check */
-	if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
-		printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
-		rcu_assign_pointer(fdt->fd[fd], NULL);
-	}
-#endif
 
 out:
 	spin_unlock(&files->file_lock);
@@ -572,7 +565,7 @@  int get_unused_fd_flags(unsigned flags)
 }
 EXPORT_SYMBOL(get_unused_fd_flags);
 
-static void __put_unused_fd(struct files_struct *files, unsigned int fd)
+static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
 {
 	struct fdtable *fdt = files_fdtable(files);
 	__clear_open_fd(fd, fdt);
@@ -583,7 +576,12 @@  static void __put_unused_fd(struct files_struct *files, unsigned int fd)
 void put_unused_fd(unsigned int fd)
 {
 	struct files_struct *files = current->files;
+	struct fdtable *fdt = files_fdtable(files);
 	spin_lock(&files->file_lock);
+	if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
+		printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
+		rcu_assign_pointer(fdt->fd[fd], NULL);
+	}
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
 }