Message ID | 20240622154904.3774273-2-yu.ma@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/file.c: optimize the critical section of file_lock in | expand |
On Sat 22-06-24 11:49:02, Yu Ma wrote: > There is available fd in the lower 64 bits of open_fds bitmap for most cases > when we look for an available fd slot. Skip 2-levels searching via > find_next_zero_bit() for this common fast path. > > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a > free slot is available there, as: > (1) The fd allocation algorithm would always allocate fd from small to large. > Lower bits in open_fds bitmap would be used much more frequently than higher > bits. > (2) After fdt is expanded (the bitmap size doubled for each time of expansion), > it would never be shrunk. The search size increases but there are few open fds > available here. > (3) find_next_zero_bit() itself has a fast path inside to speed up searching > when size<=64. > > Besides, "!start" is added to fast path condition to ensure the allocated fd is > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios: > (1) Allocating a new fd (the most common usage scenario) via > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0). > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from > old_fd's index in fdt, which is only called by syscall fcntl. > > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4. > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > Signed-off-by: Yu Ma <yu.ma@intel.com> > --- > fs/file.c | 35 +++++++++++++++++++++-------------- > 1 file changed, 21 insertions(+), 14 deletions(-) > > diff --git a/fs/file.c b/fs/file.c > index a3b72aa64f11..50e900a47107 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > if (fd < files->next_fd) > fd = files->next_fd; > > - if (fd < fdt->max_fds) > + error = -EMFILE; > + if (likely(fd < fdt->max_fds)) { > + if (~fdt->open_fds[0] && !start) { > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd); So I don't think this is quite correct. If files->next_fd is set, we could end up calling find_next_zero_bit() starting from quite high offset causing a regression? Also because we don't expand in this case, we could cause access beyond end of fdtable? Finally, AFAIU this speeds up the lookup for cases where fd < 64 is available at the cost of cases where the first long is full (there we unnecessarily load open_fds[0] into cache). Did you check if the cost is visible (e.g. by making blogbench occupy first 64 fds before starting its load)? Honza > + goto fastreturn; > + } > fd = find_next_fd(fdt, fd); > + } > + > + if (unlikely(fd >= fdt->max_fds)) { > + error = expand_files(files, fd); > + if (error < 0) > + goto out; > + /* > + * If we needed to expand the fs array we > + * might have blocked - try again. > + */ > + if (error) > + goto repeat; > + } > > +fastreturn: > /* > * N.B. For clone tasks sharing a files structure, this test > * will limit the total number of files that can be opened. > */ > - error = -EMFILE; > - if (fd >= end) > + if (unlikely(fd >= end)) > goto out; > > - error = expand_files(files, fd); > - if (error < 0) > - goto out; > - > - /* > - * If we needed to expand the fs array we > - * might have blocked - try again. > - */ > - if (error) > - goto repeat; > - > if (start <= files->next_fd) > files->next_fd = fd + 1; > > -- > 2.43.0 >
On Tue 25-06-24 13:52:57, Jan Kara wrote: > On Sat 22-06-24 11:49:02, Yu Ma wrote: > > There is available fd in the lower 64 bits of open_fds bitmap for most cases > > when we look for an available fd slot. Skip 2-levels searching via > > find_next_zero_bit() for this common fast path. > > > > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a > > free slot is available there, as: > > (1) The fd allocation algorithm would always allocate fd from small to large. > > Lower bits in open_fds bitmap would be used much more frequently than higher > > bits. > > (2) After fdt is expanded (the bitmap size doubled for each time of expansion), > > it would never be shrunk. The search size increases but there are few open fds > > available here. > > (3) find_next_zero_bit() itself has a fast path inside to speed up searching > > when size<=64. > > > > Besides, "!start" is added to fast path condition to ensure the allocated fd is > > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios: > > (1) Allocating a new fd (the most common usage scenario) via > > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0). > > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from > > old_fd's index in fdt, which is only called by syscall fcntl. > > > > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved > > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4. > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > Signed-off-by: Yu Ma <yu.ma@intel.com> > > --- > > fs/file.c | 35 +++++++++++++++++++++-------------- > > 1 file changed, 21 insertions(+), 14 deletions(-) > > > > diff --git a/fs/file.c b/fs/file.c > > index a3b72aa64f11..50e900a47107 100644 > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > > if (fd < files->next_fd) > > fd = files->next_fd; > > > > - if (fd < fdt->max_fds) > > + error = -EMFILE; > > + if (likely(fd < fdt->max_fds)) { > > + if (~fdt->open_fds[0] && !start) { > > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd); > > So I don't think this is quite correct. If files->next_fd is set, we could > end up calling find_next_zero_bit() starting from quite high offset causing > a regression? Also because we don't expand in this case, we could cause access > beyond end of fdtable? OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should be false. As such the above condition could be rewritten as: if (!start && files->next_fd < BITS_PER_LONG) to avoid loading the first bitmap long if we know it is full? Or we could maybe go as far as: if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds)) goto fastreturn; because AFAIU this should work in exactly the same cases as your code? Honza > > + goto fastreturn; > > + } > > fd = find_next_fd(fdt, fd); > > + } > > + > > + if (unlikely(fd >= fdt->max_fds)) { > > + error = expand_files(files, fd); > > + if (error < 0) > > + goto out; > > + /* > > + * If we needed to expand the fs array we > > + * might have blocked - try again. > > + */ > > + if (error) > > + goto repeat; > > + } > > > > +fastreturn: > > /* > > * N.B. For clone tasks sharing a files structure, this test > > * will limit the total number of files that can be opened. > > */ > > - error = -EMFILE; > > - if (fd >= end) > > + if (unlikely(fd >= end)) > > goto out; > > > > - error = expand_files(files, fd); > > - if (error < 0) > > - goto out; > > - > > - /* > > - * If we needed to expand the fs array we > > - * might have blocked - try again. > > - */ > > - if (error) > > - goto repeat; > > - > > if (start <= files->next_fd) > > files->next_fd = fd + 1; > > > > -- > > 2.43.0 > > > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR >
On 6/25/2024 8:53 PM, Jan Kara wrote: > On Tue 25-06-24 13:52:57, Jan Kara wrote: >> On Sat 22-06-24 11:49:02, Yu Ma wrote: >>> There is available fd in the lower 64 bits of open_fds bitmap for most cases >>> when we look for an available fd slot. Skip 2-levels searching via >>> find_next_zero_bit() for this common fast path. >>> >>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a >>> free slot is available there, as: >>> (1) The fd allocation algorithm would always allocate fd from small to large. >>> Lower bits in open_fds bitmap would be used much more frequently than higher >>> bits. >>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion), >>> it would never be shrunk. The search size increases but there are few open fds >>> available here. >>> (3) find_next_zero_bit() itself has a fast path inside to speed up searching >>> when size<=64. >>> >>> Besides, "!start" is added to fast path condition to ensure the allocated fd is >>> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios: >>> (1) Allocating a new fd (the most common usage scenario) via >>> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0). >>> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from >>> old_fd's index in fdt, which is only called by syscall fcntl. >>> >>> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved >>> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4. >>> >>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> >>> Signed-off-by: Yu Ma <yu.ma@intel.com> >>> --- >>> fs/file.c | 35 +++++++++++++++++++++-------------- >>> 1 file changed, 21 insertions(+), 14 deletions(-) >>> >>> diff --git a/fs/file.c b/fs/file.c >>> index a3b72aa64f11..50e900a47107 100644 >>> --- a/fs/file.c >>> +++ b/fs/file.c >>> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) >>> if (fd < files->next_fd) >>> fd = files->next_fd; >>> >>> - if (fd < fdt->max_fds) >>> + error = -EMFILE; >>> + if (likely(fd < fdt->max_fds)) { >>> + if (~fdt->open_fds[0] && !start) { >>> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd); >> So I don't think this is quite correct. If files->next_fd is set, we could >> end up calling find_next_zero_bit() starting from quite high offset causing >> a regression? Also because we don't expand in this case, we could cause access >> beyond end of fdtable? > OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in > the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should > be false. As such the above condition could be rewritten as: > > if (!start && files->next_fd < BITS_PER_LONG) > > to avoid loading the first bitmap long if we know it is full? Or we could > maybe go as far as: > > if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds)) > goto fastreturn; > > because AFAIU this should work in exactly the same cases as your code? > > Honza Thanks Honza for the good concern and suggestions here, while both above conditions are not enough to ensure that there is available fd in the first 64 bits of open_fds. As next_fd just means there is no available fd before next_fd, just imagine that fd from 0 to 66 are already occupied, now fd=3 is returned back, then next_fd would be set as 3 per fd recycling logic (i.e. in __put_unused_fd()), next time when alloc_fd() being called, it would return fd=3 to the caller and set next_fd=4. Then next time when alloc_fd() being called again, next_fd==4, but actually it's already been occupied. So find_next_zero_bit() is needed to find the real 0 bit anyway. The conditions should either be like it is in patch or if (!start && !test_bit(0, fdt->full_fds_bits)), the latter should also have the bitmap loading cost, but another point is that a bit in full_fds_bits represents 64 bits in open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, maybe we can modify the condition to use full_fds_bits ? >>> + goto fastreturn; >>> + } >>> fd = find_next_fd(fdt, fd); >>> + } >>> + >>> + if (unlikely(fd >= fdt->max_fds)) { >>> + error = expand_files(files, fd); >>> + if (error < 0) >>> + goto out; >>> + /* >>> + * If we needed to expand the fs array we >>> + * might have blocked - try again. >>> + */ >>> + if (error) >>> + goto repeat; >>> + } >>> >>> +fastreturn: >>> /* >>> * N.B. For clone tasks sharing a files structure, this test >>> * will limit the total number of files that can be opened. >>> */ >>> - error = -EMFILE; >>> - if (fd >= end) >>> + if (unlikely(fd >= end)) >>> goto out; >>> >>> - error = expand_files(files, fd); >>> - if (error < 0) >>> - goto out; >>> - >>> - /* >>> - * If we needed to expand the fs array we >>> - * might have blocked - try again. >>> - */ >>> - if (error) >>> - goto repeat; >>> - >>> if (start <= files->next_fd) >>> files->next_fd = fd + 1; >>> >>> -- >>> 2.43.0 >>> >> -- >> Jan Kara <jack@suse.com> >> SUSE Labs, CR >>
On Tue 25-06-24 23:33:34, Ma, Yu wrote: > On 6/25/2024 8:53 PM, Jan Kara wrote: > > On Tue 25-06-24 13:52:57, Jan Kara wrote: > > > On Sat 22-06-24 11:49:02, Yu Ma wrote: > > > > There is available fd in the lower 64 bits of open_fds bitmap for most cases > > > > when we look for an available fd slot. Skip 2-levels searching via > > > > find_next_zero_bit() for this common fast path. > > > > > > > > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a > > > > free slot is available there, as: > > > > (1) The fd allocation algorithm would always allocate fd from small to large. > > > > Lower bits in open_fds bitmap would be used much more frequently than higher > > > > bits. > > > > (2) After fdt is expanded (the bitmap size doubled for each time of expansion), > > > > it would never be shrunk. The search size increases but there are few open fds > > > > available here. > > > > (3) find_next_zero_bit() itself has a fast path inside to speed up searching > > > > when size<=64. > > > > > > > > Besides, "!start" is added to fast path condition to ensure the allocated fd is > > > > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios: > > > > (1) Allocating a new fd (the most common usage scenario) via > > > > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0). > > > > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from > > > > old_fd's index in fdt, which is only called by syscall fcntl. > > > > > > > > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved > > > > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4. > > > > > > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> > > > > Signed-off-by: Yu Ma <yu.ma@intel.com> > > > > --- > > > > fs/file.c | 35 +++++++++++++++++++++-------------- > > > > 1 file changed, 21 insertions(+), 14 deletions(-) > > > > > > > > diff --git a/fs/file.c b/fs/file.c > > > > index a3b72aa64f11..50e900a47107 100644 > > > > --- a/fs/file.c > > > > +++ b/fs/file.c > > > > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) > > > > if (fd < files->next_fd) > > > > fd = files->next_fd; > > > > - if (fd < fdt->max_fds) > > > > + error = -EMFILE; > > > > + if (likely(fd < fdt->max_fds)) { > > > > + if (~fdt->open_fds[0] && !start) { > > > > + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd); > > > So I don't think this is quite correct. If files->next_fd is set, we could > > > end up calling find_next_zero_bit() starting from quite high offset causing > > > a regression? Also because we don't expand in this case, we could cause access > > > beyond end of fdtable? > > OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in > > the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should > > be false. As such the above condition could be rewritten as: > > > > if (!start && files->next_fd < BITS_PER_LONG) > > > > to avoid loading the first bitmap long if we know it is full? Or we could > > maybe go as far as: > > > > if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds)) > > goto fastreturn; > > > > because AFAIU this should work in exactly the same cases as your code? > > > > Honza > > Thanks Honza for the good concern and suggestions here, while both above > conditions are not enough to ensure that there is available fd in the first > 64 bits of open_fds. As next_fd just means there is no available fd before > next_fd, just imagine that fd from 0 to 66 are already occupied, now fd=3 is > returned back, then next_fd would be set as 3 per fd recycling logic (i.e. > in __put_unused_fd()), next time when alloc_fd() being called, it would > return fd=3 to the caller and set next_fd=4. Then next time when alloc_fd() > being called again, next_fd==4, but actually it's already been occupied. So > find_next_zero_bit() is needed to find the real 0 bit anyway. Indeed, thanks for correcting me! next_fd is just a lower bound for the first free fd. > The conditions > should either be like it is in patch or if (!start && !test_bit(0, > fdt->full_fds_bits)), the latter should also have the bitmap loading cost, > but another point is that a bit in full_fds_bits represents 64 bits in > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, > maybe we can modify the condition to use full_fds_bits ? So maybe I'm wrong but I think the biggest benefit of your code compared to plain find_next_fd() is exactly in that we don't have to load full_fds_bits into cache. So I'm afraid that using full_fds_bits in the condition would destroy your performance gains. Thinking about this with a fresh head how about putting implementing your optimization like: --- a/fs/file.c +++ b/fs/file.c @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) unsigned int maxbit = maxfd / BITS_PER_LONG; unsigned int bitbit = start / BITS_PER_LONG; + /* + * Optimistically search the first long of the open_fds bitmap. It + * saves us from loading full_fds_bits into cache in the common case + * and because BITS_PER_LONG > start >= files->next_fd, we have quite + * a good chance there's a bit free in there. + */ + if (start < BITS_PER_LONG) { + unsigned int bit; + + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); + if (bit < BITS_PER_LONG) + return bit; + } + bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; if (bitbit >= maxfd) return maxfd; Plus your optimizations with likely / unlikely. This way the code flow in alloc_fd() stays more readable, we avoid loading the first open_fds long into cache if it is full, and we should get all the performance benefits? Honza > > > > + goto fastreturn; > > > > + } > > > > fd = find_next_fd(fdt, fd); > > > > + } > > > > + > > > > + if (unlikely(fd >= fdt->max_fds)) { > > > > + error = expand_files(files, fd); > > > > + if (error < 0) > > > > + goto out; > > > > + /* > > > > + * If we needed to expand the fs array we > > > > + * might have blocked - try again. > > > > + */ > > > > + if (error) > > > > + goto repeat; > > > > + } > > > > +fastreturn: > > > > /* > > > > * N.B. For clone tasks sharing a files structure, this test > > > > * will limit the total number of files that can be opened. > > > > */ > > > > - error = -EMFILE; > > > > - if (fd >= end) > > > > + if (unlikely(fd >= end)) > > > > goto out; > > > > - error = expand_files(files, fd); > > > > - if (error < 0) > > > > - goto out; > > > > - > > > > - /* > > > > - * If we needed to expand the fs array we > > > > - * might have blocked - try again. > > > > - */ > > > > - if (error) > > > > - goto repeat; > > > > - > > > > if (start <= files->next_fd) > > > > files->next_fd = fd + 1; > > > > -- > > > > 2.43.0 > > > > > > > -- > > > Jan Kara <jack@suse.com> > > > SUSE Labs, CR > > > >
On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote: > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the > first free fd. > > > The conditions > > should either be like it is in patch or if (!start && !test_bit(0, > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost, > > but another point is that a bit in full_fds_bits represents 64 bits in > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, > > maybe we can modify the condition to use full_fds_bits ? > > So maybe I'm wrong but I think the biggest benefit of your code compared to > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > into cache. So I'm afraid that using full_fds_bits in the condition would > destroy your performance gains. Thinking about this with a fresh head how > about putting implementing your optimization like: > > --- a/fs/file.c > +++ b/fs/file.c > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > unsigned int maxbit = maxfd / BITS_PER_LONG; > unsigned int bitbit = start / BITS_PER_LONG; > > + /* > + * Optimistically search the first long of the open_fds bitmap. It > + * saves us from loading full_fds_bits into cache in the common case > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > + * a good chance there's a bit free in there. > + */ > + if (start < BITS_PER_LONG) { > + unsigned int bit; > + > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); Say start is 31 (< BITS_PER_LONG) bit found here could be 32 and greater than start. Do we care if we return bit > start? Tim > + if (bit < BITS_PER_LONG) > + return bit; > + } > + > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > if (bitbit >= maxfd) > return maxfd; > > Plus your optimizations with likely / unlikely. This way the code flow in > alloc_fd() stays more readable, we avoid loading the first open_fds long > into cache if it is full, and we should get all the performance benefits? > > Honza > > > > > > > + goto fastreturn; > > > > > + } > > > > > fd = find_next_fd(fdt, fd); > > > > > + } > > > > > + > > > > > + if (unlikely(fd >= fdt->max_fds)) { > > > > > + error = expand_files(files, fd); > > > > > + if (error < 0) > > > > > + goto out; > > > > > + /* > > > > > + * If we needed to expand the fs array we > > > > > + * might have blocked - try again. > > > > > + */ > > > > > + if (error) > > > > > + goto repeat; > > > > > + } > > > > > +fastreturn: > > > > > /* > > > > > * N.B. For clone tasks sharing a files structure, this test > > > > > * will limit the total number of files that can be opened. > > > > > */ > > > > > - error = -EMFILE; > > > > > - if (fd >= end) > > > > > + if (unlikely(fd >= end)) > > > > > goto out; > > > > > - error = expand_files(files, fd); > > > > > - if (error < 0) > > > > > - goto out; > > > > > - > > > > > - /* > > > > > - * If we needed to expand the fs array we > > > > > - * might have blocked - try again. > > > > > - */ > > > > > - if (error) > > > > > - goto repeat; > > > > > - > > > > > if (start <= files->next_fd) > > > > > files->next_fd = fd + 1; > > > > > -- > > > > > 2.43.0 > > > > > > > > > -- > > > > Jan Kara <jack@suse.com> > > > > SUSE Labs, CR > > > > > >
On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote: > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote: > > > > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the > > first free fd. > > > > > The conditions > > > should either be like it is in patch or if (!start && !test_bit(0, > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost, > > > but another point is that a bit in full_fds_bits represents 64 bits in > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, > > > maybe we can modify the condition to use full_fds_bits ? > > > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > into cache. So I'm afraid that using full_fds_bits in the condition would > > destroy your performance gains. Thinking about this with a fresh head how > > about putting implementing your optimization like: > > > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > unsigned int bitbit = start / BITS_PER_LONG; > > > > + /* > > + * Optimistically search the first long of the open_fds bitmap. It > > + * saves us from loading full_fds_bits into cache in the common case > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > + * a good chance there's a bit free in there. > > + */ > > + if (start < BITS_PER_LONG) { > > + unsigned int bit; > > + > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > Say start is 31 (< BITS_PER_LONG) > bit found here could be 32 and greater than start. Do we care if we return bit > start? Sorry, I mean to say that we could find a bit like 30 that is less than start instead of the other way round. Tim >
On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote: > So maybe I'm wrong but I think the biggest benefit of your code compared to > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > into cache. So I'm afraid that using full_fds_bits in the condition would > destroy your performance gains. Thinking about this with a fresh head how > about putting implementing your optimization like: > > --- a/fs/file.c > +++ b/fs/file.c > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > unsigned int maxbit = maxfd / BITS_PER_LONG; > unsigned int bitbit = start / BITS_PER_LONG; > > + /* > + * Optimistically search the first long of the open_fds bitmap. It > + * saves us from loading full_fds_bits into cache in the common case > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > + * a good chance there's a bit free in there. > + */ > + if (start < BITS_PER_LONG) { > + unsigned int bit; > + > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > + if (bit < BITS_PER_LONG) > + return bit; > + } > + > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > if (bitbit >= maxfd) > return maxfd; > > Plus your optimizations with likely / unlikely. This way the code flow in > alloc_fd() stays more readable, we avoid loading the first open_fds long > into cache if it is full, and we should get all the performance benefits? > Huh. So when I read the patch previously I assumed this is testing the bit word for the map containing next_fd (whatever it is), avoiding looking at the higher level bitmap and inlining the op (instead of calling the fully fledged func for bit scans). I did not mentally register this is in fact only checking for the beginning of the range of the entire thing. So apologies from my end as based on my feedback some work was done and I'm going to ask to further redo it. blogbench spawns 100 or so workers, say total fd count hovers just above 100. say this lines up with about half of more cases in practice for that benchmark. Even so, that's a benchmark-specific optimization. A busy web server can have literally tens of thousands of fds open (and this is a pretty mundane case), making the 0-63 range not particularly interesting. That aside I think the patchset is in the wrong order -- first patch tries to not look at the higher level bitmap, while second reduces stores made there. This makes it quite unclear how much is it worth to reduce looking there if atomics are conditional. So here is what I propose in terms of the patches: 1. NULL check removal, sprinkling of likely/unlikely and expand_files call avoidance; no measurements done vs stock kernel for some effort saving, just denote in the commit message there is less work under the lock and treat it as baseline 2. conditional higher level bitmap clear as submitted; benchmarked against 1 3. open_fds check within the range containing fd, avoiding higher level bitmap if a free slot is found. this should not result in any func calls if successful; benchmarked against the above Optionally the bitmap routines can grow variants which always inline and are used here. If so that would probably land between 1 and 2 on the list. You noted you know about blogbench bugs and have them fixed. Would be good to post a link to a pull request or some other spot for a reference. I'll be best if the vfs folk comment on what they want here.
On Wed 26-06-24 09:52:50, Tim Chen wrote: > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote: > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote: > > > > > > > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the > > > first free fd. > > > > > > > The conditions > > > > should either be like it is in patch or if (!start && !test_bit(0, > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost, > > > > but another point is that a bit in full_fds_bits represents 64 bits in > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, > > > > maybe we can modify the condition to use full_fds_bits ? > > > > > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > > into cache. So I'm afraid that using full_fds_bits in the condition would > > > destroy your performance gains. Thinking about this with a fresh head how > > > about putting implementing your optimization like: > > > > > > --- a/fs/file.c > > > +++ b/fs/file.c > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > > unsigned int bitbit = start / BITS_PER_LONG; > > > > > > + /* > > > + * Optimistically search the first long of the open_fds bitmap. It > > > + * saves us from loading full_fds_bits into cache in the common case > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > > + * a good chance there's a bit free in there. > > > + */ > > > + if (start < BITS_PER_LONG) { > > > + unsigned int bit; > > > + > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > > > Say start is 31 (< BITS_PER_LONG) > > bit found here could be 32 and greater than start. Do we care if we return bit > start? > > Sorry, I mean to say that we could find a bit like 30 that is less than > start instead of the other way round. Well, I propose calling find_next_zero_bit() with offset set to 'start' so it cannot possibly happen that the returned bit number is smaller than start... But maybe I'm missing something? Honza
On Thu, Jun 27, 2024 at 2:09 PM Jan Kara <jack@suse.cz> wrote: > > On Wed 26-06-24 09:52:50, Tim Chen wrote: > > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote: > > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote: > > > > > > > > > > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the > > > > first free fd. > > > > > > > > > The conditions > > > > > should either be like it is in patch or if (!start && !test_bit(0, > > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost, > > > > > but another point is that a bit in full_fds_bits represents 64 bits in > > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, > > > > > maybe we can modify the condition to use full_fds_bits ? > > > > > > > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > > > into cache. So I'm afraid that using full_fds_bits in the condition would > > > > destroy your performance gains. Thinking about this with a fresh head how > > > > about putting implementing your optimization like: > > > > > > > > --- a/fs/file.c > > > > +++ b/fs/file.c > > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > > > unsigned int bitbit = start / BITS_PER_LONG; > > > > > > > > + /* > > > > + * Optimistically search the first long of the open_fds bitmap. It > > > > + * saves us from loading full_fds_bits into cache in the common case > > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > > > + * a good chance there's a bit free in there. > > > > + */ > > > > + if (start < BITS_PER_LONG) { > > > > + unsigned int bit; > > > > + > > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > > > > > Say start is 31 (< BITS_PER_LONG) > > > bit found here could be 32 and greater than start. Do we care if we return bit > start? > > > > Sorry, I mean to say that we could find a bit like 30 that is less than > > start instead of the other way round. > > Well, I propose calling find_next_zero_bit() with offset set to 'start' so > it cannot possibly happen that the returned bit number is smaller than > start... But maybe I'm missing something? > You gate it with " if (start < BITS_PER_LONG)" which only covers the small initital range, while I'm arguing this should work for any fd.
On Wed 26-06-24 21:13:07, Mateusz Guzik wrote: > On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote: > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > into cache. So I'm afraid that using full_fds_bits in the condition would > > destroy your performance gains. Thinking about this with a fresh head how > > about putting implementing your optimization like: > > > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > unsigned int bitbit = start / BITS_PER_LONG; > > > > + /* > > + * Optimistically search the first long of the open_fds bitmap. It > > + * saves us from loading full_fds_bits into cache in the common case > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > + * a good chance there's a bit free in there. > > + */ > > + if (start < BITS_PER_LONG) { > > + unsigned int bit; > > + > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > + if (bit < BITS_PER_LONG) > > + return bit; > > + } > > + > > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > > if (bitbit >= maxfd) > > return maxfd; > > > > Plus your optimizations with likely / unlikely. This way the code flow in > > alloc_fd() stays more readable, we avoid loading the first open_fds long > > into cache if it is full, and we should get all the performance benefits? > > > > Huh. > > So when I read the patch previously I assumed this is testing the bit > word for the map containing next_fd (whatever it is), avoiding looking > at the higher level bitmap and inlining the op (instead of calling the > fully fledged func for bit scans). > > I did not mentally register this is in fact only checking for the > beginning of the range of the entire thing. So apologies from my end > as based on my feedback some work was done and I'm going to ask to > further redo it. Well, just confirming the fact that the way the code was written was somewhat confusing ;) > blogbench spawns 100 or so workers, say total fd count hovers just > above 100. say this lines up with about half of more cases in practice > for that benchmark. > > Even so, that's a benchmark-specific optimization. A busy web server > can have literally tens of thousands of fds open (and this is a pretty > mundane case), making the 0-63 range not particularly interesting. I agree this optimization helps only processes with low number of open fds. On the other hand that is usually the majority of the processes on the system so the optimization makes sense to me. That being said your idea of searching the word with next_fd makes sense as well... > That aside I think the patchset is in the wrong order -- first patch > tries to not look at the higher level bitmap, while second reduces > stores made there. This makes it quite unclear how much is it worth to > reduce looking there if atomics are conditional. > > So here is what I propose in terms of the patches: > 1. NULL check removal, sprinkling of likely/unlikely and expand_files > call avoidance; no measurements done vs stock kernel for some effort > saving, just denote in the commit message there is less work under the > lock and treat it as baseline > 2. conditional higher level bitmap clear as submitted; benchmarked against 1 > 3. open_fds check within the range containing fd, avoiding higher > level bitmap if a free slot is found. this should not result in any > func calls if successful; benchmarked against the above Yeah, I guess this ordering is the most obvious -> the least obvious so it makes sense to me. Honza
On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote: > On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote: > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > into cache. So I'm afraid that using full_fds_bits in the condition would > > destroy your performance gains. Thinking about this with a fresh head how > > about putting implementing your optimization like: > > > > --- a/fs/file.c > > +++ b/fs/file.c > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > unsigned int bitbit = start / BITS_PER_LONG; > > > > + /* > > + * Optimistically search the first long of the open_fds bitmap. It > > + * saves us from loading full_fds_bits into cache in the common case > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > + * a good chance there's a bit free in there. > > + */ > > + if (start < BITS_PER_LONG) { > > + unsigned int bit; > > + > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > + if (bit < BITS_PER_LONG) > > + return bit; > > + } > > + > > bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; > > if (bitbit >= maxfd) > > return maxfd; > > > > Plus your optimizations with likely / unlikely. This way the code flow in > > alloc_fd() stays more readable, we avoid loading the first open_fds long > > into cache if it is full, and we should get all the performance benefits? > > > > Huh. > > So when I read the patch previously I assumed this is testing the bit > word for the map containing next_fd (whatever it is), avoiding looking > at the higher level bitmap and inlining the op (instead of calling the > fully fledged func for bit scans). > > I did not mentally register this is in fact only checking for the > beginning of the range of the entire thing. So apologies from my end > as based on my feedback some work was done and I'm going to ask to > further redo it. > > blogbench spawns 100 or so workers, say total fd count hovers just > above 100. say this lines up with about half of more cases in practice > for that benchmark. > > Even so, that's a benchmark-specific optimization. A busy web server > can have literally tens of thousands of fds open (and this is a pretty > mundane case), making the 0-63 range not particularly interesting. > > That aside I think the patchset is in the wrong order -- first patch > tries to not look at the higher level bitmap, while second reduces > stores made there. This makes it quite unclear how much is it worth to > reduce looking there if atomics are conditional. > > So here is what I propose in terms of the patches: > 1. NULL check removal, sprinkling of likely/unlikely and expand_files > call avoidance; no measurements done vs stock kernel for some effort > saving, just denote in the commit message there is less work under the > lock and treat it as baseline > 2. conditional higher level bitmap clear as submitted; benchmarked against 1 > 3. open_fds check within the range containing fd, avoiding higher > level bitmap if a free slot is found. this should not result in any > func calls if successful; benchmarked against the above > > Optionally the bitmap routines can grow variants which always inline > and are used here. If so that would probably land between 1 and 2 on > the list. > > You noted you know about blogbench bugs and have them fixed. Would be > good to post a link to a pull request or some other spot for a > reference. > > I'll be best if the vfs folk comment on what they want here. Optimizing only the < BIT_PER_LONG seems less desirable then making it work for arbitrary next_fd. Imho, it'll also be easier to follow if everything follows the same logic.
On Thu, 2024-06-27 at 14:09 +0200, Jan Kara wrote: > On Wed 26-06-24 09:52:50, Tim Chen wrote: > > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote: > > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote: > > > > > > > > > > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the > > > > first free fd. > > > > > > > > > The conditions > > > > > should either be like it is in patch or if (!start && !test_bit(0, > > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost, > > > > > but another point is that a bit in full_fds_bits represents 64 bits in > > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way, > > > > > maybe we can modify the condition to use full_fds_bits ? > > > > > > > > So maybe I'm wrong but I think the biggest benefit of your code compared to > > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits > > > > into cache. So I'm afraid that using full_fds_bits in the condition would > > > > destroy your performance gains. Thinking about this with a fresh head how > > > > about putting implementing your optimization like: > > > > > > > > --- a/fs/file.c > > > > +++ b/fs/file.c > > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) > > > > unsigned int maxbit = maxfd / BITS_PER_LONG; > > > > unsigned int bitbit = start / BITS_PER_LONG; > > > > > > > > + /* > > > > + * Optimistically search the first long of the open_fds bitmap. It > > > > + * saves us from loading full_fds_bits into cache in the common case > > > > + * and because BITS_PER_LONG > start >= files->next_fd, we have quite > > > > + * a good chance there's a bit free in there. > > > > + */ > > > > + if (start < BITS_PER_LONG) { > > > > + unsigned int bit; > > > > + > > > > + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); > > > > > > Say start is 31 (< BITS_PER_LONG) > > > bit found here could be 32 and greater than start. Do we care if we return bit > start? > > > > Sorry, I mean to say that we could find a bit like 30 that is less than > > start instead of the other way round. > > Well, I propose calling find_next_zero_bit() with offset set to 'start' so > it cannot possibly happen that the returned bit number is smaller than > start... But maybe I'm missing something? Yes, you're right. the search begin at start bit so it is okay. Tim > > Honza
On 6/27/2024 11:33 PM, Christian Brauner wrote: > On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote: >> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote: >>> So maybe I'm wrong but I think the biggest benefit of your code compared to >>> plain find_next_fd() is exactly in that we don't have to load full_fds_bits >>> into cache. So I'm afraid that using full_fds_bits in the condition would >>> destroy your performance gains. Thinking about this with a fresh head how >>> about putting implementing your optimization like: >>> >>> --- a/fs/file.c >>> +++ b/fs/file.c >>> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start) >>> unsigned int maxbit = maxfd / BITS_PER_LONG; >>> unsigned int bitbit = start / BITS_PER_LONG; >>> >>> + /* >>> + * Optimistically search the first long of the open_fds bitmap. It >>> + * saves us from loading full_fds_bits into cache in the common case >>> + * and because BITS_PER_LONG > start >= files->next_fd, we have quite >>> + * a good chance there's a bit free in there. >>> + */ >>> + if (start < BITS_PER_LONG) { >>> + unsigned int bit; >>> + >>> + bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start); >>> + if (bit < BITS_PER_LONG) >>> + return bit; >>> + } >>> + >>> bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG; >>> if (bitbit >= maxfd) >>> return maxfd; >>> >>> Plus your optimizations with likely / unlikely. This way the code flow in >>> alloc_fd() stays more readable, we avoid loading the first open_fds long >>> into cache if it is full, and we should get all the performance benefits? >>> >> Huh. >> >> So when I read the patch previously I assumed this is testing the bit >> word for the map containing next_fd (whatever it is), avoiding looking >> at the higher level bitmap and inlining the op (instead of calling the >> fully fledged func for bit scans). >> >> I did not mentally register this is in fact only checking for the >> beginning of the range of the entire thing. So apologies from my end >> as based on my feedback some work was done and I'm going to ask to >> further redo it. >> >> blogbench spawns 100 or so workers, say total fd count hovers just >> above 100. say this lines up with about half of more cases in practice >> for that benchmark. >> >> Even so, that's a benchmark-specific optimization. A busy web server >> can have literally tens of thousands of fds open (and this is a pretty >> mundane case), making the 0-63 range not particularly interesting. >> >> That aside I think the patchset is in the wrong order -- first patch >> tries to not look at the higher level bitmap, while second reduces >> stores made there. This makes it quite unclear how much is it worth to >> reduce looking there if atomics are conditional. >> >> So here is what I propose in terms of the patches: >> 1. NULL check removal, sprinkling of likely/unlikely and expand_files >> call avoidance; no measurements done vs stock kernel for some effort >> saving, just denote in the commit message there is less work under the >> lock and treat it as baseline >> 2. conditional higher level bitmap clear as submitted; benchmarked against 1 >> 3. open_fds check within the range containing fd, avoiding higher >> level bitmap if a free slot is found. this should not result in any >> func calls if successful; benchmarked against the above >> >> Optionally the bitmap routines can grow variants which always inline >> and are used here. If so that would probably land between 1 and 2 on >> the list. >> >> You noted you know about blogbench bugs and have them fixed. Would be >> good to post a link to a pull request or some other spot for a >> reference. >> >> I'll be best if the vfs folk comment on what they want here. > Optimizing only the < BIT_PER_LONG seems less desirable then making it > work for arbitrary next_fd. Imho, it'll also be easier to follow if > everything follows the same logic. Sorry that this message is a bit long. Thanks for your time to review. Firstly sincerely thanks all for the hot discussion and kind suggestions with your expertise to make the patch set better. At least, we already reached some agreements on removing sanity_check and adding conditional clear (p.s. I'll revise the bug_on to warn_on in fd_install() as aligned). I fully agree with Guzik's suggestion to resort the patches. As the remaining focus of discussion is around fast path, I suggest that we submit patch 1 & 2 (after reorder) for up-streaming firstly (with data remeasured on latest kernel version accordingly), then we focus on discussion for fast path. For this fast path idea, here I summarized some info for further discussion, why I still think it is valuable: 1. The original intention for fast path is to reduce func calls and avoid unnecessary load/store on the members sharing the same cacheline (such as file_lock, next_fd and the 3 bitmaps. BTW, we've tried to add __cacheline_aligned_in_smp for next_fd and fd array, no improvement observed), specially, yes, specially, all these operations are inside of critical section of file_lock. 2. For fast path implementation, the essential and simple point is to directly return an available bit if there is free bit in [0-63]. I'd emphasize that it does not only improve low number of open fds (even it is the majority case on system as Honza agreed), but also improve the cases that lots of fds open/close frequently with short task (as per the algorithm, lower bits will be prioritized to allocate after being recycled). Not only blogbench, a synthetic benchmark, but also the realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological performance case for __alloc_fd()"), which literally introduced this 2-levels bitmap searching algorithm to vfs as we see now. We may ask Eric for help to see whether it's possible to let us have some data on it. Besides, for those lots of fds are allocated and occupied for not-short time, the lock contention would be much less than the scenarios we're talking about here, then the impact of change would be much less. 3. Now we talk about the extra cost of fast path based on the patch we submitted. To be honest, before fast path, we firstly found that alloc_fd() is only called in two scenarios, as I mentioned in commit: (1) called by __get_unused_fd_flags() to find available fd start from bit 0, which is the most common usage. It means start==0 for alloc_fd(), and with this premise, alloc_fd() logic can be simpler, two branches for comparing to next_fd can be reduced; (2) dup a fd via dup_fd() to find a fd start from old_fd, which means "start" is not zero, but it is only called by fcntl. Then the first impression came into our mind is that why we sacrifice the performance of absolutely majority cases for this specific dup_fd. So we revised __get_unused_fd_flags() to not call alloc_fd() directly, but with the same logic as alloc_fd() by omitted the branches related to "start". Based on this version, we then found fast path would be possibly more efficient than the stock 2-levels bitmap searching based on the considerations stated in item 2 and commit message of this thread. Leaving aside the benefit, the extra cost is an conditional branch, but with 2 branches related to "start" has been reduced, it is still profitable, not even to say the cost can be alleviated by branch predictor. However, with this draft version, the code of __get_unused_fd_flags() duplicates a lot with alloc_fd(), then we change to current version for concise code. What I want to say is that there is space to make it faster with cost less than stock. For whether to use open_fds[0] as conditions for fast path, we think it's OK as all bitmaps are almost on the same cacheline, and we finaly need to write open_fds to allocate bit anyway. 4. Based on patches order as suggested by Guzik, we've re-measured the data on latest kernel 6.10-rc5, removing sanity_check and add likely/unlikely would have 6% gain for read, and 2% for write. Combined with conditional clear, it would have 14% gain for read, and 8% for write. If with fast path, it might have another ~15% gain to read (we do not re-measure this one yet due to time, will make up soon).
On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote: > 2. For fast path implementation, the essential and simple point is to > directly return an available bit if there is free bit in [0-63]. I'd > emphasize that it does not only improve low number of open fds (even it > is the majority case on system as Honza agreed), but also improve the > cases that lots of fds open/close frequently with short task (as per the > algorithm, lower bits will be prioritized to allocate after being > recycled). Not only blogbench, a synthetic benchmark, but also the > realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological > performance case for __alloc_fd()"), which literally introduced this > 2-levels bitmap searching algorithm to vfs as we see now. I don't understand how using next_fd instead is supposed to be inferior. Maybe I should clarify that by API contract the kernel must return the lowest free fd it can find. To that end it maintains the next_fd field as a hint to hopefully avoid some of the search work. In the stock kernel the first thing done in alloc_fd is setting it as a starting point: fdt = files_fdtable(files); fd = start; if (fd < files->next_fd) fd = files->next_fd; that is all the calls which come here with 0 start their search from next_fd position. Suppose you implemented the patch as suggested by me and next_fd fits the range of 0-63. Then you get the benefit of lower level bitmap check just like in the patch you submitted, but without having to first branch on whether you happen to be in that range. Suppose next_fd is somewhere higher up, say 80. With your general approach the optimization wont be done whatsoever or it will be attempted at the 0-63 range when it is an invariant it finds no free fds. With what I'm suggesting the general idea of taking a peek at the lower level bitmap can be applied across the entire fd space. Some manual mucking will be needed to make sure this never pulls more than one cacheline, easiest way out I see would be to align next_fd to BITS_PER_LONG for the bitmap search purposes. Outside of the scope of this patchset, but definitely worth considering, is an observation that this still pulls an entire cacheline worth of a bitmap (assuming it grew). If one could assume that the size is always a multiply of 64 bytes (which it is after first expansion) the 64 byte scan could be entirely inlined -- there is quite a bit of fd fields in this range we may as well scan in hopes of avoiding looking at the higher level bitmap, after all we already paid for fetching it. This would take the optimization to its logical conclusion. Perhaps it would be ok to special-case the lower bitmap to start with 64 bytes so that there would be no need to branch on it.
On Thu 27-06-24 21:59:12, Mateusz Guzik wrote: > On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote: > > 2. For fast path implementation, the essential and simple point is to > > directly return an available bit if there is free bit in [0-63]. I'd > > emphasize that it does not only improve low number of open fds (even it > > is the majority case on system as Honza agreed), but also improve the > > cases that lots of fds open/close frequently with short task (as per the > > algorithm, lower bits will be prioritized to allocate after being > > recycled). Not only blogbench, a synthetic benchmark, but also the > > realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological > > performance case for __alloc_fd()"), which literally introduced this > > 2-levels bitmap searching algorithm to vfs as we see now. > > I don't understand how using next_fd instead is supposed to be inferior. > > Maybe I should clarify that by API contract the kernel must return the > lowest free fd it can find. To that end it maintains the next_fd field > as a hint to hopefully avoid some of the search work. > > In the stock kernel the first thing done in alloc_fd is setting it as > a starting point: > fdt = files_fdtable(files); > fd = start; > if (fd < files->next_fd) > fd = files->next_fd; > > that is all the calls which come here with 0 start their search from > next_fd position. Yup. > Suppose you implemented the patch as suggested by me and next_fd fits > the range of 0-63. Then you get the benefit of lower level bitmap > check just like in the patch you submitted, but without having to > first branch on whether you happen to be in that range. > > Suppose next_fd is somewhere higher up, say 80. With your general > approach the optimization wont be done whatsoever or it will be > attempted at the 0-63 range when it is an invariant it finds no free > fds. > > With what I'm suggesting the general idea of taking a peek at the > lower level bitmap can be applied across the entire fd space. Some > manual mucking will be needed to make sure this never pulls more than > one cacheline, easiest way out I see would be to align next_fd to > BITS_PER_LONG for the bitmap search purposes. Well, all you need to do is to call: bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG], BITS_PER_LONG, start & (BITS_PER_LONG-1)); if (bit < BITS_PER_LONG) return bit + (start & ~(BITS_PER_LONG - 1)); in find_next_fd(). Not sure if this is what you meant by aligning next_fd to BITS_PER_LONG... Honza
On 6/28/2024 3:59 AM, Mateusz Guzik wrote: > On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote: >> 2. For fast path implementation, the essential and simple point is to >> directly return an available bit if there is free bit in [0-63]. I'd >> emphasize that it does not only improve low number of open fds (even it >> is the majority case on system as Honza agreed), but also improve the >> cases that lots of fds open/close frequently with short task (as per the >> algorithm, lower bits will be prioritized to allocate after being >> recycled). Not only blogbench, a synthetic benchmark, but also the >> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological >> performance case for __alloc_fd()"), which literally introduced this >> 2-levels bitmap searching algorithm to vfs as we see now. > I don't understand how using next_fd instead is supposed to be inferior. > > Maybe I should clarify that by API contract the kernel must return the > lowest free fd it can find. To that end it maintains the next_fd field > as a hint to hopefully avoid some of the search work. > > In the stock kernel the first thing done in alloc_fd is setting it as > a starting point: > fdt = files_fdtable(files); > fd = start; > if (fd < files->next_fd) > fd = files->next_fd; > > that is all the calls which come here with 0 start their search from > next_fd position. > > Suppose you implemented the patch as suggested by me and next_fd fits > the range of 0-63. Then you get the benefit of lower level bitmap > check just like in the patch you submitted, but without having to > first branch on whether you happen to be in that range. > > Suppose next_fd is somewhere higher up, say 80. With your general > approach the optimization wont be done whatsoever or it will be > attempted at the 0-63 range when it is an invariant it finds no free > fds. > > With what I'm suggesting the general idea of taking a peek at the > lower level bitmap can be applied across the entire fd space. Some > manual mucking will be needed to make sure this never pulls more than > one cacheline, easiest way out I see would be to align next_fd to > BITS_PER_LONG for the bitmap search purposes. Some misunderstanding here, Guzik, I thought you felt not so worth for fast path in previous feedback, so the whole message sent just wanna say we still think the original idea is reasonable. Back to the point here, the way to implement it in find_next_fd() by searching the word with next_fd makes sense and OK to me. It's efficient, concise and should bring us the expected benefits. I'll re-measure the data for reference based on the code proposed by you and Honza. > Outside of the scope of this patchset, but definitely worth > considering, is an observation that this still pulls an entire > cacheline worth of a bitmap (assuming it grew). If one could assume > that the size is always a multiply of 64 bytes (which it is after > first expansion) the 64 byte scan could be entirely inlined -- there > is quite a bit of fd fields in this range we may as well scan in hopes > of avoiding looking at the higher level bitmap, after all we already > paid for fetching it. This would take the optimization to its logical > conclusion. > > Perhaps it would be ok to special-case the lower bitmap to start with > 64 bytes so that there would be no need to branch on it.
On 6/28/2024 5:12 PM, Jan Kara wrote: > On Thu 27-06-24 21:59:12, Mateusz Guzik wrote: >> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote: >>> 2. For fast path implementation, the essential and simple point is to >>> directly return an available bit if there is free bit in [0-63]. I'd >>> emphasize that it does not only improve low number of open fds (even it >>> is the majority case on system as Honza agreed), but also improve the >>> cases that lots of fds open/close frequently with short task (as per the >>> algorithm, lower bits will be prioritized to allocate after being >>> recycled). Not only blogbench, a synthetic benchmark, but also the >>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological >>> performance case for __alloc_fd()"), which literally introduced this >>> 2-levels bitmap searching algorithm to vfs as we see now. >> I don't understand how using next_fd instead is supposed to be inferior. >> >> Maybe I should clarify that by API contract the kernel must return the >> lowest free fd it can find. To that end it maintains the next_fd field >> as a hint to hopefully avoid some of the search work. >> >> In the stock kernel the first thing done in alloc_fd is setting it as >> a starting point: >> fdt = files_fdtable(files); >> fd = start; >> if (fd < files->next_fd) >> fd = files->next_fd; >> >> that is all the calls which come here with 0 start their search from >> next_fd position. > Yup. > >> Suppose you implemented the patch as suggested by me and next_fd fits >> the range of 0-63. Then you get the benefit of lower level bitmap >> check just like in the patch you submitted, but without having to >> first branch on whether you happen to be in that range. >> >> Suppose next_fd is somewhere higher up, say 80. With your general >> approach the optimization wont be done whatsoever or it will be >> attempted at the 0-63 range when it is an invariant it finds no free >> fds. >> >> With what I'm suggesting the general idea of taking a peek at the >> lower level bitmap can be applied across the entire fd space. Some >> manual mucking will be needed to make sure this never pulls more than >> one cacheline, easiest way out I see would be to align next_fd to >> BITS_PER_LONG for the bitmap search purposes. > Well, all you need to do is to call: > > bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG], > BITS_PER_LONG, start & (BITS_PER_LONG-1)); > if (bit < BITS_PER_LONG) > return bit + (start & ~(BITS_PER_LONG - 1)); > > > in find_next_fd(). Not sure if this is what you meant by aligning next_fd > to BITS_PER_LONG... > > Honza So the idea here is to take a peek at the word contains next_fd, return directly if get free bit there. Not sure about the efficiency here if open/close fd frequently and next_fd is distributed very randomly. Just give a quick try for this part of code, seems kernel failed to boot up
How about you post a working version of the patchset, even if it only looks at 0-63 and we are going to massage it afterwards. On Sat, Jun 29, 2024 at 5:41 PM Ma, Yu <yu.ma@intel.com> wrote: > > > On 6/28/2024 5:12 PM, Jan Kara wrote: > > On Thu 27-06-24 21:59:12, Mateusz Guzik wrote: > >> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote: > >>> 2. For fast path implementation, the essential and simple point is to > >>> directly return an available bit if there is free bit in [0-63]. I'd > >>> emphasize that it does not only improve low number of open fds (even it > >>> is the majority case on system as Honza agreed), but also improve the > >>> cases that lots of fds open/close frequently with short task (as per the > >>> algorithm, lower bits will be prioritized to allocate after being > >>> recycled). Not only blogbench, a synthetic benchmark, but also the > >>> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological > >>> performance case for __alloc_fd()"), which literally introduced this > >>> 2-levels bitmap searching algorithm to vfs as we see now. > >> I don't understand how using next_fd instead is supposed to be inferior. > >> > >> Maybe I should clarify that by API contract the kernel must return the > >> lowest free fd it can find. To that end it maintains the next_fd field > >> as a hint to hopefully avoid some of the search work. > >> > >> In the stock kernel the first thing done in alloc_fd is setting it as > >> a starting point: > >> fdt = files_fdtable(files); > >> fd = start; > >> if (fd < files->next_fd) > >> fd = files->next_fd; > >> > >> that is all the calls which come here with 0 start their search from > >> next_fd position. > > Yup. > > > >> Suppose you implemented the patch as suggested by me and next_fd fits > >> the range of 0-63. Then you get the benefit of lower level bitmap > >> check just like in the patch you submitted, but without having to > >> first branch on whether you happen to be in that range. > >> > >> Suppose next_fd is somewhere higher up, say 80. With your general > >> approach the optimization wont be done whatsoever or it will be > >> attempted at the 0-63 range when it is an invariant it finds no free > >> fds. > >> > >> With what I'm suggesting the general idea of taking a peek at the > >> lower level bitmap can be applied across the entire fd space. Some > >> manual mucking will be needed to make sure this never pulls more than > >> one cacheline, easiest way out I see would be to align next_fd to > >> BITS_PER_LONG for the bitmap search purposes. > > Well, all you need to do is to call: > > > > bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG], > > BITS_PER_LONG, start & (BITS_PER_LONG-1)); > > if (bit < BITS_PER_LONG) > > return bit + (start & ~(BITS_PER_LONG - 1)); > > > > > > in find_next_fd(). Not sure if this is what you meant by aligning next_fd > > to BITS_PER_LONG... > > > > Honza > > So the idea here is to take a peek at the word contains next_fd, return > directly if get free bit there. Not sure about the efficiency here if > open/close fd frequently and next_fd is distributed very randomly. Just > give a quick try for this part of code, seems kernel failed to boot > up
diff --git a/fs/file.c b/fs/file.c index a3b72aa64f11..50e900a47107 100644 --- a/fs/file.c +++ b/fs/file.c @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags) if (fd < files->next_fd) fd = files->next_fd; - if (fd < fdt->max_fds) + error = -EMFILE; + if (likely(fd < fdt->max_fds)) { + if (~fdt->open_fds[0] && !start) { + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd); + goto fastreturn; + } fd = find_next_fd(fdt, fd); + } + + if (unlikely(fd >= fdt->max_fds)) { + error = expand_files(files, fd); + if (error < 0) + goto out; + /* + * If we needed to expand the fs array we + * might have blocked - try again. + */ + if (error) + goto repeat; + } +fastreturn: /* * N.B. For clone tasks sharing a files structure, this test * will limit the total number of files that can be opened. */ - error = -EMFILE; - if (fd >= end) + if (unlikely(fd >= end)) goto out; - error = expand_files(files, fd); - if (error < 0) - goto out; - - /* - * If we needed to expand the fs array we - * might have blocked - try again. - */ - if (error) - goto repeat; - if (start <= files->next_fd) files->next_fd = fd + 1;