Message ID | 20230925205545.4135472-1-mjguzik@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfs: shave work on failed file open | expand |
> +void fput_badopen(struct file *file) > +{ > + if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) { > + fput(file); > + return; > + } > + > + if (WARN_ON(atomic_long_read(&file->f_count) != 1)) { > + fput(file); > + return; > + } > + > + /* zero out the ref count to appease possible asserts */ > + atomic_long_set(&file->f_count, 0); Afaict this could just be: if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) { > + file_free_badopen(file); > +} > +EXPORT_SYMBOL(fput_badopen); Should definitely not be exported and only be available to core vfs code. So this should go into fs/internal.h.
On 9/26/23, Christian Brauner <brauner@kernel.org> wrote: >> +void fput_badopen(struct file *file) >> +{ >> + if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) { >> + fput(file); >> + return; >> + } >> + >> + if (WARN_ON(atomic_long_read(&file->f_count) != 1)) { >> + fput(file); >> + return; >> + } >> + >> + /* zero out the ref count to appease possible asserts */ >> + atomic_long_set(&file->f_count, 0); > > Afaict this could just be: > > if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) { > This would bring back one of the atomics, but I'm not going to die on this hill. If you insist on this change I'm going to have tweak some comments and bench again. >> + file_free_badopen(file); >> +} >> +EXPORT_SYMBOL(fput_badopen); > > Should definitely not be exported and only be available to core vfs > code. So this should go into fs/internal.h. > Ok
> > if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) { > bench again. Can you see how much of a difference it makes because imho it really looks a lot nicer then this ugly atomic_read followed by atomic_set...
On 9/26/23, Christian Brauner <brauner@kernel.org> wrote: >> > if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) { > >> bench again. > > Can you see how much of a difference it makes because imho it really > looks a lot nicer then this ugly atomic_read followed by atomic_set... > Huh, turns out to be in the noise here and I see why. Immediately following this there are several atomic ops anyway, notably to unref creds and apparmor labels. So happens top of the profile is the allocator(!). These can be fixed but that's perhaps for another time. If going this route then perhaps atomic_long_dec_and_test just like fput? Although one could argue if that cmpxchg failed then something fishy is going on and "real" fput would be safer. Ultimately there is a lot of handwaving possible whichever way, so just pick something and I'll send a v2.
>>>>> "Mateusz" == Mateusz Guzik <mjguzik@gmail.com> writes: > Failed opens (mostly ENOENT) legitimately happen a lot, for example here > are stats from stracing kernel build for few seconds (strace -fc make): > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ------------------ > 0.76 0.076233 5 15040 3688 openat > (this is tons of header files tried in different paths) > Apart from a rare corner case where the file object is fully constructed > and we need to abort, there is a lot of overhead which can be avoided. > Most notably delegation of freeing to task_work, which comes with an > enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for > an example). > Benched with will-it-scale with a custom testcase based on > tests/open1.c: > [snip] > while (1) { > int fd = open("/tmp/nonexistent", O_RDONLY); > assert(fd == -1); > (*iterations)++; > } > [/snip] > Sapphire Rapids, one worker in single-threaded case (ops/s): > before: 1950013 > after: 2914973 (+49%) So what are the times in a multi-threaded case? Just wondering what happens if you have a bunch of makes or other jobs like that all running at once. > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> > --- > fs/file_table.c | 39 +++++++++++++++++++++++++++++++++++++++ > fs/namei.c | 2 +- > include/linux/file.h | 1 + > 3 files changed, 41 insertions(+), 1 deletion(-) > diff --git a/fs/file_table.c b/fs/file_table.c > index ee21b3da9d08..320dc1f9aa0e 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -82,6 +82,16 @@ static inline void file_free(struct file *f) > call_rcu(&f->f_rcuhead, file_free_rcu); > } > +static inline void file_free_badopen(struct file *f) > +{ > + BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED)); eww... what a BUG_ON() here? This seems *way* overkill to crash the system here, and you don't even check if f exists first as well, since I assume the caller checks it or already knows it? Why not just return an error here and keep going? What happens if you do? > + security_file_free(f); > + put_cred(f->f_cred); > + if (likely(!(f->f_mode & FMODE_NOACCOUNT))) > + percpu_counter_dec(&nr_files); > + kmem_cache_free(filp_cachep, f); > +} > + > /* > * Return the total number of open files in the system > */ > @@ -468,6 +478,35 @@ void __fput_sync(struct file *file) > EXPORT_SYMBOL(fput); > EXPORT_SYMBOL(__fput_sync); > +/* > + * Clean up after failing to open (e.g., open(2) returns with -ENOENT). > + * > + * This represents opportunities to shave on work in the common case compared > + * to the usual fput: > + * 1. vast majority of the time FMODE_OPENED is not set, meaning there is no > + * need to delegate to task_work > + * 2. if the above holds then we are guaranteed we have the only reference with > + * nobody else seeing the file, thus no need to use atomics to release it > + * 3. then there is no need to delegate freeing to RCU > + */ > +void fput_badopen(struct file *file) > +{ > + if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) { > + fput(file); > + return; > + } > + > + if (WARN_ON(atomic_long_read(&file->f_count) != 1)) { > + fput(file); > + return; > + } > + > + /* zero out the ref count to appease possible asserts */ > + atomic_long_set(&file->f_count, 0); > + file_free_badopen(file); > +} > +EXPORT_SYMBOL(fput_badopen); > + > void __init files_init(void) > { > filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, > diff --git a/fs/namei.c b/fs/namei.c > index 567ee547492b..67579fe30b28 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata *nd, > WARN_ON(1); > error = -EINVAL; > } > - fput(file); > + fput_badopen(file); > if (error == -EOPENSTALE) { > if (flags & LOOKUP_RCU) > error = -ECHILD; > diff --git a/include/linux/file.h b/include/linux/file.h > index 6e9099d29343..96300e27d9a8 100644 > --- a/include/linux/file.h > +++ b/include/linux/file.h > @@ -15,6 +15,7 @@ > struct file; > extern void fput(struct file *); > +extern void fput_badopen(struct file *); > struct file_operations; > struct task_struct; > -- > 2.39.2
On 9/26/23, John Stoffel <john@stoffel.org> wrote: >>>>>> "Mateusz" == Mateusz Guzik <mjguzik@gmail.com> writes: > >> Failed opens (mostly ENOENT) legitimately happen a lot, for example here >> are stats from stracing kernel build for few seconds (strace -fc make): > >> % time seconds usecs/call calls errors syscall >> ------ ----------- ----------- --------- --------- ------------------ >> 0.76 0.076233 5 15040 3688 openat > >> (this is tons of header files tried in different paths) > >> Apart from a rare corner case where the file object is fully constructed >> and we need to abort, there is a lot of overhead which can be avoided. > >> Most notably delegation of freeing to task_work, which comes with an >> enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for >> an example). > >> Benched with will-it-scale with a custom testcase based on >> tests/open1.c: >> [snip] >> while (1) { >> int fd = open("/tmp/nonexistent", O_RDONLY); >> assert(fd == -1); > >> (*iterations)++; >> } >> [/snip] > >> Sapphire Rapids, one worker in single-threaded case (ops/s): >> before: 1950013 >> after: 2914973 (+49%) > > > So what are the times in a multi-threaded case? Just wondering what > happens if you have a bunch of makes or other jobs like that all > running at once. > On my kernel they heavily bottleneck on apparmor, I already mailed the author: https://lore.kernel.org/all/CAGudoHFfG7mARwSqcoLNwV81-KX4Bici5FQHjoNG4f9m83oLyg@mail.gmail.com/ maybe i'll hack up the fix When running without that LSM and on *stock* kernel it heavily bottlenecks somewhere in bowels of SLUB + RCU. Without LSM and with the patch it scales almost perfectly, as one would expect. I don't have numbers nor perf output handy. > >> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> >> --- >> fs/file_table.c | 39 +++++++++++++++++++++++++++++++++++++++ >> fs/namei.c | 2 +- >> include/linux/file.h | 1 + >> 3 files changed, 41 insertions(+), 1 deletion(-) > >> diff --git a/fs/file_table.c b/fs/file_table.c >> index ee21b3da9d08..320dc1f9aa0e 100644 >> --- a/fs/file_table.c >> +++ b/fs/file_table.c >> @@ -82,6 +82,16 @@ static inline void file_free(struct file *f) >> call_rcu(&f->f_rcuhead, file_free_rcu); >> } > >> +static inline void file_free_badopen(struct file *f) >> +{ >> + BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED)); > > eww... what a BUG_ON() here? This seems *way* overkill to crash the > system here, and you don't even check if f exists first as well, since > I assume the caller checks it or already knows it? > > Why not just return an error here and keep going? What happens if you do? > The only caller already checked these flags, so I think BUGing out is prudent. > >> + security_file_free(f); >> + put_cred(f->f_cred); >> + if (likely(!(f->f_mode & FMODE_NOACCOUNT))) >> + percpu_counter_dec(&nr_files); >> + kmem_cache_free(filp_cachep, f); >> +} >> + >> /* >> * Return the total number of open files in the system >> */ >> @@ -468,6 +478,35 @@ void __fput_sync(struct file *file) >> EXPORT_SYMBOL(fput); >> EXPORT_SYMBOL(__fput_sync); > >> +/* >> + * Clean up after failing to open (e.g., open(2) returns with -ENOENT). >> + * >> + * This represents opportunities to shave on work in the common case >> compared >> + * to the usual fput: >> + * 1. vast majority of the time FMODE_OPENED is not set, meaning there is >> no >> + * need to delegate to task_work >> + * 2. if the above holds then we are guaranteed we have the only >> reference with >> + * nobody else seeing the file, thus no need to use atomics to release >> it >> + * 3. then there is no need to delegate freeing to RCU >> + */ >> +void fput_badopen(struct file *file) >> +{ >> + if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) { >> + fput(file); >> + return; >> + } >> + >> + if (WARN_ON(atomic_long_read(&file->f_count) != 1)) { >> + fput(file); >> + return; >> + } >> + >> + /* zero out the ref count to appease possible asserts */ >> + atomic_long_set(&file->f_count, 0); >> + file_free_badopen(file); >> +} >> +EXPORT_SYMBOL(fput_badopen); >> + >> void __init files_init(void) >> { >> filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, >> diff --git a/fs/namei.c b/fs/namei.c >> index 567ee547492b..67579fe30b28 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata >> *nd, >> WARN_ON(1); >> error = -EINVAL; >> } >> - fput(file); >> + fput_badopen(file); >> if (error == -EOPENSTALE) { >> if (flags & LOOKUP_RCU) >> error = -ECHILD; >> diff --git a/include/linux/file.h b/include/linux/file.h >> index 6e9099d29343..96300e27d9a8 100644 >> --- a/include/linux/file.h >> +++ b/include/linux/file.h >> @@ -15,6 +15,7 @@ >> struct file; > >> extern void fput(struct file *); >> +extern void fput_badopen(struct file *); > >> struct file_operations; >> struct task_struct; >> -- >> 2.39.2 > >
>>>>> "Mateusz" == Mateusz Guzik <mjguzik@gmail.com> writes: > On 9/26/23, John Stoffel <john@stoffel.org> wrote: >> >>> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> >>> --- >>> fs/file_table.c | 39 +++++++++++++++++++++++++++++++++++++++ >>> fs/namei.c | 2 +- >>> include/linux/file.h | 1 + >>> 3 files changed, 41 insertions(+), 1 deletion(-) >> >>> diff --git a/fs/file_table.c b/fs/file_table.c >>> index ee21b3da9d08..320dc1f9aa0e 100644 >>> --- a/fs/file_table.c >>> +++ b/fs/file_table.c >>> @@ -82,6 +82,16 @@ static inline void file_free(struct file *f) >>> call_rcu(&f->f_rcuhead, file_free_rcu); >>> } >> >>> +static inline void file_free_badopen(struct file *f) >>> +{ >>> + BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED)); >> >> eww... what a BUG_ON() here? This seems *way* overkill to crash the >> system here, and you don't even check if f exists first as well, since >> I assume the caller checks it or already knows it? >> >> Why not just return an error here and keep going? What happens if you do? >> > The only caller already checked these flags, so I think BUGing out is prudent. So how would the flags change if they had been checked before? And if they are wrong, why not just exit without doing anything? Crashing the system just because you can't free some memory seems like a horrible thing to do. Linus has said multiple times that BUG_ON() isn't the answer. You should just do a WARN_ON() instead. Or WARN_ONCE(), don't just kill the entire system like this. John
diff --git a/fs/file_table.c b/fs/file_table.c index ee21b3da9d08..320dc1f9aa0e 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -82,6 +82,16 @@ static inline void file_free(struct file *f) call_rcu(&f->f_rcuhead, file_free_rcu); } +static inline void file_free_badopen(struct file *f) +{ + BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED)); + security_file_free(f); + put_cred(f->f_cred); + if (likely(!(f->f_mode & FMODE_NOACCOUNT))) + percpu_counter_dec(&nr_files); + kmem_cache_free(filp_cachep, f); +} + /* * Return the total number of open files in the system */ @@ -468,6 +478,35 @@ void __fput_sync(struct file *file) EXPORT_SYMBOL(fput); EXPORT_SYMBOL(__fput_sync); +/* + * Clean up after failing to open (e.g., open(2) returns with -ENOENT). + * + * This represents opportunities to shave on work in the common case compared + * to the usual fput: + * 1. vast majority of the time FMODE_OPENED is not set, meaning there is no + * need to delegate to task_work + * 2. if the above holds then we are guaranteed we have the only reference with + * nobody else seeing the file, thus no need to use atomics to release it + * 3. then there is no need to delegate freeing to RCU + */ +void fput_badopen(struct file *file) +{ + if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) { + fput(file); + return; + } + + if (WARN_ON(atomic_long_read(&file->f_count) != 1)) { + fput(file); + return; + } + + /* zero out the ref count to appease possible asserts */ + atomic_long_set(&file->f_count, 0); + file_free_badopen(file); +} +EXPORT_SYMBOL(fput_badopen); + void __init files_init(void) { filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, diff --git a/fs/namei.c b/fs/namei.c index 567ee547492b..67579fe30b28 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata *nd, WARN_ON(1); error = -EINVAL; } - fput(file); + fput_badopen(file); if (error == -EOPENSTALE) { if (flags & LOOKUP_RCU) error = -ECHILD; diff --git a/include/linux/file.h b/include/linux/file.h index 6e9099d29343..96300e27d9a8 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -15,6 +15,7 @@ struct file; extern void fput(struct file *); +extern void fput_badopen(struct file *); struct file_operations; struct task_struct;
Failed opens (mostly ENOENT) legitimately happen a lot, for example here are stats from stracing kernel build for few seconds (strace -fc make): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ------------------ 0.76 0.076233 5 15040 3688 openat (this is tons of header files tried in different paths) Apart from a rare corner case where the file object is fully constructed and we need to abort, there is a lot of overhead which can be avoided. Most notably delegation of freeing to task_work, which comes with an enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for an example). Benched with will-it-scale with a custom testcase based on tests/open1.c: [snip] while (1) { int fd = open("/tmp/nonexistent", O_RDONLY); assert(fd == -1); (*iterations)++; } [/snip] Sapphire Rapids, one worker in single-threaded case (ops/s): before: 1950013 after: 2914973 (+49%) Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> --- fs/file_table.c | 39 +++++++++++++++++++++++++++++++++++++++ fs/namei.c | 2 +- include/linux/file.h | 1 + 3 files changed, 41 insertions(+), 1 deletion(-)