Message ID | 20240109194818.91465-1-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] fsnotify: optimize the case of no access event watchers | expand |
On 1/9/24 12:48 PM, Amir Goldstein wrote: > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > optimized the case where there are no fsnotify watchers on any of the > filesystem's objects. > > It is quite common for a system to have a single local filesystem and > it is quite common for the system to have some inotify watches on some > config files or directories, so the optimization of no marks at all is > often not in effect. > > Access event watchers are far less common, so optimizing the case of > no marks with access events could improve performance for more systems, > especially for the performance sensitive hot io path. > > Maintain a per-sb counter of objects that have marks with access > events in their mask and use that counter to optimize out the call to > fsnotify() in fsnotify access hooks. > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > --- > > Jens, > > You may want to try if this patch improves performance for your workload > with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y. Ran the usual test, and this effectively removes fsnotify from the profiles, which (as per other email) is between 5-6% of CPU time. So I'd say it looks mighty appealing!
On 1/9/24 1:12 PM, Jens Axboe wrote: > On 1/9/24 12:48 PM, Amir Goldstein wrote: >> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") >> optimized the case where there are no fsnotify watchers on any of the >> filesystem's objects. >> >> It is quite common for a system to have a single local filesystem and >> it is quite common for the system to have some inotify watches on some >> config files or directories, so the optimization of no marks at all is >> often not in effect. >> >> Access event watchers are far less common, so optimizing the case of >> no marks with access events could improve performance for more systems, >> especially for the performance sensitive hot io path. >> >> Maintain a per-sb counter of objects that have marks with access >> events in their mask and use that counter to optimize out the call to >> fsnotify() in fsnotify access hooks. >> >> Signed-off-by: Amir Goldstein <amir73il@gmail.com> >> --- >> >> Jens, >> >> You may want to try if this patch improves performance for your workload >> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y. > > Ran the usual test, and this effectively removes fsnotify from the > profiles, which (as per other email) is between 5-6% of CPU time. So I'd > say it looks mighty appealing! Tried with an IRQ based workload as well, as those are always impacted more by the fsnotify slowness. This patch removes ~8% of useless overhead in that case, so even bigger win there.
On Tue, Jan 9, 2024 at 10:24 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 1/9/24 1:12 PM, Jens Axboe wrote: > > On 1/9/24 12:48 PM, Amir Goldstein wrote: > >> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > >> optimized the case where there are no fsnotify watchers on any of the > >> filesystem's objects. > >> > >> It is quite common for a system to have a single local filesystem and > >> it is quite common for the system to have some inotify watches on some > >> config files or directories, so the optimization of no marks at all is > >> often not in effect. > >> > >> Access event watchers are far less common, so optimizing the case of > >> no marks with access events could improve performance for more systems, > >> especially for the performance sensitive hot io path. > >> > >> Maintain a per-sb counter of objects that have marks with access > >> events in their mask and use that counter to optimize out the call to > >> fsnotify() in fsnotify access hooks. > >> > >> Signed-off-by: Amir Goldstein <amir73il@gmail.com> > >> --- > >> > >> Jens, > >> > >> You may want to try if this patch improves performance for your workload > >> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y. > > > > Ran the usual test, and this effectively removes fsnotify from the > > profiles, which (as per other email) is between 5-6% of CPU time. So I'd > > say it looks mighty appealing! > > Tried with an IRQ based workload as well, as those are always impacted > more by the fsnotify slowness. This patch removes ~8% of useless > overhead in that case, so even bigger win there. > Do the IRQ based workloads always go through io_req_io_end()? Meaning that unlike the polled io workloads, they also incur the overhead of the fsnotify_{modify,access}() hooks? I remember I asked you once (cannot find where) whether io_complete_rw_iopoll() needs the fsnotify hooks and you said that it is a highly specialized code path for fast io, whose users will not want those access/modify hooks. Considering the fact that fsnotify_{modify,access}() could just as well be bypassed by mmap() read/write, I fully support this reasoning. Anyway, that explains (to me) why compiling-out the fsnotify_perm() hooks took away all the regression that you observed in upstream, because I was wondering where the overhead of fsnotify_access() was. Jan, What are your thoughts about this optimization patch? My thoughts are that the optimization is clearly a win, but do we really want to waste a full long in super_block for counting access event watchers that may never exist? Should we perhaps instead use a flag to say that "access watchers existed"? We could put s_fsnotify_access_watchers inside a struct fsnotify_sb_mark_connector and special case alloc/free of FSNOTIFY_OBJ_TYPE_SB connector. Using a specialized fsnotify_sb_mark_connector will make room for similar optimizations in the future and could be a placeholder for the "default permission mask" that we discussed. Thoughts? Thanks, Amir.
On Wed, Jan 10, 2024 at 11:08:17AM +0200, Amir Goldstein wrote: > My thoughts are that the optimization is clearly a win, but do we > really want to waste a full long in super_block for counting access > event watchers that may never exist? Would it make more sense for it to be global, and perhaps even use the static key infrastructure to enable/disable fsnotify?
On Wed, Jan 10, 2024 at 2:46 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Jan 10, 2024 at 11:08:17AM +0200, Amir Goldstein wrote: > > My thoughts are that the optimization is clearly a win, but do we > > really want to waste a full long in super_block for counting access > > event watchers that may never exist? > > Would it make more sense for it to be global, Ironically, we once tried to disabled fsnotify hooks on pipefs and found out (the hard way) that some people are using IN_ACCESS event to get notified on pipe reads or something like that, so the global option may result in less predictable performance. > and perhaps even use the > static key infrastructure to enable/disable fsnotify? We are talking about disabling specific fsnotify hooks fsnotify_access() and fsnotify_perm(). I doubt that static key infrastructure is required, because with this patch, Jens did not observe any regression compared to the code being compiled out. Thanks, Amir.
On Tue 09-01-24 21:48:18, Amir Goldstein wrote: > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > optimized the case where there are no fsnotify watchers on any of the > filesystem's objects. > > It is quite common for a system to have a single local filesystem and > it is quite common for the system to have some inotify watches on some > config files or directories, so the optimization of no marks at all is > often not in effect. I agree. > Access event watchers are far less common, so optimizing the case of > no marks with access events could improve performance for more systems, > especially for the performance sensitive hot io path. > > Maintain a per-sb counter of objects that have marks with access > events in their mask and use that counter to optimize out the call to > fsnotify() in fsnotify access hooks. > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> I'm not saying no to this but let's discuss first before hacking in some partial solution :). AFAIU what Jens sees is something similar as was reported for example here [1]. In these cases we go through fsnotify_parent() down to fsnotify() until the check: if (!(test_mask & marks_mask)) return 0; there. And this is all relatively cheap (pure fetches like sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks, sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except for parent handling in __fsnotify_parent(). That requires multiple atomic operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget() is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this gets really expensive, even more so if multiple threads race for the same parent dentry. So I think what we ideally need is a way to avoid this expensive "stabilize parent & its name" game unless we are pretty sure we are going to generate the event. There's no way to avoid fetching the parent early if dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the take_dentry_name_snapshot() cost though). However for the case where dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case here AFAICT) we need the parent only after the check of 'test_mask & marks_mask'. So in that case we could completely avoid the extra cost of parent dentry handling. So in principle I believe we could make fsnotify noticeably lighter for the case where no event is generated. Just the question is how to refactor the current set of functions to achieve this without creating an unmaintainable mess. I suspect if we lifted creation & some prefilling of iter_info into __fsnotify_parent() and then fill in the parent in case we need it for reporting only in fsnotify(), the code could be reasonably readable. We'd need to always propagate the dentry down to fsnotify() though, currently we often propagate only the inode because the dentry may be (still in case of create or already in case of unlink) negative. Similarly we'd somehow need to communicate down into fsnotify() whether the passed name needs snapshotting or not... What do you think? Honza [1] https://lore.kernel.org/all/SJ0PR08MB6494F5A32B7C60A5AD8B33C2AB09A@SJ0PR08MB6494.namprd08.prod.outlook.com
On Wed, Jan 10, 2024 at 3:56 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 09-01-24 21:48:18, Amir Goldstein wrote: > > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > > optimized the case where there are no fsnotify watchers on any of the > > filesystem's objects. > > > > It is quite common for a system to have a single local filesystem and > > it is quite common for the system to have some inotify watches on some > > config files or directories, so the optimization of no marks at all is > > often not in effect. > > I agree. > > > Access event watchers are far less common, so optimizing the case of > > no marks with access events could improve performance for more systems, > > especially for the performance sensitive hot io path. > > > > Maintain a per-sb counter of objects that have marks with access > > events in their mask and use that counter to optimize out the call to > > fsnotify() in fsnotify access hooks. > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > I'm not saying no to this but let's discuss first before hacking in some > partial solution :). AFAIU what Jens sees is something similar as was > reported for example here [1]. In these cases we go through > fsnotify_parent() down to fsnotify() until the check: > > if (!(test_mask & marks_mask)) > return 0; > > there. And this is all relatively cheap (pure fetches like > sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks, > sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except > for parent handling in __fsnotify_parent(). That requires multiple atomic > operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget() > is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this > gets really expensive, even more so if multiple threads race for the same > parent dentry. Sorry, I forgot to link the Jens regression report [1] which included this partial perf report: 3.36% [kernel.vmlinux] [k] fsnotify 2.32% [kernel.vmlinux] [k] __fsnotify_paren Your analysis about __fsnotify_parent() may be correct, but what would be the explanation to time spent in fsnotify() in this report? Jens, Can you provide an expanded via of the perf function called by fsnotify() and __fsnotify_parent()? In general, I think that previous optimization work as this commit by Mel: 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead when there is no watcher") showed improvements when checks were inlined. [1] https://lore.kernel.org/linux-fsdevel/53682ece-f0e7-48de-9a1c-879ee34b0449@kernel.dk/ > > So I think what we ideally need is a way to avoid this expensive "stabilize > parent & its name" game unless we are pretty sure we are going to generate > the event. There's no way to avoid fetching the parent early if > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the > take_dentry_name_snapshot() cost though). However for the case where > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case > here AFAICT) we need the parent only after the check of 'test_mask & > marks_mask'. So in that case we could completely avoid the extra cost of > parent dentry handling. > > So in principle I believe we could make fsnotify noticeably lighter for the > case where no event is generated. Just the question is how to refactor the > current set of functions to achieve this without creating an unmaintainable > mess. I suspect if we lifted creation & some prefilling of iter_info into > __fsnotify_parent() and then fill in the parent in case we need it for > reporting only in fsnotify(), the code could be reasonably readable. We'd > need to always propagate the dentry down to fsnotify() though, currently we > often propagate only the inode because the dentry may be (still in case of > create or already in case of unlink) negative. Similarly we'd somehow need > to communicate down into fsnotify() whether the passed name needs > snapshotting or not... > > What do you think? I am not saying no ;) but it sound a bit complicated so if the goal is to reduce the overhead of fsnotify_access() and fsnotify_perm(), which I don't think any application cares about, then I'd rather go with a much simpler solution even if it does not cover all the corner cases. Thanks, Amir.
On Wed 10-01-24 16:41:46, Amir Goldstein wrote: > On Wed, Jan 10, 2024 at 3:56 PM Jan Kara <jack@suse.cz> wrote: > > On Tue 09-01-24 21:48:18, Amir Goldstein wrote: > > > Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > > > optimized the case where there are no fsnotify watchers on any of the > > > filesystem's objects. > > > > > > It is quite common for a system to have a single local filesystem and > > > it is quite common for the system to have some inotify watches on some > > > config files or directories, so the optimization of no marks at all is > > > often not in effect. > > > > I agree. > > > > > Access event watchers are far less common, so optimizing the case of > > > no marks with access events could improve performance for more systems, > > > especially for the performance sensitive hot io path. > > > > > > Maintain a per-sb counter of objects that have marks with access > > > events in their mask and use that counter to optimize out the call to > > > fsnotify() in fsnotify access hooks. > > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > > > I'm not saying no to this but let's discuss first before hacking in some > > partial solution :). AFAIU what Jens sees is something similar as was > > reported for example here [1]. In these cases we go through > > fsnotify_parent() down to fsnotify() until the check: > > > > if (!(test_mask & marks_mask)) > > return 0; > > > > there. And this is all relatively cheap (pure fetches like > > sb->s_fsnotify_marks, mnt->mnt_fsnotify_marks, inode->i_fsnotify_marks, > > sb->s_fsnotify_mask, mnt->mnt_fsnotify_mask, inode->i_fsnotify_mask) except > > for parent handling in __fsnotify_parent(). That requires multiple atomic > > operations - take_dentry_name_snapshot() is lock & unlock of d_lock, dget() > > is cmpxchg on d_lockref, dput() is another cmpxchg on d_lockref - and this > > gets really expensive, even more so if multiple threads race for the same > > parent dentry. > > Sorry, I forgot to link the Jens regression report [1] which included this > partial perf report: > > 3.36% [kernel.vmlinux] [k] fsnotify > 2.32% [kernel.vmlinux] [k] __fsnotify_paren > > Your analysis about __fsnotify_parent() may be correct, but what would be > the explanation to time spent in fsnotify() in this report? OK, I don't have a good explanation for why fsnotify() itself is so high in Jens' profile. Maybe cacheline fetches caused by inode->i_fsnotify_mask and inode->i_fsnotify_marks checks are causing the overhead but it would be good to have it confirmed. > In general, I think that previous optimization work as this commit by Mel: > 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead when > there is no watcher") showed improvements when checks were inlined. Well, I believe that commit helped exactly because the check for DCACHE_FSNOTIFY_PARENT_WATCHED was moved ahead of all the work now in __fsnotify_parent(). > [1] https://lore.kernel.org/linux-fsdevel/53682ece-f0e7-48de-9a1c-879ee34b0449@kernel.dk/ > > > So I think what we ideally need is a way to avoid this expensive "stabilize > > parent & its name" game unless we are pretty sure we are going to generate > > the event. There's no way to avoid fetching the parent early if > > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED (we can still postpone the > > take_dentry_name_snapshot() cost though). However for the case where > > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED == 0 (and this is the case > > here AFAICT) we need the parent only after the check of 'test_mask & > > marks_mask'. So in that case we could completely avoid the extra cost of > > parent dentry handling. > > > > So in principle I believe we could make fsnotify noticeably lighter for the > > case where no event is generated. Just the question is how to refactor the > > current set of functions to achieve this without creating an unmaintainable > > mess. I suspect if we lifted creation & some prefilling of iter_info into > > __fsnotify_parent() and then fill in the parent in case we need it for > > reporting only in fsnotify(), the code could be reasonably readable. We'd > > need to always propagate the dentry down to fsnotify() though, currently we > > often propagate only the inode because the dentry may be (still in case of > > create or already in case of unlink) negative. Similarly we'd somehow need > > to communicate down into fsnotify() whether the passed name needs > > snapshotting or not... > > > > What do you think? > > I am not saying no ;) > but it sound a bit complicated so if the goal is to reduce the overhead > of fsnotify_access() and fsnotify_perm(), which I don't think any application > cares about, then I'd rather go with a much simpler solution even if it > does not cover all the corner cases. OK, let's figure out what exactly causes slowdown in Jens' case first. I agree your solution helps mitigate the cost of fsnotify_access() for reads but I forsee people complaining about fsnotify_modify() cost for writes in short order :) and there it is not so simple to solve as there's likely some watch for FS_MODIFY event somewhere. Honza
> > I am not saying no ;) > > but it sound a bit complicated so if the goal is to reduce the overhead > > of fsnotify_access() and fsnotify_perm(), which I don't think any application > > cares about, then I'd rather go with a much simpler solution even if it > > does not cover all the corner cases. > > OK, let's figure out what exactly causes slowdown in Jens' case first. I > agree your solution helps mitigate the cost of fsnotify_access() for reads > but I forsee people complaining about fsnotify_modify() cost for writes in > short order :) and there it is not so simple to solve as there's likely > some watch for FS_MODIFY event somewhere. > Actually, I think we may be able to eat the cake and leave it whole. As I've written to you once in a different context, I think fsnotify has two mostly non-overlapping use cases: 1. Watch FS_ACCESS/FS_MODIFY on selective inodes (a.k.a tail) 2. Watch sb/mount/recursive subtree Anyone that will try watching FS_ACCESS/FS_MODIFY on a large data set, would find that to be way too noisy to be useful. For example, we have a filesystem monitor application which needs to know about changes to any file in the filesystem, but we cannot use FS_MODIFY for that monitor because it is too noisy, so we use FS_CLOSE_WRITE as second best. I guess we cannot rule out that the use case of watching direct children for FS_ACCESS/FS_MODIFY may exist in the wild. IOW, I think we should be able to optimize most of the access/modify hooks by checking those events specifically in the inline helpers for: 1. inode mask 2. parent watching children mask 3. sb->s_iflags & SB_I_FSNOTIFY_ACCESS_MONITOR The use of such an access monitor is currently unlikely, so I think tainting the sb is good enough for now. Note that the upcoming pre-content events (a.k.a HSM) are certainly going to fall under this "unlikely" category - If you'd want to use HSM (on-demand file content filling) on a filesystem, you will pay the performance penalty for some intensive io workloads unless you'd use an fd with FMODE_NONOTIFY. I think that is to be expected. HSM is anyway expected to incur extra performance penalty for sb_write_barrier() (for the pre-modify events) and in my wip sb_write_barrier branch [1], there is a similar optimization that does activate_sb_write_barrier() on the first pre-modify event watch and then leaves it activated on that sb forever. I will try to write this patch. Thanks, Amir. [1] https://github.com/amir73il/linux/commits/sb_write_barrier
On Wed, Jan 10, 2024 at 11:08:17AM +0200, Amir Goldstein wrote: > On Tue, Jan 9, 2024 at 10:24 PM Jens Axboe <axboe@kernel.dk> wrote: > > > > On 1/9/24 1:12 PM, Jens Axboe wrote: > > > On 1/9/24 12:48 PM, Amir Goldstein wrote: > > >> Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") > > >> optimized the case where there are no fsnotify watchers on any of the > > >> filesystem's objects. > > >> > > >> It is quite common for a system to have a single local filesystem and > > >> it is quite common for the system to have some inotify watches on some > > >> config files or directories, so the optimization of no marks at all is > > >> often not in effect. > > >> > > >> Access event watchers are far less common, so optimizing the case of > > >> no marks with access events could improve performance for more systems, > > >> especially for the performance sensitive hot io path. > > >> > > >> Maintain a per-sb counter of objects that have marks with access > > >> events in their mask and use that counter to optimize out the call to > > >> fsnotify() in fsnotify access hooks. > > >> > > >> Signed-off-by: Amir Goldstein <amir73il@gmail.com> > > >> --- > > >> > > >> Jens, > > >> > > >> You may want to try if this patch improves performance for your workload > > >> with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y. > > > > > > Ran the usual test, and this effectively removes fsnotify from the > > > profiles, which (as per other email) is between 5-6% of CPU time. So I'd > > > say it looks mighty appealing! > > > > Tried with an IRQ based workload as well, as those are always impacted > > more by the fsnotify slowness. This patch removes ~8% of useless > > overhead in that case, so even bigger win there. > > > > Do the IRQ based workloads always go through io_req_io_end()? > Meaning that unlike the polled io workloads, they also incur the > overhead of the fsnotify_{modify,access}() hooks? > > I remember I asked you once (cannot find where) whether > io_complete_rw_iopoll() needs the fsnotify hooks and you said that > it is a highly specialized code path for fast io, whose users will not > want those access/modify hooks. > > Considering the fact that fsnotify_{modify,access}() could just as well > be bypassed by mmap() read/write, I fully support this reasoning. > > Anyway, that explains (to me) why compiling-out the fsnotify_perm() > hooks took away all the regression that you observed in upstream, > because I was wondering where the overhead of fsnotify_access() was. > > Jan, > > What are your thoughts about this optimization patch? > > My thoughts are that the optimization is clearly a win, but do we > really want to waste a full long in super_block for counting access > event watchers that may never exist? Meh, not too excited about it. Couldn't you use a flag in s_fsnotify_mask? Maybe that's what you mean below. > > Should we perhaps instead use a flag to say that "access watchers > existed"? > > We could put s_fsnotify_access_watchers inside a struct > fsnotify_sb_mark_connector and special case alloc/free of > FSNOTIFY_OBJ_TYPE_SB connector. Would it make sense without incurring performance impact to move atomic_long_t s_fsnotify_connectors and your new atomic_long_t into a struct fsnotify_data that gets allocated when an sb is created? Then we don't waste space in struct super_block. This is basically a copy-pasta of the LSM sb->s_security approach.
> > Jan, > > > > What are your thoughts about this optimization patch? > > > > My thoughts are that the optimization is clearly a win, but do we > > really want to waste a full long in super_block for counting access > > event watchers that may never exist? > > Meh, not too excited about it. Couldn't you use a flag in s_fsnotify_mask? > Maybe that's what you mean below. > Yes, for my v2 I will use a "taint" flag in s_fsnotify_mask instead of keeping count of content monitors. > > > > Should we perhaps instead use a flag to say that "access watchers > > existed"? > > > > We could put s_fsnotify_access_watchers inside a struct > > fsnotify_sb_mark_connector and special case alloc/free of > > FSNOTIFY_OBJ_TYPE_SB connector. > > Would it make sense without incurring performance impact to move > atomic_long_t s_fsnotify_connectors and your new atomic_long_t into a > struct fsnotify_data that gets allocated when an sb is created? Then we > don't waste space in struct super_block. This is basically a copy-paste > of the LSM sb->s_security approach. Not as trivial to move s_fsnotify_connectors, but I think it should be possible, because basically, I don't think we need to ever detach an sb connector once it is attached. I did already look at this when writing the above suggestion, but it was not straightforward so left it for later. This change would be relevant for another feature that I need for HSM called "default event mask" (sort of like the default iptables rule), so I will look into making those changes when I get to this feature. Thanks, Amir.
diff --git a/fs/notify/mark.c b/fs/notify/mark.c index d6944ff86ffa..14dc581df414 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -153,9 +153,29 @@ static struct inode *fsnotify_update_iref(struct fsnotify_mark_connector *conn, return inode; } +/* + * To avoid the performance penalty of rare access event watchers in the hot + * io path, keep track of whether any such watchers exists on the filesystem. + */ +static void fsnotify_update_sb_watchers(struct fsnotify_mark_connector *conn, + u32 old_mask, u32 new_mask) +{ + struct super_block *sb = fsnotify_connector_sb(conn); + bool old_watchers = old_mask & ALL_FSNOTIFY_ACCESS_EVENTS; + bool new_watchers = new_mask & ALL_FSNOTIFY_ACCESS_EVENTS; + + if (!sb || old_watchers == new_watchers) + return; + + if (new_watchers) + atomic_long_inc(&sb->s_fsnotify_access_watchers); + else + atomic_long_dec(&sb->s_fsnotify_access_watchers); +} + static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) { - u32 new_mask = 0; + u32 old_mask, new_mask = 0; bool want_iref = false; struct fsnotify_mark *mark; @@ -163,6 +183,7 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) /* We can get detached connector here when inode is getting unlinked. */ if (!fsnotify_valid_obj_type(conn->type)) return NULL; + old_mask = fsnotify_conn_mask(conn); hlist_for_each_entry(mark, &conn->list, obj_list) { if (!(mark->flags & FSNOTIFY_MARK_FLAG_ATTACHED)) continue; @@ -172,6 +193,7 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) want_iref = true; } *fsnotify_conn_mask_p(conn) = new_mask; + fsnotify_update_sb_watchers(conn, old_mask, new_mask); return fsnotify_update_iref(conn, want_iref); } @@ -243,11 +265,13 @@ static void *fsnotify_detach_connector_from_object( unsigned int *type) { struct inode *inode = NULL; + u32 old_mask; *type = conn->type; if (conn->type == FSNOTIFY_OBJ_TYPE_DETACHED) return NULL; + old_mask = fsnotify_conn_mask(conn); if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) { inode = fsnotify_conn_inode(conn); inode->i_fsnotify_mask = 0; @@ -261,6 +285,7 @@ static void *fsnotify_detach_connector_from_object( fsnotify_conn_sb(conn)->s_fsnotify_mask = 0; } + fsnotify_update_sb_watchers(conn, old_mask, 0); fsnotify_put_sb_connectors(conn); rcu_assign_pointer(*(conn->obj), NULL); conn->obj = NULL; diff --git a/include/linux/fs.h b/include/linux/fs.h index 42298095b7a5..643d2aeb037d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1282,6 +1282,7 @@ struct super_block { * inodes objects are currently double-accounted. */ atomic_long_t s_fsnotify_connectors; + atomic_long_t s_fsnotify_access_watchers; /* Read-only state of the superblock is being changed */ int s_readonly_remount; diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index 8300a5286988..9dba2e038017 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -17,6 +17,14 @@ #include <linux/slab.h> #include <linux/bug.h> +/* Are there any inode/mount/sb objects that are being watched? */ +static inline int fsnotify_sb_has_watchers(struct super_block *sb, __u32 mask) +{ + if (mask & ALL_FSNOTIFY_ACCESS_EVENTS) + return atomic_long_read(&sb->s_fsnotify_access_watchers); + return atomic_long_read(&sb->s_fsnotify_connectors); +} + /* * Notify this @dir inode about a change in a child directory entry. * The directory entry may have turned positive or negative or its inode may @@ -30,7 +38,7 @@ static inline int fsnotify_name(__u32 mask, const void *data, int data_type, struct inode *dir, const struct qstr *name, u32 cookie) { - if (atomic_long_read(&dir->i_sb->s_fsnotify_connectors) == 0) + if (!fsnotify_sb_has_watchers(dir->i_sb, mask)) return 0; return fsnotify(mask, data, data_type, dir, name, NULL, cookie); @@ -44,7 +52,7 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry, static inline void fsnotify_inode(struct inode *inode, __u32 mask) { - if (atomic_long_read(&inode->i_sb->s_fsnotify_connectors) == 0) + if (!fsnotify_sb_has_watchers(inode->i_sb, mask)) return; if (S_ISDIR(inode->i_mode)) @@ -59,7 +67,7 @@ static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, { struct inode *inode = d_inode(dentry); - if (atomic_long_read(&inode->i_sb->s_fsnotify_connectors) == 0) + if (!fsnotify_sb_has_watchers(inode->i_sb, mask)) return 0; if (S_ISDIR(inode->i_mode)) { diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index 7f63be5ca0f1..5e0e76cbd7ee 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -77,6 +77,8 @@ */ #define ALL_FSNOTIFY_DIRENT_EVENTS (FS_CREATE | FS_DELETE | FS_MOVE | FS_RENAME) +#define ALL_FSNOTIFY_ACCESS_EVENTS (FS_ACCESS | FS_ACCESS_PERM) + #define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM | \ FS_OPEN_EXEC_PERM)
Commit e43de7f0862b ("fsnotify: optimize the case of no marks of any type") optimized the case where there are no fsnotify watchers on any of the filesystem's objects. It is quite common for a system to have a single local filesystem and it is quite common for the system to have some inotify watches on some config files or directories, so the optimization of no marks at all is often not in effect. Access event watchers are far less common, so optimizing the case of no marks with access events could improve performance for more systems, especially for the performance sensitive hot io path. Maintain a per-sb counter of objects that have marks with access events in their mask and use that counter to optimize out the call to fsnotify() in fsnotify access hooks. Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- Jens, You may want to try if this patch improves performance for your workload with SECURITY=Y and FANOTIFY_ACCESS_PERMISSIONS=Y. I am not at all sure that access events are the right thing to optimize. I am pretty sure that access event watchers are rare in the wild, unlike FS_MODIFY watchers that are common (e.g. tail -f). Let me know how this patch works for you and we can continue the discussion with Jan from there. Thanks, Amir. fs/notify/mark.c | 27 ++++++++++++++++++++++++++- include/linux/fs.h | 1 + include/linux/fsnotify.h | 14 +++++++++++--- include/linux/fsnotify_backend.h | 2 ++ 4 files changed, 40 insertions(+), 4 deletions(-)