Message ID | 20200608140557.GG3127@techsingularity.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fsnotify: Rearrange fast path to minimise overhead when there is no watcher | expand |
On Mon, Jun 8, 2020 at 5:05 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > The fsnotify paths are trivial to hit even when there are no watchers and > they are surprisingly expensive. For example, every successful vfs_write() > hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless > FMODE_NONOTIFY is set which is an internal flag invisible to userspace. > As it stands, fsnotify_parent is a guaranteed functional call even if there > are no watchers and fsnotify() does a substantial amount of unnecessary > work before it checks if there are any watchers. A perf profile showed > that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the > total samples taken in that function during a test. This patch rearranges > the fast paths to reduce the amount of work done when there are no watchers. > > The test motivating this was "perf bench sched messaging --pipe". Despite > the fact the pipes are anonymous, fsnotify is still called a lot and > the overhead is noticable even though it's completely pointless. It's > likely the overhead is negligible for real IO so this is an extreme > example. This is a comparison of hackbench using processes and pipes on > a 1-socket machine with 8 CPU threads without fanotify watchers. > > 5.7.0 5.7.0 > vanilla fastfsnotify-v1r1 > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) > > 5.7.0 5.7.0 > vanilla fastfsnotify-v1r1 > Duration User 157.05 152.79 > Duration System 1279.98 1219.32 > Duration Elapsed 182.81 174.52 > > This is showing that the latencies are improved by roughly 2-9%. The > variability is not shown but some of these results are within the noise > as this workload heavily overloads the machine. That said, the system CPU > usage is reduced by quite a bit so it makes sense to avoid the overhead > even if it is a bit tricky to detect at times. A perf profile of just 1 > group of tasks showed that 5.14% of samples taken were in either fsnotify() > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, > mostly function entry and the initial check for watchers. The check for > watchers is complicated enough that inlining it may be controversial. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Hi Mel, Thanks for looking into this > --- > fs/notify/fsnotify.c | 25 +++++++++++++++---------- > include/linux/fsnotify.h | 10 ++++++++++ > include/linux/fsnotify_backend.h | 4 ++-- > 3 files changed, 27 insertions(+), 12 deletions(-) > > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > index 72d332ce8e12..de7bbfd973c0 100644 > --- a/fs/notify/fsnotify.c > +++ b/fs/notify/fsnotify.c > @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode) > } > > /* Notify this dentry's parent about a child's events. */ > -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > int data_type) > { > struct dentry *parent; > @@ -174,7 +174,7 @@ int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > > return ret; > } > -EXPORT_SYMBOL_GPL(fsnotify_parent); > +EXPORT_SYMBOL_GPL(__fsnotify_parent); > > static int send_to_group(struct inode *to_tell, > __u32 mask, const void *data, > @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > struct fsnotify_iter_info iter_info = {}; > struct super_block *sb = to_tell->i_sb; > struct mount *mnt = NULL; > - __u32 mnt_or_sb_mask = sb->s_fsnotify_mask; > + __u32 mnt_or_sb_mask; > int ret = 0; > - __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); > + __u32 test_mask; > > - if (path) { > + if (path) > mnt = real_mount(path->mnt); > - mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > - } > - /* An event "on child" is not intended for a mount/sb mark */ > - if (mask & FS_EVENT_ON_CHILD) > - mnt_or_sb_mask = 0; > > /* > * Optimization: srcu_read_lock() has a memory barrier which can > @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && > (!mnt || !mnt->mnt_fsnotify_marks)) > return 0; > + > + /* An event "on child" is not intended for a mount/sb mark */ > + mnt_or_sb_mask = 0; > + if (!(mask & FS_EVENT_ON_CHILD)) { > + mnt_or_sb_mask = sb->s_fsnotify_mask; > + if (path) > + mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > + } > + Are you sure that loading ->_fsnotify_mask is so much more expensive than only checking ->_fsnotify_marks? They are surely on the same cache line. Isn't it possible that you just moved the penalty to ->_fsnotify_marks check with this change? Did you test performance with just the fsnotify_parent() change alone? In any case, this hunk seriously conflicts with a patch set I am working on, so I would rather not merging this change as is. If you provide me the feedback that testing ->_fsnotify_marks before loading ->_fsnotify_mask is beneficial on its own, then I will work this change into my series. > /* > * if this is a modify event we may need to clear the ignored masks > * otherwise return if neither the inode nor the vfsmount/sb care about > * this type of event. > */ > + test_mask = (mask & ALL_FSNOTIFY_EVENTS); > if (!(mask & FS_MODIFY) && > !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask))) > return 0; > diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h > index 5ab28f6c7d26..508f6bb0b06b 100644 > --- a/include/linux/fsnotify.h > +++ b/include/linux/fsnotify.h > @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry, > fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0); > } > > +/* Notify this dentry's parent about a child's events. */ > +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, > + const void *data, int data_type) > +{ > + if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)) > + return 0; > + > + return __fsnotify_parent(dentry, mask, data, data_type); > +} > + This change looks good as is, but I'm afraid my series is going to make it obsolete. It may very well be that my series will introduce more performance penalty to your workload. It would be very much appreciated if you could run a test for me. I will gladly work in some more optimizations into my series. You can find the relevant part of my work at: https://github.com/amir73il/linux/commits/fsnotify_name What this work does essentially is two things: 1. Call backend once instead of twice when both inode and parent are watching. 2. Snapshot name and parent inode to pass to backend not only when parent is watching, but also when an sb/mnt mark exists which requests to get file names with events. Compared to the existing implementation of fsnotify_parent(), my code needs to also test bits in inode->i_fsnotify_mask, inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask before the fast path can be taken. So its back to square one w.r.t your optimizations. I could add the same optimization as in fsnotify() to check ->_fsnotify_marks before ->_fsnotify_mask if you can provide me some concrete numbers. Thanks, Amir.
On Mon 08-06-20 15:05:57, Mel Gorman wrote: > The fsnotify paths are trivial to hit even when there are no watchers and > they are surprisingly expensive. For example, every successful vfs_write() > hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless > FMODE_NONOTIFY is set which is an internal flag invisible to userspace. > As it stands, fsnotify_parent is a guaranteed functional call even if there > are no watchers and fsnotify() does a substantial amount of unnecessary > work before it checks if there are any watchers. A perf profile showed > that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the > total samples taken in that function during a test. This patch rearranges > the fast paths to reduce the amount of work done when there are no watchers. > > The test motivating this was "perf bench sched messaging --pipe". Despite > the fact the pipes are anonymous, fsnotify is still called a lot and > the overhead is noticable even though it's completely pointless. It's > likely the overhead is negligible for real IO so this is an extreme > example. This is a comparison of hackbench using processes and pipes on > a 1-socket machine with 8 CPU threads without fanotify watchers. > > 5.7.0 5.7.0 > vanilla fastfsnotify-v1r1 > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) > > 5.7.0 5.7.0 > vanilla fastfsnotify-v1r1 > Duration User 157.05 152.79 > Duration System 1279.98 1219.32 > Duration Elapsed 182.81 174.52 > > This is showing that the latencies are improved by roughly 2-9%. The > variability is not shown but some of these results are within the noise > as this workload heavily overloads the machine. That said, the system CPU > usage is reduced by quite a bit so it makes sense to avoid the overhead > even if it is a bit tricky to detect at times. A perf profile of just 1 > group of tasks showed that 5.14% of samples taken were in either fsnotify() > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, > mostly function entry and the initial check for watchers. The check for > watchers is complicated enough that inlining it may be controversial. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Thanks for the patch! I have to tell I'm surprised this small reordering helps so much. For pipe inode we will bail on: if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && (!mnt || !mnt->mnt_fsnotify_marks)) return 0; So what we save with the reordering is sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask fetch but that should be the same cacheline as sb->s_fsnotify_marks and mnt->mnt_fsnotify_marks, respectively. We also save a function call of fsnotify_parent() but I would think that is very cheap (compared to the whole write path) as well. The patch is simple enough so I have no problem merging it but I'm just surprised by the results... Hum, maybe the structure randomization is used in the builds and so e.g. sb->s_fsnotify_mask and sb->s_fsnotify_marks don't end up in the same cacheline? But I don't think we enable that in SUSE builds? Honza > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > index 72d332ce8e12..de7bbfd973c0 100644 > --- a/fs/notify/fsnotify.c > +++ b/fs/notify/fsnotify.c > @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode) > } > > /* Notify this dentry's parent about a child's events. */ > -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > int data_type) > { > struct dentry *parent; > @@ -174,7 +174,7 @@ int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > > return ret; > } > -EXPORT_SYMBOL_GPL(fsnotify_parent); > +EXPORT_SYMBOL_GPL(__fsnotify_parent); > > static int send_to_group(struct inode *to_tell, > __u32 mask, const void *data, > @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > struct fsnotify_iter_info iter_info = {}; > struct super_block *sb = to_tell->i_sb; > struct mount *mnt = NULL; > - __u32 mnt_or_sb_mask = sb->s_fsnotify_mask; > + __u32 mnt_or_sb_mask; > int ret = 0; > - __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); > + __u32 test_mask; > > - if (path) { > + if (path) > mnt = real_mount(path->mnt); > - mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > - } > - /* An event "on child" is not intended for a mount/sb mark */ > - if (mask & FS_EVENT_ON_CHILD) > - mnt_or_sb_mask = 0; > > /* > * Optimization: srcu_read_lock() has a memory barrier which can > @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && > (!mnt || !mnt->mnt_fsnotify_marks)) > return 0; > + > + /* An event "on child" is not intended for a mount/sb mark */ > + mnt_or_sb_mask = 0; > + if (!(mask & FS_EVENT_ON_CHILD)) { > + mnt_or_sb_mask = sb->s_fsnotify_mask; > + if (path) > + mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > + } > + > /* > * if this is a modify event we may need to clear the ignored masks > * otherwise return if neither the inode nor the vfsmount/sb care about > * this type of event. > */ > + test_mask = (mask & ALL_FSNOTIFY_EVENTS); > if (!(mask & FS_MODIFY) && > !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask))) > return 0; > diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h > index 5ab28f6c7d26..508f6bb0b06b 100644 > --- a/include/linux/fsnotify.h > +++ b/include/linux/fsnotify.h > @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry, > fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0); > } > > +/* Notify this dentry's parent about a child's events. */ > +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, > + const void *data, int data_type) > +{ > + if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)) > + return 0; > + > + return __fsnotify_parent(dentry, mask, data, data_type); > +} > + > /* > * Simple wrappers to consolidate calls fsnotify_parent()/fsnotify() when > * an event is on a file/dentry. > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h > index f0c506405b54..1626fa7d10ff 100644 > --- a/include/linux/fsnotify_backend.h > +++ b/include/linux/fsnotify_backend.h > @@ -379,7 +379,7 @@ struct fsnotify_mark { > /* main fsnotify call to send events */ > extern int fsnotify(struct inode *to_tell, __u32 mask, const void *data, > int data_type, const struct qstr *name, u32 cookie); > -extern int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > +extern int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > int data_type); > extern void __fsnotify_inode_delete(struct inode *inode); > extern void __fsnotify_vfsmount_delete(struct vfsmount *mnt); > @@ -541,7 +541,7 @@ static inline int fsnotify(struct inode *to_tell, __u32 mask, const void *data, > return 0; > } > > -static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, > +static inline int __fsnotify_parent(struct dentry *dentry, __u32 mask, > const void *data, int data_type) > { > return 0;
On Mon, Jun 08, 2020 at 06:03:56PM +0300, Amir Goldstein wrote: > > @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > struct fsnotify_iter_info iter_info = {}; > > struct super_block *sb = to_tell->i_sb; > > struct mount *mnt = NULL; > > - __u32 mnt_or_sb_mask = sb->s_fsnotify_mask; > > + __u32 mnt_or_sb_mask; > > int ret = 0; > > - __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); > > + __u32 test_mask; > > > > - if (path) { > > + if (path) > > mnt = real_mount(path->mnt); > > - mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > > - } > > - /* An event "on child" is not intended for a mount/sb mark */ > > - if (mask & FS_EVENT_ON_CHILD) > > - mnt_or_sb_mask = 0; > > > > /* > > * Optimization: srcu_read_lock() has a memory barrier which can > > @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && > > (!mnt || !mnt->mnt_fsnotify_marks)) > > return 0; > > + > > + /* An event "on child" is not intended for a mount/sb mark */ > > + mnt_or_sb_mask = 0; > > + if (!(mask & FS_EVENT_ON_CHILD)) { > > + mnt_or_sb_mask = sb->s_fsnotify_mask; > > + if (path) > > + mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; > > + } > > + > > Are you sure that loading ->_fsnotify_mask is so much more expensive > than only checking ->_fsnotify_marks? They are surely on the same cache line. > Isn't it possible that you just moved the penalty to ->_fsnotify_marks check > with this change? The profile indicated that the cost was in this line mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; as opposed to the other checks. Hence, I deferred the calculation of mnt_or_sb_mask until it was definitely needed. However, it's perfectly possible that the line simply looked hot because the function entry was hot in general. > Did you test performance with just the fsnotify_parent() change alone? > No, but I can. I looked at the profile of fsnotify() first before moving on to fsnotify_parent() but I didn't test them in isolation as deferring unnecessarily calculations in a fast path seemed reasonable. > In any case, this hunk seriously conflicts with a patch set I am working on, > so I would rather not merging this change as is. > > If you provide me the feedback that testing ->_fsnotify_marks before loading > ->_fsnotify_mask is beneficial on its own, then I will work this change into > my series. > Will do. If the fsnotify_parent() changes are useful on their own, I'll post a v2 of the patch with just that change. > > /* > > * if this is a modify event we may need to clear the ignored masks > > * otherwise return if neither the inode nor the vfsmount/sb care about > > * this type of event. > > */ > > + test_mask = (mask & ALL_FSNOTIFY_EVENTS); > > if (!(mask & FS_MODIFY) && > > !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask))) > > return 0; > > diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h > > index 5ab28f6c7d26..508f6bb0b06b 100644 > > --- a/include/linux/fsnotify.h > > +++ b/include/linux/fsnotify.h > > @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry, > > fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0); > > } > > > > +/* Notify this dentry's parent about a child's events. */ > > +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, > > + const void *data, int data_type) > > +{ > > + if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)) > > + return 0; > > + > > + return __fsnotify_parent(dentry, mask, data, data_type); > > +} > > + > > This change looks good as is, but I'm afraid my series is going to > make it obsolete. > It may very well be that my series will introduce more performance > penalty to your workload. > > It would be very much appreciated if you could run a test for me. > I will gladly work in some more optimizations into my series. > > You can find the relevant part of my work at: > https://github.com/amir73il/linux/commits/fsnotify_name > Sure. > What this work does essentially is two things: > 1. Call backend once instead of twice when both inode and parent are > watching. > 2. Snapshot name and parent inode to pass to backend not only when > parent is watching, but also when an sb/mnt mark exists which > requests to get file names with events. > > Compared to the existing implementation of fsnotify_parent(), > my code needs to also test bits in inode->i_fsnotify_mask, > inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask > before the fast path can be taken. > So its back to square one w.r.t your optimizations. > Seems fair but it may be worth noting that the changes appear to be optimising the case where there are watchers. The case where there are no watchers at all is also interesting and probably a lot more common. I didn't look too closely at your series as I'm not familiar with fsnotify in general. However, at a glance it looks like fsnotify_parent() executes a substantial amount of code even if there are no watchers but I could be wrong.
> > What this work does essentially is two things: > > 1. Call backend once instead of twice when both inode and parent are > > watching. > > 2. Snapshot name and parent inode to pass to backend not only when > > parent is watching, but also when an sb/mnt mark exists which > > requests to get file names with events. > > > > Compared to the existing implementation of fsnotify_parent(), > > my code needs to also test bits in inode->i_fsnotify_mask, > > inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask > > before the fast path can be taken. > > So its back to square one w.r.t your optimizations. > > > > Seems fair but it may be worth noting that the changes appear to be > optimising the case where there are watchers. The case where there are > no watchers at all is also interesting and probably a lot more common. I My changes are not optimizations. They are for adding functionality. Surely, that shouldn't come at a cost for the common case. > didn't look too closely at your series as I'm not familiar with fsnotify > in general. However, at a glance it looks like fsnotify_parent() executes > a substantial amount of code even if there are no watchers but I could > be wrong. > I don't about substantial, I would say it is on par with the amount of code that you tries to optimize out of fsnotify(). Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED test, it also references d_inode->i_sb, real_mount(path->mnt) and fetches all their ->x_fsnotify_mask fields. I changed the call pattern from open/modify/... hooks from: fsnotify_parent(...); fsnotify(...); to: fsnotify_parent(...); /* which calls fsnotify() */ So the NULL marks optimization could be done in beginning of fsnotify_parent() and it will be just as effective as it is in fsnotify(). Thanks, Amir.
On Mon, Jun 08, 2020 at 05:19:43PM +0200, Jan Kara wrote: > > This is showing that the latencies are improved by roughly 2-9%. The > > variability is not shown but some of these results are within the noise > > as this workload heavily overloads the machine. That said, the system CPU > > usage is reduced by quite a bit so it makes sense to avoid the overhead > > even if it is a bit tricky to detect at times. A perf profile of just 1 > > group of tasks showed that 5.14% of samples taken were in either fsnotify() > > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, > > mostly function entry and the initial check for watchers. The check for > > watchers is complicated enough that inlining it may be controversial. > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Thanks for the patch! I have to tell I'm surprised this small reordering > helps so much. For pipe inode we will bail on: > > if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && > (!mnt || !mnt->mnt_fsnotify_marks)) > return 0; > > So what we save with the reordering is sb->s_fsnotify_mask and > mnt->mnt_fsnotify_mask fetch but that should be the same cacheline as > sb->s_fsnotify_marks and mnt->mnt_fsnotify_marks, respectively. It is likely that the contribution of that change is marginal relative to the fsnotify_parent() call. I'll know by tomorrow morning at the latest. > We also > save a function call of fsnotify_parent() but I would think that is very > cheap (compared to the whole write path) as well. > To be fair, it is cheap but with this particular workload, we call vfs_write() a *lot* and the path is not that long so it builds up to 5% of samples overall. Given that these were anonymous pipes, it surprised me to see fsnotify at all which is why I took a closer look. > The patch is simple enough so I have no problem merging it but I'm just > surprised by the results... Hum, maybe the structure randomization is used > in the builds and so e.g. sb->s_fsnotify_mask and sb->s_fsnotify_marks > don't end up in the same cacheline? But I don't think we enable that in > SUSE builds? > Correct, GCC_PLUGIN_RANDSTRUCT was not set.
On Mon, Jun 8, 2020 at 7:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Mon, Jun 08, 2020 at 05:19:43PM +0200, Jan Kara wrote: > > > This is showing that the latencies are improved by roughly 2-9%. The > > > variability is not shown but some of these results are within the noise > > > as this workload heavily overloads the machine. That said, the system CPU > > > usage is reduced by quite a bit so it makes sense to avoid the overhead > > > even if it is a bit tricky to detect at times. A perf profile of just 1 > > > group of tasks showed that 5.14% of samples taken were in either fsnotify() > > > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, > > > mostly function entry and the initial check for watchers. The check for > > > watchers is complicated enough that inlining it may be controversial. > > > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > > Thanks for the patch! I have to tell I'm surprised this small reordering > > helps so much. For pipe inode we will bail on: > > > > if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && > > (!mnt || !mnt->mnt_fsnotify_marks)) > > return 0; > > > > So what we save with the reordering is sb->s_fsnotify_mask and > > mnt->mnt_fsnotify_mask fetch but that should be the same cacheline as > > sb->s_fsnotify_marks and mnt->mnt_fsnotify_marks, respectively. > > It is likely that the contribution of that change is marginal relative > to the fsnotify_parent() call. I'll know by tomorrow morning at the latest. > > > We also > > save a function call of fsnotify_parent() but I would think that is very > > cheap (compared to the whole write path) as well. > > > > To be fair, it is cheap but with this particular workload, we call > vfs_write() a *lot* and the path is not that long so it builds up to 5% > of samples overall. Given that these were anonymous pipes, it surprised > me to see fsnotify at all which is why I took a closer look. > I should note that after: 7c49b8616460 fs/notify: optimize inotify/fsnotify code for unwatched files Which speaks of a similar workload, the code looked quite similar to your optimization. It was: 60f7ed8c7c4d fsnotify: send path type events to group with super block marks That started accessing ->x_fsnotify_mask before ->x_fsnotify_marks, although I still find it hard to believe that this makes a real difference. Thanks, Amir.
On Mon, Jun 08, 2020 at 07:26:10PM +0300, Amir Goldstein wrote: > > > What this work does essentially is two things: > > > 1. Call backend once instead of twice when both inode and parent are > > > watching. > > > 2. Snapshot name and parent inode to pass to backend not only when > > > parent is watching, but also when an sb/mnt mark exists which > > > requests to get file names with events. > > > > > > Compared to the existing implementation of fsnotify_parent(), > > > my code needs to also test bits in inode->i_fsnotify_mask, > > > inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask > > > before the fast path can be taken. > > > So its back to square one w.r.t your optimizations. > > > > > > > Seems fair but it may be worth noting that the changes appear to be > > optimising the case where there are watchers. The case where there are > > no watchers at all is also interesting and probably a lot more common. I > > My changes are not optimizations. They are for adding functionality. > Surely, that shouldn't come at a cost for the common case. > My bad. I interpreted the folding of fsnotify_parent calling fsnotify to be a potential optimisation particularly if it bailed earlier (which it doesn't do but maybe it could). > > didn't look too closely at your series as I'm not familiar with fsnotify > > in general. However, at a glance it looks like fsnotify_parent() executes > > a substantial amount of code even if there are no watchers but I could > > be wrong. > > > > I don't about substantial, I would say it is on par with the amount of > code that you tries to optimize out of fsnotify(). > > Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED > test, it also references d_inode->i_sb, real_mount(path->mnt) > and fetches all their ->x_fsnotify_mask fields. > > I changed the call pattern from open/modify/... hooks from: > fsnotify_parent(...); > fsnotify(...); > > to: > fsnotify_parent(...); /* which calls fsnotify() */ > > So the NULL marks optimization could be done in beginning of > fsnotify_parent() and it will be just as effective as it is in fsnotify(). > Something like that may be required because 5.7.0 5.7.0 5.7.0 5.7.0 vanilla fastfsnotify-v1r1 fastfsnotify-v2r1 amir-20200608 Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* 0.4597 * 4.96%* 0.4967 * -2.69%* Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) 1.5310 ( 0.88%) 1.6587 * -7.38%* Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) 2.4237 ( 6.91%) 2.6400 ( -1.40%) Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) 3.6543 ( -1.55%) 3.9040 * -8.48%* Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) 5.5903 ( 4.06%) 6.2593 ( -7.43%) Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) 7.7150 * 8.59%* 8.9940 ( -6.56%) Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* 9.8977 * 10.17%* 11.7247 * -6.41%* Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) 12.2087 * 6.81%* 14.0290 * -7.08%* Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 13.2900 ( 4.52%) 14.7140 * -5.71%* vanilla and fastnotify-v1r1 are the same. fastfsnotify-v2r1 is just the fsnotify_parent() change which is mostly worse and may indicate that the first patch was reasonable. amir-20200608 is your branch as of today and it appears to introduce a substantial regression albeit in an extreme case where fsnotify overhead is visible. The regressions are mostly larger than noise with the caveat it may be machine specific given that the machine is overloaded. I accept that adding extra functional to fsnotify may be desirable but ideally it would not hurt the case where there are no watchers at all. So what's the right way forward? The patch as-is even though the fsnotify() change itself may be marginal, a patch that just inlines the fast path of fsnotify_parent or wait for the additional functionality and try and address the overhead on top?
> > > didn't look too closely at your series as I'm not familiar with fsnotify > > > in general. However, at a glance it looks like fsnotify_parent() executes > > > a substantial amount of code even if there are no watchers but I could > > > be wrong. > > > > > > > I don't about substantial, I would say it is on par with the amount of > > code that you tries to optimize out of fsnotify(). > > > > Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED > > test, it also references d_inode->i_sb, real_mount(path->mnt) > > and fetches all their ->x_fsnotify_mask fields. > > > > I changed the call pattern from open/modify/... hooks from: > > fsnotify_parent(...); > > fsnotify(...); > > > > to: > > fsnotify_parent(...); /* which calls fsnotify() */ > > > > So the NULL marks optimization could be done in beginning of > > fsnotify_parent() and it will be just as effective as it is in fsnotify(). > > > > Something like that may be required because > > 5.7.0 5.7.0 5.7.0 5.7.0 > vanilla fastfsnotify-v1r1 fastfsnotify-v2r1 amir-20200608 > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* 0.4597 * 4.96%* 0.4967 * -2.69%* > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) 1.5310 ( 0.88%) 1.6587 * -7.38%* > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) 2.4237 ( 6.91%) 2.6400 ( -1.40%) > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) 3.6543 ( -1.55%) 3.9040 * -8.48%* > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) 5.5903 ( 4.06%) 6.2593 ( -7.43%) > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) 7.7150 * 8.59%* 8.9940 ( -6.56%) > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* 9.8977 * 10.17%* 11.7247 * -6.41%* > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) 12.2087 * 6.81%* 14.0290 * -7.08%* > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 13.2900 ( 4.52%) 14.7140 * -5.71%* > > vanilla and fastnotify-v1r1 are the same. fastfsnotify-v2r1 is just the > fsnotify_parent() change which is mostly worse and may indicate that the > first patch was reasonable. amir-20200608 is your branch as of today and > it appears to introduce a substantial regression albeit in an extreme case > where fsnotify overhead is visible. The regressions are mostly larger > than noise with the caveat it may be machine specific given that the > machine is overloaded. I accept that adding extra functional to fsnotify > may be desirable but ideally it would not hurt the case where there are > no watchers at all. > Of course. And thanks for catching this regression even before I posted the patches :-) > So what's the right way forward? The patch as-is even though the fsnotify() > change itself may be marginal, a patch that just inlines the fast path > of fsnotify_parent or wait for the additional functionality and try and > address the overhead on top? > > Let me add your optimizations on top of my branch with the needed adaptations and send you a branch for testing. Thanks, Amir.
On Mon, Jun 8, 2020 at 9:12 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > didn't look too closely at your series as I'm not familiar with fsnotify > > > > in general. However, at a glance it looks like fsnotify_parent() executes > > > > a substantial amount of code even if there are no watchers but I could > > > > be wrong. > > > > > > > > > > I don't about substantial, I would say it is on par with the amount of > > > code that you tries to optimize out of fsnotify(). > > > > > > Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED > > > test, it also references d_inode->i_sb, real_mount(path->mnt) > > > and fetches all their ->x_fsnotify_mask fields. > > > > > > I changed the call pattern from open/modify/... hooks from: > > > fsnotify_parent(...); > > > fsnotify(...); > > > > > > to: > > > fsnotify_parent(...); /* which calls fsnotify() */ > > > > > > So the NULL marks optimization could be done in beginning of > > > fsnotify_parent() and it will be just as effective as it is in fsnotify(). > > > > > > > Something like that may be required because > > > > 5.7.0 5.7.0 5.7.0 5.7.0 > > vanilla fastfsnotify-v1r1 fastfsnotify-v2r1 amir-20200608 > > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* 0.4597 * 4.96%* 0.4967 * -2.69%* > > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) 1.5310 ( 0.88%) 1.6587 * -7.38%* > > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) 2.4237 ( 6.91%) 2.6400 ( -1.40%) > > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) 3.6543 ( -1.55%) 3.9040 * -8.48%* > > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) 5.5903 ( 4.06%) 6.2593 ( -7.43%) > > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) 7.7150 * 8.59%* 8.9940 ( -6.56%) > > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* 9.8977 * 10.17%* 11.7247 * -6.41%* > > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) 12.2087 * 6.81%* 14.0290 * -7.08%* > > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 13.2900 ( 4.52%) 14.7140 * -5.71%* > > > > vanilla and fastnotify-v1r1 are the same. fastfsnotify-v2r1 is just the > > fsnotify_parent() change which is mostly worse and may indicate that the > > first patch was reasonable. amir-20200608 is your branch as of today and > > it appears to introduce a substantial regression albeit in an extreme case > > where fsnotify overhead is visible. The regressions are mostly larger > > than noise with the caveat it may be machine specific given that the > > machine is overloaded. I accept that adding extra functional to fsnotify > > may be desirable but ideally it would not hurt the case where there are > > no watchers at all. > > > > Of course. > And thanks for catching this regression even before I posted the patches :-) > > > So what's the right way forward? The patch as-is even though the fsnotify() > > change itself may be marginal, a patch that just inlines the fast path > > of fsnotify_parent or wait for the additional functionality and try and > > address the overhead on top? > > > > > > Let me add your optimizations on top of my branch with the needed > adaptations and send you a branch for testing. https://github.com/amir73il/linux/commits/fsnotify_name-for-mel Cheers, Amir.
On Mon, Jun 08, 2020 at 11:39:26PM +0300, Amir Goldstein wrote: > > Let me add your optimizations on top of my branch with the needed > > adaptations and send you a branch for testing. > > https://github.com/amir73il/linux/commits/fsnotify_name-for-mel > Sorry for the delay getting back. The machine was busy with other tests and it took a while to reach this on the queue. Fairly good news hackbench-process-pipes 5.7.0 5.7.0 5.7.0 5.7.0 vanilla fastfsnotify-v1r1 amir-20200608 amir-for-mel Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* 0.4967 * -2.69%* 0.4680 ( 3.24%) Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) 1.6587 * -7.38%* 1.4807 ( 4.14%) Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) 2.6400 ( -1.40%) 2.4900 ( 4.37%) Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) 3.9040 * -8.48%* 3.5130 ( 2.38%) Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) 6.2593 ( -7.43%) 5.6967 ( 2.23%) Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) 8.9940 ( -6.56%) 7.7240 * 8.48%* Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* 11.7247 * -6.41%* 9.5793 * 13.06%* Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) 14.0290 * -7.08%* 12.1630 ( 7.16%) Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 14.7140 * -5.71%* 13.2457 * 4.84%* First two sets of results are vanilla kernel and just my patch respectively to have two baselines. amir-20200608 is the first git branch you pointed me to and amir-for-mel is this latest branch. Comparing the optimisation and your series, we get hackbench-process-pipes 5.7.0 5.7.0 fastfsnotify-v1r1 amir-for-mel Amean 1 0.4630 ( 0.00%) 0.4680 ( -1.08%) Amean 3 1.4557 ( 0.00%) 1.4807 ( -1.72%) Amean 5 2.4363 ( 0.00%) 2.4900 ( -2.20%) Amean 7 3.4757 ( 0.00%) 3.5130 ( -1.07%) Amean 12 5.6983 ( 0.00%) 5.6967 ( 0.03%) Amean 18 8.1327 ( 0.00%) 7.7240 ( 5.03%) Amean 24 10.0290 ( 0.00%) 9.5793 ( 4.48%) Amean 30 12.8510 ( 0.00%) 12.1630 ( 5.35%) Amean 32 13.2410 ( 0.00%) 13.2457 ( -0.04%) As you can see, your patches with the optimisation layered on top is comparable to just the optimisation on its own. It's not universally better but it would not look like a regression when comparing releases. The differences are mostly within the noise as there is some variability involved for this workload so I would not worry too much about it (caveats are other machines may be different as well as other workloads). A minor issue is that this is probably a bisection hazard. If a series is merged and LKP points the finger somewhere in the middle of your series then I suggest you ask them to test the optimisation commit ID to see if the regression goes away. Thanks Amir.
On Wed, Jun 10, 2020 at 3:59 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Mon, Jun 08, 2020 at 11:39:26PM +0300, Amir Goldstein wrote: > > > Let me add your optimizations on top of my branch with the needed > > > adaptations and send you a branch for testing. > > > > https://github.com/amir73il/linux/commits/fsnotify_name-for-mel > > > > Sorry for the delay getting back. The machine was busy with other tests > and it took a while to reach this on the queue. Fairly good news > > hackbench-process-pipes > 5.7.0 5.7.0 5.7.0 5.7.0 > vanilla fastfsnotify-v1r1 amir-20200608 amir-for-mel > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* 0.4967 * -2.69%* 0.4680 ( 3.24%) > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) 1.6587 * -7.38%* 1.4807 ( 4.14%) > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) 2.6400 ( -1.40%) 2.4900 ( 4.37%) > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) 3.9040 * -8.48%* 3.5130 ( 2.38%) > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) 6.2593 ( -7.43%) 5.6967 ( 2.23%) > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) 8.9940 ( -6.56%) 7.7240 * 8.48%* > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* 11.7247 * -6.41%* 9.5793 * 13.06%* > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) 14.0290 * -7.08%* 12.1630 ( 7.16%) > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 14.7140 * -5.71%* 13.2457 * 4.84%* > > First two sets of results are vanilla kernel and just my patch respectively > to have two baselines. amir-20200608 is the first git branch you pointed > me to and amir-for-mel is this latest branch. Comparing the optimisation > and your series, we get > > hackbench-process-pipes > 5.7.0 5.7.0 > fastfsnotify-v1r1 amir-for-mel > Amean 1 0.4630 ( 0.00%) 0.4680 ( -1.08%) > Amean 3 1.4557 ( 0.00%) 1.4807 ( -1.72%) > Amean 5 2.4363 ( 0.00%) 2.4900 ( -2.20%) > Amean 7 3.4757 ( 0.00%) 3.5130 ( -1.07%) > Amean 12 5.6983 ( 0.00%) 5.6967 ( 0.03%) > Amean 18 8.1327 ( 0.00%) 7.7240 ( 5.03%) > Amean 24 10.0290 ( 0.00%) 9.5793 ( 4.48%) > Amean 30 12.8510 ( 0.00%) 12.1630 ( 5.35%) > Amean 32 13.2410 ( 0.00%) 13.2457 ( -0.04%) > > As you can see, your patches with the optimisation layered on top is > comparable to just the optimisation on its own. It's not universally > better but it would not look like a regression when comparing releases. > The differences are mostly within the noise as there is some variability > involved for this workload so I would not worry too much about it (caveats > are other machines may be different as well as other workloads). Excellent! Thanks for verifying. TBH, this result is not surprising, because despite all the changes from fastfsnotify-v1r1 to amir-for-mel, the code that is executed when there are no watches should be quite similar. Without any unexpected compiler optimizations that may differ between our versions, fsnotify hooks called for directories should execute the exact same code. fsnotify hooks called for non-directories (your workload) should execute almost the same code. I spotted one additional test for access to d_inode and one additional test of: dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED in the entry to __fsnotify_parent(). > A minor > issue is that this is probably a bisection hazard. If a series is merged > and LKP points the finger somewhere in the middle of your series then > I suggest you ask them to test the optimisation commit ID to see if the > regression goes away. > No worries. I wasn't planning on submitted the altered patch. I just wanted to let you test the final result. I will apply your change before my series and make sure to keep optimizations while my changes are applied on top of that. Thanks, Amir.
On Wed, Jun 10, 2020 at 4:24 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Wed, Jun 10, 2020 at 3:59 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Mon, Jun 08, 2020 at 11:39:26PM +0300, Amir Goldstein wrote: > > > > Let me add your optimizations on top of my branch with the needed > > > > adaptations and send you a branch for testing. > > > > > > https://github.com/amir73il/linux/commits/fsnotify_name-for-mel > > > > > > > Sorry for the delay getting back. The machine was busy with other tests > > and it took a while to reach this on the queue. Fairly good news > > > > hackbench-process-pipes > > 5.7.0 5.7.0 5.7.0 5.7.0 > > vanilla fastfsnotify-v1r1 amir-20200608 amir-for-mel > > Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* 0.4967 * -2.69%* 0.4680 ( 3.24%) > > Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) 1.6587 * -7.38%* 1.4807 ( 4.14%) > > Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) 2.6400 ( -1.40%) 2.4900 ( 4.37%) > > Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) 3.9040 * -8.48%* 3.5130 ( 2.38%) > > Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) 6.2593 ( -7.43%) 5.6967 ( 2.23%) > > Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) 8.9940 ( -6.56%) 7.7240 * 8.48%* > > Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* 11.7247 * -6.41%* 9.5793 * 13.06%* > > Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) 14.0290 * -7.08%* 12.1630 ( 7.16%) > > Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 14.7140 * -5.71%* 13.2457 * 4.84%* > > > > First two sets of results are vanilla kernel and just my patch respectively > > to have two baselines. amir-20200608 is the first git branch you pointed > > me to and amir-for-mel is this latest branch. Comparing the optimisation > > and your series, we get > > > > hackbench-process-pipes > > 5.7.0 5.7.0 > > fastfsnotify-v1r1 amir-for-mel > > Amean 1 0.4630 ( 0.00%) 0.4680 ( -1.08%) > > Amean 3 1.4557 ( 0.00%) 1.4807 ( -1.72%) > > Amean 5 2.4363 ( 0.00%) 2.4900 ( -2.20%) > > Amean 7 3.4757 ( 0.00%) 3.5130 ( -1.07%) > > Amean 12 5.6983 ( 0.00%) 5.6967 ( 0.03%) > > Amean 18 8.1327 ( 0.00%) 7.7240 ( 5.03%) > > Amean 24 10.0290 ( 0.00%) 9.5793 ( 4.48%) > > Amean 30 12.8510 ( 0.00%) 12.1630 ( 5.35%) > > Amean 32 13.2410 ( 0.00%) 13.2457 ( -0.04%) > > > > As you can see, your patches with the optimisation layered on top is > > comparable to just the optimisation on its own. It's not universally > > better but it would not look like a regression when comparing releases. > > The differences are mostly within the noise as there is some variability > > involved for this workload so I would not worry too much about it (caveats > > are other machines may be different as well as other workloads). > > Excellent! > Thanks for verifying. > > TBH, this result is not surprising, because despite all the changes from > fastfsnotify-v1r1 to amir-for-mel, the code that is executed when there > are no watches should be quite similar. Without any unexpected compiler > optimizations that may differ between our versions, fsnotify hooks called > for directories should execute the exact same code. > > fsnotify hooks called for non-directories (your workload) should execute > almost the same code. I spotted one additional test for access to > d_inode and one additional test of: > dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED > in the entry to __fsnotify_parent(). > > > A minor > > issue is that this is probably a bisection hazard. If a series is merged > > and LKP points the finger somewhere in the middle of your series then > > I suggest you ask them to test the optimisation commit ID to see if the > > regression goes away. > > > > No worries. I wasn't planning on submitted the altered patch. > I just wanted to let you test the final result. > I will apply your change before my series and make sure to keep > optimizations while my changes are applied on top of that. > FYI, just posted your patch with a minor style change at the bottom of my prep patch series. Thanks, Amir.
On Fri, Jun 12, 2020 at 12:38:15PM +0300, Amir Goldstein wrote: > > No worries. I wasn't planning on submitted the altered patch. > > I just wanted to let you test the final result. > > I will apply your change before my series and make sure to keep > > optimizations while my changes are applied on top of that. > > > > FYI, just posted your patch with a minor style change at the bottom > of my prep patch series. > Thanks!
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 72d332ce8e12..de7bbfd973c0 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode) } /* Notify this dentry's parent about a child's events. */ -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, int data_type) { struct dentry *parent; @@ -174,7 +174,7 @@ int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, return ret; } -EXPORT_SYMBOL_GPL(fsnotify_parent); +EXPORT_SYMBOL_GPL(__fsnotify_parent); static int send_to_group(struct inode *to_tell, __u32 mask, const void *data, @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, struct fsnotify_iter_info iter_info = {}; struct super_block *sb = to_tell->i_sb; struct mount *mnt = NULL; - __u32 mnt_or_sb_mask = sb->s_fsnotify_mask; + __u32 mnt_or_sb_mask; int ret = 0; - __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS); + __u32 test_mask; - if (path) { + if (path) mnt = real_mount(path->mnt); - mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; - } - /* An event "on child" is not intended for a mount/sb mark */ - if (mask & FS_EVENT_ON_CHILD) - mnt_or_sb_mask = 0; /* * Optimization: srcu_read_lock() has a memory barrier which can @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks && (!mnt || !mnt->mnt_fsnotify_marks)) return 0; + + /* An event "on child" is not intended for a mount/sb mark */ + mnt_or_sb_mask = 0; + if (!(mask & FS_EVENT_ON_CHILD)) { + mnt_or_sb_mask = sb->s_fsnotify_mask; + if (path) + mnt_or_sb_mask |= mnt->mnt_fsnotify_mask; + } + /* * if this is a modify event we may need to clear the ignored masks * otherwise return if neither the inode nor the vfsmount/sb care about * this type of event. */ + test_mask = (mask & ALL_FSNOTIFY_EVENTS); if (!(mask & FS_MODIFY) && !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask))) return 0; diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index 5ab28f6c7d26..508f6bb0b06b 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry, fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0); } +/* Notify this dentry's parent about a child's events. */ +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, + const void *data, int data_type) +{ + if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED)) + return 0; + + return __fsnotify_parent(dentry, mask, data, data_type); +} + /* * Simple wrappers to consolidate calls fsnotify_parent()/fsnotify() when * an event is on a file/dentry. diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h index f0c506405b54..1626fa7d10ff 100644 --- a/include/linux/fsnotify_backend.h +++ b/include/linux/fsnotify_backend.h @@ -379,7 +379,7 @@ struct fsnotify_mark { /* main fsnotify call to send events */ extern int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_type, const struct qstr *name, u32 cookie); -extern int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, +extern int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, int data_type); extern void __fsnotify_inode_delete(struct inode *inode); extern void __fsnotify_vfsmount_delete(struct vfsmount *mnt); @@ -541,7 +541,7 @@ static inline int fsnotify(struct inode *to_tell, __u32 mask, const void *data, return 0; } -static inline int fsnotify_parent(struct dentry *dentry, __u32 mask, +static inline int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, int data_type) { return 0;
The fsnotify paths are trivial to hit even when there are no watchers and they are surprisingly expensive. For example, every successful vfs_write() hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless FMODE_NONOTIFY is set which is an internal flag invisible to userspace. As it stands, fsnotify_parent is a guaranteed functional call even if there are no watchers and fsnotify() does a substantial amount of unnecessary work before it checks if there are any watchers. A perf profile showed that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the total samples taken in that function during a test. This patch rearranges the fast paths to reduce the amount of work done when there are no watchers. The test motivating this was "perf bench sched messaging --pipe". Despite the fact the pipes are anonymous, fsnotify is still called a lot and the overhead is noticable even though it's completely pointless. It's likely the overhead is negligible for real IO so this is an extreme example. This is a comparison of hackbench using processes and pipes on a 1-socket machine with 8 CPU threads without fanotify watchers. 5.7.0 5.7.0 vanilla fastfsnotify-v1r1 Amean 1 0.4837 ( 0.00%) 0.4630 * 4.27%* Amean 3 1.5447 ( 0.00%) 1.4557 ( 5.76%) Amean 5 2.6037 ( 0.00%) 2.4363 ( 6.43%) Amean 7 3.5987 ( 0.00%) 3.4757 ( 3.42%) Amean 12 5.8267 ( 0.00%) 5.6983 ( 2.20%) Amean 18 8.4400 ( 0.00%) 8.1327 ( 3.64%) Amean 24 11.0187 ( 0.00%) 10.0290 * 8.98%* Amean 30 13.1013 ( 0.00%) 12.8510 ( 1.91%) Amean 32 13.9190 ( 0.00%) 13.2410 ( 4.87%) 5.7.0 5.7.0 vanilla fastfsnotify-v1r1 Duration User 157.05 152.79 Duration System 1279.98 1219.32 Duration Elapsed 182.81 174.52 This is showing that the latencies are improved by roughly 2-9%. The variability is not shown but some of these results are within the noise as this workload heavily overloads the machine. That said, the system CPU usage is reduced by quite a bit so it makes sense to avoid the overhead even if it is a bit tricky to detect at times. A perf profile of just 1 group of tasks showed that 5.14% of samples taken were in either fsnotify() or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify, mostly function entry and the initial check for watchers. The check for watchers is complicated enough that inlining it may be controversial. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- fs/notify/fsnotify.c | 25 +++++++++++++++---------- include/linux/fsnotify.h | 10 ++++++++++ include/linux/fsnotify_backend.h | 4 ++-- 3 files changed, 27 insertions(+), 12 deletions(-)