diff mbox series

fsnotify: Rearrange fast path to minimise overhead when there is no watcher

Message ID 20200608140557.GG3127@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series fsnotify: Rearrange fast path to minimise overhead when there is no watcher | expand

Commit Message

Mel Gorman June 8, 2020, 2:05 p.m. UTC
The fsnotify paths are trivial to hit even when there are no watchers and
they are surprisingly expensive. For example, every successful vfs_write()
hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
As it stands, fsnotify_parent is a guaranteed functional call even if there
are no watchers and fsnotify() does a substantial amount of unnecessary
work before it checks if there are any watchers. A perf profile showed
that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
total samples taken in that function during a test. This patch rearranges
the fast paths to reduce the amount of work done when there are no watchers.

The test motivating this was "perf bench sched messaging --pipe". Despite
the fact the pipes are anonymous, fsnotify is still called a lot and
the overhead is noticable even though it's completely pointless. It's
likely the overhead is negligible for real IO so this is an extreme
example. This is a comparison of hackbench using processes and pipes on
a 1-socket machine with 8 CPU threads without fanotify watchers.

                              5.7.0                  5.7.0
                            vanilla      fastfsnotify-v1r1
Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*
Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)
Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)
Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)
Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)
Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)
Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*
Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)
Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)

                       5.7.0       5.7.0
                     vanilla fastfsnotify-v1r1
Duration User         157.05      152.79
Duration System      1279.98     1219.32
Duration Elapsed      182.81      174.52

This is showing that the latencies are improved by roughly 2-9%. The
variability is not shown but some of these results are within the noise
as this workload heavily overloads the machine. That said, the system CPU
usage is reduced by quite a bit so it makes sense to avoid the overhead
even if it is a bit tricky to detect at times. A perf profile of just 1
group of tasks showed that 5.14% of samples taken were in either fsnotify()
or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
mostly function entry and the initial check for watchers.  The check for
watchers is complicated enough that inlining it may be controversial.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 fs/notify/fsnotify.c             | 25 +++++++++++++++----------
 include/linux/fsnotify.h         | 10 ++++++++++
 include/linux/fsnotify_backend.h |  4 ++--
 3 files changed, 27 insertions(+), 12 deletions(-)

Comments

Amir Goldstein June 8, 2020, 3:03 p.m. UTC | #1
On Mon, Jun 8, 2020 at 5:05 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> The fsnotify paths are trivial to hit even when there are no watchers and
> they are surprisingly expensive. For example, every successful vfs_write()
> hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
> FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
> As it stands, fsnotify_parent is a guaranteed functional call even if there
> are no watchers and fsnotify() does a substantial amount of unnecessary
> work before it checks if there are any watchers. A perf profile showed
> that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
> total samples taken in that function during a test. This patch rearranges
> the fast paths to reduce the amount of work done when there are no watchers.
>
> The test motivating this was "perf bench sched messaging --pipe". Despite
> the fact the pipes are anonymous, fsnotify is still called a lot and
> the overhead is noticable even though it's completely pointless. It's
> likely the overhead is negligible for real IO so this is an extreme
> example. This is a comparison of hackbench using processes and pipes on
> a 1-socket machine with 8 CPU threads without fanotify watchers.
>
>                               5.7.0                  5.7.0
>                             vanilla      fastfsnotify-v1r1
> Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*
> Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)
> Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)
> Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)
> Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)
> Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)
> Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*
> Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)
> Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)
>
>                        5.7.0       5.7.0
>                      vanilla fastfsnotify-v1r1
> Duration User         157.05      152.79
> Duration System      1279.98     1219.32
> Duration Elapsed      182.81      174.52
>
> This is showing that the latencies are improved by roughly 2-9%. The
> variability is not shown but some of these results are within the noise
> as this workload heavily overloads the machine. That said, the system CPU
> usage is reduced by quite a bit so it makes sense to avoid the overhead
> even if it is a bit tricky to detect at times. A perf profile of just 1
> group of tasks showed that 5.14% of samples taken were in either fsnotify()
> or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
> mostly function entry and the initial check for watchers.  The check for
> watchers is complicated enough that inlining it may be controversial.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Hi Mel,

Thanks for looking into this

> ---
>  fs/notify/fsnotify.c             | 25 +++++++++++++++----------
>  include/linux/fsnotify.h         | 10 ++++++++++
>  include/linux/fsnotify_backend.h |  4 ++--
>  3 files changed, 27 insertions(+), 12 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 72d332ce8e12..de7bbfd973c0 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
>  }
>
>  /* Notify this dentry's parent about a child's events. */
> -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>                     int data_type)
>  {
>         struct dentry *parent;
> @@ -174,7 +174,7 @@ int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>
>         return ret;
>  }
> -EXPORT_SYMBOL_GPL(fsnotify_parent);
> +EXPORT_SYMBOL_GPL(__fsnotify_parent);
>
>  static int send_to_group(struct inode *to_tell,
>                          __u32 mask, const void *data,
> @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
>         struct fsnotify_iter_info iter_info = {};
>         struct super_block *sb = to_tell->i_sb;
>         struct mount *mnt = NULL;
> -       __u32 mnt_or_sb_mask = sb->s_fsnotify_mask;
> +       __u32 mnt_or_sb_mask;
>         int ret = 0;
> -       __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
> +       __u32 test_mask;
>
> -       if (path) {
> +       if (path)
>                 mnt = real_mount(path->mnt);
> -               mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> -       }
> -       /* An event "on child" is not intended for a mount/sb mark */
> -       if (mask & FS_EVENT_ON_CHILD)
> -               mnt_or_sb_mask = 0;
>
>         /*
>          * Optimization: srcu_read_lock() has a memory barrier which can
> @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
>         if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
>             (!mnt || !mnt->mnt_fsnotify_marks))
>                 return 0;
> +
> +       /* An event "on child" is not intended for a mount/sb mark */
> +       mnt_or_sb_mask = 0;
> +       if (!(mask & FS_EVENT_ON_CHILD)) {
> +               mnt_or_sb_mask = sb->s_fsnotify_mask;
> +               if (path)
> +                       mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> +       }
> +

Are you sure that loading ->_fsnotify_mask is so much more expensive
than only checking ->_fsnotify_marks? They are surely on the same cache line.
Isn't it possible that you just moved the penalty to ->_fsnotify_marks check
with this change?
Did you test performance with just the fsnotify_parent() change alone?

In any case, this hunk seriously conflicts with a patch set I am working on,
so I would rather not merging this change as is.

If you provide me the feedback that testing ->_fsnotify_marks before loading
->_fsnotify_mask is beneficial on its own, then I will work this change into
my series.

>         /*
>          * if this is a modify event we may need to clear the ignored masks
>          * otherwise return if neither the inode nor the vfsmount/sb care about
>          * this type of event.
>          */
> +       test_mask = (mask & ALL_FSNOTIFY_EVENTS);
>         if (!(mask & FS_MODIFY) &&
>             !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask)))
>                 return 0;
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index 5ab28f6c7d26..508f6bb0b06b 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry,
>         fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0);
>  }
>
> +/* Notify this dentry's parent about a child's events. */
> +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
> +                                 const void *data, int data_type)
> +{
> +       if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED))
> +               return 0;
> +
> +       return __fsnotify_parent(dentry, mask, data, data_type);
> +}
> +

This change looks good as is, but I'm afraid my series is going to
make it obsolete.
It may very well be that my series will introduce more performance
penalty to your workload.

It would be very much appreciated if you could run a test for me.
I will gladly work in some more optimizations into my series.

You can find the relevant part of my work at:
https://github.com/amir73il/linux/commits/fsnotify_name

What this work does essentially is two things:
1. Call backend once instead of twice when both inode and parent are
    watching.
2. Snapshot name and parent inode to pass to backend not only when
    parent is watching, but also when an sb/mnt mark exists which
    requests to get file names with events.

Compared to the existing implementation of fsnotify_parent(),
my code needs to also test bits in inode->i_fsnotify_mask,
inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask
before the fast path can be taken.
So its back to square one w.r.t your optimizations.

I could add the same optimization as in fsnotify() to check
->_fsnotify_marks before ->_fsnotify_mask if you can provide
me some concrete numbers.

Thanks,
Amir.
Jan Kara June 8, 2020, 3:19 p.m. UTC | #2
On Mon 08-06-20 15:05:57, Mel Gorman wrote:
> The fsnotify paths are trivial to hit even when there are no watchers and
> they are surprisingly expensive. For example, every successful vfs_write()
> hits fsnotify_modify which calls both fsnotify_parent and fsnotify unless
> FMODE_NONOTIFY is set which is an internal flag invisible to userspace.
> As it stands, fsnotify_parent is a guaranteed functional call even if there
> are no watchers and fsnotify() does a substantial amount of unnecessary
> work before it checks if there are any watchers. A perf profile showed
> that applying mnt->mnt_fsnotify_mask in fnotify() was almost half of the
> total samples taken in that function during a test. This patch rearranges
> the fast paths to reduce the amount of work done when there are no watchers.
> 
> The test motivating this was "perf bench sched messaging --pipe". Despite
> the fact the pipes are anonymous, fsnotify is still called a lot and
> the overhead is noticable even though it's completely pointless. It's
> likely the overhead is negligible for real IO so this is an extreme
> example. This is a comparison of hackbench using processes and pipes on
> a 1-socket machine with 8 CPU threads without fanotify watchers.
> 
>                               5.7.0                  5.7.0
>                             vanilla      fastfsnotify-v1r1
> Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*
> Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)
> Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)
> Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)
> Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)
> Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)
> Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*
> Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)
> Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)
> 
>                        5.7.0       5.7.0
>                      vanilla fastfsnotify-v1r1
> Duration User         157.05      152.79
> Duration System      1279.98     1219.32
> Duration Elapsed      182.81      174.52
> 
> This is showing that the latencies are improved by roughly 2-9%. The
> variability is not shown but some of these results are within the noise
> as this workload heavily overloads the machine. That said, the system CPU
> usage is reduced by quite a bit so it makes sense to avoid the overhead
> even if it is a bit tricky to detect at times. A perf profile of just 1
> group of tasks showed that 5.14% of samples taken were in either fsnotify()
> or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
> mostly function entry and the initial check for watchers.  The check for
> watchers is complicated enough that inlining it may be controversial.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Thanks for the patch! I have to tell I'm surprised this small reordering
helps so much. For pipe inode we will bail on:

       if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
           (!mnt || !mnt->mnt_fsnotify_marks))
               return 0;

So what we save with the reordering is sb->s_fsnotify_mask and
mnt->mnt_fsnotify_mask fetch but that should be the same cacheline as
sb->s_fsnotify_marks and mnt->mnt_fsnotify_marks, respectively. We also
save a function call of fsnotify_parent() but I would think that is very
cheap (compared to the whole write path) as well.

The patch is simple enough so I have no problem merging it but I'm just
surprised by the results... Hum, maybe the structure randomization is used
in the builds and so e.g. sb->s_fsnotify_mask and sb->s_fsnotify_marks
don't end up in the same cacheline? But I don't think we enable that in
SUSE builds?

								Honza


> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 72d332ce8e12..de7bbfd973c0 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -143,7 +143,7 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
>  }
>  
>  /* Notify this dentry's parent about a child's events. */
> -int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> +int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>  		    int data_type)
>  {
>  	struct dentry *parent;
> @@ -174,7 +174,7 @@ int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>  
>  	return ret;
>  }
> -EXPORT_SYMBOL_GPL(fsnotify_parent);
> +EXPORT_SYMBOL_GPL(__fsnotify_parent);
>  
>  static int send_to_group(struct inode *to_tell,
>  			 __u32 mask, const void *data,
> @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
>  	struct fsnotify_iter_info iter_info = {};
>  	struct super_block *sb = to_tell->i_sb;
>  	struct mount *mnt = NULL;
> -	__u32 mnt_or_sb_mask = sb->s_fsnotify_mask;
> +	__u32 mnt_or_sb_mask;
>  	int ret = 0;
> -	__u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
> +	__u32 test_mask;
>  
> -	if (path) {
> +	if (path)
>  		mnt = real_mount(path->mnt);
> -		mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> -	}
> -	/* An event "on child" is not intended for a mount/sb mark */
> -	if (mask & FS_EVENT_ON_CHILD)
> -		mnt_or_sb_mask = 0;
>  
>  	/*
>  	 * Optimization: srcu_read_lock() has a memory barrier which can
> @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
>  	if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
>  	    (!mnt || !mnt->mnt_fsnotify_marks))
>  		return 0;
> +
> +	/* An event "on child" is not intended for a mount/sb mark */
> +	mnt_or_sb_mask = 0;
> +	if (!(mask & FS_EVENT_ON_CHILD)) {
> +		mnt_or_sb_mask = sb->s_fsnotify_mask;
> +		if (path)
> +			mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> +	}
> +
>  	/*
>  	 * if this is a modify event we may need to clear the ignored masks
>  	 * otherwise return if neither the inode nor the vfsmount/sb care about
>  	 * this type of event.
>  	 */
> +	test_mask = (mask & ALL_FSNOTIFY_EVENTS);
>  	if (!(mask & FS_MODIFY) &&
>  	    !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask)))
>  		return 0;
> diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> index 5ab28f6c7d26..508f6bb0b06b 100644
> --- a/include/linux/fsnotify.h
> +++ b/include/linux/fsnotify.h
> @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry,
>  	fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0);
>  }
>  
> +/* Notify this dentry's parent about a child's events. */
> +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
> +				  const void *data, int data_type)
> +{
> +	if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED))
> +		return 0;
> +
> +	return __fsnotify_parent(dentry, mask, data, data_type);
> +}
> +
>  /*
>   * Simple wrappers to consolidate calls fsnotify_parent()/fsnotify() when
>   * an event is on a file/dentry.
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index f0c506405b54..1626fa7d10ff 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -379,7 +379,7 @@ struct fsnotify_mark {
>  /* main fsnotify call to send events */
>  extern int fsnotify(struct inode *to_tell, __u32 mask, const void *data,
>  		    int data_type, const struct qstr *name, u32 cookie);
> -extern int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> +extern int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>  			   int data_type);
>  extern void __fsnotify_inode_delete(struct inode *inode);
>  extern void __fsnotify_vfsmount_delete(struct vfsmount *mnt);
> @@ -541,7 +541,7 @@ static inline int fsnotify(struct inode *to_tell, __u32 mask, const void *data,
>  	return 0;
>  }
>  
> -static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
> +static inline int __fsnotify_parent(struct dentry *dentry, __u32 mask,
>  				  const void *data, int data_type)
>  {
>  	return 0;
Mel Gorman June 8, 2020, 4:06 p.m. UTC | #3
On Mon, Jun 08, 2020 at 06:03:56PM +0300, Amir Goldstein wrote:
> > @@ -315,17 +315,12 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
> >         struct fsnotify_iter_info iter_info = {};
> >         struct super_block *sb = to_tell->i_sb;
> >         struct mount *mnt = NULL;
> > -       __u32 mnt_or_sb_mask = sb->s_fsnotify_mask;
> > +       __u32 mnt_or_sb_mask;
> >         int ret = 0;
> > -       __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
> > +       __u32 test_mask;
> >
> > -       if (path) {
> > +       if (path)
> >                 mnt = real_mount(path->mnt);
> > -               mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> > -       }
> > -       /* An event "on child" is not intended for a mount/sb mark */
> > -       if (mask & FS_EVENT_ON_CHILD)
> > -               mnt_or_sb_mask = 0;
> >
> >         /*
> >          * Optimization: srcu_read_lock() has a memory barrier which can
> > @@ -337,11 +332,21 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
> >         if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
> >             (!mnt || !mnt->mnt_fsnotify_marks))
> >                 return 0;
> > +
> > +       /* An event "on child" is not intended for a mount/sb mark */
> > +       mnt_or_sb_mask = 0;
> > +       if (!(mask & FS_EVENT_ON_CHILD)) {
> > +               mnt_or_sb_mask = sb->s_fsnotify_mask;
> > +               if (path)
> > +                       mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
> > +       }
> > +
> 
> Are you sure that loading ->_fsnotify_mask is so much more expensive
> than only checking ->_fsnotify_marks? They are surely on the same cache line.
> Isn't it possible that you just moved the penalty to ->_fsnotify_marks check
> with this change?

The profile indicated that the cost was in this line

	mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;

as opposed to the other checks. Hence, I deferred the calculation of
mnt_or_sb_mask until it was definitely needed. However, it's perfectly
possible that the line simply looked hot because the function entry was
hot in general.

> Did you test performance with just the fsnotify_parent() change alone?
> 

No, but I can. I looked at the profile of fsnotify() first before moving
on to fsnotify_parent() but I didn't test them in isolation as deferring
unnecessarily calculations in a fast path seemed reasonable.

> In any case, this hunk seriously conflicts with a patch set I am working on,
> so I would rather not merging this change as is.
> 
> If you provide me the feedback that testing ->_fsnotify_marks before loading
> ->_fsnotify_mask is beneficial on its own, then I will work this change into
> my series.
> 

Will do. If the fsnotify_parent() changes are useful on their own, I'll
post a v2 of the patch with just that change.

> >         /*
> >          * if this is a modify event we may need to clear the ignored masks
> >          * otherwise return if neither the inode nor the vfsmount/sb care about
> >          * this type of event.
> >          */
> > +       test_mask = (mask & ALL_FSNOTIFY_EVENTS);
> >         if (!(mask & FS_MODIFY) &&
> >             !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask)))
> >                 return 0;
> > diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
> > index 5ab28f6c7d26..508f6bb0b06b 100644
> > --- a/include/linux/fsnotify.h
> > +++ b/include/linux/fsnotify.h
> > @@ -44,6 +44,16 @@ static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry,
> >         fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0);
> >  }
> >
> > +/* Notify this dentry's parent about a child's events. */
> > +static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
> > +                                 const void *data, int data_type)
> > +{
> > +       if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED))
> > +               return 0;
> > +
> > +       return __fsnotify_parent(dentry, mask, data, data_type);
> > +}
> > +
> 
> This change looks good as is, but I'm afraid my series is going to
> make it obsolete.
> It may very well be that my series will introduce more performance
> penalty to your workload.
> 
> It would be very much appreciated if you could run a test for me.
> I will gladly work in some more optimizations into my series.
> 
> You can find the relevant part of my work at:
> https://github.com/amir73il/linux/commits/fsnotify_name
> 

Sure.

> What this work does essentially is two things:
> 1. Call backend once instead of twice when both inode and parent are
>     watching.
> 2. Snapshot name and parent inode to pass to backend not only when
>     parent is watching, but also when an sb/mnt mark exists which
>     requests to get file names with events.
> 
> Compared to the existing implementation of fsnotify_parent(),
> my code needs to also test bits in inode->i_fsnotify_mask,
> inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask
> before the fast path can be taken.
> So its back to square one w.r.t your optimizations.
> 

Seems fair but it may be worth noting that the changes appear to be
optimising the case where there are watchers. The case where there are
no watchers at all is also interesting and probably a lot more common. I
didn't look too closely at your series as I'm not familiar with fsnotify
in general. However, at a glance it looks like fsnotify_parent() executes
a substantial amount of code even if there are no watchers but I could
be wrong.
Amir Goldstein June 8, 2020, 4:26 p.m. UTC | #4
> > What this work does essentially is two things:
> > 1. Call backend once instead of twice when both inode and parent are
> >     watching.
> > 2. Snapshot name and parent inode to pass to backend not only when
> >     parent is watching, but also when an sb/mnt mark exists which
> >     requests to get file names with events.
> >
> > Compared to the existing implementation of fsnotify_parent(),
> > my code needs to also test bits in inode->i_fsnotify_mask,
> > inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask
> > before the fast path can be taken.
> > So its back to square one w.r.t your optimizations.
> >
>
> Seems fair but it may be worth noting that the changes appear to be
> optimising the case where there are watchers. The case where there are
> no watchers at all is also interesting and probably a lot more common. I

My changes are not optimizations. They are for adding functionality.
Surely, that shouldn't come at a cost for the common case.

> didn't look too closely at your series as I'm not familiar with fsnotify
> in general. However, at a glance it looks like fsnotify_parent() executes
> a substantial amount of code even if there are no watchers but I could
> be wrong.
>

I don't about substantial, I would say it is on par with the amount of
code that you tries to optimize out of fsnotify().

Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED
test, it also references d_inode->i_sb,  real_mount(path->mnt)
and fetches all their ->x_fsnotify_mask fields.

I changed the call pattern from open/modify/... hooks from:
fsnotify_parent(...);
fsnotify(...);

to:
fsnotify_parent(...); /* which calls fsnotify() */

So the NULL marks optimization could be done in beginning of
fsnotify_parent() and it will be just as effective as it is in fsnotify().

Thanks,
Amir.
Mel Gorman June 8, 2020, 4:50 p.m. UTC | #5
On Mon, Jun 08, 2020 at 05:19:43PM +0200, Jan Kara wrote:
> > This is showing that the latencies are improved by roughly 2-9%. The
> > variability is not shown but some of these results are within the noise
> > as this workload heavily overloads the machine. That said, the system CPU
> > usage is reduced by quite a bit so it makes sense to avoid the overhead
> > even if it is a bit tricky to detect at times. A perf profile of just 1
> > group of tasks showed that 5.14% of samples taken were in either fsnotify()
> > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
> > mostly function entry and the initial check for watchers.  The check for
> > watchers is complicated enough that inlining it may be controversial.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Thanks for the patch! I have to tell I'm surprised this small reordering
> helps so much. For pipe inode we will bail on:
> 
>        if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
>            (!mnt || !mnt->mnt_fsnotify_marks))
>                return 0;
> 
> So what we save with the reordering is sb->s_fsnotify_mask and
> mnt->mnt_fsnotify_mask fetch but that should be the same cacheline as
> sb->s_fsnotify_marks and mnt->mnt_fsnotify_marks, respectively.

It is likely that the contribution of that change is marginal relative
to the fsnotify_parent() call. I'll know by tomorrow morning at the latest.

> We also
> save a function call of fsnotify_parent() but I would think that is very
> cheap (compared to the whole write path) as well.
> 

To be fair, it is cheap but with this particular workload, we call
vfs_write() a *lot* and the path is not that long so it builds up to 5%
of samples overall. Given that these were anonymous pipes, it surprised
me to see fsnotify at all which is why I took a closer look.

> The patch is simple enough so I have no problem merging it but I'm just
> surprised by the results... Hum, maybe the structure randomization is used
> in the builds and so e.g. sb->s_fsnotify_mask and sb->s_fsnotify_marks
> don't end up in the same cacheline? But I don't think we enable that in
> SUSE builds?
> 

Correct, GCC_PLUGIN_RANDSTRUCT was not set.
Amir Goldstein June 8, 2020, 5:45 p.m. UTC | #6
On Mon, Jun 8, 2020 at 7:50 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Jun 08, 2020 at 05:19:43PM +0200, Jan Kara wrote:
> > > This is showing that the latencies are improved by roughly 2-9%. The
> > > variability is not shown but some of these results are within the noise
> > > as this workload heavily overloads the machine. That said, the system CPU
> > > usage is reduced by quite a bit so it makes sense to avoid the overhead
> > > even if it is a bit tricky to detect at times. A perf profile of just 1
> > > group of tasks showed that 5.14% of samples taken were in either fsnotify()
> > > or fsnotify_parent(). With the patch, 2.8% of samples were in fsnotify,
> > > mostly function entry and the initial check for watchers.  The check for
> > > watchers is complicated enough that inlining it may be controversial.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >
> > Thanks for the patch! I have to tell I'm surprised this small reordering
> > helps so much. For pipe inode we will bail on:
> >
> >        if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
> >            (!mnt || !mnt->mnt_fsnotify_marks))
> >                return 0;
> >
> > So what we save with the reordering is sb->s_fsnotify_mask and
> > mnt->mnt_fsnotify_mask fetch but that should be the same cacheline as
> > sb->s_fsnotify_marks and mnt->mnt_fsnotify_marks, respectively.
>
> It is likely that the contribution of that change is marginal relative
> to the fsnotify_parent() call. I'll know by tomorrow morning at the latest.
>
> > We also
> > save a function call of fsnotify_parent() but I would think that is very
> > cheap (compared to the whole write path) as well.
> >
>
> To be fair, it is cheap but with this particular workload, we call
> vfs_write() a *lot* and the path is not that long so it builds up to 5%
> of samples overall. Given that these were anonymous pipes, it surprised
> me to see fsnotify at all which is why I took a closer look.
>

I should note that after:
7c49b8616460 fs/notify: optimize inotify/fsnotify code for unwatched files
Which speaks of a similar workload,
the code looked quite similar to your optimization.

It was:
60f7ed8c7c4d fsnotify: send path type events to group with super block marks

That started accessing ->x_fsnotify_mask before ->x_fsnotify_marks,
although I still find it hard to believe that this makes a real difference.

Thanks,
Amir.
Mel Gorman June 8, 2020, 6:01 p.m. UTC | #7
On Mon, Jun 08, 2020 at 07:26:10PM +0300, Amir Goldstein wrote:
> > > What this work does essentially is two things:
> > > 1. Call backend once instead of twice when both inode and parent are
> > >     watching.
> > > 2. Snapshot name and parent inode to pass to backend not only when
> > >     parent is watching, but also when an sb/mnt mark exists which
> > >     requests to get file names with events.
> > >
> > > Compared to the existing implementation of fsnotify_parent(),
> > > my code needs to also test bits in inode->i_fsnotify_mask,
> > > inode->i_sb->s_fsnotify_mask and mnt->mnt_fsnotify_mask
> > > before the fast path can be taken.
> > > So its back to square one w.r.t your optimizations.
> > >
> >
> > Seems fair but it may be worth noting that the changes appear to be
> > optimising the case where there are watchers. The case where there are
> > no watchers at all is also interesting and probably a lot more common. I
> 
> My changes are not optimizations. They are for adding functionality.
> Surely, that shouldn't come at a cost for the common case.
> 

My bad. I interpreted the folding of fsnotify_parent calling fsnotify to
be a potential optimisation particularly if it bailed earlier (which it
doesn't do but maybe it could).

> > didn't look too closely at your series as I'm not familiar with fsnotify
> > in general. However, at a glance it looks like fsnotify_parent() executes
> > a substantial amount of code even if there are no watchers but I could
> > be wrong.
> >
> 
> I don't about substantial, I would say it is on par with the amount of
> code that you tries to optimize out of fsnotify().
> 
> Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED
> test, it also references d_inode->i_sb,  real_mount(path->mnt)
> and fetches all their ->x_fsnotify_mask fields.
> 
> I changed the call pattern from open/modify/... hooks from:
> fsnotify_parent(...);
> fsnotify(...);
> 
> to:
> fsnotify_parent(...); /* which calls fsnotify() */
> 
> So the NULL marks optimization could be done in beginning of
> fsnotify_parent() and it will be just as effective as it is in fsnotify().
> 

Something like that may be required because

                              5.7.0                  5.7.0                  5.7.0                  5.7.0
                            vanilla      fastfsnotify-v1r1      fastfsnotify-v2r1          amir-20200608
Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*      0.4597 *   4.96%*      0.4967 *  -2.69%*
Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)      1.5310 (   0.88%)      1.6587 *  -7.38%*
Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)      2.4237 (   6.91%)      2.6400 (  -1.40%)
Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)      3.6543 (  -1.55%)      3.9040 *  -8.48%*
Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)      5.5903 (   4.06%)      6.2593 (  -7.43%)
Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)      7.7150 *   8.59%*      8.9940 (  -6.56%)
Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*      9.8977 *  10.17%*     11.7247 *  -6.41%*
Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)     12.2087 *   6.81%*     14.0290 *  -7.08%*
Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)     13.2900 (   4.52%)     14.7140 *  -5.71%*

vanilla and fastnotify-v1r1 are the same. fastfsnotify-v2r1 is just the
fsnotify_parent() change which is mostly worse and may indicate that the
first patch was reasonable. amir-20200608 is your branch as of today and
it appears to introduce a substantial regression albeit in an extreme case
where fsnotify overhead is visible. The regressions are mostly larger
than noise with the caveat it may be machine specific given that the
machine is overloaded. I accept that adding extra functional to fsnotify
may be desirable but ideally it would not hurt the case where there are
no watchers at all.

So what's the right way forward? The patch as-is even though the fsnotify()
change itself may be marginal, a patch that just inlines the fast path
of fsnotify_parent or wait for the additional functionality and try and
address the overhead on top?
Amir Goldstein June 8, 2020, 6:12 p.m. UTC | #8
> > > didn't look too closely at your series as I'm not familiar with fsnotify
> > > in general. However, at a glance it looks like fsnotify_parent() executes
> > > a substantial amount of code even if there are no watchers but I could
> > > be wrong.
> > >
> >
> > I don't about substantial, I would say it is on par with the amount of
> > code that you tries to optimize out of fsnotify().
> >
> > Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED
> > test, it also references d_inode->i_sb,  real_mount(path->mnt)
> > and fetches all their ->x_fsnotify_mask fields.
> >
> > I changed the call pattern from open/modify/... hooks from:
> > fsnotify_parent(...);
> > fsnotify(...);
> >
> > to:
> > fsnotify_parent(...); /* which calls fsnotify() */
> >
> > So the NULL marks optimization could be done in beginning of
> > fsnotify_parent() and it will be just as effective as it is in fsnotify().
> >
>
> Something like that may be required because
>
>                               5.7.0                  5.7.0                  5.7.0                  5.7.0
>                             vanilla      fastfsnotify-v1r1      fastfsnotify-v2r1          amir-20200608
> Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*      0.4597 *   4.96%*      0.4967 *  -2.69%*
> Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)      1.5310 (   0.88%)      1.6587 *  -7.38%*
> Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)      2.4237 (   6.91%)      2.6400 (  -1.40%)
> Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)      3.6543 (  -1.55%)      3.9040 *  -8.48%*
> Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)      5.5903 (   4.06%)      6.2593 (  -7.43%)
> Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)      7.7150 *   8.59%*      8.9940 (  -6.56%)
> Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*      9.8977 *  10.17%*     11.7247 *  -6.41%*
> Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)     12.2087 *   6.81%*     14.0290 *  -7.08%*
> Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)     13.2900 (   4.52%)     14.7140 *  -5.71%*
>
> vanilla and fastnotify-v1r1 are the same. fastfsnotify-v2r1 is just the
> fsnotify_parent() change which is mostly worse and may indicate that the
> first patch was reasonable. amir-20200608 is your branch as of today and
> it appears to introduce a substantial regression albeit in an extreme case
> where fsnotify overhead is visible. The regressions are mostly larger
> than noise with the caveat it may be machine specific given that the
> machine is overloaded. I accept that adding extra functional to fsnotify
> may be desirable but ideally it would not hurt the case where there are
> no watchers at all.
>

Of course.
And thanks for catching this regression even before I posted the patches :-)

> So what's the right way forward? The patch as-is even though the fsnotify()
> change itself may be marginal, a patch that just inlines the fast path
> of fsnotify_parent or wait for the additional functionality and try and
> address the overhead on top?
>
>

Let me add your optimizations on top of my branch with the needed
adaptations and send you a branch for testing.

Thanks,
Amir.
Amir Goldstein June 8, 2020, 8:39 p.m. UTC | #9
On Mon, Jun 8, 2020 at 9:12 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > > didn't look too closely at your series as I'm not familiar with fsnotify
> > > > in general. However, at a glance it looks like fsnotify_parent() executes
> > > > a substantial amount of code even if there are no watchers but I could
> > > > be wrong.
> > > >
> > >
> > > I don't about substantial, I would say it is on par with the amount of
> > > code that you tries to optimize out of fsnotify().
> > >
> > > Before bailing out with DCACHE_FSNOTIFY_PARENT_WATCHED
> > > test, it also references d_inode->i_sb,  real_mount(path->mnt)
> > > and fetches all their ->x_fsnotify_mask fields.
> > >
> > > I changed the call pattern from open/modify/... hooks from:
> > > fsnotify_parent(...);
> > > fsnotify(...);
> > >
> > > to:
> > > fsnotify_parent(...); /* which calls fsnotify() */
> > >
> > > So the NULL marks optimization could be done in beginning of
> > > fsnotify_parent() and it will be just as effective as it is in fsnotify().
> > >
> >
> > Something like that may be required because
> >
> >                               5.7.0                  5.7.0                  5.7.0                  5.7.0
> >                             vanilla      fastfsnotify-v1r1      fastfsnotify-v2r1          amir-20200608
> > Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*      0.4597 *   4.96%*      0.4967 *  -2.69%*
> > Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)      1.5310 (   0.88%)      1.6587 *  -7.38%*
> > Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)      2.4237 (   6.91%)      2.6400 (  -1.40%)
> > Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)      3.6543 (  -1.55%)      3.9040 *  -8.48%*
> > Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)      5.5903 (   4.06%)      6.2593 (  -7.43%)
> > Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)      7.7150 *   8.59%*      8.9940 (  -6.56%)
> > Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*      9.8977 *  10.17%*     11.7247 *  -6.41%*
> > Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)     12.2087 *   6.81%*     14.0290 *  -7.08%*
> > Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)     13.2900 (   4.52%)     14.7140 *  -5.71%*
> >
> > vanilla and fastnotify-v1r1 are the same. fastfsnotify-v2r1 is just the
> > fsnotify_parent() change which is mostly worse and may indicate that the
> > first patch was reasonable. amir-20200608 is your branch as of today and
> > it appears to introduce a substantial regression albeit in an extreme case
> > where fsnotify overhead is visible. The regressions are mostly larger
> > than noise with the caveat it may be machine specific given that the
> > machine is overloaded. I accept that adding extra functional to fsnotify
> > may be desirable but ideally it would not hurt the case where there are
> > no watchers at all.
> >
>
> Of course.
> And thanks for catching this regression even before I posted the patches :-)
>
> > So what's the right way forward? The patch as-is even though the fsnotify()
> > change itself may be marginal, a patch that just inlines the fast path
> > of fsnotify_parent or wait for the additional functionality and try and
> > address the overhead on top?
> >
> >
>
> Let me add your optimizations on top of my branch with the needed
> adaptations and send you a branch for testing.

https://github.com/amir73il/linux/commits/fsnotify_name-for-mel

Cheers,
Amir.
Mel Gorman June 10, 2020, 12:59 p.m. UTC | #10
On Mon, Jun 08, 2020 at 11:39:26PM +0300, Amir Goldstein wrote:
> > Let me add your optimizations on top of my branch with the needed
> > adaptations and send you a branch for testing.
> 
> https://github.com/amir73il/linux/commits/fsnotify_name-for-mel
> 

Sorry for the delay getting back. The machine was busy with other tests
and it took a while to reach this on the queue. Fairly good news

hackbench-process-pipes
                              5.7.0                  5.7.0                  5.7.0                  5.7.0
                            vanilla      fastfsnotify-v1r1          amir-20200608           amir-for-mel
Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*      0.4967 *  -2.69%*      0.4680 (   3.24%)
Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)      1.6587 *  -7.38%*      1.4807 (   4.14%)
Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)      2.6400 (  -1.40%)      2.4900 (   4.37%)
Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)      3.9040 *  -8.48%*      3.5130 (   2.38%)
Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)      6.2593 (  -7.43%)      5.6967 (   2.23%)
Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)      8.9940 (  -6.56%)      7.7240 *   8.48%*
Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*     11.7247 *  -6.41%*      9.5793 *  13.06%*
Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)     14.0290 *  -7.08%*     12.1630 (   7.16%)
Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)     14.7140 *  -5.71%*     13.2457 *   4.84%*

First two sets of results are vanilla kernel and just my patch respectively
to have two baselines. amir-20200608 is the first git branch you pointed
me to and amir-for-mel is this latest branch. Comparing the optimisation
and your series, we get

hackbench-process-pipes
                              5.7.0                  5.7.0
                  fastfsnotify-v1r1           amir-for-mel
Amean     1       0.4630 (   0.00%)      0.4680 (  -1.08%)
Amean     3       1.4557 (   0.00%)      1.4807 (  -1.72%)
Amean     5       2.4363 (   0.00%)      2.4900 (  -2.20%)
Amean     7       3.4757 (   0.00%)      3.5130 (  -1.07%)
Amean     12      5.6983 (   0.00%)      5.6967 (   0.03%)
Amean     18      8.1327 (   0.00%)      7.7240 (   5.03%)
Amean     24     10.0290 (   0.00%)      9.5793 (   4.48%)
Amean     30     12.8510 (   0.00%)     12.1630 (   5.35%)
Amean     32     13.2410 (   0.00%)     13.2457 (  -0.04%)

As you can see, your patches with the optimisation layered on top is
comparable to just the optimisation on its own. It's not universally
better but it would not look like a regression when comparing releases.
The differences are mostly within the noise as there is some variability
involved for this workload so I would not worry too much about it (caveats
are other machines may be different as well as other workloads). A minor
issue is that this is probably a bisection hazard. If a series is merged
and LKP points the finger somewhere in the middle of your series then
I suggest you ask them to test the optimisation commit ID to see if the
regression goes away.

Thanks Amir.
Amir Goldstein June 10, 2020, 1:24 p.m. UTC | #11
On Wed, Jun 10, 2020 at 3:59 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Jun 08, 2020 at 11:39:26PM +0300, Amir Goldstein wrote:
> > > Let me add your optimizations on top of my branch with the needed
> > > adaptations and send you a branch for testing.
> >
> > https://github.com/amir73il/linux/commits/fsnotify_name-for-mel
> >
>
> Sorry for the delay getting back. The machine was busy with other tests
> and it took a while to reach this on the queue. Fairly good news
>
> hackbench-process-pipes
>                               5.7.0                  5.7.0                  5.7.0                  5.7.0
>                             vanilla      fastfsnotify-v1r1          amir-20200608           amir-for-mel
> Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*      0.4967 *  -2.69%*      0.4680 (   3.24%)
> Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)      1.6587 *  -7.38%*      1.4807 (   4.14%)
> Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)      2.6400 (  -1.40%)      2.4900 (   4.37%)
> Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)      3.9040 *  -8.48%*      3.5130 (   2.38%)
> Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)      6.2593 (  -7.43%)      5.6967 (   2.23%)
> Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)      8.9940 (  -6.56%)      7.7240 *   8.48%*
> Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*     11.7247 *  -6.41%*      9.5793 *  13.06%*
> Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)     14.0290 *  -7.08%*     12.1630 (   7.16%)
> Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)     14.7140 *  -5.71%*     13.2457 *   4.84%*
>
> First two sets of results are vanilla kernel and just my patch respectively
> to have two baselines. amir-20200608 is the first git branch you pointed
> me to and amir-for-mel is this latest branch. Comparing the optimisation
> and your series, we get
>
> hackbench-process-pipes
>                               5.7.0                  5.7.0
>                   fastfsnotify-v1r1           amir-for-mel
> Amean     1       0.4630 (   0.00%)      0.4680 (  -1.08%)
> Amean     3       1.4557 (   0.00%)      1.4807 (  -1.72%)
> Amean     5       2.4363 (   0.00%)      2.4900 (  -2.20%)
> Amean     7       3.4757 (   0.00%)      3.5130 (  -1.07%)
> Amean     12      5.6983 (   0.00%)      5.6967 (   0.03%)
> Amean     18      8.1327 (   0.00%)      7.7240 (   5.03%)
> Amean     24     10.0290 (   0.00%)      9.5793 (   4.48%)
> Amean     30     12.8510 (   0.00%)     12.1630 (   5.35%)
> Amean     32     13.2410 (   0.00%)     13.2457 (  -0.04%)
>
> As you can see, your patches with the optimisation layered on top is
> comparable to just the optimisation on its own. It's not universally
> better but it would not look like a regression when comparing releases.
> The differences are mostly within the noise as there is some variability
> involved for this workload so I would not worry too much about it (caveats
> are other machines may be different as well as other workloads).

Excellent!
Thanks for verifying.

TBH, this result is not surprising, because despite all the changes from
fastfsnotify-v1r1 to amir-for-mel, the code that is executed when there
are no watches should be quite similar. Without any unexpected compiler
optimizations that may differ between our versions, fsnotify hooks called
for directories should execute the exact same code.

fsnotify hooks called for non-directories (your workload) should execute
almost the same code. I spotted one additional test for access to
d_inode and one additional test of:
dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED
in the entry to __fsnotify_parent().

> A minor
> issue is that this is probably a bisection hazard. If a series is merged
> and LKP points the finger somewhere in the middle of your series then
> I suggest you ask them to test the optimisation commit ID to see if the
> regression goes away.
>

No worries. I wasn't planning on submitted the altered patch.
I just wanted to let you test the final result.
I will apply your change before my series and make sure to keep
optimizations while my changes are applied on top of that.

Thanks,
Amir.
Amir Goldstein June 12, 2020, 9:38 a.m. UTC | #12
On Wed, Jun 10, 2020 at 4:24 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Jun 10, 2020 at 3:59 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Mon, Jun 08, 2020 at 11:39:26PM +0300, Amir Goldstein wrote:
> > > > Let me add your optimizations on top of my branch with the needed
> > > > adaptations and send you a branch for testing.
> > >
> > > https://github.com/amir73il/linux/commits/fsnotify_name-for-mel
> > >
> >
> > Sorry for the delay getting back. The machine was busy with other tests
> > and it took a while to reach this on the queue. Fairly good news
> >
> > hackbench-process-pipes
> >                               5.7.0                  5.7.0                  5.7.0                  5.7.0
> >                             vanilla      fastfsnotify-v1r1          amir-20200608           amir-for-mel
> > Amean     1       0.4837 (   0.00%)      0.4630 *   4.27%*      0.4967 *  -2.69%*      0.4680 (   3.24%)
> > Amean     3       1.5447 (   0.00%)      1.4557 (   5.76%)      1.6587 *  -7.38%*      1.4807 (   4.14%)
> > Amean     5       2.6037 (   0.00%)      2.4363 (   6.43%)      2.6400 (  -1.40%)      2.4900 (   4.37%)
> > Amean     7       3.5987 (   0.00%)      3.4757 (   3.42%)      3.9040 *  -8.48%*      3.5130 (   2.38%)
> > Amean     12      5.8267 (   0.00%)      5.6983 (   2.20%)      6.2593 (  -7.43%)      5.6967 (   2.23%)
> > Amean     18      8.4400 (   0.00%)      8.1327 (   3.64%)      8.9940 (  -6.56%)      7.7240 *   8.48%*
> > Amean     24     11.0187 (   0.00%)     10.0290 *   8.98%*     11.7247 *  -6.41%*      9.5793 *  13.06%*
> > Amean     30     13.1013 (   0.00%)     12.8510 (   1.91%)     14.0290 *  -7.08%*     12.1630 (   7.16%)
> > Amean     32     13.9190 (   0.00%)     13.2410 (   4.87%)     14.7140 *  -5.71%*     13.2457 *   4.84%*
> >
> > First two sets of results are vanilla kernel and just my patch respectively
> > to have two baselines. amir-20200608 is the first git branch you pointed
> > me to and amir-for-mel is this latest branch. Comparing the optimisation
> > and your series, we get
> >
> > hackbench-process-pipes
> >                               5.7.0                  5.7.0
> >                   fastfsnotify-v1r1           amir-for-mel
> > Amean     1       0.4630 (   0.00%)      0.4680 (  -1.08%)
> > Amean     3       1.4557 (   0.00%)      1.4807 (  -1.72%)
> > Amean     5       2.4363 (   0.00%)      2.4900 (  -2.20%)
> > Amean     7       3.4757 (   0.00%)      3.5130 (  -1.07%)
> > Amean     12      5.6983 (   0.00%)      5.6967 (   0.03%)
> > Amean     18      8.1327 (   0.00%)      7.7240 (   5.03%)
> > Amean     24     10.0290 (   0.00%)      9.5793 (   4.48%)
> > Amean     30     12.8510 (   0.00%)     12.1630 (   5.35%)
> > Amean     32     13.2410 (   0.00%)     13.2457 (  -0.04%)
> >
> > As you can see, your patches with the optimisation layered on top is
> > comparable to just the optimisation on its own. It's not universally
> > better but it would not look like a regression when comparing releases.
> > The differences are mostly within the noise as there is some variability
> > involved for this workload so I would not worry too much about it (caveats
> > are other machines may be different as well as other workloads).
>
> Excellent!
> Thanks for verifying.
>
> TBH, this result is not surprising, because despite all the changes from
> fastfsnotify-v1r1 to amir-for-mel, the code that is executed when there
> are no watches should be quite similar. Without any unexpected compiler
> optimizations that may differ between our versions, fsnotify hooks called
> for directories should execute the exact same code.
>
> fsnotify hooks called for non-directories (your workload) should execute
> almost the same code. I spotted one additional test for access to
> d_inode and one additional test of:
> dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED
> in the entry to __fsnotify_parent().
>
> > A minor
> > issue is that this is probably a bisection hazard. If a series is merged
> > and LKP points the finger somewhere in the middle of your series then
> > I suggest you ask them to test the optimisation commit ID to see if the
> > regression goes away.
> >
>
> No worries. I wasn't planning on submitted the altered patch.
> I just wanted to let you test the final result.
> I will apply your change before my series and make sure to keep
> optimizations while my changes are applied on top of that.
>

FYI, just posted your patch with a minor style change at the bottom
of my prep patch series.

Thanks,
Amir.
Mel Gorman June 12, 2020, 12:42 p.m. UTC | #13
On Fri, Jun 12, 2020 at 12:38:15PM +0300, Amir Goldstein wrote:
> > No worries. I wasn't planning on submitted the altered patch.
> > I just wanted to let you test the final result.
> > I will apply your change before my series and make sure to keep
> > optimizations while my changes are applied on top of that.
> >
> 
> FYI, just posted your patch with a minor style change at the bottom
> of my prep patch series.
> 

Thanks!
diff mbox series

Patch

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 72d332ce8e12..de7bbfd973c0 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -143,7 +143,7 @@  void __fsnotify_update_child_dentry_flags(struct inode *inode)
 }
 
 /* Notify this dentry's parent about a child's events. */
-int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
+int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
 		    int data_type)
 {
 	struct dentry *parent;
@@ -174,7 +174,7 @@  int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(fsnotify_parent);
+EXPORT_SYMBOL_GPL(__fsnotify_parent);
 
 static int send_to_group(struct inode *to_tell,
 			 __u32 mask, const void *data,
@@ -315,17 +315,12 @@  int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
 	struct fsnotify_iter_info iter_info = {};
 	struct super_block *sb = to_tell->i_sb;
 	struct mount *mnt = NULL;
-	__u32 mnt_or_sb_mask = sb->s_fsnotify_mask;
+	__u32 mnt_or_sb_mask;
 	int ret = 0;
-	__u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
+	__u32 test_mask;
 
-	if (path) {
+	if (path)
 		mnt = real_mount(path->mnt);
-		mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
-	}
-	/* An event "on child" is not intended for a mount/sb mark */
-	if (mask & FS_EVENT_ON_CHILD)
-		mnt_or_sb_mask = 0;
 
 	/*
 	 * Optimization: srcu_read_lock() has a memory barrier which can
@@ -337,11 +332,21 @@  int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is,
 	if (!to_tell->i_fsnotify_marks && !sb->s_fsnotify_marks &&
 	    (!mnt || !mnt->mnt_fsnotify_marks))
 		return 0;
+
+	/* An event "on child" is not intended for a mount/sb mark */
+	mnt_or_sb_mask = 0;
+	if (!(mask & FS_EVENT_ON_CHILD)) {
+		mnt_or_sb_mask = sb->s_fsnotify_mask;
+		if (path)
+			mnt_or_sb_mask |= mnt->mnt_fsnotify_mask;
+	}
+
 	/*
 	 * if this is a modify event we may need to clear the ignored masks
 	 * otherwise return if neither the inode nor the vfsmount/sb care about
 	 * this type of event.
 	 */
+	test_mask = (mask & ALL_FSNOTIFY_EVENTS);
 	if (!(mask & FS_MODIFY) &&
 	    !(test_mask & (to_tell->i_fsnotify_mask | mnt_or_sb_mask)))
 		return 0;
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index 5ab28f6c7d26..508f6bb0b06b 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -44,6 +44,16 @@  static inline void fsnotify_dirent(struct inode *dir, struct dentry *dentry,
 	fsnotify_name(dir, mask, d_inode(dentry), &dentry->d_name, 0);
 }
 
+/* Notify this dentry's parent about a child's events. */
+static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
+				  const void *data, int data_type)
+{
+	if (!(dentry->d_flags & DCACHE_FSNOTIFY_PARENT_WATCHED))
+		return 0;
+
+	return __fsnotify_parent(dentry, mask, data, data_type);
+}
+
 /*
  * Simple wrappers to consolidate calls fsnotify_parent()/fsnotify() when
  * an event is on a file/dentry.
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index f0c506405b54..1626fa7d10ff 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -379,7 +379,7 @@  struct fsnotify_mark {
 /* main fsnotify call to send events */
 extern int fsnotify(struct inode *to_tell, __u32 mask, const void *data,
 		    int data_type, const struct qstr *name, u32 cookie);
-extern int fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
+extern int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
 			   int data_type);
 extern void __fsnotify_inode_delete(struct inode *inode);
 extern void __fsnotify_vfsmount_delete(struct vfsmount *mnt);
@@ -541,7 +541,7 @@  static inline int fsnotify(struct inode *to_tell, __u32 mask, const void *data,
 	return 0;
 }
 
-static inline int fsnotify_parent(struct dentry *dentry, __u32 mask,
+static inline int __fsnotify_parent(struct dentry *dentry, __u32 mask,
 				  const void *data, int data_type)
 {
 	return 0;