[v3] fsnotify: optimize the case of no parent watcher

Message ID	20240116113247.758848-1-amir73il@gmail.com (mailing list archive)
State	New, archived
Headers	show Received: from mail-wm1-f46.google.com (mail-wm1-f46.google.com [209.85.128.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81E371B940 for <linux-fsdevel@vger.kernel.org>; Tue, 16 Jan 2024 11:32:55 +0000 (UTC) From: Amir Goldstein <amir73il@gmail.com> To: Jan Kara <jack@suse.cz>, Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org Subject: [PATCH v3] fsnotify: optimize the case of no parent watcher Date: Tue, 16 Jan 2024 13:32:47 +0200 Message-Id: <20240116113247.758848-1-amir73il@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v3] fsnotify: optimize the case of no parent watcher \| expand [v3] fsnotify: optimize the case of no parent watcher

Amir Goldstein Jan. 16, 2024, 11:32 a.m. UTC

If parent inode is not watching, check for the event in masks of
sb/mount/inode masks early to optimize out most of the code in
__fsnotify_parent() and avoid calling fsnotify().

Jens has reported that this optimization improves BW and IOPS in an
io_uring benchmark by more than 10% and reduces perf reported CPU usage.

before:

+    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
+    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent

after:

+    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent

Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Jan,

Considering that this change looks like a clear win and it actually
the change that you suggested, I cleaned it up a bit and posting for
your consideration.

I've kept the wrappers fsnotify_path() and fsnotify_sb_has_watchers()
although they are not directly related to the optimization, because they
makes the code a bit nicer IMO.

Jens,

I took the liberty to add Reported-and-tested-by to attribute for your
help even though the patch that you tested was slightly different than
this cleaned up version, because logically it should be identical.

Thanks,
Amir.

Changes since v2:
- Add helper fsnotify_object_watched()
- Move the optimization from inlined fsnotify_path() to
  extern __fsnotify_parent()
- Optimize for any event - not only content events
- Drop the SB_I_CONTENT_WATCHED flag

 fs/notify/fsnotify.c     | 28 +++++++++++++++++-----------
 include/linux/fsnotify.h | 22 +++++++++++++++-------
 2 files changed, 32 insertions(+), 18 deletions(-)

Jan Kara Jan. 16, 2024, 12:04 p.m. UTC | #1

On Tue 16-01-24 13:32:47, Amir Goldstein wrote:
> If parent inode is not watching, check for the event in masks of
> sb/mount/inode masks early to optimize out most of the code in
> __fsnotify_parent() and avoid calling fsnotify().
> 
> Jens has reported that this optimization improves BW and IOPS in an
> io_uring benchmark by more than 10% and reduces perf reported CPU usage.
> 
> before:
> 
> +    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
> +    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> 
> after:
> 
> +    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> 
> Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> Jan,
> 
> Considering that this change looks like a clear win and it actually
> the change that you suggested, I cleaned it up a bit and posting for
> your consideration.

Agreed, I like this. What did you generate this patch against? It does not
apply on top of current Linus' tree (maybe it needs the change sitting in
VFS tree - which is fine I can wait until that gets merged)?

> I've kept the wrappers fsnotify_path() and fsnotify_sb_has_watchers()
> although they are not directly related to the optimization, because they
> makes the code a bit nicer IMO.

Yeah, these cleanups are fine. I would have prefered them as a separate
patch (some people might want the performance improvement to be backported
and this makes it unnecessarily more complex) but don't resend just because
of that.

								Honza

Amir Goldstein Jan. 16, 2024, 12:53 p.m. UTC | #2

On Tue, Jan 16, 2024 at 2:04 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 16-01-24 13:32:47, Amir Goldstein wrote:
> > If parent inode is not watching, check for the event in masks of
> > sb/mount/inode masks early to optimize out most of the code in
> > __fsnotify_parent() and avoid calling fsnotify().
> >
> > Jens has reported that this optimization improves BW and IOPS in an
> > io_uring benchmark by more than 10% and reduces perf reported CPU usage.
> >
> > before:
> >
> > +    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
> > +    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> >
> > after:
> >
> > +    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> >
> > Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> > Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
> > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > Jan,
> >
> > Considering that this change looks like a clear win and it actually
> > the change that you suggested, I cleaned it up a bit and posting for
> > your consideration.
>
> Agreed, I like this. What did you generate this patch against? It does not
> apply on top of current Linus' tree (maybe it needs the change sitting in
> VFS tree - which is fine I can wait until that gets merged)?
>

Yes, it is on top of Christian's vfs-fixes branch.

> > I've kept the wrappers fsnotify_path() and fsnotify_sb_has_watchers()
> > although they are not directly related to the optimization, because they
> > makes the code a bit nicer IMO.
>
> Yeah, these cleanups are fine. I would have prefered them as a separate
> patch (some people might want the performance improvement to be backported
> and this makes it unnecessarily more complex) but don't resend just because
> of that.
>

Makes sense.

Thanks,
Amir.

Jan Kara Jan. 24, 2024, 4:07 p.m. UTC | #3

On Tue 16-01-24 14:53:00, Amir Goldstein wrote:
> On Tue, Jan 16, 2024 at 2:04 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 16-01-24 13:32:47, Amir Goldstein wrote:
> > > If parent inode is not watching, check for the event in masks of
> > > sb/mount/inode masks early to optimize out most of the code in
> > > __fsnotify_parent() and avoid calling fsnotify().
> > >
> > > Jens has reported that this optimization improves BW and IOPS in an
> > > io_uring benchmark by more than 10% and reduces perf reported CPU usage.
> > >
> > > before:
> > >
> > > +    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
> > > +    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > >
> > > after:
> > >
> > > +    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > >
> > > Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> > > Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
> > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > ---
> > >
> > > Jan,
> > >
> > > Considering that this change looks like a clear win and it actually
> > > the change that you suggested, I cleaned it up a bit and posting for
> > > your consideration.
> >
> > Agreed, I like this. What did you generate this patch against? It does not
> > apply on top of current Linus' tree (maybe it needs the change sitting in
> > VFS tree - which is fine I can wait until that gets merged)?
> >
> 
> Yes, it is on top of Christian's vfs-fixes branch.

Merged your improvement now (and I've split off the cleanup into a separate
change and dropped the creation of fsnotify_path() which seemed a bit
pointless with a single caller). All pushed out.

								Honza

Amir Goldstein Jan. 24, 2024, 4:20 p.m. UTC | #4

On Wed, Jan 24, 2024 at 6:08 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 16-01-24 14:53:00, Amir Goldstein wrote:
> > On Tue, Jan 16, 2024 at 2:04 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 16-01-24 13:32:47, Amir Goldstein wrote:
> > > > If parent inode is not watching, check for the event in masks of
> > > > sb/mount/inode masks early to optimize out most of the code in
> > > > __fsnotify_parent() and avoid calling fsnotify().
> > > >
> > > > Jens has reported that this optimization improves BW and IOPS in an
> > > > io_uring benchmark by more than 10% and reduces perf reported CPU usage.
> > > >
> > > > before:
> > > >
> > > > +    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
> > > > +    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > > >
> > > > after:
> > > >
> > > > +    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > > >
> > > > Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> > > > Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > ---
> > > >
> > > > Jan,
> > > >
> > > > Considering that this change looks like a clear win and it actually
> > > > the change that you suggested, I cleaned it up a bit and posting for
> > > > your consideration.
> > >
> > > Agreed, I like this. What did you generate this patch against? It does not
> > > apply on top of current Linus' tree (maybe it needs the change sitting in
> > > VFS tree - which is fine I can wait until that gets merged)?
> > >
> >
> > Yes, it is on top of Christian's vfs-fixes branch.
>
> Merged your improvement now (and I've split off the cleanup into a separate
> change and dropped the creation of fsnotify_path() which seemed a bit
> pointless with a single caller). All pushed out.
>

Thanks!
Amir.

Amir Goldstein Feb. 13, 2024, 7:45 p.m. UTC | #5

On Wed, Jan 24, 2024 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 6:08 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 16-01-24 14:53:00, Amir Goldstein wrote:
> > > On Tue, Jan 16, 2024 at 2:04 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 16-01-24 13:32:47, Amir Goldstein wrote:
> > > > > If parent inode is not watching, check for the event in masks of
> > > > > sb/mount/inode masks early to optimize out most of the code in
> > > > > __fsnotify_parent() and avoid calling fsnotify().
> > > > >
> > > > > Jens has reported that this optimization improves BW and IOPS in an
> > > > > io_uring benchmark by more than 10% and reduces perf reported CPU usage.
> > > > >
> > > > > before:
> > > > >
> > > > > +    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
> > > > > +    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > > > >
> > > > > after:
> > > > >
> > > > > +    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > > > >
> > > > > Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> > > > > Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > ---
> > > > >
> > > > > Jan,
> > > > >
> > > > > Considering that this change looks like a clear win and it actually
> > > > > the change that you suggested, I cleaned it up a bit and posting for
> > > > > your consideration.
> > > >
> > > > Agreed, I like this. What did you generate this patch against? It does not
> > > > apply on top of current Linus' tree (maybe it needs the change sitting in
> > > > VFS tree - which is fine I can wait until that gets merged)?
> > > >
> > >
> > > Yes, it is on top of Christian's vfs-fixes branch.
> >
> > Merged your improvement now (and I've split off the cleanup into a separate
> > change and dropped the creation of fsnotify_path() which seemed a bit
> > pointless with a single caller). All pushed out.
> >
>

Jan & Jens,

Although Jan has already queued this v3 patch with sufficient performance
improvement for Jens' workloads, I got a performance regression report from
kernel robot on will-it-scale microbenchmark (buffered write loop)
on my fan_pre_content patches, so I tried to improve on the existing solution.

I tried something similar to v1/v2 patches, where the sb keeps accounting
of the number of watchers for specific sub-classes of events.

I've made two major changes:
1. moved to counters into a per-sb state object fsnotify_sb_connector
    as Christian requested
2. The counters are by fanotify classes, not by specific events, so they
    can be used to answer the questions:
a) Are there any fsnotify watchers on this sb?
b) Are there any fanotify permission class listeners on this sb?
c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?

I think that those questions are very relevant in the real world, because
a positive answer to (b) and (c) is quite rare in the real world, so the
overhead on the permission hooks could be completely eliminated in
the common case.

If needed, we can further bisect the class counters per specific painful
events (e.g. FAN_ACCESS*), but there is no need to do that before
we see concrete benchmark results.

Jens,

If you feel like it, you can see if this branch further improves your
workloads:

https://github.com/amir73il/linux/commits/fsnotify-perf/

Jan,

Whenever you have the time, feel free to see if this is a valid direction,
if not for the perf optimization then we are going to need the
fsnotify_sb_connector container for other features as well.

Thanks!
Amir.

Jan Kara Feb. 14, 2024, 11:23 a.m. UTC | #6

On Tue 13-02-24 21:45:56, Amir Goldstein wrote:
> On Wed, Jan 24, 2024 at 6:20 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > On Wed, Jan 24, 2024 at 6:08 PM Jan Kara <jack@suse.cz> wrote:
> > > On Tue 16-01-24 14:53:00, Amir Goldstein wrote:
> > > > On Tue, Jan 16, 2024 at 2:04 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 16-01-24 13:32:47, Amir Goldstein wrote:
> > > > > > If parent inode is not watching, check for the event in masks of
> > > > > > sb/mount/inode masks early to optimize out most of the code in
> > > > > > __fsnotify_parent() and avoid calling fsnotify().
> > > > > >
> > > > > > Jens has reported that this optimization improves BW and IOPS in an
> > > > > > io_uring benchmark by more than 10% and reduces perf reported CPU usage.
> > > > > >
> > > > > > before:
> > > > > >
> > > > > > +    4.51%  io_uring  [kernel.vmlinux]  [k] fsnotify
> > > > > > +    3.67%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > > > > >
> > > > > > after:
> > > > > >
> > > > > > +    2.37%  io_uring  [kernel.vmlinux]  [k] __fsnotify_parent
> > > > > >
> > > > > > Reported-and-tested-by: Jens Axboe <axboe@kernel.dk>
> > > > > > Link: https://lore.kernel.org/linux-fsdevel/b45bd8ff-5654-4e67-90a6-aad5e6759e0b@kernel.dk/
> > > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > > ---
> > > > > >
> > > > > > Jan,
> > > > > >
> > > > > > Considering that this change looks like a clear win and it actually
> > > > > > the change that you suggested, I cleaned it up a bit and posting for
> > > > > > your consideration.
> > > > >
> > > > > Agreed, I like this. What did you generate this patch against? It does not
> > > > > apply on top of current Linus' tree (maybe it needs the change sitting in
> > > > > VFS tree - which is fine I can wait until that gets merged)?
> > > > >
> > > >
> > > > Yes, it is on top of Christian's vfs-fixes branch.
> > >
> > > Merged your improvement now (and I've split off the cleanup into a separate
> > > change and dropped the creation of fsnotify_path() which seemed a bit
> > > pointless with a single caller). All pushed out.
> > >
> >
> 
> Jan & Jens,
> 
> Although Jan has already queued this v3 patch with sufficient performance
> improvement for Jens' workloads, I got a performance regression report from
> kernel robot on will-it-scale microbenchmark (buffered write loop)
> on my fan_pre_content patches, so I tried to improve on the existing solution.
> 
> I tried something similar to v1/v2 patches, where the sb keeps accounting
> of the number of watchers for specific sub-classes of events.
> 
> I've made two major changes:
> 1. moved to counters into a per-sb state object fsnotify_sb_connector
>     as Christian requested
> 2. The counters are by fanotify classes, not by specific events, so they
>     can be used to answer the questions:
> a) Are there any fsnotify watchers on this sb?
> b) Are there any fanotify permission class listeners on this sb?
> c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?
> 
> I think that those questions are very relevant in the real world, because
> a positive answer to (b) and (c) is quite rare in the real world, so the
> overhead on the permission hooks could be completely eliminated in
> the common case.
> 
> If needed, we can further bisect the class counters per specific painful
> events (e.g. FAN_ACCESS*), but there is no need to do that before
> we see concrete benchmark results.

OK, I think this idea is sound, I'd just be interested whether the 0-day
bot (or somebody else) is able to see improvement with this. Otherwise why
bother :)

> Jan,
> 
> Whenever you have the time, feel free to see if this is a valid direction,
> if not for the perf optimization then we are going to need the
> fsnotify_sb_connector container for other features as well.

So firstly the name fsnotify_sb_connector really confuses me. I'd save
"connector" names to fsnotify_mark_connector. Maybe fsnotify_sb_info?

Then I dislike how we have to specialcase superblock in quite a few places
and add these wrappers and what not. This seems to be mostly caused by the
fact that you directly embed fsnotify_mark_connector into fsnotify_sb_info.
What if we just put fsnotify_connp_t there? I understand that this will
mean one more pointer fetch if there are actually marks attached to the
superblock and the event mask matches s_fsnotify_mask. But in that case we
are likely to generate the event anyway so the cost of that compared to
event generation is negligible?

And I'd allocate fsnotify_sb_info on demand from fsnotify_add_mark_locked()
which means that we need to pass object pointer (in the form of void *)
instead of fsnotify_connp_t to various mark adding functions (and transform
it to fsnotify_connp_t only in fsnotify_add_mark_locked() after possibly
setting up fsnotify_sb_info). Passing void * around is not great but it
should be fairly limited (and actually reduces the knowledge of fsnotify
internals outside of the fsnotify core).

								Honza

Amir Goldstein Feb. 14, 2024, 1:40 p.m. UTC | #7

> > > > Merged your improvement now (and I've split off the cleanup into a separate
> > > > change and dropped the creation of fsnotify_path() which seemed a bit
> > > > pointless with a single caller). All pushed out.
> > > >
> > >
> >
> > Jan & Jens,
> >
> > Although Jan has already queued this v3 patch with sufficient performance
> > improvement for Jens' workloads, I got a performance regression report from
> > kernel robot on will-it-scale microbenchmark (buffered write loop)
> > on my fan_pre_content patches, so I tried to improve on the existing solution.
> >
> > I tried something similar to v1/v2 patches, where the sb keeps accounting
> > of the number of watchers for specific sub-classes of events.
> >
> > I've made two major changes:
> > 1. moved to counters into a per-sb state object fsnotify_sb_connector
> >     as Christian requested
> > 2. The counters are by fanotify classes, not by specific events, so they
> >     can be used to answer the questions:
> > a) Are there any fsnotify watchers on this sb?
> > b) Are there any fanotify permission class listeners on this sb?
> > c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?
> >
> > I think that those questions are very relevant in the real world, because
> > a positive answer to (b) and (c) is quite rare in the real world, so the
> > overhead on the permission hooks could be completely eliminated in
> > the common case.
> >
> > If needed, we can further bisect the class counters per specific painful
> > events (e.g. FAN_ACCESS*), but there is no need to do that before
> > we see concrete benchmark results.
>
> OK, I think this idea is sound, I'd just be interested whether the 0-day
> bot (or somebody else) is able to see improvement with this. Otherwise why
> bother :)
>

Exactly.

> > Jan,
> >
> > Whenever you have the time, feel free to see if this is a valid direction,
> > if not for the perf optimization then we are going to need the
> > fsnotify_sb_connector container for other features as well.
>
> So firstly the name fsnotify_sb_connector really confuses me. I'd save
> "connector" names to fsnotify_mark_connector. Maybe fsnotify_sb_info?
>

Sure.

> Then I dislike how we have to specialcase superblock in quite a few places
> and add these wrappers and what not. This seems to be mostly caused by the
> fact that you directly embed fsnotify_mark_connector into fsnotify_sb_info.
> What if we just put fsnotify_connp_t there? I understand that this will
> mean one more pointer fetch if there are actually marks attached to the
> superblock and the event mask matches s_fsnotify_mask. But in that case we
> are likely to generate the event anyway so the cost of that compared to
> event generation is negligible?
>

I guess that can work.
I can try it and see if there are any other complications.

> And I'd allocate fsnotify_sb_info on demand from fsnotify_add_mark_locked()
> which means that we need to pass object pointer (in the form of void *)
> instead of fsnotify_connp_t to various mark adding functions (and transform
> it to fsnotify_connp_t only in fsnotify_add_mark_locked() after possibly
> setting up fsnotify_sb_info). Passing void * around is not great but it
> should be fairly limited (and actually reduces the knowledge of fsnotify
> internals outside of the fsnotify core).

Unless I am missing something, I think we only need to pass an extra sb
arg to fsnotify_add_mark_locked()? and it does not sound like a big deal.
For adding an sb mark, connp arg could be NULL, and then we get connp
from sb->fsnotify_sb_info after making sure that it is allocated.

I will get to look at it in ~2 weeks.

Thanks for the quick feedback.
Amir.

Jan Kara Feb. 15, 2024, 8:36 a.m. UTC | #8

On Wed 14-02-24 15:40:31, Amir Goldstein wrote:
> > > > > Merged your improvement now (and I've split off the cleanup into a separate
> > > > > change and dropped the creation of fsnotify_path() which seemed a bit
> > > > > pointless with a single caller). All pushed out.
> > > > >
> > > >
> > >
> > > Jan & Jens,
> > >
> > > Although Jan has already queued this v3 patch with sufficient performance
> > > improvement for Jens' workloads, I got a performance regression report from
> > > kernel robot on will-it-scale microbenchmark (buffered write loop)
> > > on my fan_pre_content patches, so I tried to improve on the existing solution.
> > >
> > > I tried something similar to v1/v2 patches, where the sb keeps accounting
> > > of the number of watchers for specific sub-classes of events.
> > >
> > > I've made two major changes:
> > > 1. moved to counters into a per-sb state object fsnotify_sb_connector
> > >     as Christian requested
> > > 2. The counters are by fanotify classes, not by specific events, so they
> > >     can be used to answer the questions:
> > > a) Are there any fsnotify watchers on this sb?
> > > b) Are there any fanotify permission class listeners on this sb?
> > > c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?
> > >
> > > I think that those questions are very relevant in the real world, because
> > > a positive answer to (b) and (c) is quite rare in the real world, so the
> > > overhead on the permission hooks could be completely eliminated in
> > > the common case.
> > >
> > > If needed, we can further bisect the class counters per specific painful
> > > events (e.g. FAN_ACCESS*), but there is no need to do that before
> > > we see concrete benchmark results.
...
 
> > Then I dislike how we have to specialcase superblock in quite a few places
> > and add these wrappers and what not. This seems to be mostly caused by the
> > fact that you directly embed fsnotify_mark_connector into fsnotify_sb_info.
> > What if we just put fsnotify_connp_t there? I understand that this will
> > mean one more pointer fetch if there are actually marks attached to the
> > superblock and the event mask matches s_fsnotify_mask. But in that case we
> > are likely to generate the event anyway so the cost of that compared to
> > event generation is negligible?
> >
> 
> I guess that can work.
> I can try it and see if there are any other complications.
> 
> > And I'd allocate fsnotify_sb_info on demand from fsnotify_add_mark_locked()
> > which means that we need to pass object pointer (in the form of void *)
> > instead of fsnotify_connp_t to various mark adding functions (and transform
> > it to fsnotify_connp_t only in fsnotify_add_mark_locked() after possibly
> > setting up fsnotify_sb_info). Passing void * around is not great but it
> > should be fairly limited (and actually reduces the knowledge of fsnotify
> > internals outside of the fsnotify core).
> 
> Unless I am missing something, I think we only need to pass an extra sb
> arg to fsnotify_add_mark_locked()? and it does not sound like a big deal.
> For adding an sb mark, connp arg could be NULL, and then we get connp
> from sb->fsnotify_sb_info after making sure that it is allocated.

Yes that would be another possibility but frankly I like passing the
'object' pointer instead of connp pointer a bit more. But we can see how
the code looks like.

								Honza

Jens Axboe Feb. 15, 2024, 3:07 p.m. UTC | #9

On 2/13/24 12:45 PM, Amir Goldstein wrote:
> Jens,
> 
> If you feel like it, you can see if this branch further improves your
> workloads:
> 
> https://github.com/amir73il/linux/commits/fsnotify-perf/

Baseline is current -git with changes I have for 6.9, so not with the
previous patch. 4 optanes, usually random reads, and being run on a dual
socket Intel(R) Xeon(R) Platinum 8458P  CPU @ 2.7GHz.

IOPS=16.20M, BW=7.91GiB/s, IOS/call=32/31
IOPS=16.20M, BW=7.91GiB/s, IOS/call=32/32
IOPS=16.20M, BW=7.91GiB/s, IOS/call=32/31

and in perf profile, we see:

+    3.16%  io_uring  [kernel.kallsyms]  [k] fsnotify
+    2.04%  io_uring  [kernel.kallsyms]  [k] __fsnotify_parent

with this branch pulled in:

IOPS=17.44M, BW=8.52GiB/s, IOS/call=32/31
IOPS=17.45M, BW=8.52GiB/s, IOS/call=32/31
IOPS=17.45M, BW=8.52GiB/s, IOS/call=32/31

and perf diff shows:

     2.04%     -1.15%  [kernel.kallsyms]  [k] __fsnotify_parent
     3.16%             [kernel.kallsyms]  [k] fsnotify

with a big reduction for __fsnotify_parent() and fsnotify() being
totally gone. In absolute terms, this is all we see in perf profile with
the patch:

+    0.89%  io_uring  [kernel.kallsyms]  [k] __fsnotify_parent

iow, we went from over 5% of added overhead to less than 1%.

Amir Goldstein March 6, 2024, 2:51 p.m. UTC | #10

On Thu, Feb 15, 2024 at 10:36 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 14-02-24 15:40:31, Amir Goldstein wrote:
> > > > > > Merged your improvement now (and I've split off the cleanup into a separate
> > > > > > change and dropped the creation of fsnotify_path() which seemed a bit
> > > > > > pointless with a single caller). All pushed out.
> > > > > >
> > > > >
> > > >
> > > > Jan & Jens,
> > > >
> > > > Although Jan has already queued this v3 patch with sufficient performance
> > > > improvement for Jens' workloads, I got a performance regression report from
> > > > kernel robot on will-it-scale microbenchmark (buffered write loop)
> > > > on my fan_pre_content patches, so I tried to improve on the existing solution.
> > > >
> > > > I tried something similar to v1/v2 patches, where the sb keeps accounting
> > > > of the number of watchers for specific sub-classes of events.
> > > >
> > > > I've made two major changes:
> > > > 1. moved to counters into a per-sb state object fsnotify_sb_connector
> > > >     as Christian requested
> > > > 2. The counters are by fanotify classes, not by specific events, so they
> > > >     can be used to answer the questions:
> > > > a) Are there any fsnotify watchers on this sb?
> > > > b) Are there any fanotify permission class listeners on this sb?
> > > > c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?
> > > >
> > > > I think that those questions are very relevant in the real world, because
> > > > a positive answer to (b) and (c) is quite rare in the real world, so the
> > > > overhead on the permission hooks could be completely eliminated in
> > > > the common case.
> > > >
> > > > If needed, we can further bisect the class counters per specific painful
> > > > events (e.g. FAN_ACCESS*), but there is no need to do that before
> > > > we see concrete benchmark results.
> ...
>
> > > Then I dislike how we have to specialcase superblock in quite a few places
> > > and add these wrappers and what not. This seems to be mostly caused by the
> > > fact that you directly embed fsnotify_mark_connector into fsnotify_sb_info.
> > > What if we just put fsnotify_connp_t there? I understand that this will
> > > mean one more pointer fetch if there are actually marks attached to the
> > > superblock and the event mask matches s_fsnotify_mask. But in that case we
> > > are likely to generate the event anyway so the cost of that compared to
> > > event generation is negligible?
> > >
> >
> > I guess that can work.
> > I can try it and see if there are any other complications.
> >
> > > And I'd allocate fsnotify_sb_info on demand from fsnotify_add_mark_locked()
> > > which means that we need to pass object pointer (in the form of void *)
> > > instead of fsnotify_connp_t to various mark adding functions (and transform
> > > it to fsnotify_connp_t only in fsnotify_add_mark_locked() after possibly
> > > setting up fsnotify_sb_info). Passing void * around is not great but it
> > > should be fairly limited (and actually reduces the knowledge of fsnotify
> > > internals outside of the fsnotify core).
> >
> > Unless I am missing something, I think we only need to pass an extra sb
> > arg to fsnotify_add_mark_locked()? and it does not sound like a big deal.
> > For adding an sb mark, connp arg could be NULL, and then we get connp
> > from sb->fsnotify_sb_info after making sure that it is allocated.
>
> Yes that would be another possibility but frankly I like passing the
> 'object' pointer instead of connp pointer a bit more. But we can see how
> the code looks like.

Ok, here it is:

https://github.com/amir73il/linux/commits/fsnotify-sbinfo/

I agree that the interface does end up looking better this way.

I've requested to re-test performance on fsnotify-sbinfo.

You can use this rebased branch to look at the diff from the
the previous patches that were tested by 0day:

https://github.com/amir73il/linux/commits/fsnotify-sbconn/

If you have the bandwidth to consider those patches as candidates
for (the second half of?) 6.9 merge window, I can post them for review.

The main benefit of getting them merged in 6.9 is that I could work
on pre-content events for 6.10 without having to worry about perf
regressions.

Thanks,
Amir.

Jan Kara March 8, 2024, 4 p.m. UTC | #11

On Wed 06-03-24 16:51:06, Amir Goldstein wrote:
> On Thu, Feb 15, 2024 at 10:36 AM Jan Kara <jack@suse.cz> wrote:
> > On Wed 14-02-24 15:40:31, Amir Goldstein wrote:
> > > > > > > Merged your improvement now (and I've split off the cleanup into a separate
> > > > > > > change and dropped the creation of fsnotify_path() which seemed a bit
> > > > > > > pointless with a single caller). All pushed out.
> > > > > > >
> > > > > >
> > > > >
> > > > > Jan & Jens,
> > > > >
> > > > > Although Jan has already queued this v3 patch with sufficient performance
> > > > > improvement for Jens' workloads, I got a performance regression report from
> > > > > kernel robot on will-it-scale microbenchmark (buffered write loop)
> > > > > on my fan_pre_content patches, so I tried to improve on the existing solution.
> > > > >
> > > > > I tried something similar to v1/v2 patches, where the sb keeps accounting
> > > > > of the number of watchers for specific sub-classes of events.
> > > > >
> > > > > I've made two major changes:
> > > > > 1. moved to counters into a per-sb state object fsnotify_sb_connector
> > > > >     as Christian requested
> > > > > 2. The counters are by fanotify classes, not by specific events, so they
> > > > >     can be used to answer the questions:
> > > > > a) Are there any fsnotify watchers on this sb?
> > > > > b) Are there any fanotify permission class listeners on this sb?
> > > > > c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?
> > > > >
> > > > > I think that those questions are very relevant in the real world, because
> > > > > a positive answer to (b) and (c) is quite rare in the real world, so the
> > > > > overhead on the permission hooks could be completely eliminated in
> > > > > the common case.
> > > > >
> > > > > If needed, we can further bisect the class counters per specific painful
> > > > > events (e.g. FAN_ACCESS*), but there is no need to do that before
> > > > > we see concrete benchmark results.
> > ...
> >
> > > > Then I dislike how we have to specialcase superblock in quite a few places
> > > > and add these wrappers and what not. This seems to be mostly caused by the
> > > > fact that you directly embed fsnotify_mark_connector into fsnotify_sb_info.
> > > > What if we just put fsnotify_connp_t there? I understand that this will
> > > > mean one more pointer fetch if there are actually marks attached to the
> > > > superblock and the event mask matches s_fsnotify_mask. But in that case we
> > > > are likely to generate the event anyway so the cost of that compared to
> > > > event generation is negligible?
> > > >
> > >
> > > I guess that can work.
> > > I can try it and see if there are any other complications.
> > >
> > > > And I'd allocate fsnotify_sb_info on demand from fsnotify_add_mark_locked()
> > > > which means that we need to pass object pointer (in the form of void *)
> > > > instead of fsnotify_connp_t to various mark adding functions (and transform
> > > > it to fsnotify_connp_t only in fsnotify_add_mark_locked() after possibly
> > > > setting up fsnotify_sb_info). Passing void * around is not great but it
> > > > should be fairly limited (and actually reduces the knowledge of fsnotify
> > > > internals outside of the fsnotify core).
> > >
> > > Unless I am missing something, I think we only need to pass an extra sb
> > > arg to fsnotify_add_mark_locked()? and it does not sound like a big deal.
> > > For adding an sb mark, connp arg could be NULL, and then we get connp
> > > from sb->fsnotify_sb_info after making sure that it is allocated.
> >
> > Yes that would be another possibility but frankly I like passing the
> > 'object' pointer instead of connp pointer a bit more. But we can see how
> > the code looks like.
> 
> Ok, here it is:
> 
> https://github.com/amir73il/linux/commits/fsnotify-sbinfo/
> 
> I agree that the interface does end up looking better this way.

Yep, the interface looks fine. I have left some comments on github
regarding typos and some suspicious things.

> I've requested to re-test performance on fsnotify-sbinfo.
> 
> You can use this rebased branch to look at the diff from the
> the previous patches that were tested by 0day:
> 
> https://github.com/amir73il/linux/commits/fsnotify-sbconn/
> 
> If you have the bandwidth to consider those patches as candidates
> for (the second half of?) 6.9 merge window, I can post them for review.

Well, unless Linus does rc8, I don't think we should queue these for the
merge window as it is too late by now. But please post them for review,
I'll have a look. I can then push them to my tree early into a stable
branch and you can base your patches on my branch. If the patches then need
to go through VFS tree, Christian is fine with pulling my tree...

							Honza

Christian Brauner March 11, 2024, 1:51 p.m. UTC | #12

On Fri, Mar 08, 2024 at 05:00:58PM +0100, Jan Kara wrote:
> On Wed 06-03-24 16:51:06, Amir Goldstein wrote:
> > On Thu, Feb 15, 2024 at 10:36 AM Jan Kara <jack@suse.cz> wrote:
> > > On Wed 14-02-24 15:40:31, Amir Goldstein wrote:
> > > > > > > > Merged your improvement now (and I've split off the cleanup into a separate
> > > > > > > > change and dropped the creation of fsnotify_path() which seemed a bit
> > > > > > > > pointless with a single caller). All pushed out.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > > Jan & Jens,
> > > > > >
> > > > > > Although Jan has already queued this v3 patch with sufficient performance
> > > > > > improvement for Jens' workloads, I got a performance regression report from
> > > > > > kernel robot on will-it-scale microbenchmark (buffered write loop)
> > > > > > on my fan_pre_content patches, so I tried to improve on the existing solution.
> > > > > >
> > > > > > I tried something similar to v1/v2 patches, where the sb keeps accounting
> > > > > > of the number of watchers for specific sub-classes of events.
> > > > > >
> > > > > > I've made two major changes:
> > > > > > 1. moved to counters into a per-sb state object fsnotify_sb_connector
> > > > > >     as Christian requested
> > > > > > 2. The counters are by fanotify classes, not by specific events, so they
> > > > > >     can be used to answer the questions:
> > > > > > a) Are there any fsnotify watchers on this sb?
> > > > > > b) Are there any fanotify permission class listeners on this sb?
> > > > > > c) Are there any fanotify pre-content (a.k.a HSM) class listeners on this sb?
> > > > > >
> > > > > > I think that those questions are very relevant in the real world, because
> > > > > > a positive answer to (b) and (c) is quite rare in the real world, so the
> > > > > > overhead on the permission hooks could be completely eliminated in
> > > > > > the common case.
> > > > > >
> > > > > > If needed, we can further bisect the class counters per specific painful
> > > > > > events (e.g. FAN_ACCESS*), but there is no need to do that before
> > > > > > we see concrete benchmark results.
> > > ...
> > >
> > > > > Then I dislike how we have to specialcase superblock in quite a few places
> > > > > and add these wrappers and what not. This seems to be mostly caused by the
> > > > > fact that you directly embed fsnotify_mark_connector into fsnotify_sb_info.
> > > > > What if we just put fsnotify_connp_t there? I understand that this will
> > > > > mean one more pointer fetch if there are actually marks attached to the
> > > > > superblock and the event mask matches s_fsnotify_mask. But in that case we
> > > > > are likely to generate the event anyway so the cost of that compared to
> > > > > event generation is negligible?
> > > > >
> > > >
> > > > I guess that can work.
> > > > I can try it and see if there are any other complications.
> > > >
> > > > > And I'd allocate fsnotify_sb_info on demand from fsnotify_add_mark_locked()
> > > > > which means that we need to pass object pointer (in the form of void *)
> > > > > instead of fsnotify_connp_t to various mark adding functions (and transform
> > > > > it to fsnotify_connp_t only in fsnotify_add_mark_locked() after possibly
> > > > > setting up fsnotify_sb_info). Passing void * around is not great but it
> > > > > should be fairly limited (and actually reduces the knowledge of fsnotify
> > > > > internals outside of the fsnotify core).
> > > >
> > > > Unless I am missing something, I think we only need to pass an extra sb
> > > > arg to fsnotify_add_mark_locked()? and it does not sound like a big deal.
> > > > For adding an sb mark, connp arg could be NULL, and then we get connp
> > > > from sb->fsnotify_sb_info after making sure that it is allocated.
> > >
> > > Yes that would be another possibility but frankly I like passing the
> > > 'object' pointer instead of connp pointer a bit more. But we can see how
> > > the code looks like.
> > 
> > Ok, here it is:
> > 
> > https://github.com/amir73il/linux/commits/fsnotify-sbinfo/
> > 
> > I agree that the interface does end up looking better this way.
> 
> Yep, the interface looks fine. I have left some comments on github
> regarding typos and some suspicious things.
> 
> > I've requested to re-test performance on fsnotify-sbinfo.
> > 
> > You can use this rebased branch to look at the diff from the
> > the previous patches that were tested by 0day:
> > 
> > https://github.com/amir73il/linux/commits/fsnotify-sbconn/
> > 
> > If you have the bandwidth to consider those patches as candidates
> > for (the second half of?) 6.9 merge window, I can post them for review.
> 
> Well, unless Linus does rc8, I don't think we should queue these for the
> merge window as it is too late by now. But please post them for review,
> I'll have a look. I can then push them to my tree early into a stable
> branch and you can base your patches on my branch. If the patches then need
> to go through VFS tree, Christian is fine with pulling my tree...

I'm absolutely opposed to touching anything that you do. I'm joking of
course, I'm very happy to pull from you!

[v3] fsnotify: optimize the case of no parent watcher

Commit Message

Comments

Patch