mbox series

[v4,0/5] fsnotify: fix softlockups iterating over d_subdirs

Message ID 20221111220614.991928-1-stephen.s.brennan@oracle.com (mailing list archive)
Headers show
Series fsnotify: fix softlockups iterating over d_subdirs | expand

Message

Stephen Brennan Nov. 11, 2022, 10:06 p.m. UTC
Hi Jan, Amir, Al,

Here's my v4 patch series that aims to eliminate soft lockups when updating
dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
allowing the flag to be lazily cleared in the fsnotify_parent() function,
via Amir's patch. This allowed me to drop patch #2 from my previous series
(fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
replaced it with "fsnotify: require inode lock held during child flag
update", patch #5 in this series. I also added "dnotify: move
fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
issues with dnotify.

Jan expressed concerns about lock ordering of the inode rwsem with the
fsnotify group mutex. I built this with lockdep enabled (see below for the
lock debugging .config section -- I'm not too familiar with lockdep so I
wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
I could find in LTP, with no lockdep splats to be found. I don't know that
this can completely satisfy the concerns about lock ordering: I'm reading
through the code to better understand the concern about "the removal of
oneshot mark during modify event generation". But I'm encouraged by the
LTP+lockdep results.

I also went ahead and did my negative dentry oriented testing. Of course
the fsnotify_parent() issue is fully resolved, and when I tested several
processes all using inotifywait on the same directory full of negative
dentries, I was able to use ftrace to confirm that
fsnotify_update_children_dentry_flags() was called exactly once for all
processes. No softlockups occurred!

I originally wrote this series to make the last patch (#5) optional: if for
some reason we didn't think it was necessary to hold the inode rwsem, then
we could omit it -- the main penalty being the race condition described in
the patch description. I tested without the last patch and LTP passed also
with lockdep enabled, but of course when multiple tasks did an inotifywait
on the same directory (with many negative dentries) only the first waited
for the flag updates, the rest of the tasks immediately returned despite
the flags not being ready.

I agree with Amir that as long as the lock ordering is fine, we should keep
patch #5. And if that's the case, I can reorder the series a bit to make it
a bit more logical, and eliminate logic in
fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
which I promptly delete later in the series.

1. fsnotify: clear PARENT_WATCHED flags lazily
2. fsnotify: Use d_find_any_alias to get dentry associated with inode
3. dnotify: move fsnotify_recalc_mask() outside spinlock
4. fsnotify: require inode lock held during child flag update
5. fsnotify: allow sleepable child flag update

Thanks for continuing to read this series, I hope we're making progress
toward a simpler way to fix these scaling issues!

Stephen

Amir Goldstein (1):
  fsnotify: clear PARENT_WATCHED flags lazily

Stephen Brennan (4):
  fsnotify: Use d_find_any_alias to get dentry associated with inode
  dnotify: move fsnotify_recalc_mask() outside spinlock
  fsnotify: allow sleepable child flag update
  fsnotify: require inode lock held during child flag update

 fs/notify/dnotify/dnotify.c      |  28 ++++++---
 fs/notify/fsnotify.c             | 101 ++++++++++++++++++++++---------
 fs/notify/fsnotify.h             |   3 +-
 fs/notify/mark.c                 |  40 +++++++++++-
 include/linux/fsnotify_backend.h |   8 ++-
 5 files changed, 136 insertions(+), 44 deletions(-)

Comments

Stephen Brennan Nov. 11, 2022, 10:08 p.m. UTC | #1
Stephen Brennan <stephen.s.brennan@oracle.com> writes:

> Hi Jan, Amir, Al,
>
> Here's my v4 patch series that aims to eliminate soft lockups when updating
> dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> allowing the flag to be lazily cleared in the fsnotify_parent() function,
> via Amir's patch. This allowed me to drop patch #2 from my previous series
> (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> replaced it with "fsnotify: require inode lock held during child flag
> update", patch #5 in this series. I also added "dnotify: move
> fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> issues with dnotify.
>
> Jan expressed concerns about lock ordering of the inode rwsem with the
> fsnotify group mutex. I built this with lockdep enabled (see below for the
> lock debugging .config section -- I'm not too familiar with lockdep so I
> wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> I could find in LTP, with no lockdep splats to be found. I don't know that
> this can completely satisfy the concerns about lock ordering: I'm reading
> through the code to better understand the concern about "the removal of
> oneshot mark during modify event generation". But I'm encouraged by the
> LTP+lockdep results.

Of course, I forgot to append the .config section:

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
# CONFIG_PROVE_RAW_LOCK_NESTING is not set
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
CONFIG_CSD_LOCK_WAIT_DEBUG=y
# end of Lock Debugging (spinlocks, mutexes, etc...)

>
> I also went ahead and did my negative dentry oriented testing. Of course
> the fsnotify_parent() issue is fully resolved, and when I tested several
> processes all using inotifywait on the same directory full of negative
> dentries, I was able to use ftrace to confirm that
> fsnotify_update_children_dentry_flags() was called exactly once for all
> processes. No softlockups occurred!
>
> I originally wrote this series to make the last patch (#5) optional: if for
> some reason we didn't think it was necessary to hold the inode rwsem, then
> we could omit it -- the main penalty being the race condition described in
> the patch description. I tested without the last patch and LTP passed also
> with lockdep enabled, but of course when multiple tasks did an inotifywait
> on the same directory (with many negative dentries) only the first waited
> for the flag updates, the rest of the tasks immediately returned despite
> the flags not being ready.
>
> I agree with Amir that as long as the lock ordering is fine, we should keep
> patch #5. And if that's the case, I can reorder the series a bit to make it
> a bit more logical, and eliminate logic in
> fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
> which I promptly delete later in the series.
>
> 1. fsnotify: clear PARENT_WATCHED flags lazily
> 2. fsnotify: Use d_find_any_alias to get dentry associated with inode
> 3. dnotify: move fsnotify_recalc_mask() outside spinlock
> 4. fsnotify: require inode lock held during child flag update
> 5. fsnotify: allow sleepable child flag update
>
> Thanks for continuing to read this series, I hope we're making progress
> toward a simpler way to fix these scaling issues!
>
> Stephen
>
> Amir Goldstein (1):
>   fsnotify: clear PARENT_WATCHED flags lazily
>
> Stephen Brennan (4):
>   fsnotify: Use d_find_any_alias to get dentry associated with inode
>   dnotify: move fsnotify_recalc_mask() outside spinlock
>   fsnotify: allow sleepable child flag update
>   fsnotify: require inode lock held during child flag update
>
>  fs/notify/dnotify/dnotify.c      |  28 ++++++---
>  fs/notify/fsnotify.c             | 101 ++++++++++++++++++++++---------
>  fs/notify/fsnotify.h             |   3 +-
>  fs/notify/mark.c                 |  40 +++++++++++-
>  include/linux/fsnotify_backend.h |   8 ++-
>  5 files changed, 136 insertions(+), 44 deletions(-)
>
> -- 
> 2.34.1
Jan Kara Nov. 22, 2022, 11:50 a.m. UTC | #2
Hi Stephen!

On Fri 11-11-22 14:06:09, Stephen Brennan wrote:
> Here's my v4 patch series that aims to eliminate soft lockups when updating
> dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> allowing the flag to be lazily cleared in the fsnotify_parent() function,
> via Amir's patch. This allowed me to drop patch #2 from my previous series
> (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> replaced it with "fsnotify: require inode lock held during child flag
> update", patch #5 in this series. I also added "dnotify: move
> fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> issues with dnotify.

Yes, the series is now much simpler. Thanks!

> Jan expressed concerns about lock ordering of the inode rwsem with the
> fsnotify group mutex. I built this with lockdep enabled (see below for the
> lock debugging .config section -- I'm not too familiar with lockdep so I
> wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> I could find in LTP, with no lockdep splats to be found. I don't know that
> this can completely satisfy the concerns about lock ordering: I'm reading
> through the code to better understand the concern about "the removal of
> oneshot mark during modify event generation". But I'm encouraged by the
> LTP+lockdep results.

So I had a look and I think your patches could cause deadlock at least for
nfsd. The problem is with things like inotify IN_ONESHOT marks. They get
autodeleted as soon as they trigger. Thus e.g. fsnotify_mkdir() can trigger
IN_ONESHOT mark and goes on removing it by calling fsnotify_destroy_mark()
from inotify_handle_inode_event(). And nfsd calls e.g. fsnotify_mkdir()
while holding dir->i_rwsem held. So we have lock ordering like:

nfsd_mkdir()
  inode_lock(dir);
    ...
    __nfsd_mkdir(dir, ...)
      fsnotify_mkdir(dir, dentry);
        ...
        inotify_handle_inode_event()
          ...
          fsnotify_destroy_mark()
            fsnotify_group_lock(group)

So we have dir->i_rwsem > group->mark_mutex. But we also have callchains
like:

inotify_add_watch()
  inotify_update_watch()
    fsnotify_group_lock(group)
    inotify_update_existing_watch()
      ...
      fsnotify_recalc_mask()
        inode_lock(dir); -> added by your series

which creates ordering group->mark_mutex > dir->i_rwsem.

It is even worse with dnotify which (even with your patches) ends up
calling fsnotify_recalc_mask() from dnotify_handle_event() so we have a
possibility of direct A->A deadlock. But I'd leave dnotify aside, I think
that can be massaged to not need to call fsnotify_recalc_mask()
(__fsnotify_recalc_mask() would be enough there).

Still I'm not 100% sure about a proper way out of this. The simplicity of
alias->d_subdirs iteration with i_rwsem held is compeling. We could mandate
that fsnotify hooks cannot be called with inode->i_rwsem held (and fixup
nfsd) but IMO that is pushing the complexity from the fsnotify core into
its users which is undesirable. Maybe we could grab inode->i_rwsem in those
places adding / removing notification marks before we grab
group->mark_mutex, just verify (with lockdep) that fsnotify_recalc_mask()
has the inode->i_rwsem held and be done with it? That pushes a bit of
complexity into the fsnotify backends but it is not too bad.
fsnotify_recalc_mask() gets only called by dnotify, inotify, and fanotify.
Amir?

> I originally wrote this series to make the last patch (#5) optional: if for
> some reason we didn't think it was necessary to hold the inode rwsem, then
> we could omit it -- the main penalty being the race condition described in
> the patch description. I tested without the last patch and LTP passed also
> with lockdep enabled, but of course when multiple tasks did an inotifywait
> on the same directory (with many negative dentries) only the first waited
> for the flag updates, the rest of the tasks immediately returned despite
> the flags not being ready.
> 
> I agree with Amir that as long as the lock ordering is fine, we should keep
> patch #5. And if that's the case, I can reorder the series a bit to make it
> a bit more logical, and eliminate logic in
> fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
> which I promptly delete later in the series.
> 
> 1. fsnotify: clear PARENT_WATCHED flags lazily
> 2. fsnotify: Use d_find_any_alias to get dentry associated with inode
> 3. dnotify: move fsnotify_recalc_mask() outside spinlock
> 4. fsnotify: require inode lock held during child flag update
> 5. fsnotify: allow sleepable child flag update
> 
> Thanks for continuing to read this series, I hope we're making progress
> toward a simpler way to fix these scaling issues!

Yeah, so I'd be for making sure i_rwsem is held where we need it first and
only after that add reschedule handling into
fsnotify_update_children_dentry_flags(). That makes the series more
logical.

								Honza
Amir Goldstein Nov. 22, 2022, 2:03 p.m. UTC | #3
On Tue, Nov 22, 2022 at 1:50 PM Jan Kara <jack@suse.cz> wrote:
>
> Hi Stephen!
>
> On Fri 11-11-22 14:06:09, Stephen Brennan wrote:
> > Here's my v4 patch series that aims to eliminate soft lockups when updating
> > dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> > allowing the flag to be lazily cleared in the fsnotify_parent() function,
> > via Amir's patch. This allowed me to drop patch #2 from my previous series
> > (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> > replaced it with "fsnotify: require inode lock held during child flag
> > update", patch #5 in this series. I also added "dnotify: move
> > fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> > issues with dnotify.
>
> Yes, the series is now much simpler. Thanks!
>
> > Jan expressed concerns about lock ordering of the inode rwsem with the
> > fsnotify group mutex. I built this with lockdep enabled (see below for the
> > lock debugging .config section -- I'm not too familiar with lockdep so I
> > wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> > I could find in LTP, with no lockdep splats to be found. I don't know that
> > this can completely satisfy the concerns about lock ordering: I'm reading
> > through the code to better understand the concern about "the removal of
> > oneshot mark during modify event generation". But I'm encouraged by the
> > LTP+lockdep results.
>
> So I had a look and I think your patches could cause deadlock at least for
> nfsd. The problem is with things like inotify IN_ONESHOT marks. They get
> autodeleted as soon as they trigger. Thus e.g. fsnotify_mkdir() can trigger
> IN_ONESHOT mark and goes on removing it by calling fsnotify_destroy_mark()
> from inotify_handle_inode_event(). And nfsd calls e.g. fsnotify_mkdir()
> while holding dir->i_rwsem held. So we have lock ordering like:
>
> nfsd_mkdir()
>   inode_lock(dir);
>     ...
>     __nfsd_mkdir(dir, ...)
>       fsnotify_mkdir(dir, dentry);
>         ...
>         inotify_handle_inode_event()
>           ...
>           fsnotify_destroy_mark()
>             fsnotify_group_lock(group)
>
> So we have dir->i_rwsem > group->mark_mutex. But we also have callchains
> like:
>
> inotify_add_watch()
>   inotify_update_watch()
>     fsnotify_group_lock(group)
>     inotify_update_existing_watch()
>       ...
>       fsnotify_recalc_mask()
>         inode_lock(dir); -> added by your series
>
> which creates ordering group->mark_mutex > dir->i_rwsem.
>
> It is even worse with dnotify which (even with your patches) ends up
> calling fsnotify_recalc_mask() from dnotify_handle_event() so we have a
> possibility of direct A->A deadlock. But I'd leave dnotify aside, I think
> that can be massaged to not need to call fsnotify_recalc_mask()
> (__fsnotify_recalc_mask() would be enough there).
>
> Still I'm not 100% sure about a proper way out of this. The simplicity of
> alias->d_subdirs iteration with i_rwsem held is compeling.

Agreed.

> We could mandate
> that fsnotify hooks cannot be called with inode->i_rwsem held (and fixup
> nfsd) but IMO that is pushing the complexity from the fsnotify core into
> its users which is undesirable.

I think inode in this context is the parent inode, so all fsnotify hooks
in namei.c are holding inode->i_rwsem by design.

> Maybe we could grab inode->i_rwsem in those
> places adding / removing notification marks before we grab
> group->mark_mutex, just verify (with lockdep) that fsnotify_recalc_mask()
> has the inode->i_rwsem held and be done with it? That pushes a bit of
> complexity into the fsnotify backends but it is not too bad.
> fsnotify_recalc_mask() gets only called by dnotify, inotify, and fanotify.
> Amir?
>

Absolutely agree - I think it makes sense and will simplify things a lot.

Obviously if we need to assert inode_is_locked() in fsnotify_recalc_mask()
only for (conn->type == FSNOTIFY_OBJ_TYPE_INODE).

Thanks,
Amir.