diff mbox series

[v3,3/3] fsnotify: allow sleepable child flag update

Message ID 20221028001016.332663-4-stephen.s.brennan@oracle.com (mailing list archive)
State New, archived
Headers show
Series fsnotify: fix softlockups iterating over d_subdirs | expand

Commit Message

Stephen Brennan Oct. 28, 2022, 12:10 a.m. UTC
With very large d_subdirs lists, iteration can take a long time. Since
iteration needs to hold parent->d_lock, this can trigger soft lockups.
It would be best to make this iteration sleepable. Since we have the
inode locked exclusive, we can drop the parent->d_lock and sleep,
holding a reference to a child dentry, and continue iteration once we
wake.

Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
---

Notes:
    v3:
    - removed if statements around dput()
    v2:
    - added a check for child->d_parent != alias and retry logic

 fs/notify/fsnotify.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

Comments

Amir Goldstein Oct. 28, 2022, 9:32 a.m. UTC | #1
On Fri, Oct 28, 2022 at 3:10 AM Stephen Brennan
<stephen.s.brennan@oracle.com> wrote:
>
> With very large d_subdirs lists, iteration can take a long time. Since
> iteration needs to hold parent->d_lock, this can trigger soft lockups.
> It would be best to make this iteration sleepable. Since we have the
> inode locked exclusive, we can drop the parent->d_lock and sleep,
> holding a reference to a child dentry, and continue iteration once we
> wake.
>
> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
> ---
>

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

some comment nits and one fortify suggestion

> Notes:
>     v3:
>     - removed if statements around dput()
>     v2:
>     - added a check for child->d_parent != alias and retry logic
>
>  fs/notify/fsnotify.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index ccb8a3a6c522..34e5d18235a7 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -102,10 +102,12 @@ void fsnotify_sb_delete(struct super_block *sb)
>   * on a child we run all of our children and set a dentry flag saying that the
>   * parent cares.  Thus when an event happens on a child it can quickly tell
>   * if there is a need to find a parent and send the event to the parent.
> + *
> + * Context: inode locked exclusive
>   */
>  static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
>  {
> -       struct dentry *alias, *child;
> +       struct dentry *child, *alias, *last_ref = NULL;
>         int watched;
>
>         if (!S_ISDIR(inode->i_mode))
> @@ -120,12 +122,37 @@ static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
>         alias = d_find_any_alias(inode);
>
>         /*
> -        * run all of the children of the original inode and fix their
> -        * d_flags to indicate parental interest (their parent is the
> -        * original inode)
> +        * These lists can get very long, so we may need to sleep during
> +        * iteration. Normally this would be impossible without a cursor,
> +        * but since we have the inode locked exclusive, we're guaranteed
> +        * that the directory won't be modified, so whichever dentry we
> +        * pick to sleep on won't get moved. So, start a manual iteration
> +        * over d_subdirs which will allow us to sleep.
>          */
>         spin_lock(&alias->d_lock);
> +retry:
>         list_for_each_entry(child, &alias->d_subdirs, d_child) {
> +               if (need_resched()) {
> +                       /*
> +                        * We need to hold a reference while we sleep. But when
> +                        * we wake, dput() could free the dentry, invalidating
> +                        * the list pointers. We can't look at the list pointers
> +                        * until we re-lock the parent, and we can't dput() once
> +                        * we have the parent locked. So the solution is to hold
> +                        * onto our reference and free it the *next* time we drop
> +                        * alias->d_lock: either at the end of the function, or
> +                        * at the time of the next sleep.
> +                        */

My personal preference would be to move this above if (needed_reschd())
it is not any less clear when this comment is above the condition
and less indented will read nicer.

> +                       dget(child);
> +                       spin_unlock(&alias->d_lock);
> +                       dput(last_ref);
> +                       last_ref = child;
> +                       cond_resched();
> +                       spin_lock(&alias->d_lock);
> +                       if (child->d_parent != alias)
> +                               goto retry;

Is this expected? If not, then we need a WARN_ON_ONCE().
Also, I wonder if it would be better to break out and leave
dentry flags as they are instead of risking some endless
or very long retry loop?

And how about asserting on unexpected !list_empty(&child->d_child)
to avoid an endless loop in list_for_each_entry()?

Thanks,
Amir.
Stephen Brennan Nov. 1, 2022, 9:25 p.m. UTC | #2
Amir Goldstein <amir73il@gmail.com> writes:

> On Fri, Oct 28, 2022 at 3:10 AM Stephen Brennan
> <stephen.s.brennan@oracle.com> wrote:
>>
>> With very large d_subdirs lists, iteration can take a long time. Since
>> iteration needs to hold parent->d_lock, this can trigger soft lockups.
>> It would be best to make this iteration sleepable. Since we have the
>> inode locked exclusive, we can drop the parent->d_lock and sleep,
>> holding a reference to a child dentry, and continue iteration once we
>> wake.
>>
>> Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
>> ---
>>
>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
>
> some comment nits and one fortify suggestion
>
>> Notes:
>>     v3:
>>     - removed if statements around dput()
>>     v2:
>>     - added a check for child->d_parent != alias and retry logic
>>
>>  fs/notify/fsnotify.c | 36 ++++++++++++++++++++++++++++++++----
>>  1 file changed, 32 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> index ccb8a3a6c522..34e5d18235a7 100644
>> --- a/fs/notify/fsnotify.c
>> +++ b/fs/notify/fsnotify.c
>> @@ -102,10 +102,12 @@ void fsnotify_sb_delete(struct super_block *sb)
>>   * on a child we run all of our children and set a dentry flag saying that the
>>   * parent cares.  Thus when an event happens on a child it can quickly tell
>>   * if there is a need to find a parent and send the event to the parent.
>> + *
>> + * Context: inode locked exclusive
>>   */
>>  static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
>>  {
>> -       struct dentry *alias, *child;
>> +       struct dentry *child, *alias, *last_ref = NULL;
>>         int watched;
>>
>>         if (!S_ISDIR(inode->i_mode))
>> @@ -120,12 +122,37 @@ static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
>>         alias = d_find_any_alias(inode);
>>
>>         /*
>> -        * run all of the children of the original inode and fix their
>> -        * d_flags to indicate parental interest (their parent is the
>> -        * original inode)
>> +        * These lists can get very long, so we may need to sleep during
>> +        * iteration. Normally this would be impossible without a cursor,
>> +        * but since we have the inode locked exclusive, we're guaranteed
>> +        * that the directory won't be modified, so whichever dentry we
>> +        * pick to sleep on won't get moved. So, start a manual iteration
>> +        * over d_subdirs which will allow us to sleep.
>>          */
>>         spin_lock(&alias->d_lock);
>> +retry:
>>         list_for_each_entry(child, &alias->d_subdirs, d_child) {
>> +               if (need_resched()) {
>> +                       /*
>> +                        * We need to hold a reference while we sleep. But when
>> +                        * we wake, dput() could free the dentry, invalidating
>> +                        * the list pointers. We can't look at the list pointers
>> +                        * until we re-lock the parent, and we can't dput() once
>> +                        * we have the parent locked. So the solution is to hold
>> +                        * onto our reference and free it the *next* time we drop
>> +                        * alias->d_lock: either at the end of the function, or
>> +                        * at the time of the next sleep.
>> +                        */
>
> My personal preference would be to move this above if (needed_reschd())
> it is not any less clear when this comment is above the condition
> and less indented will read nicer.

Definitely.

>> +                       dget(child);
>> +                       spin_unlock(&alias->d_lock);
>> +                       dput(last_ref);
>> +                       last_ref = child;
>> +                       cond_resched();
>> +                       spin_lock(&alias->d_lock);
>> +                       if (child->d_parent != alias)
>> +                               goto retry;
>
> Is this expected? If not, then we need a WARN_ON_ONCE().
> Also, I wonder if it would be better to break out and leave
> dentry flags as they are instead of risking some endless
> or very long retry loop?
>
> And how about asserting on unexpected !list_empty(&child->d_child)
> to avoid an endless loop in list_for_each_entry()?

I was trying to think this through as I prepared v3, and ended I up
leaving it as-is, out of a hope that it was doing something helpful. But
I'm pretty sure Al would ask why I believe that this could happen (e.g.
what scenario am I guarding against happening?). I didn't have a clear
idea of what I was guarding against here when I wrote it, which is my
fault. What's necessary is an audit of the places where d_child is
modified, so we can understand the risky places.

What we're doing here is dropping parent->d_lock and going to sleep,
while still having parent->d_inode->i_rwsem held in write mode. The risk
we have is that this particular dentry is removed from this list while
we slept. Checking d_parent seems particularly unhelpful for that, since
what we care far more about is the d_child list pointers. I did a code
audit of all locations where d_child is modified. Here's all I could
identify:

1. dentry_unlist() in fs/dcache.c
2. __d_move() in fs/dcache.c
3. Initialization code in d_alloc() family
4. Cursor code in fs/libfs.c for dcache_readdir()

For case #1, dentry_unlist() is a helper of dentry_kill(). The real guts
of dentry_kill() must happen with parent->d_lock held. Since we get a
reference to the child with parent->d_lock held, we can be confident
that any dentry_kill() which could have been happening concurrently will
bail out after seeing that we got a reference. At least, that's my
reading of the code.

For case #2, __d_move() can (should?) NOT happen, because the caller
must hold the i_mutex (read i_rwsem in write mode), but we are holding
that ourselves.

For case #3, it doesn't matter... it's initialization time, and we
couldn't have gotten a reference to the dentry until the dentry is put
into the tree.

For case #4, the simple_readdir() code should hold the inode rwsem in
read mode, so we have mutual exclusion from this code too. While it's
possible that we could go to sleep holding a reference to a cursor, I
believe that the cursor could not move without holding i_rwsem in read
mode. So we could never wake up to find we've skipped part of the list.

So to summarize, I don't think there's a case where we could actually
expect that d_child pointers get updated while we sleep with a reference
held and the parent i_rwsem held exclusive. Consequently, no place where
d_parent would change from under our feet. So I will remove this check
form the patch.

Thanks,
Stephen
diff mbox series

Patch

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index ccb8a3a6c522..34e5d18235a7 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -102,10 +102,12 @@  void fsnotify_sb_delete(struct super_block *sb)
  * on a child we run all of our children and set a dentry flag saying that the
  * parent cares.  Thus when an event happens on a child it can quickly tell
  * if there is a need to find a parent and send the event to the parent.
+ *
+ * Context: inode locked exclusive
  */
 static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
 {
-	struct dentry *alias, *child;
+	struct dentry *child, *alias, *last_ref = NULL;
 	int watched;
 
 	if (!S_ISDIR(inode->i_mode))
@@ -120,12 +122,37 @@  static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
 	alias = d_find_any_alias(inode);
 
 	/*
-	 * run all of the children of the original inode and fix their
-	 * d_flags to indicate parental interest (their parent is the
-	 * original inode)
+	 * These lists can get very long, so we may need to sleep during
+	 * iteration. Normally this would be impossible without a cursor,
+	 * but since we have the inode locked exclusive, we're guaranteed
+	 * that the directory won't be modified, so whichever dentry we
+	 * pick to sleep on won't get moved. So, start a manual iteration
+	 * over d_subdirs which will allow us to sleep.
 	 */
 	spin_lock(&alias->d_lock);
+retry:
 	list_for_each_entry(child, &alias->d_subdirs, d_child) {
+		if (need_resched()) {
+			/*
+			 * We need to hold a reference while we sleep. But when
+			 * we wake, dput() could free the dentry, invalidating
+			 * the list pointers. We can't look at the list pointers
+			 * until we re-lock the parent, and we can't dput() once
+			 * we have the parent locked. So the solution is to hold
+			 * onto our reference and free it the *next* time we drop
+			 * alias->d_lock: either at the end of the function, or
+			 * at the time of the next sleep.
+			 */
+			dget(child);
+			spin_unlock(&alias->d_lock);
+			dput(last_ref);
+			last_ref = child;
+			cond_resched();
+			spin_lock(&alias->d_lock);
+			if (child->d_parent != alias)
+				goto retry;
+		}
+
 		if (!child->d_inode)
 			continue;
 
@@ -137,6 +164,7 @@  static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
 		spin_unlock(&child->d_lock);
 	}
 	spin_unlock(&alias->d_lock);
+	dput(last_ref);
 	dput(alias);
 	return watched;
 }