Message ID | 20221028001016.332663-1-stephen.s.brennan@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | fsnotify: fix softlockups iterating over d_subdirs | expand |
Hi Stephen! On Thu 27-10-22 17:10:13, Stephen Brennan wrote: > Here is v3 of the patch series. I've taken all of the feedback, > thanks Amir, Christian, Hilf, et al. Differences are noted in each > patch. > > I caught an obvious and silly dentry reference leak: d_find_any_alias() > returns a reference, which I never called dput() on. With that change, I > no longer see the rpc_pipefs issue, but I do think I need more testing > and thinking through the third patch. Al, I'd love your feedback on that > one especially. > > Thanks, > Stephen > > Stephen Brennan (3): > fsnotify: Use d_find_any_alias to get dentry associated with inode > fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem > fsnotify: allow sleepable child flag update Thanks for the patches Stephen and I'm sorry for replying somewhat late. The first patch is a nobrainer. The other two patches ... complicate things somewhat more complicated than I'd like. I guess I can live with them if we don't find a better solution but I'd like to discuss a bit more about alternatives. So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in __fsnotify_parent() for the dentry which triggered the event and does not have watched parent anymore and never bother with full children walk? I suppose your contention problems will be gone, we'll just pay the price of dget_parent() + fsnotify_inode_watches_children() for each child that falsely triggers instead of for only one. Maybe that's not too bad? After all any event upto this moment triggered this overhead as well... Am I missing something? AFAIU this would allow us to avoid the games with the new connector flag etc... We would probably still need to avoid softlockups when setting the flag DCACHE_FSNOTIFY_PARENT_WATCHED but that should be much simpler (we could use i_rwsem trick like you do). Honza
Hi Jan, Jan Kara <jack@suse.cz> writes: > Hi Stephen! > > On Thu 27-10-22 17:10:13, Stephen Brennan wrote: >> Here is v3 of the patch series. I've taken all of the feedback, >> thanks Amir, Christian, Hilf, et al. Differences are noted in each >> patch. >> >> I caught an obvious and silly dentry reference leak: d_find_any_alias() >> returns a reference, which I never called dput() on. With that change, I >> no longer see the rpc_pipefs issue, but I do think I need more testing >> and thinking through the third patch. Al, I'd love your feedback on that >> one especially. >> >> Thanks, >> Stephen >> >> Stephen Brennan (3): >> fsnotify: Use d_find_any_alias to get dentry associated with inode >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem >> fsnotify: allow sleepable child flag update > > Thanks for the patches Stephen and I'm sorry for replying somewhat late. Absolutely no worries, these things take time. Thanks for taking a look! > The first patch is a nobrainer. The other two patches ... complicate things > somewhat more complicated than I'd like. I guess I can live with them if we > don't find a better solution but I'd like to discuss a bit more about > alternatives. Understood! > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in > __fsnotify_parent() for the dentry which triggered the event and does not > have watched parent anymore and never bother with full children walk? I > suppose your contention problems will be gone, we'll just pay the price of > dget_parent() + fsnotify_inode_watches_children() for each child that > falsely triggers instead of for only one. Maybe that's not too bad? After > all any event upto this moment triggered this overhead as well... This is an interesting idea. It came across my mind but I don't think I considered it seriously because I assumed that it was too big a change. But I suppose in the process I created an even bigger change :P The false positive dget_parent() + fsnotify_inode_watches_children() shouldn't be too bad. I could see a situation where there's a lot of random accesses within a directory, where the dget_parent() could cause some contention over the parent dentry. But to be fair, the performance would have been the same or worse while fsnotify was active in that case, and the contention would go away as most of the dentries get their flags cleared. So I don't think this is a problem. > Am I missing something? I think there's one thing missed here. I understand you'd like to get rid of the extra flag in the connector. But the advantage of the flag is avoiding duplicate work by saving a bit of state. Suppose that a mark is added to a connector, which causes fsnotify_inode_watches_children() to become true. Then, any subsequent call to fsnotify_recalc_mask() must call __fsnotify_update_child_dentry_flags(), even though the child dentry flags don't need to be updated: they're already set. For (very) large directories, this can take a few seconds, which means that we're doing a few extra seconds of work each time a new mark is added to or removed from a connector in that case. I can't imagine that's a super common workload though, and I don't know if my customers do that (my guess would be no). For an example of a test workload where this would make a difference: one of my test cases is to create a directory with millions of negative dentries, and then run "for i in {1..20}; do inotifywait $DIR & done". With the series as-is, only the first task needs to do the child flag update. With your proposed alternative, each task would re-do the update. If we were to manage to get __fsnotify_update_child_dentry_flags() to work correctly, and safely, with the ability to drop the d_lock and sleep, then I think that wouldn't be too much of a problem, because then other spinlock users of the directory will get a chance to grab it, and so we don't risk softlockups. Without the sleepable iterations, it would be marginally worse than the current solution, but I can't really comment _how_ much worse because like I said, it doesn't sound like a frequent usage pattern. I think I have a _slight_ preference for the current design, but I see the appeal of simpler code, and your design would still improve things a lot! Maybe Amir has an opinion too, but of course I'll defer to what you want. Thanks, Stephen
On Tue, Nov 1, 2022 at 10:49 PM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Hi Jan, > > Jan Kara <jack@suse.cz> writes: > > Hi Stephen! > > > > On Thu 27-10-22 17:10:13, Stephen Brennan wrote: > >> Here is v3 of the patch series. I've taken all of the feedback, > >> thanks Amir, Christian, Hilf, et al. Differences are noted in each > >> patch. > >> > >> I caught an obvious and silly dentry reference leak: d_find_any_alias() > >> returns a reference, which I never called dput() on. With that change, I > >> no longer see the rpc_pipefs issue, but I do think I need more testing > >> and thinking through the third patch. Al, I'd love your feedback on that > >> one especially. > >> > >> Thanks, > >> Stephen > >> > >> Stephen Brennan (3): > >> fsnotify: Use d_find_any_alias to get dentry associated with inode > >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem > >> fsnotify: allow sleepable child flag update > > > > Thanks for the patches Stephen and I'm sorry for replying somewhat late. > > Absolutely no worries, these things take time. Thanks for taking a look! > > > The first patch is a nobrainer. The other two patches ... complicate things > > somewhat more complicated than I'd like. I guess I can live with them if we > > don't find a better solution but I'd like to discuss a bit more about > > alternatives. > > Understood! > > > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in > > __fsnotify_parent() for the dentry which triggered the event and does not > > have watched parent anymore and never bother with full children walk? I > > suppose your contention problems will be gone, we'll just pay the price of > > dget_parent() + fsnotify_inode_watches_children() for each child that > > falsely triggers instead of for only one. Maybe that's not too bad? After > > all any event upto this moment triggered this overhead as well... > > This is an interesting idea. It came across my mind but I don't think I > considered it seriously because I assumed that it was too big a change. > But I suppose in the process I created an even bigger change :P > > The false positive dget_parent() + fsnotify_inode_watches_children() > shouldn't be too bad. I could see a situation where there's a lot of > random accesses within a directory, where the dget_parent() could cause > some contention over the parent dentry. But to be fair, the performance > would have been the same or worse while fsnotify was active in that > case, and the contention would go away as most of the dentries get their > flags cleared. So I don't think this is a problem. > I also don't think that is a problem. > > Am I missing something? > > I think there's one thing missed here. I understand you'd like to get > rid of the extra flag in the connector. But the advantage of the flag is > avoiding duplicate work by saving a bit of state. Suppose that a mark is > added to a connector, which causes fsnotify_inode_watches_children() to > become true. Then, any subsequent call to fsnotify_recalc_mask() must > call __fsnotify_update_child_dentry_flags(), even though the child > dentry flags don't need to be updated: they're already set. For (very) > large directories, this can take a few seconds, which means that we're > doing a few extra seconds of work each time a new mark is added to or > removed from a connector in that case. I can't imagine that's a super > common workload though, and I don't know if my customers do that (my > guess would be no). > > For an example of a test workload where this would make a difference: > one of my test cases is to create a directory with millions of negative > dentries, and then run "for i in {1..20}; do inotifywait $DIR & done". > With the series as-is, only the first task needs to do the child flag > update. With your proposed alternative, each task would re-do the > update. > > If we were to manage to get __fsnotify_update_child_dentry_flags() to > work correctly, and safely, with the ability to drop the d_lock and > sleep, then I think that wouldn't be too much of a problem, because then > other spinlock users of the directory will get a chance to grab it, and > so we don't risk softlockups. Without the sleepable iterations, it would > be marginally worse than the current solution, but I can't really > comment _how_ much worse because like I said, it doesn't sound like a > frequent usage pattern. > > I think I have a _slight_ preference for the current design, but I see > the appeal of simpler code, and your design would still improve things a > lot! Maybe Amir has an opinion too, but of course I'll defer to what you > want. > IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them, I see no reason to withhold. With patches #1+#3 applied, the penalty is restricted to adding or removing/destroying multiple marks on very large dirs or dirs with many negative dentries. I think that fixing the synthetic test case of multiple added marks is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED. Something like the attached patch is what Jan had suggested in the first place. The synthetic test case of concurrent add/remove watch on the same dir may still result in multiple children iterations, but that will usually not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED and probably not worth optimizing for. Thoughts? Thanks, Amir.
On Tue 01-11-22 13:48:54, Stephen Brennan wrote: > Jan Kara <jack@suse.cz> writes: > > Hi Stephen! > > > > On Thu 27-10-22 17:10:13, Stephen Brennan wrote: > >> Here is v3 of the patch series. I've taken all of the feedback, > >> thanks Amir, Christian, Hilf, et al. Differences are noted in each > >> patch. > >> > >> I caught an obvious and silly dentry reference leak: d_find_any_alias() > >> returns a reference, which I never called dput() on. With that change, I > >> no longer see the rpc_pipefs issue, but I do think I need more testing > >> and thinking through the third patch. Al, I'd love your feedback on that > >> one especially. > >> > >> Thanks, > >> Stephen > >> > >> Stephen Brennan (3): > >> fsnotify: Use d_find_any_alias to get dentry associated with inode > >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem > >> fsnotify: allow sleepable child flag update > > > > Thanks for the patches Stephen and I'm sorry for replying somewhat late. > > Absolutely no worries, these things take time. Thanks for taking a look! > > > The first patch is a nobrainer. The other two patches ... complicate things > > somewhat more complicated than I'd like. I guess I can live with them if we > > don't find a better solution but I'd like to discuss a bit more about > > alternatives. > > Understood! > > > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in > > __fsnotify_parent() for the dentry which triggered the event and does not > > have watched parent anymore and never bother with full children walk? I > > suppose your contention problems will be gone, we'll just pay the price of > > dget_parent() + fsnotify_inode_watches_children() for each child that > > falsely triggers instead of for only one. Maybe that's not too bad? After > > all any event upto this moment triggered this overhead as well... > > This is an interesting idea. It came across my mind but I don't think I > considered it seriously because I assumed that it was too big a change. > But I suppose in the process I created an even bigger change :P > > The false positive dget_parent() + fsnotify_inode_watches_children() > shouldn't be too bad. I could see a situation where there's a lot of > random accesses within a directory, where the dget_parent() could cause > some contention over the parent dentry. But to be fair, the performance > would have been the same or worse while fsnotify was active in that > case, and the contention would go away as most of the dentries get their > flags cleared. So I don't think this is a problem. > > > Am I missing something? > > I think there's one thing missed here. I understand you'd like to get > rid of the extra flag in the connector. But the advantage of the flag is > avoiding duplicate work by saving a bit of state. Suppose that a mark is > added to a connector, which causes fsnotify_inode_watches_children() to > become true. Then, any subsequent call to fsnotify_recalc_mask() must > call __fsnotify_update_child_dentry_flags(), even though the child > dentry flags don't need to be updated: they're already set. For (very) > large directories, this can take a few seconds, which means that we're > doing a few extra seconds of work each time a new mark is added to or > removed from a connector in that case. I can't imagine that's a super > common workload though, and I don't know if my customers do that (my > guess would be no). I understand. This basically matters for fsnotify_recalc_mask(). As a side note I've realized that your changes to fsnotify_recalc_mask() acquiring inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for dnotify because that calls fsnotify_recalc_mask() under a spinlock. Furthermore it is somewhat worrying also for inotify & fanotify because it nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure something doesn't force the ordering the other way around (e.g. the removal of oneshot mark during modify event generation). Did you run tests with lockdep enabled? Anyway, if the lock ordering issues can be solved, I suppose we can optimize fsnotify_recalc_mask() like: inode_lock(inode); spin_lock(&conn->lock); oldmask = inode->i_fsnotify_mask; __fsnotify_recalc_mask(conn); newmask = inode->i_fsnotify_mask; spin_unlock(&conn->lock); if (watching children changed(oldmask, newmask)) __fsnotify_update_child_dentry_flags(...) inode_unlock(inode); And because everything is serialized by inode_lock, we don't have to worry about inode->i_fsnotify_mask and dentry flags getting out of sync or some mark addition returning before all children are marked for reporting events. No need for the connector flag AFAICT. But the locking issue needs to be resolved first in any case. I need to think some more... Honza
Jan Kara <jack@suse.cz> writes: > On Tue 01-11-22 13:48:54, Stephen Brennan wrote: >> Jan Kara <jack@suse.cz> writes: >> > Hi Stephen! >> > >> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote: >> >> Here is v3 of the patch series. I've taken all of the feedback, >> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each >> >> patch. >> >> >> >> I caught an obvious and silly dentry reference leak: d_find_any_alias() >> >> returns a reference, which I never called dput() on. With that change, I >> >> no longer see the rpc_pipefs issue, but I do think I need more testing >> >> and thinking through the third patch. Al, I'd love your feedback on that >> >> one especially. >> >> >> >> Thanks, >> >> Stephen >> >> >> >> Stephen Brennan (3): >> >> fsnotify: Use d_find_any_alias to get dentry associated with inode >> >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem >> >> fsnotify: allow sleepable child flag update >> > >> > Thanks for the patches Stephen and I'm sorry for replying somewhat late. >> >> Absolutely no worries, these things take time. Thanks for taking a look! >> >> > The first patch is a nobrainer. The other two patches ... complicate things >> > somewhat more complicated than I'd like. I guess I can live with them if we >> > don't find a better solution but I'd like to discuss a bit more about >> > alternatives. >> >> Understood! >> >> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in >> > __fsnotify_parent() for the dentry which triggered the event and does not >> > have watched parent anymore and never bother with full children walk? I >> > suppose your contention problems will be gone, we'll just pay the price of >> > dget_parent() + fsnotify_inode_watches_children() for each child that >> > falsely triggers instead of for only one. Maybe that's not too bad? After >> > all any event upto this moment triggered this overhead as well... >> >> This is an interesting idea. It came across my mind but I don't think I >> considered it seriously because I assumed that it was too big a change. >> But I suppose in the process I created an even bigger change :P >> >> The false positive dget_parent() + fsnotify_inode_watches_children() >> shouldn't be too bad. I could see a situation where there's a lot of >> random accesses within a directory, where the dget_parent() could cause >> some contention over the parent dentry. But to be fair, the performance >> would have been the same or worse while fsnotify was active in that >> case, and the contention would go away as most of the dentries get their >> flags cleared. So I don't think this is a problem. >> >> > Am I missing something? >> >> I think there's one thing missed here. I understand you'd like to get >> rid of the extra flag in the connector. But the advantage of the flag is >> avoiding duplicate work by saving a bit of state. Suppose that a mark is >> added to a connector, which causes fsnotify_inode_watches_children() to >> become true. Then, any subsequent call to fsnotify_recalc_mask() must >> call __fsnotify_update_child_dentry_flags(), even though the child >> dentry flags don't need to be updated: they're already set. For (very) >> large directories, this can take a few seconds, which means that we're >> doing a few extra seconds of work each time a new mark is added to or >> removed from a connector in that case. I can't imagine that's a super >> common workload though, and I don't know if my customers do that (my >> guess would be no). > > I understand. This basically matters for fsnotify_recalc_mask(). As a side > note I've realized that your changes to fsnotify_recalc_mask() acquiring > inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for > dnotify because that calls fsnotify_recalc_mask() under a spinlock. > Furthermore it is somewhat worrying also for inotify & fanotify because it > nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure > something doesn't force the ordering the other way around (e.g. the removal > of oneshot mark during modify event generation). Did you run tests with > lockdep enabled? No I didn't. I'll be sure to get the LTP tests running with lockdep early next week and try this series out, we'll probably get an error like you say. I'll also take a look at the dnotify use case and see if there's anything to do there. Hopefully there's something we can do to salvage it :D Thanks, Stephen > Anyway, if the lock ordering issues can be solved, I suppose we can > optimize fsnotify_recalc_mask() like: > > inode_lock(inode); > spin_lock(&conn->lock); > oldmask = inode->i_fsnotify_mask; > __fsnotify_recalc_mask(conn); > newmask = inode->i_fsnotify_mask; > spin_unlock(&conn->lock); > if (watching children changed(oldmask, newmask)) > __fsnotify_update_child_dentry_flags(...) > inode_unlock(inode); > > And because everything is serialized by inode_lock, we don't have to worry > about inode->i_fsnotify_mask and dentry flags getting out of sync or some > mark addition returning before all children are marked for reporting > events. No need for the connector flag AFAICT. > > But the locking issue needs to be resolved first in any case. I need to > think some more... > > Honza > -- > Jan Kara <jack@suse.com> > SUSE Labs, CR
On Fri 04-11-22 16:33:18, Stephen Brennan wrote: > Jan Kara <jack@suse.cz> writes: > > On Tue 01-11-22 13:48:54, Stephen Brennan wrote: > >> Jan Kara <jack@suse.cz> writes: > >> > Hi Stephen! > >> > > >> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote: > >> >> Here is v3 of the patch series. I've taken all of the feedback, > >> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each > >> >> patch. > >> >> > >> >> I caught an obvious and silly dentry reference leak: d_find_any_alias() > >> >> returns a reference, which I never called dput() on. With that change, I > >> >> no longer see the rpc_pipefs issue, but I do think I need more testing > >> >> and thinking through the third patch. Al, I'd love your feedback on that > >> >> one especially. > >> >> > >> >> Thanks, > >> >> Stephen > >> >> > >> >> Stephen Brennan (3): > >> >> fsnotify: Use d_find_any_alias to get dentry associated with inode > >> >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem > >> >> fsnotify: allow sleepable child flag update > >> > > >> > Thanks for the patches Stephen and I'm sorry for replying somewhat late. > >> > >> Absolutely no worries, these things take time. Thanks for taking a look! > >> > >> > The first patch is a nobrainer. The other two patches ... complicate things > >> > somewhat more complicated than I'd like. I guess I can live with them if we > >> > don't find a better solution but I'd like to discuss a bit more about > >> > alternatives. > >> > >> Understood! > >> > >> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in > >> > __fsnotify_parent() for the dentry which triggered the event and does not > >> > have watched parent anymore and never bother with full children walk? I > >> > suppose your contention problems will be gone, we'll just pay the price of > >> > dget_parent() + fsnotify_inode_watches_children() for each child that > >> > falsely triggers instead of for only one. Maybe that's not too bad? After > >> > all any event upto this moment triggered this overhead as well... > >> > >> This is an interesting idea. It came across my mind but I don't think I > >> considered it seriously because I assumed that it was too big a change. > >> But I suppose in the process I created an even bigger change :P > >> > >> The false positive dget_parent() + fsnotify_inode_watches_children() > >> shouldn't be too bad. I could see a situation where there's a lot of > >> random accesses within a directory, where the dget_parent() could cause > >> some contention over the parent dentry. But to be fair, the performance > >> would have been the same or worse while fsnotify was active in that > >> case, and the contention would go away as most of the dentries get their > >> flags cleared. So I don't think this is a problem. > >> > >> > Am I missing something? > >> > >> I think there's one thing missed here. I understand you'd like to get > >> rid of the extra flag in the connector. But the advantage of the flag is > >> avoiding duplicate work by saving a bit of state. Suppose that a mark is > >> added to a connector, which causes fsnotify_inode_watches_children() to > >> become true. Then, any subsequent call to fsnotify_recalc_mask() must > >> call __fsnotify_update_child_dentry_flags(), even though the child > >> dentry flags don't need to be updated: they're already set. For (very) > >> large directories, this can take a few seconds, which means that we're > >> doing a few extra seconds of work each time a new mark is added to or > >> removed from a connector in that case. I can't imagine that's a super > >> common workload though, and I don't know if my customers do that (my > >> guess would be no). > > > > I understand. This basically matters for fsnotify_recalc_mask(). As a side > > note I've realized that your changes to fsnotify_recalc_mask() acquiring > > inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for > > dnotify because that calls fsnotify_recalc_mask() under a spinlock. > > Furthermore it is somewhat worrying also for inotify & fanotify because it > > nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure > > something doesn't force the ordering the other way around (e.g. the removal > > of oneshot mark during modify event generation). Did you run tests with > > lockdep enabled? > > No I didn't. I'll be sure to get the LTP tests running with lockdep > early next week and try this series out, we'll probably get an error > like you say. > > I'll also take a look at the dnotify use case and see if there's > anything to do there. Hopefully there's something we can do to salvage > it :D Yeah, dnotify should be solvable. I'm even willing to accept somewhat unusual methods for dnotify ;). It is ancient and rarely used API these days so I don't want it to block fixes for newer APIs. From a quick look it should be enough to move fsnotify_recalc_mask() call in dnotify_recalc_inode_mask() from under the mark->lock out into callers and perhaps rename dnotify_recalc_inode_mask() to dnotify_recalc_mark_mask() to make the name somewhat less confusing. Honza
Amir Goldstein <amir73il@gmail.com> writes: [...] > IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them, > I see no reason to withhold. > > With patches #1+#3 applied, the penalty is restricted to adding or > removing/destroying multiple marks on very large dirs or dirs with > many negative dentries. > > I think that fixing the synthetic test case of multiple added marks > is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED. > Something like the attached patch is what Jan had suggested in the > first place. > > The synthetic test case of concurrent add/remove watch on the same > dir may still result in multiple children iterations, but that will usually > not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED > and probably not worth optimizing for. > > Thoughts? If I understand your train of thought, your below patch would take the place of my patch #2, since #3 requires we not hold a spinlock during fsnotify_update_children_dentry_flags(). And since you fully rely on dentry accessors to lazily clear flags, you don't need to have the mutual exclusion provided by the inode lock. I like that a lot. However, the one thing I see here is that if we no longer hold the inode lock, patch #3 needs to be re-examined: it assumes that dentries can't be removed or moved around, and that assumption is only true with the inode lock held. I'll test things out with this patch replacing #2 and see how things shake out. Maybe I can improve #3. Thanks, Stephen > > Thanks, > Amir. > From c8ea1d84397c26ce334bff9d9f721400cebb88dd Mon Sep 17 00:00:00 2001 > From: Amir Goldstein <amir73il@gmail.com> > Date: Wed, 2 Nov 2022 10:28:01 +0200 > Subject: [PATCH] fsnotify: clear PARENT_WATCHED flags lazily > > Call fsnotify_update_children_dentry_flags() to set PARENT_WATCHED flags > only when parent starts watching children. > > When parent stops watching children, clear false positive PARENT_WATCHED > flags lazily in __fsnotify_parent() for each accessed child. > > Suggested-by: Jan Kara <jack@suse.cz> > Signed-off-by: Amir Goldstein <amir73il@gmail.com> > --- > fs/notify/fsnotify.c | 26 ++++++++++++++++++++------ > fs/notify/fsnotify.h | 3 ++- > fs/notify/mark.c | 32 +++++++++++++++++++++++++++++--- > include/linux/fsnotify_backend.h | 8 +++++--- > 4 files changed, 56 insertions(+), 13 deletions(-) > > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > index 1e541a9bd12b..f60078d6bb97 100644 > --- a/fs/notify/fsnotify.c > +++ b/fs/notify/fsnotify.c > @@ -103,17 +103,13 @@ void fsnotify_sb_delete(struct super_block *sb) > * parent cares. Thus when an event happens on a child it can quickly tell > * if there is a need to find a parent and send the event to the parent. > */ > -void __fsnotify_update_child_dentry_flags(struct inode *inode) > +void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched) > { > struct dentry *alias; > - int watched; > > if (!S_ISDIR(inode->i_mode)) > return; > > - /* determine if the children should tell inode about their events */ > - watched = fsnotify_inode_watches_children(inode); > - > spin_lock(&inode->i_lock); > /* run all of the dentries associated with this inode. Since this is a > * directory, there damn well better only be one item on this list */ > @@ -140,6 +136,24 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode) > spin_unlock(&inode->i_lock); > } > > +/* > + * Lazily clear false positive PARENT_WATCHED flag for child whose parent had > + * stopped wacthing children. > + */ > +static void fsnotify_update_child_dentry_flags(struct inode *inode, > + struct dentry *dentry) > +{ > + spin_lock(&dentry->d_lock); > + /* > + * d_lock is a sufficient barrier to prevent observing a non-watched > + * parent state from before the fsnotify_update_children_dentry_flags() > + * or fsnotify_update_flags() call that had set PARENT_WATCHED. > + */ > + if (!fsnotify_inode_watches_children(inode)) > + dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; > + spin_unlock(&dentry->d_lock); > +} > + > /* Are inode/sb/mount interested in parent and name info with this event? */ > static bool fsnotify_event_needs_parent(struct inode *inode, struct mount *mnt, > __u32 mask) > @@ -208,7 +222,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, > p_inode = parent->d_inode; > p_mask = fsnotify_inode_watches_children(p_inode); > if (unlikely(parent_watched && !p_mask)) > - __fsnotify_update_child_dentry_flags(p_inode); > + fsnotify_update_child_dentry_flags(p_inode, dentry); > > /* > * Include parent/name in notification either if some notification > diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h > index fde74eb333cc..bce9be36d06b 100644 > --- a/fs/notify/fsnotify.h > +++ b/fs/notify/fsnotify.h > @@ -74,7 +74,8 @@ static inline void fsnotify_clear_marks_by_sb(struct super_block *sb) > * update the dentry->d_flags of all of inode's children to indicate if inode cares > * about events that happen to its children. > */ > -extern void __fsnotify_update_child_dentry_flags(struct inode *inode); > +extern void fsnotify_update_children_dentry_flags(struct inode *inode, > + bool watched); > > extern struct kmem_cache *fsnotify_mark_connector_cachep; > > diff --git a/fs/notify/mark.c b/fs/notify/mark.c > index fcc68b8a40fd..614bce0e7261 100644 > --- a/fs/notify/mark.c > +++ b/fs/notify/mark.c > @@ -176,6 +176,24 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) > return fsnotify_update_iref(conn, want_iref); > } > > +static bool fsnotify_conn_watches_children( > + struct fsnotify_mark_connector *conn) > +{ > + if (conn->type != FSNOTIFY_OBJ_TYPE_INODE) > + return false; > + > + return fsnotify_inode_watches_children(fsnotify_conn_inode(conn)); > +} > + > +static void fsnotify_conn_set_children_dentry_flags( > + struct fsnotify_mark_connector *conn) > +{ > + if (conn->type != FSNOTIFY_OBJ_TYPE_INODE) > + return; > + > + fsnotify_update_children_dentry_flags(fsnotify_conn_inode(conn), true); > +} > + > /* > * Calculate mask of events for a list of marks. The caller must make sure > * connector and connector->obj cannot disappear under us. Callers achieve > @@ -184,15 +202,23 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) > */ > void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) > { > + bool update_children; > + > if (!conn) > return; > > spin_lock(&conn->lock); > + update_children = !fsnotify_conn_watches_children(conn); > __fsnotify_recalc_mask(conn); > + update_children &= fsnotify_conn_watches_children(conn); > spin_unlock(&conn->lock); > - if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) > - __fsnotify_update_child_dentry_flags( > - fsnotify_conn_inode(conn)); > + /* > + * Set children's PARENT_WATCHED flags only if parent started watching. > + * When parent stops watching, we clear false positive PARENT_WATCHED > + * flags lazily in __fsnotify_parent(). > + */ > + if (update_children) > + fsnotify_conn_set_children_dentry_flags(conn); > } > > /* Free all connectors queued for freeing once SRCU period ends */ > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h > index a31423c376a7..bd90bcf6c3b0 100644 > --- a/include/linux/fsnotify_backend.h > +++ b/include/linux/fsnotify_backend.h > @@ -586,12 +586,14 @@ static inline __u32 fsnotify_parent_needed_mask(__u32 mask) > > static inline int fsnotify_inode_watches_children(struct inode *inode) > { > + __u32 parent_mask = READ_ONCE(inode->i_fsnotify_mask); > + > /* FS_EVENT_ON_CHILD is set if the inode may care */ > - if (!(inode->i_fsnotify_mask & FS_EVENT_ON_CHILD)) > + if (!(parent_mask & FS_EVENT_ON_CHILD)) > return 0; > /* this inode might care about child events, does it care about the > * specific set of events that can happen on a child? */ > - return inode->i_fsnotify_mask & FS_EVENTS_POSS_ON_CHILD; > + return parent_mask & FS_EVENTS_POSS_ON_CHILD; > } > > /* > @@ -605,7 +607,7 @@ static inline void fsnotify_update_flags(struct dentry *dentry) > /* > * Serialisation of setting PARENT_WATCHED on the dentries is provided > * by d_lock. If inotify_inode_watched changes after we have taken > - * d_lock, the following __fsnotify_update_child_dentry_flags call will > + * d_lock, the following fsnotify_update_children_dentry_flags call will > * find our entry, so it will spin until we complete here, and update > * us with the new state. > */ > -- > 2.25.1
On Fri, Nov 11, 2022 at 3:12 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > Amir Goldstein <amir73il@gmail.com> writes: > > > On Thu, Nov 10, 2022, 10:25 PM Stephen Brennan <stephen.s.brennan@oracle.com> > > wrote: > > > >> Amir Goldstein <amir73il@gmail.com> writes: > >> [...] > >> > IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them, > >> > I see no reason to withhold. > >> > > >> > With patches #1+#3 applied, the penalty is restricted to adding or > >> > removing/destroying multiple marks on very large dirs or dirs with > >> > many negative dentries. > >> > > >> > I think that fixing the synthetic test case of multiple added marks > >> > is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED. > >> > Something like the attached patch is what Jan had suggested in the > >> > first place. > >> > > >> > The synthetic test case of concurrent add/remove watch on the same > >> > dir may still result in multiple children iterations, but that will > >> usually > >> > not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED > >> > and probably not worth optimizing for. > >> > > >> > Thoughts? > >> > >> If I understand your train of thought, your below patch would take the > >> place of my patch #2, since #3 requires we not hold a spinlock during > >> fsnotify_update_children_dentry_flags(). > >> > >> And since you fully rely on dentry accessors to lazily clear flags, you > >> don't need to have the mutual exclusion provided by the inode lock. > >> > > > > You don't need it to synchronize set and clear of all children, but inode > > lock is still useful for the update of children. > > > > > >> I like that a lot. > >> > >> However, the one thing I see here is that if we no longer hold the inode > >> lock, patch #3 needs to be re-examined: it assumes that dentries can't > >> be removed or moved around, and that assumption is only true with the > >> inode lock held. > >> > > > > I wasn't planning on dropping that assumption > > I was assuming that another patch would replace the spinlock with inode > > lock. > > As you had planned to do to avoid softlocks > > When adding marks. > > Ah, thanks for the clarity! I believe I was working to that conclusion > myself too. > > I actually believe I can make patch #3 work even when inode lock is not > held, using the d_parent+retry check, and ensuring we skip cursors I believe that would be a textbook premature optimization. With all the patches planned, what would be the problem solved by not holding inode lock? > before cond_resched(). I was hoping to have a tested series ready by the > end of my workday, but alas, it wasn't in the cards. I've got some > builds running, which I'll test on my LTP machine in the morning. The > overall idea is: > > fsnotify: clear PARENT_WATCHED flags lazily > -> As you sent it, no changes > fsnotify: Use d_find_any_alias to get dentry associated with inode > -> My patch #1, rebased > dnotify: move fsnotify_recalc_mask() outside spinlock > -> New patch to ensure we don't sleep while holding dnotify spnilock > fsnotify: allow sleepable child flag update > -> My patch #3, but working even without inode lock held > fsnotify: require inode lock held during child flag update > -> An optional additional patch to require the inode lock to be held, > just like you described. > The main benefit of having this (assuming my changes to my patch #3 > hold out under scrutiny/testing) would be that it prevents the > situation where another task can do an fsnotify_recalc_mask(), see > that the flags don't need an update, and return before a different > task actually finishes the child flag update. Assuming that you figured out the lock ordering and fixed the dnotify locking, I just don't see pros in letting go of inode lock only cons. At the most, you can defer getting rid of inode lock to a later time after the dust has settled on the other changes, but again, not sure for what purpose. I cannot see how holding inode lock could in any way regress the situation from the current state of holding the spinlock. Thanks, Amir.