Message ID | 20201103173300.GF7123@magnolia (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] vfs: remove lockdep bogosity in __sb_start_write | expand |
> int __sb_start_write(struct super_block *sb, int level, bool wait) > { > - int ret = 1; > + if (!wait) > + return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1); > > + percpu_down_read(sb->s_writers.rw_sem + level-1); > + return 1; > } > EXPORT_SYMBOL(__sb_start_write); Please split the function into __sb_start_write and __sb_start_write_trylock while you're at it..
On Tue, Nov 03, 2020 at 05:39:21PM +0000, Christoph Hellwig wrote: > > int __sb_start_write(struct super_block *sb, int level, bool wait) > > { > > - int ret = 1; > > + if (!wait) > > + return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1); > > > > + percpu_down_read(sb->s_writers.rw_sem + level-1); > > + return 1; > > } > > EXPORT_SYMBOL(__sb_start_write); > > Please split the function into __sb_start_write and > __sb_start_write_trylock while you're at it.. Any thoughts on this patch itself? I don't feel like I have 100% of the context to know whether the removal is a good idea for non-xfs filesystems, though I'm fairly sure the current logic is broken. (Sending a second cleanup patch...) --D
On Tue, Nov 03, 2020 at 10:34:44AM -0800, Darrick J. Wong wrote: > > Please split the function into __sb_start_write and > > __sb_start_write_trylock while you're at it.. > > Any thoughts on this patch itself? I don't feel like I have 100% of the > context to know whether the removal is a good idea for non-xfs > filesystems, though I'm fairly sure the current logic is broken. The existing logic looks pretty bogus to me as well. Did you try to find the discussion that lead to it?
On Tue, Nov 03, 2020 at 06:46:59PM +0000, Christoph Hellwig wrote: > On Tue, Nov 03, 2020 at 10:34:44AM -0800, Darrick J. Wong wrote: > > > Please split the function into __sb_start_write and > > > __sb_start_write_trylock while you're at it.. > > > > Any thoughts on this patch itself? I don't feel like I have 100% of the > > context to know whether the removal is a good idea for non-xfs > > filesystems, though I'm fairly sure the current logic is broken. > > The existing logic looks pretty bogus to me as well. Did you try to find > the discussion that lead to it? TBH I don't know where the discussion happened. The "convert to trylock" behavior first appeared as commit 5accdf82ba25c back in 2012; that commit seems to have come from v6 of a patch[1] that Jan Kara sent to try to fix fs freeze handling back in 2012. The behavior was not in the v5[0] patch, nor was there any discussion for any of the v5 patches that would suggest why things changed from v5 to v6. Dave and I were talking about this on IRC yesterday, and his memory thought that this was lockdep trying to handle xfs taking intwrite protection while handling a write (or page_mkwrite) operation. I'm not sure where "XFS for example gets freeze protection on internal level twice in some cases" would actually happen -- did xfs support nested transactions in the past? We definitely don't now, so I don't think the comment is valid anymore. The last commit to touch this area was f4b554af9931 (in 2015), which says that Dave explained that the trylock hack + comment could be removed, but the patch author never did that, and lore doesn't seem to know where or when Dave actually said that? FWIW I couldn't find a place where we could take the freeze locks in the wrong order (or multiple times) -- I'm pretty sure the file remap stuff (clone/dedupe/copyrange) all grab write freeze protection; I couldn't figure out how you could start with pagefault freeze protection and then need write freeze protection; and AFAIK none of the filesystems that even care about intwrite will try to grab it recursively. Most of them don't allow nested transactions, and the one that does (nilfs2?) appears to be careful not to nest. --D [0] https://lore.kernel.org/linux-fsdevel/1334592845-22862-14-git-send-email-jack@suse.cz/ [1] https://lore.kernel.org/linux-fsdevel/1338589841-9568-14-git-send-email-jack@suse.cz/
On Tue, Nov 03, 2020 at 11:37:50AM -0800, Darrick J. Wong wrote: > On Tue, Nov 03, 2020 at 06:46:59PM +0000, Christoph Hellwig wrote: > > On Tue, Nov 03, 2020 at 10:34:44AM -0800, Darrick J. Wong wrote: > > > > Please split the function into __sb_start_write and > > > > __sb_start_write_trylock while you're at it.. > > > > > > Any thoughts on this patch itself? I don't feel like I have 100% of the > > > context to know whether the removal is a good idea for non-xfs > > > filesystems, though I'm fairly sure the current logic is broken. > > > > The existing logic looks pretty bogus to me as well. Did you try to find > > the discussion that lead to it? > > TBH I don't know where the discussion happened. The "convert to > trylock" behavior first appeared as commit 5accdf82ba25c back in 2012; > that commit seems to have come from v6 of a patch[1] that Jan Kara sent > to try to fix fs freeze handling back in 2012. The behavior was not in > the v5[0] patch, nor was there any discussion for any of the v5 patches > that would suggest why things changed from v5 to v6. > > Dave and I were talking about this on IRC yesterday, and his memory > thought that this was lockdep trying to handle xfs taking intwrite > protection while handling a write (or page_mkwrite) operation. I'm not > sure where "XFS for example gets freeze protection on internal level > twice in some cases" would actually happen -- did xfs support nested > transactions in the past? We definitely don't now, so I don't think the > comment is valid anymore. > > The last commit to touch this area was f4b554af9931 (in 2015), which > says that Dave explained that the trylock hack + comment could be > removed, but the patch author never did that, and lore doesn't seem to > know where or when Dave actually said that? I'm pretty sure this "nesting internal freeze references" stems from the fact we log and flush the superblock after fulling freezing the filesystem to dirty the journal so recovery after a crash while frozen handles unlinked inodes. The high level VFS freeze annotations were not able to handle running this transaction when transactions were supposed to already be blocked and drained, so there was a special hack to hide it from lockdep. Then we ended up hiding it from the VFS via XFS_TRANS_NO_WRITECOUNT in xfs_sync_sb() because we needed it in more places than just freeze (e.g. the log covering code run by the background log worker). It's kinda documented here: /* * xfs_sync_sb * * Sync the superblock to disk. * * Note that the caller is responsible for checking the frozen state of the * filesystem. This procedure uses the non-blocking transaction allocator and * thus will allow modifications to a frozen fs. This is required because this * code can be called during the process of freezing where use of the high-level * allocator would deadlock. */ So, AFAICT, the whole "XFS nests internal transactions" lockdep handling in __sb_start_write() has been unnecessary for quite a few years now.... Cheers, Dave.
On Fri, Nov 06, 2020 at 08:34:15AM +1100, Dave Chinner wrote: > On Tue, Nov 03, 2020 at 11:37:50AM -0800, Darrick J. Wong wrote: > > On Tue, Nov 03, 2020 at 06:46:59PM +0000, Christoph Hellwig wrote: > > > On Tue, Nov 03, 2020 at 10:34:44AM -0800, Darrick J. Wong wrote: > > > > > Please split the function into __sb_start_write and > > > > > __sb_start_write_trylock while you're at it.. > > > > > > > > Any thoughts on this patch itself? I don't feel like I have 100% of the > > > > context to know whether the removal is a good idea for non-xfs > > > > filesystems, though I'm fairly sure the current logic is broken. > > > > > > The existing logic looks pretty bogus to me as well. Did you try to find > > > the discussion that lead to it? > > > > TBH I don't know where the discussion happened. The "convert to > > trylock" behavior first appeared as commit 5accdf82ba25c back in 2012; > > that commit seems to have come from v6 of a patch[1] that Jan Kara sent > > to try to fix fs freeze handling back in 2012. The behavior was not in > > the v5[0] patch, nor was there any discussion for any of the v5 patches > > that would suggest why things changed from v5 to v6. > > > > Dave and I were talking about this on IRC yesterday, and his memory > > thought that this was lockdep trying to handle xfs taking intwrite > > protection while handling a write (or page_mkwrite) operation. I'm not > > sure where "XFS for example gets freeze protection on internal level > > twice in some cases" would actually happen -- did xfs support nested > > transactions in the past? We definitely don't now, so I don't think the > > comment is valid anymore. > > > > The last commit to touch this area was f4b554af9931 (in 2015), which > > says that Dave explained that the trylock hack + comment could be > > removed, but the patch author never did that, and lore doesn't seem to > > know where or when Dave actually said that? > > I'm pretty sure this "nesting internal freeze references" stems from > the fact we log and flush the superblock after fulling freezing the > filesystem to dirty the journal so recovery after a crash while > frozen handles unlinked inodes. > > The high level VFS freeze annotations were not able to handle > running this transaction when transactions were supposed to already > be blocked and drained, so there was a special hack to hide it from > lockdep. Then we ended up hiding it from the VFS via > XFS_TRANS_NO_WRITECOUNT in xfs_sync_sb() because we needed it in > more places than just freeze (e.g. the log covering code > run by the background log worker). It's kinda documented here: > > /* > * xfs_sync_sb > * > * Sync the superblock to disk. > * > * Note that the caller is responsible for checking the frozen state of the > * filesystem. This procedure uses the non-blocking transaction allocator and > * thus will allow modifications to a frozen fs. This is required because this > * code can be called during the process of freezing where use of the high-level > * allocator would deadlock. > */ > > So, AFAICT, the whole "XFS nests internal transactions" lockdep > handling in __sb_start_write() has been unnecessary for quite a few > years now.... Yeah. Would you be willing to RVB this, or are you all waiting for a v2 series? --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Nov 05, 2020 at 06:19:51PM -0800, Darrick J. Wong wrote: > On Fri, Nov 06, 2020 at 08:34:15AM +1100, Dave Chinner wrote: > > On Tue, Nov 03, 2020 at 11:37:50AM -0800, Darrick J. Wong wrote: > > > On Tue, Nov 03, 2020 at 06:46:59PM +0000, Christoph Hellwig wrote: > > > > On Tue, Nov 03, 2020 at 10:34:44AM -0800, Darrick J. Wong wrote: > > > > > > Please split the function into __sb_start_write and > > > > > > __sb_start_write_trylock while you're at it.. > > > > > > > > > > Any thoughts on this patch itself? I don't feel like I have 100% of the > > > > > context to know whether the removal is a good idea for non-xfs > > > > > filesystems, though I'm fairly sure the current logic is broken. > > > > > > > > The existing logic looks pretty bogus to me as well. Did you try to find > > > > the discussion that lead to it? > > > > > > TBH I don't know where the discussion happened. The "convert to > > > trylock" behavior first appeared as commit 5accdf82ba25c back in 2012; > > > that commit seems to have come from v6 of a patch[1] that Jan Kara sent > > > to try to fix fs freeze handling back in 2012. The behavior was not in > > > the v5[0] patch, nor was there any discussion for any of the v5 patches > > > that would suggest why things changed from v5 to v6. > > > > > > Dave and I were talking about this on IRC yesterday, and his memory > > > thought that this was lockdep trying to handle xfs taking intwrite > > > protection while handling a write (or page_mkwrite) operation. I'm not > > > sure where "XFS for example gets freeze protection on internal level > > > twice in some cases" would actually happen -- did xfs support nested > > > transactions in the past? We definitely don't now, so I don't think the > > > comment is valid anymore. > > > > > > The last commit to touch this area was f4b554af9931 (in 2015), which > > > says that Dave explained that the trylock hack + comment could be > > > removed, but the patch author never did that, and lore doesn't seem to > > > know where or when Dave actually said that? > > > > I'm pretty sure this "nesting internal freeze references" stems from > > the fact we log and flush the superblock after fulling freezing the > > filesystem to dirty the journal so recovery after a crash while > > frozen handles unlinked inodes. > > > > The high level VFS freeze annotations were not able to handle > > running this transaction when transactions were supposed to already > > be blocked and drained, so there was a special hack to hide it from > > lockdep. Then we ended up hiding it from the VFS via > > XFS_TRANS_NO_WRITECOUNT in xfs_sync_sb() because we needed it in > > more places than just freeze (e.g. the log covering code > > run by the background log worker). It's kinda documented here: > > > > /* > > * xfs_sync_sb > > * > > * Sync the superblock to disk. > > * > > * Note that the caller is responsible for checking the frozen state of the > > * filesystem. This procedure uses the non-blocking transaction allocator and > > * thus will allow modifications to a frozen fs. This is required because this > > * code can be called during the process of freezing where use of the high-level > > * allocator would deadlock. > > */ > > > > So, AFAICT, the whole "XFS nests internal transactions" lockdep > > handling in __sb_start_write() has been unnecessary for quite a few > > years now.... > > Yeah. Would you be willing to RVB this, or are you all waiting for a v2 > series? Send out a v2 - you probably need to include some of the above information in the change log removing the lockdep stuff so it's preserved this time... Cheers, Dave.
diff --git a/fs/super.c b/fs/super.c index a51c2083cd6b..e1fd667454d4 100644 --- a/fs/super.c +++ b/fs/super.c @@ -1647,36 +1647,11 @@ EXPORT_SYMBOL(__sb_end_write); */ int __sb_start_write(struct super_block *sb, int level, bool wait) { - bool force_trylock = false; - int ret = 1; + if (!wait) + return percpu_down_read_trylock(sb->s_writers.rw_sem + level-1); -#ifdef CONFIG_LOCKDEP - /* - * We want lockdep to tell us about possible deadlocks with freezing - * but it's it bit tricky to properly instrument it. Getting a freeze - * protection works as getting a read lock but there are subtle - * problems. XFS for example gets freeze protection on internal level - * twice in some cases, which is OK only because we already hold a - * freeze protection also on higher level. Due to these cases we have - * to use wait == F (trylock mode) which must not fail. - */ - if (wait) { - int i; - - for (i = 0; i < level - 1; i++) - if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) { - force_trylock = true; - break; - } - } -#endif - if (wait && !force_trylock) - percpu_down_read(sb->s_writers.rw_sem + level-1); - else - ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1); - - WARN_ON(force_trylock && !ret); - return ret; + percpu_down_read(sb->s_writers.rw_sem + level-1); + return 1; } EXPORT_SYMBOL(__sb_start_write);