Message ID | 20230809220545.1308228-8-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/13] xfs: reformat the xfs_fs_free prototype | expand |
On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote: > Copy and paste the commit message from Darrick into a comment to explain > the seemly odd invalidate_bdev in xfs_shutdown_devices. ^ seemingly? > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++ > 1 file changed, 26 insertions(+) > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 4ae3b01ed038c7..c169beb0d8cab3 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -399,6 +399,32 @@ STATIC void > xfs_shutdown_devices( > struct xfs_mount *mp) > { > + /* > + * Udev is triggered whenever anyone closes a block device or unmounts > + * a file systemm on a block device. > + * The default udev rules invoke blkid to read the fs super and create > + * symlinks to the bdev under /dev/disk. For this, it uses buffered > + * reads through the page cache. > + * > + * xfs_db also uses buffered reads to examine metadata. There is no > + * coordination between xfs_db and udev, which means that they can run > + * concurrently. Note there is no coordination between the kernel and > + * blkid either. > + * > + * On a system with 64k pages, the page cache can cache the superblock > + * and the root inode (and hence the root directory) with the same 64k > + * page. If udev spawns blkid after the mkfs and the system is busy > + * enough that it is still running when xfs_db starts up, they'll both > + * read from the same page in the pagecache. > + * > + * The unmount writes updated inode metadata to disk directly. The XFS > + * buffer cache does not use the bdev pagecache, nor does it invalidate > + * the pagecache on umount. If the above scenario occurs, the pagecache This sentence reads a little strangely, since "nor does it invalidate" would seem to conflict with the invalidate_bdev call below. I suggest changing the verb a bit: "The XFS buffer cache does not use the bdev pagecache, so it needs to invalidate that pagecache on unmount." With those two things changed, Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > + * no longer reflects what's on disk, xfs_db reads the stale metadata, > + * and fails to find /a. Most of the time this succeeds because closing > + * a bdev invalidates the page cache, but when processes race, everyone > + * loses. > + */ > if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) { > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev); > invalidate_bdev(mp->m_logdev_targp->bt_bdev); > -- > 2.39.2 >
On Wed, Aug 09, 2023 at 03:39:23PM -0700, Darrick J. Wong wrote: > On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote: > > Copy and paste the commit message from Darrick into a comment to explain > > the seemly odd invalidate_bdev in xfs_shutdown_devices. > > ^ seemingly? > > > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > --- > > fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > index 4ae3b01ed038c7..c169beb0d8cab3 100644 > > --- a/fs/xfs/xfs_super.c > > +++ b/fs/xfs/xfs_super.c > > @@ -399,6 +399,32 @@ STATIC void > > xfs_shutdown_devices( > > struct xfs_mount *mp) > > { > > + /* > > + * Udev is triggered whenever anyone closes a block device or unmounts > > + * a file systemm on a block device. > > + * The default udev rules invoke blkid to read the fs super and create > > + * symlinks to the bdev under /dev/disk. For this, it uses buffered > > + * reads through the page cache. > > + * > > + * xfs_db also uses buffered reads to examine metadata. There is no > > + * coordination between xfs_db and udev, which means that they can run > > + * concurrently. Note there is no coordination between the kernel and > > + * blkid either. > > + * > > + * On a system with 64k pages, the page cache can cache the superblock > > + * and the root inode (and hence the root directory) with the same 64k > > + * page. If udev spawns blkid after the mkfs and the system is busy > > + * enough that it is still running when xfs_db starts up, they'll both > > + * read from the same page in the pagecache. > > + * > > + * The unmount writes updated inode metadata to disk directly. The XFS > > + * buffer cache does not use the bdev pagecache, nor does it invalidate > > + * the pagecache on umount. If the above scenario occurs, the pagecache > > This sentence reads a little strangely, since "nor does it invalidate" > would seem to conflict with the invalidate_bdev call below. I suggest > changing the verb a bit: > > "The XFS buffer cache does not use the bdev pagecache, so it needs to > invalidate that pagecache on unmount." > > With those two things changed, > Reviewed-by: Darrick J. Wong <djwong@kernel.org> Fixed in-tree.
On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote: > + /* > + * Udev is triggered whenever anyone closes a block device or unmounts > + * a file systemm on a block device. > + * The default udev rules invoke blkid to read the fs super and create > + * symlinks to the bdev under /dev/disk. For this, it uses buffered > + * reads through the page cache. > + * > + * xfs_db also uses buffered reads to examine metadata. There is no > + * coordination between xfs_db and udev, which means that they can run > + * concurrently. Note there is no coordination between the kernel and > + * blkid either. > + * > + * On a system with 64k pages, the page cache can cache the superblock > + * and the root inode (and hence the root directory) with the same 64k > + * page. If udev spawns blkid after the mkfs and the system is busy > + * enough that it is still running when xfs_db starts up, they'll both > + * read from the same page in the pagecache. > + * > + * The unmount writes updated inode metadata to disk directly. The XFS > + * buffer cache does not use the bdev pagecache, nor does it invalidate > + * the pagecache on umount. If the above scenario occurs, the pagecache > + * no longer reflects what's on disk, xfs_db reads the stale metadata, > + * and fails to find /a. Most of the time this succeeds because closing > + * a bdev invalidates the page cache, but when processes race, everyone > + * loses. > + */ > if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) { > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev); > invalidate_bdev(mp->m_logdev_targp->bt_bdev); While I have no complaints with this as a commit message, it's just too verbose for an inline comment, IMO. Something pithier and more generic would seem appropriate. How about: /* * Prevent userspace (eg blkid or xfs_db) from seeing stale data. * XFS is not coherent with the bdev's page cache. */
On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote: > > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev); > > invalidate_bdev(mp->m_logdev_targp->bt_bdev); > > While I have no complaints with this as a commit message, it's just too > verbose for an inline comment, IMO. Something pithier and more generic > would seem appropriate. How about: > > /* > * Prevent userspace (eg blkid or xfs_db) from seeing stale data. > * XFS is not coherent with the bdev's page cache. > */ Well, this completely misses the point. The point is that XFS should never have to invalidate the page cache because it's not using it, but it has to due to weird races. I tried to condese the message but I could not come up with a good one that's not losing information.
On Wed, Aug 09, 2023 at 03:39:23PM -0700, Darrick J. Wong wrote: > > + * read from the same page in the pagecache. > > + * > > + * The unmount writes updated inode metadata to disk directly. The XFS > > + * buffer cache does not use the bdev pagecache, nor does it invalidate > > + * the pagecache on umount. If the above scenario occurs, the pagecache > > This sentence reads a little strangely, since "nor does it invalidate" > would seem to conflict with the invalidate_bdev call below. I suggest > changing the verb a bit: > > "The XFS buffer cache does not use the bdev pagecache, so it needs to > invalidate that pagecache on unmount." Agreed. I'll forward it to the original author of the sentence time permitting :)
On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote: > On Wed, Aug 09, 2023 at 03:05:39PM -0700, Christoph Hellwig wrote: > > + /* > > + * Udev is triggered whenever anyone closes a block device or unmounts > > + * a file systemm on a block device. > > + * The default udev rules invoke blkid to read the fs super and create > > + * symlinks to the bdev under /dev/disk. For this, it uses buffered > > + * reads through the page cache. > > + * > > + * xfs_db also uses buffered reads to examine metadata. There is no > > + * coordination between xfs_db and udev, which means that they can run > > + * concurrently. Note there is no coordination between the kernel and > > + * blkid either. > > + * > > + * On a system with 64k pages, the page cache can cache the superblock > > + * and the root inode (and hence the root directory) with the same 64k > > + * page. If udev spawns blkid after the mkfs and the system is busy > > + * enough that it is still running when xfs_db starts up, they'll both > > + * read from the same page in the pagecache. > > + * > > + * The unmount writes updated inode metadata to disk directly. The XFS > > + * buffer cache does not use the bdev pagecache, nor does it invalidate > > + * the pagecache on umount. If the above scenario occurs, the pagecache > > + * no longer reflects what's on disk, xfs_db reads the stale metadata, > > + * and fails to find /a. Most of the time this succeeds because closing > > + * a bdev invalidates the page cache, but when processes race, everyone > > + * loses. > > + */ > > if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) { > > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev); > > invalidate_bdev(mp->m_logdev_targp->bt_bdev); > > While I have no complaints with this as a commit message, it's just too > verbose for an inline comment, IMO. Something pithier and more generic > would seem appropriate. How about: > > /* > * Prevent userspace (eg blkid or xfs_db) from seeing stale data. > * XFS is not coherent with the bdev's page cache. "XFS' buffer cache is not coherent with the bdev's page cache." ? --D > */
On Thu, Aug 10, 2023 at 05:52:25PM +0200, Christoph Hellwig wrote: > On Thu, Aug 10, 2023 at 04:22:15PM +0100, Matthew Wilcox wrote: > > > blkdev_issue_flush(mp->m_logdev_targp->bt_bdev); > > > invalidate_bdev(mp->m_logdev_targp->bt_bdev); > > > > While I have no complaints with this as a commit message, it's just too > > verbose for an inline comment, IMO. Something pithier and more generic > > would seem appropriate. How about: > > > > /* > > * Prevent userspace (eg blkid or xfs_db) from seeing stale data. > > * XFS is not coherent with the bdev's page cache. > > */ > > Well, this completely misses the point. The point is that XFS should > never have to invalidate the page cache because it's not using it, > but it has to due to weird races. I tried to condese the message but > I could not come up with a good one that's not losing information. Agreed -- it took me a while to set up an arm64 box with just the right debugging info to figure out why certain fstests were flaky. I do think it's useful (despite my other reply to willy) to retain the defect details for hard-to-reproduce errors, and the only way to do that without encountering the dead url problem is to dump it in a huge commit message or a comment. (Too bad there's no way to have a commit whose code comments reference the commit id of that commit to say "Hey, you need to read this commit before you touch this line"...) --D
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 4ae3b01ed038c7..c169beb0d8cab3 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -399,6 +399,32 @@ STATIC void xfs_shutdown_devices( struct xfs_mount *mp) { + /* + * Udev is triggered whenever anyone closes a block device or unmounts + * a file systemm on a block device. + * The default udev rules invoke blkid to read the fs super and create + * symlinks to the bdev under /dev/disk. For this, it uses buffered + * reads through the page cache. + * + * xfs_db also uses buffered reads to examine metadata. There is no + * coordination between xfs_db and udev, which means that they can run + * concurrently. Note there is no coordination between the kernel and + * blkid either. + * + * On a system with 64k pages, the page cache can cache the superblock + * and the root inode (and hence the root directory) with the same 64k + * page. If udev spawns blkid after the mkfs and the system is busy + * enough that it is still running when xfs_db starts up, they'll both + * read from the same page in the pagecache. + * + * The unmount writes updated inode metadata to disk directly. The XFS + * buffer cache does not use the bdev pagecache, nor does it invalidate + * the pagecache on umount. If the above scenario occurs, the pagecache + * no longer reflects what's on disk, xfs_db reads the stale metadata, + * and fails to find /a. Most of the time this succeeds because closing + * a bdev invalidates the page cache, but when processes race, everyone + * loses. + */ if (mp->m_logdev_targp && mp->m_logdev_targp != mp->m_ddev_targp) { blkdev_issue_flush(mp->m_logdev_targp->bt_bdev); invalidate_bdev(mp->m_logdev_targp->bt_bdev);
Copy and paste the commit message from Darrick into a comment to explain the seemly odd invalidate_bdev in xfs_shutdown_devices. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_super.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+)