Message ID | 20221022020345.GA2699923@ceph-admin (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | [v1] xfs: fix sb write verify for lazysbcount | expand |
On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > When lazysbcount is enabled, multiple threads stress test the xfs report > the following problems: > > XFS (loop0): SB summary counter sanity check failed > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > +0x13b/0x460, xfs_sb block 0x0 > XFS (loop0): Unmount and run xfs_repair > XFS (loop0): First 128 bytes of corrupted metadata buffer: > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > XFS (loop0): log mount/recovery failed: error -117 > XFS (loop0): log mount failed > > The cause of the problem is that during the log recovery process, incorrect > icount and ifree are recovered from the log and fail to pass the size check Are you saying that the log contained a transaction in which ifree > icount? > in xfs_validate_sb_write(). > > With lazysbcount is enabled, There is no additional lock protection for > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies > the m_ifree between the read m_icount and the m_ifree, this will make the > m_ifree larger than m_icount and written to the log. If we have an unclean > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding > the counters from the AGF block counts, and the correction is later than > log recovery. During log recovery, incorrect ifree/icount may be restored > from the log and written to the super block, since ifree and icount have > not been corrected at this time, the relationship between ifree and icount > cannot be checked in xfs_validate_sb_write(). > > So, don't check the size between ifree and icount in xfs_validate_sb_write() > when lazysbcount is enabled. Um, doesn't that neuter a sanity check on all V5 filesystems? --D > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") > Signed-off-by: Long Li <leo.lilong@huawei.com> > --- > fs/xfs/libxfs/xfs_sb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c > index a20cade590e9..b4a4e57361e7 100644 > --- a/fs/xfs/libxfs/xfs_sb.c > +++ b/fs/xfs/libxfs/xfs_sb.c > @@ -245,7 +245,7 @@ xfs_validate_sb_write( > if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress && > (sbp->sb_fdblocks > sbp->sb_dblocks || > !xfs_verify_icount(mp, sbp->sb_icount) || > - sbp->sb_ifree > sbp->sb_icount)) { > + (!xfs_has_lazysbcount(mp) && sbp->sb_ifree > sbp->sb_icount))) { > xfs_warn(mp, "SB summary counter sanity check failed"); > return -EFSCORRUPTED; > } > -- > 2.31.1 >
On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > When lazysbcount is enabled, multiple threads stress test the xfs report > > the following problems: > > > > XFS (loop0): SB summary counter sanity check failed > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > +0x13b/0x460, xfs_sb block 0x0 > > XFS (loop0): Unmount and run xfs_repair > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > XFS (loop0): log mount/recovery failed: error -117 > > XFS (loop0): log mount failed > > > > The cause of the problem is that during the log recovery process, incorrect > > icount and ifree are recovered from the log and fail to pass the size check > > Are you saying that the log contained a transaction in which ifree > > icount? Yes, this situation is possible. For example consider the following sequence: CPU0 CPU1 xfs_log_sb xfs_trans_unreserve_and_mod_sb ---------- ------------------------------ percpu_counter_sum(&mp->m_icount) percpu_counter_add(&mp->m_icount, idelta) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH) percpu_counter_sum(&mp->m_ifree) > > > in xfs_validate_sb_write(). > > > > With lazysbcount is enabled, There is no additional lock protection for > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies > > the m_ifree between the read m_icount and the m_ifree, this will make the > > m_ifree larger than m_icount and written to the log. If we have an unclean > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding > > the counters from the AGF block counts, and the correction is later than > > log recovery. During log recovery, incorrect ifree/icount may be restored > > from the log and written to the super block, since ifree and icount have > > not been corrected at this time, the relationship between ifree and icount > > cannot be checked in xfs_validate_sb_write(). > > > > So, don't check the size between ifree and icount in xfs_validate_sb_write() > > when lazysbcount is enabled. > > Um, doesn't that neuter a sanity check on all V5 filesystems? Yes, such modifications don't look like the best way, all sb write checks will be affect. Maybe it can increase the judgment of clean mount and reduce the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after re-initialise incore superblock counters, like this: --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -242,10 +242,12 @@ xfs_validate_sb_write( * secondary superblocks, so allow this usage to continue because * we never read counters from such superblocks. */ + bool check = !(xfs_has_lazysbcount(mp) && !xfs_is_clean(mp)); + if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress && (sbp->sb_fdblocks > sbp->sb_dblocks || !xfs_verify_icount(mp, sbp->sb_icount) || - sbp->sb_ifree > sbp->sb_icount)) { + (check && sbp->sb_ifree > sbp->sb_icount))) { xfs_warn(mp, "SB summary counter sanity check failed"); return -EFSCORRUPTED; } diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index e8bb3c2e847e..0637e5d01e72 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -535,6 +535,8 @@ xfs_check_summary_counts( return error; } + set_bit(XFS_OPSTATE_CLEAN, &mp->m_opstate); + return 0; } > > --D > > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > --- > > fs/xfs/libxfs/xfs_sb.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c > > index a20cade590e9..b4a4e57361e7 100644 > > --- a/fs/xfs/libxfs/xfs_sb.c > > +++ b/fs/xfs/libxfs/xfs_sb.c > > @@ -245,7 +245,7 @@ xfs_validate_sb_write( > > if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress && > > (sbp->sb_fdblocks > sbp->sb_dblocks || > > !xfs_verify_icount(mp, sbp->sb_icount) || > > - sbp->sb_ifree > sbp->sb_icount)) { > > + (!xfs_has_lazysbcount(mp) && sbp->sb_ifree > sbp->sb_icount))) { > > xfs_warn(mp, "SB summary counter sanity check failed"); > > return -EFSCORRUPTED; > > } > > -- > > 2.31.1 > >
On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote: > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > > When lazysbcount is enabled, multiple threads stress test the xfs report > > > the following problems: We've had lazy sb counters for 15 years and just about every XFS filesystem in production uses them, so providing us with some idea of the scope of the problem and how to reproduce it would be greatly appreciated. What stress test are you running? What filesystem config does it manifest on (other than lazysbcount=1)? How long does the stress test run for, and where/why does log recovery get run in this stress test? > > > XFS (loop0): SB summary counter sanity check failed > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > > +0x13b/0x460, xfs_sb block 0x0 > > > XFS (loop0): Unmount and run xfs_repair > > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > > XFS (loop0): log mount/recovery failed: error -117 > > > XFS (loop0): log mount failed > > > > > > The cause of the problem is that during the log recovery process, incorrect > > > icount and ifree are recovered from the log and fail to pass the size check > > > > Are you saying that the log contained a transaction in which ifree > > > icount? > > Yes, this situation is possible. For example consider the following sequence: > > CPU0 CPU1 > xfs_log_sb xfs_trans_unreserve_and_mod_sb > ---------- ------------------------------ > percpu_counter_sum(&mp->m_icount) > percpu_counter_add(&mp->m_icount, idelta) > percpu_counter_add_batch(&mp->m_icount, > idelta, XFS_ICOUNT_BATCH) > percpu_counter_sum(&mp->m_ifree) What caused the xfs_log_sb() to be called? Very few things actually log the superblock this way at runtime - it's generally only logged directly like this when a feature bit changes during a transaction (rare) or at a synchronisation point when everything else is idle and there's no chance of a race like this occurring... I can see a couple of routes to this occurring via feature bit modification, but I don't see them being easy to hit or something that would exist for very long in the journal. Hence I'm wondering if there should be runtime protection for xfs_log_sb() to avoid these problems.... > > > in xfs_validate_sb_write(). > > > > > > With lazysbcount is enabled, There is no additional lock protection for > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies > > > the m_ifree between the read m_icount and the m_ifree, this will make the > > > m_ifree larger than m_icount and written to the log. If we have an unclean > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding > > > the counters from the AGF block counts, and the correction is later than > > > log recovery. During log recovery, incorrect ifree/icount may be restored > > > from the log and written to the super block, since ifree and icount have > > > not been corrected at this time, the relationship between ifree and icount > > > cannot be checked in xfs_validate_sb_write(). > > > > > > So, don't check the size between ifree and icount in xfs_validate_sb_write() > > > when lazysbcount is enabled. > > > > Um, doesn't that neuter a sanity check on all V5 filesystems? > > Yes, such modifications don't look like the best way, all sb write checks > will be affect. Maybe it can increase the judgment of clean mount and reduce > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after > re-initialise incore superblock counters, like this: I'm not sure that silencing the warning and continuing log recovery with an invalid accounting state is the best way to proceed. We haven't yet recovered unlinked inodes at the time this warning fires. If ifree really is larger than icount, then we've got a real problem at this point in log recovery. Hence I suspect that we should be looking at fixing the runtime race condition that leads to the problem, not trying to work around inconsistency in the recovery code. -Dave.
On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote: > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote: > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > > > When lazysbcount is enabled, multiple threads stress test the xfs report > > > > the following problems: > > We've had lazy sb counters for 15 years and just about every XFS > filesystem in production uses them, so providing us with some idea > of the scope of the problem and how to reproduce it would be greatly > appreciated. > > What stress test are you running? What filesystem config does it > manifest on (other than lazysbcount=1)? How long does the stress > test run for, and where/why does log recovery get run in this stress > test? > > > > > XFS (loop0): SB summary counter sanity check failed > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > > > +0x13b/0x460, xfs_sb block 0x0 > > > > XFS (loop0): Unmount and run xfs_repair > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > > > XFS (loop0): log mount/recovery failed: error -117 > > > > XFS (loop0): log mount failed > > > > > > > > The cause of the problem is that during the log recovery process, incorrect > > > > icount and ifree are recovered from the log and fail to pass the size check > > > > > > Are you saying that the log contained a transaction in which ifree > > > > icount? > > > > Yes, this situation is possible. For example consider the following sequence: > > > > CPU0 CPU1 > > xfs_log_sb xfs_trans_unreserve_and_mod_sb > > ---------- ------------------------------ > > percpu_counter_sum(&mp->m_icount) > > percpu_counter_add(&mp->m_icount, idelta) > > percpu_counter_add_batch(&mp->m_icount, > > idelta, XFS_ICOUNT_BATCH) > > percpu_counter_sum(&mp->m_ifree) > > What caused the xfs_log_sb() to be called? Very few things > actually log the superblock this way at runtime - it's generally > only logged directly like this when a feature bit changes during a > transaction (rare) or at a synchronisation point when everything > else is idle and there's no chance of a race like this occurring... > > I can see a couple of routes to this occurring via feature bit > modification, but I don't see them being easy to hit or something > that would exist for very long in the journal. Hence I'm wondering > if there should be runtime protection for xfs_log_sb() to avoid > these problems.... Maybe. Or perhaps we sample m_i{count,free} until they come up with a value that will pass the verifier, and only then log the new values to the primary super xfs_buf? > > > > in xfs_validate_sb_write(). > > > > > > > > With lazysbcount is enabled, There is no additional lock protection for > > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies > > > > the m_ifree between the read m_icount and the m_ifree, this will make the > > > > m_ifree larger than m_icount and written to the log. If we have an unclean > > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding > > > > the counters from the AGF block counts, and the correction is later than > > > > log recovery. During log recovery, incorrect ifree/icount may be restored > > > > from the log and written to the super block, since ifree and icount have > > > > not been corrected at this time, the relationship between ifree and icount > > > > cannot be checked in xfs_validate_sb_write(). > > > > > > > > So, don't check the size between ifree and icount in xfs_validate_sb_write() > > > > when lazysbcount is enabled. > > > > > > Um, doesn't that neuter a sanity check on all V5 filesystems? > > > > Yes, such modifications don't look like the best way, all sb write checks > > will be affect. Maybe it can increase the judgment of clean mount and reduce > > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after > > re-initialise incore superblock counters, like this: > > I'm not sure that silencing the warning and continuing log recovery > with an invalid accounting state is the best way to proceed. We > haven't yet recovered unlinked inodes at the time this warning > fires. If ifree really is larger than icount, then we've got a > real problem at this point in log recovery. > > Hence I suspect that we should be looking at fixing the runtime race > condition that leads to the problem, not trying to work around > inconsistency in the recovery code. Agreed. Higher level code shouldn't be storing garbage ifree/icount to the primary superblock xfs_buf, period. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote: > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote: > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > > > When lazysbcount is enabled, multiple threads stress test the xfs report > > > > the following problems: > > We've had lazy sb counters for 15 years and just about every XFS > filesystem in production uses them, so providing us with some idea > of the scope of the problem and how to reproduce it would be greatly > appreciated. Ok, here is my simplified script that could reproducer the bug: #!/bin/bash device=/dev/sda testdir=/mnt/test round=0 function fail() { echo "$*" exit 1 } while [ $round -lt 10000 ] do echo "******* round $round ********" mkfs.xfs -f $device mkdir -p $testdir mount $device $testdir || fail "mount failed!" fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & sleep 4 killall -w fsstress umount $testdir xfs_repair -e $device > /dev/null if [ $? -eq 2 ];then echo "ERR CODE 2: Dirty log exception encountered during repair." exit 1 fi sleep 1 round=$(($round+1)) done > > What stress test are you running? What filesystem config does it > manifest on (other than lazysbcount=1)? How long does the stress > test run for, and where/why does log recovery get run in this stress > test? I tested with the fsstress tool and used the default mount options. More configuration are as follows: CONFIG_XFS_FS=y CONFIG_XFS_SUPPORT_V4=y CONFIG_XFS_QUOTA=y CONFIG_XFS_POSIX_ACL=y CONFIG_XFS_RT=y CONFIG_XFS_DEBUG=y CONFIG_XFS_ASSERT_FATAL=y xfs_db> info meta-data=/dev/sda isize=512 agcount=4, agsize=65536 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=262144, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 With the above test script, it will take several days to reproduce the bug. If we add a delay in xfs_log_sb(), this will quickly reproduce the problem: if (xfs_has_lazysbcount(mp)) { mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount); + udelay(2000); mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree); mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks); } > > > > > XFS (loop0): SB summary counter sanity check failed > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > > > +0x13b/0x460, xfs_sb block 0x0 > > > > XFS (loop0): Unmount and run xfs_repair > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > > > XFS (loop0): log mount/recovery failed: error -117 > > > > XFS (loop0): log mount failed > > > > > > > > The cause of the problem is that during the log recovery process, incorrect > > > > icount and ifree are recovered from the log and fail to pass the size check > > > > > > Are you saying that the log contained a transaction in which ifree > > > > icount? > > > > Yes, this situation is possible. For example consider the following sequence: > > > > CPU0 CPU1 > > xfs_log_sb xfs_trans_unreserve_and_mod_sb > > ---------- ------------------------------ > > percpu_counter_sum(&mp->m_icount) > > percpu_counter_add(&mp->m_icount, idelta) > > percpu_counter_add_batch(&mp->m_icount, > > idelta, XFS_ICOUNT_BATCH) > > percpu_counter_sum(&mp->m_ifree) > > What caused the xfs_log_sb() to be called? Very few things > actually log the superblock this way at runtime - it's generally > only logged directly like this when a feature bit changes during a > transaction (rare) or at a synchronisation point when everything > else is idle and there's no chance of a race like this occurring... > > I can see a couple of routes to this occurring via feature bit > modification, but I don't see them being easy to hit or something > that would exist for very long in the journal. Hence I'm wondering > if there should be runtime protection for xfs_log_sb() to avoid > these problems.... > > > > > in xfs_validate_sb_write(). > > > > > > > > With lazysbcount is enabled, There is no additional lock protection for > > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies > > > > the m_ifree between the read m_icount and the m_ifree, this will make the > > > > m_ifree larger than m_icount and written to the log. If we have an unclean > > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding > > > > the counters from the AGF block counts, and the correction is later than > > > > log recovery. During log recovery, incorrect ifree/icount may be restored > > > > from the log and written to the super block, since ifree and icount have > > > > not been corrected at this time, the relationship between ifree and icount > > > > cannot be checked in xfs_validate_sb_write(). > > > > > > > > So, don't check the size between ifree and icount in xfs_validate_sb_write() > > > > when lazysbcount is enabled. > > > > > > Um, doesn't that neuter a sanity check on all V5 filesystems? > > > > Yes, such modifications don't look like the best way, all sb write checks > > will be affect. Maybe it can increase the judgment of clean mount and reduce > > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after > > re-initialise incore superblock counters, like this: > > I'm not sure that silencing the warning and continuing log recovery > with an invalid accounting state is the best way to proceed. We > haven't yet recovered unlinked inodes at the time this warning > fires. If ifree really is larger than icount, then we've got a > real problem at this point in log recovery. > > Hence I suspect that we should be looking at fixing the runtime race > condition that leads to the problem, not trying to work around > inconsistency in the recovery code. > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Sun, Oct 23, 2022 at 09:07:42PM -0700, Darrick J. Wong wrote: > On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote: > > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote: > > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > > > > When lazysbcount is enabled, multiple threads stress test the xfs report > > > > > the following problems: > > > > We've had lazy sb counters for 15 years and just about every XFS > > filesystem in production uses them, so providing us with some idea > > of the scope of the problem and how to reproduce it would be greatly > > appreciated. > > > > What stress test are you running? What filesystem config does it > > manifest on (other than lazysbcount=1)? How long does the stress > > test run for, and where/why does log recovery get run in this stress > > test? > > > > > > > XFS (loop0): SB summary counter sanity check failed > > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > > > > +0x13b/0x460, xfs_sb block 0x0 > > > > > XFS (loop0): Unmount and run xfs_repair > > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > > > > XFS (loop0): log mount/recovery failed: error -117 > > > > > XFS (loop0): log mount failed > > > > > > > > > > The cause of the problem is that during the log recovery process, incorrect > > > > > icount and ifree are recovered from the log and fail to pass the size check > > > > > > > > Are you saying that the log contained a transaction in which ifree > > > > > icount? > > > > > > Yes, this situation is possible. For example consider the following sequence: > > > > > > CPU0 CPU1 > > > xfs_log_sb xfs_trans_unreserve_and_mod_sb > > > ---------- ------------------------------ > > > percpu_counter_sum(&mp->m_icount) > > > percpu_counter_add(&mp->m_icount, idelta) > > > percpu_counter_add_batch(&mp->m_icount, > > > idelta, XFS_ICOUNT_BATCH) > > > percpu_counter_sum(&mp->m_ifree) > > > > What caused the xfs_log_sb() to be called? Very few things > > actually log the superblock this way at runtime - it's generally > > only logged directly like this when a feature bit changes during a > > transaction (rare) or at a synchronisation point when everything > > else is idle and there's no chance of a race like this occurring... > > > > I can see a couple of routes to this occurring via feature bit > > modification, but I don't see them being easy to hit or something > > that would exist for very long in the journal. Hence I'm wondering > > if there should be runtime protection for xfs_log_sb() to avoid > > these problems.... > > Maybe. Or perhaps we sample m_i{count,free} until they come up with a > value that will pass the verifier, and only then log the new values to > the primary super xfs_buf? I suspect the simplest thing to do is this: mp->m_sb.sb_ifree = min_t(uint64_t, percpu_counter_sum(&mp->m_ifree), mp->m_sb.sb.icount); That way ifree will never be logged as being greater than icount. Neither icount or ifree will be accurate if we are racing with other updates, but it will guarantee that what we write to the journal won't trigger corruption warnings. -Dave.
On Mon, Oct 24, 2022 at 04:43:45PM +1100, Dave Chinner wrote: > On Sun, Oct 23, 2022 at 09:07:42PM -0700, Darrick J. Wong wrote: > > On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote: > > > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote: > > > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > > > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > > > > > When lazysbcount is enabled, multiple threads stress test the xfs report > > > > > > the following problems: > > > > > > We've had lazy sb counters for 15 years and just about every XFS > > > filesystem in production uses them, so providing us with some idea > > > of the scope of the problem and how to reproduce it would be greatly > > > appreciated. > > > > > > What stress test are you running? What filesystem config does it > > > manifest on (other than lazysbcount=1)? How long does the stress > > > test run for, and where/why does log recovery get run in this stress > > > test? > > > > > > > > > XFS (loop0): SB summary counter sanity check failed > > > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > > > > > +0x13b/0x460, xfs_sb block 0x0 > > > > > > XFS (loop0): Unmount and run xfs_repair > > > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > > > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > > > > > XFS (loop0): log mount/recovery failed: error -117 > > > > > > XFS (loop0): log mount failed > > > > > > > > > > > > The cause of the problem is that during the log recovery process, incorrect > > > > > > icount and ifree are recovered from the log and fail to pass the size check > > > > > > > > > > Are you saying that the log contained a transaction in which ifree > > > > > > icount? > > > > > > > > Yes, this situation is possible. For example consider the following sequence: > > > > > > > > CPU0 CPU1 > > > > xfs_log_sb xfs_trans_unreserve_and_mod_sb > > > > ---------- ------------------------------ > > > > percpu_counter_sum(&mp->m_icount) > > > > percpu_counter_add(&mp->m_icount, idelta) > > > > percpu_counter_add_batch(&mp->m_icount, > > > > idelta, XFS_ICOUNT_BATCH) > > > > percpu_counter_sum(&mp->m_ifree) > > > > > > What caused the xfs_log_sb() to be called? Very few things > > > actually log the superblock this way at runtime - it's generally > > > only logged directly like this when a feature bit changes during a > > > transaction (rare) or at a synchronisation point when everything > > > else is idle and there's no chance of a race like this occurring... > > > > > > I can see a couple of routes to this occurring via feature bit > > > modification, but I don't see them being easy to hit or something > > > that would exist for very long in the journal. Hence I'm wondering > > > if there should be runtime protection for xfs_log_sb() to avoid > > > these problems.... > > > > Maybe. Or perhaps we sample m_i{count,free} until they come up with a > > value that will pass the verifier, and only then log the new values to > > the primary super xfs_buf? > > I suspect the simplest thing to do is this: > > mp->m_sb.sb_ifree = min_t(uint64_t, percpu_counter_sum(&mp->m_ifree), > mp->m_sb.sb.icount); > > That way ifree will never be logged as being greater than icount. > Neither icount or ifree will be accurate if we are racing with other > updates, but it will guarantee that what we write to the journal > won't trigger corruption warnings. > Agree with you, this is the simplest and cleanest fix method, there will be no more impact. This can be fixed at the point where the problem occurs rather than after the problem has occurred. I would like to resend a patch and attach the reproduce script. Thanks for your advice. > -Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index a20cade590e9..b4a4e57361e7 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -245,7 +245,7 @@ xfs_validate_sb_write( if (xfs_buf_daddr(bp) == XFS_SB_DADDR && !sbp->sb_inprogress && (sbp->sb_fdblocks > sbp->sb_dblocks || !xfs_verify_icount(mp, sbp->sb_icount) || - sbp->sb_ifree > sbp->sb_icount)) { + (!xfs_has_lazysbcount(mp) && sbp->sb_ifree > sbp->sb_icount))) { xfs_warn(mp, "SB summary counter sanity check failed"); return -EFSCORRUPTED; }
When lazysbcount is enabled, multiple threads stress test the xfs report the following problems: XFS (loop0): SB summary counter sanity check failed XFS (loop0): Metadata corruption detected at xfs_sb_write_verify +0x13b/0x460, xfs_sb block 0x0 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed The cause of the problem is that during the log recovery process, incorrect icount and ifree are recovered from the log and fail to pass the size check in xfs_validate_sb_write(). With lazysbcount is enabled, There is no additional lock protection for reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies the m_ifree between the read m_icount and the m_ifree, this will make the m_ifree larger than m_icount and written to the log. If we have an unclean shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding the counters from the AGF block counts, and the correction is later than log recovery. During log recovery, incorrect ifree/icount may be restored from the log and written to the super block, since ifree and icount have not been corrected at this time, the relationship between ifree and icount cannot be checked in xfs_validate_sb_write(). So, don't check the size between ifree and icount in xfs_validate_sb_write() when lazysbcount is enabled. Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> --- fs/xfs/libxfs/xfs_sb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)