Message ID | 20221103034736.2604208-1-leo.lilong@huawei.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | [v3] xfs: fix sb write verify for lazysbcount | expand |
On Thu, Nov 03, 2022 at 11:47:36AM +0800, Long Li wrote: > When lazysbcount is enabled, fsstress and loop mount/unmount test report > the following problems: > > XFS (loop0): SB summary counter sanity check failed > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, > xfs_sb block 0x0 > XFS (loop0): Unmount and run xfs_repair > XFS (loop0): First 128 bytes of corrupted metadata buffer: > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > XFS (loop0): log mount/recovery failed: error -117 > XFS (loop0): log mount failed > > This corruption will shutdown the file system and the file system will > no longer be mountable. The following script can reproduce the problem, > but it may take a long time. > > #!/bin/bash > > device=/dev/sda > testdir=/mnt/test > round=0 > > function fail() > { > echo "$*" > exit 1 > } > > mkdir -p $testdir > while [ $round -lt 10000 ] > do > echo "******* round $round ********" > mkfs.xfs -f $device > mount $device $testdir || fail "mount failed!" > fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & > sleep 4 > killall -w fsstress > umount $testdir > xfs_repair -e $device > /dev/null > if [ $? -eq 2 ];then > echo "ERR CODE 2: Dirty log exception during repair." > exit 1 > fi > round=$(($round+1)) > done > > With lazysbcount is enabled, There is no additional lock protection for > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the > m_ifree, this will make the m_ifree greater than m_icount. For example, > consider the following sequence and ifreedelta is postive: > > CPU0 CPU1 > xfs_log_sb xfs_trans_unreserve_and_mod_sb > ---------- ------------------------------ > percpu_counter_sum(&mp->m_icount) > percpu_counter_add_batch(&mp->m_icount, > idelta, XFS_ICOUNT_BATCH) > percpu_counter_add(&mp->m_ifree, ifreedelta); > percpu_counter_sum(&mp->m_ifree) > > After this, incorrect inode count (sb_ifree > sb_icount) will be writen to > the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() > that cause the file system shutdown. > > When lazysbcount is enabled, we don't need to guarantee that Lazy sb > counters are completely correct, but we do need to guarantee that sb_ifree > <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount > must be satisfied any time that there /cannot/ be other threads allocating > or freeing inode chunks. If the constraint is violated under these > circumstances, sb_i{count,free} (the ondisk superblock inode counters) > maybe incorrect and need to be marked sick at unmount, the count will > be rebuilt on the next mount. > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") > Signed-off-by: Long Li <leo.lilong@huawei.com> > --- > v3: > - Corrected the description of the cause of the problem > - Add a check for m_icount and m_ifree at unmout > v2: > - Add scripts that could reproduce the problem > - Guaranteed that ifree will never be logged as being greater than icount > > fs/xfs/libxfs/xfs_sb.c | 4 +++- > fs/xfs/xfs_mount.c | 15 +++++++++++++++ > 2 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c > index a20cade590e9..1eeecf2eb2a7 100644 > --- a/fs/xfs/libxfs/xfs_sb.c > +++ b/fs/xfs/libxfs/xfs_sb.c > @@ -972,7 +972,9 @@ xfs_log_sb( > */ > if (xfs_has_lazysbcount(mp)) { > mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount); > - mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree); > + mp->m_sb.sb_ifree = min_t(uint64_t, > + percpu_counter_sum(&mp->m_ifree), > + mp->m_sb.sb_icount); > mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks); > } > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index e8bb3c2e847e..fb87ffb48f7f 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -538,6 +538,20 @@ xfs_check_summary_counts( > return 0; > } > > +static void > +xfs_unmount_check( I'd have called this xfs_ifree_unmount or something, but as this is a fix for a race condition and I'd like to get this all into 6.1 before -rc4 so I can start working on 6.2, so I'll change the name and commit it. Thank you for digging into this. Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > + struct xfs_mount *mp) > +{ > + if (xfs_is_shutdown(mp)) > + return; > + > + if (percpu_counter_sum(&mp->m_ifree) > > + percpu_counter_sum(&mp->m_icount)) { > + xfs_alert(mp, "ifree/icount mismatch at unmount"); > + xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS); > + } > +} > + > /* > * Flush and reclaim dirty inodes in preparation for unmount. Inodes and > * internal inode structures can be sitting in the CIL and AIL at this point, > @@ -1077,6 +1091,7 @@ xfs_unmountfs( > if (error) > xfs_warn(mp, "Unable to free reserved block pool. " > "Freespace may not be correct on next mount."); > + xfs_unmount_check(mp); > > xfs_log_unmount(mp); > xfs_da_unmount(mp); > -- > 2.31.1 >
On Wed, Nov 02, 2022 at 09:22:58PM -0700, Darrick J. Wong wrote: > On Thu, Nov 03, 2022 at 11:47:36AM +0800, Long Li wrote: > > When lazysbcount is enabled, fsstress and loop mount/unmount test report > > the following problems: > > > > XFS (loop0): SB summary counter sanity check failed > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, > > xfs_sb block 0x0 > > XFS (loop0): Unmount and run xfs_repair > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > XFS (loop0): log mount/recovery failed: error -117 > > XFS (loop0): log mount failed > > > > This corruption will shutdown the file system and the file system will > > no longer be mountable. The following script can reproduce the problem, > > but it may take a long time. > > > > #!/bin/bash > > > > device=/dev/sda > > testdir=/mnt/test > > round=0 > > > > function fail() > > { > > echo "$*" > > exit 1 > > } > > > > mkdir -p $testdir > > while [ $round -lt 10000 ] > > do > > echo "******* round $round ********" > > mkfs.xfs -f $device > > mount $device $testdir || fail "mount failed!" > > fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & > > sleep 4 > > killall -w fsstress > > umount $testdir > > xfs_repair -e $device > /dev/null > > if [ $? -eq 2 ];then > > echo "ERR CODE 2: Dirty log exception during repair." > > exit 1 > > fi > > round=$(($round+1)) > > done > > > > With lazysbcount is enabled, There is no additional lock protection for > > reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the > > m_ifree, this will make the m_ifree greater than m_icount. For example, > > consider the following sequence and ifreedelta is postive: > > > > CPU0 CPU1 > > xfs_log_sb xfs_trans_unreserve_and_mod_sb > > ---------- ------------------------------ > > percpu_counter_sum(&mp->m_icount) > > percpu_counter_add_batch(&mp->m_icount, > > idelta, XFS_ICOUNT_BATCH) > > percpu_counter_add(&mp->m_ifree, ifreedelta); > > percpu_counter_sum(&mp->m_ifree) > > > > After this, incorrect inode count (sb_ifree > sb_icount) will be writen to > > the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > > > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() > > that cause the file system shutdown. > > > > When lazysbcount is enabled, we don't need to guarantee that Lazy sb > > counters are completely correct, but we do need to guarantee that sb_ifree > > <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount > > must be satisfied any time that there /cannot/ be other threads allocating > > or freeing inode chunks. If the constraint is violated under these > > circumstances, sb_i{count,free} (the ondisk superblock inode counters) > > maybe incorrect and need to be marked sick at unmount, the count will > > be rebuilt on the next mount. > > > > Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > --- > > v3: > > - Corrected the description of the cause of the problem > > - Add a check for m_icount and m_ifree at unmout > > v2: > > - Add scripts that could reproduce the problem > > - Guaranteed that ifree will never be logged as being greater than icount > > > > fs/xfs/libxfs/xfs_sb.c | 4 +++- > > fs/xfs/xfs_mount.c | 15 +++++++++++++++ > > 2 files changed, 18 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c > > index a20cade590e9..1eeecf2eb2a7 100644 > > --- a/fs/xfs/libxfs/xfs_sb.c > > +++ b/fs/xfs/libxfs/xfs_sb.c > > @@ -972,7 +972,9 @@ xfs_log_sb( > > */ > > if (xfs_has_lazysbcount(mp)) { > > mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount); > > - mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree); > > + mp->m_sb.sb_ifree = min_t(uint64_t, > > + percpu_counter_sum(&mp->m_ifree), > > + mp->m_sb.sb_icount); > > mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks); > > } > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > > index e8bb3c2e847e..fb87ffb48f7f 100644 > > --- a/fs/xfs/xfs_mount.c > > +++ b/fs/xfs/xfs_mount.c > > @@ -538,6 +538,20 @@ xfs_check_summary_counts( > > return 0; > > } > > > > +static void > > +xfs_unmount_check( > > I'd have called this xfs_ifree_unmount or something, but as this is a > fix for a race condition and I'd like to get this all into 6.1 before > -rc4 so I can start working on 6.2, so I'll change the name and commit > it. Thank you for digging into this. Ok, thank you. > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --D > > > + struct xfs_mount *mp) > > +{ > > + if (xfs_is_shutdown(mp)) > > + return; > > + > > + if (percpu_counter_sum(&mp->m_ifree) > > > + percpu_counter_sum(&mp->m_icount)) { > > + xfs_alert(mp, "ifree/icount mismatch at unmount"); > > + xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS); > > + } > > +} > > + > > /* > > * Flush and reclaim dirty inodes in preparation for unmount. Inodes and > > * internal inode structures can be sitting in the CIL and AIL at this point, > > @@ -1077,6 +1091,7 @@ xfs_unmountfs( > > if (error) > > xfs_warn(mp, "Unable to free reserved block pool. " > > "Freespace may not be correct on next mount."); > > + xfs_unmount_check(mp); > > > > xfs_log_unmount(mp); > > xfs_da_unmount(mp); > > -- > > 2.31.1 > >
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index a20cade590e9..1eeecf2eb2a7 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -972,7 +972,9 @@ xfs_log_sb( */ if (xfs_has_lazysbcount(mp)) { mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount); - mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree); + mp->m_sb.sb_ifree = min_t(uint64_t, + percpu_counter_sum(&mp->m_ifree), + mp->m_sb.sb_icount); mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks); } diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index e8bb3c2e847e..fb87ffb48f7f 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -538,6 +538,20 @@ xfs_check_summary_counts( return 0; } +static void +xfs_unmount_check( + struct xfs_mount *mp) +{ + if (xfs_is_shutdown(mp)) + return; + + if (percpu_counter_sum(&mp->m_ifree) > + percpu_counter_sum(&mp->m_icount)) { + xfs_alert(mp, "ifree/icount mismatch at unmount"); + xfs_fs_mark_sick(mp, XFS_SICK_FS_COUNTERS); + } +} + /* * Flush and reclaim dirty inodes in preparation for unmount. Inodes and * internal inode structures can be sitting in the CIL and AIL at this point, @@ -1077,6 +1091,7 @@ xfs_unmountfs( if (error) xfs_warn(mp, "Unable to free reserved block pool. " "Freespace may not be correct on next mount."); + xfs_unmount_check(mp); xfs_log_unmount(mp); xfs_da_unmount(mp);
When lazysbcount is enabled, fsstress and loop mount/unmount test report the following problems: XFS (loop0): SB summary counter sanity check failed XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460, xfs_sb block 0x0 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed This corruption will shutdown the file system and the file system will no longer be mountable. The following script can reproduce the problem, but it may take a long time. #!/bin/bash device=/dev/sda testdir=/mnt/test round=0 function fail() { echo "$*" exit 1 } mkdir -p $testdir while [ $round -lt 10000 ] do echo "******* round $round ********" mkfs.xfs -f $device mount $device $testdir || fail "mount failed!" fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null & sleep 4 killall -w fsstress umount $testdir xfs_repair -e $device > /dev/null if [ $? -eq 2 ];then echo "ERR CODE 2: Dirty log exception during repair." exit 1 fi round=$(($round+1)) done With lazysbcount is enabled, There is no additional lock protection for reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the m_ifree, this will make the m_ifree greater than m_icount. For example, consider the following sequence and ifreedelta is postive: CPU0 CPU1 xfs_log_sb xfs_trans_unreserve_and_mod_sb ---------- ------------------------------ percpu_counter_sum(&mp->m_icount) percpu_counter_add_batch(&mp->m_icount, idelta, XFS_ICOUNT_BATCH) percpu_counter_add(&mp->m_ifree, ifreedelta); percpu_counter_sum(&mp->m_ifree) After this, incorrect inode count (sb_ifree > sb_icount) will be writen to the log. In the subsequent writing of sb, incorrect inode count (sb_ifree > sb_icount) will fail to pass the boundary check in xfs_validate_sb_write() that cause the file system shutdown. When lazysbcount is enabled, we don't need to guarantee that Lazy sb counters are completely correct, but we do need to guarantee that sb_ifree <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount must be satisfied any time that there /cannot/ be other threads allocating or freeing inode chunks. If the constraint is violated under these circumstances, sb_i{count,free} (the ondisk superblock inode counters) maybe incorrect and need to be marked sick at unmount, the count will be rebuilt on the next mount. Fixes: 8756a5af1819 ("libxfs: add more bounds checking to sb sanity checks") Signed-off-by: Long Li <leo.lilong@huawei.com> --- v3: - Corrected the description of the cause of the problem - Add a check for m_icount and m_ifree at unmout v2: - Add scripts that could reproduce the problem - Guaranteed that ifree will never be logged as being greater than icount fs/xfs/libxfs/xfs_sb.c | 4 +++- fs/xfs/xfs_mount.c | 15 +++++++++++++++ 2 files changed, 18 insertions(+), 1 deletion(-)